Introduction PDF
Introduction PDF
Programming
This series is published jointly by the Mathematical Optimization Society and the Society for
Industrial and Applied Mathematics. It includes research monographs, books on applications,
textbooks at all levels, and tutorials. Besides being of high scientific quality, books in the series
must advance the understanding and practice of optimization. They must also be written clearly
and at an appropriate level.
Editor-in-Chief
Thomas Liebling
École Polytechnique Fédérale de Lausanne
Editorial Board
William Cook, Georgia Tech
Gérard Cornuejols, Carnegie Mellon University
Oktay Gunluk, IBM T.J. Watson Research Center
Michael Jünger, Universität zu Köln
C.T. Kelley, North Carolina State University
Adrian S. Lewis, Cornell University
Pablo Parrilo, Massachusetts Institute of Technology
Daniel Ralph, University of Cambridge
Éva Tardos, Cornell University
Mike Todd, Cornell University
Laurence Wolsey, Université Catholique de Louvain
Series Volumes
Biegler, Lorenz T., Nonlinear Programming: Concepts, Algorithms, and Applications to
Chemical Processes
´ Andrzej, Lectures on Stochastic
Shapiro, Alexander, Dentcheva, Darinka, and Ruszczynski,
Programming: Modeling and Theory
Conn, Andrew R., Scheinberg, Katya, and Vicente, Luis N., Introduction to Derivative-Free
Optimization
Ferris, Michael C., Mangasarian, Olvi L., and Wright, Stephen J., Linear Programming with MATLAB
Attouch, Hedy, Buttazzo, Giuseppe, and Michaille, Gérard, Variational Analysis in Sobolev
and BV Spaces: Applications to PDEs and Optimization
Wallace, Stein W. and Ziemba, William T., editors, Applications of Stochastic Programming
Grötschel, Martin, editor, The Sharpest Cut: The Impact of Manfred Padberg and His Work
Renegar, James, A Mathematical View of Interior-Point Methods in Convex Optimization
Ben-Tal, Aharon and Nemirovski, Arkadi, Lectures on Modern Convex Optimization: Analysis,
Algorithms, and Engineering Applications
Conn, Andrew R., Gould, Nicholas I. M., and Toint, Phillippe L., Trust-Region Methods
Lorenz T. Biegler
Carnegie Mellon University
Pittsburgh, Pennsylvania
10 9 8 7 6 5 4 3 2 1
All rights reserved. Printed in the United States of America. No part of this book may be reproduced,
stored, or transmitted in any manner without the written permission of the publisher. For information,
write to the Society for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia,
PA 19104-2688.
Trademarked names may be used in this book without the inclusion of a trademark symbol. These
names are used in an editorial context only; no infringement of trademark is intended.
Excel is a trademark of Microsoft Corporation in the United States and/or other countries.
MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please
contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000,
Fax: 508-647-7001 [email protected], www.mathworks.com.
Biegler, Lorenz T.
Nonlinear programming : concepts, algorithms, and applications to chemical processes / Lorenz T.
Biegler.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-898717-02-0
1. Chemical processes. 2. Nonlinear programming. I. Title.
TP155.75.B54 2010
519.7’6--dc22 2010013645
is a registered trademark.
Contents
Preface xiii
vii
i i
i i
book_tem
i i
2010/7/27
page viii
i i
viii Contents
i i
i i
book_tem
i i
2010/7/27
page ix
i i
Contents ix
i i
i i
book_tem
i i
2010/7/27
page x
i i
x Contents
i i
i i
book_tem
i i
2010/7/27
page xi
i i
Contents xi
Bibliography 363
Index 391
i i
i i
book_tem
i i
2010/7/27
page xiii
i i
Preface
xiii
i i
i i
book_tem
i i
2010/7/27
page xiv
i i
xiv Preface
i i
i i
book_tem
i i
2010/7/27
page xv
i i
Preface xv
approach to inequality constraints and discusses algorithms and problem formulations that
are essential for developing large-scale optimization models. Chapter 7 then discusses steady
state process optimization and describes the application of NLP methods to modular and
equation-oriented simulation environments.
Chapter 8 introduces the emerging field of dynamic modeling and optimization in
process systems. A survey of optimization strategies is given and current applications are
summarized. The next two chapters deal with two strategies for dynamic optimization.
Chapter 9 develops optimization methods with embedded differential-algebraic equation
(DAE) solvers, while Chapter 10 describes methods that embed discretized DAE models
within the optimization formulation itself. Chapter 11 concludes this text by presenting
complementarity models that can describe a class of discrete decisions. Embedded within
nonlinear programs, these lead to mathematical programs with complementarity constraints
(MPCC) that apply to both steady state and dynamic process optimization models.
Finally, it is important to mention what this book does not cover. As seen in the table of
contents, the book is restricted to methods and applications centered around gradient-based
nonlinear programming. Comprising the broad area of optimization, the following topics
are not considered, although extensive citations are provided for additional reading:
• Optimization in function spaces. While the text provides a practical treatment of DAE
optimization, this is developed from the perspective of finite-dimensional optimiza-
tion problems. Similarly, PDE-constrained optimization problems [51, 50] are beyond
the scope of this text.
• Iterative linear solvers. Unlike methods for PDE-based formulations, indirect lin-
ear solvers are almost never used with chemical process models. Hence, the NLP
strategies in this book will rely on direct (and often sparse) linear solvers with little
coverage of iterative linear solvers.
• Optimization methods for nondifferentiable functions. These methods are not covered,
although some nondifferentiable features may be addressed through reformulation
of the nonlinear program. Likewise, derivative-free optimization methods are not
covered. A recent treatment of this area is given in [102].
• Optimization problems with stochastic elements. These problems are beyond the scope
of this text and are covered in a number of texts including [57, 216].
• Optimization methods that ensure global solutions. The NLP methods covered in the
text guarantee only local solutions unless the appropriate convexity conditions hold.
Optimization methods that ensure global solutions for nonconvex problems are not
covered here. Extensive treatment of these methods can be found in [144, 203, 379].
• Optimization methods for problems with integer variables. These methods are beyond
the scope of this book. Resources for these mixed integer problems can be found in
[53, 143, 295].
Nevertheless, the NLP concepts and algorithms developed in this text provide useful back-
ground and a set of tools to address many of these areas.
Acknowledgments
There are many people who deserve my thanks in the creation of this book. I have been privi-
leged to have been able to work with outstanding graduate students and research colleagues.
i i
i i
book_tem
i i
2010/7/27
page xvi
i i
xvi Preface
I am grateful to all of them for the inspiration that they brought to this book. In particu-
lar, many thanks to Nikhil Arora, Brian Baumrucker, Antonio Flores-Tlacuahuac, Shiva
Kameswaran, Carl Laird, Yi-dong Lang, Nuno Oliveira, Maame Yaa Poku, Arvind Raghu-
nathan, Lino Santos, Claudia Schmid, Andreas Wächter, Dominique Wolbert, and Victor
Zavala for their research contributions that were incorporated into the chapters. In addition,
special thanks go to Brian Baumrucker, Vijay Gupta, Shiva Kameswaran, Yi-dong Lang,
Rodrigo Lopez Negrete, Arvind Raghunathan, Andreas Wächter, Kexin Wang, and Victor
Zavala for their careful reading of the manuscript.
I am also very grateful to my research colleagues at Carnegie Mellon (Ignacio Gross-
mann, Nick Sahinidis, Art Westerberg, and Erik Ydstie) and at Wisconsin (Mike Ferris,
Christos Maravelias, Harmon Ray, Jim Rawlings, Ross Swaney, Steve Wright) for their
advice and encouragement during the writing of this book. Also, I very much appreciate
the discussions and advice from Mihai Anitescu, John Betts, Georg Bock, Steve Campbell,
Andy Conn, Tim Kelley, Katya Kostina, Sven Leyffer, Pu Li, Wolfgang Marquardt, Hans
Mittlemann, Jorge Nocedal, Sachin Pawardhan, Danny Ralph, Zhijiang Shao, and Philippe
Toint. My thanks also go to Lisa Briggeman, Sara Murphy and Linda Thiel at SIAM for
their advice and many suggestions.
Finally, it is a privilege to acknowledge the support of the Fulbright Foundation and the
Hougen Visiting Professorship for very fruitful research stays at the University of Heidelberg
and the University of Wisconsin, respectively. Without these opportunities this book could
not have been completed.
Lorenz T. Biegler
i i
i i
book_tem
i i
2010/7/27
page 1
i i
Chapter 1
Introduction to Process
Optimization
Most things can be improved, so engineers and scientists optimize. While designing systems
and products requires a deep understanding of influences that achieve desirable performance,
the need for an efficient and systematic decision-making approach drives the need for opti-
mization strategies. This introductory chapter provides the motivation for this topic as well
as a description of applications in chemical engineering. Optimization applications can be
found in almost all areas of engineering. Typical problems in chemical engineering arise in
process design, process control, model development, process identification, and real-time
optimization. The chapter provides an overall description of optimization problem classes
with a focus on problems with continuous variables. It then describes where these problems
arise in chemical engineering, along with illustrative examples. This introduction sets the
stage for the development of optimization methods in the subsequent chapters.
i i
i i
book_tem
i i
2010/7/27
page 2
i i
Optimization is a fundamental and frequently applied task for most engineering ac-
tivities. However, in many cases, this task is done by trial and error (through case study). To
avoid such tedious activities, we take a systematic approach to this task, which is as efficient
as possible and also provides some guarantee that a better solution cannot be found.
The systematic determination of optimal solutions leads to a large family of methods
and algorithms. Moreover, the literature for optimization is dynamic, with hundreds of
papers published every month in dozens of journals. Moreover, research in optimization
can be observed at a number of different levels that necessarily need to overlap but are often
considered by separate communities:
• At the engineering level, optimization strategies are applied to challenging, and of-
ten poorly defined, real-world problems. Knowledge of optimization at this level is
engaged with the efficiency and reliability of applicable methods, analysis of the so-
lution, and diagnosis and recovery from failure of the solution method.
From the above description of optimization research, it is clear that successful devel-
opment of an optimization strategy within a given level requires a working knowledge of
the preceding levels. For instance, while it is important at the mathematical programming
level to develop the “right” optimization algorithm, at the engineering level it is even more
important to solve the “right” optimization problem formulation. On the other hand, as en-
gineers need to consider optimization tasks on a regular basis, a systematic approach with a
fundamental knowledge of optimization formulations and algorithms is essential. It should
be noted that this requires not only knowledge of existing software, which may have limited
application to particularly difficult problems, but also knowledge of the underlying algo-
rithmic principles that allow challenging applications to be addressed. In the next section
we begin with a classification of mathematical programming problems. This is followed by
examples of optimization problems in chemical engineering that will be addressed in this
text. Finally, a simple example is presented to motivate the development of optimization
methods in subsequent chapters.
1 The term mathematical programming was coined in the 1940s and is somewhat unrelated to computer
programming; it originally referred to the more general concept of optimization in the sense of optimal
planning.
i i
i i
book_tem
i i
2010/7/27
page 3
i i
min f (x, y)
x,y
s.t. h(x, y) = 0, (1.1)
g(x, y) ≤ 0,
x ∈ Rn , y ∈ {0, 1}t ,
where f (x, y) is the objective function (e.g., cost, energy consumption, etc.), h(x, y) = 0
are the equations that describe the performance of the system (e.g., material balances, pro-
duction rates), and the inequality constraints g(x, y) ≤ 0 can define process specifications
or constraints for feasible plans and schedules. Note that the operator max f (x, y) is equiv-
alent to min −f (x, y). We define the real n-vector x to represent the continuous variables
while the t-vector y represents the discrete variables, which, without loss of generality, are
often restricted to take 0/1 values to define logical or discrete decisions, such as assignment
of equipment and sequencing of tasks. (These variables can also be formulated to take on
other integer values as well.) Problem (1.1) corresponds to an MINLP when any of the
i i
i i
book_tem
i i
2010/7/27
page 4
i i
functions involved are nonlinear. If the functions f (x, y), g(x, y), and h(x, y) are linear (or
vacuous), then (1.1) corresponds to a mixed integer linear program (MILP). Further, for
MILPs, an important case occurs when all the variables are integer; this gives rise to an
integer programming (IP) problem. IP problems can be further classified into a number of
specific problems (e.g., assignment, traveling salesman, etc.), not shown in Figure 1.1.
If there are no 0/1 variables, then problem (1.1) reduces to the nonlinear program
(1.2) given by
min f (x)
x∈Rn
s.t. h(x) = 0, (1.2)
g(x) ≤ 0.
This general problem can be further classified. First, an important distinction is whether the
problem is assumed to be differentiable or not. In this text, we will assume that the func-
tions f (x), h(x), and g(x) have continuous first and second derivatives. (In many cases,
nonsmooth problems can be reformulated into a smooth form of (1.2).)
Second, a key characteristic of (1.2) is whether it is convex or not, i.e., whether it
has a convex objective function, f (x), and a convex feasible region. This can be defined as
follows.
• A set S ∈ Rn is convex if and only if all points on the straight line connecting any two
points in this set are also within this set. This can be stated as
holds for all α ∈ (0, 1) and all points x1 , x2 ∈ X. (Strict convexity requires that the
inequality (1.4) be strict.)
• Convex feasible regions require g(x) to be a convex function and h(x) to be linear.
• A function φ(x) is (strictly) concave if the function −φ(x) is (strictly) convex.
If (1.2) is a convex problem, then any local solution (for which a better, feasible solution
cannot be found in a neighborhood around this solution) is guaranteed to be a global solution
to (1.2); i.e., no better solution exists. On the other hand, nonconvex problems may have
multiple local solutions, i.e., feasible solutions that minimize the objective function only
within some neighborhood about the solution.
Further specializations of problem (1.2) can be made if the constraint and objec-
tive functions satisfy certain properties, and specialized algorithms can be constructed for
these cases. In particular, if the objective and constraint functions in (1.2) are linear, then
the resulting linear program (LP) can be solved in a finite number of steps. Methods to solve
LPs are widespread and well implemented. Currently, state-of-the-art LP solvers can handle
millions of variables and constraints, and the application of further decomposition meth-
ods leads to the solution of problems that are two or three orders of magnitude larger than
this. Quadratic programs (QPs) represent a slight modification of LPs through the addition
of a quadratic term in the objective function. If the objective function is convex, then the
i i
i i
book_tem
i i
2010/7/27
page 5
i i
resulting convex QP can also be solved in a finite number of steps. While QP models are
generally not as large or widely applied as LP models, a number of solution strategies have
been created to solve large-scale QPs very efficiently. These problem classes are covered in
Chapter 4.
Finally, we mention that while the nonlinear programming (NLP) problem (1.2) is
given as a finite-dimensional representation, it may result from a possible large-scale dis-
cretization of differential equations and solution profiles that are distributed in time and
space.
i i
i i
book_tem
i i
2010/7/27
page 6
i i
Table 1.1 summarizes model types that have been formulated for process engineering
applications. Design is dominated by NLP and MINLP models due to the need for the explicit
handling of performance equations, although simpler targeting models for synthesis give
rise to LP and MILP problems. Operations problems, in contrast, tend to be dominated
by linear models, LPs and MILPs, for planning, scheduling, and supply chain problems.
Nonlinear Programming, however, plays a crucial role at the level of real-time optimization.
Finally, process control has traditionally relied on LP and NLP models, although MILPs
are being increasingly used for hybrid systems. It is also worth noting that the applications
listed in Table 1.1 have been facilitated not only by progress in optimization algorithms, but
also by modeling environments such as GAMS [71] and AMPL [148].
This book focuses on the nonlinear programming problem (1.2) and explores methods
that locate local solutions efficiently. While this approach might first appear as a restricted
form of optimization, NLPs have broad applications, particularly for large-scale engineering
models. Moreover, while the study of NLP algorithms is important on its own, these algo-
rithms also form important components of strategies for MINLP problems and for finding
the global optimum of nonconvex problems.
i i
i i
book_tem
i i
2010/7/27
page 7
i i
and momentum), physical and chemical equilibrium among species and phases, and addi-
tional constitutive equations that describe the rates of chemical transformation or transport
of mass and energy.
Chemical process models are often represented by a collection of individual unit
models (the so-called unit operations) that usually correspond to major pieces of process
equipment. Unit models are assembled within a process flowsheet that describes the inter-
action of equipment either for steady state or dynamic behavior. As a result, models can
be described by algebraic or differential equations. For example, steady state process flow-
sheets are usually described by lumped parameter models described by algebraic equations.
Similarly, dynamic process flowsheets are described by lumped parameter models described
by differential-algebraic equations (DAEs). Models that deal with spatially distributed mod-
els are frequently considered at the unit level, with partial differential equations (PDEs) that
model fluid flow, heat and mass transfer, and reactions. On the other hand, distributed mod-
els are usually considered too expensive to incorporate within an overall process model
(although recent research [421, 243] is starting to address these issues).
Process models may also contain stochastic elements with uncertain variables. While
these features are beyond the scope of this text, Chapter 6 considers a limited treatment
of uncertainty through the formulation and efficient solution of multiple scenario opti-
mization problems. These formulations link multiple instances of process models together
with common variables. Similarly, models can also be linked to include multiple processes
over multiple time scales. As a result, interactions between different operating levels (see
Table 1.1) or among spatially distributed processes can be exploited through an optimization
formulation.
To illustrate the formulation of NLPs from process models we introduce three exam-
ples from design, real-time optimization, and control. While solution strategies and results
are deferred to later chapters, some detail is provided here to demonstrate both the charac-
teristics of the resulting NLPs and some challenges in their solution.
i i
i i
book_tem
i i
2010/7/27
page 8
i i
• Each heat exchanger also has a capital cost that is based on its area Ai , i ∈ {c, h, m},
for heat exchange. Here we consider a simple countercurrent, shell and tube heat
exchanger with an overall heat transfer coefficient, Ui , i ∈ {c, h, m}. The resulting
area equations are given by
Qi = Ui Ai Tlm
i
, i ∈ {c, h, m}. (1.8)
Tai − Tbi
i
Tlm = , i ∈ {c, h, m}, (1.9)
ln(Tai /Tbi )
and
Our objective is to minimize the total cost of the system, i.e., the energy cost as well as the
capital cost of the heat exchangers. This leads to the following NLP:
β
min (ĉi Qi + c̄i Ai ) (1.10)
i∈{c,h,m}
s.t. (1.5)–(1.9), (1.11)
Qi ≥ 0, Tai ≥ , Tbi ≥ , i ∈ {c, h, m}, (1.12)
where the cost coefficients ĉi and c̄i reflect the energy and amortized capital prices, the
exponent β ∈ (0, 1] reflects the economy of scale of the equipment, and a small con-
stant > 0 is selected to prevent the log-mean temperature difference from becoming
i i
i i
book_tem
i i
2010/7/27
page 9
i i
undefined. This example has one degree of freedom. For instance, if the heat duty Qm is
specified, then the hot and cold stream temperatures and all of the remaining quantities can
be calculated.
The column is specified to recover most of the n-butane (the light key) in the top product and
most of the isopentane (the heavy key) in the bottom product. We assume a total condenser
and partial reboiler, and that the liquid and vapor phases are in equilibrium. A tray-by-tray
i i
i i
book_tem
i i
2010/7/27
page 10
i i
B + V0 − L1 = 0, (1.13)
Li + Vi − Li+1 − Vi−1 = 0, i ∈ [1, N ], i ∈
/ S, (1.14)
Li + Vi − Li+1 − Vi−1 − F = 0, i ∈ S, (1.15)
LN+1 + D − VN = 0. (1.16)
Enthalpy Balances
m
m
yi,j − xi,j = 0, i = 0, . . . , N + 1, (1.26)
j =1 j =1
i i
i i
book_tem
i i
2010/7/27
page 11
i i
where
minQR (1.30a)
s.t. (1.13)– (1.27) (1.30b)
xbottom,lk ≤ 0.01xtop,lk , (1.30c)
Li , Vi , Ti ≥ 0, i = 1, . . . , N + 1, (1.30d)
D, QR , QC ≥ 0, (1.30e)
yi,j , xi,j ∈ [0, 1], j ∈ C, i = 1, . . . , N + 1. (1.30f )
i i
i i
book_tem
i i
2010/7/27
page 12
i i
and output variables. In its most common form, a linear model is used to represent the
dynamic process within separate time periods. A quadratic objective function is used to
drive the output (or controlled) variables back to their desired values (or setpoints) and also
to stabilize the control (or manipulated) variable profiles. As a result, a so-called quadratic
programming problem is solved online. As seen in Figure 1.4 the control variables are
allowed to vary in the first several time periods (Hu known as the input horizon) and the
process is simulated for several more time periods (Hp known as the output horizon). The
output horizon must be long enough so that the process will come back to steady state at
the end of the simulation. After a horizon subproblem is solved, the control variables in the
first time period are implemented on the process. At the next time period, the MPC model
is updated with new measurements from the process and the next horizon subproblem is set
up and solved again. The input horizon can be adjusted within the limits, 1 ≤ Hu ≤ Hp . As
long as the time sampling periods are short enough and the process does not stray too far
from the setpoint (about which the system can be linearized), then a linear model usually
leads to good controller performance [47]. On the other hand, for processes undergoing
large changes including periods of start-up, shut-down, and change of steady states, the
linear model for the process is insufficient, nonlinear models must be developed, and, as
described in Chapter 9, a more challenging nonlinear program needs to be solved.
For most MPC applications, a linear time invariant (LTI) model is used to describe
the process dynamics over time interval t; this is of the following form:
where x k ∈ Rnx is the vector of state variables at t = t0 + kt, uk ∈ Rnu is the vector
of manipulated variables at t = t0 + kt, A ∈ Rnx ×nx is the state space model matrix,
B ∈ Rnx ×nu is the manipulated variable model matrix, t ∈ R is the sampling time, and
t0 ∈ R is the current time. A set of output (or controlled) variables is selected using the
following linear transformation:
y k = Cx k + d k , (1.32)
i i
i i
book_tem
i i
2010/7/27
page 13
i i
where y k ∈ Rny is the vector of output variables at t = t0 + kt, d k ∈ Rny is the vector
of known disturbances at t = t0 + kt, and C ∈ Rny ×nx is the controlled variable mapping
matrix.
The control strategy for this process is to find values for manipulated variables uk ,
k = 1, . . . , Hu , over an input horizon Hu such that (1.31)–(1.32) hold over the output horizon
Hp for k = 1, . . . , Hp and y Hp ≈ y sp (i.e., endpoint constraint). To accomplish these goals,
the following MPC QP is set up and solved:
Hp
Hu
min (y k − y sp )T Qy (y k − y sp ) + (uk − uk−1 )T Qu (uk − uk−1 ) (1.33a)
uk ,x k ,y k
k=1 k=1
s.t. x = Ax k−1 + Buk−1
k
for k = 1, . . . , Hp , (1.33b)
y = Cx + d
k k k
for k = 1, . . . , Hp , (1.33c)
u =u
k Hu
for k = Hu + 1, . . . , Hp (1.33d)
−b̂ ≤ Â u ≤ b̂ for k = 1, . . . , Hp ,
T k
(1.33e)
u ≤ uk ≤ u
L U
for k = 1, . . . , Hp (1.33f )
−u max
≤ u −u
k k−1
≤ +u max
for k = 1, . . . , Hp , (1.33g)
i i
i i
book_tem
i i
2010/7/27
page 14
i i
Note that the cost of the tank is related to its surface area (i.e., the amount of material
required) and cs is the per unit area cost of the tank’s side, while ct is the per unit area cost
of the tank’s top and bottom.
This problem can be simplified by neglecting the bound constraints and eliminating
L from the problem, leading to the unconstrained problem:
Differentiating this objective with respect to D and setting the derivative to zero gives
df
= −4cs V /D 2 + ct π D = 0, (1.36)
dD
i i
i i
book_tem
i i
2010/7/27
page 15
i i
1.7. Exercises 15
1.7 Exercises
1. Examine whether the following functions are convex or not:
• x 2 + ax + b, x ∈ R,
• x 3 , x ∈ R,
• x 4 , x ∈ R,
• log(x), x ∈ (0, 1].
min z
s.t. z ≥ x,
z ≥ y,
x + y = 1.
i i
i i
book_tem
i i
2010/7/27
page 16
i i
5. Modify the motivating example (1.34) and consider finding the minimum cost of a
parallelepiped with a fixed volume. Show that the optimal solution must be a cube.
6. Consider the distillation column optimization (1.30) and introduce binary decision
variables to formulate the optimal location of the feed tray as an MINLP of the form
(1.1).
i i
i i
book_tem
i i
2010/7/27
page 17
i i
Chapter 2
Concepts of Unconstrained
Optimization
2.1 Introduction
We begin by considering a scalar, real objective function in n variables, i.e., f (x) : Rn → R.
The unconstrained optimization problem can be written as
Example 2.1 Consider the reaction of chemical species A, B, and C in an isothermal batch
k1 k2
reactor according to the following first order reactions: A → B and A → C. The equations
17
i i
i i
book_tem
i i
2010/7/27
page 18
i i
that is summed over a set of concentration measurements, â, b̂, and ĉ, obtained at normalized
time, ti . Figure 2.1 presents a contour plot in the space of the kinetic parameters, with level
sets for different values of f (x). The lowest value of f (x) = 2.28 × 10−6 , and from the
figure we see that this occurs when k1 = 0.50062 and k2 = 0.40059. This corresponds to
the best fit of the kinetic model (2.7) to the experimental data.
This example leads us to ask two questions.
• What criteria characterize the solution to the unconstrained optimization problem?
• How can one satisfy these criteria and therefore locate this solution quickly and
reliably?
i i
i i
book_tem
i i
2010/7/27
page 19
i i
k2
k1
The first question will be considered in this chapter while the second question will be
explored in the next. But before considering these questions, we first introduce some back-
ground material.
with elements {C}ij = mk=1 {A} ik {B}kj . This operation is defined only if the number
of columns in A and rows of B is the same.
• The transpose of A ∈ Rn×m is AT ∈ Rm×n with the rows and columns of A inter-
changed, i.e., {AT }ij = {A}j i .
• A symmetric matrix A ∈ Rn×n satisfies A = AT .
i i
i i
book_tem
i i
2010/7/27
page 20
i i
• A diagonal matrix A ∈ Rn×n has nonzero elements only on the diagonal, i.e., {A}ij =
0, i = j .
• The identity matrix I ∈ Rn×n is defined as
1 if i = j ,
{I }ij =
0 otherwise.
Vectors and matrices can be characterized and described by a number of scalar mea-
sures. The following definitions introduce these.
i i
i i
book_tem
i i
2010/7/27
page 21
i i
i i
i i
book_tem
i i
2010/7/27
page 22
i i
i i
i i
book_tem
i i
2010/7/27
page 23
i i
represent (2.9) as
1
f (z) = c + ā T z + zT
z
2
1 1
= c + ā T z + z∗T
z − z∗T
z∗ + (z − z∗ )T
(z − z∗ )
2 2
1
= c̄ + (ā +
z ) z + (z − z )
(z − z∗ )
∗ T ∗ T
2
1
n
= c̄ + āj zj + λj (zj − zj∗ )2 (2.10)
2
j ∈J0 j =1
i i
i i
book_tem
i i
2010/7/27
page 24
i i
Figure 2.2. Contour plots for quadratic objective functions: B is positive definite
(top), B is singular (middle), B is indefinite (bottom).
and the eigenvectors v1 and v2 are shown. Note that the contours show ellipsoidal shapes
1
with condition number κ(B) = λmax /λmin = 3, and with κ 2 that gives the ratio of major
and minor axes in these contours. Note that for large values of κ the contours become more
“needle like” and, as κ approaches infinity, the contours approach those of the middle plot
in Figure 2.2.
i i
i i
book_tem
i i
2010/7/27
page 25
i i
Definition 2.9 A function φ(x) : Rn → R is continuous in Rn if for every point x̄ and all
> 0 there is a value δ > 0 such that
and therefore limx→x̄ φ(x) → φ(x̄). Note that the bottom left contour plot in Figure 2.4
shows φ(x) to be discontinuous.
Definition 2.11 Let γ (ξ ) be a scalar function of a scalar variable, ξ . Assuming that the limit
exists, the first derivative is defined by
dγ γ (ξ + ) − γ (ξ )
= γ
(ξ ) := lim (2.13)
dξ →0
and the second derivative is defined by
d 2γ γ
(ξ + ) − γ
(ξ )
2
= γ
(ξ ) := lim . (2.14)
dξ →0
i i
i i
book_tem
i i
2010/7/27
page 26
i i
Figure 2.4. Contour plots for different objective functions: Convex function (top
left), nonconvex function (top right), discontinuous function (bottom left), nondifferentiable
(but convex) function (bottom right).
Definition 2.12 Define the unit vector ei = [0, 0, . . . , 1, . . . , 0] (i.e., a “1” in the ith element
of ei ; the other elements are zero). Assuming that the limits exist, the first partial derivative
of the multivariable φ(x) is defined by
We define φ(x) as (twice) differentiable if all of the (second) partial derivatives exist and
as (twice) continuously differentiable if these (second) partial derivatives are continuous.
∂2φ ∂2φ
Second derivatives that are continuous have the property ∂x i xj
= ∂xj xi
. Note that the bottom
right contour plot in Figure 2.4 shows φ(x) to be nondifferentiable.
Assembling these partial derivatives leads to the gradient vector
∂φ(x)
∂x1
∇x φ(x) =
..
.
(2.17)
∂φ(x)
∂xn
i i
i i
book_tem
i i
2010/7/27
page 27
i i
Also, when a function has only one argument, we suppress the subscript on “∇” and simply
write ∇φ(x) and ∇ 2 φ(x) for the gradient and Hessian, respectively.
Finally we establish properties for convex functions. As already mentioned in Chap-
ter 1, a convex function is defined as follows.
holds for all α ∈ (0, 1) and all points x a , x b ∈ Rn . Strict convexity requires that inequal-
ity (2.19) be strict. For differentiable functions this can be extended to the following state-
ments:
• A continuously differentiable function φ(x) is convex if and only if
holds for all x, p ∈ Rn . Strict convexity requires that inequality (2.20) be strict.
• A twice continuously differentiable function φ(x) is convex if and only if ∇ 2 φ(x) is
positive semidefinite for all x ∈ Rn .
• If ∇ 2 φ(x) is positive definite for all x ∈ Rn , then the function φ(x) is defined as strongly
convex. A strongly convex function is always strictly convex, but the converse is not
true. For instance, the function φ(x) = x 4 is strictly convex but not strongly convex.
Theorem 2.14 (Taylor’s Theorem [294]). Suppose that f (x) is continuously differentiable,
then we have for all x, p ∈ Rn ,
i i
i i
book_tem
i i
2010/7/27
page 28
i i
and
1
∇f (x + p) = ∇f (x) + ∇ 2 f (x + tp) pdt. (2.23)
0
As we will see, finding global minimizers is generally much harder than finding local
ones. However, if f (x) is a convex function, then we can invoke the following theorem.
Theorem 2.16 If f (x) is convex, then every local minimum is a global minimum. If f (x)
is strictly convex, then a local minimum is the unique global minimum. If f (x) is convex
and differentiable, then a stationary point is a global minimum.
Proof: For the first statement, we assume that there are two local minima, x a and x b with
f (x a ) > f (x b ), and establish a contradiction. Note that x a is not a global minimum and we
have f (x a ) ≤ f (x), x ∈ N (x a ), and f (x b ) ≤ f (x), x ∈ N (x b ). By convexity,
i i
i i
book_tem
i i
2010/7/27
page 29
i i
For the third statement, we note that the stationary point satisfies ∇f (x ∗ ) = 0, and
from (2.20), we have
f (x ∗ + p) ≥ f (x ∗ ) + ∇f (x ∗ )T p = f (x ∗ ) (2.28)
for all p ∈ Rn .
Convexity is sufficient to show that a local solution is a global solution. On the other
hand, in the absence of convexity, showing that a particular local solution is also a global
solution often requires application of rigorous search methods. A comprehensive treatment
of global optimization algorithms and their properties can be found in [144, 203, 379].
Instead, this text will focus on methods that guarantee only local solutions.
To identify locally optimal solutions, we consider the following properties.
Theorem 2.17 (Necessary Conditions for Local Optimality). If f (x) is twice continuously
differentiable and there exists a point x ∗ that is a local minimum, then ∇f (x ∗ ) = 0 and
∇ 2 f (x ∗ ) must be positive semidefinite.
Proof: To show ∇f (x ∗ ) = 0, we assume that it is not and apply Taylor’s theorem (2.21) to
get
f (x ∗ + tp) = f (x ∗ ) + t∇f (x ∗ + τp)T p for some τ ∈ (0, t) and for all t ∈ (0, 1). (2.29)
Choosing p = −∇f (x ∗ ) and, because of continuity of the first derivatives, we can choose
t sufficiently small so that −t∇f (x ∗ + τp)T ∇f (x ∗ ) < 0, and x ∗ + tp ∈ N (x ∗ ) is in a
sufficiently small neighborhood. Then f (x ∗ + tp) = f (x ∗ ) + t∇f (x ∗ + τp)T p < f (x ∗ ),
which is a contradiction.
To show that ∇ 2 f (x ∗ ) must be positive semidefinite, we assume it is not and show a
contradiction. Applying Taylor’s theorem (2.22) leads to
t2 T 2
f (x ∗ + tp) = f (x ∗ ) + t∇f (x ∗ )T p + p ∇ f (x ∗ + τp)T p (for some τ ∈ (0, t))
2
t2 T 2
= f (x ∗ ) + p ∇ f (x ∗ + τp)T p.
2
Assuming pT ∇ 2 f (x ∗ )T p < 0 and choosing the neighborhood N (x ∗ +tp) sufficiently small
2
leads to t2 p T ∇ 2 f (x + τp)T p < 0 by continuity of the second derivatives. This also leads
to the contradiction, f (x ∗ + tp) < f (x ∗ ).
Theorem 2.18 (Sufficient Conditions for Local Optimality). If f (x) is continuously twice
differentiable and there exists a point x ∗ where ∇f (x ∗ ) = 0 and ∇ 2 f (x ∗ ) is positive definite,
then x ∗ is a strict, isolated local minimum.
i i
i i
book_tem
i i
2010/7/27
page 30
i i
Example 2.19 To illustrate these optimality conditions, we consider the following two-
variable unconstrained optimization problem [205]:
with a T = [0.3, 0.6, 0.2], bT = [5, 26, 3], and cT = [40, 1, 10]. The solution to this problem is
given by x ∗ = [0.7395, 0.3144] with ∗
f (x ) = −5.0893.
∗
At this solution, ∇f (x ) = 0 and the
77.012 108.334
Hessian is given by ∇ 2 f (x ∗ ) = , which has eigenvalues λ = 43.417
108.334 392.767
and 426.362. Therefore this solution is a strict local minimum. The contour plot for this
problem is given in the top of Figure 2.5. From this plot, the careful reader will note that
f (x) is a nonconvex function. Moreover, the bottom of Figure 2.5 shows regions of positive
and negative eigenvalues.
2.4 Algorithms
From the conditions that identify locally optimal solutions, we now consider methods that
will locate them efficiently. In Section 2.2, we defined a number of function types. Our main
focus will be to develop fast methods for continuously differentiable functions. Before
developing these methods, we first provide a brief survey of methods that do not require
differentiability of the objective function.
i i
i i
book_tem
i i
2010/7/27
page 31
i i
2.4. Algorithms 31
Figure 2.5. Contours for nonlinear unconstrained example (top) and regions of
minimum eigenvalues (bottom). The clear region has nonnegative eigenvalues, and the
regions are decremented by values of 50 and with darker shading.
solution strategies. Many of these methods are derived from heuristics that naturally lead to
numerous variations, and a very broad literature describes these methods. Here we discuss
only a few of the many important trends in this area.
i i
i i
book_tem
i i
2010/7/27
page 32
i i
and minimizing an approximation based on quadratic forms. All of these methods require
only objective function values for unconstrained minimization. Associated with these meth-
ods are numerous studies on a wide range of process problems. Moreover, many of these
methods include heuristics that prevent premature termination (e.g., directional flexibility
in the complex search as well as random restarts and direction generation). To illustrate
these methods, Figure 2.6 shows the performance of a pattern search method as well as a
random search method on an unconstrained problem.
Simulated Annealing
This strategy is related to random search methods and is derived from a class of heuristics
with analogies to the motion of molecules in the cooling and solidification of metals [239].
Here a “temperature” parameter, θ , can be raised or lowered to influence the probability of
accepting points that do not improve the objective function. The method starts with a base
point, x, and objective value, f (x). The next point x
is chosen at random from a distribution.
If f (x
) < f (x), the move is accepted with x
as the new point. Otherwise, x
is accepted
with probability p(θ , x
, x). Options include the Metropolis distribution,
The θ parameter is then reduced and the method continues until no further progress is made.
Genetic Algorithms
This approach, first proposed by Holland [200], is based on the analogy of improving a
population of solutions through modifying its gene pool. It also has performance charac-
teristics similar as random search methods and simulated annealing. Two forms of genetic
modification, crossover or mutation, are used, and the elements of the optimization vec-
tor, x, are represented as binary strings. Crossover deals with random swapping of vector
i i
i i
book_tem
i i
2010/7/27
page 33
i i
2.4. Algorithms 33
elements (among parents with highest objective function values or other rankings of pop-
ulation) or any linear combinations of two parents. Mutation deals with the addition of a
random variable to elements of the vector. Genetic algorithms (GAs) have seen widespread
use in process engineering and a number of codes are available. Edgar, Himmelblau, and
Lasdon [122] described a related GA that is available in Microsoft ExcelTM .
i i
i i
book_tem
i i
2010/7/27
page 34
i i
Newton-type methods because they are fast local solvers and because many of the concepts
for unconstrained optimization extend to fast constrained optimization algorithms as well. In
addition to these methods, it is also important to mention the matrix-free conjugate gradient
method. More detail on this method can be found in [294].
To derive Newton’s method for unconstrained optimization, we consider the smooth
function f (x) and apply Taylor’s theorem at a fixed value of x and a direction p:
1
f (x + p) = f (x) + ∇f (x)T p + p T ∇ 2 f (x)p + O(p3 ). (2.37)
2
If we differentiate this function with respect to p, we obtain
∇f (x + p) = ∇f (x) + ∇ 2 f (x)p + O(p2 ). (2.38)
At the given point x we would like to determine the vector p that locates the stationary
point, i.e., ∇f (x + p) = 0. Taking only the first two terms on the right-hand side, we can
define the search direction by
∇f (x) + ∇ 2 f (x)p = 0 =⇒ p = −(∇ 2 f (x))−1 ∇f (x) (2.39)
if ∇ 2 f (x) is nonsingular. In fact, for x in the neighborhood of a strict local optimum, with
p small, we would expect the third order term in (2.37) to be negligible; f (x) would then
behave like a quadratic function. Moreover, ∇ 2 f (x) would need to be positive definite in
order to compute the Newton step in (2.39), and from Theorem 2.18 this would be consistent
with conditions for a local minimum.
This immediately leads to a number of observations:
• If f (x) is a quadratic function where ∇ 2 f (x) is positive definite, then the Newton step
p = −(∇ 2 f (x))−1 ∇f (x) applied at any point x ∈ Rn immediately finds the global
solution at x + p.
• Nonsingularity of ∇ 2 f (x) is required at any point x where the Newton step p is
computed from (2.39). This is especially important for ∇ 2 f (x ∗ ), as a strict local
solution is required for this method. On the other hand, if ∇ 2 f (x) is singular, some
corrections need to be made to obtain good search directions and develop a convergent
method. More detail on this issue will be given in Chapter 3.
• To promote convergence to a local solution, it is important that the Newton step be a
descent step, i.e., ∇f (x)T p < 0, so that f (x k ) can decrease from one iteration k to
the next. If x k ∈ N (x ∗ ) and x k+1 = x k + αp, where the step size α ∈ (0, 1], then from
(2.22) we require that
α2 T 2
0 > f (x k + αp) − f (x k ) = α∇f (x k )T p + p ∇ f (x + tp)p for some t ∈ (0, 1).
2
(2.40)
i i
i i
book_tem
i i
2010/7/27
page 35
i i
2.4. Algorithms 35
Algorithm 2.1.
Choose a starting point x 0 .
For k ≥ 0 while p k > 1 and ∇f (x k ) > 2 :
1. At x k , evaluate ∇f (x k ) and ∇ 2 f (x k ). If ∇ 2 f (x k ) is singular, STOP.
2. Solve the linear system ∇ 2 f (x k )p k = −∇f (x k ).
3. Set x k+1 = x k + pk and k = k + 1.
This basic algorithm has the desirable property of fast convergence, which can be
quantified by the following well-known property.
Theorem 2.20 Assume that f (x) is twice differentiable and ∇ 2 f (x) is Lipschitz continuous
in a neighborhood of the solution x ∗ , which satisfies the sufficient second order conditions.
Then, by applying Algorithm 2.1 and with x 0 sufficiently close to x ∗ , there exists a constant
L̂ > 0 such that
• x k+1 − x ∗ ≤ L̂x k − x ∗ 2 , i.e., the convergence rate for {x k } is quadratic;
• the convergence rate for {∇f (x k )} is also quadratic.
Proof: For the Newton step, pk = −(∇ 2 f (x k ))−1 ∇f (x k ), we note that, by continuity of the
second derivatives, ∇ 2 f (x k ) is nonsingular and satisfies ∇ 2 f (x k )−1 < C for some C > 0
if x k is sufficiently close to x ∗ . Thus the Newton step is well defined and we can then write
x k + pk − x ∗ = x k − x ∗ − (∇ 2 f (x k ))−1 ∇f (x k )
= ∇ 2 f (x k )−1 (∇ 2 f (x k )(x k − x ∗ ) − ∇f (x k ))
= ∇ 2 f (x k )−1 (∇ 2 f (x k )(x k − x ∗ ) − (∇f (x k ) − ∇f (x ∗ )))
and from (2.23)
1
= ∇ 2 f (x k )−1 (∇ 2 f (x k ) − ∇ 2 f (x k + t(x k − x ∗ )))(x k − x ∗ ) dt. (2.41)
0
i i
i i
book_tem
i i
2010/7/27
page 36
i i
Table 2.2. Iteration sequence with basic Newton method with starting point close
to solution.
Iteration (k) x1k x2k f (x k ) ∇f (x k ) x k − x ∗
0 0.8000 0.3000 −5.000 3.000 6.2175 × 10−2
1 0.7423 0.3115 −5.0882 0.8163 3.9767 × 10−3
2 0.7395 0.3143 −5.0892 6.8524 × 10−3 9.4099 × 10−5
3 0.7395 0.3143 −5.0892 2.6847 × 10−6 2.6473 × 10−8
4 0.7395 0.3143 −5.0892 1.1483 × 10−13 2.9894 × 10−16
Table 2.3. Iteration sequence with basic Newton method with starting point far
from solution.
Example 2.21 Consider the problem described in Example 2.19. Applying the basic
Newton algorithm from a starting point close to the solution leads to the iteration sequence
given in Table 2.2. Here it is clear that the solution can be found very quickly and, based on
the errors x k − x ∗ and ∇f (x k ), we observe that both {x k } and {f (x k )} have quadratic
convergence rates, as predicted by Theorem 2.20. Moreover, from Figure 2.5 we note that
the convergence path remains in the region where the Hessian is positive definite.
On the other hand, if we choose a starting point farther away from the minimum, then
we obtain the iteration sequence in Table 2.3. Note that the starting point, which is not much
farther away, lies in a region where the Hessian is indefinite, and the method terminates at a
saddle point where the eigenvalues of the Hessian are 11.752 and −3.034. Moreover, other
starting points can also lead to iterates where the objective function is undefined and the
method fails.
Example 2.21 illustrates that the optimum solution can be found very accurately if a
good starting point is found. On the other hand, it also shows that more reliable algorithms
are needed to find the unconstrained optimum. While Newton’s method converges very
quickly in the neighborhood of the solution, it can fail if
i i
i i
book_tem
i i
2010/7/27
page 37
i i
2.6. Exercises 37
In the next chapter we will develop unconstrained algorithms that overcome these difficulties
and still retain the fast convergence properties of Newton’s method.
2.6 Exercises
1. While searching for the minimum of f (x) = [x12 + (x2 + 1)2 ][x12 + (x2 − 1)2 ], we
terminate at the following points:
(a) x = [0, 0]T ,
(b) x = [0, 1]T ,
(c) x = [0, −1]T ,
(d) x = [1, 1]T .
Classify each point.
Find the eigenvalues and eigenvectors and any stationary points. Are the stationary
points local optima? Are they global optima?
For M = 0 find all stationary points. Are they optimal? Find the path of optimal
solutions as M increases.
i i
i i
book_tem
i i
2010/7/27
page 38
i i
4. Apply Newton’s method to the optimization problem in Example 2.1, starting from
k1 = 0.5, k2 = 0.4 and also from k1 = 2.0, k2 = 2.0. Explain the performance of this
method from these points.
5. Download and apply the NEWUOA algorithm to Example 2.19. Use the two start-
ing points from Example 2.21. How does this method compare to the results in
Example 2.21?
i i
i i
book_tem
i i
2010/7/27
page 39
i i
Chapter 3
Newton-type methods are presented and analyzed for the solution of unconstrained optimiza-
tion problems. In addition to covering the basic derivation and local convergence properties,
both line search and trust region methods are described as globalization strategies, and key
convergence properties are presented. The chapter also describes quasi-Newton methods
and focuses on derivation of symmetric rank one (SR1) and Broyden–Fletcher–Goldfarb–
Shanno (BFGS) methods, using simple variational approaches. The chapter includes a small
example that illustrates the characteristics of these methods.
3.1 Introduction
Chapter 2 concluded with the derivation of Newton’s method for the unconstrained opti-
mization problem
min f (x). (3.1)
x∈Rn
For unconstrained optimization, Newton’s method forms the basis for the most efficient
algorithms. Derived from Taylor’s theorem, this method is distinguished by its fast perfor-
mance. As seen in Theorem 2.20, this method has a quadratic convergence rate that can
lead, in practice, to inexpensive solutions of optimization problems. Moreover, extensions
to constrained optimization rely heavily on this method; this is especially true in chemical
process engineering applications. As a result, concepts of Newton’s method form the core
of all of the algorithms discussed in this book.
On the other hand, given a solution to (3.1) that satisfies first and second order sufficient
conditions, the basic Newton method in Algorithm 2.1 may still have difficulties and may
be unsuccessful. Newton’s method can fail on problem (3.1) for the following reasons:
1. The objective function is not smooth. Here, first and second derivatives are needed
to evaluate the Newton step, and Lipschitz continuity of the second derivatives is
needed to keep them bounded.
2. The Newton step does not generate a descent direction. This is associated with
Hessian matrices that are not positive definite. A singular matrix produces New-
ton steps that are unbounded, while Newton steps with ascent directions lead to an
39
i i
i i
book_tem
i i
2010/7/27
page 40
i i
increase in the objective function. These arise from Hessian matrices with negative
curvature.
3. The starting point is not sufficiently close to solution. For general unconstrained
problems, this property is the hardest to check. While estimates of regions of attraction
for Newton’s method have been developed in [113], they are not easy to apply when
the solution, and its relation to the initial point, is unknown.
These three challenges raise some important questions on how to develop reliable and
efficient optimization algorithms, based on Newton’s method. This chapter deals with these
questions in the following way:
1. In the application of Newton’s method throughout this book, we will focus only on
problems with smooth functions. Nevertheless, there is a rich literature on optimiza-
tion with nonsmooth functions. These include development of nonsmooth Newton
methods [97] and the growing field of nonsmooth optimization algorithms.
2. In Section 3.2, we describe a number of ways to modify the Hessian matrix to ensure
that the modified matrix at iteration k, B k , has a bounded condition number and
remains positive definite. This is followed by Section 3.3 that develops the concept of
quasi-Newton methods, which do not require the calculation of the Hessian matrix.
Instead, a symmetric, positive definite B k matrix is constructed from differences of
the objective function gradient at successive iterations.
3. To avoid the problem of finding a good starting point, globalization strategies are
required that ensure sufficient decrease of the objective function at each step and
lead the algorithm to converge to locally optimal solutions, even from distant starting
points. In Section 3.4, this global convergence property will be effected by line search
methods that are simple modifications of Algorithm 2.1 and require that B k be posi-
tive definite with bounded condition numbers. Moreover, these positive definiteness
assumptions can be relaxed if we apply trust region methods instead. These strategies
are developed and analyzed in Section 3.5.
Algorithm 3.1.
Choose a starting point x 0 and tolerances 1 , 2 > 0.
i i
i i
book_tem
i i
2010/7/27
page 41
i i
The modified Hessian, B k , satisfies v T (B k )−1 v > v2 , for all vectors v = 0 and for
some > 0. The step pk determined from B k leads to the descent property:
B k = ∇ 2 f (x k ) + E k
= V k
k V k,T + δI = V k (
k + δI )V k,T ,
i i
i i
book_tem
i i
2010/7/27
page 42
i i
Theorem 3.1 [294] Assume that f (x) is three times differentiable and that Algorithm 3.1
converges to a point that is a strict local minimum. Then x k converges at a superlinear
rate, i.e.,
x k + pk − x ∗
lim =0 (3.6)
k→∞ x k − x ∗
if and only if
(B k − ∇ 2 f (x k ))p k
lim = 0. (3.7)
k→∞ p k
In Section 3.5, we will see that such a judicious modification of the Hessian can be
performed together with a globalization strategy. In particular, we will consider a systematic
strategy for the Levenberg–Marquardt step that is tied to the trust region method.
B k+1 s = y. (3.9)
If f (x) is a quadratic function, the secant relation is exactly satisfied when B k is the Hessian
matrix. Also, from Taylor’s theorem (2.22), we see that (3.9) can provide a reasonable
approximation to the curvature of f (x) along the direction s. Therefore we consider this
relation to motivate a formula to describe B k . Finally, because ∇ 2 f (x) is a symmetric
matrix, we also want B k to be symmetric as well.
i i
i i
book_tem
i i
2010/7/27
page 43
i i
The simplest way to develop an update formula for B k is to postulate the rank-one
update: B k+1 = B k + wwT . Applying (3.9) to this update (see Exercise 1) leads to
(y − B k s)(y − B k s)T
B k+1 = B k + (3.10)
(y − B k s)T s
which is the symmetric rank 1 (SR1) update formulation. The SR1 update asymptotically
converges to the (positive definite) Hessian of the objective function as long as the steps s are
linearly independent. On the other hand, the update for B k+1 can be adversely affected by
regions of negative or zero curvature and can become ill-conditioned, singular, or unbounded
in norm. In particular, care must be taken so that the denominator in (3.10) is bounded away
from zero, e.g., |(y − B k s)T s| ≥ C1 s2 for some C1 > 0. So, while this update can work
well, it is not guaranteed to be positive definite and may not lead to descent directions.
Instead, we also consider a rank-two quasi-Newton update formula that allows B k to
remain symmetric and positive definite as well. To do this, we define the current Hessian
approximation as B k = J J T , where J is a square, nonsingular matrix. Note that this defi-
nition implies that B k is positive definite. To preserve symmetry, the update to B k can be
given as B k+1 = J + (J + )T , where J + is also expected to remain square and nonsingular.
By working with the matrices J and J + , it will also be easier to monitor the symmetry and
positive definiteness properties of B k .
Using the matrix J + , the secant relation (3.9) can be split into two parts. From
B k+1 s = J + (J + )T s = y, (3.11)
J +v = y and (J + )T s = v. (3.12)
The derived update satisfies the secant relation and symmetry. In order to develop a unique
update formula, we also assume the update has the least change in some norm. Here we
obtain an update formula by invoking a least change strategy for J + , leading to
min J + − J F (3.13)
s.t. J + v = y, (3.14)
where J F is the Frobenius norm of matrix J . Solving (3.13) (see Exercise 8 in Chapter 4)
leads to the so-called Broyden update used to solve nonlinear equations:
(y − J v)v T
J+ = J + . (3.15)
vT v
Using (3.15) we can recover an update formula in terms of s, y, and B k by using the
following identities for v. From (3.12), we have v T v = (y T (J + )−T )(J + )T s = s T y. Also,
postmultiplying (J + )T by s and using (3.15) leads to
v(y − J v)T s
v = (J + )T s = J T s + (3.16)
vT v
T T
v J s
= JT s +v− T v (3.17)
s y
i i
i i
book_tem
i i
2010/7/27
page 44
i i
sT y
and v = vT J T s
J T s. Premultiplying v by s T J and simplifying the expression leads to
1/2
sT y
v= J T s.
sT Bk s
Finally, from the definition of v, B k , and B k+1 as well as (3.15), we have
T
(y − J v)v T (y − J v)v T
B k+1 = J + J +
vT v vT v
yy T − J vv T J T
= JJT +
vT v
yy T J vv T J T
= Bk + T −
s y vT v
yy T B k ss T B k
= Bk + T − T k . (3.18)
s y s B s
From this derivation, we have assumed B k to be a symmetric matrix, and therefore B k+1
remains symmetric, as seen from (3.18). Moreover, it can be shown that if B k is positive
definite and s T y > 0, then the update, B k+1 , is also positive definite. In fact, the condition
that s T y be sufficiently positive at each iteration, i.e.,
s T y ≥ C2 s2 for some C2 > 0, (3.19)
is important in order to maintain a bounded update. As a result, condition (3.19) is checked at
each iteration, and if it cannot be satisfied, the update (3.18) is skipped. Another alternative
to skipping is known as Powell damping [313]. As described in Exercise 2, this approach
maintains positive definiteness when (3.19) fails by redefining y := θy + (1 − θ )B k s for a
calculated θ ∈ [0, 1].
The update formula (3.18) is known as the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) update, and the derivation above is due to Dennis and Schnabel [110]. As a re-
sult of this updating formula, we have a reasonable approximation to the Hessian matrix
that is also positive definite.
Moreover, the BFGS update has a fast rate of convergence as summarized by the
following property.
Theorem
∞ 3.2 [294] If the BFGS algorithm converges to a strict local solution x ∗ with
∗ ∗
k=0 x − x < ∞, and the Hessian ∇ f (x) is Lipschitz continuous at x , then (3.7)
k 2
k
holds and x converges at a superlinear rate.
Finally, while the BFGS update can be applied directly in step 2 of Algorithm 3.1,
calculation of the search direction p k can be made more efficient by implementing the
quasi-Newton update through the following options.
• Solution of the linear system can be performed with a Cholesky factorization of
B k = Lk (Lk )T . On the other hand, Lk can be updated directly by applying formula
T
(3.15) with J := Lk and v = s T Lsk (Lyk )T s (Lk )T s, i.e.,
ys T Lk (Lk )T ss T Lk
J + = Lk + − , (3.20)
(s T y)1/2 (s T Lk (Lk )T s)1/2 s T (Lk )(Lk )T s
i i
i i
book_tem
i i
2010/7/27
page 45
i i
where
Bk sk yk
uk = , vk = . (3.23)
(s k )T B k s k (y k )T s k )1/2
For large-scale problems (say, n > 1000), it is advantageous to store only the last m
updates and to develop the so-called limited memory update:
k
B k+1 = B 0 + v i (v i )T − ui (ui )T . (3.24)
i=max(0,k−m+1)
In this way, only the most recent updates are used for B k , and the older ones are
discarded. While the limited memory update has only a linear convergence rate, it
greatly reduces the linear algebra cost for large problems. Moreover, by storing only
the updates, one can work directly with matrix-vector products instead of B k , i.e.,
k
B k+1 w = B 0 w + v i (v i )T w − ui (ui )T w . (3.25)
i=max(0,k−m+1)
Similar updates have been developed for H k as well. Moreover, Byrd and Nocedal
[83] discovered a particularly efficient compact form of this update as follows:
T 0 −1 T 0
S k B Sk Lk Sk B
B k+1 = B 0 − [B 0 Sk Yk ] , (3.26)
LTk −Dk YkT
where Dk = diag[(s k−m+1 )T (y k−m+1 ), . . . , (s k )T y k ], Sk = [s k−m+1 , . . . , s k ], Yk =
[y k−m+1 , . . . , y k ], and
k−m+i T k−m+j
(s ) (y ), i > j,
(Lk ) =
0 otherwise.
The compact limited memory form (3.26) is more efficient to apply than the unrolled
form (3.25), particularly when m is large and B 0 is initialized as a diagonal matrix.
Similar compact representations have been developed for the inverse BFGS update
H k as well as the SR1 update.
i i
i i
book_tem
i i
2010/7/27
page 46
i i
Figure 3.1. Example that shows cycling of the basic Newton method.
Note that the Hessian is always positive definite (f (z) is a convex function) and a unique
optimum exists at z∗ = 0. However, as seen in Figure 3.1, applying Algorithm 3.1 with a
starting point z0 = 10 leads the algorithm to cycle indefinitely between 10 and −10.
To avoid cycling, convergence requires a sufficient decrease of the objective function.
If we choose the step p k = −(B k )−1 ∇f (x k ) with B k positive definite and bounded in
condition number, we can modify the selection of the next iterate using a positive step
length α with
x k+1 = x k + αpk . (3.27)
Using Taylor’s theorem, one can show for sufficiently small α that
i i
i i
book_tem
i i
2010/7/27
page 47
i i
On the other hand, restriction to only small α leads to inefficient algorithms. Instead, a
systematic line search method is needed that allows larger steps to be taken with a sufficient
decrease of f (x). The line search method therefore consists of three tasks:
1. At iteration k, start with a sufficiently large value for α. While a number of methods
can be applied to determine this initial step length [294, 134], for the purpose of
developing this method, we choose an initial α set to one.
Sufficient decrease of f (x) can be seen from the last term in (3.30). Clearly a value of
α can be chosen that reduces f (x). As the iterations proceed, one would expect decreases
in f (x) to taper off as ∇f (x) → 0. On the other hand, convergence would be impeded if
α k → 0 and the algorithm stalls.
An obvious line search option is to perform a single variable optimization in α, i.e.,
minα f (x k + αpk ), and to choose α k := α ∗ , as seen in Figure 3.2. However, this option is
expensive. And far away from the solution, it is not clear that this additional effort would
reduce the number of iterations to converge to x ∗ . Instead, we consider three popular criteria
to determine sufficient decrease during the line search. To illustrate the criteria for sufficient
decrease, we consider the plot of f (x k + αpk ) shown in Figure 3.2.
All of these criteria require the following decrease in the objective function:
f (x k + α k p k ) ≤ f (x k ) + ηα k ∇f (x k )T p k , (3.31)
i i
i i
book_tem
i i
2010/7/27
page 48
i i
where η ∈ (0, 12 ]. This is also known as the Armijo condition.As seen in Figure 3.2, α ∈ (0, αa ]
satisfies this condition. The following additional conditions are also required so that the
chosen value of α is not too short:
• The Wolfe conditions require that (3.31) be satisfied as well as
∇f (x k + α k p k )T p k ≥ ζ ∇f (x k )T p k (3.32)
for ζ ∈ (η, 1). From Figure 3.2, we see that α ∈ [αw , αa ] satisfies these conditions.
• The strong Wolfe conditions are more restrictive and require satisfaction of
for ζ ∈ (η, 1). From Figure 3.2, we see that α ∈ [αw , αsw ] satisfies these conditions.
• The Goldstein or Goldstein–Armijo conditions require that (3.31) be satisfied as well as
f (x k + α k p k ) ≥ f (x k ) + (1 − η)α k ∇f (x k )T p k . (3.34)
From Figure 3.2, we see that α ∈ [αg , αa ] satisfies these conditions. (Note that the
relative locations of αg and αw change if (1 − η) > ζ .)
The Wolfe and strong Wolfe conditions lead to methods that have desirable conver-
gence properties that are analyzed in [294, 110]. The Goldstein conditions are similar but
do not require evaluation of the directional derivatives ∇f (x k + α k p k )T p k during the line
search. Moreover, in using a backtracking line search, where α is reduced if (3.31) fails, the
Goldstein condition (3.34) is easier to check.
We now consider a global convergence proof for the Goldstein conditions. Based on
the result by Zoutendijk [422], a corresponding proof is also given for the Wolfe conditions
in [294].
Theorem 3.3 (Global Convergence of Line Search Method). Consider an iteration: x k+1 =
x k + α k p k , where pk = −(B k )−1 ∇f (x k ) and α k satisfies the Goldstein–Armijo conditions
(3.31), (3.34). Suppose that f (x) is bounded below for x ∈ Rn , that f (x) is continuously
differentiable, and that ∇f (x) is Lipschitz continuous in an open set containing the level
set {x|f (x) ≤ f (x 0 )}. Then by defining the angle between p k and −∇f (x k ) as
−∇f (x k )T p k
cos θ k = (3.35)
∇f (x k )p k
i i
i i
book_tem
i i
2010/7/27
page 49
i i
Since θk is the angle at x k between the search direction p k and the steepest descent
direction −∇f (x k ), this theorem leads to the result that either ∇f (x k ) approaches zero
or that ∇f (x k ) and pk become orthogonal to each other. However, in the case where we
have a positive definite B k with a bounded condition number, κ(B k )), then
|∇f (x k )T p k | ∇f (x k )T (B k )−1 ∇f (x k ) 1
cos θ = = −1
≥
∇f (x )p ∇f (x )2 (B ) ∇f (x )2 κ(B k )
k k k k k
i i
i i
book_tem
i i
2010/7/27
page 50
i i
With this result, we now state the basic Newton-type algorithm for unconstrained
optimization with a backtracking line search.
Algorithm 3.2.
Choose a starting point x 0 and tolerances 1 , 2 > 0.
3. Set α = 1.
The value of ρ can be chosen in a number of ways. It can be a fixed fraction (e.g., 21 )
or it can be determined by minimizing a quadratic (see Exercise 4) or cubic interpolant based
on previous line search information. In addition, if α < 1, then the Goldstein condition (3.34)
should be checked to ensure that α is not too short.
Finally, in addition to global convergence in Theorem 3.3 we would like Algorithm 3.2
to perform well, especially in the neighborhood of the optimum. The following theorem is
useful for this feature.
Theorem 3.4 Assume that f (x) is three times differentiable and that Algorithm 3.2 con-
verges to a point that is a strict local minimum x ∗ and
(B k − ∇ 2 f (x k ))p k
lim =0
k→∞ p k
is satisfied. Then there exists a finite k0 , where α k = 1 is admissible for all k > k0 and x k
converges superlinearly to x ∗ .
1
f (x k+1 ) = f (x k ) + ∇f (x k )T p k + p k,T ∇ 2 f (x k + tpk )p k
2
1 1
= f (x k ) + ∇f (x k )T p k + p k,T (∇ 2 f (x k + tpk ) − B k )p k + p k,T B k p k
2 2
1 1 k,T 2
= f (x ) + ∇f (x ) p + p (∇ f (x + tp ) − B )p
k k T k k k k k
2 2
1 1
= f (x ) + η∇f (x ) p +
k k T k
− η ∇f (x k )T p k + p k,T (∇ 2 f (x k + tpk ) − B k )p k
2 2
1
≤ f (x k ) + η∇f (x k )T p k − − η |∇f (x k )T p k | + o(pk 2 ).
2
i i
i i
book_tem
i i
2010/7/27
page 51
i i
From the proof of Theorem 3.3 we know that αk is bounded away from zero, and because
we have limk→∞ x k − x k+1 → 0, we have p k → 0. Also, because x ∗ is a strict local
optimum, we have from Taylor’s theorem that |∇f (x k )T p k | > pk 2 for some > 0 and
k sufficiently large. Consequently, for η < 12 there exists a k0 such that
1 k,T 2 1
p (∇ f (x + tp ) − B )p <
k k k k
− η |∇f (x k )T p k |
2 2
leading to
which satisfies the Armijo condition for α = 1. Superlinear convergence then follows from
this result and Theorem 3.1.
and with a T = [0.3, 0.6, 0.2], bT = [5, 26, 3], and cT = [40, 1, 10]. The minimizer occurs at
x ∗ = [0.73950, 0.31436] with f (x ∗ ) = −5.08926.2 As seen from Figure 2.5, this problem
has only a small region around the solution where the Hessian is positive definite. As a result,
we saw in Example 2.21 that Newton’s method has difficulty converging from a starting
point far away from the solution. To deal with this issue, let’s consider the line search
algorithm with BFGS updates, starting with an initial B 0 = I and from a starting point close
to the solution. Applying Algorithm 3.2 to this problem, with a termination tolerance of
∇f (x) ≤ 10−6 , leads to the iteration sequence given in Table 3.1. Here it is clear that
the solution can be found very quickly, although it requires more iterations than Newton’s
method (see Table 2.3). Also, as predicted by Theorem 3.4, the algorithm chooses step sizes
with α k = 1 as convergence proceeds in the neighborhood of the solution. Moreover, based
on the error ∇f (x k ) we observe superlinear convergence rates, as predicted by Theorem
3.1. Finally, from Figure 2.5 we again note that the convergence path remains in the region
where the Hessian is positive definite.
If we now choose a starting point farther away from the minimum, then applying
Algorithm 3.2, with a termination tolerance of ∇f (x) ≤ 10−6 , leads to the iteration
sequence given in Table 3.2. At this starting point, the Hessian is indefinite, and, as seen
in Example 2.21, Newton’s method was unable to converge to the minimum. On the other
hand, the line search method with BFGS updates converges relatively quickly to the optimal
solution. The first three iterations show that large search directions are generated, but the
2 To prevent undefined objective and gradient functions, the square root terms are replaced by f (ξ ) =
(max(10−6 , ξ ))1/2 . While this approximation violates the smoothness assumptions for these methods, it
affects only large search steps which are immediately reduced by the line search in the early stages of the
algorithm. Hence first and second derivatives are never evaluated at these points.
i i
i i
book_tem
i i
2010/7/27
page 52
i i
Table 3.1. Iteration sequence with BFGS line search method with starting point
close to solution.
Iteration (k) x1k x2k f (x k ) ∇f (x k ) α
0 0.8000 0.3000 −5.0000 3.0000 0.0131
1 0.7606 0.3000 −5.0629 3.5620 0.0043
2 0.7408 0.3159 −5.0884 0.8139 1.000
3 0.7391 0.3144 −5.0892 2.2624 × 10−2 1.0000
4 0.7394 0.3143 −5.0892 9.7404 × 10−4 1.0000
5 0.7395 0.3143 −5.0892 1.5950 × 10−5 1.0000
6 0.7395 0.3143 −5.0892 1.3592 × 10−7 —
Table 3.2. Iteration sequence with BFGS line search newton method with starting
point far from solution.
Iteration (k) x1 x2 f ∇f (x k ) α
0 1.0000 0.5000 −1.1226 9.5731 0.0215
1 0.9637 0.2974 −3.7288 12.9460 0.0149∗
2 0.8101 0.2476 −4.4641 19.3569 0.0140
3 0.6587 0.3344 −4.9359 4.8700 1.0000
4 0.7398 0.3250 −5.0665 4.3311 1.0000
5 0.7425 0.3137 −5.0890 0.1779 1.0000
6 0.7393 0.3144 −5.0892 8.8269 × 10−3 1.0000
7 0.7395 0.3143 −5.0892 1.2805 × 10−4 1.0000
8 0.7395 0.3143 −5.0892 3.1141 × 10−6 1.0000
9 0.7395 0.3143 −5.0892 5.4122 × 10−12 —
*BFGS update was reinitialized to I .
line search leads to very small step sizes. In fact, the first BFGS update generates a poor
descent direction, and the matrix B k had to be reinitialized to I . Nevertheless, the algorithm
continues and takes full steps after the third iteration. Once this occurs, we see that the
method converges superlinearly toward the optimum.
The example demonstrates that the initial difficulties that occurred with Newton’s
method are overcome by line search methods as long as positive definite Hessian approx-
imations are applied. Also, the performance of the line search method on this example
confirms the convergence properties shown in this section. Note, however, that these prop-
erties apply only to convergence to stationary points and do not guarantee convergence to
points that also satisfy second order conditions.
i i
i i
book_tem
i i
2010/7/27
page 53
i i
calculated search direction also changes as a function of the step length. This added flexibility
leads to methods that have convergence properties superior to line search methods. On the
other hand, the computational expense for each trust region iteration may be greater than
for line search iterations.
We begin the discussion of this method by defining a trust region for the optimization
step, e.g., p ≤ , and a model function mk (p) that is expected to provide a good approx-
imation to f (x k + p) within a trust region of size . Any norm can be used to characterize
the trust region, although the Euclidean norm is often used for unconstrained optimization.
Also, a quadratic model is often chosen for mk (p), and the optimization step at iteration k
is determined by the following optimization problem:
1
min mk (p) = ∇f (x k )T p + p T B k p (3.49)
2
s.t. p ≤ ,
where B k = ∇ 2 f (x k ) or its quasi-Newton approximation. The basic trust region algorithm
can be given as follows.
Algorithm 3.3.
Given parameters ,¯ 0 ∈ (0, ],
¯ 0 < κ1 < κ2 < 1, γ ∈ (0, 1/4), and tolerances 1 , 2 > 0.
Choose a starting point x 0 .
Typical values of κ1 and κ2 are 14 and 34 , respectively. Algorithm 3.3 lends itself to a
number of variations that will be outlined in the remainder of this section. In particular, if
second derivatives are available, then problem (3.49) is exact up to second order, although
one may need to deal with an indefinite Hessian and nonconvex model problem. On the
other hand, if second derivatives are not used, one may instead use an approximation for
B k such as the BFGS update. Here the model problem is convex, but without second order
information it is less accurate.
i i
i i
book_tem
i i
2010/7/27
page 54
i i
Levenberg–Marquardt Steps
Levenberg–Marquardt steps were discussed in Section 3.2 as a way to correct Hessian
matrices that were not positive definite. For trust region methods, the application of these
steps is further motivated by the following property.
Theorem 3.6 Consider the model problem given by (3.49). The solution is given by p k if
and only if there exists a scalar λ ≥ 0 such that the following conditions are satisfied:
(B k + λI )pk = −∇f (x k ), (3.50)
λ( − pk ) = 0,
(B k + λI ) is positive semidefinite.
Note that when λ = 0, we have the same Newton-type step as with a line search method,
p N = −(B k )−1 ∇f (x k ).
On the other hand, as λ becomes large,
p(λ) = −(B k + λI )−1 ∇f (x k )
approaches a small step in the steepest descent direction pS ≈ − λ1 ∇f (x k ). As k is adjusted
in Algorithm 3.3 and if k < pN , then one can find a suitable value of λ by solving the
equation p(λ) − k = 0. With this equation, however, p depends nonlinearly on λ, thus
leading to difficult convergence with an iterative method. The alternate form
1 1
− = 0, (3.51)
p(λ) k
suggested in [281, 294], is therefore preferred because it is nearly linear in λ. As a result,
with an iterative solver, such as Newton’s method, a suitable λ is often found for (3.51)
in just 2–3 iterations. Figure 3.4 shows how the Levenberg–Marquardt step changes with
increasing values of . Note that for λ = 0 we have the Newton step p N . Once λ increases,
the step takes on a length given by the value of from (3.51). The steps p(λ) then decrease
in size with (and increasing λ) and trace out an arc shown by the dotted lines in the figure.
Finally, as vanishes, p(λ) points to the steepest descent direction.
i i
i i
book_tem
i i
2010/7/27
page 55
i i
Figure 3.4. Levenburg–Marquardt (dotted lines) and Powell dogleg (dashed lines)
steps for different trust regions.
In the first case, the Cauchy step can be derived by inserting a trial solution p = −τ ∇f (x k )
into the model problem (3.49) and solving for τ with a large value of . Otherwise, if B k
is not positive definite, then the Cauchy step is taken to the trust region bound and has
length .
For the dogleg method, we assume that B k is positive definite and we adopt the
Cauchy step for the first case. We also have a well-defined Newton step given by p N =
−(B k )−1 ∇f (x k ). As a result, the solution to (3.49) is given by the following cases:
• p k = p N if ≥ p N ;
k pC
• pk = pC
if ≤ p C ;
Both of these methods provide search directions that address the model problem (3.49).
It is also important to note that while the Levenberg–Marquardt steps provide an exact
solution to the model problem, the dogleg method solves (3.49) only approximately. In both
cases, the following property holds.
Theorem 3.7 [294] Assume that the model problem (3.49) is solved using Levenberg–
Marquardt or dogleg steps with p k ≤ k ; then for some c1 > 0,
∇f (x k )
mk (0) − mk (p) ≥ c1 ∇f (x k ) min , . (3.53)
B k
i i
i i
book_tem
i i
2010/7/27
page 56
i i
The relation (3.53) can be seen as an analogue to the descent property in line search methods
as it relates improvement in the model function back to ∇f (x). With this condition, one
can show convergence properties similar to (and under weaker conditions than) line search
methods [294, 100]. These are summarized by the following theorems.
Theorem 3.8 [294] Let γ ∈ (0, 1/4), B k ≤ β < ∞, and let f (x) be Lipschitz continuously
differentiable and bounded below on a level set {x|f (x) ≤ f (x 0 )}. Also, in solving (3.49)
(approximately), assume that pk satisfies (3.53) and that p k ≤ c2 k for some constant
c2 ≥ 1. Then the algorithm generates a sequence of points with limk→∞ ∇f (x k ) = 0.
The step acceptance condition f (x k ) − f (x k + pk ) ≥ γ (mk (0) − mk (p)) > 0 with
γ > 0 and Lipschitz continuity of ∇f can also be relaxed, but with a weakening of the
above property.
Theorem 3.9 [294] Let γ = 0, B k ≤ β < ∞, and let f (x) be continuously differentiable
and bounded below on a level set {x|f (x) ≤ f (x 0 )}. Let pk satisfy (3.53) and pk ≤ c2 k
for some constant c2 ≥ 1. Then the algorithm generates a sequence of points with
lim inf ∇f (x k ) = 0.
k→∞
This lim inf property states that without a strict step acceptance criterion (i.e., γ = 0),
∇f (x k ) must have a limit point that is not bounded away from zero. On the other hand, there
is only a subsequence, indexed by k
, with f (x k ) that converges to zero. A more detailed
description of this property can be found in [294].
Finally, note that Theorems 3.8 and 3.9 deal with convergence to stationary points
that may not be local optima if f (x) is nonconvex. Here, using B k forced to be positive
definite, the dogleg approach may converge to a point where ∇ 2 f (x ∗ ) is not positive definite.
Similarly, the Levenberg–Marquardt method may also converge to such a point if λ remains
positive. The stronger property of convergence to a local minimum requires consideration of
second order conditions for (3.49), as well as a more general approach with B k = ∇ 2 f (x k )
for the nonconvex model problem (3.49).
i i
i i
book_tem
i i
2010/7/27
page 57
i i
To adjust p(λ) to satisfy p(λ) = k , we see from (3.54) that we can make p(λ) small
by increasing λ. Also, if ∇ 2 f (x k ) is positive definite, then if λ = 0 satisfies the acceptance
criterion, we can recover the Newton step, and fast convergence is assured.
In the case of negative or zero curvature, we have an eigenvalue λi∗ ≤ 0 for a particular
index i ∗ . As long as vi∗T ∇f (x k ) = 0, we can still make p(λ) large by letting λ approach
|λi∗ |. Thus, p(λ) could be adjusted so that its length matches k . However, if we have
T ∇f (x k ) = 0, then no positive value of λ can be found which increases the length p(λ)
vi∗
to k . This case is undesirable, as it precludes significant improvement of f (x) along the
direction of negative curvature vi∗ and could lead to premature termination with large values
of λ and very small values of .
This phenomenon is called the hard case [282, 100, 294]. For this case, an additional
term is needed that includes a direction of negative curvature, z. Here the corresponding
eigenvector for a negative eigenvalue, vi∗ , is an ideal choice. Because it is orthogonal both
to the gradient vector and to all of the other eigenvectors, it can independently exploit the
negative curvature direction up to the trust region boundary with a step given by
pk = −(∇ 2 f (x k ) + λI )−1 ∇f (x k ) + τ z, (3.55)
where τ is chosen so that p k = k . Finding the appropriate eigenvector requires an
eigenvalue decomposition of ∇ 2 f (x k ) and is only suitable for small problems. Nevertheless,
the approach based on (3.55) can be applied in systematic way to find the global minimum
of the trust region problem (3.49). More details of this method can be found in [282, 100].
For large-scale problems, there is an inexact way to solve (3.49) based on the truncated
Newton method. Here we attempt to solve the linear system B k p k = −∇f (x k ) with the
method of conjugate gradients. However, if B k is indefinite, this method “fails” by generating
a large step, which can be shown to be a direction of negative curvature and can be used
directly in (3.55). The truncated Newton algorithm for the inexact solution of (3.49) is given
as follows.
Algorithm 3.4.
Given parameters > 0, set p0 = 0, r0 = ∇f (x k ), and d0 = −r0 . If r0 ≤ , stop with
p = p0 .
For j ≥ 0:
• If djT B k dj ≤ 0 (dj is a direction of negative curvature), find τ so that p = pj + τ dj
minimizes m(p) with p = k and return with the solution p k = p.
• Else, set αj = rjT rj /djT B k dj and pj +1 = pj + αj dj .
The conjugate gradient (CG) steps generated to solve (3.49) for a particular can be
seen in Figure 3.5. Note that the first step taken by the method is exactly the Cauchy step.
The subsequently generated CG steps increase in size until they exceed the trust region or
converge to the Newton step inside the trust region.
i i
i i
book_tem
i i
2010/7/27
page 58
i i
Figure 3.5. Solution steps for model problem generated by the truncated Newton
method: truncated step (left), Newton step (right).
While treatment of negative curvature requires more expensive trust region algo-
rithms, what results are convergence properties that are stronger than those resulting from
the solution of convex model problems. In particular, these nonconvex methods can find
limit points that are truly local minima, not just stationary points, as stated by the following
property.
Theorem 3.10 [294] Let p ∗ be the exact solution of the model problem (3.49), γ ∈ (0, 1/4),
and B k = ∇ 2 f (x k ). Also, let the approximate solution for Algorithm 3.3 satisfy p k ≤
c2 k and
lim ∇f (x k ) = 0.
k→∞
Also, if the level set {x|f (x) ≤ f (x 0 )} is closed and bounded, then either the algorithm
terminates at a point that satisfies second order necessary conditions, or there is a limit point
x ∗ that satisfies second order necessary conditions.
Finally, as with line search methods, global convergence alone does not guarantee
efficient algorithms. To ensure fast convergence, we would like to take pure Newton steps
at least in the neighborhood of the optimum. For trust region methods, this requires the trust
region not to shrink to zero upon convergence, i.e., limk→∞ k ≥ > 0. This property is
stated by the following theorem.
Theorem 3.11 [294] Let f (x) be twice Lipschitz continuously differentiable and suppose
that the sequence {x k } converges to a point x ∗ that satisfies second order sufficient conditions.
Also, for sufficiently large k, problem (3.49) is solved asymptotically exactly with B k →
∇ 2 f (x ∗ ), and with at least the same reduction as a Cauchy step. Then the trust region bound
becomes inactive for k sufficiently large.
Example 3.12 To evaluate trust region methods with exact second derivatives, we again
consider the problem described in Example 3.5. This problem has only a small region
i i
i i
book_tem
i i
2010/7/27
page 59
i i
Table 3.3. Iteration sequence with trust region (TR) method and exact Hessian with
starting point close to solution. NFE k denotes the number of function evaluations needed
to adjust the trust region in iteration k.
TR Iteration (k) x1k x2k f (x k ) ∇f (x k ) NFE k
0 0.8000 0.3000 −5.0000 3.0000 3
1 0.7423 0.3115 −5.0882 0.8163 1
2 0.7396 0.3143 −5.0892 6.8524 × 10−3 1
3 0.73950 0.31436 −5.08926 2.6845 × 10−6 —
Table 3.4. Iteration sequence with trust region (TR) method and exact Hessian
with starting point far from solution. NFE k denotes the number of function evaluations to
adjust the trust region in iteration k.
TR Iteration (k) x1 x2 f ∇f (x k ) NFE k
0 1.0000 0.5000 −1.1226 9.5731 11
1 0.9233 0.2634 −4.1492 11.1073 1
2 0.7621 0.3093 −5.0769 1.2263 1
3 0.7397 0.3140 −5.0892 0.1246 1
4 0.7395 0.3143 −5.0892 1.0470 × 10−4 1
5 0.73950 0.31436 −5.08926 5.8435 × 10−10 —
around the solution where the Hessian is positive definite, and a method that takes full
Newton steps has difficulty converging from starting points far away from the solution.
Here we apply the trust region algorithm of Gay [156]. This algorithm is based on the
exact trust region algorithm [282] described above. More details of this method can also be
found in [154, 110]. A termination tolerance of ∇f (x) ≤ 10−6 is chosen and the initial
trust region is determined from the initial Cauchy step pC . Choosing a starting point
close to the solution generates the iteration sequence given in Table 3.3. Here it is clear
that the solution can be found very quickly, with only three trust region iterations. The first
trust region step requires three function evaluations to determine a proper trust region size.
After this, pure Newton steps are taken, the trust region becomes inactive, as predicted by
Theorem 3.11, and we see quadratic convergence to the optimum solution.
If we choose a starting point farther away from the minimum, then the trust region
algorithm, with a termination tolerance of ∇f (x) ≤ 10−6 , generates the iteration sequence
given in Table 3.4. Here only five trust region iterations are required, but the initial trust
region requires 11 function evaluations to determine a proper size. After this, pure Newton
steps are taken, the trust region becomes inactive, as predicted by Theorem 3.11, and we
can observe quadratic convergence to the optimum solution.
This example demonstrates that the initial difficulties that occur with Newton’s method
are overcome by trust region methods that still use exact Hessians, even if they are indefinite.
Also, the performance of the trust region method on this example confirms the trust region
convergence properties described in this section.
i i
i i
book_tem
i i
2010/7/27
page 60
i i
3.7 Exercises
1. Using an SR1 update, derive the following formula:
(y − B k s)(y − B k s)T
B k+1 = B k + .
(y − B k s)T s
max θ
s.t. θs T y + (1 − θ )s T B k s ≥ 0.2s T B k s,
0 ≤ θ ≤ 1.
i i
i i
book_tem
i i
2010/7/27
page 61
i i
3.7. Exercises 61
min W + − W F
s.t. W + v = s, (W + )T y = v. (3.56)
Using the inverse update formula (3.21), derive the DFP update for B k .
7. Download and apply the L-BFGS method to Example 3.5. How does the method
perform as a function of the number of updates?
i i
i i
book_tem
i i
2010/7/27
page 63
i i
Chapter 4
Concepts of Constrained
Optimization
This chapter develops the underlying properties for constrained optimization. It describes
concepts of feasible regions and reviews convexity conditions to allow visualization of
constrained solutions. Karush–Kuhn–Tucker (KKT) conditions are then presented from
two viewpoints. First, an intuitive, geometric interpretation is given to aid understanding of
the conditions. Once presented, the KKT conditions are analyzed more rigorously. Several
examples are provided to illustrate these conditions along with the role of multipliers and
constraint qualifications. Finally, two special cases of nonlinear programming, linear and
quadratic, are briefly described, and a case study on portfolio optimization is presented to
illustrate their characteristics.
4.1 Introduction
The previous two chapters dealt with unconstrained optimization, where the solution was not
restricted in Rn . This chapter now considers the influence of a constrained feasible region
in the characterization of optimization problems. For continuous variable optimization, we
consider the general NLP problem given by
63
i i
i i
book_tem
i i
2010/7/27
page 64
i i
for all α ∈ (0, 1). Strict convexity requires that inequality (4.2) be strict. Similarly, a region
Y is convex if and only if for all points x a , x b ∈ Y ,
Theorem 4.2 If g(x) is convex and h(x) is linear, then the region F = {x|g(x) ≤ 0, h(x) = 0}
is convex, i.e., αx a + (1 − α)x b ∈ F for all α ∈ (0, 1) and x a , x b ∈ F .
Proof: From the definition of convex regions, we consider two points x a , x b ∈ F and
assume there exists a point x̄ = αx a + (1 − α)x b ∈/ F for some α ∈ (0, 1). Since x̄ ∈/ F , we
can have g(x̄) > 0 or h(x̄) = 0. In the former case, we have by feasibility of x a and x b and
convexity of g(x),
0 ≥ αg(x a ) + (1 − α)g(x b )
≥ g(αx a + (1 − α)x b ) = g(x̄) > 0
i i
i i
book_tem
i i
2010/7/27
page 65
i i
4.1. Introduction 65
which is a contradiction. In the second case, we have from feasibility of x a and x b and
linearity of h(x),
0 = αh(x a ) + (1 − α)h(x b )
= h(αx a + (1 − α)x b ) = h(x̄) = 0,
which again is a contradiction. Since neither case can hold, the statement of the theorem
must hold.
Examples of convex optimization problems are pictured in Figures 4.1 and 4.2.As with
unconstrained problems in Chapter 2, we now characterize solutions of convex constrained
problems with the following theorem.
Theorem 4.3 If f (x) is convex and F is convex, then every local minimum in F is a
global minimum. If f (x) is strictly convex in F , then a local minimum is the unique global
minimum.
Proof: For the first statement, we assume that there are two local minima x a , x b ∈ F with
f (x a ) > f (x b ) and establish a contradiction. Here we have f (x a ) ≤ f (x), x ∈ N (x a ) ∩ F
and f (x b ) ≤ f (x), x ∈ N (x b ) ∩ F . By convexity, (1 − α)x a + αx b ∈ F and
i i
i i
book_tem
i i
2010/7/27
page 66
i i
can be observed at vertex points, on a constraint boundary, or in the interior of the feasible
region.
Finally, Figure 4.3 shows two contours for nonconvex problems where multiple local
solutions are observed. The left plot shows these local solutions as a result of a nonconvex
feasible region, even though the objective function is convex, while the right plot has a
convex feasible region but shows multiple solutions due to a nonconvex objective function.
In analogy to Chapter 2, we can characterize, calculate, and verify local solutions
through well-defined optimality conditions. Without convexity, however, there are generally
no well-defined optimality conditions that guarantee global solutions, and a much more
expensive enumeration of the search space is required. In this text, we will focus on finding
local solutions to (4.1). A comprehensive treatment of global optimization algorithms and
their properties can be found in [144, 203, 379].
To motivate the additional complexities of constrained optimization problems, we con-
sider a small geometric example, where the optimal solution is determined by the constraints.
i i
i i
book_tem
i i
2010/7/27
page 67
i i
4.1. Introduction 67
Example 4.4 We consider three circles of different sizes, as shown in Figure 4.4, and seek
the smallest perimeter of a box that encloses these circles.
As decision variables we choose the dimensions of the box, A, B, and the coordinates
for the centers of the three circles, (x1 , y1 ), (x2 , y2 ), (x3 , y3 ). As specified parameters we have
the radii R1 , R2 , R3 . For this problem we minimize the perimeter 2(A + B) and include as
constraints the fact that the circles need to remain in the box and cannot overlap with each
other. As a result we formulate the following nonlinear program:
i i
i i
book_tem
i i
2010/7/27
page 68
i i
i i
i i
book_tem
i i
2010/7/27
page 69
i i
i i
i i
book_tem
i i
2010/7/27
page 70
i i
The conditions that characterize the stationary ball in Figure 4.7 can be stated more pre-
cisely by the following Karush–Kuhn–Tucker (KKT) necessary conditions for constrained
optimality. They comprise the following elements:
• Feasibility: Both inequality and equality constraints must be satisfied; i.e., the ball
must lie on the rail and within the fences
g(x ∗ )T u∗ = 0, u∗ ≥ 0. (4.11)
i i
i i
book_tem
i i
2010/7/27
page 71
i i
• Constraint Qualification: The stationarity conditions for a local optimum are based
on gradient information for the constrained and the objective functions. Because this
linearized information is used to characterize the optimum of a nonlinear problem, an
additional regularity condition is required on the constraints to ensure that gradients
are sufficient for the local characterization of the feasible region at x ∗ . This concept
will be covered in more detail in the next section. A typical condition is that the active
constraint gradients at x ∗ be linearly independent, i.e., the matrix made up of columns
of ∇h(x ∗ ) and ∇gi (x ∗ ) with i ∈ {i|gi (x ∗ ) = 0} is full column rank.
• Second Order Conditions: In Figure 4.6 we find that the gradients (i.e., force balance)
determine how the ball rolls to x ∗ . On the other hand, we also want to consider
the curvature condition, especially along the active constraint. Thus, when the ball
is perturbed from x ∗ in a feasible direction, it needs to “roll back” to x ∗ . As with
unconstrained optimization, nonnegative (positive) curvature is necessary (sufficient)
for all of the feasible, i.e., constrained, nonzero directions, p. As will be discussed
in the next section, these directions determine the following necessary second order
conditions:
∇p T ∇xx L(x ∗ , u∗ , v ∗ )p ≥ 0, (4.12)
for all p = 0, ∇h(x ∗ )T p = 0,
∇gi (x ∗ )T p = 0, i ∈ {i|gi (x ∗ ) = 0, u∗i > 0},
∇gi (x ∗ )T p ≤ 0, i ∈ {i|gi (x ∗ ) = 0, u∗i = 0}.
The corresponding sufficient conditions require that the first inequality in (4.12) be
strict. Note that for Figure 4.5 the allowable directions p span the entire space. In
Figure 4.6 these directions are tangent to the inequality constraint at x ∗ . Finally, for
Figure 4.7, the constraints uniquely define x ∗ and there are no nonzero directions p
that satisfy (4.12).
To close this section we illustrate the KKT conditions with an example drawn from
Figures 4.5 and 4.7. An example drawn from Figure 4.6 will be considered in the next
section.
Example 4.5 Consider the following unconstrained NLP:
3
min x12 − 4x1 + x22 − 7x2 + x1 x2 + 9 − ln(x1 ) − ln(x2 ). (4.13)
2
This function corresponds to the contour plot in Figure 4.5. The optimal solution can be
found by solving for the first order conditions (4.8):
∗ 2x1 − 4 + x2 − x11 ∗ 1.3475
∇f (x ) = = 0 leading to x = (4.14)
3x2 − 7 + x1 − x1 2
2.0470
i i
i i
book_tem
i i
2010/7/27
page 72
i i
which corresponds to the plot in Figure 4.7. The optimal solution can be found by applying
the first order KKT conditions (4.9)–(4.10):
2x1 − 4 + x2 − x11 − x2 u + 2v
= = 0, (4.18)
3x2 − 7 + x1 − x12 − x1 u − v
g(x) = 4 − x1 x2 ≤ 0, h(x) = 2x1 − x2 = 0, (4.19)
u ≥ 0, uT (4 − x1 x2 ) = 0 (4.20)
⇓ (4.21)
∗ 1.4142
x = , u∗ = 1.068, v ∗ = 1.0355, (4.22)
2.8284
However, because this matrix is nonsingular, there are no nonzero vectors, p, that satisfy
the allowable directions. Hence, the sufficient second order conditions (4.12) are vacuously
satisfied for this problem.
i i
i i
book_tem
i i
2010/7/27
page 73
i i
A more precise definition of the active constraints can be made by considering the
multipliers in (4.9). The role of the multipliers can be understood by perturbing the right-
hand side of a constraint in (4.1). Here we consider a particular inequality gî (x) + ≤ 0
with î ∈ A(x ∗ ) and consider the solution of the perturbed problem, x . For small values
of , one can show that this leads to
f (x ) − f (x ∗ ) ≈ ∇f (x ∗ )T (x − x ∗ )
=− vi∗ ∇hi (x ∗ )T (x − x ∗ ) − u∗i ∇gi (x ∗ )T (x − x ∗ )
i∈E i∈I
≈− vi∗ (hi (x ) − hi (x ∗ )) − u∗i (gi (x ) − gi (x ∗ ))
i∈E i∈I
= u∗î .
∗
Dividing the above expression by and taking limits as goes to zero leads to dfd (x )
= u∗ .
î
Thus we see that the multipliers provide the sensitivity of the optimal objective function
value to perturbations in the constraints.
Note that we have not stated any particular properties of the multipliers. In the ab-
sence of additional conditions, the multipliers may be nonunique, or may not even exist
at an optimal solution. Nevertheless, with values of the multipliers, u∗ , v ∗ , that satisfy
(4.9) at the solution to (4.1), we can further refine the definition of the active set. We
define a strongly active set, As (x ∗ ) = {i|u∗i > 0, gi (x ∗ ) = 0}, and a weakly active set,
Aw (x ∗ ) = {i|u∗i = 0, gi (x ∗ ) = 0}. This also allows us to state the following definition.
Definition 4.6 Given a local solution of (4.1), x ∗ , along with multipliers u∗ , v ∗ that satisfy
(4.9) and (4.11), the strict complementarity condition is given by u∗i > 0 for all i ∈ A(x ∗ )∩I.
This condition holds if and only if Aw (x ∗ ) is empty.
1
min x 2 s.t. − x ≤ 0, x − 1 ≤ 0.
2
2x + uU − uL = 0,
1
uU (x − 1) = 0, uL − x = 0,
2
1
≤ x ≤ 1, uU , uL ≥ 0.
2
i i
i i
book_tem
i i
2010/7/27
page 74
i i
These conditions are satisfied by x ∗ = 12 , u∗U = 0, and u∗L = 1. Thus the lower bound on
x is the only active constraint at the solution. This constraint is strongly active, and strict
complementarity is satisfied. Moreover, if we increase the lower bound by , the optimal
(x ∗ )
value of the objective function is f (x ) = 14 + + 2 and dfd = 1 = u∗L as → 0.
If we now modify the example to min x 2 , 0 ≤ x ≤ 1, with corresponding KKT con-
ditions given by
2x + uU − uL = 0,
uU (1 − x) = 0, uL x = 0,
0 ≤ x ≤ 1, uU , uL ≥ 0,
we have the solution x ∗ = 0, u∗U = 0, and u∗L = 0. Here the lower bound on x is only weakly
(x ∗ )
active at the solution and strict complementarity does not hold. Here, dfd = 0 = u∗L and,
as verified by the multipliers, we see that reducing the lower bound on x does not change
f (x ∗ ), while increasing the lower bound on x increases the optimum objective function
value only to ()2 .
Limiting Directions
To deal with nonlinear objective and constraint functions we need to consider the concept
of feasible sequences and their limiting directions. For a given feasible point, x̄, we can
consider a sequence of points {x k }, with each point indexed by k, that satisfies
• x k = x̄,
• limk→∞ x k = x̄,
Associated with each feasible sequence is a limiting direction, d, of unit length defined by
x k − x̄
lim = d. (4.26)
k→∞ x k − x̄
In analogy with the kinematic interpretation in Section 4.2, one can interpret this sequence
to be points on the “path of a rolling ball” that terminates at or passes through x̄. The limiting
direction is tangent to this path in the opposite direction of the rolling ball.
We now consider all possible feasible sequences leading to a feasible point and the
limiting directions associated with them. In particular, if the objective function f (x k ) in-
creases monotonically for any feasible sequence to x̄, then x̄ cannot be optimal. Using the
concept of limiting directions, this property can be stated in terms of directional derivatives
as follows.
Theorem 4.8 [294] If x ∗ is a solution of (4.1), then all feasible sequences leading to x ∗
must satisfy
∇f (x ∗ )T d ≥ 0, (4.27)
where d is any limiting direction of a feasible sequence.
i i
i i
book_tem
i i
2010/7/27
page 75
i i
Proof: We will assume that there exists a feasible sequence {x k } that has a limiting direction d̂
with ∇f (x ∗ )T d̂ < 0 and establish a contradiction. From Taylor’s theorem, and the definition
of a limiting direction (4.26), we have for sufficiently large k,
f (x k ) = f (x ∗ ) + ∇f (x ∗ )T (x k − x ∗ ) + o(x k − x ∗ ) (4.28)
∗ ∗ T ∗ ∗
= f (x ) + ∇f (x ) d̂x − x + o(x − x )
k k
(4.29)
1
≤ f (x ∗ ) + ∇f (x ∗ )T d̂x k − x ∗ (4.30)
2
< f (x ∗ ), (4.31)
with vectors b, c and matrices B, C. For these constraints the limiting directions of all
feasible sequences to a point x̄ can be represented as a system of linear equations and
inequalities known as a cone [294]. This cone coincides exactly with all of the feasible
directions in the neighborhood of x̄ and these are derived from the constraints in (4.32).
Vectors in this cone will cover limiting directions from all possible feasible sequences
for k sufficiently large. As a result, for problem (4.32) we can change the statement of
Theorem 4.8 to the following.
Theorem 4.9 If x ∗ is a solution of (4.32) with active constraints that correspond to appro-
priate rows of B and elements of b, i.e., Bi x ∗ = bi , i ∈ A(x ∗ ), then
∇f (x ∗ )T d ≥ 0 (4.33)
a T z > 0, Az ≥ 0, Dz = 0 (4.34)
ay1 + AT y2 + D T y3 = 0, (4.35)
y1 > 0, y2 ≥ 0 (4.36)
i i
i i
book_tem
i i
2010/7/27
page 76
i i
• if f (x) is convex and the KKT conditions (4.37) are satisfied at a point x + , then x +
is the global solution to (4.32).
Proof: The linear constraints (4.37b) are satisfied as x ∗ is the solution of (4.32). The comple-
mentarity conditions (4.37c) are equivalent to the conditions u∗i ≥ 0, Bi x ∗ = bi , i ∈ A(x ∗ ),
and u∗i = 0 for i ∈ I \ A(x ∗ ). By Theorem 4.9, there is no solution d to the linear system
i i
i i
book_tem
i i
2010/7/27
page 77
i i
However, an additional condition is needed to ensure that limiting directions at the solution
can be represented by the following cone conditions derived from linearization of the active
constraints:
∇hi (x ∗ )T d = 0, ∇gi (x ∗ )T d ≤ 0, i ∈ A(x ∗ ). (4.43)
To motivate this point, consider the nonlinear program given by
min x1 (4.44)
s.t. x2 ≤ x13 ,
−x13 ≤ x2
and shown in Figure 4.8. The solution of (4.44) is x1∗ = x2∗ = 0, and any feasible sequence
to the origin will be monotonically decreasing and satisfies the conditions of Theorem 4.8.
On the other hand, it is easy to show that the KKT conditions are not satisfied at this point.
Moreover, linearization of the active constraints at x ∗ leads to only two directions, d1 =
[1, 0]T and d2 = [−1, 0]T , as seen in the figure. However, only d1 is a limiting direction of a
feasible sequence. Hence, the essential link between Theorem 4.8 and the KKT conditions
(4.42) is broken.
To ensure this link, we need an additional constraint qualification that shows that
limiting directions of active nonlinear constraints can be represented by their linearizations
at the solutions. That is, unlike the constraints in Figure 4.8, the active constraints should
not be “too nonlinear.” A frequently used condition is the linear independence constraint
qualification (LICQ) given by the following definition.
Definition 4.12 (LICQ). Given a local solution of (4.1), x ∗ , and an active set A(x ∗ ), LICQ
is defined by linear independence of the constraint gradients
∇gi (x ∗ ), ∇hi (x ∗ ), i ∈ A(x ∗ ).
i i
i i
book_tem
i i
2010/7/27
page 78
i i
From Figure 4.8 it is easy to see that (4.44) does not satisfy the LICQ at the solution.
To make explicit the need for a constraint qualification, we use the following technical
lemma; the proof of this property is given in [294].
Lemma 4.13 [294, Lemma 12.2]. The set of limiting directions from all feasible sequences
is a subset of the cone (4.43). Moreover, if LICQ holds for (4.43), then,
• the cone (4.43) is equivalent to the set of limiting directions for all feasible sequences,
• for limiting directions d that satisfy (4.43) with d = 1, a feasible sequence {x k }
(with x k = x ∗ + t k d + o(t k )) can always be constructed that satisfies
hi (x k ) = t k ∇h(x ∗ )T d = 0,
gi (x k ) = t k ∇gi (x ∗ )T d ≤ 0, i ∈ A(x ∗ ),
for some small positive t k with lim t k → 0.
This lemma allows us to extend Theorem 4.11 to nonlinearly constrained problems.
Combining the above results, we now prove the main results of this section.
Theorem 4.14 Suppose that x ∗ is a local solution of (4.1) and the LICQ holds at this
solution; then
• the KKT conditions (4.42) are satisfied;
• if f (x) and g(x) are convex, h(x) is linear and the KKT conditions (4.42) are satisfied
at a point x + , then x + is the global solution to (4.1).
Proof: The proof follows along similar lines as in Theorem 4.11. The constraints (4.42b)
are satisfied as x ∗ is the solution of (4.1). The complementarity conditions (4.42c) are
equivalent to the conditions u∗i ≥ 0, gi (x ∗ ) = 0, i ∈ A(x ∗ ), and u∗i = 0 for i ∈ I \ A(x ∗ ). By
Theorem 4.8, the LICQ, and Lemma 4.13, there is no solution d to the linear system
∇f (x ∗ )T d < 0, ∇h(x ∗ )T d = 0, ∇gi (x ∗ )T d ≤ 0, i ∈ A(x ∗ ). (4.45)
Hence, from Lemma 4.10 and by setting a to −∇f (x ∗ ), D to ∇h(x ∗ )T , and A to the rows
of −∇gi (x ∗ ), i ∈ A(x ∗ ), we have (4.42a) and u∗ ≥ 0.
For the second part of the theorem, we know from Lemma 4.10 that (4.42a) implies
that there is no solution d to the linear system
∇f (x + )T d < 0, ∇h(x + )T d = 0, ∇gi (x + )T d ≤ 0, i ∈ A(x + ). (4.46)
Assume now that x + is not a solution to (4.1) and there is another feasible point x ∗ with
f (x ∗ ) < f (x + ). We define d = (x ∗ − x + ) and note that, by linearity of h(x) and convexity
of g(x), we have ∇h(x + )T d = 0 and
0 ≥ gi (x ∗ ) ≥ gi (x + ) + ∇gi (x + )T d = ∇gi (x + )T d, i ∈ A(x + ).
By convexity of f (x), ∇ 2 f (x) is positive semidefinite for all x, and by Taylor’s theorem,
1 1 T 2
0 > f (x ∗ ) − f (x + ) = ∇f (x + )T d + d ∇ f (x + + td)d dt (4.47)
2 0
≥ ∇f (x + )T d, (4.48)
i i
i i
book_tem
i i
2010/7/27
page 79
i i
which contradicts the assumption that there is no solution to (4.46). Therefore, x + is a local
solution to (4.32), and because the objective function and feasible region are both convex,
x + is also a global solution by Theorem 4.3.
The assumption of the LICQ leads to an important additional property on the multi-
pliers, given by the following theorem.
Theorem 4.15 (LICQ and Multipliers). Given a point, x ∗ , that satisfies the KKT conditions,
along with an active set A(x ∗ ) with multipliers u∗ , v ∗ , if LICQ holds at x ∗ , then the
multipliers are unique.
Proof: We define the vector of multipliers λ∗ = [u∗T , v ∗T ]T and the matrix A(x ∗ ) made
up of the columns of active constraint gradients ∇gi (x ∗ ), ∇hi (x ∗ ), i ∈ A(x ∗ ), and write
(4.42a) as
∇f (x ∗ ) + A(x ∗ )λ∗ = 0. (4.49)
Now assume that there are two multiplier vectors, λa and λb , that satisfy (4.42) and (4.49).
Substituting both multiplier vectors into (4.49), subtracting the two equations from each
other, and premultiplying the difference by A(x ∗ )T leads to
Since the LICQ holds, A(x ∗ )T A(x ∗ ) is nonsingular and we must have λa = λb , which proves
the result.
Finally, it should be noted that while the LICQ is commonly assumed, there are
several other constraint qualifications that link Theorem 4.8 to the KKT conditions (4.42)
(see [294, 273]). In particular, the weaker Mangasarian–Fromovitz constraint qualification
(MFCQ) has the following definition.
Definition 4.16 (MFCQ). Given a local solution of (4.1), x ∗ , and an active set A(x ∗ ),
the MFCQ is defined by linear independence of the equality constraint gradients and the
existence of a search direction d such that
The MFCQ is always satisfied if the LICQ is satisfied. Also, satisfaction of the MFCQ leads
to bounded multipliers, u∗ , v ∗ , although they are not necessarily unique.
i i
i i
book_tem
i i
2010/7/27
page 80
i i
Theorem 4.17 (Second Order Necessary Conditions). Suppose that x ∗ is a local solution
of (4.1), LICQ holds at this solution, and u∗ , v ∗ are the multipliers that satisfy the KKT
conditions (4.42). Then
Since LICQ holds from Lemma 4.13, we can construct for any d ∈ C1 (x ∗ ) a feasible sequence
{x k } (with x k = x ∗ + t k d + o(t k )) that satisfies
for a sequence of scalars t k > 0 with lim t k → 0. For this feasible sequence we then have
the result
f (x k ) = L(x k , u∗ , v ∗ ) (4.52)
= L(x ∗ , u∗ , v ∗ ) + t k ∇L(x ∗ , u∗ , v ∗ )T d (4.53)
(t k )2 T
+ d ∇xx L(x ∗ , u∗ , v ∗ )d + o((t k )2 ) (4.54)
2
(t k )2 T
= f (x ∗ ) + d ∇xx L(x ∗ , u∗ , v ∗ )d + o((t k )2 ) (4.55)
2
(t k )2 T
< f (x ∗ ) + d ∇xx L(x ∗ , u∗ , v ∗ )d < f (x ∗ ) (4.56)
4
for k sufficiently large. This contradicts the assumption that x ∗ is a solution and that there
is no increasing feasible sequence to the solution. Hence the result (4.50) must hold.
i i
i i
book_tem
i i
2010/7/27
page 81
i i
Theorem 4.18 (Second Order Sufficient Conditions). Suppose that x ∗ and the multipliers
u∗ , v ∗ satisfy the KKT conditions (4.42) and
Proof: To show that x ∗ is a strict local minimum, every feasible sequence must have
f (x k ) > f (x ∗ ) for all k sufficiently large. From Lemma 4.13, C1 (x ∗ ) contains the limiting
directions for all feasible sequences. We now consider any feasible sequence associated
with these limiting directions and consider the following two cases:
• d ∈ C2 (x ∗ , u∗ ) ⊆ C1 (x ∗ ): For any feasible sequence with limiting direction d ∈
C2 (x ∗ , u∗ ), d = 1, we have h(x k )T v ∗ = 0, g(x k )T u∗ ≤ 0, and from (4.42) and (4.26),
f (x k ) ≥ f (x k ) + h(x k )T v ∗ + g(x k )T u∗
= L(x k , u∗ , v ∗ )
1
= L(x ∗ , u∗ , v ∗ ) + (x k − x ∗ )T ∇xx L(x ∗ , u∗ , v ∗ )(x k − x ∗ )
2
+ o(x k − x ∗ 2 )
1
= f (x ∗ ) + d T ∇xx L(x ∗ , u∗ , v ∗ )dx k − x ∗ 2 + o(x k − x ∗ 2 )
2
1
≥ f (x ∗ ) + d T ∇xx L(x ∗ , u∗ , v ∗ )dx k − x ∗ 2 > f (x ∗ ),
4
where the higher order term can be absorbed by 41 d T ∇xx L(x ∗ , u∗ , v ∗ )dx k − x ∗ 2
for k sufficiently large, and the result follows for the first case.
• d ∈ C1 (x ∗ ) \ C2 (x ∗ , u∗ ): Unlike the first case, d T ∇xx L(x ∗ , u∗ , v ∗ )d may not be posi-
tive for d ∈ C1 (x ∗ )\C2 (x ∗ , u∗ ) so a different approach is needed. Since d ∈ / C2 (x ∗ , u∗ ),
∗ ∗
there is at least one constraint with ∇gī (x ) d < 0, ī ∈ As (x ). Also, we have
T
As a result,
L(x k , u∗ , v ∗ ) = f (x ∗ ) + O(x k − x ∗ 2 ).
i i
i i
book_tem
i i
2010/7/27
page 82
i i
The right-hand side can be absorbed into − 12 ∇gī (x ∗ )T dx k − x ∗ u∗ī , for all k suffi-
ciently large, leading to
1
f (x k ) − f (x ∗ ) ≥ − ∇gī (x ∗ )T dx k − x ∗ u∗ī > 0
2
which gives the desired result for the second case.
i i
i i
book_tem
i i
2010/7/27
page 83
i i
semidefinite.
∗ ∗ ∗
3 ) ∇xx L(x , u , v )Q3 is positive
– Sufficient second order condition: (QN T N
definite.
To illustrate these properties, we revisit the remaining case related to Example 4.5.
Example 4.20 Consider the constrained NLP
3
min x12 − 4x1 + x22 − 7x2 + x1 x2 + 9 − ln(x1 ) − ln(x2 ) (4.59)
2
s.t. 4 − x1 x2 ≤ 0,
which corresponds to the plot in Figure 4.6. The optimal solution can be found by applying
the first order KKT conditions (4.42) as follows:
∇L(x ∗ , u∗ , v ∗ ) = ∇f (x ∗ ) + ∇g(x ∗ )u∗ (4.60)
2x1 − 4 + x2 − x11 − x2 u
= = 0, (4.61)
3x2 − 7 + x1 − x12 − x1 u
g(x) = 4 − x1 x2 ≤ 0, uT (4 − x1 x2 ) = 0, u≥0 (4.62)
⇓ (4.63)
1.7981
x∗ = , u∗ = 0.5685, (4.64)
2.2245
This leads to (QN )T ∇xx L(x ∗ , u∗ , v ∗ )QN = 1.0331, and the sufficient second order condition
holds.
i i
i i
book_tem
i i
2010/7/27
page 84
i i
Linear programs can be formulated in a number of ways to represent the equivalent problem.
In fact, by adding additional variables to (4.68), we can convert the inequalities to equalities
and add bounds on the variables. Moreover, by adding or subtracting (possibly large) fixed
constants to the variables, one can instead impose simple nonnegativity on all the variables
and write the LP as
min a T x (4.69)
s.t. Cx = c, x ≥ 0,
a + C T v ∗ − u∗ = 0, (4.70a)
Cx ∗ = c, x ∗ ≥ 0, (4.70b)
(x ∗ )T u∗ = 0, u∗ ≥ 0. (4.70c)
Since ∇xx L(x ∗ , u∗ , v ∗ ) is identically zero, it is easy to see that the necessary second order
conditions (4.50) always hold when (4.70) holds. Moreover, at a nondegenerate solution
there are no nonzero vectors that satisfy d ∈ C2 (x ∗ , u∗ ) and the sufficient second order
conditions (4.57) hold vacuously.
Problem (4.69) can be solved in a finite number of steps. The standard method used
to solve (4.69) is the simplex method, developed in the late 1940s [108] (although, starting
from Karmarkar’s discovery in 1984, interior point methods have become quite advanced
and competitive for highly constrained problems [411]). The simplex method proceeds by
i i
i i
book_tem
i i
2010/7/27
page 85
i i
moving successively from vertex to vertex with improved objective function values. At each
vertex point, we can repartition the variable vector into x T = [xN
T , x T ] and C into submatrices
B
C = [CN | CB ] with corresponding columns. Here xN is the subvector of n − m nonbasic
variables which are set to zero, and xB is the subvector of m basic variables, which are
determined by the square system CB xB = c. At this vertex, directions to adjacent vertices
are identified (with different basic and nonbasic sets) and directional derivatives of the
objective are calculated (the so-called pricing step). If all of these directional derivatives
are positive (nonnegative), then one can show that the KKT conditions (4.70) and the
sufficient (necessary) second order conditions are satisfied.
Otherwise, an adjacent vertex associated with a negative directional derivative is
selected where a nonbasic variable (the driving variable) is increased from zero and a basic
variable (the blocking variable) is driven to zero. This can be done with an efficient pivoting
operation, and a new vertex is obtained by updating the nonbasic and basic sets by swapping
the driving and blocking variables in these sets. The sequence of these simplex steps leads
to vertices with decreasing objective functions, and the algorithm stops when no adjacent
vertices can be found that improve the objective. More details of these simplex steps can be
found in [108, 195, 287, 294].
Methods to solve (4.68) are well implemented and widely used, especially in planning
and logistical applications. They also form the basis for mixed integer linear programming
methods. Currently, state-of-the-art LP solvers can handle millions of variables and con-
straints and the application of specialized decomposition methods leads to the solution of
problems that are even two or three orders of magnitude larger than this.
a + Qx ∗ + C T v ∗ − u∗ = 0, (4.73a)
Cx ∗ = c, x ∗ ≥ 0, (4.73b)
(x ∗ )T u∗ = 0, u∗ ≥ 0. (4.73c)
If the matrix Q is positive semidefinite (positive definite) when projected into the null
space of the active constraints, then (4.72) is (strictly) convex and (4.73) provides a global
(and unique) minimum. Otherwise, multiple local solutions may exist for (4.72) and more
extensive global optimization methods are needed to obtain the global solution. Like LPs,
i i
i i
book_tem
i i
2010/7/27
page 86
i i
convex QPs can be solved in a finite number of steps. However, as seen in Figure 4.2, these
optimal solutions can lie on a vertex, on a constraint boundary, or in the interior.
Solution of QPs follows in a manner similar to LPs. For a fixed active set, i.e., nonbasic
variables identified and conditions (4.73c) suppressed, the remaining equations constitute
a linear system that is often solved directly. The remaining task is to choose among active
sets (i.e., nonbasic variables) that satisfy (4.73c), much like the simplex method. Conse-
quently, the QP solution can also be found in a finite number of steps. A number of active set
QP strategies have been created that provide efficient updates of active constraints. Popular
methods include null-space algorithms, range-space methods, and Schur complement meth-
ods. As with LPs, QP problems can also be solved with interior point methods. A thorough
discussion of QP algorithms is beyond the scope of this text. For more discussion, the inter-
ested reader is referred to [195, 287, 294, 134, 411], as well as the references to individual
studies cited therein.
1
Nt
ρi = ri (tj ), (4.74)
Nt
j =1
where Nt is the number of time periods. We now consider a choice of a portfolio that
maximizes the rate of return, and this can be posed as the following LP:
min ρi xi (4.75)
i∈N
s.t. xi = 1,
i∈N
0 ≤ xi ≤ xmax ,
where xi is the fraction of the total portfolio invested in investment i and we can choose a
maximum amount xmax to allow for some diversity in the portfolio.
However, the LP formulation assumes that there is no risk to these investments,
and experience shows that high-yielding investments are likely to be riskier. To deal with
risk, one can calculate the variance of investment i as well as the covariance between two
investments i and i
, given by
1 t N
si,i = (ri (tj ) − ρi )2 , (4.76a)
Nt − 1
j =1
1 t N
si,i
= (ri (tj ) − ρi )(ri
(tj ) − ρi
), (4.76b)
Nt − 1
j =1
respectively; these quantities form the elements of the matrix S. With this information
one can adopt the Markowitz mean/variance portfolio model [275], where the least risky
i i
i i
book_tem
i i
2010/7/27
page 87
i i
portfolio is determined that provides (at least) a desired rate of return. This can be written
as the following QP:
min obj ≡ si,i
xi xi
(4.77)
i,i
∈N
s.t. xi = 1,
i∈N
ρi xi ≥ ρmin ,
i∈N
0 ≤ xi ≤ xmax .
In addition to these basic formulations there are a number of related LP and QP formulations
that allow the incorporation of uncertainties, transaction costs, after-tax returns, as well as
differentiation between upside and downside risk. These models constitute important topics
in financial planning, and more information on these formulations can be found in [361]
and the many references cited therein. The following example illustrates the nature of the
solutions to these basic portfolio problems.
Example 4.21 Consider a set with four investments, N = {A, B, C, D}. The first three
represent stocks issued by large corporations, and the growth over 12 years in their stock
prices with dividends, i.e., (1 + ri (t)), is plotted in Figure 4.9. From these data we can
apply (4.74) and (4.76) to calculate the mean vector and covariance matrix, respectively,
for portfolio planning. Investment D is a risk-free asset which delivers a constant rate of
return and does not interact with the other investments. The data for these investments are
given in Table 4.1.
Now consider the results of the LP and QP planning strategies. Using the data in
Table 4.1, we can solve the LP (4.75) for values of xmax from 0.25 (the lowest value with
a feasible solution for (4.75)) to 1.0. This leads to the portfolio plotted in Figure 4.10 with
i i
i i
book_tem
i i
2010/7/27
page 88
i i
rates of return that vary from 0.1192 to 0.2108. From the figure we see that the optimum
portfolio follows a straightforward strategy: put as much as possible (xmax ) of the fund into
the highest yielding investment, then as much of the remainder into the next highest, and so
on, until all of the fund is invested. This is consistent with the nature of LPs, where vertices
are always optimal solutions. Moreover, as xmax is increased, we see that the fraction of
low-yielding investments is reduced and these eventually drop out, while the rate of return
on the portfolio increases steadily.
However, B is also the riskiest investment, so the LP solution may not be the best
investment portfolio. For the quadratic programming formulation (4.77), one can verify that
the S matrix from Table 4.1 is positive semidefinite (with eigenvalues 0, 0.0066, 0.0097, and
0.0753), the QP is convex, and hence its solutions will be global. Here we set xmax = 1
and use the data in Table 4.1 to solve this QP for different levels of ρmin . This leads to the
solutions shown in Figure 4.11. Unlike the LP solution, the results at intermediate interest
rates reflect a mixed or diversified portfolio with minimum risk. As seen in the figure, the
amount of risk, obj , increases with the desired rate of return, leading to a trade-off curve
i i
i i
book_tem
i i
2010/7/27
page 89
i i
between these two measures. One can observe that from ρmin = 0.03 to ρmin = 0.15, the
optimal portfolio always has the same ratios of stocks A, B, and C with the fraction invested
in D as the only change in the portfolio (see [361] for an interesting discussion on this point).
Note also that the risk-free investment disappears at ρmin = 0.15 while the lowest-yielding
stock A drops out at ρmin = 0.18, leaving only B and C in the portfolio. At these points, we
observe steep rises in the objective. Finally, at ρmin = 0.2108 only B, the highest-yielding
stock, remains in the portfolio.
i i
i i
book_tem
i i
2010/7/27
page 90
i i
KKT conditions is the theorem of the alternative (TOA). Moreover, there is a broad set of
constraint qualifications (CQ) that lead to weaker assumptions on the nonlinear program to
apply optimality conditions. Both TOAs and CQs are explored in depth in the classic text
by Mangasarian [273].
4.6 Exercises
1. Show that the “no overlap constraints” in Example 4.4 are not convex.
2. Derive the KKT conditions for Example 4.4.
3. Show that the nonlinear program (4.44) does not satisfy the KKT conditions.
4. Consider the convex problem
Show that this problem does not satisfy the LICQ and does not satisfy the KKT
conditions at the optimum.
5. Investigate Lemma 12.2 in [294] and explain why the LICQ is essential to the proof
of Lemma 4.13.
6. Apply the second order conditions to both parts of Example 4.7. Define the tangent
cones for this problem.
7. Convert (4.68) to (4.69) and compare KKT conditions of both problems.
8. In the derivation of the Broyden update, the following convex equality constrained
problem is solved:
minJ + − J 2F
s.t. J + s = y.
Using the definition of the Frobenius norm from Section 2.2.1, apply the optimality
s)s T
conditions to the elements of J + and derive the Broyden update J + = J + (y−J
sT s
.
9. Convert (4.1) to
and compare KKT conditions of both problems. If (4.1) is a convex problem, what
can be said about the global solution of (4.78)?
10. A widely used trick is to convert (4.1) into an equality constrained problem by adding
new variables zj to each inequality constraint to form gj (x) − (zj )2 = 0. Compare
the KKT conditions for the converted problem with (4.1). Discuss any differences
between these conditions as well as the implications of using the converted form
within an NLP solver.
i i
i i
book_tem
i i
2010/7/27
page 91
i i
Chapter 5
This chapter extends the Newton-based algorithms in Chapter 3 to deal with equality
constrained optimization. It generalizes the concepts of Newton iterations and associ-
ated globalization strategies to the optimality conditions for this constrained optimiza-
tion problem. As with unconstrained optimization we focus on two important aspects:
solving for the Newton step and ensuring convergence from poor starting points. For
the first aspect, we focus on properties of the KKT system and introduce both full- and
reduced-space approaches to deal with equality constraint satisfaction and constrained
minimization of the objective function. For the second aspect, the important properties
of penalty-based merit functions and filter methods will be explored, both for line search
and trust region methods. Several examples are provided to illustrate the concepts devel-
oped in this chapter and to set the stage for the nonlinear programming codes discussed in
Chapter 6.
and we assume that the functions f (x) : Rn → R and h(x) : Rn → Rm have continuous
first and second derivatives. First, we note that if the constraints h(x) = 0 are linear, then
(5.1) is a convex problem if and only if f (x) is convex. On the other hand, as discussed in
Chapter 4, nonlinear equality constraints imply nonconvex problems even if f (x) is convex.
As a result, the presence of nonlinear equalities leads to violation of convexity properties
and will not guarantee that local solutions to (5.1) are global minima. Therefore, unless
additional information is known about (5.1), we will be content in this chapter to determine
only local minima.
91
i i
i i
book_tem
i i
2010/7/27
page 92
i i
i i
i i
book_tem
i i
2010/7/27
page 93
i i
Finding a local solution of (5.1) can therefore be realized by solving (5.2) and then
checking the second order conditions (5.3). In this chapter we develop Newton-based strate-
gies for this task. As with the Newton methods in Chapter 3, a number of important concepts
need to be developed.
Solution of (5.2) with Newton’s method relies on the generation of Newton steps from
the following linear system:
Wk Ak dx ∇L(x k , v k )
= − , (5.4)
(Ak )T 0 dv h(x k )
where W k = ∇xx L(x k , v k ) and Ak = ∇h(x k ). Defining the new multiplier estimate as v̄ =
v k + dv and substituting into (5.4) leads to the equivalent system:
Wk Ak dx ∇f (x k )
=− . (5.5)
(Ak )T 0 v̄ h(x k )
Note that (5.5) are also the first order KKT conditions of the following quadratic program-
ming problem:
1
min ∇f (x k )T dx + dxT W k dx (5.6)
dx 2
s.t. h(x k ) + (Ak )T dx = 0.
Using either (5.4) or (5.5), the basic Newton method can then be stated as follows.
Algorithm 5.1.
Choose a starting point (x 0 , v 0 ).
Theorem 5.1 Consider a solution x ∗ , v ∗ , which satisfies the sufficient second order con-
ditions and ∇h(x ∗ ) is full column rank (LICQ). Moreover, assume that f (x) and h(x) are
twice differentiable and ∇ 2 f (x) and ∇ 2 h(x) are Lipschitz continuous in a neighborhood of
this solution. Then, by applying Algorithm 5.1, and with x 0 and v 0 sufficiently close to x ∗
and v ∗ , there exists a constant Ĉ > 0 such that
k+1 k
x − x∗ ∗ 2
≤ Ĉ x − x
v k+1 − v ∗ vk − v∗ , (5.7)
i i
i i
book_tem
i i
2010/7/27
page 94
i i
The proof of this theorem follows the proof of Theorem 2.20 as long as the
KKT matrix in (5.4) is nonsingular at the solution (see Exercise 1). In the remainder of
this section we consider properties of the KKT matrix and its application in solving (5.1).
or simply
∗T ∗ ∗ ∗T
(Y ) W Y (Y ∗ )T W ∗ Z ∗ (Y ∗ )T A∗ pY (Y ) ∇L(x ∗ , v ∗ )
(Z ∗ )T W ∗ Y ∗ (Z ∗ )T W ∗ Z ∗ 0 pZ = − (Z ∗ )T ∇L(x ∗ , v ∗ ) .
(A∗ )T Y ∗ 0 0 dv h(x ∗ )
(5.9)
i i
i i
book_tem
i i
2010/7/27
page 95
i i
• Because the KKT conditions (5.2) are satisfied at x ∗ , v ∗ , the right-hand side of (5.9)
equals zero.
As a result, we can use the bottom row of (5.9) to solve uniquely for pY = 0. Then, from
the second row of (5.9), we can solve uniquely for pZ = 0. Finally, from the first row of
(5.9), we solve uniquely for dv = 0. This unique solution implies that the matrix in (5.9)
is nonsingular, and from Sylvester’s law of inertia we have that the KKT matrix in (5.4) is
nonsingular as well at x ∗ and v ∗ . Note that nonsingularity is a key property required for the
proof of Theorem 5.1.
i i
i i
book_tem
i i
2010/7/27
page 96
i i
The solution can be found by solving the first order KKT conditions:
x1 + v = 0,
x2 + v = 0,
x1 + x2 = 1
with the solution x1∗ = 12 , x2∗ = 12 , and v1∗ = − 12 . We can define a null space basis as Z ∗ =
[1 −1]T , so that AT Z ∗ = 0. The reduced Hessian at the optimum is (Z ∗ )T ∇xx L(x ∗ , v ∗ )Z ∗ =
2 > 0 and the sufficient second order conditions are satisfied. Moreover, the KKT matrix at
the solution is given by
1 0 1
W∗ A∗
K= = 0 1 1
(A∗ )T 0
1 1 0
with eigenvalues, −1, 1, and 2. The inertia of this system is therefore I n(K) = (2, 1, 0) =
(n, m, 0) as required.
i i
i i
book_tem
i i
2010/7/27
page 97
i i
A number of direct solvers is available for symmetric linear systems. In particular, for
positive definite symmetric matrices, efficient Cholesky factorizations can be applied. These
can be represented as LDLT , with L a lower triangular matrix and D diagonal. On the other
hand, because the KKT matrix in (5.4) is indefinite, possibly singular, and frequently sparse,
another symmetric factorization (such as the Bunch–Kaufman [74] factorization) needs to
be applied. Denoting the KKT matrix as K, and defining P as an orthonormal permutation
matrix, the indefinite factorization allows us to represent K as P T KP = LBLT , where the
block diagonal matrix B is determined to have 1 × 1 or 2 × 2 blocks. From Sylvester’s law of
inertia we can obtain the inertia of K cheaply by examining the blocks in B and evaluating
the signs of their eigenvalues.
To ensure that the iteration matrix is nonsingular, we first check whether the KKT
matrix at iteration k has the correct inertia, (n, m, 0), i.e., n positive, m negative, and no
zero eigenvalues. If the inertia of this matrix is incorrect, we can modify (5.4) to form the
following linear system:
k
W + δW I Ak dx ∇L(x k , v k )
= − . (5.12)
(Ak )T −δA I dv h(x k )
Here, different trial values can be selected for the scalars δW , δA ≥ 0 until the inertia is
correct. To see how this works, we first prove that such values of δW and δA can be found,
and then sketch a prototype algorithm.
Theorem 5.4 Assume that matrices W k and Ak are bounded in norm; then for any δA > 0
there exist suitable values of δW such that the matrix in (5.12) has the inertia (n, m, 0).
Proof: In (5.12) Ak has column rank r ≤ m. Because Ak may be rank deficient, we represent
it by the LU factorization Ak LT = [U T | 0], with upper triangular U ∈ Rr×n of full rank
and nonsingular lower triangular L ∈ Rm×m . This leads to the following factorization of the
KKT matrix:
k
k W + δW I [U T | 0]
W + δW I Ak
V̄ V̄ T = U , (5.13)
(Ak )T −δA I −δA LLT
0
where
I0
V̄ = .
0L
From Theorem 5.2, it is clear that the right-hand matrix has the same inertia as the KKT
matrix in (5.12). Moreover, defining the nonsingular matrix
−1
I δA [U T | 0]L−T L−1
Ṽ =
0 I
i i
i i
book_tem
i i
2010/7/27
page 98
i i
where
−1 −T −1 U −1 k
Ŵ = W + δW I
k
+ δA [U T | 0]L L = W k + δW I + δA A (Ak )T .
0
We note that if δA > 0, the matrix −δA LLT has inertia (0, m, 0) and we see that the right-
hand matrix of (5.14) has an inertia given by I n(Ŵ ) + (0, m, 0). Again, we note that this
matrix has the same inertia as the KKT matrix in (5.12).
To obtain the desired inertia, we now examine Ŵ more closely. Similar to the decom-
position in (5.9), we define full rank range and null-space basis matrices, Y ∈ Rn×r and Z ∈
Rn×(n−r) , respectively, with Y T Z = 0. From Z T Ak = 0, it is clear that Z T [U T | 0]L−T = 0
and also that (Y )T Ak ∈ Rr×m has full column rank. For any vector p ∈ Rn with p =
YpY + ZpZ , we can choose positive constants ã and c̃ that satisfy
c̃pY pZ ≥ −pZT Z T (W k + δW I )YpY = −pZT Z T W k YpY ,
pY Y T Ak (Ak )T YpY ≥ ãpY 2 .
For a given δA > 0, we can choose δW large enough so that for all pZ ∈ Rn−r and pY ∈ Rr
with
pZT Z T (W k + δW I )ZpZ ≥ m̃pZ 2 for m̃ > 0,
pYT Y T (W k + δW I )YpY ≥ −w̃pY 2
for w̃ ≥ 0,
−1 c̃2
we have ãδA > w̃ and m̃ ≥ −1 . With these quantities, we obtain the following
(ãδA −w̃)
relations:
−1 k
Y T (W k + δW I + δA A (Ak )T )Y Y T W kZ pY
p Ŵ p =
T
pYT | pZT T
Z W Y k Z (W k + δW I )Z
T pZ
= pYT Y T (Ŵ )YpY + 2pYT Y T W k ZpZ + pZT Z T (W k + δW I )ZpZ
−1
≥ (ãδA − w̃)pY 2 − 2c̃pY pZ + m̃pZ 2
2
−1 c̃
= (ãδA − w̃) pY −
1/2
−1
pz
(ãδA − w̃)1/2
c̃2
+ m̃ − −1
pz 2 > 0
(ãδA − w̃)
for all values of p. As a result, Ŵ is positive definite, the inertia of the KKT matrix in (5.12)
is (n, m, 0), and we have the desired result.
These observations motivate the following algorithm (simplified from [404]) for
choosing δA and δW for a particular Newton iteration k.
Algorithm 5.2.
Given constants 0 < δ̄Wmin < δ̄ 0 < δ̄ max ; δ̄ > 0; and 0 < κ < 1 < κ . (From [404], rec-
W W A l u
ommended values of the scalar parameters are δ̄W min = 10−20 , δ̄ 0 = 10−4 , δ̄ max = 1040 ,
W W
δ̄A = 10−8 , κu = 8, and κl = 13 .) Also, for the first iteration, set δW
last := 0.
i i
i i
book_tem
i i
2010/7/27
page 99
i i
At each iteration k:
1. Attempt to factorize the matrix in (5.12) with δW = δA = 0. For instance, this can be
done with an LBLT factorization; the diagonal (1 × 1 and 2 × 2) blocks of B are then
used to determine the inertia. If the matrix has correct inertia, then use the resulting
search direction as the Newton step. Otherwise, continue with step 2.
2. If the inertia calculation reveals zero eigenvalues, set δA := δ̄A . Otherwise, set δA := 0.
last = 0, set δ := δ̄ 0 , else set δ := max{δ̄ min , κ δ last }.
3. If δW W W W W l W
4. With updated values of δW and δA , attempt to factorize the matrix in (5.12). If the
last := δ and use the resulting search direction in the line
inertia is now correct, set δW W
search. Otherwise, continue with step 5.
5. Continue to increase δW and set δW := κu δW .
max , abort the search direction computation. The matrix is severely ill-
6. If δW > δ̄W
conditioned.
i i
i i
book_tem
i i
2010/7/27
page 100
i i
Figure 5.1. Normal and tangential steps for the solution dx of (5.4) or (5.6). Note
also that if a coordinate basis (Yc ) is chosen, the normal and tangential steps may not be
orthogonal and the steps Yc pY c are longer than YpY .
In contrast to the full-space method, this decomposition requires that Ak have full column
rank for all k. From the bottom row in (5.16) we obtain
h(x k ) + (Ak )T Y k pY = 0.
To determine the tangential component, pZ , we substitute (5.17) in the second row of (5.16).
The following linear system:
(Z k )T W k Z k pZ = −[(Z k )T ∇f (x k ) + (Z k )T W k Y k pY ] (5.19)
can then be solved if (Z k )T W k Z k is positive definite, and this property can be verified
through a successful Cholesky factorization. Otherwise, if (Z k )T W k Z k is not positive def-
inite, this matrix can be modified either by adding a sufficiently large diagonal term, say
δR I , or by using a modified Cholesky factorization, as discussed in Section 3.2.
Once pZ is calculated from (5.19) we use (5.18) to obtain dx . Finally, the top row in
(5.16) can then be used to update the multipliers:
i i
i i
book_tem
i i
2010/7/27
page 101
i i
Because dxk → 0 as the algorithm converges, a first order multiplier calculation may be used
instead, i.e.,
v̄ = −[(Y k )T Ak ]−1 (Y k )T ∇f (x k ), (5.21)
and we can avoid the calculation of (Y k )T W k Y k and (Y k )T W k Z k . Note that the multipliers
from (5.21) are still asymptotically correct.
A dominant part of the Newton step is the computation of the null-space and range-
space basis matrices, Z and Y , respectively. There are many possible choices for these basis
matrices, including the following three options.
• By computing a QR factorization of A, we can define Z and Y so that they have
orthonormal columns, i.e., Z T Z = In−m , Y T Y = Im , and Z T Y = 0. This gives a
well-conditioned representation of the null space and range space of A, However,
Y and Z are dense matrices, and this can lead to expensive computation when the
number of variables (n) is large.
• A more economical alternative for Z and Y can be found through a simple elimination
of dependent variables [134, 162, 294]. Here we permute the components of x into m
dependent or basic variables (without loss of generality, we select the first m variables)
and n − m independent or superbasic variables. Similarly, the columns of (Ak )T are
permuted and partitioned accordingly to yield
(Ak )T = [AkB | AkS ]. (5.22)
We assume that the m×m basis matrix AkB is nonsingular and we define the coordinate
bases
−(AkB )−1 AkS I
Z =
k
and Y =k
. (5.23)
I 0
Note that in practice Z k is not formed explicitly; instead we can compute and store
the sparse LU factors of AkB . Due to the choice (5.23) of Y k , the normal component
determined in (5.17) and multipliers determined in (5.21) simplify to
pY = −(AkB )−1 h(x k ), (5.24)
v̄ = −(AkB )−T (Y k )T ∇f (x k ). (5.25)
• Numerical accuracy and algorithmic performance are often improved if the tangential
and normal directions, ZpZ and YpY , can be maintained orthogonal and the length
of the normal step is minimized. This can, of course, be obtained through a QR
factorization, but the directions can be obtained more economically by modifying the
coordinate basis decomposition and defining the orthogonal bases
−(AkB )−1 AkS I
Z =
k
and Y = k
. (5.26)
I (AkS )T (AkB )−T
Note that (Z k )T Y k = 0, and from the choice (5.26) of Y k , the calculation in (5.17)
can be written as
pY = −[I + (AkB )−1 AkS (AkS )T (AkB )−T ]−1 (AkB )−1 h(x k )
= − I − (AkB )−1 AkS [I + (AkS )T (AkB )−T (AkB )−1 AkS ]−1 (AkS )T (AkB )−T
×(AkB )−1 h(x k ), (5.27)
i i
i i
book_tem
i i
2010/7/27
page 102
i i
where the second equation follows from the application of the Sherman–Morison–
Woodbury formula [294]. An analogous expression can be derived for v̄. As with
coordinate bases, this calculation requires a factorization of AkB , but it also requires a
factorization of
• The normal and tangential search directions can also be determined implicitly through
an appropriate modification of the full-space KKT system (5.4). As shown in Exer-
cise 3, the tangential step dt = Z k pZ can be found from the following linear system:
Wk Ak dt ∇f (x k )
= − . (5.28)
(Ak )T 0 v 0
B k+1 s = y,
but with
Note that because we approximate the Hessian with respect to x, both terms in the definition
of y require the multiplier evaluated at v k+1 .
With these definitions of s and y, we can directly apply the BFGS update (3.18):
yy T B k ss T B k
B k+1 = B k + − T k (5.31)
sT y s B s
i i
i i
book_tem
i i
2010/7/27
page 103
i i
(y − B k s)(y − B k s)T
B k+1 = B k + . (5.32)
(y − B k s)T s
However, unlike the Hessian matrix for unconstrained optimization, discussed in Chap-
ter 3, W (x, v) is not required to be positive definite at the optimum. Only its projection,
i.e., Z(x ∗ )T W (x ∗ , v ∗ )Z(x ∗ ), needs to be positive definite to satisfy sufficient second order
conditions. As a result, some care is needed in applying the update matrices using (5.31) or
(5.32).
As discussed in Chapter 3, these updates are well defined when s T y is sufficiently
positive for BFGS and (y − B k s)T s = 0 for SR1. Otherwise, the updates are skipped for a
given iteration, or Powell damping (see Section 3.3) can be applied for BFGS.
However, because the updated quasi-Newton matrix B k is dense, handling this matrix di-
rectly leads to factorizations of the Newton step that are O(n3 ). For large problems, this
calculation can be prohibitively expensive. Instead we would like to exploit the fact that
Ak is sparse and B k has the structure given by (5.31) or (5.32). We can therefore store the
updates y and s for successive iterations and incorporate these into the solution of (5.33).
For this update, we store only the last q updates and apply the compact limited memory
representation from [83] (see Section 3.3), along with a sparse factorization of the initial
quasi-Newton matrix, B 0 , in (5.33). Also, we assume that B 0 is itself sparse; often it is
chosen to be diagonal. For the BFGS update (5.31), this compact representation is given by
−1
SkT B 0 Sk Lk SkT B 0
B k+1 = B 0 − [B 0 Sk Y k ] , (5.34)
LTk −Dk (Y k )T
where
and
(s k−q+i )T (y k−q+j ), i > j,
(Lk ) =
0 otherwise.
i i
i i
book_tem
i i
2010/7/27
page 104
i i
By writing (5.34) as
1/2 1/2 −1 1/2
−1
−Dk Dk LTk Dk 0 (Y k )T
B k+1
= B − [Y
0 k 0
B Sk ] −1/2 ,
0 JkT −LTk Dk Jk SkT B 0
(5.36)
where Jk is a lower triangular factorization constructed from the Cholesky factorization that
satisfies
Jk JkT = SkT B 0 Sk + Lk Dk−1 LTk , (5.37)
−1/2
we now define Vk = Y k Dk , Uk = (B 0 Sk + Y k Dk−1 LTk )Jk−T , Ũ T = [UkT 0], and Ṽ T =
[VkT 0] and consider the matrices
Bk Ak (B 0 + Vk VkT − Uk UkT ) Ak
K= = ,
(Ak )T 0 (Ak )T 0
B0 Ak
K0 = .
(Ak )T 0
Moreover, we assume that K0 is sparse and can be factorized cheaply using a sparse or
structured matrix decomposition. (Usually such a factorization requires O(nβ ) operations,
with the exponent β ∈ [1, 2].) Factorization of K can then be made by two applications of
the Sherman–Morison–Woodbury formula, yielding
K1−1 = K0−1 − K0−1 Ṽ [I + Ṽ T K0−1 Ṽ ]−1 Ṽ T K0−1 , (5.38)
K −1
= K1−1 − K1−1 Ũ [I + Ũ T
K1−1 Ũ ]−1 Ũ T K1−1 . (5.39)
By carefully structuring these calculations so that K0−1 , K1−1 , and K −1 are factorized and
incorporated into backsolves, these matrices are never explicitly created, and we can obtain
the solution to (5.33), i.e.,
dx −1 ∇L(x k , v k )
= −K (5.40)
dv h(x k )
in only O(nβ + q 3 + q 2 n) operations. Since only a few updates are stored (q is typically less
than 20), this limited memory approach leads to a much more efficient implementation of
quasi-Newton methods. An analogous approach can be applied to the SR1 method as well.
As seen in Chapter 3, quasi-Newton methods are slower to converge than Newton’s
method, and the best that can be expected is a superlinear convergence rate. For equality
constrained problems, the full-space quasi-Newton implementation has a local convergence
property that is similar to Theorem 3.1, but with additional restrictions.
Theorem 5.5 [63] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and Algorithm 5.1 modified by a quasi-Newton approxi-
mation converges to a KKT point that satisfies the LICQ and the sufficient second order
optimality conditions. Then x k converges to x ∗ at a superlinear rate, i.e.,
x k + dx − x ∗
lim =0 (5.41)
k→∞ x k − x ∗
i i
i i
book_tem
i i
2010/7/27
page 105
i i
if and only if
Z(x k )T (B k − W (x ∗ , v ∗ ))dx
lim = 0, (5.42)
k→∞ dx
where Z(x) is a representation of the null space of A(x)T that is Lipschitz continuous in a
neighborhood of x ∗ .
Note that the theorem does not state whether a specific quasi-Newton update, e.g.,
BFGS or SR1, satisfies (5.42), as this depends on whether B k remains bounded and well-
conditioned, or whether updates need to be modified through damping or skipping. On the
other hand, under the more restrictive assumption where W (x ∗ , v ∗ ) is positive definite, the
following property can be shown.
Theorem 5.6 [63] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and Algorithm 5.1 modified by a BFGS update converges to
a KKT point that satisfies the LICQ and the sufficient second order optimality conditions.
If, in addition, W (x ∗ , v ∗ ) and B 0 are positive definite, and B 0 − W (x ∗ , v ∗ ) and x 0 − x ∗
are sufficiently small, then x k converges to x ∗ at a superlinear rate, i.e.,
x k + dx − x ∗
lim = 0. (5.43)
k→∞ x k − x ∗
This linear system is solved by solving two related subsystems of order m for pY and v̄,
respectively, and a third subsystem of order n − m for pZ . For this last subsystem, we can
approximate the reduced Hessian, i.e., B̄ k ≈ (Z k )T W k Z k , by using the BFGS update (5.31).
By noting that
d x = Y k pY + Z k pZ
and assuming that a full Newton step is taken (i.e., x k+1 = x k + dx ), we can apply the secant
condition based on the following first order approximation to the reduced Hessian:
i i
i i
book_tem
i i
2010/7/27
page 106
i i
we can write the secant formula B̄ k+1 sk = yk and use this definition for the BFGS update
(5.31). However, note that for this update, some care is needed in the choice of Z(x k )
so that it remains continuous with respect to x. Discontinuous changes in Z(x) (due to
different variable partitions or QR rotations) could lead to serious problems in the Hessian
approximation [162].
In addition, we modify the Newton step from (5.44). Using the normal step pY cal-
culated from (5.17), we can write the tangential step as follows:
where the second part follows from writing (or approximating) (Z k )T W k Y k pY as w k . Note
that the BFGS update eliminates the need to evaluate W k and to form (Z k )T W k Z k explicitly.
In addition, to avoid the formation of (Z k )T W k Y k pY , it is appealing to approximate the
vectors w k in (5.49) and (5.47). This can be done in a number of ways.
• Many reduced Hessian methods [294, 100, 134] typically approximate Z k W k Y k pY
by w k = 0. Neglecting this term often works well, especially when normal steps are
much smaller than tangential steps. This approximation is particularly useful when the
constraints are linear and pY remains zero once a feasible iterate is encountered. More-
over, when Z and Y have orthogonal representations, the length of the normal step
Y k pY is minimized. Consequently, neglecting this term in (5.49) and (5.47) still can
lead to good performance. On the other hand, coordinate representations (5.23) of Y
can lead to larger normal steps (see Figure 5.1), a nonnegligible value for Z k W k Y k pY ,
and possibly erratic convergence behavior of the method.
• To ensure that good search directions are generated in (5.44) that are independent of
the representations of Z and Y , we can compute a finite difference approximation of
(Z k )T W k Y k along pY , for example
i i
i i
book_tem
i i
2010/7/27
page 107
i i
Figure 5.2. Regions that determine criteria for choosing wk , the approximation
to(Z k )T W k Y k p
Y . Quasi-Newton approximation can be in R1 or R3 , while R2 requires a
finite difference approximation.
∞
k=1 γk < ∞, and σk = (Z k )T ∇f (x k ) + h(x k ) which is related directly to the
distance to the solution, x k − x ∗ . In the space of pY and pZ , we now define
three regions given by
R1 : 0 ≤ pY ≤ γk2 pZ ,
1/2
R2 : γk2 pZ < pY ≤ κpZ /σk ,
1/2
R3 : pY > κpZ /σk
and shown in Figure 5.2. In order to obtain a fast convergence rate, we only need
to resort to the finite difference approximation (5.50) when the calculated step lies
in region R2 . Otherwise, in regions R1 and R3 , the tangential and the normal steps,
respectively, are dominant and the less expensive quasi-Newton approximation can
be used for w k . In [54], these hybrid concepts were incorporated into an algorithm
with a line search along with safeguards to ensure that the updates S k and B̄ k remain
uniformly bounded. A detailed presentation of this hybrid algorithm along with anal-
ysis of global and local convergence properties and numerical performance is given
in [54].
To summarize, the reduced-space quasi-Newton algorithm does not require the com-
putation of the Hessian of the Lagrangian W k and only makes use of first derivatives of f (x)
and h(x). The reduced Hessian matrix (Z k )T W k Z k is approximated by a positive definite
i i
i i
book_tem
i i
2010/7/27
page 108
i i
quasi-Newton matrix B̄ k , using the BFGS formula, and the Z k W k Y k pY term is approxi-
mated by a vector wk , which is computed either by means of a finite difference formula or
via a quasi-Newton approximation. The method is therefore well suited for large problems
with relatively few degrees of freedom.
The local convergence properties for this reduced-space method are analogous to The-
orems 5.5 and 5.6, but they are modified because B̄ k is now a quasi-Newton approximation
to the reduced Hessian and also because of the choice of approximation wk . First, we note
from (5.15), (5.47), and the definitions for B̄ k and w k that the condition
implies (5.42). Therefore, Theorem 5.7 holds and the algorithm converges superlinearly in x.
On the other hand, (5.51) does not hold for all choices of w k or quasi-Newton updates. For
instance, if wk is a poor approximation to (Z k )T W k Y k pY , the convergence rate is modified
as follows.
Theorem 5.7 [63] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and the reduced-space quasi-Newton algorithm converges to
a KKT point that satisfies the LICQ and the sufficient second order optimality conditions.
If Z(x) and v(x) (obtained from (5.21)) are Lipschitz continuous in a neighborhood of x ∗ ,
Y (x) and Z(x) are bounded, and [Y (x) | Z(x)] has a bounded inverse, then x k converges to
x ∗ at a 2-step superlinear rate, i.e.,
x k+2 − x ∗
lim =0 (5.52)
k→∞ x k − x ∗
if and only if
(B̄ k − (Z ∗ )T W (x ∗ , v ∗ )Z ∗ )pZ
lim = 0. (5.53)
k→∞ dx
Finally, we consider the hybrid approach that monitors and evaluates an accurate ap-
proximation of w k based on the regions in Figure 5.2. Moreover, since (Z ∗ )T W (x ∗ , v ∗ )Z ∗ is
positive definite, the BFGS update to the reduced Hessian provides a suitable approximation.
As a result, the following, stronger property can be proved.
Theorem 5.8 [54] Assume that f (x) and h(x) are twice differentiable, their second deriva-
tives are Lipschitz continuous, and we apply the hybrid reduced-space algorithm in [54]
with the BFGS update (5.31), (5.48). If
• the algorithm converges to a KKT point that satisfies the LICQ and the sufficient
second order optimality conditions,
• Y (x) and Z(x) are bounded and [Y (x) | Z(x)] has a bounded inverse in a neighborhood
of x ∗ ,
i i
i i
book_tem
i i
2010/7/27
page 109
i i
x k + dx − x ∗
lim = 0. (5.54)
k→∞ x k − x ∗
i i
i i
book_tem
i i
2010/7/27
page 110
i i
i i
i i
book_tem
i i
2010/7/27
page 111
i i
Because the augmented Lagrange function is exact for finite values of ρ and suitable values
of v, it forms the basis of a number of NLP methods (see [294, 100, 134]). On the other
hand, the suitability of LA as a merit function is complicated by the following:
• Finding a sufficiently large value of ρ is not straightforward. A suitably large value
of ρ needs to be determined iteratively,
• Suitable values for v need to be determined through iterative updates as the algorithm
proceeds, or by direct calculation using (5.58). Either approach can be expensive and
may lead to ill-conditioning, especially if ∇h(x) is rank deficient.
Nonsmooth Exact Penalty Functions
Because a smooth function of the form (5.56) cannot be exact, we finally turn to nonsmooth
penalty functions. The most common choice for merit functions is the class of p penalty
functions, given by
• Let x̄ be a stationary point of φp (x; ρ) for all values of ρ above a positive threshold;
then if h(x̄) = 0, x̄ is a KKT point for (5.1) (see [294]).
• Let x̄ be a stationary point of φp (x; ρ) for all values of ρ above a positive threshold,
then if h(x̄) = 0, x̄ is an infeasible stationary point for the penalty function (see [294]).
Compared to (5.55) and (5.57) the nonsmooth exact penalty is easy to apply as a merit
function. Moreover, under the threshold and feasibility conditions, the above properties
show an equivalence between local minimizers of φp (x, ρ) and local solutions of (5.1). As
a result, further improvement of φp (x; ρ) cannot be made from a local optimum of (5.1).
Moreover, even if a feasible solution is not found, the minimizer of φp (x; ρ) leads to a
“graceful failure” at a local minimum of the infeasibility. This point may be useful to the
practitioner to flag incorrect constraint specifications in (5.1) or to reinitialize the algorithm.
As a result, φp (x; ρ), with p = 1 or 2, is widely used in globalization strategies discussed
in the remainder of the chapter and implemented in the algorithms in Chapter 6.
The main concern with this penalty function is to find a reasonable value for ρ. While
ρ has a well-defined threshold value, it is unknown a priori, and the resulting algorithm
can suffer from poor performance and ill-conditioning if ρ is set too high. Careful attention
is needed in updating ρ, which is generally chosen after a step is computed, in order to
i i
i i
book_tem
i i
2010/7/27
page 112
i i
promote acceptance. For instance, the value of ρ can be adjusted by monitoring the multiplier
estimates v k and ensuring that ρ > v k q as the algorithm proceeds. Recently, more efficient
updates for ρ have been developed [78, 79, 294] that are based on allowing ρ to lie within a
bounded interval and using the smallest value of ρ in that interval to determine an acceptable
step. As described below, this approach provides considerable flexibility in choosing larger
steps. Concepts of exact penalty functions will be applied to both the line search and trust
region strategies developed in the next section.
i i
i i
book_tem
i i
2010/7/27
page 113
i i
A trial point x̂ (e.g., generated by a Newton-like step coupled to the line search or
trust region approach) is acceptable to the filter if it is not in the forbidden region, Fk ,
and a sufficient improvement is realized corresponding to a small fraction of the current
infeasibility. In other words, a trial point is accepted if
for some small positive values γf , γθ . These inequalities correspond to the trial point
(f (x̂), θ (x̂)) that lies below the dotted lines in Figure 5.3.
The filter strategy also requires two further enhancements in order to guarantee con-
vergence.
1. Relying solely on criterion (5.60) allows the acceptance of a sequence {x k } that only
provides sufficient reduction of the constraint violation, but not the objective function.
For instance, this can occur if a filter pair is placed at a feasible point with θ(x l ) = 0,
and acceptance through (5.60) may lead to convergence to a feasible, but nonoptimal,
point. In order to prevent this, we monitor a switching criterion to test whenever the
constraint infeasibility is too small to promote a sufficient decrease in the objective
function. If so, we require that the trial point satisfy a sufficient decrease of the
objective function alone.
2. Ill-conditioning at infeasible points can lead to new steps that may be too small to
satisfy the filter criteria; i.e., trial points could be “caught” between the dotted and
solid lines in Figure 5.3. This problem can be corrected with a feasibility restoration
phase which attempts to find a less infeasible point at which to restart the algorithm.
With these elements, the filter strategy, coupled either with a line search or trust region
approach, can be shown to converge to either a local solution of (5.1) or a stationary point
of the infeasibility, θ (x). To summarize, the filter strategy is outlined as follows.
• Generate a new trial point, x̂. Continue if the trial point is not within the forbidden
region, Fk .
• If a switching condition (see (5.85)) is triggered, then accept x̂ as x k+1 if a sufficient
reduction is found in the objective function.
• Else, accept x̂ as x k+1 if the trial point is acceptable to filter with a margin given by
θ (x k ). Update the filter.
• If the trial point is not acceptable, find another trial point that is closer to x k .
• If there is insufficient progress in finding trial points, evoke the restoration phase to
find a new point, x R , with a smaller θ(x R ).
i i
i i
book_tem
i i
2010/7/27
page 114
i i
Figure 5.4. Illustration of merit function concepts with a flexible penalty parameter.
Curtis and Nocedal [105], and these can be interpreted and compared in the context of filter
methods.
Figure 5.4 shows how merit functions can mimic the performance of filter methods
in the space of f (x) and h(x) if we allow the penalty parameter to vary between upper
and lower bounds. To see this, consider the necessary condition for acceptance of x k+1 ,
f (x k+1 ) + ρh(x k+1 ) < f (x k ) + ρh(x k ), (5.61)
f (x k ) − f (x k+1 )
or if h(x k ) = h(x k+1 ), ρ > ρv = .
h(x k+1 ) − h(x k )
Note that the bound on ρ defines the slopes of the lines in Figure 5.4. Clearly, R1 does not
satisfy (5.61), and x k+1 is not acceptable in this region. Instead, acceptable regions for x k+1
are R2 and R3 for ρ = ρu , while R3 and R4 are acceptable regions for ρ = ρl . If we now
allow ρ ∈ [ρl , ρu ], then all three regions R2 , R3 , and R4 are acceptable regions for x k+1 and
these regions behave like one element of the filter method. This is especially true if ρu can
be kept arbitrarily large and ρl can be kept close to zero. Of course, the limits on ρ cannot
be set arbitrarily; they are dictated by global convergence properties. As will be seen later,
it is necessary to update ρu based on a predicted descent property. This property ensures
decrease of the exact penalty function as the step size or trust region becomes small. On the
other hand, due to nonlinearity, smaller values of ρ may be allowed for satisfaction of (5.61)
for larger steps. This relation will guide the choice of ρl . More detail on this selection and
the resulting convergence properties is given in the next sections. We now consider how
both merit function and filter concepts are incorporated within globalization strategies.
i i
i i
book_tem
i i
2010/7/27
page 115
i i
also be addressed in this section in the context of constrained optimization, for both merit
function and filter approaches.
where x k+1 = x k + α k dx and Ddx φp (x k ; ρ) is the directional derivative of the merit function
along dx . To evaluate the directional derivative we first consider its definition:
φp (x k + αdx ; ρ) − φp (x k ; ρ)
Ddx φp (x k ; ρ) = lim . (5.63)
α→0 α
For α ≥ 0 we can apply Taylor’s theorem and bound the difference in merit functions by
φp (x k + αdx ; ρ) − φp (x k ; ρ) = (f (x k + αdx ) − f (x k ))
+ρ(h(x k + αdx )p − h(x k )p )
≤ α∇f (x k )T dx + ρ(h(x k ) + α∇h(x k )T dx p − h(x k )p )
+ b1 α 2 dx 2 for some b1 > 0. (5.64)
Using the quadratic term from Taylor’s theorem, one can also derive a corresponding lower
bound, leading to
Dividing (5.65) and (5.66) by α and applying the definition (5.63) as α goes to zero leads
to the following expression for the directional derivative:
The next step is to tie the directional derivative to the line search strategy to enforce
the optimality conditions. In particular, the choices that determine dx and ρ play important
i i
i i
book_tem
i i
2010/7/27
page 116
i i
roles. In Sections 5.3 and 5.4 we considered full- and reduced-space representations as well
as the use of quasi-Newton approximations to generate dx . These features also extend to the
implementation of the line search strategy, as discussed next.
We start by noting that the Newton steps dx from (5.5) and (5.16) are equivalent.
Substituting (5.15) into the directional derivative (5.67) leads to
Ddx φp (x k ; ρ) = ∇f (x k )T (Z k pZ + Y k pY ) − ρh(x k )p , (5.68)
and from (5.17) we have
Ddx φp (x k ; ρ) = ∇f (x k )T Z k pZ − ∇f (x k )T Y k [(Ak )T Y k ]−1 h(x k ) − ρh(x k )p . (5.69)
i i
i i
book_tem
i i
2010/7/27
page 117
i i
Substituting this expression into (5.75) and choosing the constants appropriately leads
directly to (5.72).
• Finally, for flexible choices of ρ ∈ [ρl , ρu ] as shown in Figure 5.4, one can choose ρu
to satisfy (5.72), based on (5.71), (5.73), or (5.74). This approach finds a step size α k
that satisfies the modified Armijo criterion:
φp (x k + α k dx ; ρ) ≤ φp (x k ; ρ) + ηα k Ddx φp (x k ; ρm
k
) (5.77)
On the other hand, ρlk is chosen as a slowly increasing lower bound given by
Algorithm 5.3.
Choose constants η ∈ (0, 1/2), 1 , 2 > 0, and τ , τ
with 0 < τ < τ
< 1. Set k := 0 and
choose starting points x 0 , v 0 , and an initial value ρ0 > 0 (or ρl0 , ρu0 > 0) for the penalty
parameter.
i i
i i
book_tem
i i
2010/7/27
page 118
i i
3. Update ρ k to satisfy (5.71), (5.73), or (5.74), or update ρlk and ρuk for the flexible
choice of penalty parameters.
4. Set α k = 1.
2. The matrix A(x) has full column rank for all x ∈ D, and there exist constants γ0 and
β0 such that
for all x ∈ D.
3. The Hessian (or approximation) W (x, v) is positive definite on the null space of the
Jacobian A(x)T and uniformly bounded.
Lemma 5.9 [54] If Assumptions I hold and if ρ k = ρ is constant for all sufficiently large k,
then there is a positive constant γρ such that for all large k,
φp (x k ) − φp (x k+1 ) ≥ γρ (Z k )T ∇f (x k )2 + h(x k )p . (5.81)
The proof of Lemma 5.9 is a modification from the proof of Lemma 3.2 in [82] and
Lemma 4.1 in [54]. Key points in the proof (see Exercise 4) include showing that α k is
always bounded away from zero, using (5.64). Applying (5.72) and the fact that h(x) has
an upper bound in D leads to the desired result.
Finally, Assumptions I and Lemma 5.9 now allow us to prove convergence to a KKT
point for a class of line search strategies that use nonsmooth exact penalty merit functions
applied to Newton-like steps.
Theorem 5.10 If Assumptions I hold, then the weights {ρ k } are constant for all sufficiently
large k, limk→∞ ((Z k )T ∇f (x k )+h(x k )) = 0, and limk→∞ x k = x ∗ , a point that satisfies
the KKT conditions (5.2).
i i
i i
book_tem
i i
2010/7/27
page 119
i i
Proof: By Assumptions I, we note that the quantities that define ρ in (5.71), (5.73), or (5.74)
are bounded. Therefore, since the procedure increases ρ k by at least whenever it changes
the penalty parameter, it follows that there is an index k0 and a value ρ such that for all
k > k0 , ρ k = ρ always satisfies (5.71), (5.73), or (5.74).
This argument can be extended to the flexible updates of ρ. Following the proof in
[78], we note that ρuk is increased in the same way ρ k is increased for the other methods,
so ρuk eventually approaches a constant value. For ρlk we note that this quantity is bounded
above by ρuk and through (5.78) it increases by a finite amount to this value at each iteration.
As a result, it also attains a constant value.
Now that ρ k is constant for k > k0 , we have by Lemma 5.9 and the fact that φp (x k )
decreases at each iterate, that
k
φp (x k0 ; ρ) − φp (x k+1 ; ρ) = (φp (x j ; ρ) − φp (x j +1 ; ρ))
j =k0
k
≥ γρ [(Z j )T ∇f (x j )2 + h(x j )p ].
j =k0
By Assumptions I, φp (x, ρ) is bounded below for all x ∈ D, so the last sum is finite, and
thus
lim [(Z k )T ∇f (x k )2 + h(x k )p ] = 0. (5.82)
k→∞
with constants δ > 0, sθ > 1, sf ≥ 1, where mk (α) := α∇f (x k )T dx . If (5.85) holds, then
step dx is a descent direction for the objective function and we require that α k,l satisfies the
i i
i i
book_tem
i i
2010/7/27
page 120
i i
Armijo-type condition
where ηf ∈ (0, 12 ) is a fixed constant. Note that several trial step sizes may be tried with (5.85)
satisfied, but not (5.86). Moreover, for smaller step sizes the f -type switching condition
(5.85) may no longer be valid and we revert to the acceptance criterion (5.84).
Note that the second part of (5.85) ensures that the progress for the objective function
enforced by the Armijo condition (5.86) is sufficiently large compared to the current con-
straint violation. Enforcing (5.85) is essential for points that are near the feasible region.
Also, the choices of sf and sθ allow some flexibility in performance of the algorithm. In
particular, if sf > 2sθ (see [403]), both (5.85) and (5.86) will hold for a full step, possibly
improved by a second order correction (see Section 5.8) [402], and rapid local convergence
is achieved.
During the optimization we make sure that the current iterate x k is always acceptable
to the current filter Fk . At the beginning of the optimization, the forbidden filter region is
normally initialized to F0 := {(θ , f ) ∈ R2 : θ ≥ θmax } for some upper bound on infeasibility,
θmax > θ(x 0 ). Throughout the optimization, the filter is then augmented in some iterations
after the new iterate x k+1 has been accepted. For this, the updating formula
" #
Fk+1 := Fk ∪ (θ , f ) ∈ R2 : θ ≥ (1 − γθ )θ (x k ) and f ≥ f (x k ) − γf θ (x k ) (5.87)
is used. On the other hand, the filter is augmented only if the current iteration is not an f -type
iteration, i.e., if for the accepted trial step size α k , the f -type switching condition (5.85)
does not hold. Otherwise, the filter region is not augmented, i.e., Fk+1 := Fk . Instead, the
Armijo condition (5.86) must be satisfied, and the value of the objective function is strictly
decreased. This is sufficient to prevent cycling.
Finally, finding a trial step size α k,l > 0 that provides sufficient reduction as defined by
criterion (5.84) is not guaranteed. In this situation, the filter method switches to a feasibility
restoration phase, whose purpose is to find a new iterate x k+1 that satisfies (5.84) and is also
acceptable to the current filter by trying to decrease the constraint violation. Any iterative
algorithm can be applied to find a less infeasible point, and different methods could even
be used at different stages of the optimization procedure. To detect the situation where no
admissible step size can be found, one can linearize (5.84) (or (5.85) in the case of an f -type
iteration) and estimate the smallest step size, α k,min , that allows an acceptable trial point.
The algorithm then switches to the feasibility restoration phase when α k,l becomes smaller
than α k,min .
If the feasibility restoration phase terminates successfully by delivering an admissible
point, the filter is augmented according to (5.87) to avoid cycling back to the problematic
point x k . On the other hand, if the restoration phase is unsuccessful, it should converge
to a stationary point of the infeasibility, θ (x). This “graceful exit” may be useful to the
practitioner to flag incorrect constraint specifications in (5.1) or to reinitialize the algorithm.
Combining these elements leads to the following filter line search algorithm.
Algorithm 5.4.
Given a starting point x 0 ; constants θmax ∈ (θ(x 0 ), ∞]; γθ , γf ∈ (0, 1); δ > 0; γα ∈ (0, 1];
sθ > 1; sf ≥ 1; ηf ∈ (0, 21 ); 0 < τ ≤ τ
< 1.
i i
i i
book_tem
i i
2010/7/27
page 121
i i
3. Compute search direction. Calculate the search step dx using either the full-space or
reduced-space linear systems, with exact or quasi-Newton Hessian information. If
this system is detected to be too ill-conditioned, go to feasibility restoration phase in
step 8.
6. Augment filter if necessary. If k is not an f -type iteration, augment the filter using
(5.87); otherwise leave the filter unchanged, i.e., set Fk+1 := Fk .
7. Continue with next iteration. Increase the iteration counter k ← k + 1 and go back to
step 2.
8. Feasibility restoration phase. Compute a new iterate x k+1 by decreasing the infeasi-
bility measure θ, so that x k+1 satisfies the sufficient decrease conditions (5.84) and is
acceptable to the filter, i.e., (θ (x k+1 ), f (x k+1 )) ∈ Fk . Augment the filter using (5.87)
(for x k ) and continue with the regular iteration in step 7.
The filter line search method is clearly more complicated than the merit function
approach and is also more difficult to analyze. Nevertheless, the key convergence result for
this method can be summarized as follows.
Theorem 5.11 [403] Let Assumption I(1) hold for all iterates and Assumptions I(1,2,3) hold
for those iterates that are within a neighborhood of the feasible region (h(x k ) ≤ θinc for
some θinc > 0). In this region successful steps can be taken and the restoration phase need
i i
i i
book_tem
i i
2010/7/27
page 122
i i
not be invoked. Then Algorithm 5.4, with no unsuccessful terminations of the feasibility
restoration phase, has the following properties:
In other words, all limit points are feasible, and if {x k } is bounded, then there exists a limit
point x ∗ of {x k } which is a first order KKT point for the equality constrained NLP (5.1).
The filter line search method avoids the need to monitor and update penalty param-
eters over the course of the iterations. This also avoids the case where a poor update of
the penalty parameter may lead to small step sizes and poor performance of the algorithm.
On the other hand, the “lim inf ” result in Theorem 5.11 is not as strong as Theorem 5.7, as
only convergence of a subsequence can be shown. The reason for this weaker result is that,
even for a good trial point, there may be filter information from previous iterates that could
invoke the restoration phase. Nevertheless, one should be able to strengthen this property
to limk→∞ ∇x L(x k , v k ) = 0 through careful design of the restoration phase algorithm
(see [404]).
1
min ∇f (x k )T dx + dxT W k dx (5.89)
dx 2
s.t. h(x k ) + (Ak )T dx = 0,
dx ≤ .
Here dx could be expressed as the max-norm leading to simple bound constraints. How-
ever, if the trust region is small, there may be no feasible solution for (5.89). As discussed
below, this problem can be overcome by using either a merit function or filter approach.
With the merit function approach, the search direction dx is split into two components with
separate trust region problems for each, and a composite-step trust region method is devel-
oped. In contrast, the filter method applies its restoration phase if the trust region is too
small to yield a feasible solution to (5.89).
i i
i i
book_tem
i i
2010/7/27
page 123
i i
This approach focuses on improving each term separately, and with p = 2 quadratic models
of each term can be constructed and solved as separate trust region problems. In particular,
a trust region problem that improves the quadratic model of the objective function can be
shown to give a tangential step, while the trust region problem that reduces a quadratic
model of the infeasibility leads to a normal step. Each trust region problem is then solved
using the algorithms developed in Chapter 3.
Using the 2 penalty function, we now postulate a composite-step trust region model
at iteration k given by
1
m2 (x; ρ) = ∇f (x k )T dx + dxT W k dx + ρh(x k ) + (Ak )T dx 2 .
2
Decomposing dx = Y k pY + Z k pZ into normal and tangential components leads to the fol-
lowing subproblems.
s.t. Y k pY 2 ≤ ξN . (5.90)
i i
i i
book_tem
i i
2010/7/27
page 124
i i
If acceptable steps have been found that provide sufficient reduction in the merit
function, we select our next iterate as
x k+1 = x k + Y k pY + Z k pZ . (5.92)
If not, we reject this step, reduce the size of the trust region, and repeat the calculation of
the normal and tangential steps. These steps are shown in Figure 5.5. In addition, we obtain
Lagrange multiplier estimates at iteration k using the first order approximation:
The criteria for accepting new points and adjusting the trust region are based on
predicted and actual reductions of the merit function. The predicted reduction in the quadratic
models contains the predicted reduction in the model for the tangential step (qk ),
1
qk = −((Z k )T ∇f (x k ) + wk )T pZ − pZT B̄ k pZ , (5.94)
2
and the predicted reduction in the model for the normal step (ϑk ),
i i
i i
book_tem
i i
2010/7/27
page 125
i i
The actual reduction is just the difference in the merit function at two successive
iterates:
k−1 −(q − ∇f (x ) Y pY − 2 pY (Y ) W Y pY )
k k T k 1 T k T k k
ρ = max ρ ,
k
, (5.98)
(1 − ζ )ϑk
where ζ ∈ (0, 1/2). Again, if we use quasi-Newton updates, we would drop the term
1 T k T k k k
2 pY (Y ) W Y pY in (5.98). A more flexible alternate update for ρ is also discussed
in [294].
With these elements, the composite-step trust region algorithm follows.
Algorithm 5.5.
Set all tolerances and constants; initialize B 0 if using quasi-Newton updates, as well as
x 0 , v 0 , ρ 0 , and 0 .
Typical values for this algorithm are κ0 = 10−4 , κ1 = 0.25, κ2 = 0.75, γ1 = 0.5 and we
note that the trust region algorithm is similar to Algorithm 3.3 from Chapter 3.
The trust region algorithm is stated using reduced-space steps. Nevertheless, for large
sparse problems, the tangential and normal steps can be calculated more cheaply using only
i i
i i
book_tem
i i
2010/7/27
page 126
i i
full-space information based on the systems (5.29) and (5.28). Moreover, the tangential
problem allows either exact or quasi-Newton Hessian information. The following conver-
gence property holds for all of these cases.
1. The functions f : Rn → R and h : Rn → Rm are twice differentiable and their first and
second derivatives are Lipschitz continuous and uniformly bounded in norm over D.
2. The matrix A(x) has full column rank for all x ∈ D, and there exist constants γ0 and
β0 such that
Y (x)[A(x)T Y (x)]−1 ≤ γ0 , Z(x) ≤ β0 , (5.99)
for all x ∈ D.
Theorem 5.12 [100] Suppose Assumptions II hold; then Algorithm 5.5 has the following
properties:
In other words, all limit points are feasible and x ∗ is a first order stationary point for the
equality constrained NLP (5.1).
1
min mf (dx ) = ∇f (x k )T dx + dxT W k dx (5.101)
dx 2
s.t. h(x k ) + (Ak )T dx = 0, dx ≤ k .
The filter trust region method uses similar concepts as in the line search case and can
be described by the following algorithm.
Algorithm 5.6.
Initialize all constants and choose a starting point (x 0 , v 0 ) and initial trust region 0 . Ini-
tialize the filter F0 := {(θ, f ) ∈ R2 : θ ≥ θmax } and the iteration counter k ← 0.
2. For the current trust region, check whether a solution exists to (5.101). If so, continue
to the next step. If not, add x k to the filter and invoke the feasibility restoration phase
i i
i i
book_tem
i i
2010/7/27
page 127
i i
1
min mf (dx ) = ∇f (x r )T dx + dxT W (x r )dx (5.102)
dx 2
s.t. h(x r ) + ∇h(x r )T dx = 0,
dx ≤ k
θ (x k + dx ) ≤ (1 − γθ )θ (x k ) or f (x k + dx ) ≤ f (x k ) − γf θ(x k ),
then reject the trial step size, set x k+1 = x k and k+1 ∈ [γ0 k , γ1 k ] and go to
step 1.
• If
mf (x k ) − mf (x k + dx ) ≥ κθ θ(x k )2 (5.103)
and
f (x k ) − f (x k + dx )
πk = < η1 , (5.104)
mf (x k ) − mf (x k + dx )
then we have a rejected f -type step. Set x k+1 = x k and k+1 ∈ [γ0 k , γ1 k ]
and go to step 1.
• If (5.103) fails, add x k to the filter. Otherwise, we have an f -type step.
Typical values for these constants are γ0 = 0.1, γ1 = 0.5, γ2 = 2, η1 = 0.01, η2 = 0.9,
κθ = 10−4 . Because the filter trust region method shares many of the concepts from the
corresponding line search method, it is not surprising that it shares the same convergence
property.
i i
i i
book_tem
i i
2010/7/27
page 128
i i
Example 5.14 (Slow Convergence Near Solution). Consider the following equality con-
strained problem:
with the constant τ > 1. As shown in Figure 5.6, the solution to this problem is x ∗ = [1, 0]T ,
v ∗ = 1/2 − τ and it can easily be seen that ∇xx L(x ∗ , v ∗ ) = I at the solution. To illustrate the
effect of slow convergence, we proceed from a feasible point x k = [cos θ, sin θ ]T and choose
θ small enough to start arbitrarily close to the solution. Using the Hessian information at
i i
i i
book_tem
i i
2010/7/27
page 129
i i
x ∗ and linearizing (5.5) at x k allows the search direction to be determined from the linear
system
1 0 2x1k d1 1 − 2τ x1k
0 1 2x2k d2 = −2τ x2k (5.107)
k
2x1 2x2 k 0 v̄ 1 − (x1 ) − (x2 )
k 2 k 2
yielding d1 = sin2 θ , d2 = − sin θ cos θ and x k + d = [cos θ + sin2 θ, sin θ (1 − cos θ )]T . As
seen in Figure 5.6, x k + d is much closer to the optimum solution, and one can show that
the ratio
x k + d − x ∗ 1
= (5.108)
x k − x ∗ 2 2
is characteristic of quadratic convergence for all values of θ . However, with the p merit
function, there is no positive value of the penalty parameter ρ that allows this point to be
accepted. Moreover, none of the globalization methods described in this chapter accept this
point either, because the infeasibility increases:
To avoid small steps that “creep toward the solution,” we develop a second order
correction at iteration k. The second order correction is an additional step taken in the range
space of the constraint gradients in order to reduce the infeasibility from x k + dx . Here we
define the correction by
Example 5.15 (Maratos Example Revisited). To see the influence of the second order cor-
rection, we again refer to Figure 5.6. From (5.110) with Y k = A(x k ), we have
and we can see that x k + dx + d̄x is even closer to the optimum solution with
x k + dx − x ∗ = 1 − cos θ
≥ 1 − cos θ − (cos θ + 5)(sin2 θ )/4
= x k + dx + d̄x − x ∗ ,
i i
i i
book_tem
i i
2010/7/27
page 130
i i
which is also characteristic of quadratic convergence from (5.108). Moreover, while the
infeasibility still increases from h(x k ) = 0 to h(x k + dx + d̄x ) = sin4 θ/4, the infeasibility is
considerably less than at x k + dx . Moreover, the objective function becomes
For (1 − cos θ/2)/ sin2 θ > (τ + ρ)/4, the objective function clearly decreases from f (x k ) =
− cos θ. Moreover, this point also reduces the exact penalty and is acceptable to the filter.
Note that as θ → 0, this relation holds true for any bounded values of τ and ρ. As a
result, the second order correction allows full steps to be taken and fast convergence to be
obtained.
Implementation of the second order correction is a key part of all of the globalization
algorithms discussed in this chapter. These algorithms are easily modified by first detect-
ing whether the current iteration is within a neighborhood of the solution. This can be
done by checking whether the optimality conditions, say max(h(x k ), (Z k )T ∇f (x k )) or
max(h(x k ), ∇L(x k , v k )), are reasonably small. If x k is within this neighborhood, then
d̄x is calculated and the trial point x k + dx + d̄x is examined for sufficient improvement.
If the trail point is accepted, we set x k+1 = x k + dx + d̄x . Otherwise, we discard d̄x and
continue either by restricting the trust region or line search backtracking.
i i
i i
book_tem
i i
2010/7/27
page 131
i i
5.11 Exercises
1. Prove Theorem 5.1 by modifying the proof of Theorem 2.20.
2. Consider the penalty function given by (5.56) with a finite value of ρ, and compare the
KKT conditions of (5.1) with a stationary point of (5.56). Argue why these conditions
cannot yield the same solution.
3. Show that the tangential step dt = Z k pZ and normal step dn = Y k pY can be found
from (5.28) and (5.29), respectively.
4. Prove Lemma 5.9 by modifying the proof of Lemma 3.2 in [82] and Lemma 4.1
in [54].
5. Fill in the steps used to obtain the results in Example 5.15. In particular, show that for
(1 − cos θ/2)/ sin2 θ > τ + ρ,
the second order correction allows a full step to be taken for either the p penalty or
the filter strategy.
6. Consider the problem
1
min f (x) = (x12 + x22 ) (5.112)
2
s.t. h(x) = x1 (x2 − 1) − θx2 = 0, (5.113)
where θ is an adjustable parameter. The solution to this problem is x1∗ = x2∗ = v ∗ = 0
2 L(x ∗ , v ∗ ) = I . For θ = 1 and θ = 100, perform the following experiments
and ∇xx
with xi0 = 1/θ , i = 1, 2.
i i
i i
book_tem
i i
2010/7/27
page 132
i i
• Setting B k = I , solve this problem with full-space Newton steps and no glob-
alization.
• Setting B̄ k = (Z k )T Z k and w k = 0, solve the above problem with reduced-space
Newton steps and no globalization, using orthogonal bases (5.26).
• Setting B̄ k = (Z k )T Z k and w k = 0, solve the above problem with reduced-space
Newton steps and no globalization, using coordinate basis (5.23).
• Setting B̄ k = (Z k )T Z k and w k = (Z k )T Y k pY , solve the above problem with
reduced-space Newton steps and no globalization, using coordinate basis (5.23).
i i
i i
book_tem
i i
2010/7/27
page 133
i i
Chapter 6
This chapter deals with the solution of nonlinear programs with both equality and inequality
constraints. It follows directly from the previous chapter and builds directly on these concepts
and algorithms for equality constraints only. The key concept of this chapter is the extension
for the treatment of inequality constraints. For this, we develop three approaches: sequential
quadratic programming (SQP)-type methods, interior point methods, and nested projection
methods. Both global and local convergence properties of these methods are presented and
discussed. With these fundamental concepts we then describe and classify many of the
nonlinear programming codes that are currently available. Features of these codes will be
discussed along with their performance characteristics. Several examples are provided and
a performance study is presented to illustrate the concepts developed in this chapter.
where we assume that the functions f (x), h(x), and g(x) have continuous first and sec-
ond derivatives. To derive algorithms for nonlinear programming and to exploit particular
problem structures, it is often convenient to consider related, but equivalent NLP formula-
tions. For instance, by introducing nonnegative slack variables s ≥ 0, we can consider the
equivalent bound constrained problem:
where the nonlinearities are now shifted away from the inequalities.
133
i i
i i
book_tem
i i
2010/7/27
page 134
i i
More generally, we can write the NLP with double-sided bounds given as
with x ∈ Rn and c : Rn → Rm .
Through redefinition of the constraints and variables, it is easy to show (see Exercise 1)
that problems (6.1), (6.2), and (6.3) are equivalent formulations. On the other hand, it is not
true that optimization algorithms will perform identically on these equivalent formulations.
Consequently, one may prefer to choose an appropriate NLP formulation that is best suited
for the structure of a particular algorithm. For instance, bound constrained problems (6.2)
and (6.3) have the advantage that trial points generated from constraint linearizations will
never violate the bound constraints. Also, it is easier to identify active constraints through
active bounds on variables and to partition the variables accordingly. Finally, by using only
bound constraints, the structure of the nonlinearities remains constant over the optimization
and is not dependent on the choice of the active set. On the other hand, feasibility measures
for merit functions (see Chapter 5) are assessed more accurately in (6.1) than in the bound
constrained formulations. This can lead to larger steps and better performance in the line
search or trust region method.
For the above reasons and to ease development of the methods as a direct extension
of Chapter 5, this chapter derives and analyzes NLP algorithms using the framework of
problem (6.3). Nevertheless, the reader should feel comfortable in going back and forth
between these NLPs in order to take best advantage of a particular optimization application.
To motivate the development of the algorithms in this chapter, we consider the first
order KKT conditions for (6.3) written as
Note that for the complementarity conditions, y ⊥ z denotes that y(i) = 0 (inclusive) or
z(i) = 0 for all elements i of these vectors. While it is tempting to solve (6.4) directly as a
set of nonlinear equations, conditions (6.4c)–(6.4d) can make the resulting KKT conditions
ill-conditioned. This issue will be explored in more detail in Chapter 11.
To deal with these complementarity conditions, two approaches are commonly applied
in the design of NLP algorithms. In the active set strategy, the algorithm sets either the
variable x(i) to its bound, or the corresponding bound multiplier uL,(i) or uU ,(i) to zero.
Once this assignment is determined, the remaining equations (6.4a)–(6.4b) can be solved
for the remaining variables.
On the other hand, for the interior point (or barrier) strategy, (6.4c)–(6.4d) are re-
laxed to
where µ > 0, eT = [1, 1, . . . , 1], UL = diag{uL }, and UU = diag{uU }. By taking care that the
variables x stay strictly within bounds and multipliers uL and uU remain strictly positive,
i i
i i
book_tem
i i
2010/7/27
page 135
i i
the relaxed KKT equations are then solved directly with µ fixed. Then, by solving a sequence
of these equations with µ → 0, we obtain the solution to the original nonlinear program.
With either approach, solving the NLP relies on the solution of nonlinear equations,
with globalization strategies to promote convergence from poor starting points.
In this chapter, we develop three general NLP strategies derived from (6.4), using
either active set or interior point methods that extend Newton’s method to KKT systems.
The next section develops the SQP method as a direct active set extension of Chapter 5.
In addition to the derivation of the SQP step, line search and trust region strategies are
developed and convergence properties are discussed that build directly on the results and
algorithms in the previous chapter. Section 6.3 then considers the development of interior
point strategies with related Newton-based methods that handle inequalities through the use
of barrier functions.
In Section 6.4 we consider a nested, active set approach, where the variables are
first partitioned in order to deal with the complementarity conditions (6.4c)–(6.4d). The
remaining equations are then solved with (6.4b) nested within (6.4a). These methods apply
concepts of reduced gradients as well as gradient projection algorithms. Finally, Section 6.5
considers the implementation of these algorithms and classifies several popular NLP codes
using elements of these algorithms. A numerical performance study is presented in this
section to illustrate the characteristic features of these methods.
(6.8)
i i
i i
book_tem
i i
2010/7/27
page 136
i i
or, equivalently,
∇xx L(x k , ukA , v k ) ∇c(x k ) −EL EU ∇f (x k )
dx
c(x k )
∇c(x k )T 0 0 0 v̄
= −
EL (xL − x k )
T
−ELT 0 0 0 ūAL
EUT 0 0 0 ūAU EUT (x k − xU )
(6.9)
with ūAL = ukAL + duAL , ūAU = ukAU + duAU , and v̄ = v k + dv . As shown in Chapter 5, these
systems correspond to the first order KKT conditions of the following quadratic program-
ming problem:
1
min ∇f (x k )T dx + dxT W k dx (6.10)
dx 2
s.t. c(x k ) + ∇c(x k )T dx = 0,
EUT (x k + dx ) = EUT xU ,
ELT (x k + dx ) = ELT xL ,
where W k = ∇xx L(x k , uk , v k ) = ∇xx f (x k ) + jm=1 ∇xx c(j ) (x k )v(j
k (Note that terms for the
)
linear bound constraints are absent). Moreover, by relaxing this problem to allow selection
of active bounds (local to x k ), we can form the following quadratic program (QP) for each
iteration k
1
min ∇f (x k )T dx + dxT W k dx (6.11)
dx 2
s.t. c(x k ) + ∇c(x k )T dx = 0,
xL ≤ x k + dx ≤ xU .
In order to solve (6.3), solution of the QP leads to the choice of the likely active set with
quadratic programming multipliers ūL , ūU , and v̄; this choice is then updated at every
iteration. Moreover, if (6.11) has the solution dx = 0, one can state the following theorem.
Theorem 6.1 Assume that the QP (6.11) has a solution where the active constraint gradients
A(x k ) = [∇c(x k ) | EL | EU ] have linearly independent columns (LICQ is satisfied) and that
the projection of W k into the null space of A(x k )T is positive definite. Then (x k , v̄, ūL , ūU )
is a KKT point of (6.3) if and only if dx = 0.
Proof: The above assumptions imply that the solution of (6.11) is equivalent to (6.9), and
that the KKT matrix in (6.9) is nonsingular. If (x k , ū, v̄) is a KKT point that satisfies (6.7),
then the only solution of (6.9) is dx = 0, ūAL , ūAU , and v̄. For the converse, if dx = 0 is a
solution to (6.11) (and (6.9)), then by inspection of (6.9) we see that the vector (x k , ūL , ūU , v̄)
directly satisfies the KKT conditions (6.7).
Therefore, by substituting the QP (6.11) for the Newton step, we develop the SQP
method as an extension of Algorithm 5.2. Moreover, many of the properties developed
in Chapter 5 carry over directly to the treatment of inequality constraints, and analogous
algorithmic considerations apply directly to SQP. These issues are briefly considered below
and include
i i
i i
book_tem
i i
2010/7/27
page 137
i i
• One can invoke a restoration phase in order to find a point that is less infeasible and
for which (6.4) is likely to be solved. This approach was discussed in Section 5.5.2 in
the context of filter trust region methods and will also be considered later in Section
6.2.3.
i i
i i
book_tem
i i
2010/7/27
page 138
i i
• One can formulate a QP subproblem with an elastic mode where constraints are
relaxed, a feasible region is “created,” and a solution can be guaranteed. Below we
consider two versions of the elastic mode that are embedded in widely used SQP codes.
To motivate the first relaxation we consider the 1 penalty function (5.59):
φ1 (x; ρ) = f (x) + ρc(x)1 . (6.12)
With this function we can formulate the QP subproblem
1
min ∇f (x k )T dx + dxT W k dx
2
+ρ |c(j ) (x k ) + ∇c(j ) (x k )T dx |
j
s.t. xL ≤ x k + dx ≤ xU .
Introducing additional nonnegative variables s and t allows us to reformulate this problem
as the so-called 1 QP:
1
min ∇f (x k )T dx + dxT W k dx + ρ (sj + tj ) (6.13)
dx 2
j
xL ≤ x k + dx ≤ xU , s, t ≥ 0.
Because dx = 0, sj = max(0, cj (x k )), tj = − min(0, cj (x k )) is a feasible point for (6.13),
and if we choose W k as a bounded, positive definite matrix, one can always generate a QP
solution. Moreover, as discussed in Exercise 2, if dx = 0, this solution serves as a descent
direction for the 1 merit function
φ1 (x; ρ) = f (x) + ρc(x)1
and allows the SQP algorithm to proceed with a reduction in φ1 .
An alternative strategy that allows solution of a smaller QP subproblem is motivated
by extending (6.3) to the form
min f (x) + M̄ ξ̄ (6.14)
s.t. c(x)(1 − ξ̄ ) = 0,
xL ≤ x ≤ xU , ξ̄ ∈ [0, 1]
with M̄ as an arbitrarily large penalty weight on the scalar variable ξ̄ . Clearly, solving (6.14)
with ξ̄ = 0 is equivalent to solving (6.3). Moreover, a solution of (6.14) with ξ̄ < 1 also
solves (6.3). Finally, if there is no feasible solution for (6.3), one obtains a solution to (6.14)
with ξ̄ = 1. Choosing M̄ sufficiently large therefore helps to find solutions to (6.14) with
ξ̄ = 0 when feasible solutions exist for (6.3). For (6.14), we can write the corresponding QP
subproblem:
1
min ∇f (x k )T dx + dxT W k dx + Mξ (6.15)
dx 2
s.t. c(x k )(1 − ξ ) + ∇c(x k )T dx = 0,
xL ≤ x k + dx ≤ xU , ξ ≥ 0
i i
i i
book_tem
i i
2010/7/27
page 139
i i
ξ̄ −ξ̄ k
with a slight redefinition of variables, ξ = , and M = (1 − ξ̄k )M̄, assuming that ξ̄k < 1.
1−ξ¯k
Again, because dx = 0 and ξ = 1 is a feasible point for (6.15), a QP solution can be found.
To guarantee a unique solution (and nonsingular KKT matrix) when ξ > 0, QP (6.15) can
be modified to
1
min ∇f (x k )T dx + dxT W k dx + M(ξ + ξ 2 /2) (6.16)
dx 2
s.t. c(x k )(1 − ξ ) + ∇c(x k )T dx = 0,
xL ≤ x k + dx ≤ xU , ξ ≥ 0.
As the objective is bounded over the feasible region with a positive definite Hessian, the
QP (6.16) has a unique solution. Moreover, if a solution with dx = 0 can be found, the SQP
algorithm then proceeds with a reduction in φ1 (x, ρ), as shown in the following theorem.
Theorem 6.2 Consider the QP (6.16) with a positive definite W k (i.e., the Hessian of the
Lagrange function or its quasi-Newton approximation). If the QP has a solution with ξ < 1,
then dx is a descent direction for the p merit function φp (x; ρ) = f (x) + ρc(x)p (with
p ≥ 1) for all ρ > ρ̄ where ρ̄ is chosen sufficiently large.
Proof: We apply the definition of the directional derivative (5.63) to the p merit function.
From the solution of (6.16) we have
and
c(x k + αdx )p − c(x k )p = (c(x k + αdx )p − c(x k ) + α∇c(x k )T dx p )
+ (c(x k ) + α∇c(x k )T dx p − c(x k )p )
= (c(x k + αdx )p − c(x k ) + α∇c(x k )T dx p )
− α(1 − ξ )c(x k )p .
From the derivation of the directional derivatives in Section 5.6.1, we have for some constant
γ > 0,
and applying the definition (5.63) as α goes to zero leads to the following expression for
the directional derivative:
i i
i i
book_tem
i i
2010/7/27
page 140
i i
i i
i i
book_tem
i i
2010/7/27
page 141
i i
Algorithm 6.1.
Choose constants η ∈ (0, 1/2), convergence tolerances 1 > 0, 2 > 0, and τ , τ
with 0 <
τ ≤ τ
< 1. Set k := 0 and choose starting points x 0 , v 0 , u0L , u0U and an initial value ρ −1 > 0
for the penalty parameter.
For k ≥ 0, while dx > 1 and max(∇L(x k , ukL , ukU , v k , c(x k )) > 2 :
1. Evaluate f (x k ), ∇f (x k ), c(x k ), ∇c(x k ), and the appropriate Hessian terms (or approx-
imations).
2. Solve the QP (6.16) to calculate the search step dx . If ξ = 1, stop. The constraints are
inconsistent and no further progress can be made.
3. Set α k = 1.
4. Update ρ k = max{ρ k−1 , v̄q + }.
5. Test the line search condition
φp (x k + α k dx ; ρ k ) ≤ φp (x k ; ρ k ) + ηα k Ddx φp (x k ; ρ k ). (6.23)
With the relaxation of the QP subproblem, we can now invoke many of the properties
developed in Chapter 5 for equality constrained optimization. In particular, from Theo-
rem 6.2 the QP solution leads to a strong descent direction for φp (x, ρ). This allows us to
modify Theorem 5.10 to show the following “semiglobal” convergence property for SQP.
Theorem 6.4 Assume that the sequences {x k } and {x k + dx } generated by Algorithm 6.1
are contained in a closed, bounded, convex region with f (x) and c(x) having uniformly
continuous first and second derivatives. If W k is positive definite and uniformly bounded,
LICQ holds and all QPs are solvable with ξ < 1, and ρ ≥ v̄ + for some > 0, then all
limit points of x k are KKT points of (6.3).
Moreover, the following property provides conditions under which active sets do not
change near the solution.
Theorem 6.5 [294, 336] Assume that x ∗ is a local solution of (6.3) which satisfies the LICQ,
sufficient second order conditions, and strict complementarity of the bound multipliers
(i.e., u∗L,(i) + (x(i)
∗ −x ∗ ∗
L,(i) ) > 0 and uU ,(i) + (xU ,(i) − x(i) ) > 0 for all vector elements i).
i i
i i
book_tem
i i
2010/7/27
page 142
i i
Also, assume that the iterates generated by Algorithm 6.1 converge to x ∗ . Under these
conditions, there exists some neighborhood for iterates x k ∈ N (x ∗ ) where the QP (6.16)
solution yields ξ = 0 and the active set from (6.16) is the same as that of (6.3).
Therefore, under the assumptions of Theorem 6.5, the active set remains unchanged
in a neighborhood around the solution and we can apply the same local convergence results
as in the equality constrained case in Chapter 5. Redefining v T := [uTL , uTU , v T ], these results
can be summarized as follows.
• If we set W k = ∇xx L(x k , v k ), then, by applying Algorithm 6.1 and with x k and v k
sufficiently close to x ∗ and v ∗ , there exists a constant Ĉ > 0 such that
k k
x + dx − x ∗ ∗ 2
≤ Ĉ x − x
v k + dv − v ∗ vk − v∗ ; (6.24)
x k + dx − x ∗
lim =0 (6.25)
k→∞ x k − x ∗
if and only if
Finally, as noted in Chapter 5, stitching the global and local convergence properties
together follows from the need to take full steps in the neighborhood of the solution. Unfor-
tunately, the Maratos effect encountered in Chapter 5 occurs for the inequality constrained
problem as well, with poor, “creeping” performance near the solution due to small step sizes.
As noted before, second order corrections and related line searches overcome the Maratos
effect and lead to full steps near the solution.
i i
i i
book_tem
i i
2010/7/27
page 143
i i
For inequality constrained problems solved with SQP, the Watchdog strategy [93] is
a particularly popular line search strategy. This approach is based on ignoring the Armijo
descent condition (6.23) and taking a sequence of iW ≥ 1 steps with α = 1. If one of these
points leads to a sufficient improvement in the merit function, then the algorithm proceeds.
Otherwise, the algorithm restarts at the beginning of the sequence with the normal line
search procedure in Algorithm 6.1. A simplified version of the Watchdog strategy is given
below as a modification of Algorithm 6.1.
Algorithm 6.2.
Choose constants η ∈ (0, 1/2), convergence tolerances 1 > 0, 2 > 0, and τ , τ
with 0 <
τ < τ
< 1. Set k := 0 and choose starting points x 0 , v 0 , and an initial value ρ −1 > 0 for
the penalty parameter.
For k ≥ 0, while dx > 1 and max(∇L(x k , ukL , ukU , v k ), c(x k )) > 2 :
1. Evaluate f (x k ), ∇f (x k ), c(x k ), ∇c(x k ), and the appropriate Hessian terms (or approx-
imations).
2. Solve the QP (6.16) to calculate the search step dx . If ξ = 1, stop. The constraints are
inconsistent and no further progress can be made.
4. For i ∈ {1, iW }:
• Set x k+i = x k+i−1 + dxi , where dxi is obtained from the solution of QP (6.16) at
x k+i−1 .
• Evaluate f (x k+i ), ∇f (x k+i ), c(x k+i ), ∇c(x k+i ) and update the Hessian terms
(or approximations).
• If φp (x k+i ; ρ k ) ≤ φp (x k ; ρ k ) + ηDdx φp (x k ; ρ k ), then accept the step, set k =
k + i − 1, v k+1 = v̄, uk+1L = ūL , and uU = ūU , and proceed to step 1.
k+1
φp (x k + α k dx ; ρ k ) ≤ φp (x k ; ρ k ) + ηα k Ddx φp (x k ; ρ k ). (6.28)
i i
i i
book_tem
i i
2010/7/27
page 144
i i
where Z k is a full rank n × (n − m) matrix spanning the null space of (∇c(x k ))T and Y k is
any n × m matrix that allows [Y k | Z k ] to be nonsingular. A number of choices for Y k and
Z k are discussed in Section 5.3.2. These lead to the normal and tangential steps, Y k pY and
Z k pZ , respectively. Substituting (6.29) into the linearized equations c(x k ) + ∇c(x k )T dx = 0
directly leads to the normal step
Assuming that ∇c(x k ) has full column rank, we again write this step, pY , as
To determine the tangential component, pZ , we substitute (6.30) directly into (6.11) to obtain
the following QP in the reduced space:
1
min ((Z k )T ∇f (x k ) + wk )T pZ + pZT B̄ k pZ (6.32)
pZ 2
s.t. xL ≤ x k + Y k pY + Z k pZ ≤ xU ,
i i
i i
book_tem
i i
2010/7/27
page 145
i i
1
min ((Z k )T ∇f (x k ) + wk )T pZ + pZT B̄ k pZ + M(ξ + ξ 2 /2) (6.33)
pZ 2
s.t. xL ≤ x k + (1 − ξ )Y k pY + Z k pZ ≤ xU .
The QP (6.33) can then be solved if B̄ k is positive definite, and this property can be checked
through a Cholesky factorization. If B̄ k is not positive definite, the matrix can be modified ei-
ther by adding to it a sufficiently large diagonal term, say δR I , applying a modified Cholesky
factorization, or through the BFGS update (5.31) and (5.48).
This reduced-space decomposition takes advantage of the sparsity and structure of
∇c(x). Moreover, through the formation of basis matrices Z k and Y k we can often take
advantage of a “natural” partitioning of dependent and decision variables, as will be seen
in Chapter 7. The tangential QP requires that the projected Hessian have positive curvature,
which is a less restrictive assumption than full-space QP. Moreover, if second derivatives
are not available, a BFGS update provides an inexpensive approximation, B̄ k , as long as
n − m is small. On the other hand, it should be noted that reduced-space matrices in (6.33)
are dense and expensive to construct and factorize when n − m becomes large.
Moreover, solution of (6.33) can be expensive if there are many bound constraints.
For the solution of (6.11), the QP solver usually performs a search in the space of the primal
variables. First a (primal) feasible point is found and the iterates of the solver reduce the
quadratic objective until the optimality conditions are satisfied (and dual feasible). On the
other hand, the feasible regions for (6.32) or (6.33), may be heavily constrained and difficult
to navigate if the problem is ill-conditioned. In this case, dual space QP solvers [26, 40]
perform more efficiently as they do not remain in the feasible region of (6.33). Instead, these
QP solvers find the unconstrained minimum of the quadratic objective function in (6.33)
and then update the solution by successively adding violated inequalities. In this way, the
QP solution remains dual feasible and the QP solver terminates when its solution is primal
feasible.
The reduced-space algorithm can be expressed as a modification of Algorithm 6.1 as
follows.
Algorithm 6.3.
Choose constants η ∈ (0, 1/2), convergence tolerances 1 > 0, 2 > 0, and τ , τ
with 0 <
τ ≤ τ
< 1. Set k := 0 and choose starting points x 0 , v 0 u0L , u0U and an initial value ρ −1 > 0
for the penalty parameter.
For k ≥ 0, while dx > 1 and max(∇L(x k , ukL , ukU , v k , c(x k )) > 2 :
1. Evaluate f (x k ), ∇f (x k ), c(x k ), ∇c(x k ), and the appropriate Hessian terms (or ap-
proximations).
i i
i i
book_tem
i i
2010/7/27
page 146
i i
3. Calculate the tangential step pZ as well as the bound multipliers ūL , ūU from the QP
(6.33). If ξ = 1, stop. The constraints are inconsistent and no further progress can be
made.
4. Calculate the search step dx = Z k pZ +Y k pY along with the multipliers for the equality
constraints:
5. Set α k = 1.
φp (x k + α k dx ; ρ k ) ≤ φp (x k ; ρ k ) + ηα k Ddx φp (x k ; ρ k ). (6.35)
i i
i i
book_tem
i i
2010/7/27
page 147
i i
by eiT dx = bi with its corresponding scalar multiplier ūi , then (6.9) is augmented as the
following linear system:
K êi z r
= , (6.36)
êiT 0 ūi bi
where êiT = [eiT , 0, . . . , 0] has the same dimension as z. Using the Schur complement method,
we can apply a previous factorization of K and compute the solution of the augmented system
from
and compute the solution of this augmented system by the Schur complement method as
well. This approach also allows a number of sophisticated matrix updates to the Schur com-
plement as well as ways to treat degeneracy in the updated matrix. More details of these
large-scale QP solvers can be found in [26, 163].
Finally, to maintain full steps near the solution, and to avoid the Maratos effect, line
searches should be modified and built into these large-scale algorithms. Both second order
corrections and the Watchdog strategy are standard features in a number of large-scale SQP
codes [313, 404, 251].
i i
i i
book_tem
i i
2010/7/27
page 148
i i
• Once the restoration phase is completed in step 8, the new iterate xk+1 must also
satisfy the bounds in (6.3).
The modified algorithm enjoys the global convergence property of Theorem 5.11,
provided the following additional assumptions are satisfied for all successful iterates k:
• the smallest step from x k to a feasible point of (6.13) is bounded above by the infea-
sibility measure MC c(x k ) for some MC > 0.
These conditions are used to define a sufficiently consistent QP that does not require a
restoration step.
Finally, the local convergence rates are similar to those of Algorithms 6.1 and 6.3.
Again, a modified line search feature, such as the Watchdog strategy or a second order cor-
rection, is required to avoid the Maratos effect and take full steps near the solution. If full
steps are taken and the solution x ∗ , v ∗ , u∗L , u∗U satisfies sufficient second order conditions,
LICQ, and strict complementarity of the active constraints, then the SQP filter algorithm
with the above assumptions has the same convergence rates as in Algorithms 6.1 and 6.3.
i i
i i
book_tem
i i
2010/7/27
page 149
i i
i i
i i
book_tem
i i
2010/7/27
page 150
i i
Following the reduced-space decomposition, we can modify the normal and tangential sub-
problems, (6.30) and (6.32), respectively, and apply the trust region concepts in Chapter 5.
The normal trust region subproblem
min ||ck + ∇c(x k )Y k pY ||2
pY
is often a small NLP (in R(n−m) ) with a quadratic objective, a single quadratic constraint,
and bounds. For this problem we can apply a classical trust region algorithm extended to
deal with simple bounds. If (n − m) is small, this problem can be solved with an approach
that uses direct matrix factorizations [154]; else an inexact approach can be applied based on
truncated Newton methods (see, e.g., [100, 261]). A particular advantage of these approaches
is their ability to deal with directions of negative curvature to solve the subproblem within
the trust region. For a more comprehensive discussion of these methods, see [281, 154].
In addition, we can coordinate the trust regions, ξN and , ˜ in order to enforce
||Y pY + Z pZ || ≤ . Provided that we obtain sufficient reduction in the 2 merit function
k k
Assumptions TR: The sequence {x k } generated by the trust region algorithm is contained
in a convex set D with the following properties:
1. The functions f : Rn → R and c : Rn → Rm and first and second derivatives are
Lipschitz continuous and uniformly bounded in norm over D. Also, W k is uniformly
bounded in norm over D.
2. The matrix ∇c(x) has full column rank for all x ∈ D, and there exist constants γ0 and
β0 such that
Y (x)[∇c(x)T Y (x)]−1 ≤ γ0 , Z(x) ≤ β0 , (6.46)
for all x ∈ D.
i i
i i
book_tem
i i
2010/7/27
page 151
i i
Theorem 6.6 [100] Suppose Assumptions TR hold and strict complementarity holds at the
limit points. Then Algorithm 5.5 with the modified tangential subproblem (6.44) has the
following properties:
In other words, all limit points x ∗ are feasible and are first order KKT points for the NLP
(6.39).
In addition, fast local convergence rates can be proved, depending on the choices for
the Hessian terms. These results are similar to those obtained with line search SQP methods
(see [100, 294]).
Note that a suitable reformulation of variables and equations (see Exercise 3) can be used
to represent (6.3) as (6.48). To deal with the nonnegative lower bound on x, we form a
log-barrier function and consider the solution of a sequence of equality constrained NLPs
of the form
nx
min ϕµl (x) = f (x) − µl ln(x(i) ) (6.49)
i=1
s.t. c(x) = 0, x > 0,
where the integer l is the sequence counter and liml→∞ µl = 0. Note that the logarithmic
barrier term becomes unbounded at x = 0, and therefore the path generated by the algorithm
must lie in a region that consists of strictly positive points, x > 0, in order to determine the
solution vector of (6.49). As the barrier parameter decreases, the solutions x(µl ) approach
the solution of (6.48). This can be seen in Figure 6.1. Also, if c(x) = 0 satisfies the LICQ,
the solution of (6.49) with µ > 0 satisfies the following first order conditions:
where X = diag{x}, e = [1, 1, . . . , 1]T , and the solution vector x(µ) > 0, i.e., it lies strictly
in the interior. Equations (6.50) are known as the primal optimality conditions to denote the
absence of multipliers for the inequality constraints. A key advantage of the barrier approach
is that decisions about active sets are avoided and the methods developed in Chapter 5 can
now be applied. Moreover, the following theorem relates a sequence of solutions to (6.49)
to the solution of (6.48).
i i
i i
book_tem
i i
2010/7/27
page 152
i i
Figure 6.1. Illustration of barrier solutions for values of µ at 0.1, 0.05, 0.01, 0.001
for the problem min ϕµ (x) = x − µ ln(x). For this problem x(µ) = µ, and as µ → 0 we
approach the solution of the original problem min x s.t. x ≥ 0.
Theorem 6.7 Consider problem (6.48), with f (x) and c(x) at least twice differentiable,
and let x ∗ be a local constrained minimizer of (6.48). Also, assume that the feasible re-
gion of (6.48) has a strict interior and that the following sufficient optimality conditions
hold at x ∗ :
1. x ∗ is a KKT point,
3. strict complementarity holds for x ∗ and for the bound multipliers u∗ satisfying the
KKT conditions,
4. for L(x, v) = f (x) + c(x)T v, there exists ω > 0 such that q T ∇xx L(x ∗ , v ∗ )q ≥ ωq2
for equality constraint multipliers v ∗ satisfying the KKT conditions and all nonzero
q ∈ Rn in the null space of the active constraints.
i i
i i
book_tem
i i
2010/7/27
page 153
i i
• limµ→0+ x(µ) = x ∗ ;
• x(µl ) − x ∗ = O(µl ).
The proof of Theorem 6.7 is given through a detailed analysis in [147] (in particular, Theo-
rems 3.12 and Lemma 3.13) and it indicates that nearby solutions of (6.49) provide useful
information for bounding solutions of (6.48) for small positive values of µ. On the other
hand, note that the gradients of the barrier function in Figure 6.1 are unbounded at the
constraint boundary, and this leads to very steep and ill-conditioned response surfaces for
ϕµ (x). These can also be observed from the Hessian of problem (6.49):
m
W (x, v) = ∇ 2 f (x) + ∇ 2 c(j ) (x)v(j ) + µX−2 .
j =1
Consequently, because of the extremely nonlinear behavior of the barrier function, direct
solution of barrier problems is often difficult. Moreover, for the subset of variables that are
zero at the solution of (6.48), the Hessian becomes unbounded as µ → 0. On the other hand,
if we consider the quantity q T W (x ∗ , v ∗ )q, with directions q that lie in the null space of the
Jacobian of the active constraints, then the unbounded terms are no longer present and the
conditioning of the corresponding reduced Hessian is not affected by the barrier terms.
This motivates an important modification to the primal optimality conditions (6.50) to
form the primal-dual system. Here we define new “dual” variables along with the equation
Xu = µe and we replace the barrier contribution to form
This substitution and linearization eases the nonlinearity of the barrier terms and now leads
to a straightforward relaxation of the KKT conditions for (6.48). Thus, we can view this
barrier modification as applying a homotopy method to the primal-dual equations with the
homotopy parameter µ, where the multipliers u ∈ Rn correspond to the KKT multipliers for
the bound constraints (6.48) as µ → 0. Note that (6.51) for µ = 0, together with “x, u ≥ 0,”
are the KKT conditions for the original problem (6.48). This defines the basic step for our
interior point method.
We now consider a Newton-based strategy to solve (6.48) via the primal-dual system
(6.51). First, using the elements of the primal-dual equations (6.51), we define an error for
the optimality conditions of the interior point problem as
Eµ (x, v, u) := max {∇f (x) + ∇c(x)v − u∞ , c(x)∞ , Xu − µe∞ } . (6.52)
Similarly, E0 (x, v, z) corresponds to (6.52) with µ = 0 and this measures the error in the
optimality conditions for (6.48). The overall barrier algorithm terminates if an approximate
solution is found that satisfies
i i
i i
book_tem
i i
2010/7/27
page 154
i i
In this description we consider a nested approach to solve (6.49), where the barrier
problem is solved (approximately) in an inner loop for a fixed value of µ. Using l as the
iteration counter in adjusting µ in the outer loop, we apply the following algorithm.
Algorithm 6.4.
Choose constants tol > 0, κ > 0, κµ ∈ (0, 1), and θµ ∈ (1, 2). Set l := 0 and choose starting
points x 0 > 0, v 0 , u0 , and an initial value µ0 > 0 for the barrier parameter.
i i
i i
book_tem
i i
2010/7/27
page 155
i i
with k := (X k )−1 U k . This is derived from (6.56) by eliminating the last block row. The
vector duk is then obtained from
Note that (6.57) has the same structure as the KKT system (5.4) considered in Chapter 5.
Consequently, the concepts of (i) maintaining a nonsingular system, (ii) applying quasi-
Newton updates efficiently in the absence of second derivatives, and (iii) developing global
convergence strategies, carry over directly. In particular, to maintain a nonsingular system,
one can modify the matrix in (6.57) and consider the modified linear system
k k
W + k + δW I ∇c(x k ) dx ∇ϕµ (x k ) + ∇c(x k )v k
= − . (6.59)
∇c(x k )T −δA I dvk c(x k )
Again, δW and δA can be updated using Algorithm 5.2 in order to ensure that (6.59) has the
correct inertia.
From the search directions in (6.59) and (6.58), a line search can be applied with step
sizes α k , αuk ∈ (0, 1] determined to obtain the next iterate as
Since x and u are both positive at an optimal solution of the barrier problem (6.49), and
ϕµ (x) is only defined at positive values of x, this property must be maintained for all iterates.
For this, we apply the step-to-the-boundary rule
" #
k
αmax := max α ∈ (0, 1] : x k + αdxk ≥ (1 − τl )x k , (6.61a)
" #
αuk := max α ∈ (0, 1] : uk + αduk ≥ (1 − τl )uk (6.61b)
τl = max{τmin , 1 − µl }. (6.62)
Here, τmin is typically set close to 1, say 0.99, and τl → 1 as µl → 0. Note that αuk is the
actual step size used in (6.60c), while the step size α k ∈ (0, αmax
k ] for x and v is determined
by a backtracking line search procedure that explores a decreasing sequence of trial step
sizes α k,j = 2−j αmax
k (with j = 0, 1, 2, . . .).
i i
i i
book_tem
i i
2010/7/27
page 156
i i
Figure 6.2. Failure of line search for Newton-based interior point method.
i i
i i
book_tem
i i
2010/7/27
page 157
i i
lead to a less infeasible point with improved convergence behavior. Alternately, invoking a
feasibility restoration step leads to a less infeasible point, from which convergence to the
solution is possible.
This example motivates the need for a more powerful global convergence approach. In
fact, from Chapter 5 the filter line search and trust region methods, as well as the composite-
step trust region strategy, overcome the poor behavior of this example.
The filter line search method from Chapter 5 can be applied with few modifications to
the solution of the inner problem (6.49) with µl fixed. The barrier function ϕµ (x) replaces
f (x) in Algorithm 5.4 and the line search proceeds by augmenting the filter with previous
iterates or by performing f -type iterates which lead to a sufficient reduction in the barrier
function. Moreover, if a suitably large step size cannot be found (α k,l < α k,min ), then a
feasibility restoration phase is invoked. (This occurs in Example 6.8.)
With the application of a reliable globalization strategy, the solution of the Newton
step (6.56) can be performed with all of the variations described in Chapter 5. In particular,
the symmetric primal-dual system (6.57) can be solved with the following options:
• The Hessian can be calculated directly or a limited memory BFGS update can be
performed using the algorithm in section 5.4.1 with B 0 := B 0 + k .
• If n−m is small, (6.57) can be decomposed into normal and tangential steps and solved
as in Section 5.3.2. Note that the relevant reduced Hessian quantities are modified to
Z T (W + )Z and Z T (W + )YpY .
• For the decomposed system, quasi-Newton updates for the reduced Hessians can be
constructed as described in Section 5.4.2. However, additional terms for Z T Z and
Z T YpY must also be evaluated and included in the Newton step calculation.
Convergence Properties
The convergence properties for the filter method follow closely from the results for the
equality constrained problem in Chapter 5. In particular, the global convergence property
from Theorem 5.11 follows directly as long as the iterates generated by the filter line search
algorithm for the primal-dual problem remain suitably in the interior of the region de-
fined by the bound constraints, x ≥ 0. To ensure this condition, we require the following
assumptions.
2. The matrix ∇c(x) has full column rank for all x ∈ D, and there exist constants γ0 and
β0 such that
for all x ∈ D.
i i
i i
book_tem
i i
2010/7/27
page 158
i i
3. The primal-dual barrier term k has a bounded deviation from the “primal Hessian”
µl (X k )−2 . Suitable safeguards have been proposed in [206] to ensure this behavior.
4. The matrices W k + µ(X k )−2 are uniformly bounded and uniformly positive definite
on the null space of ∇c(x k )T .
5. The restoration phase is not invoked arbitrarily close to the feasible region. (In [403]
a weaker but more detailed assumption is applied.)
These assumptions lead to the following result.
Theorem 6.9 [403] Suppose Assumptions F hold. Then there exists a constant x , so that
x k ≥ x e for all k.
This theorem is proved in [403] and can be used directly to establish the following
global convergence result.
Theorem 6.10 [403] Suppose Assumptions F(1)-(2) hold for all x k ∈ D, and the remaining
assumptions hold in the neighborhood of the feasible region c(x k ) ≤ θinc , for some θinc > 0
(it is assumed that successful steps can be taken here and the restoration phase need not be
invoked); then the filter line search algorithm for (6.49) has the following properties:
In other words, all limit points are feasible, and if {x k } is bounded, then there exists a limit
point x ∗ of {x k } which is a first order optimal point for the equality constrained NLP (6.49).
Finally, for the solution of (6.49) with µl fixed, one would expect to apply the local
properties from Chapter 5 as well. Moreover, close to the solution of (6.49) we expect
the line search to allow large steps, provided that suitable second order corrections are
applied. However, we see that the convergence rate is governed by the value of τl in (6.62)
and the update of µl . While there are a number of updating strategies for µl that lead to
fast convergence rates (see [294, 206, 293]), these need to be implemented carefully to
prevent ill-conditioning of the primal-dual system. Nevertheless, for µl updated by (6.55),
the convergence rate is at best superlinear for the overall algorithm.
i i
i i
book_tem
i i
2010/7/27
page 159
i i
directly to p̃N and p̃T . Using the derivation in Section 5.7.1, we formulate the trust region
subproblems as follows:
minc(x k ) + ∇c(x k )T pN 22
pN
i i
i i
book_tem
i i
2010/7/27
page 160
i i
with conjugate gradient iterations is generally applied to solve (6.67) inexactly. This solution
is aided by the scaled barrier term Xk k X k . This is an efficient preconditioner which has
eigenvalues clustered close to µ as µ → 0. As with the normal subproblem, if inequality
(6.67c) is violated by the approximate solution, a backtracking procedure can be applied to
recover dx .
Theorem 6.11 Suppose Assumptions TR hold and also assume that ϕµ (x) is bounded below
for all x ∈ D. Then the composite-step trust region algorithm with subproblems (6.66) and
(6.67) has the following properties:
In other words, the limit points x ∗ are feasible and first order KKT points for problem (6.49).
The above result was originally shown by Byrd, Gilbert, and Nocedal [80] for the
barrier trust region method applied to
but modified here with the boundedness assumption for ϕµ , so that it applies to prob-
lem (6.48). In their study, the resulting algorithm allows a reset of the slack variables to
s k+1 = −g(x k+1 ), and the assumption on ϕµ is not needed. Moreover, in the absence of an
assumption on linear independence of the constraint gradients, they show that one of three
situations occurs:
• the iterates x k approach an infeasible point that is stationary for a measure of infea-
sibility, and the penalty parameter ρ k tends to infinity,
• the iterates x k approach a feasible point but the active constraints are linearly depen-
dent, and the penalty parameter ρ k tends to infinity, or
• Theorem 6.11 holds, and the active constraints are linearly independent at x ∗ .
Finally, for the solution of (6.49) with µl fixed, one would expect to apply the local
convergence properties from Chapter 5 as well. Moreover, close to the solution of (6.49) we
expect the trust region to be inactive and allow large steps to be taken, provided that suitable
second order corrections are applied. However, the convergence rate is again governed by
the value of τl in (6.62) and the update of µl . While a number of fast updating strategies
are available for µl (see [294, 206, 293]), some care is needed to prevent ill-conditioning.
Again, as in the line search case, for µ updated by (6.55), the convergence rate is at best
superlinear for the overall algorithm.
i i
i i
book_tem
i i
2010/7/27
page 161
i i
poor and misleading information in determining search directions, active constraints, and
constraint multipliers. Nested approaches are designed to avoid these problems.
In this approach the variables are first partitioned in order to deal with the constrained
portions of (6.3):
min f (x)
s.t. c(x) = 0,
xL ≤ x ≤ xU .
We partition the variables into three categories: basic, nonbasic, and superbasic. As de-
scribed in Chapter 4, nonbasic variables are set to their bounds at the optimal solution,
while basic variables can be determined from c(x) = 0, once the remaining variables are
fixed. For linear programs (and NLPs with vertex solutions), a partition for basic and nonba-
sic variables is sufficient to develop a convergent algorithm. In fact, the LP simplex method
proceeds by setting nonbasic variables to their bounds, solving for the basic variables, and
then using pricing information (i.e., reduced gradients) to decide which nonbasic variable
to relax (the driving variable) and which basic variable to bound (the blocking variable).
On the other hand, if f (x) or c(x) is nonlinear, then superbasic variables are needed
as well. Like nonbasic variables, they “drive” the optimization algorithm, but their optimal
values are not at their bounds. To develop these methods, we again consider the first order
KKT conditions of (6.3):
∇x L(x ∗ , u∗ , v ∗ ) = ∇f (x ∗ ) + ∇c(x ∗ )T v ∗ − u∗L + u∗U = 0,
c(x ∗ ) = 0,
0 ≤ uL ∗ ⊥ (x ∗ − xL ) ≥ 0,
0 ≤ uU ∗ ⊥ (xU − x ∗ ) ≥ 0.
We now partition and reorder the variables x as basic, superbasic, and nonbasic vari-
T ]T with x ∈ RnB , x ∈ RnS , x ∈ RnN , and n + n + n = n.
ables: x = [xBT , xST , xN B S N B S N
This partition is derived from local information and may change over the course of the
optimization iterations. Similarly, we partition the constraint Jacobian as
∇c(x)T = [AB (x) | AS (x) | AN (x)]
and the gradient of the objective function as
∇f (x)T = [fB (x)T | fS (x)T | fN (x)T ].
The corresponding KKT conditions can be now written as
fS (x ∗ ) + AS (x ∗ )T v ∗ = 0, (6.69a)
fN (x ∗ ) + AN (x ∗ )T v ∗ − u∗L + u∗U = 0, (6.69b)
fB (x ∗ ) + AB (x ∗ )T v ∗ = 0, (6.69c)
c(x ∗ ) = 0, (6.69d)
∗ ∗
xN,(i) = xN,L,(i) or xN,(i) = xN,U ,(i) . (6.69e)
Note that (6.69e) replaces the complementarity conditions (6.4c)–(6.4d) with the assump-
tion that u∗L , u∗U ≥ 0. We now consider a strategy of nesting equations (6.69c)–(6.69d)
k and x k , we solve for x using
within (6.69a)–(6.69b). At iteration k, for fixed values of xN S B
i i
i i
book_tem
i i
2010/7/27
page 162
i i
(6.69d). We can then solve for v k = −(AkB )−T fBk from (6.69c). With xBk determined in an
inner problem and xN k fixed, the superbasic variables can then be updated from (6.69a) and
Moreover, we define the matrix ZP that lies in the null space of ∇c(x)T as follows:
−A−1B AS
ZP = I
0
and the reduced gradient is given by ZPT ∇f in a similar manner as in Sections 5.3.2 and
6.2.3. Following the same reasoning we define the reduced Hessian as ZPT ∇xx L(x, v)ZP .
Often, the dimension of the reduced Hessian (nS ) is small and it may be more efficiently
approximated by a quasi-Newton update, B̄, as in Sections 5.3.2 and 6.2.3.
With these reduced-space concepts, we consider the nested strategy as the solution of
bound constrained subproblems in the space of the superbasic variables, i.e.,
A basic algorithm for this nested strategy, known as the generalized reduced gradient (GRG)
algorithm, is given as follows.
Algorithm 6.5.
Set l := 0, select algorithmic parameters including the convergence tolerance tol and min-
imum step size αmin . Choose a feasible starting point by first partitioning and reordering x
into index sets B 0 , S 0 , and N 0 for the basic, superbasic, and nonbasic variables, respec-
tively. Then fix xS0 and xN 0 and solve (6.69d) to yield x 0 . (This may require some trial
B
partitions to ensure that xB remains within bounds.)
For l ≥ 0 and the partition sets B l , S l , and N l , and while ∇x L(x l , v l , ul ) > tol :
1. Determine x l+1 from the solution of problem (6.70).
k,l
),xsk,l )
For k ≥ 0, while df (xB (xs
dxs ≥ tol , at each trial step to solve (6.70), i.e., xS (α) =
xSk,l + αdS , perform the following steps:
i i
i i
book_tem
i i
2010/7/27
page 163
i i
2. At x l+1 , calculate the multiplier estimates v l+1 from (6.69c) and ul+1 l+1
L and uU from
(6.69b).
3. Update the index sets B l+1 , S l+1 , and N l+1 and repartition the variables for the
following cases:
• If xB cannot be found due to convergence failure, repartition sets B and S.
• If xBk,l ∈
/ [xB,L , xB,U ], repartition sets B and N .
• If ul+1
L U have negative elements, repartition S and N . For instance,
or ul+1
the nonbasic variable with the most negative multiplier can be added to the
superbasic partition.
• If xS has elements at their bounds, repartition S and N . For instance, superbasic
variables at upper (lower) bounds with negative (positive) reduced gradients
can be added to the nonbasic partition.
Once the algorithm finds the correct variable partition, the algorithm proceeds by up-
dating xS using reduced gradients in a Newton-type iteration with a line search. Convergence
of this step can be guaranteed by Theorem 3.3 presented in Chapter 3.
On the other hand, effective updating of the partition is essential to overcome potential
failure cases in step 3 and ensure reliable and efficient performance. The most serious
repartitioning case is failure to solve the constraint equations, as can be seen in the example in
Figure 6.3. Following Algorithm 6.5, and ignoring nonbasic variables, leads to the horizontal
steps taken for the superbasic variable. Each step requires solution to the constraint equation
(the dashed line), taken by vertical steps for basic variables. In the left graphic, various values
of the superbasic variable fail to solve the constraint and find the basic variable; the step
size is therefore reduced. Eventually, the algorithm terminates at a point where the basis
matrix AB is singular (the slope for the equality is infinity at this point). As seen in the right
graphic, repartitioning (rotating by 90◦ ) leads to a different basis matrix that is nonsingular,
and Algorithm 6.5 terminates successfully. In most implementations, repartitioning sets B
and S in step 3 is often governed by updating heuristics that can lead to efficient performance,
but do not have well-defined convergence properties.
On the other hand, repartitioning between sets S and N can usually be handled through
efficient solution of problem (6.70), and systematically freeing the bounded variables. Al-
ternately, a slightly larger bound constrained subproblem can be considered as well:
min f (xB (xS , xN ), xS , xN ) (6.71)
xS ,xN
s.t. xS,L ≤ xS ≤ xS,U ,
xN,L ≤ xN ≤ xN,U ,
i i
i i
book_tem
i i
2010/7/27
page 164
i i
where the nonbasic and basic variables are considered together in step 1 of Algorithm 6.5.
An effective solution strategy for either (6.70) or (6.71) is the gradient projection method,
described next.
To solve (6.72) we first introduce some new concepts. The inequality constraints need to be
handled through the concept of gradient projection. We define the projection operator for
vector z as the point closest to z in the feasible region , i.e.,
The projected gradient method comprises both projected gradient and projected Newton
steps. Using the developments in [39, 227, 294], we define the new iterates based on the
projected gradient as
i i
i i
book_tem
i i
2010/7/27
page 165
i i
Figure 6.4. Projected gradient steps and Cauchy points in bound constrained
problems. Note that z∗ = P (z∗ − ∇f (z∗ )).
where α ≥ 0 is the step-size parameter. We define the Cauchy point as the first local minimizer
of f (z(α)) with respect to α, i.e., zc = z(αc ) where αc = arg min f (z(α)), as shown in
Figure 6.4. Also, as illustrated in this figure, we note the following termination property.
Theorem 6.12 [227, Theorem 5.2.4] Let f (z) be continuously differentiable. A point
z∗ ∈ is stationary for problem (6.72) if and only if
Algorithm 6.6.
Choose a starting point z0 , convergence tolerance tol , and step-size restriction parameter
β < 1.
For k ≥ 0 with iterate zk and values f (zk ), and ∇f (zk ), and while z∗ − P (z∗ −
α∇f (z∗ )) > tol :
1. Find the smallest integer j ≥ 0 for which α = (β)j and z(α) = P (zk − α∇f (zk ))
satisfies the Armijo inequality
This algorithm is globally convergent, but because it is based only on gradient in-
formation, it converges at a slow rate. To accelerate the convergence rate, it is tempting to
improve the algorithm with a Newton-type step pN = −(B̄ k )−1 ∇f (zk ), where B̄ k is a pos-
itive definite matrix that approximates ∇zz f (zk ), and then calculate z(α) = P (zk + αpN ).
However, once projected into a bound, such a step does not necessarily satisfy descent
conditions and might lead to a line search failure.
i i
i i
book_tem
i i
2010/7/27
page 166
i i
Instead, we note that if the active set is known at the solution, then we can reorder
the variables and partition to zk = [(zIk )T , (zA A,(i) is at its bound, zI ,(i) ∈
k )T ]T , where zk k
(zI ,L,(i) , zI ,U ,(i) ), and zI ∈ RnI . Defining a projection matrix ZI (zk ) = [InI | 0]T along with
the modified Hessian
T k
ZI B̄ ZI 0
BR =
k
(6.78)
0 I
When we are close to z∗ and the correct active set is known (see Theorem 6.5), this leads
to projected gradient steps for zA while providing second order information for zI . On the
other hand, far from the solution the partitioning of z must be relaxed and the set of active
variables must be overestimated. This can be done by defining an -active set at iteration k
k is given by
where k > 0, zA
zU ,A,(i) − zA,(i)
k
≤ k or k
zA,(i) − zL,A,(i) ≤ k (6.80)
and zI includes the remaining nI variables. The quantities BR and z(α) are now defined
with respect to this new variable partition. Moreover, with the following choice of k
we have the remarkable property that the active set at the solution remains fixed for all
zk − z∗ ≤ δ for some δ > 0, if strict complementarity holds at z∗ . By defining
the variable partition remains unchanged in some neighborhood around the solution. As a
result, projected Newton and quasi-Newton methods can be developed that guarantee fast
local convergence.
The resulting Newton-type bound constrained algorithm is given as follows.
Algorithm 6.7.
Choose a starting point z0 , convergence tolerance tol , and step size restriction parame-
ter β < 1.
For k ≥ 0 with iterate zk and values f (zk ), ∇f (zk ), and B̄ k , and while z∗ − P (z∗ −
α∇f (z∗ )) > tol :
1. Evaluate k from (6.81), determine the -active set, and reorder and partition the
variables zk according to (6.80). Calculate BRk according to (6.78).
2. Find the smallest integer j ≥ 0 for which α = (β)j and z(α) is evaluated by (6.79)
that satisfies the Armijo inequality
i i
i i
book_tem
i i
2010/7/27
page 167
i i
Using the first two parts of Assumptions TR, convergence properties for Algorithm 6.7
can be summarized as follows [227, 39]:
• The algorithm is globally convergent; any limit point is a stationary point.
• If the stationary point z∗ satisfies strict complementarity, then the active set and the
associated matrix ZI remains constant for all k > k0 , with k0 sufficiently large.
• If the exact Hessian is used and ZIT ∇zz f (z∗ )ZI is positive definite, then the local
convergence rate is quadratic.
• If the BFGS update described in [227] is used and ZIT ∇zz f (z∗ )ZI is positive definite,
then the local convergence rate is superlinear.
As noted in [227] some care is needed in constructing the BFGS update, because
the partition changes at each iteration and only information about the inactive variables is
updated. Nevertheless, Algorithm 6.7 can be applied to a variety of problem types, including
bound constrained QPs, and is therefore well suited for step 1 in Algorithm 6.5.
Finally, as detailed in [39, 227, 294], there are a number of related gradient projection
methods for bound constrained optimization. A popular alternative algorithm uses limited
memory BFGS updates and performs subspace minimizations [294]. Using breakpoints
for α defined by changes in the active set (see Figure 6.4), segments are defined in the
determination of the Cauchy point, and a quadratic model can be minimized within each
segment. This approach forms the basis of the L-BFGS-B code [420]. Also, trust region
versions have also been developed for gradient projection; these include the TRON [261]
and TOMS Algorithm 717 [73, 155] codes. As discussed and analyzed in [100], trust region
algorithms do not require positive definite Hessians and deal directly with negative curvature
in the objective function.
i i
i i
book_tem
i i
2010/7/27
page 168
i i
Figure 6.5. Progress of major iterations for the MINOS algorithm. At each major
iteration a linearly constrained subproblem is solved.
This subproblem can be solved by partitioning x into basic, superbasic, and nonbasic vari-
ables and applying Algorithm 6.5. The iterations of this algorithm are known as the minor
iterations. With linear equalities, the feasibility step is now greatly simplified, because the
∇c matrix does not change. As a result, much more efficient and reliable solutions can be
obtained for subproblem (6.84). The solution to (6.84) yields the primal and dual variables
(x k+1 , v k+1 , uk+1 k+1
L , uU ) for the next major iteration. This sequence of the major iterations
is illustrated in Figure 6.5.
The MINOS solver has been implemented very efficiently to take care of linearity and
sparsity in the Jacobian. In fact, if the problem is totally linear, it becomes the LP simplex
method. For nonlinear problems, it uses a reduced-space method and applies quasi-Newton
updates to solve (6.84). The major iterations x k+1 − x k converge at a quadratic rate [162],
and because no feasibility step is required, MINOS tends to be more efficient than GRG,
especially if the equalities are “mostly linear.”
In solving (6.84), the penalty term in LA (x) controls the infeasibility of the major itera-
tions. Nevertheless, this approach has no other globalization strategies and is not guaranteed
to converge on problems with highly nonlinear constraints.
i i
i i
book_tem
i i
2010/7/27
page 169
i i
engineering optimization problems should therefore leverage their efforts with these well-
developed codes. Much more information on widely available codes can be found on the
NEOS server (www-neos.mcs.anl.gov) and in the NEOS Software Guide [1, 283].
We now consider a representative set of NLP codes that build on the concepts in
this chapter. This list is necessarily incomplete and represents only a sampling of what is
available. Moreover, with ongoing continuous development and improvements for many of
these codes, the reader is urged to seek more information from the software developer, along
with literature and Internet resources, regarding new developments, software availability,
and updates to the codes.
i i
i i
book_tem
i i
2010/7/27
page 170
i i
• SNOPT [159]: Also developed by Gill, Murray, and Wright [160], SNOPT performs
a full-space, limited-memory BFGS update for the Hessian matrix. It then forms
and solves the QP problem using SQOPT, a reduced-Hessian active set method. It
performs a line search with an augmented Lagrangian function.
• SOCS [41]: Developed at Boeing by Betts and coworkers [40], this code incorporates
a primal sparse quadratic programming algorithm that fully exploits exact sparse
Jacobians and Hessians. It uses an augmented Lagrangian line search along with
specially developed sparse matrix linear algebra.
• IPOPT [206]: Using the algorithm described in [404], this object-oriented code is
part of the open-source COIN-OR project and solves the primal-dual equations and
the linear system (6.56) through a variety of options for sparse, symmetric, indefinite
matrix factorizations. It applies a filter line search and has options for handling exact
Hessian or limited-memory full-space BFGS updates. Also included are a number of
options for adjustment of the barrier parameter µ.
• KNITRO [79]: Using the algorithm described in [81], this code includes two proce-
dures for computing the steps for the primal-dual equations. The Interior/CG version
uses the Byrd–Omojukun algorithm and solves the tangential trust region via projected
conjugate gradient iterations. The Interior/Direct version solves the primal-dual KKT
matrix using direct linear algebra unless poor steps are detected; it then switches to
Interior/CG.
• LOQO [37]: As described in [392], LOQO is one of the earliest primal-dual NLP
codes and incorporates a line search–based Newton solver with a number of options
for promoting convergence on difficult problems. Also, through a regularization of
the lower right-hand corner of the KKT matrix in (6.56), LOQO formulates and solves
quasi-definite primal-dual linear systems through a specialized sparse algorithm.
i i
i i
book_tem
i i
2010/7/27
page 171
i i
• GRG2 [245]: Developed by Lasdon and coworkers, this code applies the generalized
reduced gradient concepts from Section 6.4 and has a long development history.
It includes an efficient quasi-Newton algorithm (BFGS in factored form) as well as
an optional conjugate gradient method for larger problems.
• MINOS [285]: As described in [162] and also in Section 6.4.2, MINOS solves a se-
quence of linearly constrained subproblems, as illustrated in Figure 6.5. For nonlinear
problems, it uses a reduced-space method and applies quasi-Newton updates to solve
(6.84).
• PENNON [231]: As described in [373], this code first converts the constraints by
using a transformed penalty/barrier function. It then forms a (generalized) augmented
Lagrange function. The NLP is solved through solution of a sequence of unconstrained
problems, with multipliers for the augmented Lagrange function updated in an outer
loop. The unconstrained minimization (inner loop) is performed by a Newton-type
solver with globalization using either line search or trust region options.
i i
i i
book_tem
i i
2010/7/27
page 172
i i
strongly dependent on the hardware environment and operating system, as well as the mod-
eling environment, which provides the problem data to the NLP algorithm. Nevertheless,
from the test problem results, we can observe the influence of the Newton step, globalization
features, and approximation of second derivatives.
i i
i i
book_tem
i i
2010/7/27
page 173
i i
Table 6.1. Results for scalable test problem (# iterations [CPU seconds]). Minor
iterations are also indicated for SNOPT, and CG iterations are indicated for KNITRO,
where applied. The numbers of variables and equations are n = 6N − 2 and m = 5N − 2
and n − m = N .
N CONOPT IPOPT KNITRO SNOPT
5 15 [0.11] 9 [0.016] 16/0 [0.046] 13/33 [0.063]
10 15 [0.062] 9 [0.016] 19/0 [0.093] 15/55 [0.063]
50 28 [0.094] 9 [0.032] 20/0 [0.078] 15/357 [0.11]
100 30 [0.047] 9 [0.11] 18/0 [0.109] 13/603 [0.156]
500 32 [0.344] 9 [0.64] 96/337 [4.062] 23/5539 [4.828]
1000 29 [1.375] 9 [1.422] 116/771 [16.343] 31/10093 [13.734]
the direct version of KNITRO also solves (6.51) with a Newton method, its performance
is similar to IPOPT for N up to 100, with the difference in iterations probably due to the
use of a different line search. For N = 500 and N = 1000, though, KNITRO automatically
switches from the Interior/Direct to the Interior/CG version to provide more careful, but
expensive, trust region steps. This can be seen from the added CG iterations in the table.
Finally, SNOPT is the only solver considered that does not use exact second derivatives.
Table 6.1 lists the major (SQP) iterations and minor (QP) iterations. Note that the number
of SQP iterations increases only slightly with N . However, as a result of BFGS updates
and heavily constrained QPs, SNOPT requires a much larger increase in CPU time with
increasing N .
i i
i i
book_tem
i i
2010/7/27
page 174
i i
i i
i i
book_tem
i i
2010/7/27
page 175
i i
log2 S
Figure 6.6. Dolan–Morè plot of Mittelmann NLP benchmark from July 20, 2009.
The graph indicates the fraction of test problems solved by an algorithm in S× (minimum
CPU time of all 5 algorithms); here S is a factor that ranges from 1 to 1024. Comparisons
are made on 48 large NLPs, ranging from 500 to 261365 variables.
i i
i i
book_tem
i i
2010/7/27
page 176
i i
Large-scale SQP methods come in two flavors, full-space and reduced-space methods.
Reduced-space methods are best suited for few decision variables, i.e., when n − m is small.
This method relies on decomposition of the QP solution into normal and tangential steps. For
this case, quasi-Newton updates of the reduced Hessian can also be applied efficiently. Global
convergence is promoted through a line search strategy, or, in some cases, a composite-
step trust region strategy. On the other hand, full-space methods are especially well suited
for problems where n − m is large. A key element of these methods is the provision of
second derivatives, the exploitation of sparsity in the QP solver, and treatment of indefinite
Hessian matrices. Convergence properties for these methods are described in more detail in
[294, 100].
Barrier, or interior point, methods have much in common with SQP methods, as they
both depend on Newton-type solvers. On the other hand, rather than considering active set
strategies, these methods deal with relaxed KKT conditions, i.e., primal-dual equations.
Solving these equations leads to a sequence of relaxed problems solved with Newton-based
approaches. These methods provide an attractive alternative to active set strategies in han-
dling problems with large numbers of inequality constraints. However, careful implemen-
tation is needed in order to remain in the interior of the inequality constraints. This requires
modifications of line search and trust region strategies to prevent the method from “crashing
into bounds” and convergence failure. Otherwise, the Newton step can be decomposed or
regularized in the same manner as in Chapter 5, with the option of applying quasi-Newton
updates as well. Convergence properties for these methods are described in more detail in
[294, 100].
Nested NLP solvers based on reduced gradients tend to be applied on problems with
highly nonlinear constraints, where linearizations from infeasible points can lead to poor
Newton steps and multiplier estimates. These methods partition the NLP problem into basic,
superbasic, and nonbasic variables, nest the solution of the KKT conditions (6.69), and
update these variables in a sequential manner. There are a number of options and heuristics
dealing with repartitioning of these variables as well as relaxation of feasibility. Among
these, a particularly useful option is the solution of bound constrained subproblems to deal
with superbasic and nonbasic variables (see [227] for algorithmic development and analysis
of convergence properties). Nevertheless, these methods require more function evaluations
but may be more robust for highly nonlinear problems, as most of them follow feasible
paths.
Finally, a set of 19 popular NLP codes is described that incorporate elements of these
Newton-type algorithms. In addition, with the help of two brief numerical studies, some
trends are observed on characteristics of these methods and their underlying algorithmic
components.
i i
i i
book_tem
i i
2010/7/27
page 177
i i
With the application of SQP on larger problems, both full-space and reduced-space
versions were developed, particularly with line search strategies. For applications with few
degrees of freedom, reduced-space algorithms were developed using concepts developed
in Gill, Murray, and Wright [162, 159] and through composite-step methods analyzed by
Coleman and Conn [98] and by Nocedal and coworkers [292, 82, 54]. In particular, early
methods with reduced-space QP subproblems are described in [38, 262]. These were fol-
lowed by developments for both line search [395, 54] and trust region adaptations [11, 380],
and also in the development of the SNOPT code [159]. In addition, large-scale, full-space
SQP methods were developed by Betts and coworkers [40] and by Lucia and coworkers
[265, 266].
Interior point or barrier methods for large-scale NLP are grounded in the early work
of barrier and other penalty functions developed in Fiacco and McCormick [131]. Moreover,
the development of interior point methods for linear programming [223, 411, 385], and the
relation of primal-dual equations to stationarity conditions of the barrier function problem,
spurred a lot of work in NLP, particularly for convex problems [289]. For general NLP, there
has also been a better understanding of the convergence properties of these methods [147],
and efficient algorithms have been developed with desirable global and local convergence
properties. To allow convergence from poor starting points, interior point methods in both
trust region and line search frameworks have been developed that use exact penalty merit
functions as well as filter methods [137, 403]. In addition, an interesting survey of filter
methods is given in [139].
Convergence properties for line-search-based interior point methods for NLP are
developed in [123, 413, 382]. Global and local convergence of an interior point algorithm
with a filter line search is analyzed in [403, 402], with less restrictive assumptions. In
addition, Benson, Shanno, and Vanderbei [36] discussed numerous options for line search
methods based on merit functions and filter methods.
Trust region interior point methods based on exact penalty functions were developed
by Byrd, Nocedal, and Waltz [79]. Since the late 1990s these “KNITRO-type” algorithms
have seen considerable refinement in the updating of the penalty parameters and solution of
the trust region subproblems. In addition, M. Ulbrich, S. Ulbrich, and Vicente [389] consid-
ered a trust region filter method that bases the acceptance of trial steps on the norm of the
optimality conditions. Also, Ulbrich [390] discussed a filter approach using the Lagrangian
function in a trust region setting, including both global and local convergence results. Fi-
nally, Dennis, Heinkenschloss, and Vicente [112] proposed a related trust region algorithm
for NLPs with bounds on control variables. They applied a reduced-space decomposition
and barrier functions for handling the bounds in the tangential step. In addition, they obtained
an approximate solution to the tangential problem by using truncated Newton methods.
Reduced gradient methods date back to the work by Rosen [339, 340] and Abadie
and Carpentier [5] in the 1960s. Because of their nested structure, these algorithms tend to
lead to “monolithic” codes with heuristic steps for repartitioning and decomposition. As a
result, they are not easily adapted to NLPs with specific structure. Reduced gradient methods
were developed in early work by Sargent [347] and Sargent and Murtagh [348]. Long-term
developments by Lasdon and coworkers [245] led to the popular GRG2 code. Similarly,
work by Saunders and Murtagh [285] led to the popular MINOS code. More recently, work
by Drud has led to the development of CONOPT [119]. Implemented within a number of
modeling environments, CONOPT is widely used due to robustness and efficiency with
highly nonlinear problems.
i i
i i
book_tem
i i
2010/7/27
page 178
i i
The NLP algorithms and associated solvers discussed in this chapter comprise only
a sampling of representative codes, based on Newton-type methods. A complete listing is
beyond the scope of this book and the reader is referred to the NEOS software guide [1] for
a more complete selection and description of NLP codes. Moreover, important issues such
as scaling and numerical implementations to improve precision have not been covered here.
Readers are referred to [162, 294, 100] for more information on these issues.
Finally, systematic numerical studies are important research elements for NLP algo-
rithm development. A wealth of test problems abound [64, 116, 152, 199] and numerical
studies frequently deal with hundreds of test cases. While many benchmarks continue to
be available for NLP solvers, it is worth mentioning the frequently updated benchmarks
provided by Mittelmann [280]. These serve as extremely useful and impartial evaluations
of optimization solvers. Moreover, for numerical comparisons that are geared toward spe-
cialized test sets of interest to a particular user, modeling environments like GAMS [71]
provide facilities for conducting such solver benchmarks more systematically.
6.8 Exercises
1. Show, by adding and redefining variables and constraints, that problems (6.1), (6.2),
and (6.3) have the same optimal solutions.
2. Show that if dx = 0, then the solution to (6.13) is a descent direction for the 1 merit
function φ1 (x; ρ) = f (x) + ρc(x)1 .
3. Show, by adding and redefining variables and constraints, that problems (6.3) and
(6.48) have the same optimal solutions.
4. Rederive the interior point method in Section 6.3 using the double-bounded NLP
(6.3).
5. Select solvers from the SQP, interior point, and reduced gradient categories, and apply
these to Example 6.3. Use x 0 = [0, 0.1]T and x 0 = [0, 0]T as starting points.
6. Extend the derivation of the primal-dual equations (6.51) to the bound constrained
NLP (6.3). Derive the resulting Newton step for these equations, analogous to (6.57).
7. Select solvers from the SQP, interior point, and reduced gradient categories, and apply
these to Example 6.8. Use x 0 = [−2, 3, 1]T as the starting point.
8. Consider the placement of three circles of different diameter in a box enclosure of
minimum perimeter. As shown in Figure 6.7, this leads to the following nonconvex,
constrained NLP:
min (a + b)
s.t. xi , yi ≥ Ri , xi ≤ b − Ri , yi ≤ a − Ri , i = 1, . . . , 3,
(xi − xi
)2 + (yi − yi
)2 ≥ (Ri + Ri
)2 , 0 < i
< i, i = 1, . . . , 3,
a, b ≥ 0,
where Ri is the radius of circle i, (xi , yi ) represents the center of circle i, and a and
b are the horizontal and vertical sides of the box, respectively.
i i
i i
book_tem
i i
2010/7/27
page 179
i i
• Let Ri = 1 + i/10. Apply an SQP solver to this algorithm with starting points
xi0 = i, yi0 = i, a 0 = 6, and b0 = 6.
• Repeat this study using an interior point NLP solver.
9. Using a modeling environment such as GAMS, repeat the case study for problem
(6.85) using η = 5. Discuss the trends for the NLP algorithms as N is increased.
i i
i i
book_tem
i i
2010/7/27
page 181
i i
Chapter 7
The chapter deals with the formulation and solution of chemical process optimization prob-
lems based on models that describe steady state behavior. Many of these models can be
described by algebraic equations, and here we consider models only in this class. In pro-
cess engineering such models are applied for the optimal design and analysis of chemical
processes, optimal process operation, and optimal planning of inventories and products.
Strategies for assembling and incorporating these models into an optimization problem
are reviewed using popular modular and equation-oriented solution strategies. Optimiza-
tion strategies are then described for these models and gradient-based NLP methods are
explored along with the calculation of accurate derivatives. Moreover, some guidelines
are presented for the formulation of equation-oriented optimization models. In addition,
four case studies are presented for optimization in chemical process design and in process
operations.
7.1 Introduction
The previous six chapters focused on properties of nonlinear programs and algorithms for
their efficient solution. Moreover, in the previous chapter, we described a number of NLP
solvers and evaluated their performance on a library of test problems. Knowledge of these
algorithmic properties and the characteristics of the methods is essential to the solution of
process optimization models.
As shown in Figure 7.1, chemical process models arise from quantitative knowledge
of process behavior based on conservation laws (for mass, energy, and momentum) and
constitutive equations that describe phase and chemical equilibrium, transport processes,
and reaction kinetics, i.e., the “state of nature.” These are coupled with restrictions based
on process and product specifications, as well as an objective that is often driven by an
economic criterion. Finally, a slate of available decisions, including equipment parameters
and operating conditions need to be selected. These items translate to a process model
with objective and constraint functions that make up the NLP. Care must be taken that the
resulting NLP problem represents the real-world problem accurately and also consists of
objective and constraint functions that are well defined and sufficiently smooth.
181
i i
i i
book_tem
i i
2010/7/27
page 182
i i
In this chapter we consider NLP formulations in the areas of design, operation, and
planning. Each area is described with optimization examples along with a case study that
illustrates the application and performance of appropriate NLP solvers.
Process optimization models arise frequently as off-line studies for design and anal-
ysis. These require accurate process simulation models to describe steady state behavior
of the chemical process. Because of the considerable detail required for these simulation
models (particularly in the description of thermodynamic equilibrium), the models need to
be structured with specialized solution procedures, along the lines of the individual process
units represented in the model. As shown in Section 7.2, this collection of unit models leads
to a process flowsheet simulation problem, either for design or analysis, executed in a mod-
ular simulation mode. Some attention to problem formulation and implementation needs to
be observed to adapt this calculation structure to efficient NLP solvers. Nevertheless, the
NLP problem still remains relatively small with only a few constraints and few degrees of
freedom.
Optimization problems that arise in process operations consist of a problem hierarchy
that deals with planning, scheduling, real-time optimization (RTO), and control. In partic-
ular, RTO is an online activity that achieves optimal operation in petroleum refineries and
chemical plants. Nonlinear programs for RTO are based on process models similar to those
used for design and analysis. On the other hand, because these problems need to be solved
at regular intervals (at least every few hours), detailed simulation models can be replaced
by correlations that are fitted and updated by the process. As shown in Section 7.4, these
time-critical nonlinear programs are formulated in the so-called equation-oriented mode.
As a result, the nonlinear program is generally large, with many equality constraints from
the process model, but relatively few degrees of freedom for optimization.
Finally, planning and scheduling problems play a central role in process operations.
Optimization models in this area are characterized by both discrete and continuous variables,
i i
i i
book_tem
i i
2010/7/27
page 183
i i
i.e., mixed integer programs, and by simpler nonlinear models. While these models still
consist of algebraic equations, they are often spatially distributed over temporal horizons
and describe material flows between networks and management of inventories. As seen in
Section 7.5 the resulting NLP problems are characterized by many equality constraints (with
fewer nonlinearities) and also many degrees of freedom. These applications are illustrated
later by gasoline blending models, along with a case study that demonstrates the performance
of NLP solvers.
The evolution of these problem formulations is closely tied to the choice of appropriate
optimization algorithms. In Chapters 5 and 6, efficient NLP algorithms were developed that
assume open, equation-oriented models are available with exact first and second derivatives
for all of the constraint and objective functions. These algorithms best apply to the examples
in Section 7.5. On the other hand, on problems where function evaluations are expensive,
and gradients and Hessians are difficult to obtain, it is not clear that large-scale NLP solvers
should be applied. Figure 7.2 suggests a hierarchy of models paired with suitable NLP
strategies. For instance, large equation-based models can be solved efficiently with struc-
tured barrier NLP solvers. On the other hand, black box optimization models with inexact
(or approximated) derivatives and few decision variables are poorly served by large-scale
NLP solvers, and derivative-free optimization algorithms should be considered instead. The
problem formulations in this chapter also include intermediate levels in Figure 7.2, where
SQP and reduced-space SQP methods are expected to perform well.
i i
i i
book_tem
i i
2010/7/27
page 184
i i
chemical components and bring the resulting mixture to the desired pressure and
temperature for reaction.
• Reaction section consists of reactor units that convert feed components to desired
products and by-products. These equipment units usually convert only a fraction of
the reactants, as dictated by kinetic and equilibrium limits.
• Recycle separation section consists of separation units to separate products from
reactant components and send their respective streams further downstream.
• Recycle processing section consists of pumps, compressors, and heat exchangers that
serve to combine the recycled reactants with the process feed.
• Product recovery consists of equipment units that provide further temperature, phase,
and pressure changes, as well as separation to obtain the product at desired conditions
and purity.
Computer models of these tasks are described by a series of unit modules; each software
module contains the specific unit model equations as well as specialized procedures for their
solution. For modular-based optimization models, we formulate the objective and constraint
functions in terms of unit and stream variables in the flowsheet and, through unit modules,
these are assumed to be implicit functions of the decision variables. Since we intend to use a
gradient-based algorithm, care must be taken so that the objective and constraint functions
are continuous and differentiable. Moreover, for the modular approach, derivatives for the
implicit module relationships are not directly available, and often not calculated. Instead,
they need to be obtained by finite differences (and additional flowsheet evaluations), or by
enhancing the unit model calculation to provide exact derivatives directly.
As a result of the above structure, these optimization problems deal with large, arbitrar-
ily complex models but relatively few degrees of freedom. While the number of flowsheet
variables could be many thousands, these are “hidden” within the simulation models. On the
other hand, even for large flowsheets, there are rarely more than 100 degrees of freedom.
The modular mode offers several advantages for flowsheet optimization. First, the flow-
sheeting problem is relatively easy to construct and to initialize, since numerical procedures
are applied that are tailored to each unit. Moreover, the flowsheeting model is relatively easy
to debug using process concepts intuitive to the process engineer. On the other hand, cal-
culations in the modular mode are procedural and directed by a rigidly defined information
sequence, dictated by the process simulation. Consequently, for optimization one requires
i i
i i
book_tem
i i
2010/7/27
page 185
i i
Figure 7.4. Evolving from black box (left) to modular formulations (right). Here,
tear streams are introduced to break all four recycle loops.
that unit models need to be solved repeatedly, and careful problem definition is required to
prevent intermediate failure of these process units.
Early attempts at applying gradient-based optimization strategies within the modular
mode were based on black box implementations, and these were discouraging. In this sim-
ple approach, an optimization algorithm was tied around the process simulator as shown in
Figure 7.4. In this “black box” mode, the entire flowsheet needs to be solved repeatedly and
failure in flowsheet convergence is detrimental to the optimization. Moreover, as gradients
are determined by finite difference, they are often corrupted by errors from loose conver-
gence tolerances in the flowsheet, and this has adverse effects on the optimization strategy.
Typically, a flowsheet optimization with 10 degrees of freedom requires the equivalent time
of several hundred simulations with the black box implementation.
Since the mid 1980s, however, flowsheet optimization for the modular mode has be-
come a widely used industrial tool. This has been made possible by the following advances
in implementation. First, intermediate calculation loops, involving flowsheet recycles and
implicit unit specifications, are usually solved in a nested manner with slow fixed-point
algorithms. These loops can now be incorporated as equality constraints in the optimization
problem, and these additional equality constraints can be handled efficiently with Newton-
based NLP solvers. For instance, SQP converges the equality and inequality constraints
simultaneously with the optimization problem. This strategy requires relatively few func-
tion evaluations and performs very efficiently for process optimization problems, usually
in less than the equivalent CPU time of 10 process simulations. Moreover, the NLP solver
can be implemented in a nonintrusive way, similar to recycle convergence modules that
are already in place. As a result, the structure of the simulation environment and the unit
operations blocks does not need to be modified in order to include a simultaneous op-
timization capability. As seen in Figure 7.4, this approach could be incorporated easily
within existing modular simulators and applied directly to flowsheets modeled within these
environments.
i i
i i
book_tem
i i
2010/7/27
page 186
i i
with decision variables z, objective function f (z), and constraint functions h(z), g(z), this
approach “breaks open” the calculation loops in the simulation problem by considering
the so-called tear variables y and tear equations y − w(z, y) = 0 (equations which serve
to break every calculation loop at least once) as part of the optimization problem. Adding
these variables and equations to (7.1) leads to
Because slow convergence loops and corresponding convergence errors are eliminated,
this strategy is often over an order of magnitude faster than the “black box” approach and
converges more reliably, due to more accurate gradients.
Example 7.1 (Formulation of Small Flowsheet Optimization Problem). Consider the sim-
ple process flowsheet shown in Figure 7.5. Here the feed stream vector S1 , specified in
Table 7.1, has elements j of chemical component flows:
as well as temperature and pressure. Stream S1 is mixed with the “guessed” tear stream
vector S6 to form S2 which is fed to an adiabatic flash tank, a phase separator operating at
vapor-liquid equilibrium with no additional heat input. Exiting from the flash tank is a vapor
product stream vector S3 and a liquid stream vector S4 , which is divided into the bottoms
product vector S7 and recycle vector S5 . The liquid stream S5 is pumped to 1.02 MPa to
form the “calculated” tear stream vector S̄6 .
The central unit in the flowsheet is the adiabatic flash tank, where high-boiling com-
ponents are concentrated in the liquid stream and low-boiling components exit in the vapor
stream. This separation module is modeled by the following equations:
where F , L, V are feed, liquid, and vapor flow rates, x, y, z are liquid, vapor, and feed
mole fractions, Tin , Pin are input temperature and pressure, Tf , Pf are flash temperature
i i
i i
book_tem
i i
2010/7/27
page 187
i i
and pressure, Kj is the equilibrium coefficient, and HF , HL , HV are feed, liquid, and vapor
enthalpies, respectively. In particular, Kj and HF , HL , HV are thermodynamic quantities
represented by complex nonlinear functions of their arguments, such as cubic equations of
state. These quantities are calculated by separate procedures and the model (7.3) is solved
with a specialized algorithm, so that vapor and liquid streams S3 and S4 are determined,
given Pf and feed stream S2 .
The process flowsheet has two decision variables, the flash pressure Pf , and η which
is the fraction of stream S4 which exits as S7 . The individual unit models in Figure 7.5 can
be described and linked as follows.
• Mixer: Adding stream S6 to the fixed feed stream S1 and applying a mass and energy
balance leads to the equations S2 = S2 (S6 ).
• Flash: By solving (7.3) we separate S2 adiabatically at a specified pressure, Pf , into
vapor and liquid streams leading to S3 = S3 (S2 , Pf ) and S4 = S4 (S2 , Pf ).
i i
i i
book_tem
i i
2010/7/27
page 188
i i
• Pump: Pumping the liquid stream S5 to 1.02 MPa leads to the equations S̄6 = S̄6 (S5 ).
The tear equations for S6 can then be formulated by nesting the dependencies in the flowsheet
streams:
S6 − S̄6 (S5 (S4 (S2 (S6 ), Pf ), η)) = S6 − S̄6 [S6 , Pf , η] = 0. (7.4)
The objective is to maximize a nonlinear function of the elements of the vapor product:
and the inequality constraints consist of simple bounds on the decision variables. Written
in the form of (7.2), we can pose the following nonlinear program:
The nonlinear program (7.5) can be solved with a gradient-based algorithm, but derivatives
that relate the module outputs to their inputs are usually not provided within simulation
codes. Instead, finite difference approximations are obtained through perturbations of the
variables y and z in (7.2). Because the accuracy of these derivatives plays a crucial role in
the optimization, we consider these more closely before solving this example.
As discussed in Chapter 6, one can partition the variables x = [xBT , xN T , x T ]T into ba-
S
sic, nonbasic, and superbasic variables, respectively, with the first order KKT conditions
i i
i i
book_tem
i i
2010/7/27
page 189
i i
written as (6.69):
c(xB , xN , xS ) = 0, (7.7a)
fB (x ∗ ) + AB (x ∗ )v ∗ = 0, (7.7b)
fN (x ∗ ) + AN (x ∗ )v ∗ − u∗L + u∗U = 0, (7.7c)
fS (x ∗ ) + AS (x ∗ )v ∗ = 0, (7.7d)
u∗L , u∗U ≥ 0, xL ≤ x ∗ ≤ xU , (7.7e)
(u∗L )T (xN − xN,L ) = 0, (u∗U )T (xN − xN,U ) = 0. (7.7f )
The nonbasic variables are set to either lower or upper bounds while the basic variables
are determined from the equality constraints. For nonbasic variables, the gradients need
only be accurate enough so that the signs of u∗L and u∗U are correct. Moreover, once xS
and xN are fixed, xB is determined from (7.7a), which does not require accurate gradients;
multipliers are then determined from (7.7b). Thus, in the absence of superbasic variables
(i.e., for vertex optima), correct identification of KKT points may still be possible despite
some inaccuracy in the gradients. On the other hand, if superbasic variables are present,
accurate identification of x ∗ relies on accurate gradients for the solution of (7.7b), (7.7a),
and (7.7d).
with solution x ∗ = 0 and x 0 = β[1, 1]T as the starting point with some β = 0. Note that A
is positive definite but the condition number of A is 1/ 2 . The gradient at x 0 is given by
where is the relative error from the perturbation step. If we choose β = −1 and = 8 2 ,
we have the following Newton directions:
Note that the inexact gradient leads to no descent direction for the Newton step. Conse-
quently, the algorithm fails from a point far from the solution. It is interesting to note that
i i
i i
book_tem
i i
2010/7/27
page 190
i i
even if we used a “steepest descent direction,” dsd = −g(x 0 ) = (2 + 4 3 )x 0 , similar be-
havior would be observed for this problem. Moreover, values for that lead to failure can
actually be quite small. For β < 0, we require < −4β 2 /(1 + 2 ) in order to allow a
descent direction for dsd . For example, with β = −1 and = 10−3 , the algorithm fails for
≥ 4 × 10−6 .
This worst-case example shows that gradient error can greatly affect the performance
of any derivative-based optimization method. In practice, finite difference approximations
may still lead to successful “nearly optimal” NLP solutions, especially for performance or
economic optimization, where the optima are usually highly constrained. To minimize the
effects of gradient errors, the following guidelines are suggested (see also [162]).
• Modular calculation loops should be converged tightly, so that convergence noise
leads to relative errors in module outputs and tear streams, say by δ ≤ 10−8 .
• Choose a perturbation size so that the relative effect on module outputs and tear
streams is approximately δ 1/2 .
• Finite differenced gradients should be monitored to detect input/output discontinu-
ities, due to failed convergence modules or conditional rules within a module. These
are serious problems that must be avoided in the optimization.
• Choose a looser KKT tolerance to include the effect of convergence noise. Alternately,
choose a tight tolerance and let the NLP solver terminate with “failure to find a
descent direction.” Monitoring the final iterates allows the user to assess the effect of
inaccurate derivatives on the KKT error, and also to decide whether the final iterate
is an acceptable solution.
Example 7.3 (Small Flowsheet Optimization Problem Revisited). The flowsheet given in
Figure 7.5 and represented by the optimization problem (7.5) was simulated using the
ProSIM process simulator [153] with Soave–Redlich–Kwong (cubic) equations of state
for the thermodynamic models. Within this simulation program, gradients for optimization
can be obtained either analytically (implemented through successive chain ruling) or by
finite difference. Three optimization algorithms discussed in Chapter 6, the reduced gradient
method [309], SQP using BFGS updates in the full space, and a reduced-space SQP method,
also with BFGS updates, were used to solve this problem with and without analytical
derivatives. Using the initial values given in Table 7.1, the optimal solution was obtained for
all algorithms considered. At the optimum, also shown in Table 7.1, the split fraction goes to
its upper bound while flash pressure is at an intermediate value. Consequently, there is only
one allowable direction at this solution (and one superbasic variable) and it is easily verified
that the reduced Hessian is positive definite, so that both necessary and sufficient KKT
conditions are satisfied at this solution. More detail on this solution can be found in [410].
Performance comparisons (number of flowsheet evaluations (NFE), number of gradi-
ent evaluations (NGE), and relative CPU time, normalized to the reduced gradient case with
finite difference perturbations) are given for each case in Table 7.2. The reduced gradient
case with finite differences corresponds to the equivalent CPU time of about 30 simula-
tions, and the best case finds the optimal solution in about 6 simulation time equivalents.
The comparison clearly shows the performance advantages of using exact derivatives. Here
the reduced gradient method appears to be the slowest algorithm, as it devotes considerable
i i
i i
book_tem
i i
2010/7/27
page 191
i i
effort to an exact line search algorithm. In contrast, the reduced SQP algorithm is slightly
faster than the full SQP algorithm when numerical perturbations are used, but with analyt-
ical derivatives their computation times are about the same. On the other hand, the use of
analytical derivatives leads to important time savings for all three optimization methods.
These savings are different for each method (48% for the reduced gradient method, 73% for
the full-space SQP, and 66% for reduced-space SQP), since they are realized only for the
fraction of time devoted to gradient evaluations.
i i
i i
book_tem
i i
2010/7/27
page 192
i i
i i
i i
book_tem
i i
2010/7/27
page 193
i i
the thermodynamic quantities. Three optimization algorithms have also been compared to
solve this problem, with and without exact derivatives. All of the algorithms converge to
the optimal solution shown in Table 7.3, which improves the profit function by over 20%.
Here the purge constraint is active and three decision variables are at their bounds. In
particular, the reactor conversion and temperature of stream S8 are at their upper bounds
while the temperature of stream S9 is at its lower bound. At this solution, the reduced
Hessian is a fourth order matrix with eigenvalues 2.87 × 10−2 , 8.36 × 10−10 , 1.83 × 10−4 ,
and 7.72×10−5 . Consequently, this point satisfies second order necessary conditions, but the
extremely low second eigenvalue indicates that the optimum solution is likely nonunique.
Table 7.4 presents a performance comparison among the three algorithms and shows
the NFE, the NGE, and the relative CPU time for each case (normalized to the reduced
gradient method with finite differenced gradients). The reduced gradient case with finite
differences corresponds to the equivalent CPU time of about 30 simulations, while the best
case finds the optimal solution in about 3 simulation time equivalents. The comparison
shows that the performance results are similar to those obtained for the adiabatic flash loop.
The reduced gradient method is slower. When numerical perturbations are used, the reduced-
space SQP is faster than full-space SQP, and about the same when analytical derivatives are
used. Moreover, significant time savings are observed due to analytical derivatives (42%
for reduced gradient, 75% for full-space SQP, and 75% for reduced-space SQP).
i i
i i
book_tem
i i
2010/7/27
page 194
i i
from additional variables and equations, particularly if they are linear. As a result, these
algorithms are generally more efficient and reliable in handling large, mostly linear prob-
lems than smaller, mostly nonlinear problems. Using this axiom as a guiding principle, we
further motivate the formulation of optimization models by recalling some assumptions
required for the convergence analyses in Chapters 5 and 6.
2. The objective and constraint functions and their first and second derivatives are Lip-
schitz continuous and uniformly bounded in norm over D.
3. The matrix of active constraint gradients has full column rank for all x ∈ D.
4. The Hessian matrices are uniformly positive definite and bounded on the null space
of the active constraint gradients.
The first two assumptions are relatively easy to enforce for a nonlinear program of the form
Sensible bounds xL , xU can be chosen for many of the variables based on problem-specific
information in order to define the convex set D. This is especially important for variables
that appear nonlinearly in the objective and constraint functions, although care is needed
in bounding nonlinear terms to prevent inconsistent linearizations. In addition, f (x) and
c(x) must be bounded, well defined, and smooth for all x ∈ D. This can be addressed in the
following ways.
where g(x) is the scalar argument and variable y is substituted for ln(g(x)). Upper and
lower bounds can also be imposed on y based on limits of x and g(x). The resulting
reformulation satisfies the second assumption for any finite bounds on g(x).
• Nonsmooth terms and discrete switches from conditional logic must be avoided in
all nonlinear programs. As discussed in Chapter 1, discrete switches are beyond the
capabilities of NLP methods; they can be handled through the introduction of integer
variables and the formulation of mixed integer nonlinear programs (MINLPs). On
the other hand, nonsmooth terms (such as |x|) can be handled through smoothing
methods or the complementarity reformulations discussed in Chapter 11.
The third assumption requires particular care. Linearly dependent constraint sets are
hard to anticipate, as the active set is unknown at the optimum. This assumption can be
addressed by first considering the linear constraints and then extending the reformulation
to the nonlinear system as follows.
i i
i i
book_tem
i i
2010/7/27
page 195
i i
• It is far easier to examine feasible regions made up of simple bounds and linear
equalities. Linear independence of linear equalities is relatively easy to check, even for
large systems. With the addition of n − m bound constraints (i.e., nonbasic variables),
any linear dependence can be detected through elimination of the nonbasic variables
in the gradients of the equality constraints. A sufficient condition to satisfy the third
assumption is that all combinations of m basic variables lead to nonsingular basis
matrices, i.e., AB in (6.40). This test need only be performed once and is valid for
all x ∈ D. Even if the sufficient condition is not satisfied, one can at least identify
nonbasic variable sets (and variable bounds) that should be avoided.
• Further difficulties occur with nonlinear equations, especially when elements vanish
in the gradient vector. For example, the constraint x12 + x22 = 1 is linearly dependent
with x1 = x2 = 0 and most NLP algorithms in Chapter 6 will fail from this start-
ing point. Avoiding this singular point is only possible through the use of additional
bounds derived from problem-specific information. A viable approach to handling
linear dependencies from poor linearization of nonlinear terms is to reformulate trou-
blesome nonlinear equations by adding new variables that help to isolate the nonlinear
terms. These terms can then be bounded separately. For instance, if it is known that
x1 ≥ 0.1, then the constraint x12 + x22 = 1 can be rewritten as
y1 + y2 = 1,
y1 − x12 = 0,
y2 − x22 = 0,
x1 ≥ 0.1,
and linear dependence from these nonlinear terms is avoided.
• An issue related to this assumption is that the objective and constraint functions be
well scaled. From the standpoint of Newton-based solvers, good scaling is required
for accurate solution of linear systems for determination of Newton steps. A widely
used rule of thumb is to scale the objective, constraint functions, and the variables so
that magnitudes of the gradient elements are “around 1” [162]. Moreover, many of
the NLP algorithms described in Chapter 6 have internal scaling algorithms, or issue
warnings if the problem scaling is poor. This usually provides a suitable aid to the
user to properly assess the problem scaling.
• Finally, note that linear dependence of constraint gradients is less of a concern in
the modular simulation mode because feasible solutions are supplied by internal
calculation loops, and because the basis matrix AB is usually nonsingular at feasible
points.
The assumption for the reduced Hessian is the hardest to ensure through reformula-
tion. Positive definiteness is not required for trust region methods; otherwise, the Hessian
(or its projection) can be replaced by a quasi-Newton update. However, if the actual pro-
jected Hessian has large negative curvature, NLP algorithms can still perform poorly. As
a result, some attention needs to be paid to highly nonlinear terms in the objective and
constraint functions. For instance, the nonsmooth function max{0, g(x)} is often replaced
by a smoothed reformulation 1/2(g(x) + (g(x)2 + )1/2 ). For small > 0, this function pro-
vides a reasonable approximation of the max operator, but the higher derivatives become
i i
i i
book_tem
i i
2010/7/27
page 196
i i
unbounded as g(x), → 0; this can adversely affect the curvature of the Hessian. Conse-
quently, a choice of is required that balances accuracy of the smoothing function with
ill-conditioning.
Because process models differ so widely, there are no clear-cut formulation rules
that apply to all problems. On the other hand, with the goal of formulating sparse, mostly
linear models in mind, along with the above assumptions on convergence properties, one
can develop successful optimization models that can be solved efficiently and reliably by
modern NLP solvers. To demonstrate the application of the above guidelines, we close this
section with a process case study.
i i
i i
book_tem
i i
2010/7/27
page 197
i i
0 = FA + FB − FG − FP − Fpurge ,
0 = −k1 Feff
A B
Feff sum 2
Vρ/(Feff )
−Fpurge Feff
A sum
/(Feff − FG − FP ) + FA ,
0 = (−k1 Feff
A B
Feff − k2 Feff
B C
Feff sum 2
)Vρ/(Feff )
−Fpurge Feff
B sum
/(Feff − FG − FP ) + FB ,
0 = [(2k1 Feff
A
− 2k2 Feff
C B
)Feff − k3 Feff
C P
Feff sum 2
)]Vρ/(Feff )
−Fpurge Feff
C sum
/(Feff − FG − FP ),
0 = 2k2 Feff
B C
Feff sum 2
Vρ/(Feff )
−Fpurge Feff
E sum
/(Feff − FG − FP ),
0 = (k2 Feff
B
− 0.5k3 Feff
P C
)Feff sum 2
Vρ/(Feff )
−Fpurge (Feff
P
− FP )/(Feff
sum
− FG − FP ) − FP ,
0 = 1.5k3 Feff
C P
Feff Vρ/(Feff ) − FG ,
sum 2
0 = Feff
A
+ Feff
B
+ Feff
C
+ Feff
E
+ Feff
P
+ FG − Feff
sum
.
Bound Constraints
i i
i i
book_tem
i i
2010/7/27
page 198
i i
Table 7.5. Initial and optimal points for the Williams–Otto problem.
Variable Initial Point Optimal Solution
sum
Feff 52 366.369
A
Feff 10. 46.907
B
Feff 30. 145.444
C
Feff 3. 7.692
E
Feff 3. 144.033
P
Feff 5. 19.115
FP 0.5 4.712
FG 1. 3.178
Fpurge 0. 35.910
V 0.06 0.03
FA 10. 13.357
FB 20. 30.442
T 5.80 6.744
k1 6.18 111.7
k2 15.2 567.6
k3 10.2 1268.2
ROI — 121.1
where ρ = 50, a1 = 5.9755 × 109 , a2 = 2.5962 × 1012 , and a3 = 9.6283 × 1015 . Note that
the volume and mass flows are scaled by a factor of 1000, and temperature by a factor
of 100.
Table 7.5 shows the solution to this problem that satisfies sufficient second order KKT
conditions. Also listed is a suggested starting point. However, from this and other distant
starting points, the sparse, large-scale algorithms CONOPT, IPOPT, and KNITRO have
difficulty finding the optimal solution. In particular, the mass balances become linearly de-
j
pendent when flows Feff tend to zero. Curiously, feasible points can be found for very small
values of this stream with essentially zero flow rates for all of the feed and product streams.
This leads to an attraction to an unbounded solution where no production occurs. Con-
sequently, the above formulation does not allow reliable solution in the equation-oriented
mode. To reformulate the model to allow convergence from distant starting points, we apply
the guidelines presented above and make the changes below.
• The overall mass balances are reformulated equivalently as mass balances around
the reactor, separator, and splitter. This larger model leads to greater sparsity and
additional linear equations.
• Additional variables are added to define linear mass balances in the reactor, and the
nonlinear reaction terms are defined through these additional variables and isolated
as additional equations.
• All nonlinear terms that are undefined at zero are replaced. In particular, the rate
constraints are reformulated by adding new variables that render the rate equations
linear.
i i
i i
book_tem
i i
2010/7/27
page 199
i i
• In addition to existing bound constraints, bounds are placed on all variables that occur
nonlinearly. In particular, small lower bounds are placed on FA and FB in order to
avoid the unbounded solution.
Rate Equations
FG = Feff
G
.
FP = Feff
P
− 0.1Feff
E
.
Fpurge = η(Feff
A
+ Feff
B
+ Feff
C
+ 1.1Feff
E
).
i i
i i
book_tem
i i
2010/7/27
page 200
i i
Bound Constraints
The resulting reformulation increases the optimization model from 17 variables and
13 equations to 37 variables and 33 equations. With this reformulation, all three solvers
easily converge to the desired solution from the distant starting point given in Table 7.5.
CONOPT, IPOPT, and KNITRO require 72, 43, and 32 iterations, respectively, and less
than 0.1 CPUs (with a 2 GB, 2.4 GHz Intel Core2 Duo processor, running Windows XP) for
each solver.
i i
i i
book_tem
i i
2010/7/27
page 201
i i
assumption is that the steady state model is sufficient to describe the plant, and that fast
dynamics and disturbances can be handled by the process control system. Along with the
benefits of RTO, there are a number of challenges to RTO implementation [276, 146, 415],
including formulating models that are consistent with the plant, maintaining stability of
the RTO cycle, ensuring useful solutions in the face of disturbances, and solving the NLP
problems quickly and reliably.
i i
i i
book_tem
i i
2010/7/27
page 202
i i
• Performance of each heat exchanger is based on its available area A for heat
exchange along with an overall heat transfer coefficient, U . The resulting area equa-
tions are given by
Q = U ATlm , (7.10)
T1 − T2
Tlm = ,
ln(T1 /T2 )
T1 = (Tain − Tbout ),
T2 = (Taout − Tbin ).
• Because the expression for log-mean temperature, Tlm , is not well defined when
T1 = T2 , it is often replaced by an approximation, e.g.,
(Tlm )1/3 = ((T1 )1/3 + (T2 )1/3 )/2.
Enthalpy Balances
Li HL,i (Ti , Pi , xi ) + Vi HV ,i (Ti , Pi , yi ) − Li+1 HL,i+1 (Ti+1 , Pi+1 , xi+1 )
−Vi−1 HV ,i−1 (Ti−1 , Pi−1 , yi−1 ) + Qi,ext = 0, i ∈ {1, . . . , NT }, i ∈ / S, (7.15)
Li HL,i (Ti , Pi , xi ) + Vi HV ,i (Ti , Pi , yi ) − Li+1 HL,i+1 (Ti+1 , Pi+1 , xi+1 )
−Vi−1 HV ,i−1 (Ti−1 , Pi−1 , yi−1 ) − Fi HFi (TF ,i , PF ,i , zF ,i ) = 0, i ∈ S,
where HFi , HL,i , HV ,i are enthalpies of the feed, liquid, and vapor streams. Also, as shown in
Figure 7.9, Qi,ext is the heat added or removed from tray i through external heat exchange.
i i
i i
book_tem
i i
2010/7/27
page 203
i i
i i
i i
book_tem
i i
2010/7/27
page 204
i i
to two orders of magnitude larger, the case study is typical of many real-time optimization
problems.
The fractionation plant, shown in Figure 7.9, is used to separate the effluent stream
from a hydrocracking unit. The process deals with 17 chemical components,
C = {nitrogen, hydrogen sulfide, hydrogen, methane, ethane, propane, isobutane,
n-butane, isopentane, n-pentane, cyclopentane, C6 , C7 , C8 , C9 , C10 , C11 }.
The plant includes numerous heat exchangers, including six utility coolers, two interchangers
(with heat transfer coefficients UD and US ), and additional sources of heating and cooling.
It can be described by the following units, each consisting of collections of tray separator
and heat exchange models described in the previous section.
• Absorber/Stripper separates methane and ethane overhead with the remaining com-
ponents in the bottom stream. The combined unit is modeled as a single column with
30 trays and feeds on trays 1 and 14. Setpoints include mole fraction of propane in
the overhead product, feed cooler duty, and temperature in tray 7.
• Debutanizer separates the pentane and heavier components in the bottoms from the
butanes and lighter components. This distillation column has 40 trays and feed on
tray 20. External heat exchange on the bottom and top trays is provided through a
reboiler and condenser, respectively. These are modeled as additional heat exchangers.
Setpoints include feed preheater duty, reflux ratio, and mole fraction of butane in the
bottom product.
• C3/C4 splitter separates propane and lighter components from the butanes. This dis-
tillation column has 40 trays, feed on tray 24, and reboiler and condenser (modeled
as additional heat exchangers). Setpoints include the mole fractions of butane and
propane in the product streams.
• Deisobutanizer separates isobutane from butane and has 65 trays with feed on tray 38
and reboiler and condenser (modeled as additional heat exchangers). Setpoints include
the mole fractions of butane and isobutane in the product streams.
Note that these 10 unit setpoints represent the decision variables for the RTO. Additional
details on the individual units may be found in [20].
As seen in Figure 7.8, a two-step RTO procedure is considered. First, a single-
parameter case is solved as the DR-PE step, in order to fit the model to an operating point.
The optimization is then performed starting from this “good” starting point. Besides the
equality constraints used to represent the individual units, simple bounds are imposed to re-
spect actual physical limits of various variables, bound key variables to prevent the solution
from moving too far from the previous point, and fix respective variables (i.e., setpoints and
parameters) in the parameter and optimization cases. The objective function which drives
the operating conditions of the plant is a profit function consisting of the added value of raw
materials as well as energy costs:
f (x) = CkG Sk + CkE Sk + CkP Sk − E(x), (7.16)
k∈G k∈E k∈P
where CkG , CkE , CkP are prices on the feed and product streams (Sk ) valued as gasoline, fuel,
or pure components, respectively, and E(x) are the utility costs. The resulting RTO problem
has 2836 equality constraints and 10 independent variables.
i i
i i
book_tem
i i
2010/7/27
page 205
i i
Table 7.6. Performance results for hydrocracker fractionation problem (with CPU
times normalized to MINOS result).
Base Case 1 Case 2 Case 3 Case 4 Case 5
Case Base Opt. Fouling Fouling Market Market
UD (base normalized) 1.0 1.0 0.762 0.305 1.0 1.0
US (base normalized) 1.0 1.0 0.485 0.194 1.0 1.0
Propane Price ($/m3 ) 180 180 180 180 300 180
Gasoline Price ($/m3 ) 300 300 300 300 300 350
Octane Credit ($/(RON-m3 )) 2.5 2.5 2.5 2.5 2.5 10
Profit ($/day) 230969 239277 239268 236707 258913 370054
Change from Base — 3.6 % 3.6 % 2.5 % 12.1 % 60.2 %
Poor Initialization
MINOS Iters. (Major/minor) 5/275 9/788 — — — —
rSQP Iterations 5 20 12 24 17 12
MINOS CPU time (norm.) 0.394 12.48 — — — —
rSQP CPU time (norm.) 0.05 0.173 0.117 0.203 0.151 0.117
Good Initialization
MINOS Iters. (Major/minor) — 12/132 14/120 16/156 11/166 11/76
rSQP Iterations — 13 8 18 11 10
MINOS CPU time (norm.) — 1.0 0.883 2.212 1.983 0.669
rSQP CPU time (norm.) — 0.127 0.095 0.161 0.114 0.107
We consider the following RTO cases for this process. Numerical values of the corre-
sponding parameters are included in Table 7.6. In the base optimization case (Case 1), the
profit is improved by 3.6% over the base case. This level of performance is typical for RTO
implementations. In Cases 2 and 3, the effect of reduced process performance by fouling in
heat exchanges is simulated by reducing the heat exchange coefficients for the debutanizer
and splitter feed/bottoms exchangers in order to see their effect on the optimal solution. For
Case 2 new setpoints are determined so that this process deterioration does not reduce the
profit from Case 1. For Case 3, further deterioration of heat transfer leads to lower profit from
Case 1, but this is still an improvement over the base case. The effect of changing market
prices is seen in Cases 4 and 5. Here, changing market prices are reflected by an increase
in the price for propane (Case 4) or an increase in the base price for gasoline, together with
an increase in the octane credit (Case 5). In both cases, significant increases in profit are
observed, as the RTO determines setpoints that maximize the affected product flows.
All cases were solved to a KKT tolerance of 10−8 . Results are reported in Table 7.6,
where “poor” initialization indicates initialization at the original initial point, while the
“good” initialization results were obtained using the solution to the parameter case as the
initial point. Table 7.6 compares RTO performance from two studies [20, 356] using MINOS
and reduced-space SQP (rSQP), both described in Chapter 6, as the NLP solvers. For all
cases considered, both algorithms terminate with the same optimal solution. From Table 7.6
it is apparent that rSQP is at least as robust and considerably more efficient than MINOS. In
particular, for the poor initialization, there is a difference of almost two orders of magnitude.
Also, for this problem, rSQP is much less sensitive to a poor initial point than MINOS.
Moreover, the results for the good initializations indicate an order of magnitude improvement
in CPU times when comparing rSQP to MINOS.
i i
i i
book_tem
i i
2010/7/27
page 206
i i
where the indexed variables st,lm represent a stream flow between tank indices l and m, and
qt,l and vt,l are qualities (i.e., blend stream properties) and volumes for index l, respectively,
i i
i i
book_tem
i i
2010/7/27
page 207
i i
at time t. The classical blending problem mixes feeds directly into blends. The related
pooling problem also considers intermediate pools where the feeds are mixed prior to being
directed to the final blends. These pools consist of source pools with a single purchased
feed, intermediate pools with multiple inputs and outputs, and final pools with a single final
blend as output. Also, if intermediate pools have multiple outlets at the same time, then
additional equations are added to enforce the same tank qualities on the outlet.
The objective function of the gasoline blending model minimizes cost or maximizes
profit of production of blends and remains linear, with the nonlinearities seen only in the
constraints. Here we consider three blending models. As shown in Figure 7.10, the first two
models were proposed by Haverly [190] (three sources, two products, and one intermediate)
and Audet, Hansen, and Brimberg [17] (three sources, three products, and two intermediates)
each with only a single quality. The third model applies the blending formulation to a real-
world industrial problem [310] (17 sources, 13 intermediate tanks, 7 product tanks, and
5 products) with 48 qualities including octane number, density, and Reid vapor pressure.
Among these models we consider 8 examples, Nt = 1 and Nt = 25 for each of the first
two models and Nt = 1, 5, 10, 15 for the industrial model. Because all of these problems are
nonconvex and may admit locally optimal solutions, we apply the following initialization
strategy in an attempt to find globally optimal solutions:
2. Solve this restricted problem as a linear program (LP) for the stream flows.
3. Using the LP solution, fix the streams and solve for the optimal qualities. This provides
an upper bound to the solution of (7.17).
4. Using the resulting flows and qualities as a feasible initial guess, solve (7.17) with an
NLP solver.
To compare solutions we considered the solvers SNOPT (version 5.3), LOQO (version
4.05), IPOPT (version 2.2.1 using MA27 as the sparse linear solver) with an exact Hessian,
and IPOPT with the limited memory BFGS update. As noted in Chapter 6, SNOPT consists
of an rSQP method that uses BFGS updates, LOQO uses a full-space barrier method with
exact Hessian information and a penalty function line search, and IPOPT, using a barrier
i i
i i
book_tem
i i
2010/7/27
page 208
i i
method with a filter line search, is applied in two forms, with exact Hessians and with
quasi-Newton updates. Default options were used for all of the solvers; more information
on the comparison can be found in [310]. Results for the 8 blending cases were obtained
with the NEOS server at Argonne National Laboratory (http://www-neos.mcs.anl.gov) and
are presented in Table 7.7. Here n represents the number of variables and nS is the number
of superbasic variables (degrees of freedom) at the solution. Note that for these cases and
the initialization above, we always found the same local solutions, although there is no
guarantee that these are global solutions.
Table 7.7 shows normalized CPU times as well as iteration counts, which represent the
number of linear KKT systems that were solved. First, we consider the results for the Haverly
(HM) and the Audet and Hansen (AHM) models. For Nt = 1, the objective function values
are 400 and 49.2, respectively. These problems have few superbasic variables, all CPU
times are small, and there is no significant difference in the solution times for these solvers.
Note, however, that solvers that use exact second derivatives (LOQO and IPOPT (exact))
generally require fewer iterations. As a result, this set of results serves as a consistency check
that shows the viability of all of the methods. For Nt = 25, the objective function values are
10000 and 1229.17, respectively. These models have hundreds of degrees of freedom and
the smallest iteration counts are required by both LOQO and IPOPT (exact). Here, methods
without exact second derivatives (SNOPT and IPOPT (BFGS)) require at least an order of
magnitude more iterations because nS is large.
i i
i i
book_tem
i i
2010/7/27
page 209
i i
7.7 Exercises
1. Consider the Williams–Otto optimization problem presented in Section 7.3.1. Refor-
mulate and solve this problem in the modular mode.
i i
i i
book_tem
i i
2010/7/27
page 210
i i
2. Consider the process flowsheet in Figure 7.11. The plug flow reactor is used to convert
component A to B in a reversible reaction according to the rate equations
dCA
(F + FR ) = −k1 CA + k2 CB ,
dV
dCB
(F + FR ) = k1 CA − k2 CB ,
dV
where the feed has F = 10 l/s as the volumetric flow rate with concentrations CA,f =
1 mol/l and CB,f = 0 mol/l. V is the reactor volume, k1 = 0.10/s and k2 = 0.05/s,
the molecular weights of A and B are both 100, and the liquid density is 0.8 g/ l. The
flash separator operates at 2 atm and temperature T with vapor pressure equations
(in atm):
A
log10 Pvap = 4.665 − 1910/T ,
B
log10 Pvap = 4.421 − 1565/T .
We assume that the purge fraction is η ∈ [0.01, 0.99], T ∈ [380K, 430K], V ∈ [250 l,
1500 l] and the profit is given by
where FR is the recycle volumetric flow rate and B top is the mass flow of component B
exiting as top product from the flash separator.
(a) Formulate the process optimization model by solving the differential equations
analytically and formulating the flash equations.
(b) Set up an equation-oriented optimization model in GAMS, AMPL, or AIMMS
and solve. What problems are likely to occur in the solution?
i i
i i
book_tem
i i
2010/7/27
page 211
i i
(c) Comment on satisfaction of the KKT conditions. Calculate the reduced Hessian
and comment on the second order conditions.
3. Consider an optimization model for the alkylation process discussed in Liebman et al.
[260] and shown in Figure 7.12. The alkylation model is derived from simple mass
balance relationships and regression equations determined from operating data. The
first four relationships represent characteristics of the alkylation reactor and are given
empirically. The alkylate field yield, X4 , is a function of both the olefin feed, X1 ,
and the external isobutane to olefin ratio, X8 . The following relation is developed
from a nonlinear regression for temperature between 80◦ and 90◦ F and acid strength
between 85 and 93 weight percent:
The motor octane number of the alkylate, X7 , is a function of X8 and the acid
strength, X6 . The nonlinear regression under the same conditions as for X4 yields
The acid dilution factor, X9 , can be expressed as a linear function of the F-4 perfor-
mance number, X10 and is given by
X9 = 35.82 − 0.222X10 .
The remaining three constraints represent exact definitions for the remaining variables.
The external isobutane to olefin ratio is given by
X8 X1 = X2 + X5 .
i i
i i
book_tem
i i
2010/7/27
page 212
i i
The isobutane feed, X5 , is determined by a volume balance on the system. Here olefins
are related to alkylated product and there is a constant 22% volume shrinkage, thus
giving
X5 = 1.22X4 − X1 .
Finally, the acid dilution strength (X6 ) is related to the acid addition rate (X3 ), the
acid dilution factor (X9 ), and the alkylate yield (X4 ) by the equation
X6 (X4 X9 + 1000X3 ) = 98000X3 .
The objective function to be maximized is the profit ($/day)
OBJ = 0.063X4 X7 − 5.04X1 − 0.035X2 − 10X3 − 3.36X5
based on the following prices:
• Alkylate product value = $0.063/octane-barrel
• Olefin feed cost = $5.04/barrel
• Isobutane feed cost = $3.36/barrel
• Isobutane recycle cost = $0.035/barrel
• Acid addition cost = $10.00/barrel.
Use the following variable bounds: X1 ∈ [0, 2000], X2 ∈ [0, 16000], X3 ∈ [0, 120],
X4 ∈ [0, 5000], X5 ∈ [0, 2000], X6 ∈ [85, 93], X7 ∈ [90, 95], X8 ∈ [3, 12], X9 ∈ [1.2, 4]
for the following exercises:
(a) Set up this NLP problem and solve.
(b) The above regression equations are based on operating data and are only ap-
proximations and it is assumed that equally accurate expressions actually lie in
a band around these expressions. Therefore, in order to consider the effect of
this band, replace the variables X4 , X7 , X9 , and X10 with RX4 , RX7 , RX9 , and
RX10 in the regression equations (only) and impose the constraints
0.99X4 ≤ RX4 ≤ 1.01X4 ,
0.99X7 ≤ RX7 ≤ 1.01X7 ,
0.99X9 ≤ RX9 ≤ 1.01X9 ,
0.9X10 ≤ RX10 ≤ 1.11X10
to allow for this relaxation. Resolve with this formulation. How would you
interpret these results?
(c) Resolve problems 1 and 2 with the following prices:
• Alkylate product value = $0.06/octane/barrel
• Olefin feed cost = $5.00/barrel
• Isobutane feed cost = $3.50/barrel
• Isobutane recycle cost = $0.04/barrel
• Acid addition cost = $9.00/barrel.
(d) Calculate the reduced Hessian at optimum of the above three problems and
comment on second order conditions.
i i
i i
book_tem
i i
2010/7/27
page 213
i i
Chapter 8
Introduction to Dynamic
Process Optimization
This chapter provides the necessary background to develop and solve dynamic optimization
problems that arise in chemical processes. Such problems arise in a wide variety of areas.
Off-line applications range from batch process operation, transition optimization for dif-
ferent product grades, analysis of transients and upsets, and parameter estimation. Online
problems include formulations for model predictive control, online process identification,
and state estimation. The chapter describes a general multiperiod problem formulation that
applies to all of these applications. It also discusses a hierarchy of optimality conditions for
these formulations, along with variational strategies to solve them. Finally, the chapter in-
troduces dynamic optimization methods based on NLP which will be covered in subsequent
chapters.
8.1 Introduction
With growing application and acceptance of large-scale dynamic simulation in process
engineering, recent advances have also been made in the optimization of these dynamic
systems. Application domains for dynamic optimization cover a wide range of process
tasks and include
• off-line and online problems in process control, particularly for multivariable sys-
tems that are nonlinear and output constrained; these are particularly important for
nonlinear model predictive control and real-time optimization of dynamic systems;
• optimum batch process operating profiles, particularly for reactors and separators;
• parameter estimation and inverse problems that arise in state estimation for process
control as well as in model building applications.
213
i i
i i
book_tem
i i
2010/7/27
page 214
i i
Moreover, in other disciplines, including air traffic management [325] and aerospace appli-
cations [40], modern tools for dynamic optimization play an increasingly important role.
These applications demand solution strategies that are efficient, reliable, and flexible for
different problem formulations and structures.
This chapter introduces dynamic optimization problems related to chemical processes.
It provides a general problem statement, examples in process engineering, and optimality
conditions for a class of these problems. Since dynamic optimization strategies need to
solve (with some reasonable level of approximation) problems in infinite dimensions, they
need to determine solutions even for poorly conditioned or unstable systems. Moreover, for
online applications, computations are time-limited and efficient optimization formulations
and solvers are essential, particularly for large-scale systems. In the next section we describe
differential-algebraic models for process engineering and state a general multistage formu-
lation for dynamic optimization problems. Specific cases of this formulation are illustrated
with process examples. Section 8.3 then develops the optimality conditions for dynamic
optimization problems, based on variational principles. This leads to the examination of a
number of cases, which are illustrated with small examples. Two particularly difficult cases
merit separate sections. Section 8.4 deals with path constraints, while Section 8.5 deals with
singular control problems. Section 8.6 then outlines the need for numerical methods to solve
these optimization problems and sets the stage for further developments in Chapters 9 and 10.
Finally, it should be noted that the style of presentation differs somewhat from the
previous chapters. As dynamic optimization deals with infinite-dimensional problems, it re-
lies on principles of functional analysis, which are beyond the scope of this book. Although
external citations are provided to the relevant theory, the presentation style will rely on
informal derivations rather than detailed proofs in order to present the key concepts. More-
over, some notational changes are made that differ from the previous chapters, although the
notation shall be clear from the context of the presentation.
i i
i i
book_tem
i i
2010/7/27
page 215
i i
and we assume that y(t) can be solved uniquely from g(z(t), y(t), u(t), p, t) = 0, once
z(t), u(t), and p are specified. Equivalently, ∂g ∂y is nonsingular for all values of z(t), y(t), u(t),
and p. The invertibility of g(·, y(t), ·, ·)) allows an implicit elimination of the algebraic vari-
ables y(t) = y[z(t), u(t), p], which allows us to consider the DAE with the same solution as
the related ordinary differential equation (ODE):
dz
= f (z(t), y[z(t), u(t), p], u(t), p) = f¯(z(t), u(t), p), z(0) = z0 . (8.3)
dt
In Section 8.4, we will see that this corresponds to the index-1 property of the DAE system
(8.2). With this property, we can then rely on an important result (the Picard–Lindelöf
theorem) regarding the solution of initial value ODE problems. This is paraphrased by the
following theorem.
Theorem 8.1 [16, 59] For u(t) and p specified, let f¯(z(t), u(t), p) be Lipschitz continuous
for all z(t) in a bounded region with t ∈ [0, tf ]. Then the solution of the initial value problem
(8.3) exists and is unique, z(t) for t ∈ [0, tf ].
DAE models of the form (8.2) arise in many areas of process engineering. The differ-
ential equations usually arise from conservation laws such as mass, energy, and momentum
balances. The algebraic equations are typically derived from constitutive equations and
equilibrium conditions. They include equations for physical properties, hydraulics, and rate
laws. The decision variables or “degrees of freedom” in dynamic optimization problems
are the control variables u(t) and the time-independent variables p. The former correspond
to manipulated variables that determine operating policies over time, while the latter of-
ten correspond to equipment parameters, initial conditions, and other steady state decision
variables.
Related to the initial value problems (8.2) and (8.3) are boundary value problems
(BVPs), where the initial condition z(0) = z0 is replaced by boundary conditions. Much
less can be said about existence and uniqueness of solutions for nonlinear boundary value
problems of the form
dz
= f¯(z(t)), h(z(0), z[tf ; z(0)]) = 0, (8.4)
dt
where we suppress the dependence on u(t) and p for the moment, and z[tf ; z(0)] is defined
by (8.3) for an unknown z(0) that satisfies the boundary conditions. Solutions to (8.4) may
be nonunique, or may not even exist, and only local properties can be considered for this
problem. For instance, a key property is that a known solution to (8.4) is isolated, i.e., locally
unique, as expressed by the following theorem.
Theorem 8.2 [16, pp. 164–165] Consider problem (8.4) with a solution ẑ(t). Also, let
f¯(z(t)) be Lipschitz continuous for all z(t) with z(t) − ẑ(t) ≤ for some > 0 and
t ∈ [0, tf ]. Then the solution ẑ(t) is locally unique if and only if the matrix
∂h(ẑ(0), ẑ(tf )) ∂h(ẑ(0), ẑ(tf ))
Q(t) = + Z(t)
∂z(0) ∂z(tf )
dz(tf )
is nonsingular, where the fundamental solution matrix Z(t) = dz(0) is evaluated at the
solution ẑ(t).
i i
i i
book_tem
i i
2010/7/27
page 216
i i
We will apply both of these properties in the derivation of optimality conditions for
DAE constrained optimization problems.
For the general setting we consider the optimization of dynamic systems over a number
of time periods, l = 1, . . . , NT , possibly with different DAE models, states, and decisions in
each period, t ∈ (tl−1 , tl ]. We formulate this multiperiod dynamic problem in the following
form:
NT
min l (zl (tl ), y l (tl ), ul (tl ), p l ) (8.5a)
l=1
dzl
s.t. = f l (zl (t), y l (t), ul (t), p l ), zl (tl−1 ) = z0l , (8.5b)
dt
g l (zl (t), y l (t), ul (t), p l ) = 0, (8.5c)
ulL ≤ ul (t) ≤ ulU , (8.5d)
pLl ≤ p l ≤ pU
l
, (8.5e)
yLl ≤ y l (t) ≤ yUl , (8.5f )
l
zL ≤ zl (t) ≤ zU l
, t ∈ (tl−1 , tl ], l = 1, . . . , NT , (8.5g)
h(p, z0 , z (t1 ), z0 , z (t2 ), . . . , z0NT , zNT (tNT )) = 0.
1 1 2 2
(8.5h)
The dynamic optimization problem (8.5) is defined by separate models in each period l, with
initial conditions, z0l and inequality constraints represented here as simple bounds (8.5g)
within each period. Note that the state variables are not assumed to be continuous across
periods. Instead, a general set of boundary conditions is represented by (8.5h) to link the
states of these periods together. The resulting multiperiod formulation (8.5) captures most
dynamic optimization problems of interest, including the problem classes in chemical en-
gineering considered below.
i i
i i
book_tem
i i
2010/7/27
page 217
i i
C2 H6 → 2 CH3 ·
CH3 · +C2 H6 → CH4 + C2 H5 ·
H · +C2 H6 → H2 + C2 H5 ·
C2 H5 · → C2 H4 + H ·
C2 H5 · +C2 H4 → C3 H6 + CH3 ·
2C2 H5 · → C4 H1 0
H · +C2 H4 → C2 H5 ·
As presented in [88], the resulting DAE model contains mass and energy balances for the
species in the reactor, mass action reaction kinetics, and a momentum balance that provides
the pressure profile in the reactor. The reactor system has only one period (or zone) and
there are no time-independent decisions p. The goal of the optimization problem is to find
an optimal profile for the heat added along the reactor length that maximizes the production
of ethylene. Note that there are also several undesirable by-products that need to be sup-
pressed in the reactor, in order to promote the evolution of product ethylene. The product
distribution is therefore determined by the reactor temperature profile, influenced by the
heat flux distribution. More details on this dynamic optimization problem and its solution
can be found in [88].
Parameter Estimation
Parameter estimation problems arise frequently in the elucidation of kinetic mechanisms
in reaction engineering and in model construction for reactive and transport systems. This
optimization problem is a crucial aspect in a wide variety of areas in modeling and analysis.
It is used in applications ranging from discriminating among candidate mechanisms in
fundamental reactive systems to developing predictive models for optimization in chemical
plants. Moreover, the resulting NLP solution is subjected to an extensive sensitivity analysis
that leads to statistical analysis of the model and its estimated parameters.
For model development, DAEs for parameter estimation arise from a number of
dynamic process systems, especially batch reactors. These applications fit the form of prob-
lem (8.5), with all periods generally described by the same model equations and state
variables. The objective function is usually based on a statistical criterion, typically derived
from maximum likelihood assumptions. Depending on knowledge of the error distribution,
these assumptions often lead to weighted least squares functions, with a structure that can be
exploited by the NLP algorithm. The objective function includes experimental data collected
at time periods tl , l = 1, . . . , NT , which need to be matched with calculated values from the
i i
i i
book_tem
i i
2010/7/27
page 218
i i
Figure 8.2. Data and fitted concentration profiles (A, top, and Q, bottom) from
parameter estimation of rate constants.
DAE model. These calculated values can be represented in (8.5) as states at the end of each
period, z(tl ), y(tl ). Control profiles are rarely included in parameter estimation problems.
Instead, the degrees of freedom, represented by p in (8.5), are the model parameters that
provide the best fit to the experimental data. This problem is illustrated in Figure 8.2 on a
small batch kinetic system. Here three reaction rate parameters need to be estimated for the
p1 p2 p3
kinetic model A → Q, Q → S, A → S to match the concentration data for A and Q. The
evolution of these reactions is modeled by two differential equations to calculate concen-
trations for A and Q. The rate parameters in the DAE model are adjusted by the NLP solver
to minimize the squared deviation between the data and calculated concentration values.
More details on this application can be found in [383].
i i
i i
book_tem
i i
2010/7/27
page 219
i i
Figure 8.3. Batch process stages (I, II, III, IV) and corresponding process schedules
integrated with dynamic optimization.
The economic objective over the batch campaign is usually motivated by the net present
value function that includes product sales, raw material costs, and operating costs. In ad-
dition, the makespan (the time of the total campaign) is a major consideration in the opti-
mization problem. This dictates the time required for a fixed product slate, or the number
of batches that can be produced over a fixed time horizon. Finally, batch processes are of-
ten driven by strong interactions between the production schedule, dynamic operation, and
equipment design. While optimization models with these interactions are often difficult to
solve, they can lead to significant improvements in performance and profitability.
Figure 8.3 illustrates a four-stage batch process with units I (batch reactor), II (heat
exchanger), III (centrifugal separator), and IV (batch distillation). Units II and III have no
degrees of freedom and are run as “recipe” units, while the reactor can be optimized through
a temperature profile and the batch distillation unit can be optimized through the reflux pro-
file. Also shown in Figure 8.3 is a set of production schedules, determined in [46], that result
from various levels of integration between the dynamic operation and unit scheduling. As
discussed in [46, 142, 317], tighter levels of integration can lead to significant reductions in
idle times and makespans.
i i
i i
book_tem
i i
2010/7/27
page 220
i i
i i
i i
book_tem
i i
2010/7/27
page 221
i i
Problem (8.6) with the final time objective function is often called the Mayer problem.
Replacing this objective by an integral over time leads to the problem of Lagrange, and a
problem with both integral and final time terms is known as the Bolza problem. All of these
problems are equivalent and can be formulated interchangeably.
We now assume a local solution to (8.6), (z∗ (t), y ∗ (t), u∗ (t), p ∗ ), and derive relations
based on perturbations around this solution. Since all constraints are satisfied at the optimal
solution, we adjoin these constraints through the use of costate or adjoint variables as
follows:
where (z∗ (t), u∗ (t), y ∗ (t), p ∗ ) = (z∗ (tf )) due to satisfaction of the constraints at the
solution. Although we now deal with an infinite-dimensional problem, the development
of this adjoined system can be viewed as an extension of the Lagrange function developed
in Chapter 4. The adjoint variables λ, νE , νI are functions of time. Here λ(t) serves as a
multiplier on the differential equations, while νE (t) and νI (t) serve as multipliers for the
corresponding algebraic constraints. In addition ηE and ηI serve as multipliers for the final
conditions. $
Applying integration by parts to λ(t)T dz dt dt leads to
tf tf
dz dλ
λ(t)T dt = z(tf )T λ(tf ) − z(0)T λ(0) − z(t)T dt,
0 dt 0 dt
We now define perturbations δz(t) = z(t) − z∗ (t), δy(t) = y(t) − y ∗ (t), δu(t) = u(t) −
u∗ (t), dp = p − p ∗ . We also distinguish between the perturbation δz(t) (abbreviated as δz),
which applies at a fixed time t, and dp which is independent of time. Because (z∗ (t), y ∗ (t),
u∗ (t), p ∗ ) is a local optimum, we note that
d∗ = (z∗ (t) + δz, u∗ (t) + δu, y ∗ (t) + δy, p∗ + dp) − (z∗ (t), u∗ (t), y ∗ (t), p ∗ ) ≥ 0
for all allowable perturbations (i.e., feasible directions) in a neighborhood around the so-
lution. Using infinitesimal perturbations (where we drop the (t) argument for convenience)
allows us to rely on linearizations to assess the change in the objective function d∗ .
i i
i i
book_tem
i i
2010/7/27
page 222
i i
i i
i i
book_tem
i i
2010/7/27
page 223
i i
• z(0) is not specified and all perturbations are allowed for δz(0). In this case
λ(0) = 0.
• The initial condition is specified by variable p in the optimization problem,
z(0) = z0 (p). For this case, which subsumes the first two cases, we define
∂z0T
λ(0)T δz(0) = λ(0)T dp,
∂p
By eliminating the state perturbation terms and suitably defining the adjoint variables above,
(8.9) is now simplified, and it is clear that only perturbations in the decisions will continue
to influence d∗ , as seen in (8.14),
tf T
∗ ∂f ∂gE ∂gI
0 ≤ d = λ+ νE + νI δu dt
0 ∂u ∂u ∂u
tf T %
∂z0 ∂f ∂gE ∂gI
+ [ λ(0)] +
T
λ+ νE + νI dt dp. (8.14)
∂p 0 ∂p ∂p ∂p
For condition (8.14) we first derive optimality conditions where the inequality
constraints are absent. Following this, we modify these conditions to include inequality
constraints.
∂f ∂gE
λ+ νE = 0, (8.15a)
∂u ∂u
tf
∂z0 ∂f ∂gE
λ(0) + λ(t) + νE dt = 0. (8.15b)
∂p 0 ∂p ∂p
and this allows us to concisely represent the optimality conditions (8.12), (8.11), (8.13),
and (8.15) as
i i
i i
book_tem
i i
2010/7/27
page 224
i i
These are the Euler–Lagrange equations developed for problem (8.6) without inequalities.
Adding the state equations to these conditions leads to the following differential-algebraic
boundary value problem, with an integral constraint:
dz ∂H (t)
= = f (z(t), y(t), u(t), p), z(0) = z0 , (8.16a)
dt ∂λ
dλ ∂f ∂gE ∂H ∂(zf ) ∂hE (zf )
=− λ+ νE = − , λf = + ηE , (8.16b)
dt ∂z ∂z ∂z ∂z ∂z
∂H (t)
hE (z(tf )) = = 0, (8.16c)
∂ηE
∂H (t)
gE (z(t), y(t), u(t), p) = = 0, (8.16d)
∂νE
∂f ∂gE ∂H (t)
λ+ νE = = 0, (8.16e)
∂y ∂y ∂y
∂f ∂gE ∂H (t)
λ+ νE = = 0, (8.16f )
∂u ∂u ∂u
tf
∂z0 ∂H
λ(0) + dt = 0, (8.16g)
∂p 0 ∂p
where we define zf = z(tf ) and λf = λ(tf ) for convenience. To illustrate these conditions,
we consider a small batch reactor example.
Example 8.3 (Batch Reactor Example without Bounds). Consider the nonisothermal batch
reactor with first order series reactions A → B → C. For optimal reactor operation, we seek
a temperature profile that maximizes the final amount of product B. The optimal control
problem can be stated as
Note that the optimization model has no algebraic equations nor time-independent variables
p. To simplify the solution, we redefine the control profile as u(t) = k10 exp(−E1 /RT ) and
rewrite the problem as
i i
i i
book_tem
i i
2010/7/27
page 225
i i
where k = k20 /k10 and β = E2 /E1 . To obtain the optimality conditions, we form the Hamil-
tonian
H (t) = (λ2 − λ1 )a(t)u(t) − λ2 ku(t)β b(t).
The adjoint equations are given by
dλ1
= −(λ2 − λ1 )u(t), (8.19a)
dt
dλ2
= λ2 k u(t)β , (8.19b)
dt
λ1 (tf ) = 0, λ2 (tf ) = −1, (8.19c)
and the stationarity condition for the Hamiltonian is given by
∂H
= (λ2 − λ1 )a(t) − βkλ2 u(t)β−1 b(t) = 0. (8.20)
∂u
Equations (8.18)–(8.20) take the form of the optimality conditions (8.16), but without the
conditions for the algebraic equations and decisions p. Note that u(t) can be recovered in
terms of the state and adjoint variables only if β = 1. Otherwise, the problem is singular and
the more complex analysis in Section 8.5 is needed. The temperature profile can be found
by solving a two-point BVP, which consists of the state equations with initial conditions
(8.18b)–(8.18d), adjoint equations with final conditions (8.19a)–(8.19c), and an algebraic
equation (8.20), using a DAE BVP solver. The optimal state and control profiles are given
in Figure 8.4 for values of k = 2, β = 2, tf = 1.
i i
i i
book_tem
i i
2010/7/27
page 226
i i
i i
i i
book_tem
i i
2010/7/27
page 227
i i
∂H (t)
gE (z(t), y(t), u(t), p) = = 0, (8.23f )
∂νE
0 ≤ νI (t) ⊥ gI (z(t), y(t), u(t), p) ≤ 0, (8.23g)
∂f ∂gE ∂gI ∂H (t)
λ+ νE + νI = = 0, (8.23h)
∂y ∂y ∂y ∂y
∂f ∂gE ∂gI ∂H (t)
λ+ νE + νI = = 0, (8.23i)
∂u ∂u ∂u ∂u
tf
∂z0 ∂H
λ(0) + dt = 0. (8.23j)
∂p 0 ∂p
Solving the above conditions and determining the active constraints is considerably more
difficult than solving the BVP (8.16), because the additional complementarity conditions
must be considered at each time point. Also, as noted in [72] the junction points, where an
inactive inequality constraint becomes active, or vice versa, give rise to “corner conditions”
which can lead to nonsmoothness and even discontinuities in u(t). Following the derivations
in [72, 248], we note that the optimality conditions (8.23) hold “almost everywhere,” with
discontinuities in the profiles excluded.
In addition to the optimality conditions (8.23), we consider the following properties:
• For a locally unique solution of (8.23) and the corresponding state and adjoint profiles,
we know from Theorem 8.2 that the “algebraic parts” need to be invertible for u(t),
y(t), νE (t), and νI (t), and that matrix Q(t) for the associated BVP must be nonsingular.
• The conditions (8.23) represent only first order necessary conditions for optimality.
In addition to these, second order conditions are also needed. These are analogous
to those developed in Chapter 4 for NLP. In the absence of active constraints, these
conditions are known as the Legendre–Clebsch conditions and are given as follows:
∂2H ∗
– Necessary condition: ∂u2
is positive semidefinite for 0 ≤ t ≤ tf .
∂2H ∗
– Sufficient condition: ∂u2
is positive definite for 0 ≤ t ≤ tf .
• For autonomous problems, the Hamiltonian H (t) is constant over time. This can be
seen from
dH ∂H dz ∂H dλ ∂H dy ∂H dνE ∂H dνI ∂H du ∂H dp
= + + + + + + .
dt ∂z dt ∂λ dt ∂y dt ∂νE dt ∂νI dt ∂u dt ∂p dt
(8.24)
From (8.23),
∂H dλ ∂H
= f (z, y, u, p) and =− ,
∂λ dt ∂z
we see that the first two terms cancel. Also note that dp
dt = 0. Moreover, from (8.23),
dνI
∂u = 0, ∂y = 0, ∂νE = gE = 0, and either ∂νI = gI = 0 or dt = 0, because νI (t) = 0.
∂H ∂H ∂H ∂H
i i
i i
book_tem
i i
2010/7/27
page 228
i i
• If final time tf is not specified, then it can be replaced by a scalar decision variable pf .
In this case, time can be normalized as t = pf τ , τ ∈ [0, 1], and the DAE system can
be rewritten as
dz
= pf f (z(τ ), y(τ ), u(τ ), p)), z(0) = z0 ,
dτ
g(z(τ ), y(τ ), u(τ ), p) = 0.
The optimality conditions (8.23) can still be applied in the same way.
• Finally, the formulation (8.6) can also accommodate integrals in the objective or
$t
final time constraint functions. A particular integral term, 0f φ(z, y, u, p)dt, can be
replaced by a new state variable ζ (tf ) and a new state equation
dζ
= φ(z, y, u, p), ζ (0) = 0.
dt
With this substitution, the optimality conditions (8.23) can still be applied as before.
With these properties in hand, we now consider a batch reactor example with control
profile inequalities.
Example 8.4 (Batch Reactor Example with Control Bounds). Consider a nonisothermal
batch reactor with first order parallel reactions A → B, A → C, where the goal is again to
find a temperature profile that maximizes the final amount of product B. However, here the
temperature profile has an upper bound. Using the same transformation for temperature as
in the previous system, the optimal control problem can be stated as
min −b(tf ) (8.26a)
da
s.t. = −a(t)(u(t) + ku(t)β ), (8.26b)
dt
db
= a(t)u(t), (8.26c)
dt
a(0) = 1, b(0) = 0, u(t) ∈ [0, U ]. (8.26d)
We form the Hamiltonian
H = −λ1 (u(t) + ku(t)β )a(t) + λ2 u(t)a(t) − ν0 u(t) + νU (u(t) − U ),
and the adjoint equations are given by
dλ1
= λ1 (u(t) + ku(t)β ) − λ2 u(t), (8.27a)
dt
dλ2
= 0, (8.27b)
dt
λ1 (tf ) = 0, λ2 (tf ) = −1. (8.27c)
Also, the stationarity conditions for the Hamiltonian are given by
∂H
= −λ1 (1 + kβu(t)β−1 )a(t) + λ2 a(t) − ν0 + νU = 0,
∂u
0 ≤ u(t) ⊥ ν0 (t) ≥ 0, 0 ≤ (U − u(t)) ⊥ νU (t) ≥ 0. (8.28)
i i
i i
book_tem
i i
2010/7/27
page 229
i i
For ν0 (t) = νU (t) = 0, we see that u(t) can be recovered in terms of the state and adjoint
variables only if β = 1. Otherwise, the problem is singular and the more complex analysis in
Section 8.5 is needed. The optimality conditions now consist of the state equations with initial
conditions (8.26b)–(8.26d), adjoint equations with final conditions (8.27a)–(8.27c), and an
algebraic equation with complementarity conditions (8.28). The solution of this system
for k = 0.5, β = 2, tf = 1, and U = 5 is found by solving this dynamic complementarity
problem. The state and control profiles are given in Figure 8.5. It can be verified that they
satisfy conditions (8.23).
Finally, we consider a special class of optimal control problems where the state vari-
ables and the control variables appear only linearly in (8.6); i.e., we have linear autonomous
DAEs (assumed index 1) and a linear objective function. Without additional constraints, the
solution to the “unconstrained” problem is unbounded. On the other hand, if simple bounds
are added to the control profiles, then the optimal control profile is either at its upper bound
or its lower bound. This is known as a “bang-bang” solution profile [248, 72, 127]. To close
this section, we apply the optimality conditions (8.23) and determine an analytical solution
to a linear example.
Example 8.5 (Linear Optimal Control with Control Constraints: Car Problem). Consider
the operation of a car over a fixed distance starting at rest and ending at rest. Defining the
states as distance z1 (t) and velocity z2 (t) along with the control u(t), we have
min tf
dz1
s.t. = z2 (t),
dt
dz2
= u(t),
dt
z1 (0) = 0, z1 (tf ) = L, u(t) ∈ [uL , uU ],
z2 (0) = 0, z2 (tf ) = 0.
This problem is the classic double integrator problem. As it is linear in the state and control
variables, a minimum exists only if constraints are specified on either the state or the control
i i
i i
book_tem
i i
2010/7/27
page 230
i i
variables. Moreover, when constraints are placed on the control variables only, the solution
will be on the boundary of the control region. For the case of controls with simple bounds,
this leads to a bang-bang control profile.
To make this problem autonomous, we define a third state z3 to represent time and
rewrite the problem as
min z3 (tf ) (8.29a)
dz1
s.t. = z2 (t), (8.29b)
dt
dz2
= u(t), (8.29c)
dt
dz3
= 1, (8.29d)
dt
z1 (0) = 0, z1 (tf ) = L, u(t) ∈ [uL , uU ], (8.29e)
z2 (0) = 0, z2 (tf ) = 0, z3 (0) = 0. (8.29f )
The Hamiltonian can be written as
H = λ1 z2 (t) + λ2 u(t) + λ3 + νU (u(t) − uU ) + νL (uL − u(t)),
and the adjoint equations are given by
dλ1
= 0, (8.30a)
dt
dλ2
= −λ1 , (8.30b)
dt
dλ3
= 0, λ3 = 1. (8.30c)
dt
Note that because the states z1 (t) and z2 (t) have both initial and final conditions, no
conditions are specified for λ1 and λ2 . Also, the stationarity conditions for the Hamiltonian
are given by
∂H
= λ2 − νL + νU = 0, (8.31a)
∂u
0 ≤ (u(t) − uL ) ⊥ νL ≥ 0, (8.31b)
0 ≤ (uU − u(t)) ⊥ νU ≥ 0. (8.31c)
For this linear problem with νL (t) = νU (t) = 0, we see that u(t) cannot be recovered in
terms of the state and adjoint variables, and no solution can be obtained. Instead, we expect
the solution to lie on the bounds. Therefore, to obtain a solution, we need an approach that
depends on the sign of λ2 (t). Fortunately, this approach is aided by analytic solutions of the
state and adjoint equations.
From the adjoint equations (8.30a)–(8.30c), we know that λ3 = 1, λ1 = c1 , and λ2 =
c1 (tf − t) + c2 , where c1 and c2 are constants to be determined. We consider the following
cases:
• c1 = 0, c2 = 0. This leaves an indeterminate u(t). Moreover, repeated time differen-
tiation of ∂H
∂u will not yield any additional information to determine u(t).
i i
i i
book_tem
i i
2010/7/27
page 231
i i
• c1 ≥ 0, c2 ≤ 0. This case leads to a linear profile for λ2 (t), with a switching point,
that goes from positive to negative as time evolves. The control profile corresponds
to full braking at initial time and up to a switching point, and full acceleration from
the switching point to final time. Again, this profile does not allow satisfaction of the
state boundary conditions.
• c1 ≤ 0, c2 ≥ 0. This case leads to a linear profile for λ2 (t) with a switching point as
it goes from negative to positive as time evolves. The control profile corresponds to
full acceleration at initial time and up to a switching point, and full braking from the
switching point to final time. It is the only case that allows the boundary conditions
to be satisfied.
To find tf and the switching point ts , we solve the state equations and obtain
Substituting the final conditions leads to two equations and two unknowns:
uU 2 uL − uU
z1 (tf ) = t + (tf − ts )2 = L,
2 f 2
z2 (tf ) = uU ts + uL (tf − ts ) = 0
with the solution ts = (2L/(uU − u2U /uL ))1/2 and tf = (1 − uL /uU )ts . The solution profiles
for this system with ua = −2, ub = 1, and L = 300 are shown in Figure 8.6.
i i
i i
book_tem
i i
2010/7/27
page 232
i i
Finally, there are two classes of optimal control problems where the assumptions
made in the derivation of the optimality conditions lead to difficult solution strategies.
In the case of state variable path constraints, the adjoint variables can no longer be defined
as in (8.14), by shifting the emphasis on the control variables. In the case of singular control
problems, the optimality condition (8.15a) is not explicit in the control variable u(t) and the
optimal solution may not occur at the control bounds. Both cases require reformulation of
the control problem, and considerably more complicated solution strategies. We will explore
these challenging cases in the next two sections.
Definition 8.6 (Index of Differential-Algebraic Equations [16]). Consider the DAE systems
of the form (8.1) or (8.2) with decisions u(t) and p fixed. The index is the integer s that
represents the minimum number of differentiations of the DAE system (with respect to time)
required to determine an ODE for the variables z(t) and y(t).
i i
i i
book_tem
i i
2010/7/27
page 233
i i
i i
i i
book_tem
i i
2010/7/27
page 234
i i
• Even with consistent initial conditions, numerical solution of (8.37) can differ sig-
nificantly from the solution of (8.2). This follows because the reformulated problem
(8.2) also applies to the modified algebraic constraint:
s−1
g(z(t), y(t), u(t), p) = βi t i
i=0
with arbitrary constants βi for the polynomial in the right-hand side. Consequently,
roundoff errors from numerical solution of (8.37) can lead to drift from the algebraic
equations in (8.2).
General-purpose solution strategies for high-index DAEs are based on the solution of the
derivative array equations (8.32) using specialized algorithms for overdetermined systems
[44, 238, 167]. However, another approach to index reduction can be found by creating
additional algebraic equations, rather than differential equations, and replacing existing
differential equations with these new equations. For instance, the index-2 example (8.36),
dz
= y(t), z(t) − 5 = 0
dt
d(z(t) − 5) dz(t)
=⇒ 0 = = = y(t)
dt dt
can be reformulated as the purely algebraic index-1 system:
z(t) − 5 = 0, y(t) = 0,
We now apply this algorithm to the index reduction of a dynamic phase separation.
i i
i i
book_tem
i i
2010/7/27
page 235
i i
Example 8.7 (Index Reduction for Dynamic Flash Optimization). Consider the optimal
operation of a flash tank with N C components (with index i), a fixed liquid feed F (t) with
mole fractions zi , and vapor and liquid outlet streams, as shown in Figure 8.7. The goal
of the optimization is to adjust the tank temperature T (t) to maximize the recovery of an
intermediate component subject to purity limits on the recovered product.
The dynamic model of the flash tank can be given by the following DAEs. The
differential equations relate to the mass balance:
dMi
= F (t)zi (t) − L(t)xi (t) − V (t)yi (t), i = 1, . . . , N C, (8.38a)
dt
while the following algebraic equations describe the equilibrium and hydraulic conditions
of the flash:
NC
M(t) = Mi (t), (8.39a)
i=1
M = ML + MV , (8.39b)
Mi = ML xi (t) + MV yi (t), i = 1, . . . , N C, (8.39c)
yi = Ki (T , P )xi , i = 1, . . . , N C, (8.39d)
CT = MV /ρV (T , P , y) + ML /ρL (T , P , x), (8.39e)
L = ψ L (P , M), (8.39f )
V = ψ V (P − Pout ), (8.39g)
NC
NC
0= (yi − xi ) = (Ki (T , P ) − 1)xi . (8.39h)
i=1 i=1
The differential variables are component holdups Mi (t), while algebraic variables are M,
the total holdup; MV , the vapor holdup; ML , the liquid holdup; xi , the mole fraction of
i i
i i
book_tem
i i
2010/7/27
page 236
i i
component i in the liquid phase; yi , the mole fraction of component i in the vapor phase; P ,
flash pressure; L, liquid flow rate; and V , vapor flow rate. In addition, Pout is the outlet
pressure, CT is the capacity (volume) of the flash tank, ρL and ρV are molar liquid and
vapor densities, Ki (T , P ) is the equilibrium expression, ψ V provides the valve equation,
and ψ L describes the liquid hydraulics of the tank. This model can be verified to be index 1,
but as pressure changes much faster than composition, it is also a stiff DAE system.
A simplified form of (8.38)–(8.39) arises by neglecting the vapor holdup MV (t) ≈ 0,
as this is 2 to 3 orders of magnitude smaller than the liquid holdup. This also allows the
pressure P to be specified directly and leads to a less stiff DAE, which can be written as
dM
= F (t) − L(t) − V (t), (8.40a)
dt
dxi
= [F (t)(zi − xi ) − V (t)(yi − xi )]/M(t), i = 1, . . . , N C, (8.40b)
dt
yi (t) = Ki (T , P )xi (t), i = 1, . . . , N C, (8.40c)
L(t) = ψ L (P , M), (8.40d)
NC
NC
0= (yi (t) − xi (t)) = (Ki (T , P ) − 1)xi (t). (8.40e)
i=1 i=1
This model requires much less data than the previous one. With pressure specified, the
tank capacity and the vapor valve equations are no longer needed. However, this system
is now index 2, and two problems are apparent with (8.40). First, the algebraic variable
V (t) cannot be calculated from the algebraic equations; the last equation does not contain
V (t). Second, the initial conditions for xi (0) cannot be specified
the unassigned variable
independently, since i xi (0) = 1.
Applying Algorithm 8.1 to (8.40) leads to the following steps:
• Step 1 requires us to consider the last algebraic equation only.
• Step 2 requires differentiating the last equation, leading to NC dxi
i=1 (Ki (T , P ) − 1) dt = 0.
• Step 3 deals with substituting for the differential terms. This leads to the new algebraic
equation,
NC
(Ki (T , P ) − 1)[F (t)(zi (t) − xi (t)) − V (t)(yi (t) − xi (t))]/M(t) = 0,
i=1
and one of the differential equations must be eliminated, say the last one.
This leads to the following index-1 DAE model:
dM
= F (t) − L(t) − V (t), (8.41a)
dt
dxi
= [F (t)(zi − xi ) − V (t)(yi − xi )]/M(t), i = 1, . . . , N C − 1, (8.41b)
dt
yi (t) = Ki (T , P )xi (t), i = 1, . . . , N C, (8.41c)
L(t) = ψ (P , M),
L
(8.41d)
i i
i i
book_tem
i i
2010/7/27
page 237
i i
NC
NC
0= (yi (t) − xi (t)) = (Ki (T , P ) − 1)xi (t), (8.41e)
i=1 i=1
NC
0= (Ki (T , P ) − 1)[F (t)(zi − xi ) − V (t)(yi − xi )]/M(t). (8.41f )
i=1
Note that xNC is now an algebraic variable and there are no restrictions on specifying initial
conditions for the differential variables.
T T
T
T dgI ,j (z(t)) d q−1 gI ,j (z(t))
Nj (z(t)) = gI ,j (z(t)) , ,..., = 0,
dt dt q−1
d q gI ,j (z(t))
= 0, t ∈ (tentry , texit ),
dt q
which must be satisfied when the path constraint is active. Since the control variable appears
in the differentiated constraint, the Hamiltonian is now redefined as
d qj gI ,j (z, u)
H (t) = f (z, u, p)T λ + gE (z, u, p)T νE + νI ,j ,
dt qj
j
where we have distinguished the integer qj for each path constraint. The Euler–Lagrange
conditions can now be derived as in (8.23). In addition, state path constraints also require
the additional corner conditions:
+ − ∂Nj
λ(tentry ) = λ(tentry )− πj , (8.42a)
∂z
+ −
λ(texit ) = λ(texit ) (8.42b)
or
+ − ∂Nj
λ(texit ) = λ(texit )− πj , (8.43a)
∂z
+ −
λ(tentry ) = λ(tentry ). (8.43b)
i i
i i
book_tem
i i
2010/7/27
page 238
i i
Here πj is an additional multiplier on the corner conditions and t + and t − represent points
just before and just after the change in the active set. Note that choice of corner conditions
emphasizes the nonuniqueness of the multipliers with path inequalities [72, 307]. In fact,
there are a number of related conditions that can be applied to the treatment of path inequal-
ities [188] and this confirms the ill-posedness of the modified Euler–Lagrange equations.
Characteristics of path inequalities will be considered again in Chapter 10. Because of
the complexity in handling the above conditions along with finding entry and exit points, the
treatment of these constraints is difficult for all but small problems. The following example
illustrates the application of these conditions.
Example 8.8 (Car Problem Revisited). We now consider a slight extension to Example 8.5
by imposing a speed limit constraint, gI (z(t)) = z2 − V ≤ 0. As described above, the deriva-
tion of the Euler–Lagrange equations defines the adjoint system so that the objective and
constraint functions are influenced directly by control variables. To recapture this influence,
we differentiate gI (z) with respect to time in order to recover the control variable. For this
example, we therefore have
dgI ∂gI T dz2
= f (z, u) = = u(t),
dt ∂z dt
and we define a multiplier for the path constraint with the following Euler–Lagrange
equations:
H = λT f (z, u) − νI (dgI /dt) + νU (u(t) − uU ) + νL (uL − u(t))
= λ1 z2 (t) + λ2 u(t) + νI u(t) + νU (u(t) − uU ) + νL (uL − u(t)), (8.44)
dλ1
= 0, (8.45)
dt
dλ2
= −λ1 , (8.46)
dt
∂H
= λ2 + νI − νL + νU = 0, (8.47)
∂u
0 ≤ (u(t) − uL ) ⊥ νL ≥ 0, (8.48)
0 ≤ (uU − u(t)) ⊥ νU ≥ 0, (8.49)
0 ≤ (V − z2 (t)) ⊥ νI ≥ 0. (8.50)
These conditions are more complex than in the previous examples. Moreover, because of
the influence of the path constraint, an additional corner condition is required:
λ2 (t1+ ) = λ2 (t1− ) − π ,
where π is a constant and t1 is the entry point of the path constraint. Solution of path-
constrained problems through the above Euler–Lagrange equations is generally difficult,
especially because the location of the active path constrained segments is not known a
priori. Nevertheless, the following intuitive solution:
uU , λ2 < 0, t ∈ [0, t1 ),
u(t) = 0, λ2 = −νI , t ∈ (t1 , t2 ), (8.51)
u , λ > 0, t ∈ (t2 , tf ],
L 2
can be easily checked with these Euler–Lagrange equations (see Exercise 5). The solu-
tion profiles for this system with uL = −2, uU = 1, V = 15, and L = 300 are shown in
Figure 8.8.
i i
i i
book_tem
i i
2010/7/27
page 239
i i
Figure 8.8. State and control profiles for car problem with speed limit.
and
∂H
= Hu (t) = f2 (z)λ − νL + νU = 0.
∂u
Repeated time derivatives of Hu (t) are required to obtain an explicit function in u(t). If such
a function exists, then the condition for the singular arc over t ∈ (tentry , texit ) is given by
d q Hu (t)
= ϕ(z, u) = 0, (8.52)
dt q
i i
i i
book_tem
i i
2010/7/27
page 240
i i
where q is an even integer that represents the minimum number of times that Hu must be
differentiated. The order of the singular arc is given by q/2 and, for a scalar control, a second
order condition over the singular arc can be defined by the generalized Legendre–Clebsch
condition [72].
• Necessary condition:
∂ d q Hu (t) ∂ϕ(z, u)
(−1)q/2 = ≥0 for t ∈ (tentry , texit ).
∂u dt q ∂u
• Sufficient condition:
∂ d q Hu (t) ∂ϕ(z, u)
(−1)q/2 = >0 for t ∈ (tentry , texit ).
∂u dt q ∂u
As with state path inequalities, entry and exit points must be found and the following
l
u (t)
stationarity conditions d Hdt l
= 0, l = 0, . . . , q −1, must also be satisfied for t ∈ [tentry , texit ].
On the other hand, there are no corner conditions, and both the Hamiltonian and the adjoint
variables remain continuous over [0, tf ].
To illustrate the application of the singular arc conditions, we consider a classical
optimal control example in reaction engineering.
Example 8.9 (Singular Optimal Control: Catalyst Mixing Problem). Consider the catalyst
mixing problem analyzed by Jackson [207]. The reactions A ⇐⇒ B → C take place in
a tubular reactor at constant temperature. The first reaction is reversible and is catalyzed
by Catalyst I, while the second irreversible reaction is catalyzed by Catalyst II. The goal
of this problem is to determine the optimal mixture of catalysts along the length t of the
reactor in order to maximize the amount of product C. Using u(t) ∈ [0, 1] as the fraction of
Catalyst I, an intuitive solution would be to use Catalyst I at the beginning of the reactor
with Catalyst II toward the end of the reactor, leading to a “bang-bang” profile. As with
the solution in Example 8.5, the switching point would then be determined by the available
length of the reactor, tf . However, the bang-bang policy leads only to production of B in
the first portion, and for large tf this production is limited by equilibrium of the reversible
reaction. As a result, production of C will be limited as well. Instead, the production of C
can be enhanced through a mixture of catalysts over some internal location in the reactor,
where the reversible reaction is driven forward by consumption, as well as production, of B.
This motivates the singular arc solution.
The resulting optimal catalyst mixing problem can be stated as
max c(tf )
da
s.t. = −u(k1 a(t) − k2 b(t)),
dt
db
= u(k1 a(t) − k2 b(t)) − (1 − u)k3 b(t),
dt
a0 = a(t) + b(t) + c(t),
a(0) = a0 , b(0) = 0, u(t) ∈ [0, 1].
i i
i i
book_tem
i i
2010/7/27
page 241
i i
By eliminating the algebraic state c(t) and its corresponding algebraic equation, an equiva-
lent, but simpler, version of this problem is given by
H (t) = (λ2 − λ1 )(k1 a(t) − k2 b(t))u(t) − λ2 k3 b(t)(1 − u(t)) − ν0 u(t) + ν1 (u(t) − 1). (8.54)
dλ1
= −(λ2 − λ1 )k1 u(t), (8.55)
dt
dλ2
= (λ2 − λ1 )k2 u(t) + (1 − u(t))λ2 k3 , (8.56)
dt
λ1 (tf ) = 1, λ2 (tf ) = 1, (8.57)
∂H
= J (t) − ν0 + ν1 = 0, (8.58)
∂u
0 ≤ ν0 (t) ⊥ u(t) ≥ 0, (8.59)
0 ≤ ν1 (t) ⊥ (1 − u(t)) ≥ 0, (8.60)
• From the state equations we see that b(tf ) → 0 only if tf → ∞ and u(tf ) = 0.
Else, b(t) > 0 for t > 0. The final conditions for the adjoint variables, i.e., λ1 (tf ) =
λ2 (tf ) = 1, allow the Hamiltonian to be written as
and since J (tf ) = k3 b(tf ) > 0, we have u(tf ) = 0 and H (t) = H (tf ) = −k3 b(tf ) < 0.
• Assume at t = 0 that u(0) = 1. For u(t) = 1, we note that a0 = a(t) + b(t) and
da db
+ = 0 = −k3 b(t)(1 − u(t)).
dt dt
i i
i i
book_tem
i i
2010/7/27
page 242
i i
Since
H (t) = J (t)u(t) − λ2 k3 b(t) < 0,
we have at t = 0,
−k3 b(tf ) = H (0) = J (0) − λ2 k3 b(0) = J (0) < 0,
which is consistent with u(0) = 1. Alternately, if we had assumed that u(0) ∈ [0, 1),
we would have H (0) = 0, which contradicts H (0) = −k3 b(tf ) < 0.
• By continuity of the state and the adjoint variables, we know that J (t) must switch
sign at some point in t ∈ (0, tf ). Moreover, to indicate the presence of a singular arc,
we need to determine whether J (t) = 0 over a nonzero length.
Now if J = 0 over a nonzero length segment, we can apply the following time differentia-
tions:
J = (λ2 − λ1 )(k1 a − k2 b) + λ2 k3 b = 0, (8.61a)
dJ
= (λ̇2 − λ̇1 )(k1 a − k2 b) + (λ2 − λ1 )(k1 ȧ − k2 ḃ) + λ̇2 k3 b + λ2 k3 ḃ
dt
= k3 (k1 aλ2 − k2 bλ1 ) = 0, (8.61b)
2
d J
= λ2 k3 (k1 a − k2 b) + u [λ1 (k2 b(k2 − k3 − k1 ) − 2k1 k2 a)
dt 2
+ λ2 (k1 a(k2 − k3 − k1 ) + 2k1 k2 b)] = 0, (8.61c)
d2J
where ξ̇ = dξ
dt . Note that two time differentiations, dt 2 = 0, are required to expose u(t) and
to yield an expression for u(t) over this time segment. This leads to a singular arc of order
q/2 = 1. Another approach used in [207] would be to reformulate the simpler expression
in (8.61b) by defining a new variable λ̄ as
λ1 k1 a
λ̄ = = . (8.62)
λ2 k2 b
Substituting into (8.61a) leads to
J = λ2 (t)k2 (t)b(t)[k3 /k2 − (1 − λ̄)2 ] = 0 =⇒ λ̄ = 1 ± (k3 /k2 )1/2 , (8.63)
so λ̄ is a constant for the singular arc. Over this time period we can write
λ̇1 = λ̄λ̇2 =⇒ (λ1 − λ2 )k1 u = λ̄[(λ2 − λ1 )k2 u + (1 − u)k3 λ2 ]. (8.64)
Dividing this equation by (λ2 k2 ) and using (8.62) and (8.63) leads to the singular arc value
of u(t):
λ̄(λ̄ − 1)k2
−(1 − λ̄)uk1 /k2 = λ̄[(1 − λ̄)u + (1 − u)(1 − λ̄)2 ] =⇒ u = . (8.65)
k1 + k2 λ̄2
Since u(t) > 0 for the singular arc, we require λ̄ = 1 + (k3 /k2 )1/2 so that
(k3 k2 )1/2 + k3
u(t) = (8.66)
k1 + k2 + k3 + 2(k3 k2 )1/2
i i
i i
book_tem
i i
2010/7/27
page 243
i i
Figure 8.9. State and control profiles for optimal catalyst mixing.
The switching points can now be determined by solving the state equations forward from
t = 0 and the adjoint equations backward from t = tf as follows:
k1 a(t)
• From t = 0 we have u(t) = 1 and λ̄ = k2 b(t) . The state equations are then integrated
forward until kk12 a(t)
b(t) = 1 + (k3 /k2 )
1/2 , the value of λ̄ in the singular arc. This defines
the entry point, t1 , for the singular arc.
• For t ∈ [t2 , tf ], we note that
d λ̄ λ̇1 λ2 − λ̇2 λ1 λ 1 λ2 k3
=− 2
=− = −k3 λ̄. (8.68)
dt (λ2 ) (λ2 )2
Since λ1 (tf ) = λ2 (tf ) = λ̄(tf ) = 1, equation (8.68) can be integrated backward until
λ̄(t) = exp((tf − t)k3 ) = 1 + (k3 /k2 )1/2 . This defines the exit point t2 .
With this solution strategy, optimal control and state profiles are shown in Fig-
ure 8.9 for k1 = 1, k2 = 10, k3 = 1, and tf = 4. These reveal a “bang-singular-bang” control
policy.
i i
i i
book_tem
i i
2010/7/27
page 244
i i
problems, starting from “unconstrained” problems to mixed path constraints and conclud-
ing with state path constraints and singular control. For this progression, the optimality
conditions became increasingly harder to apply and the illustrative examples often rely on
analytic solutions of the state and adjoint equations to facilitate the treatment of inequality
constraints and singular arcs. Unfortunately, this is no longer possible for larger systems,
and efficient numerical methods are required instead.
Awide variety of approaches has been developed to address the solution of (8.6). These
strategies can be loosely classified as Optimize then Discretize and Discretize then Optimize.
In the first case, the state equations and the Euler–Lagrange equations are discretized in time
and solved numerically . In the second case, the state and control profiles are discretized and
substituted into the state equations. The resulting algebraic problem is then solved with an
NLP algorithm. There are a number of pros and cons to both approaches, and a roadmap of
optimization strategies is sketched in Figure 8.10.
i i
i i
book_tem
i i
2010/7/27
page 245
i i
machinery of large-scale NLP solvers. This approach has the advantage of directly find-
ing good approximate solutions that are feasible to the state equations. On the other hand,
the formulation requires an accurate level of discretization of the control (and possibly the
state) profiles. This may be a daunting task particularly for path constrained and singular
problems.
As seen in Figure 8.10, methods that apply NLP solvers can be separated into two
groups: the sequential and the simultaneous strategies. In the sequential methods, only
the control variables are discretized and the resulting NLP is solved with control vector
parameterization (CVP) methods. In this formulation the control variables are represented
as piecewise polynomials [398, 399, 28] and optimization is performed with respect to the
polynomial coefficients. Given initial conditions and a set of control parameters, the DAE
model is then solved in an inner loop, while the parameters representing the control variables
are updated on the outside using an NLP solver. Gradients of the objective function with
respect to the control coefficients and parameters are calculated either from direct sensitivity
equations of the DAE system or by integration of adjoint sensitivity equations.
Sequential strategies are relatively easy to construct, as they link reliable and efficient
codes for DAE and NLP solvers. On the other hand, they require repeated numerical integra-
tion of the DAE model and are not guaranteed to handle open loop unstable systems [16, 56].
Finally, path constraints can also be incorporated within this approach but are often handled
approximately, within the limits of the control parameterization. These approaches will be
developed in more detail in the next chapter.
Optimization with multiple shooting serves as a bridge between sequential and di-
rect transcription approaches and was developed to handle unstable DAE systems. In this
approach, the time domain is partitioned into smaller time elements and the DAE models
are integrated separately in each element [60, 62, 250]. Control variables are treated in
the same manner as in the sequential approach and sensitivities are obtained for both the
i i
i i
book_tem
i i
2010/7/27
page 246
i i
control variables, as well as for the initial conditions of the states in each element. In addi-
tion, equality constraints are added to the nonlinear program in order to link the elements
and ensure that the states remain continuous over time. Inequality constraints for states and
controls can be imposed directly at the grid points, although constraints on state profiles
may be violated between grid points. These methods will also be developed in the next
chapter.
Direct transcription approaches deal with full discretization of the state and con-
trol profiles and the state equations. Typically the discretization is performed by using
collocation on finite elements, a Runge–Kutta method. The resulting set of equations and
constraints leads to a large nonlinear program that is addressed with large-scale NLP solvers.
The approach is fully simultaneous and requires no nested calculations with DAE solvers.
Moreover, both structure and sparsity of the KKT system can be exploited by modern NLP
codes, such as those developed in Chapter 6. On the other hand, adaptive discretization of
the state equations, found in many DAE solvers, is generally absent in direct transcription.
Instead, the number of finite elements is fixed during the solution of the NLP and the mesh is
either fixed or varies slightly during this solution. Nevertheless, an outer loop may be added
to this approach to improve accuracy of the solution profiles, by refining the finite element
mesh [40]. Finally, in some cases direct links can be made between the KKT conditions of
the NLP and the discretized Euler–Lagrange equations. This offers the exciting possibility
of relating D-O with O-D strategies. The direct transcription approach will be developed
further in Chapter 10.
i i
i i
book_tem
i i
2010/7/27
page 247
i i
On the other hand, larger problems require numerical solution strategies. As described in the
previous section, efficient optimization algorithms are not straightforward to develop from
the Euler–Lagrange and the state equations. Early algorithms based on this “indirect method”
are neither fast nor reliable. On the other hand, more recent methods based on modern BVP
solvers have led to promising performance results, even on large-scale problems [235].
Nevertheless, difficulties with indirect methods have motivated the development of direct
NLP methods. While direct methods represent a departure from the optimality conditions
in this chapter, the Euler–Lagrange equations still provide the concepts and insights for the
successful formulation and solution of NLP strategies, and the interpretation of accurate
solution profiles. These NLP strategies will be treated in the next two chapters.
i i
i i
book_tem
i i
2010/7/27
page 248
i i
8.9 Exercises
1. Consider the dynamic flash systems in Example 8.7.
2. Show that the solution profiles for Example 8.3 satisfy the optimality conditions
(8.16).
3. Show that the solution profiles for Example 8.4 satisfy the optimality conditions
(8.23).
4. Consider the two tanks in Figure 8.11 with levels h1 (t) and h2 (t) and cross-sectional
areas A1 and A2 , respectively. The inlet flow rate to the first tank F0 = g(t) and F2 (t)
1/2
is given by Cv h2 , where Cv is a valve constant.
(a) Write the DAE model so that F1 (t) is adjusted to keep h1 = 2h2 over time. What
is the index of the DAE system?
(b) Reformulate the system in part (a) so that consistent initial conditions can be
specified directly.
5. Verify that the solution profiles in Example 8.8 satisfy the Euler–Lagrange equations
with state path inequalities.
i i
i i
book_tem
i i
2010/7/27
page 249
i i
dz(t)
min (z(t)) s.t. = f1 (z) + f2 (z)u(t), z(0) = z0 , u(t) ∈ [uL , uU ],
dt
show that q ≥ 2 in (8.52).
7. Consider the mechanical system with time-varying forces, F and T . The system can
be written as
d 2x
m = −T (t)x(t),
dt 2
d 2y
m 2 = F (t) − T (t)y(t),
dt
x(t) = f (t), y(t) = g(t),
where m is the system mass, and x and y are the displacements in the horizontal and
vertical directions.
(a) Formulate this problem as a semiexplicit first order DAE system. What is the
index of this system?
(b) Reformulate this problem as an index-1 system using the index reduction ap-
proach presented in Section 8.4.1.
8. For the problem below, derive the two-point BVP and show the relationship of u(t)
to the state and adjoint variables:
9. Solve Example 8.9 for the case where the first reaction is irreversible (k2 = 0). Show
that this problem has a bang-bang solution.
i i
i i
book_tem
i i
2010/7/27
page 251
i i
Chapter 9
Dynamic Optimization
Methods with Embedded
DAE Solvers
This chapter develops dynamic optimization strategies that couple solvers for differential
algebraic equations (DAEs) with NLP algorithms developed in the previous chapters. First,
to provide a better understanding of DAE solvers, widely used Runge–Kutta and linear
multistep methods are described, and a summary of their properties is given. A framework
is then described for sequential dynamic optimization approaches that rely on DAE solvers
to provide function information to the NLP solver. In addition, gradient information is
provided by direct and adjoint sensitivity methods. Both methods are derived and described
along with examples. Unfortunately, the sequential approach can fail on unstable and ill-
conditioned systems. To deal with this case, we extend the sequential approach through the
use of multiple shooting and concepts of boundary value solvers. Finally, a process case
study is presented that illustrates both sequential and multiple shooting approaches.
9.1 Introduction
Chapter 8 provided an overview of the features of dynamic optimization problems and their
optimality conditions. From that chapter, it is clear that the extension to larger dynamic
optimization problems requires efficient and reliable numerical algorithms. This chapter
addresses this question by developing a popular approach for the optimization of dynamic
process systems. Powerful DAE solvers for large-scale initial value problems have led to
widely used simulation environments for dynamic nonlinear processes. The ability to de-
velop dynamic process simulation models naturally leads to their extension for optimization
studies. Moreover, with the development of reliable and efficient NLP solvers, discussed
in Chapter 6, an optimization capability can be implemented for dynamic systems along
the lines of modular optimization modes discussed in Chapter 7. With robust simulation
models, the implementation of NLP codes can be done in a reasonably straightforward way.
To develop sequential strategies that follow from the integration of DAE solvers and
NLP codes, we consider the following DAE optimization problem:
NT
min ϕ(p) = l (zl (tl ), y l (tl ), p l ) (9.1a)
l=1
251
i i
i i
book_tem
i i
2010/7/27
page 252
i i
dzl
s.t. = f l (zl (t), y l (t), p l ), zl (tl−1 ) = z0l , (9.1b)
dt
g l (zl (t), y l (t), p l ) = 0, t ∈ (tl−1 , tl ], l = 1, . . . , NT , (9.1c)
pLl ≤ p l ≤ pU
l
, (9.1d)
yLl ≤ y l (tl ) ≤ yUl , (9.1e)
l
zL ≤ zl (tl ) ≤ zUl
, (9.1f )
h(p , . . . , p , z0 , z (t1 ), z02 , z2 (t2 ), . . . , z0NT , zNT (tNT )) = 0.
1 NT 1 1
(9.1g)
This problem is related to problem (8.5) in Chapter 8. Again, we have differential vari-
ables z(t) and algebraic variables y(t) that appear in the DAE system (9.1b)–(9.1c) in
semiexplicit form, and we assume that the invertibility of g(·, y(t), ·) permits an implicit
elimination of the algebraic variables y(t) = y[z(t), p]. This allows us to consider the DAEs
as equivalent ODEs. Moreover, while (9.1) still has NT time periods, we no longer consider
time-dependent bounds, or other path constraints on the state variables. Also, control pro-
files u(t) are now represented as parameterized functions with coefficients that determine
the optimal profile. Consequently, the decisions in (9.1) appear only in the time-independent
vector pl . Finally, algebraic constraints and terms in the objective function are applied only
at the end of each period, tl .
Problem (9.1) can be represented as the following nonlinear program:
min ϕ(p) (9.2a)
s.t. cE (p) = 0, (9.2b)
cI (p) ≤ 0, (9.2c)
where p = [(p 1 )T , (p 2 )T , . . . , (p NT )T ]T ∈ Rnp . The DAE system in (9.1b)–(9.1c) is
solved in an inner loop, and all the constraint and objective functions are now implicit
functions of p.
Figure 9.1 provides a sketch of the sequential dynamic optimization strategy for prob-
lem (9.1). At a given iteration of the optimization cycle, decision variables pl are specified
by the NLP solver. With these values of pl we treat the DAE system as an initial value
problem and integrate (9.1b)–(9.1c) forward in time for periods l = 1, . . . , NT . This inte-
gration provides the state profiles that determine the objective and constraint functions.
The next component evaluates the gradients of the objective and constraint functions with
respect to pl . These are usually provided through the solution of DAE sensitivity equations.
The function and gradient information is then passed to the NLP solver so that the deci-
sion variables can be updated. The cycle then continues with the NLP solver driving the
convergence of problem (9.2).
While any of the NLP solvers from Chapter 6 can be applied to this strategy, it is
important to note that problem (9.2) is relatively small and the constraint gradients are
dense. For this approach, there are few opportunities for the NLP solver to exploit sparsity
or structure of the dynamic model or the KKT system. Instead, the dominant calculation
cost lies in the solution of the DAE system and the sensitivity equations. Consequently, SQP
methods, with codes such as NPSOL, NLPQL, or fmincon, are generally well suited for this
task, as they require relatively few iterations to converge.
This chapter provides the necessary background concepts that describe sequential
optimization strategies. As discussed in Chapter 7, accurate function and gradient values
i i
i i
book_tem
i i
2010/7/27
page 253
i i
are required in order for the NLP code to perform reliably. This is enabled by state-of-the-
art DAE solvers as well as the formulation and efficient solution of sensitivity equations.
A successful sequential algorithm requires careful attention to the problem formulation, as
failure of the DAE solver or the sensitivity component is fatal for the optimization loop. For
instance, the sequential approach may fail for unstable systems, as the forward integration
of the DAE may lead to unbounded state profiles. For these systems, failure can be avoided
through the use of boundary value methods.
The next section provides a brief survey of DAE solvers and shows how they have
been developed to solve challenging, large-scale (index-1) DAE systems. Section 9.3 de-
scribes the calculation of sensitivity information from the DAE system. Both direct and
adjoint sensitivity equations are derived and relative advantages of both are discussed. The
application of sensitivity calculations is also demonstrated on formulations of (9.1) that
approximate the multiperiod optimal control problem (8.5) from the previous chapter. Next,
Section 9.4 extends the sequential approach to multiple shooting formulations which deal
with unstable dynamic process systems. In this section we outline methods for BVPs that al-
low the solution of dynamic systems that cause initial value DAE solvers to fail. Section 9.5
then presents a detailed case study that illustrates both sequential and multiple shooting
optimization strategies.
i i
i i
book_tem
i i
2010/7/27
page 254
i i
functions of differential variables. We also note that semiexplicit high-index DAEs can be
reformulated to index 1 as demonstrated in Chapter 8.
ODE solvers for initial value problems proceed in increments of time ti+1 = ti + hi ,
hi > 0, i = 0, . . . , N , by approximating the ODE solution, z(ti ), by zi at each time point.
Discretization of the ODE model leads to difference equations of the form
with the approximated state zi+1 based on previous values, zi−j and fi−j = f (zi−j , p).
With j0 = 1, we have an explicit solution strategy, and zi+1 can be determined directly.
Setting j0 = 0 leads to an implicit strategy which requires an iterative solution for zi+1 .
Initial value solvers can be classified into two types. Single-step methods take the
form (9.4) with nq = 1; these are typically Runge–Kutta methods with a (possibly variable)
time step hi . Multistep methods require nq > 1 in (9.4) and therefore contain some history
of the state profile. Here, the difference equations are derived so that all of the time steps
hi in (9.4) are the same size for i = 1, . . . , nq . Both types of methods need to satisfy the
following properties:
• The difference equation must be consistent. Rewriting (9.4) as
zi+1 − zi
− φ̄(zi−j +1 , fi−j +1 , hi ) = 0 (9.5)
hi
and substituting the true solution z(t) into (9.5) at time points ti , leads to the local
truncation error:
q
di (z) = (z(ti+1 ) − z(ti ))/ hi − φ̄(z(ti−j +1 ), f (z(ti−j +1 ), p), hi ) = O(hi ) (9.6)
for some constant K > 0, for all hi ∈ [0, h0 ] and for two neighboring solutions zi
and z̄i . This property implies bounded invertibility of the difference formula (9.4) at
all ti and time steps 0 ≤ hi ≤ h0 .
• Consistency and zero-stability imply a convergent method of order q.
These conditions govern the choice of time steps hi that are needed to obtain accurate
and stable solutions to (9.3). Certainly hi must be chosen so that the local error is small
relative to the solution profile. In addition, the stability condition can also limit the choice
of time step, and the stability condition can be violated even when the local truncation error
is small.
If a small time step is dictated by the stability condition and not the accuracy criterion,
then we characterize the DAEs as a stiff system. For example, the scalar ODE
dz
= −106 (z − cos(t))
dt
i i
i i
book_tem
i i
2010/7/27
page 255
i i
has a very fast transient, after which the solution essentially becomes z(t) = cos(t). However,
while a reasonably large time step ensures an accurate approximation to cos(t), all explicit
methods (and many implicit ones) are stable only if hi remains small, i.e., hi = O(10−6 ).
Stiff systems usually arise in dynamic processes with multiple time scales, where the fastest
component dies out quickly but the step restriction still remains to prevent an unstable
integration.
Because the zero-stability property does not directly determine appropriate time steps
for hi , more specific stability properties have been established with scalar homogeneous
and nonhomogeneous test ODEs
dζ
= λζ , (9.7a)
dt
dζ
= λ(ζ − γ (t)), (9.7b)
dt
respectively, where λ is a complex constant, and γ (t) is a forcing function. It can be argued
that these test equations provide a reasonable approximation to (9.3) after linearization
about a solution trajectory, and decomposition into orthogonal components. For these test
equations we define the following properties:
• For a specified value of hi = h, the difference formula (9.4) is absolutely stable if for
all i = 1, 2, . . . , it satisfies |ζi+1 | ≤ |ζi | for (9.7a).
• If the difference formula (9.4) is absolutely stable for all Re(hλ) ≤ 0, then it is
A-stable. This is an especially desirable property for stiff systems, as h is no longer
limited by a stability condition.
• The difference formula (9.4) has stiff decay if |ζi − γ (ti )| → 0 as Re(hλ) → −∞ for
the test equation (9.7b).
ns
zi+1 = zi + bk f (ti + ck hi , ẑk ), (9.8a)
k=1
nrk
ẑk = zi + akj f (ti + cj hi , ẑj ), k = 1, . . . , ns , (9.8b)
j =1
where ns is the number of stages, and intermediate stage variables are given by ẑk for the
profile zi . For the summation in (9.8b), explicit Runge–Kutta methods are characterized by
nrk < k, semi-implicit methods have nrk = k, and fully implicit methods have nrk = ns .
i i
i i
book_tem
i i
2010/7/27
page 256
i i
Note that all of the evaluations occur within the time step with ck ∈ [0, 1], and the method
is characterized by the matrix A and vectors b and c, commonly represented by the Butcher
block:
c A
(9.9)
bT
s s
The coefficients satisfy the relations nk=1 bk = 1 and jn=1 akj = ck , and they are derived
from Taylor series expansions of (9.8a)–(9.8b). These single-step methods have a number
of interesting features and advantages:
• Because Runge–Kutta methods are single step, adjustment, of the time step hi is easy
to implement and independent of previous profile information.
• The ability to control the time step directly allows accurate location of nonsmooth
events in the state profiles. Because of the single-step nature, accurate solutions require
that the state profile be smooth only within a step, and continuous across steps.
• Explicit methods have bounded, well-defined stability regions that increase in size
with order q. These are determined by applying (9.8) to the test equation (9.7a) and
determining where absolute stability holds for a grid of hλ values in the complex
plane.
• A-stable implicit Runge–Kutta methods can be found for any order q. Implicit
methods with A nonsingular, and bT equal to the last row of A, also satisfy the
stiff decay property. Consequently, these implicit methods are well suited for stiff
problems.
• Calculation of the local truncation error (9.6) is generally not straightforward for
Runge–Kutta methods. The local error is often estimated by embedding a high-order
Runge–Kutta method with a lower order method, using the same c and A values
for both methods, and different b values. This allows two methods to be executed
in parallel, and values of zi+1 to be compared, without additional calculation. The
popular RKF45 formulae (see [16]) are based on such an embedded method.
• For an explicit Runge–Kutta method, increasing the order q of the approximation
error tends to be expensive, as ns ≥ q.
• Most semi-implicit methods are of order ns + 1, and fully implicit methods have
orders up to 2ns . Of course, implicit methods tend to be more expensive to apply than
explicit ones.
i i
i i
book_tem
i i
2010/7/27
page 257
i i
values of βj . On the other hand, the Gear or backward difference formula (BDF) is implicit,
and has α0 = 0, βj = 0, j = 1, . . . , ns , with nonzero values for β0 and αj , j = 1, . . . , ns .
Derivation of these methods is based on approximating the profile history for both zi and
f (zi , p) with an interpolating polynomial, which is then used to carry the integration forward.
In particular, with zi−j and fi−j calculated from ns previous steps, the coefficients αj and
βj are derived by integrating, from ti to ti+1 , an interpolating polynomial for f (z(t), p) that
is represented by the Newton divided difference formula [16]:
&
ns
f (z(t), p) − fns (t) = f [ti , . . . , ti−ns , t] (t − ti−j ) = O(hns +1 ), (9.13)
j =0
d ns +1 f
(ns + 1)!f [ti , . . . , ti−ns , t] ≈ .
dt ns +1
A key advantage of linear multistep methods is that higher order methods can be
applied simply by considering a longer profile history, and this requires no additional cal-
culation. On the other hand, multistep methods are not self-starting, and single-step meth-
ods must be used to build up the ns -step history for the polynomial representation (9.10).
Moreover, nonsmooth events due to discontinuous control profiles, or a change in period,
also require the profile history to be restarted. If these events occur frequently, repeated
i i
i i
book_tem
i i
2010/7/27
page 258
i i
construction of profile histories may lead to an inefficient low-order method. Finally, the
derivation is based on an interpolating polynomial with a constant time step h. Consequently,
adjusting h to satisfy an error tolerance affects the sequence of ns steps and requires a new
polynomial interpolation and reconstruction of the profile history.
The explicit and implicit Adams methods can be characterized by the following
properties:
• Derived from interpolation with the Newton divided difference formula, the explicit
Adams–Bashforth method
ns
zi+1 = zi + βj hfi−j +1 (9.14)
j =1
has order q = ns + 1. Moreover, the local truncation term can be estimated with the
aid of the residual of the polynomial interpolant (9.13).
• Combining both implicit and explicit formulae leads to a predictor-corrector method,
where the explicit formula provides a starting guess for zi+1 used in the implicit
formula. For h suitably small, successive substitution of zi+1 into f (z, p) and updating
zi+1 with (9.15) leads to a cheap fixed-point convergence strategy.
• Regions of absolute stability tend to be smaller than for single-step methods. More-
over, these regions shrink in size with increasing order. Hence, smaller steps are
required to satisfy stability requirements. This arises because the associated differ-
ence equation derived from (9.10) has multiple roots; all of these roots must satisfy
the absolute stability criterion.
• There are no explicit linear multistep methods that are A-stable. There are no implicit
linear multistep methods above order 2 that are A-stable.
• Because of the stability limitations on h (and also because predictor-corrector methods
converge only for small h), Adams methods should not be used for stiff problems.
Stiff problems are handled by BDF methods. While these methods are not A-stable
above order 2, they do have stiff decay up to order 6. Moreover, even for order 6, the
regions of absolute stability cover most of the left half plane in the phase space of hλ.
The only excluded regions are near the imaginary axis with |Im(hλ)| > 1 for Re(hλ) > −5.
These regions correspond to oscillatory solutions that normally require small steps to satisfy
accuracy requirements anyway. Consequently, large time steps h can then be selected once
fast transients have died out. The BDF method has the following features:
• The BDF is given by
ns
zi+1 − αj zi−j +1 − β0 hfi+1 = φ(zi+1 ) = 0 (9.16)
j =1
i i
i i
book_tem
i i
2010/7/27
page 259
i i
• Because large steps are taken, zi+1 is determined from (9.16) using the Newton
iteration,
m+1 m −1
zi+1 = zi+1
m
− J (zi+1 m
) φ(zi+1 ),
where m is the iteration counter. For large systems, calculation of the Newton step
can be the most time-consuming part of the integration.
T
• The Jacobian is given by J (zi+1 ) = I − hβ0 ∂f ∂f
∂z and ∂z is generally required from the
dynamic model. Note that J (z) is nonsingular for h sufficiently small. To save com-
putational expense, a factorized form of J (z) is used over multiple Newton iterations
and time steps, ti+j . It is refactorized only if the Newton iteration fails to converge
quickly.
where ẑk and ŷk are intermediate stage variables and nrk ≥ k is defined from (9.7b). If
the DAE is index 1, the algebraic equations can be eliminated with the algebraic variables
determined implicitly. Consequently, the single-step formula generally has the same order
and stability properties as with ODEs. This is especially true for methods with stiff decay.
A more general discussion of order reduction and stability limitations of (9.17) for Runge–
Kutta methods and high-index DAEs can be found in [70].
Since predictor-corrector Adams methods require explicit formulae, we do not con-
sider them for DAEs and focus instead on the BDF method, which is the most popular
method for DAE systems. The extension of the BDF method to semiexplicit index-1 DAEs
is given by
ns
zi+1 = αj zi−j +1 + β0 hf (zi+1 , yi+1 ), (9.18a)
j =1
g(zi+1 , yi+1 ) = 0. (9.18b)
5 There seem to be no explicit DAE solvers, although half-explicit Runge–Kutta methods have been
proposed [16].
i i
i i
book_tem
i i
2010/7/27
page 260
i i
Again, as the last equation leads to implicit solution of the algebraic equations, the extended
BDF formula has the same order and stability properties as with ODEs. More discussion
of order reduction and stability limitations of (9.17) for linear multistep methods and high-
index DAEs can be found in [70].
Also, the BDF equations (9.18) are also solved with Newton’s method using the
Jacobian given by
T T
I − hβ0 ∂f hβ0 ∂f
J (zi+1 , yi+1 ) = .
∂z ∂y
T T (9.19)
∂g ∂g
∂z ∂y
i i
i i
book_tem
i i
2010/7/27
page 261
i i
Since z(t) is sufficiently smooth in t and p, we can exchange the order of differentiation, de-
dz T dy T
fine the sensitivity matrices S(t) = dp and R(t) = dp , and obtain the following sensitivity
equations:
dS ∂f T ∂f T ∂f T ∂z0 T
= S(t) + R(t) + , S(0) = , (9.22a)
dt ∂z ∂y ∂p ∂p
∂g T ∂g T ∂g T
0= S(t) + R(t) + . (9.22b)
∂z ∂y ∂p
Note that for nz + ny states and np decision variables, we have (nz + ny ) × np sensitivity
equations. These equations are linear time variant (LTV), index-1 DAEs and require the
state profile solution in their right-hand sides. Also, decision variables that arise in the
initial conditions also influence the initial conditions of the sensitivity equations. Once
the sensitivities are determined, we calculate the gradients for the objective and constraint
functions in (9.2) by using (9.20) and the chain rule to yield
∂ ∂ψ
∇p = S(tf )T + .
∂z ∂p
The formulation of the sensitivity equations is illustrated with the following example.
i i
i i
book_tem
i i
2010/7/27
page 262
i i
With nz = 2, ny = 1, and np = 3, the original DAE system has two differential equations
and one algebraic equation. The np (nz + ny ) sensitivity equations consist of six differential
equations, with initial conditions, and three algebraic equations.
To avoid the need to store the state profiles, sensitivity equations are usually solved
simultaneously with the state equations. Because of their desirable stability properties, BDF
methods are often adapted to the solution of the combined state-sensitivity system. This
implicit method applies Newton’s method to determine the states at the next time step. With
the addition of the sensitivity equations, the same Jacobian
T T
I − hβ0 ∂f hβ0 ∂f
J (z, y) =
∂z ∂y
T T (9.23)
∂g ∂g
∂z ∂y
can be used to solve them as well, due to the structure of (9.22). This leads to a number
of ways to automate the sensitivity calculation and make it more efficient. Moreover, a
number of large-scale DAE solvers takes advantage of sparsity and special structures (e.g.,
bandedness) of the Jacobian.
Sensitivity methods can be classified into the staggered direct [85], simultaneous
corrector [272], and staggered corrector options [128]. At each time step, the simultane-
ous corrector method solves the entire dynamic system of state (9.1b)–(9.1c) and sensi-
tivity equations (9.22). Instead, the staggered methods first solve the DAEs for the state
variables at each time step. Here the accuracy of the state and the sensitivity equations
must be checked separately, and a shorter step may be required if the sensitivity error is
not acceptable. After the Newton method has converged for the state variables, sensitiv-
ity equations are solved at the current step. In solving the sensitivity equations, the stag-
gered direct method updates the factorization of the Jacobian (9.23) at every step, while
the staggered corrector refactorizes this Jacobian matrix only when necessary. Since this
is often the most expensive step in the sensitivity algorithm, considerable savings can be
made. More details on these three options can be found in [258]. The partial derivatives
in the sensitivity equations can be determined either by a finite difference approxima-
tion or through automatic differentiation. Finally, several automatic differentiation tools
[58, 306, 173] are available to provide exact partial derivatives. When implemented with
BDF solvers, they lead to faster and more accurate sensitivities and fewer Newton solver
failures.
i i
i i
book_tem
i i
2010/7/27
page 263
i i
Note that the differential and algebraic adjoints, λ(t) and ν(t), respectively, serve as multipli-
$
ers to influence ψ(z(tf ), p). Applying integration by parts to λ(t)T dzdt (t)dt and substituting
into (9.24) leads to the equivalent expression
As in Chapter 8, we apply perturbations δz(t), δy(t), dp about the current point. Applying
these perturbations to (9.25) leads to
T
∂ψ ∂ψ T
dψ = − λ(tf ) δz(tf ) + λ(0)T δz(0) + dp
∂z ∂p
tf
∂f dλ ∂g T ∂f ∂g T
+ λ− + ν δz(t) + λ + ν δy(t)
0 ∂z dt ∂z ∂y ∂y
T
∂f ∂g
+ λ + ν dp dt. (9.26)
∂p ∂p
We now define the adjoint variables so that only the perturbations dp have a direct influence
on dψ:
1. From perturbation of the final state, δz(tf ), we have
∂ψ
λ(tf ) = . (9.27)
∂z
This leads to a boundary or transversality condition for λ(t).
2. From perturbation of the differential state, δz(t), we have
dλ ∂f ∂g
= − λ− ν (9.28)
dt ∂z ∂z
which yields a differential equation for λ(t).
3. From perturbation of the algebraic state, δy(t), we have
∂f ∂g
λ + ν = 0, (9.29)
∂y ∂y
which leads to an algebraic equation with the algebraic variable ν(t).
i i
i i
book_tem
i i
2010/7/27
page 264
i i
4. We assume that the perturbation of the initial state, δz(0), is governed by the variable
p in the optimization problem, z(0) = z0 (p), and we define
∂z0 T
λ(0)T δz(0) = λ(0)T dp
∂p
and group this term with dp.
By eliminating the state perturbation terms and suitably defining the adjoint variables
above, (9.26) is now simplified as follows:
T tf %
∂ψ ∂z0 ∂f ∂g T
dψ = + λ(0) + λ + ν dt dp. (9.30)
∂p ∂p 0 ∂p ∂p
Calculation of adjoint sensitivities requires the solution of a DAE system with final
values specified for the differential variables. Note that if the state equations are semiexplicit,
index 1, then the adjoint system is also semiexplicit and index 1. Once the state and adjoint
variables are obtained, the integrals allow the direct calculation of the gradients ∇p .
We now revisit Example 9.1 and contrast the adjoint approach with direct sensitivity.
Example 9.2 For the DAEs
dz1
= (z1 )2 + (z2 )2 − 3y, z1 (0) = 5,
dt
dz2
= z1 z2 + z1 (y + p2 ), z2 (0) = p1 ,
dt
0 = z1 y + p3 z2 ,
we write
λ(t)T f (z, y, p) + ν(t)T g(z, y, p) = λ1 ((z1 )2 + (z2 )2 − 3y(t))
+ λ2 (z1 z2 + z1 (y(t) + p2 )) + ν(z1 y(t) + p3 z2 ).
Using the above adjoint equations, we have from (9.28) and (9.27),
dλ1 ∂ψ(z(tf ))
= −(2z1 λ1 + (z2 + y + p2 )λ2 + yν), λ1 (tf ) = , (9.31)
dt ∂z1
dλ2 ∂ψ(z(tf ))
= −(2z2 λ1 + z1 λ2 + p3 ν), λ2 (tf ) = . (9.32)
dt ∂z2
From (9.29) we have the algebraic equation
0 = −3λ1 + z1 λ2 + z1 ν. (9.33)
Solving this system of DAEs leads to the adjoint profiles λ(t) and ν(t). The gradients are
then obtained from (9.30) as follows:
∂ψ
∇p1 ψ = + λ2 (0),
∂p1
tf
∂ψ
∇p2 ψ = + z1 λ2 dt,
∂p2 0
tf
∂ψ
∇p3 ψ = + z2 ν dt.
∂p3 0
i i
i i
book_tem
i i
2010/7/27
page 265
i i
The original DAE system has two differential equations and one algebraic equation with
nz = 2, ny = 1, and np = 3. In addition, we have n objective and constraint functions at
final time. Therefore, the adjoint sensitivities require the solution of n (nz + ny ) DAEs and
n (np ) integrals (which are generally less expensive to solve). From this example, we see
that if np > n , then the adjoint approach is more efficient than solving the direct sensitivity
equations.
On the other hand, the adjoint sensitivity approach is more difficult to implement than
direct sensitivity; hence, there are fewer adjoint sensitivity codes [259, 197] than direct
sensitivity codes and the adjoint approach is less widely applied, except for systems with
many decision variables. Solution of the adjoint equations requires storage of state pro-
files, and retrieval is needed for the backward integration. To avoid the storage requirement,
especially for large systems, an adjoint approach is usually implemented with a checkpoint-
ing scheme. Here the state variables are stored at only a few checkpoints in time. Starting
from the checkpoint closest to tf , the state is reconstructed by integrating forward from
this checkpoint, while the adjoint variable is calculated by integrating backward up to this
checkpoint. Once the adjoint is calculated at this point, we back up to an earlier checkpoint
and the state and adjoint calculation repeats until t = 0. The checkpointing scheme offers a
trade-off between repeated adjoint calculation and state variable storage. Moreover, strate-
gies have been developed for the optimal distribution of checkpoints [172] that lead to very
efficient adjoint sensitivity strategies.
i i
i i
book_tem
i i
2010/7/27
page 266
i i
the solution, and can be neglected. This simplification leads to the Hessian approximation
d 2 ϕ(p) dx(tl ) dx(tl ) T
D N D N
S(tl )
≈ W = [S(tl )T R(tl )T ]W . (9.35)
dp 2 dp dp R(tl )
l=1 l=1
T
l)
T
l)
where S(tl ) = dz(t
dp and R(tl ) = dy(t
dp .
With the assumption of small residuals, this simplification leads to a quadratically con-
vergent method while calculating only first derivatives. Should this assumption be violated,
then the Gauss–Newton method has only a linear rate of convergence.
To illustrate, we consider the batch reactor data in Figure 9.3 for the first order re-
versible reactions:
A ⇐⇒ B ⇐⇒ C
in liquid phase. The DAE model for the reactor system is given by
dzA
= −p1 zA + p2 zB , zA (0) = 1, (9.36a)
dt
dzB
= p1 zA − (p2 + p3 )zB + p4 yC , zB (0) = 0, (9.36b)
dt
zA + zB + yC = 1. (9.36c)
The 12 sensitivity equations (9.22) are written as
dSA1
= −SA1 p1 − zA + p2 SB1 , SA1 (0) = 0, (9.37a)
dt
dSB1
= SA1 p1 + zA − (p2 + p3 )SB1 + p4 RC1 , SB1 (0) = 0, (9.37b)
dt
SA1 + SB1 + RC1 = 0, (9.37c)
i i
i i
book_tem
i i
2010/7/27
page 267
i i
dSA2
= −SA2 p1 + p2 SB2 + zB , SA2 (0) = 0, (9.37d)
dt
dSB2
= SA2 p1 − (p2 + p3 )SB2 − zB + p4 RC2 , SB2 (0) = 0, (9.37e)
dt
SA2 + SB2 + RC2 = 0, (9.37f )
dSA3
= −SA3 p1 + p2 SB3 , SA3 (0) = 0, (9.37g)
dt
dSB3
= SA3 p1 − (p2 + p3 )SB3 − zB + p4 RC3 , SB3 (0) = 0, (9.37h)
dt
SA3 + SB3 + RC3 = 0, (9.37i)
dSA4
= −SA4 p1 + p2 SB4 , SA4 (0) = 0, (9.37j)
dt
dSB4
= SA4 p1 − (p2 + p3 )SB4 + p4 RC3 + yC , SB4 (0) = 0, (9.37k)
dt
SA4 + SB4 + RC4 = 0, (9.37l)
and the weighting matrix in the objective function is set to W = I .
The parameter estimation problem is solved using the scheme in Figure 9.1. A trust
region SQP method is applied, similar to the algorithm described in Section 6.2.3, with the QP
subproblem given by (6.38) and the Hessian given by (9.35). This approach is implemented
in the GREG parameter estimation code [372, 384]. Starting from p 0 = [10, 10, 30, 30]T the
algorithm converges after 8 iterations and 20 DAE evaluations to the optimal parameters
given by
(p∗ )T = [3.997, 1.998, 40.538, 20.264]
with ϕ ∗ = 4.125 × 10−5 . At the solution the reduced Hessian has eigenvalues that range
from 10−2 to 103 ; such ill-conditioning is frequently encountered in parameter estimation
problems.
To extend the sequential approach to optimal control problems, we rely on the
multiperiod problem (9.1) and represent the control profile as piecewise polynomials in
each period. In addition, the length of each period may be variable. The decisions are still
represented by p l and the state variables remain continuous across the periods.
For the gradient calculations, both the direct and adjoint sensitivity equations are
easily modified to reflect parameters that are active only in a given period.
dz(t) T
• For the direct sensitivity approach, we define the sensitivity matrices Spl (t) = dpl
dy(t) T
and Rpl (t) = dpl
and modify the sensitivity equations (9.22) as follows. For t ∈
[tl−1 , tl ],
dS l ∂f T l ∂f T l ∂f T
= S (t) + R (t) + l , S l (0) = 0, (9.38)
dt ∂z ∂y ∂p
T
∂g l ∂g T ∂g T
0= S (t) + R l (t) + l , t ∈ [tl−1 , tl ]; (9.39)
∂z ∂y ∂p
i i
i i
book_tem
i i
2010/7/27
page 268
i i
and for t ∈
/ [tl−1 , tl ],
dS l ∂f T l ∂f T l
= S (t) + R (t), S l (0) = 0, t∈
/ [tl−1 , tl ], (9.40)
dt ∂z ∂y
∂g T l ∂g T l
0= S (t) + R (t), t∈
/ [tl−1 , tl ]. (9.41)
∂z ∂y
• For the adjoint sensitivity approach, the adjoint equations (9.28), (9.27), and (9.29)
remain unchanged. The only change occurs in (9.30), which is now rewritten as
tl
dψ ∂ψ ∂f ∂g
= l+ λ + l ν dt. (9.42)
dp l ∂p tl−1 ∂p l ∂p
Finally, path inequality constraints are usually approximated by enforcing them only
at tl . Because the periods can be made as short as needed, this approximation is often a
practical approach to maintain near feasibility of these constraints.
To demonstrate the (approximate) solution of optimal control problems with the se-
quential approach, we revisit Example 8.4 from Chapter 8.
Example 9.4 For the nonisothermal batch reactor with first order parallel reactions A → B,
A → C, we maximize the final amount of product B subject to an upper temperature bound.
The optimal control problem over t ∈ [0, 1] can be stated as
min −b(1)
da
s.t. = −a(t)(u(t) + ku(t)β ),
dt
db
= a(t)u(t),
dt
a(0) = 1, b(0) = 0, u(t) ∈ [0, U ].
Here we represent the controls as piecewise constants over the time periods l = 1, . . . , NT ,
with each period having a variable duration. We therefore define pl = [ul , hl ] and redefine
time with the mapping t = tl−1 + hl (τ − (l − 1)), so that tl is replaced by l and t ∈ [0, 1]
is replaced by τ ∈ [0, NT ]. This allows us to rewrite the optimal control problem as a
multiperiod problem of the form (9.1):
i i
i i
book_tem
i i
2010/7/27
page 269
i i
and a(τ ) and b(τ ) are continuous across period boundaries. The direct sensitivity equations
for (9.43) are given for τ ∈ (l − 1, l]:
dSa,hl β β
= −a(τ )(ul + kul ) − hl Sa,hl (τ )(ul + kul ), (9.44a)
dτ
dSa,ul β−1 β
= −hl a(τ )(1 + βkul ) − hl Sa,ul (τ )(ul + kul ), (9.44b)
dτ
dSb,hl
= a(τ )ul + hl Sa,hl (τ )ul , (9.44c)
dτ
dSb,ul
= hl a(τ ) + hl Sa,ul (τ )ul , (9.44d)
dτ
and for τ ∈ (l
− 1, l
), l
= l:
dSa,hl β
= −hl
Sa,hl (τ )(ul
+ kul
), (9.45a)
dτ
dSa,ul β
= −hl
Sa,ul (τ )(ul
+ kul
), (9.45b)
dτ
dSb,hl
= hl
Sa,hl (τ )ul
, (9.45c)
dτ
dSb,ul
= hl
Sa,ul (τ )ul
. (9.45d)
dτ
Also for τ = 0,
and S(τ ) is also continuous across period boundaries. As a result, we now have 4NT sensi-
tivity equations.
We also derive the related adjoint sensitivity system. Starting from
β
λT f (z, p) = −λa (hl a(τ )(ul + kul )) + λb hl a(τ )ul ,
Note that only two adjoint equations and 2NT integrals need to be evaluated.
i i
i i
book_tem
i i
2010/7/27
page 270
i i
Figure 9.4. Optimal temperature profiles for Example 9.4 with increasing numbers
of elements NT .
Both the direct and adjoint sensitivity equations can be applied using the approach of
Figure 9.1. On the other hand, because this example is linear in the states, the state equations
can be solved analytically once p l are fixed. This allows us to solve this problem under ideal
conditions, without any errors from the DAE or the sensitivity equation solvers.
For this example we set U = 5, k = 0.5, and β = 2 and add the bounds, 0 ≤ hl ≤ 2/NT ,
to the NLP (9.43). With starting values of h0l = 1/NT and u0l = 1, we apply the SNOPT NLP
solver to (9.43) and apply the approach in Figure 9.1. The results are shown in Table 9.1
with control profiles in Figure 9.4 for increasing values of NT . Comparing these profiles to
the solution in Example 8.4, we see that even for the coarse profile with NT = 5, the same
trend can be observed as in Figure 8.5. The temperature starts at a low temperature and
gradually increases, with a final acceleration to the upper bound. As NT is increased, the
control profile is refined with smaller steps that maintain this shape. Finally, with NT = 100,
the temperature profile (not shown) cannot be distinguished from the solution in Figure 8.5.
On the other hand, more accurate solutions with larger values of NT come at an additional
cost. Each of the function evaluations in Table 9.1 requires the solution of the DAE system,
and each SQP iteration requires the solution of the sensitivity equations as well. From this
table, increasing NT leads to an increase in the number of SQP iterations and function
values. In particular, without exact Hessians, the quasi-Newton update may require a large
i i
i i
book_tem
i i
2010/7/27
page 271
i i
number of iterations, particularly on problems with many decision variables. Note also that
the work per iteration increases with NT as well. This is especially true if direct sensitivity
calculations are implemented.
dzl
(t) = f l (zl (t), y l (t), p l ), zl (tl−1 ) = z0l ,
dt
g l (zl (t), y l (t), p l ) = 0, t ∈ (tl−1 , tl ], l = 1, . . . , NT ,
i i
i i
book_tem
i i
2010/7/27
page 272
i i
embedded in an inner loop for each element l. We note that sensitivities are required for z0l
as well as pl . On the other hand, direct and adjoint sensitivity equations need only be solved
over the elements where their variables are present.
At first glance, multiple shooting seems to offer few advantages over the sequential
approach. A larger NLP is formed, more sensitivity information is required, and the number
of SQP iterations may not be reduced. However, for ill-conditioned and unstable systems,
the advantages become clear, as the sequential approach may be incapable of considering
such systems. Unstable systems arise frequently in reactive systems and control problems.
In fact, even though stable solutions may exist at the optimum, the initial value solver may
fail to provide bounded state profiles over the sequence of NLP iterates. This is shown in
the next example.
Example 9.5 Consider the simple dynamic optimization problem discussed in [61, 89] and
given by
Motivated by Example 9.5 we now investigate why the multiple shooting approach is
successful on unstable problems. Unstable state profiles typically encountered with multiple
shooting are sketched in Figure 9.6. Here we see that unbounded solutions can be prevented
in the multiple shooting formulation (9.48), because it introduces two important features
not present in the sequential approach.
First, by choosing the length of the period or element to be sufficiently small, the
multiple shooting approach limits the escape associated with an unstable dynamic mode.
Instead, the imposition of bounds on the state at the end of the elements forces the state
to remain in a bounded region. In contrast, with the sequential approach, an unstable state
profile can become unbounded before tf is reached.
Second, a subtle, but more important, feature is the prevention of ill-conditioning in
the NLP problem. Even when the state profile remains bounded, unstable dynamic modes
can lead to serious ill-conditioning in the Jacobian matrices in (9.48) or the Hessian matrix
i i
i i
book_tem
i i
2010/7/27
page 273
i i
Figure 9.5. Comparison of multiple shooting solution (left) with initial value pro-
files (right) for Example 9.5 with τ = 50. The solid line shows z1 (t) while the dotted line
shows z2 (t).
i i
i i
book_tem
i i
2010/7/27
page 274
i i
1
=⇒ c = Q−1 b − B1 Z(1) Z −1 (s)q(s)ds , (9.56)
0
where Q = B0 + B1 Z(1). From Theorem 8.1 we know that the IVP has a unique solution,
and from Theorem 8.2 a unique solution exists for the BVP if and only if Q is nonsingular.
To assess the well-posedness of (9.53), we need to ensure that the solution remains
bounded under perturbations of the problem data (i.e., b and q(t)). For this we define
(t) = Z(t)Q−1 as the fundamental solution to
d
= A(t), B0 (0) + B1 (1) = I , t ∈ [0, 1], (9.57)
dt
and rewrite (9.51) as
t
−1
z(t) = Z(t) c + Z (s)q(s)ds
0 1
t
−1 −1
= Z(t)Q b − B1 Z(1) Z (s)q(s)ds + Z(t) Z −1 (s)q(s)ds
0 0
1 t
= (t)b − (t)B1 (1) −1 (s)q(s)ds + (t) −1 (s)q(s)ds
0 0
t 1
−1 −1
= (t)b + (t) (I − B1 (1)) (s)q(s)ds − B1 (1) (s)q(s)ds
0 t
t 1
= (t)b + (t) B0 (0) −1 (s)q(s)ds − B1 (1) −1 (s)q(s)ds ,
0 t
i i
i i
book_tem
i i
2010/7/27
page 275
i i
where we use Z(1)Z(s)−1 = (1)(s)−1 and B0 (0) + B1 (1) = I . This allows us to write
the solution to (9.53) as
1
z(t) = (t)b + G(t, s)q(s)ds, (9.58)
0
and the matrices P = B0 (0) and I − P = B1 (1) correspond to the initial and final
conditions that are “pinned down.” The dichotomy property of pinning down the unstable
modes is analogous to the zero-stability property presented in Section 9.2. In this case, we
invert (9.53) and obtain (9.58) and the Green’s function. Moreover, we define the BVP as
stable and well-conditioned if there exists a constant κ of moderate size (relative to the
problem data A(t), q(t), and b) such that the following inequalities hold:
1
z(t) ≤ κ b + q(s)ds ,
0
(t) ≤ κ, t ≥ 0,
G0 (t, s) ≤ κ, s ≤ t,
G1 (t, s) ≤ κ, s > t. (9.60)
i i
i i
book_tem
i i
2010/7/27
page 276
i i
From the dynamic models and the boundary conditions, we see that the unstable dynamic
mode z1 (t) is pinned down by its boundary condition. The corresponding Green’s functions
are given by
τ (t−s)
0 0 e 0
G0 (t, s) = for all t ≥ s; G1 (t, s) = − for all t < s,
0 eτ (s−t) 0 0
then the unstable mode is no longer pinned down and the Green’s functions become
τ (t−s)
e 0 0 0
G0 (t, s) = for all t ≥ s; G1 (t, s) = − for all t < s.
0 0 0 eτ (s−t)
From (9.60) we require κ ≥ eτ . Since this constant may be much larger than τ , the BVP is
considered unstable.
The dichotomy property applies not only to the BVP but also to the discretization used
for its numerical solution. As shown in [16], the inverse of the Jacobian matrix associated
with the matching constraints (9.48b) in the multiple shooting formulation consists of the
quantities G(t l−1 , t l ) and (t l ). Therefore, the boundedness of these quantities has a direct
influence on the discretized BVP that arises in the multiple shooting formulation. Moreover,
de Hoog and Mattheij [109] investigated the relationship between the discretized BVP and
the fundamental BVP solution and showed that the conditioning constants for the Jacobian
are closely related to κ in (9.60). Therefore to promote a stable and well-conditioned solution,
the appropriate dichotomous boundary condition is required in (9.48). For this NLP, the
required boundary condition can be added explicitly through a final time constraint that
could include additional decision variables, or through a sufficiently tight bound on the
dynamic modes at final time. In either case, a well-conditioned discretized system results.
This property is illustrated by the case study in the next section.
A + C + D → G,
A+C +E → H,
A+E → F,
3D → 2F .
i i
i i
book_tem
i i
2010/7/27
page 277
i i
These reactions lead to two liquid products, G and H , and an unwanted liquid by-product, F .
The reactor model is given by the following ODEs:
dNA,r
= yA,in Fin − yA,out Fout − R1 − R2 − R3 , (9.61a)
dt
dNB,r
= yB,in Fin − yB,out Fout , (9.61b)
dt
dNC,r
= yC,in Fin − yC,out Fout − R1 − R2 , (9.61c)
dt
dND,r 3
= yD,in Fin − yD,out Fout − R1 − R4 , (9.61d)
dt 2
dNE,r
= yE,in Fin − yE,out Fout − R2 − R3 , (9.61e)
dt
dNF ,r
= yF ,in Fin − yF ,out Fout + R3 + R4 , (9.61f )
dt
dNG,r
= yG,in Fin − yG,out Fout + R1 , (9.61g)
dt
dNH ,r
= yH ,in Fin − yH ,out Fout + R2 , (9.61h)
dt
H
dTr
Ni,r Cp,i = yi,in Cp,vap,i Fin (Tin − Tr )
dt
i=A
− QCW (Tr , TCW , FCW ) − HRj (Tr )Rj , (9.61i)
dTCW
(ρCW VCW Cp,CW ) = FCW Cp,CW (TCW ,in − TCW )
dt
+ QCW (Tr , TCW , FCW ). (9.61j)
i i
i i
book_tem
i i
2010/7/27
page 278
i i
The differential equations represent component mass balances in the reactor as well as energy
balances for the reactor vessel and heat exchanger. The additional variables are defined by
the following equations:
H
xi,r = Ni,r / Ni,r , i = D, E, . . . , H ,
i=D
yi,out = Pi,r /Pr , i = A, B, . . . , H ,
H
Pr = Pi,r ,
i=A
Pi,r = Ni,r RTr /VV r , i = A, B, C,
H
VLr = Ni,r /ρi ,
i=D
VV r = Vr − VLr ,
Pi,r = γir xir Pisat , i = D, E, F , G, H ,
R1 = α1 VV r exp[µ1 − ν1 /Tr ] PA,r
1.15 0.370
PC,r PD,r ,
R2 = α2 VV r exp[µ2 − ν2 /Tr ] PA,r
1.15 0.370
PC,r PE,r ,
R3 = α3 VV r exp[µ3 − ν3 /Tr ] PA,r PE,r ,
R4 = α4 VV r exp[µ4 − ν4 /Tr ] PA,r PD,r ,
'
Fout = β Pr − Ps .
Here Fin and Fout are the inlet and outlet flow rates. In addition, Pr and Ps are the reactor
and system pressures, and Ni , yi , xi , Pisat , Pir , ρi are reactor holdups, vapor and liquid mole
fractions, vapor pressures, partial pressures, and liquid densities of component i. Rj are
the reaction rates, VLr , VV r , VCW are the liquid and vapor reactor volumes and the heat
exchanger volume, respectively. In addition, γi , αj , µj , νj are constants, HR,j , Cp,i are
heats of reaction and heat capacities, β is the valve constant, and QCW is a function that
models the heat removed by the heat exchanger.
This system has three output variables y(t) (reactor pressure Pr , level VL , and tem-
perature Tr ), and the control (or manipulated) variables ul are the reactor feed rate Fin , the
agitator speed β, and the cooling water flow rate FCW . For the purpose of this study, we
assume that complete and accurate state information is available over all time. More details
on the reactor model can be found in [118, 345, 211].
The NMPC controller needs to maintain conditions of reactor operation that corre-
spond to equal production of products G and H . However, at this operating point the reactor
is open-loop unstable, and without closed-loop stabilizing control, the state profiles become
unbounded.
The dynamic model (9.61) can be represented as the following ODE system along
with output equations:
dz(t)
= f (z(t), u), y(t) = g(z(t), u).
dt
The control problem is described over a prediction horizon with NT time periods of 180 sec-
onds each. The objective function is to keep the reactor operating at the steady state point
i i
i i
book_tem
i i
2010/7/27
page 279
i i
without violating the operating limits and constraints on the process variables. This problem
takes the form of (9.2) and is written over the predictive horizon shown in Figure 9.8 as
NT
min ϕ(p) = (y(tl ) − ysp )T Qy (y(tl ) − ysp )
l=1
+ (ul − ul−1 )T Qu (ul − ul−1 ) (9.62a)
dz(t)
s.t. = f (z(t), ul ), z(0) = zk , (9.62b)
dt
y(t) = g(z(t), ul ), (9.62c)
uL,l ≤ ul ≤ uU,l , l = 1, . . . , NT , (9.62d)
where weighting matrices for the controller are Qy = I and Qu = 10−4 × I . By adjusting
the bounds uL,l and uU,l , the control variables can be varied over the input horizon with NU
periods, while the process is simulated for several more time periods (the output horizon
with NT periods) with constant controls. The output horizon must be long enough so that
the process will arrive at the steady state at the end of the simulation. After the NMPC
problem (9.62) is solved at time k with initial state zk , the control variables in the first time
period u1 are injected into the process. At time k + 1, the dynamic model is updated with
new measurements from the process, and problem (9.62) is set up and solved again with
zk+1 as the initial condition.
Problem (9.62) was solved with both sequential and multiple shooting approaches.
Moreover, with multiple shooting we consider the option of adding constraints at final
time to force the model to steady state values. This constraint guarantees the dichotomy
property for the dynamic system. The NMPC problem was solved with the NEWCON
package, which applied DASSL and DDASAC to solve the DAEs and the direct sensitivity
equations, respectively. The SQP solver incorporates the QPSOL package to solve the QP
subproblem, and the Gauss–Newton approximation (9.35) was used for the Hessian.
i i
i i
book_tem
i i
2010/7/27
page 280
i i
Pressure
For this problem, open-loop instability of the reactor has a strong influence on the
convergence properties of these algorithms. In particular, for predictive horizons of length
NT = 10, and NU = 5, the condition number of the Hessian is on the order of 108 . With an
output horizon at NT = 20, the condition number grows to approximately 1012 .
Figure 9.9 shows the open-loop and closed-loop responses of the normalized output
variables. Here the G/H mass ratio is also shown in Figure 9.10. These graphs describe the
i i
i i
book_tem
i i
2010/7/27
page 281
i i
i i
i i
book_tem
i i
2010/7/27
page 282
i i
formulation was twice as fast as the sequential approach and required an average of only
6 SQP iterations to solve problem (9.62). For the larger output horizon with NT = 20, the
disturbance case (curve 4) was also obtained successfully with and without endpoint con-
straints. However, without endpoint constraints, ill-conditioning often led to error messages
from the QP solver. Finally, for the case of no disturbances (curve 5), only the endpoint
formulation was able to solve this problem. Without endpoint constraints, the QP solver
fails at time t = 7 h and the controller terminates. This failure is due to ill-conditioning of
the constraint gradients when endpoint constraints are absent. This difficulty also increases
with longer predictive horizons. On the other hand, by using multiple shooting with terminal
output constraints for NT = 30 and NU = 5, the flat curves 4 and 5 can still be obtained for
simulation times of 20 hours (i.e., a sequence of 400 horizon problems (9.62)).
i i
i i
book_tem
i i
2010/7/27
page 283
i i
• Both sequential and multiple shooting approaches require the repeated solution of
DAE systems. The DAE solution and sensitivity calculations represent the dominant
costs of the optimization. In particular, the direct sensitivity approach can become an
overwhelming component of the computation if the problem has many variables in
the nonlinear program.
• Both approaches are heavily dependent on the reliability of the DAE solver. Failure in
the DAE solver or the corresponding sensitivity calculation will cause the optimization
strategy to fail. This issue was explored for unstable DAEs, but failure may also be due
to nonsmoothness, loose tolerances, and other features of DAE systems and solvers.
• Control profiles and path constraints must be approximated in the sequential and mul-
tiple shooting approaches. While this is often adequate for many practical problems,
a close approximation requires many more decision variables and more expensive
solutions.
• By embedding the DAE solver and sensitivity into the optimization strategy, the
constraint gradients in problem (9.1) are no longer sparse, but contain dense blocks.
The linear algebra associated with these blocks may be expensive for problems with
many decision or state variables, especially in the multiple shooting formulation.
Overcoming these issues will be addressed in the next chapter, where large-scale, sparse
NLP formulations are developed that directly incorporate the discretized DAE system.
i i
i i
book_tem
i i
2010/7/27
page 284
i i
Schlegel et al. [355]; Prata et al. [317]; Kadam and Marquardt [214, 215]; Santos et al.
[346, 345]; Romanenko and Santos [337]; and Oliveira and Biegler [299].
Similarly, the multiple shooting method has seen significant recent development
through the work of Leineweber et al. [250, 251]; Bock [61]; and Diehl, Bock, and Schlöder
[115]. In particular, the MUSCOD code implements large-scale SQP methods with ad-
vanced DAE solvers. Building on the concepts in Section 9.4, sophisticated advances have
been implemented in MUSCOD that accelerate the direct sensitivity step and promote reli-
able solutions, even for unstable and chaotic systems. More information on the MUSCOD
algorithm can be found in [251].
Finally, as shown in Chapters 5 and 6, the performance of the NLP solver in Figure 9.1
can be greatly enhanced if second derivatives can be made available from the DAE solver.
Recent work (see [186, 301]) has led to efficient strategies to calculate Hessian vector
products from the sensitivity equations. Here second order adjoint equations are applied to
the direct sensitivity equations (or vice versa) for search directions supplied by the NLP
solver. This approach leads to an accurate Hessian vector product at the cost of an additional
adjoint sensitivity step.
9.8 Exercises
1. Consider the reactor optimization problem given by
L
min L − 500 (T (t) − TS )dt
0
dq
s.t. = 0.3(1 − q(t)) exp(20(1 − 1/T (t))), q(0) = 0,
dt
dT dq
= −1.5(T (t) − TS ) + 2/3 , T (0) = 1,
dt dt
where q(t) and T (t) are the normalized reactor conversion and temperature, respec-
tively, and the decision variables are TS ∈ [0.5, 1] and L ∈ [0.5, 1.25].
(a) Derive the direct sensitivity equations for the DAEs in this problem.
(b) Using MATLAB or a similar package, apply the sequential approach to find the
optimum values for the decision variables.
(c) How would you reformulate the problem so that the path constraint T (t) ≤ 1.45
can be enforced?
(a) For the sequential optimization formulation, derive the analytical solution for
substitution in (9.2) and solve the problem in GAMS or AMPL for various values
of NT .
(b) For the multiple shooting optimization formulation, derive the analytical solu-
tion for substitution in (9.48) and solve the problem in GAMS or AMPL for
various values of NT .
i i
i i
book_tem
i i
2010/7/27
page 285
i i
(a) Derive the direct sensitivity and adjoint sensitivity equations required for the
sequential formulation (9.2).
(b) Derive the direct sensitivity and adjoint sensitivity equations required for the
multiple shooting formulation (9.48).
(c) For the sequential optimization formulation, derive the analytical solution for
substitution in (9.2) and solve the problem in GAMS or AMPL for various values
of NT .
(d) For the multiple shooting optimization formulation, derive the analytical solu-
tion for substitution in (9.48) and solve the problem in GAMS or AMPL for
various values of NT .
dz1
= z2 ,
dt
dz2
= 1600z1 − (π 2 + 1600) sin(π t).
dt
(a) Show that the analytic solution of these differential equations are the same for
the initial conditions z1 (0) = 0, z2 (0) = π and the boundary conditions z1 (0) =
z1 (1) = 0.
(b) Find the analytic solution for the initial and boundary value problems. Comment
on the dichotomy of each system.
(c) Consider the optimal control problem:
max c2 (1.0)
dc1
s.t. = −k1 (T )c12 , c1 (0) = 1,
dt
dc2
= k1 (T )c12 − k2 (T )c2 , c2 (0) = 0,
dt
i i
i i
book_tem
i i
2010/7/27
page 286
i i
max z3 (1.0)
dz1
s.t. = z2 , z1 (0) = 0,
dt
dz2
= −z2 + u(t), z2 (0) = −1,
dt
dz3
= z12 + z22 + 0.005 u(t)2 , z3 (0) = 0.
dt
Discretize the control profile as piecewise constants over NT periods and perform the
following:
(a) Derive the adjoint sensitivity equations for this problem.
(b) Cast this example in the form of (9.2) and solve using the sequential strategy
with MATLAB or a similar package.
(c) Cast this example in the form of (9.48) and solve using the multiple shooting
approach with MATLAB or a similar package.
i i
i i
book_tem
i i
2010/7/27
page 287
i i
Chapter 10
Following on embedded methods for dynamic optimization, this chapter considers “all-
at-once” or direct transcription methods that allow a simultaneous approach for this opti-
mization problem. In particular, we consider formulations based on orthogonal collocation
methods. These methods can also be represented as a special class of implicit Runge–Kutta
(IRK) methods, and concepts and properties of IRK methods apply directly. Large-scale
optimization formulations are then presented with the aim of maintaining accurate state and
control profiles and locating potential break points. This approach is applied to consider a
number of difficult problem classes including unstable systems, path constraints, and sin-
gular controls. Moreover, a number of real-world examples are featured that demonstrate
the characteristics and advantages of this approach. These include batch crystallization pro-
cesses, grade transitions in polymer processes, and large-scale parameter estimation for
complex industrial reactors.
10.1 Introduction
This chapter evolves from sequential and multiple shooting approaches for dynamic op-
timization by considering a large nonlinear programming (NLP) formulation without an
embedded DAE solver. Instead, we consider the multiperiod dynamic optimization prob-
lem (8.5) where the periods themselves are represented by finite elements in time, with
piecewise polynomial representations of the state and controls in each element. This ap-
proach leads to a discretization that is equivalent to the Runge–Kutta methods described
in the previous chapter. Such an approach leads to a fully open formulation, represented in
Figure 7.1, and has a number of pros and cons. The large-scale NLP formulation allows a
great deal of sparsity and structure, along with flexible decomposition strategies to solve this
problem efficiently. Moreover, convergence difficulties in the embedded DAE solver are
avoided, and sensitivity calculations from the solver are replaced by direct gradient and Hes-
sian evaluations within the NLP formulation. On the other hand, efficient large-scale NLP
solvers are required for efficient solutions, and accurate state and control profiles require
careful formulation of the nonlinear program.
To address these problems, we rely on efficient large-scale NLP algorithms devel-
oped in Chapter 6. In particular, methods that accept exact second derivatives have fast
287
i i
i i
book_tem
i i
2010/7/27
page 288
i i
convergence properties. Moreover, exploitation of the structure of the KKT matrix leads
to efficient large-scale algorithms. Nevertheless, additional concerns include choosing an
accurate and stable discretization, selecting the number and the length of finite elements,
and dealing with unstable dynamic modes. Finally, the fundamental relation of the NLP
formulation and the original dynamic optimization problem needs to be analyzed. All of
these issues will be explored in this chapter, and properties of the resulting simultaneous
collocation formulation will be demonstrated on a number of real-world process examples.
The next section describes the collocation approach and motivates its properties,
particularly high-order approximations and the relation to IRK methods. Section 10.3 in-
corporates the discretized DAEs into an NLP formulation and discusses the addition of finite
elements and the incorporation of discontinuous control profiles. In addition, the calcula-
tion of error bounds is discussed and the extension to variable finite elements is developed.
Moreover, we also analyze the treatment of unstable modes. Section 10.4 then explores
the relation of the NLP formulation with the dynamic optimization problem. A key issue
to establishing this relationship is regularity of the optimal control problem that translates
into nonsingularity of the KKT matrix. Subsections 10.4.4 and 10.4.5 deal with open ques-
tions where these regularity conditions are violated. This section also deals with problems
with inequality constraints on state profiles. As seen in Chapter 8, these problems are dif-
ficult to handle, but with the simultaneous collocation method, they may be treated in a
more straightforward way. We also deal with singular control problems, discuss related
convergence difficulties, both with indirect and direct approaches, and present heuristic
regularization approaches to overcome these difficulties.
i i
i i
book_tem
i i
2010/7/27
page 289
i i
Figure 10.1. Polynomial approximation for state profile across a finite element.
&
K
(τ − τk )
where j (τ ) = ,
(τj − τk )
k=0,=j
i i
i i
book_tem
i i
2010/7/27
page 290
i i
where zi−1 is a coefficient that represents the differential state at the beginning of element
dzK (t )
i, żij represents the time derivative dτ ij , and j (τ ) is a polynomial of order K satisfying
τ
j (τ ) = j (τ
)dτ
, t ∈ [ti−1 , ti ], τ ∈ [0, 1].
0
To determine the polynomial coefficients that approximate the solution of the DAE, we
substitute the polynomial into (10.2) and enforce the resulting algebraic equations at the
interpolation points τk . This leads to the following collocation equations.
dzK
(tik ) = f (zK (tik ), tik ), k = 1, . . . , K, (10.6)
dt
with zK (ti−1 ) determined separately. For the polynomial representations (10.4) and (10.5),
it is convenient to normalize time over the element, write the state profile as a function of τ ,
K dzK
and apply dz dτ = hi dt . For the Lagrange polynomial (10.4), the collocation equations
become
K
dj (τk )
zij = hi f (zik , tik ), k = 1, . . . , K, (10.7)
dτ
j =0
while the collocation equations for the Runge–Kutta basis are given by
żik = f (zik , tik ), (10.8a)
K
zik = zi−1 + hi j (τk )żij , k = 1, . . . , K, (10.8b)
j =1
with zi−1 determined from the previous element i − 1 or from the initial condition on the
ODE.
The optimal choice of interpolation points τj and quadrature weights ωj leads to 2K degrees
of freedom with the result that (10.10) will provide the exact solution to (10.9) as long as
f (z(t), t) is a polynomial in t of order 2K (degree ≤ 2K − 1). The optimal choice of
interpolation points is given by the following theorem.
i i
i i
book_tem
i i
2010/7/27
page 291
i i
Theorem 10.1 (Accuracy of Gaussian Quadrature). The quadrature formula (10.10) pro-
vides the exact solution to the integral (10.9) if f (z(t), t) is a polynomial in t of order 2K
and τj are the roots of a Kth degree polynomial, PK (τ ) with the property
1
Pj (τ )Pj
(τ )dτ = 0, j = 0, . . . , K − 1, j
= 1, . . . , K, for indices j = j
. (10.11)
0
Proof: Without loss of generality, we consider only scalar profiles in (10.9) and (10.10)
and define z and f directly as functions of τ with the domain of integration τ ∈ [0, 1]. We
expand the integrand as a polynomial and write
K
d K f (τ̄ ) &
K
f (τ ) = j (τ )f (τj ) + (τ − τj )/K!
dτ K
j =1 j =1
K &
K
= j (τ )f (τj ) + qK−1 (τ ) (τ − τj ) + Q(τ ),
j =1 j =1
where τ̄ ∈ [0, 1], jK=1 (τ − τj ) is of degree K, Q(τ ) is the residual polynomial of order
2K + 1, and qK−1 (τ ) is a polynomial of degree ≤ K − 1, which can be represented as
qK−1 (τ ) = jK−1 =0 αj Pj (τ ), with coefficients αj .
Now if f (τ ) is a polynomial of order 2K we note that Q(τ ) = 0. Choosing τj so that
K
j =1 (τ − τj ) = κPK (τ ), for some κ > 0, then leads to
1 1
K &
K
f (τ )dτ = j (τ )f (τj ) + qK−1 (τ ) (τ − τj )dτ
0 0 j =1 j =1
1
K
= j (τ )f (τj )dτ ,
0 j =1
where the last integral follows from (10.11) with j
= K, and j = 0, . . . , K − 1. The desired
$1
result then follows by setting ωj = 0 j (τ )dτ and noting that tij = ti−1 + hi τj .
This result justifies the choice of collocation points τj as the roots of PK (τ ), the shifted
Gauss–Legendre polynomial6 with the orthogonality property (10.11). This polynomial
belongs to the more general class of Gauss–Jacobi polynomials that satisfy
1
(1 − τ )α τ β Pj (τ )Pj
(τ )dτ = 0, j = j
. (10.12)
0
Using these polynomials, Theorem 10.1 can be suitably modified to allow an exact quadra-
ture for f (τ ) with degree ≤ 2K − 1 − α − β. Gauss–Jacobi polynomials of degree K can
be written as
(α,β)
K
PK = (−1)K−j (τ )j γj (10.13)
j =0
6 Gauss–Legendre polynomials are normally defined with τ ∈ [−1, 1] as the domain of integration in
(10.11). In this chapter, we will only consider τ ∈ [0, 1].
i i
i i
book_tem
i i
2010/7/27
page 292
i i
with γ0 = 1 and
(K − j + 1)(K + j + α + β)
γj = , j = 1, K.
j (j + β)
With this result, we note from (10.8) that
K
zK (ti ) = zK (ti−1 ) + hi j (1)f (zij ),
j =1
i i
i i
book_tem
i i
2010/7/27
page 293
i i
K is large), then zK (t0 ) = z0 . For multiple elements, with N > 1, we enforce continuity of
the state profiles across element boundaries. With Lagrange interpolation profiles, this is
written as
K
zi+1,0 = j (1)zij , i = 1, . . . , N − 1, (10.14a)
j =0
K
zf = j (1)zNj , z1,0 = z0 , (10.14b)
j =0
K
zf = zN−1 + hN j (1)żNj . (10.15b)
j =1
Finally, we note that by using the Runge–Kutta basis, equations (10.8) and (10.15) show
that the collocation approach is an IRK method in the form of equations (9.8). This can
be seen directly by noting the following equivalences: ns = K, ck = τk , akj = j (τk ), and
bk = k (1) of appropriate accuracy. Because collocation methods are IRK methods, they
enjoy the following properties described in Section 9.2.
• Collocation methods are A-stable, and both Gauss–Legendre and Radau collocation
are AN -stable, or equivalently algebraically stable. As a result, there is no stability
limitation on hi for stiff problems.
• Radau collocation has stiff decay. Consequently, large time steps hi are allowed for
stiff systems that capture steady state components and slow time scales.
• Both Gauss–Legendre and Radau collocations are among the highest order methods.
The truncation error (9.6) is O(h2K ) for Gauss–Legendre and O(h2K−1 ) for Radau.
This high-order error applies to zi , but not to the intermediate points, zij .
To illustrate the use of collocation methods, we consider the following small IVP.
i i
i i
book_tem
i i
2010/7/27
page 294
i i
and
3
zi+1,0 = j (1)zij , i = 1, . . . , N − 1,
j =0
K
zf = j (1)zNj , z1,0 = −3.
j =0
i i
i i
book_tem
i i
2010/7/27
page 295
i i
K
K
u(t) = ¯j (τ )uij , y(t) = ¯j (τ )yij ,
j =1 j =1
&
K
(τ − τk )
where ¯j (τ ) = , (10.17)
(τj − τk )
k=1,=j
K
˙j (τk )zij − hi f (zik , yik , uik , p) = 0, i ∈ {1, . . . , N }, k ∈ {1, . . . , K}, (10.18a)
j =0
g(zik , yik , uik , p) = 0, i ∈ {1, . . . , N }, k ∈ {1, . . . , K}, (10.18b)
K
zf = j (1)zNj , z1,0 = z(t0 ), (10.19g)
j =0
hE (zf ) = 0. (10.19h)
i i
i i
book_tem
i i
2010/7/27
page 296
i i
Problem (10.19) allows for a number of formulation options, especially when extended
to multiperiod problems. The simplest case, where the dynamic system is described by
a single finite element (N = 1), leads to the class of pseudospectral methods [327, 342].
Such methods can be very accurate for dynamic optimization problems that have smooth
profile solutions. On the other hand, if the solution profiles have changes in active sets or the
control profiles are discontinuous over time, then these solutions are not sufficiently accurate
to capture the solution of the state equations, and multiple elements need to be introduced.
For the multielement formulation, it is important to note from (10.17) that the alge-
braic and control profiles are not specified at τ = 0. Moreover, continuity is not enforced at
the element boundary for these profiles. For algebraic profiles defined by index-1 constraints
(10.18b), continuity of these profiles is obtained automatically, as algebraic variables are
implicit functions of continuous differential variables. On the other hand, control profiles are
allowed to be discontinuous at element boundaries and this allows us to capture interesting
solutions of optimal control problems accurately, including those observed in Examples 8.5,
8.8, and 8.9. Finally, for process control applications, the control profile is often represented
as a piecewise constant profile, where the breakpoints are defined by finite element bound-
aries. This profile description is straightforward to incorporate within (10.19). Moreover, as
seen next, accurate control profiles can be determined with the aid of variable finite elements.
where C1 is a computable constant and T (t) can be computed from the polynomial
solution. Choices for Ti (t) are reviewed in Russell and Christensen [341]. In particular,
the error estimate given by
d K+1 z(t) K+1
Ti (t) = h
dt K+1
i i
i i
book_tem
i i
2010/7/27
page 297
i i
is widely used. This quantity can be estimated from discretization of zK (t) over
elements that are neighbors of element i.
• Alternately, one can obtain an error estimate from
dzK (t)
Ti (t) = − hi f (zK (t), y(zK (t)), ui (t), p).
dτ
This residual-based estimate can be calculated directly from the discretized DAEs.
Here, T (tik ) = 0 at collocation points, but choosing a noncollocation point tnc =
ti−1 + hi τnc , τnc ∈ [0, 1], leads to ei (t) ≤ C̄Ti (tnc ) with the constant C̄ given by
1 τnc & &
K K
C̄ = (s − τj )ds, A = (τnc − τj ).
A 0
j =1 j =1
With this estimate, N sufficiently large, and a user-specified error tolerance , appro-
priate values of hi can be determined by adding the constraints
N
hi = tf , hi ≥ 0, (10.21a)
i=1
C̄Ti (tnc ) ≤ (10.21b)
to (10.19). This extended formulation has been developed in [396] and demonstrated
on a number of challenging reactor optimization problems. In particular, this approach
allows variable finite elements to track and adapt to steep profiles encountered over
the course of an optimization problem.
• On the other hand, the variable finite element formulation with (10.21) does lead to
more constrained and difficult NLPs, with careful initializations needed to obtain good
solutions. A simpler approach is to choose a sufficiently large number of elements
and a nominal set of time steps h̄i by trial and error, and to first solve (10.19) with
these fixed values. Using this solution to initialize the variable element problem, we
then relax hi , add the following constraints to (10.19),
N
hi = tf , hi ∈ [(1 − γ )h̄i , (1 + γ )h̄i ], (10.22)
i=1
with a constant γ ∈ [0, 1/2), and solve the resulting NLP again. This approach allows
the time steps to remain suitably small and provides sufficient freedom to locate
breakpoints in the control profiles.
• One drawback to this NLP formulation is that addition of hi as variables may lead to
additional zero curvature in the reduced Hessian for time steps that have little effect
on the solution profiles. This issue will be addressed in Section 10.4. Finally, optimal
control solutions can be further improved by monitoring the solution and adding
additional time steps if several hi have optimal values at their upper bounds [48, 378].
To illustrate the problem formulation in (10.19) along with variable elements, we
consider the following crystallization example.
Example 10.3 (Optimal Cooling Profile for Crystallization). In this example, we consider
the dynamic optimization of a crystallizer described with a simple population balance model.
i i
i i
book_tem
i i
2010/7/27
page 298
i i
Figure 10.3. Crystallizer growth problem with optimal cooling temperature and
crystal size profiles.
Here we seek to find the optimum temperature profile to cool the crystallizer and grow the
largest crystals. Conventional crystallization kinetics are characterized in terms of two dom-
inant phenomena: nucleation (the creation of new particles) and crystal growth of existing
particles. Both competing phenomena consume desired solute material during the crystal-
lization process. To obtain larger (and fewer) crystals, nucleation needs to be avoided, and
the goal of the optimization is to find operating strategies that promote growth of existing
crystals.
The dynamic optimization problem for the crystallizer consists of a DAE model, a
lower bound on the jacket temperature as a function of the solute concentration, and an
objective to maximize the crystal length. This objective also corresponds to minimizing the
crystal surface area in order to obtain higher purity of the crystals. The crystallizer operates
over 25 hours and the control variable is the cooling profile in the crystallizer jacket. The
dynamic optimization problem can be stated as
i i
i i
book_tem
i i
2010/7/27
page 299
i i
where Ls is the mean crystal size, Nc is the number of nuclei per liter (l) of solvent, L is the
total length of the crystals per liter of solvent, Ac the total surface area of the crystals per liter
of solvent, Vc is the total volume of the crystals per liter of solvent, Cc is the solute concen-
tration, Mc is the total mass of the crystals, and Tc is the crystallizer temperature. Additional
constants include Vs = 300l, the volume of the solvent, W = 2025 kg, the total mass in the
crystallizer, T = max(0, Tequ − Tc ), the degree of supercooling,7 Tequ = 4i=1 ai (C̄)i−1 ,
the equilibrium temperature, C̄ = 100Cc /(1.35+Cc ), the weight fraction, Ls0 = 5×10−4 m,
the initial crystal size, L0 = 5 × 10−5 m, the nucleate crystal size, Ws0 = 2 kg, the weight
of seed crystals, ρ = 1.58 kg/ l, the specific gravity of the crystals, and α = 0.2, β = 1.2,
the shape factors for area and volume of the crystals, respectively. The control variable is
the jacket temperature, Tj , which has a lower bound, Ta = 4i=1 bi (C̄)i−1 . The remaining
parameter values are Kg = 0.00418, Bn = 385, Cp = 0.4, Kc = 35, Ke = 377, η1 = 1.1,
and η2 = 5.72. Finally the polynomial coefficients for Tequ and Ta are:
Applying the formulations (10.19) and (10.22) to this model, with a piecewise constant
control profile, three-point Radau collocation, and 50 variable finite elements, leads to an
NLP with 1900 variables and 1750 equality constraints. The optimal solution was obtained in
12.5 CPUs (1.6 MHz, Pentium 4 PC running Windows XP) using 105 iterations of a reduced-
space interior point solver (IPOPT v2.4). The optimal profiles of the mean crystal size and the
jacket temperature are given in Figure 10.3. Over the 25-hour operation, the mean crystal size
increases by over eight times, from 0.5 mm to 4.4 mm.Also, note that in order to maximize the
crystal size, the jacket cooling temperature must first increase to reduce the number of nucle-
ating particles. Additional information on this optimization study can be found in [242].
i i
i i
book_tem
i i
2010/7/27
page 300
i i
To develop the second concept for the NLP formulation (10.19), we note that the
dichotomy property can be enforced by maintaining a stable pivot sequence for the associated
KKT matrix. To see this, consider the Jacobian of the collocation equations given by
I 0
T 1 C1 U1
1
D̄ D −I 1
T 2 C 2 U 2
A =
T
| = [Az |Au ],
D̄ 2 D 2 −I
T 3 C3 U 3
.. .. ..
. . .
(10.24)
where T i is the Jacobian of zi0 , C i is the Jacobian of the state variables zik , yik , and U i
is the Jacobian of the control variables uik for (10.19b)–(10.19c) in element i. Similarly,
D̄ i and D i are the Jacobians with respect to zi0 and zik in (10.19f). The partition of AT into
Az and Au can be interpreted as the selection of dependent (basic) and decision (superbasic)
variables in the reduced-space optimization strategy.
With initial conditions specified in the first row of AT , uik fixed, and Az square and
nonsingular, solving for the state variables leads to an ill-conditioned system in the presence
of increasing (i.e., unstable) dynamic modes. This essentially mimics the solution strategy of
the sequential optimization approach in Figure 9.1. On the other hand, if the columns of U i
span the range of the rows of −I that correspond to unstable modes, then by repartitioning
Az and Au , the control variable could be shifted from Au to Az , and the corresponding
columns of −I (in an element j > i) could be shifted from Az to Au . Moving these unstable
states (columns of −I ) to Au has the effect of providing boundary conditions that pin down
the unstable modes. In addition, moving control variables into Az leads to corresponding
free states starting from element i + 1. Moreover, this repartitioning strategy can be left
to the KKT matrix solver alone, as long as a reliable pivoting strategy is applied. Because
of the relation between dichotomous BVPs and well-conditioned discretized systems [109],
the pivoting strategy itself should then lead to the identification of a stable repartition for Az .
To demonstrate this approach we consider a challenging dynamic optimization prob-
lem that deals with an unstable polymerization reactor.
Example 10.4 (Grade Transition for High-Impact Polystyrene). We consider the dynamic
optimization of a polymerization reactor that operates at unstable conditions. Polymeriza-
tion reactors typically manufacture a variety of products or grades, and an effective grade
transition policy must feature minimum transition time in order to minimize waste prod-
uct and utility consumption. The minimum transition policy can be determined from the
discretized dynamic optimization problem given by (10.19) with the objective given by
θ
min z(t) − ẑ2 + u(t) − û2 dt, (10.25)
0
where ẑ and û are the states and inputs for the desired operating point of the new product
grade, and θ is the transition horizon length. We consider the grade transition in [141] that
deals with free-radical bulk polymerization of styrene/polybutadiene, using a monofunc-
tional initiator (I ) to form high-impact polystyrene (HIPS). The polymerization occurs
i i
i i
book_tem
i i
2010/7/27
page 301
i i
j kp j +1 j kp j +1
Propagation reactions RS + MS −→ RS BRS + MS −→ BRS
Termination reactions
j kt j
Homopolymer RS + RSm −→ PM
j kt j j kt j +m
Grafting RS + BR −→ BP RS + BRS
m −→ B
P
kt j kt j
Crosslinking BR + BR −→ BEB BRS + BR −→ BP B
j kt j +m
BRS + BRS
m −→ B
PB
Transfer reactions
j kf s j kf s j
Monomer RS + MS −→ P j + RS1 BRS + MS −→ BP + RS1
j kf b j kf b j
Grafting sites RS + B0 −→ P j + BR BRS + B0 −→ BP + BR
in a nonisothermal stirred tank reactor assuming perfect mixing, constant reactor vol-
ume and physical properties, and quasi steady state and the long chain assumptions for
the polymerization reactions. The reaction mechanism involves the initiation, propagation,
transfer, and termination reactions shown in Table 10.2. Polybutadiene is also added in order
to guarantee desired mechanical properties by promoting grafting reactions. In Table 10.2,
the superscript on each species refers to the length of the polymer chain, B0 is the polybu-
tadiene unit, BEB is the cross-linked polybutadiene, BP is the grafted polymer, BP B is the
cross-linked polymer, BR is an activated polybutadiene unit, BRS is grafted radical with a
styrene end group, I is the initiator, MS is styrene monomer, P is the homopolymer, and R is
the free radical. The dynamic model can be found in [393, 141] along with rate constants and
reactor data. This model includes a mass balance for the initiator, monomer, butadiene, and
radical species. Also, included are zeroth-moment models for the polymer products and an
energy balance over the stirred tank reactor. Finally, the manipulated variables are cooling
water flow rate Fj and initiator volumetric flow rate Fi .
For the grade transition problem, Figure 10.4 displays the multiple steady states for
this reactor in the space of one of the input parameters: cooling water flow rate. Under
nominal operating conditions (Fj = 1 l/s and Fi = 0.0015 l/s), the reactor exhibits three
steady state values (denoted by N1, N2, and N3). The lower and upper steady state branches
are stable but represent undesired operating conditions. On the intermediate unstable branch,
the monomer conversion rises to around 30%, which is given by the nominal, but unstable,
operating point N2. Note that a sequential approach cannot be applied to transitions on
the intermediate branches, as numerical errors in the integration may lead to drift of the
transition to the stable outer branches.
In [141], 14 grade transitions were considered from N2 to either stable or unstable
steady states, using either cooling water and/or initiator flow as control variables. NLP
i i
i i
book_tem
i i
2010/7/27
page 302
i i
600
1: Qi = 0.0015
A5 2: Qi = 0.0025
3: Qi = 0.0040
550 N3
500
Temperature (K)
450
1
400 N2
A1
2
A2
3
350
A3
A4 N1
300
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Cooling water flowrate (L/s)
Figure 10.4. Multiple steady states for HIPS reactor. The dashed lines indicate
unstable branches for operation.
formulations of the form (10.19) and (10.25) were solved with the polymerization model
using the IPOPT code and Radau collocation with K = 3 and finite elements up to N = 40. To
demonstrate this approach, we consider the transition from N2 to A2 in Figure 10.4. For this
transition, cooling water flow rate was the manipulated variable, and the resulting nonlinear
program has 496 variables, 316 equations, and 20 finite elements. The optimization problem
required 223 iterations and 5.01 CPUs (1.6 MHz Pentium 4 PC running Linux) to solve.
At the desired state A2, the reactor operates close to a turning point. To achieve this
state, the optimal cooling water flow rate suddenly drops from 1 l/s to its lower bound and
remains there for most of the transition time. It then rises to its final desired value at the end.
From Figure 10.5 we see that the states reach their desired end values in less than 1.3 h.
From the expected behavior of nonisothermal reactors with multiple steady states, we note
that at the unstable point N2, the loss of cooling leads to a decrease in temperature. With
this decrease in reactor temperature, the propagation rate step is reduced, less initiator is
consumed, and monomer and initiator concentration increase. When the flow rate increases
at the end, the monomer concentration decreases as well, and conversion increases to a value
higher than that at the initial point.
i i
i i
book_tem
i i
2010/7/27
page 303
i i
Figure 10.5. Optimal state and control profiles for HIPS grade transition: N2 → A2.
i i
i i
book_tem
i i
2010/7/27
page 304
i i
Newton solver at each time step, and often with a sparse matrix routine embedded within the
Newton solver. Sparse factorization of the Newton step occurs at a cost that scales slightly
more than linearly with problem size (i.e., with exponent β ∈ [1, 2]). For the simultaneous
approach, this step is replaced by the optimization step (v). In Step (ii) both multiple shooting
and sequential approaches obtain reduced gradients through direct sensitivity calculations
of the DAE system. While this calculation is often implemented efficiently, the cost scales
linearly with the number of inputs times the size of the DAE system, since previous fac-
torizations can be reused. With the sequential approach, the number of inputs is nu N ; with
multiple shooting, sensitivity is calculated separately in each time step and the number of
inputs is nw + nu . For the simultaneous approach the gradient calculation (through auto-
matic differentiation) usually scales linearly with the problem size. Step (iii) deals with the
calculation of second derivatives, which is rarely used for the embedded optimization ap-
proaches. For both multiple shooting and sequential approaches the cost of reduced Hessians
scales roughly with the number of inputs times the sensitivity cost. Instead, quasi-Newton
methods are employed, but often at the cost of additional iterations of the NLP solver, as
seen in Example 9.4. On the other hand, the calculation cost of the sparse Hessian for the
simultaneous collocation approach usually scales linearly with the problem size.
In addition, multiple shooting executes a decomposition (step (iv)) which requires
projection of the Hessian to dimension nu N , through the factorization of dense matrices.
With the collocation approach, the Hessian remains sparse and its calculation (through
automatic differentiation) also scales with the problem size. Step (v) deals with the opti-
mization step determination; sequential and multiple shooting methods require the solution
of a quadratic program (QP) with nu N variables, and dense constraint and reduced Hessian
matrices. These require dense factorizations and matrix updates that scale with the expo-
nent α ∈ [2, 3]. The QP also chooses an active constraint set, a combinatorial step which
is often accelerated by a warm start. On the other hand, with a barrier approach applied to
simultaneous collocation, the active set is determined from the solution of nonlinear (KKT)
equations through a Newton method. The corresponding Newton step is obtained through
sparse factorization and a backsolve of the KKT system in step (v).
Table 10.3 shows that as the number of inputs nu N increases, one sees a particular
advantage to the simultaneous collocation approach incorporated with a barrier NLP solver.
Moreover, the specific structure of the KKT matrix can be exploited with the simultaneous
collocation approach as demonstrated in the following case study.
i i
i i
book_tem
i i
2010/7/27
page 305
i i
not exceed 70–80 mm. A schematic representation of a typical tubular reactor is presented
in Figure 10.6.
DAE models for LDPE tubular reactors comprise detailed polymerization kinetic
mechanisms and rigorous methods for the prediction of the reacting mixture as well as
thermodynamic and transport properties at extreme conditions. For this case study, a first-
principles model is applied that describes the gas-phase free-radical homopolymerization
of ethylene in the presence of several different initiators and chain-transfer agents at super-
critical conditions. Details of this model are presented in [418], and the reaction mecha-
nism for the polymerization is presented in Table 10.4. Here, Ii , i = 1, . . . , NI , R·, M, and
Si , i = 1, . . . , NS , denote the initiators, radicals, monomer, and chain-transfer agent (CTA)
molecules, respectively. The symbol ηi represents the efficiency of initiator i, Pr repre-
sents “live” polymer chains, and Mr are “dead” polymer chains with r monomer units.
The corresponding reaction rates for the monomers, initiators, CTAs, and “live” and “dead”
polymer chains can be obtained by combining the reaction rates of the elementary reactions
describing their production and consumption.
To simplify the description of the polymer molecular weight distribution and related
properties, we apply the method of moments. This method represents average polymer
molecular weights in terms of the leading moments of the polymer chain-length distributions.
Additional simplifying assumptions are described in [418, 419] to develop steady state
i i
i i
book_tem
i i
2010/7/27
page 306
i i
Figure 10.7. Temperature profile and model fit for LDPE reactor.
differential molar and energy balances that describe the evolution of the reacting mixture
along each reactor zone. The detailed design equations are reported in [418].
For high-pressure LDPE reactors, the determination of the kinetic rate constants in
Table 10.4 remains a key challenge. These rate constants are both pressure and temperature
dependent, using a general Arrhenius form with parameters determined from reactor data.
In addition to estimating kinetic parameters, adjustable parameters must also be determined
that account for uncertainty in reactor conditions. These are mainly due to fouling of the inner
reactor wall from continuous polymer build-up and the decomposition of reaction initiators
in each reaction zone. Because both phenomena are difficult to predict by a mechanistic
model, heat-transfer coefficients (HTCs) that account for fouling are estimated to match the
plant reactor temperature profile in each zone. Similarly, efficiency factors ηi are associated
with the decomposition of each initiator i. Because initiator efficiencies vary widely along
the reactor, they are estimated for each reaction zone in order to match the plant reactor
temperature profile. This approach was demonstrated in [418] and a typical model fit to the
data is presented in Figure 10.7.
The combined reactor model with multiple zones can be formulated as the following
multiperiod index-1 DAE system:
dzs,l (t)
Fs,l , zs,l (t), ys,l (t), ps,l , = 0,
dt
Gs,l zs,l (t), ys,l (t), ps,l , = 0,
zs,l (0) = φ(zs,l−1 (tLs,l−1 ), us,l ),
s = 1, . . . , N S, l = 1, . . . , N Zs . (10.26)
The subscript l refers to the reactor zone defined for an operating scenario s for which a data
set is provided. Also, N Zs is the total number of zones in scenario s; this allows parameter
estimation with data from different reactor configurations. In addition, note that the DAE
models are coupled across zones through material and energy balances φ(·). This coupling
expression contains input variables us,l that include flow rates and temperatures for the
monomer, initiator, and cooling water sidestreams along the reactor. Also, tLs,l denotes the
total length of zone l in scenario s. The decision variables ps,l represent local parameters
corresponding to the HTCs and initiator efficiencies for each zone l and scenario s, and
the variables correspond to the kinetic rate constants. As developed in [418], the reactor
i i
i i
book_tem
i i
2010/7/27
page 307
i i
model contains around 130 ODEs and 500 algebraic equations for each scenario s, and the
total number of DAEs in (10.26) increases linearly with the number of scenarios. In addition
to the large number of equations, the reactor model is highly nonlinear and stiff.
The objective of this dynamic optimization problem is to estimate the (global) kinetic
parameters, , as well as (local) HTCs and initiator efficiencies, ps,l , that match the plant
reactor operating conditions and polymer properties. For this we consider a multiscenario
errors-in-variables-measured (EVM) parameter estimation problem of the form
min s,l (
s NM
NS NZ )T ( )
, ps,l , us,l ys,l (ti ) − ys,l,i
M
Vy−1 ys,l (ti ) − ys,l,i
M
i i
i i
book_tem
i i
2010/7/27
page 308
i i
Figure 10.8. Total CPU time and iterations for the solution of multiscenario non-
linear programs with serial and parallel implementations.
where vs are the primal and dual variables in each scenario s, rs corresponds to the cor-
responding KKT conditions in each scenario, and r corresponds to the KKT condition
with respect to . The system (10.28) contains diagonal blocks Ks that represent the KKT
matrix for each scenario, and the matrices As represent the equations that link the global
variables to each scenario. Moreover, by replacing by local variables in each scenario and
introducing additional linear equations that link these variables to , the matrices As have
a simple, sparse structure that can be automated in a general-purpose way [419]. With the
structure in (10.28), one can apply a Schur complement decomposition strategy that avoids
the full factorization of the KKT matrix on a single processor. Instead, when distributed
over multiple processors, one avoids memory bottlenecks and can handle a large number
of scenarios (i.e., data sets) in the parameter estimation problem. Moreover, to obtain exact
first and second derivative information, each scenario was implemented as a separate AMPL
model that indicates internally the set of variables corresponding to the global parameters .
This allows the construction of a linking variable vector that is passed to the IPOPT solver.
Figure 10.8 presents computational results for the solution of multiscenario nonlinear
programs with up to 32 data sets. The results were obtained in a Beowulf-type cluster using
standard Intel Pentium IV Xeon 2.4 GHz, 2 GB RAM processors running Linux. These
results are also compared with the serial solution of the multiscenario problems on a single
processor. As seen in the figure, the serial solution of the multiscenario nonlinear programs
has memory available for only 9 data sets, while the parallel implementation overcomes this
i i
i i
book_tem
i i
2010/7/27
page 309
i i
Figure 10.9. CPU time per iteration and per factorization of the KKT matrix during
the solution of multiscenario nonlinear programs with serial and parallel implementations.
memory bottleneck and solves problems with over 32 data sets. In all of the analyzed cases,
sufficient second order KKT conditions are satisfied and, consequently, the full parameter
set can be estimated uniquely. The largest problem solved contains around 4,100 differential
and 16,000 algebraic equations, over 400,000 variables, and 2,100 degrees of freedom. The
solution time in the serial implementation increases significantly with the addition of more
data sets; the 9 data set problem requires over 30 wall clock CPU minutes. In contrast,
the parallel solution takes consistently less than 10 wall clock minutes regardless of the
number of data sets. On the other hand, it is important to emphasize that this behavior is
problem (and data) dependent. In fact, the solution of the 32 data set problem requires fewer
iterations than with 20 data sets. This behavior is mainly attributed to the nonlinearity of
the constraints and the influence of the initialization with different NS .
To provide more insight into the performance of the algorithm, Figure 10.9 presents
computational results for the time required per iteration and per factorization of the KKT
matrix. This leads to a more consistent measure of the scalability of the proposed strategy.
For the parallel approach, a near-perfect scaling is reflected in the time per factorization.
Also, the time per iteration can be consistently kept below 5 wall clock CPU seconds, while
the factorization in the serial approach can take as much as 35 wall clock seconds before
running out of memory. Additional information on this case study can be found in [419].
i i
i i
book_tem
i i
2010/7/27
page 310
i i
these conditions, indirect methods were proposed that led to an “optimize then discretize”
approach, where the DAEs representing the Euler–Lagrange equations were discretized to
finite dimension and solved to a desired level of accuracy.
In contrast, direct approaches based on a “discretize then optimize” strategy were
developed in Chapter 9 and also in this chapter. For the direct approaches in Chapter 9,
we have focused on the solution of nonlinear programs with a finite number of decision
variables, either through the variables p or through discretization of the control profiles.
On the other hand, simultaneous collocation methods have the ability to discretize both the
control and state profiles at the same level of accuracy. Consequently, it is important to
consider the characteristics of the simultaneous approach and the conditions under which it
can converge to the solution of (8.6). In this section, we analyze these properties for Gauss–
Legendre and Radau collocation with simultaneous NLP formulations. We then consider
some open questions when the nonlinear program is no longer well-posed, and we offer
some guidelines and directions for future work.
For this analysis we consider a simpler optimal control problem of the form
Note that in (10.29) we have removed the algebraic variables y(t) as these are implicit
functions of z(t) and u(t) from the (index-1) algebraic equations. To focus on convergence
of the control profiles, we also removed decision variables p of finite dimension. Finally,
we have removed the inequality constraints, as they present additional difficulties for the
optimality conditions in Chapter 8. These will be discussed later in this section.
As derived in Chapter 8, the Euler–Lagrange equations are given by
dz
= f (z, u), z(t0 ) = z0 , (10.30a)
dt
hE (z(tf )) = 0, (10.30b)
dλ ∂f (z, u)
=− λ(t), (10.30c)
dt ∂z
∂(z(tf )) ∂hE (z(tf ))
λ(tf ) = + ηE , (10.30d)
∂z ∂z
∂f (z, u)
λ = 0. (10.30e)
∂u
These conditions form a BVP, and to obtain a well-posed solution, we require the following
assumption, which will be examined later in this section.
One can either solve the BVP (10.30a)–(10.30e) with some discretization method or
instead discretize (10.29) over time and solve the related nonlinear program, as is done
with the direct transcription approach discussed in this chapter. This section shows the
i i
i i
book_tem
i i
2010/7/27
page 311
i i
relationship between both methods and also addresses related issues such as convergence
rates (as a function of time step h) and estimation of adjoint profiles from KKT multipliers.
The collocation discretization of problem (10.29) is given by
K
zf = j (1)zNj , z1,0 = z(t0 ), (10.31d)
j =0
hE (zf ) = 0 (10.31e)
Assumption II: For all suitable small values of hi , the nonlinear program (10.31) has a
solution where the sufficient second order KKT conditions and LICQ are satisfied. This
guarantees a unique solution for the primal and dual variables.
We redefine the NLP multipliers for (10.31b) λ̄ij , as ωj λij = λ̄ij and ωj > 0 is the
quadrature weight. With this transformation we can write the KKT conditions as
The KKT conditions are similar to the discretization applied to (10.30a)–(10.30e) but with
some important differences. In particular, (10.32c) is a straightforward collocation dis-
cretization of (10.30e). On the other hand, multipliers, ν̄i , are added that correspond to
i i
i i
book_tem
i i
2010/7/27
page 312
i i
continuity conditions on the state variables, and these also appear in (10.32a) and (10.32b).
To reconcile this difference with (10.30d) and (10.30c), we first consider collocation at
Gauss–Legendre roots and then take advantage of properties of Legendre polynomials.
These will redefine the corresponding terms in (10.32b) and (10.32d).
K
ωk λik ˙j (τk ), j = 0, . . . , K, τk ∈ (0, 1). (10.33)
k=1
From Theorem 10.1, this quadrature formula is exact for polynomial integrands up to degree
2K − 1. Since ˙j (τ ) is a polynomial of degree K − 1, we may represent λij as coefficients
of an interpolating polynomial evaluated at Gauss–Legendre roots. Because we can choose
a polynomial, λi (τ ), up to degree K (order K + 1), we are free to choose an additional
interpolation point for this polynomial.
Taking advantage of the order property for Gauss quadrature, we integrate (10.33) by
parts to obtain the following relations:
K 1
ωk λik ˙j (τk ) = λi (τ )˙j (τ )dτ
k=1 0
1
= λi (1)j (1) − λi (0)j (0) − λ̇i (τ )j (τ )dτ
0
K
= λi (1)j (1) − λi (0)j (0) − ωk λ˙i (τk )j (τk ). (10.34)
k=1
K
ωk λik ˙j (τk ) = λi (1)j (1) − ωj λ̇i (τj ), j = 1, . . . , K, (10.35)
k=1
K
ωk λik ˙0 (τk ) = λi (1)0 (1) − λi (0). (10.36)
k=1
∇zij L = ωj [hi ∇z f (zij , uij )λij + λ̇i (τj )] − (ν̄i + λi (1))j (1) = 0, (10.37)
∇zi,0 L = λi (0) + ν̄i−1 − (ν̄i + λi (1))0 (1) = 0. (10.38)
Note that the NLP multipliers provide information only at the collocation points for λij and
ν̄i . Moreover, from Assumption II, the primal and dual variables are unique. Finally, because
we are free to choose an additional coefficient for the (K + 1)th order polynomial λi (τ ),
we may impose (ν̄i + λi (1)) = 0. From (10.38), this leads to continuity of the λi profiles,
i i
i i
book_tem
i i
2010/7/27
page 313
i i
i.e., λi (0) = −ν̄i−1 = λi−1 (1). With these simplifications, the KKT conditions (10.32) now
become
Note that these transformed KKT conditions are equivalent to discrete approximations
of the Euler–Lagrange equations (10.30c)–(10.30e). Therefore, (10.31b)–(10.31e) and
(10.39a)–(10.39c) can be solved directly as a discretized BVP, and convergence rates for
(10.30a)–(10.30e) are directly applicable to the simultaneous NLP formulation (without in-
equality constraints). In particular, Reddien [331] shows that these calculated state, adjoint,
and control profiles converge at the rate of O(hK+1 ) with superconvergence (i.e., O(h2K ))
at the element endpoints.
Examining these equations shows that the residual error at any collocation point is
O(hK ) for Radau collocation and O(hK+1 ) for Gauss–Legendre collocation.
• Writing (10.19b)–(10.19h), (10.32a)–(10.32d) in vector form as φ(·) = 0 leads to
where w is a vector that represents all the primal and dual variables in φ(·) = 0 and
w ∗ is a vector that represents the solution of the optimal control problem discretized
i i
i i
book_tem
i i
2010/7/27
page 314
i i
w = w∗ − w = J−1 r. (10.43)
The mean value theorem allows a symmetric J matrix to be constructed which has the
same symbolic form and structure as the KKT matrix of (10.19a)–(10.19c), (10.19f),
(10.19g), (10.19h).
where H is the Hessian of the Lagrange function and AT is the Jacobian of the equality
constraints of the NLP.
• For the analysis we assume equally spaced time steps (h1 = · · · = hN = h) and that
w − w∗ ∞ remains bounded for sufficiently small h. (This assumption is based
on the fact that the KKT matrices at w∗ and w also satisfy these two assumptions
with w∗ and w “close” to each other [217].) From Theorem 5.4, J is guaranteed to
be invertible if the constraint gradients A are linearly independent and the reduced
Hessian Hr (i.e., H projected into the null space of AT ) is positive definite.
• The similarity of J to the KKT matrix allows us to perform a reduced-space decom-
position similar to the one developed in Section 5.3.2. By partitioning the constraint
Jacobian between state and control variables, AT = (AB | AS ), we define
I1 −AB −1 AS
Y= , Z= , (10.45)
0 I2
where I1 and I2 are identity matrices of appropriate dimensions. These matrices satisfy
AT Y = AB , AT Z = 0, (Y | Z) is nonsingular, (10.46)
and we partition w into subvectors for the state, control, and adjoint variables with
dY
[wzT wuT ] = (Y | Z) . (10.47)
dZ
Also, by partitioning the residual vector into rT = rzT , 0T , rλT , the modified linear
system then becomes
T
Y H Y YT H Z ATB dY rz
ZT H Y ZT H Z 0 dZ = −AT A−T rz . (10.48)
S B
AB 0 0 wλ rλ
i i
i i
book_tem
i i
2010/7/27
page 315
i i
• The orders of the vectors and matrices in (10.48) increase with h1 and this must be
considered when analyzing the normed quantities with respect to the time step, h.
Also, the analysis in [217] follows an assumption in Hager [178] that
−1 −1 1
Hr = (Z H Z) ∞ = O
T
. (10.49)
h
With this property, (10.48) can be solved to provide order results for w. For Radau
collocation, dY ∞ = O(hK+1 ), dZ ∞ = O(hK ), and wλ ∞ = O(hK ) for suf-
ficiently small h.
• From this analysis, the states, the multipliers, and the controls with Radau collocation
converge to the true solution at the rate of O(hK ), and the Lagrange multipliers scaled
by ωj provide estimates of the true adjoints. Following the same analysis, convergence
to the true solution with Gauss–Legendre collocation occurs at the rate of O(hK+1 ).
The above results were originally developed for unconstrained optimal control prob-
lems and were extended in [219] to account for the final-time equality constraints (10.29c).
This extension requires that the dynamic system be controllable. These results provide the
framework for obtaining results for the optimal control problem (10.29) with the direct
transcription approach, even for large systems.
On the other hand, there are two classes of optimal control problems where Assump-
tions I and II are not satisfied. First, singular control problems do not satisfy Assumption I
and, in particular, the assumption that Hr−1 ∞ = O( h1 ). In the next subsection, we will
examine the behavior of direct NLP methods for this problem class and consider some open
questions for their solution. Second, state inequality constrained problems do not satisfy
Assumption II as hi becomes small, because constraint gradients do not satisfy the LICQ
as the time step hi → 0. To conclude this section, we will examine the behavior of direct
NLP methods and consider open questions for this class as well.
i i
i i
book_tem
i i
2010/7/27
page 316
i i
Figure 10.10. Control profile obtained for N = 100, K = 2 for problem (10.50).
that the reduced Hessian becomes very ill-conditioned as the time step goes to zero, h → 0.
Tests on a number of singular examples (see [222]) reveal, that Hr−1 ∞ = O(h−j ), where
j is greater than 2 and has been observed as high as 5 (with the value of j less for Radau col-
location than for Gauss collocation). To illustrate this behavior, we consider the following
test example [8].
Example 10.5 (Singular Control with Direct Transcription). Consider the singular optimal
control problem given by
(π )
min z3 (10.50a)
u 2
dz1
s.t. = z2 , z1 (0) = 0, (10.50b)
dt
dz2
= u, z2 (0) = 1, (10.50c)
dt
dz3 1 1
= z22 − z12 , z3 (0) = 0, (10.50d)
dt 2 2
−1 ≤ u ≤ 1. (10.50e)
The analytic solution of this problem shows a control profile that is singular over the entire
time domain, t ∈ [0, π/2].
Figure 10.10 presents the results of solving problem (10.50) using the simultane-
ous approach with K = 2 and N = 100. We see that for large N , the control profile is
highly oscillatory with spikes at initial and final time. Moreover, note that the spectrum of
eigenvalues for the reduced Hessian Hr quickly drops to zero, and for N = 100 its condi-
tion number is 1.5 × 109 . Similar behaviors have been observed with other discretizations
as well.
To address singular problems, Jacobson, Gershwin, and Lele [208] proposed to solve
$t
the singular control problem by adding the regularization 0f u(t)T Su(t)dt to the objective
function (with S positive definite, > 0). The resulting problem is nonsingular and is solved
i i
i i
book_tem
i i
2010/7/27
page 317
i i
for a monotonically decreasing sequence {k }. Similar ideas are given in [414, 357]; the latter
source also proves consistency of this approach. However, as is decreased, problems tend
to become increasingly ill-conditioned, and the quality of the optimal control deteriorates. In
addition, regularizations have been performed through coarse discretizations of the control
profile and application of an NLP-based strategy [355]. Finally, several methods are based on
analytical calculation of the singular optimal controls (see [368] for a summary). However,
many of these methods rely on tools from Lie algebra and involve cumbersome calculations,
especially for large problems.
All of these approaches aim at a regularization not only to improve the spectrum
of eigenvalues for the reduced Hessian and to guarantee a unique solution, but also to
recover the true solution of the singular problem as well. A recent approach developed in
Kameswaran [222] relies on properties of the Hamiltonian ((8.25) and (8.52)) and applies an
indirect approach that explicitly recovers the states and adjoint profiles. With these quantities,
an accurate approximation of the Hamiltonian and ∂H /∂u, and their stationarity with respect
to time, can be enforced within an extended NLP formulation. While this approach is more
complex than a nonlinear program derived for direct transcription, it has been very effective
in solving challenging singular control problems of orders one and two.
Consequently, accurate determination of control profiles for these singular problems
still remains an open question and must be handled carefully within the context of direct
NLP-based methods. For the simultaneous collocation approach, a simple heuristic strategy
is to apply a control discretization that is coarser than the state profile but still can be
refined by decreasing the time step. This has the effect of retarding the rapid decay in
the spectrum of eigenvalues in the reduced Hessian so that reasonable solutions can still
be obtained. For instance, problem (10.19) can be modified so that u(t) is represented by
ui = uij , j = 1, . . . , K is piecewise constant in an element. Applying this modification to
Example 10.5 leads to the control profiles shown in Figure 10.11. From the shapes of the
profile and the value of the objective function as N is increased, an accurate approximation
of the solution profiles may be obtained before the onset of ill-conditioning.
Finally, the heuristic approach with coarse control profiles in (10.19) can be extended
to variable finite elements as well. By using relatively few elements and a high-order collo-
cation formula, singular arcs can be located, and breakpoints can be determined for control
profiles that have bang-bang and singular characteristics. For instance, the catalyst mixing
profiles in Figure 8.9 match the analytical solution in Example 8.9 and were determined
using this approach with up to 100 elements and 3 collocation points.
i i
i i
book_tem
i i
2010/7/27
page 318
i i
Figure 10.11. Singular control profiles obtained for piecewise constant controls
in (10.50).
collocation points. For time steps that are sufficiently small and profiles that are sufficiently
smooth within the element, this approach can ensure the feasibility of path constraints in
practice. Nevertheless, open questions remain on the influence of these high-index inequal-
ities on the solution of the NLP problem (10.19).
While much work still needs to be done to analyze the behavior of path constraints
within collocation formulations, we observe that these constraints lead to violations of
constraint qualifications in the nonlinear program along with undesirable behavior of the
KKT multipliers. To demonstrate this observation, we analyze the problem presented in the
following example from an NLP perspective.
i i
i i
book_tem
i i
2010/7/27
page 319
i i
with q fixed at 10−3 . This problem represents a heat conduction problem where the boundary
controls are used to minimize the field temperature subject to a minimum temperature profile
constraint.
Problem (10.51) is discretized spatially using a central difference scheme and equi-
distant spacing with n + 1 points. A trapezoidal scheme is used to discretize the double
integral in the objective function, and this converts it into a single integral. The spatial mesh
size is then defined as δ = πn . With suitable time scaling, we obtain the following optimal
control problem:
5δ −2
n−1 5δ −2
1 δ 3 + 2qδ 2
min 2δ 3 zk2 (τ )dτ + (u20 (τ ) + u2π (τ ))dτ
2 0 2 0
k=1
(10.52a)
dz1
s.t. = z2 − 2z1 + u0 ; z1 (0) = 0; (10.52b)
dτ
dz2
= z3 − 2z2 + z1 ; z2 (0) = 0; (10.52c)
dτ
..
.
dzn−1
= uπ − 2zn−1 + zn−2 ; zn−1 (0) = 0; (10.52d)
dτ
2
πδ τ
u0 (τ ) ≥ sin(0) sin − 0.7 = −0.7; (10.52e)
5
2
πδ τ
zk (τ ) ≥ sin(kδ) sin − 0.7; k = 1, . . . , n − 1; (10.52f )
5
2
πδ τ
uπ (τ ) ≥ sin(π) sin − 0.7 = −0.7. (10.52g)
5
Note that the control profiles u0 and uπ are applied only at the boundaries and directly
influence z1 and zn−1 . Moreover, computational experience [45, 220] confirms that only
the inequality constraints in the spatial center (at n/2) become active at some point in time.
Using this observation, and the symmetry of the resulting DAE optimization problem, the
active path constraint requires n/2 time differentiations to expose the boundary control.
With finer spatial discretization, the index of this constraint can be made arbitrarily high.
Here, cases are considered with n = 10 and n = 20.
Nevertheless, the simultaneous approach can provide accurate results irrespective of
the temporal discretization scheme. For any temporal discretization (including explicit Euler
with N equally spaced steps), problem (10.52) is transformed into the following quadratic
program:
1 1
min uT Hu + zT Qz (10.53)
2 2
s.t. Az + Bu = 0,
s = Cz + d,
s ≥ 0,
where, z, u, s are now written as vectors of discretized values of z(τ ), u0 , and the slack in
the inequality constraint, respectively. The matrices H and Q are positive definite for all
i i
i i
book_tem
i i
2010/7/27
page 320
i i
Figure 10.12. Singular values for Jacobian of active constraints in the QP (10.54)
with n = 20. Note that the fraction of constraints with singular values below a given tolerance
increases with N .
temporal and spatial mesh sizes, A is invertible, and C is full row rank [217]. With these
assumptions the above QP can be reduced to the following smaller QP:
1 ( )
min uT H + BT A−T QA−1 B u (10.54)
u 2
s.t. CA−1 Bu − d ≤ 0.
As the above QP is strictly convex, it has a unique optimal solution u∗ . On the other
hand, with increasing temporal discretization, the fraction of dependent rows (for a given
tolerance level) in matrix CA−1 B also increases, as seen from the singular values plotted
in Figure 10.12. Hence, the LICQ fails to hold for a given tolerance level and calculation
of the QP multipliers is ill-conditioned. In fact, the profile of the path constraint multipliers
exhibits inconsistent behavior and appears to go unbounded as N is increased. Nevertheless,
as observed from Figure 10.13, the optimal control profile is well defined and appears to
converge for increasing values of N .
From Figures 10.12 and 10.13 it is clear that the failure of LICQ triggers the ill-
conditioned behavior of the multipliers. This has a strong impact on the estimation of the
adjoint profiles even though the control profile is well defined. The reason for this behavior
can be deduced from Theorem 4.9. For the linearly constrained problem (10.54) neither
unique multipliers nor constraint qualifications are necessary to solve the QP, as the optimum
is defined by the gradient of the objective function and the cone of the active constraints.
i i
i i
book_tem
i i
2010/7/27
page 321
i i
Figure 10.13. Active constraint multipliers and control variable profiles for the
QP (10.54) with n = 20. Note inconsistency of the multipliers with increasing N while the
control profile remains essentially unchanged.
i i
i i
book_tem
i i
2010/7/27
page 322
i i
On the other hand, the indirect approach (using a multipoint BVP with the path con-
straints reformulated to index 1) is more difficult to apply for this problem. This method is
applied in [220, 45] but fails to converge. Moreover, it is noted in [220] that the constrained
arc is unstable. This leads to an inability to obtain the states, the adjoints, and the controls
with the indirect approach.
From this illustrative example, it is also clear that the simultaneous approach can
produce meaningful solutions for inequality path constrained optimal control problems. On
the other hand, open questions still remain including (a) convergence and convergence rates
of the state and control profiles as h → 0, (b) location of entrance and exit points for the
active constraints (e.g., through variable finite elements), and (c) the consequence of failure
of LICQ on nonlinearly constrained problems. Additional discussion on this problem can
be found in [40].
i i
i i
book_tem
i i
2010/7/27
page 323
i i
with path constraints. Moreover, there is a growing literature for this dynamic optimization
approach including two excellent monographs [40, 318] that describe related direct tran-
scription approaches with Runge–Kutta discretizations. In addition, related quasi-sequential
approaches have been developed that solve the collocation equations in an inner loop and
therefore apply a nested optimization strategy [201, 257].
The development of the simultaneous collocation approach is taken from a number of
papers [106, 396, 264, 89, 377, 401, 378, 217]. More detail can also be found in the following
Ph.D. theses [394, 92, 222]. In addition, the properties of the collocation method and related
boundary value approaches can be found in [341, 14, 13, 16]. Convergence results for
simultaneous methods have been developed for a class of discretization schemes: Gauss
collocation [331, 106], explicit Runge–Kutta methods satisfying a symmetry condition
[178], Radau collocation [217], and general IRK methods [179, 65].
Moreover, the simultaneous approach has been applied widely in aeronautical and as-
tronautical applications. A cursory literature search reveals several hundred publications that
apply simultaneous approaches in this area. Specific applications include collision avoid-
ance for multiple aircraft [42, 325] and underwater vehicles [365], trajectories for satellites
and earth orbiters [351, 103], and the design of multiple paths and orbits for multibody dy-
namics [68], including interplanetary travel [43]. An overview of these applications is given
in [40]. Moreover, the SOCS (sparse optimal control software) package [40], a commercial
software package developed and marketed by Boeing Corporation, has been widely used
for these and other engineering applications. Also, GESOP, a graphical environment that
incorporates SOCS as well as multiple shooting methods, can be found in [158].
In process engineering, applications of the simultaneous approach include the de-
sign and optimal operation of batch processes. These include optimization of operating
policies for fermentors [107] and bioreactors [334], flux balance models for metabolic sys-
tems [232, 271, 321], batch distillation columns [264, 298], membrane separators [125],
polymerization reactors [141, 209], crystallization [242], freeze-drying processes [67], and
integrated multiunit batch processes [46]. Other off-line applications include parameter es-
timation of reactive systems [384, 126], design of periodic separation processes including
pressure swing adsorption [291] and simulated moving beds [230, 225], optimal grade tran-
sitions in polymer processes [91, 141], reactor network synthesis [21, 302], and economic
performance analysis of batch systems [263].
Online applications include dynamic data reconciliation algorithms for batch pro-
cesses [7, 260], state estimation and process identification [375], optimal startup policies
for distillation columns [320], optimal feed policies for direct methanol fuel cells [412], and a
number of algorithms and case studies for nonlinear model predictive control (NMPC) [364,
211]. Moreover, commercial applications of NMPC include several applications at Exxon-
Mobil and Chevron Phillips, which use the NLC package and NOVA solver both from
Honeywell, Inc. [333, 417]. Other software implementations of the simultaneous approach
include DynoPC [240], a Windows-based platform; the MATLAB-based OptimalControl-
Centre [211] and dynopt [95] packages, and the pseudospectral packages PROPT [342] and
GPOPS [327].
10.7 Exercises
1. Apply two-point Gauss–Legendre collocation to Example 10.2 and find the solution
for up to N = 10 finite elements. Calculate the global error and determine the order
of the method.
i i
i i
book_tem
i i
2010/7/27
page 324
i i
2. Using the information reported in [242], resolve the crystallizer optimization problem
described in Example 10.3 by applying two-point Radau collocation to the state and
control profiles. How does the solution compare with Example 10.3?
3. Consider the unstable dynamic optimization problem given in Example 9.5. Formulate
this problem using two-point collocation and compare the solutions with Radau and
Gauss–Legendre collocation. How many finite elements are needed in each case?
4. Consider the parallel batch reactor system in Example 9.4. Formulate this problem
using two-point collocation and variable finite elements. Compare the solutions with
Radau and Gauss–Legendre collocation. How many finite elements are needed in
each case?
5. Consider the batch reactor system in Example 8.3. Formulate this problem using three
point Radau collocation and variable finite elements.
(a) Solve this problem without the control bound and compare with the solution in
Example 8.3.
(b) Solve the problem with u(t) ≤ 2. Compare the state profiles with a profile de-
termined by a DAE solver. How many finite elements are needed to achieve a
solution with less than 10−4 error in the states?
6. Write the KKT conditions for (10.19). Using Gauss–Legendre collocation, apply the
quadrature formulation and extend the derivation in Section 10.4 to deal with algebraic
variables and decision variables p.
(a) Without considering inequality constraints, compare the transformed KKT sys-
tem to the optimality conditions developed in Chapter 8.
(b) Using the KKT multipliers, write a discrete version of the Hamiltonian func-
tion. If the finite elements are variables in (10.19) and the constraints (10.22)
are added, what additional conditions arise on the Hamiltonian from the KKT
system?
(c) Explain why changes in active constraints and discontinuities in control profiles
must be confined to the boundaries of the finite elements. Discuss how this can
be enforced in the KKT conditions.
7. Consider the singular catalyst mixing problem in Example 8.9.
(a) Apply three-point Gauss collocation and solve with piecewise constant controls
for increasing values of N .
(b) Apply three-point Radau collocation and solve with piecewise constant controls
for increasing values of N . Compare this solution with Gauss collocation.
(c) Apply three-point Radau collocation and solve with control coefficients at each
collocation point for increasing values of N . Compare this solution to those with
piecewise controls.
i i
i i
book_tem
i i
2010/7/27
page 325
i i
Chapter 11
Steady state and dynamic process models frequently deal with switches and other nonsmooth
decisions that can be represented through complementarity constraints. If these formulations
can be applied at the NLP level, they provide an “all-at-once” strategy that can be addressed
with state-of-the-art NLP solvers. On the other hand, these mathematical programs with
complementarity constraints (MPCCs) have not been widely studied in process engineering
and must be handled with care. These formulations are nonconvex and violate constraint
qualifications, including LICQ and MFCQ, which are required for well-posed, nonlinear
programs. This chapter deals with properties of MPCCs including concepts of stationarity
and linear independence that are essential for well-defined NLP formulations. NLP-based
solution strategies for MPCCs are then reviewed along with examples of complementarity
drawn from steady state chemical engineering applications. In addition, we extend these
MPCC formulations for the optimization of a class of hybrid dynamic models, where the
differential states remain continuous over time. These involve differential inclusions of the
Filippov type, and a formulation is developed that preserves the piecewise smoothness prop-
erties of the dynamic system. Results on several process examples drawn from distillation
optimization, process control, and hybrid systems are used to illustrate MPCC formulations
and demonstrate the proposed optimization approach.
11.1 Introduction
Optimization problems were introduced in Chapter 1 with discrete and continuous decision
variables. This mixed integer nonlinear programming formulation (1.1) leads to the most
general description of process optimization problems. To solve these problems, NLP formu-
lations are usually solved at a lower level for fixed values of the discrete variables. Based
on information from the NLP solution, a search of the discrete variables is then conducted
at a higher level.
While the scope of this book is restricted to nonlinear programs with smooth objective
and constraint functions, the ability to deal with (some) discrete decisions within an “all-at-
once” NLP formulation can have an advantage over nested MINLP strategies. This motivates
the modeling of these decisions with complementarity constraints. Complementarity is a
325
i i
i i
book_tem
i i
2010/7/27
page 326
i i
relationship between variables where either one (or both) must be at its bound. In principle,
these relationships can be embedded within an NLP formulation, but these complements
introduce an inherent nonconvexity as well as linear dependence of constraints, which make
the nonlinear program harder to solve.
Complementarity problems arise in a large number of applications in engineering and
economic systems. They include contact and friction problems in computational mechanics,
equilibrium relations in economics, and a wide variety of discrete events in process sys-
tems. Many of these models embed themselves within optimization problems [130, 267].
However, these mathematical programming formulations and related solution strategies are
not yet fully developed in process engineering.
In process optimization, complementarity models allow a number of options when
dealing with phase changes, flow reversal, safety valve operation, and other discrete events.
These events are often embedded within modular process simulators (as described in Chap-
ter 7) that model flowsheets through conditional (IF-THEN) statements embedded in the
software. In fact, depending on the value of computed variables, different sets of equations
may even be used for the simulation. However, with embedded conditional statements,
the modular approach has significant limitations in dealing with process optimization. In-
stead, equation-based flowsheet models are much better suited to an optimization framework
and have the flexibility to exploit complex specifications and recycle streams. Moreover,
equation-based process optimization, also described in Chapter 7, must deal with conditional
statements in a different way. In this context, complementarity models offer an efficient al-
ternative.
Complementarities also arise naturally in the solution of multilevel optimization prob-
lems. Such problems are members of a more general problem class called mathematical
programs with equilibrium constraints (MPECs) [267]. For bilevel optimization problems
of the form
min f (x, y) (11.1a)
x
s.t. (x, y) ∈ Z, (11.1b)
y = arg min{θ(x, ŷ) : ŷ ∈ C(x)}, (11.1c)
ŷ
we define Z and C(x) as feasible regions for the upper and lower level problems, respectively,
and f (x, y) and θ (x, y) as objective functions for the upper and lower level problems,
respectively.
Bilevel optimization problems can be reformulated as mathematical programs with
complementarity constraints (MPCCs) by writing the optimality conditions of the inner
optimization problem as constraints on the outer problem. An MPCC takes the following
general form:
min f (x, y, z) (11.2a)
s.t. h(x, y, z) = 0, (11.2b)
g(x, y, z) ≤ 0, (11.2c)
0 ≤ x ⊥ y ≥ 0, (11.2d)
where ⊥ is the complementarity operator enforcing at least one of the bounds to be active.
(Note that we have classified the complementing variables as x and y, both in Rnc with
i i
i i
book_tem
i i
2010/7/27
page 327
i i
the remaining variables, z ∈ Rn−2nc .) The complementarity constraint (11.2d) implies the
following:
x(i) = 0 OR y(i) = 0, i = 1, . . . , nc ,
x ≥ 0, y ≥ 0,
for each vector element i. Here the OR operator is inclusive, as both variables may be zero.
Alternately, the complementarity constraint may be written in several equivalent ways:
x T y = 0, x ≥ 0, y ≥ 0, (11.3)
x(i) y(i) = 0, i = 1, . . . , nc , x ≥ 0, y ≥ 0, (11.4)
x(i) y(i) ≤ 0, i = 1, . . . , nc , x ≥ 0, y ≥ 0. (11.5)
These alternate forms are particularly useful when applying existing NLP solution strategies
to solve MPCCs. In addition, every MPCC problem can be rewritten as the equivalent MPEC
problem:
Section 11.2 summarizes MPCC properties and develops related NLP formulations
that lead to standard NLP problems that satisfy constraint qualifications. Also included
is a numerical case study comparison of these formulations with standard NLP solvers.
Section 11.3 then considers the modeling of discrete decisions with MPCC formulations and
describes several important chemical engineering applications of MPCCs. To demonstrate
both the models and solution strategies, we consider a distillation optimization case study in
Section 11.4. Section 11.5 extends complementarity concepts to the optimization of a class of
hybrid dynamic systems, with continuous states and discontinuous right-hand sides. Known
as Filippov systems [132], these are considered with the complementarity class of hybrid
systems described and analyzed in [193], where the differential states remain continuous
over all time. For this class, we adopt the simultaneous collocation approach from Chapter 10
along with variable finite elements to track the location of these nonsmooth features. Two
case studies illustrate this approach in Section 11.6 and demonstrate the advantages of the
MPCC formulation.
i i
i i
book_tem
i i
2010/7/27
page 328
i i
From Theorem 4.14 we know that if constraint qualifications (CQs) hold, the KKT
conditions are necessary conditions for optimality. However, these CQs are violated
for MPCCs and it is easy to formulate MPCC examples where the multipliers α, β,
and δ are unbounded or do not exist.
• At feasible points that satisfy
h(x, y, z) = 0, g(x, y, z) ≤ 0, 0 ≤ x ⊥ y ≥ 0,
we have for all x(i) = 0 that x(i) y(i) = 0 as well; a similar condition holds for y(i) = 0.
As a result, the LICQ is violated, and failure of the LICQ implies that the multipliers,
if they exist, are nonunique.
• The weaker MFCQ condition from Definition 4.16 is also violated. MFCQ requires
linearly independent gradients for the equality constraints and a feasible direction into
the interior of the cone of inequality constraint gradients. This constraint qualification
is a necessary and sufficient condition for boundedness of the multipliers. On the
other hand, 0 ≤ x ⊥ y ≥ 0 can be written equivalently as x ≥ 0, y ≥ 0, and x T y ≤ 0.
Therefore, at a feasible point (x̄,ȳ,z̄), there is no feasible search direction that satisfies
x̄(y) + ȳ(x) < 0. Consequently, MFCQ cannot hold either, and multipliers of the
MPCC (11.2) will be nonunique and unbounded.
Because these constraint qualifications do not hold, it is not surprising that an MPCC
may be difficult to solve. In order to classify an MPCC solution, we introduce the concept of
B-stationarity.8 A point w ∗ = [x ∗,T , y ∗,T , z∗,T ]T is a B-stationary point if it is feasible to the
MPCC, and d = 0 is a solution to the following linear program with equilibrium constraints
(LPEC) [267, 326, 350]:
min ∇f (w ∗ )T d (11.7a)
d
s.t. g(w ∗ ) + ∇g(w∗ )T d ≤ 0, (11.7b)
∗ ∗ T
h(w ) + ∇h(w ) d = 0, (11.7c)
0 ≤ x ∗ + dx ⊥ y ∗ + dy ≥ 0. (11.7d)
The LPEC therefore verifies that locally there is no feasible direction that improves the
objective function. On the other hand, verification of this condition may require the solution
8 There are weaker stationarity conditions that identify weak, A-, C-, and M-stationary points. However,
these conditions are not sufficient to identify local optima, as they allow negative multipliers and have feasible
descent directions [256].
i i
i i
book_tem
i i
2010/7/27
page 329
i i
Strong stationarity is a more useful, but less general, definition of stationarity for the MPCC
problem. A point is strongly stationary if it is feasible for the MPCC and d = 0 solves the
following relaxed problem:
min ∇f (w ∗ )T d (11.8a)
d
s.t. g(w ∗ ) + ∇g(w∗ )T d ≤ 0, (11.8b)
∗ ∗ T
h(w ) + ∇h(w ) d = 0, (11.8c)
dx(i) = 0, i ∈ IX \ IY , (11.8d)
dy(i) = 0, i ∈ IY \ IX , (11.8e)
dx(i) ≥ 0, i ∈ IX ∩ IY , (11.8f )
dy(i) ≥ 0, i ∈ IX ∩ IY . (11.8g)
Strong stationarity can also be related to stationarity of an NLP relaxation of (11.2), abbre-
viated RNLP and defined as
This property implies that there is no feasible descent direction at the solution of either
(11.8) or (11.9). This condition is equivalent to B-stationarity if the biactive set IX ∩ IY is
empty, or if the MPEC-LICQ property holds [10]. MPEC-LICQ requires that the following
set of vectors be linearly independent:
. / . / . / . /
∇gi (w ∗ )|i ∈ Ig ∪ ∇h(w∗ ) ∪ ∇x(i) |i ∈ IX ∪ ∇y(i) |i ∈ IY , (11.10)
where . /
Ig = i : gi (w ∗ ) = 0 .
MPEC-LICQ is equivalent to LICQ for (11.9) and implies that the multipliers of either
(11.8) or (11.9) are bounded and unique. Satisfaction of MPEC-LICQ leads to the following
result.
Theorem 11.1 [10, 255, 350] If w ∗ is a solution to the MPCC (11.2) and MPEC-LICQ
holds at w∗ , then w ∗ is strongly stationary.
i i
i i
book_tem
i i
2010/7/27
page 330
i i
Strong stationarity is the key assumption that allows the solution of MPCCs through
NLP formulations. With this property, a well-posed set of multipliers for (11.9) verifies op-
timality of (11.2), and a suitable reformulation of MPCC can lead to a number of equivalent,
well-posed nonlinear programs. As a result, MPCCs can be addressed directly through these
reformulations.
Finally, a related constraint qualification is MPEC-MFCQ, i.e., MFCQ applied to
(11.9). As noted above, this property requires that all the equality constraints in (11.9) have
linearly independent gradients and there exists a nonzero vector d ∈ Rn such that
dx(i) = 0, i ∈ IX \ IY , (11.11a)
dy(i) = 0, i ∈ IY \ IX , (11.11b)
∗ T
∇h(w ) d = 0, (11.11c)
∗ T
∇gi (w ) d < 0, i ∈ Ig , (11.11d)
dx(i) > 0, dy(i) > 0, i ∈ IX ∩ IY . (11.11e)
MPEC-MFCQ also implies that the multipliers of (11.9) will be bounded.
Similarly, second order conditions can be defined for MPCCs that extend analogous
conditions developed for constrained nonlinear programs in Chapter 4. In Theorem 4.18,
sufficient second order conditions were developed with the use of constrained (or allowable)
search directions. For MPCCs, the allowable search directions d are defined in [326, 350]
for the following cones:
• d ∈ S̄, where d is tangential to equality constraints and inequality constraints with
positive multipliers, and it forms an acute angle to active constraints with zero mul-
tipliers.
• d ∈ S ∗ , where d ∈ S̄ and also tangential to at least one of the branches for i ∈ IX ∪ IY .
• d ∈ T̄ , where d is tangential to equality constraints and inequality constraints with
nonzero multipliers.
• d ∈ T ∗ , where d ∈ T̄ and also tangential to at least one of the branches for i ∈ IX ∪ IY .
We define the MPCC Lagrange function given by
LC = f (w) + g(w)T u + h(w)v − α T x − β T y
and with w∗ as a strongly stationary point and multipliers u∗ , v ∗ , α ∗ , β ∗ that satisfy the KKT
conditions for (11.9). The following second order sufficient conditions (SOSC) and strong
second order sufficient conditions (SSOSC), defined in [326], require
d T ∇ww LC (w ∗ , u∗ , v ∗ , α ∗ , β ∗ )d ≥ σ
if there is a σ > 0 for the following allowable directions:
• RNLP-SOSC for d ∈ S̄,
• MPEC-SOSC for d ∈ S ∗ ,
• RNLP-SSOSC for d ∈ T̄ ,
• MPEC-SSOSC for d ∈ T ∗ .
i i
i i
book_tem
i i
2010/7/27
page 331
i i
i i
i i
book_tem
i i
2010/7/27
page 332
i i
For the first three, regularized formulations, the complementarity conditions are relaxed and
the MPCC is reformulated in Reg(), RegComp(), or RegEq() with a positive relaxation
parameter . The solution of the MPCC, w ∗ , can be obtained by solving a series of relaxed
solutions, w(), as approaches zero.
The convergence properties of these NLP formulations can be summarized by the
following theorems, developed in [326].
Theorem 11.2 Suppose that w ∗ is a strongly stationary solution to (11.2) at which MPEC-
MFCQ and MPEC-SOSC are satisfied. Then there exist r0 > 0, , ¯ and ∈ (0, ¯ ] so that
w(), the global solution of Reg() with constraint w() − w∗ ≤ r0 that lies closest to w∗ ,
satisfies w() − w∗ = O( 1/2 ). If the stronger conditions MPEC-LICQ and RNLP-SOSC
hold, then under similar conditions we have w() − w∗ = O().
Uniqueness properties have also been shown for Reg(). Also, a property similar to
Theorem 11.2 holds for the RegComp() formulation, where the individual complementarity
constraints are replaced by the single constraint x T y ≤ . However, local uniqueness of the
solutions to RegComp() cannot be guaranteed.
For the RegEq() formulation the following convergence property holds.
Theorem 11.3 Suppose that w∗ is a strongly stationary solution to (11.2) at which MPEC-
LICQ and MPEC-SOSC are satisfied. Then there exist r0 > 0, ,¯ and ∈ (0, ]
¯ so that w(),
the global solution of RegEq() with the constraint w() − w ∗ ≤ r0 that lies closest to
w∗ , satisfies w() − w∗ = O( 1/4 ).
Note that even with slightly stronger assumptions than in Theorem 11.2, the RegEq()
formulation will exhibit slower convergence. Despite this property, the RegEq formulation
has proved to be popular because simpler equations replace the complementarity conditions.
In particular, the related nonlinear complementarity problem (NCP) functions and smoothing
functions have been widely used to solve MPCCs [94, 167, 241, 369]. A popular NCP
function is the Fischer–Burmeister function
0
φ(x, y) = x + y − x 2 + y 2 = 0. (11.16)
for some small > 0. The solution to the original problem is then recovered by solv-
ing a series of problems as approaches zero. An equivalence can be made between this
method and RegEq(). Accordingly, the problem will converge at the same slow conver-
gence rate.
In contrast to the regularized formulations, we also consider the exact 1 penalization
shown in PF(ρ). Here the complementarity can be moved from the constraints to the objec-
tive function, and the resulting problem is solved for a particular value of ρ. If ρ ≥ ρc , where
ρc is the critical value of the penalty parameter, then the complementarity constraints will
be satisfied at the solution. Similarly, Anitescu, Tseng, and Wright [10] considered a related
“elastic mode” formulation, where artificial variables are introduced to relax the constraints
i i
i i
book_tem
i i
2010/7/27
page 333
i i
in PF(ρ) and an additional ∞ constraint penalty term is added. In both cases the resulting
NLP formulation has the following properties.
Theorem 11.4 [326] If w∗ is a strongly stationary point for the MPCC (11.2), then for all ρ
sufficiently large, w∗ is a stationary point for PF(ρ). Also, if MPEC-LICQ, MPEC-MFCQ,
or MPEC-SOSC hold for (11.2), then the corresponding LICQ, MFCQ, or SOSC properties
hold for PF(ρ).
Theorem 11.5 [10] If w∗ is a solution to PF(ρ) and w ∗ is feasible for the MPCC (11.2),
then w ∗ is a strongly stationary solution to (11.2). Moreover, if LICQ or SOSC hold for
PF(ρ), then the corresponding MPEC-LICQ or MPEC-SOSC properties hold for (11.2).
Theorem 11.4 indicates that the solution strategy is attracted to strongly stationary
points for sufficiently large values of ρ. However, it is not known beforehand how large
ρ must be. If the initial value of ρ is too small, a series of problems with increasing ρ
values may need to be solved. Also, Theorem 11.5 indicates that a local solution of PF(ρ)
is a solution to the MPCC if it is also feasible to the MPCC. This condition is essential as
there is no guarantee that PF(ρ) will not get stuck at a local solution that does not satisfy
the complementarity, as observed by [204]. Nevertheless, with a finite value of ρ > ρc this
1 penalization ensures that strongly stationary points of the MPCC are local minimizers
to PF(ρ).
The PF(ρ) formulation has a number of advantages. Provided that the penalty parame-
ter is large enough, the MPCC may then be solved as a single problem, instead of a sequence
of problems. Moreover, PF(ρ) allows any NLP solver to be used to solve a complementarity
problem, without modification of the algorithm. On the other hand, using the PF(ρ) formu-
lation the penalty parameter can also be changed during the course of the optimization, for
both active set and interior point optimization algorithms. This modification was introduced
and demonstrated in [10, 255].
Finally, there is a strong interaction between the MPCC reformulation and the ap-
plied NLP solver. In particular, as noted in [255], if a barrier NLP method (like LOQO,
KNITRO, or IPOPT) is applied to PF(ρ), then there is a correspondence between
RegComp() and intermediate PF(ρ) solutions with the barrier parameter µl → 0. This
relation can be seen by comparing the first order KKT conditions of RegComp() and inter-
mediate PF(ρ) subproblems, and identifying a corresponding set of parameters ρ, µ, and .
i i
i i
book_tem
i i
2010/7/27
page 334
i i
For this comparison we consider the MPECLib collection of MPCC test problems
maintained by Dirkse [284]. This test set consists of 92 problems including small-scale
models from the literature and several large industrial models. The performance of the NLP
reformulations PF(ρ) with ρ = 10, Reg() with = 10−8 , and NCP formulation x T y = 0,
x, y ≥ 0 were compared with CONOPT (version 3.14) and IPOPT (version 3.2.3) used to
solve the resulting NLP problems. Also included is the IPOPT-C solver, which applies the
Reg() formulation within IPOPT [324] and coordinates the adjustment of the relaxation
parameter with the barrier parameter µ.
The results of this comparison are presented as Dolan–Morè plots in Figures 11.1
and 11.2. All results were obtained on an Intel Pentium 4, 1.8 GHz CPU with 992 MB
of RAM. The plots portray both robustness and relative performance of a set of solvers
on a given problem set. Each problem is assigned its minimum solution time among the
algorithms compared. The figure then plots the fraction of test problems solved by a particular
algorithm within Time Factor of this minimum CPU time.
Figure 11.1 shows the performance plot of different reformulations and solvers tested
on all 92 problems of the test set. Since the problems are inherently nonconvex, not all of
the different reformulation and solver combinations converged to the same local solutions.
While this makes a direct comparison difficult, this figure is useful to demonstrate the
robustness of the methods (as Time Factor becomes large). With one exception, all of the
reformulation and solver combinations are able to solve at least 84% of the test problems.
The most reliable solvers were IPOPT-Reg() (99%), CONOPT-Penalty (98%), and IPOPT-
Penalty (94%). In contrast, IPOPT-Mult (IPOPT with the complementarity written as (11.3)
and no further reformulation) was able to solve only 57% of all of the problems. This is not
unexpected, as IPOPT is an interior point algorithm and the original MPCC problem has no
interior at the solution.
Figure 11.2 shows the same reformulations and solvers for only 22 of the 92 test prob-
lems. For these 22 problems, all reformulation-solver combinations gave the same solutions,
when they were successful. This allows for a more accurate comparison of the solvers’ per-
formance. Table 11.1 displays these 22 test problems and their respective objective function
values. From Figure 11.2 we note that CONOPT-Reg() performed the fastest on 73% of
the problems, but it turns out to solve only 76% of them. Instead, CONOPT-Penalty and
IPOPT-Penalty (PF(ρ)) are the next best in performance and they also prove to be the most
robust, solving over 90% of the problems. These turn out to be the best all-around methods.
Because the test problems are not large, CONOPT provides the best performance on
this test set. Also, the CONOPT formulations appear to take advantage of the CONOPT ac-
tive set strategy in the detection and removal of dependent constraints. As a result, CONOPT-
Mult (with no reformulation) performs well. On the other hand, the IPOPT formulations
(IPOPT-Penalty, IPOPT-Reg, IPOPT-C) follow similar trends within Figure 11.2, with the
penalty formulation as the most robust. This can be explained by the similarities among
these methods, as analyzed in [255]. Finally, IPOPT-Mult (with no reformulation) is the
worst performer as it cannot remove dependent constraints, and therefore suffers most from
the inherent degeneracies in MPCCs.
From this brief set of results, we see the advantages of the PF(ρ) strategy, particularly
since it is easier to address with general NLP solvers such as CONOPT [119] and IPOPT
[206]. These solvers have competing advantages; CONOPT quickly detects active sets and
handles dependent constraints efficiently, while IPOPT has low computational complexity
in handling large-scale problems with many inequality constraints.
i i
i i
book_tem
i i
2010/7/27
page 335
i i
i i
i i
book_tem
i i
2010/7/27
page 336
i i
Table 11.1. Objective function values of the 22 MPCCLib test problems that con-
verged to the same solutions.
Problem Name Objective Function Value Constraints Variables Complementarities
bard2 −6600 10 13 8
bard3 −12.67872 6 7 4
bartruss3_0 3.54545 × 10−7 29 36 26
bartruss3_1 3.54545 × 10−7 29 36 11
bartruss3_2 10166.57 29 36 6
bartruss3_3 3 × 10−7 27 34 26
bartruss3_4 3 × 10−7 27 34 11
bartruss3_5 10166.57 27 34 6
desilva −1 5 7 4
ex9_1_4m −61 5 6 4
findb10s 2.02139 × 10−7 203 198 176
fjq1 3.207699 7 8 6
gauvin 20 3 4 2
kehoe1 3.6345595 11 11 5
outrata31 2.882722 5 6 4
outrata33 2.888119 5 6 4
outrata34 5.7892185 5 6 4
qvi 7.67061 × 10−19 3 5 3
three 2.58284 × 10−20 4 3 1
tinque_dhs2 N/A 4834 4805 3552
tinque_sws3 12467.56 5699 5671 4480
tollmpec −20.82589 2377 2380 2376
i i
i i
book_tem
i i
2010/7/27
page 337
i i
0 ≤ x ⊥ (1 − x) ≥ 0 (11.18)
consists of only two isolated points, and the disjoint feasible region associated with this
example often leads to convergence failures, unless the model is initialized close to a solution
with either x = 0 or x = 1. Moreover, without careful consideration, it is not difficult to
create MPCCs with similar difficulties.
Because the MPCC can be derived from an associated MPEC, we consider the formu-
lation in (11.1). To avoid disjoint regions, we require that both θ(x, y) and C(x) be convex
in y for all (x, y) ∈ Z. This leads to a well-defined solution for the inner problem. We use
this observation to develop the following guidelines for the examples in this study.
where ϕ(x) is a switching function for the discrete decision. Often a linear program
in y is a good choice.
• When possible, formulate (11.19) so that the upper level constraints in (11.1), i.e.,
(x, y) ∈ Z, do not interfere with the selection of any value of y ∈ C(x).
• The resulting MPEC is then converted to an MPCC. For instance, applying the KKT
conditions to (11.19), we obtain the complementarities
ϕ(x) − sa + sb = 0, (11.20a)
0 ≤ (y − ya ) ⊥ sa ≥ 0, (11.20b)
0 ≤ (yb − y) ⊥ sb ≥ 0. (11.20c)
• Simplify the relations in (11.20) through variable elimination and application of the
complementarity conditions.
• Finally, incorporate the resulting complementarity conditions within the NLP refor-
mulations described in Section 11.2.
Note that in the absence of upper level constraints in y, the solution of (11.20) allows
y to take any value in [ya , yb ] when ϕ(x) = 0. Thus, it is necessary to avoid disjoint feasible
regions for the resulting MPCC, and this is frequently a problem-specific modeling task. For
instance, complementarity formulations should not be used to model logical disjunctions
such as exclusive or (EXOR) operators as in (11.18) because they lead to disjoint regions
for y. On the other hand, logical disjunctions such as an inclusive or operator can be modeled
successfully with MPCCs.
In the remainder of this section, we apply these guidelines to develop complementarity
models that arise in process applications.
i i
i i
book_tem
i i
2010/7/27
page 338
i i
These relations can be simplified by using (11.27b), (11.27c), (11.27d) and substituting
into (11.27a) to eliminate y, leading to
z = f (x) + sb , (11.28a)
f (x) − za = sa − sb , (11.28b)
0 ≤ sb ⊥ sa ≥ 0. (11.28c)
• The min operator z = min (f (x), zb ) can be treated in a similar way by defining the
problem
z = f (x) + (zb − f (x))y, (11.29)
!
y = arg min(zb − f (x))ŷ s.t. 0 ≤ ŷ ≤ 1 . (11.30)
ŷ
i i
i i
book_tem
i i
2010/7/27
page 339
i i
Applying the optimality conditions and simplifying leads to the following comple-
mentarity system:
z = f (x) − sa , (11.31a)
zb − f (x) = sb − sa , (11.31b)
0 ≤ sb ⊥ sa ≥ 0. (11.31c)
min −y ·x (11.32a)
s.t. − 1 ≤ y ≤ 1, (11.32b)
x = sb − sa , (11.33a)
0 ≤ sa ⊥ (y + 1) ≥ 0, (11.33b)
0 ≤ sb ⊥ (1 − y) ≥ 0. (11.33c)
Flow Reversal
Flow reversal occurs in fuel headers and pipeline distribution networks. This is problematic
in simulation models as physical properties and other equations implicitly assume the flow
rate to be positive. These situations may be modeled with the absolute value operator, using
complementarities as in (11.24). The magnitude of the flow rate can then be used in the sign
sensitive equations.
• Check valves prevent fluid flows in reverse directions that may arise from changes
in differential pressure. Assuming that the directional flow is a monotonic function
i i
i i
book_tem
i i
2010/7/27
page 340
i i
f (p) of the differential pressure, we model the flow F through the check valve as
F = max{0, f (p)}
and rewrite the max operator as the complementarity system given in (11.28).
• Relief valves allow flow only when the pressure, p, or pressure difference, p, is
larger than a predetermined value. Once open, the flow F is some function of the
pressure, f (p). This can be modeled as F = f (p)y with
The inner minimization problem sets y = 1 if p > pmax , y = 0 if p < pmax , and
y ∈ [0, 1] if p = pmax . This behavior determines if the flow rate should be zero, when
the valve is closed, or determined by the expression f (p), when the valve is open.
The related complementarity conditions are
F = f (p)y,
(pmax − p) = s0 − s1 ,
0 ≤ y ⊥ s0 ≥ 0,
0 ≤ (1 − y) ⊥ s1 ≥ 0.
Piecewise Functions
Piecewise smooth functions are often encountered in physical property models, tiered pric-
ing, and table lookups. This composite function can be represented by a scalar ξ along with
the following inner minimization problem and associated equation [323]:
N
min (ξ − ai )(ξ − ai−1 )y(i) (11.34a)
y
i=1
N
s.t. y(i) = 1, y(i) ≥ 0, (11.34b)
i=1
N
z= fi (ξ )y(i) , (11.35)
i=1
where N is the number of piecewise segments, fi (ξ ) is the function over the interval ξ ∈
[ai−1 , ai ], and z represents the value of the piecewise function. This LP sets y(i) = 1 and
i i
i i
book_tem
i i
2010/7/27
page 341
i i
yj =i = 0 when ξ ∈ (ai−1 , ai ). The associated equation will then set z = fi (ξ ), which is
the function value on the interval. The NLP (11.34) can be rewritten as the following
complementarity system:
N
y(i) = 1, (11.36a)
i=1
(ξ − ai )(ξ − ai−1 ) − γ − si = 0, (11.36b)
0 ≤ y(i) ⊥ si ≥ 0. (11.36c)
If the function z(ξ ) is piecewise smooth but not continuous, then y(i) can “cheat” at ξ = ai
or ξ = ai−1 by taking fractional values. For instance, cost per unit may increase stepwise
over different ranges, and jumps in costs that occur at ai may be replaced by arbitrary inter-
mediate values. A way around this problem is to define z(ξ ) as a continuous, but nonsmooth,
function (e.g., unit cost times quantity), so that fractional values of y(i) will still represent
z(ai ) accurately.
PI Controller Saturation
PI (proportional plus integral) controller saturation has been studied using complementarity
formulations [416]. The PI control law takes the following form:
1 t
u(t) = Kc e(t) + e(t )dt , (11.37)
τI 0
where u(t) is the control law output, e(t) is the error in the measured variable, Kc is the
controller gain, and τI is the integral time constant. The controller output v(t) is typically
subject to upper and lower bounds, vup and vlo , i.e., v(t) = max(vlo , min(vup , u(t))). The
following inner minimization relaxes the controller output to take into account the saturation
effects:
min (vup − u)yup + (u − vlo )ylo (11.38)
yup ,ylo
s.t. 0 ≤ ylo , yup ≤ 1
with v(t) = u(t) + (vup − u(t))yup + (vlo − u(t))ylo . Suppressing the dependence on t and
applying the KKT conditions to (11.38) leads to
v = u + (vup − u)yup + (vlo − u)ylo , (11.39)
(vup − u) − s0,up + s1,up = 0,
(u − vlo ) − s0,lo + s1,lo = 0,
0 ≤ s0,up ⊥ yup ≥ 0, 0 ≤ s1,up ⊥ (1 − yup ) ≥ 0,
0 ≤ s0,lo ⊥ ylo ≥ 0, 0 ≤ s1,lo ⊥ (1 − ylo ) ≥ 0.
Eliminating the variables yup and ylo and applying the complementarity conditions leads to
the following simplification:
v = u + s1,lo − s1,up , (11.40)
(vup − u) − s0,up + s1,up = 0,
(u − vlo ) − s0,lo + s1,lo = 0,
0 ≤ s1,up ⊥ s0,up ≥ 0, 0 ≤ s0,lo ⊥ s1,lo ≥ 0.
i i
i i
book_tem
i i
2010/7/27
page 342
i i
Phase Changes
As described in Example 7.1 in Chapter 7, flash separators operate at vapor-liquid equilib-
rium by concentrating high-boiling components in the liquid stream and low-boiling com-
ponents in the vapor stream. The process model (7.3) that describes this system is derived
from minimization of the Gibbs free energy of the system [344]. Flash separations operate at
vapor-liquid equilibrium (i.e., boiling mixtures) between the so-called dew and bubble point
conditions. Outside of these ranges, one of the phases disappears and equilibrium no longer
holds between the phases. To model these phase changes, Gibbs minimization at a fixed
temperature T and pressure P is written as a constrained optimization problem of the form
NC
NC
min G(T , P , li , vi ) = i +
li ḠL vi ḠVi
li ,vi
i=1 i=1
NC
NC
s.t. li ≥ 0, vi ≥ 0,
i=1 i=1
li + vi = mTi > 0, i = 1, . . . , N C,
where N C refers to the number of chemical components with index i, G(T , P , li , vi ) is the
total Gibbs free energy,
ig
i = Ḡi (T , P ) + RT ln(fi )),
ḠL L
ig
ḠVi = Ḡi (T , P ) + RT ln(fiV )),
ig
Ḡi is the ideal gas free energy per mole for component i, fiL and fiV are the mixture liquid
and vapor fugacities for component i, li and vi are the moles of component i in the liquid
and vapor phase, R is the gas constant, and mTi are the total moles of component i. The first
order KKT conditions for this problem are given by
NC %
ig
∂ ḠL ∂ ḠV
Ḡi (T , P ) + RT ln(fiL ) + li i
+ vi i
− αL − γi = 0, (11.41a)
∂li ∂li
i=1
NC %
ig
∂ ḠL ∂ ḠV
Ḡi (T , P ) + RT ln(fiV ) + li i
+ vi i
− αV − γi = 0, (11.41b)
∂vi ∂vi
i=1
NC
0 ≤ αL ⊥ li ≥ 0, (11.41c)
i=1
NC
0 ≤ αV ⊥ vi ≥ 0, (11.41d)
i=1
li + vi = mTi , i = 1, . . . , N C. (11.41e)
The bracketed terms in (11.41a) and (11.41b) are equal to zero from the Gibbs–Duhem
equation [344], and subtracting (11.41a) from (11.41b) leads to
RT ln(fiV /fiL ) − αV + αL = 0.
i i
i i
book_tem
i i
2010/7/27
page 343
i i
Moreover, defining fiV = φiV (T , P , yi )yi , fiL = φiL (T , P , xi )xi , and Ki = φiL /φiV , where
φiL and φiV are fugacity coefficients, leads to
αV − αL
yi = exp K i xi .
RT
−αL
From (11.41c) and (11.41d) we can deduce 0 ≤ αL ⊥ αV ≥ 0 and by defining β = exp( αVRT ),
we have
yj = βKj xj , (11.42a)
β = 1 − sL + sV , (11.42b)
0 ≤ L ⊥ sL ≥ 0, (11.42c)
0 ≤ V ⊥ sV ≥ 0. (11.42d)
In this manner, phase existence can be determined within the context of an MPCC. If a
slack variable (sL or sV ) is positive, either the corresponding liquid or vapor phase is absent
and β = 1 relaxes the phase equilibrium condition, as required in (11.41). As seen in the
next section, (11.42) can also be applied to the optimization of distillation columns, and, in
Exercise 5, these conditions can also be extended as necessary conditions for multiphase
equilibrium.
i i
i i
book_tem
i i
2010/7/27
page 344
i i
L1 + V1 − L2 = 0,
Li + Vi − Li+1 − Vi−1 = 0, i = 2, . . . , N + 1, i ∈
/ S,
Li + Vi − Li+1 − Vi−1 − Fi = 0, i ∈ S,
R + D − VN+1 = 0.
Energy Balance
L1 HL,1 + V1 HV ,1 − L2 HL,2 − QR = 0,
Li HL,i + Vi HV ,i − Li+1 HL,i+1 − Vi−1 HV ,i−1 = 0, i = 2, . . . , N + 1,
Li HL,i + Vi HV ,i − Li+1 HL,i+1 − Vi−1 HV ,i−1 − Fi HF = 0, i ∈ S,
VN+1 HV ,N+1 − (R + D)HL,D − QC = 0.
Complementarities
0 ≤ Li ⊥ sL,i ≥ 0, i = 1, . . . , N + 1,
0 ≤ Vi ⊥ sV ,i ≥ 0, i = 1, . . . , N + 1,
βi − 1 + sL,i − sV ,i = 0, i = 1, . . . , N + 1.
Bounds
D, R, QR , QC ≥ 0,
1 ≥ yi,j , xi,j ≥ 0, j ∈ N C, i = 1, . . . , N + 1,
where i are trays numbered starting from reboiler (= 1), j are components in the feed, P is
the column pressure, S is the set for feed tray location, D is the distillate flow rate, R is the
reflux flow rate, N is the number of trays in the column, Fi is feed flow rate, Li /Vi is flow
rate of liquid/vapor leaving tray i, Ti is temperature of tray i, HF is feed enthalpy, HL/V ,i
i i
i i
book_tem
i i
2010/7/27
page 345
i i
is enthalpy of liquid/vapor leaving tray i, xF , xD are feed and distillate composition, x/yi,j
is mole fraction of j in liquid/vapor leaving tray i, βi is the relaxation parameter, sL/V ,i are
slack variables, Aj , Bj , Cj are Antoine coefficients, D is distillate flow rate, and QR/C is
the heat load on reboiler/condenser.
The feed is a saturated liquid of mole fractions [0.05, 0.15, 0.25, 0.2, 0.35] with the
components in the order given above. The column has N = 20 trays, and feed enters on
tray 12. The column is operated at a pressure of 725 kPa and we neglect pressure drop
across the column. This problem has 2 degrees of freedom. The objective is to minimize
the reboiler heat duty, which accounts for most of the energy costs. We are interested in
observing whether the proposed algorithm can identify (possibly) dry and vaporless trays
at the optimal solution. For this purpose we study three cases that differ in the recovery of
key components:
(1) xbottom,lk ≤ 0.01xtop,lk and xtop,hk ≤ 0.01xbottom,hk , where lk = C3 H8 is the light key
and hk = iC5 H12 is the heavy key,
(3) xbottom,hk ≥ 0.35. This loose specification corresponds to no need for separation.
Since reboiler duty increases if we achieve more than the specified recovery, we expect
these constraints to be active at the optimal solution. Figure 11.4 shows the liquid and vapor
flow rates from all of the trays for these three cases. In case (1), we require high recovery in
the top as well as the bottom, and the column operates with a high-reboiler load with liquid
and vapor present on all trays. In case (2), we require high recovery of the heavy key only.
Since we have no specification for the top products, there is vapor on all trays, but the trays
above the feed tray run dry to attain minimum reboiler heat duty. Case (3) requires no more
separation than is already present in the feed. The column is rewarded for not operating and
has no vapor on its trays. Instead, the feed runs down the column as a liquid and leaves
without change in composition.
All three cases were solved with IPOPT-C, a modification of IPOPT based on Reg()
with adaptive adjustment of (see [319]). The size of the problem in the three cases and
the performance results are provided in Table 11.2. The number of constraints includes the
number of complementarity constraints as well. All problems have been solved to a tolerance
of less than 10−6 in the KKT error and required less that 2 CPU seconds on a Pentium III,
667 MHz CPU running Linux).
i i
i i
book_tem
i i
2010/7/27
page 346
i i
Figure 11.4. Minimum energy distillation with liquid (solid lines) and vapor
(dashed lines) molar flow rates for (1) top and bottom specifications, (2) bottom speci-
fication, (3) no specifications.
i i
i i
book_tem
i i
2010/7/27
page 347
i i
Figure 11.5. Diagram of distillation column showing feed and reflux flows dis-
tributed according to the DDF. The gray section of the column is above the primary reflux
location and has negligible liquid flows.
where σf , σt = 0.5 are parameters in the distribution. Finally, we modify the overall mass,
energy, and component balances in Section 11.4.1 by allowing feed and reflux flow rates on
all trays i ∈ S.
This modification enables the placement of feeds, sidestreams, and number of trays
in the column to be continuous variables in the DDF. On the other hand, this approach leads
to the upper trays with no liquid flows, and this requires the complementarity constraints
(11.42). The resulting model is used to determine the optimal number of trays, reflux ratio,
and feed tray location for a benzene/toluene separation. The distillation model uses ideal
thermodynamics as in Section 11.4.1. Three MPCC formulations, modeled in GAMS and
solved with CONOPT, were considered for the following cases:
• Penalty formulation, PF(ρ) with ρ = 1000.
• Relaxed formulation, Reg(). This strategy was solved in two NLP stages with =
10−6 followed by = 10−12 .
• NCP formulation using the Fischer–Burmeister function (11.17). This approach is
equivalent to RegEq() and was solved in three NLP stages with = 10−4 followed
by = 10−8 and = 10−12 .
i i
i i
book_tem
i i
2010/7/27
page 348
i i
The binary column has a maximum of 25 trays, its feed is 100 mol/s of a 70%/30% mixture
of benzene/toluene, and distillate flow is specified to be 50% of the feed. The reflux ratio is
allowed to vary between 1 and 20, the feed tray location varies between 2 and 20, and the
total tray number varies between 3 and 25. The objective function for the benzene-toluene
separation minimizes:
where Nt is the number of trays, r = R/D is the reflux ratio, D is the distillate flow, xD,T oluene
is the toluene mole fraction, and wt, wr, and wn are the weighting parameters for each term;
these weights allow the optimization to trade off product purity, energy cost, and capital cost.
The column optimizations were initialized with 21 trays, a feed tray location at the seventh
tray, and a reflux ratio at 2.2. Temperature and mole fraction profiles were initialized with
linear interpolations based on the top and bottom product properties. The resulting GAMS
models consist of 353 equations and 359 variables for the Reg() and NCP formulations,
and 305 equations and 361 variables for the PF(ρ) formulation. The following cases were
considered:
• Case 1 (wt = 1, wr = 1, wn = 1): This represents the base case with equal weights for
toluene in distillate, reflux ratio, and tray count. As seen in the results in Table 11.3, the
optimal solution has an objective function value of 9.4723 with intermediate values of
r and Nt . This solution is found quickly by the PF(ρ) formulation. On the other hand,
the Reg() formulation terminates close to this solution, while the NCP formulation
terminates early with poor progress.
i i
i i
book_tem
i i
2010/7/27
page 349
i i
• Case 2 (wt = 1, wr = 0.1, wn = 1): In this case, less emphasis is given to energy cost.
As seen in the results in Table 11.3, the optimal solution now has a lower objective
function value of 6.8103 along with a higher value of r and lower value of Nt . This
is found quickly by both PF(ρ) and Reg(), although only the former satisfies the
convergence tolerance. On the other hand, the NCP formulation again terminates
early with poor progress.
• Case 3 (wt = 1, wr = 1, wn = 0.1): In contrast to Case 2, less emphasis is now given
to capital cost. As seen in the results in Table 11.3, the optimal solution now has an
objective function value of 2.9048 with lower values of r and higher values of Nt .
This is found quickly by the Reg() formulation. Although Reg() does not satisfy
the convergence tolerance, the optimum could also be verified by PF(ρ). On the other
hand, PF(ρ) quickly converges to a slightly different solution, which it identifies as a
local optimum, while the NCP formulation requires more time to terminate with poor
progress.
Along with the numerical study in Section 11.2.2, these three cases demonstrate that
optimization of detailed distillation column models can be performed efficiently with MPCC
formulations. In particular, it can be seen that the penalty formulation (PF(ρ)) represents
a significant improvement over the NCP formulation both in terms of iterations and CPU
seconds; PF(ρ) offers advantages over the Reg() formulation as well.
In this representation, t, z(t), u(t), ν(t), and σ (z(t)) are time, differential state variables, con-
trol variables, switching profiles, and guard (or switching) function, respectively. The scalar
switching function, σ (z(t)), determines transitions to different state models, represented by a
scalar switching profile ν(t) set to zero or one. At the transition point, where σ (t) = 0, a con-
vex combination of the two models is allowed with ν(t) ∈ [0, 1]. Note that if σ (z(t)) = 0 over
a nonzero period of time, ν(t) can be determined from smoothness properties of σ (z(t)); i.e.,
dσ
= ∇z σ (z(t))T [ν(t)f− (z(t), u(t)) + (1 − ν(t))f+ (z(t), u(t))] = 0. (11.47)
dt
i i
i i
book_tem
i i
2010/7/27
page 350
i i
Also, for this problem class, we assume that the differential states z(t) remain continuous
over time. Existence and uniqueness properties of (11.46) have been analyzed in [132, 249].
We can express (11.46) equivalently through the addition of slack variable profiles and
complementarity constraints as
σ (z(t)) = s p (t) − s n (t), (11.48a)
dz
= ν(t)f− (z, u) + (1 − ν(t))f+ (z, u), ν(t) ∈ [0, 1], (11.48b)
dt
0 ≤ s p (t) ⊥ ν(t) ≥ 0, (11.48c)
0 ≤ s n (t) ⊥ (1 − ν(t)) ≥ 0. (11.48d)
Using the simultaneous collocation approach from Chapter 10, the DAE is converted
into an NLP by approximating state and control profiles by a family of polynomials on finite
elements, defined by t0 < t1 < · · · < tN . In addition, the differential equation is discretized
using K-point Radau collocation on finite elements. As in Chapter 10, one can apply either
Lagrange interpolation polynomials (10.4) or the Runge–Kutta representation (10.5) for the
differential state profiles, with continuity enforced across element boundaries. The switching
ν(t) and control profiles u(t) are approximated using Lagrange interpolation polynomials
(10.17).
A straightforward substitution of the polynomial representations allows us to write
the collocation equations (10.6) for (11.48) as
dzK
(tik ) = νik f− (zik , uik ) + (1 − νik )f+ (zik , uik ), νik ∈ [0, 1],
dt
p
σ (zik ) = sik − sik
n
,
p
0 ≤ sik ⊥ νik ≥ 0,
0 ≤ sik n
⊥ (1 − νik ) ≥ 0,
i = 1, . . . , N , k = 1, . . . , K.
However, this formulation is not sufficient to enforce smoothness within an element for
z(t), ν(t), and u(t). For this, we allow a variable length hi ∈ [hL , hU ] for each finite element
(determined by the NLP) and allow sign changes in σ (z(t)) only at ti , the boundary of the
finite element. Also, a positive lower bound on hi is required to ensure that it does not go to
zero, and an upper bound on hi is chosen to limit the approximation error associated with
the finite element. Permitting sign changes in σ (t) only at the $ tiboundary of a finite element
is enforced by choosing ν(t) to complement the L1 norm, ti−1 |s p (t)| dt, and 1 − ν(t) to
$ ti
complement ti−1 |s n (t)| dt. With this modification, the discretized formulation becomes
dzK
(tik ) = νik f− (zik , uik ) + (1 − νik )f+ (zik , uik ), νik ∈ [0, 1], (11.49a)
dt
p
K
p
σ (zik ) = sik − sikn
, 0≤ sik
⊥ νik ≥ 0, (11.49b)
k
=0
K
0≤ n
sik
⊥ (1 − νik ) ≥ 0, i = 1, . . . , N , k = 1, . . . , K, (11.49c)
k
=0
i i
i i
book_tem
i i
2010/7/27
page 351
i i
p p
where we define si0 = si−1,K and si0n = sn
i−1,K for i = 2, . . . , N . Note that the complemen-
tarities are now formulated so that only one branch of the complement is allowed over the
element i, i.e.,
Moreover, for the last condition, we have an index-2 path constraint σ (z(t)) = 0 over the
finite element. In our direct transcription approach, the Radau collocation scheme allows us
to handle the high-index constraint directly as
σ (zik ) = 0, i = 1, . . . , N , k = 1, . . . , K,
and to obtain ν(t) implicitly through the solution of (11.49). As noted in Chapter 10 and
in [16], Radau collocation is stable and accurate for index-2 systems; the error in the dif-
ferential variables is O(h2K−1 ) and the error in the algebraic variables is reduced only
to O(hK ).
We now generalize this formulation to multiple guard functions and switches that
define NT periods of positive length, indexed by l = 1, . . . , NT , along with hybrid index-1
DAE models given by
1 2 %
F dz , z(t), y(t), u(t), ν(t), p = 0
dt t ∈ [tl−1 , tl ], l = 1, . . . , NT ,
g(z(t), y(t), u(t), ν(t), p) ≤ 0
1 2
F dzdt , z(t), y(t), u(t), ν(t), p = 0
g(z(t), y(t), u(t), ν(t), p) ≤ 0
σm (z(t)) < 0 =⇒ νm (t) = 1 t ∈ (tl−1 , tl ], m ∈ M, (11.50a)
σm (z(t)) > 0 =⇒ νm (t) = 0
σm (z(t)) = 0 =⇒ νm (t) ∈ [0, 1]
z(t0 ) = z0 , z(tN+T ) = zf , (11.50b)
z(tl− ) = z(tl+ ), l = 1, . . . , NT − 1. (11.50c)
i i
i i
book_tem
i i
2010/7/27
page 352
i i
and apply the complementarity conditions within each finite element. This leads to the
following formulation:
dzK
F (tik ), zik , yik , uik , νik , p = 0, (11.51a)
dt
g(zik , yik , uik , νik , p) ≤ 0, (11.51b)
z(ti− ) = z(ti+ ), (11.51c)
p
σm (zik ) = sm,ik − sm,ik
n
, (11.51d)
K
p
0≤ sm,ik
⊥ νm,ik ≥ 0, (11.51e)
k
=0
K
0≤ n
sm,ik
⊥ (1 − νm,ik ) ≥ 0, (11.51f )
k
=0
i = 1, . . . , N , k = 1, . . . , K, m ∈ M, (11.51g)
z(t0 ) = z0 , z(tN+ ) = zf . (11.51h)
Equations (11.51) constitute the constraints for the discretized hybrid dynamic optimization
problem represented by the MPCC (11.2). Moreover, because a higher order IRK discretiza-
tion is used within smooth finite elements, we are able to enforce accurate solutions to the
hybrid dynamic system, with a relatively small number of finite elements.
i i
i i
book_tem
i i
2010/7/27
page 353
i i
We present this case study in two parts. First, we demonstrate the NLP reformulation
on an example with increasingly many complementarity conditions. In the second part, we
apply the dynamic reformulation (11.51) to determine accurate solutions for this hybrid
system. A related discussion of this problem can also be found in [40].
N
min φ = (zend − 5/3)2 + h (zi )2 (11.53a)
i=1
s.t. żi = ui + 2, (11.53b)
zi = zi−1 + h · żi , (11.53c)
zi = si+ − si− , (11.53d)
0 ≤ 1 − ui ⊥ si+ ≥ 0, (11.53e)
0 ≤ ui + 1 ⊥ si− ≥ 0, i = 1, . . . , N . (11.53f )
i i
i i
book_tem
i i
2010/7/27
page 354
i i
Table 11.4. Solution times (Pentium 4, 1.8 GHz, 992 MB RAM) for different solution
strategies.
Objective CONOPT/PF IPOPT-C
N Function CPU s. Iterations CPU s. Iterations
10 1.4738 0.047 15 0.110 17
100 1.7536 0.234 78 1.250 41
1000 1.7864 9.453 680 28.406 78
2000 1.7888 35.359 1340 14.062 25
3000 1.7894 112.094 2020 106.188 84
4000 1.7892 211.969 2679 84.875 56
5000 1.7895 340.922 3342 199.391 87
6000 1.7898 468.891 3998 320.140 115
7000 1.7896 646.953 4655 457.984 141
8000 1.7898 836.891 5310 364.937 98
For this formulation, choosing K = 1 leads to an implicit Euler method with first order
accuracy. Since the differential equation is piecewise constant, implicit Euler integrates the
i i
i i
book_tem
i i
2010/7/27
page 355
i i
Table 11.5. Solution times (Pentium 4, 1.8 GHz, 992 MB RAM) MPCC formulation
with variable finite elements.
MPCC Formulation
N Objective Iters. CPU s.
10 1.5364 25 0.063
100 1.7766 97 0.766
1000 1.7889 698 23.266
2000 1.7895 1345 77.188
3000 1.7897 2009 166.781
4000 1.7898 2705 343.016
differential inclusion exactly and leads to exact identification of the switching locations. On
the other hand, the integrand in the objective function is not integrated exactly.
The discretized MPCC was solved using the NLP penalty reformulation with
CONOPT using ρ = 1000. The results are shown in Table 11.5. As in the case with fixed
elements, the total computational time for CONOPT grows approximately quadratically
with problem size.
The analytic solution for the hybrid dynamic system is plotted in Figure 11.6. Starting
from z(0) = −2, z(t) is piecewise linear, and the influence of sgn(z) can be seen clearly from
the plot of dz/dt. Moreover, from the analytic solution, it can be shown that z(t) and the
objective function, φ, are both differentiable in z(0). On the other hand, a discretized problem
with hi fixed does not locate the transition point accurately and this leads to inaccurate
profiles for z(t) and ν(t). As discussed in [371], the application of fixed finite elements also
leads to a nonsmooth dependence of the solution on z(0). In Figure 11.7 the plot for N = 100
fixed elements shows a sawtooth behavior of φ versus z(0). In contrast with variable finite
i i
i i
book_tem
i i
2010/7/27
page 356
i i
elements the objective function varies smoothly with z(0). This occurs because the NLP
solver can now locate the switching points accurately, and the complementarity formulation
requires the differential state z(t) to remain smooth within an element.
Moreover, the Euler discretization captures the piecewise linear z(t) profile exactly
and varies smoothly with z(0). On the other hand, because φ still has local errors of O(h2i ),
Figure 11.7 shows that the plot of the analytically determined objective function still differs
from the Euler discretization, despite the accurate determination of z(t). When the problem
is resolved with K = 3, Radau collocation has fifth order accuracy and the numeric solution
matches the analytical solution within machine precision. Nevertheless, for both K = 1 and
K = 3 smoothness is still maintained in Figure 11.7 because the discontinuity is located
exactly, and the integral is evaluated over the piecewise smooth portions, not across the
discontinuity.
i i
i i
book_tem
i i
2010/7/27
page 357
i i
V0
Tank 1
Tank 2
L1
V1
H2
L2 Tank 3
V2
H3
L3
3
Implicit Euler (K = 1) was used to discretize the differential equations and IPOPT was
used to solve this set of problems using the PF(ρ) formulation. The model was initially
solved with 10 finite elements per time interval, 10 time intervals, and 3 tanks, as the base
case scenario. The state trajectories for the solution of the base case problem are plotted in
i i
i i
book_tem
i i
2010/7/27
page 358
i i
L1(t)
0.5
0
0 20 40 60 80 100
1
L2(t)
0.5
0
0 20 40 60 80 100
1
L3(t)
0.5
0
0 20 40 60 80 100
Time [s]
Figure 11.9. Plot of state trajectories of tank levels Li for base case.
1.5
1
w (t)
0
0.5
0
1 2 3 4 5 6 7 8 9 10
1.5
1
w (t)
1
0.5
0
1 2 3 4 5 6 7 8 9 10
1.5
1
w (t)
2
0.5
0
1 2 3 4 5 6 7 8 9 10
1.5
1
w (t)
3
0.5
0
1 2 3 4 5 6 7 8 9 10
Time Step
Figure 11.9, while the corresponding control profiles are given in Figure 11.10. The tank
levels reach the target during the first two time intervals and remain at the target setpoint
for all subsequent intervals. Note that the valve openings remain at intermediate values for
these target levels.
i i
i i
book_tem
i i
2010/7/27
page 359
i i
Table 11.6. Scaling of solution time with the number of finite elements (N) with
3 tanks and 10 time intervals.
N Time (s) Iterations
10 1.703 28
20 7.297 42
30 12.125 41
40 23.641 48
50 28.078 43
60 40.219 47
70 46.063 43
80 63.094 45
90 80.734 48
100 90.234 44
Table 11.7. Scaling of solution time with the number of tanks with 10 time intervals
and 10 elements per interval.
Tanks Time (s) Iterations
3 1.703 28
4 4.157 45
5 9.984 69
6 7.484 40
7 14.968 61
8 11.156 40
9 13.906 41
10 78.703 122
11 20.266 43
12 28.203 52
From this base case the number of finite elements per time interval and the number of
tanks are varied independently, with the solution time on a Pentium 4 CPU with 1.8 GHz,
992 MB RAM) and iteration results given in Tables 11.6 and 11.7 for complementarities
representing up to 3000 discrete switches. The solution time grows linearly with increasing
numbers of tanks, and grows between linearly and quadratically with the number of finite
elements. The increase in solution times is largely the result of the increased cost of the
sparse linear factorization of the KKT matrix, which increases linearly with the size of
the problem. As both the number of finite elements and the number of tanks are increased,
the solution times never increase more than quadratically.
i i
i i
book_tem
i i
2010/7/27
page 360
i i
not satisfy constraint qualifications and have nonunique and unbounded multipliers. Conse-
quently, the MPCC must be reformulated to a well-posed nonlinear program. This chapter
describes a number of regularization and penalty formulations for MPCCs. In particular, the
penalty formulation is advantageous, as it can be solved with standard NLP algorithms and
does not require a sequence of nonlinear programs to be solved. In addition to summarizing
convergence properties of these formulations, a small numerical comparison is presented
on a collection of MPCC test problems.
The modeling of complementarity constraints has a strong influence on the success-
ful solution of the MPCC. Here we motivate the development of complementarity models
through formulation of bilevel optimization problems, where the lower level problem is con-
vex, and present guidelines for the formulation of complementarity constraints for applica-
tions that include flow reversals, check valves, controller saturation, and phase equilibrium.
These are demonstrated on the optimization of two distillation case studies.
The MPCC-based formulation is also extended to a class of hybrid dynamic opti-
mization problems, where the differential states are continuous over time. These include
differential inclusions of the Filippov type. Using direct transcription formulations, the lo-
cation of switching points is modeled by allowing variable finite elements and by enforcing
only one branch of the complementarity condition across the finite element. Any nonsmooth
behavior from a change in these conditions is thus forced to occur only at the finite ele-
ment boundary. As in Chapter 10, high-order IRK discretization is used to ensure accurate
solutions in each finite element.
Two case studies illustrate this approach. The first demonstrates the MPCC formula-
tion to deal with hybrid systems as well as the complexity of the MPCC solution strategy
with increases in problem size. In particular, with fixed finite elements the problem can
exhibit nonsmoothness in the parametric sensitivities, while moving finite elements lead to
smooth solutions. This study also demonstrates that solution times grow polynomially with
the number of complementarity constraints and problem size. In the second case study an
optimal control profile was determined for a set of cascading tanks with check valves. Here
the solution time grows no worse than quadratically with respect to number of tanks, and
to number of finite elements per time interval.
i i
i i
book_tem
i i
2010/7/27
page 361
i i
[130, 267]. In a parallel development, researchers have also investigated the applicability of
general NLP algorithms for the solution of MPECs. Anitescu [9] analyzed convergence be-
havior of a certain class of NLP algorithms when applied to MPECs, and Ralph and Wright
[326] examined convergence properties for a wide range of NLP formulations for MPCCs.
Moreover, the modeling of complementarities and MPCCs has become a very useful feature
in optimization modeling environments. Ferris et al. [129] surveyed software advances for
MPCCs and discussed extended mathematical programming frameworks for this class of
problems.
MPCC formulations that arise in process engineering are described in Baumrucker,
Renfro, and Biegler [33]. Applications of MPCCs in process engineering frequently stem
from multilevel optimization problems. Clark and Westerberg [96] were among the first to
reformulate such problems as an MPCC by replacing the inner minimization problem with
its stationary conditions. Moreover, flexibility and resilience problems involved in design
and operation under uncertainty have been formulated and solved as bilevel and trilevel
optimization problems [376, 176]. Complementarity formulations of vapor-liquid phase
equilibrium have been derived from the minimization of Gibbs free energy [167], and these
complementarity conditions have been extended to the optimization of trayed distillation
columns [320]. Complementarity models have also been applied to parameter estimation of
core flood and reservoir models [221]. Other MPCC applications include optimization of
metabolic flux networks [323], real-time optimization (RTO) of chemical processes [363],
hybrid dynamic systems [191], dynamic startup of fuel cell systems [218], and the design of
robust control systems [416]. In particular, this approach extends to optimization problems
with DAE and PDE models that include equilibrium phase transitions [320, 221]. Moreover,
Raghunathan et al. [322, 321] and Hjersted and Henson [198] considered optimization
problems for metabolic network flows in cellular systems (modeled as linear programs in
the inner problem).
MPCC formulations that arise in hybrid dynamic systems are described in [32, 320,
40]. For the application of complementarity constraints to hybrid dynamic systems, a survey
of differential inclusions (DIs) along with an analysis of several classes of DIs is given by
Dontchev and Lempio [117]. More recently, Pang and Stewart [303] introduced and ana-
lyzed differential variational inequalities (DVIs). Their study includes development of time-
stepping methods along with conditions for existence of solutions, convergence of particular
methods, and rates of convergence with respect to time steps. Consistency of time-stepping
methods for DVIs and complementarity systems was also investigated in [191, 192, 193].
DVIs are a broad problem class that unify several other fields including ODEs with smooth
or discontinuous right-hand sides, DAEs, and dynamic complementarity systems. Leine
and Nijmeijer [249] considered dynamics and bifurcation of nonsmooth Filippov systems
with applications in mechanics. Control and optimization of hybrid systems have been ac-
tive application areas for DVIs, and smoothing methods, nonsmooth solution techniques,
and mixed integer approaches have been considered in a number of studies including
[371, 29, 30, 374, 18, 297]. In particular, the last two approaches deal with optimization
formulations with discrete decisions.
11.9 Exercises
1. Show the equivalence between the Fischer–Burmeister function and the RegEq() for-
mulation. Additional smoothing functions can be derived by (i) noting the equivalence
i i
i i
book_tem
i i
2010/7/27
page 362
i i
i i
i i
book_tem
i i
2010/7/27
page 363
i i
Bibliography
[1] NEOS Wiki. Mathemicals and Computer Science Division, Argonne National Labo-
ratory, http://wiki.mcs.anl.gov/NEOS/index.php/NEOS_wiki, (1996).
[2] Aspen Custom Modeler User’s Guide. Technical report, Aspen Technology,
http://www.aspentech.com, 2002.
[3] gPROMS User’s Guide. Technical report, Process Systems Enterprises Limited,
http://www.psenterprise.com, 2002.
[4] JACOBIAN dynamic modeling and optimization software. Technical report, Numer-
ica Technology LLC, http://www.numericatech.com, 2005.
[5] J. Abadie and J. Carpentier. Generalization of the Wolfe reduced gradient method to
the case of nonlinear constraints. In R. Fletcher, editor, Optimization, pages 37–49.
Academic Press, New York, 1969.
[6] N. Adhya and N. Sahinidis. A Lagrangian Approach to the Pooling Problem. Ind.
Eng. Chem. Res., 38:1956–1972, 1999.
[7] J. Albuquerque and L.T. Biegler. Decomposition Algorithms for On-Line Estimation
with Nonlinear DAE Models. Comput. Chem. Eng., 21:283–294, 1997.
[8] G. M. Aly and W. C. Chan. Application of a Modified Quasilinearization Technique
to Totally Singular Optimal Problems. Int. J. Control, 17:809–815, 1973.
[9] M. Anitescu. On Using the Elastic Mode in Nonlinear Programming Approaches
to Mathematical Programs with Complementarity Constraints. SIAM J. Optim.,
15:1203–1236, 2005.
[10] M. Anitescu, P. Tseng, and S.J. Wright. Elastic-Mode Algorithms for Mathemati-
cal Programs with Equilibrium Constraints: Global Convergence and Stationarity
Properties. Math. Program., 110:337–371, 2007.
[11] N. Arora and L. T. Biegler. A Trust Region SQP Algorithm for Equality Constrained
Parameter Estimation with Simple Parametric Bounds. Comput. Optim. Appl.,
28:51–86, 2004.
[12] J. J. Arrieta-Camacho and L. T. Biegler. Real Time Optimal Guidance of Low-Thrust
Spacecraft: An Application of Nonlinear Model Predictive Control. Ann. N.Y. Acad.
Sci., 1065:174, 2006.
363
i i
i i
book_tem
i i
2010/7/27
page 364
i i
364 Bibliography
[13] U. Ascher, J. Christiansen, and R. Russell. Collocation Software for Boundary Value
ODEs. ACM Trans. Math. Software, 7:209–222, 1981.
[14] U. M. Ascher, R. M. M. Mattheij, and R. D. Russell. Numerical Solution of Boundary
Value Problems for Ordinary Differential Equations. Classics in Appl. Math. 13
SIAM, Philadelphia, 1995.
[15] U. M. Ascher and R. J. Spiteri. Collocation Software for Boundary Value Differential-
Algebraic Equations. SIAM J. Sci. Comput., 15:938–952, 1994.
[16] U. M. Ascher and L. R. Petzold. Computer Methods for Ordinary Differential Equa-
tions and Differential-Algebraic Equations. SIAM, Philadelphia, 1998.
[17] C. Audet, P. Hansen, and J. Brimberg. Pooling Problem: Alternate Formulations and
Solutions Methods. Les Cahiers du GERAD G, 23–31, 2000.
[18] M. Avraam, N. Shah, and C. C. Pantelides. Modeling and Optimization of General
Hybrid Systems in Continuous Time Domain. Comput. Chem. Eng., 225:221–224,
1998.
[19] R. Bachmann, L. Bruell, T. Mrziglod, and U. Pallaske. On Methods of Reducing
the Index of Differential-Algebraic Equations. Comput. Chem. Eng., 14:1271–1276,
1990.
[20] J. K. Bailey, A. N. Hrymak, S. S. Treiber, and R. B. Hawkins. Nonlinear Optimization
of a Hydrocracker Fractionation Plant. Comput. Chem. Eng., 17:123–130, 1993.
[21] S. Balakrishna and L.T. Biegler. A Unified Approach for the Simultaneous Synthesis
of Reaction, Energy, and Separation Systems. Ind. Eng. Chem. Res., 32:1372–1382,
1993.
[22] J. R. Banga and W. D. Seider. Global optimization of chemical processes using
stochastic algorithms. In C. Floudas and P. Pardalos, editors, State of the Art in
Global Optimization, Kluwer, Dordrecht, page 563, 1996.
[23] Y. Bard. Nonlinear Parameter Estimation. Academic Press, New York, 1974.
[24] R. A. Bartlett. MOOCHO: Multi-functional Object-Oriented arCHitecture for
Optimization. http://trilinos.sandia.gov/packages/moocho/, 2005.
[25] R. A. Bartlett and L. T. Biegler. rSQP++ : An Object-Oriented Framework for
Successive Quadratic Programming. In O. Ghattas, M. Heinkenschloss, D. Keyes,
L. Biegler, and B. van Bloemen Waanders, editors, Large-Scale PDE-Constrained
Optimization, page 316. Lecture Notes in Computational Science and Engineering,
Springer, Berlin, 2003.
[26] R. A. Bartlett and L. T. Biegler. QPSchur: A Dual, Active Set, Schur Complement
Method for Large-Scale and Structured Convex Quadratic Programming Algorithm.
Optim. Eng., 7:5–32, 2006.
[27] P. I. Barton and C. C. Pantelides. The Modeling of Combined Discrete/Continuous
Processes. AIChE J., 40:966–979, 1994.
i i
i i
book_tem
i i
2010/7/27
page 365
i i
Bibliography 365
i i
i i
book_tem
i i
2010/7/27
page 366
i i
366 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 367
i i
Bibliography 367
[60] H. G. Bock. Recent advances in parameter identification techniques for ODE Numer-
ical Treatment of Inverse Problem in Differential and Integral Equation, 1983.
[62] H. G. Bock and K. J. Plitt. A Multiple Shooting Algorithm for Direct Solution of
Optimal Control Problems. Ninth IFAC World Congress, Budapest, 1984.
[64] I. Bongartz, A. R. Conn, N. I. M. Gould, and Ph. L. Toint. CUTE: Constrained and
Unconstrained Testing Environment. ACM Trans. Math. Software, 21:123–142, 1995.
[67] E. A. Boss, R. Maciel Filho, V. De Toledo, and E. Coselli. Freeze Drying Process:
Real Time Model and Optimization. Chem. Eng. Process., 12:1475–1485, 2004.
[68] C. L. Bottasso and A. Croce. Optimal Control of Multibody Systems Using an Energy
Preserving Direct Transcription Method. Multibody Sys. Dyn., 12/1:17–45, 2004.
[72] A. E. Bryson and Y. C. Ho. Applied Optimal Control. Hemisphere, Washington, DC,
1975.
[73] D. S. Bunch, D. M. Gay, and R. E. Welsch. Algorithm 717: Subroutines for Maximum
Likelihood and Quasi-Likelihood Estimation of Parameters in Nonlinear Regression
Models. ACM Trans. Math. Software, 19:109–120, 1993.
[74] J. R. Bunch and L. Kaufman. Some Stable Methods for Calculating Inertia and Solving
Symmetric Linear Systems. Math. Comput., 31:163–179, 1977.
i i
i i
book_tem
i i
2010/7/27
page 368
i i
368 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 369
i i
Bibliography 369
i i
i i
book_tem
i i
2010/7/27
page 370
i i
370 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 371
i i
Bibliography 371
[126] W. R. Esposito and C. A. Floudas. Global Optimization for the Parameter Estimation
of Differential-Algebraic Systems. Ind. Eng. Chem. Res., 5:1291–1310, 2000.
[130] M. C. Ferris and J. S. Pang (eds.). Complementarity and Variational Problems: State
of the Art. Proceedings of the International Conference on Complementarity Problems
SIAM, Philadelphia, (Baltinore, MD, 1995), 1997.
[133] W. Fleming and R. Rishel. Deterministic and Stochastic Optimal Control. Springer,
Berlin, 1975.
[135] R. Fletcher, N. I. M. Gould, S. Leyffer, Ph. L. Toint, and A. Wächter. Global Conver-
gence of a Trust-Region SQP-Filter Algorithm for General Nonlinear Programming.
SIAM J. Optim., 13:635–659, 2002.
[136] R. Fletcher and S. Leyffer. User Manual for FilterSQP. Technical Report, Numerical
Analysis Report, NA/181, University of Dundee, 1999.
[137] R. Fletcher and S. Leyffer. Nonlinear Programming without a Penalty Function. Math.
Programm., 91:239–269, 2002.
[139] R. Fletcher, S. Leyffer, and Ph. L. Toint. A Brief History of Filter Methods.
SIAG/Optimization Views-and-News, 18:2–12, 2007.
i i
i i
book_tem
i i
2010/7/27
page 372
i i
372 Bibliography
[140] R. Fletcher and W. Morton. Initialising Distillation Column Models. Comput. Chem.
Eng., 23:1811–1824, 2000.
[141] A. Flores-Tlacuahuac, L. T. Biegler, and E. Saldivar-Guerra. Dynamic Optimiza-
tion of HIPS Open-Loop Unstable Polymerization Reactors. Ind. Eng. Chem. Res.,
8:2659–2674, 2005.
[142] A. Flores-Tlacuahuac and I. E. Grossmann. Simultaneous Cyclic Scheduling and
Control of a Multiproduct CSTR. Ind. Eng. Chem. Res., 20:6698–6712, 2006.
[143] C. A. Floudas. Nonlinear and Mixed Integer Optimization: Fundamentals and Ap-
plications. Oxford University Press, New York, 1995.
[144] C. A. Floudas. Deterministic Global Optimization: Theory, Algorithms and Applica-
tions. Kluwer Academic Publishers, Norwell, MA, 2000.
[145] fmincon: An SQP Solver. MATLAB User’s Guide. The MathWorks, 2009.
[146] J. F. Forbes and T. E. Marlin. Design Cost: A Systematic Approach to Technology
Selection for Model-Based Real-Time Optimization Systems. Comput. Chem. Eng.,
20:717–734, 1996.
[147] A. Forsgren, P. E. Gill, and M. H. Wright. Interior Methods for Nonlinear Optimiza-
tion. SIAM Review, 44:525–597, 2002.
[148] R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A Modeling Language for Math-
ematical Programming. Duxbury Press/Brooks/Cole Publishing Company, 2002.
[149] R. Franke and J. Doppelhamer. Real-time implementation of nonlinear model predic-
tive control of batch processes in an industrial framework. In Assessment and Future
Directions of NMPC, pp. 465–472. Springer, Berlin, 2007.
[150] R. Franke and J. Engell. Integration of advanced model based control with industrial
IT. In R. Findeisen, F. Allgöwer, and L.T. Biegler, editors, Assessment and Future
Directions of Nonlinear Model Predictive Control, pages 399–406. Springer, 2007.
[151] S. E. Gallun, R. H. Luecke, D. E. Scott, and A. M. Morshedi. Use Open Equations
for Better Models. Hydrocarbon Processing, pages 78–90, 1992.
[152] GAMS. Model Library Index. GAMS Development Corporation (2010).
[153] M. A. Gaubert and X. Joulia. Tools for Computer Aided Analysis and Interpretation
of Process Simulation Results. Comput. Chem. Eng., 21:S205–S210, 1997.
[154] D. M. Gay. Computing Optimal Locally Constrained Steps. SIAM J. Sci. Statist.
Comput., 2:186–197, 1981.
[155] D. M. Gay. A Trust-Region Approach to Linearly Constrained Optimization. Numer-
ical Analysis Proceedings (Dundee, 1983), D. F. Griffiths, editor, Springer, 1983.
[156] D. M. Gay. Algorithm 611: Subroutine for Unconstrained Minimization Using a
Model/Trust-Region Approach. ACM Trans. Math. Software, 9:503–524, 1983.
i i
i i
book_tem
i i
2010/7/27
page 373
i i
Bibliography 373
i i
i i
book_tem
i i
2010/7/27
page 374
i i
374 Bibliography
[173] A. Griewank, D. Juedes, and J. Utke. ADOL-C: A Package for the Automatic Differ-
entiation of Algorithms written in C/C++. ACM Trans. Math. Software, 22:131–167,
1996.
[174] L. Grippo, F. Lampariello, and S. Lucidi. A Nonmonotone Line Search Technique
for Newton’s Method. SIAM J. Numer. Anal, 23:707–716, 1986.
[175] I. E. Grossmann and L. T. Biegler. Part II: Future Perspective on Optimization. Com-
put. Chem. Eng., 8:1193–1218, 2004.
[176] I. E. Grossmann and C. A. Floudas. Active Constraint Strategy for Flexibility Analysis
in Chemical Process. Comput. Chem. Eng., 11:675–693, 1987.
[177] M. Grötschel, S. Krumke, and J. Rambau (eds.). Online Optimization of Large Sys-
tems. Springer, Berlin, 2001.
[178] W. W. Hager. Rates of Convergence for Discrete Approximations to Unconstrained
Control Problems. SIAM J. Numer. Anal., 13:449–472, 1976.
[179] W. W. Hager. Runge-Kutta Methods in Optimal Control and the Transformed Adjoint
System. Numer. Math., 87:247–282, 2000.
[180] W. W. Hager. Minimizing a Quadratic over a Sphere. SIAM J. Optim., 12:188–208,
2001.
[181] W. W. Hager and S. Park. Global Convergence of SSM for Minimizing a Quadratic
over a Sphere. Math Comp., 74:1413–1423, 2005.
[182] E. Hairer, S. P. Norsett, and G. Wanner. Solving Ordinary Differential Equations I:
Nonstiff Problems. Springer Series in Computational Mathematics, Springer, Berlin,
2008.
[183] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II: Stiff and
Differential-Algebraic Problems. Springer Series in Computational Mathematics,
Springer, Berlin, 2002.
[184] S. P. Han. Superlinearly Convergent Variable Metric Algorithms for General Nonlin-
ear Programming Problems. Math. Programm., 11:263–282, 1976.
[185] S. P. Han and O. L. Mangasarian. Exact Penalty Functions in Nonlinear Program-
ming. Math. Programm., 17:251–269, 1979.
[186] R. Hannemann and W. Marquardt. Continuous and Discrete Composite Adjoints for
the Hessian of the Lagrangian in Shooting Algorithms for Dynamic Optimization.
SIAM J. Sci. Comput., 31:4675–4695, 2010.
[187] P. T. Harker and J. S. Pang. Finite-Dimensional Variational Inequalities and Com-
plementarity Problems: A Survey of Theory, Algorithms and Applications. Math.
Programm., 60:161–220, 1990.
[188] R. F. Hartl, S. P. Sethi, and R. G. Vickson. A Survey of the Maximum Principles for
Optimal Control Problems with State Constraints. SIAM Review, 37:181–218, 1995.
i i
i i
book_tem
i i
2010/7/27
page 375
i i
Bibliography 375
[189] L. Hasdorff. Gradient Optimization and Nonlinear Control. Wiley, New York, 1976.
[190] C.A. Haverly. Studies of the Behavior of Recursion for the Pooling Problem. SIGMAP
Bull., page 25, 1978.
[194] G. A. Hicks and W. H. Ray. Approximation Methods for Optimal Control Synthesis.
Can. J. Chem. Eng., 40:522–529, 1971.
[197] A. C. Hindmarsh and R. Serban. User Documentation for CVODES, An ODE Solver
with Sensitivity Analysis Capabilities. LLNL Report UCRL-MA-148813, 2002.
[199] W. Hock and K. Schittkowski. Test Examples for Nonlinear Programming Codes.
J. Optim. Theory Appl., 30:127–129 (1980). http://www.math.uni-bayreuth.de/
∼kschittkowski/tp_coll1.htm.
[202] R. Hooke and T. A. Jeeves. Direct Search Solution of Numerical and Statistical
Problems. J. ACM, 8:212–220, 1961.
i i
i i
book_tem
i i
2010/7/27
page 376
i i
376 Bibliography
[207] R. Jackson. Optimal Use of Mixed Catalysts for Two Successive Chemical Reactions.
J. Optim. Theory Appl., 2/1:27–39, 1968.
[209] Shi-Shang Jang and Pin-Ho Lin. Discontinuous Minimum End-Time Tempera-
ture/Initiator Policies for Batch Emulsion Polymerization of Vinyl Acetate. Chem.
Eng. Sci., 46:12–19, 1991.
[210] L. Jiang, L. T. Biegler, and V. G. Fox. Simulation and Optimization of Pressure Swing
Adsorption Systems for Air Separation. AIChE J., 49, 5:1140, 2003.
[213] B. S. Jung, W. Mirosh, and W. H. Ray. Large Scale Process Optimization Techniques
Applied to. Chemical and Petroleum Processes. Can. J. Chem. Eng., 49:844–851,
1971.
[215] J. Kadam and W. Marquardt. Integration of economical optimization and control for
intentionally transient process operation. In R. Findeisen, F. Allgöwer, and L. Biegler,
editors, Assessment and future directions of nonlinear model predictive control, pages
419–434. Springer, Berlin, 2007.
[216] P. Kall and S. W. Wallace. Stochastic Programming. John Wiley and Sons, Chichester,
1994.
[219] S. Kameswaran and L.T. Biegler. Convergence Rates for Direct Transcription of
Optimal Control Problems with Final-Time Equality Constraints Using Collocation
at Radau Points. In Proc. 2006 American Control Conference, pages 165–171, 2006.
i i
i i
book_tem
i i
2010/7/27
page 377
i i
Bibliography 377
i i
i i
book_tem
i i
2010/7/27
page 378
i i
378 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 379
i i
Bibliography 379
[252] J. Leis and M. Kramer. The Simultaneous Solution and Sensitivity Analysis of Sys-
tems Described by Ordinary Differential Equations. ACM Trans. Math. Software,
14:45–60, 1988.
[253] M. Lentini and V. Pereyra. PASVA4: An Ordinary Boundary Solver for Problems with
Discontinuous Interfaces and Algebraic Parameters. Mat. Apl. Comput., 2:103–118,
1983.
[255] S. Leyffer, G. López-Calva, and J. Nocedal. Interior Methods for Mathematical Pro-
grams with Complementarity Constraints. SIAM J. Optim., 17:52–77, 2006.
[256] S. Leyffer and T. S. Munson. A globally convergent filter method for MPECs. Techni-
cal Report ANL/MCS-P1457-0907, Argonne National Laboratory, Mathematics and
Computer Science Division, 2007.
[258] S. Li and L. Petzold. Design of new DASPK for sensitivity analysis. Technical Report,
Dept. of Computer Science, UCSB, 1999.
[260] J. Liebman, L. Lasdon, L. Schrage, and A. Waren. Modeling and Optimization with
GINO. The Scientific Press, Palo Alto, CA, 1986.
[261] C. J. Lin and J. J. Moré. Newton’s Method for Large Bound-Constrained Optimization
problems. SIAM J. Optim., 9:1100–1127, 1999.
i i
i i
book_tem
i i
2010/7/27
page 380
i i
380 Bibliography
[265] A. Lucia and J. Xu. Methods of Successive Quadratic Programming. Comput. Chem.
Eng., 18:S211–S215, 1994.
[266] A. Lucia, J. Xu, and M. Layn. Nonconvex Process Optimization. Comput. Chem.
Eng., 20:1375–13998, 1996.
[267] Z.-Q. Luo, J.-S. Pang, and D. Ralph. Mathematical Programs with Equilibrium Con-
straints. Cambridge University Press, Cambridge, 1996.
[268] R. Luus and T. H. I. Jaakola. Direct Search for Complex Systems. AIChE J. 19:645–
646, 1973.
[269] J. Macki and A. Strauss. Introduction to Optimal Control Theory. Springer, Berlin,
1982.
[270] L. Magni and R. Scattolini. Robustness and robust design of MPC for nonlinear
systems. In Assessment and Future Directions of NMPC, pp. 239–254. Springer,
Berlin, 2007.
[271] R. Mahadevan, J. S. Edwards, and F. J. Doyle, III F. J. Doyle, Dynamic Flux Balance
Analysis of Diauxic Growth in Escherichia coli. Biophys. J. 83:1331–1340, 2002.
[272] T. Maly and L. R. Petzold. Numerical Methods and Software for Sensitivity Analysis
of Differential-Algebraic Systems. Appl. Numer. Math., to appear.
[279] A. Miele. Gradient algorithms for the optimization of dynamic systems. In Leondes
C.T., editor, Control and Dynamic Systems: Advances in Theory and Applications,
volume 16, pages 1–52. Academic Press, New York, 1980.
i i
i i
book_tem
i i
2010/7/27
page 381
i i
Bibliography 381
i i
i i
book_tem
i i
2010/7/27
page 382
i i
382 Bibliography
[297] J. Oldenburg and W. Marquardt. Disjunctive Modeling for Optimal Control of Hy-
brid Systems. Comput. Chem. Eng., 32:2346–2364, 2008.
[301] D. B. Özyurt and P. I. Barton. Cheap Second Order Directional Derivatives of Stiff
ODE Embedded Functionals. SIAM J. Sci. Comput., 26:1725–1743, 2005.
[302] B. Pahor and Z. Kravanja. Simultaneous Solution and MINLP Synthesis of DAE Pro-
cess Problems: PFR Networks in Overall Processes. Comput. Chem. Eng., 19:S181–
S188, 1995.
[307] H. J. Pesch. A practical guide to the solution of real-life optimal control problems.
Control Cybernetics, 23:7–60, 1994.
i i
i i
book_tem
i i
2010/7/27
page 383
i i
Bibliography 383
[317] A. Prata, J. Oldenburg, A. Kroll, and W. Marquardt. Integrated Scheduling and Dy-
namic Optimization of Grade Transitions for a Continuous Polymerization Reactor.
Comput. Chem. Eng., 32:463–476, 2008.
[318] R. Pytlak. Numerical Methods for Optimal Control Problems with State Constraints.
Lecture Notes in Mathematics, Vol. 1707, Springer, Berlin, 1999.
i i
i i
book_tem
i i
2010/7/27
page 384
i i
384 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 385
i i
Bibliography 385
[343] H. Sagan. Introduction to the Calculus of Variations. Dover Publications, New York,
1992.
[345] L. O. Santos, N. de Oliveira, and L.T. Biegler. Reliable and efficient optimization
strategies for nonlinear model predictive control. In J. B. Rawlings, W. Marquardt,
D. Bonvin, and S. Skogestad, editors, Proc. Fourth IFAC Symposium on Dynam-
ics and Control of Chemical Reactors, Distillation Columns and Batch Processes
(DYCORD ’95), page 33. Pergamon, 1995.
[347] R. W. H. Sargent. Reduced gradient and projection methods for nonlinear program-
ming. In P. E. Gill and W. Murray, editors, Numerical Methods in Constrained Opti-
mization, pages 140–174. Academic Press, New York, 1974.
[352] A. Schiela and M. Weiser. Superlinear Convergence of the Control Reduced Interior
Point Method for PDE Constrained Optimization. Comput. Optim. Appl., 39:369–
393, 2008.
[356] C. Schmid and L. T. Biegler. Quadratic Programming Methods for Tailored Reduced
Hessian SQP. Comput. Chem. Eng., 18:817, 1994.
i i
i i
book_tem
i i
2010/7/27
page 386
i i
386 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 387
i i
Bibliography 387
[374] O. Stursberg, S. Panek, J. Till, and S. Engell. Generation of optimal control policies for
systems with switched hybrid dynamics. Modelling, Analysis, and Design of Hybrid
Systems, 279/LNCIS:337–352, 2002.
[377] P. Tanartkit. and L. T. Biegler. A Nested, Simultaneous Approach for Dynamic Opti-
mization Problems. Comput. Chem. Eng., 20:735, 1996.
[378] P. Tanartkit and L. T. Biegler. A Nested Simultaneous Approach for Dynamic Opti-
mization Problems II: The Outer Problem. Comput. Chem. Eng., 21:1365, 1997.
[380] D. Ternet and L. T. Biegler. Recent Improvements to a Multiplier Free Reduced Hes-
sian Successive Quadratic Programming Algorithm. Comput. Chem. Eng., 22:963–
978, 1998.
[381] J. Till, S. Engell, S. Panek, and O. Stursberg. Applied Hybrid System Optimization:
An Empirical Investigation of Complexity. Control Eng. Prac., 12:1291–1303, 2004.
[383] I-B. Tjoa. Simultaneous Solution and Optimization Strategies for Data Analysis.
Ph.D. Thesis, Carnegie Mellon University, 1991.
[384] I.-B. Tjoa and L. T. Biegler. Simultaneous Solution and Optimization Strategies for
Parameter Estimation of Differential-Algebraic Equation Systems. Ind. Eng. Chem.
Res., 30:376–385, 1991.
[385] M. J. Todd and Y. Ye. A Centered Projective Algorithm for Linear Programming.
Math. Oper. Res., 15:508–529, 1990.
[386] J. Tolsma and P. Barton. DAEPACK: An Open Modeling Environment for Legacy
Models. Ind. Eng. Chem. Res., 39:1826–1839, 2000.
[387] A. Toumi, M. Diehl, S. Engell, H. G. Bock, and J. P. Schlöder. Finite Horizon Opti-
mizing Control of Advanced SMB Chromatographic Processes. In 16th IFAC World
Congress, Prague, 2005.
i i
i i
book_tem
i i
2010/7/27
page 388
i i
388 Bibliography
[391] R. J. Vanderbei and D. F. Shanno. An interior point algorithm for nonconvex nonlinear
programming. Technical Report SOR-97-21, CEOR, Princeton University, Princeton,
NJ, 1997.
[400] O. von Stryk and R. Bulirsch. Direct and Indirect Methods for Trajectory Optimiza-
tion. Ann. Oper. Res., 37:357–373, 1992.
[402] A. Wächter and L. T. Biegler. Line Search Filter Methods for Nonlinear Programming:
Local Convergence. SIAM J. Optim., 16:32–48, 2005.
i i
i i
book_tem
i i
2010/7/27
page 389
i i
Bibliography 389
[403] A. Wächter and L. T. Biegler. Line Search Filter Methods for Nonlinear Programming:
Motivation and Global Convergence. SIAM J. Optim., 16:1–31, 2005.
[404] A. Wächter and L. T. Biegler. On the Implementation of a Primal-Dual Interior Point
Filter Line Search Algorithm for Large-Scale Nonlinear Programming. Math. Pro-
gramm., 106:25–57, 2006.
[405] A. Wächter, C. Visweswariah, and A. R. Conn. Large-Scale Nonlinear Optimization
in Circuit Tuning. Future Generat. Comput. Syst., 21:1251–1262, 2005.
[406] M. Weiser. Function Space Complementarity Methods for Optimal Control Problems.
Ph.D. Thesis, Freie Universität Berlin, 2001.
[407] M. Weiser and P. Deuflhard. Inexact Central Path Following Algorithms for Optimal
Control Problems. SIAM J. Control Optim., 46:792–815, 2007.
[408] T. Williams and R. Otto. A generalized Chemical Processing. Model for the Investi-
gation of Computer Control. AIEE Trans., 79:458, 1960.
[409] R. B. Wilson. A Simplicial Algorithm for Concave Programming. Ph.D. Thesis, Grad-
uate School of Business Administration, Harvard University, 1963.
[410] D. Wolbert, X. Joulia, B. Koehret, and L. T. Biegler. Flowsheet Optimization and Op-
timal Sensitivity Analysis Using Exact Derivatives. Comput. Chem. Eng., 18:1083–
1095, 1994.
[411] S. J. Wright. Primal-Dual Interior-Point Methods. SIAM, Philadelphia, 1997.
[412] C. Xu, P. M. Follmann, L. T. Biegler, and M. S. Jhon. Numerical Simulation and
Optimization of a Direct Methanol Fuel Cell. Comput. Chem. Eng., 29, 8:1849–1860,
2005.
[413] H. Yamashita. A Globally Convergent Primal-Dual Interior-Point Method for Con-
strained Optimization. Optim. Methods Softw., 10:443–469, 1998.
[414] B. P. Yeo. A Modified Quasilinearization Algorithm for the Computation of Optimal
Singular Control. Int. J. Control, 32:723–730, 1980.
[415] W. S. Yip and T. Marlin. The Effect of Model Fidelity on Real-Time Optimization
Performance. Comput. Chem. Eng., 28:267, 2004.
[416] J. C. C. Young, R. Baker, and C. L. E. Swartz. Input Saturation Effects in Optimizing
Control: Inclusion within a Simultaneous Optimization Framework. Comput. Chem.
Eng., 28:1347–1360, 2004.
[417] R. E. Young, R. D. Bartusiak, and R. W. Fontaine. Evolution of an industrial nonlinear
model predictive controller. In J. W. Eaton, J. B. Rawlings, B. A. Ogunnaike, editor,
Chemical Process Control VI: Sixth International Conference on Chemical Process
Control, pages 342–351. AIChE Symposium Series, Volume 98, Number 326, 2001.
[418] V. M. Zavala and L. T. Biegler. Large-Scale Parameter Estimation in Low-Density
Polyethylene Tubular Reactors. Ind. Eng. Chem. Res., 45:7867–7881, 2006.
i i
i i
book_tem
i i
2010/7/27
page 390
i i
390 Bibliography
i i
i i
book_tem
i i
2010/7/27
page 391
i i
Index
391
i i
i i
book_tem
i i
2010/7/27
page 392
i i
392 Index
i i
i i
book_tem
i i
2010/7/27
page 393
i i
Index 393
i i
i i
book_tem
i i
2010/7/27
page 394
i i
394 Index
i i
i i
book_tem
i i
2010/7/27
page 395
i i
Index 395
i i
i i
book_tem
i i
2010/7/27
page 396
i i
396 Index
i i
i i
book_tem
i i
2010/7/27
page 397
i i
Index 397
PDE constrained optimization, 318 qth order inequality path constraint, 237
PDEs (partial differential equations), 7 QP (quadratic program), 4, 12, 85, 95, 136
penalty function, 109, 114 QP portfolio planning, 89
PENNON, 171, 173 QPSOL, 279
perturbation, 260 QR factorization, 45, 101
PF(ρ), 331, 334, 347, 348, 353, 357 quadratic forms, 22
phase changes, 342 quadratic penalty functions, 109
PI (proportional plus integral), 341 quadratic program (QP), 4, 12, 85, 95, 136
PI controller saturation, 341 quasi-Newton methods, 42
Picard–Lindelöf theorem, 215 quasi-Newton update matrix, 102
piecewise functions, 340
planning, 182 Radau collocation, 293, 310, 313, 315,
polyhedron, 84 350, 354, 356
polymerization, 305 Radau collocation on finite elements, 351
polymerization reactors, 300 range-space matrices, 101
polynomial approximation, 289 reactor, 196
portfolio model, 86 reactor control, 239
positive curvature, 23 real-time optimization (RTO), 5, 182, 200,
positive definite, 21 336
positive semidefinite, 21 reboiler, 9, 204
positive step length, 46 reduced gradient method, 190
Powell damping, 44 reduced gradients, 162
Powell dogleg steps, 54 reduced Hessians, 82, 130, 162
Powell’s dogleg method, 123 properties, 82
pressure swing adsorption, 323 reduced-space decomposition, 96
primal degenerate, 84 reduced-space Newton steps, 99
primal-dual equations, 153, 154 reduced-space SQP (rSQP), 205
Pro/II, 15 Reg(), 331, 347, 348
problem of Lagrange, 221 RegComp(), 331
process control, 213 RegEq(), 331
process flowsheet, 7, 186 RNLP-SOSC, 330
process model, 7 RNLP-SSOSC, 330
steady state, 181 ROMeo, 15
process optimization, 326 roundoff errors, 260
process simulation, 184 rSQP (reduced-space SQP), 207
modular mode, 184 RTO, 201, 204
process simulation models, 182 Runge–Kutta, 251, 293
process synthesis, 5 basis, 289
profit function, 193 Butcher block, 256
projected gradient, 164 discretizations, 323
projected gradient steps, 166 explicit, 255
projected Newton methods, 166 fully implicit methods, 255, 256
projection operator, 164 methods, 254–256, 259, 296
PROPT, 323 polynomial representation, 350, 354
ProSIM, 190 representation, 350
pseudospectral methods, 296 semi-implicit methods, 255, 256
i i
i i
book_tem
i i
2010/7/27
page 398
i i
398 Index
i i
i i
book_tem
i i
2010/7/27
page 399
i i
Index 399
trust region
ared, 53, 125
pred, 53, 125
trust region globalization, 122
trust region methods, 52, 158
convergence rate, 160
local convergence properties, 160
trust region with merit functions, 149
tubular reactors, 304
twice differentiable, 26
zero curvature, 23
zero-stability property, 254, 255
i i
i i