Notes HQ
Notes HQ
These are notes for a one-semester graduate course on numerical optimisation given by Prof.
Miguel Á. Carreira-Perpiñán at the University of California, Merced. The notes are largely based
on the book “Numerical Optimization” by Jorge Nocedal and Stephen J. Wright (Springer, 2nd ed.,
2006), with some additions.
These notes may be used for educational, non-commercial purposes.
c 2005–2020 Miguel Á. Carreira-Perpiñán
1 Introduction
• Goal: describe the basic concepts & main state-of-the-art algorithms for continuous optimiza-
tion.
cij : shipping cost; xij : amount of product shipped from factory i to shop j.
• Ex.: LSQ problem: fit a parametric model (e.g. line, polynomial, neural net. . . ) to a data set. Ex. 2.1
• Optimization algorithms are iterative: build sequence of points that converges to the solution.
Needs good initial point (often by prior knowledge).
– Robustness: perform well on wide variety of problems in their class, for any starting point;
– Efficiency: little computer time or storage;
– Accuracy: identify solution precisely (within the limits of fixed-point arithmetic).
• General comment about optimization (Fletcher): “fascinating blend of theory and computation,
heuristics and rigour”.
– No universal algorithm: a given algorithm works well with a given class of problems.
– Necessary to adapt a method to the problem at hand (by experimenting).
– Not choosing an appropriate algorithm → solution found very slowly or not at all.
– Discrete optimization (integer programming): the variables are discrete. Ex.: integer
transportation problem, traveling salesman problem.
∗ Harder to solve than continuous opt (in the latter we can predict the objective function
value at nearby points).
1
∗ Too many solutions to count them.
∗ Rounding typically gives very bad solutions.
∗ Highly specialized techniques for each problem type.
Ref: Papadimitriou & Steiglitz 1982.
– Network opt: shortest paths, max flow, min cost flow, assignments & matchings, MST,
dynamic programming, graph partitioning. . .
Ref: Ahuja, Magnanti & Orlin 1993.
– Stochastic opt: the model is specified with uncertainty, e.g. x ≤ b where b could be given
by a probability density function.
– Global opt: find the global minimum, not just a local one. Very difficult.
Some heuristics: simulated annealing, genetic algorithms, evolutionary computation.
– Multiobjective opt: one approach is to transform it to a single objective = linear combi-
nations of objectives.
– EM algorithm (Expectation-Maximization): specialized technique for maximum likelihood
estimation of probabilistic models.
Ref: McLachlan & Krishnan 2008; many books on statistics or machine learning.
– Modeling: the setup of the opt problem, i.e., the process of identifying objective, variables
and constraints for a given problem. Very important but application-dependent.
Ref: Dantzig 1963; Ahuja, Magnanti & Orlin 1993.
2
2 Fundamentals of unconstrained optimization
Problem: min f (x), x ∈ Rn .
• Strict (or strong) local minimizer : f (x∗ ) < f (x) ∀x ∈ N \{x∗ }. (Ex. f (x) = 3 vs f (x) = (x − 2)4 at x∗ = 2.)
• Isolated local minimizer : ∃N of x∗ such that x∗ is the only local min. in N . (Ex. f (x) = x4 cos x1 + 2x4
with f (0) = 0 has a strict global minimizer at x∗ = 0 but non-isolated.) All isolated local min. are strict.
• First-order necessary conditions (Th. 2.2): x∗ local min, f cont. diff. in an open neighborhood
of x∗ ⇒ ∇f (x∗ ) = 0. (Not sufficient condition, ex: f (x) = x3 .)
(Pf.: by contradiction: if ∇f (x∗ ) 6= 0 then f decreases along the negative gradient direction.)
• Second-order necessary conditions (Th. 2.3): x∗ is local min, f twice cont. diff. in an open
neighborhood of x∗ ⇒ ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is psd. (Not sufficient condition, ex: f (x) = x3 .)
(Pf.: by contradiction: if ∇2 f (x∗ ) is not psd then f decreases along the direction where ∇2 is not psd.)
The key for the conditions is that ∇, ∇2 exist and are continuous. The smoothness of f allows us to
predict approximately the landscape around a point x.
Convex optimization
• S ⊂ Rn is a convex set if x, y ∈ S ⇒ αx + (1 − α)y ∈ S, ∀α ∈ [0, 1].
• Convex optimization problem: the objective function and the feasible set are both convex
(⇐ the equality constraints are linear and the inequality constraints ci (x) ≥ 0 are concave.)
Ex.: linear programming (LP).
• Th. 2.5:
3
Algorithm overview
• Algorithms look for a stationary point starting from a point x0 (arbitrary or user-supplied) ⇒
sequence of iterates {xk }∞
k=0 that terminates when no more progress can be made, or it seems
that a solution has been approximated with sufficient accuracy.
• Stopping criterion: can’t use kxk − x∗ k or |f (xk ) − f (x∗ )|. Instead, in practice (given a small
ǫ > 0):
• We choose xk+1 given information about f at xk (and possibly earlier iterates) so that f (xk+1 ) <
f (xk ) (descent).
• Move xk → xk+1 : two fundamental strategies, line search and trust region.
• In both strategies, the subproblem (step 2) is easier to solve with the real problem. Why not
solve the subproblem exactly?
• Both strategies differ in the order in which they choose the direction and the distance of the
move:
4
Scaling (“units” of the variables)
• A problem is poorly scaled if changes to x in a certain direction produce much larger variations
in the value of f than do changes to x in another direction. Some algorithms (e.g. steepest
descent) are sensitive to poor scaling while others (e.g. Newton’s method) are not. Generally,
scale-invariant algorithms are more robust to poor problem formulations.
Ex. f (x) = 109 x21 + x22 (fig. 2.7).
5
3 Line search methods
Iteration: xk+1 = xk + αk pk , where αk is the step length (how far to move along pk ), αk > 0; pk is
the search direction.
2.5
1.5
pk
xk
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
π
Descent direction at xk : pTk ∇fk = kpk kk∇fk k cos θk < 0 (angle < 2
with −∇fk ). Guarantees that f
can be reduced along pk (for a sufficiently small step):
• The steepest descent direction, i.e., the direction along which f decreases most rapidly, is fig. 2.5
fig. 2.6
pk = −∇fk . Pf.: for any p, α: f (xk + αp) = f (xk ) + αpT ∇fk + O(α2 ) so the rate of change
in f along p at xk is pT ∇fk (the directional derivative) = kpkk∇fk k cos θ. Then minp pT ∇fk
s.t. kpk = 1 is achieved when cos θ = −1, i.e., p = −∇fk /k∇fk k.
This direction is ⊥ to the contours of f . Pf.: take x + p on the same contour line as x. Then, by Taylor’s th.:
1 T 2 1 pT ∇2 f (x + ǫp)p
f (x + p) = f (x) + pT ∇f (x) + p ∇ f (x + ǫp)p, ǫ ∈ (0, 1) ⇒ cos ∠(p, ∇f (x)) = − −−−−−→ 0
2 2 kpkk∇f (x)k kpk→0
but kpk → 0 along the contour line means p/kpk is parallel to its tangent at x.
• The Newton direction is pk = −∇2 fk−1 ∇fk . This corresponds to assuming f is locally quadratic
and jumping directly to its minimum. Pf.: by Taylor’s th.:
1
f (xk + p) ≈ fk + pT ∇fk + pT ∇2 fk p = mk (p)
2
which is minimized (take derivatives wrt p) by the Newton direction if ∇2 fk is pd. (✐ what
happens if assuming f is locally linear (order 1)?)
In a line search the Newton direction has a natural step length of 1.
6
– steepest descent: Bk = I
– Newton’s method: Bk = ∇2 f (xk )
– Quasi-Newton method: Bk ≈ ∇2 f (xk )
Here, we deal with how to choose the step length given the search direction pk . Desirable properties:
guaranteed global convergence and rapid rate of convergence.
Step length
Time/accuracy trade-off: want to choose αk to give a substantial reduction in f but not to spend
much time on it.
• Exact line search (global or local min): αk : minα>0 φ(α) = f (xk + αpk ). Too expensive: many
evaluations of f , ∇f to find αk even with moderate precision. (✐ Angle (pk ,\
∇fk+1 ) = ?)
• Inexact line search: a typical l.s. algorithm will try a sequence of α values and stop when certain
conditions hold.
We want easily verifiable theoretical conditions on the step length that allow to prove convergence
of an optimization algorithm.
• Reduction in f : f (xk + αk pk ) < f (xk ) → not enough, can converge before reaching the mini-
mizer.
Wolfe conditions
➀ Sufficient decrease (Armijo condition) is equivalent to φ(0) − φ(αk ) ≥ αk (−c1 φ′ (0)). fig. 3.3
Rejects too small decreases. The reduction is proportional both to step length αk and directional
derivative ∇fkT pk . In practice, c1 is very small, e.g. c1 = 10−4.
It is satisfied for any sufficiently small α ⇒ not enough, need to rule out unacceptably small
steps.
We will concentrate on the Wolfe conditions in general, and assume they always hold when the l.s.
is used as part of an optimization algorithm (allows convergence proofs).
7
Lemma 3.1: there always exist step lengths that satisfy the Wolfe (also the strong Wolfe) conditions
if f is smooth and bounded below. (Pf.: mean value th.; see figure below.)
➋ ➊
f (xk ) + (1 − c)αk ∇fkT pk ≤ f (xk + αk pk ) ≤ f (xk ) + cαk ∇fkT pk , 0<c< 1
2
8
Step-length selection algorithms
• They take a starting value of α and generate a sequence {αi } that satisfies the Wolfe cond.
Usually they use interpolation, e.g. approximate φ(α) as a cubic polynomial.
• There are also derivative-free methods (e.g. the golden section search) but they are less efficient
and can’t benefit from the Wolfe cond. (to prove global convergence).
• Important theorem, e.g. shows that the steepest descent method is globally convergent; for
other algorithms, it describes how far pk can deviate from the steepest descent direction and
still give rise to a globally convergent iteration.
• Th. 3.2 (Zoutendijk). Consider an iterative method xk+1 = xk + αk pk with starting point
x0 where pk is a descent direction and αk satisfies the Wolfe conditions. Suppose f is bounded
below in Rn and cont. diff. in an open set N containing the level set L = {x: f (x) ≤ f (x0 )}, and
that ∇f is Lipschitz continuous on N (⇔ ∃L > 0: k∇f (x) − ∇f (x̃)k ≤ Lkx − x̃k ∀x, x̃ ∈ N ;
weaker than bounded Hessian). Then
X
(Proof: ➁ + L lower bound αk ;
cos2 θk k∇fk k2 < ∞ (Zoutendijk’s condition) ➀ upper bound f ; telescope)
k+1
k≥0
Zoutendijk’s condition implies cos2 θk k∇fk k2 → 0. Thus if cos θk ≥ δ > 0 ∀k for fixed δ then
k∇fk k → 0 (global convergence).
• Examples:
– Steepest descent method : pk = −∇fk ⇒ cos θk = 1 ⇒ global convergence. Intuitive fig. 3.7
method, but very slow in difficult problems.
– Newton-like method : pk = −B−1k ∇fk with Bk symmetric, pd and with bounded condition
−1
number: kBk k Bk ≤ M ∀k (✐ ill-cond. ⇒ ∇f ⊥ Newton dir.) Then cos θk ≥ M1 (Pf.: exe. 3.5) ⇒
global convergence.
In other words, if Bk are pd (which is required for descent directions), have bounded c.n.
and the step lengths satisfy the Wolfe conditions ⇒ global convergence. This includes
steepest descent, some Newton and quasi-Newton methods.
• For some methods (e.g. conjugate gradients) we may have directions that are almost ⊥∇fk when
the Hessian is ill-conditioned. It is still possible to show global convergence by assuming that we
take a steepest descent step from time to time. “Turning” the directions toward −∇fk so that
cos θk < δ for some preselected δ > 0 is generally a bad idea: it slows down the method (difficult
to choose a good δ) and also destroys the invariance properties of quasi-Newton methods.
9
• Fast convergence can sometimes conflict with global convergence, e.g. steepest descent is glob-
ally convergent but quite slow; Newton’s method converges very fast when near a solution but
away from the solution its steps may not even be descent (indeed, it may be looking for a
maximizer!). The challenge is to design algorithms will both fast and global convergence.
Rate of convergence
• Steepest descent: pk = −∇fk .
Th. 3.4: assume f is twice cont. diff. and that the iterates generated by the steepest descent
method with exact line searches converge to a point x∗ where the Hessian ∇2 f (x∗ ) is pd. Then
−λ1
f (xk+1 ) − f (x∗ ) ≤ r 2 (f (xk ) − f (x∗ )), where r = λλnn +λ1
= κ−1
κ+1
, 0 < λ1 ≤ · · · ≤ λn are the
2 ∗
eigenvalues of ∇ f (x ) and κ = λn /λ1 its condition number.
(Pf.: near the min., f is approx. quadratic.) (For quadratic functions (with matrix Q): kxk+1 − x∗ kQ ≤ rkxk − x∗ kQ where
kxk2Q = xT Qx and 1
2
kx − x∗ k2Q = f (x) − f (x∗ ).)
– Very well conditioned Hessian: λ1 ≈ λn ; very fast, since the steepest descent direction
approximately points to the minimizer.
– Ill-conditioned Hessian: λ1 ≪ λn ; very slow, zigzagging behaviour. This is the typical
situation in practice.
That is, near the solution, where the Hessian is pd, the convergence rate is quadratic if we
always take αk = 1 (no line search at all). The theorem does not apply away from the solution,
where the Hessian may not pd (so the direction may not be descent) or the unit step size may
not satisfy the Wolfe cond. or even increase f ; practical Newton methods avoid this.
10
Newton’s method with Hessian modification
Newton step: solution of the n × n linear system ∇2 f (xk )pN
k = −∇f (xk ).
• Near a minimizer the Hessian is pd ⇒ quadratic convergence (with unit steps αk = 1).
• Quadratic convergence near the minimizer, where the Hessian is pd, if Ek = 0 there.
Diagonal Hessian modification: Ek = λI with λ ≥ 0 large enough that Bk is sufficiently pd. λ =
max(0, δ − λmin (∇2 f (x k ))) for δ > 0.
• We want λ as small as possible to preserve Hessian information along the positive curvature
directions; but if λ is too small, Bk is nearly singular and the step too long.
• The method behaves like pure Newton for pd Hessian and λ = 0, like steepest descent for
λ → ∞, and finds some descent direction for intermediate λ.
Other types of Hessian modification exist, but there is no consensus about which one is best:
• Direct modification of the eigenvalues (needs SVD of the Hessian).
If A = QΛQT (spectral th.) then ∆A = Q diag (max(0, δ − λi )) QT is the correction with minimum Frobenius norm s.t. λmin (A +
∆A) ≥ δ.
11
Review: line search (l.s.) methods
search direction pk given by optimization method
• Iteration xk+1 = xk + αk pk
l.s. ≡ determine step length αk
• We want:
steepest descent dir. : −∇fk
Newton dir. : −∇2 fk−1 ∇fk
– Descent direction: pTk ∇fk = kpk kk∇fk k cos θk < 0
Quasi-Newton dir. : −B−1
k ∇fk
(Bk pd ⇒ descent dir.)
– Inexact l.s.: approx. solution of minα>0 f (xk + αk pk ) (faster convergence of the overall algorithm).
Even if the l.s. is inexact, if αk satisfies certain conditions at each k then the overall algorithm has global
convergence.
Ex.: the Wolfe conditions (others exist), most crucially the sufficient decrease in f . A simple l.s. algorithm
that often (not always) satisfies the Wolfe cond. is backtracking (better ones exist).
• Global convergence (to a stationary point): k∇fk k → 0.
P
Zoutendijk’s th.: descent dir + Wolfe + mild cond. on f ⇒ k≥0 cos2 θk k∇fk k2 < ∞.
Corollary: cos θk ≥ δ > 0 ∀k ⇒ global convergence. But we often want cos θk ≈ 0!
Ex.: steepest descent and some Newton-like methods have global convergence.
• Convergence rate:
λn −λ1 2
– Steepest descent: linear, r = λn +λ1 ; slow for ill-conditioned problems.
– Quasi-Newton: superlinear under certain conditions.
– Newton: quadratic near the solution.
• Modified Hessian Newton’s method: Bk = ∇2 f (xk ) + λI (diagonal modif., others exist) s.t. Bk suffic. pd.
λ = 0: pure Newton step; λ → ∞: steepest descent direction (with step → 0).
Descent direction with moderate length away from minimizer, pure Newton’s method near minimizer.
Global, quadratic convergence if κ(Bk ) ≤ M and l.s. with Wolfe cond.
12
4 Trust region methods
Iteration: xk+1 = xk + pk , where p is the approximate minimizer of the model mk (p) in a region
around xk (the trust region); if pk does not produce a sufficient decrease in f , we shrink the region
and try again.
• Each time we decrease ∆ after failure of a candidate iterate, the step from xk is shorter and
usually points in a different direction.
(
too small: good model, but can only take a small step, so slow convergence
• Trade-off in ∆:
too large: bad model, we may have to reduce ∆ and repeat.
In practice, we increase ∆ if previous steps showed the model reliable. fig. 4.1
∇fk
• Linear model : mk (p) = fk + ∇fkT p s.t. kpk ≤ ∆k ⇒ pk = −∆k k∇f kk
, i.e., steepest descent
with step length αk given by ∆k (no news). (✐ what is the approx. error for the linear model?)
• Quadratic model : mk (p) = f(k + ∇fkT p + 21 pT Bk p where Bk is symmetric but need not be psd.
O(kpk3 ) if Bk = ∇2 fk , trust-region Newton method
The approximation error is
O(kpk2 ) otherwise.
In both cases, the model is accurate for small kpk, which guarantees we can always find a good
step for sufficiently small ∆. (✐ what happens if ρk < 0 but kpk k < ∆k with an arbitrary mk ?)
• If actual reduction < 0 the new objective value is larger, so reject the step.
≈ 1: good agreement between f and the model mk , so expand ∆k+1 > ∆k if kpk k = ∆k
(otherwise, don’t interfere)
• ρk
> 0 but not close to 1: keep ∆k+1 = ∆k
close to 0 or negative: shrink ∆k+1 < ∆k .
Algorithm 4.1
• If Bk is pd and B−1 B −1
k ∇fk ≤ ∆k the solution is the unconstrained minimizer pk = −Bk ∇fk
(the full step).
13
Characterization of the exact solution of the optimization subproblem
Th. 4.1. p∗ is a global solution of the trust-region problem minkpk≤∆ m(p) = f + gT p + 21 pT Bp iff
p∗ is feasible and ∃λ ≥ 0 such that:
1. (B + λI)p∗ = −g
2. λ (∆ − kp∗ k) = 0 (i.e., λ = 0 or kp∗ k = ∆)
3. B + λI is psd.
Note:
• Conditions 1 and 2 follow from the KKT conditions (th. 12.1) for a local solution, where λ is
the Lagrange multiplier. Condition 3 holds for a global solution.
• Using B + λI instead of B in the model transforms the problem into minp m(p) + λ2 kpk2 , and
so for large λ > 0 the minimizer is strictly inside the region. As we decrease λ the minimizer
moves to the region boundary and the theorem holds for that λ.
• If λ > 0 then the direction is antiparallel to the model gradient and so the region is tangent to
the model contour at the solution: λp∗ = −g − Bp∗ = −∇m(p∗ ).
• This is useful for Newton’s method and is the basis of the Levenberg-Marquardt algorithm for
nonlinear least-squares problems.
certain step size. It gives a baseline solution; several methods improve over the Cauchy point.
• Iterative solution of the subproblem (based on the characterization):
1. Try λ = 0, solve Bp∗ = −g and see if kp∗ k ≤ ∆ (full step).
2. If kp∗ k > ∆, define p(λ) = −(B + λI)−1 g for λ sufficiently large that B + λI is pd and
seek a smaller value λ > 0 such that kp(λ)k = ∆ (1D root-finding for λ; iterative solution
factorizing the matrix B + λI). fig. 4.5
qT g (qT
j g)
2
= − j=1 λ j+λ qj ⇒ kp(λ)k2 = n
Pn
QΛQT −(B + λI)−1 g
P
B= (spectral th. with λn ≥ · · · ≥ λ1 ) ⇒ p(λ) = j=1 (λj +λ)2 . If
j
1 1 1
qT ∗
1 g 6= 0, find λ > −λ1 using Newton’s method for root finding on r(λ) = ∆ − kp(λ)k (since kp(λ)k ≈ (λ + λ1 )/constant).
One can show this is equivalent to Alg. 4.3, which uses Cholesky factorizations (limit to ∼ 3 steps).
• Two-dimensional subspace minimization: on the span of the dogleg path. eq. 4.17
The minimizer results from a 4th-degree polynomial.
• If using Bk = ∇2 fk and if the region becomes eventually inactive and we always take the full
step, the convergence is quadratic (the method becomes Newton’s method).
14
Review: trust-region methods
• Iteration xk+1 = xk + pk
15
5 Conjugate gradient methods
• Linear conjugate gradient method: solves a large linear system of equations.
• Th. 5.1: we can minimize φ in n steps at most by successively minimizing φ along the n vectors
in a conjugate set.
Conjugate direction method : given a starting point x0 ∈ Rn and a set of n conjugate directions
rT p
{p0 , . . . , pn−1 }, generate the sequence {xk } with xk+1 = xk + αk pk , where αk = − pTkApk (✐
k k
denominator 6= 0?) (exact line search) (Proof: x∗ = x0 + n−1
P
i=0 αi pi .)
• Intuitive idea:
– A diagonal: quadratic function φ can be minimized along the coordinate directions fig. 5.1
e1 , . . . , en in n iterations.
– A not diagonal: the coordinate directions don’t minimize φ in n iterations; but the fig. 5.2
variable change x̂ = S−1 x with S = (p0 p1 . . . pn−1 ) diagonalizes A: φ̂(x̂) = φ(Sx̂) =
1 T
2
x̂ (ST AS)x̂ − (ST b)T x̂. (✐ why is S invertible? why is ST AS diagonal?)
coordinate search in x̂ ⇔ conjugate direction search in x.
• Th. 5.2 (expanding subspace minimization): for the conjugate directions method:
16
• How to obtain conjugate directions? Many ways, e.g. using the eigenvectors of A or transform-
ing a set of l.i. vectors into conjugate directions with a procedure similar to Gram-Schmidt.
But these are computationally expensive!
• The conjugate gradient method generates conjugate direction pk by using only the previous
one, pk−1 :
– pk is a l.c. of −∇φ(xk ) and pk−1 s.t. being conjugate to pk−1 ⇒ pk = −rk + βk pk−1 with
rT Ap
βk = pTk Apk−1 ? .
k−1 k−1
To prove the algorithm works, we need to prove it builds a conjugate direction set.
• Th. 5.3: suppose that the kth iterate of the CG method is not the solution x∗ . Then:
– rTk ri = 0 for i = 0, . . . , k − 1 (the gradients at all iterates are ⊥ to each other.)
– span {r0 , . . . , rk } = span {p0 , . . . , pk } = span r0 , Ar0, . . . , Ak r0 = Krylov subspace of
degree k for r0 . (So {rk } orthogonal basis, {pk } basis.)
(Intuitive explanation: compute rk , pk for k = 1, 2 using rk+1 = rk + Ak pk , pk+1 = −rk+1 + βk+1 pk .)
rT
k rk krk k2 rT
k+1 rk+1 krk+1 k2
⇒ αk ← pT
= kpk k2A
, rk+1 ← rk + αk Apk , βk+1 ← rT
= krk k2
in algorithm 5.2.
k Apk k rk
17
• Advantages: no matrix storage; does not alter A; does not introduce fill (for a sparse matrix
A); fast convergence.
• Disadvantages: sensitive to roundoff errors.
It is recommended for large systems.
Rate of convergence
• Here we don’t mean the asymptotic rate (k → ∞) because CG converges in at most n steps
for a quadratic function. But CG can get very close to the solution in quite less than n steps,
depending on the eigenvalue structure of A:
– Th. 5.4: if A has only r distinct eigenvalues, CG converges in at most r steps.
– If the eigenvalues of A occur in r distinct clusters, CG will approximately solve the problem
in r steps. fig. 5.4
• Two bounds (using kxk2A = xT Ax), useful to estimate the convergence rate in advance if we
know something about the eigenvalues of A:
λ −λ1 2
– Th. 5.5: kxk+1 − x∗ k2A ≤ λn−kn−k +λ1
kx0 − x∗ k2A if A has eigenvalues λ1 ≤ · · · ≤ λn .
√ k
– kxk − x∗ kA ≤ 2 √κ−1
κ+1
kx0 − x∗ kA if κ = λλn1 is the c.n. (this bound is very coarse).
√
κ−1 κ−1
Recall for steepest descent we had a similar expression but with κ+1
instead of √
κ+1
.
• Preconditioning: change of variables x̂ = Cx so that the new matrix  = C−T AC−1 has a
clustered spectrum or a small condition number (thus faster convergence). Besides being effec-
tive in this sense, a good preconditioner C should take little storage and allow an inexpensive
solution of Cx = x̂. Finding good preconditioners C depends on the problem (the structure of
A), e.g. good ones exist when A results from a discretized PDE.
The preconditioner can be integrated in a convenient way in the CG algorithm. alg. 5.3
18
FR
• Line search for αk : we need each direction pk+1 = −∇fk+1 + βk+1 pk to be a descent direction,
T 2 FR T
i.e., ∇fk+1 pk+1 = −k∇fk+1 k + βk+1 ∇fk+1 pk < 0.
T
– Exact l.s.: αk is a local minimizer along pk ⇒ ∇fk+1 pk = 0 ⇒ pk+1 is descent.
– Inexact l.s.: pk+1 is descent if αk satisfies the strong Wolfe conditions (lemma 5.6):
• We can define βk+1 in other ways that also generalize the quadratic case: for quadratic functions
with a pd Hessian and exact l.s. we have that βkFR = βkPR = βkHS = βkother for linear CG (since
the successive gradients are mutually ⊥).
– For nonlinear functions in general, with inexact l.s., PR is empirically more robust and
efficient than FR.
– Yet, the strong Wolfe conditions don’t guarantee that pk is a descent direction.
– PR needs a good l.s. to do well.
Restarts Restarting the iteration every n steps (by setting βk = 0, i.e., taking a steepest descent
step) periodically refreshes the algorithm and works well in practice. It leads to n-step quadratic
kx −x∗ k
convergence: kxk+n ∗ 2 ≤ M; intuitively because near the minimum, f is approx. quadratic and so
k −x k
after a restart we will have (approximately) the linear CG method (which requires p0 = steepest
descent).
For large n (when CG is most useful) restarts may never occur, since an approximate solution
may be found in less than n steps.
Global convergence
• With restarts and the strong Wolfe conditions, the algorithms (FR, PR) have global convergence
since they include as a subsequence the steepest descent method (which is globally convergent
with the Wolfe conditions).
• Without restarts:
• In general, the theory on the rate of convergence of CG is complex and assumes exact l.s.
19
Review: conjugate gradient methods
• Linear CG: An×n sym. pd: solves Ax = b ⇔ min φ(x) = 12 xT Ax − bT x.
– At each step, xk is the minimizer over the set x0 + span {p0 , . . . , pk−1 }; rk+1 = rk + αk Apk ; and
rTk pi = rTk ri = 0 ∀i < k.
– Conjugate direction pk is obtained from the previous one and the current gradient: pk = −rk + βk pk−1
rT
k+1 rk+1
with βk = rT
.
k rk
20
6 Quasi-Newton methods
• Like Newton’s method but using a certain Bk instead of the Hessian.
Like steepest descent and conjugate gradients, they require only the gradient.
By measuring the changes in gradients over iterations, they construct an approximation Bk to
the Hessian whose accuracy improves gradually and results in superlinear convergence.
• Quadratic model of the objective function (correct to first order):
1
mk (p) = fk + ∇fkT p + pT Bk p
2
where Bk is symmetric pd and is updated at every iteration.
• Search direction given by the minimizer of mk , pk = −B−1 ?
k ∇fk (which is descent ).
• Line search xk+1 = xk + αk pk with step length chosen to satisfy the Wolfe conditions.
– Or by requiring the gradients of the quadratic model mk+1 to agree with those of f at
xk+1 and xk :
∇mk+1 = ∇f at xk+1 : by construction
∇mk+1 = ∇f at xk : ∇mk+1 (xk − xk+1 ) = ∇fk+1 + Bk+1 (xk − xk+1 ) = ∇fk .
• It implicitly requires that sTk yk = sTk Bk+1 sk > 0 (a curvature condition). If f is strongly convex,
this is guaranteed (because sTk yk > 0 for any two points xk and xk+1 ; proof: exe. 6.1). Otherwise, it is guaranteed
if the line search verifies the 2nd Wolfe condition ∇f (xk + αk pk )T pk ≥ c2 ∇fkT pk , 0 < c2 < 1
T s ≥ c ∇f T s ⇔ yT s ≥ (c − 1)α ∇f T p > 0). Or, in particular, if the l.s. is exact? .
(proof: 2nd Wolfe ⇔ ∇fk+1 k 2 k k k k 2 k k k
The secant equation provides only n constraints for n2 dof in Bk , so it has many solutions (even
with the constraint that Bk be pd). We choose the solution closest to the current matrix Bk :
minB kB − Bk k s.t. B symmetric pd, Bsk = yk . Different choices of norm are possible; one
that allows an easy solution and gives rise P to2 scale invariance is the weighted Frobenius norm
1 1 2
kAkW = kW AW kF (where kAkF =
2 2
ij aij ). W is any matrix satisfying Wyk = sk (thus
the norm is adimensional, i.e., the solution doesn’t depend on the units of the problem).
Where Hk = B−1 −1 3
k . Using Bk directly requires solving pk = −Bk ∇fk , which is O(n ), while using
Hk gives us pk = −Hk ∇fk , which is O(n2 ).
21
The BFGS method (Broyden-Fletcher-Goldfarb-Shanno)
We apply the conditions to Hk+1 = B−1
k+1 rather than Bk+1 :
with the same norm as before, where Wsk = yk . For W = the average Hessian we obtain:
(
Hk+1 = (I − ρk sk ykT )Hk (I − ρk yk sTk ) + ρk sk sTk with ρk = yT1s
k k
BFGS: Bk sk sT
k Bk yk ykT
Bk+1 = Bk − sT B s + yT s (Pf.: SMW formula)
k k k k k
We have: Hk pd ⇒ Hk+1 pd (proof: zT Hk+1 z > 0 if z 6= 0). We take the initial matrix as H0 = I for lack
of better knowledge. This means BFGS may make slow progress for the first iterations, while information is being built into Hk .
• For quadratic f and if an exact line search is performed, then DFP, BFGS, SR1 converge to
the exact minimizer in n steps and Hn = ∇2 f −1 . (✐ Why do many methods work well with quad. f ?)
• BFGS is the best quasi-Newton method. With an adequate line search (e.g. Wolfe conditions),
BFGS has effective self-correcting properties (and DFP is not so effective): a poor approxima-
tion to the Hessian will be improved in a few steps, thus being stable wrt roundoff error.
H0 ← I, k ← 0
while k∇fk k > ǫ
pk ← −Hk ∇fk search direction
xk+1 ← xk + αk pk line search with Wolfe cond.
sk ← xk+1 − xk , yk ← ∇fk+1 − ∇fk ,, Hk+1 ← BFGS update update inverse Hessian
k ←k+1
end
• Always try αk = 1 first in the line search (this step will always be accepted eventually).
Empirically, good values for c1 , c2 in the Wolfe conditions are c1 = 10−4 , c2 = 0.9.
– Space: O(n2 ) matrix storage. For large problems, techniques exist to modify the method
to take less space, though converging more slowly (see ch. 7).
– Time: O(n2 ) matrix × vector, outer products.
• Global convergence if Bk have a bounded condition number + Wolfe conditions (see ch. 3);
but in practice this assumption may not hold. There aren’t truly global convergence results,
though the methods are very robust in practice.
• Local convergence: if BFGS converges, then its order is superlinear under mild conditions.
Newton Quasi-Newton
Convergence rate quadratic superlinear
3 2
Cost per iteration (time) O(n ) linear system O(n ) matrix × vector
∇2 f required yes no
22
The SR1 method (symmetric rank-1)
By requiring Bk+1 = Bk + σvvT where σ = ±1 and v is a nonzero vector, and substituting in the
secant eq., we obtain: T
Bk+1 = Bk + (yk −Bk sk )(yk −B
T
k sk )
(yk −Bk sk ) sk
SR1: T
Hk+1 = Hk + (sk −Hk yk )(sk −H k yk )
(sk −Hk yk )T yk
Generates very good Hessian approximations, often better than BFGS’s (indeed BFGS only produces pd Bk ),
but:
• symmetric
23
Review: quasi-Newton methods
• Newton’s method with an approximate Hessian:
(sk − Hk yk )(sk − Hk yk )T
Hk+1 = Hk + .
(sk − Hk yk )T yk
• Global convergence: no general results, though the methods are robust in practice.
• Convergence rate: superlinear.
Newton Quasi-Newton
Convergence rate Quadratic Superlinear
time O(n3 ) O(n2 )
Cost per iteration
space O(n2 ) O(n2 )
Hessian required yes no
24
7 Large-scale unconstrained optimization
• Large problems (today): 103–106 variables.
• In large problems, the following can have a prohibitive cost: factorizing the Hessian (solving
for the Newton step), or even computing the Hessian or multiplying times it or storing it
(note quasi-Newton algorithms generate dense approximate Hessians even if the true Hessian
is sparse).
• They keep simple, compact approximations of the Hessian based on a few n–vectors (rather
than an n × n matrix).
• Focus on L-BFGS, which uses curvature information from only the most recent iterations.
L-BFGS: store modified version of Hk implicitly, by storing m ≪ n of the vector pairs {si , yi }.
• Product Hk ∇fk = sum of inner products involving ∇fk and the pairs {si , yi }.
• After the new iterate is computed, replace oldest pair with newest one.
• Update, in detail:
25
– Use l.s. with (strong) Wolfe conditions to make BFGS stable.
– The first m − 1 iterates are as in BFGS.
Choose m > 0; k ← 0
repeat
Choose H0k e.g. eq. (7.20)
pk ← −Hk ∇fk algorithm 7.4
xk+1 ← xk + αk pk l.s. with Wolfe conditions
Discard {sk−m , yk−m } if k > m
Store pair sk ← xk+1 − xk , yk ← ∇fk+1 − ∇fk
k ←k+1
until convergence
T
∇fk+1 yk sk ykT
pk+1 = −∇fk+1 + T
pk = −Ĥk+1 ∇fk+1 with Ĥk+1 = I −
yk pk ykT sk
which resembles quasi-Newton iterates, but Ĥk+1 is neither symmetric nor pd.
• The following memoryless BFGS is symmetric, pd and satisfies the secant eq. Hk+1 yk = sk :
(
sk ykT yk sTk sk sTk BFGS update with Hk = I
Hk+1 = I − T I− T + T ≡
yk sk yk sk yk sk L-BFGS with m = 1 and H0k = I.
T
And, with exact l.s. (∇fk+1 pk = 0 ∀k): pk+1 = −Hk+1 ∇fk+1 ≡ CG–HS ≡ CG–PR.
1
Bk = I+ .
γk
|{z} |{z} | {z }
n 2m × 2m 2m × n
This could be used in a trust-region method or in a constrained optimization method. Efficient since
updating Bk costs O(mn + m3 ) and matrix-vector products Bk v cost O(mn + m2 ).
26
Inexact Newton methods
Newton step: solution of the n × n linear system ∇2 f (xk ) pk = −∇f (xk ).
• Expensive: computing the Hessian is a major task, O(n2 ), and solving the system is O(n3 ).
• Not robust: far from a minimizer, need to ensure pk is descent.
Newton-CG method: solve the system approximately with the linear CG method (efficient), termi-
nating if negative curvature is encountered (robustness); can be implemented as line search or trust
region.
Inexact Newton steps Terminate the iterative solver (e.g. CG) when the residual rk = ∇2 f (xk )pk +
∇f (xk ) (where pk is the inexact Newton step) is small wrt the gradient (to achieve invariance wrt
scalings of f ): krk k ≤ ηk k∇f (xk )k, where (ηk ) is the forcing sequence. Under mild conditions, if the
initial x0 is sufficiently near a minimizer x∗ and eventually we always try the full step αk = 1, we
have:
• Th. 7.1: if 0 < ηk ≤ η < 1 ∀k then xk → x∗ .
( p
ηk → 0: superlinear, e.g. ηk = min(0.5, k∇f (xk )k)
• Th. 7.2: rate of convergence
ηk = O(k∇f (xk )k): quadratic, e.g. ηk = min(0.5, k∇f (xk )k).
But the smaller ηk , the more iterations of CG we need.
(✐ How many Newton-CG iterations and how many CG steps in total does this require if f is quadratic with pd Hessian?)
How do we get sufficiently near a minimizer?
Newton-CG method We solve the system with the CG method to an accuracy determined by
the forcing sequence, but terminating if negative curvature is encountered (pTk Apk ≤ 0). If the very
first direction is of negative curvature, use the steepest descent direction instead. Then, we use the
resulting direction:
• For a line search (inexact,
with appropriate conditions: Wolfe, Goldstein, or backtracking). Alg. 7.1
pd Hessian: pure (inexact) Newton direction
The method behaves like: nd Hessian: steepest descent
not definite Hessian: finds some descent direction.
Problem (as in the modified Hessian Newton’s method): if the Hessian is nearly singular, the
Newton-CG direction can be very long.
Hessian-free Newton methods: the Hessian-vector product ∇2 fk v can be obtained (exactly or approximately) without computing
the Hessian, see ch. 8.
Solution Bk+1 given by solving an n × n linsys with the same sparsity pattern, but is not
necessarily pd.
27
• Then, use Bk+1 within trust-region method.
• Partially separable function = sum of element functions, each dependent on a few variables
(e.g. f (x) = f1 (x1 , x2 ) + f2 (x2 , x3 ) + f3 (x3 , x4 )) ⇒ sparse gradient, Hessian for each function; it is efficient
to maintain quasi-Newton approximations to each element function. Essentially, we work on a
lower-dimensional space for each function.
X X X
f (x) = φi (Ui x) =⇒ ∇f (x) = UTi ∇φi (Ui x), ∇2 f (x) = UTi ∇2 φi (Ui x)Ui
i i i
X
2
=⇒ ∇ f (x) ≈ B = UTi B[i] Ui
i
28
8 Calculating derivatives
Approximate or automatic techniques to compute the gradient, Hessian or Jacobian if difficult by
hand.
Needs careful choice of ǫ: as small as possible but not too close to the machine precision (to avoid
1 1 1
roundoff errors). As a rule of thumb, ǫ ∼ u 2 with error ∼ u 2 for forward diff. and ǫ ∼ u 3 with error
2
∼ u 3 for central diff., where u (≈ 10−16 in double precision) is the unit roundoff.
∂2 f ∂f f (x+ǫei )−f (x) L2
Pf.: assume |f | ≤ L0 and ∂x2
≤ L2 in the region of interest. Then ∂xi
(x) = ǫ
+ δǫ with |δǫ | ≤ 2
ǫ. But the machine
i
∂f
representation of f at any x has a relative error |comp(f (x)) − f (x)| ≤ uL0 . Thus, the absolute error E = ∂x (x) − f (x+ǫeǫi )−f (x) is
i
L2 2uL0 4L u √ √
bounded by 2 ǫ + ǫ , which is minimal for ǫ2 = L0 and E = L2 ǫ, or ǫ ∼ u and E ∼ u. A similar derivation for the central diff.
2
∂3f 3L0 u 1 2
(with ∂x3
≤ L3 ) gives ǫ3 = 2L3
and E = L3 ǫ2 , or ǫ ∼ u 3 and E ∼ u 3 .
i
• If the Jacobian or Hessian is sparse, it is possible to reduce the number of function evaluations
by cleverly choosing the perturbation vector p (graph-coloring problem).
• Computationally, the finite-difference approximation of ∇f , etc. can cost more than computing
them from their analytical expression (ex.: quadratic f ), though this depends on f .
• Numerical gradients are also useful to check whether the expression for a gradient calculated
by hand is correct.
29
Exact derivatives by automatic differentiation
• Build a computational graph of f using intermediate variables.
P ∂h
• Apply the chain rule: ∇x h(y(x)) = i ∂y i
∇yi (x) where Rn −
→ Rm −
→ R.
y h
30
9 Derivative-free optimization
Evaluating ∇f in practice is sometimes impossible, e.g.:
• f (x) can be the result of an experimental measurement or a simulation (so analytic form of f
unknown).
Approaches:
• Approximate gradient and possibly Hessian using finite differences (ch. 8), then apply derivative-
based method (previous chapters). But:
• Don’t approximate the gradient, instead use function values at a set of sample points and
determine a new iterate by a different means (this chapter). But: less developed and less
efficient than derivative-based methods; effective only for small problems; difficult to use with
general constraints.
If possible, try methods in this order: derivative-based > finite-difference-based > derivative-free.
Model-based methods
• Build a model mk as a quadratic function that interpolates f at an appropriate set of samples.
Compute a step with a trust-region strategy (since mk is usually nonconvex).
• Model: samples Y = {y1 , . . . , yq } ⊂ Rn with current iterate xk ∈ Y and having lowest function
value in Y. Construct mk (xk + p) = c + gT p + 21 pT Gp (we can’t use g = ∇f (xk ), G =
∇2 f (xk )) by imposing interpolation conditions mk (yl ) = f (yl ), l = 1, . . . , q (linear system).
Need q = 21 (n + 1)(n + 2) (exe. 9.2) and choose points so linsys is nonsingular.
31
Coordinate-descent algorithm (alternating optimization) pk cycles through the n coordi- fig. 9.1
nate dimension e1 , . . . , en in turn.
• May not converge, iterating indefinitely without approaching a stationary point, if the gradient
becomes more and more ⊥ to the coordinate directions. Then cos θk approaches 0 sufficiently
rapidly that the Zoutendijk condition is satisfied even when ∇fk 9 0.
• If it does converge, its rate of convergence is often much slower than that of steepest descent,
and this gets worse as n increases.
• Advantages: very simple, does not require calculation of derivatives, convergence rate ok if the
variables are loosely coupled.
• Variants:
– back-and-forth approach repeats e1 e2 . . . en−1 en en−1 . . . e2 e1 e2 . . .
– Hooke-Jeeves: after sequence of coordinate descent steps, search along first and last point
in the cycle.
• Very useful in special cases:
– When alternating over groups of variables so that the optimization over each group is easy.
Pm
Ex.: f (X, A) = j=1 kyj − Axj k2 .
– When the cost of cycling through the n variables is comparable to the cost of computing
the gradient. Ex.: f (w) = 21 Nn=1 (yn − wT xn )2 + λkwk1 (Lasso).
P
Pattern search Generalizes coordinate search to a richer set of directions at each iteration. At
each iterate xk :
• Choose a certain set of search directions, Dk = {p1 , . . . } and define a frame centered at xk by
points xk at a given step length γk > 0 along each direction: {xk + γk p1 , . . . }.
• Evaluate f at each frame point:
– If significantly lower f value found, adopt as new iterate and shift frame center to it.
Possibly, increase γk (expand the frame).
– Otherwise, stay at xk and reduce γk (shrink the frame).
• Possibly, change the directions.
Ex.: algorithm 9.2, which eventually shrinks the frame around a stationary point. Global convergence
under certain conditions on the choice of directions:
(1) At least one direction in Dk should be descent (unless ∇f (xk ) = 0), specifically:
minv∈Rn maxp∈Dk cos (v,d p) ≥ δ for constant δ > 0.
(2) All directions have roughly similar length (so we can use a single step length): ∀p ∈ Dk :
βmin ≤ kpk ≤ βmax , for some positive βmin , βmax and all k.
Examples of such Dk : fig. 9.2
32
Nelder-Mead method (downhill simplex)
• At each iteration we keep n + 1 points whose convex hull forms a simplex. Proc. 9.5
fig. 9.4
A simplex with vertices z1 , . . . , zn+1 is nondegenerate or nonsingular if the edge matrix V = (z2 − z1 , . . . , zn+1 − z1 ) is nonsingular.
At each iteration we replace the worst vertex (in f -value) with a better point obtained by
reflecting, expanding or contracting the simplex along the line joining the worst vertex with
the simplex center of mass. If we can’t find a better point this way, we keep only the best
E
vertex and shrink the simplex towards it.
A conjugate-direction method
• Idea: algorithm that builds a set of conjugate directions using only function values (thus
minimizes a strictly convex quadratic function); then extend to nonlinear function.
• Parallel subspace property:Plet x1 6= x2 ∈ Rn and {p1 , . . . , pl } ⊂ Rn l.i. Define the two parallel fig. 9.3.
linear varieties Sj = {xj + li=1 αi pi , α1 , . . . , αl ∈ R}, j = 1, 2; let x∗1 and x∗2 be the minimizers
of f (x) = 21 xT Ax − bT x on S1 and S2 , resp. =⇒ x∗2 − x∗1 is conjugate to {p1 , . . . , pl }. Proof
Ex. 2D: given x0 and (say) e1 , e2 : (1) minimize from x0 along e2 to obtain x1 , (2) minimize
from x1 along e1 then e2 to obtain z ⇒ z − x1 is conjugate to e2 .
• Algorithm: starting with n l.i. directions, perform n consecutive exact minimizations each along
a current direction, then generate a new conjugate direction which replaces the oldest direction;
repeat.
• For quadratic f , terminates in n steps with total O(n2 ) function evaluations (each one O(n2 ))
⇒ O(n4 ). For non-quadratic f the l.s. is inexact (using interpolation) and needs some care.
Problem: the directions {pi } tend to become l.d. Heuristics exist to correct this.
33
Finite differences and noise (illustrative analysis)
Noise in evaluating f can arise because:
• differential-equation solver (or some other complex numerical procedure): small but nonzero
tolerance in calculations.
η(x; ǫ)
k∇ǫ f (x) − ∇h(x)k∞ ≤ Lh ǫ2 +
| {z } |{z} ǫ }
finite-diff.
| {z
approximation error
approx. error noise error x − ǫei x x + ǫei
“If the noise dominates ǫ, no accuracy in ∇ǫ f and so little hope that −∇ǫ f will be descent.” So,
instead of using close samples, it may be better to use samples more widely separated.
Implicit filtering
• Essentially, steepest descent at an accuracy level ǫ (the finite-difference parameter) that is
decreased over iterations.
• Useful when we can control the accuracy in computing f and ∇ǫ f (e.g. if we can control the
tolerance of a differential-equation solver, or the number of trials of a stochastic simulation).
A more accurate (less noisy) value costs more computation.
• The algorithm decreases ǫ systematically (but hopefully not as quickly as the decay in error)
so as to maintain a reasonable accuracy in ∇ǫ f (x). Each iteration is a steepest descent step
at accuracy level ǫk (i.e., along −∇ǫk f (x)) with a backtracking l.s. that is limited to a fixed
number of steps. We decrease ǫk when:
– or we reach the fixed number of backtracking steps (i.e., ∇ǫ f (x) is a poor approximation of ∇f (x)).
• Converges if ǫk is decreased such that ǫ2k + η(xǫkk;ǫk ) → 0, i.e., the noise level decreases sufficiently
fast as the iterates approach a solution.
34
Review: derivative-free optimization
Methods that use only function values to minimize f (but not ∇f or ∇2 f ). Less efficient than derivative-based
methods, but possibly acceptable for small n or in special cases.
• Model-based methods: build a linear or quadratic model of f by interpolating f at a set of samples and use it
with trust-region. Slow convergence rate and very costly steps.
• Coordinate descent (alternating minimization): minimize successively along each variable. If it does con-
verge, its rate of convergence is often much slower than that of steepest descent. Very simple and convenient
sometimes.
• Pattern search: the iterate xk carries a set of directions that is possibly updated based on the values of f
along them. Generalizes coordinate descent to a richer direction set.
• Nelder-Mead method (downhill simplex): the iterate xk carries a simplex that evolves based on the values of
f , falling down and eventually shrinking around a minimizer (if it does converge).
• Conjugate directions built using the parallel subspace property: computing the new conjugate direction requires
n line searches (CG requires only one).
• Finite-difference approximations to the gradient degrade significantly with noise in f .
Implicit filtering: steepest descent at an accuracy level ǫ (the finite-difference parameter) that is decreased
over iterations. Useful when we can control the accuracy in computing f and ∇ǫ f .
35
10 Nonlinear least-squares problems
1
Pm 2
• Least-squares (LSQ) problem: f (x) = 2 j=1 rj (x) where the residuals rj : Rn → R j =
1, 2, . . . , m are smooth and m ≥ n.
• Arise very often in practice when fitting a parametric model to observed data; rj (x) is the error
for datum j with model parameters x; “min f ” means finding the parameter values that best
match the model to the data.
• Ex.: regression
P (curve fitting): rj = yj − φ(x; tj ),
f (x) = 21 m j=1 (y 2
j − φ(x; tj )) is the LSQ error of
fitting curve φ: t → y (with parameters x) to the
observed data points {(tj , yj )}m j=1 .
If using other norms, e.g. |rj | or |rj |3 , it won’t be
a LSQ problem.
• The special form of f simplifies the minimization problem. Write f (x) = 21 kr(x)k22 in terms of
the residual vector r: Rn → Rm
r1 (x) ∇r1T
∂rj (m × n matrix of first
r(x) = ... with Jacobian J(x) = = ...
∂xi j=1,...,m T
partial derivatives).
rm (x) i=1,...,n ∇rm
• Linear LSQ problem: rj (x) is linear ∀j ⇒ J(x) = J constant. Calling r = r(0), we have
f (x) = 21 kJx + rk22 convex? , ∇f (x) = JT (Jx + r), ∇2 f (x) = JT J constant.
(✐ is fitting a polynomial to data a linear LSQ problem?)
Minimizer: ∇f (x∗ ) = 0 ⇒ JT Jx∗ = −Jr, the normal equations: n × n linear system with pd
or psd matrix which can be solved with numerical analysis techniques.
Cholesky factorization of JT J, or QR or SVD factorization of J are best depending on the problem; also could use the linear
conjugate gradient method for large n.
If m is very large, do not build J explicitly but accumulate JT J = j ∇rj (x)∇rj (x)T and JT r = j rj (x)∇rj (x).
P P
• For nonlinear LSQ problems f isn’t necessarily convex. We see 2 methods (Gauss-Newton,
Levenberg-Marquardt) which take advantage of the particular form of LSQ problems; but any
of the methods we have seen in earlier chapter are applicable too (e.g. Newton’s method, if we
compute ∇2 rj ).
36
Gauss-Newton method
• Line search with Wolfe conditions and a modification of Newton’s method: instead of generating
the search direction pk by solving the Newton eq. ∇2 f (xk )p = −∇f (xk ), ignore the second-
order term in ∇2 f (i.e., approximate ∇2 fk ≈ JTk Jk ) and solve JTk Jk pGN
k = −JTk rk .
• Equivalent to approximating r(x) by a linear model r(x+p) ≈ r(x)+J(x)p (linearization) and
so f (x) by a quadratic model with Hessian J(x)T J(x), then solving the linear LSQ problem
minp 21 kJk p + rk k22 .
• If Jk has full rank and ∇fk = JTk rk 6= 0 then pGN
k is a descent direction. (Pf.: evaluate (pGN T
k ) ∇fk < 0.)
The theorem doesn’t hold if J(xk ) is rank-deficient for some k. This occurs when the normal
equations are underdetermined (infinite number of solutions for pGN
k ).
• Rate of convergence depends on how much the term JT J dominates the second-order term in
the Hessian at the solution x∗ ; it is linear but rapid in the small-residual case:
eq. (10.30): xk + pGN
k −x
∗
. (JT (x∗ )J(x∗ ))−1 ∇2 f (x∗ ) − I kxk − x∗ k + O(kxk − x∗ k2 ).
• Inexact Gauss-Newton method : solve the linsys approximately, e.g. with CG.
Levenberg-Marquardt method
• Same modification of Newton’s method as in the Gauss-Newton method but with a trust region
instead of a line search. Essentially, modify JTk Jk → JTk Jk + λI with λ ≥ 0 to make it pd.
• Spherical trust region with radius ∆k , quadratic model for f with Hessian JTk Jk :
1 1 1
mk (p) = krk k2 + pT JTk rk + pT JTk Jk p = kJk p + rk k22
2 2 2
1
⇒ min kJk p + rk k22 s.t. kpk ≤ ∆k .
p 2
37
Large-residual problems
If the residuals rj (x∗ ) near the solution x∗ are large, both Gauss-Newton and Levenberg-Marquardt
converge slowly, since JT J is a bad model of the Hessian. Options:
• Use a hybrid method, e.g. start with GN/LM then switch to (quasi-)Newton,
P or apply a quasi-
2
Newton approximation Sk to the second-order part of the Hessian j rj (x)∇ rj (x) and combine
with GN: Bk = JTk Jk + Sk .
Note that, in model fitting, large residuals mean the model is a poor fit to the data, so we may want
to use a better model.
JT J x∗ = −Jr.
• Linear LSQ : rj linear, J constant, minimizer x∗ satisfies (calling r = r(0)) the normal eqs. |{z}
• Nonlinear LSQ: GN, LM methods. pd or psd
• Gauss-Newton method :
– Approximate Hessian ∇2 fk ≈ JTk Jk , solve for the search direction JTk Jk pGN
k = −JTk rk , inexact line
search with Wolfe conditions.
– Equivalent to linearizing r(x + p) ≈ r(x) + J(x)p.
– Problems if Jk is rank-defective.
• Levenberg-Marquardt method :
2
– Like GN but with trust region instead of line search: minkpk≤∆k kJk p + rk k2 .
– No problem if Jn is rank defective.
– One way to solve the trust-region subproblem approximately: try large λ ≥ 0, solve (JTk Jk + λI)pLM
k =
−JTk rk , accept pLM
k if sufficient decrease in f , otherwise try a smaller λ.
• Global convergence under certain assumptions.
• Rate of convergence: linear but fast if JT (x∗ )J(x∗ ) ≈ ∇2 f (x∗ ), which occurs with small residuals (rj (x) ≈ 0)
or quasilinear residuals (∇2 rj (x) ≈ 0). Otherwise GN/LM are slow; try other methods instead (quasi-Newton,
Newton, etc.) or hybrids that combine the advantages of GN/LM and (quasi-)Newton.
38
Method of scoring (≈ Gauss-Newton for maximum likelihood)
P
Maximum likelihood estimation of parameters x given observations {ti }: maxx N1 N i=1 log p(ti ; x)
where p(t; x) is a pmf or pdf in t. Call L(t; x) R= log p(t; x). Then (all derivatives are wrt x in this
section, and assume we can interchange ∇ and ):
1
Gradient ∇L = ∇p
p
1 1 1
Hessian ∇2 L = − 2 ∇p ∇pT + ∇2 p = −∇ log p ∇ log pT + ∇2 p.
p p p
Taking expectations wrt the model p(t; x) we have:
1 2
E −∇2 L = E ∇ log p ∇ log pT + E ∇p = cov {∇ log p}
p
R R n o R R
since E {∇ log p} = p p1 ∇p = ∇ p = 0 and E p1 ∇2 p = p 1p ∇2 p = ∇2 p = 0.
In statistical parlance:
• Observed information: −∇2 L.
• Expected information: E {−∇2 L} = E ∇ log p ∇ log pT (Fisher information matrix )
• Score: ∇ log p = ∇L
1
PN
Two ways of approximating the log-likelihood Hessian N i=1 ∇2 log p(ti ; x) using only the first-order
term on ∇ log p:
P
• Gauss-Newton: sample observed information J(x) = N1 N T
i=1 ∇ log p(ti ; x) ∇ log p(ti ; x) .
• Method of scoring: expected information J(x) = E ∇ log p ∇ log pT . This requires computing
an integral, but its form is often much simpler than that of the observed information (e.g. for
the exponential family).
Advantages:
• Good approximation to Hessian (the second-order term is small on average if the model fits
well the data).
1
∇ log p = − ∇g + ∇Φ(x)h(t) = ∇Φ(x)(h(t) − E {h(t)}) ⇒
g
N
1 X
log-likelihood gradient ∇ log p(ti ; x) = ∇Φ(x)(Edata {h} − Emodel {h}).
N i=1
39
R
Missing-data problem Consider t are observed and z are missing, so p(t; x) = p(t|z; x)p(z; x) dz
(e.g. z = label of mixture component). We have:
Z
1
∇ log p(t; x) = (∇p(t|z; x)p(z; x) + p(t|z; x)∇p(z; x)) dz
p(t; x)
Z
1
= p(z; x)p(t|z; x)(∇ log p(t|z; x) + ∇ log p(z; x)) dz = Ez|t {∇ log p(t, z; x)}
p(t; x)
= posterior expectation of the complete-data log-likelihood gradient.
Z
2
∇ log p(t; x) = ∇ p(z|t; x)∇ log p(t, z; x) dz =
Z
= Ez|t ∇2 log p(t, z; x) + ∇p(z|t; x)∇ log p(t, z; x)T dz.
Noting that
p(t, z; x) 1
∇p(z|t; x) = ∇ = (∇p(t, z; x) − p(z|t; x)∇ log p(t; x)) =
p(t; x) p(t; x)
= p(z|t; x)(∇ log p(t, z; x) − ∇ log p(t; x)),
Again, ignoring the second-order term we obtain a cheap, pd approximation to the Hessian (Gauss-
Newton method)—but for minimizing the likelihood, not for maximizing it!
We can still use the first-order, negative-definite approximation from before:
• E step: compute p(z|t; xold) and Q(x; xold ) = Ez|t;xold {log p(t, z; x)}.
40
11 Nonlinear equations
• Problem: find roots of the n equations r(x) = 0 in n unknowns x where r(x): Rn → Rn . Ex. in
p. 271
0, 1, finitely many, or infinitely many roots.
• Many similarities with optimization: Newton’s method, line search, trust region. . .
• Differences:
– In optimization, the local optima can be ranked by objective value.
In root finding, all roots are equally good.
optimization: 2
– For quadratic convergence, we need derivative of order
root-finding: 1.
– Quasi-Newton methods are less useful in root-finding.
∂ri
• Assume the n × n Jacobian J(x) = ∂x j ij
exists and is continuous in the region of interest.
Newton’s method
• Taylor’s th.: linearize r(x + p) = r(x) + J(x)p + O(kpk2 ) and use as model; find its root.
for k = 0, 1, 2 . . .
solve Jk pk = −rk
xk+1 ← xk + pk
end
• Newton’s method for optimizing an objective function f is the same as applying this algorithm
to r(x) = ∇f (x).
Jacobian continuous: superlinear;
• Convergence rate for nondegenerate roots (th. 11.2)
Jacobian Lipschitz cont.: quadratic.
• Problems:
– Degenerate roots, e.g. r(x) = x2 produces xk = 2−k x0 which converge linearly.
– Not globally convergent: away from a root the algorithm can diverge or cycle; it is not
even defined if J(xk ) is singular.
– Expensive to compute J and solve the system exactly for large n.
41
Broyden’s method (secant or quasi-Newton method)
• Constructs an approximation to the Jacobian over iterations.
• We require the updated Jacobian approximation Bk+1 to satisfy the secant equation yk =
Bk+1sk and to minimize kB − Bk k2 , i.e., the smallest possible update that satisfies the secant
(y −B s )sT
eq.: Bk+1 = Bk + k sT ks k k (from lemma 11.4).
k k
• Convergence rate: superlinear if the initial point x0 and Jacobian B0 are close to the root x∗
and its Jacobian J(x∗ ), resp.; the latter condition can be crucial, but is difficult to guarantee
in practice.
r(xk ) r(xk )
Newton’s method: xk+1 = xk − r ′ (xk )
Secant method: xk+1 = xk − Bk
with
r(xk )−r(xk−1 )
Bk = xk −xk−1
independent of Bk−1 .
Practical methods
• Line search and trust region techniques to ensure convergence away from a root.
• Problem: each root (r(x) = 0) is a local minimizer of f but not vice versa, so local minima
that are not roots can attract the algorithm.
=0
If a local minimizer x is not a root then J(x) is singular (Pf.: ∇f (x) = J(x)T r(x) = 0 r(x)6
=⇒ J(x) singular.)
– We want descent directions for f (pTk ∇f (xk ) < 0); step length chosen as in ch. 3.
– Zoutendijk’s th.: descent directions + Wolfe conditions + Lipschitz continuous J ⇒
P 2 T 2
k≥0 cos θk Jk rk < ∞.
– So if cos θk ≥ δ for constant δ ∈ (0, 1) and all k sufficiently large ⇒ ∇fk = JTk rk → 0; and
if kJ(x)−1 k is bounded ⇒ rk → 0.
42
– If well defined, the Newton step is a descent direction for f for rk 6= 0 (Pf.: pTk ∇fk =
−pTk ∇fk krk k2
−pTk Jk rk = −krk k < 0). But cos θk = kpk kk∇fk k =
T 2
≥ JT 1 J−1 = κ(J1 k ) , so a
kJk rk kkJTk rk k
−1
k k kk k k
large condition number causes poor performance (search direction almost ⊥∇fk ).
– One modification of Newton’s direction is (JTk Jk + τk I)pk = −JTk rk ; a large enough τk
ensures cos θk is bounded away from 0 because τk → ∞ ⇒ pk ∝ −JTk rk .
– Inexact Newton steps do not compromise global convergence: if at each step krk + Jk pk k ≤
ηk krk k for ηk ∈ [0, η] and η ∈ [0, 1) then cos θk ≥ 2 1−η
κ(Jk )
.
– Algorithm 4.1 from ch. 4 applied to f (x) = 21 kr(x)k22 using Bk = JTk Jk as approximate
Hessian in the model mk , i.e., linearize r(p) ≈ rk + Jk p.
– The exact solution has the form pk = (−JTk Jk + λk I)−1 JTk rk for some λk ≥ 0, with λk = 0
if the unconstrained solution is in the trust region. The Levenberg-Marquardt algorithm
searches for such λk .
– Global convergence (to non-degenerate roots) under mild conditions.
– Quadratic convergence rate if the trust region subproblem is solved exactly for all k suffi-
ciently large.
Continuation/homotopy/path-following methods
• Problem of Newton-based methods: unless J is nonsingular in the region of interest, they may
converge to a local minimizer of the merit function rather than a root.
• Continuation methods: instead of dealing with the original problem r(x) = 0 directly, establish
a continuous sequence of root-finding problems that converges to the original problem but starts
from an easy problem; then solve each problem in the sequence, tracking the root as we move
from the easy to the original problem.
• Homotopy map: H(x, λ) = λ r(x) +(1 − λ) (x − a) where a ∈ Rn fixed and λ ∈ R.
|{z} | {z }
λ=1 λ=0
original easy problem with
problem solution x = a
∂
If ∂x
H(x, λ) is nonsingular then H(x, λ) = 0 defines a continuous curve x(λ), the zero path.
By the implicit function theorem (th. A.2); ex.: x2 − y + 1 = 0, Ax + By + c = 0 ⇒ y = g(x) or x = g(y) locally.
We want to follow the path numerically. x = a plays the role of initial iterate.
• Naive approach: start from λ = 0, x = a; gradually increase λ from 0 to 1 and solve H(x, λ)
= 0 using as initial x the one from the previous λ value; stop after solving for λ = 1.
43
• Arc-length parametrization of the zero path: (x(s), λ(s)) where s = arc length measured from
(a, 0) at s = 0. Since H(x(s), λ(s)) = 0 ∀s ≥ 0, its total derivative wrt s is also 0:
dH ∂ ∂
= H(x, λ) ẋ + H(x, λ) λ̇ = 0 (n equations),
ds ∂x ∂λ
where (ẋ, λ̇) = dx , dλ is the tangent vector to the zero path.
ds ds
• Continuation methods can fail in practice with even simple problems and they require consid-
erable computation; but they are generally more reliable than merit-function methods.
• Related algorithms:
– For constrained optimization: quadratic-penalty, log-barrier, interior-point.
– For (heuristic) global optimization: deterministic annealing. Ex.:
h(x, λ) = λ f (x) +(1 − λ) g(x)
|{z} |{z}
λ=1 λ=0
original easy problem,
objective e.g. quadratic
44
Review: nonlinear equations
Problem: find roots of n equations r(x) = 0 in n unknowns.
• Degenerate roots (singular Jacobian) cause troubles (e.g. slower convergence).
• Similar to optimization but harder: most methods only converge if starting sufficiently near a root.
– Newton’s method : linsys given by Jacobian of r (first deriv. only), quadratic convergence. Inexact steps
may be used.
– Broyden’s method : quasi-Newton method, approximates Jacobian through differences in r and x, super-
linear convergence. Limited-memory versions exist.
2
• Merit function f (x) = 21 kr(x)k2 turns root-finding into minimization (so we are guided towards minima), but
contains minima that are not roots (∇f (x) = J(x)T r(x) = 0 but r(x) 6= 0, i.e., singular Jacobian).
– Line search and trust region strategies may be used. Results obtained in earlier chapters (e.g. for
convergence) carry over appropriately.
• Root finding and minimization: related but not equivalent:
2
– From root finding r(x) = 0 to minimization: the merit function kr(x)k contains all the roots of r(x) = 0
as global minimizers, but also contains local minimizers that are not roots.
– From minimization min f (x) to root finding: the nonlinear equations ∇f (x) = 0 contain all the min-
imizers of f (x) as roots, but also contain other roots that are not minimizers (maximizers and saddle
points).
• Continuation/homotopy methods construct a family of problems parameterized over λ ∈ [0, 1] so that λ = 0 is
easy to solve (e.g. x − a = 0) and λ = 1 is the original problem (r(x) = 0).
– This implicitly defines a path in (x, λ) that most times is continuous, bounded and goes from (a, 0) to
(x∗ , 1) where x∗ is a root of r.
– We follow the path numerically (solving an ODE, or solving a sequence of root-finding problems).
– More robust than other methods, but higher computational cost.
45
Summary of methods
Assumptions:
• Typical behavior.
• Evaluation costs:
f ’s type f (x) ∇f (x) ∇2 f (x)
quadratic O(n2 ) O(n2 ) O(1)
other O(n) O(n) O(n2 )
• Appropriate conditions for: the line search (e.g. Wolfe) or trust region strategy; the functions, etc. (e.g. Lipschitz continuity,
solution with pd Hessian).
2
Pure N quadratic O(n ) O(n3 ) 2 O(n3 )
Newton 2
Modified-Hessian Y quadratic O(n ) O(n3 ) 2 O(n3 )
Fletcher-Reeves Y linear O(n) O(n) 1 O(n3 )
Conjugate-gradient
Polak-Ribière N linear O(n) O(n) 1 O(n3 )
2
Quasi-Newton DFP, BFGS, SR1 N superlinear O(n ) O(n2 ) 1 O(n3 )
Newton-CG Y linear to quadratic O(n2 ) O(n2 )–O(n3 ) 2 at least O(n3 )
Large-scale dep. on forcing seq. but finite
L-BFGS N linear O(nm) O(nm) 1 ∞
Model-based N ≤ linear O(n3 ) O(n4 ) 0 O(n6 )
Coordinate descent N ≤ linear O(1) O(n) 0 ∞
Derivative-free
Nelder-Mead N ≤ linear O(n2 ) O(n) 0 ∞
Conjugate directions N ≤ linear O(n2 ) O(n2 ) 0 O(n4 )
Gauss-Newton N linear O(n2 ) O(n3 ) 1 O(n3 )
Least-squares
Levenberg-Marquardt Y linear O(n2 ) O(n3 ) 1 O(n3 )
x 6= x∗ .
• x∗ is a isolated local solution iff x∗ ∈ Ω and ∃ neigh. N of x∗ : x∗ is only local minimizer in
N ∩ Ω.
• At a feasible point x, the inequality constraint ci (i ∈ I) is:
– active iff ci (x) = 0 (x is on the boundary for that constraint)
– inactive iff ci (x) > 0 (x is interior point for that constraint).
✐ What happens if we define the feasible set as ci (x) > 0 rather than ci (x) ≥ 0?
• For inequality constraints, the constraint normal ∇ci (x) points towards the feasible region and
is ⊥ to the contour ci (x) = 0. For equality constraints, ∇ci (x) is ⊥ to the contour ci (x) = 0.
• Mathematical characterization of solutions for unconstrained optimization (reminder):
– Necessary conditions: x∗ local minimizer of f ⇒ ∇f (x∗ ) = 0, ∇2 f (x∗ ) psd.
– Sufficient conditions: ∇f (x∗ ) = 0, ∇2 f (x∗ ) pd ⇒ x∗ is a strong local minimizer of f .
Here we derive similar conditions for constrained optimization problems. Let’s see some exam-
ples.
• Local and global solutions: constraining can decrease or increase the number of optimizers:
unconstrained: single solution (x = 0)
2
minn kxk2 constrained s.t. kxk22 ≥ 1: infinite solutions (kxk2 = 1)
x∈R
constrained s.t. c1 (x) = 0: several solutions.
unconstrained: infinite solutions (x = − π2 + 2kπ, k ∈ Z)
min sin x
x∈R constrained s.t. 0 ≤ x ≤ 2π: single solution (x = 3π 2
).
• Smoothness of both f and the constraints is important since then we can predict what happens
near a point.
Some apparently nonsmooth problems (involving k·k1 , k·k∞ ) can be reformulated as smooth:
P P
1. A nonsmooth function by adding variables: kxk1 = ni=1 |xi | = ni=1 x+ −
i + xi by splitting
−
xi = x+ +
i − xi into nonnegative and nonpositive parts, where xi = max (xi , 0) ≥ 0 and
−
xi = max (−xi , 0) ≥ 0. (✐ Do we need to force (x+ )T x− = 0?)
47
2. A nonsmooth constraint as several smooth constraints:
∇c1
s
ur
minimum x∗ nto f
co of
• decreases f if ∇f (x)T d < 0 (by Taylor’s th.: 0 > f (x + d) − f (x) ≈ ∇f (x)T d) (descent direction).
Thus if no improvement is possible then there cannot be a direction d such that ∇c1 (x)T d = 0 and
∇f (x)T d < 0 ⇒ ∇f (x) = λ1 ∇c1 (x) for some λ1 ∈ R.
Equivalent formulation in terms of the Lagrangian function L(x, λ1 ) = f (x) − λ1 c1 (x): at a solu-
|{z}
tion x∗ , ∃λ∗1 ∈ R: ∇x L(x∗ , λ∗1 ) = 0 (and also c1 (x∗ ) = 0). Lagrange
multiplier
Idea: to optimize equality-constrained problems, search for stationary points of the Lagrangian.
48
Case 2: a single inequality constraint c1 (x) ≥ 0. The solution is the same, but now the sign
of λ∗1 matters: at x∗ , ∇f (x∗ ) = λ∗1 ∇c1 (x∗ ) for λ∗1 ≥ 0.
Pf.: consider a feasible point at x i.e., c1 (x) ≥ 0. An infinitesimal move to x + d:
• retains feasibility if c1 (x) + ∇c1 (x)T d ≥ 0 (by Taylor’s th.: 0 ≤ c1 (x + d) ≈ c1 (x) + ∇c1 (x)T d)
If no improvement is possible:
1. Interior point c1 (x) > 0: any small enough d satisfies feasibility ⇒ ∇f (x) =
0 (this is the unconstrained case).
Another example:
x: no d satisfies ∇c1 (x)T d ≥ 0, ∇c2 (x)T d ≥ 0, ∇f (x)T d < 0
z: dz satisfies ∇c1 (z)T d ≥ 0, ∇c2 (z)T d ≥ 0, ∇f (z)T d < 0
At
y: dy satisfies c1 (y) + ∇c1 (y)T d ≥ 0 (c1 not active), ∇c2 (y)T d ≥ 0, ∇f (y)T d < 0
w: dw satisfies c1 (w) + ∇c1 (w)T d ≥ 0, c2 (w) + ∇c2 (w)T d ≥ 0 (c1 , c2 not active), ∇f (w)T d < 0.
P
Equivalent formulation in general: L(x, λ) = f (x) − i λi ci (x). At a solution x∗ , ∃λ∗ ≥ 0 (≡ λ∗i ≥
0 ∀i): ∇x L(x∗ , λ∗ ) = 0 and λ∗i ci (x∗ ) = 0 ∀i (complementarity condition) (and also ci (x∗ ) ≥ 0 ∀i).
49
First-order necessary (Karush-Kuhn-Tucker) conditions for optimality
They relate the gradient of f and of the constraints at a solution.
ci (x) = 0, i ∈ E
Consider the constrained optimization problem minx∈Rn f (x) s.t.
ci (x) ≥ 0, i ∈ I.
• |E ∪ I| = m constraints.
X
• Lagrangian L(x, λ) = f (x) − λi ci (x).
i∈E∪I
• Active set at a feasible point x: A(x) = E ∪ {i ∈ I: ci (x) = 0}.
Matrix of active constraint gradients at x: A(x) = [∇ci (x)]Ti∈A(x) of |A| × n.
• Degenerate constraint behavior: e.g. c21 (x) is equivalent to c1 (x) as an equality constraint, but
∇(c21 ) = 2c1 ∇c1 = 0 at any feasible point, which disables the condition ∇f = λ1 ∇c1 . We can
avoid degenerate behavior by requiring the following constraint qualification:
– Def. 12.4: given x∗ , A(x∗ ), the linear independence constraint qualification (LICQ) holds iff
the set of active constraint gradients {∇ci (x∗ ), i ∈ A(x∗ )} is l.i. (which implies ∇ci (x∗ ) 6=
0). Equivalently A(x∗ ) has full row rank.
Other constraint qualif. possible in th. 12.1, in partic. “all active constraints are linear”.
Th. 12.1 (KKT conditions): x∗ local solution of the optimization problem, LICQ holds at x∗ ⇒
∃!λ∗ ∈ Rm (Lagrange multipliers) such that: ex. 12.6
a) ∇x L(x∗ , λ∗ ) = 0 n eqs.
b) ci (x∗ ) = 0 ∀i ∈ E
c) ci (x∗ ) ≥ 0 ∀i ∈ I
d) λ∗i ≥ 0 ∀i ∈ I
e) λi ci (x∗ ) = 0
∗
∀i ∈ E ∪ I m complementarity eqs.
Notes:
• I = ∅: KKT ⇔ ∇L(x∗ , λ∗ ) = 0 (∇ wrt x, λ).
In principle solvable by writing x = (xa , φ(xa ))T and solving the unconstrained problem min f (xa ).
• Given a solution x∗ , its associated Lagrange multipliers λ∗ are 0 for the inactive constraints
and (A(x∗ )A(x∗ )T )−1 A(x∗ )∇f (x∗ ) for the active ones. Pf.: solve for λ∗ in KKT a).
• f (x∗ ) = L(x∗ , λ∗ ) (from the complementarity condition).
• Strict complementarity: exactly one of λ∗i and ci (x∗ ) is zero ∀i ∈ I. Easier for some algorithms.
An active constraint ci at x∗ is strongly active if λ∗i > 0 and weakly active if λ∗i = 0.
50
Second order conditions
• Set of linearized feasible directions F (x) at a feasible point x, which
is a cone (def. 12.3):
T
n w ∇ci (x) = 0, ∀i ∈ E
F (x) = w ∈ R : T .
w ∇ci (x) ≥ 0, ∀i ∈ I ∩ A(x)
These are the directions along which we can move from x, infinitesimally, and remain feasible.
If LICQ holds, F (x) is the tangent cone to the feasible set at x.
Other examples of tangent cones F (x) in 2D:
∗ ∗
• If the first-order conditions hold at x
( , then an infinitesimal move along any vector w ∈ F (x )
either increases f , if wT ∇f (x∗ ) > 0 (“decided”)
remains feasible and, to first order, .
(✐ Why can it not decrease f ?) or keeps it constant, if wT ∇f (x∗ ) = 0 (“undecided”).
• Second-order necessary conditions (Th. 12.5): x∗ local solution, LICQ condition holds,
KKT conditions hold with Lagrangian multiplier vector λ∗ ⇒ wT ∇2xx L(x∗ , λ∗ )w ≥ 0 ∀w ∈
C(x∗ , λ∗ ).
• Second-order sufficient conditions (Th. 12.6): x∗ ∈ Rn feasible point, KKT conditions ex. 12.8
hold with Lagrange multiplier λ∗ , wT ∇2xx L(x∗ , λ∗ )w > 0 ∀w ∈ C(x∗ , λ∗ ), w 6= 0 ⇒ x∗ is a ex. 12.9
51
Duality
• Dual problem: constructed from the primal problem (objective and constraints) and related to
it in certain ways (possibly easier to solve computationally, gives lower bound on the optimal
primal objective). Applies to convex problems.
• Consider only inequalities with f and −ci all convex (so convex problem):
Primal problem: minx∈Rn f (x) s.t. c(x) ≥ 0 with c(x) = (c1 (x), . . . , cm (x))T , Lagrangian
L(x, λ) = f (x) − λT c(x). Note L(·, λ) is convex for any λ ≥ 0.
• Dual problem: maxλ∈Rm q(λ) s.t. λ ≥ 0 with dual objective function q: Rm → R defined as
q(λ) = inf x L(x, λ) with domain D = {λ: q(λ) > −∞}. ex. 12.10
• Th. 12.10: q is concave and its domain D is convex. So the dual problem is convex. Proofs
Th. 12.11 (weak duality): x̄ feasible for the primal, λ̄ feasible for the dual (i.e., c(x̄), λ̄ ≥ 0) ⇒
q(λ̄) ≤ f (x̄).
Th. 12.12: x̄ is a solution of the primal; f, −c1 , . . . , −cm are convex in Rn and diff. at x̄ ⇒ any
λ̄ for which (x̄, λ̄) satisfies the primal KKT conditions is a solution of the dual.
Th. 12.13: x̄ is a solution of the primal at which LICQ holds; f, −c1 , . . . , −cm are convex
and cont. diff. in Rn ; suppose that λ̂ is a solution of the dual, that x̂ = arg inf x L(x, λ̂) and
that L(·, λ̂) is strictly convex ⇒ x̄ = x̂ (i.e., x̂ is the unique solution of the primal) and
f (x̂) = L(x̂, λ̂) = q(λ̂).
• Th. 12.14: (x̄, λ̄) is a solution pair of the primal at which LICQ holds; f, −c1 , . . . , −cm are
convex and cont. diff. in Rn ⇒ (x̄, λ̄) is a solution of the Wolfe dual.
• Examples:
52
Review: theory of constrained optimization
ci (x) = 0, i ∈ E
Constrained optimization problem min x∈Rn f (x) s.t. m constraints
ci (x) ≥ 0, i ∈ I.
Assuming the derivatives ∇f (x∗ ), ∇2 f (x∗ ) exist and are continuous in a neighborhood of x∗ :
First-order necessary (KKT) conditions:
P
• Lagrangian L(x, λ) = f (x) − i∈E∪I λi ci (x).
• LICQ: active constraint gradients are l.i.
• Unconstrained opt: ∇f (x∗ ) = 0 } n eqs., n unknowns.
∇x L(x∗ , λ∗ ) = 0
m + n eqs.,
∗
i∈E
ci (x ) = 0,
m + n unknowns
• Constrained opt: λi ci (x∗ ) = 0, i∈E ∪I )
ci (x∗ ) ≥ 0, i∈I with additional
λ ≥ 0, i∈I constraints
i
Second-order conditions:
• Critical cone containing the undecided directions:
wT ∇ci (x∗ ) = 0 ∀i ∈ E
∗ ∗ n
C(x , λ ) = w ∈ R : wT ∇ci (x∗ ) = 0 ∀i ∈ I ∩ A(x∗ ) with λ∗i > 0 .
wT ∇ci (x∗ ) ≥ 0 ∀i ∈ I ∩ A(x∗ ) with λ∗i = 0
• Unconstrained opt:
– Necessary: x∗ is a local minimizer ⇒ ∇2 f (x∗ ) psd.
– Sufficient: ∇f (x∗ ) = 0, ∇2 f (x∗ ) pd ⇒ x∗ is a strict local minimizer.
• Constrained opt:
– Necessary: (x∗ , λ∗ ) local solution + LICQ + KKT ⇒ wT ∇2xx L(x∗ , λ∗ )w ≥ 0 ∀w ∈ C(x∗ , λ∗ ).
– Sufficient: (x∗ , λ∗ ) KKT point, wT ∇2xx L(x∗ , λ∗ )w > 0 ∀w ∈ C(x∗ , λ∗ ) \ {0} ⇒ x∗ strict local sol.
⇒ Seek solutions of KKT system, then check whether they are really minimizers (second-order conditions).
Duality:
• Primal problem: minx∈Rn f (x) s.t. c(x) ≥ 0
with Lagrangian L(x, λ) = f (x) − λT c(x).
• Dual problem: maxλ∈Rm q(λ) s.t. λ ≥ 0
with q(λ) = inf x L(x, λ) with domain D = {λ: q(λ) > −∞}.
• Wolfe dual: maxx,λ cT x − λT (Ax − b) s.t. AT λ = c, λ ≥ 0.
Loosely speaking, the primal objective is lower bounded by the dual objective and they touch at the (pri-
mal,dual) solution, so that the dual variables give the Lagrange multipliers of the primal. Sometimes, solving
the dual is easier. Particularly useful with LP, convex QP and other convex problems.
53
13 Linear programming: the simplex method
Linear program (LP)
• Linear objective function, linear constraints (equal-
ity + inequality); feasible set: polytope (= convex;
connected set with flat faces); contours of objective
function: hyperplanes; solution: either none (feasi-
ble set is empty or problem is unbounded), one (a
vertex) or an infinite number (edge, face, etc.).
(✐ What happens if there are no inequalities?)
• LP is a very special case of constrained optimization, but popular because of its simplicity and
the availability of software.
• Commercial software accepts LP in non-standard form.
Optimality conditions
LP is a convex optimization problem ⇒ any minimizer is a global minimizer; KKT conditions are
necessary and also sufficient; LICQ isn’t necessary? . (✐ What happens with the second-order conditions?)
KKT conditions: L(x, λ, s ) = cT x − λT (Ax − b) − sT x. If x is a solution ⇒ ∃! λ ∈ Rm , s ∈ Rn :
|{z}
Lagrange multipliers
a) AT λ + s = c
b) Ax = b
a) b)
c) x≥0 ⇒ cT x = (AT λ + s)T x = (Ax)T λ = bT λ.
d) s≥0
e) xi si = 0, i = 1, . . . , n ⇔ xT s = 0
The KKT conditions are also sufficient. Pf.: let x be another feasible point ⇔ Ax = b, x ≥ 0. Then
cT x = (AT λ + s)T x = bT λ + xT s ≥ bT λ = cT x. And x optimal ⇔ xT s = 0.
a) x,s≥0
54
The dual problem
• Primal problem: min cT x s.t. Ax = b, x ≥ 0.
Dual problem: max bT λ s.t. AT λ ≤ c, or min −bT λ s.t. c − AT λ ≥ 0 in the form of ch. 12.
x: primal variables (n), λ: dual variables (m).
• KKT conditions for the dual: L(λ, x) = −bT λ − xT (c − AT λ). If λ is a solution ⇒ ∃!x:
• Dual of the dual = primal. Pf.: restate dual in LP standard form by introducing slack variables
s ≥ 0 (so that AT λ + s = c) and splitting the unbounded variables λ into λ = λ+ − λ− with
λ+ , λ− ≥ 0. Then we can write the dual as:
−b T
λ+ T T λ+ λ+
min b λ− s.t. (A − A I) λ− = c, λ− ≥0
0 s s s
whose dual is
A
−b
T
max c z s.t. −A z≤ b ⇔ min −cT z s.t. Az = −b, z ≤ 0
I 0
• Duality gap: given a feasible vector x for the primal (⇔ Ax = b, x ≥ 0) and a feasible vector
(λ, s) for the dual (⇔ AT λ + s = c, s ≥ 0) we have:
0 ≤ xT s = xT (c − AT λ) = |cT x − T T T
{z b λ} ⇔ c x ≥ b λ.
gap
Thus, the dual objective function b λ is a lower bound on the primal objective function cT x
T
1. If either problem (primal or dual) has a (finite) solution, then so does the other, and the
objective values are equal.
2. If either problem (primal or dual) is unbounded, then the other problem is infeasible.
• Duality is important in the theory of LP (and convex opt. in general) and in primal-dual
algorithms; also, the dual may be easier to solve than the primal.
• Sensitivity analysis: how sensitive the global objective value is to perturbations in the con-
straints ⇔ find the Lagrange multipliers λ, s.
55
Geometry of the feasible set
x ≥ 0 is the n-dim positive quadrant and we consider its intersection with the m-dim (m < n) linear
subspace Ax = b. The intersection happens at points x having at most m nonzeros, which are the
vertices of the feasible polytope. If the objective is bounded then at least one of these vectors is a
minimizer. Examples:
• x is a basic feasible point (BFP) if x is a feasible point with at most m nonzero components
and we can identify a subset B(x) of the index set {1, . . . , n} such that:
– i∈
/ B(x) ⇒ xi = 0
• The simplex method generates a sequence of iterates xk that are BFPs and converges (in a
finite number of steps) to a solution, if the LP has BFPs and at least one of them is a basic
optimal point (= a BFP which is a minimizer).
• All BFPs for the standard LP are vertices of the feasible polytope {x: Ax = b, x ≥ 0} and
vice versa (th. 13.3). (A vertex is a point that does not lie on a straight line between two other points in the polytope). Proof
56
The simplex method (Not to be confused with Nelder & Mead’s downhill simplex method of derivative-free opt.)
n
There are at most m different sets of basic indices B, so a brute-force way to find a solution would 20
10
∼ 105
be to try them all and check the KKT conditions. The simplex algorithm does better than this: it
guarantees a sequence of iterates all of which are BFPs (thus vertices of the polytope). Each step
moves from one vertex to an adjacent vertex for which the set of basic indices B(x) differs in exactly
one component and either decreases the objective or keeps it unchanged.
The move: we need to decide which index to change in the basic set B (by taking it out and
replacing it with one index from outside B, i.e., from N = {1, . . . , n} \ B). Write the KKT conditions
in terms of B and N (partitioned matrices and vectors):
Formally, call x+ the new iterate and x the current one: we want Ax+ = b = Ax:
+
+ xB ⋄ −1
Ax = (B N) + = B x+ + +
B + Aq xq = BxB = Ax ⇒ xB = xB − B Aq xq
+
xN |{z} |{z} |{z} |{z} | {z }
m×m m×1 m×1 1×1 increase x+
q till some com-
ponent of x+
B becomes 0
(⋄ : x+ T
i = 0 for i ∈ N \ {q}). This operation decreases c x (pf.: p. 369).
LP nondegenerate & bounded ⇒ simplex method terminates at a basic optimal point (th. 13.4).
• The practical implementation of the simplex needs to take care of same details:
– Selection of the entering index from among the several negative components of s.
57
• Presolving (preprocessing): reduces the size of the user-given problem by applying several
techniques to eliminate variables, constraints and bounds; may also detect infeasibility. Ex:
look for rows or columns in A that are singletons or all-zeros, or for redundant constraints.
• With inequalities, as indicated by the KKT conditions, an algorithm must determine (implicitly
or explicitly) which of them are active at a solution. Active-set methods, of which the simplex
method is an example:
– maintain explicitly a set of constraints that estimates the active set at the solution (the
complement of the basis B in the simplex method), and
– make small changes to it at each step (a single index in the simplex method).
Active-set methods apply also to QP and bound-constrained optimization, but are less conve-
nient for nonlinear programming.
• The simplex method is very efficient in practice (if typically requires 2m to 3m iterations) but
it does have a worst-case complexity that is exponential in n. This can be demonstrated with a
pathological n-dim problem where the feasible polytope has 2n vertices, all of which are visited
by the simplex method before reaching the optimal point.
Interior-point methods for LP (next chapter) have a polynomial worst-case complexity.
58
14 Linear programming: interior-point methods
Interior-point methods Simplex method
All iterates satisfy the inequality constraints Moves along the boundary of the feasible poly-
strictly, so in the limit they approach the solution tope, testing a finite sequence of vertices until
from the inside (in some methods from the outside) but never it finds the optimal one.
lie on the boundary of the feasible set.
Each iteration is expensive to compute but can Usually requires a larger number of inexpen-
make significant progress toward the solution. sive iterations.
Average-case complexity = worst-case complexity Average-case complexity: 2m–3m iterations
= polynomial. (m = number of constraints); worst-case com-
plexity: exponential.
Primal-dual methods
• Standard-form primal LP: min cT x s.t. Ax = b, x ≥ 0, c, x ∈ Rn , b ∈ Rm , Am×n
Dual LP: max bT λ s.t. AT λ + s = c, s ≥ 0, λ ∈ Rm , s ∈ Rn .
• KKT conditions:
AT λ + s = c T
system of 2n + m equations A λ+s−c
Ax = b
for 2n + m unknowns x, λ, s ⇔ F(x, λ, s) = Ax − b = 0
xi si = 0 (mildly nonlinear because of xi si )
XSe
i = 1, . . . , n 1
X = diag (xi ) , S = diag (si ) , e = ··· .
x, s ≥ 0 1
• Idea: find solutions (x∗ , λ∗ , s∗ ) of this system with a Newton-like method, but modifying the
search directions and step sizes to satisfy x, s > 0 (strict inequality). The sequence of iterates
traces a path in the space (x, λ, s), thus the name primal-dual. Solving the system is relatively
easy (little nonlinearity) but the nonnegativity condition complicates things. Spurious solutions
(F(x, λ, s) = 0 but x, s 0) abound and do not provide useful information about feasible
solutions, so we must ensure to exclude them.
All the vertices of the x–polytope are associated with a root of F, but most violate x, s ≥ 0.
• Newton’s method to solve nonlinear equations r(x) = 0 from current estimate xk (ch. 11):
xk+1 = xk + ∆x where J(xk ) ∆x = −r(xk ) and J(x) is the Jacobian of r.
(Recall that if we apply it to solve ∇f (x) = 0 we obtain ∇2 f (x)p = −∇f (x), Newton’s method for optimization.)
• In our case the Jacobian J(x, λ, s) takes a simple form (✐ is J nonsingular?). Assuming x, s > 0
and calling rc = AT λ + s − c, rb = Ax − b the residuals for the linear equations, the Newton
step is:
0 AT I ∆x −rc
This Newton direction is also
J(x, λ, s) = A 0 0 ⇒ J ∆λ = −rb
called affine scaling direction.
S 0 X ∆s −XSe
Since
x a full step would likely violate x, s > 0, we perform a line search so that the new iterate
∆x h i
is λs + α ∆λ for α ∈ (0, 1] ⇒ α < min min∆xi <0 ∆x
? −xi −si
, min∆si <0 ∆s .
∆s i i
Still, often α ≪ 1. Primal-dual methods modify the basic Newton procedure by:
59
1. Biasing the search direction towards the interior of the nonnegative orthant x, s ≥ 0 (so
more room to move within it). We take a less aggressive Newton direction that aims at a
solution with xi si = σµ > 0 (perturbed KKT conditions) instead of all the way to 0 (this
usually allows a longer step α), with:
T
– Duality measure µ = xn s = average of the pairwise products xi si . It measures close-
ness to the boundary, and the algorithms drive µ to zero.
– Centering parameter σ ∈ [0, 1]: amount of reduction in µ we want to achieve:
σ = 0: pure N. step towards (x0 , λ0 , s0 ) (affine-scaling dir.); aims at reducing µ.
σ = 1: N. step towards (xµ , λµ , sµ ) ∈ C (centering direction); aims at centrality.
Primal-dual methods trade off both aims.
2. Controlling the step α to keep xi , si from moving too close to the boundary of the non-
negative orthant.
for k = 0, 1, 2, . . . k
0 AT I ∆x −rc
(xk )T sk
Solve A 0 0 ∆λk = −rb where σk ∈ [0, 1], µk =
n
k k k k k
S 0 X ∆s −X S e + σk µk e
(xk+1 , λk+1 , sk+1) ← (xk , λk , sk ) + αk (∆xk , ∆λk , ∆sk ) choosing αk such that xk+1 , sk+1 > 0
end
Examples
• This shows every vertex of the polytope in x (i.e., Ax = b) produces one root of F(x, λ, s).
min cT x s.t. Ax = b, x ≥ 0, for c = (1, 1, 0)T , A = (1 12 2), b = 2. KKT conditions:
F(x,λ,s)=0
z }| {
λ + s1 =1
with solutions
λ
+ s2 =1
(1) λ = 1, x = (2, 0, 0)T ,
AT λ + s = c
2
2λ + s3 =0 s = (0, 21 , −2)T infeasible
Ax = b 1
⇒ 1 2 2 + 2x3
x + x =2 (2) λ = 2, x = (0, 4, 0)T ,
sT x = 0
s1 x1 =0
s = (−1, 0, −4)T infeasible
x, s ≥ 0
s2 x2 =0
(3) λ = 0, x = (0, 0, 1)T ,
s3 x3 =0 s = (1, 1, 0)T feasible
• Another example: A = (1 − 2 2) above. The solutions of F(x, λ, s) = 0 are:
(1) λ = 1, x = (2, 0, 0)T , s = (0, 3, −2)T infeasible
(2) λ = − 12 , x = (0, −1, 0)T , s = ( 32 , 0, 1)T infeasible
(3) λ = 0, x = (0, 0, 1)T , s = (1, 1, 0)T feasible
Thus the need to steer away from the boundary till we approach
the solution.
60
The central path
primal-dual feasible set F = {(x, λ, s): Ax = b, AT λ + s = c, x, s ≥ 0}
Define
primal-dual strictly feasible set F 0 = {(x, λ, s): Ax = b, AT λ + s = c, x, s > 0}.
Before, we justified taking a step towards τ = σµ > 0 instead of directly towards 0 in that it keeps
us away from the feasible set boundaries and allows longer steps. We can also see this as following
a central path through the nonnegative orthant. Parameterize the KKT system in terms of a scalar
parameter τ > 0 (perturbed KKT conditions):
AT λ + s = c
Ax = b The solution F(xτ , λτ , sτ ) = 0 gives a curve C = {(xτ , λτ , sτ ): τ > 0} whose
xi si = τ points are strictly feasible, and that converges to a solution for τ → 0. This curve
i = 1, . . . , n
is the central path C. The central path is defined uniquely ∀τ > 0 ⇔ F 0 6= ∅.
x, s > 0
The central path guides us to a solution along a route that steers clear of spurious solutions by
keeping all x and s components strictly positive and decreasing the pairwise products xi si to 0 at
the same rate. A Newton step towards points in C is biased toward the interior of the nonnegative
orthant x, s ≥ 0 and so it can usually be longer than the pure Newton step for F.
Path-following methods
• They explicitly restrict iterates to a neighborhood of the central
path C, thus following C more or less strictly. That is, we choose
αk ∈ [0, 1] as large as possible but so that (xk+1 , λk+1 , sk+1) lies in
the neighborhood.
• Examples: (
N−∞ (0) = F
N−∞ (γ) = {(x, λ, s) ∈ F 0 : xi si ≥ γµ, i = 1, . . . , n} for γ ∈ (0, 1] (typ. γ = 10−3 );
N−∞ (1) = C;
(
N2 (1) 6= F
N2 (θ) = {(x, λ, s) ∈ F 0 : kXSe − µek2 ≤ θµ} for θ ∈ [0, 1); (typ. θ = 0.5);
N2 (0) = C.
• Homotopy methods for nonlinear eqs. follow tightly a tubular neighborhood of the path.
Interior-point path-following methods follow a horn-shaped neighborhood, initially wide.
• Most computational effort is spent solving the linear system for the direction:
– This is often a sparse system because A is often sparse.
– If not, can reformulate (by eliminating variables) as a smaller system: eq. (14.42) or
(14.44).
61
A practical algorithm: Mehrotra’s predictor-corrector algorithm
At each iteration:
1. Predictor step (x′ , λ′ , s′ ) = (x, λ, s) + α(∆xaff , ∆λaff , ∆saff ): affine-scaling direction (i.e., σ = 0)
and largest step size α ∈ [0, 1] that satisfies x′ , s′ ≥ 0.
′ T ′
2. Adapt σ: compute the effectiveness µaff = (x n) s of this step and set σ = (µaff /µ)3 . Thus, if the
predictor step is effective, µaff is small and σ is close to 0, otherwise σ is close to 1.
• Two linear systems must be solved (predictor and corrector steps) but with the same coefficient
matrix (so use a matrix factorization, e.g. J = LU).
• If LP is infeasible or unbounded, the algorithm typically diverges (rkc , rkb and/or µk → ∞).
• No convergence theory available for this algorithm (which can occasionally diverge); but it has
good practical performance.
62
Review: linear programming: interior-point methods
• Standard-form primal LP: min cT x s.t. Ax = b, x ≥ 0, c, x ∈ Rn , b ∈ Rm , Am×n .
T
A λ+s−c
X = diag (xi )
• KKT conditions with Lagrange multipliers λ, s: F(x, λ, s) = Ax − b = 0 S = diag (si ) .
XSe e = (1, . . . , 1)T
• Apply Newton’s method to solve for (x, λ, s) (primal-dual space) but modify the complementarity conditions
as xi si = τ = σµ > 0 to force iterates to be strictly feasible, i.e., interior (x, s > 0), and drive τ → 0. This
affords longer steps α:
∆x −rc
0 AT I
– Pure Newton step: J ∆λ = −rb with J(x, λ, s) = A 0 0 .
∆s −XSe+σµe S 0 X
x ∆x
– New iterate: λ + α ∆λ for α ∈ (0, 1] that ensures the iterate is sufficiently interior.
s ∆s
xT s
– Duality measure µ = n (measures progress towards a solution).
– Centering parameter σ between 0 (affine-scaling direction) and 1 (central path).
• The set of solutions as a function of τ > 0 is called the central path C. It serves as guide to a solution from
the interior that avoids non-KKT points. Path-following algorithms follow C more or less closely.
– Global convergence: µk+1 ≤ cµk for constant c ∈ (0, 1) if σk is bounded away from 0 and 1.
– Convergence rate: achieving µk < ǫ requires O(n log 1ǫ ) iterations ⇒ polynomial complexity.
• Each step of the interior-point method requires solving a linear system (for the Newton step) of 2n + m eqs.
which is sparse if A is sparse.
• Fewer, more costly iterations than the simplex method. In practice, preferable in large problems.
63
15 Fundamentals of algorithms for nonlinear constrained
optimization
• General constrained optimization problem:
ci (x) = 0, i ∈ E
minn f (x) s.t. , f, {ci } smooth.
x∈R ci (x) ≥ 0, i ∈ I
– Linear programming (LP): f , all ci linear; solved by simplex & interior-point methods.
– Quadratic programming (QP): f quadratic, all ci linear.
– Linearly constrained optimization: all ci linear.
– Bound-constrained optimization: constraints are only of the form xi ≥ li or xi ≤ ui .
– Convex programming: f convex, equality ci linear, inequality ci concave. (✐ Is QP convex progr.?)
• Brute-force approach: guess which inequality constraints are active (λ∗i 6= 0), try to solve the
nonlinear equations given by the KKT conditions directly and then check whether the resulting
solutions are feasible. If there are m inequality
constraints
and k are active, we have m k
combinations and so altogether m0 + m1 + · · · + m m
= 2m combinations, which is wasteful
unless we can really guess which constraints are active. Solving a nonlinear system of equations
is still hard because the root-finding algorithms are not guaranteed to find a solution from
arbitrary starting points.
• Iterative algorithms: sequence of xk (and possibly of Lagrange multipliers associated with the
constraints) that converges to a solution. The move to a new iterate is based on information
about the objective and constraints, and their derivatives, at the current iterate, possibly
combined with information gathered in previous iterates. Termination occurs when a solution
is identified accurately enough, or when further progress can’t be made.
Goal: to find a local minimizer (global optimization is too hard).
• Initial study of the problem: try to show whether the problem is infeasible or unbounded; try
to simplify the problem.
• Hard√constraints: they cannot be violated during the algorithm’s run, e.g. non-negativity of
x if x appears in the objective function. Need feasible algorithms, which are slower than
algorithms that allow the iterates to be infeasible, since they can’t allow shortcuts across
infeasible territory; but the objective is a merit function, which spares us the need to introduce
a more complex merit function that accounts for constraint violations.
Soft constraints: they may be modeled as objective function f + penalty, where the penalty
includes the constraints. However, this can introduce ill-conditioning.
• Slack variables are commonly used to simplify an inequality into a bound, at the cost of having
an extra equality and slack variable:
ci (x) ≥ 0 ⇒ ci (x) − si = 0, si ≥ 0 ∀i ∈ I.
64
Categorization of algorithms
• Ch. 16: quadratic programming: it’s an important problem by itself and as part of other
algorithms; the algorithms can be tailored to specific types of QP.
– Penalty methods: combine objective and constraints into a penalty function φ(x; µ) via a
penalty parameter µ > 0; e.g. if only equality constraints exist:
P
∗ φ(x; µ) = f (x) + µ2 i∈E ci (x)2
⇒ unconstrained minimization of φ wrt x for a series of increasing µ values.
P
∗ φ(x; µ) = f (x) + µ i∈E |ci (x)| (exact penalty function)
⇒ single unconstrained minimization for large enough µ.
– Augmented Lagrangian methods: define a function that combines the Lagrangian and a
quadratic penalty; e.g. if only equality constraints exist:
P P
∗ LA (x, λ; µ) = f (x) − i∈E λi ci (x) + µ2 i∈E c2i (x)
⇒ unconstrained min. of LA wrt x for fixed λ, µ; update λ, increase µ; repeat.
– Sequential linearly constrained methods: minimize at every iteration a certain Lagrangian
function subject to linearization of the constraints; useful for large problems.
• Ch. 18: sequential quadratic programming: model the problem as a QP subproblem; solve it
by ensuring a certain merit function decreases; repeat. Effective in practice. Although the QP
subproblem is relatively complicated, they typically require fewer function evaluations than
some of the other methods.
• Ch. 19: interior-point methods for nonlinear programming: extension of the primal-dual interior-
point methods for LP. Effective in practice. They can also be viewed as:
– Barrier methods: add terms to the objective (via a barrier parameter µ > 0) that are
insignificant when x is safely inside the feasible set but become large as x approaches the
boundary; e.g. if only inequality constraints exist:
P
∗ P (x; µ) = f (x) − µ i∈I log ci (x) (logarithmic barrier function)
⇒ unconstrained minimization of P wrt x for a series of decreasing µ values.
65
Elimination of variables
Goal: eliminate some of the constraints and so simplify the problem. This must be done with care
because the problem may be altered, or ill-conditioning may appear.
• Example 15.2: elimination alters the problem: minx,y x2 + y 2 s.t. (x − 1)3 = y 2 has the solution
( xy ) = ( 10 ). Eliminating y 2 = (x − 1)3 yields min x2 + (x − 1)3 which is unbounded (x → −∞);
the mistake is that this elimination ignores the implicit constraint x ≥ 1 (since y 2 ≥ 0) which
is active at the solution.
In general, nonlinear elimination is tricky. Instead, many algorithms linearize the constraints, then
apply linear elimination.
Linear elimination Consider min f (x) s.t. Ax = b where Am×n , m ≤ n, and A has full rank
(otherwise, remove redundant constraints or determine whether the problem is infeasible). Say
we eliminate xB = (x1 , . . . , xm )T (otherwise permute x, A and b); writing A = (B N) with
Bm×m nonsingular, Nm×(n−m) and x = ( xxN B
), we have xB = B−1 b − B−1 NxN (remember how
to find a basic feasible point in the simplex method), and we can solve the unconstrained problem
minxN ∈Rn−m f (xB (xN ), xN ) . Ideally we’d like to select B to be easily factorizable (easier linear sys- ex. 15.3
tem).
−1
We can also write x = Yb + ZxN with Y = B0−1 , Z = −BI N . We have:
• The columns of Y and the columns of Z are l.i. (pf.: (Y Z)λ = 0 ⇒ λ = 0); and null (A) ⊕ range AT = Rn
Thus the elimination technique expresses feasible points as the sum of a particular solution of Ax = b
plus a displacement along the null space of A:
But linear elimination can give rise to numerical instability, e.g. for n = 2:
This can be improved by choosing as the particular solution that having minimum norm: min kxk2
s.t. Ax = b, which is xp = AT (AAT )−1 b (pf.: apply KKT to min 21 xT x s.t. Ax = b). Both this xp and Z
can be computed in a numerically stable way using the QR decomposition of A, though the latter is
costly if A is large (even if sparse).
If inequality constraints exist, eliminating equality constraints is worthwhile if the inequality
constraints don’t get more complicated.
66
Measuring progress: merit functions φ(x; µ)
• A merit function measures a combination of the objective f and of feasibility via a penalty
parameter µ > 0 which controls the tradeoff; several definitions exist. They help to control
the optimization algorithm: a step is accepted if it leads to a sufficient reduction in the merit
function.
• A merit function φ(x; µ) is exact if ∃µ∗ > 0: µ > µ∗ ⇒ any local solution x of the optimization
problem is a local minimizer of φ.
– ℓ1 exact function:
X X
φ1 (x; µ) = f (x) + µ |ci (x)| + µ [ci (x)]− [x]− = max (0, −x).
i∈E i∈I
67
Review: fundamentals of algorithms for nonlinear constrained optimiza-
tion
• Brute-force approach: try all 2m combinations of active constraints, solving a different system of nonlinear
equations for each (from the KKT conditions): computationally intractable. Instead, iterative algorithms
construct a sequence of xk and possibly λk , using information about the objective and constraints and their
derivatives, that converges to one local solution.
• Some algorithms are for specific problems (LP, QP) while others apply more generally (penalty and augmented
Lagrangian, sequential QP, interior-point and barrier methods).
• Elimination of variables: useful with linear equality constraints (if done carefully to prevent introducing ill-
conditioning); tricky with nonlinear ones.
• Merit function φ(x; µ): measures a combination of the objective f and of feasibility via a penalty parameter
µ > 0 which controls the tradeoff. Ex: φ = f for unconstrained problems or feasible algorithms; quadratic-
penalty; augmented Lagrangian, Fletcher’s augmented Lagrangian, ℓ1 exact function.
Exact merit function: any local solution is a minimizer of φ for sufficiently large µ.
68
16 Quadratic programming
Quadratic program (QP): quadratic objective function, linear constraints.
T
1 T T ai x = bi , i ∈ E
min q(x) = x Gx + c x s.t. Gn×n symmetric.
x∈Rn 2 aTi x ≥ bi , i ∈ I
Can always be solved in a finite number of iterations (exactly how many depends on G and on the
number of inequality constraints).
• Convex QP ⇔ G psd. Local minimizer(s) also global; not much harder than LP.
• Non-convex QP ⇔ G not psd. Possibly several solutions.
Equality-constrained QP
• m equality constraints, no inequality constraints: Example:
min x21 + x22
1 s.t. x1 + x2 = 1.
min q(x) = xT Gx + cT x s.t. Ax = b A full rank.
x 2
• KKT conditions: a solution x∗ verifies (where λ∗ is the Lagrange multiplier vector)
T
∗ T
h = Ax − b
G −A x −c x =x+p G A
∗ −p g (✐ What happens
= ⇐⇒ = g = c + Gx
A 0 λ∗ b A 0 λ∗ h if G = 0 (LP)?)
| {z } p = x∗ − x.
KKT matrix K
Let Zn×(n−m) = (z1 , . . . , zn−m ) be a basis of null (A) ⇔ AZ = 0, rank (Z) = n − m. Call
ZT GZ the reduced Hessian (= how the quadratic form looks like in the subspace Ax = b).
Then, if A has full row rank (= m) and ZT GZ is pd:
69
∗
– Lemma 16.1: K is nonsingular (⇒ unique ( λx∗ )). Proof
∗
• Classification of the solutions (assuming the KKT system has solutions ( λx∗ )): Ex. 16.2
• The KKT system can be solved with various linear algebra techniques (note that linear conju-
gate gradients are not applicable? ).
Inequality-constrained QP
P
• Optimality conditions: Lagrangian function L(x, λ) = 12 xT Gx + cT x − i∈E∪I λi (aTi x − bi ).
Active set at an optimal point x∗ : A(x∗ ) = {i ∈ E ∪ I: aTi x∗ = bi }.
– Second-order conditions:
1. G psd (convex QP) ⇒ x∗ is a global minimizer (th. 16.4); and unique if G pd. Proof
2. Strict, unique local minimizer at x∗ ⇔ ZT GZ pd, where Z is a nullspace basis for the
active constraint Jacobian matrix (aTi )i∈A(x∗ ) .
3. If G is not psd, there may be more than one strict local minimizer at which the 2nd -
order conditions hold (non-convex, or indefinite QP); harder to solve. Determining
whether a feasible point is a global minimizer is NP-hard. fig. 16.1
Ex.: max xT x s.t. x ∈ [−1, 1]n : 2n local (and global) optima.
• Degeneracy is one of the following situations, which can cause problems for the algorithms: ex. p. 466
– Active constraint gradients are l.d. at the solution, e.g. (but not necessarily) if more than
n constraints are active at the solution ⇒ numerically difficult to compute Z.
– Strict complementary condition fails: λ∗i = 0 for some active index i ∈ A(x∗ ) (the con-
straint is weakly active) ⇒ numerically difficult to determine whether a weakly active
constraint is active.
70
Active-set methods for convex QP
• Convex QP: any local solution is also global.
• They are the most effective methods for small- to medium-scale problems; efficient detection
of unboundedness and infeasibility; accurate estimate (typically) of the optimal active set.
• Remember the brute-force approach to solving the KKT systems for all combinations of active
constraints: if we knew the optimal active set A(x∗ ) (≡ the active set at the optimal point x∗ ),
we could find the solution of the equality-constrained QP problem minx q(x) s.t. aTi x = bi , i ∈
A(x∗ ). Goal: to determine this set.
• Active-set method: start from a guess of the optimal active set; if not optimal, drop one index
from A(x) and add a new index (using gradient and Lag. mult. information); repeat.
– The simplex method for LP is an active-set method.
– QP active-set methods may have iterates that aren’t vertices of the feasible polytope.
Three types of active-set methods: primal, dual, and primal-dual. We focus on primal methods,
which generate iterates that remain feasible wrt the primal problem while steadily decreasing
the objective function q.
71
• Iterating this process (where we keep adding blocking constraints and moving xk ) we must reach
a point x̂ that minimizes q over its current working set Ŵ, or equivalently p = 0 occurs. Now,
is this also a minimizer of the QP problem, i.e., does it satisfy the KKT conditions? Only if
the Lagrange multipliers for the inequality Pconstraints in the working set are nonnegative. The
?
Lagrange multipliers are the solution of i∈Ŵ ai λ̂i = Gx̂ + c. So if λ̂j < 0 for some j ∈ Ŵ ∩ I
then we drop constant j from the working set (since by making this constraint inactive we can
decrease q while remaining feasible; th. 16.5) and go back to iterate.
If there are several λ̂j < 0 one typically chooses the most negative one since the rate of decrease
of q is proportional to |λ̂j | if we remove constraint j (other heuristics possible).
• If αk > 0 at each step, this algorithm converges in a finite number of iterations since there
is a finite number of working sets. In rare situations the algorithm can cycle: a sequence of
consecutive iterations results in no movement of xk while the working set undergoes deletions
and additions of indices and eventually repeats itself. Although this can be dealt with, most
QP implementations simply ignore it.
• The linear systems can be solved efficiently by updating factorizations (in the KKT matrix, G
is constant and A changes by one row at most at each step).
if λ̂i ≥ 0 ∀i ∈ Wk ∩ I
stop with solution x∗ = xk All KKT conditions hold
else
Remove from the working set that ineq.
j ← arg minj∈Wk ∩I λ̂j , Wk+1 ← Wk \ {j} constr. having the most negative Lag. mult.
xk+1 ← xk
end
else pk 6= 0: we can move xk and decrease q
compute αk = min (1, min . . . ) from (16.41) Longest step in [0, 1]
xk+1 ← xk + αk pk
if αk < 1 There are blocking constraints
Wk+1 ← Wk ∪ {one blocking constraint} Add one of them to the working set
else
Wk+1 ← Wk
end
end
end
72
The gradient projection method
• Active-set method: the working set changes by only one index at each iteration, so many
iterations are needed for large scale problems.
• Gradient-projection method: large changes to the active set (from those constraints that are
active at the current point, to those that are active at the Cauchy point).
• Most efficient on bound-constrained QP, on which we focus:
1
min q(x) = xT Gx + cT x s.t. l ≤ x ≤ u
x 2
x, l, u ∈ Rn , G symmetric (not necessarily pd); not all components need to be bounded.
• The feasible set is a box.
• Idea: steepest descent but bending along the box faces.
• Needs a feasible starting point x◦ (trivial to obtain); all iterates remain feasible.
• Each iteration consists of two stages; assume current point is x (which is feasible):
1. Find the Cauchy point xc : this is the first minimizer along the steepest descent direction
−∇q = −(Gx + c) piecewise-bent to satisfy the constraints. To find it, search along −∇q;
if we hit a bound (a box face), bend the direction (by projecting it on the face) and keep
searching along it; and so on, resulting in a piecewise linear path P (x − t∇q; [l, u]), t ≥ 0,
where P (x; [l, u])i = median(xi , li , ui ) = li if xi < li , xi if xi ∈ [li , ui ], ui if xi > ui . exact
formulas:
pp. 486ff.
xc is somewhere in
this path (depend-
ing on the quadratic
form q(x)). Note
there can be several
minimizers along
the path.
73
Interior-point methods
• Appropriate for large problems.
• A simple extension of the primal-dual interior-point approach of LP works for convex QP. The
algorithms are easy to implement and efficient for some problems.
• Consider for simplicity only inequality constraints (exe. 16.21 considers also equality ones):
1
min q(x) = xT Gx + xT c s.t. Ax ≥ b with G symmetric pd, Am×n .
x 2
Write KKT conditions, then introduce surplus vector y = Ax − b ≥ 0 (saves m Lag. mult.)? .
Since the problem is convex, the KKT conditions are not only necessary but also sufficient. We
find minimizers of the QP by finding roots of the KKT system:
only
T
addition Gx − A λ = −c
wrt LP
system of n + 2m equations Gx − AT λ + c
Ax − y = b
for n + 2m unknowns x, y, λ ⇔ F(x, y, λ) = Ax − y − b = 0
y i λi = 0
(mildly nonlinear because of yi λi ) YΛe
i = 1, . . . , m 1
Y = diag (yi ) , Λ = diag (λi ) , e = ··· .
y, λ ≥ 0 1
0
• Central path C = {(xτ , yτ , λτ ): F(xτ , yτ , λτ ) = 0 , τ > 0} ⇔ solve perturbed KKT system
τe
with yi λi = τ . Given a current iterate (x, y, λ) with y, λ > 0, define the duality measure
µ = m1 yT λ (closeness to the boundary) and the centering parameter σ ∈ [0, 1].
• Newton-like step toward point (xσµ , yσµ , λσµ ) on the central path:
G 0 −AT ∆x −rc
A −I rc = Gx − AT λ + c
0 ∆y = −rb
rb = Ax − y − b
0 Λ Y ∆λ −ΛYe + σµe
| {z } | {z } | {z }
Jacobian of F step −F(x,y,λ)
xk+1 xk ∆xk
yk+1 ← yk + αk ∆yk choosing αk ∈ [0, 1] such that yk+1 , λk+1 > 0.
k+1 k k
λ λ ∆λ
• Likewise, we can extend the path-following methods (by defining a neighborhood N−∞ (γ)) and
Mehrotra’s predictor-corrector algorithm.
• Major computation: solving the linear system, more costly than for LP because of G.
• As in the comparison between the simplex and interior-point methods for LP, for QP:
74
Review: quadratic programming
• Quadratic program:
75
17 Penalty and augmented Lagrangian methods
The quadratic penalty method
ci (x) = 0 i ∈ E
min f (x) s.t.
x ci (x) ≥ 0 i ∈ I.
• Define the following quadratic-penalty function with penalty parameter µ > 0:
µX 2 µX
Q(x; µ) = f (x) + ci (x) + ([ci (x)]− )2 [y]− = max(−y, 0).
2 i∈E 2 i∈I
|{z} | {z }
objective one term per constraint, which is positive
function when x violates ci and 0 otherwise
Given tolerance τ0 > 0, starting penalty parameter µ0 > 0, starting point xs0
for k = 0, 1, 2, . . .
starting at xsk
Find an approximate minimizer xk of Q(x; µk )
terminating when k∇x Q(x; µk )k ≤ τk
if final convergence
test satisfied ⇒ stop with approximate solution xk
penalty parameter µk+1 > µk
Choose new starting point xsk+1
tolerance τk+1 ∈ (0, τk )
end
– Using the tangent to the path {x(µ), µ > 0}: xsk = xk−1 + (µk − µk−1)ẋ.
The path tangent ẋ = d(x(µ))
dµ
can be obtained by total differentiation of
∇x Q(x; µ) = 0 wrt µ:
∂∇x Q/∂x dx/dµ ∂∇x Q/∂µ
zz }|
}| { z}|{
{
d∇x Q(x(µ); µ) 2 1 Linear system
0= = ∇xx Q(x; µ) ẋ + (∇x Q(x; µ) − ∇f (x))
dµ µ for vector ẋ.
This can be seen as a predictor-corrector method: linear prediction xsk ,
nonlinear correction by minimizing Q(x; µ).
76
• Choice of {µk }: adaptive, e.g. if minimizing Q(x; µk ) was:
• Choice of {τk }: τk −−−→ 0 (the minimization is carried out progressively more accurately).
k→∞
– Th. 17.1: xk global min. of Q(x; µk ) ⇒ xk → global solution of the constr. problem.
Impractical: requires global minimization (or a convex problem).
Practical problems
• Q may be unbounded below for some values of µ (ex. in eq. 17.5) ⇒ safeguard.
• The penalty function doesn’t look quadratic around its minimizer except very close to it (see
contours in fig. 17.2).
• Even if ∇2 f (x∗ ) is well-conditioned, the Hessian ∇2xx Q(x; µk ) becomes arbitrarily ill-conditioned
as µk → ∞. Consider equality constraints only and define A(x)T = (∇ci (x))i∈E (matrix of
constraint gradients; usually rank(A) < n):
?
X
∇2xx Q(x; µk ) = ∇2 f (x) + µk ci (x)∇2 ci (x) + µk A(x)T A(x).
i∈E
where L(x, λ) is the Lagrangian function (and usually |E| < n). Unconstrained optimization
methods have problems with ill-conditioning. For Newton’s method we can apply the following
reformulation that avoids the ill-conditioning (p solves both systems? ):
This system has dimension n+|E| rather than n, and is a regularized version of the SQP system
(18.6) (the “− µ1k I” term makes the matrix nonsingular even if A(x) is rank-deficient).
The augmented Lagrangian method is more effective, as it delays the onset of ill-conditioning.
77
Exact penalty functions
• Exact penalty function φ(x; µ): ∃µ∗ > 0: ∀µ > µ∗ , any local solution x of the constrained
problem is a local minimizer of φ. So we need a single unconstrained minimization of φ(x; µ)
for such a µ > µ∗ .
• The quadratic-penalty and log-barrier functions are not exact, so they need µ → ∞.
• The ℓ1 exact penalty function ex. 17.2
X X ex. 17.3
φ1 (x; µ) = f (x) + µ |ci (x)| + µ [ci (x)]−
i∈E i∈I
∗
is exact for µ = largest Lagrange multiplier (in absolute value) associated with an optimal
solution (th. 17.3). Algorithms based on minimizing φ1 need:
– Rules for adjusting µ to ensure µ > µ∗ (note minimizing φ for large µ is difficult).
– Special techniques to deal with the fact that φ1 is not differentiable at any x for which
ci (x) = 0 for some i ∈ E ∪ I (and such x must be encountered).
• Any exact penalty function of the type φ(x; µ) = f (x)+µh(c1 (x)) (where h(0) = 0 and h(y) ≥ 0 Proof
p. 513
∀y ∈ R) must be nonsmooth.
So if λk is close to the optimal multiplier vector λ∗ then kc(xk )k will be much smaller than 1
µk
rather than just proportional to µ1k .
• Now we need an update equation for λk+1 so that it approximates λ∗ more and more accurately;
the relation c(xk ) ≈ − µ1k (λ∗ − λk ) suggests λk+1 ← λk − µk c(xk ). ex. 17.4
∗
Note that −µk ci (xk ) → λi quadratic-penalty method
0 augmented Lagrangian method.
78
• Algorithmic framework 17.3 (augmented Lagrangian method—equality constraints): as for the
quadratic-penalty method but using LA (x, λ; µ) and updating λk+1 ← λk − µk c(xk ) where xk
is the (approximate) minimizer of LA (x, λk ; µ) and with given starting point λ0 .
• Choice of starting point xsk for the minimization of LA (x, λk ; µk ) less critical now (less ill-
conditioning), so we can simply take xsk+1 ← xk .
• Convergence:
– Th. 17.5: (x∗ , λ∗ ) = (local solution, Lagrange multiplier) at which KKT + LICQ +
2nd -order sufficient conditions hold (≡ well-behaved solution) ⇒ x∗ is a stationary point
of LA (x, λ∗ ; µ) for any µ ≥ 0, and ∃µ̄ > 0: ∀µ ≥ µ̄, x∗ is a strict local minimizer of
LA (x, λ∗ ; µ).
Pf.: KKT + 2nd-order cond. for constrained problem ⇒ KKT + 2nd-order cond. for unconstrained problem minx LA .
Thus LA is an exact penalty function for the optimal Lagrange multiplier λ = λ∗ and, if we
knew the latter, we would not need to take µ → ∞. In practice, we need to estimate λ∗ over
iterates and drive µ sufficiently large; if λk is close to λ∗ or if µk is large, then xk will be close
to x∗ (the quadratic-penalty method gives only one option: increase µk ).
(✐ Given λ∗ , how do we determine x∗ from the KKT conditions?)
Note that LA (x, 0; µ) = Q(x; µ).
• Special case (useful for distributed optimization): alternating direction method of multipliers
(ADMM): for a convex problem with (block-)separable objective and constraints:
min f (x) + g(z) s.t. Ax + Bz = c
x,z
79
2. Linearly-constrained formulation: in the bound-constrained problem, solve the subproblem of
minimizing the (augmented) Lagrangian subject to linearization of the constraints:
(
ci (xk ) + ∇ci (xk )T (x − xk ) = 0, i = 1, . . . , m
min Fk (x) s.t.
x l≤x≤u
Lagrangian Quadratic penalty
where z }| { z }| {
m
X Xm
µ
• Augmented Lagrangian Fk (x) = f (x) − λki cki (x) + (ck (x))2 .
i=1
2 i=1 i
k
• Current Lag. mult. estimate λ = Lag. mult. for the linearized constraints at k − 1.
• cki (x) = ci (x) − (ci (xk ) + ∇ci (xk )T (x − xk )) = true − linearized = “second-order ci remainder”.
Similar to SQP but with a nonlinear objective (hard subproblem); particularly effective when
most of the constraints are linear.
Implemented in the MINOS package.
3. Unconstrained formulation: consider again the general constrained problem (without equality
constraints for simplicity) and introduce slack variables: ci (x) ≥ 0 ⇒ ci (x) −si = 0, si ≥ 0 ∀i ∈
I. Consider the bound-constrained augmented Lagrangian
X µX
min LA (x, s, λ; µ) = f (x) − λi (ci (x) − si ) + (ci (x) − si )2 s.t. si ≥ 0 ∀i ∈ I.
x,s
i∈I
2 i∈I
• Update µk , etc.
80
Review: penalty and augmented Lagrangian methods
• Quadratic-penalty method : sequence of unconstrained minimization subproblems where we drive µ → ∞
µX 2 µX
min Q(x; µ) = f (x) + ci (x) + ([ci (x)]− )2
x 2 2
i∈E i∈I
thus forcing the minimizer of Q closer to the feasible region of the constrained problem.
– Assuming equality constraints, if this converges to a point x∗ , then either it is infeasible and a stationary
point of kc(x)k2 , or it is feasible; in the latter case, if A(x∗ ) (the matrix of active constraint gradients)
has full rank then −µc(x) → λ∗ and (x∗ , λ∗ ) satisfy the KKT cond.
– Problem: typically, ∇2xx Q(x; µ) becomes progressively more ill-conditioned as µ → ∞.
• Exact penalty function φ(x; µ): ∃µ∗ > 0: ∀µ > µ∗ , any local solution x of the constrained problem is a local
minimizer of φ. So we need a single unconstrained minimization of φ(x; µ) for such a µ > µ∗ .
– Ex.: ℓ1 exact penalty function (it is exact for µ∗ = largest Lagrange multiplier):
X X
φ1 (x; µ) = f (x) + µ |ci (x)| + µ [ci (x)]− .
i∈E i∈I
Exact penalty functions of the form φ(x; µ) = f (x) + µh(c1 (x)) are nonsmooth.
• Augmented Lagrangian method (method of multipliers): sequence of unconstrained minimization subproblems
where we drive µ → ∞ (for equality constraints)
X µX 2
min LA (x, λ; µ) = f (x) − λi ci (x) + ci (x), λ ← λ − µc(x).
x 2
i∈E i∈E
– Modifies the quadratic-penalty function by introducing explicit estimates of the Lagrange multipliers,
and this delays the onset of ill-conditioning.
– Convergence: a well-behaved solution (x∗ , λ∗ ) is a stationary point of LA (x, λ∗ ; µ) ∀µ ≥ 0 and a strict
local minimizer of LA (x, λ∗ ; µ) ∀µ ≥ µ̄ (for some µ̄ > 0). So LA is an exact penalty function for λ = λ∗
(but we don’t know λ∗ ).
– Inequality constraints: several formulations (bound-constrained Lagrangian, linearly-constrained formu-
lation, unconstrained formulation).
81
18 Sequential quadratic programming (SQP)
One of the most effective approaches for nonlinearly constrained optimization, large or small.
Given initial x0 , λ0
for k = 0, 1, 2, . . .
Evaluate fk , ∇fk , ci (xk ), ∇ci (xk ), ∇2xx L(xk , λk )
(pk , λk+1 ) ← (solution, Lagrange multiplier) of QP subproblem
xk+1 ← xk + pk
if convergence test satisfied ⇒ stop with approximate solution (xk+1 , λk+1 )
end
• Intuitive idea: the QP subproblem is Newton’s method applied to the optimality conditions of
the problem. Consider only equality constraints for simplicity and write minx f (x) s.t. c(x) = 0
with c(x)T = (c1 (x), . . . , cm (x)) and A(x)T = (∇c1 (x), . . . , ∇cm (x)):
p
(i) The solution lkk of the QP subproblem satisfies:
2 2
∇xx Lk pk + ∇fk − ATk lk = 0 ∇xx Lk −ATk pk −∇fk
⇔ = .
Ak pk + ck = 0 Ak 0 lk −ck
∇x L(x,λ)
(ii) KKT system for the problem: F(x, λ) = c(x) = 0 (where ∇x L(x, λ) = ∇f (x) − A(x)T λ), for
which Newton’s method (for root finding) result in a step
2
xk+1 xk pk ∇xx Lk −ATk pk −∇fk + ATk λk
= + where = .
λk+1 λk pλ Ak 0 pλ −ck
| {z } | {z }
Jacobian of F at (xk , λk )T −F(xk ,λk )
(i) ≡ (ii), since the two linear systems have the same solution (define lk = pλ + λk ).
• Assumptions (recall lemma 16.1 in ch. 16 about equality-constrained QP):
– The constraint Jacobian Ak has full row rank (LICQ)
– ∇2xx Lk is pd on the tangent space of the constraints (dT ∇2xx Lk d > 0 ∀d 6= 0, Ak d = 0)
⇒ the KKT matrix is nonsingular and the linear system has a unique solution.
Do these assumptions hold? They do locally (near the solution) if the problem solution satisfies
the 2nd -order sufficient conditions. Then Newton’s method converges quadratically.
82
Considering now equality and inequality constraints:
– Th. 18.1: (x∗ , λ∗ ) local solution at which KKT + LICQ + 2nd-order + strict complemen-
tarity hold ⇒ if (xk , λk ) is sufficiently close to (x∗ , λ∗ ), there is a local solution of the QP
subproblem whose active set Ak is A(x∗ ).
At that time, the QP subproblem correctly identifies the active set at the solution and SQP
behaves like Newton steps for an equality-constrained problem.
• To ensure global convergence (≡ from remote starting points), Newton’s method needs to be
modified (just as in the unconstrained optimization case). This includes defining a merit func-
tion (which evaluates the goodness of an iterate, trading off reducing the objective function
but improving feasibility) and applying the strategies of:
– Line search: modify the Hessian of the quadratic model to make it pd, so that pk is a
descent direction for the merit function.
– Trust region: limit the step size to a region so that the step produces sufficient decrease
of the merit function (the Hessian need not be pd).
Additional issues need to be accounted for, e.g. the linearization of inequality constraints may
produce an infeasible subproblem.
Ex.: linearizing x ≤ 1, x2 ≥ 0 at xk = 3 results in 3 + p ≤ 1, 9 + 6p ≥ 0 which is inconsistent.
where ∇2xx L(xk , λk ) is the Hessian of the Lagrangian L(x, λ) = f (x) − λT c(x), for a given λ estimate.
• For equality constraints, this is equivalent to applying Newton’s method to the KKT conditions.
• Local convergence: near a solution (x∗ , λ∗ ), the QP subproblem correctly identifies the active set at the solution
and SQP behaves like Newton steps for an equality-constrained problem, converging quadratically.
• Global convergence: Newton’s method needs to be modified, by defining a merit function and applying the
strategies of line search or trust region (as in unconstrained optimization).
83
19 Interior-point methods for nonlinear programming
• Considered the most powerful algorithms (together with SQP) for large-scale nonlinear pro-
gramming.
• Extension of the interior-point methods for LP and QP.
• Terms “interior-point methods” and “barrier methods” used interchangeably (but different
historical origin).
That is, the interior-point method can be seen as minimizing the log-barrier function P (x; µ) (subject
to the equalities) and taking µ → 0.
84
The primal log-barrier method
• Consider the inequality-constrained problem minx f (x) s.t. ci (x) ≥ 0, i ∈ I.
Strictly feasible region F 0 = {x ∈ Rn : ci (x) > 0 ∀i ∈ I}, assume nonempty.
Define the log-barrier function (through a barrier parameter µ > 0): ex. 19.1
X
P (x; µ) = f (x) − µ log ci (x)
i∈I
|{z} | {z } (infinite everywhere except in F 0
objective log-barrier function smooth inside F 0
function approaches ∞ as x approaches the boundary of F 0 .
Convergence
• For convex programs: global convergence.
Th.: f , {−ci , i ∈ I} convex functions, F 0 6= ∅ ⇒
1. For any µ > 0, P (x; µ) is convex in F 0 and attains a minimizer x(µ) (not necessarily
unique) on F 0 ; any local minimizer x(µ) is also global.
2. If the set of solutions of the constrained optimization problem is nonempty and bounded
and if (µk ) is a decreasing sequence with µk → 0 ⇒ (x(µk )) converges to a solution x∗
and f (x(µk )) → f ∗ , P (x(µk ); µk ) → f ∗ .
If there are no solutions or the solution set is unbounded, the theorem may not apply.
Relation between the minimizers of P (x; µ) and a solution (x∗ , λ∗ ): at a minimizer x(µ):
X µ X
0 = ∇x P (x(µ); µ) = ∇f (x(µ)) − ∇ci (x(µ)) = ∇f (x(µ)) − λi (µ)∇ci (x(µ))
c i (x(µ))
i∈I | {z } i∈I
define as λ(µ)
which is KKT condition a) for the constrained problem (∇2xx L(x, λ) = 0). As for the other KKT
conditions at x(µ), λ(µ); b) (ci (x) ≥ 0, i ∈ I) and c) (λi ≥ 0, i ∈ I) also hold since ci (x(µ)) > 0;
only the complementarity condition d) fails: λi ci (x) = µ > 0; but it holds as µ → 0. The path
Cp = {x(µ): µ > 0} is called primal central path, and is the projection on the primal variables of the
primal-dual central path from the interior-point version.
85
Practical problems As with the quadratic-penalty method, the barrier function looks quadratic
only very near its minimizer; and the Hessian ∇2xx P (x; µk ) becomes ill-conditioned as µk → 0:
X µ
∇x P (x; µ) = ∇f (x) − ∇ci (x)
i∈I
c i (x)
X µ X µ
2 2
∇xx P (x; µ) = ∇ f (x) − ∇2 ci (x) + 2
∇ci (x)∇ci (x)T .
c
i∈I i
(x) c
i∈I i
(x)
Near a minimizer x(µ) with µ small, from the earlier theorem we have that the optimal Lagrange
multipliers can be estimated as λ∗i ≈ ciµ(x) , so
X1
∇2xx P (x; µ) ≈ ∇2xx L(x; λ∗ ) + (λ∗i )2 ∇ci (x)∇ci (x)T .
i∈I
µ
| {z } | {z }
independent of µ for the active constraints (λ∗i 6= 0),
becomes very large as µ → 0, with rank < n
The Newton step can be reformulated as in the quadratic-penalty method to avoid ill-conditioning,
and it should be implemented with line-search or trust-region strategy to remain (well) strictly
feasible.
(
ci (x) = 0, i ∈ E
Equality constraints minx f (x) s.t.
ci (x) ≥ 0, i ∈ I.
Splitting an equality constraint ci (x) = 0 as two inequalities ci (x) ≥ 0, −ci (x) ≥ 0 doesn’t work? ,
but we can combine the quadratic penalty and the log-barrier:
X 1 X 2
B(x; µ) = f (x) − µ log ci (x) + c (x).
i∈I
2µ i∈E i
This has similar aspects to the quadratic penalty and barrier methods: algorithm = successive
reduction of µ alternated with approximate minimization of B wrt x; ill-conditioned ∇2xx B when µ
is small; etc.
To find an initial point which is strictly feasible wrt the inequality constraints, introduce slack
variables si , i ∈ I:
ci (x) = 0, i∈E
min f (x) s.t. ci (x) − si = 0, i ∈ I ⇒
x,s
si ≥ 0, i∈I
X 1 X 1 X
B(x, s; µ) = f (x) − µ log si + c2i (x) + (ci (x) − si )2 .
i∈I
2µ i∈E
2µ i∈I
86
Review: interior-point methods for nonlinear programming
• Considered the most powerful algorithms (together with SQP) for large-scale nonlinear programming.
• Interior-point methods as homotopy methods:
perturbed KKT conditions
(introducing slacks s) Newton step
∇f (x) − AT T
E (x) y − AI (x) z = 0
∇2xx L −AT −AT ∇f (x) − AT T
Sz − µe = 0 0 E (x) I (x) px E (x) y − AI (x) z
0 Z 0 S ps = −
Sz − µe
.
cE (x) = 0
AE (x) 0 0 0 py cE (x)
cI (x) − s = 0 AI (x) −I 0 0 pz cI (x) − s
s, z ≥ 0
This follows the primal-dual central path (x(µ), s(µ), y(µ), z(µ)) as µ → 0+ while preserving s, z > 0, avoiding
spurious solutions.
– No ill-conditioning arises for a well-behaved solution.
– Practical versions of interior-point methods follow line-search or trust-region implementations.
• Primal log-barrier method : sequence of unconstrained minimization subproblems where we drive µ → 0+
X (if equality constraints, add them
min P (x; µ) = f (x) − µ log ci (x) as a quadratic penalty to P )
x
i∈I
thus allowing the minimizer of P to approach the boundary of the feasible set from inside it.
– Convergence: global for convex problems; otherwise local for a well-behaved solution (x∗ , λ∗ ):
(x(µ), λ(µ)) −−−→ (x∗ , λ∗ ) where λi (µ) = ci (x(µ))
µ
, i ∈ I.
µ→0
– Cp = {x(µ): µ > 0} is the primal central path, and is the projection on the primal variables of the
primal-dual central path from the interior-point version.
– Problem: typically, ∇2xx P (x; µ) becomes progressively more ill-conditioned as µ → 0.
• Interior-point methods can be seen as barrier methods by eliminating s and x in the perturbed KKT conditions,
which become the gradient of the log-barrier function.
87
Final comments
Fundamental ideas underlying most methods:
• Sequence of subproblems that converges to our problem, where each subproblem is easy:
– line search
– trust region
– homotopy or path-following, interior-point
– quadratic penalty, augmented Lagrangian, log-barrier
– sequential quadratic programming
– etc.
• Simpler function valid near current iterate (e.g. linear, quadratic with Taylor’s th.): allows to
predict the function locally.
– inexact steps: approximate rather than exact solution of the subproblem, to minimize
overall computation
– warm starts: initialize subproblem from previous iteration’s result
– caching factorizations: for linear algebra subproblems, e.g. linear system with constant
coefficient matrix but variable RHS
• Heuristics are useful to invent algorithms, but they must be backed by theory guaranteeing
good performance (e.g. l.s. heuristics are ok as long as Wolfe conditions hold).
88
Given your particular optimization problem:
• No best method in general; use your understanding of basic methods and fundamental ideas to
choose an appropriate method, or to design your own.
• Recognize the type of problem and its structure: Differentiable? Smooth? (Partially) sepa-
rable? LP? QP? Convex? And the dual? How many constraints? Many optima? Feasible?
Bounded? Sparse Hessian? Etc.
• Simplify the problem if possible: improve variable scaling, introduce slacks, eliminate variables
or redundant constraints, introduce new variables/constraints to decouple terms. . .
• Try to come up with subproblems that make good progress towards the solution but are easy
to solve.
• Try to guess good initial iterates: from domain knowledge, or from solving a version of the
problem that is simpler (e.g. convex or less nonlinear) or smaller (using fewer variables).
• Determine your stopping criterion. Do you need a highly accurate or just an approximate
minimizer? Do you need to identify the active constraints at the solution accurately?
• Close the loop between the definition of the optimization problem (motivated by a practical
application) and the computational approach to solve it, in order to find a good compromise:
a problem that is practically meaningful (its solution is useful and accurate or approximate
enough) and convenient to solve (efficiently, in a scalable way, using existing algorithms, etc.).
89
A Math review
Error analysis and floating-point arithmetic
Pt
• Floating-point representation of x ∈ R: fl(x) = 2e i=1 di 2−i (t bits for fractional part, d1 = 1,
rest of bits for exponent and sign).
• Unit roundoff u = 2−t−1 (≈ 1.1 × 10−16 for 64-bit IEEE double precision). Matlab: eps = 2u.
• Any x with |x| ∈ [2L , 2U ] (where e ∈ {L+1, . . . , U}) can be approximated with relative accuracy
u: |fl(x)−x|
|x|
≤ u ⇔ fl(x) = x(1 + ǫ) with roundoff error |ǫ| ≤ u (so x and fl(x) agree to at least
15 decimal digits).
• Roundoff errors accumulate during floating-point operations. An algorithm is stable if errors
do not grow unbound. A particularly nasty case is cancellation: the relative error in computing
x − y if x and y are very close is . 2u|x|/|x − y| ⇒ precision loss; or, if x and y are accurate
to k digits and they agree in the first k̄, their difference contains only about k − k̄ significant
digits. So, avoid taking the difference of similar floating-point numbers.
∂xn
90
• Cone: set F verifying: x ∈ F ⇒ αx ∈ F ∀α > 0. Ex.:P {( xx12 ) : x1 > 0, x2 ≥ 0}.
x= m
Cone generated by {x1 , . . . , xm } ⊂ Rn : {x ∈ Rn : P i=1 αi xi , αi ≥ 0 ∀i = 1, . . .P , m}.
Convex hull of {x1 , . . . , xm } ⊂ Rn : {x ∈ Rn : x = m α x
i=1 i i , α i ≥ 0 ∀i = 1, . . . , m, m
i=1 αi = 1}.
Matrices
• Positive definite matrix B ⇔ pT Bp > 0 ∀p 6= 0 (pd). Positive semidefinite if ≥ 0 (psd).
• Matrix norm induced by a vector norm: kAk = supx6=0 kAxk .
p kxk
Ex. kAk2 (spectral norm) = largest s.v. σmax (A) = largest eigenvalue λmax (AT A).
If A is symmetric, then its s.v.’s are the absolute values of its eigenvalues.
• Condition number of a square nonsingular matrix A: κ(A) = kAkkA−1 k ≥ 1 where k·k is any Ex. in
p. 616
matrix norm. Ex. κ2 (A) = σmax /σmin . Square linsys Ax = b, perturb to Ãx̃ = b̃ ⇒ kx−x̃k
kxk
≈
kA−Ãk kb−b̃k
κ(A) kAk
+ kbk so ill-conditioned problem ⇔ largeκ(A).
λ ∈ C: eigenvalue
• Eigenvalues and eigenvectors of a real matrix: Au = λu = n .
u ∈ C : eigenvector
symmetric: all λ ∈ R, u ∈ Rn ; eigenvectors of different eigenvalues are ⊥
nonsingular: all λ 6= 0
Matrix pd: all λ > 0 (nd: all λ < 0)
psd: all λ ≥ 0 (nsd: all λ ≤ 0)
not definite: mixed-sign λ
– Unit-norm best approximation (minx kAxk2 s.t. kxk2 = 1): x = minor eigenvector of AT A.
– minx kAxk2 s.t. kBxk2 = 1: x = minor generalized eigenvector of AT A, BT B.
• m > n: overconstrained, no solution in general; instead, define LSQ solution as minx kAx − bk2
⇒ x = A+ b = (AT A)−1 AT b.
AA+ = A(AT A)−1 AT is the orthogonal projection on range (A) (Pf.: write y ∈ Rm as y = Ax + u where
x ∈ Rn and u⊥Ax ∀x ∈ Rn (i.e., u ∈ null AT ). Then A(AT A)−1 AT y = Ax).
Likewise, A+ A = AT (AAT )−1 A is the orthogonal projection on range AT .
• rank (A) ≥ p ⇒ Up Sp VpT is the best rank–p approximation to A in the sense of the Frobenius
P
norm (kAk2F = tr AAT = i,j a2ij ) and the 2–norm (kAk2 = largest s.v.).
92
Other matrix decompositions (besides spectral, SVD)
• Cholesky decomposition: A symmetric pd ⇒ A = LLT with L lower triangular.
Useful to solve a sym pd linsys efficiently: Ax = b ⇔ LLT x = b ⇒ solve two triangular linsys.
Matrix identities
(
Am×n , Bn×p : rank (A) + rank (B) − n ≤ rank (AB) ≤ min (rank (A) , rank (B))
Ranks: .
Am×n , Bm×n : rank (A + B) ≤ rank (A) + rank (B)
Derivatives:
d(aT x)
= ∇x (aT x) = a if a, x ∈ Rn , a independent of x
dx
d(xT Ax) A symmetric
= ∇x (xT Ax) = (A + AT )x = 2Ax if An×n independent of x
dx
dyT ∂yi
xm×1 , yn×1 : = m × n Jacobian matrix J(x) =
dx ∂xj ij
2
2
d f 2 ∂ f
f1×1 , xn×1 : T
= n × n Hessian matrix ∇ f (x) =
dx dx ∂xi ∂xj ij
1×n n×m m×n n×1
z}|{ z}|{ z}|{ z}|{
T
d( x C ) d( B x )
= Cn×m , = Bm×n if B, C independent of x
d |{z}
x xT
d |{z}
n×1 1×n n×n n×n
z}|{ z}|{
T
T
d(u v) du dvT
Product rule: = ∇x (uT v) = v+ u, x, u, v ∈ Rn
dx dx dx
Xn
dy(x(t)) dyT dx ∂y dxi
Chain rule: = = , x ∈ Rn , y ∈ Rm , t ∈ R
dt dx |{z}
|{z} dt i=1
∂xi dt
m×n n×1
93
Quadratic forms
f (x) = 21 xT Ax + bT x + c, A ∈ Rn×n , b, x ∈ Rn , c ∈ R. Center and diagonalize it:
T
1. Ensure A symmetric: xT Ax = xT A+A 2
x.
1 1 1
x2 0 0 0
−1 −1 −1
−2 −2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1 1 1
x2 0 0 0
−1 −1 −1
−2 −2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x1 x1 x1
94
Order notation
Consider f (n), g(n) ≥ 0 for n = 1, 2, 3 . . .
• Asymptotic upper bound O(·): f is O(g) iff f (n) ≤ cg(n) for c > 0 and all n > n0 .
f is of order g at most. √
Ex.: 3n + 5 is O(n) and O(n2 ) but not O(log n) or O( n).
f (n)
• Asymptotic upper bound o(·): f is o(g) iff limn→∞ g(n)
= 0.
f becomes insignificant relative to g as n grows.
Ex.: 3n + 5 is o(n2 ) and o(n1.3 ) but not o(n).
• Asymptotic tight bound Ω(·): f is Ω(g) iff c0 g(n) ≤ f (n) ≤ c1 g(n) for c1 ≥ c0 > 0 and all n > n0 .
f is O(g) and g is O(f ). √
Ex.: 3n + 5 is Ω(n) but not Ω(n2 ), Ω(log n), Ω( n).
Rates of convergence
Infimum of a set S ∈ R, inf S (greatest lower bound): the largest v ∈ R s.t. v ≤ s ∀s ∈ S. If inf S ∈ S
then we also denote it as min S (minimum, smallest element of S). Likewise for maximum/supremum.
Ex: for S = {1/n, n ∈ N } we have sup S = max S = 1 and inf S = 0, but S has no minimum.
Ex: does “minx f (x) s.t. x > 0” make sense?
Let {xk }∞ n ∗
k=0 ⊂ R be a sequence that converges to x .
∗
• Linear convergence: kxkxk+1 −x k
∗
k −x k
≤ r for all k sufficiently large, with constant 0 < r < 1. The
distance to the solution decreases at each iteration by at least a constant factor. Ex.: xk = 2−k ;
steepest descent (and r ≈ 1 for ill-conditioned problems).
kxk+1 −x∗ k
• Sublinear convergence: limk→∞ kxk −x∗ k
= 1. Ex.: xk = k1 .
kxk+1 −x∗ k
• Superlinear convergence: limk→∞ kxk −x∗ k
= 0. Ex.: xk = k −k ; quasi-Newton methods.
∗
• Quadratic convergence (order 2): kx k+1 −x k
kxk −x∗ k2
≤ M for all k sufficiently large, with constant M > 0
(not necessarily < 1). We double the number of digits at each iteration. Quadratic faster than
k
superlinear faster than linear. Ex.: xk = 2−2 ; Newton’s method.
kxk+1 −x∗ k
• Order p: kxk −x∗ kp
≤ M (rare for p > 2).
In the long run, the speed of an algorithm depends mainly on the order p and on r (for p = 1), and
more weakly on M. The values of r, M depend on the algorithm and the particular problem.
95