0% found this document useful (0 votes)
16 views

Notes HQ

This document provides lecture notes for a graduate course on numerical optimization. It introduces the basic concepts and state-of-the-art algorithms for continuous optimization. The course will focus on derivative-based methods for problems with many variables. Fundamental topics covered include conditions for local minima, convex optimization, and algorithm strategies such as line search and trust regions.

Uploaded by

Piyush Agrahari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Notes HQ

This document provides lecture notes for a graduate course on numerical optimization. It introduces the basic concepts and state-of-the-art algorithms for continuous optimization. The course will focus on derivative-based methods for problems with many variables. Fundamental topics covered include conditions for local minima, convex optimization, and algorithm strategies such as line search and trust regions.

Uploaded by

Piyush Agrahari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Lecture Notes on Numerical Optimization

Miguel Á. Carreira-Perpiñán


EECS, University of California, Merced
December 30, 2020

These are notes for a one-semester graduate course on numerical optimisation given by Prof.
Miguel Á. Carreira-Perpiñán at the University of California, Merced. The notes are largely based
on the book “Numerical Optimization” by Jorge Nocedal and Stephen J. Wright (Springer, 2nd ed.,
2006), with some additions.
These notes may be used for educational, non-commercial purposes.
c 2005–2020 Miguel Á. Carreira-Perpiñán
1 Introduction
• Goal: describe the basic concepts & main state-of-the-art algorithms for continuous optimiza-
tion.

• The optimization problem:



ci (x) = 0, i ∈ E equality constraints (scalar)
minn f (x) s.t.
x∈R ci (x) ≥ 0, i ∈ I inequality constraints (scalar)

x: variables (vector); f (x): objective function (scalar).


Feasible region: set of points satisfying all constraints.
max f ≡ − min −f .
 2
2 2 x1 − x2 ≤ 0
• Ex. (fig. 1.1): minx1 ,x2 (x1 − 2) + (x2 − 1) s.t.
x1 + x2 ≤ 2.
• Ex.: transportation problem (LP)
 P
X  Pj xij ≤ ai ∀i (capacity of factory i)
min cij xij s.t. x ≥ bj ∀i (demand of shop j)
{xij }  i ij
i,j xij ≥ 0 ∀i, j (nonnegative production)

cij : shipping cost; xij : amount of product shipped from factory i to shop j.

• Ex.: LSQ problem: fit a parametric model (e.g. line, polynomial, neural net. . . ) to a data set. Ex. 2.1

• Optimization algorithms are iterative: build sequence of points that converges to the solution.
Needs good initial point (often by prior knowledge).

• Focus on many-variable problems (but will illustrate in 2D).

• Desiderata for algorithms:

– Robustness: perform well on wide variety of problems in their class, for any starting point;
– Efficiency: little computer time or storage;
– Accuracy: identify solution precisely (within the limits of fixed-point arithmetic).

They conflict with each other.

• General comment about optimization (Fletcher): “fascinating blend of theory and computation,
heuristics and rigour”.

– No universal algorithm: a given algorithm works well with a given class of problems.
– Necessary to adapt a method to the problem at hand (by experimenting).
– Not choosing an appropriate algorithm → solution found very slowly or not at all.

• Not covered in the Nocedal & Wright book, or in this course:

– Discrete optimization (integer programming): the variables are discrete. Ex.: integer
transportation problem, traveling salesman problem.
∗ Harder to solve than continuous opt (in the latter we can predict the objective function
value at nearby points).

1
∗ Too many solutions to count them.
∗ Rounding typically gives very bad solutions.
∗ Highly specialized techniques for each problem type.
Ref: Papadimitriou & Steiglitz 1982.

– Network opt: shortest paths, max flow, min cost flow, assignments & matchings, MST,
dynamic programming, graph partitioning. . .
Ref: Ahuja, Magnanti & Orlin 1993.

– Non-smooth opt: discontinuous derivatives, e.g. L1 -norm.


Ref: Fletcher 1987.

– Stochastic opt: the model is specified with uncertainty, e.g. x ≤ b where b could be given
by a probability density function.
– Global opt: find the global minimum, not just a local one. Very difficult.
Some heuristics: simulated annealing, genetic algorithms, evolutionary computation.
– Multiobjective opt: one approach is to transform it to a single objective = linear combi-
nations of objectives.
– EM algorithm (Expectation-Maximization): specialized technique for maximum likelihood
estimation of probabilistic models.
Ref: McLachlan & Krishnan 2008; many books on statistics or machine learning.

– Calculus of variations: stationary point of a functional (= function of functions).


– Convex optimization: we’ll see some of this.
Ref: Boyd & Vandenberghe 2003.

– Modeling: the setup of the opt problem, i.e., the process of identifying objective, variables
and constraints for a given problem. Very important but application-dependent.
Ref: Dantzig 1963; Ahuja, Magnanti & Orlin 1993.

• Course contents: derivative-based methods for continuous optimization (see syllabus).

2
2 Fundamentals of unconstrained optimization
Problem: min f (x), x ∈ Rn .

Conditions for a local minimum x∗ (cf. case n = 1)

• Global minimizer : f (x∗ ) ≤ f (x) ∀x ∈ Rn .

• Local minimizer : ∃ neighborhood N of x∗ : f (x∗ ) ≤ f (x) ∀x ∈ N .

• Strict (or strong) local minimizer : f (x∗ ) < f (x) ∀x ∈ N \{x∗ }. (Ex. f (x) = 3 vs f (x) = (x − 2)4 at x∗ = 2.)

• Isolated local minimizer : ∃N of x∗ such that x∗ is the only local min. in N . (Ex. f (x) = x4 cos x1 + 2x4
with f (0) = 0 has a strict global minimizer at x∗ = 0 but non-isolated.) All isolated local min. are strict.

• First-order necessary conditions (Th. 2.2): x∗ local min, f cont. diff. in an open neighborhood
of x∗ ⇒ ∇f (x∗ ) = 0. (Not sufficient condition, ex: f (x) = x3 .)
(Pf.: by contradiction: if ∇f (x∗ ) 6= 0 then f decreases along the negative gradient direction.)

• Stationary point: ∇f (x∗ ) = 0.

• Second-order necessary conditions (Th. 2.3): x∗ is local min, f twice cont. diff. in an open
neighborhood of x∗ ⇒ ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is psd. (Not sufficient condition, ex: f (x) = x3 .)
(Pf.: by contradiction: if ∇2 f (x∗ ) is not psd then f decreases along the direction where ∇2 is not psd.)

• Second-order sufficient conditions (Th. 2.4): ∇2 f cont. in an open neighborhood of x∗ , ∇f (x∗ ) =


0, ∇2 f (x∗ ) pd ⇒ x∗ is a strict local minimizer of f . (Not necessary condition, ex.: f (x) = x4 at x∗ = 0.)
(Pf.: Taylor-expand f around x∗ .)

The key for the conditions is that ∇, ∇2 exist and are continuous. The smoothness of f allows us to
predict approximately the landscape around a point x.

Convex optimization
• S ⊂ Rn is a convex set if x, y ∈ S ⇒ αx + (1 − α)y ∈ S, ∀α ∈ [0, 1].

• f : S ⊂ Rn → R is a convex function if its domain S is convex and f (αx + (1 − α)y) ≤


αf (x) + (1 − α)f (y), ∀α ∈ (0, 1), ∀x, y ∈ S.
Strictly convex: “<” instead of “≤”. f is (strictly) concave if −f is (strictly) convex.

• Convex optimization problem: the objective function and the feasible set are both convex
(⇐ the equality constraints are linear and the inequality constraints ci (x) ≥ 0 are concave.)
Ex.: linear programming (LP).

• Easier to solve because every local min is a global min.

• Th. 2.5:

– f convex ⇒ any local min is also global.


– f convex and differentiable ⇒ any stationary point is a global min.
(Pf.: by contradiction, assume z with f (z) < f (x∗ ), study the segment x∗ z.)

3
Algorithm overview
• Algorithms look for a stationary point starting from a point x0 (arbitrary or user-supplied) ⇒
sequence of iterates {xk }∞
k=0 that terminates when no more progress can be made, or it seems
that a solution has been approximated with sufficient accuracy.

• Stopping criterion: can’t use kxk − x∗ k or |f (xk ) − f (x∗ )|. Instead, in practice (given a small
ǫ > 0):

– k∇f (xk )k < ǫ


– kxk − xk−1 k < ǫ or kxk − xk−1 k < ǫkxk−1 k
– |f (xk ) − f (xk−1 )| < ǫ or |f (xk ) − f (xk−1)| < ǫ|f (xk−1 )|

and also set a limit on k.


With convex problems, it is often possible to bound the error |f (xk ) − f (x∗ )| as a function of k.

• We choose xk+1 given information about f at xk (and possibly earlier iterates) so that f (xk+1 ) <
f (xk ) (descent).

• Move xk → xk+1 : two fundamental strategies, line search and trust region.

– Line search strategy


1. Choose a direction pk
2. Search along pk from xk for xk+1 with f (xk+1) < f (xk ), i.e., approximately solve the
1D minimization problem: minα>0 f (xk + αpk ) where α = step length.
– Trust region strategy
1. Construct a model function mk (typ. quadratic) that is similar to (but simpler than)
f around xk .
2. Search for xk+1 with mk (xk+1 ) < m(xk ) inside a small trust region (typ. a ball)
around xk , i.e., approximately solve the n-D minimization problem: minp mk (xk + p)
s.t. xk + p ∈ trust region.
3. If xk+1 does not produce enough decay in f , shrink the region.

• In both strategies, the subproblem (step 2) is easier to solve with the real problem. Why not
solve the subproblem exactly?

– Good: derives maximum benefit from pk or mk ; but


– Bad: expensive (many iterations over α) and unnecessary towards the real problem (min
f over all Rn ).

• Both strategies differ in the order in which they choose the direction and the distance of the
move:

– Line search: fix direction, choose distance.


– Trust region: fix maximum distance, choose direction and actual distance.

4
Scaling (“units” of the variables)
• A problem is poorly scaled if changes to x in a certain direction produce much larger variations
in the value of f than do changes to x in another direction. Some algorithms (e.g. steepest
descent) are sensitive to poor scaling while others (e.g. Newton’s method) are not. Generally,
scale-invariant algorithms are more robust to poor problem formulations.
Ex. f (x) = 109 x21 + x22 (fig. 2.7).

• Related, but not the same as ill-conditioning.


 
1 T 1 1+ǫ 2+ǫ

Ex. f (x) = 2
x 1+ǫ 1 x− 2 x.

Review: fundamentals of unconstrained optimization


Assuming the derivatives ∇f (x∗ ), ∇2 f (x∗ ) exist and are continuous in a neighborhood of x∗ :

∇f (x∗ ) = 0 (first order)
• x∗ is a local minimizer ⇒ (necessary condition).
∇2 f (x∗ ) psd (second order)
• ∇f (x∗ ) = 0, ∇2 f (x∗ ) pd ⇒ x∗ is a strict local minimizer (sufficient condition).

 pd: strict local minimizer



 nd: strict local maximizer
∗ 2 ∗
• Stationary point: ∇f (x ) = 0; ∇ f (x ) not definite (pos & neg eigenvalues): saddle point



 psd: may be non-strict local minimizer

nsd: may be non-strict local maximizer

5
3 Line search methods
Iteration: xk+1 = xk + αk pk , where αk is the step length (how far to move along pk ), αk > 0; pk is
the search direction.

2.5

1.5

pk
xk

1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
π
Descent direction at xk : pTk ∇fk = kpk kk∇fk k cos θk < 0 (angle < 2
with −∇fk ). Guarantees that f
can be reduced along pk (for a sufficiently small step):

f (xk + αpk ) = f (xk ) + αpTk ∇fk + O(α2 ) (Taylor’s th.)


< f (xk ) for all sufficiently small α > 0

• The steepest descent direction, i.e., the direction along which f decreases most rapidly, is fig. 2.5
fig. 2.6
pk = −∇fk . Pf.: for any p, α: f (xk + αp) = f (xk ) + αpT ∇fk + O(α2 ) so the rate of change
in f along p at xk is pT ∇fk (the directional derivative) = kpkk∇fk k cos θ. Then minp pT ∇fk
s.t. kpk = 1 is achieved when cos θ = −1, i.e., p = −∇fk /k∇fk k.
This direction is ⊥ to the contours of f . Pf.: take x + p on the same contour line as x. Then, by Taylor’s th.:
1 T 2 1 pT ∇2 f (x + ǫp)p
f (x + p) = f (x) + pT ∇f (x) + p ∇ f (x + ǫp)p, ǫ ∈ (0, 1) ⇒ cos ∠(p, ∇f (x)) = − −−−−−→ 0
2 2 kpkk∇f (x)k kpk→0

but kpk → 0 along the contour line means p/kpk is parallel to its tangent at x.

• The Newton direction is pk = −∇2 fk−1 ∇fk . This corresponds to assuming f is locally quadratic
and jumping directly to its minimum. Pf.: by Taylor’s th.:
1
f (xk + p) ≈ fk + pT ∇fk + pT ∇2 fk p = mk (p)
2
which is minimized (take derivatives wrt p) by the Newton direction if ∇2 fk is pd. (✐ what
happens if assuming f is locally linear (order 1)?)
In a line search the Newton direction has a natural step length of 1.

• For most algorithms, pk = −B−1


k ∇fk where Bk is symmetric nonsingular:

6
– steepest descent: Bk = I
– Newton’s method: Bk = ∇2 f (xk )
– Quasi-Newton method: Bk ≈ ∇2 f (xk )

If Bk is pd then pk is a descent direction: pTk ∇fk = −∇fkT B−1


k ∇fk < 0.

Here, we deal with how to choose the step length given the search direction pk . Desirable properties:
guaranteed global convergence and rapid rate of convergence.

Step length
Time/accuracy trade-off: want to choose αk to give a substantial reduction in f but not to spend
much time on it.

• Exact line search (global or local min): αk : minα>0 φ(α) = f (xk + αpk ). Too expensive: many
evaluations of f , ∇f to find αk even with moderate precision. (✐ Angle (pk ,\
∇fk+1 ) = ?)

• Inexact line search: a typical l.s. algorithm will try a sequence of α values and stop when certain
conditions hold.

We want easily verifiable theoretical conditions on the step length that allow to prove convergence
of an optimization algorithm.

• Reduction in f : f (xk + αk pk ) < f (xk ) → not enough, can converge before reaching the mini-
mizer.

Wolfe conditions

➀ f (xk + αk pk ) ≤ f (xk ) + c1 αk ∇fkT pk (“sufficient decrease in objective function”);


➁ ∇f (xk + αk pk )T pk ≥ c2 ∇fkT pk (“more positive gradient”).
where 0 < c1 < c2 < 1. Call φ(α) = f (xk + αpk ), then φ′ (α) = ∇f (xk + αpk )T pk .

➀ Sufficient decrease (Armijo condition) is equivalent to φ(0) − φ(αk ) ≥ αk (−c1 φ′ (0)). fig. 3.3
Rejects too small decreases. The reduction is proportional both to step length αk and directional
derivative ∇fkT pk . In practice, c1 is very small, e.g. c1 = 10−4.
It is satisfied for any sufficiently small α ⇒ not enough, need to rule out unacceptably small
steps.

➁ Curvature condition is equivalent to φ′ (αk ) ≥ c2 φ′ (0). fig. 3.4


′ fig. 3.5
Rejects too negative slopes. Reason: if the slope at α, φ (α), is strongly negative, it’s likely we
can reduce f significantly
( by moving further.
0.9 if pk is chosen by a Newton or quasi-Newton method
In practice, c2 =
0.1 if pk is chosen by a nonlinear conjugate method.

We will concentrate on the Wolfe conditions in general, and assume they always hold when the l.s.
is used as part of an optimization algorithm (allows convergence proofs).

7
Lemma 3.1: there always exist step lengths that satisfy the Wolfe (also the strong Wolfe) conditions
if f is smooth and bounded below. (Pf.: mean value th.; see figure below.)

φ(α) = f (xk + αpk ) l(α) = fk − (−c1 ∇fkT pk )α = fk − (−c1 φ′ (0))α

Other useful conditions


• Strong Wolfe conditions: ➀ + ➁’ ∇f (xk + αk pk )T pk ≤ c2 ∇fkT pk (“flatter gradient”). We
don’t allow φ′ (αk ) to be too positive, so we exclude points that are far from stationary points
of φ.
• Goldstein conditions: ensure step length achieves sufficient decrease but is not too short: fig. 3.6

➋ ➊
f (xk ) + (1 − c)αk ∇fkT pk ≤ f (xk + αk pk ) ≤ f (xk ) + cαk ∇fkT pk , 0<c< 1
2

➋ controls step from below, ➊ is the first Wolfe condition.


Disadvantage: may exclude all minimizers of φ. (✐ do the Wolfe conditions exclude minimizers?)

Sufficient decrease and backtracking l.s. (Armijo l.s.)


Start with largish step size and decrease it (times ρ < 1) until it meets the sufficient decrease condition
➀ (Algorithm 3.1).
• It’s a heuristic approach to avoid a more careful l.s. that satisfies the Wolfe cond.
• It always terminates because ➀ is satisfied by sufficiently small α.
• Works well in practice because the accepted αk is near (times ρ) the previous α, which was
rejected for being too long.
• The initial step length α is 1 for Newton and quasi-Newton methods.

8
Step-length selection algorithms
• They take a starting value of α and generate a sequence {αi } that satisfies the Wolfe cond.
Usually they use interpolation, e.g. approximate φ(α) as a cubic polynomial.

• There are also derivative-free methods (e.g. the golden section search) but they are less efficient
and can’t benefit from the Wolfe cond. (to prove global convergence).

• We’ll just use backtracking for simplicity.

Convergence of line search methods


• Global convergence: k∇fk k −−−→ 0, i.e., convergence to a stationary point for any starting
k→∞
point x0 . To converge to a minimizer we need more information, e.g. the Hessian or convexity.
• We give a theorem for the search direction pk to obtain global convergence, focusing on the
−∇fkT p
angle θk between pk and the steepest descent direction −∇fk : cos θk = k∇fk kkpk , and assuming
the Wolfe cond. Similar theorems exist for strong Wolfe, Goldstein cond.

• Important theorem, e.g. shows that the steepest descent method is globally convergent; for
other algorithms, it describes how far pk can deviate from the steepest descent direction and
still give rise to a globally convergent iteration.
• Th. 3.2 (Zoutendijk). Consider an iterative method xk+1 = xk + αk pk with starting point
x0 where pk is a descent direction and αk satisfies the Wolfe conditions. Suppose f is bounded
below in Rn and cont. diff. in an open set N containing the level set L = {x: f (x) ≤ f (x0 )}, and
that ∇f is Lipschitz continuous on N (⇔ ∃L > 0: k∇f (x) − ∇f (x̃)k ≤ Lkx − x̃k ∀x, x̃ ∈ N ;
weaker than bounded Hessian). Then
X
(Proof: ➁ + L lower bound αk ;
cos2 θk k∇fk k2 < ∞ (Zoutendijk’s condition) ➀ upper bound f ; telescope)
k+1
k≥0

Zoutendijk’s condition implies cos2 θk k∇fk k2 → 0. Thus if cos θk ≥ δ > 0 ∀k for fixed δ then
k∇fk k → 0 (global convergence).

• Examples:

– Steepest descent method : pk = −∇fk ⇒ cos θk = 1 ⇒ global convergence. Intuitive fig. 3.7
method, but very slow in difficult problems.
– Newton-like method : pk = −B−1k ∇fk with Bk symmetric, pd and with bounded condition
−1
number: kBk k Bk ≤ M ∀k (✐ ill-cond. ⇒ ∇f ⊥ Newton dir.) Then cos θk ≥ M1 (Pf.: exe. 3.5) ⇒
global convergence.
In other words, if Bk are pd (which is required for descent directions), have bounded c.n.
and the step lengths satisfy the Wolfe conditions ⇒ global convergence. This includes
steepest descent, some Newton and quasi-Newton methods.
• For some methods (e.g. conjugate gradients) we may have directions that are almost ⊥∇fk when
the Hessian is ill-conditioned. It is still possible to show global convergence by assuming that we
take a steepest descent step from time to time. “Turning” the directions toward −∇fk so that
cos θk < δ for some preselected δ > 0 is generally a bad idea: it slows down the method (difficult
to choose a good δ) and also destroys the invariance properties of quasi-Newton methods.

9
• Fast convergence can sometimes conflict with global convergence, e.g. steepest descent is glob-
ally convergent but quite slow; Newton’s method converges very fast when near a solution but
away from the solution its steps may not even be descent (indeed, it may be looking for a
maximizer!). The challenge is to design algorithms will both fast and global convergence.

Rate of convergence
• Steepest descent: pk = −∇fk .
Th. 3.4: assume f is twice cont. diff. and that the iterates generated by the steepest descent
method with exact line searches converge to a point x∗ where the Hessian ∇2 f (x∗ ) is pd. Then
−λ1
f (xk+1 ) − f (x∗ ) ≤ r 2 (f (xk ) − f (x∗ )), where r = λλnn +λ1
= κ−1
κ+1
, 0 < λ1 ≤ · · · ≤ λn are the
2 ∗
eigenvalues of ∇ f (x ) and κ = λn /λ1 its condition number.
(Pf.: near the min., f is approx. quadratic.) (For quadratic functions (with matrix Q): kxk+1 − x∗ kQ ≤ rkxk − x∗ kQ where
kxk2Q = xT Qx and 1
2
kx − x∗ k2Q = f (x) − f (x∗ ).)

Thus the convergence rate is linear, with two extremes:

– Very well conditioned Hessian: λ1 ≈ λn ; very fast, since the steepest descent direction
approximately points to the minimizer.
– Ill-conditioned Hessian: λ1 ≪ λn ; very slow, zigzagging behaviour. This is the typical
situation in practice.

• Newton’s method : pk = −∇2 fk−1 ∇fk .


Th. 3.5: assume f twice diff. and ∇2 f Lipschitz cont. in a neighborhood of a solution x∗ at the
which second-order sufficient cond. hold. Consider the iterates xk+1 = xk − ∇2 fk−1 ∇fk . Then,
if x0 is sufficiently close to x∗ : (xk ) → x∗ , (k∇fk k) → 0 both quadratically.
(Pf.: upper bound kxk + pk − x∗ k by a constant × kxk − x∗ k2 using Taylor + Lipschitz.)

That is, near the solution, where the Hessian is pd, the convergence rate is quadratic if we
always take αk = 1 (no line search at all). The theorem does not apply away from the solution,
where the Hessian may not pd (so the direction may not be descent) or the unit step size may
not satisfy the Wolfe cond. or even increase f ; practical Newton methods avoid this.

• Quasi-Newton methods: pk = −B−1 k ∇fk with Bk symmetric pd.


Th. 3.7: assume f is twice cont. diff. and that the iterates xk+1 = xk − B−1k ∇fk converge to
a point x at the which second-order sufficient cond. hold. Then (xk ) → x∗ superlinearly iff

k(Bk −∇2 f (x∗ ))pk k


limk→∞ kpk k
= 0.
Thus, the convergence rate is superlinear if the matrices Bk become increasingly accurate
approximations of the Hessian along the search directions pk , and near the solution the step
length αk is always 1, i.e., xk+1 = xk + pk . In practice, we must always try α = 1 in the line
search and accept it if it satisfies the Wolfe cond.

10
Newton’s method with Hessian modification
Newton step: solution of the n × n linear system ∇2 f (xk )pN
k = −∇f (xk ).

• Near a minimizer the Hessian is pd ⇒ quadratic convergence (with unit steps αk = 1).

• Away from a minimizer the Hessian may not be pd or may be close


to singular ⇒ pN k may be an ascent direction, or too long. A too
long direction may not be good even if it is a descent direction,
because it violates the spirit of Newton’s method (which relies on
a quadratic approximation valid near the current iterate); thus it
may require many iterations in the line search.
Modified Newton method : modify the Hessian to make it sufficiently pd.
• We solve (by factorizing Bk , e.g. Cholesky) the system Bk pk = −∇f (xk ) where Bk = ∇2 f (xk )+
Ek (modified Hessian) such that Bk is sufficiently pd. Then we use pk (which is a descent
direction because Bk is pd) in a line search with Wolfe conditions.

• Global convergence (k∇fk k → 0) by Zoutendijk’s condition if κ(Bk ) = kBk k B−1


k ≤ M
(which can be proven for several types of modifications).

• Quadratic convergence near the minimizer, where the Hessian is pd, if Ek = 0 there.
Diagonal Hessian modification: Ek = λI with λ ≥ 0 large enough that Bk is sufficiently pd. λ =
max(0, δ − λmin (∇2 f (x k ))) for δ > 0.

• We want λ as small as possible to preserve Hessian information along the positive curvature
directions; but if λ is too small, Bk is nearly singular and the step too long.

• The method behaves like pure Newton for pd Hessian and λ = 0, like steepest descent for
λ → ∞, and finds some descent direction for intermediate λ.

Other types of Hessian modification exist, but there is no consensus about which one is best:
• Direct modification of the eigenvalues (needs SVD of the Hessian).
If A = QΛQT (spectral th.) then ∆A = Q diag (max(0, δ − λi )) QT is the correction with minimum Frobenius norm s.t. λmin (A +
∆A) ≥ δ.

• Modified Cholesky factorization (add terms to diagonal on the fly).

11
Review: line search (l.s.) methods

search direction pk given by optimization method
• Iteration xk+1 = xk + αk pk
l.s. ≡ determine step length αk

• We want: 
 steepest descent dir. : −∇fk

Newton dir. : −∇2 fk−1 ∇fk
– Descent direction: pTk ∇fk = kpk kk∇fk k cos θk < 0

 Quasi-Newton dir. : −B−1
k ∇fk

(Bk pd ⇒ descent dir.)
– Inexact l.s.: approx. solution of minα>0 f (xk + αk pk ) (faster convergence of the overall algorithm).
Even if the l.s. is inexact, if αk satisfies certain conditions at each k then the overall algorithm has global
convergence.
Ex.: the Wolfe conditions (others exist), most crucially the sufficient decrease in f . A simple l.s. algorithm
that often (not always) satisfies the Wolfe cond. is backtracking (better ones exist).
• Global convergence (to a stationary point): k∇fk k → 0.
P
Zoutendijk’s th.: descent dir + Wolfe + mild cond. on f ⇒ k≥0 cos2 θk k∇fk k2 < ∞.
Corollary: cos θk ≥ δ > 0 ∀k ⇒ global convergence. But we often want cos θk ≈ 0!
Ex.: steepest descent and some Newton-like methods have global convergence.
• Convergence rate:

λn −λ1 2
– Steepest descent: linear, r = λn +λ1 ; slow for ill-conditioned problems.
– Quasi-Newton: superlinear under certain conditions.
– Newton: quadratic near the solution.
• Modified Hessian Newton’s method: Bk = ∇2 f (xk ) + λI (diagonal modif., others exist) s.t. Bk suffic. pd.
λ = 0: pure Newton step; λ → ∞: steepest descent direction (with step → 0).
Descent direction with moderate length away from minimizer, pure Newton’s method near minimizer.
Global, quadratic convergence if κ(Bk ) ≤ M and l.s. with Wolfe cond.

12
4 Trust region methods
Iteration: xk+1 = xk + pk , where p is the approximate minimizer of the model mk (p) in a region
around xk (the trust region); if pk does not produce a sufficient decrease in f , we shrink the region
and try again.

• The trust region is typically a ball B(xk ; ∆).


Other region shapes may also be used: elliptical (to adapt to ill-conditioned Hessians) and box-shaped (with linear constraints.)

• Each time we decrease ∆ after failure of a candidate iterate, the step from xk is shorter and
usually points in a different direction.
(
too small: good model, but can only take a small step, so slow convergence
• Trade-off in ∆:
too large: bad model, we may have to reduce ∆ and repeat.
In practice, we increase ∆ if previous steps showed the model reliable. fig. 4.1

∇fk
• Linear model : mk (p) = fk + ∇fkT p s.t. kpk ≤ ∆k ⇒ pk = −∆k k∇f kk
, i.e., steepest descent
with step length αk given by ∆k (no news). (✐ what is the approx. error for the linear model?)

• Quadratic model : mk (p) = f(k + ∇fkT p + 21 pT Bk p where Bk is symmetric but need not be psd.
O(kpk3 ) if Bk = ∇2 fk , trust-region Newton method
The approximation error is
O(kpk2 ) otherwise.
In both cases, the model is accurate for small kpk, which guarantees we can always find a good
step for sufficiently small ∆. (✐ what happens if ρk < 0 but kpk k < ∆k with an arbitrary mk ?)

Two issues remain: how to choose ∆k ? How to find pk ?

fk −f (xk +pk ) actual reduction


Choice of the trust-region radius ∆k Define the ratio ρk = mk (0)−mk (pk )
= predicted reduction
.

• Predicted reduction ≥ 0 always (since p = 0 is in the region).

• If actual reduction < 0 the new objective value is larger, so reject the step.


 ≈ 1: good agreement between f and the model mk , so expand ∆k+1 > ∆k if kpk k = ∆k

 (otherwise, don’t interfere)
• ρk

 > 0 but not close to 1: keep ∆k+1 = ∆k


close to 0 or negative: shrink ∆k+1 < ∆k .
Algorithm 4.1

The optimization subproblem A quadratically-constrained quadratic program:


1
minn mk (p) = fk + ∇fkT p + pT Bk p s.t. kpk ≤ ∆k
p∈R 2

• If Bk is pd and B−1 B −1
k ∇fk ≤ ∆k the solution is the unconstrained minimizer pk = −Bk ∇fk
(the full step).

• Otherwise, we compute an approximate solution (finding an exact one is too costly).

13
Characterization of the exact solution of the optimization subproblem
Th. 4.1. p∗ is a global solution of the trust-region problem minkpk≤∆ m(p) = f + gT p + 21 pT Bp iff
p∗ is feasible and ∃λ ≥ 0 such that:
1. (B + λI)p∗ = −g
2. λ (∆ − kp∗ k) = 0 (i.e., λ = 0 or kp∗ k = ∆)
3. B + λI is psd.
Note:
• Conditions 1 and 2 follow from the KKT conditions (th. 12.1) for a local solution, where λ is
the Lagrange multiplier. Condition 3 holds for a global solution.
• Using B + λI instead of B in the model transforms the problem into minp m(p) + λ2 kpk2 , and
so for large λ > 0 the minimizer is strictly inside the region. As we decrease λ the minimizer
moves to the region boundary and the theorem holds for that λ.
• If λ > 0 then the direction is antiparallel to the model gradient and so the region is tangent to
the model contour at the solution: λp∗ = −g − Bp∗ = −∇m(p∗ ).
• This is useful for Newton’s method and is the basis of the Levenberg-Marquardt algorithm for
nonlinear least-squares problems.

Approximate solution of the optimization subproblem Several approaches:


• The Cauchy point pC
is the minimizer of mk along pk = −∇mk (0) = −gk , which lies either at
k eq. (4.12)
∆k
− gk gk (boundary) or at −τk ∆gkk gk with 0 < τk < 1 (interior). So it is steepest descent with a fig. 4.3

certain step size. It gives a baseline solution; several methods improve over the Cauchy point.
• Iterative solution of the subproblem (based on the characterization):
1. Try λ = 0, solve Bp∗ = −g and see if kp∗ k ≤ ∆ (full step).
2. If kp∗ k > ∆, define p(λ) = −(B + λI)−1 g for λ sufficiently large that B + λI is pd and
seek a smaller value λ > 0 such that kp(λ)k = ∆ (1D root-finding for λ; iterative solution
factorizing the matrix B + λI). fig. 4.5
qT g (qT
j g)
2
= − j=1 λ j+λ qj ⇒ kp(λ)k2 = n
Pn
QΛQT −(B + λI)−1 g
P
B= (spectral th. with λn ≥ · · · ≥ λ1 ) ⇒ p(λ) = j=1 (λj +λ)2 . If
j
1 1 1
qT ∗
1 g 6= 0, find λ > −λ1 using Newton’s method for root finding on r(λ) = ∆ − kp(λ)k (since kp(λ)k ≈ (λ + λ1 )/constant).
One can show this is equivalent to Alg. 4.3, which uses Cholesky factorizations (limit to ∼ 3 steps).

• Dogleg method (for pd B): find minimizer along two-leg path 0 → pC B


k → pk within trust fig. 4.4
region.
m(p) decreases along the dogleg path (lemma 4.2). The minimizer results from a 2nd-degree polynomial.

• Two-dimensional subspace minimization: on the span of the dogleg path. eq. 4.17
The minimizer results from a 4th-degree polynomial.

Global and local convergence


• Under certain assumptions, these approximate algorithms have global convergence to a station-
ary point (Th. 4.6).
Essentially, they must ensure a sufficient decrease mk (0) ≤ mk (pk ) at each step; the Cauchy point already achieves such a decrease.

• If using Bk = ∇2 fk and if the region becomes eventually inactive and we always take the full
step, the convergence is quadratic (the method becomes Newton’s method).

14
Review: trust-region methods
• Iteration xk+1 = xk + pk

– pk = approx. minimizer of model mk of f in trust region: minkpk≤∆k mk (p).


– pk does not produce sufficient decrease ⇒ region too big, shrink it and try again.
fk −f (xk +pk )
Insufficient decrease ⇔ ρk = mk (0)−mk (pk ) .0

• Quadratic model: mk (p) = fk + ∇fkT p + 21 pT Bk p. Bk need not be psd.


Cauchy point: minimizer of mk within the trust region along the steepest descent direction.
The exact, global minimizer of mk within the trust region kpk ≤ ∆k satisfies certain conditions (th. 4.1) that
can be used to find an approximate solution. Simpler methods exist (dogleg, 2D subspace min.).
• Global convergence under mild conditions if sufficient decrease; quadratic rate if Bk = ∇2 fk and if the region
becomes eventually inactive and we always take the full step.
• Mainly used for Newton and Levenberg-Marquardt methods.

15
5 Conjugate gradient methods
• Linear conjugate gradient method: solves a large linear system of equations.

• Nonlinear conjugate gradient method: adaptation of linear CG for nonlinear optimization.

Key features: requires no matrix storage, faster than steepest descent.


Assume in all this chapter that A is an n × n symmetric pd matrix, φ(x) = 21 xT Ax − bT x and
∇φ(x) = Ax − b = r(x) (residual).
Idea:
• steepest descent (1 or ∞ iterations):
• coordinate descent (n or ∞ iterations):
• Newton’s method (1 iteration):
• CG (n iterations):

The linear conjugate gradient method


Iterative method for solving the two equivalent problems (i.e., both have the same, unique solution
x∗ ):
1
Linear system Ax = b ⇐⇒ Optimization problem min φ(x) = xT Ax − bT x.
2
• A set of nonzero vectors {p0 , p1 , . . . , pl } is conjugate wrt A iff pTi Apj = 0 ∀i 6= j.
Conjugacy ⇒ linear independence (Pf.: left-multiply P σi pi times pTj A.)
Note {A1/2 p0 , A1/2 p1 , . . . , A1/2 pl } are orthogonal.

• Th. 5.1: we can minimize φ in n steps at most by successively minimizing φ along the n vectors
in a conjugate set.
Conjugate direction method : given a starting point x0 ∈ Rn and a set of n conjugate directions
rT p
{p0 , . . . , pn−1 }, generate the sequence {xk } with xk+1 = xk + αk pk , where αk = − pTkApk (✐
k k
denominator 6= 0?) (exact line search) (Proof: x∗ = x0 + n−1
P
i=0 αi pi .)

• Intuitive idea:

– A diagonal: quadratic function φ can be minimized along the coordinate directions fig. 5.1
e1 , . . . , en in n iterations.
– A not diagonal: the coordinate directions don’t minimize φ in n iterations; but the fig. 5.2
variable change x̂ = S−1 x with S = (p0 p1 . . . pn−1 ) diagonalizes A: φ̂(x̂) = φ(Sx̂) =
1 T
2
x̂ (ST AS)x̂ − (ST b)T x̂. (✐ why is S invertible? why is ST AS diagonal?)
coordinate search in x̂ ⇔ conjugate direction search in x.

• Th. 5.2 (expanding subspace minimization): for the conjugate directions method:

– rTk pi = 0 for i = 0, . . . , k − 1 (the current residual is ⊥ to all previous search directions).


(Intuition: if rk = ∇φ(xk ) had a nonzero projection along pi , it would not be a minimum.)

– xk is the minimizer of φ over the set x0 + span {p0 , . . . , pk−1 }.

That is, the method minimizes φ piecewise, one direction at a time.


(Proof: induction plus the fact rk+1 = rk + αk Apk (implied by rk = Axk − b and xk+1 = xk + αk pk ).)

16
• How to obtain conjugate directions? Many ways, e.g. using the eigenvectors of A or transform-
ing a set of l.i. vectors into conjugate directions with a procedure similar to Gram-Schmidt.
But these are computationally expensive!
• The conjugate gradient method generates conjugate direction pk by using only the previous
one, pk−1 :
– pk is a l.c. of −∇φ(xk ) and pk−1 s.t. being conjugate to pk−1 ⇒ pk = −rk + βk pk−1 with
rT Ap
βk = pTk Apk−1 ? .
k−1 k−1

– We start with the steepest descent direction: p0 = −∇φ(x0 ) = −r0 .

Algorithm 5.1 (CG; preliminary version): given x0 :

r0 ← ∇φ(x0 ) = Ax0 − b, p0 ← −r0 , k → 0 Start with steepest descent dir. from x0


while rk 6= 0 rk = 0 means we are done, which may happen before n steps
rT p
αk ← − pTkApk , xk+1 ← xk + αk pk Exact line search
k k
rk+1 = Axk+1 − b New residual
rT
k+1 Apk
βk+1 ← pT
, pk+1 = −rk+1 + βk+1 pk New l.s. direction pk+1 is conjugate to pk , pk−1 , . . . , p0
k Apk
k ←k+1
end

To prove the algorithm works, we need to prove it builds a conjugate direction set.
• Th. 5.3: suppose that the kth iterate of the CG method is not the solution x∗ . Then:
– rTk ri = 0 for i = 0, . . . , k − 1 (the gradients at all iterates are ⊥ to each other.)

– span {r0 , . . . , rk } = span {p0 , . . . , pk } = span r0 , Ar0, . . . , Ak r0 = Krylov subspace of
degree k for r0 . (So {rk } orthogonal basis, {pk } basis.)
(Intuitive explanation: compute rk , pk for k = 1, 2 using rk+1 = rk + Ak pk , pk+1 = −rk+1 + βk+1 pk .)

– pTk Api = 0 for i = 0, . . . , k − 1 (conjugate wrt A.)


Thus the sequence {xk } converges to x∗ in at most n steps.
Important: the theorem needs that the first direction be the steepest descent dir.
(Proof: rT T
k ri = pk Api = 0 follow for i = k − 1 by construction and for i < k − 1 by induction.)

• We can simplify a bit the algorithm using the following results:


– pk+1 = −rk+1 + βk+1 pk (construction of the kth direction)

– rk+1 = rk + αk Apk (definition of rk and xk+1 )

– rTk pi = rTk ri = 0 for i < k (th. 5.2 & 5.3)

rT
k rk krk k2 rT
k+1 rk+1 krk+1 k2
⇒ αk ← pT
= kpk k2A
, rk+1 ← rk + αk Apk , βk+1 ← rT
= krk k2
in algorithm 5.2.
k Apk k rk

• Space complexity: O(n) since it computes x, r, p at k + 1 given the values at k: no matrix


storage.
• Time complexity: the bottleneck is the matrix-vector product Apk , which is O(n2 ) (maybe less
if A has structure) ⇒ in n steps, O(n3 ), similar to other methods for solving linear systems
(e.g. Gauss factorization).

17
• Advantages: no matrix storage; does not alter A; does not introduce fill (for a sparse matrix
A); fast convergence.
• Disadvantages: sensitive to roundoff errors.
It is recommended for large systems.

Rate of convergence
• Here we don’t mean the asymptotic rate (k → ∞) because CG converges in at most n steps
for a quadratic function. But CG can get very close to the solution in quite less than n steps,
depending on the eigenvalue structure of A:
– Th. 5.4: if A has only r distinct eigenvalues, CG converges in at most r steps.
– If the eigenvalues of A occur in r distinct clusters, CG will approximately solve the problem
in r steps. fig. 5.4

• Two bounds (using kxk2A = xT Ax), useful to estimate the convergence rate in advance if we
know something about the eigenvalues of A:
λ −λ1 2
– Th. 5.5: kxk+1 − x∗ k2A ≤ λn−kn−k +λ1
kx0 − x∗ k2A if A has eigenvalues λ1 ≤ · · · ≤ λn .
√ k
– kxk − x∗ kA ≤ 2 √κ−1
κ+1
kx0 − x∗ kA if κ = λλn1 is the c.n. (this bound is very coarse).

κ−1 κ−1
Recall for steepest descent we had a similar expression but with κ+1
instead of √
κ+1
.

• Preconditioning: change of variables x̂ = Cx so that the new matrix  = C−T AC−1 has a
clustered spectrum or a small condition number (thus faster convergence). Besides being effec-
tive in this sense, a good preconditioner C should take little storage and allow an inexpensive
solution of Cx = x̂. Finding good preconditioners C depends on the problem (the structure of
A), e.g. good ones exist when A results from a discretized PDE.
The preconditioner can be integrated in a convenient way in the CG algorithm. alg. 5.3

Nonlinear conjugate gradient methods


We adapt the linear CG (which minimizes a quadratic function φ) for a nonlinear function f .
(
αk is determined by an inexact line search
The Fletcher-Reeves method
rk = ∇f
Algorithm 5.4: given x0 :

Evaluate f0 ← f (x0 ), ∇f0 ← ∇f (x0 )


p0 ← −∇f0 , k ← 0
while ∇fk 6= 0
xk+1 ← xk + αk pk with inexact l.s. for αk
Evaluate ∇fk+1
FR ∇f T ∇f
k+1 FR
βk+1 ← ∇f
k+1
T , pk+1 ← −∇fk+1 + βk+1 pk
k ∇fk
k ←k+1
end
• Uses no matrix operations, requires only f and ∇f .

18
FR
• Line search for αk : we need each direction pk+1 = −∇fk+1 + βk+1 pk to be a descent direction,
T 2 FR T
i.e., ∇fk+1 pk+1 = −k∇fk+1 k + βk+1 ∇fk+1 pk < 0.
T
– Exact l.s.: αk is a local minimizer along pk ⇒ ∇fk+1 pk = 0 ⇒ pk+1 is descent.
– Inexact l.s.: pk+1 is descent if αk satisfies the strong Wolfe conditions (lemma 5.6):

f (xk + αk pk ) ≤ f (xk ) + c1 αk ∇fkT pk


∇f (xk + αk pk )T pk ≤ c2 ∇fkT pk
1
where 0 < c1 < c2 < 2
(note we required a looser 0 < c1 < c2 < 1 in ch. 3).

The Polak-Ribière method and other variants


FR PR HS other
Fletcher-Reeves βk+1 Polak-Ribière βk+1 Hestenes-Stiefel βk+1 other variant βk+1
T
∇fk+1 ∇fk+1 T
∇fk+1 (∇fk+1 − ∇fk ) T
∇fk+1 (∇fk+1 − ∇fk ) k∇fk+1 k2
T
∇fk ∇fk k∇fk k2 (∇fk+1 − ∇fk )T pk (∇fk+1 − ∇fk )T pk

• We can define βk+1 in other ways that also generalize the quadratic case: for quadratic functions
with a pd Hessian and exact l.s. we have that βkFR = βkPR = βkHS = βkother for linear CG (since
the successive gradients are mutually ⊥).

• PR is the variant of choice in practice:

– For nonlinear functions in general, with inexact l.s., PR is empirically more robust and
efficient than FR.
– Yet, the strong Wolfe conditions don’t guarantee that pk is a descent direction.
– PR needs a good l.s. to do well.

Restarts Restarting the iteration every n steps (by setting βk = 0, i.e., taking a steepest descent
step) periodically refreshes the algorithm and works well in practice. It leads to n-step quadratic
kx −x∗ k
convergence: kxk+n ∗ 2 ≤ M; intuitively because near the minimum, f is approx. quadratic and so
k −x k
after a restart we will have (approximately) the linear CG method (which requires p0 = steepest
descent).
For large n (when CG is most useful) restarts may never occur, since an approximate solution
may be found in less than n steps.

Global convergence

• With restarts and the strong Wolfe conditions, the algorithms (FR, PR) have global convergence
since they include as a subsequence the steepest descent method (which is globally convergent
with the Wolfe conditions).

• Without restarts:

– FR has global convergence with the strong Wolfe conditions above.


– PR does not have global convergence, even though in practice it is better.

• In general, the theory on the rate of convergence of CG is complex and assumes exact l.s.

19
Review: conjugate gradient methods
• Linear CG: An×n sym. pd: solves Ax = b ⇔ min φ(x) = 12 xT Ax − bT x.

– {p0 , . . . , pn−1 } conjugate wrt A ⇔ pTi Apj = 0 ∀ij, pi 6= 0 ∀i.


– Finds the solution in at most n steps, each an exact line search along a conjugate direction: xk+1 =
rT r
xk + αk pk , αk = pTkApk , rk = ∇φ(xk ) = Axk − b.
k k

– At each step, xk is the minimizer over the set x0 + span {p0 , . . . , pk−1 }; rk+1 = rk + αk Apk ; and
rTk pi = rTk ri = 0 ∀i < k.
– Conjugate direction pk is obtained from the previous one and the current gradient: pk = −rk + βk pk−1
rT
k+1 rk+1
with βk = rT
.
k rk

– Initial direction is the steepest descent direction p0 = −∇φ(x0 ).


• Space complexity O(n), time complexity O(n3 ).
But often (e.g. when the eigenvalues of A are clustered, or A has low c.n.) it gets very close to the solution in
r ≪ n steps, so O(rn2 ).
• Nonlinear CG: solves min f (x) where f is nonlinear in general.
– Fletcher-Reeves: rk = ∇f (xk ), αk is determined by an inexact l.s. satisfying the strong Wolfe conditions,
FR ∇f T ∇f
k+1
βk+1 = ∇f k+1
T .
k ∇fk
Has global convergence, but to work well in practice it needs restarts (i.e., set βk = 0 every n steps).
PR ∇f T (∇f −∇f )
– Polak-Ribière: like FR but βk+1 = k+1∇f Tk+1
∇fk
k
.
k
Works better than FR in practice, even though it has no global convergence.
• In summary: better method than steepest descent, very useful for large n (little storage).

20
6 Quasi-Newton methods
• Like Newton’s method but using a certain Bk instead of the Hessian.
Like steepest descent and conjugate gradients, they require only the gradient.
By measuring the changes in gradients over iterations, they construct an approximation Bk to
the Hessian whose accuracy improves gradually and results in superlinear convergence.
• Quadratic model of the objective function (correct to first order):
1
mk (p) = fk + ∇fkT p + pT Bk p
2
where Bk is symmetric pd and is updated at every iteration.
• Search direction given by the minimizer of mk , pk = −B−1 ?
k ∇fk (which is descent ).

• Line search xk+1 = xk + αk pk with step length chosen to satisfy the Wolfe conditions.

The secant equation


Secant equation: Bk+1sk = yk where sk = xk+1 −xk and yk = ∇fk+1 −∇fk .
• It can be derived:
– From Taylor’s th.: ∇2 fk+1 (xk+1 − xk ) ≈ ∇fk+1 − ∇fk (exact if f is quadratic).

– Or by requiring the gradients of the quadratic model mk+1 to agree with those of f at
xk+1 and xk :
∇mk+1 = ∇f at xk+1 : by construction
∇mk+1 = ∇f at xk : ∇mk+1 (xk − xk+1 ) = ∇fk+1 + Bk+1 (xk − xk+1 ) = ∇fk .

• It implicitly requires that sTk yk = sTk Bk+1 sk > 0 (a curvature condition). If f is strongly convex,
this is guaranteed (because sTk yk > 0 for any two points xk and xk+1 ; proof: exe. 6.1). Otherwise, it is guaranteed
if the line search verifies the 2nd Wolfe condition ∇f (xk + αk pk )T pk ≥ c2 ∇fkT pk , 0 < c2 < 1
T s ≥ c ∇f T s ⇔ yT s ≥ (c − 1)α ∇f T p > 0). Or, in particular, if the l.s. is exact? .
(proof: 2nd Wolfe ⇔ ∇fk+1 k 2 k k k k 2 k k k

The secant equation provides only n constraints for n2 dof in Bk , so it has many solutions (even
with the constraint that Bk be pd). We choose the solution closest to the current matrix Bk :
minB kB − Bk k s.t. B symmetric pd, Bsk = yk . Different choices of norm are possible; one
that allows an easy solution and gives rise P to2 scale invariance is the weighted Frobenius norm
1 1 2
kAkW = kW AW kF (where kAkF =
2 2
ij aij ). W is any matrix satisfying Wyk = sk (thus
the norm is adimensional, i.e., the solution doesn’t depend on the units of the problem).

The DFP method (Davidon-Fletcher-Powell)


R1
If we choose W−1 = 0 ∇2 f (xk + τ αk pk ) dτ (the average Hessian) then the minimizer is unique and
is the following rank–2 update:
(
Bk+1 = (I − γk yk sTk )Bk (I − γk sk ykT ) + γk yk ykT with γk = yT1s earlier
T
(yk sk > 0 from
curv. cond.)
k k
DFP: T
Hk yk yk Hk T
sk sk
Hk+1 = Hk − yT H y + yT s (Pf.: SMW formula)
k k k k k

Where Hk = B−1 −1 3
k . Using Bk directly requires solving pk = −Bk ∇fk , which is O(n ), while using
Hk gives us pk = −Hk ∇fk , which is O(n2 ).

21
The BFGS method (Broyden-Fletcher-Goldfarb-Shanno)
We apply the conditions to Hk+1 = B−1
k+1 rather than Bk+1 :

Hk+1 = arg min kH − Hk k s.t. H symmetric pd, Hyk = sk


H

with the same norm as before, where Wsk = yk . For W = the average Hessian we obtain:
(
Hk+1 = (I − ρk sk ykT )Hk (I − ρk yk sTk ) + ρk sk sTk with ρk = yT1s
k k
BFGS: Bk sk sT
k Bk yk ykT
Bk+1 = Bk − sT B s + yT s (Pf.: SMW formula)
k k k k k

We have: Hk pd ⇒ Hk+1 pd (proof: zT Hk+1 z > 0 if z 6= 0). We take the initial matrix as H0 = I for lack
of better knowledge. This means BFGS may make slow progress for the first iterations, while information is being built into Hk .

• For quadratic f and if an exact line search is performed, then DFP, BFGS, SR1 converge to
the exact minimizer in n steps and Hn = ∇2 f −1 . (✐ Why do many methods work well with quad. f ?)

• BFGS is the best quasi-Newton method. With an adequate line search (e.g. Wolfe conditions),
BFGS has effective self-correcting properties (and DFP is not so effective): a poor approxima-
tion to the Hessian will be improved in a few steps, thus being stable wrt roundoff error.

Algorithm 6.1 (BFGS): given starting point x0 , convergence tolerance ǫ > 0:

H0 ← I, k ← 0
while k∇fk k > ǫ
pk ← −Hk ∇fk search direction
xk+1 ← xk + αk pk line search with Wolfe cond.
sk ← xk+1 − xk , yk ← ∇fk+1 − ∇fk ,, Hk+1 ← BFGS update update inverse Hessian
k ←k+1
end

• Always try αk = 1 first in the line search (this step will always be accepted eventually).
Empirically, good values for c1 , c2 in the Wolfe conditions are c1 = 10−4 , c2 = 0.9.

• Cost per iteration:

– Space: O(n2 ) matrix storage. For large problems, techniques exist to modify the method
to take less space, though converging more slowly (see ch. 7).
– Time: O(n2 ) matrix × vector, outer products.

• Global convergence if Bk have a bounded condition number + Wolfe conditions (see ch. 3);
but in practice this assumption may not hold. There aren’t truly global convergence results,
though the methods are very robust in practice.

• Local convergence: if BFGS converges, then its order is superlinear under mild conditions.

Newton Quasi-Newton
Convergence rate quadratic superlinear
3 2
Cost per iteration (time) O(n ) linear system O(n ) matrix × vector
∇2 f required yes no

22
The SR1 method (symmetric rank-1)
By requiring Bk+1 = Bk + σvvT where σ = ±1 and v is a nonzero vector, and substituting in the
secant eq., we obtain:  T
 Bk+1 = Bk + (yk −Bk sk )(yk −B
T
k sk )
(yk −Bk sk ) sk
SR1: T
 Hk+1 = Hk + (sk −Hk yk )(sk −H k yk )
(sk −Hk yk )T yk

Generates very good Hessian approximations, often better than BFGS’s (indeed BFGS only produces pd Bk ),
but:

• Does not necessarily preserve pd ⇒ use in trust-region (not l.s.) framework.

• May not satisfy the secant equation yk = Bk+1sk , sk = Hk+1 yk if (yk − Bk sk )T sk = 0 ⇒


skipping the update if the denominator is small works well in practice:
if sTk (yk − Bk sk ) < rksk kkyk − Bk sk k then Bk+1 = Bk else Bk+1 = SR1 [use r ∼ 10−8].

The Broyden class


Bk+1 = (1 − φk )BBFGS DFP
k+1 + φk Bk+1 for φk ∈ R.

• generalizes BFGS, DFP and SR1

• symmetric

• preserves pd for φk ∈ [0, 1]

• satisfies the secant equation.

23
Review: quasi-Newton methods
• Newton’s method with an approximate Hessian:

– Quadratic model of objective function f : mk (p) = fk + ∇fkT p + 21 pT Bk p.


– Search direction is the minimizer of mk : pk = −B−1
k ∇fk .
– Inexact line search with Wolfe conditions; always try αk = 1 first.
– Bk is symmetric pd (for DFP/BFGS) and is updated at each k given the current and previous gradients,
so that it approximates ∇2 fk .
• Idea: ∇2 fk+1 (xk+1 − xk ) ≃ ∇fk+1 − ∇fk (by Taylor’s th.).
| {z } | {z }
sk yk
Secant equation: Bk+1 sk = yk ; implies ∇mk+1 = ∇f at xk , xk+1 .
• DFP method (Davidon-Fletcher-Powell): Bk+1 satisfies the secant equation and is closest to Bk (in a precise
sense) ⇒ rank-2 update

Hk yk ykT HTk sk sTk


Hk+1 (= B−1
k+1 ) = Hk − + ⇒ pk = −Hk ∇fk in O(n2 ).
ykT Hk yk ykT sk

• BFGS method (Broyden-Fletcher-Goldfarb-Shanno): Hk+1 (= B−1


k+1 ) satisfies the secant eq. and is closest to
Hk (in a precise sense) ⇒ rank-2 update
1
Hk+1 = (I − ρk sk ykT )Hk (I − ρk yk sTk ) + ρk sk sTk with ρk = .
ykT sk

Use H0 = I. Best quasi-Newton method, self-correcting properties.


• SR1 method (symmetric rank-1): rank-1 update, Hk+1 not necessarily pd so use with trust region:

(sk − Hk yk )(sk − Hk yk )T
Hk+1 = Hk + .
(sk − Hk yk )T yk

• Global convergence: no general results, though the methods are robust in practice.
• Convergence rate: superlinear.

Newton Quasi-Newton
Convergence rate  Quadratic Superlinear
time O(n3 ) O(n2 )
Cost per iteration
space O(n2 ) O(n2 )
Hessian required yes no

24
7 Large-scale unconstrained optimization
• Large problems (today): 103–106 variables.

• In large problems, the following can have a prohibitive cost: factorizing the Hessian (solving
for the Newton step), or even computing the Hessian or multiplying times it or storing it
(note quasi-Newton algorithms generate dense approximate Hessians even if the true Hessian
is sparse).

• In these cases we can use the following:

– Nonlinear CG is applicable without modification, though not very fast.


– Sparse Hessian: efficient algorithms (in time and memory) exist to factorize it.
– This chapter: limited-memory quasi-Newton methods (L-BFGS), inexact Newton methods
(Newton-CG).

Limited-memory quasi-Newton methods


• Useful for large problems with costly or nonsparse Hessian.

• They keep simple, compact approximations of the Hessian based on a few n–vectors (rather
than an n × n matrix).

• Linear convergence but fast rate.

• Focus on L-BFGS, which uses curvature information from only the most recent iterations.

Limited-memory BFGS (L-BFGS) BFGS review:


• Step: xk+1 = xk − αk Hk ∇fk . 
 T
Vk = I − ρk yk sk
• Update: Hk+1 = VkT Hk Vk + ρk sk sTk with ρk = 1/ykT sk
| {z } 

dense so we can’t use or store sk = xk+1 − xk , yk = ∇fk+1 − ∇fk .

L-BFGS: store modified version of Hk implicitly, by storing m ≪ n of the vector pairs {si , yi }.
• Product Hk ∇fk = sum of inner products involving ∇fk and the pairs {si , yi }.

• After the new iterate is computed, replace oldest pair with newest one.

• Modest values of m (∼ 3–20) work ok in practice; best m is problem-dependent.

• Slow convergence in ill-conditioned problems.

• Update, in detail:

– Iteration k: xk , {si , yi } for i = k − m, . . . , k − 1.


– Hk = eq. (7.19) by recursively expanding the update and assuming an initial Hessian
approximation H0k , from which the product Hk ∇fk can be computed (algorithm 7.4) in
O(m) vector-vector products, plus the matrix-vector product by H0k =⇒ choose it (say)
diagonal, in particular H0k = γk I with γk = eq. (7.20) helps the direction pk to be well
scaled and so the step length αk = 1 is mostly accepted.

25
– Use l.s. with (strong) Wolfe conditions to make BFGS stable.
– The first m − 1 iterates are as in BFGS.

Algorithm 7.5 (L-BFGS): given x0 :

Choose m > 0; k ← 0
repeat
Choose H0k e.g. eq. (7.20)
pk ← −Hk ∇fk algorithm 7.4
xk+1 ← xk + αk pk l.s. with Wolfe conditions
Discard {sk−m , yk−m } if k > m
Store pair sk ← xk+1 − xk , yk ← ∇fk+1 − ∇fk
k ←k+1
until convergence

Relationship with CG methods

• Limited-memory methods historically evolved as improvements of CG methods.


T
∇fk+1 (∇fk+1 −∇fk )
• CG–Hestenes-Stiefel: we have HS =
(from sk = αk pk , βk+1 (∇fk+1 −∇fk )T pk
HS p )
, pk+1 = −∇fk+1 + βk+1 k

T
∇fk+1 yk sk ykT
pk+1 = −∇fk+1 + T
pk = −Ĥk+1 ∇fk+1 with Ĥk+1 = I −
yk pk ykT sk

which resembles quasi-Newton iterates, but Ĥk+1 is neither symmetric nor pd.

• The following memoryless BFGS is symmetric, pd and satisfies the secant eq. Hk+1 yk = sk :
   (
sk ykT yk sTk sk sTk BFGS update with Hk = I
Hk+1 = I − T I− T + T ≡
yk sk yk sk yk sk L-BFGS with m = 1 and H0k = I.

T
And, with exact l.s. (∇fk+1 pk = 0 ∀k): pk+1 = −Hk+1 ∇fk+1 ≡ CG–HS ≡ CG–PR.

General limited-memory updating In general, we can represent the quasi-Newton approxima-


tion to the Hessian Bk and inverse Hessian Hk (BFGS, SR1) as an outer-product form:

1
Bk = I+ .
γk
|{z} |{z} | {z }
n 2m × 2m 2m × n

This could be used in a trust-region method or in a constrained optimization method. Efficient since
updating Bk costs O(mn + m3 ) and matrix-vector products Bk v cost O(mn + m2 ).

26
Inexact Newton methods
Newton step: solution of the n × n linear system ∇2 f (xk ) pk = −∇f (xk ).
• Expensive: computing the Hessian is a major task, O(n2 ), and solving the system is O(n3 ).
• Not robust: far from a minimizer, need to ensure pk is descent.
Newton-CG method: solve the system approximately with the linear CG method (efficient), termi-
nating if negative curvature is encountered (robustness); can be implemented as line search or trust
region.

Inexact Newton steps Terminate the iterative solver (e.g. CG) when the residual rk = ∇2 f (xk )pk +
∇f (xk ) (where pk is the inexact Newton step) is small wrt the gradient (to achieve invariance wrt
scalings of f ): krk k ≤ ηk k∇f (xk )k, where (ηk ) is the forcing sequence. Under mild conditions, if the
initial x0 is sufficiently near a minimizer x∗ and eventually we always try the full step αk = 1, we
have:
• Th. 7.1: if 0 < ηk ≤ η < 1 ∀k then xk → x∗ .
( p
ηk → 0: superlinear, e.g. ηk = min(0.5, k∇f (xk )k)
• Th. 7.2: rate of convergence
ηk = O(k∇f (xk )k): quadratic, e.g. ηk = min(0.5, k∇f (xk )k).
But the smaller ηk , the more iterations of CG we need.
(✐ How many Newton-CG iterations and how many CG steps in total does this require if f is quadratic with pd Hessian?)
How do we get sufficiently near a minimizer?

Newton-CG method We solve the system with the CG method to an accuracy determined by
the forcing sequence, but terminating if negative curvature is encountered (pTk Apk ≤ 0). If the very
first direction is of negative curvature, use the steepest descent direction instead. Then, we use the
resulting direction:
• For a line search (inexact, 
with appropriate conditions: Wolfe, Goldstein, or backtracking). Alg. 7.1

pd Hessian: pure (inexact) Newton direction
The method behaves like: nd Hessian: steepest descent


not definite Hessian: finds some descent direction.
Problem (as in the modified Hessian Newton’s method): if the Hessian is nearly singular, the
Newton-CG direction can be very long.
Hessian-free Newton methods: the Hessian-vector product ∇2 fk v can be obtained (exactly or approximately) without computing
the Hessian, see ch. 8.

• In a trust-region way: limit the line search to a region of size ∆k .

Sparse quasi-Newton updates


• Assume we know the sparsity pattern Ω of the true Hessian and demand that the quasi-Newton
approximation Bk to it have that same pattern:
X
min kB − Bk k2F = (Bij − (Bk )ij )2 s.t. Bsk = yk , B = BT , Bij = 0 for (i, j) ∈
/ Ω.
B
(i,j)∈Ω

Solution Bk+1 given by solving an n × n linsys with the same sparsity pattern, but is not
necessarily pd.

27
• Then, use Bk+1 within trust-region method.

• Smaller storage. And, perhaps more accurate approximation? Unfortunately:

– update not scale invariant under linear transformations


– disappointing practical performance; seemingly, bad model for Bk and so poor approxi-
mation.

Partially separable functions


• Separable function (e.g. f (x) = f1 (x1 , x3 ) + f2 (x2 , x4 , x6 ) + f3 (x5 )) ⇒ independent optimizations.

• Partially separable function = sum of element functions, each dependent on a few variables
(e.g. f (x) = f1 (x1 , x2 ) + f2 (x2 , x3 ) + f3 (x3 , x4 )) ⇒ sparse gradient, Hessian for each function; it is efficient
to maintain quasi-Newton approximations to each element function. Essentially, we work on a
lower-dimensional space for each function.
X X X
f (x) = φi (Ui x) =⇒ ∇f (x) = UTi ∇φi (Ui x), ∇2 f (x) = UTi ∇2 φi (Ui x)Ui
i i i
X
2
=⇒ ∇ f (x) ≈ B = UTi B[i] Ui
i

where Ui = compactifying matrix (sparse) of mi ×n and B[i] = mi ×mi quasi-Newton approx. of


∇2 φi . Then, use within trust-region method: Bk pk = −∇fk , which can be solved by linear CG
with low-dim operations using Ui and B[i] (avoiding explicitly constructing Bk ). See example
in page 187.

Review: large-scale unconstrained optimization


• Avoid factorizing or even computing the (approximate) Hessian in Bk pk = −∇fk .
• Nonlinear CG (ch. 5), linear convergence, not very fast.
• If sparse Hessian or partially separable function, take advantage of it (e.g. separable low-dim quasi-Newton
approximations for each element function).
• Inexact Newton methods:
– Newton-CG solves the Newton direction system approximately with linear CG, terminating when either
a sufficiently accurate direction is found (using a forcing sequence) or when negative curvature is found
(ensures descent direction).
– Convergence rate: linear, superlinear or quadratic depending on the forcing sequence.
– Hessian-free if computing ∇2 f v products without computing ∇2 f (ch. 8).
• Limited-memory quasi-Newton methods:
– L-BFGS implicitly constructs approximate Hessians based on m ≪ n outer products constructed using
the last m pairs {si , yi }.
– Obtained by unfolding the BFGS formula for Hk+1 over the previous m iterations.
– Linear convergence but generally faster than nonlinear CG.
– The memoryless L-BFGS (m = 1) is similar to nonlinear CG methods.

28
8 Calculating derivatives
Approximate or automatic techniques to compute the gradient, Hessian or Jacobian if difficult by
hand.

Finite-difference derivative approximations


Gradient of f : Rn → R, from Taylor’s th.:


 f (x + ǫei ) − f (x)

 + O(ǫ), ← Forward difference, n + 1 function evaluations
∂f  ǫ
= f (x + ǫei ) − f (x − ǫei )
∂xi  + O(ǫ2 ) ← Central difference, 2n function evaluations.

 | 2ǫ
{z } | {z }

 truncation
Approximate the deriv. with this error

Needs careful choice of ǫ: as small as possible but not too close to the machine precision (to avoid
1 1 1
roundoff errors). As a rule of thumb, ǫ ∼ u 2 with error ∼ u 2 for forward diff. and ǫ ∼ u 3 with error
2
∼ u 3 for central diff., where u (≈ 10−16 in double precision) is the unit roundoff.
∂2 f ∂f f (x+ǫei )−f (x) L2
Pf.: assume |f | ≤ L0 and ∂x2
≤ L2 in the region of interest. Then ∂xi
(x) = ǫ
+ δǫ with |δǫ | ≤ 2
ǫ. But the machine
i
∂f
representation of f at any x has a relative error |comp(f (x)) − f (x)| ≤ uL0 . Thus, the absolute error E = ∂x (x) − f (x+ǫeǫi )−f (x) is
i
L2 2uL0 4L u √ √
bounded by 2 ǫ + ǫ , which is minimal for ǫ2 = L0 and E = L2 ǫ, or ǫ ∼ u and E ∼ u. A similar derivation for the central diff.
2
∂3f 3L0 u 1 2
(with ∂x3
≤ L3 ) gives ǫ3 = 2L3
and E = L3 ǫ2 , or ǫ ∼ u 3 and E ∼ u 3 .
i

Hessian of f : O (n2 ) function evaluations:

∂2f f (x + ǫei + ǫej ) − f (x + ǫei ) − f (x + ǫej ) + f (x)


(x) = + O(ǫ).
∂xi ∂xj ǫ2

The roundoff error accumulates badly in this formula.

Jacobian-vector product J(x) p where r: Rn → Rm and Jm×n is the Jacobian of r:

r(x + ǫp) − r(x)


J(x) p = + O(ǫ) (by Taylor’s th.)
ǫ
∇f (x + ǫp) − ∇f (x)
∇2 f (x) p = + O(ǫ) (in particular for r = ∇f )
ǫ
∇f (x + ǫei ) − ∇f (x)
∇2 f (x) ei = + O(ǫ) (column i of the Hessian).
ǫ
Using the expression for ∇2 f (x) p in Newton-CG gives rise to a Hessian-free Newton method (ch. 7).

• If the Jacobian or Hessian is sparse, it is possible to reduce the number of function evaluations
by cleverly choosing the perturbation vector p (graph-coloring problem).

• Computationally, the finite-difference approximation of ∇f , etc. can cost more than computing
them from their analytical expression (ex.: quadratic f ), though this depends on f .

• Numerical gradients are also useful to check whether the expression for a gradient calculated
by hand is correct.

29
Exact derivatives by automatic differentiation
• Build a computational graph of f using intermediate variables.
P ∂h
• Apply the chain rule: ∇x h(y(x)) = i ∂y i
∇yi (x) where Rn −
→ Rm −
→ R.
y h

• Computes the exact numerical value of f , ∇f or ∇2 f p at a point x recursively.


Cost (time and space) depends on the structure of f .
∂r r(x+ǫp)−r(x)
Example: call Dp (r(x)) = ∂ǫ
(x + ǫp) = limǫ→0 ǫ
. Then Dp (∇f (x)) = ∇2 f (x) p. Autodiff can compute this
ǫ=0
exactly with about the same cost required to compute ∇f (x).

• Done automatically by a software tool.

• Disadvantage: simplification of expressions, reuse of operations.


 
x−1 1
Ex: differentiate tan x − x, ln x+1
, 1+e−ax
.

Exact derivatives by symbolic differentiation


• Produce an algebraic expression for the gradient, etc., possibly simplified.
d
Ex: dx
(xe−x ) = e−x − xe−x = e−x (1 − x).

• Packages: Mathematica, Maple, Matlab Symbolic Toolbox. . .


d
Autodiff does not produce an algebraic expression, but an exact numerical value for the desired input x. Ex: dx
(xe−x )x=1 = 0.

Review: Calculating derivatives


• Finite-difference approximation with perturbation ǫ for the gradient, Hessian or Jacobian:
– Specific finite difference scheme obtained from Taylor’s th.
– ǫ: large truncation error if too large, large roundoff error if too small.
– ∇f , ∇2 f cost O(n) and O(n2 ) different evaluations of f , respectively; less if sparsity.
• Exact numerical derivatives by automatic differentiation: a software tool applies the chain rule recursively at
a given numerical input x. The cost depends on the structure of f .
• Exact symbolic derivatives by symbolic differentiation with Mathematica, Maple, etc.

30
9 Derivative-free optimization
Evaluating ∇f in practice is sometimes impossible, e.g.:

• f (x) can be the result of an experimental measurement or a simulation (so analytic form of f
unknown).

• Even if known, coding ∇f may be time-consuming or impractical.

Approaches:

• Approximate gradient and possibly Hessian using finite differences (ch. 8), then apply derivative-
based method (previous chapters). But:

– Number of function evaluations may be excessive.


– Unreliable with noise (= inaccuracy in function evaluation).

• Don’t approximate the gradient, instead use function values at a set of sample points and
determine a new iterate by a different means (this chapter). But: less developed and less
efficient than derivative-based methods; effective only for small problems; difficult to use with
general constraints.

If possible, try methods in this order: derivative-based > finite-difference-based > derivative-free.

Model-based methods
• Build a model mk as a quadratic function that interpolates f at an appropriate set of samples.
Compute a step with a trust-region strategy (since mk is usually nonconvex).

• Model: samples Y = {y1 , . . . , yq } ⊂ Rn with current iterate xk ∈ Y and having lowest function
value in Y. Construct mk (xk + p) = c + gT p + 21 pT Gp (we can’t use g = ∇f (xk ), G =
∇2 f (xk )) by imposing interpolation conditions mk (yl ) = f (yl ), l = 1, . . . , q (linear system).
Need q = 21 (n + 1)(n + 2) (exe. 9.2) and choose points so linsys is nonsingular.

• Step: minkpk2 ≤∆ mk (xk + p), etc. as in trust-region.

• If sufficient reduction: the latest iterate replaces one sample in Y.


Else: if Y is adequate (= low condition number of linsys) then reduce ∆ else improve Y.

• Good initial Y: vertices and edge midpoints of simplex in Rn (exe. 9.2).

• Naive implementation costs O(n6 ).


Acceleration: update mk at each iteration rather than recomputing it from scratch.
Even with this it still costs O(n4 ), very expensive.

• Linear model (G = 0): q = n + 1 parameters, O(n3 ) step (more slowly convergent).


Hybrid: start with linear steps, switch to quadratic when q = 21 (n + 1)(n + 2) function values
become available.

Coordinate- and pattern-search methods


Search along certain specified directions from the current iterate for a point of lower f value.

31
Coordinate-descent algorithm (alternating optimization) pk cycles through the n coordi- fig. 9.1
nate dimension e1 , . . . , en in turn.
• May not converge, iterating indefinitely without approaching a stationary point, if the gradient
becomes more and more ⊥ to the coordinate directions. Then cos θk approaches 0 sufficiently
rapidly that the Zoutendijk condition is satisfied even when ∇fk 9 0.
• If it does converge, its rate of convergence is often much slower than that of steepest descent,
and this gets worse as n increases.
• Advantages: very simple, does not require calculation of derivatives, convergence rate ok if the
variables are loosely coupled.
• Variants:
– back-and-forth approach repeats e1 e2 . . . en−1 en en−1 . . . e2 e1 e2 . . .
– Hooke-Jeeves: after sequence of coordinate descent steps, search along first and last point
in the cycle.
• Very useful in special cases:
– When alternating over groups of variables so that the optimization over each group is easy.
Pm
Ex.: f (X, A) = j=1 kyj − Axj k2 .

– When the cost of cycling through the n variables is comparable to the cost of computing
the gradient. Ex.: f (w) = 21 Nn=1 (yn − wT xn )2 + λkwk1 (Lasso).
P

Pattern search Generalizes coordinate search to a richer set of directions at each iteration. At
each iterate xk :
• Choose a certain set of search directions, Dk = {p1 , . . . } and define a frame centered at xk by
points xk at a given step length γk > 0 along each direction: {xk + γk p1 , . . . }.
• Evaluate f at each frame point:
– If significantly lower f value found, adopt as new iterate and shift frame center to it.
Possibly, increase γk (expand the frame).
– Otherwise, stay at xk and reduce γk (shrink the frame).
• Possibly, change the directions.
Ex.: algorithm 9.2, which eventually shrinks the frame around a stationary point. Global convergence
under certain conditions on the choice of directions:
(1) At least one direction in Dk should be descent (unless ∇f (xk ) = 0), specifically:
minv∈Rn maxp∈Dk cos (v,d p) ≥ δ for constant δ > 0.
(2) All directions have roughly similar length (so we can use a single step length): ∀p ∈ Dk :
βmin ≤ kpk ≤ βmax , for some positive βmin , βmax and all k.
Examples of such Dk : fig. 9.2

• Coordinate dirs ±e1 , . . . , ±en .


1 1
• Simplex set: pi = 2n
e − ei , i = 1, . . . , n and pn+1 = 2n
e (where e = (1, . . . , 1)T ).
The coordinate-descent method uses Dk = {ei , −ei } for some i at each k, which violates condition
(1) (exe. 9.9).

32
Nelder-Mead method (downhill simplex)
• At each iteration we keep n + 1 points whose convex hull forms a simplex. Proc. 9.5
fig. 9.4
A simplex with vertices z1 , . . . , zn+1 is nondegenerate or nonsingular if the edge matrix V = (z2 − z1 , . . . , zn+1 − z1 ) is nonsingular.

At each iteration we replace the worst vertex (in f -value) with a better point obtained by
reflecting, expanding or contracting the simplex along the line joining the worst vertex with
the simplex center of mass. If we can’t find a better point this way, we keep only the best

E
vertex and shrink the simplex towards it.

• Reasonable practical performancePbut sometimes doesn’t converge.


1 n+1
The average function value n+1 i=1 f (xi ) decreases after each step except perhaps after a
shrinkage step (exe. 9.11).

A conjugate-direction method
• Idea: algorithm that builds a set of conjugate directions using only function values (thus
minimizes a strictly convex quadratic function); then extend to nonlinear function.

• Parallel subspace property:Plet x1 6= x2 ∈ Rn and {p1 , . . . , pl } ⊂ Rn l.i. Define the two parallel fig. 9.3.
linear varieties Sj = {xj + li=1 αi pi , α1 , . . . , αl ∈ R}, j = 1, 2; let x∗1 and x∗2 be the minimizers
of f (x) = 21 xT Ax − bT x on S1 and S2 , resp. =⇒ x∗2 − x∗1 is conjugate to {p1 , . . . , pl }. Proof
Ex. 2D: given x0 and (say) e1 , e2 : (1) minimize from x0 along e2 to obtain x1 , (2) minimize
from x1 along e1 then e2 to obtain z ⇒ z − x1 is conjugate to e2 .

• Algorithm: starting with n l.i. directions, perform n consecutive exact minimizations each along
a current direction, then generate a new conjugate direction which replaces the oldest direction;
repeat.

Algorithm 9.3: given x0 :

Init: pi = ei for i = 1, . . . , n; x1 ← arg min f from x0 along pn ; k ← 1


repeat
min sequentially from xk along p1 , . . . , pn to obtain z
New dirs: p2 , . . . , pn , z − xk
New iterate: xk+1 ← arg min f from z along z − xk
k ←k+1
until convergence

• For quadratic f , terminates in n steps with total O(n2 ) function evaluations (each one O(n2 ))
⇒ O(n4 ). For non-quadratic f the l.s. is inexact (using interpolation) and needs some care.
Problem: the directions {pi } tend to become l.d. Heuristics exist to correct this.

• Useful for small problems.

33
Finite differences and noise (illustrative analysis)
Noise in evaluating f can arise because:

• stochastic simulation: random error because finite number of trials

• differential-equation solver (or some other complex numerical procedure): small but nonzero
tolerance in calculations.

Write f (x) = h(x) + φ(x) with smooth h and noise φ (which


need not be a function
 of x), consider  centered finite-difference
f (x+ǫei )−f (x−ǫei )
approx. ∇ǫ f (x) = 2ǫ
and let η(x; ǫ) =
i=1,...,n
supkz−xk∞ ≤ǫ |φ(z)| (noise level at x). We have:
Lemma 9.1. ∇2 h Lipschitz continuous in a neighborhood of the
box {z: kz − xk∞ ≤ ǫ} with Lipschitz constant Lh . Then:

η(x; ǫ)
k∇ǫ f (x) − ∇h(x)k∞ ≤ Lh ǫ2 +
| {z } |{z} ǫ }
finite-diff.
| {z
approximation error
approx. error noise error x − ǫei x x + ǫei

(Pf.: as for unit roundoff argument in finite diff.)

“If the noise dominates ǫ, no accuracy in ∇ǫ f and so little hope that −∇ǫ f will be descent.” So,
instead of using close samples, it may be better to use samples more widely separated.

Implicit filtering
• Essentially, steepest descent at an accuracy level ǫ (the finite-difference parameter) that is
decreased over iterations.

• Useful when we can control the accuracy in computing f and ∇ǫ f (e.g. if we can control the
tolerance of a differential-equation solver, or the number of trials of a stochastic simulation).
A more accurate (less noisy) value costs more computation.

• The algorithm decreases ǫ systematically (but hopefully not as quickly as the decay in error)
so as to maintain a reasonable accuracy in ∇ǫ f (x). Each iteration is a steepest descent step
at accuracy level ǫk (i.e., along −∇ǫk f (x)) with a backtracking l.s. that is limited to a fixed
number of steps. We decrease ǫk when:

– k∇ǫk f (x)k ≤ ǫk (i.e., minimum found with accuracy level ǫk )

– or we reach the fixed number of backtracking steps (i.e., ∇ǫ f (x) is a poor approximation of ∇f (x)).

• Converges if ǫk is decreased such that ǫ2k + η(xǫkk;ǫk ) → 0, i.e., the noise level decreases sufficiently
fast as the iterates approach a solution.

34
Review: derivative-free optimization
Methods that use only function values to minimize f (but not ∇f or ∇2 f ). Less efficient than derivative-based
methods, but possibly acceptable for small n or in special cases.
• Model-based methods: build a linear or quadratic model of f by interpolating f at a set of samples and use it
with trust-region. Slow convergence rate and very costly steps.
• Coordinate descent (alternating minimization): minimize successively along each variable. If it does con-
verge, its rate of convergence is often much slower than that of steepest descent. Very simple and convenient
sometimes.
• Pattern search: the iterate xk carries a set of directions that is possibly updated based on the values of f
along them. Generalizes coordinate descent to a richer direction set.
• Nelder-Mead method (downhill simplex): the iterate xk carries a simplex that evolves based on the values of
f , falling down and eventually shrinking around a minimizer (if it does converge).
• Conjugate directions built using the parallel subspace property: computing the new conjugate direction requires
n line searches (CG requires only one).
• Finite-difference approximations to the gradient degrade significantly with noise in f .
Implicit filtering: steepest descent at an accuracy level ǫ (the finite-difference parameter) that is decreased
over iterations. Useful when we can control the accuracy in computing f and ∇ǫ f .

35
10 Nonlinear least-squares problems
1
Pm 2
• Least-squares (LSQ) problem: f (x) = 2 j=1 rj (x) where the residuals rj : Rn → R j =
1, 2, . . . , m are smooth and m ≥ n.

• Arise very often in practice when fitting a parametric model to observed data; rj (x) is the error
for datum j with model parameters x; “min f ” means finding the parameter values that best
match the model to the data.

• Ex.: regression
P (curve fitting): rj = yj − φ(x; tj ),
f (x) = 21 m j=1 (y 2
j − φ(x; tj )) is the LSQ error of
fitting curve φ: t → y (with parameters x) to the
observed data points {(tj , yj )}m j=1 .
If using other norms, e.g. |rj | or |rj |3 , it won’t be
a LSQ problem.

• The special form of f simplifies the minimization problem. Write f (x) = 21 kr(x)k22 in terms of
the residual vector r: Rn → Rm
   
r1 (x)   ∇r1T
  ∂rj   (m × n matrix of first
r(x) =  ...  with Jacobian J(x) = =  ... 
∂xi j=1,...,m T
partial derivatives).
rm (x) i=1,...,n ∇rm

Usually it’s easy to compute J explicitly. Then


X
∇f (x) = rj (x)∇rj (x) = J(x)T r(x)
j (∗)
X X z }| { X
∇2 f (x) = ∇rj (x)∇rj (x)T + rj (x)∇ rj (x) = J(x)T J(x) +
2
rj (x)∇2 rj (x).
j j j

Often (∗) is the leading term, e.g.

– if rj are small around the solution (“small residual case”);


– if rj are approximately linear around the solution.

So we get a pretty good approximation of the Hessian for free.

• Linear LSQ problem: rj (x) is linear ∀j ⇒ J(x) = J constant. Calling r = r(0), we have
f (x) = 21 kJx + rk22 convex? , ∇f (x) = JT (Jx + r), ∇2 f (x) = JT J constant.
(✐ is fitting a polynomial to data a linear LSQ problem?)
Minimizer: ∇f (x∗ ) = 0 ⇒ JT Jx∗ = −Jr, the normal equations: n × n linear system with pd
or psd matrix which can be solved with numerical analysis techniques.
Cholesky factorization of JT J, or QR or SVD factorization of J are best depending on the problem; also could use the linear
conjugate gradient method for large n.
If m is very large, do not build J explicitly but accumulate JT J = j ∇rj (x)∇rj (x)T and JT r = j rj (x)∇rj (x).
P P

• For nonlinear LSQ problems f isn’t necessarily convex. We see 2 methods (Gauss-Newton,
Levenberg-Marquardt) which take advantage of the particular form of LSQ problems; but any
of the methods we have seen in earlier chapter are applicable too (e.g. Newton’s method, if we
compute ∇2 rj ).

36
Gauss-Newton method
• Line search with Wolfe conditions and a modification of Newton’s method: instead of generating
the search direction pk by solving the Newton eq. ∇2 f (xk )p = −∇f (xk ), ignore the second-
order term in ∇2 f (i.e., approximate ∇2 fk ≈ JTk Jk ) and solve JTk Jk pGN
k = −JTk rk .
• Equivalent to approximating r(x) by a linear model r(x+p) ≈ r(x)+J(x)p (linearization) and
so f (x) by a quadratic model with Hessian J(x)T J(x), then solving the linear LSQ problem
minp 21 kJk p + rk k22 .
• If Jk has full rank and ∇fk = JTk rk 6= 0 then pGN
k is a descent direction. (Pf.: evaluate (pGN T
k ) ∇fk < 0.)

• Saves us the trouble of computing the individual Hessians ∇2 rj .


• Global convergence if Wolfe conditions + kJ(x)zk2 ≥ γkzk2 in the region of interest (for constant
γ > 0) + technical condition (th. 10.1)
Pf.: cos θk is bounded away from 0 + Zoutendijk’s th.
kJ(x)zk2 ≥ γkzk2 ⇔ singular values of J are bounded away from 0 ⇔ JT J is well conditioned (see ch. 3).

The theorem doesn’t hold if J(xk ) is rank-deficient for some k. This occurs when the normal
equations are underdetermined (infinite number of solutions for pGN
k ).

• Rate of convergence depends on how much the term JT J dominates the second-order term in
the Hessian at the solution x∗ ; it is linear but rapid in the small-residual case:
eq. (10.30): xk + pGN
k −x

. (JT (x∗ )J(x∗ ))−1 ∇2 f (x∗ ) − I kxk − x∗ k + O(kxk − x∗ k2 ).
• Inexact Gauss-Newton method : solve the linsys approximately, e.g. with CG.

Levenberg-Marquardt method
• Same modification of Newton’s method as in the Gauss-Newton method but with a trust region
instead of a line search. Essentially, modify JTk Jk → JTk Jk + λI with λ ≥ 0 to make it pd.
• Spherical trust region with radius ∆k , quadratic model for f with Hessian JTk Jk :
1 1 1
mk (p) = krk k2 + pT JTk rk + pT JTk Jk p = kJk p + rk k22
2 2 2
1
⇒ min kJk p + rk k22 s.t. kpk ≤ ∆k .
p 2

• A rank-deficient Jacobian is no problem because the step length is bounded by ∆k .


• Characterization of the solution of the trust-region subproblem (lemma 10.2, direct consequence
of th. 4.1 in ch. 4): pLM is a solution of the trust-region problem minkpk≤∆ kJp + rk22 iff pLM is
feasible and ∃λ ≥ 0 such that:
(a) (JT J + λI)pLM = −JT r
(b) λ (∆ − pLM ) = 0.
Search for λ: start with large λ, reduce it till the corresponding pLM from (a) produces a
sufficient decrease (defined in some way) in f .
• Global convergence under certain assumptions.
• Rate of convergence: like Gauss-Newton, since near a solution the trust region eventually
becomes inactive and the algorithm takes Gauss-Newton steps.

37
Large-residual problems
If the residuals rj (x∗ ) near the solution x∗ are large, both Gauss-Newton and Levenberg-Marquardt
converge slowly, since JT J is a bad model of the Hessian. Options:

• Use a Newton or quasi-Newton method.

• Use a hybrid method, e.g. start with GN/LM then switch to (quasi-)Newton,
P or apply a quasi-
2
Newton approximation Sk to the second-order part of the Hessian j rj (x)∇ rj (x) and combine
with GN: Bk = JTk Jk + Sk .

Note that, in model fitting, large residuals mean the model is a poor fit to the data, so we may want
to use a better model.

Review: nonlinear least-squares problems


1 Pm 2
f (x) = 2 j=1 rj (x), m ≥ n, residual rj : Rn → R is the error at datum j for a model with parameters x.
• Simplified form for gradient and Hessian:
– f (x) = 21 kr(x)k22 with r(x) = (rj (x))j
 
∂rj
– ∇f (x) = J(x)T r(x) with Jacobian J(x) = ∂xi ji
2 T P 2
– ∇ f (x) = J(x) J(x) + j rj (x)∇ rj (x).
| {z }
use this as approximate Hessian

JT J x∗ = −Jr.
• Linear LSQ : rj linear, J constant, minimizer x∗ satisfies (calling r = r(0)) the normal eqs. |{z}
• Nonlinear LSQ: GN, LM methods. pd or psd

• Gauss-Newton method :
– Approximate Hessian ∇2 fk ≈ JTk Jk , solve for the search direction JTk Jk pGN
k = −JTk rk , inexact line
search with Wolfe conditions.
– Equivalent to linearizing r(x + p) ≈ r(x) + J(x)p.
– Problems if Jk is rank-defective.
• Levenberg-Marquardt method :
2
– Like GN but with trust region instead of line search: minkpk≤∆k kJk p + rk k2 .
– No problem if Jn is rank defective.
– One way to solve the trust-region subproblem approximately: try large λ ≥ 0, solve (JTk Jk + λI)pLM
k =
−JTk rk , accept pLM
k if sufficient decrease in f , otherwise try a smaller λ.
• Global convergence under certain assumptions.
• Rate of convergence: linear but fast if JT (x∗ )J(x∗ ) ≈ ∇2 f (x∗ ), which occurs with small residuals (rj (x) ≈ 0)
or quasilinear residuals (∇2 rj (x) ≈ 0). Otherwise GN/LM are slow; try other methods instead (quasi-Newton,
Newton, etc.) or hybrids that combine the advantages of GN/LM and (quasi-)Newton.

38
Method of scoring (≈ Gauss-Newton for maximum likelihood)
P
Maximum likelihood estimation of parameters x given observations {ti }: maxx N1 N i=1 log p(ti ; x)
where p(t; x) is a pmf or pdf in t. Call L(t; x) R= log p(t; x). Then (all derivatives are wrt x in this
section, and assume we can interchange ∇ and ):
1
Gradient ∇L = ∇p
p
1 1 1
Hessian ∇2 L = − 2 ∇p ∇pT + ∇2 p = −∇ log p ∇ log pT + ∇2 p.
p p p
Taking expectations wrt the model p(t; x) we have:
 
  1 2
E −∇2 L = E ∇ log p ∇ log pT + E ∇p = cov {∇ log p}
p
R R n o R R
since E {∇ log p} = p p1 ∇p = ∇ p = 0 and E p1 ∇2 p = p 1p ∇2 p = ∇2 p = 0.
In statistical parlance:
• Observed information: −∇2 L.

• Expected information: E {−∇2 L} = E ∇ log p ∇ log pT (Fisher information matrix )

• Score: ∇ log p = ∇L
1
PN
Two ways of approximating the log-likelihood Hessian N i=1 ∇2 log p(ti ; x) using only the first-order
term on ∇ log p:
P
• Gauss-Newton: sample observed information J(x) = N1 N T
i=1 ∇ log p(ti ; x) ∇ log p(ti ; x) .

• Method of scoring: expected information J(x) = E ∇ log p ∇ log pT . This requires computing
an integral, but its form is often much simpler than that of the observed information (e.g. for
the exponential family).
Advantages:
• Good approximation to Hessian (the second-order term is small on average if the model fits
well the data).

• Cheaper to compute (only requires first derivatives)

• Positive definite (covariance matrix), so descent directions.


f (t) h(t)T Φ(x)
• Particularly simple expressions for the exponential family: p(t; x) = g(x)
e with sufficient
R T
statistic h(t) and partition function g(x) = f (t)eh(t) Φ(x) dt.

1
∇ log p = − ∇g + ∇Φ(x)h(t) = ∇Φ(x)(h(t) − E {h(t)}) ⇒
g
N
1 X
log-likelihood gradient ∇ log p(ti ; x) = ∇Φ(x)(Edata {h} − Emodel {h}).
N i=1

Typically Φ(x) = x so ∇Φ(x) = I, in which case E {−∇2 L} = −∇2 L.

39
R
Missing-data problem Consider t are observed and z are missing, so p(t; x) = p(t|z; x)p(z; x) dz
(e.g. z = label of mixture component). We have:
Z
1
∇ log p(t; x) = (∇p(t|z; x)p(z; x) + p(t|z; x)∇p(z; x)) dz
p(t; x)
Z
1
= p(z; x)p(t|z; x)(∇ log p(t|z; x) + ∇ log p(z; x)) dz = Ez|t {∇ log p(t, z; x)}
p(t; x)
= posterior expectation of the complete-data log-likelihood gradient.

Z
2
∇ log p(t; x) = ∇ p(z|t; x)∇ log p(t, z; x) dz =
Z

= Ez|t ∇2 log p(t, z; x) + ∇p(z|t; x)∇ log p(t, z; x)T dz.

Noting that
 
p(t, z; x) 1
∇p(z|t; x) = ∇ = (∇p(t, z; x) − p(z|t; x)∇ log p(t; x)) =
p(t; x) p(t; x)
= p(z|t; x)(∇ log p(t, z; x) − ∇ log p(t; x)),

then the second term is:


n o
T
Ez|t (∇ log p(t, z; x) − ∇ log p(t; x))∇ log p(t, z; x) = covz|t {∇ log p(t, z; x)}

since ∇ log p(t; x) = Ez|t {∇ log p(t, z; x)}.


Finally:

Gradient ∇ log p(t; x) = Ez|t {∇ log p(t, z; x)}



Hessian ∇2 log p(t; x) = Ez|t ∇2 log p(t, z; x) + covz|t {∇ log p(t, z; x)}
( ) ( )
posterior expectation posterior covariance
= of complete-data + of complete-data .
log-likelihood Hessian log-likelihood gradient

Again, ignoring the second-order term we obtain a cheap, pd approximation to the Hessian (Gauss-
Newton method)—but for minimizing the likelihood, not for maximizing it!
We can still use the first-order, negative-definite approximation from before:

∇2 log p(t; z) ≃ −∇ log p(t; x) ∇ log p(t; x)T .

Relation with the EM (expectation-maximization) algorithm

• E step: compute p(z|t; xold) and Q(x; xold ) = Ez|t;xold {log p(t, z; x)}.

• M step: maxx Q(x; xold ).


∂Q(x′ ;x)
We have ∇ log p(t; x) = ∂x′
= Ez|t {∇ log p(t, z; x)}.
x′ =x

40
11 Nonlinear equations
• Problem: find roots of the n equations r(x) = 0 in n unknowns x where r(x): Rn → Rn . Ex. in
p. 271
0, 1, finitely many, or infinitely many roots.
• Many similarities with optimization: Newton’s method, line search, trust region. . .
• Differences:
– In optimization, the local optima can be ranked by objective value.
In root finding, all roots are equally good. 
optimization: 2
– For quadratic convergence, we need derivative of order
root-finding: 1.
– Quasi-Newton methods are less useful in root-finding.
∂ri

• Assume the n × n Jacobian J(x) = ∂x j ij
exists and is continuous in the region of interest.

• Degenerate root: x∗ with r(x∗ ) = 0 and J(x∗ ) singular.

Newton’s method
• Taylor’s th.: linearize r(x + p) = r(x) + J(x)p + O(kpk2 ) and use as model; find its root.

Algorithm 11.1: given x0 :

for k = 0, 1, 2 . . .
solve Jk pk = −rk
xk+1 ← xk + pk
end

• Newton’s method for optimizing an objective function f is the same as applying this algorithm
to r(x) = ∇f (x). 
Jacobian continuous: superlinear;
• Convergence rate for nondegenerate roots (th. 11.2)
Jacobian Lipschitz cont.: quadratic.
• Problems:
– Degenerate roots, e.g. r(x) = x2 produces xk = 2−k x0 which converge linearly.
– Not globally convergent: away from a root the algorithm can diverge or cycle; it is not
even defined if J(xk ) is singular.
– Expensive to compute J and solve the system exactly for large n.

Inexact Newton method


• Exit the linear solver of Jk pk = −rk when krk + Jk pk k ≤ ηk krk k for ηk ∈ [0, η] with constant
η ∈ (0, 1). (ηk ) = forcing sequence (as for Newton’s method in optimization).
We can’t use linear conjugate gradients here because Jk is not always pd; but there are other
linear solvers (also based on Krylov subspaces, i.e., iterative multiplication by Jk ), such as
GMRES.

 linear, if η sufficiently small
• Convergence rate to a non-degenerate root (th. 11.3) superlinear, if ηk → 0

quadratic, if ηk = O(krk k).

41
Broyden’s method (secant or quasi-Newton method)
• Constructs an approximation to the Jacobian over iterations.

• Write sk = xk+1 − xk , yk = rk+1 − rk : yk = Jk sk + O(ksk k2 ) (Taylor’s th.).

• We require the updated Jacobian approximation Bk+1 to satisfy the secant equation yk =
Bk+1sk and to minimize kB − Bk k2 , i.e., the smallest possible update that satisfies the secant
(y −B s )sT
eq.: Bk+1 = Bk + k sT ks k k (from lemma 11.4).
k k

• Convergence rate: superlinear if the initial point x0 and Jacobian B0 are close to the root x∗
and its Jacobian J(x∗ ), resp.; the latter condition can be crucial, but is difficult to guarantee
in practice.

• Limited-memory versions exist for large n.

• For a scalar equation (n = 1):

r(xk ) r(xk )
Newton’s method: xk+1 = xk − r ′ (xk )
Secant method: xk+1 = xk − Bk
with
r(xk )−r(xk−1 )
Bk = xk −xk−1
independent of Bk−1 .

Practical methods
• Line search and trust region techniques to ensure convergence away from a root.

• Merit function: a function f : Rn → R that indicates how close x is to a root, so that by


decreasing f (x) we approach a root. In optimization, f is the objective function. In root-
finding, there is no unique (and fully satisfactory) way to define a merit function. The most
widely used is f (x) = 12 kr(x)k22 .

• Problem: each root (r(x) = 0) is a local minimizer of f but not vice versa, so local minima
that are not roots can attract the algorithm.
=0
If a local minimizer x is not a root then J(x) is singular (Pf.: ∇f (x) = J(x)T r(x) = 0 r(x)6
=⇒ J(x) singular.)

• Line search methods: xk+1 = xk + αk pk with step length αk along direction pk .

– We want descent directions for f (pTk ∇f (xk ) < 0); step length chosen as in ch. 3.
– Zoutendijk’s th.: descent directions + Wolfe conditions + Lipschitz continuous J ⇒
P 2 T 2
k≥0 cos θk Jk rk < ∞.
– So if cos θk ≥ δ for constant δ ∈ (0, 1) and all k sufficiently large ⇒ ∇fk = JTk rk → 0; and
if kJ(x)−1 k is bounded ⇒ rk → 0.

42
– If well defined, the Newton step is a descent direction for f for rk 6= 0 (Pf.: pTk ∇fk =
−pTk ∇fk krk k2
−pTk Jk rk = −krk k < 0). But cos θk = kpk kk∇fk k =
T 2
≥ JT 1 J−1 = κ(J1 k ) , so a
kJk rk kkJTk rk k
−1
k k kk k k
large condition number causes poor performance (search direction almost ⊥∇fk ).
– One modification of Newton’s direction is (JTk Jk + τk I)pk = −JTk rk ; a large enough τk
ensures cos θk is bounded away from 0 because τk → ∞ ⇒ pk ∝ −JTk rk .
– Inexact Newton steps do not compromise global convergence: if at each step krk + Jk pk k ≤
ηk krk k for ηk ∈ [0, η] and η ∈ [0, 1) then cos θk ≥ 2 1−η
κ(Jk )
.

• Trust region methods: min mk (p) = fk + ∇fkT p + 21 pT Bk p s.t. kpk ≤ ∆k .

– Algorithm 4.1 from ch. 4 applied to f (x) = 21 kr(x)k22 using Bk = JTk Jk as approximate
Hessian in the model mk , i.e., linearize r(p) ≈ rk + Jk p.
– The exact solution has the form pk = (−JTk Jk + λk I)−1 JTk rk for some λk ≥ 0, with λk = 0
if the unconstrained solution is in the trust region. The Levenberg-Marquardt algorithm
searches for such λk .
– Global convergence (to non-degenerate roots) under mild conditions.
– Quadratic convergence rate if the trust region subproblem is solved exactly for all k suffi-
ciently large.

Continuation/homotopy/path-following methods
• Problem of Newton-based methods: unless J is nonsingular in the region of interest, they may
converge to a local minimizer of the merit function rather than a root.
• Continuation methods: instead of dealing with the original problem r(x) = 0 directly, establish
a continuous sequence of root-finding problems that converges to the original problem but starts
from an easy problem; then solve each problem in the sequence, tracking the root as we move
from the easy to the original problem.
• Homotopy map: H(x, λ) = λ r(x) +(1 − λ) (x − a) where a ∈ Rn fixed and λ ∈ R.
|{z} | {z }
λ=1 λ=0
original easy problem with
problem solution x = a

If ∂x
H(x, λ) is nonsingular then H(x, λ) = 0 defines a continuous curve x(λ), the zero path.
By the implicit function theorem (th. A.2); ex.: x2 − y + 1 = 0, Ax + By + c = 0 ⇒ y = g(x) or x = g(y) locally.

We want to follow the path numerically. x = a plays the role of initial iterate.
• Naive approach: start from λ = 0, x = a; gradually increase λ from 0 to 1 and solve H(x, λ)
= 0 using as initial x the one from the previous λ value; stop after solving for λ = 1.

But at the turning point λT , if


we increase λ we lose track of
the root x. To follow the zero
path smoothly we need to allow
λ to decrease and even to roam
outside [0, 1]. So we follow the
path along the arc-length s (in-
stead of λ).

43
• Arc-length parametrization of the zero path: (x(s), λ(s)) where s = arc length measured from
(a, 0) at s = 0. Since H(x(s), λ(s)) = 0 ∀s ≥ 0, its total derivative wrt s is also 0:
dH ∂ ∂
= H(x, λ) ẋ + H(x, λ) λ̇ = 0 (n equations),
ds ∂x ∂λ

where (ẋ, λ̇) = dx , dλ is the tangent vector to the zero path.
ds ds

• To calculate the tangent vector at a point (x, λ) notice that:



1. (ẋ, λ̇) lies in the null space of the n × (n + 1) matrix ∂H ,
∂x ∂λ
∂H
; the null space is 1D if this
matrix is full rank. This can be obtained from the QR factorization of ∂H , ∂H .
∂x ∂λ
2. Its length is 1 = kẋ(s)k2 + |λ̇(s)|2 ∀s ≥ 0 (s is arc length, so unit speed).
3. We need to choose the correct sign to ensure we move forward along the path; a heuristic
that works well is to choose the sign so that the correct tangent vector makes an angle of
less than π/2 with the previous tangent.
Procedure 11.7.
• Following the path can now be done by different methods:
dH
– By solving an initial-value first-order ODE: ds
= 0 for s ≥ 0 with (x(0), λ(0)) = (a, 0);
terminating at an s for which λ(s) = 1.
– By a predictor-corrector method: at each iteration k: fig. 11.5
P P
1. Predictor step of length ǫ along the tangent vector: (x , λ ) = (xk , λk ) + ǫ(ẋk , λ̇k ).
This doesn’t lie on the zero path, but is close to it.
2. Corrector step: bring (xP , λP ) back to the zero path with a few Newton iterations.
(✐ What happens if running Newton’s method at λ = 1 from some initial x?)

• The tangent vector is well defined if the matrix ∂H , ∂H has full rank. This is guaranteed
∂x ∂λ
under certain assumptions:
Th. 11.9: r twice cont. differentiable ⇒ foralmost all a ∈ Rn there is a zero path
from (a, 0) along which the matrix ∂H , ∂H has full rank. If this path is bounded
∂x ∂λ
for λ ∈ [0, 1) then it has an accumulation point (x, 1) with r(x) = 0. If J(x) is
non-singular, the zero path between (a, 0) and (x, 1) has finite length.
Thus, unless we are unfortunate in the choice of a, the continuation algorithm will find a path
that either diverges or leads to a root x if J(x) is non-singular. However, divergence can occur
in practice. ex. 11.3

• Continuation methods can fail in practice with even simple problems and they require consid-
erable computation; but they are generally more reliable than merit-function methods.
• Related algorithms:
– For constrained optimization: quadratic-penalty, log-barrier, interior-point.
– For (heuristic) global optimization: deterministic annealing. Ex.:
h(x, λ) = λ f (x) +(1 − λ) g(x)
|{z} |{z}
λ=1 λ=0
original easy problem,
objective e.g. quadratic

44
Review: nonlinear equations
Problem: find roots of n equations r(x) = 0 in n unknowns.
• Degenerate roots (singular Jacobian) cause troubles (e.g. slower convergence).
• Similar to optimization but harder: most methods only converge if starting sufficiently near a root.
– Newton’s method : linsys given by Jacobian of r (first deriv. only), quadratic convergence. Inexact steps
may be used.
– Broyden’s method : quasi-Newton method, approximates Jacobian through differences in r and x, super-
linear convergence. Limited-memory versions exist.
2
• Merit function f (x) = 21 kr(x)k2 turns root-finding into minimization (so we are guided towards minima), but
contains minima that are not roots (∇f (x) = J(x)T r(x) = 0 but r(x) 6= 0, i.e., singular Jacobian).

– Line search and trust region strategies may be used. Results obtained in earlier chapters (e.g. for
convergence) carry over appropriately.
• Root finding and minimization: related but not equivalent:
2
– From root finding r(x) = 0 to minimization: the merit function kr(x)k contains all the roots of r(x) = 0
as global minimizers, but also contains local minimizers that are not roots.
– From minimization min f (x) to root finding: the nonlinear equations ∇f (x) = 0 contain all the min-
imizers of f (x) as roots, but also contain other roots that are not minimizers (maximizers and saddle
points).
• Continuation/homotopy methods construct a family of problems parameterized over λ ∈ [0, 1] so that λ = 0 is
easy to solve (e.g. x − a = 0) and λ = 1 is the original problem (r(x) = 0).
– This implicitly defines a path in (x, λ) that most times is continuous, bounded and goes from (a, 0) to
(x∗ , 1) where x∗ is a root of r.
– We follow the path numerically (solving an ODE, or solving a sequence of root-finding problems).
– More robust than other methods, but higher computational cost.

45
Summary of methods
Assumptions:
• Typical behavior.

• Evaluation costs:
f ’s type f (x) ∇f (x) ∇2 f (x)
quadratic O(n2 ) O(n2 ) O(1)
other O(n) O(n) O(n2 )

• Appropriate conditions for: the line search (e.g. Wolfe) or trust region strategy; the functions, etc. (e.g. Lipschitz continuity,
solution with pd Hessian).

Method Global Convergence Cost of one iteration Derivative Quadratic f


convergence rate space time order cost in time
Steepest descent Y linear O(n) O(n) 1 ∞
46

2
Pure N quadratic O(n ) O(n3 ) 2 O(n3 )
Newton 2
Modified-Hessian Y quadratic O(n ) O(n3 ) 2 O(n3 )
Fletcher-Reeves Y linear O(n) O(n) 1 O(n3 )
Conjugate-gradient
Polak-Ribière N linear O(n) O(n) 1 O(n3 )
2
Quasi-Newton DFP, BFGS, SR1 N superlinear O(n ) O(n2 ) 1 O(n3 )
Newton-CG Y linear to quadratic O(n2 ) O(n2 )–O(n3 ) 2 at least O(n3 )
Large-scale dep. on forcing seq. but finite
L-BFGS N linear O(nm) O(nm) 1 ∞
Model-based N ≤ linear O(n3 ) O(n4 ) 0 O(n6 )
Coordinate descent N ≤ linear O(1) O(n) 0 ∞
Derivative-free
Nelder-Mead N ≤ linear O(n2 ) O(n) 0 ∞
Conjugate directions N ≤ linear O(n2 ) O(n2 ) 0 O(n4 )
Gauss-Newton N linear O(n2 ) O(n3 ) 1 O(n3 )
Least-squares
Levenberg-Marquardt Y linear O(n2 ) O(n3 ) 1 O(n3 )

In terms of convergence rate alone, we can rank the methods as:


derivative-free < steepest descent < conjugate-gradient < L-BFGS < Least-squares < quasi-Newton < Newton.
12 Theory of constrained optimization
Optimization problem:
 
ci (x) = 0, i ∈ E finite sets
min f (x) s.t. of indices
x∈Rn|{z} ci (x) ≥ 0, i ∈ I
objective
function

equality constraints inequality constraints


z }| { z }| {
n
or equivalently minx∈Ω f (x) where Ω = {x ∈ R : ci (x) = 0, i ∈ E; ci (x) ≥ 0, i ∈ I }.
• Ω is the feasible set.
• The objective function f and the constraints ci are all smooth.
• x∗ is a local solution iff x∗ ∈ Ω and ∃ neighborhood N of x∗ : f (x∗ ) ≤ f (x) for x ∈ N ∩ Ω. note:
x∗ ∈ Ω,
• x∗ is a strict local solution iff x∗ ∈ Ω and ∃ neigh. N of x∗ : f (x∗ ) < f (x) for x ∈ N ∩ Ω, x ∈N ∩Ω

x 6= x∗ .
• x∗ is a isolated local solution iff x∗ ∈ Ω and ∃ neigh. N of x∗ : x∗ is only local minimizer in
N ∩ Ω.
• At a feasible point x, the inequality constraint ci (i ∈ I) is:
– active iff ci (x) = 0 (x is on the boundary for that constraint)
– inactive iff ci (x) > 0 (x is interior point for that constraint).
✐ What happens if we define the feasible set as ci (x) > 0 rather than ci (x) ≥ 0?

• For inequality constraints, the constraint normal ∇ci (x) points towards the feasible region and
is ⊥ to the contour ci (x) = 0. For equality constraints, ∇ci (x) is ⊥ to the contour ci (x) = 0.
• Mathematical characterization of solutions for unconstrained optimization (reminder):
– Necessary conditions: x∗ local minimizer of f ⇒ ∇f (x∗ ) = 0, ∇2 f (x∗ ) psd.
– Sufficient conditions: ∇f (x∗ ) = 0, ∇2 f (x∗ ) pd ⇒ x∗ is a strong local minimizer of f .
Here we derive similar conditions for constrained optimization problems. Let’s see some exam-
ples.
• Local and global solutions: constraining can decrease or increase the number of optimizers:

 unconstrained: single solution (x = 0)
2
minn kxk2 constrained s.t. kxk22 ≥ 1: infinite solutions (kxk2 = 1)
x∈R 
constrained s.t. c1 (x) = 0: several solutions.

unconstrained: infinite solutions (x = − π2 + 2kπ, k ∈ Z)
min sin x
x∈R constrained s.t. 0 ≤ x ≤ 2π: single solution (x = 3π 2
).
• Smoothness of both f and the constraints is important since then we can predict what happens
near a point.
Some apparently nonsmooth problems (involving k·k1 , k·k∞ ) can be reformulated as smooth:
P P
1. A nonsmooth function by adding variables: kxk1 = ni=1 |xi | = ni=1 x+ −
i + xi by splitting

xi = x+ +
i − xi into nonnegative and nonpositive parts, where xi = max (xi , 0) ≥ 0 and

xi = max (−xi , 0) ≥ 0. (✐ Do we need to force (x+ )T x− = 0?)

47
2. A nonsmooth constraint as several smooth constraints:

3. A nonsmooth objective by adding variables:


min f (x) where f (x) = max (x2 , x) min |x1 | ⇔ min2 x2 s.t. x2 = |x1 |
x∈R x1 ∈R x∈R
⇔ minx∈R2 x2 s.t. x2 ≥ |x1 |
minx∈R f (x) ⇔ minx∈R2 x2 s.t. − x2 ≤ x1 ≤ x2 .
⇔ mint,x s.t. t = f (x) x2
⇔ mint,x s.t. t = max (x2 , x)
⇔ mint,x s.t. t ≥ x2 , t ≥ x
x1
Intuitive derivation of the KKT conditions
Idea: if f and {ci } are differentiable, then they are all approximately linear sufficiently near any
(feasible) point x, so they can be characterized by their gradient vectors ∇f (x) and {∇ci (x)}.

Case 1: a single equality constraint c1 (x) = 0.


se
s c1 (x) < 0
∇f
• At x∗ , ∇c1 (x∗ ) k ∇f (x∗ ), i.e., ∇f (x∗ ) = λ∗1 ∇c1 (x∗ ) for λ∗1 ∈ R.
ea
cr
in
f c1 (x) = 0
maximum
• This is necessary but not sufficient since it also holds at the
maximum.
c1 (x) > 0 ∇c1

∇c1 • The sign of λ∗1 can be + or − (e.g. use −c1 instead of c1 ).


∇f • Compare with exact line search:

∇c1
s
ur
minimum x∗ nto f
co of

Pf.: consider a feasible point at x, i.e., c1 (x) = 0. An infinitesimal move to x + d:


• retains feasibility if ∇c1 (x)T d = 0 (Taylor’s th.: 0 = c1 (x + d) ≈ c1 (x) + ∇c1 (x)T d) (contour line)

• decreases f if ∇f (x)T d < 0 (by Taylor’s th.: 0 > f (x + d) − f (x) ≈ ∇f (x)T d) (descent direction).

Thus if no improvement is possible then there cannot be a direction d such that ∇c1 (x)T d = 0 and
∇f (x)T d < 0 ⇒ ∇f (x) = λ1 ∇c1 (x) for some λ1 ∈ R.

Equivalent formulation in terms of the Lagrangian function L(x, λ1 ) = f (x) − λ1 c1 (x): at a solu-
|{z}
tion x∗ , ∃λ∗1 ∈ R: ∇x L(x∗ , λ∗1 ) = 0 (and also c1 (x∗ ) = 0). Lagrange
multiplier
Idea: to optimize equality-constrained problems, search for stationary points of the Lagrangian.
48
Case 2: a single inequality constraint c1 (x) ≥ 0. The solution is the same, but now the sign
of λ∗1 matters: at x∗ , ∇f (x∗ ) = λ∗1 ∇c1 (x∗ ) for λ∗1 ≥ 0.
Pf.: consider a feasible point at x i.e., c1 (x) ≥ 0. An infinitesimal move to x + d:

• retains feasibility if c1 (x) + ∇c1 (x)T d ≥ 0 (by Taylor’s th.: 0 ≤ c1 (x + d) ≈ c1 (x) + ∇c1 (x)T d)

• decreases f if ∇f (x)T d < 0 as before.

If no improvement is possible:
1. Interior point c1 (x) > 0: any small enough d satisfies feasibility ⇒ ∇f (x) =
0 (this is the unconstrained case).

2. Boundary point c1 (x) = 0: there cannot be a direction such that


∇f (x)T d < 0 and ∇c1 (x)T d ≥ 0
⇒ ∇f (x) and ∇c1 (x) must point in the same direction
⇒ ∇f (x) = λ1 ∇c1 (x) for some λ1 ≥ 0 ⇒ .
Equivalent formulation: at a solution x∗ , ∃λ∗1 ≥ 0: ∇xL(x∗ , λ∗1 ) = 0 and λ∗1 c1 (x∗ ) = 0 (and also
λ∗1 > 0 and c1 is active, or
c1 (x∗ ) ≥ 0). The latter is a complementarity condition
λ∗1 = 0 and c1 is inactive.

Case 3: two inequality constraints c1 (x) ≥ 0, c2 (x) ≥ 0. Consider the


case of a point x for which both constraints are active, i.e., c1 (x) = c2 (x) =
0. A direction d is a feasible descent direction to first order if ∇f (x)T d < 0
and ∇ci (x)T d ≥ 0, i ∈ I .
| {z }
Intersection of half spaces
defined by ∇ci (x)
At a solution (and ignoring the case where ∇ci are parallel), we cannot have
any feasible descent direction d ⇒ ∇f = λ1 ∇c1 + λ2 ∇c2 with λ1 , λ2 ≥ 0,
or equivalently ∇f is in the positive quadrant of (∇c1 , ∇c2 ).

Another example:



 x: no d satisfies ∇c1 (x)T d ≥ 0, ∇c2 (x)T d ≥ 0, ∇f (x)T d < 0

z: dz satisfies ∇c1 (z)T d ≥ 0, ∇c2 (z)T d ≥ 0, ∇f (z)T d < 0
At

 y: dy satisfies c1 (y) + ∇c1 (y)T d ≥ 0 (c1 not active), ∇c2 (y)T d ≥ 0, ∇f (y)T d < 0

w: dw satisfies c1 (w) + ∇c1 (w)T d ≥ 0, c2 (w) + ∇c2 (w)T d ≥ 0 (c1 , c2 not active), ∇f (w)T d < 0.

P
Equivalent formulation in general: L(x, λ) = f (x) − i λi ci (x). At a solution x∗ , ∃λ∗ ≥ 0 (≡ λ∗i ≥
0 ∀i): ∇x L(x∗ , λ∗ ) = 0 and λ∗i ci (x∗ ) = 0 ∀i (complementarity condition) (and also ci (x∗ ) ≥ 0 ∀i).

49
First-order necessary (Karush-Kuhn-Tucker) conditions for optimality
They relate the gradient of f and of the constraints at a solution. 
ci (x) = 0, i ∈ E
Consider the constrained optimization problem minx∈Rn f (x) s.t.
ci (x) ≥ 0, i ∈ I.
• |E ∪ I| = m constraints.
X
• Lagrangian L(x, λ) = f (x) − λi ci (x).
i∈E∪I
• Active set at a feasible point x: A(x) = E ∪ {i ∈ I: ci (x) = 0}.
Matrix of active constraint gradients at x: A(x) = [∇ci (x)]Ti∈A(x) of |A| × n.
• Degenerate constraint behavior: e.g. c21 (x) is equivalent to c1 (x) as an equality constraint, but
∇(c21 ) = 2c1 ∇c1 = 0 at any feasible point, which disables the condition ∇f = λ1 ∇c1 . We can
avoid degenerate behavior by requiring the following constraint qualification:
– Def. 12.4: given x∗ , A(x∗ ), the linear independence constraint qualification (LICQ) holds iff
the set of active constraint gradients {∇ci (x∗ ), i ∈ A(x∗ )} is l.i. (which implies ∇ci (x∗ ) 6=
0). Equivalently A(x∗ ) has full row rank.
Other constraint qualif. possible in th. 12.1, in partic. “all active constraints are linear”.
Th. 12.1 (KKT conditions): x∗ local solution of the optimization problem, LICQ holds at x∗ ⇒
∃!λ∗ ∈ Rm (Lagrange multipliers) such that: ex. 12.6

a) ∇x L(x∗ , λ∗ ) = 0 n eqs.
b) ci (x∗ ) = 0 ∀i ∈ E
c) ci (x∗ ) ≥ 0 ∀i ∈ I
d) λ∗i ≥ 0 ∀i ∈ I
e) λi ci (x∗ ) = 0

∀i ∈ E ∪ I m complementarity eqs.
Notes:
• I = ∅: KKT ⇔ ∇L(x∗ , λ∗ ) = 0 (∇ wrt x, λ).
In principle solvable by writing x = (xa , φ(xa ))T and solving the unconstrained problem min f (xa ).

• Given a solution x∗ , its associated Lagrange multipliers λ∗ are 0 for the inactive constraints
and (A(x∗ )A(x∗ )T )−1 A(x∗ )∇f (x∗ ) for the active ones. Pf.: solve for λ∗ in KKT a).
• f (x∗ ) = L(x∗ , λ∗ ) (from the complementarity condition).

• Strict complementarity: exactly one of λ∗i and ci (x∗ ) is zero ∀i ∈ I. Easier for some algorithms.

Sensitivity of f wrt constraints  ∗


∗ λi = 0: f does not change
λi = how hard f is pushing or pulling against ci
λ∗i 6= 0: f changes proportionately to λ∗i .
Intuitive proof: infinitesimal perturbation of ci (inequality constraint): ci (x) ≥ ǫ. Suppose ǫ is
sufficiently small that the perturbed solution x∗ (ǫ) still has the same active constraints and that the
Lagrange multipliers are not affected much. Then (⋄: Taylor’s th. to first order of f, {cj } at x∗ (ǫ)):
ǫ 0

z }| { z }| { ⋄ 

∗ ∗ ∗ ∗ T
ǫ = ci (x (ǫ)) − ci (x ) ≈ (x (ǫ) − x ) ∇ci (x ) ∗ 
⋄ ⇒
0 = cj (x∗ (ǫ)) − cj (x∗ ) ≈ (x∗ (ǫ) − x∗ )T ∇cj (x∗ ) ∀j 6= i, j ∈ A(x∗ )  
| {z } | {z } 
0 0
X ∗
ǫ→0 df (x (ǫ))
∗ ⋄ KKT
f (x (ǫ))−f (x ) ≈ (x (ǫ)−x∗ )T ∇f (x∗ ) =
∗ ∗
λ∗j (x∗ (ǫ)−x∗ )T ∇cj (x∗ ) ≈ λ∗i ǫ =⇒ = λ∗i .


j∈A(x )

An active constraint ci at x∗ is strongly active if λ∗i > 0 and weakly active if λ∗i = 0.
50
Second order conditions
• Set of linearized feasible directions F (x) at a feasible point x, which
is a cone (def. 12.3):
 T

n w ∇ci (x) = 0, ∀i ∈ E
F (x) = w ∈ R : T .
w ∇ci (x) ≥ 0, ∀i ∈ I ∩ A(x)

These are the directions along which we can move from x, infinitesimally, and remain feasible.
If LICQ holds, F (x) is the tangent cone to the feasible set at x.
Other examples of tangent cones F (x) in 2D:

∗ ∗
• If the first-order conditions hold at x
( , then an infinitesimal move along any vector w ∈ F (x )
either increases f , if wT ∇f (x∗ ) > 0 (“decided”)
remains feasible and, to first order, .
(✐ Why can it not decrease f ?) or keeps it constant, if wT ∇f (x∗ ) = 0 (“undecided”).

• The second-order conditions give information in the undecided directions wT ∇f (x∗ ) = 0, by


examining the curvature along them (✐ what are the undecided directions in the unconstrained case?).
• Given F (x∗ ) and some Lagrange multiplier vector λ∗ satisfying the KKT conditions, define a
subset of it, the critical cone C(x∗ , λ∗ ):

C(x∗ , λ∗ ) = w ∈ F (x∗ ): ∇ci (x∗ )T w = 0, ∀i ∈ A(x∗ ) with λ∗i > 0 ⊂ F (x∗ )
or equivalently
 
 wT ∇ci (x∗ ) = 0 ∀i ∈ E 
∗ ∗ n T ∗ ∗ ∗
C(x , λ ) = w ∈ R : w ∇ci (x ) = 0 ∀i ∈ I ∩ A(x ) with λi > 0 .
 
wT ∇ci (x∗ ) ≥ 0 ∀i ∈ I ∩ A(x∗ ) with λ∗i = 0
C(x∗ , λ∗ ) contains the undecided directions for F (x):
KKT
X
w ∈ C(x∗ , λ∗ ) =⇒ wT ∇f (x∗ ) = λ∗i wT ∇ci (x∗ ) = 0 (since either λ∗i = 0 or wT ∇ci (x∗ ) = 0).
i∈E∪I

• Second-order necessary conditions (Th. 12.5): x∗ local solution, LICQ condition holds,
KKT conditions hold with Lagrangian multiplier vector λ∗ ⇒ wT ∇2xx L(x∗ , λ∗ )w ≥ 0 ∀w ∈
C(x∗ , λ∗ ).
• Second-order sufficient conditions (Th. 12.6): x∗ ∈ Rn feasible point, KKT conditions ex. 12.8
hold with Lagrange multiplier λ∗ , wT ∇2xx L(x∗ , λ∗ )w > 0 ∀w ∈ C(x∗ , λ∗ ), w 6= 0 ⇒ x∗ is a ex. 12.9

strict local solution.


(✐ What happens with F (x∗ ), C(x∗ ) and ∇2xx L(x∗ , λ∗ ) at an interior KKT point x∗ ?)
• Weaker but useful statement of the second-order conditions: assume LICQ and strict com-
plementarity hold, then C(x∗ , λ∗ ) is the null space of A(x∗ ). If Z contains a basis of it,
then the necessary and the sufficient conditions reduce to the projected (or reduced) Hessian
ZT ∇2xx L(x∗ , λ∗ ) Z being psd or pd, resp.
(✐ When are the first-order conditions sufficient? Consider the unconstrained vs constrained case.)

51
Duality
• Dual problem: constructed from the primal problem (objective and constraints) and related to
it in certain ways (possibly easier to solve computationally, gives lower bound on the optimal
primal objective). Applies to convex problems.

• Consider only inequalities with f and −ci all convex (so convex problem):
Primal problem: minx∈Rn f (x) s.t. c(x) ≥ 0 with c(x) = (c1 (x), . . . , cm (x))T , Lagrangian
L(x, λ) = f (x) − λT c(x). Note L(·, λ) is convex for any λ ≥ 0.

• Dual problem: maxλ∈Rm q(λ) s.t. λ ≥ 0 with dual objective function q: Rm → R defined as
q(λ) = inf x L(x, λ) with domain D = {λ: q(λ) > −∞}. ex. 12.10

• Th. 12.10: q is concave and its domain D is convex. So the dual problem is convex. Proofs
Th. 12.11 (weak duality): x̄ feasible for the primal, λ̄ feasible for the dual (i.e., c(x̄), λ̄ ≥ 0) ⇒
q(λ̄) ≤ f (x̄).
Th. 12.12: x̄ is a solution of the primal; f, −c1 , . . . , −cm are convex in Rn and diff. at x̄ ⇒ any
λ̄ for which (x̄, λ̄) satisfies the primal KKT conditions is a solution of the dual.
Th. 12.13: x̄ is a solution of the primal at which LICQ holds; f, −c1 , . . . , −cm are convex
and cont. diff. in Rn ; suppose that λ̂ is a solution of the dual, that x̂ = arg inf x L(x, λ̂) and
that L(·, λ̂) is strictly convex ⇒ x̄ = x̂ (i.e., x̂ is the unique solution of the primal) and
f (x̂) = L(x̂, λ̂) = q(λ̂).

• Wolfe dual: maxx,λ L(x, λ) s.t. ∇x L(x, λ) = 0, λ ≥ 0


A slightly different form of duality that is convenient for computations.

• Th. 12.14: (x̄, λ̄) is a solution pair of the primal at which LICQ holds; f, −c1 , . . . , −cm are
convex and cont. diff. in Rn ⇒ (x̄, λ̄) is a solution of the Wolfe dual.

• Examples:

– LP: minx cT x s.t. Ax − b ≥ 0.


Dual: maxλ bT λ s.t. AT λ = c, λ ≥ 0.
Wolfe dual: maxx,λ cT x − λT (Ax − b) s.t. AT λ = c, λ ≥ 0 (equivalent to the dual).
– QP with symmetric pd G: minx 12 xT Gx + cT x s.t. Ax − b ≥ 0.
Dual: maxλ − 12 (AT λ − c)T G−1 (AT λ − c) + bT λ s.t. λ ≥ 0.
Wolfe dual: maxx,λ 21 xT Gx + cT x − λT (Ax − b) s.t. Gx + c − AT λ = 0, λ ≥ 0 (equiva-
lent to the dual if G is pd, by eliminating x; but requires G to be psd only).

52
Review: theory of constrained optimization

ci (x) = 0, i ∈ E
Constrained optimization problem min x∈Rn f (x) s.t. m constraints
ci (x) ≥ 0, i ∈ I.
Assuming the derivatives ∇f (x∗ ), ∇2 f (x∗ ) exist and are continuous in a neighborhood of x∗ :
First-order necessary (KKT) conditions:
P
• Lagrangian L(x, λ) = f (x) − i∈E∪I λi ci (x).
• LICQ: active constraint gradients are l.i.
• Unconstrained opt: ∇f (x∗ ) = 0 } n eqs., n unknowns.
 

 ∇x L(x∗ , λ∗ ) = 0 


 m + n eqs.,

 ∗
i∈E
 ci (x ) = 0,
 m + n unknowns

• Constrained opt: λi ci (x∗ ) = 0, i∈E ∪I )



 ci (x∗ ) ≥ 0, i∈I with additional


 λ ≥ 0, i∈I constraints
i

Second-order conditions:
• Critical cone containing the undecided directions:
 
 wT ∇ci (x∗ ) = 0 ∀i ∈ E 
∗ ∗ n
C(x , λ ) = w ∈ R : wT ∇ci (x∗ ) = 0 ∀i ∈ I ∩ A(x∗ ) with λ∗i > 0 .
 
wT ∇ci (x∗ ) ≥ 0 ∀i ∈ I ∩ A(x∗ ) with λ∗i = 0

• Unconstrained opt:
– Necessary: x∗ is a local minimizer ⇒ ∇2 f (x∗ ) psd.
– Sufficient: ∇f (x∗ ) = 0, ∇2 f (x∗ ) pd ⇒ x∗ is a strict local minimizer.
• Constrained opt:
– Necessary: (x∗ , λ∗ ) local solution + LICQ + KKT ⇒ wT ∇2xx L(x∗ , λ∗ )w ≥ 0 ∀w ∈ C(x∗ , λ∗ ).
– Sufficient: (x∗ , λ∗ ) KKT point, wT ∇2xx L(x∗ , λ∗ )w > 0 ∀w ∈ C(x∗ , λ∗ ) \ {0} ⇒ x∗ strict local sol.
⇒ Seek solutions of KKT system, then check whether they are really minimizers (second-order conditions).
Duality:
• Primal problem: minx∈Rn f (x) s.t. c(x) ≥ 0
with Lagrangian L(x, λ) = f (x) − λT c(x).
• Dual problem: maxλ∈Rm q(λ) s.t. λ ≥ 0
with q(λ) = inf x L(x, λ) with domain D = {λ: q(λ) > −∞}.
• Wolfe dual: maxx,λ cT x − λT (Ax − b) s.t. AT λ = c, λ ≥ 0.
Loosely speaking, the primal objective is lower bounded by the dual objective and they touch at the (pri-
mal,dual) solution, so that the dual variables give the Lagrange multipliers of the primal. Sometimes, solving
the dual is easier. Particularly useful with LP, convex QP and other convex problems.

53
13 Linear programming: the simplex method
Linear program (LP)
• Linear objective function, linear constraints (equal-
ity + inequality); feasible set: polytope (= convex;
connected set with flat faces); contours of objective
function: hyperplanes; solution: either none (feasi-
ble set is empty or problem is unbounded), one (a
vertex) or an infinite number (edge, face, etc.).
(✐ What happens if there are no inequalities?)

• Standard form LP: minx cT x s.t. Ax = b; x ≥ 0,


where c, x ∈ Rn , b ∈ Rm , Am×n .
Assume m < n and A has full row rank (otherwise
Ax = b contains redundant rows, is infeasible, or
defines a unique point).
• Techniques for transformation to standard form (generally applicable beyond LP):

– max cT x ⇔ − min (−c)T x.


– Unbounded variable x: split x into nonnegative and nonpositive parts: x = x+ − x− where
x+ = max (x, 0) ≥ 0 and x− = max (−x, 0) ≥ 0. (✐ Do we need to force x+ x− = 0?)
– Ax ≤ b: add slack variables ⇔ Ax + y = b, y ≥ 0.
Ax ≥ b: add surplus variables ⇔ Ax − y = b, y ≥ 0.
Example:
 T      
c x+ x+ x+
min cT x s.t. Ax ≥ b ⇔ min −c
0
x− s.t. (A − A − I) x− = b, x− ≥ 0.
z z z

• LP is a very special case of constrained optimization, but popular because of its simplicity and
the availability of software.
• Commercial software accepts LP in non-standard form.

Optimality conditions
LP is a convex optimization problem ⇒ any minimizer is a global minimizer; KKT conditions are
necessary and also sufficient; LICQ isn’t necessary? . (✐ What happens with the second-order conditions?)
KKT conditions: L(x, λ, s ) = cT x − λT (Ax − b) − sT x. If x is a solution ⇒ ∃! λ ∈ Rm , s ∈ Rn :
|{z}
Lagrange multipliers

a) AT λ + s = c 



b) Ax = b 
a) b)
c) x≥0 ⇒ cT x = (AT λ + s)T x = (Ax)T λ = bT λ.


d) s≥0 


e) xi si = 0, i = 1, . . . , n ⇔ xT s = 0
The KKT conditions are also sufficient. Pf.: let x be another feasible point ⇔ Ax = b, x ≥ 0. Then
cT x = (AT λ + s)T x = bT λ + xT s ≥ bT λ = cT x. And x optimal ⇔ xT s = 0.
a) x,s≥0

54
The dual problem
• Primal problem: min cT x s.t. Ax = b, x ≥ 0.
Dual problem: max bT λ s.t. AT λ ≤ c, or min −bT λ s.t. c − AT λ ≥ 0 in the form of ch. 12.
x: primal variables (n), λ: dual variables (m).

• KKT conditions for the dual: L(λ, x) = −bT λ − xT (c − AT λ). If λ is a solution ⇒ ∃!x:

 which are identical to the primal problem’s KKT


Ax = b 
 conditions if we define s = c − AT λ, i.e.,

AT λ ≤ c
Primal Dual
x≥0 

 Optimal Lag. mult. Optimal variables
xi (c − AT λ)i = 0, i = 1, . . . , n λ
x Optimal variables Optimal Lag. mult.

• Dual of the dual = primal. Pf.: restate dual in LP standard form by introducing slack variables
s ≥ 0 (so that AT λ + s = c) and splitting the unbounded variables λ into λ = λ+ − λ− with
λ+ , λ− ≥ 0. Then we can write the dual as:
 −b T      
λ+ T T λ+ λ+
min b λ− s.t. (A − A I) λ− = c, λ− ≥0
0 s s s

whose dual is
 A
  −b 
T
max c z s.t. −A z≤ b ⇔ min −cT z s.t. Az = −b, z ≤ 0
I 0

i.e., the primal with z ≡ −x.

• Duality gap: given a feasible vector x for the primal (⇔ Ax = b, x ≥ 0) and a feasible vector
(λ, s) for the dual (⇔ AT λ + s = c, s ≥ 0) we have:

0 ≤ xT s = xT (c − AT λ) = |cT x − T T T
{z b λ} ⇔ c x ≥ b λ.
gap
Thus, the dual objective function b λ is a lower bound on the primal objective function cT x
T

(weak duality); at a solution the gap is 0.

• Strong duality (th. 13.1):

1. If either problem (primal or dual) has a (finite) solution, then so does the other, and the
objective values are equal.
2. If either problem (primal or dual) is unbounded, then the other problem is infeasible.

• Duality is important in the theory of LP (and convex opt. in general) and in primal-dual
algorithms; also, the dual may be easier to solve than the primal.

• Sensitivity analysis: how sensitive the global objective value is to perturbations in the con-
straints ⇔ find the Lagrange multipliers λ, s.

55
Geometry of the feasible set
x ≥ 0 is the n-dim positive quadrant and we consider its intersection with the m-dim (m < n) linear
subspace Ax = b. The intersection happens at points x having at most m nonzeros, which are the
vertices of the feasible polytope. If the objective is bounded then at least one of these vectors is a
minimizer. Examples:

• x is a basic feasible point (BFP) if x is a feasible point with at most m nonzero components
and we can identify a subset B(x) of the index set {1, . . . , n} such that:

– B(x) contains exactly m indices

– i∈
/ B(x) ⇒ xi = 0

– Bm×m = [Ai ]i∈B(x) is nonsingular.

• The simplex method generates a sequence of iterates xk that are BFPs and converges (in a
finite number of steps) to a solution, if the LP has BFPs and at least one of them is a basic
optimal point (= a BFP which is a minimizer).

• Fundamental th. of LP (th. 13.2): for the standard LP problem:

– ∃ feasible point ⇒ ∃ a BFP


– LP has solutions ⇒ at least one such solution is a basic optimal point
– LP is feasible and bounded ⇒ it has an optimal solution. Proof

• All BFPs for the standard LP are vertices of the feasible polytope {x: Ax = b, x ≥ 0} and
vice versa (th. 13.3). (A vertex is a point that does not lie on a straight line between two other points in the polytope). Proof

• A LP is degenerate if there exists at least one BFP with ≤ m nonzero components.

56
The simplex method (Not to be confused with Nelder & Mead’s downhill simplex method of derivative-free opt.)

n

There are at most m different sets of basic indices B, so a brute-force way to find a solution would 20
10
∼ 105
be to try them all and check the KKT conditions. The simplex algorithm does better than this: it
guarantees a sequence of iterates all of which are BFPs (thus vertices of the polytope). Each step
moves from one vertex to an adjacent vertex for which the set of basic indices B(x) differs in exactly
one component and either decreases the objective or keeps it unchanged.

The move: we need to decide which index to change in the basic set B (by taking it out and
replacing it with one index from outside B, i.e., from N = {1, . . . , n} \ B). Write the KKT conditions
in terms of B and N (partitioned matrices and vectors):

B = [Ai ]i∈B N = [Ai ]i∈N ⇒ A = (B N)


xB = [xi ]i∈B xN = [xi ]i∈N ⇒ x = ( xxN
B
) , also s = ( ssN
B
) , c = ( ccN
B
).

• Since x is a BFP we have: B nonsingular, xB ≥ 0, xN = 0, so KKT c) holds.

• KKT a): b = Ax = BxB + NxN ⇒ xB = B−1 b.

• KKT e): xT s = xTB sB = 0 ⇒ sB = 0.


  
 BT λ λ = (B−1 )T cB
T
• KKT a): s + A λ = c ⇒ ( ssN
B
)+ BT λ= = ( ccN
B
)⇒
NT sN +NT λ sN = cN − (B−1 N)T cB .
• KKT d): s ≥ 0: while sB satisfies this, sN = cN − (B−1 N)T cB may not (if it does, i.e., sN ≥ 0,
we have found an optimal (x, λ, s) and we have finished). Thus we take out one of the indices
q ∈ N for which sq < 0 (there are usually several) and:

– allow xq to increase from 0


– fix all other components of xN to 0
– figure out the effect of increasing xq on the current BFP xB , given that we want it to stay
feasible wrt Ax = b
– keep increasing xq until one of components of xB (say, that of xp ) is driven to 0
– p leaves B to N , q enters B from N .

Formally, call x+ the new iterate and x the current one: we want Ax+ = b = Ax:
 +
+ xB ⋄ −1
Ax = (B N) + = B x+ + +
B + Aq xq = BxB = Ax ⇒ xB = xB − B Aq xq
+
xN |{z} |{z} |{z} |{z} | {z }
m×m m×1 m×1 1×1 increase x+
q till some com-
ponent of x+
B becomes 0

(⋄ : x+ T
i = 0 for i ∈ N \ {q}). This operation decreases c x (pf.: p. 369).

LP nondegenerate & bounded ⇒ simplex method terminates at a basic optimal point (th. 13.4).

• The practical implementation of the simplex needs to take care of same details:

– Degenerate & unbounded cases.


– Efficient solution of the m × m linear system. B changes by a single column between iterations.

– Selection of the entering index from among the several negative components of s.

57
• Presolving (preprocessing): reduces the size of the user-given problem by applying several
techniques to eliminate variables, constraints and bounds; may also detect infeasibility. Ex:
look for rows or columns in A that are singletons or all-zeros, or for redundant constraints.

• With inequalities, as indicated by the KKT conditions, an algorithm must determine (implicitly
or explicitly) which of them are active at a solution. Active-set methods, of which the simplex
method is an example:

– maintain explicitly a set of constraints that estimates the active set at the solution (the
complement of the basis B in the simplex method), and
– make small changes to it at each step (a single index in the simplex method).

Active-set methods apply also to QP and bound-constrained optimization, but are less conve-
nient for nonlinear programming.

• The simplex method is very efficient in practice (if typically requires 2m to 3m iterations) but
it does have a worst-case complexity that is exponential in n. This can be demonstrated with a
pathological n-dim problem where the feasible polytope has 2n vertices, all of which are visited
by the simplex method before reaching the optimal point.
Interior-point methods for LP (next chapter) have a polynomial worst-case complexity.

Review: linear programming: the simplex method


• Linear program:
– linear objective and constraints
– convex problem
– feasible set: polytope
– number of solutions: 0 (infeasible or unbounded), 1 (a vertex) or ∞ (a face)
– KKT conditions are necessary and sufficient; LICQ not needed.
• Primal problem in LP std form (use slacks, etc. if needed): minx cT x s.t. Ax = b; x ≥ 0 with Am×n .
• Dual problem: maxλ bT λ s.t. AT λ ≤ c
– optimal variables x and Lagrange multipliers λ switch roles
– dual(dual(primal)) = primal
– weak duality: dual objective ≤ primal objective for any feasible point; they coincide at the solution
– strong duality: the primal and the dual, either both have a finite solution, or one is unbounded and the
other infeasible.
• Sensitivity analysis: values of the Lagrange multipliers.
• The simplex method :
– tests a finite sequence of polytope vertices with nonincreasing objective values, final one = solution
– each step requires solving a linear system of m × m
– needs to deal with degenerate and unbounded cases
– takes 2m–3m steps in practice
– is an active-set method.

58
14 Linear programming: interior-point methods
Interior-point methods Simplex method
All iterates satisfy the inequality constraints Moves along the boundary of the feasible poly-
strictly, so in the limit they approach the solution tope, testing a finite sequence of vertices until
from the inside (in some methods from the outside) but never it finds the optimal one.
lie on the boundary of the feasible set.
Each iteration is expensive to compute but can Usually requires a larger number of inexpen-
make significant progress toward the solution. sive iterations.
Average-case complexity = worst-case complexity Average-case complexity: 2m–3m iterations
= polynomial. (m = number of constraints); worst-case com-
plexity: exponential.

Primal-dual methods
• Standard-form primal LP: min cT x s.t. Ax = b, x ≥ 0, c, x ∈ Rn , b ∈ Rm , Am×n
Dual LP: max bT λ s.t. AT λ + s = c, s ≥ 0, λ ∈ Rm , s ∈ Rn .
• KKT conditions:

AT λ + s = c   T 

 system of 2n + m equations A λ+s−c
Ax = b
for 2n + m unknowns x, λ, s ⇔ F(x, λ, s) =  Ax − b  = 0
xi si = 0   (mildly nonlinear because of xi si )
 XSe  
i = 1, . . . , n 1
X = diag (xi ) , S = diag (si ) , e = ··· .
x, s ≥ 0 1

• Idea: find solutions (x∗ , λ∗ , s∗ ) of this system with a Newton-like method, but modifying the
search directions and step sizes to satisfy x, s > 0 (strict inequality). The sequence of iterates
traces a path in the space (x, λ, s), thus the name primal-dual. Solving the system is relatively
easy (little nonlinearity) but the nonnegativity condition complicates things. Spurious solutions
(F(x, λ, s) = 0 but x, s  0) abound and do not provide useful information about feasible
solutions, so we must ensure to exclude them.
All the vertices of the x–polytope are associated with a root of F, but most violate x, s ≥ 0.

• Newton’s method to solve nonlinear equations r(x) = 0 from current estimate xk (ch. 11):
xk+1 = xk + ∆x where J(xk ) ∆x = −r(xk ) and J(x) is the Jacobian of r.
(Recall that if we apply it to solve ∇f (x) = 0 we obtain ∇2 f (x)p = −∇f (x), Newton’s method for optimization.)

• In our case the Jacobian J(x, λ, s) takes a simple form (✐ is J nonsingular?). Assuming x, s > 0
and calling rc = AT λ + s − c, rb = Ax − b the residuals for the linear equations, the Newton
step is:
     
0 AT I ∆x −rc
This Newton direction is also
J(x, λ, s) = A 0 0  ⇒ J ∆λ =  −rb 
called affine scaling direction.
S 0 X ∆s −XSe
Since
 x a full step would likely violate x, s > 0, we perform a line search so that the new iterate
∆x h     i
is λs + α ∆λ for α ∈ (0, 1] ⇒ α < min min∆xi <0 ∆x
? −xi −si
, min∆si <0 ∆s .
∆s i i

Still, often α ≪ 1. Primal-dual methods modify the basic Newton procedure by:

59
1. Biasing the search direction towards the interior of the nonnegative orthant x, s ≥ 0 (so
more room to move within it). We take a less aggressive Newton direction that aims at a
solution with xi si = σµ > 0 (perturbed KKT conditions) instead of all the way to 0 (this
usually allows a longer step α), with:
T
– Duality measure µ = xn s = average of the pairwise products xi si . It measures close-
ness to the boundary, and the algorithms drive µ to zero.
– Centering parameter σ ∈ [0, 1]: amount of reduction in µ we want to achieve:
σ = 0: pure N. step towards (x0 , λ0 , s0 ) (affine-scaling dir.); aims at reducing µ.
σ = 1: N. step towards (xµ , λµ , sµ ) ∈ C (centering direction); aims at centrality.
Primal-dual methods trade off both aims.
2. Controlling the step α to keep xi , si from moving too close to the boundary of the non-
negative orthant.

Framework 14.1 (Primal-dual path-following): given (x0 , λ0 , s0 ) with x0 , s0 > 0

for k = 0, 1, 2, . . .   k  
0 AT I ∆x −rc
(xk )T sk
Solve  A 0 0  ∆λk  =  −rb  where σk ∈ [0, 1], µk =
n
k k k k k
S 0 X ∆s −X S e + σk µk e
(xk+1 , λk+1 , sk+1) ← (xk , λk , sk ) + αk (∆xk , ∆λk , ∆sk ) choosing αk such that xk+1 , sk+1 > 0
end

• The strategies for choosing or adapting σk , αk depend on the particular algorithm.


• If a full step (αk = 1) is taken at any iteration, the residuals rc , rb become 0 and all the
subsequent iterates remain strictly feasible? .
?
Strict feasibility is preserved: (xk , λk , sk ) ∈ F 0 ⇒ (xk+1 , λk+1 , sk+1) ∈ F 0 .

Examples
• This shows every vertex of the polytope in x (i.e., Ax = b) produces one root of F(x, λ, s).
min cT x s.t. Ax = b, x ≥ 0, for c = (1, 1, 0)T , A = (1 12 2), b = 2. KKT conditions:
F(x,λ,s)=0
z }| { 
λ + s1 =1 
 with solutions
 λ 

+ s2 =1 
 (1) λ = 1, x = (2, 0, 0)T ,
AT λ + s = c 

2 

 2λ + s3 =0  s = (0, 21 , −2)T infeasible
Ax = b 1
⇒ 1 2 2 + 2x3
x + x =2 (2) λ = 2, x = (0, 4, 0)T ,
sT x = 0 
 

 s1 x1 =0 
 s = (−1, 0, −4)T infeasible
x, s ≥ 0 

s2 x2 =0 
 (3) λ = 0, x = (0, 0, 1)T ,

s3 x3 =0 s = (1, 1, 0)T feasible
• Another example: A = (1 − 2 2) above. The solutions of F(x, λ, s) = 0 are:
(1) λ = 1, x = (2, 0, 0)T , s = (0, 3, −2)T infeasible
(2) λ = − 12 , x = (0, −1, 0)T , s = ( 32 , 0, 1)T infeasible
(3) λ = 0, x = (0, 0, 1)T , s = (1, 1, 0)T feasible
Thus the need to steer away from the boundary till we approach
the solution.

60
The central path

primal-dual feasible set F = {(x, λ, s): Ax = b, AT λ + s = c, x, s ≥ 0}
Define
primal-dual strictly feasible set F 0 = {(x, λ, s): Ax = b, AT λ + s = c, x, s > 0}.
Before, we justified taking a step towards τ = σµ > 0 instead of directly towards 0 in that it keeps
us away from the feasible set boundaries and allows longer steps. We can also see this as following
a central path through the nonnegative orthant. Parameterize the KKT system in terms of a scalar
parameter τ > 0 (perturbed KKT conditions):

AT λ + s = c  

Ax = b   The solution F(xτ , λτ , sτ ) = 0 gives a curve C = {(xτ , λτ , sτ ): τ > 0} whose
xi si = τ points are strictly feasible, and that converges to a solution for τ → 0. This curve


i = 1, . . . , n 
 is the central path C. The central path is defined uniquely ∀τ > 0 ⇔ F 0 6= ∅.

x, s > 0
The central path guides us to a solution along a route that steers clear of spurious solutions by
keeping all x and s components strictly positive and decreasing the pairwise products xi si to 0 at
the same rate. A Newton step towards points in C is biased toward the interior of the nonnegative
orthant x, s ≥ 0 and so it can usually be longer than the pure Newton step for F.

Example of central path min cT x s.t. Ax = b, x ≥ 0 for x ∈ R; A = b = c = 1.


KKT equations for central path C:

AT λ + s = c  x=1
Ax = b ⇒ s=τ

xT s = τ λ=1−τ
τ >0

Path-following methods
• They explicitly restrict iterates to a neighborhood of the central
path C, thus following C more or less strictly. That is, we choose
αk ∈ [0, 1] as large as possible but so that (xk+1 , λk+1 , sk+1) lies in
the neighborhood.
• Examples: (
N−∞ (0) = F
N−∞ (γ) = {(x, λ, s) ∈ F 0 : xi si ≥ γµ, i = 1, . . . , n} for γ ∈ (0, 1] (typ. γ = 10−3 );
N−∞ (1) = C;
(
N2 (1) 6= F
N2 (θ) = {(x, λ, s) ∈ F 0 : kXSe − µek2 ≤ θµ} for θ ∈ [0, 1); (typ. θ = 0.5);
N2 (0) = C.

• Global convergence (th. 14.3): 


(σk ) ∈ [σmin , σmax ] with 0 < σmin < σmax < 1 ⇒ µk+1 ≤ 1 − nδ µk for constant δ ∈ (0, 1).
• Convergence rate (th. 14.4): (worst-case bound; in practice, almost independent of n)
given ǫ ∈ (0, 1), O(n log 1ǫ ) iterations are necessary to find a point for which µk < ǫµ0 . Proof

• Homotopy methods for nonlinear eqs. follow tightly a tubular neighborhood of the path.
Interior-point path-following methods follow a horn-shaped neighborhood, initially wide.
• Most computational effort is spent solving the linear system for the direction:
– This is often a sparse system because A is often sparse.
– If not, can reformulate (by eliminating variables) as a smaller system: eq. (14.42) or
(14.44).
61
A practical algorithm: Mehrotra’s predictor-corrector algorithm
At each iteration:

1. Predictor step (x′ , λ′ , s′ ) = (x, λ, s) + α(∆xaff , ∆λaff , ∆saff ): affine-scaling direction (i.e., σ = 0)
and largest step size α ∈ [0, 1] that satisfies x′ , s′ ≥ 0.
′ T ′
2. Adapt σ: compute the effectiveness µaff = (x n) s of this step and set σ = (µaff /µ)3 . Thus, if the
predictor step is effective, µaff is small and σ is close to 0, otherwise σ is close to 1.

3. Corrector step: compute the step direction using this σ:


   
∆x −rc
J(x, λ, s) ∆λ =  −rb .
aff aff
∆s −XSe − ∆X ∆S e + σµe

This is an approximation to keeping to the central path:

(xi + ∆xi )(si + ∆si ) = σµ ⇒ xi ∆si + si ∆xi = σµ − ∆xi ∆si −xi si .


| {z }
approximated with ∆xaff aff
i , ∆si
Notes:

• Two linear systems must be solved (predictor and corrector steps) but with the same coefficient
matrix (so use a matrix factorization, e.g. J = LU).

• Heuristics exist to find a good initial point.

• If LP is infeasible or unbounded, the algorithm typically diverges (rkc , rkb and/or µk → ∞).

• No convergence theory available for this algorithm (which can occasionally diverge); but it has
good practical performance.

62
Review: linear programming: interior-point methods
• Standard-form primal LP: min cT x s.t. Ax = b, x ≥ 0, c, x ∈ Rn , b ∈ Rm , Am×n .
 T  
A λ+s−c 
X = diag (xi )
• KKT conditions with Lagrange multipliers λ, s: F(x, λ, s) =  Ax − b  = 0 S = diag (si ) .


XSe e = (1, . . . , 1)T
• Apply Newton’s method to solve for (x, λ, s) (primal-dual space) but modify the complementarity conditions
as xi si = τ = σµ > 0 to force iterates to be strictly feasible, i.e., interior (x, s > 0), and drive τ → 0. This
affords longer steps α:
 ∆x   −rc   
0 AT I
– Pure Newton step: J ∆λ = −rb with J(x, λ, s) = A 0 0 .
∆s −XSe+σµe S 0 X
x  ∆x 
– New iterate: λ + α ∆λ for α ∈ (0, 1] that ensures the iterate is sufficiently interior.
s ∆s
xT s
– Duality measure µ = n (measures progress towards a solution).
– Centering parameter σ between 0 (affine-scaling direction) and 1 (central path).
• The set of solutions as a function of τ > 0 is called the central path C. It serves as guide to a solution from
the interior that avoids non-KKT points. Path-following algorithms follow C more or less closely.
– Global convergence: µk+1 ≤ cµk for constant c ∈ (0, 1) if σk is bounded away from 0 and 1.
– Convergence rate: achieving µk < ǫ requires O(n log 1ǫ ) iterations ⇒ polynomial complexity.
• Each step of the interior-point method requires solving a linear system (for the Newton step) of 2n + m eqs.
which is sparse if A is sparse.
• Fewer, more costly iterations than the simplex method. In practice, preferable in large problems.

63
15 Fundamentals of algorithms for nonlinear constrained
optimization
• General constrained optimization problem:

ci (x) = 0, i ∈ E
minn f (x) s.t. , f, {ci } smooth.
x∈R ci (x) ≥ 0, i ∈ I

Special cases (for which specialized algorithms exist):

– Linear programming (LP): f , all ci linear; solved by simplex & interior-point methods.
– Quadratic programming (QP): f quadratic, all ci linear.
– Linearly constrained optimization: all ci linear.
– Bound-constrained optimization: constraints are only of the form xi ≥ li or xi ≤ ui .
– Convex programming: f convex, equality ci linear, inequality ci concave. (✐ Is QP convex progr.?)

• Brute-force approach: guess which inequality constraints are active (λ∗i 6= 0), try to solve the
nonlinear equations given by the KKT conditions directly and then check whether the resulting
solutions are feasible. If there are m inequality
 constraints
 and k are active, we have m k
combinations and so altogether m0 + m1 + · · · + m m
= 2m combinations, which is wasteful
unless we can really guess which constraints are active. Solving a nonlinear system of equations
is still hard because the root-finding algorithms are not guaranteed to find a solution from
arbitrary starting points.

• Iterative algorithms: sequence of xk (and possibly of Lagrange multipliers associated with the
constraints) that converges to a solution. The move to a new iterate is based on information
about the objective and constraints, and their derivatives, at the current iterate, possibly
combined with information gathered in previous iterates. Termination occurs when a solution
is identified accurately enough, or when further progress can’t be made.
Goal: to find a local minimizer (global optimization is too hard).

• Initial study of the problem: try to show whether the problem is infeasible or unbounded; try
to simplify the problem.

• Hard√constraints: they cannot be violated during the algorithm’s run, e.g. non-negativity of
x if x appears in the objective function. Need feasible algorithms, which are slower than
algorithms that allow the iterates to be infeasible, since they can’t allow shortcuts across
infeasible territory; but the objective is a merit function, which spares us the need to introduce
a more complex merit function that accounts for constraint violations.
Soft constraints: they may be modeled as objective function f + penalty, where the penalty
includes the constraints. However, this can introduce ill-conditioning.

• Slack variables are commonly used to simplify an inequality into a bound, at the cost of having
an extra equality and slack variable:

ci (x) ≥ 0 ⇒ ci (x) − si = 0, si ≥ 0 ∀i ∈ I.

64
Categorization of algorithms
• Ch. 16: quadratic programming: it’s an important problem by itself and as part of other
algorithms; the algorithms can be tailored to specific types of QP.

• Ch. 17: penalty and augmented Lagrangian methods.

– Penalty methods: combine objective and constraints into a penalty function φ(x; µ) via a
penalty parameter µ > 0; e.g. if only equality constraints exist:
P
∗ φ(x; µ) = f (x) + µ2 i∈E ci (x)2
⇒ unconstrained minimization of φ wrt x for a series of increasing µ values.
P
∗ φ(x; µ) = f (x) + µ i∈E |ci (x)| (exact penalty function)
⇒ single unconstrained minimization for large enough µ.
– Augmented Lagrangian methods: define a function that combines the Lagrangian and a
quadratic penalty; e.g. if only equality constraints exist:
P P
∗ LA (x, λ; µ) = f (x) − i∈E λi ci (x) + µ2 i∈E c2i (x)
⇒ unconstrained min. of LA wrt x for fixed λ, µ; update λ, increase µ; repeat.
– Sequential linearly constrained methods: minimize at every iteration a certain Lagrangian
function subject to linearization of the constraints; useful for large problems.

• Ch. 18: sequential quadratic programming: model the problem as a QP subproblem; solve it
by ensuring a certain merit function decreases; repeat. Effective in practice. Although the QP
subproblem is relatively complicated, they typically require fewer function evaluations than
some of the other methods.

• Ch. 19: interior-point methods for nonlinear programming: extension of the primal-dual interior-
point methods for LP. Effective in practice. They can also be viewed as:

– Barrier methods: add terms to the objective (via a barrier parameter µ > 0) that are
insignificant when x is safely inside the feasible set but become large as x approaches the
boundary; e.g. if only inequality constraints exist:
P
∗ P (x; µ) = f (x) − µ i∈I log ci (x) (logarithmic barrier function)
⇒ unconstrained minimization of P wrt x for a series of decreasing µ values.

Class of methods that. . . . . . replace the original constrained problem by


)
Quadratic penalty methods
Log-barrier methods a sequence of unconstrained problems
Augmented Lagrangian methods
Nonsmooth exact penalty function methods a single unconstrained problem
Sequential linearly constrained methods a sequence of linearly constrained problems
Sequential quadratic programming a sequence of QP problems
Interior-point methods a sequence of nonlinear systems of equations

65
Elimination of variables
Goal: eliminate some of the constraints and so simplify the problem. This must be done with care
because the problem may be altered, or ill-conditioning may appear.

• Example 15.2’: safe elimination.

• Example 15.2: elimination alters the problem: minx,y x2 + y 2 s.t. (x − 1)3 = y 2 has the solution
( xy ) = ( 10 ). Eliminating y 2 = (x − 1)3 yields min x2 + (x − 1)3 which is unbounded (x → −∞);
the mistake is that this elimination ignores the implicit constraint x ≥ 1 (since y 2 ≥ 0) which
is active at the solution.

In general, nonlinear elimination is tricky. Instead, many algorithms linearize the constraints, then
apply linear elimination.

Linear elimination Consider min f (x) s.t. Ax = b where Am×n , m ≤ n, and A has full rank
(otherwise, remove redundant constraints or determine whether the problem is infeasible). Say
we eliminate xB = (x1 , . . . , xm )T (otherwise permute x, A and b); writing A = (B N) with
Bm×m nonsingular, Nm×(n−m) and x = ( xxN B
), we have xB = B−1 b − B−1 NxN (remember how
to find a basic feasible point in the simplex method), and we can solve the unconstrained problem
minxN ∈Rn−m f (xB (xN ), xN ) . Ideally we’d like to select B to be easily factorizable (easier linear sys- ex. 15.3
tem).  
−1
We can also write x = Yb + ZxN with Y = B0−1 , Z = −BI N . We have:

• Z has n − m l.i. columns (due to I being the lower block) and AZ = 0


⇒ Z is a basis of the null space of A.

• The columns of Y and the columns of Z are l.i. (pf.: (Y Z)λ = 0 ⇒ λ = 0); and null (A) ⊕ range AT = Rn


⇒ Y is a basis of the range space of AT and Yb is a particular solution of Ax = b.

Thus the elimination technique expresses feasible points as the sum of a particular solution of Ax = b
plus a displacement along the null space of A:

x = (particular solution of Ax = b) + (general solution of Ax = 0).

But linear elimination can give rise to numerical instability, e.g. for n = 2:

This can be improved by choosing as the particular solution that having minimum norm: min kxk2
s.t. Ax = b, which is xp = AT (AAT )−1 b (pf.: apply KKT to min 21 xT x s.t. Ax = b). Both this xp and Z
can be computed in a numerically stable way using the QR decomposition of A, though the latter is
costly if A is large (even if sparse).
If inequality constraints exist, eliminating equality constraints is worthwhile if the inequality
constraints don’t get more complicated.

66
Measuring progress: merit functions φ(x; µ)
• A merit function measures a combination of the objective f and of feasibility via a penalty
parameter µ > 0 which controls the tradeoff; several definitions exist. They help to control
the optimization algorithm: a step is accepted if it leads to a sufficient reduction in the merit
function.

• Unconstrained optimization: objective function = merit function.


Also in feasible methods (which force all iterates to be feasible).

• A merit function φ(x; µ) is exact if ∃µ∗ > 0: µ > µ∗ ⇒ any local solution x of the optimization
problem is a local minimizer of φ.

• Useful merit functions:

– ℓ1 exact function:
X X
φ1 (x; µ) = f (x) + µ |ci (x)| + µ [ci (x)]− [x]− = max (0, −x).
i∈E i∈I

It is not differentiable. It is exact for µ∗ = largest Lagrange multiplier (in absolute


value) associated with an optimal solution. Many algorithms using this function adjust µ
heuristically to ensure µ > µ∗ (but not too large). It is inexpensive to evaluate but it may
reject steps that make good progress toward the solution (Maratos effect).
– Fletcher’s augmented Lagrangian: when only equality constraints c(x) = 0 exist:
µ
φF (x; µ) = f (x) − λ(x)T c(x) + kc(x)k22
2
where A(x) is the Jacobian of c(x) and λ(x) = (A(x)A(x)T )−1 A(x) ∇f (x) are the least-
squares multipliers’ estimates. It is differentiable and exact and does not suffer from the
Maratos effect, but since it requires the solution of a linear system to obtain λ(x), it is
expensive to evaluate; and may be ill-conditioned or not defined.
– Augmented Lagrangian in x and λ: when only equality constraints exist:
µ
LA (x, λ; µ) = f (x) − λT c(x) + kc(x)k22 .
2
Here the iterates are (xk , λk ), i.e., a step both in the primal and dual variables. A solution
of the optimization problem is a stationary point of LA but in general not a minimizer.

67
Review: fundamentals of algorithms for nonlinear constrained optimiza-
tion
• Brute-force approach: try all 2m combinations of active constraints, solving a different system of nonlinear
equations for each (from the KKT conditions): computationally intractable. Instead, iterative algorithms
construct a sequence of xk and possibly λk , using information about the objective and constraints and their
derivatives, that converges to one local solution.
• Some algorithms are for specific problems (LP, QP) while others apply more generally (penalty and augmented
Lagrangian, sequential QP, interior-point and barrier methods).
• Elimination of variables: useful with linear equality constraints (if done carefully to prevent introducing ill-
conditioning); tricky with nonlinear ones.
• Merit function φ(x; µ): measures a combination of the objective f and of feasibility via a penalty parameter
µ > 0 which controls the tradeoff. Ex: φ = f for unconstrained problems or feasible algorithms; quadratic-
penalty; augmented Lagrangian, Fletcher’s augmented Lagrangian, ℓ1 exact function.
Exact merit function: any local solution is a minimizer of φ for sufficiently large µ.

68
16 Quadratic programming
Quadratic program (QP): quadratic objective function, linear constraints.
 T
1 T T ai x = bi , i ∈ E
min q(x) = x Gx + c x s.t. Gn×n symmetric.
x∈Rn 2 aTi x ≥ bi , i ∈ I
Can always be solved in a finite number of iterations (exactly how many depends on G and on the
number of inequality constraints).
• Convex QP ⇔ G psd. Local minimizer(s) also global; not much harder than LP.
• Non-convex QP ⇔ G not psd. Possibly several solutions.

Example: portfolio optimization


• n possible investments (bonds, stocks, etc.) with returns ri , i = 1, 2, . . . , n.
• r = (r1 , . . . , rn )T is a random variable with mean µ and covariance G:
– µi = E {ri }
– gij = E {(ri − µi )(rj − µj )} = tendency of the returns of investments i, j to move together.
Usually high µi means high gii .
In practice, µ and G are guesstimated based on historical data and “intuition”.
• An investor constructs
Pn a portfolio by putting a fraction xi ∈ [0, 1] of theT available funds into
investment i with i=1 xi = 1 and x ≥ 0. Return of the portfolio R = x r, with:
– mean E {R} = xT µ (expected return)
– variance E {(R − E {R})2 } = xT Gx.
• Wanted portfolio: large expected return, small variance:
k
X
max µT x − κxT Gx s.t. xi = 1, x ≥ 0 (✐ Is this convex QP?)
x
i=1

conservative investor: large κ
where κ ≥ 0 is set by the investor
aggressive investor: small κ.

Equality-constrained QP
• m equality constraints, no inequality constraints: Example:
min x21 + x22
1 s.t. x1 + x2 = 1.
min q(x) = xT Gx + cT x s.t. Ax = b A full rank.
x 2
• KKT conditions: a solution x∗ verifies (where λ∗ is the Lagrange multiplier vector)

 T
  ∗    T
     h = Ax − b
G −A x −c x =x+p G A
∗ −p g (✐ What happens
= ⇐⇒ = g = c + Gx
A 0 λ∗ b A 0 λ∗ h  if G = 0 (LP)?)
| {z } p = x∗ − x.
KKT matrix K

Let Zn×(n−m) = (z1 , . . . , zn−m ) be a basis of null (A) ⇔ AZ = 0, rank (Z) = n − m. Call
ZT GZ the reduced Hessian (= how the quadratic form looks like in the subspace Ax = b).
Then, if A has full row rank (= m) and ZT GZ is pd:

69

– Lemma 16.1: K is nonsingular (⇒ unique ( λx∗ )). Proof

– Th. 16.3: inertia(K) = (n, m, 0) = number of (+, −, 0) eigenvalues (⇒ K always indefinite).


In general, inertia(K) = inertia(ZT GZ) + (m, m, 0).


• Classification of the solutions (assuming the KKT system has solutions ( λx∗ )): Ex. 16.2

1. Strong local minimizer at x∗ ⇔ ZT GZ pd. Proof:


– Either using the 2nd -order suffic. cond. (note Z is a basis of C(x∗ , λ∗ ) = null (A)).
– Or direct proof that x∗ is the unique, global solution (th. 16.2). Proof
T
2. Infinite solutions if Z GZ is psd and singular.
3. Unbounded if ZT GZ indefinite.

• The KKT system can be solved with various linear algebra techniques (note that linear conju-
gate gradients are not applicable? ).

Inequality-constrained QP
P
• Optimality conditions: Lagrangian function L(x, λ) = 12 xT Gx + cT x − i∈E∪I λi (aTi x − bi ).
Active set at an optimal point x∗ : A(x∗ ) = {i ∈ E ∪ I: aTi x∗ = bi }.

– KKT conditions (LICQ not required? ):


P
Gx∗ + c − i∈A(x∗ ) λ∗i ai = 0
aTi x∗ = bi ∀i ∈ A(x∗ )
aTi x∗ ≥ bi ∀i ∈ I \ A(x∗ )
λ∗i ≥ 0 ∀i ∈ I ∩ A(x∗ ).

– Second-order conditions:
1. G psd (convex QP) ⇒ x∗ is a global minimizer (th. 16.4); and unique if G pd. Proof

2. Strict, unique local minimizer at x∗ ⇔ ZT GZ pd, where Z is a nullspace basis for the
active constraint Jacobian matrix (aTi )i∈A(x∗ ) .
3. If G is not psd, there may be more than one strict local minimizer at which the 2nd -
order conditions hold (non-convex, or indefinite QP); harder to solve. Determining
whether a feasible point is a global minimizer is NP-hard. fig. 16.1
Ex.: max xT x s.t. x ∈ [−1, 1]n : 2n local (and global) optima.

• Degeneracy is one of the following situations, which can cause problems for the algorithms: ex. p. 466

– Active constraint gradients are l.d. at the solution, e.g. (but not necessarily) if more than
n constraints are active at the solution ⇒ numerically difficult to compute Z.
– Strict complementary condition fails: λ∗i = 0 for some active index i ∈ A(x∗ ) (the con-
straint is weakly active) ⇒ numerically difficult to determine whether a weakly active
constraint is active.

• Three types of methods: active-set, gradient-projection, interior-point.

70
Active-set methods for convex QP
• Convex QP: any local solution is also global.
• They are the most effective methods for small- to medium-scale problems; efficient detection
of unboundedness and infeasibility; accurate estimate (typically) of the optimal active set.
• Remember the brute-force approach to solving the KKT systems for all combinations of active
constraints: if we knew the optimal active set A(x∗ ) (≡ the active set at the optimal point x∗ ),
we could find the solution of the equality-constrained QP problem minx q(x) s.t. aTi x = bi , i ∈
A(x∗ ). Goal: to determine this set.
• Active-set method: start from a guess of the optimal active set; if not optimal, drop one index
from A(x) and add a new index (using gradient and Lag. mult. information); repeat.
– The simplex method for LP is an active-set method.
– QP active-set methods may have iterates that aren’t vertices of the feasible polytope.
Three types of active-set methods: primal, dual, and primal-dual. We focus on primal methods,
which generate iterates that remain feasible wrt the primal problem while steadily decreasing
the objective function q.

Primal active-set method


• Compute a feasible initial iterate x0 (some techniques available; pp. 473–474); subsequent
iterates will remain feasible.
• Move to next iterate xk+1 = xk + αk pk : obtained by solving an equality-constrained quadratic
subproblem. The constraint set, called working set Wk , consists of all the equality constraints
and some of the inequality constraints taken as equality constraints (i.e., assuming they are
active); the gradients ai of the constraints in Wk must be l.i. The quadratic subproblem
(solvable as in sec. 16.1):
? 1
min q(xk + p) s.t. Wk ⇐⇒ min pT Gp + (Gxk + c)T p s.t. aTi p = 0 ∀i ∈ Wk .
p p 2

Call pk the solution of the subproblem. Then:


– Use αk = 1 if possible: if xk + pk satisfies all constraints (not just those in Wk ) and is
thus feasible, set xk+1 = xk + pk .
– Otherwise, choose αk as the largest value in [0, 1) for which all constraints are satisfied.
Note that xk + αk pk satisfies? aTi xk+1 = bi ∀αk ∈ R ∀i ∈ Wk , so we need only worry about
the constraints not in Wk ; the result is trivial to compute (similar to the choice of αk in
interior-point methods for LP):
 
? bi − aTi xk
αk = min 1, min . (16.41)
i∈
/ Wk aTi pk
aT
i pk < 0
The constraints (typically just one) for which the minimum above is achieved are called
blocking constraints. The new working set: if αk < 1 (⇔ the step along pk was blocked by
some constraint not in Wk ) then add one of the blocking constraints to Wk ; otherwise keep
Wk . Note that it is possible that αk = 0; this happens when aTi pk < 0 for a constraint i
that is active at xk but not a member of Wk .

71
• Iterating this process (where we keep adding blocking constraints and moving xk ) we must reach
a point x̂ that minimizes q over its current working set Ŵ, or equivalently p = 0 occurs. Now,
is this also a minimizer of the QP problem, i.e., does it satisfy the KKT conditions? Only if
the Lagrange multipliers for the inequality Pconstraints in the working set are nonnegative. The
?
Lagrange multipliers are the solution of i∈Ŵ ai λ̂i = Gx̂ + c. So if λ̂j < 0 for some j ∈ Ŵ ∩ I
then we drop constant j from the working set (since by making this constraint inactive we can
decrease q while remaining feasible; th. 16.5) and go back to iterate.
If there are several λ̂j < 0 one typically chooses the most negative one since the rate of decrease
of q is proportional to |λ̂j | if we remove constraint j (other heuristics possible).

• If αk > 0 at each step, this algorithm converges in a finite number of iterations since there
is a finite number of working sets. In rare situations the algorithm can cycle: a sequence of
consecutive iterations results in no movement of xk while the working set undergoes deletions
and additions of indices and eventually repeats itself. Although this can be dealt with, most
QP implementations simply ignore it.

• The linear systems can be solved efficiently by updating factorizations (in the KKT matrix, G
is constant and A changes by one row at most at each step).

• Extension to nonconvex QP possible but complicated.

Algorithm 16.3 (Active-set method for convex QP) Ex. 16.4

Compute a feasible starting point x0


W0 ← subset of the active constraints at x0
for k = 0, 1, 2 . . .
pk = argminp 21 pT Gp + (Gxk + c)T p s.t. aTi p = 0 ∀i ∈ Wk Equality-constr. quadratic subproblem
if pk = 0
P
Solve for λ̂i : i∈Wk ai λ̂i = G(xk + pk ) + c Compute Lagrange mult. for subproblem

if λ̂i ≥ 0 ∀i ∈ Wk ∩ I
stop with solution x∗ = xk All KKT conditions hold
else
Remove from the working set that ineq.
j ← arg minj∈Wk ∩I λ̂j , Wk+1 ← Wk \ {j} constr. having the most negative Lag. mult.
xk+1 ← xk
end
else pk 6= 0: we can move xk and decrease q
compute αk = min (1, min . . . ) from (16.41) Longest step in [0, 1]
xk+1 ← xk + αk pk
if αk < 1 There are blocking constraints
Wk+1 ← Wk ∪ {one blocking constraint} Add one of them to the working set
else
Wk+1 ← Wk
end
end
end

(✐ Does this algorithm become the simplex method for G = 0 (LP)?)

72
The gradient projection method
• Active-set method: the working set changes by only one index at each iteration, so many
iterations are needed for large scale problems.
• Gradient-projection method: large changes to the active set (from those constraints that are
active at the current point, to those that are active at the Cauchy point).
• Most efficient on bound-constrained QP, on which we focus:
1
min q(x) = xT Gx + cT x s.t. l ≤ x ≤ u
x 2
x, l, u ∈ Rn , G symmetric (not necessarily pd); not all components need to be bounded.
• The feasible set is a box.
• Idea: steepest descent but bending along the box faces.
• Needs a feasible starting point x◦ (trivial to obtain); all iterates remain feasible.
• Each iteration consists of two stages; assume current point is x (which is feasible):
1. Find the Cauchy point xc : this is the first minimizer along the steepest descent direction
−∇q = −(Gx + c) piecewise-bent to satisfy the constraints. To find it, search along −∇q;
if we hit a bound (a box face), bend the direction (by projecting it on the face) and keep
searching along it; and so on, resulting in a piecewise linear path P (x − t∇q; [l, u]), t ≥ 0,
where P (x; [l, u])i = median(xi , li , ui ) = li if xi < li , xi if xi ∈ [li , ui ], ui if xi > ui . exact
formulas:
pp. 486ff.
xc is somewhere in
this path (depend-
ing on the quadratic
form q(x)). Note
there can be several
minimizers along
the path.

(✐ Does this method converge in a finite number of iterations?)

2. (Optionally) Approximate solution of QP subproblem where the active constraints are


taken as equality constraints, i.e., the components of xc that hit the bounds are kept
fixed: 
xi = xci , i ∈ A(xc )
min q(x) s.t.
x / A(xc ).
li ≤ xi ≤ ui , i ∈
This is almost as hard as the original QP problem; but all we need for global convergence
is a point x+ with q(x+ ) ≤ q(xc ) and is feasible wrt the subproblem constraints. xc itself
works. Something even better can be obtained by applying linear conjugate gradient to
min q(x) s.t. xi = xci , i ∈ A(xc ) and stopping when either negative curvature appears, or
a bound li ≤ xi ≤ ui , i ∈ / A(xc ) is violated.
• The gradient-projection method can be applied to general linear constraints (not just bounds),
but finding the piecewise path costs more computation, which makes it less efficient; and, in
general, to problems beyond QP (nonlinear objective and constraints).

73
Interior-point methods
• Appropriate for large problems.

• A simple extension of the primal-dual interior-point approach of LP works for convex QP. The
algorithms are easy to implement and efficient for some problems.

• Consider for simplicity only inequality constraints (exe. 16.21 considers also equality ones):
1
min q(x) = xT Gx + xT c s.t. Ax ≥ b with G symmetric pd, Am×n .
x 2
Write KKT conditions, then introduce surplus vector y = Ax − b ≥ 0 (saves m Lag. mult.)? .
Since the problem is convex, the KKT conditions are not only necessary but also sufficient. We
find minimizers of the QP by finding roots of the KKT system:
only 
T
addition Gx − A λ = −c   
wrt LP

 system of n + 2m equations Gx − AT λ + c
Ax − y = b
for n + 2m unknowns x, y, λ ⇔ F(x, y, λ) =  Ax − y − b  = 0
y i λi = 0 
 (mildly nonlinear because of yi λi ) YΛe  
i = 1, . . . , m 1
Y = diag (yi ) , Λ = diag (λi ) , e = ··· .
y, λ ≥ 0 1

 
0
• Central path C = {(xτ , yτ , λτ ): F(xτ , yτ , λτ ) = 0 , τ > 0} ⇔ solve perturbed KKT system
τe
with yi λi = τ . Given a current iterate (x, y, λ) with y, λ > 0, define the duality measure
µ = m1 yT λ (closeness to the boundary) and the centering parameter σ ∈ [0, 1].

• Newton-like step toward point (xσµ , yσµ , λσµ ) on the central path:
    
G 0 −AT ∆x −rc
A −I rc = Gx − AT λ + c
0   ∆y  =  −rb 
rb = Ax − y − b
0 Λ Y ∆λ −ΛYe + σµe
| {z } | {z } | {z }
Jacobian of F step −F(x,y,λ)
     
xk+1 xk ∆xk
yk+1 ← yk + αk ∆yk choosing αk ∈ [0, 1] such that yk+1 , λk+1 > 0.
k+1 k k
λ λ ∆λ

• Likewise, we can extend the path-following methods (by defining a neighborhood N−∞ (γ)) and
Mehrotra’s predictor-corrector algorithm.

• Major computation: solving the linear system, more costly than for LP because of G.

• As in the comparison between the simplex and interior-point methods for LP, for QP:

– Active-set methods: large number of inexpensive steps; more complicated to implement;


preferable if an estimate of the solution is available (“warm start”).
– Interior-point methods: small number of expensive steps; simpler to implement; the spar-
sity pattern in the linear system is constant.

74
Review: quadratic programming
• Quadratic program:

– quadratic objective, linear constraints


– convex problem iff convex objective
– feasible set: polytope
– can always be solved in a finite number of iterations
– LICQ not needed in KKT conditions.
• Equality-constrained QP :
– KKT conditions result in a linear system (nonsingular if full-rank constraint matrix A and pd reduced
Hessian ZT GZ); can be solved with various linear algebra techniques.
– 2nd-order conditions as in the unconstrained case but using the reduced Hessian.
• Inequality-constrained QP : several methods, including:
– Active-set methods:
∗ Appropriate for small- to medium-scale (for large problems: many iterations); efficient detection of
unboundedness and infeasibility; accurate estimate of the optimal active set; particularly simple for
convex QP.
∗ Identify the optimal active set given an initial guess for it, by repeatedly adding or subtracting one
constraint at a time (like the simplex method for LP, but iterates need not be vertices of the feasible
polytope).
– Gradient projection: steepest descent but “bending” along the constraints; particularly simple with
bound constraints, where the projection operator is trivial.
– Interior-point methods: appropriate for large problems; very similar to LP interior-point methods.

75
17 Penalty and augmented Lagrangian methods
The quadratic penalty method

ci (x) = 0 i ∈ E
min f (x) s.t.
x ci (x) ≥ 0 i ∈ I.
• Define the following quadratic-penalty function with penalty parameter µ > 0:
µX 2 µX
Q(x; µ) = f (x) + ci (x) + ([ci (x)]− )2 [y]− = max(−y, 0).
2 i∈E 2 i∈I
|{z} | {z }
objective one term per constraint, which is positive
function when x violates ci and 0 otherwise

• Define a sequence of unconstrained minimization subproblems minx Q(x; µk ) given a sequence


k→∞
µk −−−→ ∞. By driving µ to ∞ we penalize the constraint violations with increasing severity,
forcing the minimizer of Q closer to the feasible region of the constrained problem.
ex. 17.1
Algorithmic framework 17.1 (Quadratic-penalty method) eq. 17.5

Given tolerance τ0 > 0, starting penalty parameter µ0 > 0, starting point xs0
for k = 0, 1, 2, . . . 
starting at xsk
Find an approximate minimizer xk of Q(x; µk )
terminating when k∇x Q(x; µk )k ≤ τk
if final convergence
 test satisfied ⇒ stop with approximate solution xk
 penalty parameter µk+1 > µk
Choose new starting point xsk+1

tolerance τk+1 ∈ (0, τk )
end

• Smoothness of the penalty terms:

– Equality constraints: c2i has at least? as many derivatives as ci


⇒ use derivative-based techniques for unconstrained optimization.
– Inequality constraints: ([ci ]− )2 can be less smooth than ci .
Ex.: for x1 ≥ 0, ([x1 ]− )2 = min (0, x1 )2 has a discontinuous second derivative.

• Choice of starting point xsk for the min. of Q(x; µk ): extrapolate xk :


– From previous iterates xk−1 , xk−2, xk−3 , . . . (in particular, xsk = xk−1 ;
“warm start”).

– Using the tangent to the path {x(µ), µ > 0}: xsk = xk−1 + (µk − µk−1)ẋ.
The path tangent ẋ = d(x(µ))

can be obtained by total differentiation of
∇x Q(x; µ) = 0 wrt µ:
∂∇x Q/∂x dx/dµ ∂∇x Q/∂µ
zz }|
}| { z}|{
{
d∇x Q(x(µ); µ) 2 1 Linear system
0= = ∇xx Q(x; µ) ẋ + (∇x Q(x; µ) − ∇f (x))
dµ µ for vector ẋ.
This can be seen as a predictor-corrector method: linear prediction xsk ,
nonlinear correction by minimizing Q(x; µ).

76
• Choice of {µk }: adaptive, e.g. if minimizing Q(x; µk ) was:

– expensive: modest increase, e.g. µk+1 = 1.5µk


– cheap: larger increase, e.g. µk+1 = 10µk .

• Choice of {τk }: τk −−−→ 0 (the minimization is carried out progressively more accurately).
k→∞

• Convergence: assume µk → ∞ and consider equality constraints only.

– Th. 17.1: xk global min. of Q(x; µk ) ⇒ xk → global solution of the constr. problem.
Impractical: requires global minimization (or a convex problem).

– Th. 17.2: if τk → 0, xk → x∗ ⇒ x∗ is a stationary point of kc(x)k2 . Proof


Besides, if ∇ci (x∗ ) l.i. ⇒ −µk ci (xk ) → λ∗i ∀i ∈ E and (x∗ , λ∗ ) satisfy the KKT cond.
Practical: only needs tolerances τk → 0; gives Lag. mult. estimates; but can converge to points that are infeasible or to
KKT points that are not minimizers, or not converge at all.

Practical problems

• Q may be unbounded below for some values of µ (ex. in eq. 17.5) ⇒ safeguard.

• The penalty function doesn’t look quadratic around its minimizer except very close to it (see
contours in fig. 17.2).

• Even if ∇2 f (x∗ ) is well-conditioned, the Hessian ∇2xx Q(x; µk ) becomes arbitrarily ill-conditioned
as µk → ∞. Consider equality constraints only and define A(x)T = (∇ci (x))i∈E (matrix of
constraint gradients; usually rank(A) < n):
?
X
∇2xx Q(x; µk ) = ∇2 f (x) + µk ci (x)∇2 ci (x) + µk A(x)T A(x).
i∈E

Near a minimizer, from th. 17.2 we have µk ci (x) ≈ −λ∗i an so

∇2xx Q(x; µk ) ≈ ∇2xx L(x, λ∗ ) + µk A(x)T A(x)


| {z } | {z }
independent of µk rank |E| with nonzero
eigenvalues of O(µk )

where L(x, λ) is the Lagrangian function (and usually |E| < n). Unconstrained optimization
methods have problems with ill-conditioning. For Newton’s method we can apply the following
reformulation that avoids the ill-conditioning (p solves both systems? ):

Newton step Introducing dummy vector ζ = µ A(x) p


X  
∇f 2 (x) + µk ci (x)∇2 ci (x) A(x)T    
2   p −∇x Q(x; µ)
∇xx Q(x; µk ) p = −∇x Q(x; µ) ⇔ i∈E = .
ζ 0
A(x) − µ1k I
| {z } | {z }
ill-conditioned ⇒ well-conditioned as µk → ∞? ; cf. lemma 16.1
large error in p

This system has dimension n+|E| rather than n, and is a regularized version of the SQP system
(18.6) (the “− µ1k I” term makes the matrix nonsingular even if A(x) is rank-deficient).
The augmented Lagrangian method is more effective, as it delays the onset of ill-conditioning.

77
Exact penalty functions
• Exact penalty function φ(x; µ): ∃µ∗ > 0: ∀µ > µ∗ , any local solution x of the constrained
problem is a local minimizer of φ. So we need a single unconstrained minimization of φ(x; µ)
for such a µ > µ∗ .
• The quadratic-penalty and log-barrier functions are not exact, so they need µ → ∞.
• The ℓ1 exact penalty function ex. 17.2
X X ex. 17.3
φ1 (x; µ) = f (x) + µ |ci (x)| + µ [ci (x)]−
i∈E i∈I

is exact for µ = largest Lagrange multiplier (in absolute value) associated with an optimal
solution (th. 17.3). Algorithms based on minimizing φ1 need:
– Rules for adjusting µ to ensure µ > µ∗ (note minimizing φ for large µ is difficult).
– Special techniques to deal with the fact that φ1 is not differentiable at any x for which
ci (x) = 0 for some i ∈ E ∪ I (and such x must be encountered).
• Any exact penalty function of the type φ(x; µ) = f (x)+µh(c1 (x)) (where h(0) = 0 and h(y) ≥ 0 Proof
p. 513
∀y ∈ R) must be nonsmooth.

Augmented Lagrangian method (method of multipliers)


• Modification of the quadratic-penalty method to reduce the possibility of ill-conditioning by
introducing explicit estimates of the Lagrange multipliers at each iterate.
• Tends to yield less ill-conditioned subproblems than the log-barrier method and doesn’t need
strictly feasible iterates for the inequality constraints.

Equality constraints only


λ∗
• In the quadratic-penalty method, the minimizer xk of Q(x; µk ) satisfies ci (xk ) ≈ − µki ∀i ∈ E
µ →∞
(from th. 17.2) and so ci (xk ) −−k−−→ 0. Idea: redefine Q so that its minimizer xk satisfies
ci (xk ) ≈ 0 (i.e., the subproblem solution better satisfies the equality constraint); this way we
will get better iterates when µk is not so small and delay the appearance of ill-conditioning.
• Define the augmented Lagrangian function by adding a quadratic penalty to the Lagrangian
and considering the argument λ as estimates for the Lagrange Pmultipliers at a solution (or, as
a quadratic-penalty function with objective function f (x) − i∈E λi ci (x)):
"  2 #
X µX 2 µX λi
LA (x, λ; µ) = f (x) − λi ci (x) + c (x) = f (x) + ci (x) − + const .
i∈E
2 i∈E i 2 i∈E µ

Near x∗ , the minimizer xk of LA for λ = λk satisfies ci (xk ) ≈ − µ1k (λ∗i − λki ) ∀i ∈ E.


Pf.: 0 = ∇x LA (xk , λk ; µk ) = ∇f (xk ) − (λki − µk ci (xk ))∇ci (xk ) ⇒ λ∗i ≈ λki − µk ci (xk ) (KKT cond.).
P
i∈E

So if λk is close to the optimal multiplier vector λ∗ then kc(xk )k will be much smaller than 1
µk
rather than just proportional to µ1k .
• Now we need an update equation for λk+1 so that it approximates λ∗ more and more accurately;
the relation c(xk ) ≈ − µ1k (λ∗ − λk ) suggests λk+1 ← λk − µk c(xk ). ex. 17.4


Note that −µk ci (xk ) → λi quadratic-penalty method
0 augmented Lagrangian method.

78
• Algorithmic framework 17.3 (augmented Lagrangian method—equality constraints): as for the
quadratic-penalty method but using LA (x, λ; µ) and updating λk+1 ← λk − µk c(xk ) where xk
is the (approximate) minimizer of LA (x, λk ; µ) and with given starting point λ0 .
• Choice of starting point xsk for the minimization of LA (x, λk ; µk ) less critical now (less ill-
conditioning), so we can simply take xsk+1 ← xk .
• Convergence:
– Th. 17.5: (x∗ , λ∗ ) = (local solution, Lagrange multiplier) at which KKT + LICQ +
2nd -order sufficient conditions hold (≡ well-behaved solution) ⇒ x∗ is a stationary point
of LA (x, λ∗ ; µ) for any µ ≥ 0, and ∃µ̄ > 0: ∀µ ≥ µ̄, x∗ is a strict local minimizer of
LA (x, λ∗ ; µ).
Pf.: KKT + 2nd-order cond. for constrained problem ⇒ KKT + 2nd-order cond. for unconstrained problem minx LA .

Thus LA is an exact penalty function for the optimal Lagrange multiplier λ = λ∗ and, if we
knew the latter, we would not need to take µ → ∞. In practice, we need to estimate λ∗ over
iterates and drive µ sufficiently large; if λk is close to λ∗ or if µk is large, then xk will be close
to x∗ (the quadratic-penalty method gives only one option: increase µk ).
(✐ Given λ∗ , how do we determine x∗ from the KKT conditions?)
Note that LA (x, 0; µ) = Q(x; µ).
• Special case (useful for distributed optimization): alternating direction method of multipliers
(ADMM): for a convex problem with (block-)separable objective and constraints:
min f (x) + g(z) s.t. Ax + Bz = c
x,z

the augmented Lagrangian is


µ
LA (x, z, λ; µ) = f (x) + g(z) − λT (Ax + Bz − c) + kAx + Bz − ck2
2
which lends itself to alternating optimization:
– xk+1 ← arg minx LA (x, zk , λk ; µ) (= f (x) − λT Ax + µ2 kAx + uk2 + constant)
– zk+1 ← arg minz LA (xk+1 , z, λk ; µ) (= g(z) − λT Bz + µ2 kBz + vk2 + constant)
– λk+1 ← λk − µ(Axk+1 + Bzk+1 − c).

Extension to inequality constraints Three useful formulations:


1. Bound-constrained formulation: use slack variables s to turn inequalities into equalities and
bounds: ci (x) ≥ 0 ⇒ ci (x) − si = 0, si ≥ 0 ∀i ∈ I. Consider then the bound-constrained
problem (x absorbs the slacks, and li or ui can be −∞ or +∞):
(
ci (x) = 0, i = 1, . . . , m equalities
min f (x) s.t.
x∈Rn l≤x≤u bounds.
Bound-constrained Lagrangian: augmented Lagrangian using equality constraints only but sub-
ject to bounds:
Xm m
µX 2
min LA (x, λ; µ) = f (x) − λi ci (x) + c (x) s.t. l ≤ x ≤ u.
x
i=1
2 i=1 i
Solve this subproblem approximately, update λ and µ, repeat.
The subproblem may be solved with the (nonlinear) gradient-projection method.
Implemented in the LANCELOT package.

79
2. Linearly-constrained formulation: in the bound-constrained problem, solve the subproblem of
minimizing the (augmented) Lagrangian subject to linearization of the constraints:
(
ci (xk ) + ∇ci (xk )T (x − xk ) = 0, i = 1, . . . , m
min Fk (x) s.t.
x l≤x≤u
Lagrangian Quadratic penalty
where z }| { z }| {
m
X Xm
µ
• Augmented Lagrangian Fk (x) = f (x) − λki cki (x) + (ck (x))2 .
i=1
2 i=1 i
k
• Current Lag. mult. estimate λ = Lag. mult. for the linearized constraints at k − 1.
• cki (x) = ci (x) − (ci (xk ) + ∇ci (xk )T (x − xk )) = true − linearized = “second-order ci remainder”.

Similar to SQP but with a nonlinear objective (hard subproblem); particularly effective when
most of the constraints are linear.
Implemented in the MINOS package.

3. Unconstrained formulation: consider again the general constrained problem (without equality
constraints for simplicity) and introduce slack variables: ci (x) ≥ 0 ⇒ ci (x) −si = 0, si ≥ 0 ∀i ∈
I. Consider the bound-constrained augmented Lagrangian
X µX
min LA (x, s, λ; µ) = f (x) − λi (ci (x) − si ) + (ci (x) − si )2 s.t. si ≥ 0 ∀i ∈ I.
x,s
i∈I
2 i∈I

Apply “coordinate descent” first in s, then in x:


? 1 k
• In s: mins LA s.t. si ≥ 0 ∀i ∈ I ⇒ Solution: si = max (0, ci (x) − λ )
µk i
∀i ∈ I because
LA is a convex, separable quadratic form on s.
• In x: substitute the solution si in LA to obtain:
(
X ? −σt + µ2 t2 if t ≤ µσ
LA (x, λk ; µk ) = f (x) + ψ(ci (x), λki ; µk ) where ψ( t, σ; µ ) = 2
| {z } −σ otherwise.
i∈I all scalar 2µ

Now, iterate the following:

• Solve approximately the unconstrained problem minx LA (x, λk ; µk ).


Note that ψ doesn’t have a second derivative? when µt = σ ⇔ µci (x) = λi for some i ∈ I. Fortunately, this rarely happens,
since from the strict complementarity KKT condition (if it holds) exactly one of λ∗i and ci (x∗ ) is 0, so the iterates should
stay away from points at which µk ci (xk ) = λki . Thus it is safe to use Newton’s method. For a weakly active constraint
µci (x∗ ) = λ∗i = 0 does hold.

• Update the Lagrange multipliers as λk+1
i ← max λki − µk ci (xk ), 0 ∀i ∈ I
(since λi ≥ 0 for the KKT conditions to hold at the solution).

• Update µk , etc.

80
Review: penalty and augmented Lagrangian methods
• Quadratic-penalty method : sequence of unconstrained minimization subproblems where we drive µ → ∞
µX 2 µX
min Q(x; µ) = f (x) + ci (x) + ([ci (x)]− )2
x 2 2
i∈E i∈I

thus forcing the minimizer of Q closer to the feasible region of the constrained problem.
– Assuming equality constraints, if this converges to a point x∗ , then either it is infeasible and a stationary
point of kc(x)k2 , or it is feasible; in the latter case, if A(x∗ ) (the matrix of active constraint gradients)
has full rank then −µc(x) → λ∗ and (x∗ , λ∗ ) satisfy the KKT cond.
– Problem: typically, ∇2xx Q(x; µ) becomes progressively more ill-conditioned as µ → ∞.
• Exact penalty function φ(x; µ): ∃µ∗ > 0: ∀µ > µ∗ , any local solution x of the constrained problem is a local
minimizer of φ. So we need a single unconstrained minimization of φ(x; µ) for such a µ > µ∗ .
– Ex.: ℓ1 exact penalty function (it is exact for µ∗ = largest Lagrange multiplier):
X X
φ1 (x; µ) = f (x) + µ |ci (x)| + µ [ci (x)]− .
i∈E i∈I

Exact penalty functions of the form φ(x; µ) = f (x) + µh(c1 (x)) are nonsmooth.
• Augmented Lagrangian method (method of multipliers): sequence of unconstrained minimization subproblems
where we drive µ → ∞ (for equality constraints)
X µX 2
min LA (x, λ; µ) = f (x) − λi ci (x) + ci (x), λ ← λ − µc(x).
x 2
i∈E i∈E

– Modifies the quadratic-penalty function by introducing explicit estimates of the Lagrange multipliers,
and this delays the onset of ill-conditioning.
– Convergence: a well-behaved solution (x∗ , λ∗ ) is a stationary point of LA (x, λ∗ ; µ) ∀µ ≥ 0 and a strict
local minimizer of LA (x, λ∗ ; µ) ∀µ ≥ µ̄ (for some µ̄ > 0). So LA is an exact penalty function for λ = λ∗
(but we don’t know λ∗ ).
– Inequality constraints: several formulations (bound-constrained Lagrangian, linearly-constrained formu-
lation, unconstrained formulation).

81
18 Sequential quadratic programming (SQP)
One of the most effective approaches for nonlinearly constrained optimization, large or small.

Local SQP method 


ci (x) = 0, i ∈ E
General nonlinear programming problem: minx f (x) s.t.
ci (x) ≥ 0, i ∈ I.
Approximate quadratically the objective and linearize the constraints to obtain a QP subproblem:

1 T 2 T ∇ci (xk )T p + ci (xk ) = 0, i ∈ E
min p ∇xx L(xk , λk ) p + ∇f (xk ) p + f (xk ) s.t.
p 2 ∇ci (xk )T p + ci (xk ) ≥ 0, i ∈ I
where ∇2xx L(xk , λk ) is the Hessian of the Lagrangian L(x, λ) = f (x)−λT c(x), for a given λ estimate.
(✐ Why not use ∇2 f instead of ∇2xx L? See later.)
(✐ Using ∇x L instead of ∇fk changes nothing for equality constraints. Why?)

Algorithm 18.1 (local SQP algorithm):

Given initial x0 , λ0
for k = 0, 1, 2, . . .
Evaluate fk , ∇fk , ci (xk ), ∇ci (xk ), ∇2xx L(xk , λk )
(pk , λk+1 ) ← (solution, Lagrange multiplier) of QP subproblem
xk+1 ← xk + pk
if convergence test satisfied ⇒ stop with approximate solution (xk+1 , λk+1 )
end

• Intuitive idea: the QP subproblem is Newton’s method applied to the optimality conditions of
the problem. Consider only equality constraints for simplicity and write minx f (x) s.t. c(x) = 0
with c(x)T = (c1 (x), . . . , cm (x)) and A(x)T = (∇c1 (x), . . . , ∇cm (x)):
p 
(i) The solution lkk of the QP subproblem satisfies:
 2   2    
∇xx Lk pk + ∇fk − ATk lk = 0 ∇xx Lk −ATk pk −∇fk
⇔ = .
Ak pk + ck = 0 Ak 0 lk −ck
 
∇x L(x,λ)
(ii) KKT system for the problem: F(x, λ) = c(x) = 0 (where ∇x L(x, λ) = ∇f (x) − A(x)T λ), for
which Newton’s method (for root finding) result in a step
       2    
xk+1 xk pk ∇xx Lk −ATk pk −∇fk + ATk λk
= + where = .
λk+1 λk pλ Ak 0 pλ −ck
| {z } | {z }
Jacobian of F at (xk , λk )T −F(xk ,λk )

(i) ≡ (ii), since the two linear systems have the same solution (define lk = pλ + λk ).
• Assumptions (recall lemma 16.1 in ch. 16 about equality-constrained QP):
– The constraint Jacobian Ak has full row rank (LICQ)
– ∇2xx Lk is pd on the tangent space of the constraints (dT ∇2xx Lk d > 0 ∀d 6= 0, Ak d = 0)
⇒ the KKT matrix is nonsingular and the linear system has a unique solution.
Do these assumptions hold? They do locally (near the solution) if the problem solution satisfies
the 2nd -order sufficient conditions. Then Newton’s method converges quadratically.

82
Considering now equality and inequality constraints:

– Th. 18.1: (x∗ , λ∗ ) local solution at which KKT + LICQ + 2nd-order + strict complemen-
tarity hold ⇒ if (xk , λk ) is sufficiently close to (x∗ , λ∗ ), there is a local solution of the QP
subproblem whose active set Ak is A(x∗ ).

At that time, the QP subproblem correctly identifies the active set at the solution and SQP
behaves like Newton steps for an equality-constrained problem.

• To ensure global convergence (≡ from remote starting points), Newton’s method needs to be
modified (just as in the unconstrained optimization case). This includes defining a merit func-
tion (which evaluates the goodness of an iterate, trading off reducing the objective function
but improving feasibility) and applying the strategies of:

– Line search: modify the Hessian of the quadratic model to make it pd, so that pk is a
descent direction for the merit function.
– Trust region: limit the step size to a region so that the step produces sufficient decrease
of the merit function (the Hessian need not be pd).

Additional issues need to be accounted for, e.g. the linearization of inequality constraints may
produce an infeasible subproblem.
Ex.: linearizing x ≤ 1, x2 ≥ 0 at xk = 3 results in 3 + p ≤ 1, 9 + 6p ≥ 0 which is inconsistent.

Review: sequential quadratic programming (SQP)


• Very effective for all problem sizes.
• Approximate quadratically the objective and linearize the constraints to obtain a QP subproblem:

1 ∇ci (xk )T p + ci (xk ) = 0, i ∈ E
min pT ∇2xx L(xk , λk ) p + ∇f (xk )T p + f (xk ) s.t.
p 2 ∇ci (xk )T p + ci (xk ) ≥ 0, i ∈ I

where ∇2xx L(xk , λk ) is the Hessian of the Lagrangian L(x, λ) = f (x) − λT c(x), for a given λ estimate.
• For equality constraints, this is equivalent to applying Newton’s method to the KKT conditions.
• Local convergence: near a solution (x∗ , λ∗ ), the QP subproblem correctly identifies the active set at the solution
and SQP behaves like Newton steps for an equality-constrained problem, converging quadratically.
• Global convergence: Newton’s method needs to be modified, by defining a merit function and applying the
strategies of line search or trust region (as in unconstrained optimization).

83
19 Interior-point methods for nonlinear programming
• Considered the most powerful algorithms (together with SQP) for large-scale nonlinear pro-
gramming.
• Extension of the interior-point methods for LP and QP.
• Terms “interior-point methods” and “barrier methods” used interchangeably (but different
historical origin).

Interior-point methods as homotopy methods


(
ci (x) = 0, i ∈ E
General nonlinear programming problem: minx f (x) s.t.
ci (x) ≥ 0, i ∈ I.
Deriving the KKT conditions (Lagrangian L(x, y, z) = f (x) − yT cE (x) − zT cI (x)) and introducing
in them slacks (ci (x) ≥ 0 ⇒ ci (x) − si = 0, si ≥ 0 ∀i ∈ I) and a parameter µ > 0, we obtain the
perturbed KKT conditions:
∇f (x) − ATE (x) y − ATI (x) z = 0 cE (x), cI (x): constraint vectors
Sz − µe = 0 AE , AI : constraint Jacobians
cE (x) = 0 y, z: Lagrange multipliers
 
1
cI (x) − s = 0 S = diag (si ), e = ··· .
1
s, z ≥ 0
k→∞
We solve (approximately) this nonlinear system of equations for a sequence (µk ) −−−→ 0 while pre-
serving s, z > 0, and require the iterates to decrease a merit function (to help converge not just to
a KKT point but to a minimizer). This follows a primal-dual central path (x(µ), s(µ), y(µ), z(µ))
that steers through the interior of the primal-dual feasible set, avoiding solutions that satisfy the
nonlinear system of equations but not s, z > 0, and converges to a solution (x∗ , s∗ , y∗ , z∗ ) as µ → 0+
under some conditions. The Newton step is
 2    
∇xx L 0 −ATE (x) −ATI (x) px ∇f (x) − ATE (x) y − ATI (x) z
 0 Z 0 S     Sz − µe 
   ps  = −  .
AE (x) 0 0 0   py   cE (x) 
AI (x) −I 0 0 pz cI (x) − s
Note that no ill-conditioning arises if the 2nd-order and strict complementarity conditions hold at
x∗ , because then either si or zi are nonzero, so the second row of the Jacobian has full row rank. The
system can be rewritten in a form that is symmetric (eq. 19.12) but introduces ill-conditioning.
Practical versions of interior-point methods follow line-search or trust-region implementations.

Interior-point methods as barrier methods 


si zi − µ = 0 µ
In the perturbed KKT conditions, eliminate s and z: i ∈ I: ⇒ zi = , and
ci (x) − si = 0 ci (x)
substitute in ∇x L(x, y, z):
X µ X
∇f (x) − ATE (x) y − ∇ci (x) = 0 ⇔ min P (x; µ) = f (x) − µ log ci (x) s.t. cE (x) = 0.
i∈I
c i (x) x
i∈I

That is, the interior-point method can be seen as minimizing the log-barrier function P (x; µ) (subject
to the equalities) and taking µ → 0.

84
The primal log-barrier method
• Consider the inequality-constrained problem minx f (x) s.t. ci (x) ≥ 0, i ∈ I.
Strictly feasible region F 0 = {x ∈ Rn : ci (x) > 0 ∀i ∈ I}, assume nonempty.
Define the log-barrier function (through a barrier parameter µ > 0): ex. 19.1
X
P (x; µ) = f (x) − µ log ci (x)
i∈I
|{z} | {z } (infinite everywhere except in F 0
objective log-barrier function smooth inside F 0
function approaches ∞ as x approaches the boundary of F 0 .

• The minimizer x(µ) of P (x, µ) approaches a solution of the constrained problem as µ → 0+ .


• Algorithmic framework 19.5 (Primal log-barrier method): as for the quadratic penalty but
choosing a new barrier parameter µk+1 ∈ (0, µk ) instead of a new penalty parameter. Similar
choices of tolerances, adaptive decrease of µk , starting point xsk for the minimization of P (x; µ),
etc. (✐ Is xsk feasible if extrapolating with the path tangent?)
• The log-barrier function is smooth (if f , ci are), so, if x(µ) ∈ F 0 , no constraints are active and
we can use derivative-based techniques for unconstrained optimization.
X
• The point x(∞) = arg max log ci (x) is called the analytic center of the inequalities {ci }i∈I .
x
i∈I

Convergence
• For convex programs: global convergence.
Th.: f , {−ci , i ∈ I} convex functions, F 0 6= ∅ ⇒
1. For any µ > 0, P (x; µ) is convex in F 0 and attains a minimizer x(µ) (not necessarily
unique) on F 0 ; any local minimizer x(µ) is also global.
2. If the set of solutions of the constrained optimization problem is nonempty and bounded
and if (µk ) is a decreasing sequence with µk → 0 ⇒ (x(µk )) converges to a solution x∗
and f (x(µk )) → f ∗ , P (x(µk ); µk ) → f ∗ .
If there are no solutions or the solution set is unbounded, the theorem may not apply.

• For general inequality-constrained problems: local convergence.


Th.: F 0 6= ∅, (x∗ , λ∗ ) = (local solution, Lagrange multiplier) at which KKT + LICQ +
2nd -order sufficient conditions hold (≡ well-behaved solution) ⇒
1. For all sufficiently small µ, ∃! continuously differentiable function x(µ): x(µ) is a local
minimizer of P (x; µ) and ∇2xx P (x; µ) is pd.
2. (x(µ), λ(µ)) −−→ (x∗ , λ∗ ) where λi (µ) = ci (x(µ))
µ
, i ∈ I.
µ→0
This means there may be sequences of minimizers of P (x; µ) that don’t converge to a solution as µ → 0.

Relation between the minimizers of P (x; µ) and a solution (x∗ , λ∗ ): at a minimizer x(µ):
X µ X
0 = ∇x P (x(µ); µ) = ∇f (x(µ)) − ∇ci (x(µ)) = ∇f (x(µ)) − λi (µ)∇ci (x(µ))
c i (x(µ))
i∈I | {z } i∈I
define as λ(µ)

which is KKT condition a) for the constrained problem (∇2xx L(x, λ) = 0). As for the other KKT
conditions at x(µ), λ(µ); b) (ci (x) ≥ 0, i ∈ I) and c) (λi ≥ 0, i ∈ I) also hold since ci (x(µ)) > 0;
only the complementarity condition d) fails: λi ci (x) = µ > 0; but it holds as µ → 0. The path
Cp = {x(µ): µ > 0} is called primal central path, and is the projection on the primal variables of the
primal-dual central path from the interior-point version.
85
Practical problems As with the quadratic-penalty method, the barrier function looks quadratic
only very near its minimizer; and the Hessian ∇2xx P (x; µk ) becomes ill-conditioned as µk → 0:
X µ
∇x P (x; µ) = ∇f (x) − ∇ci (x)
i∈I
c i (x)
X µ X µ
2 2
∇xx P (x; µ) = ∇ f (x) − ∇2 ci (x) + 2
∇ci (x)∇ci (x)T .
c
i∈I i
(x) c
i∈I i
(x)

Near a minimizer x(µ) with µ small, from the earlier theorem we have that the optimal Lagrange
multipliers can be estimated as λ∗i ≈ ciµ(x) , so
X1
∇2xx P (x; µ) ≈ ∇2xx L(x; λ∗ ) + (λ∗i )2 ∇ci (x)∇ci (x)T .
i∈I
µ
| {z } | {z }
independent of µ for the active constraints (λ∗i 6= 0),
becomes very large as µ → 0, with rank < n

The Newton step can be reformulated as in the quadratic-penalty method to avoid ill-conditioning,
and it should be implemented with line-search or trust-region strategy to remain (well) strictly
feasible.
(
ci (x) = 0, i ∈ E
Equality constraints minx f (x) s.t.
ci (x) ≥ 0, i ∈ I.
Splitting an equality constraint ci (x) = 0 as two inequalities ci (x) ≥ 0, −ci (x) ≥ 0 doesn’t work? ,
but we can combine the quadratic penalty and the log-barrier:
X 1 X 2
B(x; µ) = f (x) − µ log ci (x) + c (x).
i∈I
2µ i∈E i

This has similar aspects to the quadratic penalty and barrier methods: algorithm = successive
reduction of µ alternated with approximate minimization of B wrt x; ill-conditioned ∇2xx B when µ
is small; etc.
To find an initial point which is strictly feasible wrt the inequality constraints, introduce slack
variables si , i ∈ I:
 
 ci (x) = 0, i∈E 
min f (x) s.t. ci (x) − si = 0, i ∈ I ⇒
x,s  
si ≥ 0, i∈I
X 1 X 1 X
B(x, s; µ) = f (x) − µ log si + c2i (x) + (ci (x) − si )2 .
i∈I
2µ i∈E
2µ i∈I

Now, any point ( xs ) with s > 0 lies in the domain of B.

86
Review: interior-point methods for nonlinear programming
• Considered the most powerful algorithms (together with SQP) for large-scale nonlinear programming.
• Interior-point methods as homotopy methods:
perturbed KKT conditions
(introducing slacks s) Newton step
∇f (x) − AT T
E (x) y − AI (x) z = 0
∇2xx L −AT −AT ∇f (x) − AT T
    
Sz − µe = 0 0 E (x) I (x) px E (x) y − AI (x) z
 0 Z 0 S   ps  = − 
    Sz − µe 
.
cE (x) = 0 
AE (x) 0 0 0  py   cE (x) 
cI (x) − s = 0 AI (x) −I 0 0 pz cI (x) − s
s, z ≥ 0
This follows the primal-dual central path (x(µ), s(µ), y(µ), z(µ)) as µ → 0+ while preserving s, z > 0, avoiding
spurious solutions.
– No ill-conditioning arises for a well-behaved solution.
– Practical versions of interior-point methods follow line-search or trust-region implementations.
• Primal log-barrier method : sequence of unconstrained minimization subproblems where we drive µ → 0+
X (if equality constraints, add them
min P (x; µ) = f (x) − µ log ci (x) as a quadratic penalty to P )
x
i∈I

thus allowing the minimizer of P to approach the boundary of the feasible set from inside it.
– Convergence: global for convex problems; otherwise local for a well-behaved solution (x∗ , λ∗ ):
(x(µ), λ(µ)) −−−→ (x∗ , λ∗ ) where λi (µ) = ci (x(µ))
µ
, i ∈ I.
µ→0
– Cp = {x(µ): µ > 0} is the primal central path, and is the projection on the primal variables of the
primal-dual central path from the interior-point version.
– Problem: typically, ∇2xx P (x; µ) becomes progressively more ill-conditioned as µ → 0.
• Interior-point methods can be seen as barrier methods by eliminating s and x in the perturbed KKT conditions,
which become the gradient of the log-barrier function.

87
Final comments
Fundamental ideas underlying most methods:

• Optimality conditions (KKT, 2nd-order):

– check whether a point is a solution


– suggest algorithms
– convergence proofs

• Sequence of subproblems that converges to our problem, where each subproblem is easy:

– line search
– trust region
– homotopy or path-following, interior-point
– quadratic penalty, augmented Lagrangian, log-barrier
– sequential quadratic programming
– etc.

• Transformations in order to simplify the problem, introduce decoupling, reduce ill-conditioning,


etc. In particular, by introducing new variables: slack variables, homotopy parameter, penalty
parameter (quadratic-penalty, augmented Lagrangian, log-barrier), duality measure (interior-
point), primal-dual methods (interior-point, augmented Lagrangian), consensus transforma-
tions (ADMM), auxiliary coordinates (MAC). . .

• Simpler function valid near current iterate (e.g. linear, quadratic with Taylor’s th.): allows to
predict the function locally.

• Derivative ≈ finite difference (e.g. secant equation).

• Common techniques to speed up computation:

– inexact steps: approximate rather than exact solution of the subproblem, to minimize
overall computation
– warm starts: initialize subproblem from previous iteration’s result
– caching factorizations: for linear algebra subproblems, e.g. linear system with constant
coefficient matrix but variable RHS

• Heuristics are useful to invent algorithms, but they must be backed by theory guaranteeing
good performance (e.g. l.s. heuristics are ok as long as Wolfe conditions hold).

• Problem-dependent heuristics, e.g.:

– restarts in nonlinear conjugate gradients


– how accurately to solve the subproblem (forcing sequences, tolerances)
– how to choose the centering parameter σ in interior-point methods
– how to increase the penalty parameter µ in quadratic penalty, augmented Lagrangian

88
Given your particular optimization problem:

• No best method in general; use your understanding of basic methods and fundamental ideas to
choose an appropriate method, or to design your own.

• Recognize the type of problem and its structure: Differentiable? Smooth? (Partially) sepa-
rable? LP? QP? Convex? And the dual? How many constraints? Many optima? Feasible?
Bounded? Sparse Hessian? Etc.

• Simplify the problem if possible: improve variable scaling, introduce slacks, eliminate variables
or redundant constraints, introduce new variables/constraints to decouple terms. . .

• Try to come up with subproblems that make good progress towards the solution but are easy
to solve.

• Try to guess good initial iterates: from domain knowledge, or from solving a version of the
problem that is simpler (e.g. convex or less nonlinear) or smaller (using fewer variables).

• Determine your stopping criterion. Do you need a highly accurate or just an approximate
minimizer? Do you need to identify the active constraints at the solution accurately?

• Evaluate costs (time & memory):

– computing the objective function, gradient, Hessian


– solving linear systems or factorizing matrices (sparse?)
– solving subproblems

• Close the loop between the definition of the optimization problem (motivated by a practical
application) and the computational approach to solve it, in order to find a good compromise:
a problem that is practically meaningful (its solution is useful and accurate or approximate
enough) and convenient to solve (efficiently, in a scalable way, using existing algorithms, etc.).

• In practice in problems in science, engineering, economics, etc., optimization—together with


linear algebra, (Fourier) transforms, etc.—is a mathematical tool to formulate problems that
we know how to solve effectively.
Ref: Gill, Murray & Wright 1981.

89
A Math review
Error analysis and floating-point arithmetic
Pt
• Floating-point representation of x ∈ R: fl(x) = 2e i=1 di 2−i (t bits for fractional part, d1 = 1,
rest of bits for exponent and sign).
• Unit roundoff u = 2−t−1 (≈ 1.1 × 10−16 for 64-bit IEEE double precision). Matlab: eps = 2u.

• Any x with |x| ∈ [2L , 2U ] (where e ∈ {L+1, . . . , U}) can be approximated with relative accuracy
u: |fl(x)−x|
|x|
≤ u ⇔ fl(x) = x(1 + ǫ) with roundoff error |ǫ| ≤ u (so x and fl(x) agree to at least
15 decimal digits).
• Roundoff errors accumulate during floating-point operations. An algorithm is stable if errors
do not grow unbound. A particularly nasty case is cancellation: the relative error in computing
x − y if x and y are very close is . 2u|x|/|x − y| ⇒ precision loss; or, if x and y are accurate
to k digits and they agree in the first k̄, their difference contains only about k − k̄ significant
digits. So, avoid taking the difference of similar floating-point numbers.

Functions, derivatives, sets


• f : Rn → R  
∂f
∂x1  
Gradient at x of   Hessian (n × n
∇f (x) =  ... ; ∇2 f (x): ∂2f
.
f (n × 1 vector): ∂f
symmetric matrix): ∂xi ∂xj

∂xn

• Directional derivative along direction p ∈ Rn :


f (x + ǫp) − f (x)
lim = ∇f (x)T p [Pf.: Taylor’s th.]
ǫ→0 ǫ
• Mean value th.:
– φ: R → R continuously differentiable, α1 > α0 :
φ(α1 ) − φ(α0 ) = φ′ (ξ)(α1 − α0 ) for some ξ ∈ (α0 , α1 ).
– f : Rn → R cont. diff., for any p ∈ Rn :
f (x + p) − f (x) = ∇f (x + αp)T p for some α ∈ (0, 1).
• Taylor’s th.: f : Rn → R, p ∈ Rn :
– If f is cont. diff. then:
f (x + p) = f (x) + ∇f (x + tp)T p for some t ∈ (0, 1). [Mean value th.]
– If f is twice cont. diff. then:
f (x + p) = f (x) + ∇f (x)T p + 21 pT ∇2 f (x + tp)p for some t ∈ (0, 1).
R1
∇f (x + p) = ∇f (x) + 0 ∇2 f (x + tp)p dt.
 
n m ∂ri
• r: R → R , Jacobian matrix at x of r (m × n matrix): J(x): ∂x j
.
ij
R1
Taylor’s th.: r(x + p) = r(x) + 0 J(x + tp)p dt (if J is cont. in the domain of interest).
• r is Lipschitz continuous in N ⊂ Rn if ∃L > 0: kr(x) − r(y)k ≤ Lkx − yk ∀x, y ∈ N .
The sum and (if they are bounded) the product of Lipschitz cont. functions is Lipschitz cont.
If the Jacobian of r exists and is bounded on N then r is Lipschitz cont.

90
• Cone: set F verifying: x ∈ F ⇒ αx ∈ F ∀α > 0. Ex.:P {( xx12 ) : x1 > 0, x2 ≥ 0}.
x= m
Cone generated by {x1 , . . . , xm } ⊂ Rn : {x ∈ Rn : P i=1 αi xi , αi ≥ 0 ∀i = 1, . . .P , m}.
Convex hull of {x1 , . . . , xm } ⊂ Rn : {x ∈ Rn : x = m α x
i=1 i i , α i ≥ 0 ∀i = 1, . . . , m, m
i=1 αi = 1}.

Matrices
• Positive definite matrix B ⇔ pT Bp > 0 ∀p 6= 0 (pd). Positive semidefinite if ≥ 0 (psd).
• Matrix norm induced by a vector norm: kAk = supx6=0 kAxk .
p kxk
Ex. kAk2 (spectral norm) = largest s.v. σmax (A) = largest eigenvalue λmax (AT A).
If A is symmetric, then its s.v.’s are the absolute values of its eigenvalues.
• Condition number of a square nonsingular matrix A: κ(A) = kAkkA−1 k ≥ 1 where k·k is any Ex. in
p. 616
matrix norm. Ex. κ2 (A) = σmax /σmin . Square linsys Ax = b, perturb to Ãx̃ = b̃ ⇒ kx−x̃k
kxk

 
kA−Ãk kb−b̃k
κ(A) kAk
+ kbk so ill-conditioned problem ⇔ largeκ(A).
λ ∈ C: eigenvalue
• Eigenvalues and eigenvectors of a real matrix: Au = λu = n .
 u ∈ C : eigenvector

 symmetric: all λ ∈ R, u ∈ Rn ; eigenvectors of different eigenvalues are ⊥




nonsingular: all λ 6= 0
Matrix pd: all λ > 0 (nd: all λ < 0)



 psd: all λ ≥ 0 (nsd: all λ ≤ 0)


not definite: mixed-sign λ

• Spectral theorem: A symmetric, real with normalized eigenvectors


P u1 , . . . , un ∈ Rn associated
n
with eigenvalues λ1 , . . . , λn ∈ R ⇒ A = UΛUT = T
i=1 λi ui ui where U = (u1 . . . un ) is
orthogonal and Λ = diag (λ1 , . . . , λn ). In other words, a symmetric real matrix can be diago-
nalized in terms of its eigenvalues and eigenvectors.
Spectrum of A = eigenvalues of A.

Linear independence, subspaces


• v1 , . . . , vk ∈ Rn are l.i. if ∀λ1 , . . . , λk ∈ R: λ1 v1 + · · · + λk vk = 0 ⇒ λ1 = · · · = λk = 0.
• Vectors u, v 6= 0 are orthogonal (u⊥v) iff uT v = 0 (= kukkvk cos(u, v)).
Orthogonal matrix: U−1 = UT . For the Euclidean norm: kUxk = kxk.
• span {v1 , . . . , vk } = {x: x = λ1 v1 + · · · + λk vk for λ1 , . . . , λk ∈ R} is the set of all vectors that
are linear combination of v1 , . . . , vk , or the linear subspace spanned by v1 , . . . , vk .
• If v1 , . . . , vk are l.i. then they are a basis of span {v1 , . . . , vk }, which has dimension k.

Subspaces of a real matrix Am×n ⇔ linear mapping Rn → Rm


• Null space: null (A) = {x ∈ Rn : Ax = 0}, i.e., the subspace associated with eigenvalue 0.
• Range space: range (A) = {y ∈ Rm : y = Ax for some x ∈ Rn }.
T

• Fundamental theorem of linear algebra: null (A) ⊕ range A = Rn (direct sum: x ∈ Rn ⇒
n T
∃!u, v ∈ R : u ∈ null
 (A) , v ∈ range A , x = u + v).  Also: null (A) ∩ range AT = {0},
null (A) ⊥ range AT , dim (null (A)) + dim range AT = n.
| {z }
= dim (range (A)) = rank (A)
91
Subspace Basis Dimension
 
1
null (A) 1 1
0
range (A) ( 10 ), ( 01 ) 2
  1  0
range AT −1 , 0 2
0 2

Least squares, pseudoinverse and singular value decomposition


Linear system Ax = b with m equations, n unknowns, assume A is full-rank:

• m = n: unique solution x = A−1 b. (


x0 = particular solution
• m < n: underconstrained, infinite solutions x = x0 + u .
u ∈ null (A) , with dim (null (A)) = n − m
2
Minimum norm solution (minx kxk s.t. Ax = b): x = A+ b = AT (AAT )−1 b.
Homogeneous system: Ax = 0:

– Unit-norm best approximation (minx kAxk2 s.t. kxk2 = 1): x = minor eigenvector of AT A.
– minx kAxk2 s.t. kBxk2 = 1: x = minor generalized eigenvector of AT A, BT B.

• m > n: overconstrained, no solution in general; instead, define LSQ solution as minx kAx − bk2
⇒ x = A+ b = (AT A)−1 AT b.
AA+ = A(AT A)−1 AT is the orthogonal projection on range (A) (Pf.: write y ∈ Rm as y = Ax + u where
x ∈ Rn and u⊥Ax ∀x ∈ Rn (i.e., u ∈ null AT ). Then A(AT A)−1 AT y = Ax).


Likewise, A+ A = AT (AAT )−1 A is the orthogonal projection on range AT .

Pseudoinverse of Am×n is the matrix A+


n×m satisfying the Moore-Penrose conditions:

A+ AA+ = A+ ; AA+ A = A; AA+ , A+ A symmetric (uniquely defined).


(
s−1
i if si > 0
Given the SVD of A as USVT : A+ = VS+ UT with s+i = . Particular cases (assume
0 if si = 0
A full-rank): m > n ⇒ A+ = (AT A)−1 AT , m < n ⇒ A+ = AT (AAT )−1 , m = n ⇒ A+ = A−1 .
Pn
Singular value decomposition (SVD) of Am×n , m ≥ n: A = USVT = T
i=1 si ui vi with
Um×n = (u1 · · · un ) and Vn×n = (v1 · · · vn ) orthogonal, S = diag (s1 , . . . , sn ) and singular values
s1 ≥ · · · ≥ sn ≥ 0. Unique up to permutations/multiplicity of s.v.
  VpT 
• rank (A) = p ≤ n ⇔ sp+1 = · · · = sn = 0: A = ( Up Un−p ) S0p 00 VT
= Up Sp VpT with
n−p
Up = (u1 · · · up ) and Vn−p = (vp+1 · · · vn ) orthonormal bases of range (A) and null (A), resp.

• rank (A) ≥ p ⇒ Up Sp VpT is the best rank–p approximation to A in the sense of the Frobenius
 P
norm (kAk2F = tr AAT = i,j a2ij ) and the 2–norm (kAk2 = largest s.v.).

92
Other matrix decompositions (besides spectral, SVD)
• Cholesky decomposition: A symmetric pd ⇒ A = LLT with L lower triangular.
Useful to solve a sym pd linsys efficiently: Ax = b ⇔ LLT x = b ⇒ solve two triangular linsys.

• LU decomposition: A square ⇒ A = LU with L lower triangular and U upper triangular.


Useful to solve a linsys efficiently: Ax = b ⇔ LUx = b ⇒ solve two triangular linsys.

• QR decomposition: A of m × n ⇒ A = QR with Q of m × m orthogonal and R of m × n


upper triangular.
Useful to find an orthogonal basis of the columns of A, solve a linsys or find rank (A).

Matrix identities
(
Am×n , Bn×p : rank (A) + rank (B) − n ≤ rank (AB) ≤ min (rank (A) , rank (B))
Ranks: .
Am×n , Bm×n : rank (A + B) ≤ rank (A) + rank (B)

Inverse of a sum of matrices: given Ap×p , Bp×q , Cq×q , Dq×p , if A, C invertible:

(A + BCD)−1 = A−1 − A−1 B(C−1 + DA−1 B)−1 DA−1 (Sherman-Morrison-Woodbury formula)


A11 A12

Inverse of a matrix by blocks: A = A 21 A22
, if A11 , A22 invertible:
 11  (
−1 A A 12 A11 = (A11 − A12 A−1 −1
22 A21 ) , A
12
= −A11 A12 A−1 −1
22 = −A11 A12 A
22
A = 21 22 :
A A A22 = (A22 − A21 A−1 −1
11 A12 ) , A
21
= −A22 A21 A−1 −1
11 = −A22 A21 A
11

Derivatives:
d(aT x)
= ∇x (aT x) = a if a, x ∈ Rn , a independent of x
dx
d(xT Ax) A symmetric
= ∇x (xT Ax) = (A + AT )x = 2Ax if An×n independent of x
dx  
dyT ∂yi
xm×1 , yn×1 : = m × n Jacobian matrix J(x) =
dx ∂xj ij
2
 2 
d f 2 ∂ f
f1×1 , xn×1 : T
= n × n Hessian matrix ∇ f (x) =
dx dx ∂xi ∂xj ij
1×n n×m m×n n×1
z}|{ z}|{ z}|{ z}|{
T
d( x C ) d( B x )
= Cn×m , = Bm×n if B, C independent of x
d |{z}
x xT
d |{z}
n×1 1×n n×n n×n
z}|{ z}|{
T
T
d(u v) du dvT
Product rule: = ∇x (uT v) = v+ u, x, u, v ∈ Rn
dx dx dx
Xn
dy(x(t)) dyT dx ∂y dxi
Chain rule: = = , x ∈ Rn , y ∈ Rm , t ∈ R
dt dx |{z}
|{z} dt i=1
∂xi dt
m×n n×1

93
Quadratic forms
f (x) = 21 xT Ax + bT x + c, A ∈ Rn×n , b, x ∈ Rn , c ∈ R. Center and diagonalize it:
 T

1. Ensure A symmetric: xT Ax = xT A+A 2
x.

2. Translate stationary point to origin: ∇f = Ax + b = 0, change y = x + A+ b ⇒


f (y) = 21 yT Ay + b′T y + c′ with b′ = (I − AA+ )b, c′ = − 21 bT A+ b.
Linear terms may remain if A is singular.

3. Rotate to axis-aligned: A = UΛUT (spectral th.), change z = UT y ⇒


f (z) = 21 zT Λz + b′′T z + c′ with b′′ = UT (I − AA+ )b.


pd: single minimizer



 ∞ minimizers
Pn psd:
1 T 1 2
Considering only the quadratic part 2 z Λz = 2 i=1 λi zi , we have A not def: saddle point(s)



nsd: ∞ maximizers


nd: single maximizer

f (x) = x21 + 12 x22 f (x) = x21 f (x) = x21 − 12 x22


2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

x2 0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−1.5 −1.5 −1.5

−2 −2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

f (x) = −x21 − 12 x22 f (x) = −x21 f (x) = x21 + x2


2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

x2 0 0 0

−0.5 −0.5 −0.5

−1 −1 −1

−1.5 −1.5 −1.5

−2 −2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

x1 x1 x1

94
Order notation
Consider f (n), g(n) ≥ 0 for n = 1, 2, 3 . . .
• Asymptotic upper bound O(·): f is O(g) iff f (n) ≤ cg(n) for c > 0 and all n > n0 .
f is of order g at most. √
Ex.: 3n + 5 is O(n) and O(n2 ) but not O(log n) or O( n).
f (n)
• Asymptotic upper bound o(·): f is o(g) iff limn→∞ g(n)
= 0.
f becomes insignificant relative to g as n grows.
Ex.: 3n + 5 is o(n2 ) and o(n1.3 ) but not o(n).

• Asymptotic tight bound Ω(·): f is Ω(g) iff c0 g(n) ≤ f (n) ≤ c1 g(n) for c1 ≥ c0 > 0 and all n > n0 .
f is O(g) and g is O(f ). √
Ex.: 3n + 5 is Ω(n) but not Ω(n2 ), Ω(log n), Ω( n).

Cost of operations: assume n × 1 vectors and n × n matrices.


• Space: vectors are O(n), matrix are O(n2 ). Less if sparse or structured, e.g. a diagonal matrix
or a circulant matrix can be stored as O(n).
• Time: we count scalar multiplications only.
– Vector × vector: O(n).
– Matrix × vector: O(n2 ).
– Matrix × matrix: O(n3 ). Mind the parenthesis with rectangular matrices: A(BC) vs. (AB)C.

– Eigenvalues and eigenvectors: O(n3 ).


– Inversion and linear system solution: O(n3 ).
– SVD, Cholesky, QR, LU, spectral decomposition: O(n3 ).
Less if sparse or structured, e.g. matrix × vector is O(n) for a diagonal matrix and O(n log n)
for a circulant matrix; linsys solution is O(n2 ) for a triangular matrix; etc.

Rates of convergence
Infimum of a set S ∈ R, inf S (greatest lower bound): the largest v ∈ R s.t. v ≤ s ∀s ∈ S. If inf S ∈ S
then we also denote it as min S (minimum, smallest element of S). Likewise for maximum/supremum.
Ex: for S = {1/n, n ∈ N } we have sup S = max S = 1 and inf S = 0, but S has no minimum.
Ex: does “minx f (x) s.t. x > 0” make sense?

Let {xk }∞ n ∗
k=0 ⊂ R be a sequence that converges to x .

• Linear convergence: kxkxk+1 −x k

k −x k
≤ r for all k sufficiently large, with constant 0 < r < 1. The
distance to the solution decreases at each iteration by at least a constant factor. Ex.: xk = 2−k ;
steepest descent (and r ≈ 1 for ill-conditioned problems).
kxk+1 −x∗ k
• Sublinear convergence: limk→∞ kxk −x∗ k
= 1. Ex.: xk = k1 .
kxk+1 −x∗ k
• Superlinear convergence: limk→∞ kxk −x∗ k
= 0. Ex.: xk = k −k ; quasi-Newton methods.

• Quadratic convergence (order 2): kx k+1 −x k
kxk −x∗ k2
≤ M for all k sufficiently large, with constant M > 0
(not necessarily < 1). We double the number of digits at each iteration. Quadratic faster than
k
superlinear faster than linear. Ex.: xk = 2−2 ; Newton’s method.
kxk+1 −x∗ k
• Order p: kxk −x∗ kp
≤ M (rare for p > 2).
In the long run, the speed of an algorithm depends mainly on the order p and on r (for p = 1), and
more weakly on M. The values of r, M depend on the algorithm and the particular problem.

95

You might also like