Hapter Nconstrained Ptimization 1. Preliminaries
Hapter Nconstrained Ptimization 1. Preliminaries
CHAPTER 3
UNCONSTRAINED OPTIMIZATION
1. Preliminaries
1.1. Introduction
In this chapter we will examine some theory for the optimization of unconstrained functions.
We will assume all functions are continuous and differentiable. Although most engineering
problems are constrained, much of constrained optimization theory is built upon the concepts
and theory presented in this chapter.
1.2. Notation
We will use lower case italics, e.g., x, to represent a scalar quantity. Vectors will be
represented by lower case bold, e.g., x, and matrices by upper case bold, e.g., H.
The set of n design variables will be represented by the n-dimensional vector x. For example,
previously we considered the design variables for the Two-bar truss to be represented by
scalars such as diameter, d, thickness, t, height, h; now we consider diameter to be the first
element, x1 , of the vector x, thickness to be the second element, x2 , and so forth. Thus for
any problem the set of design variables is given by x.
Elements of a vector are denoted by subscripts. Values of a vector at specific points are
denoted by superscripts. Typically x 0 will be the starting vector of values for the design
variables. We will then move to x1 , x 2 , until we reach the optimum, which will be x*. A
summary of notation used in this chapter is given in Table 1.
Table 1 Notation
A Matrix A x , xk , x * Vector of design variables,
vector at iteration k, vector at
the optimum
I Identity matrix x1 , x2 ...xn Elements of vector x
a Column vector s, s k Search direction, search
direction at iteration k
ai i 1, 2,... Columns of A , k , * Step length, step length at
iteration k, step length at
minimum along search
direction
ei i 1, 2,... Coordinate vectors f (x), f (x k ), f k Objective function, objective
(columns of I) evaluated at x k
A T , aT transpose 2 f (x k ), 2 f k Hessian matrix at x k
H (x k ), H k
f (x), f (x k ), f k Gradient of f (x) , A Determinant of A
gradient evaluated at x k
1
Chapter 3: Unconstrained Optimization
Find x, x Rn
To Minimize f (x)
1.4.1. Definition
The gradient of f (x) is denoted f (x) . The gradient is defined as a column vector of the
first partial derivatives of f(x):
f
x
1
f
f (x) x2
f
xn
2 4 x1 3 x2 2 4
f If evaluated at x0 , f 0
1 3 x1 2 x2 2 1
A very important property of the gradient vector is that it is orthogonal to the function
contours and points in the direction of greatest increase of a function. The negative gradient
points in the direction of greatest decrease. Any vector v which is orthogonal to f (x) will
satisfy v Tf (x) 0 .
2
Chapter 3: Unconstrained Optimization
down
VALLEY
up
f
down
up
tangent
line
As long as s T f 0 , then s points, at least for some small distance, in a direction that
increases the function (points uphill). In like manner, if s T f 0 , then s points downhill.
As an example, suppose at the current point in space the gradient vector is
f (xk )T [6,1, 2] . We propose to move from this point in a search direction sT [ 1,1, 0] .
6
Does this direction go downhill? We evaluate s f = -1, -1, 0 1 7
T
-2
1.6.1. Definition
The Hessian Matrix, H(x) or 2 f (x) is defined to be the square matrix of second partial
derivatives:
2 f 2 f 2 f
x 2
1 x1xn x1xn
2 f 2 f 2 f
H (x) 2 f (x) x2 x1 x2 2 x2 xn (3.2)
2 f 2 f 2 f
x x
n 1 xn x2 xn 2
3
Chapter 3: Unconstrained Optimization
We can also obtain the Hessian by applying the gradient operator on the gradient transpose,
2 f 2 f 2 f
x x2
1 1 x1xn x1xn
2 f 2 f 2 f
f f f
H (x) 2 f (x) (f (x)T ) x2 , x2 x1 x2 2 x2 xn
xn
,...,
x1 x2
2 f 2 f
2 f
x x xn 2
xn n 1 xn x2
The Hessian is a symmetric matrix. The Hessian matrix gives us information about the
curvature of a function, and tells us how the gradient is changing.
2 4 x1 3 x2 2 f 2 f
f , 4 3
1 3 x1 2 x2 x12 x1x2
2 f 2 f
3 2
x2 x1 x2 2
and the Hessian is:
4 3
H x
3 2
1.7.1. Definitions
If for any vector, x, the following is true for a symmetric matrix B,
4
Chapter 3: Unconstrained Optimization
2. A symmetric matrix is positive definite if and only if the determinant of each of its
principal minor matrices is positive.
What does it mean for the Hessian to be positive or negative definite? If positive definite, it
means curvature of the function is everywhere positive. This will be an important condition
for checking if we have a minimum. If negative definite, curvature is everywhere negative.
This will be a condition for verifying we have a maximum.
2 3 2
3 5 1
2 1 5
2 3 2
2 3
2 2 0 3 5 1 0 3 5 1 5 0
2 1 5
The determinants of the first two principal minors are positive. However, because the
determinant of the matrix as a whole is negative, this matrix is not positive definite.
We also note that the eigenvalues are -0.15, 4.06, 8.09. That these are not all positive also
indicates the matrix is not positive definite.
5
Chapter 3: Unconstrained Optimization
2. A symmetric matrix is negative definite if we reverse the sign of each element and the
resulting matrix is positive definite.
Note: A symmetric matrix is not negative definite if the determinant of each of its
principal minor matrices is negative. Rather, in the negative definite case, the signs of the
determinants alternate minus and plus, so the easiest way to check for negative
definiteness using principal minor matrices is to reverse all signs and see if the resulting
matrix is positive definite.
If a matrix is neither positive definite nor negative definite (nor semi-definite) then it is
indefinite. If using principal minor matrices, note that we need to check both cases before we
reach a conclusion that a matrix is indefinite.
2 3 2
3 5 1
2 1 5
2 2 0
Because the first determinant is negative there is no reason to go further. We also note that
the eigenvalues of the “reversed sign” matrix are not all positive.
1.8.1. Definition
The Taylor expansion is an approximation to a function at a point x k and can be written in
vector notation as:
f x f x k f x k x x k x x k 2 f ( x k ) x x k
1
T T
(3.4)
2
If we note that x x k can be written as x k , and using notation f x k f k , we can write
(3.4) more compactly as,
6
Chapter 3: Unconstrained Optimization
f k 1 f k f k x k x k 2 f k x k
T 1 T
(3.5)
2
1/ 2 3
at x k [5, 4] f T x1 , f 0.447, 0.750
T k T
x2
2 f 1 2 f 2 f 2 f 3
x1( 3/ 2) 0 =0
x12
2 x1x2 x2 x1 x22 x22
1 ( 3/ 2)
2 x1 0
5 0.045 0.0
H x at
3 4 0.0 0.188
0
x22
x 5 1 0.045 0.0 x1 5
f x 8.631 [0.447 , 0.750] 1 [ x1 5, x2 4]
x2 4 2 0.0 0.188 x2 4
If we wish, we can stop here with the equation in vector form. To see the equation in scalar
form we can carry out the vector multiplications and combine similar terms:
7
Chapter 3: Unconstrained Optimization
x
T
Quadratic Actual Error
[5,4] 8.63 8.63 0.00
[5,5] 9.28 9.3 0.02
[6,4] 9.05 9.06 0.01
[7,6] 10.55 10.67 0.12
[2,1] 3.98 2.83 -1.15
[9,2] 8.19 8.08 -0.11
We notice that the further the point gets from the expansion point, the greater the error that is
introduced. We also see that at the point of expansion the approximation is exact.
2.1. Representation
We can represent a quadratic function three ways—as a scalar equation, a general vector
equation, and as a Taylor expansion. Although these representations look different, they give
exactly the same results. For example, consider the equation,
1
f (x) a bT x xT Cx (3.7)
2
1 4 3
f (x) 6 [2, 1]x xT x
2 3 2
(3.8)
where,
x
x 1
x2
8
Chapter 3: Unconstrained Optimization
f x f k f k x k x k Hx k
T 1 T
(3.9)
2
2 4 x1 3 x2 4 3
We note for (3.6), f and H
1 3 x1 2 x2 3 2
2 4
We will assume a point of expansion, x k , f k . (It may not be apparent, but if
2 1
we are approximating a quadratic, it doesn’t matter what point of expansion we assume. The
Taylor expansion will be exact.)
1 4 3
f (x) 12 [4, 1]x xT 3 2 x (3.10)
2
where,
x 2
x 1
x2 2
These three representations are equivalent. If we pick the point xT 1.0, 2.0 , all three
representations give f 18 at this point, as you can verify by substitution.
The equations for the gradient vector of a quadratic function are linear. This makes it
easy to solve for where the gradient is equal to zero.
The Hessian for a quadratic function is a matrix of constants (so we will write as H or
2 f instead of H(x) or 2 f (x) ). Thus the curvature of a quadratic is everywhere the
same.
Excluding the cases where we have a semi-definite Hessian, quadratic functions have
only one stationary point, i.e. only one point where the gradient is zero.
Given the gradient and Hessian at some point x k , the gradient at some other point,
x k 1 , is given by,
f k 1 f k H x k 1 x k (3.11)
9
Chapter 3: Unconstrained Optimization
Given the gradient some point x k , Hessian, H, and a search direction, s, the optimal
step length, * , in the direction s is given by,
*
f k T
s
(3.12)
s Hs
T
2.3. Examples
We start with the example,
Since this is a quadratic, we know it only has one stationary point. We note that the Hessian,
2 4
H
4 2
is indefinite (eigenvalues are -2.0 and 6.0). This means we should have a saddle point. The
contour plots in Fig 3.2 and Fig. 3.3 confirm this.
10
Chapter 3: Unconstrained Optimization
1 8 x1 x2 8 1
f and H
2 x1 4 x2 1 4
By inspection, we see that the determinants of the principal minor matrices are all positive.
Thus this function should have a min and look like a bowl. The contour plots follow.
11
Chapter 3: Unconstrained Optimization
3.1. Definitions
These conditions are necessary but not sufficient, inasmuch as f (x) 0 can apply at a max,
min or a saddle point. However, if at a point f (x) 0 , then that point cannot be an
optimum.
12
Chapter 3: Unconstrained Optimization
f x x12 2 x1 x2 4 x22
Since this is a quadratic function, the partial derivatives will be linear equations. We can
solve these equations directly for a point that satisfies the necessary conditions. The gradient
vector is,
f
x 2 x 2 x 0
f ( x ) 1 1 2
f 2 x1 8 x2 0
x
2
When we solve these two equations, we have a solution, x1 0 , x2 0 --this a point where
the gradient is equal to zero. This represents a minimum, a maximum, or a saddle point. At
this point, the Hessian is,
2 2
H
2 8
Since this Hessian is positive definite (eigenvalues are 1.4, 8.6), this must be a minimum.
As a second example, apply the necessary and sufficient conditions to find the optimum for
the quadratic function,
f x 4 x1 2 x2 x12 4 x1 x2 x22
As in example 1, we will solve the gradient equations directly for a point that satisfies the
necessary conditions. The gradient vector is,
f
x 2 x 4 x2 4 0
1 1
f 4 x1 2 x2 2 0
x
2
When we solve these two equations, we have a solution, x1 1.333 , x2 1.667 . The Hessian
is,
13
Chapter 3: Unconstrained Optimization
2 4
H
4 2
The eigenvalues are -2, 6. The Hessian is indefinite. This means this is neither a max nor a
min—it is a saddle point.
Comments: As mentioned, the equations for the gradient for a quadratic function are linear,
so they are easy to solve. Obviously we don’t usually have a quadratic objective, so the
equations are usually not linear. Often we will use the necessary conditions to check a point
to see if we are at an optimum. Some algorithms, however, solve for an optimum by solving
directly where the gradient is equal to zero. Sequential Quadratic Programming (SQP) is this
type of algorithm.
Other algorithms search for the optimum by taking downhill steps and continuing until they
can go no further. The GRG (Generalized Reduced Gradient) algorithm is an example of this
type of algorithm. In the next section we will study one of the simplest unconstrained
algorithms that steps downhill: steepest descent.
4.1. Description
One of the simplest unconstrained optimization methods is steepest descent. Given an initial
starting point, the algorithm moves downhill until it can go no further.
The search can be broken down into stages. For any algorithm, at each stage (or iteration) we
must determine two things:
Answer to question 1: For the method of steepest descent, the search direction is f x
Answer to question 2: A line search is performed. "Line" in this case means we search along a
direction vector. The line search strategy presented here, bracketing the function with quadratic
fit, is one of many that have been proposed, and is one of the most common.
x k 1 x k s (3.18)
where s is the search direction vector, usually normalized, and is the step length, a scalar.
14
Chapter 3: Unconstrained Optimization
We will step in direction s with increasing values of until the function starts to get worse. Then
we will curve fit the data with a parabola, and step to the minimum of the parabola.
We will find *, which is the optimal step length, by trial and error.
Guess * = .4 for step number 1:
Line
Search x1 x 0 s0 f x
Step
3.0 0.50 2.80
1 0.4 x1 .4 13.3
1.0 0.86 0.66
We see that the function has decreased; we decide to double the step length and continue
doubling until the function begins to increase:
Line
Search x1 x 0 s0 f x
Step
3.0 0.50 2.60
2 0.8 x1 .8 8.75
1.0 0.86 0.31
3.0 0.50 2.20
3 1.6 x1 1.6 3.74
1.0 0.86 0.38
3.0 0.50 1.40
4 3.2 x1 3.2 9.31
1.0 0.86 1.75
The objective function has started to increase; therefore we have gone too far.
15
Chapter 3: Unconstrained Optimization
f x x1 2 x1 x 2 4 x 2
2 2
0.00 th
5 point (-1.8,-1.06)
- 2.00
- 4.00
- 6.00 - 4.00 - 2.00 .000 2.00 4.00 6.00
x1 - AXIS
Fig. 3.6 Progress in the line search shown on a contour plot.
If we plot the objective value as a function of step length as shown in Fig 3.7:
14
1
12
y 19.0794 16.1386x 4.0897x 2 R 2 1.00
Objective Value
10
4
2
8
*
6
3 5
4
2
0 1 2 3 4
Step Length
Fig. 3.7 The objective value vs. step length for the line search.
16
Chapter 3: Unconstrained Optimization
We see that the data plot up to be a parabola. We would like to estimate the minimum of this
curve. We will curve fit points 2, 5, 3. These points are equally spaced and bracket the
minimum.
f 1 f 3
* 2
2 f 1 2 f 2 f 3
* 1.60
0.8 8.75 3.91 19
2 8.75 2 3.74 3.91
* 1.97
When we step back, after the function has become worse, we have four points to choose from
(points 2, 3, 5, 4). How do we know which three to pick to make sure we don’t lose the bracket
on the minimum? The rule is this: take the point with the lowest function value (point 3) and the
two points to either side (points 2 and 5).
In summary, the line search consists of stepping along the search direction until the minimum of
the function in this direction is bracketed, fitting three points which bracket the minimum with a
parabola, and calculating the minimum of the parabola. If necessary the parabolic fit can be
carried out several times until the change in the minimum is very small (although the are then
no longer equally spaced, so the following formula must be used):
df
As shown in Fig. 3.7, at * , 0 . The process of determining * will be referred to as
d
taking a minimizing step, or, executing an exact line search.
17
Chapter 3: Unconstrained Optimization
1
E ( x) x x * H x x *
T
(3.21)
2
AA aa E xk
2
E x k 1
where (3.22)
A Largest eigenvalue of H
a Smallest eigenvalue of H
Thus if A=50 and a=1, we have that the error at the k+1 step is only guaranteed to be less than
the error at the k step by,
2
k 1 49
E Ek
51
“Roughly speaking, the above theorem says that the convergence rate of steepest descent is
slowed as the contours of f become more eccentric. If a A , corresponding to circular contours,
convergence occurs in a single step. Note, however, that even if n 1 of the n eigenvalues are
equal and the remaining one is a great distance from these, convergence will be slow, and hence
a single abnormal eigenvalue can destroy the effectiveness of steepest descent.”
The above theorem is based on a quadratic function. If we have a quadratic, and we do rotation
and translation of the axes, we can eliminate all of the linear and cross product terms. We then
have only the pure second order terms left. The eigenvalues of the resulting Hessian are equal to
twice the coefficients of the pure second order terms. Thus the function,
f x12 x22
would have equal eigenvalues of (2, 2) and would represent the circular contours as mentioned
above, shown in Fig. 3.8. Steepest descent would converge in one step. Conversely the function,
1
Luenberger and Ye, Linear and Nonlinear Programming, Third Edition, 2008
18
Chapter 3: Unconstrained Optimization
f 50 x12 x22
has eigenvalues of (100, 2). The contours would be highly eccentric and convergence of steepest
descent would be very slow. A contour plot of this function is given in Fig 3.9,
Fig. 3.9. Contours of the function, f 50 x12 x22 . Notice how the contours have
been “stretched” out.
19
Chapter 3: Unconstrained Optimization
df f dxi
d
= x d
i
Noting that x k 1 x k s , or, for a single element of vector x, xik 1 xik sik , we have
dxi
si , so
d
df f dxi f
si f s
T
(3.23)
d x i d xi
df
As an example, we will find the directional derivative, , for the problem given in Section
d
df 0.5
4.2 above, at =0. From (3.23): f T s 8 14 16.04
d 0.86
This gives us the change in the function for a small step in the search direction, i.e.,
df
f (3.24)
d
If 0.01 , the predicted change is 0.1604. The actual change in the function is 0.1599.
Equation (3.23) is the same equation for checking if a direction goes downhill, given in
Section 1.4. Before we just looked at the sign; if negative we knew we were going downhill.
Now we see that the value has meaning as well: it represents the expected change in the
df
function for a small step. If, for example, the value of is less than some epsilon, we
d 0
could terminate the line search, because the predicted change in the objective function is
below a minimum threshold.
df
Another important value of occurs at * . If we locate the minimum exactly, then
d
f k 1 s k 0
df T
(3.25)
d *
As we have seen in examples, when we take a minimizing step we stop where the search
direction is tangent to the contours of the function. Thus the gradient at this new point is
orthogonal to the previous search direction.
20
Chapter 3: Unconstrained Optimization
6. Newton’s Method
6.1. Derivation
Another classical method we will briefly study is called Newton's method. It simply makes a
quadratic approximation to a function at the current point and solves for where the necessary
conditions (to the approximation) are satisfied. Starting with a Taylor series:
f k 1 f k f k x k x k H k x k
T 1 T
(3.26)
2
Since the gradient and Hessian are evaluated at k, they are just a vector and matrix of constants.
Taking the gradient (Section 9.1),
f k 1 f k H k x k
H k x k f k
Solving for x :
x k H k f k
1
(3.27)
Note that we have solved for a vector, i.e. x , which has both a step length and direction.
3 3 0
So, x1 x0 x
1 1 0
21
Chapter 3: Unconstrained Optimization
4.00
f x x12 2 x1 x2 4 x22
2.00
3
x 18 20
1
14 16
10 12
6 8
4
2
0.00
-2.00
-4.00
- 6.00 - 4.00 - 2.00 .000 2.00 4.00 6.00
x1 - AXIS
Fig. 3.10. The operation of Newton’s method.
However we should note some drawbacks. First, it requires second derivatives. Normally we
compute derivatives numerically, and computing second derivatives is computationally
expensive, on the order of n 2 function evaluations, where n is the number of design variables.
The derivation of Newton’s method solved for where the gradient is equal to zero. The
gradient is equal to zero at a min, a max or a saddle, and nothing in the method differentiates
between these. Thus Newton’s method can diverge, or fail to go downhill (indeed, not only
not go downhill, but go to a maximum!). This is obviously a serious drawback.
7. Quasi-Newton Methods
7.1. Introduction
Let’s summarize the pros and cons of Newton's method and Steepest Descent:
Pros Cons
Always goes downhill Slow on eccentric functions
Steepest
Always converges
Descent
Simple to implement
Solves quadratic in one step. Very Requires second derivatives,
Newton’s
fast when close to optimum on non Can diverge
Method
quadratic.
22
Chapter 3: Unconstrained Optimization
We want to develop a method that starts out like steepest descent and gradually becomes
Newton's method, doesn't need second derivatives, doesn't have trouble with eccentric
functions and doesn't diverge! Fortunately such methods exist. They combine the good
aspects of steepest descent and Newton's method without the drawbacks. These methods are
called quasi-Newton methods or sometimes variable metric methods.
If N is always positive definite, then s always points downhill. To show this, our criterion for
moving downhill is:
s T f 0
Or,
f T s 0 (3.29)
f T Nf 0 (3.30)
Since N is positive definite, we know that any vector which pre-multiplies N and post-
multiplies N will result in a positive scalar. Thus the quantity within the parentheses is
always positive; with the negative sign it becomes always negative, and therefore always
goes downhill.
7.2.1. Development
In this section we will develop one of the simplest updates, called a “rank one” update
because the correction to the direction matrix, N, is a rank one matrix (i.e., it only has one
independent row or column). We first start with some preliminaries.
23
Chapter 3: Unconstrained Optimization
where x k x k 1 x k
f k 1 f k Hx k (3.32)
and defining:
γ k f k 1 f k (3.33)
we have,
γ k Hx k or H -1γ k x k (3.34)
Equation (3.34) is very important: it shows that for a quadratic function, the inverse of the
Hessian matrix ( H 1 ) maps differences in the gradients to differences in x. The relationship
expressed by (3.34) is called the Newton condition.
We will make the direction matrix satisfy this relationship. However, since we can only
calculate γ k and x k after the line search, we will make
N k 1γ k x k (3.35)
N k 1 N k auu T (3.36)
Where we will “update” the direction matrix with a correction which is of the form auu T ,
which is a rank one symmetric matrix.
N k γ k auu T γ k x k (3.37)
or
au uγ x k N k γ k
T k
(3.38)
scalar
u x k N k γ k (3.39)
24
Chapter 3: Unconstrained Optimization
a x k N k γ k x k N k γ k γ k x k N k γ k
T
(3.40)
scalar
For this to be true,
a x k N k γ k γ k 1
T
so
1
a (3.41)
x Nk γ k γ k
k T
Substituting (3.41) and (3.39) into (3.36) gives the expression we need:
x N k γ k x k N k γ k
k T
k 1
N N k
(3.42)
x Nk γ k γ k
k T
Equation (3.42) allows us to get a new direction matrix in terms of the previous matrix and
the difference in x and the gradient. We then use this to get a new search direction according
to (3.28).
2.030 2.664
x1 f 1
0.698 1.522
25
Chapter 3: Unconstrained Optimization
4.366
x 0
N γ
0 0
x 0
N γ0
0 T 13.824 4.366 13.824
auuT
x N0 γ 0 γ 0 5.336
T
4.366 13.824
0
15.522
19.062 60.364
60.364 191.158
237.932
0.080 0.254
0.254 0.803
N1 N 0 auuT
1 0 0.080 0.254
N1
0 1 0.254 0.803
0.920 0.254
N1
0.254 0.197
2.837
0.975
When we step in this direction, using again a line search, we arrive at the optimum
0 0
x 2 f x 2
0 0
At this point we are done. However, if we update the direction matrix one more time, we find
it has become the inverse Hessian.
26
Chapter 3: Unconstrained Optimization
0 2.030 2.030
x1 x 2 x1
0 0.698 0.698
0 2.664 2.664
γ 1 f 2 f 1
0 1.524 1.524
0.808
x
1
N γ1 1
x 1
N γ
1 1 T 0.279 0.808 0.279 0.253 0.088
auu T
x N γ1 1 T
γ1 2.664 0.088 0.030
0.808 0.279
1
1.524
N k 1γ k x k
N k 1γ k 1 x k 1
(3.43)
N k 1γ k 2 x k 2
N k 1γ k n 1 x k n 1
where n is the number of variables. That is, (3.35) is not only satisfied for the current step, but for
the last n-1 steps. Why is this significant? Let's write this relationship of (3.43) as follows:
Let the matrix defined by the columns of be denoted by G, and the matrix defined by
columns of x be denoted by X . Then,
27
Chapter 3: Unconstrained Optimization
N k 1G X
If γ k γ k n 1 are independent, and if we have n vectors, i.e. G is a square matrix, then the
inverse for G exists and is unique and
N k 1 XG 1 (3.44)
is uniquely defined.
Since the Hessian inverse satisfies (3.44) for a quadratic function, then we have the important
result that, after n updates the direction matrix becomes the Hessian inverse for a quadratic
function. This implies the quasi-Newton method will solve a quadratic in no more than n+1
steps. The proof that our rank one update has the hereditary property is given in the next
section.
7.2.4. Proof of the Hereditary Property for the Rank One Update
THEOREM. Let H be a constant symmetric matrix and suppose that x 0 , x1 , , x k and
γ 0 , γ1 , , γ k are given vectors, where γ i Hxi , i 0,1, 2, , k , where k n . Starting
with any initial symmetric matrix N 0 , let
x N k γ k x k N k γ k
k T
k 1
N N k
(3.45)
x Nk γ k γ k
k T
then
N k 1γ i xi for i k (3.46)
PROOF. The proof is by induction. We will show that if (3.46) holds for previous direction
matrix, it holds for the current direction matrix. We know that at the current point, k, the
following is true,
N k 1γ k x k (3.47)
because we enforced this condition when we developed the update. Now, suppose it is true
that,
i.e. that the hereditary property holds for the previous direction matrix. We can post multiply
(3.45) by γ i , giving,
x N k γ k x k N k γ k γ i
k T
k 1 i
N γ N γ k i
(3.49)
xk N k γ k γ k
T
28
Chapter 3: Unconstrained Optimization
N k 1γ i N k γ i y k x k N k γ k γ i
T
(3.50)
We can distribute the transpose on the last term, and distribute the post multiplication γ i to
give (Note: Recall that when you take the transpose inside a product, the order of the product
is reversed; also because N is symmetric, NT N thus: N k γ k γ i γ k N k γ i ),
T T
N k 1γ i N k γ i y k x k γ i γ k N k γ i
T T
(3.51)
γ
k T
xi Hx k xi x k Hxi x k γ i
T T T
(3.53)
Thus, if the hereditary property holds for the previous direction matrix, it holds for the
current direction matrix. When k 0 , condition (3.47) is all that is needed to have the
hereditary property for the first update, N1 . The second update, N 2 , will then have the
hereditary property since N1 does, and so on.
7.3. Conjugacy
7.3.1. Definition
Quasi-Newton methods are also methods of conjugate directions. A set of search directions,
s0 , s1 ,..., s k are said to be conjugate with respect to a square, symmetric matrix, H, if,
s
k T
Hsi 0 for all i k (3.55)
29
Chapter 3: Unconstrained Optimization
PROPOSITION. If H is positive definite and the set of non-zero vectors s0 , s1 ,..., s n1 are
conjugate to H, then these vectors are linearly independent.
From conjugacy, all of the terms except k s k Hs k are zero. Since H is positive definite,
T
then the only way for this remaining term to be zero is for k to be zero. In this way we can
show that for (3.57) to be satisfied all the coefficients must be zero. This is the definition of
linear independence.
k
f s
k T k
(3.60)
s Hs
k T k
f k 1 f k H x k 1 x k
2
Himmelblau, Applied Nonlinear Programming, p. 112.
30
Chapter 3: Unconstrained Optimization
x1 x0 0s 0
Likewise for x 2 :
x 2 x1 1s1 x0 0s 0 1s1
Or, in general
x k
x0 0s 0 1s1 ... k 1s k 1 (3.61)
After n steps, we can write the optimum (assuming the directions are independent, which we
just showed) as,
x *
x0 0s 0 1s1 ... k s k ... n 1s n 1 (3.62)
s H x
k T
x 0 0 s k Hs 0 1 s k Hs1 ... k s k Hs k ... n 1 s k Hs n 1
* T T T T
Solving for : k
k
s H(x x )
k T * 0
(3.63)
s Hs
k T k
s H x
k T
x0 0 s k Hs 0 1 s k Hs1 ... k 1 s k Hs k 1
k T T T
(3.64)
0 0 0
which gives,
s k T
H (x k x0 ) 0 (3.65)
31
Chapter 3: Unconstrained Optimization
k
s H(x x )
k T * k
(3.66)
s Hs
k T k
Noting that H (x* x k ) f k is the solution to (3.58), we can solve for the k as,
k
f s
k T k
s Hs
k T k
We notice that (3.60) is the same as the minimizing step we derived in Section 9.2. Thus the
conjugate direction theorem relies on taking minimizing steps.
7.3.3. Examples
We stated earlier that quasi-Newton methods are also methods of conjugate directions. Thus
for the example given in Section 7.3, we should have,
s 0 T
Hs1 0
2. 2. 2.837
0.496 0.868 0.0017
2. 8. 0.975
In the previous problem we only had two search directions. Let’s look at a problem where we
have three search directions so we have more conjugate relationships to examine. We will
consider the problem, Min f 2 x1 x12 4 x2 4 x22 8 x3 2 x32 .
3 20 0.573
We execute a line search in the direction of steepest descent (normalized as s0 above), stop
at * and determine the new point and gradient. We calculate the new search direction using
our rank 1 update,
32
Chapter 3: Unconstrained Optimization
1.000 0.000
x 0.500 f 0.000
3 3
2.000 0.000
0 0 4 0.474
1
E (α ) α α * ST HS α α * (3.67)
2
Where S is a matrix with columns, s 0 , s1 ,..., s n 1 . If the s vectors are conjugate then (3.67)
reduces to,
33
Chapter 3: Unconstrained Optimization
1 n 1 i
E (α ) ( *) 2 d i
2 i 0
As we show in Section 9.3, another result of conjugacy is that at the k+1 step,
f
k 1 T
si 0 for all i k (3.68)
Equation (3.68) indicates 1) that the current gradient is orthogonal to all the past search
directions, and 2) at the current point we have zero slope with respect to all past search
directions, i.e.,
f
0 for all i k
i
meaning we have minimized the function in the “subspace” of the previous directions. As an
example, for the three variable function of Section 7.5, f 2 should be orthogonal to s 0 and
s1 :
0.172
f s 2.038 0.218 0.917 0.802 0.0007
2 T 0
0.573
4.251
f
2 T
s 2.038 0.218 0.917 3.320 0.0013
1
8.657
3
R. Fletcher, Practical Methods of Optimization, Second Edition, 1987, pg. 26.
34
Chapter 3: Unconstrained Optimization
direction will always go downhill. It has been shown that (3.42) is the only rank one update
which satisfies the quasi-Newton condition. For more flexibility, rank 2 updates have been
proposed. These are of the form,
N k 1γ k x k (3.70)
we have,
N k γ k auu T γ k bvv T γ k x k (3.71)
There are a number of possible choices for u and v. One choice is to try,
u x k , v N k γ k (3.72)
N k γ k ax k x k γ k bN k γ k N k γ k γ k x k
T T
(3.73)
scalar scalar
In (3.73) we note that the dot products result in scalars. If we choose a and b such that,
a x k γ k 1 and b N k γ k γ k 1
T T
(3.74)
N k γ k x k N k γ k x k (3.75)
and is satisfied.
x k x k Nk γ k Nk γ k
T T
N k 1 N k (3.76)
x k T
γk N γ
k k T
γk
x k x k Nk γ k γ k Nk
T T
N k 1 N k (3.77)
xk γ k γ k Nk γ k
T T
35
Chapter 3: Unconstrained Optimization
Davidon4 was the first one to propose this update. Fletcher and Powell further developed his
method;5 thus this method came to be known as the Davidon-Fletcher-Powell (DFP) update.
This update has the following properties,
THEOREM. If x k γ 0 for all steps of the algorithm, and if we start with any symmetric,
T
positive definite matrix, N 0 , then the DFP update preserves the positive definiteness of
N k for all k.
PROOF. The proof is inductive. We will show that if N k is positive definite, N k+1 is also.
From the definition of positive definiteness,
For simplicity we will drop the superscript k on the update terms. From (3.66),
T xx T Nγγ N
T T
z T N k 1z z
T k
N
z z z z T z (3.78)
xT γ γ Nγ
term 1
term 2 term 3
We need to show that all the terms on the right hand side are positive. We will focus for a
moment on the first and third terms on the right hand side. Noting that N can be written as
N LLT via Choleski decomposition, and if we substitute a LT z, aT z T L ,
b LT γ , b T γ T L the first and third terms are,
aTb
2
T Nγγ N
T
z Nz z T
T
z a a T
T
(3.79)
γ Nγ b b
The Cauchy-Schwarz inequality states that for any two vectors, x and y,
4
W. C. Davidon, USAEC Doc. ANL-5990 (rev.) Nov. 1959
5
R. Fletcher and M. J. D. Powell, Computer J. 6: 163, 1963
36
Chapter 3: Unconstrained Optimization
x y a b
T 2 T 2
x x
T
thus a a
T
0 (3.80)
yTy bTb
So the first and third terms of (3.78) are positive. Now we need to show this for the second
term,
z T xx T z z x
T 2
xx T
z
T
z (3.81)
x γ x T γ x T γ
T
The numerator of the right-most expression is obviously positive. The denominator can be
written,
x T γ x k f k 1 x k f k s k f k 1 s k f k
T T T T
(3.82)
term 1 term 2
We have now shown that all three terms of (3.78) are positive if we take a minimizing step.
Thus, if N k is positive definite, N k 1 is positive definite, etc.
37
Chapter 3: Unconstrained Optimization
γ k T N k γ k x k x k T x k γ k T N k N k γ k x k T
N k 1
N 1
k (3.83)
x γ
k T k x k T γ k
x γ
k T k
This update is currently considered to be the best update for use in optimization. It is the
update inside OptdesX, Excel and many other optimization packages.
Note that these methods use information the previous methods “threw away.” Quasi-Newton
methods use differences in gradients and differences in x to estimate second derivatives
according to (3.34). This allows information from previous steps to correct (or update) the
current step.
γ k H k x k (3.84)
γk γk H k x k x k H k
T T
k 1
H H
k
(3.85)
γ
k T
x k x
k T
H k x k
You will note that this looks a lot like the DFP Hessian inverse update but with H
interchanged with N and interchanged with x. In fact these two formulas are said to be
complementary to each other.
38
Chapter 3: Unconstrained Optimization
8.1. Definition
There is one more method we will learn, called the conjugate gradient method. We will
present the results for this method primarily because it is an algorithm used in Microsoft
Excel.
The conjugate gradient method is built upon steepest descent, except a correction factor is
added to the search direction. The correction makes this method a method of conjugate
directions. For the conjugate direction method, the search direction is given by,
s k 1 –f k 1 k s k (3.86)
f
T
k 1
f k 1
k
(3.87)
f
T
k
f k
3 8
starting from x0 f 0
1 14
2.03 2.664
x1 f 1
0.7 1.522
Now we calculate 0 as
2.664
f
1
T
f 1 2.664 1.522
0 1.522 9.413 0.0362
f 8
T
0
f 0 8 14 14
260
39
Chapter 3: Unconstrained Optimization
2.664 8 2.954
s1 f 1 s 0 0.0362
1.522 14 1.015
0 0
when we step in this direction, we arrive at the optimum, x 2 f 2
0 0
The main advantage of the conjugate gradient method, as compared to quasi-Newton
methods, is computation and storage. The conjugate gradient method only requires that we
store the last search direction and last gradient, instead of a full matrix. Thus this method is a
good one to use for large problems (say with 500 variables).
Although both conjugate gradient and quasi-Newton methods will optimize quadratic
functions in n steps, on real problems quasi-Newton methods are better. Further, small errors
can build up in the conjugate gradient method so some researchers recommend restarting the
algorithm periodically (such as every n steps) to be steepest descent.
9. Appendix
0
ei 1 A single 1 in the i th position (3.88)
0
f x a bT x
then f x a b T x a b x
T
term 1
term 2
For the first term, since a is a constant, a 0 . Looking at the second term, from the rule for
differentiation of a product,
b T x b T x x T b
40
Chapter 3: Unconstrained Optimization
but b T 0T and x T I
Thus f x a b T x
0 b T x x T b
0 0 Ib
b (3.90)
1
q x a b T x x T Hx (3.91)
2
We wish to evaluate the gradient in vector form. We will do this term by term,
q x a b T x xT Hx
1
2
q x a b T x x T Hx
1
2
0 b x T Hx
1
2
So we only need to evaluate the term, xT Hx . If we split this into two vectors, i.e.
1
2
u x, v Hx , then
xT Hx xT v v T x
We know xT v IHx Hx , so we must only evaluate v T x Hx
T
x . We can
write,
Hx
T
= [h Tr1x , h Tr 2 x , , h Trn x]
where h Tr1 represents the first row of H, h Tr 2 represents the second row, and so forth.
Applying the gradient operator,
From the previous result for b T x , we know that h Tri x h ri since h ri is a vector constant.
Therefore,
41
Chapter 3: Unconstrained Optimization
Hx h r1 , h r 2 , , h rn
T
HT
1
Returning now to the gradient of the expression, q x a b T x xT Hx
2
1
q x a b T x xT Hx
2
1
2
0 b xT H Hx x
T
b H HT x
1
2
b Hx (3.92)
f k 1 f k f k x k x k H k x k
T 1 T
f k 1 f k H k x k (3.93)
f k 1 f k f k x k x k Hx k
T 1 T
(3.94)
2
x k s (3.95)
f k 1 f k f k s
1
s H s
T T
df k 1
f k s s T Hs
T
(3.96)
d
42
Chapter 3: Unconstrained Optimization
*
f k T
s
(3.97)
sT Hs
f k 1 T
si 0 for all i k (3.98)
which indicates that we have zero slope along any search direction in the subspace generated
by x0 and the search directions s 0 , s1 ,..., s k , i.e.,
f
0 for all i k
i
PROOF. The proof by induction. Given the usual expression for the gradient of a Taylor
expansion,
f k 1 f k Hx k
f k 1 f k Hs k (3.99)
s
k T
f k 1 s k f k s k Hs k 0
T T
s
i T
f k 1 si f k si Hs k
T T
term 1 term 2
Term 1 vanishes by the induction hypothesis, while term 2 vanishes from the definition of
conjugate directions.
43
Chapter 3: Unconstrained Optimization
9.4. Proof that an Update with the Hereditary Property is Also a Method of
Conjugate Directions
THEOREM. An update with the hereditary property and exact line searches is a method of
conjugate directions and therefore terminates after m n iterations on a quadratic function.
s
k T
Hsi 0 for all i k 1 (3.101)
The proof is by induction. We will show that if s k is conjugate then s k 1 is as well, i.e.,
s
k+1 T
Hsi 0 for all i k (3.102)
We note that
s k 1 N k 1f k 1 (3.103)
s
k 1 T
f k 1 N k 1
T
(3.104)
s
k+1 T
Hsi f k 1 N k 1Hsi
T
for all i k (3.105)
Also,
Hxi γi
Hsi
i i
so (3.105) becomes,
f k 1 T
N k 1γ i
s
k+1 T
Hs i
i
for all i k (3.106)
From the hereditary property we have N k 1γ i xi i k , so (3.106) can be written,
44
Chapter 3: Unconstrained Optimization
f k 1 T xi
s
k+1 T
Hsi
i
0 for all i k
The term in brackets is zero for all values of i 1, 2,..., k 1 from the assumption the
previous search direction was conjugate which implies (3.98). It is zero for i k from the
definition of * . Thus if we have conjugate directions at k, and the hereditary property
holds, we have conjugate directions at k+1.
10. References
For more information on unconstrained optimization and in particular Hessian updates, see:
D. Luenberger, and Y. Ye, Linear and Nonlinear Programming, Third Edition, 2008.
45