0% found this document useful (0 votes)
95 views

Expanding Subspace Theorem

1. The document discusses conjugate direction methods for solving unconstrained optimization problems where the objective function f is twice continuously differentiable but the Hessian matrix is too large to store. 2. It introduces the concept of Q-conjugacy where vectors are considered Q-conjugate if their inner product with respect to the positive definite matrix Q is equal to 0. 3. The expanding subspace theorem shows that the sequence generated by minimizing f along successive Q-conjugate directions converges to the optimal solution within n steps, where n is the dimension of the problem.

Uploaded by

A kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Expanding Subspace Theorem

1. The document discusses conjugate direction methods for solving unconstrained optimization problems where the objective function f is twice continuously differentiable but the Hessian matrix is too large to store. 2. It introduces the concept of Q-conjugacy where vectors are considered Q-conjugate if their inner product with respect to the positive definite matrix Q is equal to 0. 3. The expanding subspace theorem shows that the sequence generated by minimizing f along successive Q-conjugate directions converges to the optimal solution within n steps, where n is the dimension of the problem.

Uploaded by

A kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1.

Conjugate Direction Methods


1.1. General Discussion. In this section we are again concerned with the problem of
unconstrained optimization:
P: minimize f (x)
subject to x ∈ Rn
where f : Rn → R is twice continuously differentiable. The underlying assumption of this
section is that the dimension n is very large, indeed, so large that the matrix ∇2 f (x) cannot
be held in storage. However, it will be assumed that the matrix vector product ∇2 f (x)v can
be computed for any vector v.
To better understand the numerical approach to be developed first consider a constrained
version of P:
P(x0 ,S) : minimize f (x)
,
subject to x ∈ x0 + S
where x0 ∈ Rn and S ⊂ Rn is a subspace with
x0 + S = {x0 + s |s ∈ S } .
The problem P(x0 ,S) is our first instance of a constrained problem. However, it is a constrained
problem that is equivalent to an unconstrained problem. To see this let v0 , v1 , . . . , vk−1 be a
basis for S. Then
S = Span(v0 , . . . , vk−1 ) = Ran(V ), where V = [v0 , v1 , . . . , vk−1 ] ∈ Rn×k .
Hence

x0 + S = x0 + Ran(V ) = x0 + V z z ∈ Rk .


Therefore, the we can rewrite the problem P(x0 ,S) as


0
P(x 0 ,V )
: minimize f (x0 + V z)
.
subject to v ∈ Rk
0
If z̄ is a local solution to P(x 0 ,V )
, then x̄ = x0 +V z̄ is a local solution to P(x0 ,S) , and conversely,
if x̄ is a local solution to P(x0 ,S) , then any z̄ ∈ Rk for which x̄ = x0 + V z̄ is a local solution
0
to P(x 0 ,V )
.
Due to this equivalence, it is possible to derive first-order necessary conditions for opti-
0
mality in P(x0 ,S) from those for the unconstrained problem P(x 0 ,V )
.

Theorem 1.1. (Subspace Optimality Theorem)


Let f : Rn → R be continuously differentiable, x0 ∈ Rn , and S a subspace of Rn . If x̄ is a
local solution to P(x0 ,S) , the ∇f (x̄) ⊥ S. If it is further assumed that f is convex, then the
condition ∇f (x̄) ⊥ S is both necessary and sufficient for x̄ to be a global solution to P(x0 ,S) .
Proof. As has already been observed, if x̄ is a local solution to P(x0 ,S) , then, since x̄ ∈ x0 + S,
0
there must exist z̄ ∈ Rk such that x̄ = x0 + V z̄ and any such z̄ is a local solution to P(x 0 ,V )
.
1
2

Therefore, if h(z) = f (x0 + V z), then


v1T ∇f (x̄)
 
 v2T ∇f (x̄) 
0 = ∇h(z̄) = V T ∇f (x0 + V z̄) = V T ∇f (x̄) =  .. ,
 . 
vkT ∇f (x̄)
or equivalently, ∇f (x̄)T vi = 0, i = 1, 2, . . . , k, which is in turn equivalent to ∇f (x̄) ⊥ S.
If f is assumed to be convex, then the final statement of the theorem follows from the
convexity of the function h. 

1.2. Conjugate Direction Methods. In this section we focus on the problem P when f
has the form
1
(1.1) f (x) := xT Qx − bT x,
2
where Q is a symmetric positive definite matrix. Our development in this section revolves
around the notion of Q-conjugacy.
Definition 1.1 (Conjugacy). Let Q ∈ Rn×n be symmetric and positive definite. We say
that the vectors x, y ∈ Rn \{0} are Q-conjugate (or Q-orthogonal) if xT Qy = 0.
Proposition 1.1.1 (Conjugacy implies Linear Independence). If Q ∈ Rn×n is pos-
itive definite and the set of nonzero vectors d0 , d1 , . . . , dk are (pairwise) Q-conjugate, then
these vectors are linearly independent.
k
P
Proof. If 0 = αi di , then for i0 ∈ {0, 1, . . . , k}
i=0
k
X
0= dTi0 Q[ αi di ] = αi0 dTi0 Qdi ,
i=0

Hence αi = 0 for each i = 0, . . . , k. 


Let x0 ∈ Rn and suppose that the vectors d0 , d1 , . . . , dk−1 ∈ Rn are Q-conjugate. Set
S = Span(d0 , d1 , . . . , dk−1 ). Since Q is positive definite, f is both coercive and strictly convex.
Therefore, a solution x∗ to P(x0 ,S) exists, is unique, and satisfies 0 = V T ∇f (x∗ ) = V T (Qx∗ −b)
by the Subspace Optimality Theorem. Since x∗ ∈ x0 + S, there are scalars µ0 , . . ., µn−1 such
that
(1.2) x∗ = x0 + µ0 d0 + . . . + µk−1 dk−1 .
Since 0 = V T ∇f (x∗ ) = V T (Qx∗ − b), for each j = 0, 1, . . . , k − 1 we have
0 = dTj (Qx∗ − b)
= dTj (Q(x0 + µ0 d0 + . . . + µk−1 dk−1 ) − b)
= dTj (Qx0 − b) + µ0 dTj Qd0 + . . . + µk−1 dTj Qdk−1
= dTj ∇f (x0 ) + µj dTj Qdj ,
3

so that
−dTj ∇f (x0 )
(1.3) µj = j = 0, 1 . . . , k − 1 .
dTj Qdj
This observation motivates the following theorem.
Theorem 1.2. [Expanding Subspace Theorem]
Let {di }n−1 n n
i=0 be a sequence of nonzero Q-conjugate vectors in R . Then for any x0 ∈ R the
sequence {xk } generated according to
xk+1 = xk + αk dk
with
αk := arg min{f (xk + αdk ) : α ∈ R}
has the property that f (x) = 21 xT Qx − bT x attains its minimum value on the affine set
x0 + Span {d0 , . . . , dk−1 } at the point xk .
Proof. Let us first compute the value of the αk ’s. Set
ϕk (α) = f (xk + αdk )
2
= α2 dTk Qdk + αgkT dk + f (xk ),
where gk = ∇f (xk ) = Qxk − b. Then ϕ0k (α) = αdTk Qdk + gkT dk . Since f is strictly convex so
is φ, and so αk is the unique solution to φ0 (α) = 0 which is given by
gkT dk
αk = − .
dTk Qdk
Therefore,
xk = x0 + α0 d0 + α1 d1 + · · · + αk dk
with
gkT dk
αk = − , k = 0, 1, . . . , k.
dTk Qdk
Preceding the theorem it was shown that if x∗ is the solution to the problem
min {f (x) |x ∈ x0 + Span(d0 , d1 , . . . , dk )} ,
then x∗ is given by (1.2) and (1.3). Therefore, If we can now show that µj = αj , j =
0, 1, . . . , k, then x∗ = xk which proves the result. For each j ∈ {0, 1, . . . , k} we have
∇f (xj )T dj = (Qxj − b)T dj
= (Q(x0 + α0 d0 + α1 d1 + · · · + αj−1 dj−1 ) − b)T dj
= (Qx0 − b)T dj + α0 dT0 Qdj + α1 dT1 Qdj + · · · + αj−1 dTj−1 Qdj
= (Qx0 − b)T dj
= ∇f (x0 )T dj .
Therefore, for each j ∈ {0, 1, . . . , k},
−∇f (xj )T dj −∇f (x0 )T dj
αj = = = µj ,
dTj Qdj dTj Qdj
4

which proves the result. 


As an immediate consequence of this theorem we obtain the following result.
Theorem 1.3 (Conjugate Direction Algorithm). Let {di }n−1 i=0 be a set of nonzero
n
Q-conjugate vectors. For any x0 ∈ R the sequence {xk } generated according to
xk+1 := xk + αk dk , k≥0
with
αk := arg min{f (xk + αdk ) : α ∈ R}
converges to the unique solution, x∗ of P with f given by (1.1) after n steps, that is xn = x∗ .
1.3. The Conjugate Gradient Algorithm. The major drawback of the Conjugate Direc-
tion Algorithm of the previous section is that it seems to require that a set of Q-conjugate
directions must be obtained before the algorithm can be implemented. This is in opposition
to our working assumption that Q is so large that it cannot be kept in storage since any set
of Q-conjugate directions requires the same amount of storage as Q. However, it is possible
to generate the directions dj one at a time and then discard them after each iteration of the
algorithm. One example of such an algorithm is the Conjugate Gradient Algorithm.
The C-G Algorithm:
Initialization: x0 ∈ Rn , d0 = −g0 = −∇f (x0 ) = b − Qx0 .
For k = 0, 1, 2, . . .
αk := −gkT dk /dTk Qdk
xk+1 := xk + αk dk
gk+1 := Qxk+1 − b (STOP if gk+1 = 0)
T
βk := gk+1 Qdk /dTk Qdk
dk+1 := −gk+1 + βk dk
k := k + 1.
Theorem 1.4. [Conjugate Gradient Theorem]
The C-G algorithm is a conjugate direction method. If it does not terminate at xk (i.e.
gk 6= 0), then
(1) Span [g0 , g1 , . . . , gk ] = span [g0 , Qg0 , . . . , Qk g0 ]
(2) Span [d0 , d1 , . . . , dk ] = span [g0 , Qg0 , . . . , Qk g0 ]
(3) dTk Qdi = 0 for i ≤ k − 1
(4) αk = gkT gk /dTk Qdk
T
(5) βk = gk+1 gk+1 /gkT gk .
Proof. We first prove (1)-(3) by induction. The results are clearly true for k = 0. Now
suppose they are true for k, we show they are true for k + 1. First observe that
gk+1 = gk + αk Qdk
k+1
so that gk+1 ∈ Span[g0 , . . . , Q g0 ] by the induction hypothesis on (1) and (2). Also
gk+1 ∈
/ Span [d0 , . . . , dk ] otherwise gk+1 = 0 (by the Subspace Optimality Theorem ) since
the method is a conjugate direction method up to step k by the induction hypothesis. Hence
5

/ Span [g0 , . . . , Qk g0 ] and so Span [g0 , g1 , . . . , gk+1 ] = Span [g0 , . . . , Qk+1 g0 ], which
gk+1 ∈
proves (1).
To prove (2) write
dk+1 = −gk+1 + βk dk
so that (2) follows from (1) and the induction hypothesis on (2).
To see (3) observe that
dTk+1 Qdi = −gk+1 Qdi + βk dTk Qdi .
For i = k the right hand side is zero by the definition of βk . For i < k both terms vanish.
T
The term gk+1 Qdi = 0 by Theorem 1.2 since Qdi ∈ Span[d0 , . . . , dk ] by (1) and (2). The
T
term di Qdi vanishes by the induction hypothesis on (3).
To prove (4) write
−gkT dk = gkT gk − βk−1 gkT dk−1
where gkT dk−1 = 0 by Theorem 1.2.
T
To prove (5) note that gk+1 gk = 0 by Theorem 1.2 because gk ∈ Span[d0 , . . . , dk ]. Hence

T 1 T 1 T
gk+1 Qdk = gk+1 [gk+1 − gk ] = g gk+1 .
αk αk k+1
Therefore,
T T
1 gk+1 gk+1 gk+1 gk+1
βk = T
= T
.
αk dk Qdk gk gk


Remarks:
(1) The C–G method decribed above is a descent method since the values
f (x0 ), f (x1 ), . . . , f (xn )
form a decreasing sequence. Moreover, note that
∇f (xk )T dk = −gkT gk and αk > 0 .
Thus, the C–G method behaves very much like the descent methods discussed pevi-
ously.
(2) It should be observed that due to the occurrence of round-off error the C-G algorithm
is best implemented as an iterative method. That is, at the end of n steps, f may
not attain its global minimum at xn and the intervening directions dk may not be
Q-conjugate. Consequently, at the end of the nth step one should check the value
k∇f (xn )k. If it is sufficiently small, then accept xn as the point at which f attains
its global minimum value; otherwise, reset x0 := xn and run the algorithm again.
Due to the observations in remark above, this approach is guarenteed to continue to
reduce the function value if possible since the overall method is a descent method.
In this sense the C–G algorithm is self correcting.
6

1.4. Extensions to Non-Quadratic Problems. If f : Rn → R is not quadratic, then


the Hessian matrix ∇2 f (xk ) changes with k. Hence the C-G method needs modification in
this case. An obvious approach is to replace Q by ∇2 f (xk ) everywhere it occurs in the C-G
algorithm. However, this approach is fundamentally flawed in its explicit use of ∇2 f . By
using parts (4) and (5) of the conjugate gradient Theorem 1.4 and by trying to mimic the
descent features of the C–G method, one can obtain a workable approximation of the C–G
algorithm in the non–quadratic case.
The Non-Quadratic C-G Algorithm
Initialization: x0 ∈ Rn , g0 = ∇f (x0 ), d0 = −g0 , 0 < c < β < 1.
Having xk otain xk+1 as follows:
Check restart criteria. If a restart condition is satisfied, then reset x0 = xn , g0 = ∇f (x0 ),
d0 = −g0 ; otherwise, set
 
λ > 0, ∇f (xk + λdk )T d ≥ β∇f (xk )T dk , and
αk ∈ λ
f (xk + λdk ) − f (xk ) ≤ cλ∇f (xk )T dk
xk+1 := xk + αk dk
gk+1 :=  ∇f (xk+1 )
T
gk+1 gk+1

g Tg Fletcher-Reeves
βk := n gT (g −g k) ok
 max 0, k+1 Tk+1 k Polak-Ribiere
gk gk
dk+1 := −gk+1 + βk dk
k := k + 1.
Remarks
(1) The Polak-Ribiere update for βk has a demonstrated experimental superiority. One
way to see why this might be true is to observe that
T T
gk+1 (gk+1 − gk ) ≈ αk gk+1 ∇2 f (xk )dk
thereby yielding a better second–order approximation. Indeed, the formula for βk in
in the quadratic case is precisely
T
αk gk+1 ∇2 f (xk )dk
gkT gk .
(2) Observe that the Hessian is never explicitly refered to in the above algorithm.
(3) At any given iteration the procedure requires the storage of only 2 vectors if Fletcher-
Reeves is used and 3 vectors if Polak-Ribiere is used. This is of great significance if
n is very large, say n = 50, 000. Thus we see that one of the advantages of the C-G
method is that it can be practically applied to very large scale problems.
(4) Aside from the cost of gradient and function evaluations the greatest cost lies in the
line search employed for the computation of αk .
We now consider appropriate restart criteria. Clearly, we should restart when k = n
since this is what we do in the quadratic case. But there are other issues to take into
consideration. First, since ∇2 f (xk ) changes with each iteration, there is no reason to think
that we are preserving any sort of conjugacy relation from one iteration to the next. In order
7

to get some kind of control on this behavior, we define a measure of conjugacy and if this
measure is violated, then we restart. Second, we need to make sure that the search directions
dk are descent directions. Moreover, (a) the angle between these directions and the negative
gradient should be bounded away from zero in order to force the gradient to zero, and (b)
the directions should have a magnitude that is comparable to that of the gradient in order
to prevent ill–conditioning. The precise restart conditions are given below.
Restart Conditions
(1) k = n
T
(2) |gk+1 gk | ≥ 0.2gkT gk
(3) −2gk gk ≥ gkT dk ≥ −0.2gkT gk
T

Conditions (2) and (3) above are known as the Powell restart conditions.

You might also like