0% found this document useful (0 votes)
2 views

5. Simple methods for extremely large-scale problems

Uploaded by

scribd-ml
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

5. Simple methods for extremely large-scale problems

Uploaded by

scribd-ml
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Lecture 5

Simple methods for extremely


large-scale problems

5.1 Motivation
The polynomial time Interior Point methods, same as all other polynomial time methods for
Convex Programming known so far, have a not that pleasant common feature: the arithmetic
cost C of an iteration in such a method grows nonlinearly with the design dimension n of the
problem, unless the latter possesses a very favourable structure. E.g., in IP methods, an iteration
requires solving a system of linear equations with (at least) n unknowns. To solve this auxiliary
problem, it costs at least O(n2 ) operations (with the traditional Linear Algebra – even O(n3 )
operations), except for the cases when the matrix of the system is very sparse and, moreover,
possesses a well-structured sparsity pattern. The latter indeed is the case when solving most
of LPs of real-world origin, but nearly never is the case for, e.g., SDPs. For other known
polynomial time methods, the situation is similar – the arithmetic cost of an iteration, even in
the case of extremely simple objectives and feasible sets, is at least O(n2 ). With n of order of
tens and hundreds of thousands, the computational effort of O(n2 ), not speaking about O(n3 ),
operations per iteration becomes prohibitively large – basically, you never will finish the very
first iteration of your method... On the other hand, the design dimensions of tens and hundreds
of thousands is exactly what is met in many applications, like SDP relaxations of combinatorial
problems involving large graphs or Structural Design (especially for 3D structures). As another
important application of this type, consider 3D Medical Imaging problem arising in Positron
Emission Tomography.

Positron Emission Tomography (PET) is a powerful, non-invasive, medical diagnostic imaging


technique for measuring the metabolic activity of cells in the human body. It has been in clinical use
since the early 1990s. PET imaging is unique in that it shows the chemical functioning of organs and
tissues, while other imaging techniques - such as X-ray, computerized tomography (CT) and magnetic
resonance imaging (MRI) - show anatomic structures.
A PET scan involves the use of a radioactive tracer – a fluid with a small amount of a radioactive
material which has the property of emitting positrons. When the tracer is administered to a patient,
either by injection or inhalation of gas, it distributes within the body. For a properly chosen tracer, this
distribution “concentrates” in desired locations, e.g., in the areas of high metabolic activity where cancer
tumors can be expected.
The radioactive component of the tracer disintegrates, emitting positrons. Such a positron nearly
immediately annihilates with a near-by electron, giving rise to two photons flying at the speed of light off
the point of annihilation in nearly opposite directions along a line with a completely random orientation
(i.e., uniformly distributed in space). They penetrate the surrounding tissue and are registered outside

187
188 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

the patient by a PET scanner consisting of circular arrays (rings) of gamma radiation detectors. Since
the two gamma rays are emitted simultaneously and travel in almost exactly opposite directions, we can
say a lot on the location of their source: when a pair of opposing detectors register high-energy photons
within a short (∼ 10−8 sec) timing window (“a coincidence event”), we know that the photons came
from a disintegration act, and that the act took place on the line (“line of response” (LOR)) linking the
detectors. The measured data set is the collection of numbers of coincidences counted by different pairs of
detectors (“bins”), and the problem is to recover from these measurements the 3D density of the tracer.
The mathematical model of the process, after appropriate discretization, is

y = P λ + ξ,

where
• λ ≥ 0 is the vector representing the (discretized) density of the tracer; the entries of λ are indexed
by voxels – small cubes into which we partition the field of view, and λj is the mean density of
the tracer in voxel j. Typically, the number n of voxels is in the range from 3 × 105 to 3 × 106 ,
depending on the resolution of the discretization grid;
• y are the measurements; the entries in y are indexed by bins – pairs of detectors, and yi is the
number of coincidences counted by i-th pair of detectors. Typically, the dimension m of y – the
total number of bins – is millions (at least 3 × 106 );
• P is the projection matrix; its entries pij are the probabilities for a LOR originating in voxel j to
be registered by bin i. These probabilities are readily given by the geometry of the scanner;
• ξ is the measurement noise coming mainly from the fact that all physical processes underlying PET
are random. The standard statistical model for PET implies that yi , i = 1, ..., m, are independent
Poisson random variables with the expectations (P λ)i .
The problem we are interested in is to recover tracer’s density λ given measurements y. As far as the
quality of the result is concerned, the most attractive reconstruction scheme is given by the standard in
Statistics Likelihood Ratio maximization: denoting p(·|λ) the density of the probability distribution of
the measurements, coming from λ, w.r.t. certain dominating distribution, the estimate of the unknown
true value λ∗ of λ is
 = argmin p(y|λ),
λ
λ≥0

where y is the vector of measurements.


For the aforementioned Poisson model of PET, building the Maximum Likelihood estimate is equiv-
alent to solving the optimization problem
 m n 
n
min j=1 λj pj − i=1 yi ln( j=1 λj pij ) : λ ≥ 0
λ   .
 (PET)
pj = pij
i

This is a nicely structured convex program (by the way, polynomially reducible to CQP and even LP).
The only difficulty – and a severe one – is in huge sizes of the problem: as it was already explained, the
number n of decision variables is at least 300, 000, while the number m of log-terms in the objective is in
the range from 3 × 106 to 25 × 106 .

At the present level of our knowledge, the design dimension n of order of tens and hundreds
of thousands rules out the possibility to solve a nonlinear convex program, even a well-structured
one, by polynomial time methods because of at least quadratic in n “blowing up” the arithmetic
cost of an iteration. When n is really large, all we can use are simple methods with linear in n
cost of an iteration. As a byproduct of this restriction, we cannot utilize anymore our knowledge
of the analytic structure of the problem, since all known for the time being ways of doing so are
5.2. INFORMATION-BASED COMPLEXITY OF CONVEX PROGRAMMING 189

too expensive, provided that n is large. As a result, we are enforced to restrict ourselves with
black-box-oriented optimization techniques – those which use solely the possibility to compute
the values and the (sub)gradients of the objective and the constraints at a point. In Convex
Optimization, two types of “cheap” black-box-oriented optimization techniques are known:

• techniques for unconstrained minimization of smooth convex functions (Gradient Descent,


Conjugate Gradients, quasi-Newton methods with restricted memory, etc.);

• subgradient-type techniques for nonsmooth convex programs, including constrained ones.

Since the majority of applications are constrained, we restrict our exposition to the techniques
of the second type. We start with investigating of what, in principle, can be expected of black-
box-oriented optimization techniques.

5.2 Information-based complexity of Convex Programming


Black-box-oriented methods and Information-based complexity. Consider a Convex
Programming program in the form

min {f (x) : x ∈ X} , (CP)


x

where X is a convex compact set in Rn and the objective f is a continuous convex function
on Rn . Let us fix a family P(X) of convex programs (CP) with X common for all programs
from the family, so that such a program can be identified with the corresponding objective, and
the family itself is nothing but certain family of convex functions on Rn . We intend to explain
what is the Information-based complexity of P(X) – informally, complexity of the family w.r.t.
“black-box-oriented” methods. We start with defining such a method as a routine B as follows:

1. When starting to solve (CP), B is given an accuracy  > 0 to which the problem should
be solved and knows that the problem belongs to a given family P(X). However, B does
not know what is the particular problem it deals with.

2. In course of solving the problem, B has an access to the First Order oracle for f . This
oracle is capable, given on input a point x ∈ Rn , to report on output what is the value
f (x) and a subgradient f  (x) of f at x.
B generates somehow a sequence of search points x1 , x2 , ... and calls the First Order oracle
to get the values and the subgradients of f at these points. The rules for building xt
can be arbitrary, except for the fact that they should be casual: xt can depend only on
the information f (x1 ), f  (x1 ), ..., f (xt−1 ), f  (xt−1 ) on f accumulated by B at the first t − 1
steps.

3. After certain number T = TB (f, ) of calls to the oracle, B terminates and outputs the
result zB (f, ). This result again should depend solely on the information on f accumulated
by B at the T search steps, and must be an -solution to (CP), i.e.,

zB (f, ) ∈ X & f (zB (f, )) − min f ≤ .


X
190 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

We measure the complexity of P(X) w.r.t. a solution method B by the function

ComplB () = max TB (f, )


f ∈P(X)

– by the minimal number of steps in which B is capable to solve within accuracy  every instance
of P(X). Finally, the Information-based complexity of the family P(X) of problems is defined
as
Compl() = min ComplB (),
B

the minimum being taken over all solution methods. Thus, the relation Compl() = N means,
first, that there exists a solution method B capable to solve within accuracy  every instance of
P(X) in no more than N calls to the First Order oracle, and, second, that for every solution
method B there exists an instance of P(X) such that B solves the instance within the accuracy
 in at least N steps.
Note that as far as black-box-oriented optimization methods are concerned, the information-
based complexity Compl() of a family P(X) is a lower bound on “actual” computational effort,
whatever it means, sufficient to find -solution to every instance of the family.

Main results on Information-based complexity of Convex Programming can be sum-


marized as follows. Let X be a solid in Rn (a convex compact set with a nonempty interior),
and let P(X) be the family of all convex functions on Rn normalized by the condition

max f − min f ≤ 1. (5.2.1)


X X

For this family,

I. Complexity of finding high-accuracy solutions in fixed dimension is independent of the


geometry of X. Specifically,
 
1
∀( ≤ (X)) : O(1)n ln 2 +  ≤ Compl();
  (5.2.2)
1
∀( > 0) : Compl() ≤ O(1)n ln 2 +  ,

where

• O(1) are appropriately chosen positive absolute constants,


1
• (X) depends on the geometry of X, but never is less than n2
, where n is the dimen-
sion of X.

II. Complexity of finding solutions of fixed accuracy in high dimensions does depend on the
geometry of X. Here are 3 typical results:

(a) Let X be an n-dimensional box: X = {x ∈ Rn : x∞ ≤ 1}. Then


1 1 1
≤ ⇒ O(1)n ln( ) ≤ Compl() ≤ O(1)n ln( ). (5.2.3)
2  
(b) Let X be an n-dimensional ball: X = {x ∈ Rn : x2 ≤ 1}. Then

1 O(1) O(1)
n≥ ⇒ 2 ≤ Compl() ≤ 2 . (5.2.4)
2  
5.2. INFORMATION-BASED COMPLEXITY OF CONVEX PROGRAMMING 191

(c) Let X be an n-dimensional hyperoctahedron: X = {x ∈ Rn : x1 ≤ 1}. Then

1 O(1) O(ln n)
n≥ 2
⇒ 2 ≤ Compl() ≤ (5.2.5)
  2
(in fact, O(1) in the lower bound can be replaced with O(ln n), provided that n >>
1
2
).

Since we are interested in extremely large-scale problems, the moral which we can extract from
the outlined results is as follows:
• I is discouraging: it says that we have no hope to guarantee high accuracy, like  = 10−6 ,
when solving large-scale problems with black-box-oriented methods; indeed, with O(n) steps per
accuracy digit and at least O(n) operations per step (this many operations are required already
to input a search point to the oracle), the arithmetic cost per accuracy digit is at least O(n2 ),
which is prohibitively large for really large n.
• II is partly discouraging, partly encouraging. A bad news reported by II is that when X is
a box, which is the most typical situation in applications, we have no hope to solve extremely
large-scale problems, in a reasonable time, to guaranteed, even low, accuracy, since the required
number of steps should be at least of order of n. A good news reported by II is that there
exist situations where the complexity of minimizing a convex function to a fixed accuracy is
independent, or nearly independent, of the design dimension. Of course, the dependence of the
complexity bounds in (5.2.4) and (5.2.5) on  is very bad and has nothing in common with being
polynomial in ln(1/); however, this drawback is tolerable when we do not intend to get high
accuracy. Another drawback is that there are not that many applications where the feasible set
is a ball or a hyperoctahedron. Note, however, that in fact we can save the most important
for us upper complexity bounds in (5.2.4) and (5.2.5) when requiring from X to be a subset of
a ball, respectively, of a hyperoctahedron, rather than to be the entire ball/hyperoctahedron.
This extension is not costless: we should simultaneously strengthen the normalization condition
(5.2.1). Specifically, we shall see that

B. The upper complexity bound in (5.2.4) remains valid when X ⊂ {x : x2 ≤ 1} and

P(X) = {f : f is convex and |f (x) − f (y)| ≤ x − y2 ∀x, y ∈ X};

S. The upper complexity bound in (5.2.5) remains valid when X ⊂ {x : x1 ≤ 1} and

P(X) = {f : f is convex and |f (x) − f (y)| ≤ x − y1 ∀x, y ∈ X}.

Note that the “ball-like” case mentioned in B seems to be rather artificial: the Euclidean norm
associated with this case is a very natural mathematical entity, but this is all we can say in its
favour. For example, the normalization of the objective in B is that the Lipschitz constant of f
w.r.t.  · 2 is ≤ 1, or, which is the same, that the vector of the first order partial derivatives of f
should, at every point, be of ·2 -norm not exceeding 1. In order words, “typical” magnitudes of
the partial derivatives of f should become smaller and smaller as the number of variables grows;
what could be the reasons for such a strange behaviour? In contrast to this, the normalization
condition imposed on f in S is that the Lipschitz constant of f w.r.t.  · 1 is ≤ 1, or, which is
the same, that the  · ∞ -norm of the vector of partial derivatives of f is ≤ 1. In other words,
the normalization is that the magnitudes of the first order partial derivatives of f should be ≤ 1,
and this normalization is “dimension-independent”. Of course, in B we deal with minimization
192 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

over subsets of the unit ball, while in S we deal with minimization over the subsets of the unit
hyperoctahedron, which is much smaller than the unit ball. However, there do exist problems
in reality where we should minimize over the standard simplex

∆n = {x ∈ Rn : x ≥ 0, xi = 1},
x

which indeed is a subset of the unit hyperoctahedron. For example, it turns out that the PET
Image Reconstruction problem (PET) is in fact the problem of minimization over the standard
simplex. Indeed, the optimality condition for (PET) reads
 
 pij 
λj pj − yi   = 0, j = 1, ..., n;
i
pi λ

summing up these equalities, we get

pj λj = B ≡ yi .
j i

It follows that the optimal solution to (PET) remains unchanged when we add to the non-

negativity constraints λj ≥ 0 also the constraint pj λj = B. Passing to the new variables
j
xj = B −1 p j λj , we further convert (PET) to the equivalent form
 
 
min f (x) ≡ − yi ln( qij xj ) : x ∈ ∆n
x
i j  , (PET )
Bpij
qij = pj

which is a problem of minimizing a convex function over the standard simplex.

Intermediate conclusion. The discussion above says that this perhaps is a good idea to
look for simple convex minimization techniques which, as applied to convex programs (CP)
with feasible sets of appropriate geometry, exhibit dimension-independent (or nearly dimension-
independent) and nearly optimal information-based complexity. We are about to present a
family of techniques of this type.

5.3 The Bundle-Mirror scheme


The setup. Consider problem (CP) and assume that

(A.1): The (convex) objective f is Lipschitz continuous on X.


To quantify this assumption, we fix once for ever a norm  ·  on Rn and associate
with f the Lipschitz constant of f |X w.r.t. the norm  · :

L· (f ) = min {L : |f (x) − f (y)| ≤ Lx − y ∀x, y ∈ X} .

Note that from Convex Analysis it follows that f at every point x ∈ X admits a
subgradient f  (x) such that
f  (x)∗ ≤ L· (f ),
5.3. THE BUNDLE-MIRROR SCHEME 193

where  · ∗ is the norm conjugate to  · :

ξ∗ = max{ξ T x : x ≤ 1}.


p
For example, the norm conjugate to  · p is  · q , where q = p−1 .

We assume that this “small norm” subgradient f  (x) is exactly the one reported by
the First Order oracle as called with input x ∈ X; this is not a severe restriction,
since at least in the interior of X all subgradients of f are “small” in the outlined
sense.

The setup for the generic method BM we are about to present is given by X,  ·  and a
continuously differentiable function ω(x) : X → R which we assume to be strongly convex, with
parameter κ > 0, w.r.t. the norm  · ; the latter means that
κ
ω(y) ≥ ω(x) + (y − x)T ∇ω(x) + y − x2 ∀x, y ∈ X. (5.3.1)
2
The standard setups we will be especially interested in are:

1. “Ball setup”: ω(x) = 12 xT x, X is a convex compact set simple enough to allow for rapid
solving the optimization problems of the form
 
x[p] = argmin ω(x) + pT x , (5.3.2)
x∈X

 ·  =  · 2 ;

2. “Simplex setup”: ω(x) is the “regularized entropy”

n
ω(x) = (xi + δn−1 ) ln(xi + δn−1 ) : ∆n → R, (5.3.3)
i=1

where δ ∈ (0, 1) is a once for ever fixed “regularization parameter”; X is a convex compact
subset of the standard “full-dimensional” simplex

∆+ n
n = {x ∈ R : x ≥ 0, xi ≤ 1}
i

simple enough to allow for solving the optimization problems of the form (5.3.2); · = ·1 ;

3. “Spectahedron setup”: this setup deals with the special case when our “universe” is Sn
rather than Rn ; as usual, Sn is equipped with the Frobenius inner product. The specta-
hedron is the set in Sn defined as

Σ+ n
n = {x ∈ S : x  0, Tr(x) ≤ 1}

(we are using lowercase notation for the elements of Sn in order to be consistent with the
rest of the text). The function ω(x) is the “regularized matrix entropy”

ω(x) = Tr((x + δn−1 In ) ln(x + δn−1 In )) : Σn → R, (5.3.4)


194 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

where δ ∈ (0, 1) is a once for ever fixed regularization parameter. X is a convex compact
subset of the spectahedron Σ+ n simple enough to allow for solving the optimization problems
of the form (5.3.2), and  ·  is the norm

|x|1 ≡ λ(x)1

on Sn .
Note that in fact the simplex setup is a particular case of the Spectahedron one corre-
sponding to the case when X is the set of diagonal positive semidefinite matrices.

One can verify that for these setups, ω(·) indeed is continuously differentiable on X and satisfies
(5.3.1) with κ = O(1) (in fact, κ = 1 for the ball setup, κ = (1 + δ)−1 for the simplex setup, and
κ = 0.5(1 + δ)−1 for the spectahedron setup, see Appendix to Lecture 5).

The generic algorithm BM we are about to describe works as follows.

A. The algorithm generates a sequence of search points, all belonging to X, where the
First Order oracle is called, and at every step builds the following entities:

1. the best found so far value of f along with the best found so far search point, which is
treated as the current approximate solution built by the method;

2. a (valid) lower bound on the optimal value of the problem.

B. The execution is split in subsequent phases. Phase s, s = 1, 2, ..., is associated with


prox-center cs ∈ X and level !s ∈ R such that

• when starting the phase, we already know what is f (cs ), f  (cs );

• !s = fs + λ(f s − fs ), where

– f s is the best value of the objective known at the time when the phase starts;
– fs is the lower bound on f∗ we have at our disposal when the phase starts;
– λ ∈ (0, 1) is a parameter of the method.

The prox-center c1 corresponding to the very first phase can be chosen in X in an arbitrary
fashion. We start the entire process with computing f , f  at this prox-center, which results in

f 1 = f (c1 )

and set
f1 = min[f (c1 ) + (x − c1 )T f  (x1 )],
x∈X

thus getting the initial lower bound on f∗ .


5.3. THE BUNDLE-MIRROR SCHEME 195

C. Now let us describe a particular phase s. Let

ωs (x) = ω(x) − (x − cs )T ∇ω(cs );

note that (5.3.1) implies that


κ
ωs (y) ≥ ωs (x) + (y − x)T ∇ωs (x) + y − x2 ∀x, y ∈ X. (5.3.5)
2

and that cs is the minimizer of ωs (·) on X.


The search points xt = xt,s of the phase s, t = 1, 2, ... are generated according to the following
rules:

1. When generating xt , we already have in our disposal xt−1 and a convex compact set
Xt−1 ⊂ X such that
(at−1 ) x ∈ X\Xt−1 ⇒ f (x) > !s ;
(bt−1 ) xt−1 ∈ argmin ωs . (5.3.6)
Xt−1

Here x0 = cs and, say, X0 = X, which ensures (5.3.6.a0 -b0 ).

2. To update (xt−1 , Xt−1 ) into (xt , Xt ), we solve the auxiliary problem


 
min ωs (x) : x ∈ Xt−1 , gt−1 (x) ≡ f (xt−1 ) + (x − xt−1 )T f  (xt−1 ) ≤ !s . (Pt−1 )
x

Our subsequent actions depend on the results of this optimization:

(a) When (Pt−1 ) is infeasible, we terminate the phase and update the lower bound on f∗
as
fs → fs+1 = !s .

Note that in the case in question !s is indeed a valid lower bound on f∗ : in X\Xt−1
we have f (x) > !s by (5.3.6.at−1 ), while the infeasibility of (Pt−1 ) means that

x ∈ Xt−1 ⇒ f (x) ≥ gt−1 (x) > !s ,

where the inequality f (x) ≥ gt−1 (x) is given by the convexity of f .


The prox-center cs+1 for the new phase can be chosen in X in an arbitrary fashion.
(b) When (Pt−1 ) is feasible, we get the optimal solution xt of this problem and compute
f (xt ), f  (xt ). It is possible that

f (xt ) − !s ≤ θ(f s − !s ), (5.3.7)

where θ ∈ (0, 1) is a parameter of the method. In this case, we again terminate the
phase and set
f s+1 = f (xt ), fs+1 = fs .

The prox-center cs+1 for the new phase, same as above, can be chosen in X in an
arbitrary fashion.
196 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

(c) When (Pt−1 ) is feasible and (5.3.7) does not take place, we continue the phase s,
choosing as Xt an arbitrary convex compact set such that

X t ≡ {x ∈ Xt−1 : gt−1 (x) ≤ !s } ⊂ Xt ⊂ X t ≡ {x ∈ X : (x − xt )T ∇ωs (xt ) ≥ 0}.


(5.3.8)
Note that we are in the case when (Pt−1 ) is feasible and xt is the optimal solution to
the problem; it follows that
∅ = X t ⊂ X t ,
so that (5.3.8) indeed allows to choose Xt . Besides this, every choice of Xt compatible
with (5.3.8) ensures (5.3.6.at ) and (5.3.6.bt ); the first relation is clearly ensured by the
left inclusion in (5.3.8) combined with (5.3.6.at−1 ) and the fact that f (x) ≥ gt−1 (x),
while the second relation (5.3.6.bt ) follows from the right inclusion in (5.3.8) due to
the convexity of ωs (·).

Convergence Analysis. Let us define s-th gap as the quantity

s = f s − fs

By its origin, the gap is nonnegative and nonincreasing in s; besides this, it clearly is an upper
bound on the inaccuracy, in terms of the objective, of the approximate solution z s we have in
our disposal at the beginning of phase s.
The convergence and the complexity properties of the BM algorithm are given by the fol-
lowing statement.
Theorem 5.3.1 (i) The number Ns of oracle calls at a phase s can be bounded from above as
follows:
4ΩL2 (f )
·
Ns ≤ θ2 (1−λ) 2 κ2 ,
s (5.3.9)
Ω = max [ω(y) − ω(x) − (y − x)T ∇ω(x)].
x,y∈X

(ii) Consequently, for every  > 0, the total number of oracle calls before the phase s with
s ≤  is started (i.e., before -solution to the problem is built) does not exceed

ΩL2· (f )
N () = c(θ, λ) (5.3.10)
κ2
with an appropriate c(θ, λ) depending solely and continuously on θ, λ ∈ (0, 1).
Proof. (i): Assume that phase s did not terminate in course of N steps. Observe that then
θ(1 − λ)s
xt − xt−1  ≥ , 1 ≤ t ≤ N. (5.3.11)
L· (f )

Indeed, we have gt−1 (xt ) ≤ !s by construction of xt and gt−1 (xt−1 ) = f (xt−1 ) > !s + θ(f s − !s ), since
otherwise the phase would be terminated at the step t − 1. It follows that gt−1 (xt−1 ) − gt−1 (xt ) >
θ(f s − !s ) = θ(1 − λ)s . Taking into account that gt−1 (·), due to its origin, is Lipschitz continuous w.r.t.
 ·  with constant L· (f ), (5.3.11) follows.
Now observe that xt−1 is the minimizer of ωs on Xt−1 by (5.3.6.at−1 ), and the latter set, by con-
struction, contains xt , whence (xt − xt−1 )T ∇ωs (xt−1 ) ≥ 0. Applying (5.3.5), we get
 2
κ θ(1 − λ)s
ωs (xt ) ≥ ωs (xt−1 ) + , t = 1, ..., N,
2 L· (f )
5.3. THE BUNDLE-MIRROR SCHEME 197

whence
 2
Nκ θ(1 − λ)s
ωs (xN ) − ωs (x0 ) ≥ .
2 L· (f )

The latter relation, due to the evident inequality max ωs (x) − min ωs ≤ Ω (readily given by the definition
X X
of Ω and ωs ) implies that
2ΩL2· (f )
N≤ .
θ2 (1 − λ)2 κ2s
Recalling the origin of N , we conclude that

2ΩL2· (f )
Ns ≤ + 1.
θ2 (1 − λ)2 κ2s

All we need in order to get from this inequality the required relation (5.3.9) is to demonstrate that

2ΩL2· (f )
≥ 1, (5.3.12)
θ2 (1 − λ)2 κ2s

which is immediate: let R = max x−c1  = x̄−c1 , where x̄ ∈ X. We have s ≤ 1 = f (c1 )−min[f (c1 )+
x∈X x∈X
(x−c1 )T f  (c1 )] ≤ RL· (f ), where the concluding inequality is given by the fact that f  (c1 )∗ ≤ L· (f ).
On the other hand, by the definition of Ω we have

κ κR2
Ω ≥ ω(x̄) − [ω(c1 ) + (x̄ − c1 )T ∇ω(c1 )] ≥ x̄ − c1 2 =
2 2
2
(we have used (5.3.1)). Thus, Ω ≥ κR 2 and s ≤ RL· (f ), and (5.3.12) follows. (i) is proved.
(ii): Assume that s >  at the phases s = 1, 2, ..., S, and let us bound from above the total number
of oracle calls at these S phases. Observe, first, that the two subsequent gaps s , s+1 are linked by the
relation
s+1 ≤ γ(θ, λ)s , γ = max[1 − λ, 1 − (1 − θ)(1 − λ)] < 1. (5.3.13)

Indeed, it is possible that the phase s was terminated according to the rule 2a; in this case

s+1 = f s+1 − !s ≤ f s − !s = (1 − λ)s ,

as required in (5.3.13). The only other possibility is that the phase s was terminated when the relation
(5.3.7) took place. In this case,

s+1 = f s+1 − fs ≤ !s + θ(f s − !s ) − fs = λs + θ(1 − λ)s = (1 − (1 − θ)(1 − λ))s ,

and we again arrive at (5.3.13).


From (5.3.13) it follows that s ≥ γ s−S , s = 1, ..., S, since S >  by the origin of S. We now have


S 
S 4ΩL2· (f ) 
S 4ΩL2· (f )γ 2(S−s) 4ΩL2· (f ) ∞
Ns ≤ θ 2 (1−λ)2 κ2s ≤ θ 2 (1−λ)2 κ2 ≤ θ 2 (1−λ)2 κ2 γ 2t
s=1 
s=1 s=1 t=0
4 ΩL2· (f )
≡ 2 2 2 κ2
θ (1 − λ) (1 − γ )
  
c(θ,λ)

and (5.3.10) follows.


Now let us look what our complexity analysis says in the case of the standard setups.
198 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

Ball setup and optimization over the ball. As we remember, in the case of the ball setup
one has κ = 1. Besides this, it is immediately seen that in this case
1 2
Ω = D·2
(X),
2
where D·2 (X) = max x − y2 if the  · 2 -diameter of X. Thus, (5.3.10) becomes
x,y∈X

2 (X)L2 (f )
D·2 ·2
N () ≤ c(θ, λ) . (5.3.14)
22
On the other hand, let L > 0, and let P·2 ,L (X) be the family of all convex problems (CP) with
convex Lipschitz continuous, with constant L w.r.t.  · 2 , objectives. It is known that if X is an
2
D· (X)L2
n-dimensional Euclidean ball and n ≥ 2
2
, then the information-based complexity of the
D· (X)L2
2
family P·2 ,L (X) is at least O(1) 2
2
(cf. (5.2.4)). Comparing this result with (5.3.14),
we conclude that

If X is an n-dimensional Euclidean ball, then the complexity of the family P·2 ,L (X)
D2 (X)L2
w.r.t. the BM algorithm with the ball setup in the “large-scale case” n ≥ ·22
coincides (within a factor depending solely on θ, λ) with the information-based com-
plexity of the family.

Simplex setup and minimization over the simplex. In the case of the simplex setup one
has κ = (1 + δ)−1 , where δ ∈ (0, 1) is the regularization parameter for the entropy. Besides this,
one has   
n(1 + δ)
Ω ≤ (1 + δ) 1 + ln . (5.3.15)
δ
(see computation in Appendix to Lecture 5).
We see that for the simplex setup, Ω is of order of ln n, provided that δ is not very-very
small. E.g., when δ =1.e-16 is the “machine zero” (so that for all computational purposes, our
regularized entropy is, essentially, the same as the usual entropy), we have Ω ≤ 37 + ln n, whence
Ω ≤ 6 ln n provided that n ≥ 1000.
With our bounds for κ and Ω, the complexity bound (5.3.10) becomes

L2·1 (f ) ln n
N () ≤ c(θ, λ) (5.3.16)
2
(provided that δ ≥ 1.e-16). On the other hand, let L > 0, and let P·1 ,L (X) be the family of all
convex problems (CP) with convex Lipschitz continuous, with constant L w.r.t. ·1 , objectives.
It is known that if X is the n-dimensional simplex ∆n (or the full-dimensional simplex ∆+ n ) and
L2 2
n ≥ 2 , then the information-based complexity of the family P·1 ,L (X) is at least O(1) L2 (cf.
(5.2.5)). Comparing this result with (5.3.14), we conclude that

If X is the n-dimensional simplex ∆n (or the full-dimensional simplex ∆+ n ), then the


complexity of the family P·1 ,L (X) w.r.t. the BM algorithm with the simplex setup
2
in the “large-scale case” n ≥ L2 coincides, within a factor of order of ln n, with the
information-based complexity of the family.
5.4. IMPLEMENTATION ISSUES 199

Spectahedron setup and large-scale semidefinite optimization. All the conclusions we


have made when speaking about the case of the simplex setup and X = ∆n (or X = ∆+ n)
remain valid in the case of the spectahedron setup and X defined as the set of all block-diagonal
matrices of a given block-diagonal structure contained in Σn = {x ∈ Sn : x  0, Tr(x) ≤ 1} (or
contained in Σ+n ).
We see that with every one of our standard setups, the BM algorithm under appropriate con-
ditions possesses dimension independent (or nearly dimension independent) complexity bound
and, moreover, is nearly optimal in the sense of Information-based complexity theory, provided
that the dimension is large.

Why the standard setups? “The contribution” of ω(·) to the performance estimate (5.3.10)
is in the factor Θ = Ω
κ ; the less it is, the better. In principle, given X and ·, we could play with
ω(·) to minimize Θ. The standard setups are given by a kind of such optimization for the cases
when X is the ball and  ·  =  · 2 (“the ball case”), when X is the simplex and  ·  =  · 1
(“the simplex case”), and when X is the spectahedron and  ·  = | · |1 (“the spectahedron
case”), respectively. We did not try to solve the arising variational problems exactly; however,
it can be proved in all three cases that the value of Θ we have reached (i.e., O(1) in the ball
case and O(ln n) in the simplex and the spectahedron cases) cannot be reduced by more than
an absolute constant factor. Note that in the simplex case the (regularized) entropy is not the
 p(n)
only reasonable choice; similar complexity results can be obtained for, say, ω(x) = xi or
  i
1
ω(x) = x2p(n) with p(n) = 1 + O ln n .

5.4 Implementation issues


Solving auxiliary problems (Pt ). As far as implementation of the BM algorithm is con-
cerned, the major issue is how to solve efficiently the auxiliary problem (Pt ). Formally, this
problem is of the same design dimension as the problem of interest; what do we gain when
reducing the solution of a single large-scale problem (CP) to a long series of auxiliary problems
of the same dimension? To answer this crucial question, observe first that we have control on
the complexity of the domain Xt which, up to a single linear constraint, is the feasible domain
of (Pt ). Indeed, assume that Xt−1 is a part of X given by a finite list of linear inequalities.
Then both sets X t and X t in (5.3.8) also are cut off X by finitely many linear inequalities, so
that we may enforce Xt to be cut off X by finitely many linear inequalities as well. Moreover,
we have full control of the number of inequalities in the list. For example,

A. Setting all the time Xt = X t , we ensure that Xt is cut off X by a single linear inequality;

B. Setting all the time Xt = X t , we ensure that Xt is cut off X by t linear inequalities (so
that the larger is t, the “more complicated” is the description of Xt );

C. We can choose something in-between the above extremes. For example, assume that we
have chosen certain m and are ready to work with Xt ’s cut off X by at most m linear
inequalities. In this case, we could use the policy B. at the initial steps of a phase, until
the number of linear inequalities in the description of Xt−1 reaches the maximum allowed
value m. At the step t, we are supposed to choose Xt in-between the two sets

X t = {x ∈ X : h1 (x) ≤ 0, ..., hm (x) ≤ 0, hm+1 (x) ≤ 0}, X t = {x ∈ X : hm+2 (x) ≤ 0},


200 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

where

– the linear inequalities h1 (x) ≤ 0, ..., hm (x) ≤ 0 cut Xt−1 off X;


– the linear inequality hm+1 (x) ≤ 0 is, in our old notation, the inequality gt−1 (x) ≤ !s ;
– the linear inequality hm+2 (x) ≤ 0 is, in our old notation, the inequality (x −
xt )T ∇ωs (xt ) ≥ 0.

Now we can form a list of m linear inequalities as follows:


• we build m − 1 linear inequalities ei (x) ≤ 0 by aggregating the m + 1 inequalities
hj (x) ≤ 0, j = 1, ..., m + 1, so that every one of the e-inequalities is a convex combination
of the h-ones (the coefficients of these convex combinations can be whatever we want);
• we set Xt = {x ∈ X : ei (x) ≤ 0, i = 1, ..., m − 1, hm+2 (x) ≤ 0}.
It is immediately seen that with this approach we ensure (5.3.8), on one hand, and that Xt
is cut off X by at most m inequalities. And of course we can proceed in the same fashion.

The bottom line is: we always can ensure that Xt−1 is cut off X by at most m linear inequalities
hj (x) ≤ 0, j = 1, ..., m, where m ≥ 1 is (any) desirable bound. Consequently, we may assume
that the feasible set of (Pt−1 ) is cut off X by m + 1 linear inequalities hj (x) ≤ 0, j = 1, ..., m + 1.
The crucial point is that with this approach, we can reduce (Pt−1 ) to a convex program with at
most m + 1 decision variables. Indeed, assuming (Pt−1 ) strictly feasible:

∃(x̄ ∈ rintX) : hj (x̄) ≤ 0, j = 1, ..., m + 1

and applying the standard Lagrange Duality, we see that the optimal value in (Pt−1 ) is equal
to the one in the Lagrange dual problem
m+1
max L(λ), L(λ) = min[ωs (x) + λj hj (x)]. (Dt−1 )
λ≥0 x∈X
j=1

Now note that although (Dt−1 ) possesses absolutely no structure, its objective is concave (as a
minimum of linear functions of λ) and “computable” at every given λ. Indeed, to compute the
value L(λ) and a supergradient L (λ) of L at a given point is the same as to find the optimal
solution xλ to the optimization program
m+1
min[ωs (x) + λj hj (x)]; (D[λ])
x∈X
j=1

after xλ is found, we can set


m+1
L(λ) = ωs (xλ ) + λj hj (xλ ), L (λ) = (h1 (xλ ), ..., hm+1 (xλ ))T .
j=1

It remains to note that to solve (D[λ]) means to minimize over X a sum of ω(·) and a linear
function, and we have assumed that X is simple enough for problems of this type to be rapidly
solved.
The summary of our observations is that (D[λ]) is a convex optimization program with m + 1
decision variables, and we have in our disposal a First Order oracle for this problem, so that we
5.4. IMPLEMENTATION ISSUES 201

can solve it efficiently, provided that m is not too big (cf. Theorem 4.1.2). And we indeed can
enforce the latter requirement – m is in our full control!
After (Dt−1 ) is solved to high accuracy and we have in our disposal the corresponding
maximizer λ∗ , we can choose, as xt , the point xλ∗ , since the Lagrange Duality theorem says that
the optimal solution xt of (Pt−1 ) is among the optimal solutions to (D[λ∗ ]), and the set of the
optimal solutions to the latter problem is a singleton (since ωs is strongly convex).
It should be mentioned that the outlined construction works when (Pt−1 ) is strictly feasible;
what to do if it is not the case? Well, to overcome this difficulty in our context, it suffices to
solve first the auxiliary problem

min {gt−1 (x) − !s : x ∈ Xt−1 ≡ {x ∈ X, hj (x) ≤ 0, j = 1, ..., m}} . (Pt−1 )


x

If the optimal value in this problem is negative, then (Pt−1 ) is strictly feasible, and we know
how to proceed. If the optimal value is nonnegative, then, recalling the origin of Xt−1 , we can
conclude that !s is a valid lower bound on f∗ , and we have no necessity to solve (Pt−1 ) at all –
we should terminate the phase and update the lower bound on f∗ according to fs → fs+1 = !s .
It remains to note that to solve (Pt−1 ), we can use the same Lagrange Duality trick as for
(Pt−1 ), and that (Pt−1 ) definitely is strictly feasible (induction in t!), so that here the duality
does work1) .

Remark 5.4.1 Note that the optimal value ht−1 in (Pt−1 ) can be used to update the current
lower bound on f∗ : from (5.3.6.at−1 ) combined with the fact that gt−1 (x) ≤ f (x) it follows that
f∗ ≥ min[!s , !s + ht−1 ].

When the standard setups are implementable? As we have seen, the possibility to
implement the BM algorithm requires an ability to solve rapidly optimization problems of the
form (5.3.2). Let us look at several important cases when this indeed is possible.

Ball setup. Here problem (5.3.2) becomes


 
1
min xT x − pT x ,
x∈X 2

or, which is the same,  


1
min x − p22 .
s∈X 2

We see that to solve (5.3.2) is the same as to project on X - to find the point in X which is as
close as possible, in the usual  · 2 -norm, to a given point p. This problem is easy to solve for
several simple solids X, e.g.,

• a ball {x : x − a2 ≤ r},

• a box {x : a ≤ x ≤ b},
1)
For a careful reader it should be mentioned that as far as duality is concerned, the situation with (Pt−1 ) is
not completely similar to the one with (Pt−1 ): the objective in the former problem is not strongly convex, so that
there might be a difficulty with passing from the optimal solution of the problem dual to (Pt−1 ) to the optimal
solution of the problem (Pt−1 ) itself. Note, however, that we do not need this optimal solution at all, we need
only the optimal value...
202 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS


• the simplex ∆n = {x : x ≥ 0, xi = 1}.
i

In the first two cases, it takes O(n) operations to compute the solution – it is given by evident
explicit formulas. In the third case, to project is a bit more involving: you can easily demonstrate
that the projection is given by the relations xi = xi (λ∗ ), where xi (λ) = max[0, pi − λ] and λ∗ is
the unique root of the equation
xi (λ) = 1.
i

The left hand side of this equation is nonincreasing and continuous in λ and, as it is immediately
seen, its value varies from something ≥ 1 when λ = max pi − 1 to 0 when λ = max pi . It follows
i i
that one can easily approximate λ∗ by Bisection, and that it takes a moderate absolute constant
of bisection steps to compute λ∗ (and thus – the projection) within the machine precision. The
arithmetic cost of a Bisection step clearly is O(n), and the overall arithmetic complexity of
finding the projection becomes O(n).

Simplex setup. Let us restrict ourselves with the two simplest cases:
S.A: X is the standard simplex ∆n ;
S.B: X is the standard full-dimensional simplex ∆+
n.
Case S.A. When X = ∆n , the problem (5.3.2) becomes
 
T
min (xi + σ) ln(xi + σ) − p x : x ≥ 0, xi = 1 [σ = δn−1 ] (5.4.1)
i i

The solution to this optimization problem, as it is immediately seen, is given by xi = xi (λ∗ ),


where
xi (λ) = max[exp{pi − λ} − σ, 0] [pi = pi − max pj ] (5.4.2)
j

and λ∗ is the solution to the equation

xi (λ) = 1.
i

Here again the left hand side of the equation is nonincreasing and continuous in λ and, as it
is immediately seen, its value varies from something which is ≥ 1 when λ = −σ to something
which is < 1 when λ = ln n, so that we again can compute λ∗ (and thus – x(λ∗ )) within machine
precision, in a realistic range of values of n, in a moderate absolute constant of bisection steps.
As a result, the arithmetic cost of solving (5.4.1) is again O(n).
Note that “numerically speaking”, we should not bother about Bisection at all. Indeed, let
us set δ to something really small, say, δ = 1.e-16. Then σ = δn−1 << 1.e-16, while (at least
some of) xi (λ∗ ) should be of order of 1/n (since their sum should be 1). It follows that with
actual (i.e., finite precision) computations, the quantity σ in (5.4.2) is negligible. Omitting σ
in (5.4.1) (i.e., replacing in (5.3.2) the regularized entropy by the usual one), we can explicitly
write down the solution x∗ to (5.4.1):

exp{−pi }
xi =  , i = 1, ..., n.
exp{−pj }
j
5.4. IMPLEMENTATION ISSUES 203

Case S.B. The case of X = ∆+


n is very close to the one of X = ∆n . The only difference is that
now we first should check whether
 
max exp{−1 − pi } − δn−1 , 0 ≤ 1;
i

if it is the case, then the optimal solution to (5.3.2) is given by


 
xi = max exp{−1 − pi } − δn−1 , 0 , i = 1, ..., n,

otherwise the optimal solution to (5.3.2) is exactly the optimal solution to (5.4.1).

Spectahedron setup. Consider the case of the spectahedron setup, and assume that
either
Sp.A: X is comprised of all block-diagonal matrices, of a given block-diagonal structure,
belonging to Σn ,
or
Sp.B: X is comprised of all block-diagonal matrices, of a given block-diagonal structure,
belonging to Σ+
n.
Case Sp.A. Here the problem (5.3.2) becomes

min {Tr((x + σIn ) ln(x + σIn )) + Tr(px)} [σ = δn−1 ]


x∈X

We lose nothing by assuming that p is a symmetric block-diagonal matrix of the same block-
diagonal structure as the one of matrices from X. Let p = U πU T be the singular value decom-
position of p with orthogonal U and diagonal π of the same block-diagonal structure as that
one of p. Passing from x to the new matrix variable ξ according to x = U ξU T , we convert our
problem to the problem

min {Tr((ξ + σIn ) ln(ξ + σIn )) + Tr(πξ)} (5.4.3)


ξ∈X

We claim that the unique (due to strong convexity of ω) optimal solution ξ ∗ to the latter problem
is a diagonal matrix. Indeed, for every diagonal matrix D with diagonal entries ±1 and for every
feasible solution ξ to our problem, the matrix DξD clearly is again a feasible solution with the
same value of the objective (recall that π is diagonal). It follows that the optimal set {ξ ∗ } of
our problem should be invariant w.r.t. the aforementioned transformations ξ → DξD, which is
possible if and only if ξ ∗ is a diagonal matrix. Thus, when solving (5.4.3), we may from the very
beginning restrict ourselves with diagonal ξ, and with this restriction the problem becomes
 
T
min (ξi + σ) ln(ξi + σ) + π ξ : ξ ≥ 0, ξi = 1 , (5.4.4)
ξ∈Rn
i i

which is exactly the problem we have considered in the case of the simplex setup and X = ∆n .
We see that the only elaboration in the case of the spectahedron setup as compared to the
simplex one is in the necessity to find the singular value decomposition of p. The latter task is
easy, provided that the diagonal blocks in the matrices in question are of small sizes. Note that
this favourable situation does occur in several important applications, e.g., in Structural Design.
204 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

Case Sp.B. This case is completely similar to the previous one; the only difference is that the
role of (5.4.3) is now played by the problem
 
T
min (ξi + σ) ln(ξi + σ) + π ξ : ξ ≥ 0, ξi ≤ 1 ,
ξ∈Rn
i i

which we have already considered when speaking about the simplex setup.

Updating prox-centers. The complexity results stated in Theorem 5.3.1 are absolutely in-
dependent of how we update the prox-centers, so that in this respect we, in principle, are
completely free. The common sense, however, says that the most natural policy here is to use
as the prox-center at every stage the best (with the smallest value of f ) solution among those
we have at our disposal at the beginning of the stage.

Accumulating information. The set Xt summarizes, in a sense, all the information on f


we have accumulated so far and intend to use in the sequel. Relation (5.3.8) allows for a
tradeoff between the quality (and the volume) of this information and the computational effort
required to solve the auxiliary problems (Pt−1 ). With no restrictions on this effort, the most
promising policy for updating Xt ’s would be to set Xt = X t−1 (“collecting information with no
compression of it”). With this policy the BM algorithm with the ball setup is basically identical
to the Prox-Level algorithm [9] from the famous family of bundle methods for nonsmooth convex
optimization. Aside of theoretical complexity bounds similar to (5.3.14), most of bundle methods
(in particular, the Prox-Level one) share the following experimentally observed property: the
practical performance of the algorithm is in full accordance with the complexity bound (5.2.2):
every n steps reduce the inaccuracy by at least an absolute constant factor (something like 3).
This property is very attractive in moderate dimensions, where we indeed are capable to carry
out several times the dimension number of steps.

5.5 Illustration: PET Image Reconstruction problem


To get an impression of the practical performance of the BM method, let us look at preliminary
numerical results related to the 2D version of the PET Image Reconstruction problem.

The model. We process simulated measurements as if they were registered by a ring of 360
detectors, the inner radius of the ring being 1 (Fig. 5.1). The field of view is a concentric circle
of the radius 0.9, and it is covered by the 129×129 rectangular grid. The grid partitions the field
of view into 10, 471 pixels, and we act as if tracer’s density was constant in every pixel. Thus,
the design dimension of the problem (PET ) we are interested to solve is “just” n = 10471.
The number of bins (i.e., number m of log-terms in the objective of (PET )) is 39784, while
the number of nonzeros among qij is 3,746,832.
The true image is “black and white” (the density in every pixel is either 1 or 0). The
measurement time (which is responsible for the level of noise in the measurements) is mimicked
as follows: we model the measurements according to the Poisson model as if during the period
of measurements the expected number of positrons emitted by a single pixel with unit density
was a given number M .
5.5. ILLUSTRATION: PET IMAGE RECONSTRUCTION PROBLEM 205

Figure 5.1. Ring with 360 detectors, field of view and a line of response

The algorithm we are using to solve (PET ) is the plain BM method with the simplex setup
and the sets Xt cut off X = ∆n by just one linear inequality:

Xt = {x ∈ ∆n : (x − xt )T ∇ωs (xt ) ≥ 0}.

The parameters λ, θ of the algorithm were chosen as

λ = 0.95, θ = 0.5.

The approximate solution reported by the algorithm at a step is the best found so far search
point (the one with the best value of the objective we have seen to the moment).

The results of two sample runs we are about to present are not that bad.

Experiment 1: Noiseless measurements. The evolution of the best, in terms of the


objective, solutions xt found in course of the first t calls to the oracle is displayed at Fig. 5.2 (on
pictures, brighter areas correspond to higher density of the tracer). The numbers are as follows.
With the noiseless measurements, we know in advance the optimal value in (PET ) – it is easily
seen that without noises, the true image (which in our simulated experiment we do know) is
an optimal solution. In our problem, this optimal value equals to 2.8167; the best value of the
objective found in 111 oracle calls is 2.8171 (optimality gap 4.e-4). The progress in accuracy is
plotted on Fig. 5.3. We have built totally 111 search points, and the entire computation took
18 51 on a 350 MHz Pentium II laptop with 96 MB RAM.

Experiment 2: Noisy measurements (40 LOR’s per pixel with unit density, to-
tally 63,092 LOR’s registered). The pictures are presented at Fig. 5.4. Here are the
numbers. With noisy measurements, we have no a priori knowledge of the true optimal value
in (PET ); in simulated experiments, a kind of orientation is given by the value of the objective
at the true image (which is hopefully a close to f∗ upper bound on f∗ ). In our experiment, this
bound equals to -0.8827. The best value of the objective found in 115 oracle calls is -0.8976
(which is less that the objective at the true image in fact, the algorithm went below the value
of f at the true image already after 35 oracle calls). The upper bound on the optimality gap
at termination is 9.7e-4. The progress in accuracy is plotted on Fig. 5.5. We have built totally
115 search points; the entire computation took 20 41 .
206 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

True image: 10 “hot spots” x1 = n−1 (1, ..., 1)T x2 – some traces of 8 spots
f = 2.817 f = 3.247 f = 3.185

x3 – traces of 8 spots x5 – some trace of 9-th spot x8 – 10-th spot still missing...
f = 3.126 f = 3.016 f = 2.869

x24 – trace of 10-th spot x27 – all 10 spots in place x31 – that is it...
f = 2.828 f = 2.823 f = 2.818
Figure 5.2. Reconstruction from noiseless measurements
5.5. ILLUSTRATION: PET IMAGE RECONSTRUCTION PROBLEM 207

0
10

−1
10

−2
10

−3
10

−4
10
0 20 40 60 80 100 120

Figure 5.3. Progress in accuracy, noiseless measurements.


Gap(t)
solid line: Relative gap Gap(1) vs. step number t; Gap(t) is the difference between the
best found so far value f (xt ) of f and the current lower bound on f∗ .
In 111 steps, the gap was reduced by factor > 1600
(xt )−f∗
dashed line: Progress in accuracy ff (x1 )−f

vs. step number t
In 111 steps, the accuracy was improved by factor > 1080

3D PET. The BM algorithm is pretty young (April 2002), and when writing these Lecture
Notes, I have no possibility to test it on actual clinical 3D PET data (since such a test would
require a lot of dedicated data-handling routines I have no access to). However, a couple of
years ago we were participating in a EU project on 3D PET Image Reconstruction; in this
project, among other things, we have developed a special algorithm –  · 1 -Mirror Descent MD1
– for minimizing a convex function over the standard simplex, and have used this algorithm to
solve the 3D PET problem; the related theory and numerical results can be found in [2]. The
Mirror Descent algorithm has much in common with the BM and enjoys the same theoretical
performance bounds. My feeling is that in practice the BM algorithm should be somehow
better than Mirror Descent, but right now I do not have enough experimental data to support
this guess. However, I believe that it makes sense to present here some experimental data on
the practical performance of MD1 in order to demonstrate that simple optimization techniques
indeed have chances when applied to really huge convex programs. I restrict myself with a single
numerical example – real clinical brain study carried out on a very powerful PET scanner. In the
corresponding problem (PET ), there are n = 2, 763, 635 design variables (this is the number of
voxels in the field of view) and about 25,000,000 log-terms in the objective. The reconstruction
was carried out on the INTEL Marlinspike Windows NT Workstation (500 MHz 1Mb Cache
INTEL Pentium III Xeon processor, 2GB RAM). A single call to the First Order oracle (a single
computation of the value and the subgradient of f ) took about 90 min. Pictures of clinically
acceptable quality were obtained after just four calls to the oracle (as it was the case with other
sets of PET data); for research purposes, we carried out 6 additional steps of the algorithm.
The pictures presented on Fig. 5.6 are slices – 2D cross-sections – of the reconstructed 3D
image. Two series of pictures shown on Fig. 5.6 correspond to two different versions, MD and
OSMD, of MD1 (for details, see [2]). Relevant numbers are presented in Table 5.1. The best
208 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

True image: 10 “hot spots” x1 = n−1 (1, ..., 1)T x2 – light traces of 5 spots
f = −0.883 f = −0.452 f = −0.520

x3 – traces of 8 spots x5 – 8 spots in place x8 – 10th spot still missing...


f = −0.585 f = −0.707 f = −0.865

x12 – all 10 spots in place x35 – all 10 spots in place x43 – ...
f = −0.872 f = −0.886 f = −0.896
Figure 5.4. Reconstruction from noisy measurements
5.5. ILLUSTRATION: PET IMAGE RECONSTRUCTION PROBLEM 209

0
10

−1
10

−2
10

−3
10

−4
10
0 20 40 60 80 100 120

Figure 5.5. Progress in accuracy, noisy measurements.


Gap(t)
solid line: Relative gap Gap(1) vs. step number t
In 115 steps, the gap was reduced by factor 1580
f (xt )−f
dashed line: Progress in accuracy f (x1 )−f vs. step number t
(f is the last lower bound on f∗ built in the run)
In 115 steps, the accuracy was improved by factor > 460

step # MD OSMD step # MD OSMD


1 -1.463 -1.463 6 -1.987 -2.015
2 -1.725 -1.848 7 -1.997 -2.016
3 -1.867 -2.001 8 -2.008 -2.016
4 -1.951 -2.015 9 -2.008 -2.016
5 -1.987 -2.015 10 -2.009 -2.016
Table 5.1. Performance of MD and OSMD in Brain study

known lower bound on the optimal value in the problem is -2.050; MD and OSMD decrease in 10
oracle calls the objective from its initial value -1.436 to the values -2.009 and -2.016, respectively
(optimality gaps 4.e-2 and 3.5e-2) and reduce the initial inaccuracy in terms of the objective by
factors 15.3 (MD) and 17.5 (OSMD).
210 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

Figure 5.6. Brain, near-mid slice of the reconstructions.


[the top-left missing part is the area affected by the Alzheimer disease]

You might also like