0% found this document useful (0 votes)

2 views

5. Simple methods for extremely large-scale problems

Uploaded by

scribd-ml

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

5. Simple methods for extremely large-scale problems

Uploaded by

scribd-ml

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Lecture 5

Simple methods for extremely

large-scale problems

5.1 Motivation
The polynomial time Interior Point methods, same as all other polynomial time methods for
Convex Programming known so far, have a not that pleasant common feature: the arithmetic
cost C of an iteration in such a method grows nonlinearly with the design dimension n of the
problem, unless the latter possesses a very favourable structure. E.g., in IP methods, an iteration
requires solving a system of linear equations with (at least) n unknowns. To solve this auxiliary
problem, it costs at least O(n2 ) operations (with the traditional Linear Algebra – even O(n3 )
operations), except for the cases when the matrix of the system is very sparse and, moreover,
possesses a well-structured sparsity pattern. The latter indeed is the case when solving most
of LPs of real-world origin, but nearly never is the case for, e.g., SDPs. For other known
polynomial time methods, the situation is similar – the arithmetic cost of an iteration, even in
the case of extremely simple objectives and feasible sets, is at least O(n2 ). With n of order of
tens and hundreds of thousands, the computational effort of O(n2 ), not speaking about O(n3 ),
operations per iteration becomes prohibitively large – basically, you never will finish the very
first iteration of your method... On the other hand, the design dimensions of tens and hundreds
of thousands is exactly what is met in many applications, like SDP relaxations of combinatorial
problems involving large graphs or Structural Design (especially for 3D structures). As another
important application of this type, consider 3D Medical Imaging problem arising in Positron
Emission Tomography.

Positron Emission Tomography (PET) is a powerful, non-invasive, medical diagnostic imaging

technique for measuring the metabolic activity of cells in the human body. It has been in clinical use
since the early 1990s. PET imaging is unique in that it shows the chemical functioning of organs and
tissues, while other imaging techniques - such as X-ray, computerized tomography (CT) and magnetic
resonance imaging (MRI) - show anatomic structures.
A PET scan involves the use of a radioactive tracer – a fluid with a small amount of a radioactive
material which has the property of emitting positrons. When the tracer is administered to a patient,
either by injection or inhalation of gas, it distributes within the body. For a properly chosen tracer, this
distribution “concentrates” in desired locations, e.g., in the areas of high metabolic activity where cancer
tumors can be expected.
The radioactive component of the tracer disintegrates, emitting positrons. Such a positron nearly
immediately annihilates with a near-by electron, giving rise to two photons flying at the speed of light off
the point of annihilation in nearly opposite directions along a line with a completely random orientation
(i.e., uniformly distributed in space). They penetrate the surrounding tissue and are registered outside

187
188 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

the patient by a PET scanner consisting of circular arrays (rings) of gamma radiation detectors. Since
the two gamma rays are emitted simultaneously and travel in almost exactly opposite directions, we can
say a lot on the location of their source: when a pair of opposing detectors register high-energy photons
within a short (∼ 10−8 sec) timing window (“a coincidence event”), we know that the photons came
from a disintegration act, and that the act took place on the line (“line of response” (LOR)) linking the
detectors. The measured data set is the collection of numbers of coincidences counted by diﬀerent pairs of
detectors (“bins”), and the problem is to recover from these measurements the 3D density of the tracer.
The mathematical model of the process, after appropriate discretization, is

y = P λ + ξ,

where
• λ ≥ 0 is the vector representing the (discretized) density of the tracer; the entries of λ are indexed
by voxels – small cubes into which we partition the ﬁeld of view, and λj is the mean density of
the tracer in voxel j. Typically, the number n of voxels is in the range from 3 × 105 to 3 × 106 ,
depending on the resolution of the discretization grid;
• y are the measurements; the entries in y are indexed by bins – pairs of detectors, and yi is the
number of coincidences counted by i-th pair of detectors. Typically, the dimension m of y – the
total number of bins – is millions (at least 3 × 106 );
• P is the projection matrix; its entries pij are the probabilities for a LOR originating in voxel j to
be registered by bin i. These probabilities are readily given by the geometry of the scanner;
• ξ is the measurement noise coming mainly from the fact that all physical processes underlying PET
are random. The standard statistical model for PET implies that yi , i = 1, ..., m, are independent
Poisson random variables with the expectations (P λ)i .
The problem we are interested in is to recover tracer’s density λ given measurements y. As far as the
quality of the result is concerned, the most attractive reconstruction scheme is given by the standard in
Statistics Likelihood Ratio maximization: denoting p(·|λ) the density of the probability distribution of
the measurements, coming from λ, w.r.t. certain dominating distribution, the estimate of the unknown
true value λ∗ of λ is
= argmin p(y|λ),
λ
λ≥0

where y is the vector of measurements.

For the aforementioned Poisson model of PET, building the Maximum Likelihood estimate is equiv-
alent to solving the optimization problem
m n
n
min j=1 λj pj − i=1 yi ln( j=1 λj pij ) : λ ≥ 0
λ .
(PET)
pj = pij
i

This is a nicely structured convex program (by the way, polynomially reducible to CQP and even LP).
The only diﬃculty – and a severe one – is in huge sizes of the problem: as it was already explained, the
number n of decision variables is at least 300, 000, while the number m of log-terms in the objective is in
the range from 3 × 106 to 25 × 106 .

At the present level of our knowledge, the design dimension n of order of tens and hundreds
of thousands rules out the possibility to solve a nonlinear convex program, even a well-structured
one, by polynomial time methods because of at least quadratic in n “blowing up” the arithmetic
cost of an iteration. When n is really large, all we can use are simple methods with linear in n
cost of an iteration. As a byproduct of this restriction, we cannot utilize anymore our knowledge
of the analytic structure of the problem, since all known for the time being ways of doing so are
5.2. INFORMATION-BASED COMPLEXITY OF CONVEX PROGRAMMING 189

too expensive, provided that n is large. As a result, we are enforced to restrict ourselves with
black-box-oriented optimization techniques – those which use solely the possibility to compute
the values and the (sub)gradients of the objective and the constraints at a point. In Convex
Optimization, two types of “cheap” black-box-oriented optimization techniques are known:

• techniques for unconstrained minimization of smooth convex functions (Gradient Descent,

Conjugate Gradients, quasi-Newton methods with restricted memory, etc.);

• subgradient-type techniques for nonsmooth convex programs, including constrained ones.

Since the majority of applications are constrained, we restrict our exposition to the techniques
of the second type. We start with investigating of what, in principle, can be expected of black-
box-oriented optimization techniques.

5.2 Information-based complexity of Convex Programming

Black-box-oriented methods and Information-based complexity. Consider a Convex
Programming program in the form

min {f (x) : x ∈ X} , (CP)

where X is a convex compact set in Rn and the objective f is a continuous convex function
on Rn . Let us fix a family P(X) of convex programs (CP) with X common for all programs
from the family, so that such a program can be identified with the corresponding objective, and
the family itself is nothing but certain family of convex functions on Rn . We intend to explain
what is the Information-based complexity of P(X) – informally, complexity of the family w.r.t.
“black-box-oriented” methods. We start with defining such a method as a routine B as follows:

1. When starting to solve (CP), B is given an accuracy > 0 to which the problem should
be solved and knows that the problem belongs to a given family P(X). However, B does
not know what is the particular problem it deals with.

2. In course of solving the problem, B has an access to the First Order oracle for f . This
oracle is capable, given on input a point x ∈ Rn , to report on output what is the value
f (x) and a subgradient f (x) of f at x.
B generates somehow a sequence of search points x1 , x2 , ... and calls the First Order oracle
to get the values and the subgradients of f at these points. The rules for building xt
can be arbitrary, except for the fact that they should be casual: xt can depend only on
the information f (x1 ), f (x1 ), ..., f (xt−1 ), f (xt−1 ) on f accumulated by B at the ﬁrst t − 1
steps.

3. After certain number T = TB (f, ) of calls to the oracle, B terminates and outputs the
result zB (f, ). This result again should depend solely on the information on f accumulated
by B at the T search steps, and must be an -solution to (CP), i.e.,

zB (f, ) ∈ X & f (zB (f, )) − min f ≤ .

X
190 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

We measure the complexity of P(X) w.r.t. a solution method B by the function

ComplB () = max TB (f, )

f ∈P(X)

– by the minimal number of steps in which B is capable to solve within accuracy every instance
of P(X). Finally, the Information-based complexity of the family P(X) of problems is deﬁned
as
Compl() = min ComplB (),
B

the minimum being taken over all solution methods. Thus, the relation Compl() = N means,
first, that there exists a solution method B capable to solve within accuracy every instance of
P(X) in no more than N calls to the First Order oracle, and, second, that for every solution
method B there exists an instance of P(X) such that B solves the instance within the accuracy
in at least N steps.
Note that as far as black-box-oriented optimization methods are concerned, the information-
based complexity Compl() of a family P(X) is a lower bound on “actual” computational effort,
whatever it means, sufficient to find -solution to every instance of the family.

Main results on Information-based complexity of Convex Programming can be sum-

marized as follows. Let X be a solid in Rn (a convex compact set with a nonempty interior),
and let P(X) be the family of all convex functions on Rn normalized by the condition

max f − min f ≤ 1. (5.2.1)

X X

For this family,

I. Complexity of ﬁnding high-accuracy solutions in ﬁxed dimension is independent of the

geometry of X. Speciﬁcally,

1
∀( ≤ (X)) : O(1)n ln 2 + ≤ Compl();
(5.2.2)
1
∀( > 0) : Compl() ≤ O(1)n ln 2 + ,

where

• O(1) are appropriately chosen positive absolute constants,

1
• (X) depends on the geometry of X, but never is less than n2
, where n is the dimen-
sion of X.

II. Complexity of ﬁnding solutions of ﬁxed accuracy in high dimensions does depend on the
geometry of X. Here are 3 typical results:

(a) Let X be an n-dimensional box: X = {x ∈ Rn : x∞ ≤ 1}. Then

1 1 1
≤ ⇒ O(1)n ln( ) ≤ Compl() ≤ O(1)n ln( ). (5.2.3)
2
(b) Let X be an n-dimensional ball: X = {x ∈ Rn : x2 ≤ 1}. Then

1 O(1) O(1)
n≥ ⇒ 2 ≤ Compl() ≤ 2 . (5.2.4)
2
5.2. INFORMATION-BASED COMPLEXITY OF CONVEX PROGRAMMING 191

(c) Let X be an n-dimensional hyperoctahedron: X = {x ∈ Rn : x1 ≤ 1}. Then

1 O(1) O(ln n)
n≥ 2
⇒ 2 ≤ Compl() ≤ (5.2.5)
2
(in fact, O(1) in the lower bound can be replaced with O(ln n), provided that n >>
1
2
).

Since we are interested in extremely large-scale problems, the moral which we can extract from
the outlined results is as follows:
• I is discouraging: it says that we have no hope to guarantee high accuracy, like = 10−6 ,
when solving large-scale problems with black-box-oriented methods; indeed, with O(n) steps per
accuracy digit and at least O(n) operations per step (this many operations are required already
to input a search point to the oracle), the arithmetic cost per accuracy digit is at least O(n2 ),
which is prohibitively large for really large n.
• II is partly discouraging, partly encouraging. A bad news reported by II is that when X is
a box, which is the most typical situation in applications, we have no hope to solve extremely
large-scale problems, in a reasonable time, to guaranteed, even low, accuracy, since the required
number of steps should be at least of order of n. A good news reported by II is that there
exist situations where the complexity of minimizing a convex function to a ﬁxed accuracy is
independent, or nearly independent, of the design dimension. Of course, the dependence of the
complexity bounds in (5.2.4) and (5.2.5) on is very bad and has nothing in common with being
polynomial in ln(1/); however, this drawback is tolerable when we do not intend to get high
accuracy. Another drawback is that there are not that many applications where the feasible set
is a ball or a hyperoctahedron. Note, however, that in fact we can save the most important
for us upper complexity bounds in (5.2.4) and (5.2.5) when requiring from X to be a subset of
a ball, respectively, of a hyperoctahedron, rather than to be the entire ball/hyperoctahedron.
This extension is not costless: we should simultaneously strengthen the normalization condition
(5.2.1). Speciﬁcally, we shall see that

B. The upper complexity bound in (5.2.4) remains valid when X ⊂ {x : x2 ≤ 1} and

P(X) = {f : f is convex and |f (x) − f (y)| ≤ x − y2 ∀x, y ∈ X};

S. The upper complexity bound in (5.2.5) remains valid when X ⊂ {x : x1 ≤ 1} and

P(X) = {f : f is convex and |f (x) − f (y)| ≤ x − y1 ∀x, y ∈ X}.

Note that the “ball-like” case mentioned in B seems to be rather artificial: the Euclidean norm
associated with this case is a very natural mathematical entity, but this is all we can say in its
favour. For example, the normalization of the objective in B is that the Lipschitz constant of f
w.r.t. · 2 is ≤ 1, or, which is the same, that the vector of the first order partial derivatives of f
should, at every point, be of ·2 -norm not exceeding 1. In order words, “typical” magnitudes of
the partial derivatives of f should become smaller and smaller as the number of variables grows;
what could be the reasons for such a strange behaviour? In contrast to this, the normalization
condition imposed on f in S is that the Lipschitz constant of f w.r.t. · 1 is ≤ 1, or, which is
the same, that the · ∞ -norm of the vector of partial derivatives of f is ≤ 1. In other words,
the normalization is that the magnitudes of the first order partial derivatives of f should be ≤ 1,
and this normalization is “dimension-independent”. Of course, in B we deal with minimization
192 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

over subsets of the unit ball, while in S we deal with minimization over the subsets of the unit
hyperoctahedron, which is much smaller than the unit ball. However, there do exist problems
in reality where we should minimize over the standard simplex

∆n = {x ∈ Rn : x ≥ 0, xi = 1},
x

which indeed is a subset of the unit hyperoctahedron. For example, it turns out that the PET
Image Reconstruction problem (PET) is in fact the problem of minimization over the standard
simplex. Indeed, the optimality condition for (PET) reads
 
 pij 
λj pj − yi  = 0, j = 1, ..., n;
i
pi λ

summing up these equalities, we get

pj λj = B ≡ yi .
j i

It follows that the optimal solution to (PET) remains unchanged when we add to the non-

negativity constraints λj ≥ 0 also the constraint pj λj = B. Passing to the new variables
j
xj = B −1 p j λj , we further convert (PET) to the equivalent form

min f (x) ≡ − yi ln( qij xj ) : x ∈ ∆n
x
i j , (PET )
Bpij
qij = pj

which is a problem of minimizing a convex function over the standard simplex.

Intermediate conclusion. The discussion above says that this perhaps is a good idea to
look for simple convex minimization techniques which, as applied to convex programs (CP)
with feasible sets of appropriate geometry, exhibit dimension-independent (or nearly dimension-
independent) and nearly optimal information-based complexity. We are about to present a
family of techniques of this type.

5.3 The Bundle-Mirror scheme

The setup. Consider problem (CP) and assume that

(A.1): The (convex) objective f is Lipschitz continuous on X.

To quantify this assumption, we ﬁx once for ever a norm · on Rn and associate
with f the Lipschitz constant of f |X w.r.t. the norm · :

L· (f ) = min {L : |f (x) − f (y)| ≤ Lx − y ∀x, y ∈ X} .

Note that from Convex Analysis it follows that f at every point x ∈ X admits a
subgradient f (x) such that
f (x)∗ ≤ L· (f ),
5.3. THE BUNDLE-MIRROR SCHEME 193

where · ∗ is the norm conjugate to · :

ξ∗ = max{ξ T x : x ≤ 1}.

p
For example, the norm conjugate to · p is · q , where q = p−1 .

We assume that this “small norm” subgradient f (x) is exactly the one reported by
the First Order oracle as called with input x ∈ X; this is not a severe restriction,
since at least in the interior of X all subgradients of f are “small” in the outlined
sense.

The setup for the generic method BM we are about to present is given by X, · and a
continuously diﬀerentiable function ω(x) : X → R which we assume to be strongly convex, with
parameter κ > 0, w.r.t. the norm · ; the latter means that
κ
ω(y) ≥ ω(x) + (y − x)T ∇ω(x) + y − x2 ∀x, y ∈ X. (5.3.1)
2
The standard setups we will be especially interested in are:

1. “Ball setup”: ω(x) = 12 xT x, X is a convex compact set simple enough to allow for rapid
solving the optimization problems of the form

x[p] = argmin ω(x) + pT x , (5.3.2)
x∈X

· = · 2 ;

2. “Simplex setup”: ω(x) is the “regularized entropy”

n
ω(x) = (xi + δn−1 ) ln(xi + δn−1 ) : ∆n → R, (5.3.3)
i=1

where δ ∈ (0, 1) is a once for ever ﬁxed “regularization parameter”; X is a convex compact
subset of the standard “full-dimensional” simplex

∆+ n
n = {x ∈ R : x ≥ 0, xi ≤ 1}
i

simple enough to allow for solving the optimization problems of the form (5.3.2); · = ·1 ;

3. “Spectahedron setup”: this setup deals with the special case when our “universe” is Sn
rather than Rn ; as usual, Sn is equipped with the Frobenius inner product. The specta-
hedron is the set in Sn deﬁned as

Σ+ n
n = {x ∈ S : x 0, Tr(x) ≤ 1}

(we are using lowercase notation for the elements of Sn in order to be consistent with the
rest of the text). The function ω(x) is the “regularized matrix entropy”

ω(x) = Tr((x + δn−1 In ) ln(x + δn−1 In )) : Σn → R, (5.3.4)

194 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

where δ ∈ (0, 1) is a once for ever ﬁxed regularization parameter. X is a convex compact
subset of the spectahedron Σ+ n simple enough to allow for solving the optimization problems
of the form (5.3.2), and · is the norm

|x|1 ≡ λ(x)1

on Sn .
Note that in fact the simplex setup is a particular case of the Spectahedron one corre-
sponding to the case when X is the set of diagonal positive semideﬁnite matrices.

One can verify that for these setups, ω(·) indeed is continuously diﬀerentiable on X and satisﬁes
(5.3.1) with κ = O(1) (in fact, κ = 1 for the ball setup, κ = (1 + δ)−1 for the simplex setup, and
κ = 0.5(1 + δ)−1 for the spectahedron setup, see Appendix to Lecture 5).

The generic algorithm BM we are about to describe works as follows.

A. The algorithm generates a sequence of search points, all belonging to X, where the
First Order oracle is called, and at every step builds the following entities:

1. the best found so far value of f along with the best found so far search point, which is
treated as the current approximate solution built by the method;

2. a (valid) lower bound on the optimal value of the problem.

B. The execution is split in subsequent phases. Phase s, s = 1, 2, ..., is associated with

prox-center cs ∈ X and level !s ∈ R such that

• when starting the phase, we already know what is f (cs ), f (cs );

• !s = fs + λ(f s − fs ), where

– f s is the best value of the objective known at the time when the phase starts;
– fs is the lower bound on f∗ we have at our disposal when the phase starts;
– λ ∈ (0, 1) is a parameter of the method.

The prox-center c1 corresponding to the very ﬁrst phase can be chosen in X in an arbitrary
fashion. We start the entire process with computing f , f at this prox-center, which results in

f 1 = f (c1 )

and set
f1 = min[f (c1 ) + (x − c1 )T f (x1 )],
x∈X

thus getting the initial lower bound on f∗ .

5.3. THE BUNDLE-MIRROR SCHEME 195

C. Now let us describe a particular phase s. Let

ωs (x) = ω(x) − (x − cs )T ∇ω(cs );

note that (5.3.1) implies that

κ
ωs (y) ≥ ωs (x) + (y − x)T ∇ωs (x) + y − x2 ∀x, y ∈ X. (5.3.5)
2

and that cs is the minimizer of ωs (·) on X.

The search points xt = xt,s of the phase s, t = 1, 2, ... are generated according to the following
rules:

1. When generating xt , we already have in our disposal xt−1 and a convex compact set
Xt−1 ⊂ X such that
(at−1 ) x ∈ X\Xt−1 ⇒ f (x) > !s ;
(bt−1 ) xt−1 ∈ argmin ωs . (5.3.6)
Xt−1

Here x0 = cs and, say, X0 = X, which ensures (5.3.6.a0 -b0 ).

2. To update (xt−1 , Xt−1 ) into (xt , Xt ), we solve the auxiliary problem

min ωs (x) : x ∈ Xt−1 , gt−1 (x) ≡ f (xt−1 ) + (x − xt−1 )T f (xt−1 ) ≤ !s . (Pt−1 )
x

Our subsequent actions depend on the results of this optimization:

(a) When (Pt−1 ) is infeasible, we terminate the phase and update the lower bound on f∗
as
fs → fs+1 = !s .

Note that in the case in question !s is indeed a valid lower bound on f∗ : in X\Xt−1
we have f (x) > !s by (5.3.6.at−1 ), while the infeasibility of (Pt−1 ) means that

x ∈ Xt−1 ⇒ f (x) ≥ gt−1 (x) > !s ,

where the inequality f (x) ≥ gt−1 (x) is given by the convexity of f .

The prox-center cs+1 for the new phase can be chosen in X in an arbitrary fashion.
(b) When (Pt−1 ) is feasible, we get the optimal solution xt of this problem and compute
f (xt ), f (xt ). It is possible that

f (xt ) − !s ≤ θ(f s − !s ), (5.3.7)

where θ ∈ (0, 1) is a parameter of the method. In this case, we again terminate the
phase and set
f s+1 = f (xt ), fs+1 = fs .

The prox-center cs+1 for the new phase, same as above, can be chosen in X in an
arbitrary fashion.
196 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

(c) When (Pt−1 ) is feasible and (5.3.7) does not take place, we continue the phase s,
choosing as Xt an arbitrary convex compact set such that

X t ≡ {x ∈ Xt−1 : gt−1 (x) ≤ !s } ⊂ Xt ⊂ X t ≡ {x ∈ X : (x − xt )T ∇ωs (xt ) ≥ 0}.

(5.3.8)
Note that we are in the case when (Pt−1 ) is feasible and xt is the optimal solution to
the problem; it follows that
∅ = X t ⊂ X t ,
so that (5.3.8) indeed allows to choose Xt . Besides this, every choice of Xt compatible
with (5.3.8) ensures (5.3.6.at ) and (5.3.6.bt ); the ﬁrst relation is clearly ensured by the
left inclusion in (5.3.8) combined with (5.3.6.at−1 ) and the fact that f (x) ≥ gt−1 (x),
while the second relation (5.3.6.bt ) follows from the right inclusion in (5.3.8) due to
the convexity of ωs (·).

Convergence Analysis. Let us deﬁne s-th gap as the quantity

s = f s − fs

By its origin, the gap is nonnegative and nonincreasing in s; besides this, it clearly is an upper
bound on the inaccuracy, in terms of the objective, of the approximate solution z s we have in
our disposal at the beginning of phase s.
The convergence and the complexity properties of the BM algorithm are given by the fol-
lowing statement.
Theorem 5.3.1 (i) The number Ns of oracle calls at a phase s can be bounded from above as
follows:
4ΩL2 (f )
·
Ns ≤ θ2 (1−λ) 2 κ2 ,
s (5.3.9)
Ω = max [ω(y) − ω(x) − (y − x)T ∇ω(x)].
x,y∈X

(ii) Consequently, for every > 0, the total number of oracle calls before the phase s with
s ≤ is started (i.e., before -solution to the problem is built) does not exceed

ΩL2· (f )
N () = c(θ, λ) (5.3.10)
κ2
with an appropriate c(θ, λ) depending solely and continuously on θ, λ ∈ (0, 1).
Proof. (i): Assume that phase s did not terminate in course of N steps. Observe that then
θ(1 − λ)s
xt − xt−1 ≥ , 1 ≤ t ≤ N. (5.3.11)
L· (f )

Indeed, we have gt−1 (xt ) ≤ !s by construction of xt and gt−1 (xt−1 ) = f (xt−1 ) > !s + θ(f s − !s ), since
otherwise the phase would be terminated at the step t − 1. It follows that gt−1 (xt−1 ) − gt−1 (xt ) >
θ(f s − !s ) = θ(1 − λ)s . Taking into account that gt−1 (·), due to its origin, is Lipschitz continuous w.r.t.
· with constant L· (f ), (5.3.11) follows.
Now observe that xt−1 is the minimizer of ωs on Xt−1 by (5.3.6.at−1 ), and the latter set, by con-
struction, contains xt , whence (xt − xt−1 )T ∇ωs (xt−1 ) ≥ 0. Applying (5.3.5), we get
2
κ θ(1 − λ)s
ωs (xt ) ≥ ωs (xt−1 ) + , t = 1, ..., N,
2 L· (f )
5.3. THE BUNDLE-MIRROR SCHEME 197

whence
2
Nκ θ(1 − λ)s
ωs (xN ) − ωs (x0 ) ≥ .
2 L· (f )

The latter relation, due to the evident inequality max ωs (x) − min ωs ≤ Ω (readily given by the deﬁnition
X X
of Ω and ωs ) implies that
2ΩL2· (f )
N≤ .
θ2 (1 − λ)2 κ2s
Recalling the origin of N , we conclude that

2ΩL2· (f )
Ns ≤ + 1.
θ2 (1 − λ)2 κ2s

All we need in order to get from this inequality the required relation (5.3.9) is to demonstrate that

2ΩL2· (f )
≥ 1, (5.3.12)
θ2 (1 − λ)2 κ2s

which is immediate: let R = max x−c1 = x̄−c1 , where x̄ ∈ X. We have s ≤ 1 = f (c1 )−min[f (c1 )+
x∈X x∈X
(x−c1 )T f (c1 )] ≤ RL· (f ), where the concluding inequality is given by the fact that f (c1 )∗ ≤ L· (f ).
On the other hand, by the deﬁnition of Ω we have

κ κR2
Ω ≥ ω(x̄) − [ω(c1 ) + (x̄ − c1 )T ∇ω(c1 )] ≥ x̄ − c1 2 =
2 2
2
(we have used (5.3.1)). Thus, Ω ≥ κR 2 and s ≤ RL· (f ), and (5.3.12) follows. (i) is proved.
(ii): Assume that s > at the phases s = 1, 2, ..., S, and let us bound from above the total number
of oracle calls at these S phases. Observe, ﬁrst, that the two subsequent gaps s , s+1 are linked by the
relation
s+1 ≤ γ(θ, λ)s , γ = max[1 − λ, 1 − (1 − θ)(1 − λ)] < 1. (5.3.13)

Indeed, it is possible that the phase s was terminated according to the rule 2a; in this case

s+1 = f s+1 − !s ≤ f s − !s = (1 − λ)s ,

as required in (5.3.13). The only other possibility is that the phase s was terminated when the relation
(5.3.7) took place. In this case,

s+1 = f s+1 − fs ≤ !s + θ(f s − !s ) − fs = λs + θ(1 − λ)s = (1 − (1 − θ)(1 − λ))s ,

and we again arrive at (5.3.13).

From (5.3.13) it follows that s ≥ γ s−S , s = 1, ..., S, since S > by the origin of S. We now have

S
S 4ΩL2· (f )
S 4ΩL2· (f )γ 2(S−s) 4ΩL2· (f ) ∞
Ns ≤ θ 2 (1−λ)2 κ2s ≤ θ 2 (1−λ)2 κ2 ≤ θ 2 (1−λ)2 κ2 γ 2t
s=1
s=1 s=1 t=0
4 ΩL2· (f )
≡ 2 2 2 κ2
θ (1 − λ) (1 − γ )

c(θ,λ)

and (5.3.10) follows.

Now let us look what our complexity analysis says in the case of the standard setups.
198 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

Ball setup and optimization over the ball. As we remember, in the case of the ball setup
one has κ = 1. Besides this, it is immediately seen that in this case
1 2
Ω = D·2
(X),
2
where D·2 (X) = max x − y2 if the · 2 -diameter of X. Thus, (5.3.10) becomes
x,y∈X

2 (X)L2 (f )
D·2 ·2
N () ≤ c(θ, λ) . (5.3.14)
22
On the other hand, let L > 0, and let P·2 ,L (X) be the family of all convex problems (CP) with
convex Lipschitz continuous, with constant L w.r.t. · 2 , objectives. It is known that if X is an
2
D· (X)L2
n-dimensional Euclidean ball and n ≥ 2
2
, then the information-based complexity of the
D· (X)L2
2
family P·2 ,L (X) is at least O(1) 2
2
(cf. (5.2.4)). Comparing this result with (5.3.14),
we conclude that

If X is an n-dimensional Euclidean ball, then the complexity of the family P·2 ,L (X)
D2 (X)L2
w.r.t. the BM algorithm with the ball setup in the “large-scale case” n ≥ ·22
coincides (within a factor depending solely on θ, λ) with the information-based com-
plexity of the family.

Simplex setup and minimization over the simplex. In the case of the simplex setup one
has κ = (1 + δ)−1 , where δ ∈ (0, 1) is the regularization parameter for the entropy. Besides this,
one has
n(1 + δ)
Ω ≤ (1 + δ) 1 + ln . (5.3.15)
δ
(see computation in Appendix to Lecture 5).
We see that for the simplex setup, Ω is of order of ln n, provided that δ is not very-very
small. E.g., when δ =1.e-16 is the “machine zero” (so that for all computational purposes, our
regularized entropy is, essentially, the same as the usual entropy), we have Ω ≤ 37 + ln n, whence
Ω ≤ 6 ln n provided that n ≥ 1000.
With our bounds for κ and Ω, the complexity bound (5.3.10) becomes

L2·1 (f ) ln n
N () ≤ c(θ, λ) (5.3.16)
2
(provided that δ ≥ 1.e-16). On the other hand, let L > 0, and let P·1 ,L (X) be the family of all
convex problems (CP) with convex Lipschitz continuous, with constant L w.r.t. ·1 , objectives.
It is known that if X is the n-dimensional simplex ∆n (or the full-dimensional simplex ∆+ n ) and
L2 2
n ≥ 2 , then the information-based complexity of the family P·1 ,L (X) is at least O(1) L2 (cf.
(5.2.5)). Comparing this result with (5.3.14), we conclude that

If X is the n-dimensional simplex ∆n (or the full-dimensional simplex ∆+ n ), then the

complexity of the family P·1 ,L (X) w.r.t. the BM algorithm with the simplex setup
2
in the “large-scale case” n ≥ L2 coincides, within a factor of order of ln n, with the
information-based complexity of the family.
5.4. IMPLEMENTATION ISSUES 199

Spectahedron setup and large-scale semideﬁnite optimization. All the conclusions we

have made when speaking about the case of the simplex setup and X = ∆n (or X = ∆+ n)
remain valid in the case of the spectahedron setup and X deﬁned as the set of all block-diagonal
matrices of a given block-diagonal structure contained in Σn = {x ∈ Sn : x 0, Tr(x) ≤ 1} (or
contained in Σ+n ).
We see that with every one of our standard setups, the BM algorithm under appropriate con-
ditions possesses dimension independent (or nearly dimension independent) complexity bound
and, moreover, is nearly optimal in the sense of Information-based complexity theory, provided
that the dimension is large.

Why the standard setups? “The contribution” of ω(·) to the performance estimate (5.3.10)
is in the factor Θ = Ω
κ ; the less it is, the better. In principle, given X and ·, we could play with
ω(·) to minimize Θ. The standard setups are given by a kind of such optimization for the cases
when X is the ball and · = · 2 (“the ball case”), when X is the simplex and · = · 1
(“the simplex case”), and when X is the spectahedron and · = | · |1 (“the spectahedron
case”), respectively. We did not try to solve the arising variational problems exactly; however,
it can be proved in all three cases that the value of Θ we have reached (i.e., O(1) in the ball
case and O(ln n) in the simplex and the spectahedron cases) cannot be reduced by more than
an absolute constant factor. Note that in the simplex case the (regularized) entropy is not the
p(n)
only reasonable choice; similar complexity results can be obtained for, say, ω(x) = xi or
i
1
ω(x) = x2p(n) with p(n) = 1 + O ln n .

5.4 Implementation issues

Solving auxiliary problems (Pt ). As far as implementation of the BM algorithm is con-
cerned, the major issue is how to solve efficiently the auxiliary problem (Pt ). Formally, this
problem is of the same design dimension as the problem of interest; what do we gain when
reducing the solution of a single large-scale problem (CP) to a long series of auxiliary problems
of the same dimension? To answer this crucial question, observe first that we have control on
the complexity of the domain Xt which, up to a single linear constraint, is the feasible domain
of (Pt ). Indeed, assume that Xt−1 is a part of X given by a finite list of linear inequalities.
Then both sets X t and X t in (5.3.8) also are cut off X by finitely many linear inequalities, so
that we may enforce Xt to be cut off X by finitely many linear inequalities as well. Moreover,
we have full control of the number of inequalities in the list. For example,

A. Setting all the time Xt = X t , we ensure that Xt is cut oﬀ X by a single linear inequality;

B. Setting all the time Xt = X t , we ensure that Xt is cut oﬀ X by t linear inequalities (so
that the larger is t, the “more complicated” is the description of Xt );

C. We can choose something in-between the above extremes. For example, assume that we
have chosen certain m and are ready to work with Xt ’s cut oﬀ X by at most m linear
inequalities. In this case, we could use the policy B. at the initial steps of a phase, until
the number of linear inequalities in the description of Xt−1 reaches the maximum allowed
value m. At the step t, we are supposed to choose Xt in-between the two sets

X t = {x ∈ X : h1 (x) ≤ 0, ..., hm (x) ≤ 0, hm+1 (x) ≤ 0}, X t = {x ∈ X : hm+2 (x) ≤ 0},

200 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

where

– the linear inequalities h1 (x) ≤ 0, ..., hm (x) ≤ 0 cut Xt−1 oﬀ X;

– the linear inequality hm+1 (x) ≤ 0 is, in our old notation, the inequality gt−1 (x) ≤ !s ;
– the linear inequality hm+2 (x) ≤ 0 is, in our old notation, the inequality (x −
xt )T ∇ωs (xt ) ≥ 0.

Now we can form a list of m linear inequalities as follows:

• we build m − 1 linear inequalities ei (x) ≤ 0 by aggregating the m + 1 inequalities
hj (x) ≤ 0, j = 1, ..., m + 1, so that every one of the e-inequalities is a convex combination
of the h-ones (the coeﬃcients of these convex combinations can be whatever we want);
• we set Xt = {x ∈ X : ei (x) ≤ 0, i = 1, ..., m − 1, hm+2 (x) ≤ 0}.
It is immediately seen that with this approach we ensure (5.3.8), on one hand, and that Xt
is cut oﬀ X by at most m inequalities. And of course we can proceed in the same fashion.

The bottom line is: we always can ensure that Xt−1 is cut oﬀ X by at most m linear inequalities
hj (x) ≤ 0, j = 1, ..., m, where m ≥ 1 is (any) desirable bound. Consequently, we may assume
that the feasible set of (Pt−1 ) is cut oﬀ X by m + 1 linear inequalities hj (x) ≤ 0, j = 1, ..., m + 1.
The crucial point is that with this approach, we can reduce (Pt−1 ) to a convex program with at
most m + 1 decision variables. Indeed, assuming (Pt−1 ) strictly feasible:

∃(x̄ ∈ rintX) : hj (x̄) ≤ 0, j = 1, ..., m + 1

and applying the standard Lagrange Duality, we see that the optimal value in (Pt−1 ) is equal
to the one in the Lagrange dual problem
m+1
max L(λ), L(λ) = min[ωs (x) + λj hj (x)]. (Dt−1 )
λ≥0 x∈X
j=1

Now note that although (Dt−1 ) possesses absolutely no structure, its objective is concave (as a
minimum of linear functions of λ) and “computable” at every given λ. Indeed, to compute the
value L(λ) and a supergradient L (λ) of L at a given point is the same as to ﬁnd the optimal
solution xλ to the optimization program
m+1
min[ωs (x) + λj hj (x)]; (D[λ])
x∈X
j=1

after xλ is found, we can set

m+1
L(λ) = ωs (xλ ) + λj hj (xλ ), L (λ) = (h1 (xλ ), ..., hm+1 (xλ ))T .
j=1

It remains to note that to solve (D[λ]) means to minimize over X a sum of ω(·) and a linear
function, and we have assumed that X is simple enough for problems of this type to be rapidly
solved.
The summary of our observations is that (D[λ]) is a convex optimization program with m + 1
decision variables, and we have in our disposal a First Order oracle for this problem, so that we
5.4. IMPLEMENTATION ISSUES 201

can solve it efficiently, provided that m is not too big (cf. Theorem 4.1.2). And we indeed can
enforce the latter requirement – m is in our full control!
After (Dt−1 ) is solved to high accuracy and we have in our disposal the corresponding
maximizer λ∗ , we can choose, as xt , the point xλ∗ , since the Lagrange Duality theorem says that
the optimal solution xt of (Pt−1 ) is among the optimal solutions to (D[λ∗ ]), and the set of the
optimal solutions to the latter problem is a singleton (since ωs is strongly convex).
It should be mentioned that the outlined construction works when (Pt−1 ) is strictly feasible;
what to do if it is not the case? Well, to overcome this difficulty in our context, it suffices to
solve first the auxiliary problem

min {gt−1 (x) − !s : x ∈ Xt−1 ≡ {x ∈ X, hj (x) ≤ 0, j = 1, ..., m}} . (Pt−1 )

If the optimal value in this problem is negative, then (Pt−1 ) is strictly feasible, and we know
how to proceed. If the optimal value is nonnegative, then, recalling the origin of Xt−1 , we can
conclude that !s is a valid lower bound on f∗ , and we have no necessity to solve (Pt−1 ) at all –
we should terminate the phase and update the lower bound on f∗ according to fs → fs+1 = !s .
It remains to note that to solve (Pt−1 ), we can use the same Lagrange Duality trick as for
(Pt−1 ), and that (Pt−1 ) deﬁnitely is strictly feasible (induction in t!), so that here the duality
does work1) .

Remark 5.4.1 Note that the optimal value ht−1 in (Pt−1 ) can be used to update the current
lower bound on f∗ : from (5.3.6.at−1 ) combined with the fact that gt−1 (x) ≤ f (x) it follows that
f∗ ≥ min[!s , !s + ht−1 ].

When the standard setups are implementable? As we have seen, the possibility to
implement the BM algorithm requires an ability to solve rapidly optimization problems of the
form (5.3.2). Let us look at several important cases when this indeed is possible.

Ball setup. Here problem (5.3.2) becomes

1
min xT x − pT x ,
x∈X 2

or, which is the same,

1
min x − p22 .
s∈X 2

We see that to solve (5.3.2) is the same as to project on X - to ﬁnd the point in X which is as
close as possible, in the usual · 2 -norm, to a given point p. This problem is easy to solve for
several simple solids X, e.g.,

• a ball {x : x − a2 ≤ r},

• a box {x : a ≤ x ≤ b},
1)
For a careful reader it should be mentioned that as far as duality is concerned, the situation with (Pt−1 ) is
not completely similar to the one with (Pt−1 ): the objective in the former problem is not strongly convex, so that
there might be a diﬃculty with passing from the optimal solution of the problem dual to (Pt−1 ) to the optimal
solution of the problem (Pt−1 ) itself. Note, however, that we do not need this optimal solution at all, we need
only the optimal value...
202 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

• the simplex ∆n = {x : x ≥ 0, xi = 1}.
i

In the ﬁrst two cases, it takes O(n) operations to compute the solution – it is given by evident
explicit formulas. In the third case, to project is a bit more involving: you can easily demonstrate
that the projection is given by the relations xi = xi (λ∗ ), where xi (λ) = max[0, pi − λ] and λ∗ is
the unique root of the equation
xi (λ) = 1.
i

The left hand side of this equation is nonincreasing and continuous in λ and, as it is immediately
seen, its value varies from something ≥ 1 when λ = max pi − 1 to 0 when λ = max pi . It follows
i i
that one can easily approximate λ∗ by Bisection, and that it takes a moderate absolute constant
of bisection steps to compute λ∗ (and thus – the projection) within the machine precision. The
arithmetic cost of a Bisection step clearly is O(n), and the overall arithmetic complexity of
ﬁnding the projection becomes O(n).

Simplex setup. Let us restrict ourselves with the two simplest cases:
S.A: X is the standard simplex ∆n ;
S.B: X is the standard full-dimensional simplex ∆+
n.
Case S.A. When X = ∆n , the problem (5.3.2) becomes

T
min (xi + σ) ln(xi + σ) − p x : x ≥ 0, xi = 1 [σ = δn−1 ] (5.4.1)
i i

The solution to this optimization problem, as it is immediately seen, is given by xi = xi (λ∗ ),

where
xi (λ) = max[exp{pi − λ} − σ, 0] [pi = pi − max pj ] (5.4.2)
j

and λ∗ is the solution to the equation

xi (λ) = 1.
i

Here again the left hand side of the equation is nonincreasing and continuous in λ and, as it
is immediately seen, its value varies from something which is ≥ 1 when λ = −σ to something
which is < 1 when λ = ln n, so that we again can compute λ∗ (and thus – x(λ∗ )) within machine
precision, in a realistic range of values of n, in a moderate absolute constant of bisection steps.
As a result, the arithmetic cost of solving (5.4.1) is again O(n).
Note that “numerically speaking”, we should not bother about Bisection at all. Indeed, let
us set δ to something really small, say, δ = 1.e-16. Then σ = δn−1 << 1.e-16, while (at least
some of) xi (λ∗ ) should be of order of 1/n (since their sum should be 1). It follows that with
actual (i.e., ﬁnite precision) computations, the quantity σ in (5.4.2) is negligible. Omitting σ
in (5.4.1) (i.e., replacing in (5.3.2) the regularized entropy by the usual one), we can explicitly
write down the solution x∗ to (5.4.1):

exp{−pi }
xi = , i = 1, ..., n.
exp{−pj }
j
5.4. IMPLEMENTATION ISSUES 203

Case S.B. The case of X = ∆+

n is very close to the one of X = ∆n . The only diﬀerence is that
now we ﬁrst should check whether

max exp{−1 − pi } − δn−1 , 0 ≤ 1;
i

if it is the case, then the optimal solution to (5.3.2) is given by

xi = max exp{−1 − pi } − δn−1 , 0 , i = 1, ..., n,

otherwise the optimal solution to (5.3.2) is exactly the optimal solution to (5.4.1).

Spectahedron setup. Consider the case of the spectahedron setup, and assume that
either
Sp.A: X is comprised of all block-diagonal matrices, of a given block-diagonal structure,
belonging to Σn ,
or
Sp.B: X is comprised of all block-diagonal matrices, of a given block-diagonal structure,
belonging to Σ+
n.
Case Sp.A. Here the problem (5.3.2) becomes

min {Tr((x + σIn ) ln(x + σIn )) + Tr(px)} [σ = δn−1 ]

x∈X

We lose nothing by assuming that p is a symmetric block-diagonal matrix of the same block-
diagonal structure as the one of matrices from X. Let p = U πU T be the singular value decom-
position of p with orthogonal U and diagonal π of the same block-diagonal structure as that
one of p. Passing from x to the new matrix variable ξ according to x = U ξU T , we convert our
problem to the problem

min {Tr((ξ + σIn ) ln(ξ + σIn )) + Tr(πξ)} (5.4.3)

ξ∈X

We claim that the unique (due to strong convexity of ω) optimal solution ξ ∗ to the latter problem
is a diagonal matrix. Indeed, for every diagonal matrix D with diagonal entries ±1 and for every
feasible solution ξ to our problem, the matrix DξD clearly is again a feasible solution with the
same value of the objective (recall that π is diagonal). It follows that the optimal set {ξ ∗ } of
our problem should be invariant w.r.t. the aforementioned transformations ξ → DξD, which is
possible if and only if ξ ∗ is a diagonal matrix. Thus, when solving (5.4.3), we may from the very
beginning restrict ourselves with diagonal ξ, and with this restriction the problem becomes

T
min (ξi + σ) ln(ξi + σ) + π ξ : ξ ≥ 0, ξi = 1 , (5.4.4)
ξ∈Rn
i i

which is exactly the problem we have considered in the case of the simplex setup and X = ∆n .
We see that the only elaboration in the case of the spectahedron setup as compared to the
simplex one is in the necessity to ﬁnd the singular value decomposition of p. The latter task is
easy, provided that the diagonal blocks in the matrices in question are of small sizes. Note that
this favourable situation does occur in several important applications, e.g., in Structural Design.
204 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

Case Sp.B. This case is completely similar to the previous one; the only diﬀerence is that the
role of (5.4.3) is now played by the problem

T
min (ξi + σ) ln(ξi + σ) + π ξ : ξ ≥ 0, ξi ≤ 1 ,
ξ∈Rn
i i

which we have already considered when speaking about the simplex setup.

Updating prox-centers. The complexity results stated in Theorem 5.3.1 are absolutely in-
dependent of how we update the prox-centers, so that in this respect we, in principle, are
completely free. The common sense, however, says that the most natural policy here is to use
as the prox-center at every stage the best (with the smallest value of f ) solution among those
we have at our disposal at the beginning of the stage.

Accumulating information. The set Xt summarizes, in a sense, all the information on f

we have accumulated so far and intend to use in the sequel. Relation (5.3.8) allows for a
tradeoff between the quality (and the volume) of this information and the computational effort
required to solve the auxiliary problems (Pt−1 ). With no restrictions on this effort, the most
promising policy for updating Xt ’s would be to set Xt = X t−1 (“collecting information with no
compression of it”). With this policy the BM algorithm with the ball setup is basically identical
to the Prox-Level algorithm [9] from the famous family of bundle methods for nonsmooth convex
optimization. Aside of theoretical complexity bounds similar to (5.3.14), most of bundle methods
(in particular, the Prox-Level one) share the following experimentally observed property: the
practical performance of the algorithm is in full accordance with the complexity bound (5.2.2):
every n steps reduce the inaccuracy by at least an absolute constant factor (something like 3).
This property is very attractive in moderate dimensions, where we indeed are capable to carry
out several times the dimension number of steps.

5.5 Illustration: PET Image Reconstruction problem

To get an impression of the practical performance of the BM method, let us look at preliminary
numerical results related to the 2D version of the PET Image Reconstruction problem.

The model. We process simulated measurements as if they were registered by a ring of 360
detectors, the inner radius of the ring being 1 (Fig. 5.1). The ﬁeld of view is a concentric circle
of the radius 0.9, and it is covered by the 129×129 rectangular grid. The grid partitions the ﬁeld
of view into 10, 471 pixels, and we act as if tracer’s density was constant in every pixel. Thus,
the design dimension of the problem (PET ) we are interested to solve is “just” n = 10471.
The number of bins (i.e., number m of log-terms in the objective of (PET )) is 39784, while
the number of nonzeros among qij is 3,746,832.
The true image is “black and white” (the density in every pixel is either 1 or 0). The
measurement time (which is responsible for the level of noise in the measurements) is mimicked
as follows: we model the measurements according to the Poisson model as if during the period
of measurements the expected number of positrons emitted by a single pixel with unit density
was a given number M .
5.5. ILLUSTRATION: PET IMAGE RECONSTRUCTION PROBLEM 205

Figure 5.1. Ring with 360 detectors, ﬁeld of view and a line of response

The algorithm we are using to solve (PET ) is the plain BM method with the simplex setup
and the sets Xt cut oﬀ X = ∆n by just one linear inequality:

Xt = {x ∈ ∆n : (x − xt )T ∇ωs (xt ) ≥ 0}.

The parameters λ, θ of the algorithm were chosen as

λ = 0.95, θ = 0.5.

The approximate solution reported by the algorithm at a step is the best found so far search
point (the one with the best value of the objective we have seen to the moment).

The results of two sample runs we are about to present are not that bad.

Experiment 1: Noiseless measurements. The evolution of the best, in terms of the

objective, solutions xt found in course of the ﬁrst t calls to the oracle is displayed at Fig. 5.2 (on
pictures, brighter areas correspond to higher density of the tracer). The numbers are as follows.
With the noiseless measurements, we know in advance the optimal value in (PET ) – it is easily
seen that without noises, the true image (which in our simulated experiment we do know) is
an optimal solution. In our problem, this optimal value equals to 2.8167; the best value of the
objective found in 111 oracle calls is 2.8171 (optimality gap 4.e-4). The progress in accuracy is
plotted on Fig. 5.3. We have built totally 111 search points, and the entire computation took
18 51 on a 350 MHz Pentium II laptop with 96 MB RAM.

Experiment 2: Noisy measurements (40 LOR’s per pixel with unit density, to-
tally 63,092 LOR’s registered). The pictures are presented at Fig. 5.4. Here are the
numbers. With noisy measurements, we have no a priori knowledge of the true optimal value
in (PET ); in simulated experiments, a kind of orientation is given by the value of the objective
at the true image (which is hopefully a close to f∗ upper bound on f∗ ). In our experiment, this
bound equals to -0.8827. The best value of the objective found in 115 oracle calls is -0.8976
(which is less that the objective at the true image in fact, the algorithm went below the value
of f at the true image already after 35 oracle calls). The upper bound on the optimality gap
at termination is 9.7e-4. The progress in accuracy is plotted on Fig. 5.5. We have built totally
115 search points; the entire computation took 20 41 .
206 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

True image: 10 “hot spots” x1 = n−1 (1, ..., 1)T x2 – some traces of 8 spots
f = 2.817 f = 3.247 f = 3.185

x3 – traces of 8 spots x5 – some trace of 9-th spot x8 – 10-th spot still missing...
f = 3.126 f = 3.016 f = 2.869

x24 – trace of 10-th spot x27 – all 10 spots in place x31 – that is it...
f = 2.828 f = 2.823 f = 2.818
Figure 5.2. Reconstruction from noiseless measurements
5.5. ILLUSTRATION: PET IMAGE RECONSTRUCTION PROBLEM 207

0
10

−1
10

−2
10

−3
10

−4
10
0 20 40 60 80 100 120

Figure 5.3. Progress in accuracy, noiseless measurements.

Gap(t)
solid line: Relative gap Gap(1) vs. step number t; Gap(t) is the diﬀerence between the
best found so far value f (xt ) of f and the current lower bound on f∗ .
In 111 steps, the gap was reduced by factor > 1600
(xt )−f∗
dashed line: Progress in accuracy ff (x1 )−f
∗
vs. step number t
In 111 steps, the accuracy was improved by factor > 1080

3D PET. The BM algorithm is pretty young (April 2002), and when writing these Lecture
Notes, I have no possibility to test it on actual clinical 3D PET data (since such a test would
require a lot of dedicated data-handling routines I have no access to). However, a couple of
years ago we were participating in a EU project on 3D PET Image Reconstruction; in this
project, among other things, we have developed a special algorithm – · 1 -Mirror Descent MD1
– for minimizing a convex function over the standard simplex, and have used this algorithm to
solve the 3D PET problem; the related theory and numerical results can be found in [2]. The
Mirror Descent algorithm has much in common with the BM and enjoys the same theoretical
performance bounds. My feeling is that in practice the BM algorithm should be somehow
better than Mirror Descent, but right now I do not have enough experimental data to support
this guess. However, I believe that it makes sense to present here some experimental data on
the practical performance of MD1 in order to demonstrate that simple optimization techniques
indeed have chances when applied to really huge convex programs. I restrict myself with a single
numerical example – real clinical brain study carried out on a very powerful PET scanner. In the
corresponding problem (PET ), there are n = 2, 763, 635 design variables (this is the number of
voxels in the ﬁeld of view) and about 25,000,000 log-terms in the objective. The reconstruction
was carried out on the INTEL Marlinspike Windows NT Workstation (500 MHz 1Mb Cache
INTEL Pentium III Xeon processor, 2GB RAM). A single call to the First Order oracle (a single
computation of the value and the subgradient of f ) took about 90 min. Pictures of clinically
acceptable quality were obtained after just four calls to the oracle (as it was the case with other
sets of PET data); for research purposes, we carried out 6 additional steps of the algorithm.
The pictures presented on Fig. 5.6 are slices – 2D cross-sections – of the reconstructed 3D
image. Two series of pictures shown on Fig. 5.6 correspond to two diﬀerent versions, MD and
OSMD, of MD1 (for details, see [2]). Relevant numbers are presented in Table 5.1. The best
208 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

True image: 10 “hot spots” x1 = n−1 (1, ..., 1)T x2 – light traces of 5 spots
f = −0.883 f = −0.452 f = −0.520

x3 – traces of 8 spots x5 – 8 spots in place x8 – 10th spot still missing...

f = −0.585 f = −0.707 f = −0.865

x12 – all 10 spots in place x35 – all 10 spots in place x43 – ...
f = −0.872 f = −0.886 f = −0.896
Figure 5.4. Reconstruction from noisy measurements
5.5. ILLUSTRATION: PET IMAGE RECONSTRUCTION PROBLEM 209

0
10

−1
10

−2
10

−3
10

−4
10
0 20 40 60 80 100 120

Figure 5.5. Progress in accuracy, noisy measurements.

Gap(t)
solid line: Relative gap Gap(1) vs. step number t
In 115 steps, the gap was reduced by factor 1580
f (xt )−f
dashed line: Progress in accuracy f (x1 )−f vs. step number t
(f is the last lower bound on f∗ built in the run)
In 115 steps, the accuracy was improved by factor > 460

step # MD OSMD step # MD OSMD

1 -1.463 -1.463 6 -1.987 -2.015
2 -1.725 -1.848 7 -1.997 -2.016
3 -1.867 -2.001 8 -2.008 -2.016
4 -1.951 -2.015 9 -2.008 -2.016
5 -1.987 -2.015 10 -2.009 -2.016
Table 5.1. Performance of MD and OSMD in Brain study

known lower bound on the optimal value in the problem is -2.050; MD and OSMD decrease in 10
oracle calls the objective from its initial value -1.436 to the values -2.009 and -2.016, respectively
(optimality gaps 4.e-2 and 3.5e-2) and reduce the initial inaccuracy in terms of the objective by
factors 15.3 (MD) and 17.5 (OSMD).
210 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS

Figure 5.6. Brain, near-mid slice of the reconstructions.

[the top-left missing part is the area aﬀected by the Alzheimer disease]

Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
No ratings yet
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
391 pages
Maintenance 2012
100% (8)
Maintenance 2012
99 pages
Notes On Contrastive Divergence
No ratings yet
Notes On Contrastive Divergence
3 pages
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
No ratings yet
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
46 pages
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
No ratings yet
Data Fitting and Uncertainty (A Practical Introduction To Weighted Least Squares and Beyond)
6 pages
Optimization Theory with Applications
From Everand
Optimization Theory with Applications
Donald A. Pierre
4/5 (4)
Roman and Greek Medicine
No ratings yet
Roman and Greek Medicine
16 pages
Knots PDF
100% (2)
Knots PDF
11 pages
Distribution System
No ratings yet
Distribution System
103 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Fast Numerical Methods For Stochastic Computations: A Review
No ratings yet
Fast Numerical Methods For Stochastic Computations: A Review
31 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
wainwrightslides2
No ratings yet
wainwrightslides2
77 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
Data Mining1
No ratings yet
Data Mining1
3 pages
Laumont etal22-BaysianImagingPnP
No ratings yet
Laumont etal22-BaysianImagingPnP
37 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
notes_eth
No ratings yet
notes_eth
315 pages
AI-Aristotle - A Physics-Informed Framework For Systems Biology Gray-Box
No ratings yet
AI-Aristotle - A Physics-Informed Framework For Systems Biology Gray-Box
26 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
Compressed Sensing
No ratings yet
Compressed Sensing
34 pages
2503.20616v1
No ratings yet
2503.20616v1
20 pages
Part I - Integration and Differential Equations: Probabilistic Numerics
No ratings yet
Part I - Integration and Differential Equations: Probabilistic Numerics
86 pages
PDE-LEARN - Using Deep Learning To Discover PDE From Noisy, Limited Data
No ratings yet
PDE-LEARN - Using Deep Learning To Discover PDE From Noisy, Limited Data
25 pages
A 3D Ray Tracing Approach
No ratings yet
A 3D Ray Tracing Approach
21 pages
Barthelme EP2
No ratings yet
Barthelme EP2
58 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Regularization and Bayesian Methods for Inverse Problems in Signal and Image Processing 1st Edition Jean-Franã§Ois Giovannelli download
100% (2)
Regularization and Bayesian Methods for Inverse Problems in Signal and Image Processing 1st Edition Jean-Franã§Ois Giovannelli download
50 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Download Complete Mathematical Theory of Bayesian Statistics First Edition Watanabe PDF for All Chapters
100% (1)
Download Complete Mathematical Theory of Bayesian Statistics First Edition Watanabe PDF for All Chapters
55 pages
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - Download the ebook today and experience the full content
No ratings yet
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - Download the ebook today and experience the full content
44 pages
R O M H A H A: Educed Rder Odeling Ybrid Pproach Ybrid Pproach
No ratings yet
R O M H A H A: Educed Rder Odeling Ybrid Pproach Ybrid Pproach
30 pages
Stan Users Guide 2 32
No ratings yet
Stan Users Guide 2 32
456 pages
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
No ratings yet
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
20 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
cj_compstat_pnas13
No ratings yet
cj_compstat_pnas13
15 pages
Dimensionality reduction methods review
No ratings yet
Dimensionality reduction methods review
69 pages
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
100% (1)
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
47 pages
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
No ratings yet
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
58 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Gelbart Dissertation 2015
No ratings yet
Gelbart Dissertation 2015
137 pages
Eye Detection Using Optimal Wavelet Packets and Radial Basis Functions (RBFS)
No ratings yet
Eye Detection Using Optimal Wavelet Packets and Radial Basis Functions (RBFS)
18 pages
Computational Inverse Problems
100% (1)
Computational Inverse Problems
67 pages
Data-Driven Finite Elements Methods
No ratings yet
Data-Driven Finite Elements Methods
24 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
89 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Lecture 12
No ratings yet
Lecture 12
38 pages
MIT18 409S15 Bookex
No ratings yet
MIT18 409S15 Bookex
123 pages
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - The full ebook with all chapters is available for download
100% (6)
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - The full ebook with all chapters is available for download
73 pages
A Prospective Study On Algorithms Adapted To The Spatial Frequency in Tomography
No ratings yet
A Prospective Study On Algorithms Adapted To The Spatial Frequency in Tomography
6 pages
Lecture 05
No ratings yet
Lecture 05
34 pages
A_toolkit_for_data_driven_discovery_of_g-1
No ratings yet
A_toolkit_for_data_driven_discovery_of_g-1
27 pages
Notes Course UQ in CFD
No ratings yet
Notes Course UQ in CFD
32 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
2022Lectures1-8_Optimization_for_DataScience
No ratings yet
2022Lectures1-8_Optimization_for_DataScience
35 pages
ChamPock An
No ratings yet
ChamPock An
160 pages
DMbookTOC1
No ratings yet
DMbookTOC1
8 pages
Exercises of Numerical Analysis
From Everand
Exercises of Numerical Analysis
Simone Malacrida
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Untitled
No ratings yet
Untitled
64 pages
Unit V - Unid End Questions - EM
No ratings yet
Unit V - Unid End Questions - EM
6 pages
Facility Details ENTRY PDF FORM PDF
No ratings yet
Facility Details ENTRY PDF FORM PDF
3 pages
Quant DTUT Chap5 Forecasting
No ratings yet
Quant DTUT Chap5 Forecasting
72 pages
TDS J1020xa20
No ratings yet
TDS J1020xa20
2 pages
Chetan Shah - Blockchain - MVLCO
No ratings yet
Chetan Shah - Blockchain - MVLCO
8 pages
Bio Intensive
No ratings yet
Bio Intensive
4 pages
Connecting Modular Floating Structures: Appendices
No ratings yet
Connecting Modular Floating Structures: Appendices
209 pages
Yi Jin Jing Postures1
No ratings yet
Yi Jin Jing Postures1
4 pages
Roller Chain (Assembly) - 1
No ratings yet
Roller Chain (Assembly) - 1
1 page
CATALOG
No ratings yet
CATALOG
15 pages
Lesson #15 Ethics and Science in "Barney"
No ratings yet
Lesson #15 Ethics and Science in "Barney"
3 pages
Chronic - Pancreatitis Lecture
No ratings yet
Chronic - Pancreatitis Lecture
6 pages
Shashank Katyayan: Sr. Software Engineer Software Engineer
No ratings yet
Shashank Katyayan: Sr. Software Engineer Software Engineer
2 pages
Kawayan Kiling Proposal
No ratings yet
Kawayan Kiling Proposal
13 pages
Olympia Programming 16
No ratings yet
Olympia Programming 16
81 pages
Lesson 02 Minerals Lab
No ratings yet
Lesson 02 Minerals Lab
3 pages
Molar Mass
No ratings yet
Molar Mass
16 pages
Appendix A KMBR
100% (1)
Appendix A KMBR
4 pages
Final Bionformatics Practical - 17034103
No ratings yet
Final Bionformatics Practical - 17034103
28 pages
Nineteenth Century Preadamism and Alexan
No ratings yet
Nineteenth Century Preadamism and Alexan
5 pages
Draft Report ROB
No ratings yet
Draft Report ROB
72 pages
Lead Guitar Techniques
50% (2)
Lead Guitar Techniques
47 pages
CET - Monitoring - Datasheet - Inview 5 - en - V1.1
No ratings yet
CET - Monitoring - Datasheet - Inview 5 - en - V1.1
3 pages
Vident Ieasy310 OBDII (EOBD) +CAN Code Reader User Manual EN V1.0
No ratings yet
Vident Ieasy310 OBDII (EOBD) +CAN Code Reader User Manual EN V1.0
26 pages
Bacterial Genetics
No ratings yet
Bacterial Genetics
17 pages
Aria Variation 18 BWV 988
No ratings yet
Aria Variation 18 BWV 988
1 page