Introduction to Convex Optimization 3-4 Edition Elad Hazan 2024 scribd download
Introduction to Convex Optimization 3-4 Edition Elad Hazan 2024 scribd download
https://ebookfinal.com/download/convex-optimization-and-euclidean-
distance-geometry-dattorro-j/
https://ebookfinal.com/download/an-introduction-to-nonlinear-
optimization-theory-durea/
https://ebookfinal.com/download/introduction-to-3-1-numerical-
relativity-miguel-alcubierre/
https://ebookfinal.com/download/introduction-to-clinical-
neurology-4-edition-edition-douglas-james-gelb/
Black Box Optimization with Exact Subsolvers A Radial
Basis Function Algorithm for Problems with Convex
Constraints 1st Edition Christine Edman
https://ebookfinal.com/download/black-box-optimization-with-exact-
subsolvers-a-radial-basis-function-algorithm-for-problems-with-convex-
constraints-1st-edition-christine-edman/
https://ebookfinal.com/download/introduction-to-spectroscopy-4-ed-4th-
ed-edition-pavia-donald-l/
https://ebookfinal.com/download/introduction-to-the-theory-of-
optimization-in-euclidean-space-1st-edition-samia-challal-author/
https://ebookfinal.com/download/vce-accounting-units-3-4-5th-edition-
neville-box/
https://ebookfinal.com/download/biologically-inspired-optimization-
methods-an-introduction-1st-edition-mattias-wahde/
Introduction to Convex Optimization 3-4 Edition Elad
Hazan Digital Instant Download
Author(s): Elad Hazan
ISBN(s): 9781680831702, 1680831704
Edition: 3-4
File Details: PDF, 4.22 MB
Year: 2015
Language: english
Introduction to Online
Convex Optimization
Introduction to Online
Convex Optimization
Elad Hazan
Princeton University
[email protected]
Boston — Delft
Foundations and Trends R in Optimization
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1 781 871 0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to
now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com;
e-mail: [email protected]
Foundations and Trends R in Optimization
Volume 2, Issue 3-4, 2015
Editorial Board
Editors-in-Chief
Editors
Topics
– Machine learning
– Statistics
– Data analysis
– Signal and image processing
– Computational economics and finance
– Engineering design
– Scheduling and resource allocation
– and other areas
Elad Hazan
Princeton University
[email protected]
viii
to Dana
Preface
ix
x Preface
Book’s structure
Preface ix
1 Introduction 3
1.1 The online convex optimization model . . . . . . . . . . . 4
1.2 Examples of problems that can be modeled via OCO . . . 6
1.3 A gentle start: learning from expert advice . . . . . . . . . 10
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . 19
xi
xii Preface
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Bibliographic remarks . . . . . . . . . . . . . . . . . . . . 53
5 Regularization 69
5.1 Regularization functions . . . . . . . . . . . . . . . . . . . 70
5.2 The RFTL algorithm and its analysis . . . . . . . . . . . . 72
5.3 Online Mirrored Descent . . . . . . . . . . . . . . . . . . 76
5.4 Application and special cases . . . . . . . . . . . . . . . . 78
5.5 Randomized regularization . . . . . . . . . . . . . . . . . 81
5.6 * Optimal regularization . . . . . . . . . . . . . . . . . . 89
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8 Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . 97
Acknowledgements 167
References 169
Abstract
3
4 Introduction
Perhaps the main reason that OCO has become a leading online learn-
ing framework in recent years is its powerful modeling capability: prob-
lems from diverse domains such as online routing, ad selection for search
engines and spam filtering can all be modeled as special cases. In this
section, we briefly survey a few special cases and how they fit into the
OCO framework.
Perhaps the most well known problem in prediction theory is the so-
called “experts problem”. The decision maker has to choose among the
advice of n given experts. After making her choice, a loss between
zero and one is incurred. This scenario is repeated iteratively, and at
each iteration the costs of the various experts are arbitrary (possibly
even adversarial, trying to mislead the decision maker). The goal of the
decision maker is to do as well as the best expert in hindsight.
The online convex optimization problem captures this problem as
a special case: the set of decisions is the set of all distributions over
n elements (experts), i.e., the n-dimensional simplex K = ∆n = {x ∈
Rn , i xi = 1 , xi ≥ 0}. Let the cost of the i’th expert at iteration
P
t be gt (i), and let gt be the cost vector of all n experts. Then the
cost function is the expected cost of choosing an expert according to
distribution x, and is given by the linear function ft (x) = gt> x.
Thus, prediction from expert advice is a special case of OCO in
which the decision set is the simplex and the cost functions are linear
and bounded, in the `∞ norm, to be at most one. The bound on the
cost functions is derived from the bound on the elements of the cost
vector gt .
The fundamental importance of the experts problem in machine
learning warrants special attention, and we shall return to it and ana-
lyze it in detail at the end of this chapter.
1.2. Examples of problems that can be modeled via OCO 7
X X
xe = 1 = xe flow value is one
e=(u,w),w∈V e=(w,v),w∈V
X X
∀w ∈ V \ {u, v} xe = xe flow conservation
e=(v,x)∈E e=(x,v)∈E
∀e ∈ E 0 ≤ xe ≤ 1 capacity constraints
Figure 1.1: Linear equalities and inequalities that define the flow polytope, which
is the convex hull of all u-v paths.
Portfolio selection
In the online setting, for each iteration the decision maker outputs
a preference matrix Xt ∈ K, where K ⊆ {0, 1}n×m is a subset of all
possible zero/one matrices. An adversary then chooses a user/song pair
(it , jt ) along with a “real” preference for this pair yt ∈ {0, 1}. Thus, the
loss experienced by the decision maker can be described by the convex
loss function,
ft (X) = (Xit ,jt − yt )2 .
The natural comparator in this scenario is a low-rank matrix, which
corresponds to the intuitive assumption that preference is determined
by few unknown factors. Regret with respect to this comparator means
performing, on the average, as few preference-prediction errors as the
best low-rank matrix.
We return to this problem and explore efficient algorithms for it in
Chapter 7.
Proof. Assume that there are only two experts and one always chooses
option A while the other always chooses option B. Consider the setting
in which an adversary always chooses the opposite of our prediction (she
can do so, since our algorithm is deterministic). Then, the total number
of mistakes the algorithm makes is T . However, the best expert makes
no more than T2 mistakes (at every iteration exactly one of the two
experts is mistaken). Therefore, there is no algorithm that can always
guarantee less than 2L mistakes.
Theorem 1.2. Let ε ∈ (0, 12 ). Suppose the best expert makes L mis-
takes. Then:
Using this optimal value of ε, we get that for the best expert i?
q
?
MT ≤ 2MT (i ) + O MT (i? ) log N .
Since the value of WT (i) is always less than the sum of all weights ΦT ,
we conclude that
ε
(1 − ε)MT (i) = WT (i) ≤ ΦT ≤ N (1 − )MT .
2
Taking the logarithm of both sides we get
ε
MT (i) log(1 − ε) ≤ log N + MT log (1 − ).
2
Next, we use the approximations
1
−x − x2 ≤ log (1 − x) ≤ −x 0<x< ,
2
which follow from the Taylor series of the logarithm function, to obtain
that
ε
−MT (i)(ε + ε2 ) ≤ log N − MT ,
2
and the lemma follows.
log N
E[MT ] ≤ (1 + ε)MT (i) + .
ε
The proof of this lemma is very similar to the previous one, where the
factor of two is saved by the use of randomness:
Proof. As before, let Φt = N i=1 Wt (i) for all t ∈ [T ], and note that
P
pt (i) = PWW
t (i)
X
= Φt (1 − ε pt (i)mt (i)) (j)
j t
i
= Φt (1 − ε E[m̃t ])
≤ Φ e−ε E[m̃t ] .
t 1 + x ≤ ex
Since the value of WT (i) is always less than the sum of all weights ΦT ,
we conclude that
1.3.3 Hedge
The RWM algorithm is in fact more general: instead of considering
a discrete number of mistakes, we can consider measuring the perfor-
mance of an expert by a non-negative real number `t (i), which we refer
to as the loss of the expert i at iteration t. The randomized weighted
majority algorithm guarantees that a decision maker following its ad-
vice will incur an average expected loss approaching that of the best
expert in hindsight.
16 Introduction
Algorithm 1 Hedge
1: Initialize: ∀i ∈ [N ], W1 (i) = 1
2: for t = 1 to T do
3: Pick it ∼R Wt , i.e., it = i with probability xt (i) = PWW
t (i)
(j) j t
`2t
Theorem 1.5. Let denote the N -dimensional vector of square losses,
i.e., `2t (i) = `t (i)2 , let ε > 0, and assume all losses to be non-negative.
The Hedge algorithm satisfies for any expert i? ∈ [N ]:
T T T
log N
xt> `t ≤ xt> `2t +
X X X
`t (i? ) + ε
t=1 t=1 t=1
ε
PN
Proof. As before, let Φt = i=1 Wt (i) for all t ∈ [T ], and note that
Φ1 = N .
Inspecting the sum of weights:
P −ε`t (i)
Φt+1 = i Wt (i)e
−ε`t (i) xt (i) = PWW
t (i)
P
= Φt i xt (i)e (j)
j t
Since the value of WT (i? ) is always less than the sum of all weights Φt ,
we conclude that
P P
xt> `t +ε2 xt> `2t
WT (i? ) ≤ ΦT ≤ N e−ε t t .
1.4 Exercises
2. (a) Consider the experts problem in which the payoffs are be-
tween zero and a positive real number G > 0. Give an algo-
rithm that attains expected payoff lower bounded by:
T
X T
X
`t (i? ) − c T log N
p
E[`t (it )] ≥ max
?
i ∈[N ]
t=1 t=1
The OCO model was first defined by Zinkevich (110) and has since
become widely influential in the learning community and significantly
extended since (see thesis and surveys (52; 53; 97)).
The problem of prediction from expert advice and the Weighted
Majority algorithm were devised in (71; 73). This seminal work was
one of the first uses of the multiplicative updates method—a ubiquitous
meta-algorithm in computation and learning, see the survey (11) for
more details. The Hedge algorithm was introduced in (44).
The Universal Portfolios model was put forth in (32), and is one
of the first examples of a worst-case online learning model. Cover gave
an optimal-regret algorithm for universal portfolio selection that runs
in exponential time. A polynomial time algorithm was given in (62),
which was further sped up in (7; 54). Numerous extensions to the model
also appeared in the literature, including addition of transaction costs
(20) and relation to the Geometric Brownian Motion model for stock
prices (56).
In their influential paper, Awerbuch and Kleinberg (14) put forth
the application of OCO to online routing. A great deal of work has been
devoted since then to improve the initial bounds, and generalize it into
a complete framework for decision making with limited feedback. This
framework is an extension of OCO, called Bandit Convex Optimization
(BCO). We defer further bibliographic remarks to chapter 6 which is
devoted to the BCO framework.
2
Basic concepts in convex optimization
21
22 Basic concepts in convex optimization
∀x, y ∈ K, kx − yk ≤ D.
A set K is convex if for any x, y ∈ K, all the points on the line segment
connecting x and y also belong to K, i.e.,
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com