The Mathematics of Data
The Mathematics of Data
MATHEMATICS SERIES
Volume 25
The Mathematics
of Data
Michael W. Mahoney
John C. Duchi
Anna C. Gilbert
Editors
The Mathematics
of Data
Michael W. Mahoney
John C. Duchi
Anna C. Gilbert
Editors
IAS/Park City Mathematics Institute runs mathematics education programs that bring
together high school mathematics teachers, researchers in mathematics and mathematics
education, undergraduate mathematics faculty, graduate students, and undergraduates to
participate in distinct but overlapping programs of research and education. This volume
contains the lecture notes from the Graduate Summer School program
2010 Mathematics Subject Classification. Primary 15-02, 52-02, 60-02, 62-02, 65-02,
68-02, 90-02.
Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting
for them, are permitted to make fair use of the material, such as to copy select pages for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for permission
to reuse portions of AMS publication content are handled by the Copyright Clearance Center. For
more information, please visit www.ams.org/publications/pubpermissions.
Send requests for translation rights and licensed reprints to [email protected].
c 2018 by the American Mathematical Society. All rights reserved.
The American Mathematical Society retains all rights
except those granted to the United States Government.
Printed in the United States of America.
∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at https://www.ams.org/
10 9 8 7 6 5 4 3 2 1 23 22 21 20 19 18
Contents
Preface vii
Introduction ix
v
IAS/Park City Mathematics Series
Volume 25, Pages -7—8
https://doi.org/10.1090/pcms/025/00827
Preface
The IAS/Park City Mathematics Institute (PCMI) was founded in 1991 as part
of the Regional Geometry Institute initiative of the National Science Foundation.
In mid-1993 the program found an institutional home at the Institute for Ad-
vanced Study (IAS) in Princeton, New Jersey.
The IAS/Park City Mathematics Institute encourages both research and educa-
tion in mathematics and fosters interaction between the two. The three-week sum-
mer institute offers programs for researchers and postdoctoral scholars, graduate
students, undergraduate students, high school students, undergraduate faculty,
K-12 teachers, and international teachers and education researchers. The Teacher
Leadership Program also includes weekend workshops and other activities dur-
ing the academic year.
One of PCMI’s main goals is to make all of the participants aware of the full
range of activities that occur in research, mathematics training and mathematics
education: the intention is to involve professional mathematicians in education
and to bring current concepts in mathematics to the attention of educators. To
that end, late afternoons during the summer institute are devoted to seminars and
discussions of common interest to all participants, meant to encourage interaction
among the various groups. Many deal with current issues in education: others
treat mathematical topics at a level which encourages broad participation.
Each year the Research Program and Graduate Summer School focuses on a
different mathematical area, chosen to represent some major thread of current
mathematical interest. Activities in the Undergraduate Summer School and Un-
dergraduate Faculty Program are also linked to this topic, the better to encourage
interaction between participants at all levels. Lecture notes from the Graduate
Summer School are published each year in this series. The prior volumes are:
• Volume 1: Geometry and Quantum Field Theory (1991)
• Volume 2: Nonlinear Partial Differential Equations in Differential Geometry
(1992)
• Volume 3: Complex Algebraic Geometry (1993)
• Volume 4: Gauge Theory and the Topology of Four-Manifolds (1994)
• Volume 5: Hyperbolic Equations and Frequency Interactions (1995)
• Volume 6: Probability Theory and Applications (1996)
• Volume 7: Symplectic Geometry and Topology (1997)
• Volume 8: Representation Theory of Lie Groups (1998)
• Volume 9: Arithmetic Algebraic Geometry (1999)
©2018 American Mathematical Society
vii
viii Preface
Introduction
“The Mathematics of Data” was the topic for the 26th annual Park City Mathe-
matics Institute (PCMI) summer session, held in July 2016. To those more familiar
with very abstract areas of mathematics or more applied areas of data—the latter
going these days by names such as “big data” or “data science”—it may come as
a surprise that such an area even exists. A moment’s thought, however, should
dispel such a misconception. After all, data must be modeled, e.g., by a matrix or
a graph or a flat table, and if one performs similar operations on very different
types of data, then there is an expectation that there must be some sort of com-
mon mathematical structure, e.g., from linear algebra or graph theory or logic.
So too, ignorance or errors or noise in the data can be modeled, and it should be
plausible that how well operations perform on data depend not just on how well
data are modeled but also on how well ignorance or noise or errors are modeled.
So too, the operations themselves can be modeled, e.g., to make statements such
as whether the operations answer a precise question, exactly or approximately, or
whether they will return a solution in a reasonable amount of time.
As such, “The Mathematics of Data” fits squarely in applied mathematics—
when that term is broadly, not narrowly, defined. Technically, it represents some
combination of what is traditionally the domain of linear algebra and probability
and optimization and other related areas. Moreover, while some of the work
in this area takes place in mathematics departments, much of the work in the
area takes place in computer science, statistics, and other related departments.
This was the challenge and opportunity we faced, both in designing the graduate
summer school portion of the PCMI summer session, as well as in designing this
volume. With respect to the latter, while the area is not sufficiently mature to say
the final word, we have tried to capture the major trends in the mathematics of
data sufficiently broadly and at a sufficiently introductory level that this volume
could be used as a teaching resource for students with backgrounds in any of the
wide range of areas related to the mathematics of data.
The first chapter, “Lectures on Randomized Numerical Linear Algebra,” pro-
vides an overview of linear algebra, probability, and ways in which they interact
fruitfully in many large-scale data applications. Matrices are a common way to
model data, e.g., an m × n matrix provides a natural way to describe m objects,
each of which is described by n features, and thus linear algebra, as well as more
sophisticated variants such as functional analysis and linear operator theory, are
©2018 American Mathematical Society
ix
x Introduction
central to the mathematics of data. An interesting twist is that, while work in nu-
merical linear algebra and scientific computing typically focuses on deterministic
algorithms that return answers to machine precision, randomness can be used
in novel algorithmic and statistical ways in matrix algorithms for data. While
randomness is often assumed to be a property of the data (e.g., think of noise
being modeled by random variables drawn from a Gaussian distribution), it can
also be a powerful algorithmic resource to speed up algorithms (e.g., think of
Monte Carlo and Markov Chain Monte Carlo methods), and many of the most
interesting and exciting developments in the mathematics of data explore this
algorithmic-statistical interface. This chapter, in particular, describes the use of
these methods for the development of improved algorithms for fundamental and
ubiquitous matrix problems such as matrix multiplication, least-squares approxi-
mation, and low-rank matrix approximation.
The second chapter, “Optimization Algorithms for Data Analysis,” goes one
step beyond basic linear algebra problems, which themselves are special cases
of optimization problems, to consider more general optimization problems. Op-
timization problems are ubiquitous throughout data science, and a wide class
of problems can be formulated as optimizing smooth functions, possibly with
simple constraints or structured nonsmooth regularizers. This chapter describes
some canonical problems in data analysis and their formulation as optimization
problems. It also describes iterative algorithms (i.e., those that generate a se-
quence of points) that, for convex objective functions, converge to the set of solu-
tions of such problems. Algorithms covered include first-order methods that de-
pend on gradients, so-called accelerated gradient methods, and Newton’s second-
order method that can guarantee convergence to points that approximately satisfy
second-order conditions for a local minimizer of a smooth nonconvex function.
The third chapter, “Introductory Lectures on Stochastic Optimization,” cov-
ers the basic analytical tools and algorithms necessary for stochastic optimiza-
tion. Stochastic optimization problems are problems whose definition involves
randomness, e.g., minimizing the expectation of some function; and stochastic
optimization algorithms are algorithms that generate and use random variables
to find the solution of a (perhaps deterministic) problem. As with the use of ran-
domness in Randomized Numerical Linear Algebra, there is an interesting syn-
ergy between the two ways in which stochasticity appears. This chapter builds
the necessary convex analytic and other background, and it describes gradient
and subgradient first-order methods for the solution of these types of problems.
These methods tend to be simple methods that are slower to converge than more
advanced methods—such as Newton’s or other second-order methods—for deter-
ministic problems, but they have the advantage that they can be robust to noise
in the optimization problem itself. Also covered are mirror descent and adap-
tive methods, as well as methods for proving upper and lower bounds on such
stochastic algorithms.
Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert xi
from a class of data structures. An important point is that linear algebra can
be enriched to cover not merely linear transformations—the 99.9% use case—but
also sequences of linear transformations that form complexes, thus opening the
possibility of further mathematical developments.
Overall, the 2016 PCMI summer program included minicourses by Petros
Drineas, John Duchi, Cynthia Dwork and Kunal Talwar, Robert Ghrist, Piotr
Indyk, Mauro Maggioni, Gunnar Martinsson, Roman Vershynin, and Stephen
Wright. This volume consists of contributions, summarized above, by Petros
Drineas (with Michael Mahoney), Stephen Wright, John Duchi, Gunnar Martins-
son, Roman Vershynin, and Robert Ghrist. Each chapter in this volume was
written by a different author, and so each chapter has it’s own unique style, in-
cluding notational differences, but we have taken some effort to ensure that they
can fruitfully be read together.
Putting together such an effort—both the entire summer session as well as this
volume—is not a minor undertaking, but for us it was not difficult, due to the
large amount of support we received. We would first like to thank Richard Hain,
the former PCMI Program Director, who first invited us to organize the summer
school, as well as Rafe Mazzeo, the current PCMI Program Director, who pro-
vided seamless guidance throughout the entire process. In terms of running the
summer session, a special thank you goes out to the entire PCMI staff, and in par-
ticular to Beth Brainard and Dena Vigil as well as Bryna Kra and Michelle Wachs.
We received a lot of feedback from participants who enjoyed the event, and Beth
and Dena deserve much of the credit for making it run smoothly; and Bryna and
Michelle’s role with the graduate steering committee helped us throughout the
entire process. In terms of this volume, in addition to thanking the authors for
their efforts and (usually) getting back to us in a timely manner, we would like to
thank Ian Morrison, who is the PCMI Publisher. Putting together a volume such
as this can be a tedious task, but for us it was not, and this is in large part due to
Ian’s help and guidance.
IAS/Park City Mathematics Series
Volume 25, Pages 1–48
https://doi.org/10.1090/pcms/025/00829
Contents
1 Introduction 2
2 Linear Algebra 3
2.1 Basics. 3
2.2 Norms. 4
2.3 Vector norms. 4
2.4 Induced matrix norms. 5
2.5 The Frobenius norm. 6
2.6 The Singular Value Decomposition. 7
2.7 SVD and Fundamental Matrix Spaces. 9
2.8 Matrix Schatten norms. 9
2.9 The Moore-Penrose pseudoinverse. 10
2.10 References. 11
3 Discrete Probability 11
3.1 Random experiments: basics. 11
3.2 Properties of events. 12
3.3 The union bound. 12
3.4 Disjoint events and independent events. 12
3.5 Conditional probability. 12
3.6 Random variables. 13
3.7 Probability mass function and cumulative distribution function. 13
3.8 Independent random variables. 14
3.9 Expectation of a random variable. 14
3.10 Variance of a random variable. 14
3.11 Markov’s inequality. 15
3.12 The Coupon Collector Problem. 16
3.13 References. 16
4 Randomized Matrix Multiplication 16
4.1 Analysis of the RANDMATRIXMULTIPLY algorithm. 18
4.2 Analysis of the algorithm for nearly optimal probabilities. 21
1. Introduction
Matrices are ubiquitous in computer science, statistics, and applied mathemat-
ics. An m × n matrix can encode information about m objects (each described
by n features), or the behavior of a discretized differential operator on a finite
element mesh; an n × n positive-definite matrix can encode the correlations be-
tween all pairs of n objects, or the edge-connectivity between all pairs of n nodes
in a social network; and so on. Motivated largely by technological developments
that generate extremely large scientific and Internet data sets, recent years have
witnessed exciting developments in the theory and practice of matrix algorithms.
Particularly remarkable is the use of randomization—typically assumed to be a
property of the input data due to, e.g., noise in the data generation mechanisms—
as an algorithmic or computational resource for the development of improved
algorithms for fundamental matrix problems such as matrix multiplication, least-
squares (LS) approximation, low-rank matrix approximation, etc.
Randomized Numerical Linear Algebra (RandNLA) is an interdisciplinary re-
search area that exploits randomization as a computational resource to develop
improved algorithms for large-scale linear algebra problems. From a founda-
tional perspective, RandNLA has its roots in theoretical computer science (TCS),
with deep connections to mathematics (convex analysis, probability theory, met-
ric embedding theory) and applied mathematics (scientific computing, signal pro-
cessing, numerical linear algebra). From an applied perspective, RandNLA is a
vital new tool for machine learning, statistics, and data analysis. Well-engineered
implementations have already outperformed highly-optimized software libraries
Petros Drineas and Michael W. Mahoney 3
2. Linear Algebra
In this section, we present a brief overview of basic linear algebraic facts and
notation that will be useful in this chapter. We assume basic familiarity with
linear algebra (e.g., inner/outer products of vectors, basic matrix operations such
as addition, scalar multiplication, transposition, upper/lower triangular matrices,
matrix-vector products, matrix multiplication, matrix trace, etc.).
2.1. Basics. We will entirely focus on matrices and vectors over the reals. We
will use the notation x ∈ Rn to denote an n-dimensional vector: notice the use
of bold latin lowercase letters for vectors. Vectors will always be assumed to be
column vectors, unless explicitly noted otherwise. The vector of all zeros will be
denoted as 0, while the vector of all ones will be denoted as 1; dimensions will
be implied from context or explicitly included as a subscript.
We will use bold latin uppercase letters for matrices, e.g., A ∈ Rm×n denotes
an m × n matrix A. We will use the notation Ai∗ to denote the i-th row of A as
a row vector and A∗i to denote the i-th column of A as a column vector. The
(square) identity matrix will be denoted as In where n denotes the number of
rows and columns. Finally, we use ei to denote the i-th column of In , i.e., the
i-th canonical vector.
4 Lectures on Randomized Numerical Linear Algebra
This family of norms is named “induced” because they are realized by a non-
zero vector x that varies depending on A and p. Thus, there exists a unit norm
vector (unit norm in the p-norm) x such that Ap = Axp . The induced matrix
p-norms follow the submultiplicativity laws:
Axp Ap xp and ABp Ap Bp .
Furthermore, matrix p-norms are invariant to permutations: PAQp = Ap ,
where P and Q are permutation matrices of appropriate dimensions. Also, if we
consider the matrix with permuted rows and columns
B A12
PAQ = ,
A21 A22
then the norm of the submatrix is related to the norm of the full unpermuted
matrix as follows: Bp Ap . The following relationships between matrix
p-norms are relatively easy to prove. Given a matrix A ∈ Rm×n ,
1 √
√ A∞ A2 mA∞ ,
n
1 √
√ A1 A2 nA1 .
m
It is also the case that A 1 = A∞ and A ∞ = A1 . While transposition
affects the infinity and one norm of a matrix, it does not affect the two norm,
i.e., A 2 = A2 . Also, the matrix two-norm is not affected by pre-(or post-)
multiplication with matrices whose columns (or rows) are orthonormal vectors:
UAV 2 = A2 , where U and V are orthonormal matrices (UT U = I and
V T V = I) of appropriate dimensions.
2.5. The Frobenius norm. The Frobenius norm is not an induced norm, as it
belongs to the family of Schatten norms (to be discussed in Section 2.8).
Definition 2.5.1. Given a matrix A ∈ Rm×n , we define the Frobenius norm as:
n m
AF = A2ij = Tr A A ,
j=1 i=1
where Tr (·) denotes the matrix trace (where, recall, the trace of a square matrix
is defined to be the sum of the elements on the main diagonal).
Informally, the Frobenius norm measures the variance or variability (which can
be given an interpretation of size or mass) of a matrix. Given a vector x ∈ Rn , its
Frobenius norm is equal to its Euclidean norm, i.e., xF = x2 . Transposition
of a matrix A ∈ Rm×n does not affect its Frobenius norm, i.e., AF = A F .
Similar to the two norm, the Frobenius norm does not change under permutations
Petros Drineas and Michael W. Mahoney 7
Definition 2.6.1. Given a matrix A ∈ Rm×n , we define its full SVD as:
min{m,n}
A = UΣV T = σi ui v
i ,
i=1
where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices that contain the left
and right singular vectors of A, respectively, and Σ ∈ Rm×n is a diagonal matrix,
with the singular values of A in decreasing order on the diagonal.
i = 1, . . . , min{m, n},
Definition 2.6.4. Given a matrix A ∈ Rm×n of rank ρ min{m, n}, we define its
thin SVD as:
ρ
A = V =
Σ
U σi ui v
i ,
m×ρ ρ×ρ ρ×n i=1
Theorem 2.6.5. Let A = UΣV ∈ Rm×n be the thin SVD of A; let k be an integer
less than ρ = rank(A); and let Ak = k T
i=1 σi ui vi = Uk Σk V k . Then,
and
ρ
σ2j = min A − B2F = A − Ak 2F .
B∈Rm×n , rank(B)=k
j=k+1
the matrix of the bottom ρ − k nonzero left (respectively, right) singular vectors
of A; and let Σk,⊥ ∈ R(ρ−k)×(ρ−k) denote the diagonal matrix containing the
bottom ρ − k singular values of A. Then,
(2.6.6) Ak = Uk Σk V Tk and Ak,⊥ = A − Ak = Uk,⊥ Σk,⊥ V Tk,⊥ .
2.7. SVD and Fundamental Matrix Spaces. Any matrix A ∈ Rm×n defines four
fundamental spaces:
The Column Space of A: This space is spanned by the columns of A:
range(A) = {b : Ax = b, x ∈ Rn } ⊂ Rm .
The Null Space of A: This space is spanned by all vectors x ∈ Rn such that
Ax = 0:
null(A) = {x : Ax = 0} ⊂ Rn .
The Row Space of A: This space is spanned by the rows of A:
range(A ) = {d : A y = d, y ∈ Rm } ⊂ Rn .
The Left Null Space of A: This space is spanned by all vectors y ∈ Rm
such that A y = 0:
null(A ) = {y : A y = 0} ⊂ Rm .
The SVD reveals orthogonal bases for all these spaces. Given a matrix A ∈ Rm×n ,
with rank(A) = ρ, its SVD can be written as:
Σ V
ρ 0 ρ
A = Uρ Uρ,⊥ .
0 0 Vρ,⊥
range(A) = range(Uρ ),
null(A) = range(V ρ,⊥ ),
range(A ) = range(V ρ ),
null(A ) = range(Uρ,⊥ ).
2.8. Matrix Schatten norms. The matrix Schatten norms are a special family of
norms that are defined on the vector containing the singular values of a matrix.
Given a matrix A ∈ Rm×n with singular values σ1 · · · σρ > 0, we define the
Schatten p-norm as:
ρ 1
p p
Ap = σi .
i=1
Schatten one-norm: The nuclear norm, i.e., the sum of the singular values.
Schatten two-norm: The Frobenius norm, i.e., the square root of the sum of
the squares of the singular values.
Schatten infinity-norm: The spectral norm, defined as the limit as p → ∞
of the Schatten p-norm, i.e., the largest singular value.
Schatten norms are orthogonally invariant and submultiplicative, and they satisfy
Hölder’s inequality.
2.9. The Moore-Penrose pseudoinverse. A generalization of the well-known no-
tion of matrix inverse is the Moore-Penrose pseudoinverse. Formally, given a
matrix A ∈ Rm×n , a matrix A† is the Moore Penrose pseudoinverse of A if it
satisfies the following properties:
(1) AA† A = A.
(2) A† AA† = A† .
(3) (AA† ) = AA† .
(4) (A† A) = A† A.
Given a matrix A ∈ Rm×n of rank ρ and its thin SVD
ρ
A= σi ui v
i ,
i=1
(2.9.1) (Y 1 Y 2 )† = Y †2 Y †1 .
(We emphasize that the condition on the ranks is crucial: while the inverse of
the product of two matrices always equals the product of the inverses of those
matrices, the analogous statement is not true in full generality for the Moore-
Penrose pseudoinverse [5].)
The fundamental spaces of the Moore-Penrose pseudoinverse are connected
with those of the actual matrix. Given a matrix A and its Moore-Penrose pseu-
doinverse A† , the column space of A† can be defined as:
range(A† ) = range(A A) = range(A ),
Petros Drineas and Michael W. Mahoney 11
and it is orthogonal to the null space of A. The null space of A† can be defined as:
null(A† ) = null(AA ) = null(A ),
and it is orthogonal to the column space of A.
2.10. References. We refer the interested reader to [5, 13, 26, 27] for additional
background on linear algebra and matrix computations, as well as to [4, 25] for
additional background on matrix perturbation theory.
3. Discrete Probability
In this section, we present a brief overview of discrete probability. More
advanced results (in particular, Bernstein-type inequalities for real-valued and
matrix-valued random variables) will be introduced in the appropriate context
later in the chapter. It is worth noting that most of RandNLA builds upon simple,
fundamental principles of discrete (instead of continuous) probability.
3.1. Random experiments: basics. A random experiment is any procedure that
can be infinitely repeated and has a well-defined set of possible outcomes. Typi-
cal examples are the roll of a dice or the toss of a coin. The sample space Ω of a
random experiment is the set of all possible outcomes of the random experiment.
If the random experiment only has two possible outcomes (e.g., success and fail-
ure) then it is often called a Bernoulli trial. In discrete probability, the sample
space Ω is finite. (We will not cover countably or uncountably infinite sample
spaces in this chapter.)
An event is any subset of the sample space Ω. Clearly, the set of all possible
events is the powerset (the set of all possible subsets) of Ω, often denoted as
2Ω . As an example, consider the following random experiment: toss a coin three
times. Then, the sample space Ω is
Ω = {HHH, HHT , HT H, HT T , T HH, T HT , T T H, T T T }
and an event E could be described in words as “the output of the random exper-
iment was either all heads or all tails”. Then, E = {HHH, T T T }. The probability
measure or probability function maps the (finite) sample space Ω to the interval
[0, 1]. Formally, let the function Pr [ω] for all ω ∈ Ω be a function whose do-
main is Ω and whose range is the interval [0, 1]. This function has the so-called
normalization property, namely
Pr [ω] = 1.
ω∈Ω
If E is an event, then
(3.1.1) Pr [E] = Pr [ω] ,
ω∈E
namely the probability of an event is the sum of the probabilities of its elements.
It follows that the probability of the empty event (the event E that corresponds
12 Lectures on Randomized Numerical Linear Algebra
to the empty set) is equal to zero, whereas the probability of the event Ω (clearly
Ω itself is an event) is equal to one. Finally, the uniform probability function is
defined as Pr [ω] = 1/ |Ω|, for all ω ∈ Ω.
3.2. Properties of events. Recall that events are sets and thus set operations
(union, intersection, complementation) are applicable. Assuming finite sample
spaces and using Eqn. (3.1.1), it is easy to prove the following property for the
union of two events E1 and E2 :
Pr [E1 ∪ E2 ] = Pr [E1 ] + Pr [E2 ] − Pr [E1 ∩ E2 ] .
This property follows from the well-known inclusion-exclusion principle for set
union and can be generalized to more than two sets and thus to more than two
events. Similarly, one can prove that Pr Ē = 1 − Pr [E] . In the above, Ē denotes
the complement of the event E. Finally, it is trivial to see that if E1 is a subset of
E2 then Pr [E1 ] Pr [E2 ] .
3.3. The union bound. The union bound is a fundamental result in discrete
probability and can be used to bound the probability of a union of events without
any special assumptions on the relationships between the events. Indeed, let Ei
for all i = 1, . . . , n be events defined over a finite sample space Ω. Then, the union
bound states that n
n
Pr Ei Pr [Ei ] .
i=1 i=1
The proof of the union bound is quite simple and can be done by induction, using
the inclusion-exclusion principle for two sets that was discussed in the previous
section.
3.4. Disjoint events and independent events. Two events E1 and E2 are called
disjoint or mutually exclusive if their intersection is the empty set, i.e., if
E1 ∩ E2 = ∅.
This can be generalized to any number of events by necessitating that the events
are all pairwise disjoint. Two events E1 and E2 are called independent if the oc-
currence of one does not affect the probability of the other. Formally, they must
satisfy
Pr [E1 ∩ E2 ] = Pr [E1 ] · Pr [E2 ] .
Again, this can be generalized to more than two events by necessitating that the
events are all pairwise independent.
3.5. Conditional probability. For any two events E1 and E2 , the conditional
probability Pr [E1 |E2 ] is the probability that E1 occurs given that E2 occurs. For-
mally,
Pr [E1 ∩ E2 ]
Pr [E1 |E2 ] = .
Pr [E2 ]
Petros Drineas and Michael W. Mahoney 13
In the above, X(Ω) is the image of the random variable X over the sample space
Ω; recall that X is a function. That is, the sum is over the range of the random
variable X. Alternatively, E [X] can be expressed in terms of a sum over the domain
of X, i.e., over Ω. For finite sample spaces Ω, such as those that arise in discrete
probability, we get
E [X] = X(ω)Pr [ω] .
ω∈Ω
We now discuss fundamental properties of the expectation. The most important
property is linearity of expectation: for any random variables X and Y and real
number λ,
E [f(X)] = 1 · Pr [X t] + 0 · Pr [X < t] = Pr [X t] .
Clearly, from the function definition, f(X) X
t . Taking expectation on both sides:
X E [X]
E [f(X)] E = .
t t
16 Lectures on Randomized Numerical Linear Algebra
Thus,
E [X]
Pr [X t] .
t
Hence, we conclude the proof of Markov’s inequality.
3.12. The Coupon Collector Problem. Suppose there are m types of coupons
and we seek to collect them in independent trials, where in each trial the proba-
bility of obtaining any one coupon is 1/m (uniform). Let X denote the number of
trials that we need in order to collect at least one coupon of each type. Then, one
can prove that [20, Section 3.6]:
where each of the summands is the outer product of a column of A and the corre-
sponding row of B. Recall that the standard definition of matrix multiplication
states that the (i, j)-th entry of the matrix product AB is equal to the inner product
of the i-th row of A and the j-th column of B, namely
(AB)ij = Ai∗ B∗j ∈ R.
It is easy to see that the two definitions are equivalent. However, when matrix
multiplication is formulated as in Eqn. (4.0.1), a simple randomized algorithm
Petros Drineas and Michael W. Mahoney 17
(1) For t = 1 to c,
• Pick it ∈ {1, . . . , n} with Pr [it = k] = pk , independently and
with replacement.
1 1
• Set C∗t = √cp A and Rt∗ = √cp Bit ∗ .
c it 1 ∗it it
(2) Return CR = t=1 cpi A∗it Bit ∗ .
t
Remark 4.0.4. The choice of the sampling probabilities {pk }nk=1 in the RandMa-
trixMultiply algorithm is very important. As we will prove in Lemma 4.1.1,
the estimator returned by the RandMatrixMultiply algorithm is (in an element-
wise sense) unbiased, regardless of our choice of the sampling probabilities. How-
ever, a natural notion of the variance of our estimator (see Theorem 4.1.2 for a
precise definition) is minimized when the sampling probabilities are set to
A∗k Bk∗ F A 2 Bk∗ 2
pk = n = n ∗k .
A∗k Bk ∗ F
k =1 k =1 A∗k 2 Bk ∗ 2
In words, the best choice when sampling rank-one matrices from the summation
of Eqn. (4.0.1) is to select rank-one matrices that have larger Frobenius norms
with higher probabilities. This is equivalent to selecting column-row pairs that
have larger (products of) Euclidean norms with higher probability.
Remark 4.0.5. This approach for approximating matrix multiplication has several
advantages. First, it is conceptually simple. Second, since the heart of the algo-
rithm involves matrix multiplication of smaller matrices, it can use any algorithms
that exist in the literature for performing the desired matrix multiplication. Third,
this approach does not tamper with the sparsity of the input matrices. Finally, the
algorithm can be easily implemented in one pass over the input matrices A and
B, given the sampling probabilities {pk }nk=1 . See [9, Section 4.2] for a detailed dis-
cussion regarding the implementation of the RandMatrixMultiply algorithm in
the pass-efficient and streaming models of computation.
c
n
A2ik B2kj 1 Aik Bkj
n 2 2
Var (CR)ij = Var [Xt ] c = ,
c 2 pk c pk
t=1 k=1 k=1
which concludes the proof of the lemma.
Our next result bounds the expectation of the Frobenius norm of the error
matrix AB − CR. Notice that this error metric depends on our choice of the
sampling probabilities {pk }n
k=1 .
1
n
1
= A 2 B 2 .
c pk ∗k 2 k∗ 2
k=1
Let pk be as in Eqn. (4.1.4); then
2
1
n
E AB − CR2F A∗k 2 Bk∗ 2 .
c
k=1
Finally, to prove that the aforementioned choice for the {pk }n
k=1 minimizes the
quantity E AB − CR2F , define the function
n
1
f(p1 , . . . pn ) = A∗k 22 Bk∗ 22 ,
pk
k=1
which characterizes the dependence of E AB − CR2F on the pk ’s. In order to
n
minimize f subject to k=1 pk = 1, we can introduce the Lagrange multiplier λ
and define the function
n
g(p1 , . . . pn ) = f(p1 , . . . pn ) + λ pk − 1 .
k=1
We then have the minimum at
∂g −1
0= = 2 A∗k 22 Bk∗ 22 + λ.
∂pk pk
Thus,
A∗k 2 Bk∗ 2 A B
pk = √ = n ∗k 2 k∗ 2 ,
λ k =1 A∗k 2 Bk ∗ 2
√
where the second equality comes from solving for λ in n k=1 pk = 1. These
2
∂ g
probabilities are minimizers of f because 2 > 0 for all k.
∂pk
We conclude this section by pointing out that we can apply Markov’s inequal-
ity on the expectation bound of Theorem 4.1.2 in order to get bounds for the
Frobenius norm of the error matrix AB − CR that hold with constant probabil-
ity. We refer the reader to [9, Section 4.4] for a tighter analysis, arguing for a
better (in the sense of better dependence on the failure probability than provided
by Markov’s inequality) concentration of the Frobenius norm of the error matrix
around its mean using a martingale argument.
Petros Drineas and Michael W. Mahoney 21
4.2. Analysis of the algorithm for nearly optimal probabilities. We now dis-
cuss three different choices for the sampling probabilities that are easy to analyze
and will be useful in this chapter. We summarize these results in the following
list; all three bounds can be easily proven following the proof of Theorem 4.1.2.
Nearly optimal probabilities, depending on both A and B: If the {pk }n k=1
satisfy
n
βA∗k 2 Bk∗ 2
(4.2.1) pk = 1 and pk n ,
k=1 k =1 A∗k 2 Bk ∗ 2
(The quantities Uk∗ 22are known as leverage scores [17]; and the probabilities
given by Eqn. (4.3.2) are nearly-optimal, in the sense of Eqn. (4.2.1), i.e., in the
sense that they approximate the optimal probabilities for approximating the ma-
trix product shown in Eqn (4.3.1), up to a β factor.) Applying Markov’s inequality
to the bound of Eqn. (4.3.1) and setting
10d2
(4.3.3) c= ,
β2
we get that, with probability at least 9/10,
(4.3.4) UT U − RT RF = Id − RT RF .
Clearly, the above equation also implies a two-norm bound. Indeed, with proba-
bility at least 9/10,
UT U − RT R2 = Id − RT R2
by setting c to the value of Eqn. (4.3.3).
In the remainder of this section, we will state and prove a theorem that also
guarantees UT U − RT R2 , while setting c to a value that is smaller than
the one in Eqn. (4.3.3). For related concentration techniques, see the chapter by
Vershynin in this volume [28].
Combine Eqns. (4.3.9) and (4.3.2) to get M2 U2F /β = d/β. Recall that < 1
to conclude that it suffices to choose a value of c such that
c 48d
√ 2
,
ln 2c/ δ β
or, equivalently, √
2c/ δ 96d
√ √ .
ln 2c/ δ β2 δ
Remark 4.3.11. Let δ = 1/10 and let and β be constants. Then, we can compare
the bound of Eqn. (4.3.3) with the bound of Eqn. (4.3.6) of Theorem 4.3.5: both
values of c guarantee the same accuracy and the same success probability (say
9/10). However, asymptotically, the bound of Theorem 4.3.5 holds by setting
c = O(d ln d), while the bound of Eqn. (4.3.3) holds by setting c = O(d2 ). Thus,
the bound of Theorem 4.3.5 is much better. By the Coupon Collector Problem (see
Section 3.12), sampling-based approaches necessitate at least Ω(d ln d) samples,
thus making our algorithm asymptotically optimal. We should note, however,
that deterministic methods exist (see, for example, [24]) that achieve the same
bound with c = O(d/2 ) samples.
Remark 4.3.12. We made no effort to optimize the constants in the expression for
c in Eqn. (4.3.6). Better constants are known, by using tighter matrix-Bernstein
inequalities. For a state-of-the-art bound see, for example, [16, Theorem 5.1].
4.4. References. Our presentation in this chapter follows closely the derivations
in [9]; see [9] for a detailed discussion of prior work on this topic. We also
refer the interested reader to [16] and references therein for more recent work on
randomized matrix multiplication.
Computing xopt in this way also takes O(nd2 ) time, again assuming n d. In
this section, we will describe a randomized algorithm that will provide accu-
rate relative-error approximations to the minimal 2 -norm solution vector xopt
of Eqn. (5.0.2) faster than these “exact” algorithms for a large class of over-
constrained least-squares problems.
5.1. The Randomized Hadamard Transform. The Randomized Hadamard Trans-
form was introduced in [1] as one step in the development of a fast version of the
Johnson-Lindenstrauss lemma. Recall that the n × n Hadamard matrix (assuming
n is a power of two) H̃n , may be defined recursively as follows:
n/2 H
H n/2 +1 +1
H̃n = , with 2 =
H .
n/2 −H
H n/2 +1 −1
√
We can now define the normalized Hadamard transform Hn as (1/ n)H̃n ; it
is easy to see that Hn HTn = HTn Hn = In . Now consider a diagonal matrix
26 Lectures on Randomized Numerical Linear Algebra
D ∈ Rn×n such that Dii is set to +1 with probability 1/2 and to −1 with
probability 1/2. The product HD is the Randomized Hadamard Transform and has
three useful properties. First, when applied to a vector, it “spreads out” the
mass/energy of that vector, in the sense of providing a bound for the largest ele-
ment, or infinity norm, of the transformed vector. Second, computing the product
HDx for any vector x ∈ Rn takes O(n log2 n) time. Even better, if we only need
to access, say, r elements in the transformed vector, then those r elements can
be computed in O(n log2 r) time. We will expand on the latter observation in
Section 5.5, where we will discuss the running time of the proposed algorithm.
Third, the Randomized Hadamard Transform is an orthogonal transformation,
since HDDT HT = HT DT DH = In .
5.2. The main algorithm and main theorem. We are now ready to provide an
overview of the RandLeastSquares algorithm (Algorithm 5.2.1). Let the matrix
product HD denote the n × n Randomized Hadamard Transform discussed in the
previous section. (For simplicity, we restrict our discussion to the case that n is
a power of two, although this restriction can easily be removed by using variants
of the Randomized Hadamard Transform [17].) Our algorithm is a preconditioned
random sampling algorithm: after premultiplying A and b by HD, our algorithm
samples uniformly at random r constraints from the preprocessed problem. (See
Eqn. (5.2.3), as well as the remarks after Theorem 5.2.2 for the precise value of
r.) Then, this algorithm solves the least squares problem on just those sampled
constraints to obtain a vector x̃opt ∈ Rd such that Theorem 5.2.2 is satisfied.
It is worth noting that the claims of Theorem 5.2.2 can be made to hold with
probability 1 − δ, for any δ > 0, by repeating the algorithm ln(1/δ)/ ln(5) times.
Also, we note that if n is not a power of two we can pad A and b with all-zero
rows in order to satisfy the assumption; this process at most doubles the size of
the input matrix.
Remark 5.2.5. The matrix ST HD can be viewed in one of two equivalent ways:
as a random preprocessing or random preconditioning, which “uniformizes” the
leverage scores of the input matrix A (see Lemma 5.4.1 for a precise statement),
followed by a uniform sampling operation; or as a Johnson-Lindenstrauss style
random projection, which preserves the geometry of the entire span of A, rather
than just a discrete set of points (see Lemma 5.4.5 for a precise statement).
Eqn. (5.3.8) follows since b = Axopt + b⊥ and Eqn. (5.3.9) follows since the
columns of the matrix A span the same subspace as the columns of UA . Now, let
zopt ∈ Rd be such that UA zopt = A(x̃opt − xopt ). Using this value for zopt , we will
prove that zopt is minimizer of the above optimization problem, as follows:
XUA zopt − Xb⊥ 22 = XA(x̃opt − xopt ) − Xb⊥ 22
= XAx̃opt − XAxopt − Xb⊥ 22
(5.3.10) = XAx̃opt − Xb22
= min XAx − Xb22
x∈Rd
Eqn. (5.3.10) follows since b = Axopt + b⊥ and the last equality follows from
Eqn. (5.3.9). Thus, by the normal equations (5.0.3), we have that
(XUA )T XUA zopt = (XUA )T Xb⊥ .
Taking the norm of both sides and observing that under Condition (5.3.3) we have
√
σi ((XUA )T XUA ) = σ2i (XUA ) 1/ 2, for all i, it follows that
(5.3.11) zopt 22 /2 (XUA )T XUA zopt 22 = (XUA )T Xb⊥ 22 .
Using Condition (5.3.4) we observe that
(5.3.12) zopt 22 Z2 .
To establish the first claim of the lemma, let us rewrite the norm of the residual
vector as
b − Ax̃opt 22 = b − Axopt + Axopt − Ax̃opt 22
(5.3.13) = b − Axopt 22 + Axopt − Ax̃opt 22
(5.3.14) = Z2 + − UA zopt 22
(5.3.15) Z2 + Z2 ,
Petros Drineas and Michael W. Mahoney 31
Lemma 5.3.18. Using the notation of Lemma 5.3.5, and additionally assuming
that UA UTA b2 γb2 , for some fixed γ ∈ (0, 1], it follows that
√
(5.3.19) xopt − x̃opt 2 κ(A) γ−2 − 1 xopt 2 .
Lemma 5.4.1. Let U be an n × d orthogonal matrix and let the product HD be the
n × n Randomized Hadamard Transform of Section 5.1. Then, with probability
at least .95,
2d ln(40nd)
(5.4.2) (HDU)i∗ 22 , for all i = 1, . . . , n.
n
The following well-known inequality [15, Theorem 2] will be useful in the
proof. (See also the chapter by Vershynin in this volume [28] for related results.)
Let X = D Hi Uj be our set of n (independent) random variables. By the
construction of D and H, it is easy to see that E [X ] = 0; also,
1
|X | = D Hi Uj √ Uj .
n
Applying Lemma 5.4.3, we get
2n3 t2
3 2
Pr (HDU)ij nt 2 exp − n = 2 exp −n t /2 .
4 =1 U2j
In the last equality we used the fact that n 2
=1 Uj = 1, i.e., that the columns of
U are unit-length. Let the right-hand side of the above inequality be equal to δ
and solve for t to get
2 ln(2/δ)
Pr (HDU)ij δ.
n
Let δ = 1/(20nd) and apply the union bound over all nd possible index pairs
(i, j) to get that, with probability at least 1-1/20=0.95, for all i, j,
2 ln(40nd)
(HDU)ij .
n
Petros Drineas and Michael W. Mahoney 33
Thus,
d
2d ln(40nd)
(5.4.4) (HDU)i∗ 22 = (HDU)2ij
n
j=1
the input matrix is always satisfied. Combining the above with inequality (5.4.8)
concludes the proof of the lemma.
Satisfying Condition (5.3.4). We next prove the following lemma, which states
that Condition (5.3.4) is satisfied by the RandLeastSquares algorithm. The proof
of this Lemma 5.4.9 again essentially follows from our bounds for the RandMa-
trixMultiply algorithm from Section 4 (except here it is used for approximating
the product of a matrix and a vector).
Lemma 5.4.9. Assume that Eqn. (5.4.2) holds. If r 40d ln(40nd)/, then, with
probability at least .9,
T
ST HDUA ST HDb⊥ 22 Z2 /2.
T
Proof. (of Lemma 5.4.9) Recall that b⊥ = U⊥ ⊥ ⊥
A UA b and that Z = b 2 . We start
by noting that since UTA DHT HDb⊥ 22 = UTA b⊥ 22 = 0 it follows that
T
ST HDUA ST HDb⊥ 22 = UTA DHT SST HDb⊥ − UTA DHT HDb⊥ 22 .
T
Thus, ST HDUA ST HDb⊥ can be viewed as approximating the product of
the two matrices, (HDUA )T and HDb⊥ , by randomly sampling columns from
(HDUA )T and rows (elements) from HDb⊥ . Note that the sampling probabili-
ties are uniform and do not depend on the norms of the columns of (HDUA )T
or the rows of Hb⊥ . We will apply the bounds of Eqn. (4.2.4), after arguing
that the assumptions of Eqn. (4.2.3) are satisfied. Indeed, since we condition on
Eqn. (5.4.2) holding, the rows of HDUA (which of course correspond to columns
of (HDUA )T ) satisfy
1 (HDUA )i∗ 22
(5.4.10) β , for all i = 1, . . . , n,
n HDUA 2F
for β = (2 ln(40nd))−1 . Thus, Eqn. (4.2.4) implies
T
⊥ 2 1 dZ2
E S HDUA S HDb 2
T T
HDUA 2F HDb⊥ 22 = .
βr βr
In the above we used HDUA 2F = d. Markov’s inequality now implies that with
probability at least .9,
T 10dZ2
ST HDUA ST HDb⊥ 22 .
βr
Setting r 20d/(β) and using the value of β specified above concludes the proof
of the lemma.
Completing the proof of Theorem 5.2.2. The theorem follows since Lemma 5.4.5
and Lemma 5.4.9 establish that the sufficient conditions of Lemma 5.3.5 hold. In
more detail, we now complete the proof of Theorem 5.2.2. First, let E(5.4.2) de-
note the event that Eqn. (5.4.2) holds; clearly, Pr E(5.4.2) .95. Second, let
Petros Drineas and Michael W. Mahoney 35
E5.4.5,5.4.9|(5.4.2) denote the event that both Lemmas 5.4.5 and 5.4.9 hold conditioned
on E(5.4.2) holding. Then,
E5.4.5,5.4.9|(5.4.2) = 1 − E5.4.5,5.4.9|(5.4.2)
= 1 − Pr Lemma 5.4.5 does not hold|E(5.4.2)
or Lemma 5.4.9 does not hold|E(5.4.2)
1 − Pr Lemma 5.4.5 does not hold|E(5.4.2)
− Pr Lemma 5.4.9 does not hold|E(5.4.2)
1 − .05 − .1 = .85.
In the above, E denotes the complement of event E. In the first inequality we used
the union bound and in the second inequality we leveraged the bounds for the
failure probabilities of Lemmas 5.4.5 and 5.4.9, given that Eqn. (5.4.2) holds. We
now let E denote the event that both Lemmas 5.4.5 and 5.4.9 hold, without any a
priori conditioning on event E(5.4.2) ; we will bound Pr [E] as follows:
Pr [E] = Pr E|E(5.4.2) · Pr E(5.4.2) + Pr E|E(5.4.2) · Pr E(5.4.2)
Pr E|E(5.4.2) · Pr E(5.4.2)
= Pr E5.4.5,5.4.9|(5.4.2) |E(5.4.2) · Pr E(5.4.2)
.85 · .95 .8.
In the first inequality we used the fact that all probabilities are positive. The
above derivation immediately bounds the success probability of Theorem 5.2.2.
Combining Lemmas 5.4.5 and 5.4.9 with the structural results of Lemma 5.3.5
and setting r as in Eqn. (5.2.3) concludes the proof of the accuracy guarantees of
Theorem 5.2.2.
5.5. The running time of the RANDLEASTSQUARES algorithm. We now discuss the
running time of the RandLeastSquares algorithm. First of all, by the construc-
tion of S, the number of non-zero entries in S is r. In Step 6 we need to compute
the products ST HDA and ST HDb. Recall that A has d columns and thus the
running time of computing both products is equal to the time needed to apply
ST HD on (d + 1) vectors. In order to apply D on (d + 1) vectors in Rn , n(d + 1)
operations suffice. In order to estimate how many operations are needed to ap-
ply ST H on (d + 1) vectors, we use the following analysis that was first proposed
in [2, Section 7].
Let x be any vector in Rn ; multiplying H by x can be done as follows:
Hn/2 Hn/2 x1 Hn/2 (x1 + x2 )
= .
Hn/2 −Hn/2 x2 Hn/2 (x1 − x2 )
36 Lectures on Randomized Numerical Linear Algebra
Let T (n) be the number of operations required to perform this operation for n-
dimensional vectors. Then,
T (n) = 2T (n/2) + n,
and thus T (n) = O(n log n). We can now include the sub-sampling matrix S to
get
H Hn/2 x1
n/2
S1 S 2 = S1 Hn/2 (x1 + x2 ) + S2 Hn/2 (x1 − x2 ).
Hn/2 −Hn/2 x2
Let nnz(·) denote the number of non-zero entries of its argument. Then,
T (n, nnz(S)) = T (n/2, nnz(S1 )) + T (n/2, nnz(S2 )) + n.
From standard methods in the analysis of recursive algorithms, we can now use
the fact that r = nnz(S) = nnz(S1 ) + nnz(S2 ) to prove that
T (n, r) 2n log2 (r + 1).
Towards that end, let r1 = nnz(S1 ) and let r2 = nnz(S2 ). Then,
(see, for example, [23] for traditional numerical methods based on subspace it-
eration and Krylov subspaces to compute such approximations) as well as more
recently in machine learning and data analysis. RandNLA has pioneered an
alternative approach, by applying random sampling and random projection algo-
rithms to construct such low-rank approximations with provable accuracy guar-
antees; see [7] for early work on the topic and [14, 17, 18, 30] for overviews of
more recent approaches. In this section, we will present and analyze a simple
algorithm to approximate the top k left singular vectors of a matrix A ∈ Rm×n .
Many RandNLA methods for low-rank approximation boil down to variants of
this basic technique; see, e.g., the chapter by Martinsson in this volume [19]. Un-
like the previous section on RandNLA algorithms for regression problems, no
particular assumptions will be imposed on m and n; indeed, A could be a square
matrix.
6.1. The main algorithm and main theorem. Our main algorithm is quite sim-
ple and again leverages the Randomized Hadamard Tranform of Section 5.1. In-
deed, let the matrix product HD denote the n × n Randomized Hadamard Trans-
form. First, we postmultiply the input matrix A ∈ Rm×n by (HD)T , thus form-
ing a new matrix ADH ∈ Rm×n .1 Then, we sample (uniformly at random) c
columns from the matrix ADH, thus forming a smaller matrix C ∈ Rm×c . Finally,
we use a Ritz-Rayleigh type procedure to construct approximations Ũk ∈ Rm×k
to the top k left singular vectors of A from C; these approximations lie within the
column space of C.
See the RandLowRank algorithm (Algorithm 6.1.4) for a detailed description
of this procedure, using a sampling-and-rescaling matrix S ∈ Rn×c to form
the matrix C. Theorem 6.1.1 is our main quality-of-approximation result for
the RandLowRank algorithm.
Theorem 6.1.1. Let A ∈ Rm×n , let k be a rank parameter, and let ∈ (0, 1/2]. If
we set
k ln n k
(6.1.2) c c0 2 ln 2 + ln ln n ,
(for a fixed constant c0 ) then, with probability at least .85, the RandLowRank
algorithm returns a matrix Ũk ∈ Rm×k such that
T
(6.1.3) A − Ũk Ũk AF (1 + )A − Uk UTk AF = (1 + )A − Ak F .
(Here, Uk ∈ Rm×k contains the top k left singular vectors of A). The running
time of the RandLowRank algorithm is O(mnc).
terms like and δ) and that the rank of C (denoted by ρC ) is at least k, i.e., ρC k.
The matrix UC has dimensions m × ρC and the matrix W has dimensions ρC × n.
Finally, the matrix UW,k has dimensions ρC × k (by our assumption on the rank
of W).
Recall that the best rank-k approximation to A is equal to Ak = Uk UTk A. In
words, Theorem 6.1.1 argues that the RandLowRank algorithm returns a set of
k orthonormal vectors that are excellent approximations to the top k left singular
vectors of A, in the sense that projecting A on the subspace spanned by Ũk
returns a matrix that has residual error that is close to that of Ak .
Remark 6.1.5. We stress that the O(mnc) running time of the RandLowRank
algorithm is due to the Ritz-Rayleigh type procedure in steps (7)-(9). These
steps guarantee that the proposed algorithm returns a matrix Ũk with exactly
k columns that approximates the top k left singular vectors of A. The results
of [19] focus (in our parlance) on the matrix C, which can be constructed much
faster (see Section 6.5), in O(mn log2 c) time, but has more than k columns. By
bounding the error term A − CC† AF = A − UC UTC AF , one can prove that
the column span of C contains good approximations to the top k left singular
vectors of A.
Petros Drineas and Michael W. Mahoney 39
Lemma 6.2.1. Let UC be a basis for the column span of C and let Ũk be the
output of the RandLowRank algorithm. Then
T
(6.2.2) A − Ũk Ũk A = A − UC UTC A .
k
In addition, UC (UTC A)k is the best rank-k approximation to A, with respect to
the Frobenius norm, that lies within the column span of the matrix C, namely
(6.2.3) A − UC (UTC A)k 2F = min A − UC Y2F .
rank(Y)k
Proof. Recall that Ũk = UC UW,k , where UW,k is the matrix of the top k left
singular vectors of W = UTC A. Thus, UW,k spans the same range as W k , the best
rank-k approximation to W, i.e., UW,k UTW,k = W k W †k . Therefore
T
A − Ũk Ũk A = A − UC UW,k UTW,k UTC A
= A − UC W k W †k W = A − UC W k .
The last equality follows from W k W †k being the orthogonal projector onto the
range of W k . In order to prove the optimality property of the lemma, we simply
observe that
The second to last equality follows from Matrix Pythagoras (Lemma 2.5.2) and
the last equality follows from the orthonormality of the columns of UC . The
second statement of the lemma is now immediate since (UTC A)k is the best rank-
k approximation to UTC A and thus any other matrix Y of rank at most k would
result in a larger Frobenius norm error.
Lemma 6.2.1
shows that Eqn. (6.1.3) in Theorem 6.1.1 can be proven by bound-
ing A − UC UTC A F . Next, we transition from the best rank-k approximation
k
of the projected matrix (UTC A)k to the best rank-k approximation Ak of the orig-
inal matrix. First (recall the notation introduced in Section 2.6), we split
Lemma 6.2.5. Let UC be an orthonormal basis for the column span of the matrix
C and let Ũk be the output of the RandLowRank algorithm. Then,
T
A − Ũk Ũk A2F Ak − UC UTC Ak 2F + Ak,⊥ 2F .
Proof. The optimality property in Eqn. (6.2.3) in Lemma 6.2.1 and the fact that
UTC Ak has rank at most k imply
T
A − Ũk Ũk A2F = A − UC UTC A 2F
k
A − UC UTC Ak 2F
= Ak − UC UTC Ak 2F + Ak,⊥ 2F .
The last equality follows from Lemma 2.5.2.
6.3. A structural inequality. We now state and prove a structural inequality
that will help us bound Ak − UC UTC Ak 2F (the first term in the error bound
of Lemma 6.2.5) and that, with minor variants, underlies nearly all RandNLA
algorithms for low-rank matrix approximation [18]. Recall that, given a matrix
A ∈ Rm×n , many RandNLA algorithms seek to construct a “sketch” of A by post-
multiplying A by some “sketching” matrix Z ∈ Rn×c , where c is much smaller
than n. (In particular, this is precisely what the RandLowRank algorithm does.)
Thus, the resulting matrix AZ ∈ Rm×c is much smaller than the original matrix
A, and the interesting question is the approximation guarantees that it offers.
A common approach is to explore how well AZ spans the principal subspace
of A, and one metric of accuracy is a suitably chosen norm of the error matrix
Ak − (AZ)(AZ)† Ak , where (AZ)(AZ)† Ak is the projection of Ak onto the sub-
space spanned by the columns of AZ. (See Section 2.9 for the definition of the
Moore-Penrose pseudoinverse of a matrix.) The following structural result offers
a means to bound the Frobenius norm of the error matrix Ak − (AZ)(AZ)† Ak .
Lemma 6.3.1. Given A ∈ Rm×n , let Z ∈ Rn×c (c k) be any matrix such that
V Tk Z ∈ Rk×c has rank k. Then,
(6.3.2) Ak − (AZ)(AZ)† Ak 2F (A − Ak ) Z(V Tk Z)† 2F .
42 Lectures on Randomized Numerical Linear Algebra
Remark 6.3.3. Lemma 6.3.1 holds for any matrix Z, regardless of whether Z is
constructed deterministically or randomly. In the context of RandNLA, typical
constructions of Z would represent a random sampling or random projection
operation, like the the matrix DHS used in the RandLowRank algorithm.
Remark 6.3.4. The lemma actually holds for any unitarily invariant norm, includ-
ing the two and the nuclear norm of a matrix [18].
Remark 6.3.5. See [18] for a detailed discussion of such structural inequalities and
their history. Lemma 6.3.1 immediately suggests a proof strategy for bounding
the error of RandNLA algorithms for low-rank matrix approximation: identify a
sketching matrix Zsuch that V Tk Z has full rank; and, at the same time, bound the
†
T
relevant norms of V k Z and (A − Ak )Z.
6.4. Completing the proof of Theorem 6.1.1. In order to complete the proof
of the relative error guarantee of Theorem 6.1.1, we will complete the strategy
outlined at the end of Section 6.1. First, recall that from Lemma 6.2.5 it suffices to
bound
T
(6.4.1) A − Ũk Ũk A2F Ak − UC UTC Ak 2F + A − Ak 2F .
Then, to bound the first term in the right-hand side of the above inequality, we
will apply the structural result of Lemma 6.3.1 on the matrix
Φ = ADH,
with Z = S, where the matrices D, H, and S are constructed as described in
the RandLowRank algorithm. If V TΦ,k S has rank k, then Lemma 6.3.1 gives the
estimate that
(6.4.2) Φk − (ΦS)(ΦS)† Φk 2F (Φ − Φk )S(V TΦ,k S)† 2F .
Here, we used V Φ,k ∈ Rn×k to denote the matrix of the top k right singular
vectors of Φ.
Recall from Section 5.1 that DH is an orthogonal matrix and thus the left
singular vectors and the singular values of the matrices A and Φ = ADH are
identical. The right singular vectors of the matrix Φ are simply the right singular
vectors of A, rotated by DH, namely
V TΦ = V T DH,
where V (respectively, V Φ ) denotes the matrix of the right singular vectors of A
(respectively, Φ). Thus, we have Φk = Ak DH, Φ − Φk = (A − Ak )DH, and
V Φ,k = V k DH. Using all the above, we can rewrite Eqn. (6.4.2) as follows:
(6.4.3) Ak − (ADHS)(ADHS)† Ak 2F (A − Ak )DHS(V Tk DHS)† 2F .
In the above derivation, we used unitary invariance to drop a DH term from
the Frobenius norm. Recall that Ak,⊥ = A − Ak ; we now proceed to manipulate
the right-hand side of the above inequality as follows4 :
2k ln(40nk)
(6.4.7) (HDV k )i∗ 22 , for all i = 1, . . . , n.
n
The proof of the above lemma is identical to the proof of Lemma 5.4.1, with
V k instead of U and k instead of d.
6.4.1. Bounding Expression (6.4.4). To bound the term in Expression (6.4.4), we
first use the strong submultiplicativity of the Frobenius norm (see Section 2.5)
to get
Remark 6.4.10. The above lemma holds even if the sampling of the canonical
vectors ei to be included in S is not done uniformly at random, but with respect
to any set of probabilities {p1 , . . . , pn } summing up to one, as long as the selected
canonical vector at the t-th trial (say the it -th canonical vector eit ) is rescaled by
1/cpit . Thus, even for nonuniform sampling, XS is an unbiased estimator for
the Frobenius norm of the matrix X.
Proof. (of Lemma 6.4.9) We compute the expectation of XS2F from first princi-
ples as follows:
c n
1 n 1
c n
E XS2F = · X∗j 2F = X∗j 2F = X2F .
n c c
t=1 j=1 t=1 j=1
We can now prove the following lemma, assuming that Eqn. (6.4.7) holds.
Proof. Let σi denote the i-th singular value of the matrix V Tk DHS. Conditioned
on Eqn. (6.4.7) holding, we can replicate the proof of Lemma 5.4.5 to argue that if
c satisfies Eqn. (6.4.12), then, with probability at least .95,
(6.4.13) 1 − σ2i
holds for all i. (Indeed, we can replicate the proof of Lemma 5.4.5 using V k
instead of UA and k instead of d; we also evaluate the bound for arbitrary
instead of fixing it.) We now observe that the matrices
† T
V Tk DHS and V Tk DHS
have the same left and right singular vectors5 . Recall that in this lemma we used
T
σi to denote the
singularTvalues of the matrix V k DHS. Then, the singular values
T
of the matrix V k DHS are equal to the σi ’s, while the singular values of the
†
matrix V Tk DHS are equal to σ−1 i . Thus,
† T 2
2 2 −2
V Tk DHS − V Tk DHS 22 = max σ−1 i − σi = max (1 − σi ) σi .
i i
Combining with Eqn. (6.4.13) and using the fact that 1/2,
† T
V Tk DHS − V Tk DHS 22 = max (1 − σ2i )σ−2 −1 2 2
i (1 − ) 2 .
i
Lemma 6.4.14. Assume that Eqn. (6.4.7) holds. If c satisfies Eqn. (6.4.12), then,
with probability at least .9,
2Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T )2F 802 Ak,⊥ 2F .
Proof. We combine Eqn. (6.4.8) with Lemmas 6.4.9 (applied to X = Ak,⊥ ) and
Lemma 6.4.11 to get that, conditioned on Eqn. (6.4.7) holding, with probability at
least 1-0.05-0.05=0.9,
Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T )2F 402 Ak,⊥ 2F .
The aforementioned failure probability follows from a simple union bound on
the failure probabilities of Lemmas 6.4.9 and 6.4.11.
Bounding Expression (6.4.5). Our bound for Expression (6.4.5) will be condi-
tioned on Eqn. (6.4.7) holding; then, we will use our matrix multiplication results
5 Givenany matrix X with thin SVD X = UX ΣX V TX its transpose is XT = V X ΣX UTX and its
T
pseudoinverse is X = V X Σ−1
X UX .
46 Lectures on Randomized Numerical Linear Algebra
The last inequality follows by using 1/2. Taking square roots of both sides
√
and using 1 + 41 1 + 7, we get
T
A − Ũk Ũk AF (1 + 7)A − Ak F .
Observe that c has to be set to the maximum of the values used in Lemmas 6.4.14
and 6.4.15, which is the value of Eqn. (6.4.12). Adjusting to /21 and appropri-
ately adjusting the constants in the expression of c gives the lemma. (We made
no special effort to compute or optimize the constant c0 in the expression of c.)
The failure probability follows by a union bound on the failure probabilities of
Lemmas 6.4.14 and 6.4.15 conditioned on Eqn. (6.4.7).
To conclude the proof of Theorem 6.1.1, we simply need to remove the con-
ditional probability from Lemma 6.4.17. Towards that end, we follow the same
strategy as in Section 5.4, to conclude that the success probability of the overall
approach is at least 0.85 · 0.95 0.8.
6.5. Running time. The RandLowRank Algorithm 6.1.4 computes the product
C = AHDS using the ideas of Section 5.5, thus taking 2n(m + 1) log2 (c + 1) time.
Step 7 takes O(mc2 ); step 8 takes O(mnc + nc2 ) time; step 9 takes O(mck) time.
Overall, the running time is, asymptotically, dominated by the O(mnc) term is
step 8, with c as in Eqn. (6.4.18).
6.6. References. Our presentation in this chapter follows the derivations in [8].
We also refer the interested reader to [21, 29] for related work.
Acknowledgements. The authors would like to thank Ilse Ipsen for allowing
them to use her slides for the introductory linear algebra lecture delivered at the
PCMI Summer School on which the first section of this chapter is heavily based.
The authors would also like to thank Aritra Bose, Eugenia-Maria Kontopoulou,
and Fred Roosta for their help in proofreading early drafts of this manuscript.
References
[1] N. Ailon and B. Chazelle, The fast Johnson-Lindenstrauss transform and approximate nearest neighbors,
SIAM J. Comput. 39 (2009), no. 1, 302–322, DOI 10.1137/060673096. MR2506527 ←25
[2] N. Ailon and E. Liberty, Fast dimension reduction using Rademacher series on dual BCH codes, Discrete
Comput. Geom. 42 (2009), no. 4, 615–630, DOI 10.1007/s00454-008-9110-x. MR2556458 ←35
[3] H. Avron, P. Maymounkov, and S. Toledo, Blendenpik: supercharging Lapack’s least-squares solver,
SIAM J. Sci. Comput. 32 (2010), no. 3, 1217–1236, DOI 10.1137/090767911. MR2639236 ←27, 28, 36
[4] Rajendra Bhatia, Matrix analysis, Graduate Texts in Mathematics, vol. 169, Springer-Verlag, New
York, 1997. MR1477662 ←11
[5] Å. Björck, Numerical Methods in Matrix Computations, Springer, Heidelberg, 2015. ←10, 11
[6] C. Boutsidis, P. Drineas, and M. Magdon-Ismail, Near-optimal column-based matrix reconstruction,
SIAM J. Comput. 43 (2014), no. 2, 687–717, DOI 10.1137/12086755X. MR3504679 ←40
[7] Petros Drineas, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo algorithms for matrices.
II. Computing a low-rank approximation to a matrix, SIAM J. Comput. 36 (2006), no. 1, 158–183, DOI
10.1137/S0097539704442696. MR2231644 ←37
[8] Petros Drineas, Ilse C. F. Ipsen, Eugenia-Maria Kontopoulou, and Malik Magdon-Ismail, Struc-
tural Convergence Results for Approximation of Dominant Subspaces from Block Krylov Spaces, SIAM J.
Matrix Anal. Appl. 39 (2018), no. 2, 567–586, DOI 10.1137/16M1091745. MR3782400 ←39, 42, 47
48 References
[9] Petros Drineas, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo algorithms for ma-
trices. I. Approximating matrix multiplication, SIAM J. Comput. 36 (2006), no. 1, 132–157, DOI
10.1137/S0097539704442684. MR2231643 ←18, 20, 21, 24
[10] P. Drineas and M. W. Mahoney, RandNLA: Randomized Numerical Linear Algebra, Communications
of the ACM 59 (2016), no. 6, 80–90. ←3
[11] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós, Faster least squares approximation,
Numer. Math. 117 (2011), no. 2, 219–249, DOI 10.1007/s00211-010-0331-6. MR2754850 ←29, 36
[12] John C. Duchi, Introductory lectures on stochastic optimization, The Mathematics of Data, IAS/Park
City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←3
[13] G. H. Golub and C. F. Van Loan, Matrix computations, 3rd ed., Johns Hopkins Studies in the
Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, 1996. MR1417720 ←4, 11
[14] N. Halko, P. G. Martinsson, and J. A. Tropp, Finding structure with randomness: probabilistic algo-
rithms for constructing approximate matrix decompositions, SIAM Rev. 53 (2011), no. 2, 217–288, DOI
10.1137/090771806. MR2806637 ←37
[15] Wassily Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist.
Assoc. 58 (1963), 13–30. MR0144363 ←32
[16] John T. Holodnak and Ilse C. F. Ipsen, Randomized approximation of the Gram matrix: exact com-
putation and probabilistic bounds, SIAM J. Matrix Anal. Appl. 36 (2015), no. 1, 110–137, DOI
10.1137/130940116. MR3306014 ←24
[17] M. W. Mahoney, Randomized algorithms for matrices and data, Foundations and Trends in Machine
Learning, NOW Publishers, Boston, 2011. ←3, 22, 26, 37, 39
[18] Michael W. Mahoney and Petros Drineas, Structural properties underlying high-quality randomized
numerical linear algebra algorithms, Handbook of big data, Chapman & Hall/CRC Handb. Mod.
Stat. Methods, CRC Press, Boca Raton, FL, 2016, pp. 137–154. MR3674816 ←37, 41, 42
[19] Per-Gunnar Martinsson, Randomized methods for matrix computations, The Mathematics of Data,
IAS/Park City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←3, 37, 38, 39
[20] Rajeev Motwani and Prabhakar Raghavan, Randomized algorithms, Cambridge University Press,
Cambridge, 1995. MR1344451 ←16
[21] C. Musco and C. Musco, Stronger and Faster Approximate Singular Value Decomposition via
the Block Lanczos Method, Neural Information Processing Systems (NIPS), 2015, available at
arXiv:1504.05477. ←39, 47
[22] Roberto Imbuzeiro Oliveira, Sums of random Hermitian matrices and an inequality by Rudelson, Elec-
tron. Commun. Probab. 15 (2010), 203–212, DOI 10.1214/ECP.v15-1544. MR2653725 ←22
[23] Yousef Saad, Numerical methods for large eigenvalue problems, Classics in Applied Mathematics,
vol. 66, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2011. Revised
edition of the 1992 original [ 1177405]. MR3396212 ←37
[24] Nikhil Srivastava, Spectral sparsification and restricted invertibility, ProQuest LLC, Ann Arbor, MI,
2010. Thesis (Ph.D.)–Yale University. MR2941475 ←24
[25] G. W. Stewart and J. G. Sun, Matrix Perturbation Theory, Academic Press, New York, 1990. ←11
[26] Gilbert Strang, Linear algebra and its applications, 2nd ed., Academic Press [Harcourt Brace Jo-
vanovich, Publishers], New York-London, 1980. MR575349 ←11
[27] L.N. Trefethen and D. Bau III, Numerical Linear Algebra, SIAM, Philadelphia, 1997. ←11
[28] Roman Vershynin, Four lectures on probabilistic methods for data science, The Mathematics of Data,
IAS/Park City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←3, 22, 32
[29] S. Wang and Z. Zhang and T. Zhang, Improved Analyses of the Randomized Power Method and Block
Lanczos Method (2015), 1–22 pp., available at arXiv:1508.06429. ←47
[30] David P. Woodruff, Sketching as a tool for numerical linear algebra, Found. Trends Theor. Comput.
Sci. 10 (2014), no. 1-2, iv+157. MR3285427 ←36, 37, 39
Purdue University, Computer Science Department, 305 N University Street, West Lafayette, IN
47906.
Email address: [email protected]
University of California at Berkeley, ICSI and Department of Statistics, 367 Evans Hall, Berkeley,
CA 94720.
Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 49–97
https://doi.org/10.1090/pcms/025/00830
Stephen J. Wright
Contents
1 Introduction 50
1.1 Omissions 51
1.2 Notation 51
2 Optimization Formulations of Data Analysis Problems 52
2.1 Setup 52
2.2 Least Squares 54
2.3 Matrix Completion 54
2.4 Nonnegative Matrix Factorization 55
2.5 Sparse Inverse Covariance Estimation 56
2.6 Sparse Principal Components 56
2.7 Sparse Plus Low-Rank Matrix Decomposition 57
2.8 Subspace Identification 57
2.9 Support Vector Machines 58
2.10 Logistic Regression 60
2.11 Deep Learning 61
3 Preliminaries 63
3.1 Solutions 64
3.2 Convexity and Subgradients 64
3.3 Taylor’s Theorem 65
3.4 Optimality Conditions for Smooth Functions 67
3.5 Proximal Operators and the Moreau Envelope 68
3.6 Convergence Rates 69
4 Gradient Methods 71
4.1 Steepest Descent 71
4.2 General Case 72
4.3 Convex Case 72
4.4 Strongly Convex Case 73
4.5 General Case: Line-Search Methods 74
4.6 Conditional Gradient Method 75
2010 Mathematics Subject Classification. Primary 14Dxx; Secondary 14Dxx.
Key words and phrases. Park City Mathematics Institute.
49
50 Optimization Algorithms for Data Analysis
5 Prox-Gradient Methods 77
6 Accelerating Gradient Methods 80
6.1 Heavy-Ball Method 80
6.2 Conjugate Gradient 81
6.3 Nesterov’s Accelerated Gradient: Weakly Convex Case 82
6.4 Nesterov’s Accelerated Gradient: Strongly Convex Case 84
6.5 Lower Bounds on Rates 87
7 Newton Methods 88
7.1 Basic Newton’s Method 88
7.2 Newton’s Method for Convex Functions 90
7.3 Newton Methods for Nonconvex Functions 91
7.4 A Cubic Regularization Approach 93
8 Conclusions 95
1. Introduction
In this article, we consider algorithms for solving smooth optimization prob-
lems, possibly with simple constraints or structured nonsmooth regularizers. One
such canonical formulation is
(1.0.1) min f(x),
x∈Rn
where f : → R has at least Lipschitz continuous gradients. Additional as-
Rn
sumptions about f, such as convexity and Lipschitz continuity of the Hessian, are
introduced as needed. Another formulation we consider is
(1.0.2) min f(x) + λψ(x),
x∈Rn
where f is as in (1.0.1), ψ : Rn → R is a function that is usually convex and usually
nonsmooth, and λ 0 is a regularization parameter.1 We refer to (1.0.2) as a
regularized minimization problem because the presence of the term involving ψ
induces certain structural properties on the solution, that make it more desirable
or plausible in the context of the application. We describe iterative algorithms
that generate a sequence {xk }k=0,1,2,... of points that, in the case of convex objective
functions, converges to the set of solutions. (Some algorithms also generate other
“auxiliary” sequences of iterates.)
We are motivated to study problems of the forms (1.0.1) and (1.0.2) by their
ubiquity in data analysis applications. Accordingly, Section 2 describes some
canonical problems in data analysis and their formulation as optimization prob-
lems. After some preliminaries in Section 3, we describe in Section 4 algorithms
that take step based on the gradients ∇f(xk ). Extensions of these methods to
1A set S is said to be convex if for any pair of points z , z ∈ S, we have that αz + (1 − α)z ∈ S for
all α ∈ [0, 1]. A function φ : Rn → R is convex if φ(αz + (1 − α)z ) αφ(z ) + (1 − α)φ(z )
for all z , z in the (convex) domain of φ and all α ∈ [0, 1].
Stephen J. Wright 51
the case (1.0.2) of regularized objectives are described in Section 5. Section 6 de-
scribes accelerated gradient methods, which achieve better worst-case complexity
than basic gradient methods, while still only using first-derivative information.
We discuss Newton’s method in Section 7, outlining variants that can guarantee
convergence to points that approximately satisfy second-order conditions for a
local minimizer of a smooth nonconvex function.
1.1. Omissions Our approach throughout is to give a concise description of
some of the most important algorithmic tools for smooth nonlinear optimization
and regularized optimization, along with the basic convergence theory for each.
(In any given context, we mean by “smooth” that the function is differentiable as
many times as is necessary for the discussion to make sense.) In most cases, the
theory is elementary enough to include here in its entirety. In the few remaining
cases, we provide citations to works in which complete proofs can be found.
Although we allow nonsmoothness in the regularization term in (1.0.2), we do
not cover subgradient methods or mirror descent explicitly in this chapter. We
also do not discuss stochastic gradient methods, a class of methods that is central
to modern machine learning. All these topics are discussed in the contribution of
John Duchi to the current volume [22]. Other omissions include the following.
• Coordinate descent methods; see [47] for a recent review.
• Augmented Lagrangian methods, including alternating direction meth-
ods of multipliers (ADMM) [23]. The review [5] remains a good reference
for the latter topic, especially as it applies to problems from data analysis.
• Semidefinite programming (see [43, 45]) and conic optimization (see [6]).
• Methods tailored specifically to linear or quadratic programming, such as
the simplex method or interior-point methods (see [46] for a discussion of
the latter).
• Quasi-Newton methods, which modify Newton’s method by approximat-
ing the Hessian or its inverse, thus attaining attractive theoretical and
practical performance without using any second-derivative information.
For a discussion of these methods, see [36, Chapter 6]. One important
method of this class, which is useful in data analysis and many other
large-scale problems, is the limited-memory method L-BFGS [30]; see also
[36, Section 7.2].
1.2. Notation Our notational conventions in this chapter are as follows. We
use upper-case Roman characters (A, L, R, and so on) for matrices and lower-
case Roman (x, v, u, and so on) for vectors. (Vectors are assumed to be column
vectors.) Transposes are indicated by a superscript “T .” Elements of matrices and
vectors are indicated by subscripts, for example, Aij and xj . Iteration numbers are
indicated by superscripts, for example, xk . We denote the set of real numbers by
R, so that Rn denotes the Euclidean space of dimension n. The set of symmetric
real n × n matrices is denoted by SRn×n . Real scalars are usually denoted by
52 Optimization Algorithms for Data Analysis
where the jth term (aj , yj ; x) is a measure of the mismatch between φ(aj ) and
yj , and x is the vector of parameters that determines φ.
Stephen J. Wright 53
One use of φ is to make predictions about future data items. Given another
previously unseen item of data â of the same type as aj , j = 1, 2, . . . , m, we
predict that the label ŷ associated with â would be φ(â). The mapping may also
expose other structure and properties in the data set. For example, it may reveal
that only a small fraction of the features in aj are needed to reliably predict the
label yj . (This is known as feature selection.) The function φ or its parameter x
may also reveal important structure in the data. For example, X could reveal a
low-dimensional subspace that contains most of the aj , or X could reveal a matrix
with particular structure (low-rank, sparse) such that observations of X prompted
by the feature vectors aj yield results close to yj .
Examples of labels yj include the following.
• A real number, leading to a regression problem.
• A label, say yj ∈ {1, 2, . . . , M} indicating that aj belongs to one of M
classes. This is a classification problem. We have M = 2 for binary classifi-
cation and M > 2 for multiclass classification.
• Null. Some problems only have feature vectors aj and no labels. In this
case, the data analysis task may consist of grouping the aj into clusters
(where the vectors within each cluster are deemed to be functionally sim-
ilar), or identifying a low-dimensional subspace (or a collection of low-
dimensional subspaces) that approximately contains the aj . Such prob-
lems require the labels yj to be learned, alongside the function φ. For
example, in a clustering problem, yj could represent the cluster to which
aj is assigned.
Even after cleaning and preparation, the setup above may contain many com-
plications that need to be dealt with in formulating the problem in rigorous math-
ematical terms. The quantities (aj , yj ) may contain noise, or may be otherwise
corrupted. We would like the mapping φ to be robust to such errors. There may
be missing data: parts of the vectors aj may be missing, or we may not know all
the labels yj . The data may be arriving in streaming fashion rather than being
available all at once. In this case, we would learn φ in an online fashion.
One particular consideration is that we wish to avoid overfitting the model to
the data set D in (2.1.1). The particular data set D available to us can often be
thought of as a finite sample drawn from some underlying larger (often infinite)
collection of data, and we wish the function φ to perform well on the unobserved
data points as well as the observed subset D. In other words, we want φ to
be not too sensitive to the particular sample D that is used to define empirical
objective functions such as (2.1.2). The optimization formulation can be modified
in various ways to achieve this goal, by the inclusion of constraints or penalty
terms that limit some measure of “complexity” of the function (such techniques
are called generalization or regularization). Another approach is to terminate the
optimization algorithm early, the rationale being that overfitting occurs mainly in
the later stages of the optimization process.
54 Optimization Algorithms for Data Analysis
2.2. Least Squares Probably the oldest and best-known data analysis problem is
linear least squares. Here, the data points (aj , yj ) lie in Rn × R, and we solve
1 T
m
1
(2.2.1) min (aj x − yj )2 = Ax − y22 ,
x 2m 2m
j=1
where A, B := trace(AT B). Here we can think of the Aj as “probing” the un-
known matrix X. Commonly considered types of observations are random linear
Stephen J. Wright 55
combinations (where the elements of Aj are selected i.i.d. from some distribution)
or single-element observations (in which each Aj has 1 in a single location and
zeros elsewhere). A regularized version of (2.3.1), leading to solutions X that are
low-rank, is
1
m
(2.3.2) min (Aj , X − yj )2 + λX∗ ,
X 2m
j=1
where X∗ is the nuclear norm, which is the sum of singular values of X [39].
The nuclear norm plays a role analogous to the 1 norm in (2.2.2). Although the
nuclear norm is a somewhat complex nonsmooth function, it is at least convex, so
that the formulation (2.3.2) is also convex. This formulation can be shown to yield
a statistically valid solution when the true X is low-rank and the observation ma-
trices Aj satisfy a “restricted isometry” property, commonly satisfied by random
matrices, but not by matrices with just one nonzero element. The formulation is
also valid in a different context, in which the true X is incoherent (roughly speak-
ing, it does not have a few elements that are much larger than the others), and
the observations Aj are of single elements [10].
In another form of regularization, the matrix X is represented explicitly as a
product of two “thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r , with
r min(n, p). We set X = LRT in (2.3.1) and solve
1
m
(2.3.3) min (Aj , LRT − yj )2 .
L,R 2m
j=1
2.5. Sparse Inverse Covariance Estimation In this problem, the labels yj are
null, and the vectors aj ∈ Rn are viewed as independent observations of a ran-
dom vector A ∈ Rn , which has zero mean. The sample covariance matrix con-
structed from these observations is
1
m
S= aj aTj .
m−1
j=1
The element Sil is an estimate of the covariance between the ith and lth elements
of the random variable vector A. Our interest is in calculating an estimate X of
the inverse covariance matrix that is sparse. The structure of X yields important
information about A. In particular, if Xil = 0, we can conclude that the i and
l components of A are conditionally independent. (That is, they are independent
given knowledge of the values of the other n − 2 components of A.) Stated an-
other way, the nonzero locations in X indicate the arcs in the dependency graph
whose nodes correspond to the n components of A.
One optimization formulation that has been proposed for estimating the in-
verse sparse covariance matrix X is the following:
(2.5.1) min S, X − log det(X) + λX1 ,
X∈SR n×n , X
0
2.6. Sparse Principal Components The setup for this problem is similar to the
previous section, in that we have a sample covariance matrix S that is estimated
from a number of observations of some underlying random vector. The princi-
pal components of this matrix are the eigenvectors corresponding to the largest
eigenvalues. It is often of interest to find sparse principal components, approxi-
mations to the leading eigenvectors that also contain few nonzeros. An explicit
optimization formulation of this problem is
(2.6.1) max vT Sv s.t. v2 = 1, v0 k,
v∈Rn
where · 0 indicates the cardinality of v (that is, the number of nonzeros in v)
and k is a user-defined parameter indicating a bound on the cardinality of v. The
problem (2.6.1) is NP-hard, so exact formulations (for example, as a quadratic
program with binary variables) are intractable. We consider instead a relaxation,
due to [18], which replaces vvT by a positive semidefinite proxy M ∈ SRn×n :
(2.6.2) max S, M s.t. M 0, I, M = 1, M1 ρ,
M∈SRn×n
for some parameter ρ > 0 that can be adjusted to attain the desired sparsity. This
formulation is a convex optimization problem, in fact, a semidefinite program-
ming problem.
This formulation can be generalized to find the leading r > 1 sparse principal
components. Ideally, we would obtain these from a matrix V ∈ Rn×r whose
Stephen J. Wright 57
columns are mutually orthogonal and have at most k nonzeros each. We can
write a convex relaxation of this problem, once again a semidefinite program, as
(2.6.3) max S, M s.t. 0 M I, I, M = 1, M1 ρ .
M∈SRn×n
A more compact (but nonconvex) formulation is
max S, FFT s.t. F2 1, F2,1 R̄,
F∈Rn×r
n
where F2,1 := i=1 Fi· 2[15]. The latter regularization term is often called
a “group-sparse” or “group-LASSO” regularizer. (An early use of this type of
regularizer was described in [44].)
2.7. Sparse Plus Low-Rank Matrix Decomposition Another useful paradigm
is to decompose a partly or fully observed n × p matrix Y into the sum of a
sparse matrix and a low-rank matrix. A convex formulation of the fully-observed
problem is
min M∗ + λS1 s.t. Y = M + S,
M,S
n p
where S1 := i=1 j=1 |Sij | [11, 14]. Compact, nonconvex formulations that
allow noise in the observations include the following:
1
min LRT + S − Y2F (fully observed)
L,R,S 2
1
min PΦ (LRT + S − Y)2F (partially observed),
L,R,S 2
where Φ represents the locations of the observed entries of Y and PΦ is projection
onto this set [15, 48].
One application of these formulations is to robust PCA, where the low-rank
part represents principal components and the sparse part represents “outlier”
observations. Another application is to foreground-background separation in
video processing. Here, each column of Y represents the pixels in one frame of
video, whereas each row of Y shows the evolution of one pixel over time.
2.8. Subspace Identification In this application, the aj ∈ Rn , j = 1, 2, . . . , m are
vectors that lie (approximately) in a low-dimensional subspace. The aim is to
identify this subspace, expressed as the column subspace of a matrix X ∈ Rn×r .
If the aj are fully observed, an obvious way to solve this problem is to perform
a singular value decomposition of the n × m matrix A = [aj ]m j=1 , and take X to
be the leading r right singular vectors. In interesting variants of this problem,
however, the vectors aj may be arriving in streaming fashion and may be only
partly observed, for example in indices Φj ⊂ {1, 2, . . . , n}. We would thus need to
identify a matrix X and vectors sj ∈ Rr such that
PΦj (aj − Xsj ) ≈ 0, j = 1, 2, . . . , m.
The algorithm for identifying X, described in [1], is a manifold-projection scheme
that takes steps in incremental fashion for each aj in turn. Its validity relies on
58 Optimization Algorithms for Data Analysis
incoherence of the matrix X with respect to the principal axes, that is, the matrix
X should not have a few elements that are much larger than the others. A local
convergence analysis of this method is given in [2].
2.9. Support Vector Machines Classification via support vector machines (SVM)
is a classical paradigm in machine learning. This problem takes as input data
(aj , yj ) with aj ∈ Rn and yj ∈ {−1, 1}, and seeks a vector x ∈ Rn and a scalar
β ∈ R such that
Note that the jth term in this summation is zero if the conditions (2.9.1) are
satisfied, and positive otherwise. Even if no pair (x, β) exists with H(x, β) = 0,
the pair (x, β) that minimizes (2.1.2) will be the one that comes as close as possible
to satisfying (2.9.1), in a suitable sense. A term λx22 , where λ is a small positive
parameter, is often added to (2.9.2), yielding the following regularized version:
1
m
1
(2.9.3) H(x, β) = max(1 − yj (aTj x − β), 0) + λx22 .
m 2
j=1
1 T 1
(2.9.5a) min 1 s + λx22 ,
x,β,s m 2
(2.9.5b) subject to sj 1 − yj (aTj x − β), sj 0, j = 1, 2, . . . , m,
where 1 = ∈
(1, 1, . . . , 1)T Rm .
Often it is not possible to find a hyperplane that separates the positive and
negative cases well enough to be useful as a classifier. One solution is to trans-
form all of the raw data vectors aj by a mapping ζ into a higher-dimensional
Euclidean space, then perform the support-vector-machine classification on the
vectors ζ(aj ), j = 1, 2, . . . , m.
The conditions (2.9.1) would thus be replaced by
Interestingly, problem (2.9.8) can be formulated and solved without any explicit
knowledge or definition of the mapping ζ. We need only a technique to define the
elements of Q. This can be done with the use of a kernel function K : Rn × Rn → R,
where K(ak , al ) replaces ζ(ak )T ζ(al ) [4, 16]. This is the so-called “kernel trick.”
(The kernel function K can also be used to construct a classification function
φ from the solution of (2.9.8).) A particularly popular choice of kernel is the
Gaussian kernel:
K(ak , al ) := exp(−ak − al 2 /(2σ)),
where σ is a positive parameter.
2.10. Logistic Regression Logistic regression can be viewed as a variant of bi-
nary support-vector machine classification, in which rather than the classification
function φ giving a unqualified prediction of the class in which a new data vector
a lies, it returns an estimate of the odds of a belonging to one class or the other.
We seek an “odds function” p parametrized by a vector x ∈ Rn as follows:
(2.10.1) p(a; x) := (1 + exp(aT x))−1 ,
and aim to choose the parameter x so that
where λ > 0 is a regularization parameter. (Note that we subtract rather than add
the regularization term λx1 to the objective, because this problem is formulated
as a maximization rather than a minimization.) As we see later, this term has
the effect of producing a solution in which few components of x are nonzero,
making it possible to evaluate p(a; x) by knowing only those components of a
that correspond to the nonzeros in x.
An important extension of this technique is to multiclass (or multinomial) lo-
gistic regression, in which the data vectors aj belong to more than two classes.
Such applications are common in modern data analysis. For example, in a speech
recognition system, the M classes could each represent a phoneme of speech, one
of the potentially thousands of distinct elementary sounds that can be uttered by
Stephen J. Wright 61
output nodes
hidden layers
input nodes
Note that this is exactly the function (2.10.8) applied to the output of the top
hidden layer aD D
j (w). We write aj (w) to make explicit the dependence of aj
D
on the parameters w of (2.11.2), as well as on the input vector aj . (We can view
multiclass logistic regression (2.10.8) as a special case of deep learning in which
there are no hidden layers, so that D = 0, w is null, and aD j = aj , j = 1, 2, . . . , m.)
Neural networks in use for particular applications (in image recognition and
speech recognition, for example, where they have been very successful) include
many variants on the basic design above. These include restricted connectivity
between layers (that is, enforcing structure on the matrices W l , l = 1, 2, . . . , D),
layer arrangements that are more complex than the linear layout illustrated in
Figure 2.11.1, with outputs coming from different levels, connections across non-
adjacent layers, different componentwise transformations σ at different layers,
and so on. Deep neural networks for practical applications are highly engineered
objects.
The loss function (2.11.3) shares with many other applications the “summation”
form (2.1.2), but it has several features that set it apart from the other applications
discussed above. First, and possibly most important, it is nonconvex in the param-
eters w. There is reason to believe that the “landscape” of L is complex, with the
global minimizer being exceedingly difficult to find. Second, the total number
of parameters in (w, X) is usually very large. The most popular algorithms for
minimizing (2.11.3) are of stochastic gradient type, which like most optimization
methods come with no guarantee for finding the minimizer of a nonconvex func-
tion. Effective training of deep learning classifiers typically requires a great deal
of data and computation power. Huge clusters of powerful computers, often us-
ing multicore processors, GPUs, and even specially architected processing units,
are devoted to this task. Efficiency also requires many heuristics in the formula-
tion and the algorithm (for example, in the choice of regularization functions and
in the steplengths for stochastic gradient).
3. Preliminaries
We discuss here some foundations for the analysis of subsequent sections.
These include useful facts about smooth and nonsmooth convex functions, Tay-
lor’s theorem and some of its consequences, optimality conditions, and proximal
operators.
In the discussion of this section, our basic assumption is that f is a mapping
from Rn to R ∪ {+∞}, continuous on its effective domain D := {x | f(x) < ∞}.
Further assumptions of f are introduced as needed.
64 Optimization Algorithms for Data Analysis
3.1. Solutions Consider the problem of minimizing f (1.0.1). We have the fol-
lowing terminology:
• x∗ is a local minimizer of f if there is a neighborhood N of x∗ such that
f(x) f(x∗ ) for all x ∈ N.
• x∗ is a global minimizer of f if f(x) f(x∗ ) for all x ∈ Rn .
• x∗ is a strict local minimizer if it is a local minimizer on some neighborhood
N and in addition f(x) > f(x∗ ) for all x ∈ N with x = x∗ .
• x∗ is an isolated local minimizer if there is a neighborhood N of x∗ such that
f(x) f(x∗ ) for all x ∈ N and in addition, N contains no local minimizers
other than x∗ .
3.2. Convexity and Subgradients A convex set Ω ⊂ Rn has the property that
(3.2.1) x, y ∈ Ω ⇒ (1 − α)x + αy ∈ Ω for all α ∈ [0, 1].
We usually deal with closed convex sets in this article. For a convex set Ω ⊂ Rn
we define the indicator function IΩ (x) as follows:
0 if x ∈ Ω
IΩ (x) =
+∞ otherwise.
Indicator functions are useful devices for deriving optimality conditions for con-
strained problems, and even for developing algorithms. The constrained opti-
mization problem
(3.2.2) min f(x)
x∈Ω
can be restated equivalently as follows:
(3.2.3) min f(x) + IΩ (x).
We noted already that a convex function φ : Rn → R ∪ {+∞} has the following
defining property:
(3.2.4)
φ((1 − α)x + αy) (1 − α)φ(x) + αφ(y), for all x, y ∈ Rn and all α ∈ [0, 1].
The concepts of “minimizer” are simpler in the case of convex objective func-
tions than in the general case. In particular, the distinction between “local” and
“global” minimizers disappears. For f convex in (1.0.1), we have the following.
(a) Any local minimizer of (1.0.1) is also a global minimizer.
(b) The set of global minimizers of (1.0.1) is a convex set.
If there exists a value γ > 0 such that
1
(3.2.5) φ((1 − α)x + αy) (1 − α)φ(x) + αφ(y) − γα(1 − α)x − y22
2
for all x and y in the domain of φ and α ∈ [0, 1], we say that φ is strongly convex
with modulus of convexity γ.
We summarize some definitions and results about subgradients of convex func-
tions here. For a more extensive discussion, see [22].
Stephen J. Wright 65
Proof. From the convexity of f and the definitions of a and b, we deduce that
f(y) f(x) + aT (y − x) and f(x) f(y) + bT (x − y). The result follows by adding
these two inequalities.
We can easily characterize a minimum in terms of the subdifferential.
Theorem 3.2.8. The point x∗ is the minimizer of a convex function f if and only if
0 ∈ ∂f(x∗ ).
Lemma 3.3.10. Given convex f satisfying (3.2.5), with ∇f uniformly Lipschitz continu-
ous with constant L, we have for any x, y that
γ L
(3.3.11) y − x2 f(y) − f(x) − ∇f(x)T (y − x) y − x2 .
2 2
For later convenience, we define a condition number κ as follows:
L
(3.3.12) κ := .
γ
When f is twice continuously differentiable, we can characterize the constants γ
and L in terms of the eigenvalues of the Hessian ∇f(x). Specifically, we can show
that (3.3.11) is equivalent to
(3.3.13) γI ∇2 f(x) LI, for all x.
When f is strictly convex and quadratic, κ defined in (3.3.12) is the condition
number of the (constant) Hessian, in the usual sense of linear algebra.
Strongly convex functions have unique minimizers, as we now show.
Theorem 3.3.14. Let f be differentiable and strongly convex with modulus γ > 0. Then
the minimizer x∗ of f exists and is unique.
Stephen J. Wright 67
Proof. We show first that for any point x0 , the level set {x | f(x) f(x0 )} is closed
and bounded, and hence compact. Suppose for contradiction that there is a se-
quence {x } such that x → ∞ and
(3.3.15) f(x ) f(x0 ).
By strong convexity of f, we have for some γ > 0 that
γ
f(x ) f(x0 ) + ∇f(x0 )T (x − x0 ) + x − x0 2 .
2
By rearranging slightly, and using (3.3.15), we obtain
γ
x − x0 2 −∇f(x0 )T (x − x0 ) ∇f(x0 )x − x0 .
2
By dividing both sides by (γ/2)x − x0 , we obtain x − x0 (2/γ)∇f(x0 )
for all , which contradicts unboundedness of {x }. Thus, the level set is bounded.
Since it is also closed (by continuity of f), it is compact.
Since f is continuous, it attains its minimum on the compact level set, which is
also the solution of minx f(x), and we denote it by x∗ . Suppose for contradiction
that the minimizer is not unique, so that we have two points x∗1 and x∗2 that
minimize f. Obviously, these points must attain equal objective values, so that
f(x∗1 ) = f(x∗2 ) = f∗ for some f∗ . By taking (3.2.5) and setting φ = f∗ , x = x∗1 ,
y = x∗2 , and α = 1/2, we obtain
1 1
f((x∗1 + x∗2 )/2) (f(x∗1 ) + f(x∗2 )) − γx∗1 − x∗2 2 < f∗ ,
2 8
so the point (x∗1 + x∗2 )/2 has a smaller function value than both x∗1 and x∗2 , contra-
dicting our assumption that x∗1 and x∗2 are both minimizers. Hence, the minimizer
x∗ is unique.
3.4. Optimality Conditions for Smooth Functions We consider the case of a
smooth (twice continuously differentiable) function f that is not necessarily con-
vex. Before designing algorithms to find a minimizer of f, we need to identify
properties of f and its derivatives at a point x̄ that tell us whether or not x̄ is a
minimizer, of one of the types described in Subsection 3.1. We call such properties
optimality conditions.
A first-order necessary condition for optimality is that ∇f(x̄) = 0. More precisely,
if x̄ is a local minimizer, then ∇f(x̄) = 0. We can prove this by using Taylor’s
theorem. Supposing for contradiction that ∇f(x̄) = 0, we can show by setting
x = x̄ and p = −α∇f(x̄) for α > 0 in (3.3.3) that f(x̄ − α∇f(x̄)) < f(x̄) for all
α > 0 sufficiently small. Thus any neighborhood of x̄ will contain points x with a
f(x) < f(x̄), so x̄ cannot be a local minimizer.
If f is convex, as well as smooth, the condition ∇f(x̄) = 0 is sufficient for x̄ to be
a global solution. This claim follows immediately from Theorems 3.2.8 and 3.2.9.
A second-order necessary condition for x̄ to be a local solution is that ∇f(x̄) = 0
and ∇2 f(x̄) is positive semidefinite. The proof is by an argument similar to that
of the first-order necessary condition, but using the second-order Taylor series ex-
pansion (3.3.5) instead of (3.3.3). A second-order sufficient condition is that ∇f(x̄) = 0
68 Optimization Algorithms for Data Analysis
and ∇2 f(x̄) is positive definite. This condition guarantees that x̄ is a strict local
minimizer, that is, there is a neighborhood of x̄ such that x̄ has a strictly smaller
function value than all other points in this neighborhood. Again, the proof makes
use of (3.3.5).
We call x̄ a stationary point for smooth f if it satisfies the first-order necessary
condition ∇f(x̄) = 0. Stationary points are not necessarily local minimizers. In
fact, local maximizers satisfy the same condition. More interestingly, stationary
points can be saddle points. These are points for which there exist directions u
and v such that f(x̄ + αu) < f(x̄) and f(x̄ + αv) > f(x̄) for all positive α suffi-
ciently small. When the Hessian ∇2 f(x̄) has both strictly positive and strictly
negative eigenvalues, it follows from (3.3.5) that x̄ is a saddle point. When ∇2 f(x̄)
is positive semidefinite or negative semidefinite, second derivatives alone are in-
sufficient to classify x̄; higher-order derivative information is needed.
3.5. Proximal Operators and the Moreau Envelope Here we present some anal-
ysis for analyzing the convergence of algorithms for the regularized problem
(1.0.2), where the objective is the sum of a smooth function and a convex (usually
nonsmooth) function.
We start with a formal definition.
Definition 3.5.1. For a closed proper convex function h and a positive scalar λ,
the Moreau envelope is
1 2 1 1 2
(3.5.2) Mλ,h (x) := inf h(u) + u − x = inf λh(u) + u − x .
u 2λ λ u 2
The proximal operator of the function λh is the value of u that achieves the infi-
mum in (3.5.2), that is,
1
(3.5.3) proxλh (x) := arg min λh(u) + u − x2 .
u 2
From optimality properties for (3.5.3) (see Theorem 3.2.8), we have
(3.5.4) 0 ∈ λ∂h(proxλh (x)) + (proxλh (x) − x).
The Moreau envelope can be viewed as a kind of smoothing or regularization
of the function h. It has a finite value for all x, even when h takes on infinite
values for some x ∈ Rn . In fact, it is differentiable everywhere, with gradient
1
∇Mλ,h (x) = (x − proxλh (x)).
λ
Moreover, x∗ is a minimizer of h if and only if it is a minimizer of Mλ,h .
The proximal operator satisfies a nonexpansiveness property. From the opti-
mality conditions (3.5.4) at two points x and y, we have
x − proxλh (x) ∈ λ∂(proxλh (x)), y − proxλh (y) ∈ λ∂(proxλh (y)).
By applying monotonicity (Lemma 3.2.7), we have
T
(1/λ) (x − proxλh (x)) − (y − proxλh (y)) (proxλh (x) − proxλh (y)) 0,
Stephen J. Wright 69
zero of {dist(0, ∂f(xk ))} (the sequence of distances from 0 to the subdifferential
∂f(xk )). Other error measures for which we may be able to prove convergence
rates include xk − x∗ (where x∗ is a solution) and f(xk ) − f∗ (where f∗ is the
optimal value of the objective function f). For generality, we denote by {φk } the
sequence of nonnegative scalars whose rate of convergence to 0 we wish to find.
We say that linear convergence holds if there is some σ ∈ (0, 1) such that
(3.6.1) φk+1 /φk 1 − σ, for all k sufficiently large.
(This property is sometimes also called geometric or exponential convergence, but
the term linear is standard in the optimization literature, so we use it here.) It
follows from (3.6.1) that there is some positive constant C such that
(3.6.2) φk C(1 − σ)k , k = 1, 2, . . . .
While (3.6.1) implies (3.6.2), the converse does not hold. The sequence
2−k k even
φk =
0 k odd,
satisfies (3.6.2) with C = 1 and σ = .5, but does not satisfy (3.6.1). To distinguish
between these two slightly different definitions, (3.6.1) is sometimes called Q-
linear while (3.6.2) is called R-linear.
Sublinear convergence is, as its name suggests, slower than linear. Several
varieties of sublinear convergence are encountered in optimization algorithms
for data analysis, including the following
√
(3.6.3a) φk C/ k, k = 1, 2, . . . ,
(3.6.3b) φk C/k, k = 1, 2, . . . ,
2
(3.6.3c) φk C/k , k = 1, 2, . . . ,
where in each case, C is some positive constant.
Superlinear convergence occurs when the constant σ ∈ (0, 1) in (3.6.1) can be
chosen arbitrarily close to 1. Specifically, we say that the sequence {φk } converges
Q-superlinearly to 0 if
(3.6.4) lim φk+1 /φk = 0.
k→∞
Q-Quadratic convergence occurs when
(3.6.5) φk+1 /φ2k C, k = 1, 2, . . . ,
for some sufficiently large C. We say that the convergence is R-superlinear if
there is a Q-superlinearly convergent sequence {νk } that dominates {φk } (that is,
0 φk νk for all k). R-quadratic convergence is defined similarly. Quadratic
and superlinear rates are associated with higher-order methods, such as Newton
and quasi-Newton methods.
When a convergence rate applies globally, from any reasonable starting point,
it can be used to derive a complexity bound for the algorithm, which takes the
Stephen J. Wright 71
4. Gradient Methods
We consider here iterative methods for solving the unconstrained smooth prob-
lem (1.0.1) that make use of the gradient ∇f (see also, [22] which describes sub-
gradient methods for nonsmooth convex functions.) We consider mostly methods
that generate an iteration sequence {xk } via the formula
(4.0.1) xk+1 = xk + αk dk ,
where dk is the search direction and αk is a steplength.
We consider the steepest descent method, which searches along the negative
gradient direction dk = −∇f(xk ), proving convergence results for nonconvex
functions, convex functions, and strongly convex functions. In Subsection 4.5, we
consider methods that use more general descent directions dk , proving conver-
gence of methods that make careful choices of the line search parameter αk at
each iteration. In Subsection 4.6, we consider the conditional gradient method for
minimization of a smooth function f over a compact set.
4.1. Steepest Descent The simplest stepsize protocol is the short-step variant
of steepest descent. We assume here that f is differentiable, with gradient ∇f
satisfying the Lipschitz continuity condition (3.3.6) with constant L. We choose
the search direction dk = −∇f(xk ) in (4.0.1), and set the steplength αk to be the
constant 1/L, to obtain the iteration
1
(4.1.1) xk+1 = xk − ∇f(xk ), k = 0, 1, 2, . . . .
L
To estimate the amount of decrease in f obtained at each iterate of this method,
we use Taylor’s theorem. From (3.3.7), we have
L
(4.1.2) f(x + αd) f(x) + α∇f(x)T d + α2 d2 ,
2
72 Optimization Algorithms for Data Analysis
For x = xk and d = −∇f(xk ), the value of α that minimizes the expression on the
right-hand side is α = 1/L. By substituting these values, we obtain
1
(4.1.3) f(xk+1 ) = f(xk − (1/L)∇f(xk )) f(xk ) − ∇f(xk )2 .
2L
This expression is one of the foundational inequalities in the analysis of optimiza-
tion methods. Depending on the assumptions about f, we can derive a variety of
different convergence rates from this basic inequality.
4.2. General Case We consider first a function f that is Lipschitz continuously
differentiable and bounded below, but that need not necessarily be convex. Using
(4.1.3) alone, we can prove a sublinear convergence result for the steepest descent
method.
Theorem 4.3.1. Suppose that f is convex and Lipschitz continuously differentiable, sat-
isfying (3.3.6), and that (1.0.1) has a solution x∗ . Then the steepest descent method with
stepsize αk ≡ 1/L generates a sequence {xk }∞ k=0 that satisfies
L 0
(4.3.2) f(xT ) − f∗ x − x∗ 2 .
2T
Stephen J. Wright 73
L k
T −1 T −1
(f(xk+1 ) − f∗ ) x − x∗ 2 − xk+1 − x∗ 2
2
k=0 k=0
L 0
= x − x∗ 2 − xT − x∗ 2
2
L 0
x − x∗ 2 .
2
Since {f(xk )} is a nonincreasing sequence, we have, as required,
1
T −1
L 0
f(xT ) − f(x∗ ) (f(xk+1 ) − f∗ ) x − x∗ 2 .
T 2T
k=0
4.4. Strongly Convex Case Recall that the definition (3.3.9) of strong convexity
shows that f can be bounded below by a quadratic with Hessian γI. A strongly
convex f with L-Lipschitz gradients is also bounded above by a similar quadratic
(see (3.3.7)) differing only in the quadratic term, which becomes LI. From this
“sandwich” effect, we derive a linear convergence rate for the gradient method,
stated formally in the following theorem.
Theorem 4.4.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
and strongly convex, satisfying (3.2.5) with modulus of convexity γ. Then f has a unique
minimizer x∗ , and the steepest descent method with stepsize αk ≡ 1/L generates a se-
quence {xk }∞
k=0 that satisfies
γ
f(xk+1 ) − f(x∗ ) 1 − (f(xk ) − f(x∗ )), k = 0, 1, 2, . . . .
L
Proof. Existence of the unique minimizer x∗ follows from Theorem 3.3.14. Min-
imizing both sides of the inequality (3.3.9) with respect to y, we find that the
minimizer on the left side is attained at y = x∗ , while on the right side it is
attained at x − ∇f(x)/γ. Plugging these optimal values into (3.3.9), we obtain
γ
y − x2
min f(y) min f(x) + ∇f(x)T (y − x) +
y y 2
$ $2
1 γ $1 $
⇒ f(x∗ ) f(x) − ∇f(x)T ∇f(x) + $$ ∇f(x) $
$
γ 2 γ
1
⇒ f(x∗ ) f(x) − ∇f(x)2 .
2γ
74 Optimization Algorithms for Data Analysis
By rearrangement, we obtain
(4.4.2) ∇f(x)2 2γ[f(x) − f(x∗ )].
By substituting (4.4.2) into our basic inequality (4.1.3), we obtain
1 1 γ
f(xk+1 ) = f xk − ∇f(xk ) f(xk ) − ∇f(xk )2 f(xk ) − (f(xk ) − f∗ ).
L 2L L
Subtracting f∗ from both sides of this inequality yields the result.
Note that After T steps, we have
γ T
(4.4.3) f(xT ) − f∗ 1 − (f(x0 ) − f∗ ),
L
which is convergence of type (3.6.2) with constant σ = γ/L.
4.5. General Case: Line-Search Methods Returning to the case in which f has
Lipschitz continuous gradients but is possibly nonconvex, we consider algorithms
that take steps of the form (4.0.1), where dk is a descent direction, that is, it
makes a positive inner product with the negative gradient −∇f(xk ), so that
∇f(xk )T dk < 0. This condition ensures that f(xk + αdk ) < f(xk ) for sufficiently
small positive values of step length α — we obtain improvement in f by taking
small steps along dk . (This claim follows from (3.3.3).) Line-search methods are
built around this fundamental observation. By introducing additional conditions
on dk and αk , that can be verified in practice with reasonable effort, we can estab-
lish a bound on decrease similar to (4.1.3) on each iteration, and thus a conclusion
similar to that of Theorem 4.2.1.
We assume that dk satisfies the following for some η > 0:
(4.5.1) ∇f(xk )T dk −η∇f(xk )dk .
For the steplength αk , we assume the following weak Wolfe conditions hold, for
some constants c1 and c2 with 0 < c1 < c2 < 1:
and therefore all accumulation points x̄ of the sequence {xk } generated by the
algorithm (4.0.1) have ∇f(x̄) = 0. In the case of f convex, this condition guarantees
that x̄ is a solution of (1.0.1). When f is nonconvex, x̄ may be a local minimum,
but it may also be a saddle point or a local maximum.
The paper [29] uses the stable manifold theorem to show that line-search gra-
dient methods are highly unlikely to converge to stationary points x̄ at which
some eigenvalues of the Hessian ∇2 f(x̄) are negative. Although it is easy to con-
struct examples for which such bad behavior occurs, it requires special choices of
starting point x0 . Possibly the most obvious example is where f(x1 , x2 ) = x21 − x22
starting from x0 = (1, 0)T , where dk = −∇f(xk ) at each k. For this example, all
iterates have xk 2 = 0 and, under appropriate conditions, converge to the saddle
point x̄ = 0. Any starting point with x02 = 0 cannot converge to 0, in fact, it is easy
to see that xk2 diverges away from 0.
76 Optimization Algorithms for Data Analysis
f(xk+1 ) − f(x∗ )
2 1 4
1− [f(xk ) − f(x∗ )] + LD2 from (4.6.6), (4.6.2b)
k+2 2 (k + 2)2
2 2k 2
LD + from (4.6.4)
(k + 2)2 (k + 2)2
(k + 1)
= 2LD2
(k + 2)2
k+1 1
= 2LD2
k+2 k+2
k+2 1 2LD2
2LD2 = ,
k+3 k+2 k+3
as required.
5. Prox-Gradient Methods
We now describe an elementary but powerful approach for solving the regu-
larized optimization problem
(5.0.1) min φ(x) := f(x) + λψ(x),
x∈Rn
where f is a smooth convex function, ψ is a convex regularization function (known
simply as the “regularizer”), and λ 0 is a regularization parameter. The tech-
nique we describe here is a natural extension of the steepest-descent approach,
in that it reduces to the steepest-descent method analyzed in Theorems 4.3.1 and
4.4.1 applied to f when the regularization term is not present (λ = 0). It is useful
when the regularizer ψ has a simple structure that is easy to account for explicitly,
as is true for many regularizers that arise in data analysis, such as the 1 function
(ψ(x) = x1 ) of the indicator function for a simple set Ω (ψ(x) = IΩ (x)), such
as a box Ω = [l1 , u1 ] ⊗ [l2 , u2 ] ⊗ . . . ⊗ [ln , un ]. For such regularizers, the proximal
operators can be computed explicitly and efficiently.2
Each step of the algorithm is defined as follows:
(5.0.2) xk+1 := proxαk λψ (xk − αk ∇f(xk )),
for some steplength αk > 0, and the prox operator defined in (3.5.3). By substitut-
ing into this definition, we can verify that xk+1 is the solution of an approximation
to the objective φ of (5.0.1), namely:
1
(5.0.3) xk+1 := arg min ∇f(xk )T (z − xk ) + z − xk 2 + λψ(z).
z 2αk
2 For the analysis of this section I am indebted to class notes of L. Vandenberghe, from 2013-14.
78 Optimization Algorithms for Data Analysis
One way to verify this equivalence is to note that the objective in (5.0.3) can be
written as
$ $2
1 1$ $
$z − (xk − αk ∇f(xk ))$ + αk λψ(x) ,
αk 2
(modulo a term αk ∇f(xk )2 that does not involve z). The subproblem objective
in (5.0.3) consists of a linear term ∇f(xk )T (z − xk ) (the first-order term in a Taylor-
series expansion), a proximality term 2α1 k z − xk 2 that becomes more strict as
αk ↓ 0, and the regularization term λψ(x) in unaltered form. When λ = 0, we
have xk+1 = xk − αk ∇f(xk ), so the iteration (5.0.2) (or (5.0.3)) reduces to the
usual steepest-descent approach discussed in Section 4 in this case. It is useful
to continue thinking of αk as playing the role of a line-search parameter, though
here the line search is expressed implicitly through a proximal term.
We will demonstrate convergence of the method (5.0.2) at a sublinear rate, for
functions f whose gradients satisfy a Lipschitz continuity property with Lipschitz
constant L (see (3.3.6)), and for the constant steplength choice αk = 1/L. The proof
makes use of a “gradient map” defined by
1
(5.0.4) Gα (x) := x − proxαλψ (x − α∇f(x)) .
α
By comparing with (5.0.2), we see that this map defines the step taken at itera-
tion k as:
1 k
(5.0.5) xk+1 = xk − αk Gαk (xk ) ⇔ Gαk (xk ) = (x − xk+1 ).
αk
The following technical lemma reveals some useful properties of Gα (x).
Lemma 5.0.6. Suppose that in problem (5.0.1), ψ is a closed convex function and that f
is convex with Lipschitz continuous gradient on Rn , with Lipschitz constant L. Then for
the definition (5.0.4) with α > 0, the following claims are true.
(a) Gα (x) ∈ ∇f(x) + λ∂ψ(x − αGα (x)).
(b) For any z, and any α ∈ (0, 1/L], we have that
α
φ(x − αGα (x)) φ(z) + Gα (x)T (x − z) − Gα (x)2 .
2
Proof. For part (a), we use the optimality property (3.5.4) of the prox operator,
and make the following substitutions: x − α∇f(x) for “x”, αλ for “λ”, and ψ for
“h” to obtain
0 ∈ αλ∂ψ(proxαλψ (x − α∇f(x))) + (proxαλψ (x − α∇f(x)) − (x − α∇f(x)).
We make the substitution proxαλψ (x − α∇f(x)) = x − αGα (x), using definition
(5.0.4), to obtain
0 ∈ αλ∂ψ(x − αGα (x)) − α(Gα (x) − ∇f(x)),
and the result follows when we divide by α.
For (b), we start with the following consequence of Lipschitz continuity of ∇f,
from Lemma 3.3.10:
L
f(y) f(x) + ∇f(x)T (y − x) + y − x2 .
2
Stephen J. Wright 79
where αk and βk are positive scalars. That is, a momentum term βk (xk − xk−1 )
is added to the usual steepest descent update. Although this method can be ap-
plied to any smooth convex f (and even to nonconvex functions), the convergence
analysis is most straightforward for the special case of strongly convex quadratic
functions (see [38]). (This analysis also suggests appropriate values for the step
lengths αk and βk .) Consider the function
1
(6.1.2) minn f(x) := xT Ax − bT x,
x∈R 2
where the (constant) Hessian A has eigenvalues in the range [γ, L], with 0 < γ L.
For the following constant choices of steplength parameters:
√ √
4 L− γ
αk = α := √ √ , β k = β := √ √ ,
( L + γ)2 L+ γ
it can be shown that xk − x∗ Cβk , for some (possibly large) constant C. We
can use (3.3.7) to translate this into a bound on the function error, as follows:
L k LC2 2k
f(xk ) − f(x∗ ) x − x∗ 2 β ,
2 2
allowing a direct comparison with the rate (4.4.3) for the steepest descent method.
If we suppose that L
γ, we have
γ
β ≈ 1−2 ,
L
so that we achieve approximate convergence f(xk ) − f(x∗ ) (for small pos-
itive ) in O( L/γ log(1/)) iterations, compared with O((L/γ) log(1/)) for
steepest descent — a significant improvement.
The heavy-ball method is fundamental, but several points should be noted.
First, the analysis for convex quadratic f is based on linear algebra arguments,
and does not generalize to general strongly convex nonlinear functions. Second,
the method requires knowledge of γ and L, for the purposes of defining parame-
ters α and β. Third, it is not a descent method; we usually have f(xk+1 ) > f(xk )
for many k. These properties are not specific to the heavy-ball method — some
of them are shared by other methods that use momentum.
6.2. Conjugate Gradient The conjugate gradient method for solving linear sys-
tems Ax = b (or, equivalently, minimizing the convex quadratic (6.1.2)) where A
is symmetric positive definite, is one of the most important algorithms in compu-
tational science. Though invented earlier than the other algorithms discussed in
this section (see [27]) and motivated in a different way, conjugate gradient clearly
makes use of momentum. Its steps have the form
(6.2.1) xk+1 = xk + αk pk , where pk = −∇f(xk ) + ξk pk−1 ,
for some choices of αk and ξk , which is identical to (6.1.1) when we define βk ap-
propriately. For convex, strongly quadratic problems (6.1.2), conjugate gradient
has excellent properties. It does not require prior knowledge of the range [γ, L] of
82 Optimization Algorithms for Data Analysis
minimizer x∗ . Then the method (6.3.2), (6.4.1) with starting point x0 = y0 satisfies
∗ L+γ 0 ∗ 2 1 T
f(x ) − f(x )
T
x − x 1 − √ , T = 1, 2, . . . .
2 κ
Proof. The proof makes use of a family of strongly convex functions Φk (z) de-
fined inductively as follows:
γ
(6.4.3a) Φ0 (z) = f(y0 ) + z − y0 2 ,
√2
(6.4.3b) Φk+1 (z) = (1 − 1/ κ)Φk (z)
1 γ
+ √ f(yk ) + ∇f(yk )T (z − yk ) + z − yk 2 .
κ 2
Each Φk (·) is a quadratic, and an inductive argument shows that ∇2 Φk (z) = γI
for all k and all z. Thus, each Φk has the form
γ
(6.4.4) Φk (z) = Φ∗k + z − vk 2 , k = 0, 1, 2, . . . ,
2
where vk is the minimizer of Φk (·) and Φ∗k is its optimal value. (From (6.4.3a),
we have v0 = y0 .) We note too that Φk becomes a tighter overapproximation to f
as k → ∞. To show this, we use (3.3.9) to replace the final term in parentheses in
(6.4.3b) by f(z), then subtract f(z) from both sides of (6.4.3b) to obtain
√
(6.4.5) Φk+1 (z) − f(z) (1 − 1/ κ)(Φk (z) − f(z)).
In the remainder of the proof, we show that the following bound holds:
(6.4.6) f(xk ) min Φk (z) = Φ∗k , k = 0, 1, 2, . . . .
z
The upper bound in Lemma 3.3.10 for x = x∗
gives f(z) − f(x∗ ) (L/2)z − x∗ 2 .
By combining this bound with (6.4.5) and (6.4.6), we have
f(xk ) − f(x∗ ) Φ∗k − f(x∗ )
Φk (x∗ ) − f(x∗ )
√
(6.4.7) (1 − 1/ κ)k (Φ0 (x∗ ) − f(x∗ ))
√
(1 − 1/ κ)k [(Φ0 (x∗ ) − f(x0 )) + (f(x0 ) − f(x∗ ))]
√ γ+L 0
(1 − 1/ κ)k x − x∗ 2 .
2
The proof is completed by establishing (6.4.6), by induction on k. Since x0 = y0 ,
it holds by definition at k = 0. By using step formula (6.3.2a), the convexity
property (3.3.8) (with x = yk ), and the inductive hypothesis, we have
(6.4.8) f(xk+1 )
1
f(yk ) − ∇f(yk )2
2L
√ √ √ 1
= (1 − 1/ κ)f(xk ) + (1 − 1/ κ)(f(yk ) − f(xk )) + f(yk )/ κ − ∇f(yk )2
2L
√ √ √ 1
(1 − 1/ κ)Φ∗k + (1 − 1/ κ)∇f(yk )T (yk − xk ) + f(yk )/ κ − ∇f(yk )2 .
2L
86 Optimization Algorithms for Data Analysis
Thus the claim is established (and the theorem is proved) if we can show that the
right-hand side in (6.4.8) is bounded above by Φ∗k+1 .
Recalling the observation (6.4.4), we have by taking derivatives of both sides of
(6.4.3b) with respect to z that
√ √ √
(6.4.9) ∇Φk+1 (z) = γ(1 − 1/ κ)(z − vk ) + ∇f(yk )/ κ + γ(z − yk )/ κ.
Since vk+1 is the minimizer of Φk+1 we can set ∇Φk+1 (vk+1 ) = 0 in (6.4.9) to
obtain
√ √ √
(6.4.10) vk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ).
By subtracting yk from both sides of this expression, and taking · 2 of both
sides, we obtain
√
vk+1 − yk 2 = (1 − 1/ κ)2 yk − vk 2 + ∇f(yk )2 /(γ2 κ)
(6.4.11) √ √
− 2(1 − 1/ κ)/(γ κ)∇f(yk )T (vk − yk ).
By evaluating Φk+1 at z = yk , using both (6.4.4) and (6.4.3b), we obtain
γ
Φ∗k+1 + yk − vk+1 2
2√ √
(6.4.12) = (1 − 1/ κ)Φk (yk ) + f(yk )/ κ
√ γ √ √
= (1 − 1/ κ)Φ∗k + (1 − 1/ κ)yk − vk 2 + f(yk )/ κ.
2
By substituting (6.4.11) into (6.4.12), we obtain
√ √ √ √
Φ∗k+1 = (1 − 1/ κ)Φ∗k + f(yk )/ κ + γ(1 − 1/ κ)/(2 κ)yk − vk 2
1 √ √
− ∇f(yk )2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ
2L
(6.4.13) √ √
(1 − 1/ κ)Φ∗k + f(yk )/ κ
1 √ √
− ∇f(yk )2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ,
2L
where we simply dropped a nonnegative term from the right-hand side to obtain
the inequality. The final step is to show that
√
(6.4.14) vk − yk = κ(yk − xk ),
which we do by induction. Note that v0 = x0 = y0 , so the claim holds for k = 0.
We have
√ √ √
vk+1 − yk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ) − yk+1
√ √ √
= κyk − ( κ − 1)xk − κ∇f(yk )/L − yk+1
(6.4.15) √ √
= κxk+1 − ( κ − 1)xk − yk+1
√
= κ(yk+1 − xk+1 ),
where the first equality is from (6.4.10), the second equality is from the inductive
hypothesis, the third equality is from the iteration formula (6.3.2a), and the final
equality is from the iteration formula (6.3.2b) with the definition of βk+1 from
(6.4.1). We have thus proved (6.4.14), and by substituting this equality into (6.4.13),
Stephen J. Wright 87
we obtain that Φ∗k+1 is an upper bound on the right-hand side of (6.4.8). This
establishes (6.4.6) and thus completes the proof of the theorem.
6.5. Lower Bounds on Rates The term “optimal” in Nesterov’s optimal method
is used because the convergence rate achieved by the method is the best possible
(possibly up to a constant), among algorithms that make use of gradient informa-
tion at the iterates xk . This claim can be proved by means of a carefully designed
function, for which no method that makes use of all gradients observed up to and
including iteration k (namely, ∇f(xi ), i = 0, 1, 2, . . . , k) can produce a sequence
{xk } that achieves a rate better than that of Theorem 6.3.5. The function proposed
in [32] is a convex quadratic f(x) = (1/2)xT Ax − eT1 x, where
⎡ ⎤
2 −1 0 0 ... ... 0 ⎡ ⎤
⎢ ⎥ 1
⎢−1 2 −1 0 . . . . . . 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢0⎥
⎢ 0 −1 2 −1 0 . . . 0 ⎥ ⎢ ⎥
⎢ ⎥
A=⎢ . . . ⎥ , e1 = ⎢ ⎢ 0⎥ ⎥.
⎢ .. .. .. ⎥ ⎢ ⎥
⎢ ⎥ .
⎢ .. ⎥
⎢ ⎥ ⎣ ⎦
⎢ 0 ... 0 −1 2 −1⎥
⎣ ⎦ 0
0 ... 0 −1 2
The solution x∗ satisfies Ax∗ = e1 ; its components are x∗i = 1 − i/(n + 1), for
i = 1, 2, . . . , n. If we use x0 = 0 as the starting point, and construct the iterate
xk+1 as
k
xk+1 = xk + ξj ∇f(xj ),
j=0
7. Newton Methods
So far, we have dealt with methods that use first-order (gradient or subgra-
dient) information about the objective function. We have shown that such algo-
rithms can yield sequences of iterates that converge at linear or sublinear rates.
We turn our attention in this chapter to methods that exploit second-derivative
(Hessian) information. The canonical method here is Newton’s method, named
after Isaac Newton, who proposed a version of the method for polynomial equa-
tions in around 1670.
For many functions, including many that arise in data analysis, second-order
information is not difficult to compute, in the sense that the functions that we
deal with are simple (usually compositions of elementary functions). In compar-
ing with first-order methods, there is a tradeoff. Second-order methods typically
have local superlinear or quadratic convergence rates: Once the iterates reach a
neighborhood of a solution at which second-order sufficient conditions are sat-
isfied, convergence is rapid. Moreover, their global convergence properties are
attractive. With appropriate enhancements, they can provably avoid convergence
to saddle points. But the costs of calculating and handling the second-order infor-
mation and of computing the step is higher. Whether this tradeoff makes them
appealing depends on the specifics of the application and on whether the second-
derivative computations are able to take advantage of structure in the objective
function.
We start by sketching the basic Newton’s method for the unconstrained smooth
optimization problem min f(x), and prove local convergence to a minimizer x∗
that satisfies second-order sufficient conditions. Subsection 7.2 discusses perfor-
mance of Newton’s method on convex functions, where the use of Newton search
directions in the line search framework (4.0.1) can yield global convergence. Mod-
ifications of Newton’s method for nonconvex functions are discussed in Subsec-
tion 7.3. Subsection 7.4 discusses algorithms for smooth nonconvex functions
that use gradient and Hessian information but guarantee convergence to points
that approximately satisfy second-order necessary conditions. Some variants of
these methods are related closely to the trust-region methods discussed in Sub-
section 7.3, but the motivation and mechanics are somewhat different.
7.1. Basic Newton’s Method Consider the problem
(7.1.1) min f(x),
where f : Rn → R is a Lipschitz twice continuously differentiable function, where
the Hessian has Lipschitz constant M, that is,
(7.1.2) ∇2 f(x ) − ∇2 f(x ) Mx − x ,
where · denotes the Euclidean vector norm and its induced matrix norm. New-
ton’s method generates a sequence of iterates {xk }k=0,1,2,... .
Stephen J. Wright 89
Theorem 7.1.6. Consider the problem (7.1.1) with f twice Lipschitz continuously differ-
entiable with Lipschitz constant M defined in (7.1.2). Suppose that the second-order suf-
ficient conditions are satisfied for the problem (7.1.1) at the point x∗ , that is, ∇f(x∗ ) = 0
and ∇2 f(x∗ ) γI for some γ > 0. Then if x0 − x∗ 2M γ
, the sequence defined by
∗
(7.1.5) converges to x at a quadratic rate, with
M k
(7.1.7) xk+1 − x∗ x − x∗ 2 , k = 0, 1, 2, . . . .
γ
Proof. From (7.1.4) and (7.1.5), and using ∇f(x∗ ) = 0, we have
where λmin (·) denotes the smallest eigenvalue of a symmetric matrix. Thus for
γ
(7.1.10) xk − x∗ ,
2M
we have
γ γ
λmin (∇2 f(xk )) λmin (∇2 f(x∗ )) − Mxk − x∗ γ − M ,
2M 2
so that ∇2 f(xk )−1 2/γ. By substituting this result together with (7.1.9) into
(7.1.8), we obtain
2M k M k
xk+1 − x∗ x − x∗ 2 = x − x∗ 2 ,
γ 2 γ
verifying the local quadratic convergence rate. By applying (7.1.10) again, we
have
M k 1
xk+1 − x∗ x − x∗ xk − x∗ xk − x∗ ,
γ 2
so, by arguing inductively, we see that the sequence converges to x∗ provided
that x0 satisfies (7.1.10), as claimed.
Of course, we do not need to explicitly identify a starting point x0 in the stated
region of convergence. Any sequence that approaches to x∗ will eventually enter
this region, and thereafter the quadratic convergence guarantees apply.
We have established that Newton’s method converges rapidly once the iterates
enter the neighborhood of a point x∗ satisfying second-order sufficient optimality
conditions. But what happens when we start far from such a point?
7.2. Newton’s Method for Convex Functions When the function f is convex as
well as smooth, we can devise variants of Newton’s method for which global
convergence and complexity results (in particular, results based on those of Sec-
tion 4.5) can be proved in addition to local quadratic convergence.
When f is strongly convex with modulus γ and satisfies Lipschitz continuity
of the gradient (3.3.6), the Hessian ∇2 f(xk ) is positive definite for all k, with
all eigenvalues in the interval [γ, L]. Thus, the Newton direction (7.1.4) is well
defined at all iterates xk , and is a descent direction satisfying the condition (4.5.1)
with η = γ/L. To verify this claim, note first
1
pk ∇2 f(xk )−1 ∇f(xk ) ∇f(xk ).
γ
Then
(pk )T ∇f(xk ) = −∇f(xk )T ∇2 f(xk )−1 ∇f(xk )
1
− ∇f(xk )2
L
γ
− ∇f(xk )pk .
L
We can use the Newton direction in the line-search framework of Subsection 4.5
to obtain a method for which xk → x∗ , where x∗ is the (unique) global minimizer
of f. (This claim follows from the property (4.5.6) together with the fact that x∗ is
the only point for which ∇f(x∗ ) = 0.) We can even obtain a complexity result —
√
and O(1/ T ) bound on min0kT −1 ∇f(xk ) — from Theorem 4.5.3.
Stephen J. Wright 91
These global convergence properties are enhanced by the local quadratic con-
vergence property of Theorem 7.1.6 if we modify the line-search framework by
accepting the step length αk = 1 in (4.0.1) whenever it satisfies the weak Wolfe
conditions (4.5.2). (It can be shown, by again using arguments based on Taylor’s
theorem (Theorem 3.3.1), that these conditions will be satisfied by αk = 1 for all
xk sufficiently close to the minimizer x∗ .)
Consider now the case in which f is convex and satisfies condition (3.3.6) but
is not strongly convex. Here, the Hessian ∇2 f(xk ) may be singular for some k, so
the direction (7.1.4) may not be well defined. However, by adding any positive
number λk > 0 to the diagonal, we can ensure that the modified Newton direction
defined by
(7.2.1) pk = −[∇2 f(xk ) + λk I]−1 ∇f(xk ),
is well defined and is a descent direction for f. For any η ∈ (0, 1) in (4.5.1),
we have by choosing λk large enough that λk /(L + λk ) η that the condition
(4.5.1) is satisfied too, so we can use the resulting direction pk in the line-search
framework of Subsection 4.5, to obtain a method that convergence to a solution
x∗ of (1.0.1), when one exists.
If, in addition, the minimizer x∗ is unique and satisfies a second-order suffi-
cient condition (so that ∇2 f(x∗ ) is positive definite), then ∇2 f(xk ) will be positive
definite too for k sufficiently large. Thus, provided that η is sufficiently small,
the unmodified Newton direction (with λk = 0 in (7.2.1)) will satisfy the condi-
tion (4.5.1). If we use (7.2.1) in the line-search framework of Section 4.5, but set
λk = 0 where possible, and accept αk = 1 as the step length whenever it satisfies
(4.5.2), we can obtain local quadratic convergence to x∗ , in addition to the global
convergence and complexity promised by Theorem 4.5.3.
7.3. Newton Methods for Nonconvex Functions For smooth nonconvex f, the
Hessian ∇2 f(xk ) may be indefinite for some k. The Newton direction (7.1.4)
may not exist (when ∇2 f(xk ) is singular) or it may not be a descent direction
(when ∇2 f(xk ) has negative eigenvalues). However, we can still define a modified
Newton direction as in (7.2.1), which will be a descent direction for λk sufficiently
large, and thus can be used in the line-search framework of Section 4.5. For a
given η in (4.5.1), a sufficient condition for pk from (7.2.1) to satisfy (4.5.1) is that
λk + λmin (∇2 f(xk ))
η,
λk + L
where λmin (∇2 f(xk )) is the minimum eigenvalue of the Hessian, which may be
negative. The line-search framework of Section 4.5 can then be applied to ensure
that ∇f(xk ) → 0.
Once again, if the iterates {xk } enter the neighborhood of a local solution x∗
for which ∇2 f(x∗ ) is positive definite, some enhancements of the strategy for
choosing λk and the step length αk can recover the local quadratic convergence
of Theorem 7.1.6.
92 Optimization Algorithms for Data Analysis
Formula (7.2.1) is not the only way to modify the Newton direction to ensure
descent in a line-search framework. Other approaches are outlined in [36, Chap-
ter 3]. One such technique is to modify the Cholesky factorization of ∇2 (fk ) by
adding positive elements to the diagonal only as needed to allow the factoriza-
tion to proceed (that is, to avoid taking the square root of a negative number),
then using the modified factorization in place of ∇2 f(xk ) in the calculation of the
Newton step pk . Another technique is to compute an eigenvalue decomposition
∇2 f(xk ) = Qk Λk QTk (where Qk is orthogonal and Λk is the diagonal matrix con-
taining the eigenvalues), then define Λ̃k to be a modified version of Λk in which
all the diagonals are positive. Then, following (7.1.4), pk can be defined as
k Qk ∇f(x ).
pk := −Qk Λ̃−1 T k
value of the scalar λk , for which specialized methods have been devised.
For large-scale problems, it may be too expensive to solve (7.3.1) near-exactly,
since the process may require several factorizations of an n × n matrix (namely,
the coefficient matrix in (7.3.2), for different values of λ). A popular approach
Stephen J. Wright 93
for finding approximate solutions of (7.3.1), which can be used when ∇2 f(xk )
is positive definite, is the dogleg method. In this method the curved path traced
out by solutions of (7.3.2) for values of λ in the interval [0, ∞) is approximated
by simpler path consisting of two line segments. The first segment joins 0 to
the point dk C that minimizes the objective in (7.3.1) along the direction −∇f(x ),
k
while the second segment joins dk C to the pure Newton step defined in (7.1.4). The
approximate solution is taken to be the point at which this “dogleg” path crosses
the boundary of the trust region d Δk . If the dogleg path lies entirely inside
the trust region, we take dk to be the pure Newton step. See [36, Section 4.1].
Having discussed the trust-region subproblem (7.3.1), let us outline how it can
be used as the basis for a complete algorithm. A crucial role is played by the ratio
between the amount of decrease in f predicted by the quadratic objective in (7.3.1) and
the actual decrease in f, namely, f(xk ) − f(xk + dk ). Ideally, this ratio would be close
to 1. If it is at least greater than a small tolerance (say, 10−4 ) we accept the step
and proceed to the next iteration. Otherwise, we conclude that the trust-region
radius Δk is too large, so we do not take the step, shrink the trust region, and
re-solve (7.3.1) to obtain a new step. Additionally, when the actual-to-predicted
ratio is close to 1, we conclude that a larger trust region may hasten progress, so
we increase Δ for the next iteration, provided that the bound dk Δk really is
active at the solution of (7.3.1).
Unlike a basic line-search method, the trust-region Newton method can “es-
cape” from a saddle point. Suppose we have ∇f(xk ) = 0 and ∇2 f(xk ) indefinite
with some strictly negative eigenvalues. Then, the solution dk to (7.3.1) will be
nonzero, and the algorithm will step away from the saddle point, in the direc-
tion of most negative curvature for ∇2 f(xk ). Another appealing feature of the
trust-region Newton approach is that when the sequence {xk } approaches a point
x∗ satisfying second-order sufficient conditions, the trust region bound becomes
inactive, and the method takes pure Newton steps (7.1.4) for all sufficiently large
k so the local quadratic convergence that characterizes Newton’s method.
The basic difference between line-search and trust-region methods can be sum-
marized as follows. Line-search methods first choose a direction pk , then decide
how far to move along that direction. Trust-region methods do the opposite: They
choose the distance Δk first, then find the direction that makes the best progress
for this step length.
7.4. A Cubic Regularization Approach Trust-region Newton methods have the
significant advantage of guaranteeing that any accumulation points will satisfy
second-order necessary conditions. A related approach based on cubic regulariza-
tion has similar properties, plus some additional complexity guarantees. Cubic
regularization requires the Hessian to be Lipschitz continuous, as in (7.1.2). It
follows that the following cubic function yields a global upper bound for f:
1 M
(7.4.1) TM (z; x) := f(x) + ∇f(x)T (z − x) + (z − x)T ∇2 f(x)(z − x) + z − x3 .
2 6
94 Optimization Algorithms for Data Analysis
Using the lower bound f̄ on the objective f, we see that the number of iterations
K required must satisfy the condition
2g 2 3H
K min , f(x0 ) − f̄,
2L 3 M2
from which we conclude that
3 2 −3 0
K max 2L−2 g , M H f(x ) − f̄ .
2
We also observe that that the maximum number of iterates required to identify a
point at which only the approximate stationarity condition ∇f(xk ) g holds
is 2L−2 0
g (f(x ) − f̄). (We can just omit the second-order part of the algorithm.)
Note too that it is easy to devise approximate versions of this algorithm with simi-
lar complexity. For example, the negative curvature direction pk in step (ii) above
can be replaced by an approximation to the direction of most negative curvature,
obtained by the Lanczos iteration with random initialization.
In algorithms that make more complete use of the cubic model (7.4.1), the term
−3/2
−2
g in the complexity expression becomes g , and the constants are different.
The subproblems (7.4.1) are more complicated to solve than those in the simple
scheme above. Active research is going on into other algorithms that achieve
complexities similar to those of the cubic regularization approach. A variety of
methods that make use of Newton-type steps, approximate negative curvature di-
rections, accelerated gradient methods, random perturbations, randomized Lanc-
zos and conjugate gradient methods, and other algorithmic elements have been
proposed.
8. Conclusions
We have outlined various algorithmic tools from optimization that are useful
for solving problems in data analysis and machine learning, and presented their
basic theoretical properties. The intersection of optimization and machine learn-
ing is a fruitful and very popular area of current research. All the major machine
learning conferences have a large contingent of optimization papers, and there is
a great deal of interest in developing algorithmic tools to meet new challenges
and in understanding their properties. The edited volume [41] contains a snap-
shot of the state of the art circa 2010, but this is a fast-moving field and there have
been many developments since then.
Acknowledgments
I thank Ching-pei Lee for a close reading and many helpful suggestions, and
David Hong and an anonymous referee for detailed, excellent comments.
96 References
References
[1] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from highly incom-
plete information, 48th Annual Allerton Conference on Communication, Control, and Computing,
2010, pp. 704–711. ←57
[2] L. Balzano and S. J. Wright, Local convergence of an algorithm for subspace identification from par-
tial data, Found. Comput. Math. 15 (2015), no. 5, 1279–1314, DOI 10.1007/s10208-014-9227-7.
MR3394711 ←58
[3] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems,
SIAM J. Imaging Sci. 2 (2009), no. 1, 183–202, DOI 10.1137/080716542. MR2486527 ←80, 83
[4] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers,
Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–
152. ←60
[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learn-
ing via the alternating direction methods of multipliers, Foundations and Trends in Machine Learning
3 (2011), no. 1, 1–122. ←51
[6] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, Cambridge, 2004.
MR2061575 ←51
[7] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends in Machine
Learning 8 (2015), no. 3–4, 231–357. ←83, 84
[8] S. Bubeck, Y. T. Lee, and M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent,
Technical Report arXiv:1506.08187, Microsoft Research, 2015. ←83
[9] S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs
via low-rank factorization, Math. Program. 95 (2003), no. 2, Ser. B, 329–357, DOI 10.1007/s10107-
002-0352-8. Computational semidefinite and second order cone programming: the state of the art.
MR1976484 ←55
[10] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computa-
tional Mathematics 9 (2009), 717–772. ←55
[11] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, J. ACM 58 (2011),
no. 3, Art. 11, 37, DOI 10.1145/1970392.1970395. MR2811000 ←57
[12] C. Cartis, N. I. M. Gould, and P. L. Toint, Adaptive cubic regularisation methods for unconstrained
optimization. Part I: motivation, convergence and numerical results, Math. Program. 127 (2011), no. 2,
Ser. A, 245–295, DOI 10.1007/s10107-009-0286-5. MR2776701 ←94
[13] C. Cartis, N. I. M. Gould, and P. L. Toint, Adaptive cubic regularisation methods for unconstrained
optimization. Part II: worst-case function- and derivative-evaluation complexity, Math. Program. 130
(2011), no. 2, Ser. A, 295–319, DOI 10.1007/s10107-009-0337-y. MR2855872 ←94
[14] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-sparsity incoherence for
matrix decomposition, SIAM J. Optim. 21 (2011), no. 2, 572–596, DOI 10.1137/090761793. MR2817479
←57
[15] Y. Chen and M. J. Wainwright, Fast low-rank estimation by projected gradent descent: General statistical
and algorithmic guarantees, UC-Berkeley, 2015, arXiv:1509.03025. ←57
[16] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning 20 (1995), 273–297. ←60
[17] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection,
SIAM J. Matrix Anal. Appl. 30 (2008), no. 1, 56–66, DOI 10.1137/060670985. MR2399568 ←56
[18] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet, A direct formulation for sparse
PCA using semidefinite programming, SIAM Rev. 49 (2007), no. 3, 434–448, DOI 10.1137/050645506.
MR2353806 ←56
[19] T. Dasu and T. Johnson, Exploratory data mining and data cleaning, Wiley Series in Probability and
Statistics, Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2003. MR1979601 ←52
[20] P. Drineas and M. W. Mahoney, Lectures on randomized numerical linear algebra, The Mathematics
of Data, IAS/Park City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←54
[21] D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first order method based on optimal quadratic
averaging, SIAM J. Optim. 28 (2018), no. 1, 251–271, DOI 10.1137/16M1072528. MR3757113 ←83
[22] J. C. Duchi, Introductory lectures on stochastic optimization, The Mathematics of Data, IAS/Park
City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←51, 64, 71
[23] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point
algorithm for maximal monotone operators, Math. Programming 55 (1992), no. 3, Ser. A, 293–318,
DOI 10.1007/BF01581204. MR1168183 ←51
References 97
[24] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Res. Logist. Quart. 3 (1956),
95–110, DOI 10.1002/nav.3800030109. MR0089102 ←76
[25] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso,
Biostatistics 9 (2008), no. 3, 432–441. ←56
[26] A. Griewank, The modification of Newton’s method for unconstrained optimization by bounding cubic
terms, Technical Report NA/12, DAMTP, Cambridge University, 1981. ←92, 94
[27] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Research
Nat. Bur. Standards 49 (1952), 409–436 (1953). MR0060307 ←81
[28] A. J. Hoffman and H. W. Wielandt, The variation of the spectrum of a normal matrix, Duke Math. J.
20 (1953), 37–39. MR0052379 ←89
[29] J. D Lee, M. Simchowitz, M. I Jordan, and B. Recht, Gradient descent only converges to minimizers,
Conference on learning theory, 2016, pp. 1246–1257. ←75
[30] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization, Math.
Programming 45 (1989), no. 3, (Ser. B), 503–528, DOI 10.1007/BF01589116. MR1038245 ←51, 82
[31] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM J. Sci. Statist. Comput. 4 (1983),
no. 3, 553–572, DOI 10.1137/0904038. MR723110 ←92
[32] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, John
Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R.
Dawson; Wiley-Interscience Series in Discrete Mathematics. MR702836 ←87
[33] Yu. E. Nesterov, A method for solving the convex programming problem with convergence rate O(1/k2 )
(Russian), Dokl. Akad. Nauk SSSR 269 (1983), no. 3, 543–547. MR701288 ←80
[34] Y. Nesterov, Introductory lectures on convex optimization, Applied Optimization, vol. 87, Kluwer
Academic Publishers, Boston, MA, 2004. A basic course. MR2142598 ←80, 84
[35] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Math.
Program. 108 (2006), no. 1, Ser. A, 177–205, DOI 10.1007/s10107-006-0706-8. MR2229459 ←92, 94
[36] J. Nocedal and S. J. Wright, Numerical Optimization, Second, Springer, New York, 2006. ←51, 74,
82, 92, 93
[37] B. T. Poljak, Some methods of speeding up the convergence of iterative methods (Russian), Ž. Vyčisl. Mat.
i Mat. Fiz. 4 (1964), 791–803. MR0169403 ←80
[38] B. T. Polyak, Introduction to optimization, Translations Series in Mathematics and Engineering,
Optimization Software, Inc., Publications Division, New York, 1987. Translated from the Russian;
With a foreword by Dimitri P. Bertsekas. MR1099605 ←80, 81
[39] B. Recht, M. Fazel, and P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equa-
tions via nuclear norm minimization, SIAM Rev. 52 (2010), no. 3, 471–501, DOI 10.1137/070697835.
MR2680543 ←55
[40] R. T. Rockafellar, Convex analysis, Princeton Mathematical Series, No. 28, Princeton University
Press, Princeton, N.J., 1970. MR0274683 ←65
[41] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for machine learning, NIPS Workshop
Series, MIT Press, 2011. ←95
[42] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B 58 (1996),
no. 1, 267–288. MR1379242 ←54
[43] M. J. Todd, Semidefinite optimization, Acta Numer. 10 (2001), 515–560, DOI
10.1017/S0962492901000071. MR2009698 ←51
[44] B. A. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics 47
(2005), no. 3, 349–363, DOI 10.1198/004017005000000139. MR2164706 ←57
[45] L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Rev. 38 (1996), no. 1, 49–95, DOI
10.1137/1038003. MR1379041 ←51
[46] S. J. Wright, Primal-dual interior-point methods, Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1997. MR1422257 ←51
[47] S. J. Wright, Coordinate descent algorithms, Math. Program. 151 (2015), no. 1, Ser. B, 3–34, DOI
10.1007/s10107-015-0892-3. MR3347548 ←51
[48] X. Yi, D. Park, Y. Chen, and C. Caramanis, Fast algorithms for robust PCA via gradient descent,
Advances in Neural Information Processing Systems 29, 2016, pp. 4152–4160. ←57
John C. Duchi
Contents
1 Introduction 100
1.1 Scope, limitations, and other references 101
1.2 Notation 102
2 Basic Convex Analysis 103
2.1 Introduction and Definitions 103
2.2 Properties of Convex Sets 105
2.3 Continuity and Local Differentiability of Convex Functions 112
2.4 Subgradients and Optimality Conditions 114
2.5 Calculus rules with subgradients 119
3 Subgradient Methods 122
3.1 Introduction 122
3.2 The gradient and subgradient methods 123
3.3 Projected subgradient methods 129
3.4 Stochastic subgradient methods 132
4 The Choice of Metric in Subgradient Methods 140
4.1 Introduction 141
4.2 Mirror Descent Methods 141
4.3 Adaptive stepsizes and metrics 151
5 Optimality Guarantees 157
5.1 Introduction 157
5.2 Le Cam’s Method 162
5.3 Multiple dimensions and Assouad’s Method 167
A Technical Appendices 172
A.1 Continuity of Convex Functions 172
A.2 Probability background 173
A.3 Auxiliary results on divergences 175
B Questions and Exercises 176
2010 Mathematics Subject Classification. Primary 65Kxx; Secondary 90C15, 62C20.
Key words and phrases. Convexity, stochastic optimization, subgradients, mirror descent, minimax op-
timal.
99
100 Introductory Lectures on Stochastic Optimization
1. Introduction
In this set of four lectures, we study the basic analytical tools and algorithms
necessary for the solution of stochastic convex optimization problems, as well as
for providing optimality guarantees associated with the methods. As we proceed
through the lectures, we will be more exact about the precise problem formula-
tions, providing a number of examples, but roughly, by a stochastic optimization
problem we mean a numerical optimization problem that arises from observing
data from some (random) data-generating process. We focus almost exclusively
on first-order methods for the solution of these types of problems, as they have
proven quite successful in the large scale problems that have driven many ad-
vances throughout the 2000s.
Our main goal in these lectures, as in the lectures by S. Wright [61] in this vol-
ume, is to develop methods for the solution of optimization problems arising in
large-scale data analysis. Our route will be somewhat circuitous, as we will first
build the necessary convex analytic and other background (see Lecture 2), but
broadly, the problems we wish to solve are the problems arising in stochastic con-
vex optimization. In these problems, we have samples S coming from a sample
space S, drawn from a distribution P, and we have some decision vector x ∈ Rn
that we wish to choose to minimize the expected
loss
(1.0.1) f(x) := EP [F(x; S)] = F(x; s)dP(s),
S
where F is convex in its first argument.
The methods we consider for minimizing problem (1.0.1) are typically sim-
ple methods that are slower to converge than more advanced methods—such as
Newton or other second-order methods—for deterministic problems, but have the
advantage that they are robust to noise in the optimization problem itself. Con-
sequently, it is often relatively straightforward to derive generalization bounds
for these procedures: if they produce an estimate ' x exhibiting good performance
on some sample S1 , . . . , Sm drawn from P, then they are likely to exhibit good
performance (on average) for future data, that is, to have small objective f(' x);
see Lecture 3, and especially Theorem 3.4.13. It is of course often advantageous
to take advantage of problem structure and geometric aspects of the problem,
broadly defined, which is the goal of mirror descent and related methods, which
we discuss in Lecture 4.
The last part of our lectures is perhaps the most unusual for material on opti-
mization, which is to investigate optimality guarantees for stochastic optimization
problems. In Lecture 5, we study the sample complexity of solving problems of
the form (1.0.1). More precisely, we measure the performance of an optimization
procedure given samples S1 , . . . , Sm drawn independently from the population
distribution P, denoted by ' x=' x(S1:m ), in a uniform sense: for a class of objec-
tive functions F, a procedure’s performance is its expected error—or risk—for the
worst member of the class F. We provide lower bounds on this maximum risk,
John C. Duchi 101
showing that the first-order procedures we have developed satisfy certain notions
of optimality.
We briefly outline the coming lectures. The first lecture provides definitions
and the convex analytic tools necessary for the development of our algorithms
and other ideas, developing separation properties of convex sets as well as other
properties of convex functions from basic principles. The second two lectures
investigate subgradient methods and their application to certain stochastic opti-
mization problems, demonstrating a number of convergence results. The second
lecture focuses on standard subgradient-type methods, while the third investi-
gates more advanced material on mirror descent and adaptive methods, which
require more care but can yield substantial practical performance benefits. The
final lecture investigates optimality guarantees for the various methods we study,
demonstrating two standard techniques for proving lower bounds on the ability
of any algorithm to solve stochastic optimization problems.
1.1. Scope, limitations, and other references The lectures assume some limited
familiarity with convex functions and convex optimization problems and their
formulation, which will help appreciation of the techniques herein. All that is
truly essential is a level of mathematical maturity that includes some real analysis,
linear algebra, and introductory probability. In terms of real analysis, a typical
undergraduate course, such as one based on Marsden and Hoffman’s Elementary
Real Analysis [37] or Rudin’s Principles of Mathematical Analysis [50], are sufficient.
Readers should not consider these lectures in any way a comprehensive view of
convex analysis or stochastic optimization. These subjects are well-established,
and there are numerous references.
Our lectures begin with convex analysis, whose study Rockafellar, influenced
by Fenchel, launched in his 1970 book Convex Analysis [49]. We develop the basic
ideas necessary for our treatment of first-order (gradient-based) methods for op-
timization, which includes separating and supporting hyperplane theorems, but
we provide essentially no treatment of the important concepts of Lagrangian and
Fenchel duality, support functions, or saddle point theory more broadly. For these
and other important ideas, I have found the books of Rockafellar [49], Hiriart-
Urruty and Lemaréchal [27, 28], Bertsekas [8], and Boyd and Vandenberghe [12]
illuminating.
Convex optimization itself is a huge topic, with thousands of papers and nu-
merous books on the subject. Because of our focus on solution methods for large-
scale problems arising out of data collection, we are somewhat constrained in
our views. Boyd and Vandenberghe [12] provide an excellent treatment of the
possibilities of modeling engineering and scientific problems as convex optimiza-
tion problems, as well as some important numerical methods. Polyak [47] pro-
vides a treatment of stochastic and non-stochastic methods for optimization from
which ours borrows substantially. Nocedal and Wright [46] and Bertsekas [9]
also describe more advanced methods for the solution of optimization problems,
102 Introductory Lectures on Stochastic Optimization
focusing on non-stochastic optimization problems for which there are many so-
phisticated methods.
Because of our goal to solve problems of the form (1.0.1), we develop first-order
methods that are robust to many types of noise from sampling. There are other
approaches to dealing with data uncertainty, notably robust optimization [6],
where researchers study and develop tractable (polynomial-time-solvable) formu-
lations for a variety of data-based problems in engineering and the sciences. The
book of Shapiro et al. [54] provides a more comprehensive picture of stochastic
modeling problems and optimization algorithms than we have been able to in our
lectures, as stochastic optimization is by itself a major field. Several recent sur-
veys on online learning and online convex optimization provide complementary
treatments to ours [26, 52].
The last lecture traces its roots to seminal work in information-based complex-
ity by Nemirovski and Yudin in the early 1980s [41], who investigate the limits of
“optimal” algorithms, where optimality is defined in a worst-case sense accord-
ing to an oracle model of algorithms given access to function, gradient, or other
types of local information about the problem at hand. Issues of optimal estima-
tion in statistics are as old as the field itself, and the minimax formulation we use
is originally due to Wald in the late 1930s [59, 60]. We prove our results using
information theoretic tools, which have broader applications across statistics, and
that have been developed by many authors [31, 33, 62, 63].
1.2. Notation We use mostly standard notation throughout these notes, but for
completeness, we collect it here. We let R denote the typical field of real numbers,
with Rn having its usual meaning as n-dimensional Euclidean space. Given
vectors x and y, we let x, y denote the inner product between x and y. Given a
norm ·, its dual norm ·∗ is defined as
z∗ := sup {z, x | x 1} .
Hölder’s inequality (see Exercise B.2.4) shows that the p and q norms, defined
by
n 1
p
xp = |xj |p
j=1
(and as the limit x∞ = maxj |xj |) are dual to one another, where 1/p + 1/q = 1
and p, q ∈ [1, ∞]. Throughout, we will assume that x2 = x, x is the norm
defined by the inner product ·, ·.
We also require notation related to sets. For a sequence of vectors v1 , v2 , v3 , . . .,
we let (vn ) denote the entire sequence. Given sets A and B, we let A ⊂ B denote
that A is a subset (possibly equal to) B, and A B that A is a strict subset of
B. The notation cl A denotes the closure of A, while int A denotes the interior of
the set A. For a function f, the set dom f is its domain. If f : Rn → R ∪ {+∞} is
convex, we let dom f := {x ∈ Rn | f(x) < +∞}.
John C. Duchi 103
2.1. Introduction and Definitions This set of lecture notes considers convex op-
timization problems, numerical optimization problems of the form
minimize f(x)
(2.1.1)
subject to x ∈ C,
where f is a convex function and C is a convex set. While we will consider
tools to solve these types of optimization problems presently, this first lecture is
concerned most with the analytic tools and background that underlies solution
methods for these problems.
The starting point for any study of convex functions is the definition and study
of convex sets, which are intimately related to convex functions. To that end, we
recall that a set C ⊂ Rn is convex if for all x, y ∈ C,
λx + (1 − λ)y ∈ C for λ ∈ [0, 1].
See Figure 2.1.2.
(a) (b)
Figure 2.1.2. (a) A convex set (b) A non-convex set.
epi f
f(x) f(x)
(a) (b)
Figure 2.1.3. (a) The convex function f(x) = max{x2 , −2x − .2}
and (b) its epigraph, which is a convex set.
One may ask why, precisely, we focus on convex functions. In short, as Rock-
afellar [49] notes, convex optimization problems are the clearest dividing line
between numerical problems that are efficiently solvable, often by iterative meth-
ods, and numerical problems for which we have no hope. We give one simple
result in this direction first:
t3
t1 t2
f(x+t)−f(x)
Figure 2.1.4. The slopes t increase, with t1 < t2 < t3 .
John C. Duchi 105
To see this, note that if x is a local minimum then for any y ∈ C, we have for
small enough t > 0 that
f(x + t(y − x)) − f(x)
f(x) f(x + t(y − x)) or 0 .
t
We now use the criterion of increasing slopes, that is, for any convex function f the
function
f(x + tu) − f(x)
(2.1.5) t →
t
is increasing in t > 0. (See Fig. 2.1.4.) Indeed, let 0 t1 t2 . Then
epigraphs and gradients, results that in turn find many applications in the design
of optimization algorithms as well as optimality certificates.
A few basic properties We list a few simple properties that convex sets have,
which are evident from their definitions. First, if Cα are convex sets for each
α ∈ A, where A is an arbitrary index set, then the intersection
(
C= Cα
α∈A
is also convex. Additionally, convex sets are closed under scalar multiplication: if
α ∈ R and C is convex, then
αC := {αx : x ∈ C}
is evidently convex. The Minkowski sum of two convex sets is defined by
C1 + C2 := {x1 + x2 : x1 ∈ C1 , x2 ∈ C2 },
and is also convex. To see this, note that if xi , yi ∈ Ci , then
λ(x1 + x2 ) + (1 − λ)(y1 + y2 ) = λx1 + (1 − λ)y1 + λx2 + (1 − λ)y2 ∈ C1 + C2 .
∈C1 ∈C2
In particular, convex sets are closed under all linear combination: if α ∈ Rm , then
C= m i=1 αi Ci is also convex.
We also define the convex hull of a set of points x1 , . . . , xm ∈ Rn by
m
m
Conv{x1 , . . . , xm } = λi xi : λi 0, λi = 1 .
i=1 i=1
This set is clearly a convex set.
Projections We now turn to a discussion of orthogonal projection onto a con-
vex set, which will allow us to develop a number of separation properties and
alternate characterizations of convex sets. See Figure 2.2.1 for a geometric view
of projection.
y πC (x)
Figure 2.2.1. Projection of the point x onto the set C (with pro-
jection πC (x)), exhibiting x − πC (x), y − πC (x) 0.
John C. Duchi 107
We begin by stating a classical result about the projection of zero onto a convex
set.
Theorem 2.2.2 (Projection of zero). Let C be a closed convex set not containing the
origin 0. Then there is a unique point xC ∈ C such that xC 2 = infx∈C x2 . Moreover,
xC 2 = infx∈C x2 if and only if
(2.2.3) xC , y − xC 0
for all y ∈ C.
Proof. The key to the proof is the following parallelogram identity, which holds
in any inner product space: for any x, y,
1 1
(2.2.4) x − y22 + x + y22 = x22 + y22 .
2 2
Define M := infx∈C x2 . Now, let (xn ) ⊂ C be a sequence of points in C such that
xn 2 → M as n → ∞. By the parallelogram identity (2.2.4), for any n, m ∈ N,
we have
1 1
xn − xm 22 = xn 22 + xm 22 − xn + xm 22 .
2 2
Fix > 0, and choose N ∈ N such that n N implies that xn 22 M2 + . Then
for any m, n N, we have
1 1
(2.2.5) xn − xm 22 2M2 + 2 − xn + xm 22 .
2 2
Now we use the convexity of the set C. We have 12 xn + 12 xm ∈ C for any n, m,
which implies
$ $2
1 $1 1 $
xn + xm 22 = 2 $
$ xn + x $
m $ 2M
2
2 2 2 2
by definition of M. Using the above inequality in the bound (2.2.5), we see that
1
xn − xm 22 2M2 + 2 − 2M2 = 2.
2 √
In particular, xn − xm 2 2 ; since was arbitrary, (xn ) forms a Cauchy
sequence and so must converge to a point xC . The continuity of the norm ·2
implies that xC 2 = infx∈C x2 , and the fact that C is closed implies that the
point xC lies in C.
Now we show the inequality (2.2.3) holds if and only if xC is the projection of
the origin 0 onto C. Suppose that inequality (2.2.3) holds. Then
xC 22 = xC , xC xC , y xC 2 y2 ,
the last inequality following from the Cauchy-Schwartz inequality. Dividing each
side by xC 2 implies that xC 2 y2 for all y ∈ C. For the converse, let xC
minimize x2 over C. Then for any t ∈ [0, 1] and any y ∈ C, we have
xC 22 (1 − t)xC + ty22
= xC + t(y − xC )22 = xC 22 + 2t xC , y − xC + t2 y − xC 22 .
108 Introductory Lectures on Stochastic Optimization
Subtracting xC 22 and t2 y − xC 22 from both sides of the above inequality, we
have
−t2 y − xC 22 2t xC , y − xC .
Dividing both sides of the above inequality by 2t, we have
t
− y − xC 22 xC , y − xC
2
for all t ∈ (0, 1]. Letting t ↓ 0 gives the desired inequality.
With this theorem in place, a simple shift gives a characterization of more
general projections onto convex sets.
Corollary 2.2.6 (Projection onto convex sets). Let C be a closed convex set and let
x ∈ Rn . Then there is a unique point πC (x), called the projection of x onto C, such
that x − πC (x)2 = infy∈C x − y2 , that is, πC (x) = argminy∈C y − x22 . The
projection is characterized by the inequality
(2.2.7) πC (x) − x, y − πC (x) 0
for all y ∈ C.
Proof. When x ∈ C, the statement is clear. For x ∈ C, the corollary simply fol-
lows by considering the set C = C − x, then using Theorem 2.2.2 applied to the
recentered set.
Corollary 2.2.8 (Non-expansive projections). Projections onto convex sets are non-
expansive, in particular,
πC (x) − y2 x − y2
for any x ∈ Rn and y ∈ C.
Proof. When x ∈ C, the inequality is clear, so assume that x ∈ C. Now use
inequality (2.2.7) from the previous corollary. By adding and subtracting y in the
inner product, we have
y πC (x)
Proposition 2.2.10 (Strict separation of points). Let C be a closed convex set. Given
any point x ∈ C, there is a vector v such that
(2.2.11) v, x > sup v, y
y∈C
Moreover, we can take the vector v = x − πC (x), and v, x supy∈C v, y + v22 .
See Figure 2.2.9.
We can also investigate the existence of hyperplanes that support the convex
set C, meaning that they touch only its boundary and never enter its interior.
Such hyperplanes—and the halfspaces associated with them—provide alternata-
tive descriptions of convex sets and functions. See Figure 2.2.13.
Figure 2.2.18. The function f (solid blue line) and affine under-
estimators (dotted lines).
Corollary 2.2.19. Let f be a closed convex function that is not identically −∞. Then
f(x) = sup {v, x + b : f(y) b + v, y for all y ∈ Rn } .
v∈Rn ,b∈R
Proof. First, we note that epi f is closed by definition. Moreover, we know that we
can write
epi f = ∩{H : H ⊃ epi f},
where H denotes a halfspace. More specifically, we may index each halfspace by
(v, a, c) ∈ Rn × R × R, and we have Hv,a,c = {(x, t) ∈ Rn × R : v, x + at c}.
Now, because H ⊃ epi f, we must be able to take t → ∞ so that a 0. If a < 0,
112 Introductory Lectures on Stochastic Optimization
we may divide by |a| and assume without loss of generality that a = −1, while
otherwise a = 0. So if we let
H1 := (v, c) : Hv,−1,c ⊃ epi f and H0 := {(v, c) : Hv,0,c ⊃ epi f} .
then ( (
epi f = Hv,−1,c ∩ Hv,0,c .
(v,c)∈H1 (v,c)∈H0
We would like to show that epi f = ∩(v,c)∈H1 Hv,−1,c , as the set Hv,0,c is a vertical
hyperplane separating the domain of f, dom f, from the rest of the space.
To that end, we show that if (v1 , c1 ) ∈ H1 and (v0 , c0 ) ∈ H0 , then
(
H := Hv1 +λv0 ,−1,c1 +λc1 = Hv1 ,−1,c1 ∩ Hv0 ,0,c0 .
λ0
Lemma 2.3.1. Let f be convex and defined on B1 = {x ∈ Rn : x1 1}, the 1 ball in
n dimensions. Then there exist −∞ < m M < ∞ such that m f(x) M for all
x ∈ B1 .
If x ∈ int dom f, there exists a constant L < ∞ such that |f (x; u)| L u for any
u ∈ Rn . If f is Lipschitz continuous with respect to the norm ·, we can take L to be
the Lipschitz constant of f.
Lastly, we state a well-known condition that is equivalent to convexity. This is
inuitive: if a function is bowl-shaped, it should have positive second derivatives.
Theorem 2.3.5. Let f : Rn → R be twice continuously differentiable. Then f is convex
if and only if ∇2 f(x) 0 for all x, that is, ∇2 f(x) is positive semidefinite.
114 Introductory Lectures on Stochastic Optimization
is, f should always have global linear underestimators of itself. When a function f
is convex, the subgradient generalizes the derivative of f (which is a global linear
underestimator of f when f is differentiable), and is also intimately related to
optimality conditions for convex minimization.
f(x1 ) + g1 , x − x1
f(x2 ) + g3 , x − x2
f(x2 ) + g2 , x − x2
x2 x1
for all y, by dividing both sides by −b. In particular, −v/b is a subgradient. Thus,
suppose for the sake of contradiction that b = 0. Then we have v, x − y 0 for
all y ∈ dom f, but we assumed that x ∈ int dom f, so for small enough > 0, we
can set y = x + v. This would imply that v, x − y = − v, v = 0, i.e. v = 0,
contradicting the fact that at least one of v and b must be non-zero.
For the compactness of ∂f(x), we use Lemma 2.3.1, which implies that f is
bounded in an 1 -ball around of x. As x ∈ int dom f by assumption, there is
some > 0 such that x + B ⊂ int dom f for the 1 -ball B = {v : v1 1}.
Lemma 2.3.1 implies that supv∈B f(x + v) = M < ∞ for some M, so we have
M f(x + v) f(x) + g, v for all v ∈ B and g ∈ ∂f(x), or g∞ (M − f(x))/.
Thus ∂f(x) is closed and bounded, hence compact.
The next two results require a few auxiliary results related to the directional
derivative of a convex function. The reason for this is that both require connect-
ing the local properties of the convex function f with the sub-differential ∂f(x),
which is difficult in general since ∂f(x) can consist of multiple vectors. However,
by looking at directional derivatives, we can accomplish what we desire. The
connection between a directional derivative and the subdifferential is contained
in the next two lemmas.
Proof. Denote by S = {g : g, u f (x; u)}, the set on the right hand side of the
equality (2.4.5), and let g ∈ S. By the increasing slopes condition, we have
f(x + αu) − f(x)
g, u f (x; u)
α
for all u and α > 0; in particular, by taking α = 1 and u = y − x, we have the
standard subgradient inequality that f(x) + g, y − x f(y). So if g ∈ S, then
g ∈ ∂f(x). Conversely, for any g ∈ ∂f(x), the definition of a subgradient implies
that
f(x + αu) f(x) + g, x + αu − x = f(x) + α g, u .
Subtracting f(x) from both sides and dividing by α gives that
1
[f(x + αu) − f(x)] sup g, u
α g∈∂f(x)
Proof. Certainly, Lemma 2.4.4 shows that f (x; u) supg∈∂f(x) g, u. We must
show the other direction. To that end, note that viewed as a function of u, f (x; u)
is convex and positively homogeneous, meaning that f (x; tu) = tf (x; u) for t 0.
Thus, we can always write (by Corollary 2.2.19)
f (x; u) = sup v, u + b : f (x; w) b + v, w for all w ∈ Rn .
Using the positive homogeneity, we have f (x; 0) = 0 and thus we must have
b = 0, so that u → f (x; u) is characterized as the supremum of linear functions:
f (x; u) = sup v, u : f (x; w) v, w for all w ∈ Rn .
But the set {v : v, w f (x; w) for all w} is simply ∂f(x) by Lemma 2.4.4.
A relatively straightforward calculation using Lemma 2.4.4, which we give
in the next proposition, shows that the subgradient is simply the gradient of
differentiable convex functions. Note that as a consequence of this, we have
the first-order inequality that f(y) f(x) + ∇f(x), y − x for any differentiable
convex function.
Proposition 2.4.8. Let f be convex and differentiable at a point x. Then ∂f(x) = {∇f(x)}.
Proposition 2.4.9. Suppose that f is L-Lipschitz with respect to the norm · over a set
C, where C ⊂ int dom f. Then
sup{g∗ : g ∈ ∂f(x), x ∈ C} L.
Moreover, we have that x = supy:y 1 y, x. Fixing x ∈ Rn , we thus see that
∗
if g∗ 1 and g, x = x, then
x + g, y − x = x − x + g, y sup v, y = y .
v:v∗ 1
−g
y
x
The next theorem, containing necessary and sufficient conditions for a point x
to minimize a convex function f, generalizes the standard first-order optimality
conditions for differentiable f (e.g., Section 4.2.3 in [12]). The intuition for Theo-
rem 2.4.11 is that there is a vector g in the subgradient set ∂f(x) such that −g is
a supporting hyperplane to the feasible set C at the point x. That is, the direc-
tions of decrease of the function f lie outside the optimization set C. Figure 2.4.10
shows this behavior.
Theorem 2.4.11. Let f be convex. The point x ∈ int dom f minimizes f over a convex
set C if and only if there exists a subgradient g ∈ ∂f(x) such that simultaneously for all
y ∈ C,
(2.4.12) g, y − x 0.
Proof. One direction of the theorem is easy. Indeed, pick y ∈ C. Then certainly
there exists g ∈ ∂f(x) for which g, y − x 0. Then by definition,
f(y) f(x) + g, y − x f(x).
This holds for any y ∈ C, so x is clearly optimal.
For the converse, suppose that x minimizes f over C. Then for any y ∈ C and
any t 0 such that x + t(y − x) ∈ C, we have
f(x + t(y − x)) − f(x)
f(x + t(y − x)) f(x) or 0 .
t
Taking the limit as t → 0, we have f (x; y − x) 0 for all y ∈ C. Now, let
us suppose for the sake of contradiction that there exists a y such that for all
g ∈ ∂f(x), we have g, y − x < 0. Because
∂f(x) = {g : g, u f (x; u) for all u ∈ Rn }
by Lemma 2.4.6, and ∂f(x) is compact, we have that supg∈∂f(x) g, y − x is at-
tained, which would imply
f (x; y − x) < 0.
This is a contradiction.
2.5. Calculus rules with subgradients We present a number of calculus rules
that show how subgradients are, essentially, similar to derivatives, with a few
exceptions (see also Ch. VII of [27]). When we develop methods for optimization
problems based on subgradients, these basic calculus rules will prove useful.
Scaling. If we let h(x) = αf(x) for some α 0, then ∂h(x) = α∂f(x).
Finite sums. Suppose that f1 , . . . , fm are convex functions and let f = m
i=1 fi .
Then
m
∂f(x) = ∂fi (x),
i=1
where the addition is Minkowski addition. To see that m ∂f (x) ⊂ ∂f(x), let
m m i=1 i
gi ∈ ∂fi (x) for each i. Clearly, f(y) = i=1 fi (y) i=1 fi (x) + gi , y − x, so
120 Introductory Lectures on Stochastic Optimization
that m i=1 gi ∈ ∂f(x). The converse is somewhat more technical and is a special
case of the results to come.
Integrals. More generally, we can extend this summation result to integrals, as-
suming the integrals exist. These calculations are essential for our development
of stochastic optimization schemes based on stochastic (sub)gradient information
in the coming lectures. Indeed, for each s ∈ S, where S is some set, let fs be
convex. Let μ be a positive measure on the set S, and define the convex function
f(x) = fs (x)dμ(s). In the notation of the introduction (Eq. (1.0.1)) and the prob-
lems coming in Section 3.4, we take μ to be a probability distribution on a set S,
and if F(·; s) is convex in its first argument for all s ∈ S, then we may take
f(x) = E[F(x; S)]
and satisfy the conditions above. We shall see many such examples in the sequel.
Then if we let gs (x) ∈ ∂fs (x) for each s ∈ S, we have (assuming the integral
exists and that the selections gs (x) are appropriately measurable)
(2.5.1) gs (x)dμ(s) ∈ ∂f(x).
To see the inclusion, note that for any y we have
. /
gs (x)dμ(s), y − x = gs (x), y − x dμ(s)
(fs (y) − fs (x))dμ(s) = f(y) − f(x) ,
so the inclusion (2.5.1) holds. Eliding a few technical details, one generally obtains
the equality
∂f(x) = gs (x)dμ(s) : gs (x) ∈ ∂fs (x) for each s ∈ S .
Returning to our running example of stochastic optimization, if we have a
collection of functions F : Rn × S → R, where for each s ∈ S the function F(·; s)
is convex, then f(x) = E[f(x; S)] is convex when we take expectations over S, and
taking
g(x; s) ∈ ∂F(x; s)
gives a stochastic gradient with the property that E[g(x; S)] ∈ ∂f(x). For more on
these calculations and conditions, see the classic paper of Bertsekas [7], which
addresses the measurability issues.
Affine transformations. Let f : Rm → R be convex and A ∈ Rm×n and
b ∈ Rm . Then h : Rn → R defined by h(x) = f(Ax + b) is convex and has
subdifferential
∂h(x) = AT ∂f(Ax + b).
Indeed, let g ∈ ∂f(Ax + b), so that
, -
h(y) = f(Ay + b) f(Ax + b) + g, (Ay + b) − (Ax + b) = h(x) + AT g, y − x ,
giving the result.
John C. Duchi 121
which is convex, and f is convex. Now, let i be any index such that fi (x) = f(x),
and let gi ∈ ∂fi (x). Then we have for any y ∈ Rn that
f(y) fi (y) fi (x) + gi , y − x = f(x) + gi , y − x .
So gi ∈ ∂f(x). More generally, we have the result that
(2.5.2) ∂f(x) = Conv{∂fi (x) : fi (x) = f(x)},
that is, the subgradient set of f is the convex hull of the subgradients of active
functions at x, that is, those attaining the maximum. If there is only a single
unique active function fi , then ∂f(x) = ∂fi (x). See Figure 2.5.3 for a graphical
representation.
epi f f1
f2
x0 x1
the limited focus of these lecture notes, we have only barely touched on many
topics in convex analysis, developing only those we need. Two omissions are
perhaps the most glaring: except tangentially, we have provided no discussion of
conjugate functions and conjugacy, and we have not discussed Lagrangian dual-
ity, both of which are central to any study of convex analysis and optimization.
A number of books provide coverage of convex analysis in finite and infinite di-
mensional spaces and make excellent further reading. For broad coverage of con-
vex optimization problems, theory, and algorithms, Boyd and Vandenberghe [12]
is an excellent reference, also providing coverage of basic convex duality theory
and conjugate functions. For deeper forays into convex analysis, personal fa-
vorites of mine include the books of Hiriart-Urruty and Lemaréchal [27, 28], as
well as the shorter volume [29], and Bertsekas [8] also provides an elegant geo-
metric picture of convex analysis and optimization. Our approach here follows
Hiriart-Urruty and Lemaréchal’s most closely. For a treatment of the issues of
separation, convexity, duality, and optimization in infinite dimensional spaces,
an excellent reference is the classic book by Luenberger [36].
3. Subgradient Methods
Lecture Summary: In this lecture, we discuss first order methods for the min-
imization of convex functions. We focus almost exclusively on subgradient-
based methods, which are essentially universally applicable for convex opti-
mization problems, because they rely very little on the structure of the prob-
lem being solved. This leads to effective but slow algorithms in classical
optimization problems. In large scale problems arising out of machine learn-
ing and statistical tasks, however, subgradient methods enjoy a number of
(theoretical) optimality properties and have excellent practical performance.
any problem for which we can compute subgradients efficiently can be solved
to accuracy in time polynomial in the dimension n of the problem and log 1
by the ellipsoid method (cf. [41, 45]). Moreover, for somewhat better structured
(but still quite general) convex problems, interior point and second order meth-
ods [12, 45] are practically and theoretically quite efficient, sometimes requiring
only O(log log 1 ) iterations to achieve optimization error . (See the lectures by S.
Wright in this volume.) These methods use the Newton method as a basic solver,
along with specialized representations of the constraint set C, and can be quite
powerful.
However, for large scale problems, the time complexity of standard interior
point and Newton methods can be prohibitive. Indeed, for problems in n-dimen-
sions—that is, when x ∈ Rn —interior point methods scale at best as O(n3 ), and
can be much worse. When n is large (where today, large may mean n ≈ 109 ),
this becomes highly non-trivial. In such large scale problems and problems aris-
ing from any type of data-collection process, it is reasonable to expect that our
representation of problem data is inexact at best. In statistical machine learning
problems, for example, this is often the case; generally, many applications do not
require accuracy higher than, say = 10−2 or 10−3 , in which case faster but less
exact methods become attractive.
It is with this motivation that we approach the problem (3.1.1) in this lec-
ture, showing classical subgradient algorithms. These algorithms have the ad-
vantage that their per-iteration costs are low—O(n) or smaller for n-dimensional
problems—but they achieve low accuracy solutions to (3.1.1) very quickly. More-
over, depending on problem structure, they can sometimes achieve convergence
rates that are independent of problem dimension. More precisely, and as we will
see later, the methods we study will guarantee convergence to an -optimal so-
lution to problem (3.1.1) in O(1/2 ) iterations, while methods that achieve better
dependence on require at least n log 1 iterations.
3.2. The gradient and subgradient methods We begin by focusing on the un-
constrained case, that is, when the set C in problem (3.1.1) is C = Rn . That is, we
wish to solve
minimize
n
f(x).
x∈R
We first review the gradient descent method, using it as motivation for what
follows. In the gradient descent method, we minimize the objective (3.1.1) by
iteratively updating
(3.2.1) xk+1 = xk − αk ∇f(xk ),
where αk > 0 is a positive sequence of stepsizes. The original motivations for
this choice of update come from the fact that x minimizes a convex f if and only
if 0 = ∇f(x ); we believe a more compelling justification comes from the idea
of modeling the convex function being minimized. Indeed, the update (3.2.1) is
124 Introductory Lectures on Stochastic Optimization
equivalent to
1
(3.2.2) xk+1 = argmin f(xk ) + ∇f(xk ), x − xk + x − xk 22 .
x 2αk
The interpretation is that the linear functional x → {f(xk ) + ∇f(xk ), x − xk } is
the best linear approximation to the function f at the point xk , and we would like
to make progress minimizing x. So we minimize this linear approximation, but to
make sure that it has fidelity to the function f, we add a quadratic x − xk 22 to pe-
nalize moving too far from xk , which would invalidate the linear approximation.
See Figure 3.2.3.
f(x)
f(xk )
+ ∇f(xk ), x − xk
+ 12 x − xk 22
f(x)
f(xk ) + ∇f(xk ), x − xk
is a descent direction, but we do not prove this here. Indeed, finding such a
descent direction would require explicitly calculating the entire subgradient set
∂f(x), which for a number of functions is non-trivial and breaks the simplicity of
the subgradient method (3.2.4), which works with any subgradient.
It is the case, however, that so long as the point x does not minimize f(x), then
subgradients descend on a related quantity: the distance of x to any optimal point.
Indeed, let g ∈ ∂f(x), and let x ∈ argmin f(x) (we assume such a point exists),
which need not be unique. Then we have for any α that
1 1 α2
x − αg − x 22 = x − x 22 − α g, x − x + g22 .
2 2 2
The key is that for small enough α > 0, the quantity on the right is strictly
smaller than 12 x − x 22 , as we now show. We use the defining inequality of the
subgradient, that is, that f(y) f(x) + g, y − x for all y, including x . This gives
− g, x − x = g, x − x f(x ) − f(x), and thus
1 1 α2
(3.2.5) x − αg − x 22 x − x 22 − α (f(x) − f(x )) + g22 .
2 2 2
From inequality (3.2.5), we see immediately that, no matter our choice g ∈ ∂f(x),
we have
2(f(x) − f(x ))
0<α< 2
implies x − αg − x 22 < x − x 22 .
g2
Summarizing, by noting that f(x) − f(x ) > 0, we have
Observation 3.2.6. If 0 ∈ ∂f(x), then for any x ∈ argminx f(x) and any g ∈ ∂f(x),
there is a stepsize α > 0 such that x − αg − x 22 < x − x 22 .
1 1 α2
αk [f(xk ) − f(x )] xk − x 22 − xk+1 − x 22 + k gk 22
2 2 2
1 1 α2
xk − x 22 − xk+1 − x 22 + k M2 .
2 2 2
By summing the preceding expression from k = 1 to k = K and canceling the
alternating ± xk − x 22 terms, we obtain the theorem.
Theorem 3.2.7 is the starting point from which we may derive a number of
useful consquences. First, we use convexity to obtain the following immediate
corollary (we assume that αk > 0 in the corollary).
1 K
Corollary 3.2.8. Let Ak = k i=1 αi and define xK = AK k=1 αk xk . Then
x1 − x 22 + K 2
k=1 αk M
2
f(xK ) − f(x ) K .
2 k=1 αk
K
Proof. Noting that A−1K k=1 αk = 1, we see by convexity that
K
1 K
f(xK ) − f(x ) K αk f(xk ) − f(x ) = A−1
K αk (f(xk ) − f(x )) .
k=1 α k k=1 k=1
Applying Theorem 3.2.7 gives the result.
John C. Duchi 127
f(xk ) − f(x )
We give the results in Figures 3.2.11 and 3.2.12, which exhibit much of the
typical behavior of subgradient methods. From the plots, we see roughly a few
phases of behavior: the method with stepsize α = 1 makes progress very quickly
initially, but then enters its “jamming” phase, where it essentially makes no more
progress. (The largest stepsize, α = 10, simply jams immediately.) The accuracy
of the methods with different stepsizes varies greatly, as well—the smaller the
John C. Duchi 129
stepsize, the better the (final) performance of the iterates xk , but initial progress
is much slower.
3.3. Projected subgradient methods We often wish to solve problems not over
Rn but over some constrained set, for example, in the Lasso [57] and in com-
pressed sensing applications [20] one minimizes an objective such as Ax − b22
subject to x1 R for some constant R < ∞. Recalling the problem (3.1.1), we
more generally wish to solve the problem
minimize f(x) subject to x ∈ C ⊂ Rn ,
where C is a closed convex set, not necessarily Rn . The projected subgradient
method is close to the subgradient method, except that we replace the iteration
with
(3.3.1) xk+1 = πC (xk − αk gk )
where
πC (x) = argmin{x − y2 }
y∈C
Example 3.3.5 (Some norm balls): Consider updates when C = {x : xp 1} for
p ∈ {1, 2, ∞}, each reasonably simple, though the projections are no longer affine.
First, for p = ∞, we consider each coordinate j = 1, 2, . . . , n in turn, giving
[πC (x)]j = min{1, max{xj , −1}},
that is, we truncate the coordinates of x to be in the range [−1, 1]. For p = 2, we
have a similarly simple to describe update:
x if x2 1
πC (x) =
x/ x2 otherwise.
When p = 1, that is, C = {x : x1 1}, the update is somewhat more complex. If
x1 1, then πC (x) = x. Otherwise, we find the (unique) t 0 such that
n
|xj | − t + = 1,
j=1
and then set the coordinates j via [πC (x)]j = sign(xj ) |xj | − t + . There are nu-
merous efficient algorithms for finding this t (e.g. [14, 23]). ♦
John C. Duchi 131
Proof. The starting point of the proof is the same basic inequality as we have been
using, that is, the distance xk+1 − x 22 . In this case, we note that projections can
never increase distances to points x ∈ C, so that
xk+1 − x 22 = πC (xk − αk gk ) − x 22 xk − αk gk − x 22 .
Now, as in our earlier derivation, we apply inequality (3.2.5) to obtain
1 1 α2
xk+1 − x 22 xk − x 22 − αk [f(xk ) − f(x )] + k gk 22 .
2 2 2
Rearranging this slightly by dividing by αk , we find that
1 α
f(xk ) − f(x ) xk − x 22 − xk+1 − x 22 + k gk 22 .
2αk 2
Now, using a variant of the telescoping sum in the proof of Theorem 3.2.7 we
have
K
1 α
K
(3.3.7) [f(xk ) − f(x )] xk − x 22 − xk+1 − x 22 + k gk 22 .
2αk 2
k=1 k=1
We rearrange the middle sum in expression (3.3.7), obtaining
1
K
xk − x 22 − xk+1 − x 22
2αk
k=1
K
1 1 1 1
= − xk − x 22 + x − x 22 − x − x 22
2αk 2αk−1 2α1 1 2αK K
k=2
K
1 1 1 2
− R2 + R
2αk 2αk−1 2α1
k=2
132 Introductory Lectures on Stochastic Optimization
because αk αk−1 . Noting that this last sum telescopes and that gk 22 M2 in
inequality (3.3.7) gives the result.
√
One application of this result is to use a decreasing stepsize of αk = α/ k.
This allows nearly as strong of a convergence rate as in the fixed stepsize case
when the number of iterations K is known, but the algorithm provides a guarantee
for all iterations k. Here, we have that
K K √
1 1
√ t− 2 dt = 2 K,
k=1
k 0
1 K
and so by taking xK = K k=1 xk we obtain the following corollary.
Corollary 3.3.8. In addition to the conditions of the preceding paragraph, let the condi-
tions of Theorem 3.3.6 hold. Then
R2 M2 α
f(xK ) − f(x ) √ + √ .
2α K K
√
So we see that convergence is guaranteed, at the “best” rate 1/ K, for all iter-
ations. Here, we say “best” because this rate is unimprovable—there are worst
case functions for which no method can achieve a rate of convergence faster than
√
RM/ K—but in practice, one would hope to attain better behavior by leveraging
problem structure.
3.4. Stochastic subgradient methods The real power of subgradient methods,
which has become evident in the last ten or fifteen years, is in their applicability to
large scale optimization problems. Indeed, while subgradient methods guarantee
only slow convergence—requiring 1/2 iterations to achieve -accuracy—their
simplicity ensures that they are robust to a number of errors. In fact, subgradient
methods achieve unimprovable rates of convergence for a number of optimization
problems with noise, and they often do so very computationally efficiently.
Stochastic optimization problems The basic building block for stochastic (sub)-
gradient methods is the stochastic (sub)gradient, often called the stochastic (sub)-
gradient oracle. Let f : Rn → R ∪ {∞} be a convex function, and fix x ∈ dom f.
(We will typically omit the sub- qualifier in what follows.) Then a random vector
g is a stochastic gradient for f at the point x if E[g] ∈ ∂f(x), or
f(y) f(x) + E[g], y − x for all y.
Said somewhat more formally, we make the following definition.
Definition 3.4.1. A stochastic gradient oracle for the function f is a triple (g, S, P),
where S is a sample space, P is a probability distribution, and g : Rn × S → Rn
is a mapping that for each x ∈ dom fsatisfies
EP [g(x, S)] = g(x, s)dP(s) ∈ ∂f(x),
where S ∈ S is a sample drawn from P.
John C. Duchi 133
Often, with some abuse of notation, we will use g or g(x) for shorthand of the
random vector g(x, S) when this does not cause confusion.
A standard example for these types of problems is stochastic programming,
where we wish to solve the convex optimization problem
Iteration k
Figure 3.4.5. Performance of the stochastic subgradient method
and of the non-stochastic subgradient method on problem (3.4.4).
We plot the results of running the stochastic gradient iteration versus stan-
dard projected subgradient descent in Figure 3.4.5; both methods run with the
√
fixed stepsize α = R/M K for M2 = m 1
A2Fr , which optimizes the convergence
John C. Duchi 135
guarantees for the methods. We see in the figure the typical performance of a sto-
chastic gradient method: the initial progress in improving the objective is quite
fast, but the method eventually stops making progress once it achieves some low
accuracy (in this case, 10−1 ). In this figure we should make clear, however, that
each iteration of the stochastic gradient method requires time O(n), while each
iteration of the (non-noisy) projected gradient method requires times O(n · m), a
factor of approximately 100 times slower. ♦
Example 3.4.6 (Multiclass support vector machine): Our second example is some-
what more complex. We are given a collection of 16 × 16 grayscale images of
handwritten digits {0, 1, . . . , 9}, and wish to classify images, represented as vec-
tors a ∈ R256 , as one of the 10 digits. In a general k-class classification problem,
we represent the multiclass classifier using the matrix
X = [x1 x2 · · · xk ] ∈ Rn×k ,
where k = 10 for the digit classification problem. Given a data vector a ∈ Rn , the
“score” associated with class l is then xl , a, and the goal (given image data) is
to find a matrix X assigning high scores to the correct image labels. (In machine
learning, the typical notation is to use weight vectors w1 , . . . , wk ∈ Rn instead of
x1 , . . . , xk , but we use X to remain consistent with our optimization focus.) The
predicted class for a data vector a ∈ Rn is then
argmax a, xl = argmax{[XT a]l }.
l∈[k] l∈[k]
where [t]+ = max{t, 0} denotes the positive part. Then F is convex in X, and for
a pair (a, b) we have F(X; (a, b)) = 0 if and only if the classifer represented by X
has a large margin, meaning that
a, xb a, xl + 1 for all l = b.
In this example, we have a sample of N = 7291 digits (ai , bi ) ∈ Rn × {1, . . . , k},
and we compare the performance of stochastic subgradient descent to standard
subgradient descent for solving the problem
1
N
(3.4.7) minimize f(X) = F(X; (ai , bi )) subject to XFr R
N
i=1
where R = 40. We perform the stochastic gradient descent using the stepsizes
√ 1 N 2
αk = α1 / k, where α1 = R/M and M2 = N i=1 ai 2 (this is an approxi-
mation to the Lipschitz constant of f). For our stochastic gradient oracle, we
select an index i ∈ {1, . . . , N} uniformly at random, then take g ∈ ∂X F(X; (ai , bi )).
136 Introductory Lectures on Stochastic Optimization
x − x2 R for all x ∈ C, that projections πC are efficiently computable, and
that for all x ∈ C we have the bound E[g(x, S)22 ] M2 for our stochastic oracle
g. (The oracle’s noise S may depend on the previous iterates, but we always have
the unbiased condition E[g(x, S)] ∈ ∂f(x).)
Theorem 3.4.9. Let the conditions of the preceding paragraph hold and let αk > 0 be a
1 K
non-increasing sequence of stepsizes. Let xK = K k=1 xk . Then
1
K
R2
E[f(xK ) − f(x )] + αk M2 .
2KαK 2K
k=1
Proof. The analysis is quite similar to our previous analyses, in that we simply
expand the error xk+1 − x 22 . Let us define f (x) := E[g(x, S)] ∈ ∂f(x) to be
the expected subgradient returned by the stochastic gradient oracle, and then let
ξk = gk − f (xk ) be the error in the kth subgradient. Then
1 1
x − x 22 = πC (xk − αk gk ) − x 22
2 k+1 2
1
xk − αk gk − x 22
2
1 α2
= xk − x 22 − αk gk , xk − x + k gk 22 ,
2 2
as in the proof of Theorems 3.2.7 and 3.3.6. Now, we can add and subtract a term
αk f (xk ), xk − x , which gives
1
x − x 22
2 k+1
1 * + α2
xk − x 22 − αk f (xk ), xk − x + k gk 22 − αk ξk , xk − x
2 2
α 2
1
xk − x 22 − αk [f(xk ) − f(x )] + k gk 22 − αk ξk , xk − x ,
2 2
where we have used the standard first-order convexity inequality.
Except for the error term ξk , xk − x , the proof is completely identical to that
of Theorem 3.3.6. Indeed, dividing each side of the preceding display by αk and
rearranging, we have
1 α
f(xk ) − f(x ) xk − x 22 − xk+1 − x 22 + k gk 22 − ξk , xk − x .
2αk 2
Summing this inequality, as is done after inequality (3.3.7), yields
K
R2 1
K
K
(3.4.10) [f(xk ) − f(x )] + αk gk 22 − ξk , xk − x .
2αK 2
k=1 k=1 k=1
All our subsequent convergence guarantees follow from this basic inequality.
For this theorem, we need only take expectations, realizing that
* +
E[ξk , xk − x ] = E E[ g(xk ) − f (xk ), xk − x | xk ]
138 Introductory Lectures on Stochastic Optimization
= E E[g(xk ) | xk ] −f (xk ), xk − x = 0.
=f (xk )
Thus we obtain
1
K K
R2
E (f(xk ) − f(x )) + αk M2
2αK 2
k=1 k=1
once we realize that E[gk 22 ] M2 , which gives the desired result.
Theorem 3.4.9 makes it clear that, in expectation, we can achieve the same con-
vergence guarantees as in the non-noisy case. This does not mean that stochastic
subgradient methods are always as good as non-stochastic methods, but it does
show the robustness of the subgradient method even to substantial noise. So
while the subgradient method is very slow, its slowness comes with the benefit
that it can handle large amounts of noise.
We now provide a few corollaries on the convergence of stochastic gradient de-
scent. For background on probabilistic modes of convergence, see Appendix A.2.
√
Corollary 3.4.11. Let the conditions of Theorem 3.4.9 hold, and let αk = R/M k for
each k. Then
3RM
E[f(xK )] − f(x ) √
2 K
for all K ∈ N.
The proof of the corollary is identical to that of Corollary 3.3.8 for the projected
gradient method, once we substitute α = R/M in the bound. We can also obtain
convergence in probability of the iterates more generally.
∞
Corollary 3.4.12. If αk is non-summable but convergent to zero (i.e. k=1 αk = ∞
p
and αk → 0), then f(xK ) − f(x ) → 0 as K → ∞. That is, for all > 0 we have
Theorem 3.4.13. In addition to the conditions of Theorem 3.4.9, assume that g2 M
for all stochastic subgradients g. Then for any > 0,
R2 αk K
RM
f(xK ) − f(x ) + M2 + √
2KαK 2 K
k=1
1 2
with probability at least 1 − e− 2 .
John C. Duchi 139
1 2
Written differently, by taking αk = √R and setting δ = e− 2 , we have
kM
3MR MR 2 log δ1
f(xK ) − f(x ) √ + √
K K
√
with probability at least 1 − δ. That is, we have convergence of O(MR/ K) with
high probability.
Before providing the proof proper, we discuss two examples in which the
boundedness condition holds. Recall from Lecture 2 that a convex function f
is M-Lipschitz if and only if g2 M for all g ∈ ∂f(x) and x ∈ Rn , so Theo-
rem 3.4.13 requires that the random functions F(·; S) are Lipschitz over the domain
C. Our robust regression and multiclass support vector machine examples both
satisfy the conditions of the theorem so long as the data is bounded. More pre-
cisely, for the robust regression problem (3.2.10) with loss F(x; (a, b)) = | a, x − b|,
we have ∂F(x; (a, b)) = a sign(a, x − b) so that the condition g2 M holds
if and only if a2 M. For the multiclass hinge loss problem (3.4.7), with
F(X; (a, b)) = l=b [1 + a, xl − xb ]+ , Exercise B.3.1 develops the subgradient
calculations, but again, we have the boundedness of ∂X F(X; (a, b)) if and only if
a ∈ Rn is bounded.
Proof. We begin with the basic inequality of Theorem 3.4.9, inequality (3.4.10). We
see that we would like to bound the probability that
K
ξk , x − xk
k=1
is large. First, we note that the iterate xk is a function of ξ1 , . . . , ξk−1 , and we
have the conditional expectation
E[ξk | ξ1 , . . . , ξk−1 ] = E[ξk | xk ] = 0.
Moreover, using the boundedness assumption that g2 M, we first obtain
ξk 2 = gk − f (xk )2 2M and then
| ξk , xk − x | ξk 2 xk − x 2 2MR.
K
k=1 ξk , xk − x is a bounded difference martingale se-
Thus, the sequence
quence, and we may apply Azuma’s inequality (Theorem A.2.5), which gurantees
K
t2
P ξk , x − xk t exp −
2KM2 R2
k=1
√
for all t 0. Substituting t = MR K, we obtain, as desired, that
K 2
1 MR
P ξk , x − xk √ exp − .
K K 2
k=1
error at most f(x) − f(x ) = O(). Secondly, this convergence is (at least to the
order in ) the same as in the non-noisy case; that is, stochastic gradient meth-
ods are robust enough to noise that their convergence is hardly affected by it. In
addition to this, they are often applicable in situations in which we cannot even
evaluate the objective f, whether for computational reasons or because we do not
have access to it, as in statistical problems. This robustness to noise and good
performance has led to wide adoption of subgradient-like methods as the de facto
choice for many large-scale data-based optimization problems. In the coming sec-
tions, we give further discussion of the optimality of stochastic gradient methods,
showing that—roughly—when we have access only to noisy data, it is impossi-
ble to solve (certain) problems to accuracy better than given 1/2 data points;
thus, using more expensive but accurate optimization methods may have limited
benefit (though there may still be some benefit practically!).
Notes and further reading Our treatment in this chapter borrows from a num-
ber of resources. The two heaviest are the lecture notes for Stephen Boyd’s Stan-
ford’s EE364b course [10, 11] and Polyak’s Introduction to Optimization [47]. Our
guarantees of high probability convergence are similar to those originally de-
veloped by Cesa-Bianchi et al. [16] in the context of online learning, which Ne-
mirovski et al. [40] more fully develop. More references on subgradient methods
include the lecture notes of Nemirovski [43] and Nesterov [44].
A number of extensions of (stochastic) subgradient methods are possible, in-
cluding to online scenarios in which we observe streaming sequences of func-
tions [25, 64]; our analysis in this section follows closely that of Zinkevich [64].
The classic paper of Polyak and Juditsky [48] shows that stochastic gradient de-
scent methods, coupled with averaging, can achieve asymptotically optimal rates
of convergence even to constant factors. Recent work in machine learning by a
number of authors [18, 32, 53] has shown how to leverage the structure of opti-
1 N
mization problems based on finite sums, that is, when f(x) = N i=1 fi (x), to
develop methods that achieve convergence rates similar to those of interior point
methods but with iteration complexity close to stochastic gradient methods.
h(x)
Dh (x, y)
set C in the mirror descent method (4.2.3).4 Note that strong convexity of h is
equivalent to
1
Dh (x, y) x − y2 for all x, y ∈ C.
2
Examples of mirror descent Before analyzing the method (4.2.3), we present a
few examples, showing the updates that are possible as well as verifying that the
associated divergence is appropriately strongly convex. One of the nice conse-
quences of allowing different divergence measures Dh , as opposed to only the
Euclidean divergence, is that they often yield cleaner or simpler updates.
Example 4.2.4 (Gradient descent is mirror descent): Let h(x) = 12 x22 . Then
∇h(y) = y, and
1 1 1 1 1
Dh (x, y) = x22 − y22 − y, x − y = x22 + y22 − x, y = x − y22 .
2 2 2 2 2
Thus, substituting into the update (4.2.3), we see the choice h(x) = 12 x22 recovers
the standard (stochastic sub)gradient method
1
xk+1 = argmin gk , x + x − xk 22 .
x∈C 2αk
It is evident that h is strongly convex with respect to the 2 -norm for any con-
straint set C. ♦
n
Dh (x, y) = xj log xj − yj log yj − (log yj + 1)(xj − yj )
j=1
n
xj
= xj log + 1, y − x = Dkl (x||y) ,
yj
j=1
4 Thisis not strictly a requirement, and sometimes it is analytically convenient to avoid this, but our
analysis is simpler when h is strongly convex.
144 Introductory Lectures on Stochastic Optimization
We assume that the yj > 0, though this is not strictly necessary. Though we
have not discussed this, we write the Lagrangian for this problem by introducing
Lagrange multipliers τ ∈ R for the equality constraint 1, x = 1 and λ ∈ Rn
+ for
the inequality x 0.
Then we obtain Lagrangian
n
xj
L(x, τ, λ) = g, x + xj log + τxj − λj xj − τ.
yj
j=1
Minimizing out x to find the appropriate form for the solution, we take deriva-
tives with respect to x and set them to zero to find
∂
0= L(x, τ, λ) = gj + log xj + 1 − log yj + τ − λj ,
∂xj
or
xj (τ, λ) = yj exp(−gj − 1 − τ + λj ).
We may take λj = 0, as the latter expression yields all positive xj , and to satisfy
the constraint that j xj = 1, we set τ = log( j yj e−gj ) − 1. Thus we have the
update
y exp(−gi )
xi = n i .
j=1 yj exp(−gj )
Rewriting this in terms of the precise update at time k for the mirror descent
method, we have for each coordinate i of iterate k + 1 of the method that
xk,i exp(−αk gk,i )
(4.2.6) xk+1,i = n .
j=1 xk,j exp(−αk gk,j )
This is the so-called exponentiated gradient update, also known as entropic mirror
descent.
Later, after stating and proving our main convergence theorems, we will show
that the negative entropy is strongly convex with respect to the 1 -norm, meaning
that our coming convergence guarantees apply. ♦
for all K ∈ N
K
1 2 αk
K
[f(xk ) − f(x )] R + gk 2∗ .
αK 2
k=1 k=1
If αk ≡ α is constant, then for all K ∈ N
K
1 α
K
[f(xk ) − f(x )] Dh (x , x1 ) + gk 2∗ .
α 2
k=1 k=1
1 K
As an immediate consequence of this theorem, we see that if xK = K k=1 xk or
xK = argminxk f(xk ) and we have the gradient bound g∗ M for all g ∈ ∂f(x)
for x ∈ C, then (say, in the second case) convexity implies
1 α
(4.2.10) f(xK ) − f(x ) Dh (x , x1 ) + M2 .
Kα 2
By comparing with the bound (3.2.9), we see that the mirror descent (non-Euclid-
ean gradient descent) method gives roughly the same type of convergence guar-
antees as standard subgradient descent. Roughly we expect the following type of
behavior with a fixed stepsize: a rate of convergence of roughly 1/αK until we are
within a radius α of the optimum, after which mirror descent and subgradient
descent essentially jam—they just jump back and forth near the optimum.
Proof. We begin by considering the progress made in a single update of xk , but
whereas our previous proofs all began with a Lyapunov function for the distance
xk − x 2 , we use function value gaps instead of the distance to optimality. Us-
ing the first order convexity inequality—i.e. the definition of a subgradient—we
have
f(xk ) − f(x ) gk , xk − x .
The idea is to show that replacing xk with xk+1 makes the term gk , xk − x
small because of the definition of xk+1 , but xk and xk+1 are close together so that
this is not much of a difference.
First, we add and subtract gk , xk+1 to obtain
(4.2.11) f(xk ) − f(x ) gk , xk+1 − x + gk , xk − xk+1 .
Now, we use the the first-order necessary and sufficient conditions for optimality
of convex optimization problems given by Theorem 2.4.11. Because xk+1 solves
problem (4.2.3), we have
, -
k (∇h(xk+1 ) − ∇h(xk )) , x − xk+1 0 for all x ∈ C.
gk + α−1
In particular, this holds for x = x , and substituting into (4.2.11) yields
1
f(xk ) − f(x ) ∇h(xk+1 ) − ∇h(xk ), x − xk+1 + gk , xk − xk+1 .
αk
We now use two tricks: an algebraic identity involving Dh and the Fenchel-Young
inequality. By algebraic manipulations, we have that
∇h(xk+1 ) − ∇h(xk ), x − xk+1 = Dh (x , xk ) − Dh (x , xk+1 ) − Dh (xk+1 , xk ).
John C. Duchi 147
αk 1
gk , xk − xk+1 gk 2∗ + xk − xk+1 2 .
2 2αk
The strong convexity assumption on h guarantees Dh (xk , xk+1 ) 12 xk − xk+1 2 ,
or that
1 α
− Dh (xk+1 , xk ) + gk , xk − xk+1 k gk 2∗ .
αk 2
Substituting this into inequality (4.2.12), we have
1 α
(4.2.13) f(xk ) − f(x ) [Dh (x , xk ) − Dh (x , xk+1 )] + k gk 2∗ .
αk 2
This inequality should look similar to inequality (3.3.7) in the proof of Theo-
rem 3.3.6 on the projected subgradient method in Lecture 3. Indeed, using that
Dh (x , xk ) R2 by assumption, an identical derivation to that in Theorem 3.3.6
gives the first result of this theorem. For the second when the stepsize is fixed,
note that
K
K
1
K
α
[f(xk ) − f(x )] [Dh (x , xk ) − Dh (x , xk+1 )] + gk 2∗
α 2
k=1 k=1 k=1
1
K
α
= [D (x , x1 ) − Dh (x , xK+1 )] + g 2 ,
α h 2 k ∗
k=1
which is the second result.
We briefly provide a few remarks before moving on. As a first remark, all
of the preceding analysis carries through in an almost completely identical fash-
ion in the stochastic case. We state the most basic result, as the extension from
Section 3.4 is essentially straightforward.
Corollary 4.2.14. Let the conditions of Theorem 4.2.9 hold, except that instead of re-
ceiving a vector gk ∈ ∂f(xk ) at iteration k, the vector gk is a stochastic subgradient
satisfying E[gk | xk ] ∈ ∂f(xk ). Then for any non-increasing stepsize sequence αk
148 Introductory Lectures on Stochastic Optimization
Proof. We sketch the proof, which is identical to that of Theorem 4.2.9, except that
we replace gk with the vector f (xk ) satisfying E[gk | xk ] = f (xk ) ∈ ∂f(xk ). Then
* + * +
f(xk ) − f(x ) f (xk ), xk − x = gk , xk − x + f (xk ) − gk , xk − x ,
and an identical derivation yields the following analogue of inequality (4.2.13):
f(xk ) − f(x )
1 α * +
[Dh (x , xk ) − Dh (x , xk+1 )] + k gk 2∗ + f (xk ) − gk , xk − x .
αk 2
This inequality holds regardless of how we choose αk . Moreover, by iterating
expectations, we have
* + * +
E[ f (xk ) − gk , xk − x ] = E[ f (xk ) − E[gk | xk ], xk − x ] = 0,
which gives the corollary by a derivation identical to Theorem 4.2.9.
Thus, if we have the bound E[g2∗ ] M2 for all stochastic subgradients, then
1 K
√
taking xK = K k=1 xk and αk = R/M k, then
α
K
log n
f(xK ) − f(x ) + gk 2∞ .
Kα 2K
k=1
Proof. To apply Theorem 4.2.9, we must show that the negative entropy h is
strongly convex with respect to the 1 -norm, whose dual norm is the ∞ -norm.
By a Taylor expansion, we know that for any x, y ∈ C, we have
1
h(x) = h(y) + ∇h(y), x − y + (x − y) ∇2 h( x)(x − y)
2
for some x between x and y, that is,
x = tx + (1 − t)y for some t ∈ [0, 1]. Calculat-
ing these quantities, this is equivalent to
1 (xj − yj )2
n
1 1 1
Dkl (x||y) = Dh (x, y) = (x − y) diag ,..., (x − y) = .
2
x1
xn 2
xj
j=1
John C. Duchi 149
If x1 = 1
then Dkl (x||x1 ) = h(x) + log n log n, as h(x) 0 for x ∈ C. Thus,
n 1,
1 K
dividing by K and using that f(xK ) K k=1 f(xk ) gives the corollary.
Comparing the guarantee provided by Corollary 4.2.16 to that given by the
standard (non-stochastic) projected subgradient method (i.e. using h(x) = 12 x22
as in Theorem 3.3.6) is instructive. In the case of projected subgradient descent,
we have Dh (x , x) = 12 x − x22 1 for all x, x ∈ C = {x ∈ Rn
+ : 1, x = 1} (and
this distance is achieved). But the dual norm to the 2 -norm is 2 , so we measure
√
the size of the gradient terms gk in 2 -norm. As gk ∞ gk 2 n gk ∞ ,
supposing that gk ∞ 1 for all k, the convergence guarantee O(1) log n/K
may be up to n/ log n-times better than that guaranteed by the standard (Eu-
clidean) projected gradient method.
Lastly, we provide a final convergence guarantee for the mirror descent method
using p -norms, where p ∈ (1, 2]. Using such norms has the benefit that Dh
is bounded whenever the set C is compact—distinct from the relative entropy
x
Dh (x, y) = j xj log yjj —and thus providing a nicer guarantee of convergence.
Indeed, for h(x) = 12 x2p we always have that
1 1 n
Dh (x, y) = x2p − y2p − y2−p
p sign(yj )|yj |p−1 (xj − yj )
2 2
j=1
1 2 1 2 2−p
n
(4.2.17) = xp + yp − yp |yj |p−1 sign(yj )xj x2p + y2p ,
2 2
j=1
21 x2p + 21 y2p
n
n 1
n 1
q p
y2−p
p |yj |p−1 |xj | y2−p
p |yj | q(p−1)
|xj |
p
K
2R21 log(2n) e2
K
[f(xk ) − f(x )] + αk gk 2∞ .
αK 2
k=1 k=1
1 K
In particular, taking αk = R1 log(2n)/k/e and xK = K k=1 xk gives
R1 log(2n)
f(xK ) − f(x ) 3e
√ .
K
1
Proof. First, we note that h(x) = 2(p−1) x2p is strongly convex with respect to
the p -norm, where 1 < p 2. (Recall Example 4.2.7 and see Exercise B.4.2.)
Moreover, we know that the dual to the p -norm is the conjugate q -norm with
1/p + 1/q = 1, and thus Theorem 4.2.9 implies that
K
1 αk K
[f(xk ) − f(x )] sup Dh (x, x ) + gk 2q .
αK x∈C 2
k=1 k=1
Now, we use that if C is contained in the 1 -ball of radius R1 , then
(p − 1)Dh (x, y) x2p + y2p x21 + y21 2R21 .
1
Moreover, because p = 1 + log(2n) , we have q = 1 + log(2n), and
1 1
vq 1q v∞ = n q v∞ = n log(2n) v∞ e v∞ .
Substituting this into the previous display and noting that 1/(p − 1) = log(2n)
− 12 and then
gives the first result. The second follows by first tntegrating K
k=1 k
using convexity.
So we see that, in more general cases than the simple simplex constraint af-
forded by the entropic mirror descent (exponentiated gradient) updates, we have
√
convergence guarantees of order log n/ K, which may be substantially faster
than that guaranteed by the standard projected gradient methods.
A simulated mirror-descent example With our convergence theorems given, we
provide a (simulation-based) example of the convergence behavior for an opti-
mization problem for which it is natural to use non-Euclidean norms. We con-
sider a robust regression problem of the following form: we let A ∈ Rm×n have
entries drawn i.i.d. N(0, 1) with rows a 1
1 , . . . , am . We let bi = 2 (ai,1 + ai,2 ) + εi
iid
where εi ∼ N(0, 10−2 ), and m = 20 and the dimension n = 3000. Then we define
m
f(x) := Ax − b1 = | ai , x − bi |,
i=1
John C. Duchi 151
which has subgradients A sign(Ax − b). We then minimize f over the simplex
C = {x ∈ Rn + : 1, x = 1}; this is the same robust regression problem (3.2.10),
except with a particular choice of C.
We compare the subgradient method to exponentiated gradient descent for
this problem, noting that the Euclidean projection of a vector v ∈ Rn to the set C
has coordinates xj = vj − t + , where t ∈ R is chosen so that
n
n
xj = vj − t + = 1.
j=1 j=1
(See the papers [14, 23] for a full derivation of this expression.) We use stepsizes
√
αk = α0 / k, where the initial stepsize α0 is chosen to optimize the convergence
guarantee for each of the methods (see the coming section). In Figure 4.2.19, we
plot the results of performing the projected gradient method versus the expo-
nentiated gradient (entropic mirror decent) method and a method using distance
generating functions h(x) = 12 x2p for p = 1 + 1/ log(2n), which can also be
shown to be optimal, showing the optimality gap versus iteration count. While
all three methods are sensitive to initial stepsize, the mirror descent method (4.2.6)
enjoys faster convergence than the standard gradient-based method.
4.3. Adaptive stepsizes and metrics In our discussion of mirror descent meth-
ods, we assumed we knew enough about the geometry of the problem at hand—
or at least the constraint set—to choose an appropriate metric and associated
distance-generating function h. In other situations, however, it may be advanta-
geous to adapt the metric being used, or at least the stepsizes, to achieve faster
convergence guarantees. We begin by describing a simple scheme for choosing
stepsizes to optimize bounds on convergence, which means one does not need
to know the Lipschitz constants of gradients ahead of time, and then move on to
somewhat more involved schemes that use a distance-generating function of the
152 Introductory Lectures on Stochastic Optimization
type h(x) = 12 x Ax for some matrix A, which may change depending on infor-
mation observed during solution of the problem. We leave proofs of the major
results in these sections to exercises at the end of the lectures.
Adaptive stepsizes Let us begin by recalling the convergence guarantees for
mirror descent in the stochastic case, given by Corollary 4.2.14, which assumes
the stepsize αk used to calculate xk+1 is chosen based on the observed gradients
1 K
g1 , . . . , gk (it may be specified ahead of time). In this case, taking xK = K k=1 xk ,
we have by Corollary 4.2.14 that as long as Dh (x, x ) R for all x ∈ C, then
2
R2 1 K
α
(4.3.1) E[f(xK ) − f(x )] E + k
gk 2∗ .
KαK K 2
k=1
Now, if we were to use a fixed stepsize αk = α for all k, we see that the choice of
stepsize minimizing
α
K
R2
+ gk 2∗
Kα 2K
k=1
is
− 1
√ K
2
2
α = 2R gk ∗ ,
k=1
which, when substituted into the bound (4.3.1) yields
K 1
√ R 2
2
(4.3.2) E[f(xK ) − f(x )] 2 E
gk ∗ .
K
k=1
While the stepsize choice α and the resulting bound are not strictly possible,
as we do not know the magnitudes of the gradients gk ∗ before the procedure
executes, in Exercise B.4.1, we prove the following corollary, which uses the “up
to now” optimal choice of stepsize αk .
k 2
Corollary 4.3.3. Let the conditions of Corollary 4.2.14 hold. Let αk = R/ i=1 gi ∗ .
Then K
1
R 2
2
E[f(xK ) − f(x )] 3 E
gk ∗ ,
K
k=1
1 K
where xK = K k=1 xk .
Variable metric methods and the adaptive gradient method In variable metric
methods, the idea is to adjust the metric with which one constructs updates to
better reflect local (or non-local) problem structure. The basic framework is very
similar to the standard subgradient method (or the mirror descent method), and
proceeds as follows.
(i) Receive subgradient gk ∈ ∂f(xk ) (or stochastic subgradient gk satisfying
E[gk | xk ] ∈ ∂f(xk )).
(ii) Update positive semidefinite matrix Hk ∈ Rn×n .
(iii) Compute update
1
(4.3.4) xk+1 = argmin gk , x + x, Hk x .
x∈C 2
The method (4.3.4) subsumes a number of standard and less standard optimiza-
tion methods. If Hk = α1k In×n , a scaled identity matrix, we recover the (sto-
chastic) subgradient method (3.2.4) when C = Rn (or (3.3.2) generally). If f is
twice differentiable and C = Rn , then taking Hk = ∇2 f(xk ) to be the Hessian
of f at xk gives the (undamped) Newton method, and using Hk = ∇2 f(xk ) even
when C = Rn gives a constrained Newton method. More general choices of
Hk can even give the ellipsoid method and other classical convex optimization
methods [56].
In our case, we specialize the iterations above to focus on diagonal matrices Hk ,
and we do not assume the function f is smooth (not even differentiable). This, of
course, renders unusable standard methods using second order information in
the matrix Hk (as it does not exist), but we may still develop useful algorithms.
It is possible to consider more general matrices [22], but their additional com-
putational cost generally renders them impractical in large scale and stochastic
settings. With that in mind, let us develop a general framework for algorithms
and provide their analysis.
We begin with a general convergence guarantee.
Theorem 4.3.5. Let Hk be a sequence of positive definite matrices, where Hk is a func-
tion of g1 , . . . , gk (and potentially some additional randomness). Let gk be (stochastic)
subgradients with E[gk | xk ] ∈ ∂f(xk ). Then
K
E
(f(xk ) − f(x ))
k=1
K
1
2 2 2
E xk − x Hk − xk − x Hk−1 + x1 − x H1
2
k=2
K
1
+ E gk 2H−1 .
2 k
k=1
Proof. In contrast to mirror descent methods, in this proof we return to our classic
Lyapunov-based style of proof for standard subgradient methods, looking at the
154 Introductory Lectures on Stochastic Optimization
distance xk − x . Let x2A = x, Ax for any positive semidefinite matrix. We
claim that
$ $2
$ $
(4.3.6) xk+1 − x 2Hk $xk − H−1
k gk − x $ ,
Hk
the analogue of the fact that projections are non-expansive. This is an immediate
consequence of the update (4.3.4): we have that
$ $2
$ $
xk+1 = argmin $x − (xk − H−1 k k $
g ) ,
x∈C Hk
which is a Euclidean projection of xk − H−1 k gk into C (in the norm ·Hk ). Then
the standard result that projections are non-expansive (Corollary 2.2.8) gives in-
equality (4.3.6).
Inequality (4.3.6) is the key to our analysis, as previously. Expanding the
square on the right side of the inequality, we obtain
1 1$$
$2
$
xk+1 − x 2Hk $xk − H−1 g
k k − x $
2 2 Hk
1 1
= xk − x 2Hk − gk , xk − x + gk 2H−1 ,
2 2 k
and taking expectations we have E[gk , xk − x | xk ] f(xk ) − f(x ) by convexity
and that E[gk | xk ] ∈ ∂f(xk ). Thus
1 2
1 2 1 2
E xk+1 − x Hk E xk − x Hk − [f(xk ) − f(x )] + gk H−1 .
2 2 2 k
Rearranging, we have
1 1 1
E[f(xk ) − f(x )] E xk − x 2Hk − xk+1 − x 2Hk + gk 2H−1 .
2 2 2 k
Summing this inequality from k = 1 to K gives the theorem.
We may specialize the theorem in a number of ways to develop particular algo-
rithms. One specialization, which is convenient because the computational over-
head is fairly small, is to use diagonal matrices Hk . In particular, the AdaGrad
method sets
k 1
1
2
(4.3.7) Hk = diag gi gi ,
α
i=1
where α > 0 is a pre-specified constant (stepsize). In this case, the following corol-
lary to Theorem 4.3.5 follows. Exercise B.4.3 sketches the proof of the corollary,
which is similar to that of Corollary 4.3.3. In the corollary, recall that the trace of
a matrix is tr(A) = n j=1 Ajj .
In addition to proving the bound (4.3.9), Exercise B.4.3 also shows that when we
take C = {x ∈ Rn : x∞ 1}, the bound (4.3.9) is always better than the bounds
(e.g. Corollary 3.4.11) guaranteed by standard stochastic gradient methods. In
addition, the bound (4.3.9) is unimprovable—there are stochastic optimization
problems for which no algorithm can achieve a faster convergence rate. These
types of problems generally involve data in which the gradients g have highly
varying components (or components that are often zero, i.e. the gradients g are
sparse), as for such problems geometric aspects are quite important.
Notes and further reading The mirror descent method was originally devel-
oped by Nemirovski and Yudin [41] in order to more carefully control the norms
of gradients, and associated dual spaces, in first-order optimization methods.
Since their original development, a number of researchers have explored variants
and extensions of their methods. Beck and Teboulle [5] give an analysis of mirror
descent as a non-Euclidean gradient method, which is the approach we take in
this lecture. Nemirovski et al. [40] study mirror descent methods in stochastic
settings, giving high-probability convergence guarantees similar to those we gave
in the previous lecture. Bubeck and Cesa-Bianchi [15] explore the use of mirror
descent methods in the context of bandit optimization problems, where instead of
observing stochastic gradients one observes only random function values f(x) + ε,
where ε is mean-zero noise perturbtion.
Variable metric methods have a similarly long history. Our simple results with
stepsize selection follow the more advanced techniques of Auer et al. [3] (see es-
pecially their Lemma 3.5), and the AdaGrad method (and our development) is
due to Duchi, Hazan, and Singer [22] and McMahan and Streeter [38]. More gen-
eral metric methods include Shor’s space dilation methods (of which the ellipsoid
method is a celebrated special case), which develop matrices Hk that make new
directions of descent somewhat less correlated with previous directions, allowing
faster convergence in directions toward x ; see the books of Shor [55, 56] as well
as the thesis of Nedić [39]. Newton methods, which we do not discuss, use scaled
multiples of ∇2 f(xk ) for Hk , while Quasi-Newton methods approximate ∇2 f(xk )
with Hk while using only gradient-based information; for more on these and
other more advanced methods for smooth optimization problems, see the books
of Nocedal and Wright [46] and Boyd and Vandenberghe [12].
5. Optimality Guarantees
Lecture Summary: In this lecture, we provide a framework for demonstrat-
ing the optimality of a number of algorithms for solving stochastic optimiza-
tion problems. In particular, we introduce minimax lower bounds, showing
how techniques for reducing estimation problems to statistical testing prob-
lems allow us to prove lower bounds on optimization.
5.1. Introduction The procedures and algorithms we have presented thus far en-
joy good performance on a number of statistical, machine learning, and stochastic
optimization tasks, and we have provided theoretical guarantees on their perfor-
mance. It is interesting to ask whether it is possible to improve the algorithms, or
in what ways it may be possible to improve them. With that in mind, in this lec-
ture we develop a number of tools for showing optimality—according to certain
metrics—of optimization methods for stochastic problems.
158 Introductory Lectures on Stochastic Optimization
than the stochastic gradients). A similar variant with a natural stochastic gradient
oracle is to set g(x, s, F) ∈ ∂F(x; s) instead of providing the sample S = s.
We focus in this note on the case when the optimization procedure may view
only the sequence of subgradients g1 , g2 , . . . at the points it queries. We note in
passing, however, that for many problems we can reconstruct S from a gradient
g ∈ ∂F(x; S). As an example, consider a logistic regression problem with data
s = (a, b) ∈ {0, 1}n × {−1, 1}, a typical data case. Then
1
F(x; s) = log(1 + e−ba,x ), and ∇x F(x; s) = − ba,
1 + eba,x
so that (a, b) is identifiable from any g ∈ ∂F(x; s). More generally, classical linear
models in statistics have gradients that are scaled multiples of the data, so that
the sample s is typically identifiable from g ∈ ∂F(x; s).
Now, given function f and stochastic gradient oracle g, an optimization pro-
cedure chooses query points x1 , x2 , . . . , xK and observes stochastic subgradients
gk with E[gk ] ∈ ∂f(xk ). Based on these stochastic gradients, the optimization
procedure outputs ' xK , and we assess the quality of the procedure in terms of the
excess loss
E f ('
xK (g1 , . . . , gK )) − inf
f(x ) ,
x ∈C
where the expectation is taken over the subgradients g(xi , Si , f) returned by the
stochastic oracle and any randomness in the chosen iterates, or query points,
x1 , . . . , xK of the optimization method. Of course, if we only consider this ex-
cess objective value for a fixed function f, then a trivial optimization procedure
achieves excess risk 0: simply return some x ∈ argminx∈C f(x). It is thus impor-
tant to ask for a more uniform notion of risk: we would like the procedure to have
good performance uniformly across all functions f ∈ F, leading us to measure the
performance of a procedure by its worst-case risk
sup E f('x(g1 , . . . , gk )) − inf f(x ) ,
f∈F x∈C
where the supremum is taken over functions f ∈ F (the subgradient oracle g then
implicitly depends on f). An optimal estimator for this metric then gives the
minimax risk for optimizing the family of stochastic optimization problems {f}f∈F
over x ∈ C ⊂ Rn , which is
(5.1.2) MK (C, F) := inf sup E f('
xK (g1 , . . . , gK )) − inf f(x ) .
' K f∈F
x x ∈C
The basic approach There are a variety of techniques giving lower bounds on
the minimax risk (5.1.2). Each of them transforms the maximum risk by lower
bounding it via a Bayesian problem (e.g. [31, 33, 34]), then proving a lower bound
on the performance of all possible estimators for the Bayesian problem. In partic-
ular, let {fv } ⊂ F be a collection of functions in F indexed by some (finite or count-
able) set V and π be any probability mass function over V. Let f = infx∈C f(x).
Then for any procedure ' x, the maximum risk has lower bound
sup E [f('x) − f ] x) − fv ] .
π(v)E [fv ('
f∈F v
While trivial, this lower bound serves as the departure point for each of the
subsequent techniques for lower bounding the minimax risk. The lower bound
also allows us to assume that the procedure ' x is deterministic. Indeed, assume that
'
x is non-deterministic, which we can represent generally as depending on some
auxiliary random variable U independent of the observed subgradients. Then we
certainly have
E x) − fv | U] inf
π(v)E [fv ('
x) − fv | U = u] ,
π(v)E [fv ('
u
v v
that is, there is some realization of the auxiliary randomness that is at least as
good as the average realization. We can simply incorporate this into our minimax
optimal procedures ' x, and thus we assume from this point onward that all our
optimization procedures are deterministic when proving our lower bounds.
f0 f1
δ = dopt (f0 , f1 )
x : f1 (x) f1 + δ
Proposition 5.1.6. Let S be drawn uniformly from V, where |V| < ∞, and assume the
collection {fv }v∈V is δ-separated. Then for any optimization procedure '
x based on the
observed subgradients,
1
E[fv ('
x) − fv ] δ · inf P('
v = V),
|V| '
v
v∈V
where the distribution P is the joint distribution over the random index V and the ob-
served gradients g1 , . . . , gK and the infimum is taken over all testing procedures '
v based
on the observed data.
162 Introductory Lectures on Stochastic Optimization
Definition 5.2.1. Let P and Q be distributions on a space S, and assume that they
are both absolutely continuous with respect to a measure μ on S. The variation
distance between P and Q is
1
P − QTV := sup |P(A) − Q(A)| = |p(s) − q(s)|dμ(s).
A⊂S 2 S
The Kullback-Leibler divergence between P and Q is
p(s)
Dkl (P||Q) := p(s) log dμ(s).
S q(s)
We can connect the variation distance to binary hypothesis tests via the following
lemma, due to Le Cam. The lemma states that testing between two distributions
is hard precisely when they are close in variation distance.
inf{P1 ('
v = 1) + P−1 ('
v = −1)} = inf{1 − P1 (A) + P−1 (A)}
'
v A
= 1 − sup{P1 (A) − P−1 (A)} = 1 − P1 − P−1 TV
A
as desired.
As an immediate consequence of Lemma 5.2.2, we obtain the standard mini-
max lower bound based on binary hypothesis testing. In particular, let f1 and f−1
be δ-separated and belong to F, and assume that the method ' x receives data (in
K
this case, the data is the K subgradients) from Pv when fv is the true function.
Then we immediately have
1 $ $
$ K $
(5.2.3) MK (C, F) inf max {EPv [fv (' xK ) − fv ]} δ · 1 − $P1K − P−1 $ .
' K v∈{−1,1}
x 2 TV
Theorem 5.2.4. Let F be a collection of convex functions, and let f1 , f−1 ∈ F. Assume
that when function fv is to be optimized, we observe K subgradients according to PvK .
Then
dopt (f−1 , f1 ; C) 1 K K
MK (C, P) 1− D P |P .
2 2 kl 1 −1
What remains to give a concrete lower bound, then, is (1) to construct a family
of well-separated functions f1 , f−1 , and (2) to construct a stochastic gradient or-
acle for which we give a small upper bound on the KL-divergence between the
distributions P1 and P−1 associated with the functions, which means that testing
between P1 and P−1 is hard.
Constructing well-separated functions Our first goal is to construct a family
of well-separated functions and an associated first-order subgradient oracle that
makes the functions hard to distinguish. We parameterize our functions—of
which we construct only 2—by a parameter δ > 0 governing their separation.
Our construction applies in dimension n = 1: let us assume that C contains the
interval [−R, R] (this is no loss of generality, as we may simply shift the interval).
Then define the M-Lipschitz continuous functions
(5.2.5) f1 (x) = Mδ|x − R| and f−1 (x) = Mδ|x + R|.
See Figure 5.2.7 for an example of these functions, which makes clear that their
separation (5.1.4) is
dopt (f1 , f−1 ) = δMR.
We also consider the stochastic oracle for this problem, recalling that we must
construct subgradients satisfying E[g22 ] M2 . We will do slightly more: we
will guarantee that |g| M always. With this in mind, we assume that δ 1,
and define the stochastic gradient oracle for the distribution Pv , v ∈ {−1, 1} at the
John C. Duchi 165
point x to be
1+δ
M sign(x − vR) with probability 2
(5.2.6) gv (x) =
1−δ
−M sign(x − vR) with probability 2 .
At x = vR the oracle simply returns a random sign. Then by inspection, we see
that
Mδ Mδ
E[gv (x)] = sign(x − vR) − (− sign(x − vR)) = Mδ sign(x − vR) ∈ ∂fv (x)
2 2
for v = −1, 1. Thus, the combination of the functions (5.2.5) and the stochastic
gradient (5.2.6) give us a valid subgradient and well-separated pair of functions.
f−1 f1
dopt (f−1 , f1 ) = MRδ
x : f1 (x)
f1 + MRδ
−R R
Bounding the distance between distributions The second step in proving our
minimax lower bound is to upper bound the distance between the distributions
that generate the subgradients our methods observe. This means that testing
which of the functions we are optimizing is challenging, giving us a strong lower
bound. At a high level, building off of Theorem 5.2.4, we hope to show an upper
bound of the form
Dkl P1K | P−1
K
κδ2
for some κ. This is a local condition, allowing us to scale our problems with δ
to achieve minimax bounds. If we have such a quadratic, we may simply choose
δ2 = 1/2κ, giving the constant probability of error
$ $
$ K K $ 1 K K
κδ2 1
1 − $P1 − P−1 $ 1 − Dkl P1 | P−1 /2 1 − .
TV 2 2 2
166 Introductory Lectures on Stochastic Optimization
To this end, we begin with a standard lemma (the chain rule for KL diver-
gence), which applies when we have K potentially dependent observations from
a distribution. The result is an immediate consequence of Bayes’ rule.
Lemma 5.2.8. Let P(· | g1 , . . . , gk−1 ) denote the conditional distribution of gk given
g1 , . . . , gk−1 . For each k ∈ N let P1k and P−1
k be distributions on the K subgradients
g1 , . . . , gk . Then
K
Dkl P1K | P−1
K
= EPk−1 [Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 ))] .
1
k=1
Using Lemma 5.2.8, we have the following upper bound on the KL-divergence
between P1K and P−1
K for the stochastic gradient (5.2.6).
Lemma 5.2.9. Let the K observations under distribution Pv come from the stochastic
gradient oracle (5.2.6). Then for δ 45 ,
Dkl P1K | P−1
K
3Kδ2 .
Proof. We use the chain-rule for KL-divergence, whence we must only provide
an upper bound on the individual terms. We first note that xk is a function
of g1 , . . . , gk−1 (because we may assume w.l.o.g. that xk is deterministic) so that
Pv (· | g1 , . . . , gk−1 ) is the distribution of a Bernoulli random variable with distri-
bution (5.2.6), i.e. with probabilities 1±δ 2 . Thus we have
1+δ 1−δ
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 )) Dkl |
2 2
1+δ 1+δ 1−δ 1−δ
= log + log
2 1−δ 2 1+δ
1+δ
= δ log .
1−δ
By a Taylor expansion, we have that
1+δ 1 1
δ log = δ δ − δ2 + O(δ3 ) − δ −δ − δ2 + O(δ3 ) = 2δ2 + O(δ4 ) 3δ2
1−δ 2 2
for δ 45 , or
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 )) 3δ2
for δ 45 . Summing over k completes the proof.
Putting it all together: a minimax lower bound With Lemma 5.2.9 in place
along with our construction (5.2.5) of well-separated functions, we can now give
the best possible convergence guarantees for a broad family of problems.
Theorem 5.2.10. Let C ⊂ Rn be a convex set containing an 2 ball of radius R, and let P
denote the collection of distributions generating stochastic subgradients with g2 M
with probability 1. Then, for all K ∈ N,
RM
MK (C, P) √ √
4 6 K
John C. Duchi 167
Proof. We combine Le Cam’s method, Lemma 5.2.2 (and the subsequent Theo-
rem 5.2.4) with our construction (5.2.5) and their stochastic subgradients (5.2.6).
Certainly, the class of n-dimensional optimization problems is at least as challeng-
ing as a 1-dimensional problem (we may always restrict our functions to depend
only on a single coordinate), so that for any δ 0 we have
δMR 1 K K
MK (C, F) 1− D P |P .
2 2 kl 1 −1
Next, we use Lemma 5.2.9, which guarantees the further lower bound
δMR 3Kδ2
MK (C, F) 1− ,
2 2
x ∈ C ⊂ Rn we have
n
(5.3.1) fv (x) − fv δj 1 sign(xj ) = vj ,
j=1
where the subscript j denotes the jth coordinate. For example, if we define the
function fv (x) = δ x − v1 for each v ∈ V, then certainly {fv } is δ1-separated
in the Hamming metric; more generally, fv (x) = n j=1 δj |xj − vj | is δ-separated.
With this definition, we have the following lemma, providing a lower bound for
functions f : Rn → R.
Then
1 1
n
$ $
E[fv ('
x) − fv ] δj (1 − $P+j − P−j $TV ).
2n 2
v∈{−1,1}n j=1
d
1
xj ) = vj )
δj Pv (sign('
|V|
j=1 v∈V
d
1
= δj xj ) = 1) +
Pv (sign(' xj ) = −1)
Pv (sign('
|V|
j=1 v:vj =1 v:vj =−1
d
δj
= xj ) = 1) + P−j (sign('
P+j (sign(' xj ) = −1) .
2
j=1
Now we use Le Cam’s lemma (Lemma 5.2.2) on optimal binary hypothesis tests
to see that
$ $
xj ) = −1) 1 − $P+j − P−j $TV
xj ) = 1) + P−j (sign('
P+j (sign('
which gives the desired result.
John C. Duchi 169
Substituting this into the previous bound gives the desired result.
With this proposition, we can give a number of minimax lower bounds. We
focus on two concrete cases, which show that the stochastic gradient procedures
we have developed are optimal for a variety of problems. We give one result,
deferring others to the exercises associated with the lecture notes. For our main
result using Assouad’s method, we consider optimization problems for which the
set C ⊂ Rn contains an ∞ ball of radius R. We also assume that the stochastic
gradient oracle satisfies the 1 -bound condition
E[g(x, S, f)21 ] M2 .
This means that all the functions f ∈ F are M-Lipschitz continuous with respect
to the ∞ -norm, that is, |f(x) − f(y)| M x − y∞ .
Theorem 5.3.4. Let F and the stochastic gradient oracle be as above, and assume that
C ⊃ [−R, R]n . Then
√
1 1 n
MK (C, F) RM min ,√ √ .
5 96 K
Proof. Our proof is similar to our construction of our earlier lower bounds, except
that now we must construct functions defined on Rn so that our minimax lower
bound on convergence rate grows with the dimension. Let δ > 0 be fixed for now.
For each v ∈ V = {−1, 1}n , define the function
Mδ
fv (x) := x − Rv1 .
n
Then by inspection, the collection {fv } is MRδ
n -separated in Hamming metric, as
Mδ Mδ
n n
fv (x) = |xj − Rvj | R1 sign(xj ) = vj .
n n
j=1 j=1
170 Introductory Lectures on Stochastic Optimization
It remains to upper bound the KL-divergence terms. Let PvK denote the distribu-
tion of the K subgradients the method observes for the function fv , and let v(±j)
denote the vector v except that its jth entry is forced to be ±1. Then, we may use
the convexity of the KL-divergence to obtain that
1
Dkl P+j | P−j n Dkl PvK(+j) | PvK(−j) .
2
v∈V
K K
Let us thus bound Dkl Pv | Pv when v and v differ in only a single coordinate
(we let it be the first coordinate with no loss of generality). Let us assume for
notational simplicity M = 1 for the next calculation, as this only changes the
support of the subgradient distribution (5.3.5) but not any divergences. Applying
the chain rule (Lemma 5.2.8), we have
K
Dkl PvK | PvK = EPv [Dkl (Pv (· | g1:k−1 )||Pv (· | g1:k−1 ))] .
k=1
We consider one of the terms, noting that the kth query xk is a function of
g1 , . . . , gk−1 . We have
2
Noting that this final quantity is bounded by 3δn for δ 5 gives that
4
3Kδ2 4
Dkl PvK | PvK if δ .
n 5
Substituting the preceding calculation into the lower bound (5.3.6), we obtain
⎛ ⎞
n
MRδ ⎝ 1 3Kδ2 MRδ 3Kδ 2
MK (C, F) 1− ⎠= 1− .
2 2n n 2 2n
j=1
Choosing δ2 = min{16/25, 4K
n
} gives the result of the theorem.
A few remarks are in order. First, the theorem recovers the 1-dimensional re-
sult of Theorem 5.2.10, by simply taking n = 1 in its statement. Second, we see
that if we wish to optimize over a set larger than the 2 -ball, then there must neces-
sarily be some dimension-dependent penalty, at least in the worst case. Lastly, the
result again is sharp. By using Theorem 3.4.9, we obtain the following corollary.
A. Technical Appendices
A.1. Continuity of Convex Functions In this appendix, we provide proofs of
the basic continuity results for convex functions. Our arguments are based on
those of Hiriart-Urruty and Lemaréchal [27].
Proof of Lemma 2.3.1 We can write x ∈ B1 as x = n i=1 xi ei , where ei are the
n
standard basis vectors and i=1 |xi | 1. Thus, we have
n
n
f(x) = f ei xi =f |xi | sign(xi )ei + (1 − x1 )0
i=1 i=1
n
|xi |f(sign(xi )ei ) + (1 − x1 )f(0)
i=1
max {f(e1 ), f(−e1 ), f(e2 ), f(−e2 ), . . . , f(en ), f(−en ), f(0)} .
The first inequality uses the fact that the |xi | and (1 − x1 ) form a convex combi-
nation, since x ∈ B1 , as does the second.
For the lower bound, note by the fact that x ∈ int B1 satisfies x ∈ int dom f,
we have ∂f(x) = ∅ by Theorem 2.4.3. In particular, there is a vector g such that
f(y) f(x) + g, y − x for all y, and even more,
f(y) f(x) + inf g, y − x f(x) − 2 g∞
y∈B1
for all y ∈ B1 .
Proof of Theorem 2.3.2 First, let us suppose that for each point x0 ∈ C, there
exists an open ball B ⊂ int dom f such that
$ $
(A.1.1) |f(x) − f(x )| L $x − x $2 for all x, x ∈ B.
The collection of such balls B covers C, and as C is compact, there exists a
finite subcover B1 , . . . , Bk with associated Lipschitz constants L1 , . . . , Lk . Take
L = maxi Li to obtain the result.
It thus remains to show that we can construct balls satisfying the Lipschitz
condition (A.1.1) at each point x0 ∈ C.
With that in mind, we use Lemma 2.3.1, which shows that for each point x0 ,
there is some > 0 and −∞ < m M < ∞ such that
−∞ < m inf f(x + v) sup f(x + v) M < ∞.
v:v2 2 v:v2 2
We make the following claim, from which the condition (A.1.1) evidently follows
based on the preceding display.
Lemma A.1.2. Let > 0, let f be convex, and let B = {v : v2 1}. Suppose that
f(x) ∈ [m, M] for all x ∈ x0 + 2B. Then
M−m $ $
$x − x $ for all x, x ∈ x0 + B.
|f(x) − f(x )| 2
John C. Duchi 173
Proof. We prove the upper tail, as the lower tail is similar. The proof is a nearly
immediate consequence of Hoeffding’s lemma (Lemma A.2.3) and the Chernoff
bound technique. Indeed, we have
n n
P Xi t E exp λ Xi exp(−λt)
i=1 i=1
5 We may assume there is a dominating base measure μ with respect to which P has a density p.
John C. Duchi 175
for all λ 0. Now, letting Zi be the sequence to which the Xi are adapted, we
iterate conditional expectations. We have
n n−1
E exp λ Xi = E E exp λ Xi exp(λXn ) | Zn−1
i=1 i=1
n−1
= E exp λ Xi E[exp(λXn ) | Zn−1 ]
i=1
n−1
λ2 B2
E exp λ Xi e 8
i=1
because X1 , . . . , Xn−1 are functions of Zn−1 . By iteratively applying this calcula-
tion, we arrive at
n 2 2
λ nB
(A.2.6) E exp λ Xi exp .
8
i=1
Now we optimize by choosing λ 0 to minimize the upper bound that in-
equality (A.2.6) provides, namely
n 2 2
λ nB 2t2
P Xi t inf exp − λt = exp − 2
λ0 8 nB
i=1
4t
by taking λ = Bn .
A.3. Auxiliary results on divergences We present a few standard results on
divergences without proof, referring to standard references (e.g. the book of Cover
and Thomas [17] or the extensive paper on divergence measures by Liese and
Vajda [35]). Nonetheless, we state and prove a few results. The first is known as
the data processing inequality, and it says that processing a random variable (even
adding noise to it) can only make distributions closer together. See Cover and
Thomas [17] or Theorem 14 of Liese and Vajda [35] for a proof.
Proof. First, we note that if we show the result assuming that the sample space
S on which P and Q are defined is finite, we have the general result. Indeed,
suppose that A ⊂ S achieves the supremum
P − QTV = sup |P(A) − Q(A)|.
A⊂S
(We may assume without loss of generality that such a set exists.) If we define
and Q
P to be the binary distributions with P(0)
= P(A) and P(1) = 1 − P(A),
and similarly for Q, we have P − QTV = P − QTV , and Proposition A.3.1
immediately guarantees that
|Q)
Dkl (P| Dkl (P||Q) .
(c) Using the result of part (b), show that x, y xp yq .
(d) Show that ·p and ·q are dual norms.
where [t]+ = max{t, 0} denotes the positive part. We will use stochastic gradient
descent to attempt to minimize
f(X) := EP [F(X; (A, B))] = F(X; (a, b))dP(a, b),
where the expectation is taken over pairs (A, B).
(a) Show that F is convex.
(b) Show that F(X; (a, b)) = 0 if and only if the classifer represented by X has
a large margin, meaning that
a, xb a, xl + 1 for all l = b.
(c) For a pair (a, b), give a way to calculate a vector G ∈ ∂F(X; (a, b)) (note
that G ∈ Rd×k ).
Question B.3.2: In this problem, you will perform experiments to explore the
performance of stochastic subgradient methods for classification problems, specif-
ically, a handwritten digit recognition problem using zip code data from the
United States Postal Service (this data is taken from the book [24], originally
due to Yann Le Cunn). The data—training data zip.train, test data zip.test,
and information file zip.inf—are available for download from the zipped tar
file http://web.stanford.edu/~jduchi/PCMIConvex/ZIPCodes.tgz. Starter code is
available for julia and Matlab at the following urls.
i. For Julia: http://web.stanford.edu/~jduchi/PCMIConvex/sgd.jl
ii. For Matlab: http://web.stanford.edu/~jduchi/PCMIConvex/matlab.tgz
There are two methods left un-implemented in the starter code: the sgd method
and the MulticlassSVMSubgradient method. Implement these methods (you may
find the code for unit-testing the multiclass SVM subgradient useful to double
check your implementation). For the SGD method, your stepsizes should be
√
proportional to αi ∝ 1/ i, and you should project X to the Frobenius norm ball
Br := {X ∈ Rd×k : XFr r}, where X2Fr = X2ij .
ij
We have implemented a pre-processing step that also kernelizes the data repre-
sentation. Let the function K(a, a ) = exp(− 2τ
1
a − a 22 ). Then the kernelized
John C. Duchi 179
Question B.3.3: In this problem, we give a simple bound on the rate of conver-
gence for stochastic optimization for minimization of strongly convex functions.
Let C denote a compact convex set and f denote a λ-strongly convex function
with respect to the 2 -norm on C, meaning that
λ
f(y) f(x) + g, y − x + x − y22 for all g ∈ ∂f(x), x, y ∈ C.
2
Consider the following stochastic gradient method: at iteration k, we
i. receive a noisy subgradient gk with E[gk | xk ] ∈ ∂f(xk );
ii. perform the projected subgradient step
xk+1 = πC (xk − αk gk ).
Show that if E[gk 22 ]
M2
for all k, then with the stepsize choice αk = 1
λk , we
have the convergence guarantee
K
M2
E (f(xk ) − f(x∗ )) (log K + 1).
2λ
k=1
Question B.4.2 (Strong convexity of p -norms): Prove the claim of Example 4.2.7.
That is, for some fixed p ∈ (1, 2], if h(x) = 2(p−1)
1
x2p , show that h is strongly
convex with respect to the p -norm.
1
Hint: Let Ψ(t) = 2(p−1) t2/p and φ(t) = |t|p , noting that h(x) = Ψ( n j=1 φ(xj )).
Then by a Taylor expansion, this question is equivalent to showing that for any
w, x ∈ Rn , we have
x ∇2 h(w)x x2p
where, defining the shorthand vector ∇φ(w) = [φ (w1 ) · · · φ (wn )] , we have
n
∇2 h(w) = Ψ φ(wj ) ∇φ(w)∇φ(w)
j=1
n
Now apply an argument similar to that used in Example 4.2.5 to show the strong
convexity of h(x) = j xj log xj , but applying Hölder’s inequality instead of
Cauchy-Schwarz.
Question B.4.3 (Variable metric methods and AdaGrad): Consider the following
variable-metric method for minimizing a convex function f on a convex subset
C ⊂ Rn :
1
xk+1 = argmin gk , x + (x − xk ) Hk (x − xk ) ,
x∈C 2
where E[gk ] ∈ ∂f(xk ). In the lecture, we showed that
K
E (f(xk ) − f(x ))
k=1
K
1 1
K
2 2 2 2
E xk − x Hk − xk − x Hk−1 + x1 − x H1 + E gk H−1 .
2 2 k
k=2 k=1
(a) Let
k 1
2
Hk = diag gi g
i
i=1
John C. Duchi 181
be the diagonal matrix whose entries are the square roots of the sum of
the squares of the gradient coordinates. (This is the AdaGrad method.)
Show that
xk − x 2Hk − xk − x 2Hk−1 xk − x ∞ tr(Hk − Hk−1 ),
where tr(A) = n i=1 Aii is the trace of the matrix
(b) Assume that R∞ = supx∈C x − x ∞ is finite. Show that with any choice
of diagonal matrix Hk , we obtain
K K
1 1
E (f(xk ) − f(x )) R∞ E[tr(HK )] + E gk 2H−1 .
2 2 k
k=1 k=1
(c) Let gk,j denote the jth coordinate of the kth subgradient. Let Hk be chosen
as above. Show that
K 1
K
3
n 2
2
E (f(xk ) − f(x )) R∞
E gk,j .
2
k=1 j=1 k=1
(d) Suppose that the domain C = {x : x∞ 1}. What is the expected
regret of AdaGrad? Show that (to a numerical constant factor we ignore)
this expected regret is always smaller than the expected regret bound for
standard projected gradient descent, which is
K K 1
2
2
E (f(xk ) − f(x )) O(1) sup x − x 2 E
gk 2 .
k=1 x∈C k=1
Hint: Use Cauchy-Schwarz.
(e) As in the previous sub-question, assume that C = {x : x∞ 1}. Suppose
that the subgradients are such that gk ∈ {−1, 0, 1}n for all k, and that for
each coordinate j we have P(gk,j = 0) = pj . Show that AdaGrad has
convergence guarantee
√ n
3 K √
K
E (f(xk ) − f(x ))
pj .
2
k=1 j=1
dopt (f−1 , f1 ; C) :=
f1 (x) f1 + δ implies f−1 (x) f−1 + δ
sup δ 0 : for any x ∈ C. .
f−1 (x) f−1 + δ implies f1 (x) f1 + δ
When C = R (or, more generally, as long as C ⊃ [−δ, δ]), show that
λ
dopt (f−1 , f1 ; C) δ2 .
2
(b) Show that the Kullback-Leibler divergence between two normal distribu-
tions P1 = N(μ1 , σ2 ) and P2 = N(μ2 , σ2 ) is
(μ1 − μ2 )2
Dkl (P1 | P−1 ) = .
2σ2
(c) Use Le Cam’s method to show the following lower bound for stochastic
optimization: for any optimization procedure ' xK using K noisy gradient
evaluations,
σ2
max EPv [fv ('
xK ) − fv ] .
v∈{−1,1} 32λK
Compare the result with the regret upper bound in problem B.3.3. Hint: If
PvK denotes the distribution of the K noisy gradients for function fv , show
that 2Kλ2 δ2
Dkl P1K | P−1
K
.
σ2
Question B.5.2: Let C = {x ∈ Rn : x∞ 1}, and consider the collection of
functions F where the stochastic gradient oracle g : Rn × S × F → {−1, 0, 1}n
satisfies
P(gj (x, S, f) = 0) pj
for each coordinate j = 1, 2, . . . , n. Show that, for large enough K ∈ N, a minimax
lower bound for this class of functions and the given stochastic oracle is
1 √
n
MK (C, F) c √ pj ,
K j=1
where c > 0 is a numerical constant. How does this compare to the convergence
guarantee that AdaGrad gives?
John C. Duchi 183
References
[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, Information-theoretic lower bounds
on the oracle complexity of stochastic convex optimization, IEEE Trans. Inform. Theory 58 (2012), no. 5,
3235–3249, DOI 10.1109/TIT.2011.2182178. MR2952543 ←171
[2] P. Assouad, Deux remarques sur l’estimation (French, with English summary), C. R. Acad. Sci. Paris
Sér. I Math. 296 (1983), no. 23, 1021–1024. MR777600 ←167
[3] P. Auer, N. Cesa-Bianchi, and C. Gentile, Adaptive and self-confident on-line learning algorithms, J.
Comput. System Sci. 64 (2002), no. 1, 48–75, DOI 10.1006/jcss.2001.1795. Special issue on COLT
2000 (Palo Alto, CA). MR1896142 ←157
[4] K. Azuma, Weighted sums of certain dependent random variables, Tôhoku Math. J. (2) 19 (1967), 357–
367, DOI 10.2748/tmj/1178243286. MR0221571 ←174
[5] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex opti-
mization, Oper. Res. Lett. 31 (2003), no. 3, 167–175, DOI 10.1016/S0167-6377(02)00231-6. MR1967286
←157
[6] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski, Robust optimization, Princeton Series in Applied
Mathematics, Princeton University Press, Princeton, NJ, 2009. MR2546839 ←102
[7] D. P. Bertsekas, Stochastic optimization problems with nondifferentiable cost functionals, J. Optimiza-
tion Theory Appl. 12 (1973), 218–231, DOI 10.1007/BF00934819. MR0329725 ←120
[8] D. P. Bertsekas, Convex optimization theory, Athena Scientific, Nashua, NH, 2009. MR2830150 ←101,
122
[9] D. P. Bertsekas, Nonlinear programming, 2nd ed., Athena Scientific Optimization and Computation
Series, Athena Scientific, Belmont, MA, 1999. MR3444832 ←101
[10] S. Boyd, J. Duchi, and L. Vandenberghe, Subgradients, 2015. Course notes for Stanford Course
EE364b. ←140
[11] S. Boyd and A. Mutapcic, Stochastic subgradient methods, 2007. Course notes for EE364b at Stanford,
available at http://www.stanford.edu/class/ee364b/notes/stoch_subgrad_notes.pdf. ←140
[12] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, Cambridge, 2004.
MR2061575 ←101, 119, 122, 123, 157
[13] G. Braun, C. Guzmán, and S. Pokutta, Lower bounds in the oracle complexity of nonsmooth convex
optimization via information theory, IEEE Trans. Inform. Theory 63 (2017), no. 7, 4709–4724, DOI
10.1109/TIT.2017.2701343. MR3666985 ←171
[14] P. Brucker, An O(n) algorithm for quadratic knapsack problems, Oper. Res. Lett. 3 (1984), no. 3,
163–166, DOI 10.1016/0167-6377(84)90010-5. MR761510 ←130, 143, 151
[15] S. Bubeck and N. Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit
problems, Foundations and Trends in Machine Learning 5 (2012), no. 1, 1–122. ←157
[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile, On the generalization ability of on-line learning al-
gorithms, IEEE Trans. Inform. Theory 50 (2004), no. 9, 2050–2057, DOI 10.1109/TIT.2004.833339.
MR2097190 ←140
[17] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed., Wiley-Interscience [John
Wiley & Sons], Hoboken, NJ, 2006. MR2239987 ←171, 175
[18] A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support
for non-strongly convex composite objectives, Advances in neural information processing systems 27,
2014. ←140
[19] D. L. Donoho, R. C. Liu, and B. MacGibbon, Minimax risk over hyperrectangles, and implications,
Ann. Statist. 18 (1990), no. 3, 1416–1437, DOI 10.1214/aos/1176347758. MR1062717 ←167
[20] D. L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (2006), no. 4, 1289–1306, DOI
10.1109/TIT.2006.871582. MR2241189 ←129
[21] J. C. Duchi, Stats311/EE377: Information theory and statistics, 2015. ←171
[22] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic
optimization, J. Mach. Learn. Res. 12 (2011), 2121–2159. MR2825422 ←153, 157
[23] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, Efficient projections onto the 1 -ball for
learning in high dimensions, Proceedings of the 25th international conference on machine learning,
2008. ←130, 143, 151
[24] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, 2nd ed., Springer Series
in Statistics, Springer, New York, 2009. Data mining, inference, and prediction. MR2722294 ←178
184 References
[25] E. Hazan, The convex optimization approach to regret minimization, Optimization for machine learn-
ing, 2012. ←140
[26] E. Hazan, Introduction to online convex optimization, Foundations and Trends in Optimization 2
(2016), no. 3–4, 157–325. ←102
[27] J. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms I, Springer, New
York, 1993. ←101, 119, 122, 172
[28] J. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms II, Springer,
New York, 1993. ←101, 122
[29] J.-B. Hiriart-Urruty and C. Lemaréchal, Fundamentals of convex analysis, Springer, 2001. ←122
[30] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc.
58 (1963), 13–30. MR0144363 ←173
[31] I. A. Ibragimov and R. Z. Hasminskiı̆, Statistical estimation, Applications of Mathematics, vol. 16,
Springer-Verlag, New York-Berlin, 1981. Asymptotic theory; Translated from the Russian by
Samuel Kotz. MR620321 ←102, 160, 171
[32] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction,
Advances in neural information processing systems 26, 2013. ←140
[33] L. Le Cam, Asymptotic methods in statistical decision theory, Springer Series in Statistics, Springer-
Verlag, New York, 1986. MR856411 ←102, 160, 162
[34] E. L. Lehmann and G. Casella, Theory of point estimation, 2nd ed., Springer Texts in Statistics,
Springer-Verlag, New York, 1998. MR1639875 ←160
[35] F. Liese and I. Vajda, On divergences and informations in statistics and information theory, IEEE Trans.
Inform. Theory 52 (2006), no. 10, 4394–4412, DOI 10.1109/TIT.2006.881731. MR2300826 ←175
[36] D. G. Luenberger, Optimization by vector space methods, John Wiley & Sons, Inc., New York-London-
Sydney, 1969. MR0238472 ←122
[37] J. E. Marsden, Elementary classical analysis, W. H. Freeman and Co., San Francisco, 1974. With the
assistance of Michael Buchner, Amy Erickson, Adam Hausknecht, Dennis Heifetz, Janet Macrae
and William Wilson, and with contributions by Paul Chernoff, István Fáry and Robert Gulliver.
MR0357693 ←101
[38] B. McMahan and M. Streeter, Adaptive bound optimization for online convex optimization, Proceed-
ings of the twenty third annual conference on computational learning theory, 2010. ←157
[39] A. Nedić, Subgradient methods for convex minimization, Ph.D. Thesis, 2002. ←157
[40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach
to stochastic programming, SIAM J. Optim. 19 (2008), no. 4, 1574–1609, DOI 10.1137/070704277.
MR2486041 ←140, 157
[41] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, A Wiley-
Interscience Publication, John Wiley & Sons, Inc., New York, 1983. Translated from the Russian
and with a preface by E. R. Dawson; Wiley-Interscience Series in Discrete Mathematics. MR702836
←102, 123, 157, 171
[42] A. Nemirovski, Efficient methods in convex programming, 1994. Technion: The Israel Institute of
Technology. ←171
[43] A. Nemirovski, Lectures on modern convex optimization, 2005. Georgia Institute of Technology.
←140
[44] Y. Nesterov, Introductory lectures on convex optimization, Applied Optimization, vol. 87, Kluwer
Academic Publishers, Boston, MA, 2004. A basic course. MR2142598 ←124, 140, 171
[45] Y. Nesterov and A. Nemirovskii, Interior-point polynomial algorithms in convex programming, SIAM
Studies in Applied Mathematics, vol. 13, Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 1994. MR1258086 ←123
[46] J. Nocedal and S. J. Wright, Numerical optimization, 2nd ed., Springer Series in Operations Re-
search and Financial Engineering, Springer, New York, 2006. MR2244940 ←101, 157
[47] B. T. Polyak, Introduction to optimization, Translations Series in Mathematics and Engineering,
Optimization Software, Inc., Publications Division, New York, 1987. Translated from the Russian;
With a foreword by Dimitri P. Bertsekas. MR1099605 ←101, 140
[48] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control
Optim. 30 (1992), no. 4, 838–855, DOI 10.1137/0330046. MR1167814 ←140
[49] R. T. Rockafellar, Convex analysis, Princeton University Press, 1970. ←101, 104, 121
[50] W. Rudin, Principles of mathematical analysis, 3rd ed., McGraw-Hill Book Co., New York-Auckland-
Düsseldorf, 1976. International Series in Pure and Applied Mathematics. MR0385023 ←101
References 185
[51] S. Shalev-Shwartz, Online learning: Theory, algorithms, and applications, Ph.D. Thesis, 2007. ←144
[52] O. Shamir and S. Shalev-Shwartz, Matrix completion with the trace norm: learning, bounding, and
transducing, J. Mach. Learn. Res. 15 (2014), 3401–3423. MR3277164 ←102
[53] S. Shalev-Shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss mini-
mization, J. Mach. Learn. Res. 14 (2013), 567–599. MR3033340 ←140
[54] A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on stochastic programming, MPS/SIAM
Series on Optimization, vol. 9, Society for Industrial and Applied Mathematics (SIAM), Philadel-
phia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2009. Modeling and the-
ory. MR2562798 ←102
[55] N. Z. Shor, Minimization methods for nondifferentiable functions, Springer Series in Computational
Mathematics, vol. 3, Springer-Verlag, Berlin, 1985. Translated from the Russian by K. C. Kiwiel
and A. Ruszczyński. MR775136 ←157
[56] N. Z. Shor, Nondifferentiable optimization and polynomial problems, Nonconvex Optimization and its
Applications, vol. 24, Kluwer Academic Publishers, Dordrecht, 1998. MR1620179 ←153, 157
[57] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B 58 (1996),
no. 1, 267–288. MR1379242 ←129
[58] A. B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009. ←161, 171
[59] A. Wald, Contributions to the theory of statistical estimation and testing hypotheses, Ann. Math. Statis-
tics 10 (1939), 299–326. MR0000932 ←102, 171
[60] A. Wald, Statistical decision functions which minimize the maximum risk, Ann. of Math. (2) 46 (1945),
265–280, DOI 10.2307/1969022. MR0012402 ←102
[61] S. Wright, Optimization Algorithms for Data Analysis, 2018. ←100
[62] Y. Yang and A. Barron, Information-theoretic determination of minimax rates of convergence, Ann.
Statist. 27 (1999), no. 5, 1564–1599, DOI 10.1214/aos/1017939142. MR1742500 ←102, 161
[63] B. Yu, Assouad, Fano, and Le Cam, Festschrift for Lucien Le Cam, Springer, New York, 1997,
pp. 423–435. MR1462963 ←102, 161, 171
[64] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceedings
of the twentieth international conference on machine learning, 2003. ←140
Per-Gunnar Martinsson
Contents
1 Introduction 188
1.1 Scope and objectives 188
1.2 The key ideas of randomized low-rank approximation 189
1.3 Advantages of randomized methods 190
1.4 Relation to other chapters and the broader literature 190
2 Notation 191
2.1 Notation 191
2.2 The singular value decomposition (SVD) 191
2.3 Orthonormalization 192
2.4 The Moore-Penrose pseudoinverse 192
3 A two-stage approach 193
4 A randomized algorithm for “Stage A” — the range finding problem 194
5 Single pass algorithms 195
5.1 Hermitian matrices 196
5.2 General matrices 198
6 A method with complexity O(mn log k) for general dense matrices 199
7 Theoretical performance bounds 200
7.1 Bounds on the expectation of the error 201
7.2 Bounds on the likelihood of large deviations 202
8 An accuracy enhanced randomized scheme 202
8.1 The key idea — power iteration 202
8.2 Theoretical results 204
8.3 Extended sampling matrix 205
9 The Nyström method for positive symmetric definite matrices 205
10 Randomized algorithms for computing Interpolatory Decompositions 206
10.1 Structure preserving factorizations 206
10.2 Three flavors of ID: row, column, and double-sided ID 207
10.3 Deterministic techniques for computing the ID 208
10.4 Randomized techniques for computing the ID 210
11 Randomized algorithms for computing the CUR decomposition 212
11.1 The CUR decomposition 212
©2018 American Mathematical Society
187
188 Randomized Methods for Matrix Computations
1. Introduction
1.1. Scope and objectives The objective of this chapter is to describe a set of
randomized methods for efficiently computing a low rank approximation to a
given matrix. In other words, given an m × n matrix A, we seek to compute
factors E and F such that
A ≈ E F,
(1.1.1)
m×n m×k k×n
where the rank k of the approximation is a number we assume to be much smaller
than either m or n. In some situations, the rank k is given to us in advance, while
in others, it is part of the problem to determine a rank such that the approxima-
tion satisfies a bound of the type
A − EF ε
where ε is a given tolerance, and · is some specified matrix norm (in this
chapter, we will discuss only the spectral and the Frobenius norms).
An approximation of the form (1.1.1) is useful for storing the matrix A more fru-
gally (we can store E and F using k(m + n) numbers, as opposed to mn numbers
for storing A), for efficiently computing a matrix vector product z = Ax (via y = Fx
and z = Ey), for data interpretation, and much more. Low-rank approximation
problems of this type form a cornerstone of data analysis and scientific comput-
ing, and arise in a broad range of applications, including principal component
analysis (PCA) in computational statistics, spectral methods for clustering high-
dimensional data and finding structure in graphs, image and video compression,
model reduction in physical modeling, and many more.
Per-Gunnar Martinsson 189
A ≈ U D U∗ ,
(1.1.2)
n×n n×k k×k k×n
where the columns of U form an orthonormal set, and where D is diagonal. For
a general m × n matrix A, we would typically be interested in an approximate
rank-k singular value decomposition (SVD), which takes the form
A ≈ U D V∗ ,
(1.1.3)
m×n m×k k×k k×n
where U and V have orthonormal columns, and D is diagonal. In this chapter, we
will discuss both the EVD and the SVD in depth. We will also describe factoriza-
tions such as the interpolative decomposition (ID) and the CUR decomposition which
are highly useful for data interpretation, and for certain applications in scientific
computing. In these, we seek to determine a subset of the columns (rows) of A
itself that form a good approximate basis for the column (row) space.
While most of the chapter is aimed at computing low rank factorizations where
the target rank k is much smaller than the dimensions of the matrix m and n, we
will in the last couple of sections of the chapter also discuss how randomization
can be used to speed up factorization of full matrices, such as a full column
pivoted QR factorization, or various relaxations of the SVD that are useful for
solving least-squares problems, etc.
1.2. The key ideas of randomized low-rank approximation To quickly intro-
duce the central ideas of the current chapter, let us describe a simple prototypical
randomized algorithm: Let A be a matrix of size m × n that is approximately of
low rank. In other words, we assume that for some integer k < min(m, n), there
exists an approximate low rank factorization of the form (1.1.1). Then a natu-
ral question is how do you in a computationally efficient manner construct the
factors E and F? In [39], it was observed that random matrix theory provides a
simple solution: Draw a Gaussian random matrix G of size n × k, form the sampling
matrix
E = AG,
and then compute the factor F via
F = E† A,
where E† is the Moore-Penrose pseudo-inverse of A, as in Subsection 2.4. (Then
EF = EE† A, where EE† is the orthogonal projection onto the linear space spanned
190 Randomized Methods for Matrix Computations
A ≈ E E A ,
(1.2.1)
m×n m×k k×n
is close to optimal. With this observation as a starting point, we will construct
highly efficient algorithms for computing approximate spectral decompositions
of A, for solving certain least-squares problems, for doing principal component
analysis of large data sets, etc.
1.3. Advantages of randomized methods The algorithms that result from using
randomized sampling techniques are computationally efficient, and are simple to
implement as they rely on standard building blocks (matrix-matrix multiplication,
unpivoted QR factorization, etc.) that are readily available for most computing
environments (multicore CPU, GPU, distributed memory machines, etc). As an il-
lustration, we invite the reader to peek ahead at Algorithm 4.0.1, which provides a
complete Matlab code for a randomized algorithm that computes an approximate
singular value decomposition of a matrix. Examples of improvements enabled by
these randomized algorithms include:
• Given an m × n matrix A, the cost of computing a rank-k approximant
by classical methods is O(mnk). Randomized algorithms can attain com-
plexity O(mn log k + k2 (m + n)), cf. [26, Sec. 6.1], and Section 6.
• Algorithms for performing principal component analysis (PCA) of large
data sets have been greatly accelerated, in particular when the data is
stored out-of-core, cf. [25].
• Randomized methods tend to require less communication than traditional
methods, and can be efficiently implemented on severely communication
constrained environments such as GPUs [38] and distributed computing
platforms, cf. [24, Ch. 4] and [15, 18].
• Randomized algorithms have enabled the development of single-pass ma-
trix factorization algorithms in which the matrix is “streamed” and never
stored, cf. [26, Sec. 6.3] and Section 5.
1.4. Relation to other chapters and the broader literature Our focus in this
chapter is to describe randomized methods that attain high practical computa-
tional efficiency. In particular, we use randomization mostly as a tool for min-
imizing communication, rather than minimizing the flop count (although we do
sometimes improve asymptotic flop counts as well). The methods described were
first published in [39] (which was inspired by [17], and later led to [31, 40]; see
also [45]). Our presentation largely follows that in the 2011 survey [26], but with
a focus more on practical usage, rather than theoretical analysis. We have also
included material from more recent work, including [49] on factorizations that al-
low for better data interpretation, [38] on blocking and adaptive error estimation,
and [35, 36] on full factorizations.
Per-Gunnar Martinsson 191
2. Notation
respectively. We use the notation of Golub and Van Loan [20] to specify sub-
matrices. In other words, if B is an m × n matrix with entries B(i, j), and if
I = [i1 , i2 , . . . , ik ] and J = [j1 , j2 , . . . , j ] are two index vectors, then B(I, J) de-
notes the k × matrix
⎡ ⎤
⎢ B(i1 , j1 ) B(i1 , j2 ) · · · B(i1 , j ) ⎥
⎢ ⎥
⎢ ⎥
⎢ B(i2 , j1 ) B(i2 , j2 ) · · · B(i2 , j ) ⎥
⎢ ⎥
B(I, J) = ⎢ ⎥.
⎢ .. .. .. ⎥
⎢ . . . ⎥
⎢ ⎥
⎣ ⎦
B(ik , j1 ) B(ik , j2 ) · · · B(ik , j )
We let B(I, :) denote the matrix B(I, [1, 2, . . . , n]), and define B(:, J) analogously.
The transpose of B is denoted B∗ , and we say that a matrix U is orthonormal
(ON) if its columns form an orthonormal set, so that U∗ U = I.
2.2. The singular value decomposition (SVD) The SVD was introduced briefly
in the introduction. Here we define it again, with some more detail added. Let A
denote an m × n matrix, and set r = min(m, n). Then A admits a factorization
A = U D V∗ ,
(2.2.1)
m×n m×r r×r r×n
where the matrices U and V are orthonormal, and D is diagonal. We let {ui }ri=1
and {vi }ri=1 denote the columns of U and V, respectively. These vectors are the
left and right singular vectors of A. The diagonal elements {σj }rj=1 of D are the
singular values of A. We order these so that σ1 σ2 · · · σr 0.
192 Randomized Methods for Matrix Computations
where A denotes the operator norm of A and AFro denotes the Frobenius
norm of A. Moreover, the Eckart-Young theorem [14] states that these errors are
the smallest possible errors that can be incurred when approximating A by a
matrix of rank k.
2.3. Orthonormalization Given an m × matrix X, with m , we introduce
the function
Q = orth(X)
to denote orthonormalization of the columns of X. In other words, the matrix Q
will be an m × orthonormal matrix whose columns form a basis for the column
space of X.
In practice, this step is typically achieved most efficiently by a call to a pack-
aged QR factorization; e.g., in Matlab, we would write [Q, ∼] = qr(X, 0). However,
all calls to orth in this manuscript can be implemented without pivoting, which
makes efficient implementation much easier.
2.4. The Moore-Penrose pseudoinverse The Moore-Penrose pseudoinverse is a
generalization of the concept of an inverse for a non-singular square matrix. To
define it, let A be a given m × n matrix. Let k denote its actual rank, so that its
singular value decomposition (SVD) takes the form
k
A= σj uj vj∗ = Uk Dk Vk
∗
,
j=1
3. A two-stage approach
The problem of computing an approximate low-rank factorization to a given
matrix can conveniently be split into two distinct “stages.” For concreteness, we
describe the split for the specific task of computing an approximate singular value
decomposition. To be precise, given an m × n matrix A and a target rank k, we
seek to compute factors U, D, and V such that
A ≈ U D V∗ .
m×n m×k k×k k×n
The factors U and V should be orthonormal, and D should be diagonal. (For
now, we assume that the rank k is known in advance, techniques for relaxing this
assumption are described in Section 12.) Following [26], we split this task into
two computational stages:
Stage A — find an approximate range: Construct an m × k matrix Q with or-
thonormal columns such that A ≈ QQ∗ A. (In other words, the columns of
Q form an approximate basis for the column space of A.) This step will
be executed via a randomized process described in Section 4.
Stage B — form a specific factorization: Given the matrix Q computed in Stage
A, form the factors U, D, and V using classical deterministic techniques.
For instance, this stage can be executed via the following steps:
(1) Form the k × n matrix B = Q∗ A.
(2) Compute the SVD of the (small) matrix B so that B = ÛDV∗ .
(3) Form U = QÛ.
The point here is that in a situation where k min(m, n), the difficult part
of the computation is all in Stage A. Once that is finished, the post-processing in
Stage B is easy, as all matrices involved have at most k rows or columns.
Remark 3.0.1. Stage B is exact up to floating point arithmetic so all errors in the
factorization process are incurred at Stage A. To be precise, we have
Q Q∗ A = Q
B = QÛ DV∗ = UDV∗ .
=B =ÛDV∗ =U
In other words, if the factor Q satisfies A − QQ∗ A ε, then automatically
(3.0.2) A − UDV∗ = A − QQ∗ A ε
unless ε is close to the machine precision.
Remark 3.0.3. A bound of the form (3.0.2) implies that the diagonal elements
{D(i, i)}k
i=1 of D are accurate approximations to the singular values of A in the
sense that |σi − D(i, i)| ε for i = 1, 2, . . . , k. However, a bound like (3.0.2) does
not provide assurances on the relative errors in the singular values; nor does it, in
the general case, provide strong assurances that the columns of U and V are good
approximations to the singular vectors of A.
194 Randomized Methods for Matrix Computations
Remark 4.0.2 (How many basis vectors?). The reader may have observed that
while our stated goal was to find a matrix Q that holds k orthonormal columns,
the randomized process discussed in this section and summarized in Algorithm
4.0.1 results in a matrix with k + p columns instead. The p extra vectors are
needed to ensure that the basis produced in “Stage A” accurately captures the k
dominant left singular vectors of A. In a situation where an approximate SVD
with precisely k modes is sought, one can drop the last p components when exe-
cuting Stage B. Using Matlab notation, we would after Step (5) run the commands
Uhat = Uhat(:,1:k); D = D(1:k,1:k); V = V(:,1:k);.
From a practical point of view, the cost of carrying around a few extra samples
in the intermediate steps is often entirely negligible.
space, and then in “Stage B” where we project A on to the space spanned by the
computed basis vectors. It turns out to be possible to modify the algorithm in
such a way that each entry of A is accessed only once. This is important because
it allows us to compute the factorization of a matrix that is too large to be stored.
For Hermitian matrices, the modification to Algorithm 4.0.1 is very minor and
we describe it in Subsection 5.1. Subsection 5.2 then handles the case of a general
matrix.
5.1. Hermitian matrices Suppose that A = A∗ , and that our objective is to com-
pute an approximate eigenvalue decomposition
A ≈ U D U∗
(5.1.1)
n×n n×k k×k k×n
with U an orthonormal matrix and D diagonal. (Note that for a Hermitian matrix,
the EVD and the SVD are essentially equivalent, and that the EVD is the more
natural factorization.) Then execute Stage A with an over-sampling parameter p
to compute an orthonormal matrix Q whose columns form an approximate basis
for the column space of A:
(1) Draw a Gaussian random matrix G of size n × (k + p).
(2) Form the sampling matrix Y = AG.
(3) Orthonormalize the columns of Y to form Q, in other words Q = orth(Y).
Then
(5.1.2) A ≈ QQ∗ A.
Since A is Hermitian, its row and column spaces are identical, so we also have
(5.1.3) A ≈ AQQ∗ .
Inserting (5.1.2) into (5.1.3), we (informally!) find that
(5.1.4) A ≈ QQ∗ AQQ∗ .
We define
(5.1.5) C = Q∗ AQ.
Per-Gunnar Martinsson 197
C Q G = Q∗ Y .
(5.1.8)
× × ×
At first, it may appear that (5.1.8) is perfectly balanced in that there are 2 equa-
tions for 2 unknowns. However, we need to enforce that C is Hermitian, so the
system is actually over-determined by roughly a factor of two. Putting everything
together, we obtain the method summarized in Algorithm 5.1.9.
The procedure described in this section is less accurate than the procedure
described in Algorithm 4.0.1 for two reasons: (1) The approximation error in
formula (5.1.4) tends to be larger than the error in (5.1.2). (2) While the matrix
Q∗ G is invertible, it tends to be very ill-conditioned.
C QG = Q∗ Y.
(5.1.11)
k×k k× k×
Since (5.1.11) is over-determined, we solve it using a least-squares technique. Ob-
serve that we are now looking for less information (a k × k matrix rather than an
× matrix), and have more information in order to determine it.
Ω = D F S,
(6.0.1)
n× n×n n×n n×
where D is a diagonal matrix whose diagonal entries are complex numbers of
modulus one drawn from a uniform distribution on the unit circle in the complex
plane, where F is the discrete Fourier transform,
F(p, q) = n−1/2 e−2πi(p−1)(q−1)/n , p, q ∈ {1, 2, 3, . . . , n},
200 Randomized Methods for Matrix Computations
can be found in, e.g., [22, 51], while a detailed discussion of a related method for
low-rank approximation can be found in [11]
7.1. Bounds on the expectation of the error A basic result on the typical error
observed is Theorem 10.6 of [26], which states:
min(m,n)
Theorem 7.1.1. Let A be an m × n matrix with singular values {σj }j=1 . Let k be
a target rank, and let p be an over-sampling parameter such that p 2 and such that
k + p min(m, n). Let G be a Gaussian random matrix of size n × (k + p) and set
Q = orth(AG). Then the average error, as measured in the Frobenius norm, satisfies
⎛ ⎞1/2
1/2 min(m,n)
k
(7.1.2) E A − QQ∗ AFro 1 + ⎝ σ2j ⎠ ,
p−1
j=k+1
where E refers to expectation with respect to the draw of G. The corresponding result for
the spectral norm reads
⎛ ⎞1/2
√ min(m,n)
k e k+p ⎝
(7.1.3) E A − QQ∗ A 1 + σk+1 + σ2j ⎠ .
p−1 p
j=k+1
When errors are measured in the Frobenius norm, Theorem 7.1.1 is very gratify-
ing. For our standard recommendation of p = 10, we are basically within a factor
of 1 + k/9 of thetheoretically minimal error. (Recall that the Eckart-Young theo-
min(m,n) 2 1/2
rem states that j=k+1 σj is a lower bound on the residual for any rank-k
approximant.) If you over-sample more aggressively and set p = k + 1, then we
√
are within a distance of 2 of the theoretically minimal error.
When errors are measured in the spectral norm, the situation is much less rosy.
The first term in the bound in (7.1.3) is perfectly acceptable, but the second term
is unfortunate in that it involves the minimal error in the Frobenius norm, which
can be much larger, especially when m or n are large. The theorem is quite sharp,
as it turns out, so the sub-optimality expressed in (7.1.3) reflects a true limitation
on the accuracy to be expected from the basic randomized scheme.
The extent to which the error in (7.1.3) is problematic depends on how rapidly
min(m,n)
the “tail” singular values {σj }j=1 decay. If they decay fast, then the spectral
norm error and the Frobenius norm error are similar, and the RSVD works well.
If they decay slowly, then the RSVD performs fine when errors are measured
in the Frobenius norm, but not very well when the spectral norm is the one of
interest. To illustrate the difference, let us consider two situations:
Case 1 — fast decay: Suppose that the tail singular values decay exponentially
fast, so that for some β ∈ (0, 1), we have σj ≈ σk+1 βj−k−1 for j > k. Then we
get an estimate for the tail singular values of
⎛ ⎞1/2 ⎛ ⎞1/2
min(m,n) min(m,n)
⎝ σ2j ⎠ ≈ σk+1 ⎝ β2(j−k−1) ⎠ σk+1 (1 − β2 )−1/2 .
j=k+1 j=k+1
202 Randomized Methods for Matrix Computations
As long as β is not very close to 1, we see that the contribution from the tail
singular values is modest in this case.
Case 2 — no decay: Suppose that the tail singular values exhibit no decay, so
that σj = σk+1 for j > k. Now
⎛ ⎞1/2
min(m,n)
⎝ σ2j ⎠ = σk+1 min(m, n) − k .
j=k+1
This represents the worst case scenario and, since we want to allow for n and
m to be very large, represents devastating suboptimality.
Fortunately, it is possible to modify the RSVD in such a way that the errors
produced are close to optimal in both the spectral and the Frobenius norms. The
price to pay is a modest increase in the computational cost. See Section 8 and
[26, Sec. 4.5].
7.2. Bounds on the likelihood of large deviations One can prove that (perhaps
surprisingly) the likelihood of a large deviation from the mean depends only on
the over-sampling parameter p, and decays extraordinarily fast. For instance, one
can prove that if p 4, then
√
∗
8 k + p 1/2
(7.2.1) ||A − QQ A|| 1 + 17 1 + k/p σk+1 + σ2j ,
p+1 j>k
−p
with failure probability at most 3 e , see [26, Cor. 10.9].
2 1/2 . When the singular values decay slowly, this quantity can be much
j>k σj
larger than the theoretically minimal approximation error (which is σk+1 ).
Recall that the objective of the randomized sampling is to construct a set of
orthonormal vectors {qj }j=1 that capture to high accuracy the space spanned by
the k dominant left singular vectors {uj }k
j=1 of A. The idea is now to sample not
A, but the matrix A(q) defined by
q
A(q) = AA∗ A,
where q is a small positive integer (say, q = 1 or q = 2). A simple calculation
shows that if A has the SVD A = UDV∗ , then the SVD of A(q) is
A(q) = U D2q+1 V∗ .
Per-Gunnar Martinsson 203
In other words, A(q) has the same left singular vectors as A, while its singular
values are {σ2q+1
j }j . Even when the singular values of A decay slowly, the singular
values of A(q) tend to decay fast enough for our purposes.
The accuracy enhanced scheme now consists of drawing a Gaussian matrix G
and then forming a sample matrix
q
Y = AA∗ AG.
Then orthonormalize the columns of Y to obtain Q = orth(Y), and proceed as
before. The resulting scheme is shown in Algorithm 8.1.1.
Remark 8.1.2. The scheme described in Algorithm 8.1.1 can lose accuracy due to
round-off errors. The problem is that as q increases, all columns in the sample
q
matrix Y = AA∗ AG tend to align closer and closer to the dominant left singu-
lar vector. This means that essentially all information about the singular values
and singular vectors associated with smaller singular values get lots to round-off
errors. Roughly speaking, if
σj 1/(2q+1)
mach ,
σ1
where mach is machine precision, then all information associated with the jth
singular value and beyond is lost (see Section 3.2 of [34]). This problem can
be fixed by orthonormalizing the columns between each iteration, as shown in
204 Randomized Methods for Matrix Computations
Algorithm 8.1.3. The modified scheme is more costly due to the extra calls to
orth. (However, note that orth can be executed using unpivoted Gram-Schmidt,
which is quite fast.)
Algorithm 8.1.3. This algorithm takes the same inputs and out-
puts as the method in Algorithm 8.1.1. The only difference is that
orthonormalization is carried out between each step of the power
iteration, to avoid loss of accuracy due to rounding errors.
The bound in (8.2.2) is slightly opaque. To simplify it, let us consider a worst
case scenario in which there is no decay in the singular values beyond the trunca-
tion point, so that we have σk+1 = σk+2 = · · · = σmin(m,n) . Then (8.2.2) simplifies
to
√ 1/(2q+1)
∗
k e k+p
E A − QQ A 1 + + · min{m, n} − k σk+1 .
p−1 p
Per-Gunnar Martinsson 205
In other words, as we increase the exponent q, the power scheme drives the factor
that multiplies σk+1 to one exponentially fast. This factor represents the degree
of “sub-optimality” you can expect to see.
8.3. Extended sampling matrix The scheme given in Subsection 8.1 is slightly
wasteful in that it does not directly use all the sampling vectors computed. To
further improve accuracy, one could, for a symmetric matrix A and a small posi-
tive integer q, form an “extended” sampling matrix
Y = AG, A2 G, . . . , Aq G .
Observe that this new sampling matrix Y has q columns. Then proceed as before:
(8.3.1) Q = qr(Y), B = Q∗ A, [Û, D, V] = svd(B, ’econ’), U = QÛ.
The computations in (8.3.1) can be quite expensive since the “tall thin” matrices
being operated on now have q columns, rather than the tall thin matrices in,
e.g., Algorithm 8.1.1, which have only columns. This results in an increase in
cost for all operations (QR factorization, matrix-matrix multiply, SVD) by a factor
of O(q2 ).
Consequently, the scheme described here is primarily useful in situations in
which the computational cost is dominated by applications of A and A∗ , and we
want to maximally leverage all interactions with A. An early discussion of this
idea can be found in [44, Sec. 4.4], with a more detailed discussion in [42].
(9.0.1) A ≈ Q Q∗ AQ Q∗ .
In contrast, the so called “Nyström scheme” relies on the rank-k approximation
−1
(9.0.2) A ≈ (AQ) Q∗ AQ (AQ)∗ .
For both stability and computational efficiency, we typically rewrite (9.0.2) as
A ≈ FF∗ ,
where F is an approximate Cholesky factor of A of size n × k, defined by
−1/2
F = (AQ) Q∗ AQ .
To compute the factor F numerically, we may first form the matrices B1 = AQ
and B2 = Q∗ B1 . Observe that B2 is necessarily psd, so that we can compute
its Cholesky factorization B2 = C∗ C. Finally compute the factor F = B1 C−1 by
206 Randomized Methods for Matrix Computations
10.2. Three flavors of ID: row, column, and double-sided ID Subsection 10.1
describes a factorization where we use a subset of the columns of A to span its
column space. Naturally, this factorization has a sibling which uses the rows of A
to span its row space. In other words A also admits the factorization
A = X R,
(10.2.1)
m×n m×k k×n
where R is a matrix consisting of k rows of A, and where X is a matrix that
contains the k × k identity matrix. We let Is denote the index vector of length k
that marks the “skeleton” rows so that R = A(Is , :).
Finally, there exists a so called double-sided ID which takes the form
A = X As Z,
(10.2.2)
m×n m×k k×k k×n
208 Randomized Methods for Matrix Computations
where X and Z are the same matrices as those that appear in (10.1.1) and (10.2.1),
and where As is the k × k submatrix of A given by
As = A(Is , Js ).
10.3. Deterministic techniques for computing the ID In this section we demon-
strate that there is a close connection between the column ID and the classical
column pivoted QR factorization (CPQR). The end result is that standard soft-
ware used to compute the CPQR can with some light post-processing be used to
compute the column ID.
As a starting point, recall that for a given m × n matrix A, with m n, the QR
factorization can be written as
A P = Q S,
(10.3.1)
m×n n×n m×n n×n
where P is a permutation matrix, where Q has orthonormal columns and where
S is upper triangular. 1 Since our objective here is to construct a rank-k approxi-
mation to A, we split off the leading k columns from Q and S to obtain partitions
k n−k
k n−k
k S11 S12
(10.3.2) Q= m Q1 Q2 , and S= .
m−k 0 S22
Remark 10.3.8 (Conditioning). Equation (10.3.5) involves the quantity S−111 which
prompts the question of whether S11 is necessarily invertible, and what its condi-
tion number might be. It is easy to show that whenever the rank of A is at least
k, the CPQR algorithm is guaranteed to result in a matrix S11 that is non-singular.
(If the rank of A is j, where j < k, then the QR factorization process can detect
this and halt the factorization after j steps.)
Unfortunately, S11 is typically quite ill-conditioned. The saving grace is that
even though one should expect S11 to be poorly conditioned, it is often the case
that the linear system
(10.3.9) S11 T = S12
still has a solution T whose entries are of moderate size. Informally, one could say
that the directions where S11 and S12 are small “line up.” For standard column
pivoted QR, the system (10.3.9) will in practice be observed to almost always have
a solution T of small size [6], but counter-examples can be constructed. More
sophisticated pivot selection procedures have been proposed that are guaranteed
to result in matrices S11 and S12 such that (10.3.9) has a good solution; but these
are harder to code and take longer to execute [23].
Of course, the row ID can be computed via an entirely analogous process that
starts with a CPQR of the transpose of A. In other words, we execute a pivoted
Gram-Schmidt orthonormalization process on the rows of A.
Finally, to obtain the double-sided ID, we start with using the CPQR-based
process to build the column ID (10.1.1). Then compute the row ID by performing
Gram-Schmidt on the rows of the tall thin matrix C.
The three deterministic algorithms described for computing the three flavors
of ID are summarized in Algorithm 10.3.11.
would stop once the residual error A − Q(:, 1 : k)S(1 : k, :) = S22 ε. When
the QR factorization is interrupted after k steps, the output would still be a fac-
torization of the form (10.3.1), but in this case, S22 would not be upper triangular.
This is immaterial since S22 is never used. To further accelerate the computation,
one can advantageously use a randomized CPQR algorithm, cf. Sections 14 and 15
or [36, 38].
A = Y F.
(10.4.1)
m×n m×k k×n
Once the factorization (10.4.1) is available, let us use the algorithm ID_row de-
scribed in Algorithm 10.3.11 to compute a row ID [Is , X] = ID_row(Y, k) of Y so
that
Y = X Y(Is , :).
(10.4.2)
m×k m×k k×k
It then turns out that {Is , X} is automatically (!) a row ID of A as well. To see this,
simply note that
XA(Is , :)
= XY(Is , :)F {Use (10.4.1) restricted to the rows in Is .}
= YF {Use (10.4.2).}
= A {Use (10.4.1).}
The key insight here is very simple, but powerful, so let us spell it out explicitly:
Algorithm: Randomized ID
As we have seen, the task of finding a matrix Y whose columns form a good
basis for the column space of a matrix is ideally suited to randomized sampling.
To be precise, we showed in Section 4 that given a matrix A, we can find a matrix
Y whose columns approximately span the column space of A via the formula
Y = AG, where G is a tall thin Gaussian random matrix. The scheme that results
from combining these two insights is summarized in Algorithm 10.4.3.
The randomized algorithm for computing a row ID shown in Algorithm 10.4.3
has complexity O(mnk). We can reduce this complexity to O(mn log k) by using
a structured random matrix instead of a Gaussian, cf. Section 6. The result is
summarized in Algorithm 10.4.4.
A ≈ C U R,
(11.1.1)
m×n m×k k×k k×n
where C contains a subset of the columns of A and R contains a subset of the rows
of A. Like the ID, the CUR decomposition offers the ability to preserve properties
like sparsity or non-negativity in the factors of the decomposition, the prospect
to reduce memory requirements, and excellent tools for data interpretation.
The CUR decomposition is often obtained in three steps [10, 41]: (1) Some
scheme is used to assign a weight or the so called leverage score (of importance)
[27] to each column and row in the matrix. This is typically done either using
the 2 norms of the columns and rows or by using the leading singular vectors of
A [12, 50]. (2) The matrices C and R are constructed via a randomized sampling
Per-Gunnar Martinsson 213
Remark 11.1.3 (Conditioning of CUR). For matrices whose singular values ex-
perience substantial decay, the accuracy of the CUR factorization can deteriorate
due to effects of ill-conditioning. To simplify slightly, one would normally expect
the leading k singular values of C and R to be rough approximations to the lead-
ing k singular values of A, so that the condition numbers of C and R would be
roughly σ1 (A)/σk (A). Since low-rank factorizations are most useful when applied
to matrices whose singular values decay reasonably rapidly, we would typically
expect the ratio σ1 (A)/σk (A) to be large, which is to say that C and R would be
ill-conditioned. Hence, in the typical case, evaluation of the formula (11.1.2) can
be expected to result in substantial loss of accuracy due to accumulation of round-
off errors. Observe that the ID does not suffer from this problem; in (10.2.2), the
matrix Askel tends to be ill-conditioned, but it does not need to be inverted. (The
matrices X and Z are well-conditioned.)
The techniques described in this section are designed for dense matrices stored
in RAM. They directly update the matrix, and come with a firm guarantee that
the computed low rank approximation is within distance ε of the original matrix.
There are many situations where direct updating is not feasible and we can in
practice only interact with the matrix via the matrix-vector multiplication (e.g.,
very large matrices stored out-of-core, sparse matrices, matrices that are defined
implicitly). Section 13 describes algorithms designed for this environment that
use randomized sampling techniques to estimate the approximation error.
Recall that for the case where a computational tolerance is given (rather than
min(m,n)
a rank), the optimal solution is given by the SVD. Specifically, let {σj }j=1 be
the singular values of A, and let ε be a given tolerance. Then the minimal rank k
for which there exists a matrix B of rank k that is within distance ε of A, is the
is the smallest integer k such that σk+1 ε. The algorithms described here will
determine a k that is not necessarily optimal, but is typically fairly close.
12.2. A greedy updating algorithm Let us start by describing a general algorith-
mic template for how to compute an approximate rank-k approximate factoriza-
tion of a matrix. To be precise, suppose that we are given an m × n matrix A,
and a computational tolerance ε. Our objective is then to determine an integer
k ∈ {1, 2, . . . , min(m, n)}, an m × k orthonormal matrix Qk , and a k × n matrix Bk
such that A − Qk Bk ε. Algorithm 12.2.1 outlines how one might in a greedy
fashion build the matrices Qk and Bk , adding one column to Qk and one row to
Bk at each step.
(1) Q0 = [ ]; B0 = [ ]; A0 = A; k = 0;
(2) while Ak > ε
(3) k = k+1
(4) Pick a vector y ∈ Ran(Ak−1 )
(5) q = y/y;
(6) b = q∗ Ak−1 ;
(7) Qk = [Qk−1 q];
Bk−1
(8) Bk = ;
b
(9) Ak = Ak−1 − qb;
(10) end for
(1) Q = [ ]; B = [ ];
(2) while A > ε
(3) Draw an n × b Gaussian random matrix G.
(4) Compute the m × b matrix Qnew = qr(AG, 0).
(5) Bnew = Q∗new A
(6) Q = [Q Qnew ]
B
(7) B=
Bnew
(8) A = A − Qnew Bnew
(9) end while
12.4. Evaluating the norm of the residual The algorithms described in this sec-
tion contain one step that could be computationally expensive unless some care
is exercised. The potential problem concerns the evaluation of the norm of the re-
mainder matrix Ak (cf. Line (2) in Algorithm 12.2.1) at each step of the iteration.
When the Frobenius norm is used, this evaluation can be done very efficiently, as
follows: When the computation starts, evaluate the norm of the input matrix
a = AFro .
Then observe that after step j completes, we have
A = Qj Bj + Aj .
Since the columns in the first term all lie in Col(Qj ), and the columns of the
second term all lie in Col(Qj )⊥ , we now find that
A2Fro = Qj Bj 2Fro + Aj 2Fro = Bj 2Fro + Aj 2Fro ,
where in the last step we used that Qj Bj Fro = Bj Fro since Qj is orthonormal.
In other words, we can easily compute Aj Fro via the identity
Aj Fro = a2 − Bj 2Fro .
The idea here is related to “down-dating” schemes for computing column norms
when executing a column pivoted QR factorization, as described in, e.g., [48,
Chapter 5, Section 2.1].
When the spectral norm is used, one could use a power iteration to compute
an estimate of the norm of the matrix. Alternatively, one can use the randomized
procedure described in Section 13 which is faster, but less accurate.
(1) Q0 = [ ]; B0 = [ ];
(2) for j = 1, 2, 3, . . .
(3) Draw a Gaussian random vector gj ∈ Rn and set yj = Agj
(4) Set zj = yj − Qj−1 Q∗j−1 yj and then qj = zj /zj .
(5) Qj = [Qj−1 qj ]
Bj−1
(6) Bj =
q∗j A
(7) end for
Lemma 13.0.3. Let T be a real m × n matrix. Fix a positive integer r and a real number
α ∈ (0, 1). Draw an independent family {gi : i = 1, 2, . . . , r} of standard Gaussian
vectors. Then
1 2
T max Tgi
α π i=1,...,r
with probability at least 1 − αr .
In applying this result, we set α = 1/10, whence it follows that if zj is smaller
than the resulting threshold for r vectors in a row, then A − Qj Bj ε with
probability at least 1 − 10−r . The result is shown in Algorithm 13.0.4. Observe
that choosing α = 1/10 will work well only if the singular values decay reasonably
fast.
220 Randomized Methods for Matrix Computations
Remark 13.0.5. While the proof of Lemma 13.0.3 is outside the scope of these
lectures, it is perhaps instructive to prove a much simpler related result that says
that if T is any m × n matrix, and g ∈ Rn is a standard Gaussian vector, then
(13.0.6) E[Tg2 ] = T2Fro .
To prove (13.0.6), let T have the singular value decomposition T = UDV∗ , and set
g̃ = V∗ g. Then
n
Tg2 = UDV∗ g2 = UDg̃2 = {U is unitary} = Dg̃2 = σ2j g̃2j .
j=1
Then observe that since the distribution of Gaussian vectors is rotationally invari-
ant, the vector g̃ = V∗ g is also a standardized Gaussian vector, and so E[g̃2j ] = 1.
Since the variables {g̃j }n
j=1 are independent, it follows that
⎡ ⎤
n
n
n
E[Tg2 ] = E ⎣ σ2j g̃2j ⎦ = σ2j E[g̃2j ] = σ2j = T2Fro .
j=1 j=1 j=1
Per-Gunnar Martinsson 221
2 The matrix Q is in fact symmetric, so Q = Q∗ , but we keep the transpose symbol in the formula
i i i
for consistency with the remainder of the section.
222 Randomized Methods for Matrix Computations
Once the process has completed, the matrices Q and P in (14.1.1) are given by
Q = Qn−1 Qn−2 · · · Q1 , and P = Pn−1 Pn−2 · · · P1 .
For details, see, e.g., [20, Sec. 5.2].
The Householder QR factorization process is a celebrated algorithm that is
exceptionally stable and accurate. However, it has a serious weakness in that
it executes rather slowly on modern hardware, in particular on systems involv-
ing many cores, when the matrix is stored in distributed memory or on a hard
drive, etc. The problem is that it inherently consists of a sequence of n − 1 rank-1
updates (so called BLAS2 operations), which makes the process very communica-
tion intensive. In principle, the resolution to this problem is to block the process,
as shown in Figure 14.1.3.
Let b denote a block size; then in a blocked Householder QR algorithm, we
would find groups of b pivot vectors that are moved into the active b slots at
once, then b Householder reflectors would be determined by processing the b
pivot columns, and then the remainder of the matrix would be updated jointly.
Such a blocked algorithm would expend most of its flops on matrix-matrix mul-
tiplications (so called BLAS3 operations), which execute very rapidly on a broad
range of computing hardware. Many techniques for blocking Householder QR
have been proposed over the years, including, e.g., [3, 4].
It was recently observed [36] that randomized sampling is ideally suited for
resolving the long-standing problem of how to find groups of pivot vectors. The
key observation is that a measure of quality for a group of b pivot vectors is its
spanning volume in Rm . This turns out to be closely related to how good of a ba-
sis these vectors form for the column space of the matrix [21, 23, 37]. As we saw in
Section 10, this task is particularly well suited to randomized sampling. Precisely,
consider the task of identifying a group of b good pivot vectors in the first step of
the blocked QR process shown in Figure 14.1.3. Using the procedures described
in Subsection 10.4, we proceed as follows: Fix an over-sampling parameter p, say
p = 10. Then draw a Gaussian random matrix G of size (b + p) × m, and form
Per-Gunnar Martinsson 223
the sampling matrix Y = GA. Then simply perform column pivoted QR on the
columns of Y. To summarize, we determine P1 as follows:
G = randn(b + p, m),
Y = GA,
[∼, ∼, P1 ] = qr(Y, 0).
Observe that the QR factorization of Y is affordable since Y is small, and fits in
fast memory close to the processor. For the remaining steps, we simply apply
the same idea to find the best spanning columns for the lower right block in Ai
that has not yet been driven to upper triangular form. The resulting algorithm is
called Householder QR with Randomization for Pivoting (HQRRP); it is described in
detail in [36], and is available at https://github.com/flame/hqrrp/. (The method
described in this section was first published in [33], but is closely related to the
independently discovered results in [13].)
To maximize performance, it turns out to be possible to “downdate” the sam-
pling matrix from one step of the factorization to the next, in a manner simi-
lar to how downdating of the pivot weights are done in classical Householder
QR[48, Ch.5, Sec. 2.1]. This obviates the need to draw a new random matrix at
each step [13, 36], and reduces the leading term in the asymptotic flop count of
HQRRP to 2mn2 − (4/3)n3 , which is identical to classical Householder QR.
A = U T V∗ ,
(15.1.1)
m×n m×m m×n n×n
where U and V are unitary matrices, and T is a triangular matrix (either lower or
upper triangular). The UTV decomposition can be viewed as a generalization of
other standard factorizations such as, e.g., the Singular Value Decomposition (SVD)
or the Column Pivoted QR decomposition (CPQR). (To be precise, the SVD is the
special case where T is diagonal, and the CPQR is the special case where V is
a permutation matrix.) The additional flexibility inherent in the UTV decompo-
sition enables the design of efficient updating procedures, see [48, Ch. 5, Sec. 4]
and [16].
15.2. An overview of randUTV The algorithm randUTV follows the same general
pattern as HQRRP, as illustrated in Figure 15.2.1. Like HQRRP, it drives a given matrix
A = A0 to upper triangular form via a sequence of steps
Ai = U∗i Ai−1 Vi ,
where each Ui and Vi is a unitary matrix. As in Section 14, we let b denote a block
size, and let p = n/b denote the number of steps taken. The key difference from
In other words, the top left b × b block is upper triangular, and the bottom left
block is zero. One can also show that all entries of Ã12 are typically small in
magnitude.
226 Randomized Methods for Matrix Computations
Next, we compute a full SVD of the block Ã11 = ÛD11 V̂∗ . This step is affordable
since Ã11 is of size b × b, where b is small. Then we form the transformation
Per-Gunnar Martinsson 227
matrices
Û 0 V̂ 0
U1 = Ũ , and V1 = Ṽ ,
0 Im−b 0 In−b
and set
A1 = U∗1 AV1 .
One can demonstrate that the diagonal entries of D11 typically form accurate
approximations to first b singular values of A, and that
A1,22 ≈ inf{A − B : B has rank b}.
Once the first b columns and rows of A have been processed as described in
this Section, randUTV then applies the same procedure to the remaining block
A1,22 , of size (m − b) × (n − b), and then continues in the same fashion to pro-
cess all remaining blocks, as outlined in Subsection 15.2. The full algorithm is
summarized in Algorithm 15.3.1.
For more information about the UTV factorization, including careful numerical
experiments that illustrate how it compares in terms of speed and accuracy to
competitors such as column pivoted QR and the traditional SVD, see [35]. Codes
are available for download from https://github.com/flame/randutv.
References
[1] N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast johnson-lindenstrauss transform,
Proceedings of the thirty-eighth annual acm symposium on theory of computing, 2006, pp. 557–
563. ←200
[2] N. Ailon and E. Liberty, An almost optimal unrestricted fast Johnson-Lindenstrauss transform, ACM
Transactions on Algorithms (TALG) 9 (2013), no. 3, 21. ←200
[3] C. H. Bischof and G. Quintana-Ortí, Algorithm 782: Codes for rank-revealing QR factorizations of
dense matrices, ACM Transactions on Mathematical Software (TOMS) 24 (1998), no. 2, 254–257.
←222
[4] C. H. Bischof and G. Quintana-Ortí, Computing rank-revealing QR factorizations of dense matrices,
ACM Transactions on Mathematical Software (TOMS) 24 (1998), no. 2, 226–253. ←222
228 References
[5] C. Boutsidis, M. W Mahoney, and P. Drineas, An improved approximation algorithm for the column
subset selection problem, Proceedings of the twentieth annual acm-siam symposium on discrete
algorithms, 2009, pp. 968–977. ←207
[6] H. Cheng, Z. Gimbutas, P.-G. Martinsson, and V. Rokhlin, On the compression of low rank matrices,
SIAM Journal of Scientific Computing 26 (2005), no. 4, 1389–1404. ←207, 209
[7] P. Drineas, R. Kannan, and M. W. Mahoney, Fast Monte Carlo algorithms for matrices. II. Comput-
ing a low-rank approximation to a matrix, SIAM J. Comput. 36 (2006), no. 1, 158–183 (electronic).
MR2231644 (2008a:68243) ←191
[8] P. Drineas, M. Magdon-Ismail, M. W Mahoney, and D. P Woodruff, Fast approximation of matrix
coherence and statistical leverage, Journal of Machine Learning Research 13 (2012), no. Dec, 3475–
3506. ←207
[9] P. Drineas and M. W. Mahoney, On the Nyström method for approximating a Gram matrix for improved
kernel-based learning, J. Mach. Learn. Res. 6 (2005), 2153–2175. ←205
[10] P. Drineas and M. W. Mahoney, CUR matrix decompositions for improved data analysis, Proceedings
of the National Academy of Sciences 106 (2009), no. 3, 697–702. ←207, 212
[11] P. Drineas and M. W. Mahoney, Lectures on randomized numerical linear algebra, The mathematics
of data, 2018. ←191, 201
[12] P. Drineas, M. W. Mahoney, and S. Muthukrishnan, Relative-error CUR matrix decompositions,
SIAM J. Matrix Anal. Appl. 30 (2008), no. 2, 844–881. MR2443975 (2009k:68269) ←212
[13] J. Duersch and M. Gu, True blas-3 performance qrcp using random sampling, 2015. ←223
[14] C. Eckart and G. Young, The approximation of one matrix by another of lower rank, Psychometrika 1
(1936), no. 3, 211–218. ←192
[15] Inc. Facebook, Fast randomized svd, 2016. ←190
[16] R. D Fierro, P. C. Hansen, and P. S. K. Hansen, Utv tools: Matlab templates for rank-revealing utv
decompositions, Numerical Algorithms 20 (1999), no. 2-3, 165–194. ←224
[17] A. Frieze, R. Kannan, and S. Vempala, Fast Monte Carlo algorithms for finding low-rank approxima-
tions, J. ACM 51 (2004), no. 6, 1025–1041. (electronic). MR2145262 (2005m:65006) ←190, 191
[18] A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu, K.
Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy,
M. W. Mahoney, and Prabhat, Matrix factorizations at scale: A comparison of scientific data analytics
in spark and C+MPI using three case studies, 2016 IEEE International Conference on Big Data, 2016,
pp. 204–213. ←190
[19] A. Gittens and M. W. Mahoney, Revisiting the nyström method for improved large-scale machine learn-
ing, J. Mach. Learn. Res. 17 (January 2016), no. 1, 3977–4041. ←206
[20] G. H. Golub and C. F. Van Loan, Matrix computations, Third, Johns Hopkins Studies in the Math-
ematical Sciences, Johns Hopkins University Press, Baltimore, MD, 1996. ←191, 222
[21] S. A. Goreinov, N. L. Zamarashkin, and E. E. Tyrtyshnikov, Pseudo-skeleton approximations by
matrices of maximal volume, Mathematical Notes 62 (1997). ←222
[22] M. Gu, Subspace iteration randomization and singular value problems, SIAM Journal on Scientific
Computing 37 (2015), no. 3, A1139–A1173. ←201
[23] M. Gu and S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization,
SIAM J. Sci. Comput. 17 (1996), no. 4, 848–869. MR97h:65053 ←207, 209, 222
[24] N. Halko, Randomized methods for computing low-rank approximations of matrices, Ph.D. Thesis, 2012.
←190
[25] N. Halko, P.-G. Martinsson, Y. Shkolnisky, and M. Tygert, An algorithm for the principal component
analysis of large data sets, SIAM Journal on Scientific Computing 33 (2011), no. 5, 2580–2594. ←190
[26] N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algo-
rithms for constructing approximate matrix decompositions, SIAM Review 53 (2011), no. 2, 217–288.
←190, 193, 200, 201, 202, 204, 207, 219, 220
[27] D. C. Hoaglin and R. E. Welsch, The Hat matrix in regression and ANOVA, The American Statistician
32 (1978), no. 1, 17–22. ←212
[28] W. Kahan, Numerical linear algebra, Canadian Math. Bull 9 (1966), no. 6, 757–801. ←216
[29] D. M. Kane and J. Nelson, Sparser johnson-lindenstrauss transforms, J. ACM 61 (January 2014), no. 1,
4:1–4:23. ←200
[30] E. Liberty, Accelerated dense random projections, Ph.D. Thesis, 2009. ←200
References 229
[31] E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert, Randomized algorithms for the
low-rank approximation of matrices, Proc. Natl. Acad. Sci. USA 104 (2007), no. 51, 20167–20172.
←190
[32] M. W Mahoney, Randomized algorithms for matrices and data, Foundations and Trends® in Machine
Learning 3 (2011), no. 2, 123–224. ←191
[33] P.-G. Martinsson, Blocked rank-revealing qr factorizations: How randomized sampling can be used to
avoid single-vector pivoting, 2015. ←221, 223
[34] P.-G. Martinsson, Randomized methods for matrix computations and analysis of high dimensional data
(2016), available at arXiv:1607.01649. ←203, 207
[35] P.-G. Martinsson, G. Quintana Orti, and N. Heavner, randUTV: A blocked randomized algorithm for
computing a rank-revealing UTV factorization (2017), available at arXiv:1703.00998. ←190, 227
[36] P.-G. Martinsson, G. Quintana-Ortí, N. Heavner, and R. van de Geijn, Householder qr factorization
with randomization for column pivoting (HQRRP), SIAM Journal on Scientific Computing 39 (2017),
no. 2, C96–C115. ←190, 210, 221, 222, 223
[37] P.-G. Martinsson, V. Rokhlin, and M. Tygert, On interpolation and integration in finite-dimensional
spaces of bounded functions, Comm. Appl. Math. Comput. Sci (2006), 133–142. ←222
[38] P.-G. Martinsson and S. Voronin, A randomized blocked algorithm for efficiently computing rank-
revealing factorizations of matrices., 2015. To appear in SIAM Journal on Scientific Computation,
arXiv:1503.07157. ←190, 210
[39] P.-G. Martinsson, V. Rokhlin, and M. Tygert, A randomized algorithm for the approximation of matri-
ces, Technical Report Yale CS research report YALEU/DCS/RR-1361, Yale University, Computer
Science Department, 2006. ←189, 190
[40] P.-G. Martinsson, V. Rokhlin, and M. Tygert, A randomized algorithm for the decomposition of matrices,
Appl. Comput. Harmon. Anal. 30 (2011), no. 1, 47–68. MR2737933 (2011i:65066) ←190
[41] N. Mitrovic, M. T. Asif, U. Rasheed, J. Dauwels, and P. Jaillet, Cur decomposition for compression
and compressed sensing of large-scale traffic data, Intelligent transportation systems-(itsc), 2013 16th
international ieee conference on, 2013, pp. 1475–1480. ←212
[42] C. Musco and C. Musco, Randomized block krylov methods for stronger and faster approximate singu-
lar value decomposition, Proceedings of the 28th international conference on neural information
processing systems, 2015, pp. 1396–1404. ←205
[43] F. Pourkamali-Anaraki and S. Becker, Randomized clustered nystrom for large-scale kernel machines,
2016. arXiv:1612.06470. ←206
[44] V. Rokhlin, A. Szlam, and M. Tygert, A randomized algorithm for principal component analysis, SIAM
Journal on Matrix Analysis and Applications 31 (2009), no. 3, 1100–1124. ←205
[45] T. Sarlos, Improved approximation algorithms for large matrices via random projections, 2006 47th an-
nual ieee symposium on foundations of computer science (focs’06), 2006, pp. 143–152. ←190, 191,
200
[46] D. C Sorensen and M. Embree, A deim induced cur factorization, SIAM Journal on Scientific Com-
puting 38 (2016), no. 3, A1454–A1482. ←213
[47] G. W. Stewart, On the early history of the singular value decomposition, SIAM Rev. 35 (1993), no. 4,
551–566. MR1247916 (94f:15001) ←194
[48] G. W. Stewart, Matrix algorithms volume 1: Basic decompositions, SIAM, 1998. ←218, 223, 224
[49] S. Voronin and P.-G. Martinsson, A CUR factorization algorithm based on the interpolative decomposi-
tion (2014), available at arXiv:1412.8447. ←190, 207, 213
[50] S. Wang and Z. Zhang, Improving CUR matrix decomposition and the Nyström approximation via
adaptive sampling, J. Mach. Learn. Res. 14 (2013), 2729–2769. MR3121656 ←207, 212
[51] R. Witten and E. Candes, Randomized algorithms for low-rank matrix factorizations: sharp performance
bounds, Algorithmica 72 (2015), no. 1, 264–281. ←201
[52] D. P. Woodruff, Sketching as a tool for numerical linear algebra, Foundations and Trends in Theoreti-
cal Computer Science 10 (2014), no. 1Ű2, 1–157. ←191
[53] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, A fast randomized algorithm for the approximation
of matrices, Applied and Computational Harmonic Analysis 25 (2008), no. 3, 335–366. ←200, 219
Mathematical Institute, Andrew Wiles Building, University of Oxford, Oxford, OX2 6GG, United
Kingdom
Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 231–271
https://doi.org/10.1090/pcms/025/00833
Roman Vershynin
Abstract. Methods of high-dimensional probability play a central role in ap-
plications for statistics, signal processing, theoretical computer science and re-
lated fields. These lectures present a sample of particularly useful tools of high-
dimensional probability, focusing on the classical and matrix Bernstein’s inequal-
ity and the uniform matrix deviation inequality. We illustrate these tools with
applications for dimension reduction, network analysis, covariance estimation,
matrix completion and sparse signal recovery. The lectures are geared towards
beginning graduate students who have taken a rigorous course in probability but
may not have any experience in data science applications.
Contents
1 Lecture 1: Concentration of sums of independent random variables 232
1.1 Sub-gaussian distributions 233
1.2 Hoeffding’s inequality 234
1.3 Sub-exponential distributions 234
1.4 Bernstein’s inequality 235
1.5 Sub-gaussian random vectors 236
1.6 Johnson-Lindenstrauss Lemma 237
1.7 Notes 239
2 Lecture 2: Concentration of sums of independent random matrices 239
2.1 Matrix calculus 239
2.2 Matrix Bernstein’s inequality 241
2.3 Community recovery in networks 244
2.4 Notes 248
3 Lecture 3: Covariance estimation and matrix completion 249
3.1 Covariance estimation 249
3.2 Norms of random matrices 252
3.3 Matrix completion 255
3.4 Notes 258
4 Lecture 4: Matrix deviation inequality 259
4.1 Gaussian width 260
4.2 Matrix deviation inequality 261
4.3 Deriving Johnson-Lindenstrauss Lemma 262
Partially supported by NSF Grant DMS 1265782 and U.S. Air Force Grant FA9550-14-1-0009.
231
232 Four Lectures on Probabilistic Methods for Data Science
1.2. Hoeffding’s inequality You may remember from a basic course in probabil-
ity that the normal distribution N(μ, σ) has a remarkable property: the sum of
independent normal random variables is also normal. Here is a version of this
property for sub-gaussian distributions.
Proposition 1.2.1 (Sums of sub-gaussians). Let X1 , . . . , XN be independent, mean
zero, sub-gaussian random variables. Then N
i=1 Xi is a sub-gaussian, and
$N $2
N
$ $
$ Xi $ C Xi 2ψ2
ψ2
i=1 i=1
N
N
E exp λ Xi = E exp(λXi ) (using independence)
i=1 i=1
N
exp(Cλ2 Xi 2ψ2 ) (by last property in Proposition 1.1.1)
i=1
N
= exp(λ2 K2 ) where K2 := C Xi 2ψ2 .
i=1
Using again the last property in Proposition 1.1.1, we conclude that the sum
S= N i=1 Xi is sub-gaussian, and Sψ2 C1 K where C1 is an absolute constant.
The proof is complete.
Let us rewrite Proposition 1.2.1 in a form that is often more useful in appli-
cations, namely as a concentration inequality. To do this, we simply use the first
property in Proposition 1.1.1 for the sum N i=1 Xi . We immediately get the fol-
lowing.
Theorem 1.2.2 (General Hoeffding’s inequality). Let X1 , . . . , XN be independent,
mean zero, sub-gaussian random variables. Then, for every t 0 we have
N ct2
P Xi t 2 exp − N .
i=1 Xi ψ2
2
i=1
Hoeffding’s inequality controls how far and with what probability a sum of
independent random variables can deviate from its mean, which is zero.
1.3. Sub-exponential distributions Sub-gaussian distributions constitute a suf-
ficiently wide class of distributions. Many results in probability and data science
are proved nowadays for sub-gaussian random variables. Still, as we noted, there
are some natural random variables that are not sub-gaussian. For example, the
3 Inthe future, we will always denote positive absolute constants by C, c, C1 , etc. These numbers do
not depend on anything. In most cases, one can get good bounds on these constants from the proof,
but the optimal constants for each result are rarely known.
Roman Vershynin 235
Just like we did for sub-gaussian distributions, we call the best K3 the sub-
exponential norm of X and denote it by Xψ1 , that is
Xψ1 := inf {t > 0 : E exp(|X|/t) 2} .
All sub-exponential random variables are squares of sub-gaussian random vari-
ables. Indeed, inspecting the definitions you will quickly see that
(1.3.2) X2 ψ1 = X2ψ2 .
(Check!)
1.4. Bernstein’s inequality A version of Hoeffding’s inequality known as Bern-
stein’s inequality holds for sub-exponential random variables. You may naturally
expect to see a sub-exponential tail bound in this result. So it may come as a sur-
prise that Bernstein’s inequality actually has a mixture of two tails – sub-gaussian
and sub-exponential. Let us state and prove the inequality first, and then we will
comment on the mixture of the two tails.
Proof. For simplicity, we will assume that K = 1 and only prove the one-sided
bound (without absolute value); the general case is not much harder. Our ap-
proach will be based on bounding the moment generating function of the sum
N
S := i=1 Xi . To see how MGF can be helpful here, choose λ 0 and use
Markov’s inequality to get
(1.4.2) P S t = P exp(λS) exp(λt) e−λt E exp(λS).
Recall that S = N i=1 Xi and use independence to express the right side of (1.4.2)
as
N
e−λt E exp(λXi ).
i=1
236 Four Lectures on Probabilistic Methods for Data Science
(Check!) It remains to bound the MGF of each term Xi , and this is a much simpler
task. If we choose λ small enough so that
c
(1.4.3) 0<λ ,
maxi Xi ψ1
then we can use the last property in Proposition 1.3.1 to get
does. So we can choose λ that minimizes the right side subject to the constraint
(1.4.3). When this is done carefully, we obtain the tail bound stated in Bernstein’s
inequality. (Do this!)
Now, why does Bernstein’s inequality have a mixture of two tails? The sub-
exponential tail should of course be there. Indeed, even if the entire sum consisted
of a single term Xi , the best bound we could hope for would be of the form
exp(−ct/Xi ψ1 ). The sub-gaussian term could be explained by the central limit
theorem, which states that the sum should becomes approximately normal as the
number of terms N increases to infinity.
Remark 1.4.4 (Bernstein’s inequality for bounded random variables). Suppose the
random variables Xi are uniformly bounded, which is a stronger assumption than
being sub-gaussian. Then there is a useful version of Bernstein’s inequality, which
unlike Theorem 1.4.1 is sensitive to the variances of Xi ’s. It states that if K > 0 is
such that |Xi | K almost surely for all i, then, for every t 0, we have
N t2 /2
(1.4.5) P Xi t 2 exp − .
σ2 + CKt
i=1
N
σ2 i=1 E Xi
Here = 2 is the variance of the sum. This version of Bernstein’s
inequality can be proved in essentially the same way as Theorem 1.4.1. We will
not do it here, but a stronger Theorem 2.2.1, which is valid for matrix-valued
random variables Xi , will be proved in Lecture 2.
To compare this with Theorem 1.4.1, note that σ2 + CKt 2 max(σ2 , CKt). So
we can state the probability bound (1.4.5) as
t2 t
2 exp − c min 2 , .
σ K
Just as before, here we also have a mixture of two tails, sub-gaussian and sub-
exponential. The sub-gaussian tail is a bit sharper than in Theorem 1.4.1, since
it depends on the variances rather than sub-gaussian norms of Xi . The sub-
exponential tail, on the other hand, is weaker, since it depends on the sup-norms
rather than the sub-exponential norms of Xi .
Roman Vershynin 237
random projection”5
1
P := √ A.
m
Assume that
m Cε−2 log N,
where C is an appropriately large constant that depends only on the sub-gaussian norms of
the vectors Xi . Then, with high probability (say, 0.99), the map P preserves the distances
between all points in X with error ε, that is
(1.6.2) (1 − ε)x − y2 Px − Py2 (1 + ε)x − y2 for all x, y ∈ X.
Proof. Let us take a closer look at the desired conclusion (1.6.2). By linearity,
Px − Py = P(x − y). So, dividing the inequality by x − y2 , we can rewrite (1.6.2)
in the following way:
(1.6.4) 1 − ε Pz2 1 + ε for all z ∈ T
where
x−y
T := : x, y ∈ X distinct points .
x − y2
It will be convenient to square the inequality (1.6.4). Using that 1 + ε (1 + ε)2
and 1 − ε (1 − ε)2 , we see that it is enough to show that
(1.6.5) 1 − ε Pz22 1 + ε for all z ∈ T .
By construction, the coordinates of the vector Pz = √1m Az are √1
m
Xi , z. Thus
we can restate (1.6.5) as
1 m
(1.6.6) Xi , z2 − 1 ε for all z ∈ T .
m
i=1
Results like (1.6.6) are often proved by combining concentration and a union
bound. In order to use concentration, we first fix z ∈ T . By assumption, the
random variables Xi , z2 − 1 are independent; they have zero mean (use isotropy
5 Strictly speaking, this P is not a projection since it maps Rn to a different space Rm .
Roman Vershynin 239
to check this!), and they are sub-exponential (use (1.3.2) to check this). Then
Bernstein’s inequality (Theorem 1.4.1) gives
1 m
2
P Xi , z − 1 > ε 2 exp(−cε2 m).
m
i=1
(Check!)
Finally, we can unfix z by taking a union bound over all possible z ∈ T :
1 m 1 m
2 2
P max Xi , z − 1 > ε P Xi , z − 1 > ε
z∈T m m
i=1 z∈T i=1
2
(1.6.7) |T | · 2 exp(−cε m).
By definition of T , we have |T | So, if we choose m Cε−2 log N with
N2 .
appropriately large constant C, we can make (1.6.7) bounded by 0.01. The proof
is complete.
1.7. Notes The material presented in Sections 1.1–1.5 is basic and can be found
e.g. in [58] and [60] with all the proofs. Bernstein’s and Hoeffding’s inequalities
that we covered here are two basic examples of concentration inequalities. There
are many other useful concentration inequalities for sums of independent random
variables (e.g. Chernoff’s and Bennett’s) and for more general objects. The text-
book [60] is an elementary introduction into concentration; the books [10, 38, 39]
offer more comprehensive and more advanced accounts of this area.
The original version of Johnson-Lindenstrauss Lemma was proved in [31]. The
version we gave here, Theorem 1.6.1, was stated with probability of success 0.99,
but an inspection of the proof gives probability 1 − 2 exp(−cε2 m) which is much
better for large m. A great variety of ramifications and applications of Johnson-
Lindenstrauss lemma are known, see e.g. [2, 4, 7, 10, 34, 42].
Then one can check that the following matrix series converges8 and defines f(X):
∞
f(X) = ak (X − X0 )k .
k=1
(Check!)
2.2. Matrix Bernstein’s inequality We are now ready to state and prove a re-
markable generalization of Bernstein’s inequality for random matrices.
for any n × n symmetric matrices X and Y. Another result, which we will actually
use in the proof of matrix Bernstein’s inequality, is Lieb’s inequality.
Note that in the scalar case, where n = 1, the function f in Lieb’s inequality is
linear and the result is trivial.
To use Lieb’s inequality in a probabilistic context, we will combine it with
the classical Jensen’s inequality. It states that for any concave function f and a
random matrix X, one has10
(2.2.5) E f(X) f(E X).
Using this for the function f in Lieb’s inequality, we get
E tr exp(H + log X) tr exp(H + log E X).
And changing variables to X = eZ , we get the following:
Lemma 2.2.6 (Lieb’s inequality for random matrices). Let H be a fixed n × n sym-
metric matrix and Z be an n × n symmetric random matrix. Then
E tr exp(H + Z) tr exp(H + log E eZ ).
Lieb’s inequality is a perfect tool for bounding the MGF of a sum of indepen-
dent random variables S = N i=1 Xi . To do this, let us condition on the random
variables X1 , . . . , XN−1 . Apply Lemma 2.2.6 for the fixed matrix H := N−1 i=1 λXi
and the random matrix Z := λXN , and afterwards take the expectation with re-
spect to X1 , . . . , XN−1 . By the law of total expectation, we get
N−1
E tr exp(λS) E tr exp λXi + log E eλXN .
i=1
N−2
Next, apply Lemma 2.2.6 in a similar manner for H := i=1 λXi + log E eλXN
and Z := λXN−1 , and so on. After N times, we obtain:
it reads g(E X) E g(X). From this, inequality (2.2.5) for concave functions and random matrices
easily follows (Check!).
Roman Vershynin 243
Think of this inequality is a matrix version of the scalar identity (2.2.2). The
main difference is that it bounds the trace of the MGF11 rather the MGF itself.
You may recall from a course in probability theory that the quantity log E eλXi
that appears in this bound is called the cumulant generating function of Xi .
Lemma 2.2.7 reduces the complexity of our task significantly, for it is much
easier to bound the cumulant generating function of each single random variable
Xi than to say something about their sum. Here is a simple bound.
where
N
Z := E X2i .
i=1
It remains to optimize this bound in λ. The minimum value is attained for
λ = t/(σ2 + Kt/3). (Check!) With this value of λ, we conclude
t2 /2
P λmax (S) t n · exp − 2 .
σ + Kt/3
This completes the proof of Theorem 2.2.1.
N
Bernstein’s inequality gives a powerful tail bound for i=1 Xi . This easily
implies a useful bound on the expectation:
Proof. The link from tail bounds to expectation is provided by the basic identity
∞
(2.2.10) EZ = P Z > t dt
0
which is valid for any non-negative random variable Z. (Check it!) Integrating
the tail bound given by matrix Bernstein’s inequality, you will arrive at the expec-
tation bound we claimed. (Check!)
Notice that the bound in this corollary has mild, logarithmic, dependence on
the ambient dimension n. As we will see shortly, this can be an important feature
in some applications.
2.3. Community recovery in networks Matrix Bernstein’s inequality has many
applications. The one we are going to discuss first is for the analysis of networks.
A network can be mathematically represented by a graph, a set of n vertices
with edges connecting some of them. For simplicity, we will consider undirected
graphs where the edges do not have arrows. Real world networks often tend to
have clusters, or communities – subsets of vertices that are connected by unusually
many edges. (Think, for example, about a friendship network where communities
form around some common interests.) An important problem in data science is
to recover communities from a given network.
We are going to explain one of the simplest methods for community recovery,
which is called spectral clustering. But before we introduce it, we will first of all
place a probabilistic model on the networks we consider. In other words, it will be
convenient for us to view networks as random graphs whose edges are formed at
random. Although not all real-world networks are truly random, this simplistic
Roman Vershynin 245
model can motivate us to develop algorithms that may empirically succeed also
for real-world networks.
The basic probabilistic model of random graphs is the Erdös-Rényi model.
Definition 2.3.1 (Erdös-Rényi model). Consider a set of n vertices and connect every
pair of vertices independently and with fixed probability p. The resulting random graph
is said to follow the Erdös-Rényi model G(n, p).
The Erdös-Rényi random model is very simple. But it is not a good choice if
we want to model a network with communities, for every pair of vertices has the
same chance to be connected. So let us introduce a natural generalization of the
Erdös-Rényi random model that does allow for community structure:
Definition 2.3.2 (Stochastic block model). Partition a set of n vertices into two subsets
(“communities”) with n/2 vertices each, and connect every pair of vertices independently
with probability p if they belong to the same community and q < p if not. The resulting
random graph is said to follow the stochastic block model G(n, p, q).
Figure 2.3.3 illustrates a simulation of a stochastic block model.
belong to the same community then E Aij = p and otherwise E Aij = q. Thus A
has block structure: for example, if n = 4 then A looks like this:
⎡ ⎤
p p q q
⎢ ⎥
⎢ p p q q ⎥
⎢ ⎥
EA = ⎢ ⎥
⎢ q q p p ⎥
⎣ ⎦
q q p p
(For illustration purposes, we grouped the vertices from each community to-
gether.)
You will easily check that A has rank 2, and the non-zero eigenvalues and the
corresponding eigenvectors are
⎡ ⎤ ⎡ ⎤
1 1
⎢ ⎥ ⎢ ⎥
⎢ 1 ⎥ ⎢ 1 ⎥
p+q ⎢ ⎥ p−q ⎢ ⎥
λ1 (EA) = n, v1 (EA) = ⎢ ⎥ ; λ2 (EA) = n, v2 (EA) = ⎢ ⎥.
2 ⎢ 1 ⎥ 2 ⎢ −1 ⎥
⎣ ⎦ ⎣ ⎦
1 −1
(Check!)
The eigenvalues and eigenvectors of E A tell us a lot about the community
structure of the underlying graph. Indeed, the first (larger) eigenvalue,
p + q
d := n,
2
is the expected degree of any vertex of the graph.13 The second eigenvalue tells
us whether there is any community structure at all (which happens when p = q
and thus λ2 (E A) = 0). The first eigenvector v1 is not informative of the structure
of the network at all. It is the second eigenvector v2 that tells us exactly how to
separate the vertices into the two communities: the signs of the coefficients of v2
can be used for this purpose.
Thus if we know E A, we can recover the community structure of the network
from the signs of the second eigenvector. The problem is that we do not know
E A. Instead, we know the adjacency matrix A. If, by some chance, A is not far
from E A, we may hope to use the A to approximately recover the community
structure. So is it true that A ≈ E A? The answer is yes, and we can prove it
using matrix Bernstein’s inequality.
Theorem 2.3.4 (Concentration of the stochastic block model). Let A be the adjacency
matrix of a G(n, p, q) random graph. Then
E A − E A d log n + log n.
Here d = (p + q)n/2 is the expected degree.
13 The degree of the vertex is the number of edges connected to it.
Roman Vershynin 247
Proof. Let us sketch the argument. To use matrix Bernstein’s inequality, let us
break A into a sum of independent random matrices
A= Xij ,
i,j: ij
where each matrix Xij contains a pair of symmetric entries of A, or one diagonal
entry.14 Matrix Bernstein’s inequality obviously applies for the sum
A−EA = (Xij − E Xij ).
ij
14 Precisely,if i = j, then Xij has all zero entries except the (i, j) and (j, i) entries that can potentially
equal 1. If i = j, the only non-zero entry of Xij is the (i, i).
15 We will liberally use the notation to hide constant factors appearing in the inequalities. Thus,
Then we should expect that most of the coefficients of v2 (A) are positive on one
community and negative on the other. So we can use v2 (A) to approximately
recover the communities. This method is called spectral clustering:
Spectral Clustering Algorithm. Compute v2 (A), the eigenvector corresponding to
the second largest eigenvalue of the adjacency matrix A of the network. Use the signs of
the coefficients of v2 (A) to predict the community membership of the vertices.
We saw that spectral clustering should perform well for the stochastic block
model G(n, p, q) if it is not too sparse, namely if the expected degrees satisfy
d = (p + q)n/2
log n.
A more careful analysis along these lines, which you should be able to do
yourself with some work, leads to the following more rigorous result.
Theorem 2.3.6 (Guarantees of spectral clustering). Consider a random graph gener-
ated according to the stochastic block model G(n, p, q) with p > q, and set a = pn,
b = qn. Suppose that
(2.3.7) (a − b)2
log(n)(a + b).
Then, with high probability, the spectral clustering algorithm recovers the communities
up to o(n) misclassified vertices.
Note that condition (2.3.7) implies that the expected degrees are not too small,
namely d = (a + b)/2
log(n) (check!). It also ensures that a and b are suffi-
ciently different: recall that if a = b the network is Erdös-Rényi graph without
any community structure.
2.4. Notes The idea to extend concentration inequalities like Bernstein’s to ma-
trices goes back to R. Ahlswede and A. Winter [3]. They used Golden-Thompson
inequality (2.2.3) and proved a slightly weaker form of matrix Bernstein’s inequal-
ity than we gave in Section 2.2. R. Oliveira [48, 49] found a way to improve
this argument and gave a result similar to Theorem 2.2.1. The version of matrix
Bernstein’s inequality we gave here (Theorem 2.2.1) and a proof based on Lieb’s
inequality is due to J. Tropp [52].
The survey [53] contains a comprehensive introduction of matrix calculus, a
proof of Lieb’s inequality (Theorem 2.2.4), a detailed proof of matrix Bernstein’s
inequality (Theorem 2.2.1) and a variety of applications. A proof of Golden-
Thompson inequality (2.2.3) can be found in [8, Theorem 9.3.7].
In Section 2.3 we scratched the surface of an interdisciplinary area of net-
work analysis. For a systematic introduction into networks, refer to the book
[47]. Stochastic block models (Definition 2.3.2) were introduced in [33]. The
community recovery problem in stochastic block models, sometimes also called
community detection problem, has been in the spotlight in the last few years.
A vast and still growing body of literature exists on algorithms and theoretical
results for community recovery, see the book [47], the survey [22], papers such as
[9, 29, 30, 32, 37, 46, 61] and the references therein.
Roman Vershynin 249
A concentration result similar to Theorem 2.3.4 can be found in [48]; the ar-
gument there is also based on matrix concentration. This theorem is not quite
optimal. For dense networks, where the expected degree d satisfies d log n, the
concentration inequality in Theorem 2.3.4 can be improved to
√
(2.4.1) E A − E A d.
This improved bound goes back to the original paper [21] which studies the sim-
pler Erdös-Rényi model but the results extend to stochastic block models [17]; it
can also be deduced from [6, 32, 37].
If the network is relatively dense, i.e. d log n, one can improve the guarantee
(2.3.7) of spectral clustering in Theorem 2.3.6 to
(a − b)2
(a + b).
All one has to do is use the improved concentration inequality (2.4.1) instead of
Theorem 2.3.4. Furthermore, in this case there exist algorithms that can recover
the communities exactly, i.e. without any misclassified vertices, and with high
probability, see e.g. [1, 17, 32, 43].
For sparser networks, where d log n and possibly even d = O(1), relatively
few algorithms were known until recently, but now there exist many approaches
that provably recover communities in sparse stochastic block models, see, for
example, [9, 17, 29, 30, 37, 46, 61].
In other words,
1
N
Xi → E X as N → ∞.
N
i=1
The next most basic parameter of the distribution is the covariance matrix
Σ := E(X − E X)(X − E X)T .
This is a higher-dimensional version of the usual notion of variance of a random
variable Z, which is
Var(Z) = E(Z − E Z)2 .
The eigenvectors of the covariance matrix of Σ are called the principal components.
Principal components that correspond to large eigenvalues of Σ are the directions
in which the distribution of X is most extended, see Figure 3.1.1. These are often
the most interesting directions in the data. Practitioners often visualize the high-
dimensional data by projecting it onto the span of a few (maybe two or three) of
such principal components; the projection may reveal some hidden structure of
the data. This method is called Principal Component Analysis (PCA).
One can estimate the covariance matrix Σ from the sample by computing the
sample covariance
1
N
ΣN := (Xi − E Xi )(Xi − E Xi )T .
N
i=1
Again, the law of large numbers guarantees that the estimate becomes tight as
the sample size N grows to infinity, i.e.
ΣN → Σ as N → ∞.
But how large should the sample size N be for covariance estimation? Gener-
ally, one can not have N < n for dimension reasons. (Why?) We are going to
show that
N ∼ n log n
is enough. In other words, covariance estimation is possible with just logarithmic
oversampling.
For simplicity, we shall state the covariance estimation bound for mean zero
distributions. (If the mean is not zero, we can estimate it from the sample and
Roman Vershynin 251
subtract. Check that the mean can be accurately estimated from a sample of size
N = O(n).)
Theorem 3.1.2 (Covariance estimation). Let X be a random vector in Rn with covari-
ance matrix Σ. Suppose that
(3.1.3) X22 E X22 = tr Σ almost surely.
Then, for every N 1, we have
n log n n log n
E ΣN − Σ Σ + .
N N
Before we pass to the proof, let us note that Theorem 3.1.2 yields the covariance
estimation result we promised. Let ε ∈ (0, 1). If we take a sample of size
N ∼ ε−2 n log n,
then we are guaranteed covariance estimation with a good relative error:
E ΣN − Σ εΣ.
Proof. Apply matrix Bernstein’s inequality (Corollary 2.2.9) for the sum of inde-
pendent random matrices Xi XTi − Σ and get
1 $$
$ 1
N
$
Complete the proof by using tr Σ nΣ (check this!) to simplify the bound.
Remark 3.1.5 (Low-dimensional distributions). Far fewer samples are needed for
covariance estimation for low-dimensional, or approximately low-dimensional,
distributions. To measure approximate low-dimensionality we can use the notion
of the stable rank of Σ2 . The stable rank of a matrix A is defined as the square of
the ratio of the Frobenius to operator norms:16
A2F
r(A) :=.
A2
The stable rank is always bounded by the usual, linear algebraic rank,
r(A) rank(A),
and it can be much smaller. (Check both claims.)
Our proof of Theorem 3.1.2 actually gives
r log n r log n
E ΣN − Σ Σ + .
N N
where
tr Σ
r = r(Σ1/2 ) = .
Σ
(Check this!) Therefore, covariance estimation is possible with
N ∼ r log n
samples.
3.2. Norms of random matrices We have worked a lot with the operator norm of
matrices, denoted A. One may ask if is there exists a formula that expresses A
in terms of the entires Aij . Unfortunately, there is no such formula. The operator
norm is a more difficult quantity in this respect than the Frobenius norm, which as
we know can be easily expressed in terms of entries: AF = ( i,j A2ij )1/2 .
If we can not express A in terms of the entires, can we at least get a good es-
timate? Let us consider n × n symmetric matrices for simplicity. In one direction,
16 TheFrobenius normnof an
m n × m matrix, sometimes also called the Hilbert-Schmidt norm, is de-
fined as A
nF = ( i=1
2 1/2 . Equivalently, for an n × n symmetric matrix, we have
j=1 Aij )
AF = ( i=1 λi (A)2 )1/2 ,where λi (A) are the eigenvalues of A. Thus the stable rank of A
can be expressed as r(A) = n 2 2
i=1 λi (A) / maxi λi (A) .
Roman Vershynin 253
A is always bounded below by the largest Euclidean norm of the rows Ai :
1/2
(3.2.1) A max Ai 2 = max A2ij .
i i
j
(Check!) Unfortunately, this bound is sometimes very loose, and the best possible
upper bound is
√
(3.2.2) A n · max Ai 2 .
i
(Show this bound, and give an example where it is sharp.)
Fortunately, for random matrices with independent entries the bound (3.2.2)
can be improved to the point where the upper and lower bounds almost match.
Theorem 3.2.3 (Norms of random matrices without boundedness assumptions).
Let A be an n × n symmetric random matrix whose entries on and above the diagonal are
independent, mean zero random variables. Then
E max Ai 2 E A C log n · E max Ai 2 ,
i i
where Ai denote the rows of A.
In words, the operator norm of a random matrix is almost determined by the
norm of the rows.
Our proof of this result will be based on matrix Bernstein’s inequality – more
precisely, Corollary 2.2.9. There is one surprising point. How can we use matrix
Bernstein’s inequality, which applies only for bounded distributions, to prove
a result like Theorem 3.2.3 that does not have any boundedness assumptions?
We will do this using a trick based on conditioning and symmetrization. Let us
introduce this technique first.
Lemma 3.2.4 (Symmetrization). Let X1 , . . . , XN be independent, mean zero random
vectors in a normed space and ε1 , . . . , εN be independent Rademacher random variables.17
Then
1 $ $
$ $ $ $ $
N N N
$ $ $ $ $
E$ εi X i $ E $ Xi $ 2 E $ εi Xi $.
2
i=1 i=1 i=1
means that random variables εi take values ±1 with probability 1/2 each. We require that all
17 This
Apply the symmetrization inequality (Lemma 3.2.4) for the random matrices Zij
and get
$ $ $ $
$ $ $ $
(3.2.5) E A = E $ Zij $ 2 E $ Xij $
ij ij
where we set
Xij := εij Zij
and εij are independent Rademacher random variables.
Now we condition on A. The random variables Zij become fixed values and all
randomness remains in the Rademacher random variables εij . Note that Xij are
(conditionally) bounded almost surely, and this is exactly what we have lacked to
apply matrix Bernstein’s inequality. Now we can do it. Corollary 2.2.9 gives18
$ $
$ $
(3.2.6) Eε $ Xij $ σ log n + K log n,
ij
$ $
where σ2 =$ 2 $
ij E ε Xij and K = maxij Xij .
18 We stick a subscript ε to the expected value to remember that this is a conditional expectation, i.e.
where Ai and Aj denote the rows and columns of A. To see this, apply Theo-
rem 3.2.3 to the (m + n) × (m + n) symmetric random matrix
0 A
.
AT 0
(Do this!)
3.3. Matrix completion Consider a fixed, unknown n × n matrix X. Suppose we
are shown m randomly chosen entries of X. Can we guess all the missing entries?
This important problem is called matrix completion. We will analyze it using the
bounds on the norms on random matrices we just obtained.
Obviously, there is no way to guess the missing entries unless we know some-
thing extra about the matrix X. So let us assume that X has low rank:
rank(X) =: r n.
The number of degrees of freedom of an n × n matrix with rank r is O(rn).
(Why?) So we may hope that
(3.3.1) m ∼ rn
observed entries of X will be enough to determine X completely. But how?
Here we will analyze what is probably the simplest method for matrix com-
pletion. Take the matrix Y that consists of the observed entries of X while all
unobserved entries are set to zero. Unlike X, the matrix Y may not have small
rank. Compute the best rank r approximation19 of Y. The result, as we will show,
will be a good approximation to X.
But before we show this, let us define sampling of entries more rigorously.
Assume each entry of X is shown or hidden independently of others with fixed
19 Thebest rank r approximation of an n × n matrix A is a matrix B of rank r that minimizes the
operator norm A − B or, alternatively, the Frobenius norm A − BF (the minimizerturns out to
be the same). One can compute
r B by truncating the singular value decomposition A = n T
i=1 si ui vi
T
of A as follows: B = i=1 si ui vi , where we assume that the singular values si are arranged in
non-increasing order.
256 Four Lectures on Probabilistic Methods for Data Science
Before we prove this result, let us understand what this bound says about the
quality of matrix completion. The recovery error is measured in the Frobenius
norm, and the left side of (3.3.3) is
1 1
n 1/2
2
X̂ − XF = |X̂ ij − X ij | .
n n2
i,j=1
Thus Theorem 3.3.2 controls the average error per entry in the mean-squared sense.
To make the error small, let us assume that we have a sample of size
m
rn log2 n,
which is slightly larger than the ideal size we discussed in (3.3.1). This makes
C log(n) rn/m = o(1) and forces the recovery error to be bounded by o(1)X∞ .
Summarizing, Theorem 3.3.2 says that the expected average error per entry is much
smaller than the maximal magnitude of the entries of X. This is true for a sample of
almost optimal size m. The smaller the rank r of the matrix X, the fewer entries
of X we need to see in order to do matrix completion.
Proof of Theorem 3.3.2. Step 1: The error in the operator norm. Let us first bound
the recovery error in the operator norm. Decompose the error into two parts using
triangle inequality:
X̂ − X X̂ − p−1 Y + p−1 Y − X.
Recall that X̂ is a best approximation to p−1 Y. Then the first part of the error is
smaller than the second part, i.e. X̂ − p−1 Y p−1 Y − X, and we have
2
(3.3.4) X̂ − X 2p−1 Y − X = Y − pX.
p
Roman Vershynin 257
All that remains is to bound the norms of the rows and columns of Y − pX.
This is not difficult if we note that they can be expressed as sums of independent
random variables:
n
n
(Y − pX)i 22 = (δij − p)2 X2ij (δij − p)2 · X2∞ ,
j=1 j=1
(Check!) This probability can be further bounded by n−ct using the assumption
that m = pn2 n log n. A union bound over n rows leads to
⎧ ⎫
⎨ n ⎬
P max (δij − p)2 > tpn n · n−ct for t 3.
⎩i∈[n] ⎭
j=1
present in Theorem 3.3.2. This assumption basically excludes matrices that are
simultaneously sparse and low rank (such as a matrix whose all but one entries
are zero – it would be extremely hard to complete it, since sampling will likely
miss the non-zero entry). Many further results on exact matrix completion are
known, e.g. [15, 18, 28, 56].
Theorem 3.3.2 with a simple proof is borrowed from [50]; see also the tutorial
[59]. This result only guarantees approximate matrix completion, but it does not
have any incoherence assumptions on the matrix.
m
* +2
E Ax22 = E Aj , x (where AT
j denote the rows of A)
j=1
m
* +2
= E Aj , x (by linearity)
j=1
x
width
T
g
y
This reasoning is valid except where we assumed that g is a unit vector. In-
stead, for g ∼ N(0, In ) we have E g22 = n and
√
g2 ≈ n with high probability.
(Check both these claims using Bernstein’s inequality.) Thus, we need to scale
√
by the factor n. Ultimately, the geometric interpretation of the Gaussian width
√
becomes the following: w(T ) is approximately n/2 larger than the usual, geometric
width of T averaged over all directions.
21 Theset T − T is defined as {x − y : x, y ∈ T }. More generally, given two sets A and B in the
same vector space, the Minkowski sum of A and B is defined as A + B = {a + b : a ∈ A, b ∈ B}.
Roman Vershynin 261
A good exercise is to compute the Gaussian width and complexity for some
simple sets, such as the unit balls of the p norms in Rn , which we denote by
Bnp = {x ∈ R : xp 1}. In particular, we have
n
√
(4.1.4) 2 ) ∼ n,
γ(Bn 1)∼
γ(Bn log n.
For any finite set T ⊂ Bn
2 , we have
(4.1.5) γ(T ) log |T |.
The same holds for Gaussian width w(T ). (Check these facts!)
A look a these examples reveals that the Gaussian width captures some non-
obvious geometric qualities of sets. Of course, the fact that the Gaussian width of
√
the unit Euclidean ball Bn2 is or order n is not surprising: the usual, geometric
√
width in all directions is 2 and the Gaussian width is about n times that. But
it may be surprising that the Gaussian width of the 1 ball Bn 1 is much smaller,
and so is the width of any finite set T (unless the set has exponentially large
cardinality). As we will see later, Gaussian width nicely captures the geometric
size of “the bulk” of a set.
4.2. Matrix deviation inequality Now we are ready to answer the question we
asked in the beginning of this lecture: what is the magnitude of the uniform de-
viation (4.0.2)? The answer is surprisingly simple: it is bounded by the Gaussian
complexity of T . The proof is not too simple however, and we will skip it (see the
notes after this lecture for references).
where K = maxi Ai ψ2 is the maximal sub-gaussian norm22 of the rows of A.
Remark 4.2.2 (Tail bound). It is often useful to have results that hold with high
probability rather than in expectation. There exists a high-probability version of
the matrix deviation inequality, and it states the following. Let u 0. Then the
event
√
(4.2.3) sup Ax2 − mx2 CK2 [γ(T ) + u · rad(T )]
x∈T
22 Adefinition of the sub-gaussian norm of a random vector was given in Section 1.5. For example, if
A is a Gaussian random matrix with independent N(0, 1) entries, then K is an absolute constant.
262 Four Lectures on Probabilistic Methods for Data Science
The argument based on matrix deviation inequality, which we just gave, can
be easily extended for infinite sets. It allows one to state a version of Johnson-
Lindenstrauss lemma for general, possibly infinite, sets, which depends on the
Gaussian complexity of T rather than cardinality. (Try to do this!)
4.4. Covariance estimation In Section 3.1, we introduced the problem of covari-
ance estimation, and we showed that
N ∼ n log n
samples are enough to estimate the covariance matrix of a general distribution in
Rn . We will now show how to do better if the distribution is sub-gaussian — see
Section 1.5 for the definition of sub-gaussian random vectors — when we can get
rid of the logarithmic oversampling and the boundedness condition (3.1.3).
1
N
ΣN − Σ = Σ1/2 RN Σ1/2 where RN := Zi ZT
i − In .
N
i=1
264 Four Lectures on Probabilistic Methods for Data Science
0 0
Figure 4.5.1. The set on the left (whose boundary is shown) is star-
shaped, the set on the right is not.
Now we propose the following way to solve the recovery problem: solve the
optimization program
(4.5.2) min x K subject to y = Ax .
Note that this is a very natural program: it looks at all solutions to the equation
y = Ax and tries to “shrink” the solution x toward K. (This is what minimiza-
tion of Minkowski functional is about.)
Also note that if K is convex, this is a convex optimization program, and thus
can be solved effectively by one of the many available numeric algorithms.
266 Four Lectures on Probabilistic Methods for Data Science
The main question we should now be asking is – would the solution to this
program approximate the original vector x? The following result bounds the
approximation error for a probabilistic model of linear equations. Assume that A
is a random matrix as in Theorem 4.2.1, i.e. A is an m × n matrix whose rows Ai
are independent, isotropic and sub-gaussian random vectors in Rn .
Proof. Both the original vector x and the solution x̂ are feasible vectors for the
optimization program (4.5.2). Then
(Do this!) Keep in mind that neither x0 nor xp for 0 < p < 1 are actually
norms on Rn , since they fail triangle inequality. (Give an example.)
Let us go back to the sparse recovery problem. Our first attempt to recover x
is to try the following optimization problem:
(4.6.2) min x 0 subject to y = Ax .
This is sensible because this program selects the sparsest feasible solution. But
there is an implementation caveat: the function f(x) = x0 is highly non-convex
and even discontinuous. There is simply no known algorithm to solve the opti-
mization problem (4.6.2) efficiently.
To overcome this difficulty, let us turn to the relation (4.6.1) for an inspiration.
What if we replace x0 in the optimization problem (4.6.2) by xp with p > 0?
The smallest p for which f(x) = xp is a genuine norm (and thus a convex
function on Rn ) is p = 1. So let us try
(4.6.3) min x 1 subject to y = Ax .
This is a convexification of the non-convex program (4.6.2), and a variety of nu-
meric convex optimization methods are available to solve it efficiently.
We will now show that 1 minimization works nicely for sparse recovery. As
before, we assume that A is a random matrix as in Theorem 4.2.1.
Theorem 4.6.4 (Sparse recovery by optimization). If an unknown vector x ∈ Rn has
at most s non-zero coordinates, i.e. x0 s, then the solution x̂ of the optimization
program (4.6.3) satisfies
s log n
E x̂ − x2 x2 .
m
Proof. Since x0 s, Cauchy-Schwarz inequality shows that
√
(4.6.5) x1 s x2 .
(Check!) Denote the unit ball of the 1 norm in Rn by Bn1 , i.e. B1 := {x ∈ R :
n n
Apply Theorem 4.5.3 for this set K. We noted the Gaussian width of Bn1 in (4.1.4),
so
√ √ √
w(K) = s x2 · w(Bn 1 ) s x2 · γ(B1 ) s x2 ·
n
log n.
Substitute this in Theorem 4.5.3 and complete the proof.
268 References
Acknowledgement
I am grateful to the referees who made a number of useful suggestions, which
led to better presentation of the material in this chapter.
References
[1] E. Abbe, A. S. Bandeira, G. Hall. Exact recovery in the stochastic block model, IEEE Transactions on
Information Theory 62 (2016), 471–487. MR3447993 249
References 269
[2] D. Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins, Journal
of Computer and System Sciences, 66 (2003), 671–687. MR2005771 239
[3] R. Ahlswede, A. Winter, Strong converse for identification via quantum channels, IEEE Trans. Inf.
Theory 48 (2002), 569–579. MR1889969 248
[4] N. Ailon, B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform,
Proceedings of the 38th Annual ACM Symposium on Theory of Computing. New York: ACM
Press, 2006. pp. 557–563. MR2277181 239
[5] D. Amelunxen, M. Lotz, M. McCoy, J. Tropp, Living on the edge: phase transitions in convex programs
with random data, Inf. Inference 3 (2014), 224–294. MR3311453 268
[6] A. Bandeira, R. van Handel, Sharp nonasymptotic bounds on the norm of random matrices with inde-
pendent entries, Ann. Probab. 44 (2016), 2479–2506. MR3531673 249, 258
[7] R. Baraniuk, M. Davenport, R. DeVore, M. Wakin, A simple proof of the restricted isometry property
for random matrices, Constructive Approximation, 28 (2008), 253–263. MR2453366 239
[8] R. Bhatia, Matrix Analysis. Graduate Texts in Mathematics, vol. 169. Springer, Berlin, 1997.
MR1477662 248
[9] C. Bordenave, M. Lelarge, L. Massoulie, Non-backtracking spectrum of random graphs: community
detection and non-regular Ramanujan graphs, Annals of Probability, to appear. MR3758726 248, 249
[10] S. Boucheron, G. Lugosi, P. Massart, Concentration inequalities. A nonasymptotic theory of indepen-
dence. With a foreword by Michel Ledoux. Oxford University Press, Oxford, 2013. MR3185193
239
[11] O. Bousquet1, S. Boucheron, G. Lugosi, Introduction to statistical learning theory, in: Advanced
Lectures on Machine Learning, Lecture Notes in Computer Science 3176, pp.169–207, Springer
Verlag 2004.
[12] T. Cai, R. Zhao, H. Zhou, Estimating structured high-dimensional covariance and precision matrices:
optimal rates and adaptive estimation, Electron. J. Stat. 10 (2016), 1–59. MR3466172 258
[13] E. Candes, J. Romberg, T. Tao, Robust uncertainty principles: exact signal reconstruction from highly
incomplete frequency information, IEEE Trans. Inform. Theory 52 (2006), 489–509. MR2236170 268
[14] E. Candes, B. Recht, Exact Matrix Completion via Convex Optimization, Foundations of Computa-
tional Mathematics 9 (2009), 717–772. MR2565240 258
[15] E. Candes, T. Tao, The power of convex relaxation: near-optimal matrix completion, IEEE Trans. Inform.
Theory 56 (2010), 2053–2080. MR2723472 259
[16] R. Chen, A. Gittens, J. Tropp, The masked sample covariance estimator: an analysis using matrix con-
centration inequalities, Inf. Inference 1 (2012), 2–20. MR3311439 258
[17] P. Chin, A. Rao, and V. Vu, Stochastic block model and community detection in the sparse graphs: A
spectral algorithm with optimal rate of recovery, preprint, 2015. 249
[18] M. Davenport, Y. Plan, E. van den Berg, M. Wootters, 1-bit matrix completion, Inf. Inference 3
(2014), 189–223. MR3311452 259
[19] M. Davenport, M. Duarte, Yonina C. Eldar, Gitta Kutyniok, Introduction to compressed sensing, in:
Compressed sensing. Edited by Yonina C. Eldar and Gitta Kutyniok. Cambridge University Press,
Cambridge, 2012. MR2963166 268
[20] S. Dirksen, Tail bounds via generic chaining, Electron. J. Probab. 20 (2015), 1–29. MR3354613 268
[21] U. Feige, E. Ofek, Spectral techniques applied to sparse random graphs, Random Structures Algo-
rithms 27 (2005), 251–275. MR2155709 249
[22] S. Fortunato, Santo; D. Hric, Community detection in networks: A user guide. Phys. Rep. 659 (2016),
1–44. MR3566093 248
[23] S. Foucart, H. Rauhut, A mathematical introduction to compressive sensing. Applied and Numerical
Harmonic Analysis. Birkhäuser/Springer, New York, 2013. MR3100033 268
[24] Y. Gordon, Some inequalities for Gaussian processes and applications, Israel J. Math. 50 (1985), 265–289.
MR800188 268
[25] Y. Gordon, Elliptically contoured distributions, Prob. Th. Rel. Fields 76 (1987), 429–438. MR917672
268
[26] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in Rn , Geo-
metric aspects of functional analysis (1986/87), Lecture Notes in Math., vol. 1317, pp. 84–106.
MR950977 268
[27] Y. Gordon, Majorization of Gaussian processes and geometric applications, Prob. Th. Rel. Fields 91
(1992), 251–267. MR1147616 268
270 References
[28] D. Gross, Recovering low-rank matrices from few coefficients in any basis, IEEE Trans. Inform. Theory
57 (2011), 1548–1566. MR2815834 259
[29] O. Guedon, R. Vershynin, Community detection in sparse networks via Grothendieck’s inequality, Prob-
ability Theory and Related Fields 165 (2016), 1025–1049. MR3520025 248, 249
[30] A. Javanmard, A. Montanari, F. Ricci-Tersenghi, Phase transitions in semidefinite relaxations, PNAS,
April 19, 2016, vol. 113, no.16, E2218–E2223. MR3494080 248, 249
[31] W. B. Johnson, J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space. In Beals,
Richard; Beck, Anatole; Bellow, Alexandra; et al. Conference in modern analysis and probability
(New Haven, Conn., 1982). Contemporary Mathematics. 26. Providence, RI: American Mathemat-
ical Society, 1984. pp. 189–206. MR737400 239
[32] B. Hajek, Y. Wu, J. Xu, Achieving exact cluster recovery threshold via semidefinite programming, IEEE
Transactions on Information Theory 62 (2016), 2788–2797. MR3493879 248, 249
[33] P. W. Holland, K. B. Laskey, S. Leinhardt, Stochastic blockmodels: first steps, Social Networks 5
(1983), 109–137. MR718088 248
[34] D. Kane, J. Nelson, Sparser Johnson-Lindenstrauss Transforms, Journal of the ACM 61 (2014): 1.
MR3167920 239
[35] B. Klartag, S. Mendelson, Empirical processes and random projections, J. Funct. Anal. 225 (2005),
229–245. MR2149924 268
[36] V. Koltchinskii, K. Lounici, Concentration inequalities and moment bounds for sample covariance oper-
ators, Bernoulli 23 (2017), 110–133. MR3556768 258, 268
[37] C. Le, E. Levina, R. Vershynin, Concentration and regularization of random graphs, Random Struc-
tures and Algorithms, to appear. MR3689343 248, 249
[38] M. Ledoux, The concentration of measure phenomenon. American Mathematical Society, Providence,
RI, 2001. MR1849347 239
[39] M. Ledoux, M. Talagrand, Probability in Banach spaces. Isoperimetry and processes. Springer-Verlag,
Berlin, 1991. MR1102015 239
[40] E. Levina, R. Vershynin, Partial estimation of covariance matrices, Probability Theory and Related
Fields 153 (2012), 405–419. MR2948681 258
[41] C. Liaw, A. Mehrabian, Y. Plan, R. Vershynin, A simple tool for bounding the deviation of random ma-
trices on geometric sets, Geometric Aspects of Functional Analysis, Lecture Notes in Mathematics,
Springer, Berlin, to appear. MR3645128 268
[42] J. Matouĺek, Lectures on discrete geometry. Graduate Texts in Mathematics, 212. Springer-Verlag,
New York, 2002. MR1899299 239
[43] F. McSherry, Spectral partitioning of random graphs, Proc. 42nd FOCS (2001), 529–537. MR1948742
249
[44] S. Mendelson, S. Mendelson, A few notes on statistical learning theory, in: Advanced Lectures on
Machine Learning, S. Mendelson, A. J. Smola (Eds.) LNAI 2600, pp. 1–40, 2003. 268
[45] S. Mendelson, A. Pajor, N. Tomczak-Jaegermann, Reconstruction and subgaussian operators in as-
ymptotic geometric analysis. Geom. Funct. Anal. 17 (2007), 1248–1282. MR2373017 268
[46] E. Mossel, J. Neeman, A. Sly, Belief propagation, robust reconstruction and optimal recovery of block
models. Ann. Appl. Probab. 26 (2016), 2211–2256. MR3543895 248, 249
[47] M. E. Newman, Networks. An introduction. Oxford University Press, Oxford, 2010. MR2676073 248
[48] R. I. Oliveira, Concentration of the adjacency matrix and of the Laplacian in random graphs with inde-
pendent edges, unpublished (2010), arXiv:0911.0600. 248, 249
[49] R. I. Oliveira, Sums of random Hermitian matrices and an inequality by Rudelson, Electron. Commun.
Probab. 15 (2010), 203–212. MR2653725 248
[50] Y. Plan, R. Vershynin, E. Yudovina, High-dimensional estimation with geometric constraints, Informa-
tion and Inference 0 (2016), 1–40. MR3636866 259
[51] S. Riemer, C. Schütt, On the expectation of the norm of random matrices with non-identically distributed
entries, Electron. J. Probab. 18 (2013), no. 29, 13 pp. MR3035757 258
[52] J. Tropp, User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 (2012),
389–434. MR2946459 248
[53] J. Tropp, An introduction to matrix concentration inequalities. Found. Trends Mach. Learning 8 (2015),
10-230. 248
[54] R. van Handel, Structured random matrices. in: IMA Volume “Discrete Structures: Analysis and
Applications”, Springer, to appear.
References 271
[55] R. van Handel, On the spectral norm of Gaussian random matrices, Trans. Amer. Math. Soc., to appear.
MR3695857 258
[56] B. Recht, A simpler approach to matrix completion, J. Mach. Learn. Res. 12 (2011), 3413–3430.
MR2877360 259
[57] G. Schechtman, Two observations regarding embedding subsets of Euclidean spaces in normed spaces,
Adv. Math. 200 (2006), 125–135. MR2199631 268
[58] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices. Compressed sensing,
210–268, Cambridge University Press, Cambridge, 2012. MR2963170 232, 239, 258, 268
[59] R. Vershynin, Estimation in high dimensions: a geometric perspective. Sampling Theory, a Renaissance,
3–66, Birkhauser Basel, 2015. MR3467418 232, 259, 268
[60] R. Vershynin, High-Dimensional Probability. An Introduction with Applications in Data Science. Cam-
bridge University Press, to appear. 232, 239, 258, 268
[61] H. Zhou, A. Zhang, Minimax Rates of Community Detection in Stochastic Block Models, Annals of
Statistics, to appear. MR3546450 248, 249
Department of Mathematics, University of Michigan, 530 Church Street, Ann Arbor, MI 48109,
U.S.A.
Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 273–325
https://doi.org/10.1090/pcms/025/00834
Robert Ghrist
Contents
Introduction and Motivation 273
What is Homology? 274
When is Homology Useful? 274
Scheme 275
Lecture 1: Complexes and Homology 275
Spaces 275
Spaces and Equivalence 279
Application: Neuroscience 284
Lecture 2: Persistence 286
Towards Functoriality 286
Sequences 288
Stability 292
Application: TDA 294
Lecture 3: Compression and Computation 297
Sequential Manipulation 297
Homology Theories 301
Application: Algorithms 306
Lecture 4: Higher Order 308
Cohomology and Duality 308
Cellular Sheaves 312
Cellular Sheaf Cohomology 313
Application: Sensing and Evasion 318
Conclusion: Beyond Linear Algebra 320
are mathematical in nature, this article will treat the formalities with a light touch
and heavy references, in order to make the subject more accessible to practitioners.
See the concluding section for a roadmap for finding more details. The material
is written for beginning graduate students in any of the applied mathematical
sciences (though some mathematical maturity is helpful).
What is Homology?
Homology is an algebraic compression scheme that excises all but the essen-
tial topological features from a particular class of data structures arising naturally
from topological spaces. Homology therefore pairs with topology. Topology is
the mathematics of abstract space and transformations between them. The notion
of a space, X, requires only a set together with a notion of nearness, expressed as
a system of subsets comprising the “open” neighborhoods satisfying certain con-
sistency conditions. Metrics are permissible but not required. So many familiar
notions in applied mathematics – networks, graphs, data sets, signals, imagery,
and more – are interpretable as topological spaces, often with useful auxiliary
structures. Furthermore, manipulations of such objects, whether as comparison,
inference, or metadata, are expressible in the language of mappings, or contin-
uous relationships between spaces. Topology concerns the fundamental notions
of equivalence up to the loose nearness of what makes a space. Thus, connectiv-
ity and holes are significant; bends and corners less so. Topological invariants
of spaces and mappings between them record the essential qualitative features,
insensitive to coordinate changes and deformations.
Homology is the simplest, general, computable invariant of topological data.
In its most primal manifestation, the homology of a space X returns a sequence
of vector spaces H• (X), the dimensions of which count various types of linearly
independent holes in X. Homology is inherently linear-algebraic, but transcends
linear algebra, serving as the inspiration for homological algebra. It is this algebraic
engine that powers the subject.
Scheme
There is far too much material in the subject of algebraic topology to be sur-
veyed here. Existing applications alone span an enormous range of principles and
techniques, and the subject of applications of homology and homological algebra
is in its infancy still. As such, these notes are selective to a degree that suggests
caprice. For deeper coverage of the areas touched on here, complete with illustra-
tions, see [51]. For alternate ranges and perspectives, there are now a number of
excellent sources, including [40, 62, 76]. These notes will deemphasize formalities
and ultimate formulations, focusing instead on principles, with examples and ex-
ercises. The reader should not infer that the theorems or theoretic minutiae are
anything less than critical in practice.
These notes err on the side of simplicity. The many included exercises are
not of the typical lemma-lemma-theorem form appropriate for a mathematics
course; rather, they are meant to ground the student in examples. There is an
additional layer of unstated problems for the interested reader: these notes are
devoid of figures. The student apt with a pen should endeavor to create cartoons
to accompany the various definitions and examples presented here, with the aim
of minimality and clarity of encapsulation. The author’s attempt at such can be
found in [51].
Spaces
A space is a set X together with a compendium of all subsets in X deemed
“open,” which subcollection must of necessity satisfy a list of intuitively obvious
properties. The interested reader should consult any point-set topology book
(such as [70]) briefly. All the familiar spaces of elementary calculus – surfaces,
level sets of functions, Euclidean spaces – are indeed topological spaces and just
the beginning of the interesting spaces studied in manifold theory, algebraic ge-
ometry, differential geometry, and more. These tend to be frustratingly indiscrete.
Applications involving computation prompt an emphasis on those spaces that are
easily digitized. Such are usually called complexes, often with an adjectival prefix.
Several are outlined below.
Simplicial Complexes Consider a set X of discrete objects. A k-simplex in X is
an unordered collection σ of k + 1 distinct elements of X. Though the definition is
combinatorial, for X a set of points in a Euclidean space [viz. point-cloud data set]
276 Homological Algebra and Data
one visualizes a simplex as the geometric convex hull of the k + 1 points, a “filled-
in” clique: thus, 0-simplices are points, 1-simplices are edges, 2-simplices are
filled-in triangles, etc. A complex is a collection of multiple simplices.1 In particular,
a simplicial complex on X is a collection of simplices in X that is downward closed,
in the sense that every subset of every simplex is also a simplex in the complex.
One says that X contains all its faces. Greek letters (especially σ and τ) will be used
to denote simplices in what follows.
Exercise 1.2. Not all interesting simplicial complexes are simple to visualize. Con-
sider a finite-dimensional real vector space V and consider V to be the vertex set
of a simplicial complex defined as follows: a k-simplex consists of k + 1 linearly
independent members of V. Is the resulting independence complex finite? Finite-
dimensional? What does the dimension of this complex tell you?
where ∼ is the equivalence relation that identifies faces of Δk with the correspond-
ing combinatorial faces of σ in X(j) for j < k. Thus, for example, X(0) is a discrete
1 The etymology of both words is salient.
Robert Ghrist 277
collection of points (the vertices) and X(1) is the abstract space obtained by gluing
edges to the vertices using information stored in the 1-simplices.
Exercise 1.5. How many total k-simplices are there in the closed n-simplex for
k < n?
Vietoris-Rips Complexes A data set in the form of a finite metric space (X, d)
gives rise to a family of simplicial complexes in the following manner. The
Vietoris-Rips complex (or VR-complex) of (X, d) at scale > 0 is the simplicial
complex VR (X) whose simplices are precisely those collections of points with
pairwise distance . Otherwise said, one connects points that are sufficiently
close, filling in sufficiently small holes, with sufficiency specified by .
These VR complexes have been used as a way of associating a simplicial com-
plex to point cloud data sets. One obvious difficulty, however, lies in the choice
of : too small, and nothing is connected; too large, and everything is connected.
The question of which to use has no easy answer. However, the perspectives of
algebraic topology offer a modified question. How to integrate structures across all
values? This will be considered in Lecture 2 of this series.
Flag/clique complexes The VR complex is a particular instance of the following
construct. Given a graph (network) X, the flag complex or clique complex of X is
the maximal simplicial complex X that has the graph as its 1-skeleton: X(1) = X.
What this means in practice is that whenever you “see” the skeletal frame of
a simplex in X, you fill it and all its faces in with simplices. Flag complexes are
advantageous as data structures for spaces, in that you do not need to input/store
all of the simplices in a simplicial complex: the 1-skeleton consisting of vertices
and edges suffices to define the rest of the complex.
Exercise 1.7. Compute all possible nerves of four bounded convex subsets in
the Euclidean plane. What is and is not possible? Now, repeat, but with two
nonconvex subsets of Euclidean R3 .
Exercise 1.8. Compute the Dowker complex and the dual Dowker complex of the
following relation R:
⎡ ⎤
1 0 0 0 1 1 0 0
⎢ ⎥
⎢0 1 1 0 0 0 1 0⎥
⎢ ⎥
⎢ ⎥
(1.9) R = ⎢0 1 0 0 1 1 0 1⎥ .
⎢ ⎥
⎢1 0 1 0 1 0 0 1⎥
⎣ ⎦
1 0 1 0 0 1 1 0
Dowker complexes have been used in a variety of social science contexts (where
X and Y represent agents and attributes respectively) [7]. More recent applications
of these complexes have arisen in settings ranging from social networks [89] to
sensor networks [54]. The various flavors of witness complexes in the literature on
topological data analysis [37, 57] are special cases of Dowker complexes.
Cell Complexes There are other ways to build spaces out of simple pieces.
These, too, are called complexes, though not simplicial, as they are not necessarily
built from simplicies. They are best described as cell complexes, being built from
cells of various dimensions sporting a variety of possible auxiliary structures.
A cubical complex is a cell complex built from cubes of various dimensions,
the formal definition mimicking Equation (1.4): see [51, 62]. These often arise as
the natural model for pixel or voxel data in imagery and time series. Cubical
complexes have found other uses in modelling spaces of phylogenetic trees [17,
77] and robot configuration spaces [1, 53, 55].
There are much more general cellular complexes built from simple pieces with
far less rigidity in the attachments between pieces. Perhaps the most general
useful model of a cell complex is the CW complex used frequently in algebraic
topology. The idea of a CW complex is this: one begins with a disjoint union
of points X(0) as the 0-skeleton. One then inductively defines the n-skeleton of
Robert Ghrist 279
permissible in CW complexes.
280 Homological Algebra and Data
Many of the core results in topology are stated in the language of homotopy
(and are not true when homotopy is replaced with the more restrictive homeomor-
phism). For example:
Theorem 1.11. If U is a finite collection of open contractible subsets of X with all non-
empty intersections of subcollections of U contractible, then N(U) is homotopic to the
union ∪α Uα .
Theorem 1.12. Given any binary relation R ⊂ X × Y, the Dowker and dual Dowker
complexes are homotopic.
Homotopy invariants — functions that assign equivalent values to homotopic
inputs — are central both to topology and its applications to data (as noise per-
turbs spaces in an often non-homeomorphic but homotopic manner). Invariants
of finite simplicial and cell complexes invite a computational perspective, since
one has the hope of finite inputs and felicitous data structures.
Euler Characteristic The simplest nontrivial topological invariant of finite cell
complexes dates back to Euler. It is elementary, combinatorial, and sublime. The
Euler characteristic of a finite cell complex X is:
(1.13) χ(X) = (−1)dimσ ,
σ
where the sum is over all cells σ of X.
Exercise 1.14. Compute explicitly the Euler characteristics of the following cell
complexes: (1) the decompositions of the 2-sphere, S2 , defined by the boundaries
of the five regular Platonic solids; (2) the CW complex having one 0-cell and
one 2-cell disc whose boundary is attached directly to the 0-cell (“collapse the
boundary circle to a point”); and (3) the thickened 2-sphere S2 × [0, 1]. How did
you put a cell structure on this last 3-dimensional space?
Completion of this exercise suggests the following result:
Theorem 1.15. Euler characteristic is a homotopy invariant among finite cell complexes.
That this is so would seem to require a great deal of combinatorics to prove.
The modern proof transcends combinatorics, making the problem hopelessly un-
computable before pulling back to the finite world, as will be seen in Lecture 3.
Exercise 1.16. Prove that the Euler characteristic distinguishes [connected] trees
from [connected] graphs with cycles. What happens if the connectivity require-
ment is dropped?
Euler characteristic is a wonderfully useful invariant, with modern applica-
tions ranging from robotics [42, 47] and AI [68] to sensor networks [9–11] to
Gaussian random fields [4,6]. In the end, however, it is a numerical invariant, and
has a limited resolution. The path to improving the resolution of this invariant is
to enrich the underlying algebra that the Eulerian ±1 obscures in Equation (1.13).
Robert Ghrist 281
Lifting to Linear Algebra One of the core themes of this lecture series is the lift-
ing of cell complexes to algebraic complexes on which the tools of homological
algebra can be brought to bear. This is not a novel idea: most applied mathemati-
cians learn, e.g., to use the adjacency matrix of a graph as a means of harnessing
linear-algebraic ideas to understand networks. What is novel is the use of higher-
dimensional structure and the richer algebra this entails.
Homological algebra is often done with modules over a commutative ring. For
clarity of exposition, let us restrict to the nearly trivial setting of finite-dimensional
vector spaces over a field F, typically either R or, when orientations are bother-
some, F2 , the binary field.
Given a cell complex, one lifts the topological cells to algebraic objects by using
them as bases for vector spaces. One remembers the dimensions of the cells by
using a sequence of vector spaces, with dimension as a grading that indexes the
vector spaces. Consider the following sequence C = (Ck ) of vector spaces, where
the grading k is in N.
(1.17) ··· Ck Ck−1 ··· C1 C0 .
In algebraic topology, one often uses a “star” or “dot” to denote a grading: this
chapter will use a dot, as in C = (C• ). For a finite (and thus finite-dimensional)
cell complex, the sequence becomes all zeros eventually. Such a sequence does
not obviously offer an algebraic advantage over the original space; indeed, much
of the information on how cells are glued together has been lost. However, it
is easy to “lift” the Euler characteristic to this class of algebraic objects. For C a
sequence of finite-dimensional vector spaces with finitely many nonzero terms,
define:
(1.18) χ(C) = (−1)k dimCk .
k
Chain Complexes Recall that basic linear algebra does not focus overmuch on
vector spaces and bases; it is in linear transformations that power resides. Aug-
menting a sequence of vector spaces with a matching sequence of linear transfor-
mations adds in the assembly instructions and permits a fuller algebraic repre-
sentation of a topological complex. Given a simplicial3 complex X, fix a field F
and let C = (Ck , ∂k ) denote the following sequence of F-vector spaces and linear
transformations.
∂k ∂k−1 ∂2 ∂1 ∂0
(1.19) ··· / Ck / Ck−1 / ··· / C1 / C0 /0.
in which case orientations can be ignored; otherwise, one must affix an orienta-
tion to each simplex and proceed accordingly: see [51, 58] for details on how this
is performed.
The chain complex is the primal algebraic object in homological algebra. It
is rightly seen as the higher-dimensional analogue of a graph together with its
adjacency matrix. The chain complex will often be written as C = (C• , ∂), with
the grading implied in the subscript dot. The boundary operator, ∂ = ∂• , can be
thought of either as a sequence of linear transformations, or as a single operator
acting on the direct sum of the Ck .
Homology Homological algebra begins with the following suspiciously simple
statement about simplicial complexes.
Lemma 1.20. The boundary of a boundary is null:
(1.21) ∂2 = ∂k−1 ◦ ∂k = 0,
for all k.
Proof. For simplicity, consider the case of an abstract simplicial complex on a
vertex set V = {vi } with chain complex having F2 coefficients. The face map Di
acts on a simplex by removing the ith vertex vi from the simplex’s list, if present;
else, return zero. The graded boundary operator ∂ : C• → C• is thus a formal sum
1
of face maps ∂ = i Di . It suffices to show that ∂2 = 0 on each basis simplex σ.
Computing the composition in terms of face maps, one obtains:
(1.22) ∂2 σ = Dj Di σ.
i=j
Hk (C) = Zk /Bk
(1.24) = ker ∂k / im ∂k+1
= cycles/ boundaries.
Robert Ghrist 283
Homology inherits the grading of the complex C and has trivial (zero) linear
transformations connecting the individual vector spaces. Elements of H• (C) are
homology classes and are denoted [α] ∈ Hk , where α ∈ Zk is a k-cycle and [·]
denotes the equivalence class modulo elements of Bk .
Exercise 1.25. If C has boundary maps that are all zero, what can you say about
H• (C)? What if all the boundary maps (except at the ends) are isomorphisms
(injective and surjective)?
As stated, the above theorem would seem to apply only to cell complexes.
However, as we will detail in Lecture 3, we can define homology for any topologi-
cal space independent of cell structure; to this, as well, the above theorem applies.
Thus, we can talk of the homology of a space independent of any cell structure
or concrete representation: homotopy type is all that matters. It therefore makes
sense to explore some basic examples. The following are the homologies of the
284 Homological Algebra and Data
Application: Neuroscience
Each of these lectures ends with a sketch of some application(s): this first
sketch will focus on the use of Betti numbers. Perhaps the best-to-date exam-
ple of the use of homology in data analysis is the following recent work of
Giusti, Pastalkova, Curto, & Itskov [56] on network inference in neuroscience
using parametrized Betti numbers as a statistic.
Consider the challenge of inferring how a collection of neurons is wired to-
gether. Because of the structure of a neuron (in particular the length of the
axon), mere physical proximity does not characterize the wiring structure: neu-
rons which are far apart may in fact be “wired” together. Experimentalists can
measure the responses of individual neurons and their firing sequences as a re-
sponse to stimuli. By comparing time-series data from neural probes, the corre-
lations of neuron activity can be estimated, resulting in a correlation matrix with
entries, say, between zero and one, referencing the estimated correlation between
neurons, with the diagonal, of course, consisting of ones. By thresholding the
correlation matrix at some value, one can estimate the “wiring network” of how
neurons are connected.
Robert Ghrist 285
Unfortunately, things are more complicated than this simple scenario suggests.
First, again, the problem of which threshold to choose is present. Worse, the cor-
relation matrix is not the truth, but an experimentally measured estimation that
relies on how the experiment was performed (Where were the probes inserted?
How was the spike train data handled?). Repeating an experiment may lead to a
very different correlation matrix – a difference not accountable by a linear trans-
formation. This means, in particular, that methods based on spectral properties
such as PCA are misleading [56].
What content does the experimentally-measured correlation matrix hold? The
entries satisfy an order principle: if neurons A and B seem more correlated than
C and D, then, in truth, they are. In other words, repeated experiments lead
to a nonlinear, but order-preserving, homeomorphism of the correlation axis. It
is precisely this nonlinear coordinate-free nature of the problem that prompts a
topological approach.
The approach is this. Given a correlation matrix R, let 1 0 be a decreas-
ing threshold parameter, and, for each , let R be the binary matrix generated
from R with ones wherever the correlation exceeds . Let X be the Dowker com-
plex of R (or dual; the same, by symmetry). Then consider the kth Betti number
distribution βk : [1, 0] → N. These distributions are unique under change of cor-
relation axis coordinates up to order-preserving homeomorphisms of the domain.
What do these distributions look like? For → 1, the complex is an isolated
set of points, and for → 0 it is one large connected simplex: all the interesting
homology lies in the middle. It is known that when this sequence of simplicial
complexes is obtained by sampling points from a probability distribution, the
Betti distributions βk for k > 0 are unimodal. Furthermore, it is known that
homological peaks are ordered by dimension [5]: the peak value for β1 precedes
that of β2 , etc. Thus, what is readily available as a signature for the network is
the ordering of the heights of the peaks of the βk distributions. The surprise is
that this peak height data gives information about the distribution from which
the points were sampled.
In particular, one can distinguish between networks that are wired randomly
versus those that are wired geometrically. This is motivated by the neuroscience
applications. It has been known since the Nobel prize-winning work of O’Keefe
et al. that certain neurons in the visual cortex of rats act as place cells, encoding the
geometry of a learned domain (e.g., a maze) by how the neurons are wired [74], in
manner not unlike that of a nerve complex [35]. Other neural networks are known
to be wired together randomly, such as the olfactory system of a fly [29]. Giusti et
al., relying on theorems about Betti number distributions for random geometric
complexes by Kahle [63], show that one can differentiate between geometrically-
wired and randomly wired networks by looking at the peak signatures of β1 ,
β2 , and β3 and whether the peaks increase [random] or decrease [geometric].
Follow-on work gives novel signature types [87]. The use of these methods is
286 Homological Algebra and Data
Lecture 2: Persistence
We have covered the basic definitions of simplicial and algebraic complexes
and their homological invariants. Our goal is to pass from the mechanics of
invariants to the principles that animate the subject, culminating in a deeper un-
derstanding of how data can be qualitatively compressed and analyzed. In this
lecture, we will begin that process, using the following principles as a guide:
(1) A simplicial [or cell] complex is the right type of discrete data structure
for capturing the significant features of a space.
(2) A chain complex is a linear-algebraic representation of this data structure
– an algebraic set of assembly instructions.
(3) To prove theorems about how cell complexes behave under deformation,
study instead deformations of chain complexes.
(4) Homology is the optimal compression of a chain complex down to its
qualitative features.
Towards Functoriality
Our chief end is this: homology is functorial. This means that one can talk
not only about homology of a complex, but also of the homology of a map be-
tween complexes. To study continuous maps between spaces algebraically, one
translates the concept to chain complexes. Assume that X and Y are simplicial
complexes and f : X → Y is a simplicial map – a continuous map taking simplices
to simplices.4 This does not imply that the simplices map homeomorphically to
simplices of the same dimension.
In the same way that X and Y lift to algebraic chain complexes C• (X) and C• (Y),
the map f lifts to a graded sequence of linear transformations f• : C• (X) → C• (Y),
generated by basis n-simplices of X being sent to basis n-simplices of Y, where,
if an n-simplex of X is sent by f to a simplex of dimension less than n, then the
algebraic effect is to send the basis chain in Cn (X) to 0 ∈ Cn (Y). The continuity
of the map f induces a chain map f• that fits together with the boundary maps
of C• (X) and C• (Y) to form the following diagram of vector spaces and linear
transformations:
Exercise 2.2. Show that any chain map f• : C → C takes cycles to cycles and
boundaries to boundaries.
Exercise 2.4. Consider the disc in R2 of radius π punctured at the integer points
along the x and y axes. Although this space is not a cell complex, let us assume
that its homology is well-defined and is “the obvious thing” for H1 , defined by
the number of punctures. What are the induced homomorphisms on H1 of the
continuous maps given by (1) rotation by π/2 counterclockwise; (2) the folding
map x → |x|; (3) flipping along the y axis?
Exercise 2.5. Show using functoriality that homeomorphisms between spaces in-
duce isomorphisms on homologies.
288 Homological Algebra and Data
Exercise 2.6. Can you find explicit counterexamples to the following statements
about maps f between simplicial complexes and their induced homomorphisms
H(f) (on some grading for homology)?
(1) If f is surjective then H(f) is surjective.
(2) If f is injective then H(f) is injective.
(3) If f is not surjective then H(f) is not surjective.
(4) If f is not injective then H(f) is not injective.
(5) If f is not bijective then H(f) is not bijective.
Exercise 2.8. In the above scenario, what can you conclude about H• (X) if H(f) is
an isomorphism? If it is merely injective? Surjective?
Sequences
Induced homomorphisms in homology are key, as central to homology as the
role of linear transformations are in linear algebra. In these lectures, we have seen
how, though linear transformations between vector spaces are important, what a
great advantage there is in the chaining of linear transformations into sequences
and complexes. The advent of induced homomorphisms should prompt the same
desire, to chain into sequences, analyze, classify, and infer. This is the plan for
Robert Ghrist 289
What one observes from this example is the evolution of homological features
over a sequence: homology classes are born, can merge, split, die, or persist.
This evolutionary process as written in the language of sequences is the algebraic
means of encoding notions of geometry, significance, and noise.
Persistence Let us formalize some of what we have observed. Consider a se-
quence of spaces (Xi ) and continuous transformations fi : Xi → Xi+1 , without
requiring subcomplexes and inclusions. We again have a sequence of homologies
with induced homomorphisms. A homology class in H• (Xi ) is said to persist if its
image in H• (Xi+1 ) is also nonzero; otherwise it is said to die. A homology class
in H• (Xj ) is said to be born when it is not in the image of H• (Xj−1 ).
290 Homological Algebra and Data
One may proceed with this line of argument, at the expense of some sloppiness
of language. Does every homology class have an unambiguous birth and death?
Can we describe cycles this way, or do we need to work with classes of cycles
modulo boundaries? For the sake of precision and clarity, it is best to follow the
pattern of these lectures and pass to the context of linear algebra and sequences.
Consider a sequence V• of finite-dimensional vector spaces, graded over the
integers Z, and stitched together with linear transformations like so:
(2.13) V• = · · · −→ Vi−1 −→ Vi −→ Vi+1 −→ · · · .
These sequences are more general than algebraic complexes, which must satisfy the
restriction of composing two incident linear transformations yielding zero. Two
such sequences V• and V• are said to be isomorphic if there are isomorphisms
Vk = ∼ V which commute with the linear transformations in V• and V as in
k •
Equation (2.1). The simplest such sequence is an interval indecomposable of the
form
Id Id Id
(2.14) I• = · · · −→ 0 −→ 0 −→ F −→ F −→ · · · −→ F −→ 0 −→ 0 −→ · · · ,
where the length of the interval equals the number of Id maps, so that an interval
of length zero consists of 0 → F → 0 alone. Infinite or bi-infinite intervals are
also included as indecomposables.
Representation Theory A very slight amount of representation theory is all that
is required to convert a sequence of homologies into a useful data structure for
measuring persistence. Consider the following operation: sequences can be for-
mally added by taking the direct sum, ⊕, term-by-term and map-by-map. The
interval indecomposables are precisely indecomposable with respect to ⊕ and can-
not be expressed as a sum of simpler sequences, even up to isomorphism. The
following theorem, though simple, is suitable for our needs.
Theorem 2.15 (Structure Theorem for Sequences). Any sequence of finite dimen-
sional vector spaces and linear transformations decomposes as a direct sum of interval
indecomposables, unique up to reordering.
What does this mean? It’s best to begin with the basics of linear algebra, and
then see how that extends to homology.
A
Exercise 2.16. Any linear transformation Rn −→ Rm extends to a biinfinite se-
quence with all but two terms zero. How many different isomorphism classes of
decompositions into interval indecomposables are there? What types of intervals
are present? Can you interpret the numbers of the various types of intervals?
What well-known theorem from elementary linear algebra have you recovered?
persist, then die at particular (if perhaps infinite) parameter values. This decom-
position also impacts how we illustrate evolving homology classes. By drawing
pictures of the interval indecomposables over the [discretized] parameter line as
horizontal bars, we obtain a pictograph that is called a homology barcode.
Exercise 2.17. Consider a simple sequence of four vector spaces, each of dimen-
sion three. Describe and/or draw pictures of all possible barcodes arising from
such a sequence. Up to isomorphism, how many such barcodes are there?
consists of homology classes that persist: dim Hk (P[i, j]) equals the number of
intervals in the barcode of Hk (P) containing the parameter interval [i, j].
Exercise 2.19. If, in the indexing for a persistence complex, you have i < j < k < ,
what is the relationship between the various subintervals of [i, ] using {i, j, k, }
as endpoints? Draw the lattice of such intervals under inclusion. What is the
relationship between the persistent homologies on these subintervals?
Persistence Diagrams. Barcodes are not the only possible graphical presenta-
tion for persistent homology. Since there is a decomposition into homology
classes with well-defined initial and terminal parameter values, one can plot each
homology class as a point in the plane with axes the parameter line. To each
interval indecomposable (homology class) one assigns a single point with coordi-
nates (birth, death). This scatter plot is called the persistence diagram and is more
practical to plot and interpret than a barcode for very large numbers of homology
classes.
Exercise 2.20. In the case of a homology barcode coming from a data set, the
“noisy” homology classes are those with the smallest length, with the largest bars
holding claim as the “significant” topological features in a data set. What do these
noisy and significant bars translate to in the context of a persistence diagram? For
a specific example, return to the “clockface” data set of Exercise 2.12, but now
consider the set of all even points: 2, 4, 6, 8, 10, and 12. Show that the persistent
H2 contains a “short” bar. Are you surprised at this artificial bubble in the VR
complex? Does a similar bubble form in the homology when all 12 points are
used? In which dimension?
One aspect worth calling out is the notion of persistent homology as a ho-
mological data structure over the parameter space, in that one associates to each
interval [i, j] its persistent homology. This perspective is echoed in the early lit-
erature on the subject [32, 36, 41, 91], in which a continuous parameter space was
used, with a continuous family of (excursion sets) of spaces Xt , t ∈ R: in this
setting, persistent homology is assigned to an interval [s, t]. The discretized pa-
rameter interval offers little in the way of restrictions (unless you are working
with fractal-like or otherwise degenerate objects) and opens up the simple setting
of the Structure Theorem on Sequences as used in this lecture.
Stability
The idea behind the use of barcodes and persistence diagrams in data is
grounded in the intuition that essential topological features of a domain are ro-
bust to noise, whether arising from sensing, sampling, or approximation. In a
barcode, noisy features appear as short bars; in a persistence diagram, as near-
diagonal points. To solidify this intuition of robustness, one wants a more specific
statement on the stability of persistent homology. Can a small change in the input
Robert Ghrist 293
&
/+ Wn+T
f g
··· / Wn / ··· / ··· / Wn+2T / ···
Exercise 2.22. Verify that the interleaving distance of two sequences is zero if and
only if the two sequences are isomorphic.
Application: TDA
Topological Data Analysis, or TDA, is the currently popular nomenclature for
the set of techniques surrounding persistence, persistent homology, and the ex-
traction of significant topological features from data. The typical input to such a
problem is a point cloud Q in a Euclidean space, though any finite metric space
will work the same. Given such a data set, assumed to be a noisy sampling of
some domain of interest, one wants to characterize that domain. Such questions
are not new: linear regression assumes an affine space and returns a best fit; a
variety of locally-linear or nonlinear methods look for nonlinear embeddings of
Euclidean spaces.
Topological data analysis looks for global structure — homology classes — in
a manner that is to some degree decoupled from rigid geometric considerations.
This, too, is not entirely novel. Witness clustering algorithms take a point cloud
and return a partition that is meant to approximate connected components. Of
course, this reminds one of H0 , and the use of a Vietoris-Rips complex makes this
precise: single linkage clustering is precisely the computation of H0 (VR (Q)) for a
choice of > 0. Which choice is best? The lesson of persistence is to take all
and build the homology barcode. Notice however, that the barcode returns only
the dimension of H0 — the number of clusters — and to more carefully specify
the clusters, one needs an appropriate basis. There are many other clustering
schemes with interesting functorial interpretations [26, 27].
The ubiquity and utility of clustering is clear. What is less clear is the preva-
lence and practicality of higher-dimensional persistent homology classes in “or-
ganic” data sets. Using again a Vietoris-Rips filtration of simplicial complexes
on a point cloud Q allows the computation of homology barcodes in gradings
larger than zero. To what extent are they prevalent? The grading of homology
is reminiscent of the grading of polynomials in Taylor expansions. Though Tay-
lor expansions are undoubtedly useful, it is acknowledged that the lowest-order
terms (zeroth and first especially) are most easily seen and used. Something like
this holds in TDA, where one most readily sees clusters (H0 ) and simple loops
(H1 ) in data. The following is a brief list of applications known to the author.
The literature on TDA has blown-up of late to a degree that makes it impossi-
ble to give an exhaustive account of applications. The following are chosen as
illustrative of the basic principles of persistent homology.
Medical imaging data: Some of the earliest and most natural applications of TDA
were to image analysis [2, 13, 25, 79]. One recent study by Benditch et al. looks at
the structure of arteries in human brains [14]. These are highly convoluted path-
ways, with lots of branching and features at multiple scales, but which vary in
dramatic and unpredictable ways from patient to patient. The topology of arterial
structures are globally trivial — sampling the arterial structure though standard
imaging techniques yields a family of trees (acyclic graphs). Nevertheless, since
the geometry is measurable, one can filter these trees by sweeping a plane across
Robert Ghrist 295
in solutions/solvents with the protein. Hollow cavities within the protein’s geo-
metric conformation are contributing factors, both size and number. Gameiro et
al. propose a topological compressibility that is argued to measure the relative con-
tributions of these features, but with minimal experimental measurement, using
nothing more as input than the standard molecular datasets that record atom lo-
cations as a point cloud, together with a van der Waals radius about each atom.
What is interesting in this case is that one does not have the standard Vietoris-
Rips filtered complex, but rather a filtered complex obtained by starting with the
van der Waals radii (which vary from atom to atom) and then adding to these
radii the filtration parameter > 0. The proposed topological compressibility is
a ratio of the number of persistent H2 intervals divided by the number of persis-
tent H1 intervals (where the intervals are restricted to certain parameter ranges).
This ratio is meant to serve as proxy to the experimental measurement of cavi-
ties and tunnels in the protein’s structure. Comparisons with experimental data
suggest, with some exceptions, a tight linear correlation between the expensive
experimentally-measured compressibility and the (relatively inexpensive) topo-
logical compressibility.
These varied examples are merely summaries: see the cited references for more
details. The applications of persistent homology to data are still quite recent, and
by the time of publication of these notes, there will have been a string of novel
applications, ranging from materials science to social networks and more.
Sequential Manipulation
Not surprisingly, these lectures take the perspective that the desiderata for ho-
mological data analysis include a calculus for complexes and sequences of com-
plexes. We have seen hints of this in Lecture 2; now, we proceed to introduce a
bit more of the rich structure that characterizes (the near-trivial linear-algebraic
version of) homological algebra. Instead of focusing on spaces or simplicial com-
plexes per se, we focus on algebraic complexes; this motivates our examination
of certain types of algebraic sequences.
Exact Complexes Our first new tool is inspired by the question: among all com-
plexes, which are simplest? Simplicial complexes might suggest that the simplest
sort of complex is that of a single simplex, which has homology vanishing in all
298 Homological Algebra and Data
gradings except zero. However, there are simpler complexes still. We say that
an algebraic complex C = (C• , ∂) is exact if its homology completely vanishes,
H• (C) = 0. This is often written termwise as:
(3.1) ker ∂k = im ∂k+1 ∀ k.
Exercise 3.2. What can you say about the barcode of an exact complex? (This
means the barcode of the complex, not its [null] homology).
The following simple examples of exact complexes help build intuition:
• Two vector spaces are isomorphic, V =∼ W, iff there is an exact complex of
the form:
0 /V /W /0.
The linear transformations between the homologies are where all the interesting
details lie. These consist of: (1) H(ψ), which adds the homology classes from A
and B to give a homology class in X; (2) H(φ), which reinterprets a homology
class in A ∩ B to be a homology class of A and a (orientation reversed!) homology
Robert Ghrist 299
Exercise 3.5. Assume the following: for any k 0, (1) Hn (Dk ) = 0 for all n > 0;
∼ F and Hn (S1 ) = 0 for all n > 1. The computation
(2) the 1-sphere S1 has H1 (S1 ) =
of H• (Sk ) can be carried out via Mayer-Vietoris as follows. Let A and B be upper
and lower hemispheres of Sk , each homeomorphic to Dk and intersecting at
an equatorial Sk−1 . Write out the Mayer-Vietoris complex in this case: what
can you observe? As H• (Dk ) = ∼ 0 for k > 0, one obtains by exactness that
∼ Hn−1 (Sk−1 ) for all n and all k. Thus, starting from a knowledge
δ : Hn (Sk ) =
of H• S , show that Hn (Sk ) =
1 ∼ 0 for k > 0 unless n = k, where it has dimension
equal to one.
(3.10) f• g• h•
0 / Õ / B̃• / C̃• /0
The induced connecting homomorphism δ : Hn (C) → Hn−1 (A) comes from the
boundary map in C as follows:
(1) Fix [γ] ∈ Hn (C); thus, γ ∈ Cn .
(2) By exactness, γ = j(β) for some β ∈ Bn .
(3) By commutativity, j(∂β) = ∂(jβ) = ∂γ = 0.
(4) By exactness, ∂β = iα for some α ∈ An−1 .
(5) Set δ[γ] = [α] ∈ Hn−1 (A).
Exercise 3.12. This is a tedious but necessary exercise for anyone interested in
homological algebra: (1) show that δ[γ] is well-defined and independent of all
choices; (2) show that the resulting long complex is exact. Work at ad tedium: for
help, see any textbook on algebraic topology, [58] recommended.
Euler Characteristic, Redux Complexes solve the mystery of the topological
invariance of the Euler characteristic. Recall that we can define the Euler char-
acteristic of a (finite, finite-dimensional) complex C as in Equation (1.13). The
alternating sum is a binary exactness. A short exact complex of vector spaces
0 → A → B → C → 0 has χ = 0, since C = ∼ B/A. By applying this to individ-
ual rows of a short exact complex of (finite, finite-dimensional) chain complexes,
we can lift once again to talk about the Euler characteristic of a (finite-enough)
complex of complexes:
(3.13) 0 / A• / B• / C• /0.
One sees that χ of this complex also vanishes: χ(A• ) − χ(B• ) + χ(C• ) = 0.
The following lemma is the homological version of the Rank-Nullity Theorem
from linear algebra:
Lemma 3.14. The Euler characteristic of a chain complex C• and its homology H• are
identical, when both are defined.
Robert Ghrist 301
Proof. From the definitions of homology and chain complexes, one has two short
exact complexes of chain complexes:
0 / B• / Z• / H• /0.
(3.15)
0 / Z• / C• / B•−1 /0
Here, B•−1 is the shifted boundary complex whose kth term is Bk−1 . By exact-
ness, the Euler characteristic of each of these two complexes is zero; thus, so is
the Euler characteristic of their concatenation.
0 / B• / Z• / H• /0 / Z• / C• / B•−1 /0.
Count the +/− signs: the Z terms cancel and, since χ(B•−1 ) = −χ(B• ), the B
terms cancel. This leaves two terms and we conclude that χ(H• ) − χ(C• ) = 0.
Euler characteristic thus inherits its topological invariance from that of ho-
mology. Where does the invariance of homology come from? Something more
complicated still?
Homology Theories
Invariance of homology is best discerned from a singularly uncomputable vari-
ant that requires a quick deep dive into the plethora of homologies available. We
begin with a reminder: homology is an algebraic compression scheme — a way
of collapsing a complex to the simplest form that respects its global features. The
notion of homology makes sense for any chain complex. Thus far, our only means
of generating a complex from a space X has been via some finite auxiliary struc-
ture on X, such as a simplicial, cubical, or cellular decomposition. There are other
types of structures a space may carry, and, with them, other complexes. In the
same way that the homology of a simplicial complex is independent of the simpli-
cial decomposition, the various homologies associated to a space under different
auspices tend to be isomorphic.
Reduced homology Our first alternate theory is not really a different type of
homology at all; merely a slight change in the chain complex meant to make
contractible (or rather acyclic) spaces fit more exactly into homology theory. Re-
call that a contractible cell complex — such as a single simplex — has homology
Hk = 0 for all k > 0, with H0 being one-dimensional, recording the fact that
the cell complex is connected. For certain results in algebraic topology, it would
be convenient to have the homology of a contractible space vanish completely.
This can be engineered by an augmented complex in a manner that is applicable to
any N-graded complex. Assume for simplicity that C = (Ck , ∂) is a N-graded
complex of vector spaces over a field F. The reduced complex is the following
augmentation:
(3.16) ···
∂ / C3 ∂ / C2 ∂ / C1 ∂ / C0 /F /0,
302 Homological Algebra and Data
Exercise 3.17. Show that the reduced complex is in fact a complex: i.e., that
∂ = 0. How does the reduced homology of a complex differ from the “ordinary”
homology? What is the dependence on the choice of augmentation map ? Show
that the augmented complex of a contractible simplicial complex is exact.
There is little hope in computing the resulting singular homology, save for the
fact that this homology is, blessedly, an efficient compression.
Theorem 3.19. For a cell complex, singular and cellular homology are isomorphic.
When combined with Theorem 3.19, we obtain a truly useful, computable re-
sult. The proof of Theorem 3.20 does not focus on spaces at all, but rather, in
the spirit of these lectures, pulls back the notion of homotopy to complexes. Re-
call that f, g : X → Y are homotopic if there is a map F : X × [0, 1] → Y which
restricts to f on X × {0} and to g on X × {1}. A chain homotopy between chain maps
ϕ• , ψ• : C → C is a graded linear transformation F : C → C sending n-chains to
(n + 1)-chains so that ∂F − F∂ = ϕ• − ψ• :
··· / Cn+1 ∂ / Cn ∂ /
Cn−1
∂ / ··· .
z zz z zz
F zzz F zz F zzz z
zz
(3.21) zz ψ• •zzz ψ• •zzz ψ• •zzz F
ϕ ϕ ϕ
z |z |z }z
}zz
··· / C / C / C / ···
n+1 ∂ n ∂ n−1 ∂
One calls F a map of degree +1, indicating the upshift in the grading.5 Note
the morphological resemblance to homotopy of maps: a chain homotopy maps
each n-chain to a n + 1-chain, the algebraic analogue of a 1-parameter family.
The difference between the ends of the homotopy, ∂F − F∂, gives the difference
between the chain maps.
Exercise 3.22. Show that two chain homotopic maps induce the same homomor-
phisms on homology. Start by considering [α] ∈ H• (C), assuming ϕ• and ψ• are
chain homotopic maps from C to C .
The proof of Theorem 3.20 follows from constructing an explicit chain homo-
topy [58].
Morse Homology All the homology theories we have looked at so far have
used simplices or cells as basis elements of chains and dimension as the grading.
There is a wonderful homology theory that breaks this pattern in a creative and,
eventually, useful manner. Let M be a smooth, finite-dimensional Riemannian
5 The overuse of the term degree in graphs, maps of spheres, and chain complexes is unfortunate.
304 Homological Algebra and Data
Discrete Morse theory shows that the classical constraints – manifolds, smooth
dynamics, nondegenerate critical points – are not necessary. This point is worthy
of emphasis: the classical notion of a critical point (maximum, minimum, saddle)
is distilled away from its analytic and dynamical origins until only the algebraic
spirit remains.
Applications of discrete Morse theory are numerous and expansive, including
to combinatorics [65], mesh simplification [67], image processing [79], configu-
ration spaces of graphs [43, 44], and, most strikingly, efficient computation of
306 Homological Algebra and Data
homology of cell complexes [69]. This will be our focus for applications to com-
putation at the end of this lecture.
Application: Algorithms
Advances in applications of homological invariants have been and will remain
inextricably linked to advances in computational methods for such invariants. Re-
cent history has shown that potential applications are impotent when divorced
from computational advances, and computation of unmotivated quantities is fu-
tile. The reader who is interested in applying these methods to data is no doubt
interested in knowing the best and easiest available software. Though this is not
the right venue for a discussion of cutting-edge software, there are a number
of existing software libraries/packages for computing homology and persistent
homology of simplicial or cubical complexes, some of which are exciting and
deep. As of the time of this writing, the most extensive and current benchmark-
ing comparing available software packages can be found in the preprint of Otter
et al. [75]. We remark on and summarize a few of the issues involved with com-
puting [persistent] homology, in order to segue into how the theoretical content
of this lecture impacts how software can be written.
Time complexity: Homology is known to be output-sensitive, meaning that the
complexity of computing homology is a function of how large the homology is,
as opposed to how large the complex is. What this means in practice is that
the homology of a simple complex is simple to compute. The time-complexity
of computing homology is, by output-sensitivity, difficult to specify tightly. The
standard algorithm to compute H• (X) for X a simplicial complex is to compute
the Smith normal form of the graded boundary map ∂ : C• → C• , where we
concatenate the various gradings into one large vector space. This graded bound-
ary map is, by definition, nilpotent: it has a block structure with zero blocks on
the block-diagonal (since ∂k : Ck → Ck−1 ) and is nonzero on the superdiagonal
blocks. The algorithm for computing Smith normal form is really a slight variant
of the ubiquitous Gaussian elimination, with reduction to the normal form via
elementary row and column operations. For field coefficients in F2 this reduction
is easily seen to be of time-complexity O(n3 ) in the size of the matrix, with an
expected run time of O(n2 ). This is not encouraging, given the typical sizes seen
in applications. Fortunately, compression preprocessing methods exist, as we will
detail.
Memory and inputs: Time-complexity is not the only obstruction; holding a
complex in memory is nontrivial, as is the problem of inputting a complex. A
typical simplicial complex is specified by fixing the simplices as basis and then
specifying the boundary matrices. For very large complexes, this is prohibitive
and unnecessary, as the boundary matrices are typically sparse. There are a
number of ways to reduce the input cost, including inputting (1) a distance matrix
spelling out an explicit metric between points in a point cloud, using a persistent
Robert Ghrist 307
Dowker complex (see Exercise 2.10) to build the filtered complex; (2) using voxels
in a lattice as means of coordinatizing top-dimensional cubes in a cubical complex,
and specifying the complex as a list of voxels; and (3) using the Vietoris-Rips
complex of a network of nodes and edges, the specification of which requires
only a quadratic number of bits of data as a function of nodes.
Exercise 3.28. To get an idea of how the size of a complex leads to an inefficient
complexity bound, consider a single simplex, Δn , and cube, In , each of dimen-
sion n. How many total simplices/cubes are in each? Include all faces of all
dimensions. Computing the [nearly trivial] homology of such a simple object
requires, in principle, computing the Smith normal form of a graded boundary
matrix of what net size?
Morse theory & Compression: One fruitful approach for addressing the compu-
tation of homology is to consider alternate intermediate compression schemes. If
instead of applying Smith Normal Form directly to a graded boundary operator,
one modifies the complex first to obtain a smaller chain-homotopic complex, then
the resulting complexity bounds may collapse with a dramatic decrease in size
of the input. There have been many proposals for reduction and coreduction of
chain complexes that preserve homology: see [62] for examples. One clear and
successful compression scheme comes from discrete Morse theory. If one puts
a discrete gradient field on a cell complex, then the resulting Morse complex
is smaller and potentially much smaller, being generated only by critical cells.
The process of defining and constructing an associated discrete Morse complex
is roughly linear in the size of the cell complex [69] and thus gives an efficient
approach to homology computation. This has been implemented in the popular
software package Perseus (see [75]).
Modernizing Morse Theory: Morse homology, especially the discrete version, has
not yet been fully unfolded. There are several revolutionary approaches to Morse
theory that incorporate tools outside the bounds of these lectures. Nevertheless,
it is the opinion of this author that we are just realizing the full picture of the cen-
trality of Morse theory in Mathematics and in homological algebra in particular.
Two recent developments are worth pointing out as breakthroughs in conceptual
frameworks with potentially large impact. The first, in the papers by Nanda et
al. [71, 72], gives a categorical reworking of discrete Morse that relaxes the notion
of a discrete vector field to allow for any acyclic pairing of cells and faces without
restriction on the dimension of the face. It furthermore shows how to reconstruct
the topology of the original complex (up to homotopy type, not homology type)
using only data about critical cells and the critical discrete flowlines. Though the
tools used are formidable (2-categories and localization), the results are equally
strong.
Matroid Morse Theory: The second contribution on the cusp of impact comes
in the thesis of Henselman [59] which proposes matroid theory as the missing link
308 Homological Algebra and Data
between Morse theory and homological algebra. Matroids are classical structures
in the intersection of combinatorial topology and linear algebra and have no end
of interesting applications in optimization theory. Henselman recasts discrete
Morse theory and persistent homology both in terms of matroids, then exploits
matroid-theoretic principles (rank, modularity, minimal bases) in order to gener-
ate efficient algorithms for computing persistent homology and barcodes. This
work has already led to an initial software package Eirene6 that, as of the time of
this writing, has computed persistent Hk for 0 k 7 of a filtered 8-dimensional
simplicial complex obtained as the Vietoris-Rips complex of a random sampling
of 50 points in dimension 20 with a total of 3.3E + 11 simplices on a PC laptop
with i7 quadcore and 16meg RAM in 11.1 seconds with peak memory use of
1.8GB. This computation compares very favorably with the fastest-available soft-
ware, Ripser7 , on a cluster of machines, as recorded in the survey of Otter et
al. [75]: Ripser computes the persistent homology of this complex in 349 seconds
with a peak memory load of 24.7GB. This portends much more to come, both at
the level of conceptual understanding and computational capability.
This prompts the theme of our next lecture, that in order to prepare for in-
creased applicability, one must ascend and enfold tools and perspectives of in-
creasing generality and power.
Exercise 4.2. Fix a triangulated disc D2 and consider cochains using F2 coeffi-
cients. What do 1-cocycles look like? Show that any such 1-cocycle is the cobound-
ary of a 0-cochain which labels vertices with 0 and 1 on the left and on the right of
the 1-cocycle, so to speak: this is what a trivial class in H1 (D2 ) looks like. Now fix
a circle S1 discretized as a finite graph and construct examples of 1-cocycles that
are (1) coboundaries; and (2) nonvanishing in H1 . What is the difference between
the trivial and nontrivial cocycles on a circle?
The previous exercise foreshadows the initially depressing truth: nothing new
is gained by computing cohomology, in the sense that Hn (X) and Hn (X) have the
same dimension for each n. Recall, however, that there is more to co/homology
than just the Betti numbers. Functoriality is key, and there is a fundamental
difference in how homology and cohomology transform.
Alexander Duality There are numerous means by which duality expresses itself
in the form of cohomology. One of the most useful and ubiquitous of these
is known as Alexander duality, which relates the homology and cohomology of
a subset of a sphere Sn (or, with a puncture, Rn ) and its complement. The
following is a particularly simple form of that duality theorem.
Note that the reduced theory is used for both homology and cohomology.
Cohomology and Calculus Most students initially view cohomology as more
obtuse than homology; however, there are certain instances in which cohomology
is the most natural operation. Perhaps the most familiar such setting comes from
calculus. As seen in Exercise 3.3 from Lecture 2, the familiar constructs of vector
calculus on R3 fit into an exact complex. This exactness reflects the fact that R3 is
topologically trivial [contractible]. Later, in Exercise 4.2, you looked at simplicial
1-cocycles and hopefully noticed that whether or not they are null in H1 depends
on whether or not these cochains are simplicial gradients of 0-chains on the vertex
set. These exercises together hint at the strong relationship between cohomology
and calculus.
The use of gradient, curl, and divergence for vector calculus is, however, an un-
fortunate vestige of the philosophy of calculus-for-physics as opposed to a more
modern calculus-for-data sensibility. A slight modern update sets the stage bet-
ter for cohomology. For U ⊂ Rn an open set, let Ωk (U) denote the differentiable
k-form fields on U (a smooth choice of multilinear antisymmetric functionals on
ordered k-tuples of tangent vectors at each point). For example, Ω0 consists of
smooth functionals, Ω1 consists of 1-form fields, viewable (in a Euclidean setting)
as duals to vector fields, Ωn consists of signed densities on U times the volume
form, and Ωk>n (U) = 0. There is a natural extension of differentiation (famil-
iar from implicit differentiation in calculus class) that gives a coboundary map
d : Ωk → Ωk+1 , yielding the deRham complex,
(4.6) 0 / Ω0 (U) d / Ω1 (U) d / Ω2 (U) d / ··· d / Ωn (U) /0.
As one would hope, d2 = 0, in this case due to the fact that mixed partial deriva-
tives commute: you worked this out explicitly in Exercise 3.3. The resulting
cohomology of this complex, the deRham cohomology H• (U), is isomorphic to the
singular cohomology of U using R coefficients.
This overlap between calculus and cohomology is neither coincidental nor con-
cluded with this brief example. A slightly deeper foray leads to an examination
of the Laplacian operator (on a manifold with some geometric structure). The
well-known Hodge decomposition theorem then gives, among other things, an iso-
morphism between the cohomology of the manifold and the harmonic differential
Robert Ghrist 311
forms (those in the kernel of the Laplacian). For more information on these con-
nections, see [19].
What is especially satisfying is that the calculus approach to cohomology and
the deRham theory feeds back to the simplicial: one can export the Laplacian and
the Hodge decomposition theorem to the cellular world (see [51, Ch. 6]). This,
then, impacts data-centric problems of ranking and more over networks.
Cohomology and Ranking Cohomology arises in a surprising number of dif-
ferent contexts. One natural example that follows easily from the calculus-based
perspective on cohomology lives in certain Escherian optical illusions, such as
impossible tribars, eternally cyclic waterfalls, or neverending stairs. When one
looks at an Escher staircase, the drawn perspective is locally realizable – one can
construct a local perspective function. – but a global extension cannot be defined.
Thus, an Escherlike loop is really a non-zero class in H1 (as first pointed out by
Penrose [78]).
This is not disconnected from issues of data. Consider the problem of ranking.
One simple example that evokes nontrivial 1-cocycles is the popular game of Rock,
Paper, Scissors, for which there are local but not global ranking functions. A local
gradient of rock-beats-scissors does not extend to a global gradient. Perhaps this
is why customers are asked to conduct rankings (e.g., Netflix movie rankings or
Amazon book rankings) as a 0-cochain (“how many stars?”), and not as a 1-cochain
(“which-of-these-two-is-better?”): nontrivial H1 is, in this setting, undesirable. The
Condorcet paradox – that locally consistent comparative rankings can lead to global
inconsistencies – is an appearance of H1 in ranking theory.
There are less frivolous examples of precisely this type of application, leverag-
ing the language of gradients and curls to realize cocycle obstructions to perfect
rankings in systems. The paper of Jiang et al. [61] interprets the simplicial cochain
complex of the clique/flag complex of a network in terms of rankings. For exam-
ple, the (R-valued) 0-cochains are interpreted as numerical score functions on the
nodes of the network; the 1-cochains (supported on edges) are interpreted as pair-
wise preference rankings (with oriented edges and positive/negative values deter-
mining which is preferred over the other); and the higher-dimensional cochains
represent more sophisticated local orderings of nodes in a clique [simplex]. They
then resort to the calculus-based language of grad, curl, and div to build up the
cochain complex and infer from its cohomology information about existence and
nonexistence of compatible ranking schemes over the network. Their use of the
Laplacian and the Hodge decomposition theorem permits projection of noisy or
inconsistent ranking schemes onto the nearest consistent ranking.
There are more sophisticated variants of these ideas, with applications pass-
ing beyond finding consistent rankings or orderings. Recent work of Gao et
al. [49] gives a cohomological and Hodge-theoretic approach to synchronization
problems over networks based on pairwise nodal data in the presence of noise.
Singer and collaborators [85, 86] have published several works on cryo electron
312 Homological Algebra and Data
Cellular Sheaves
One of the most natural uses for cohomology comes in the form of a yet-more-
abstract theory that is the stated end of these lectures: sheaf cohomology. Our
perspective is that a sheaf is an algebraic data structure tethered to a space (gen-
erally) or simplicial complex (in particular). In keeping with the computational
and linear-algebraic focus of this series, we will couch everything in the language
of linear algebra. The more general approach [20, 64, 83] is much more general.
Fix X a simplicial (or regular cell) complex with denoting the face relation:
στ if and only if σ ⊂ τ. A cellular sheaf over X, F, is generated by (1) an
assignment to each simplex σ of X a stalk, a vector space F(σ); and (2) to each
face pair στ a restriction map, a linear transformation F(στ) : F(σ) → F(τ).
This data must respect that manner in which the simplicial complex is assembled,
meaning that faces of faces satisfy the composition rule:
(4.7) ρ σ τ ⇒ F(ρτ) = F(στ) ◦ F(ρσ).
The trivial face ττ by default has the identity isomorphism F(ττ) = Id as its
restriction map. Again, if one thinks of the stalks as the data over the individual
simplices, then, in the same manner that the simplicial complex is glued up by
face maps, the sheaf is assembled by the system of linear transformations.
One simple example of a sheaf on a cell complex X is that of the constant sheaf ,
FX , taking values in vector spaces over a field F. This sheaf assigns F to every
cell and the identity map Id : F → F to every face στ. In contrast, the skyscraper
sheaf over a single cell σ of X is the sheaf Fσ that assigns F to σ and 0 to all other
cells and face maps.
Exercise 4.8. Consider the following version of a random rank-1 sheaf over a
simplicial complex X. Assign the field F to every simplex. To each face map στ
assign either Id or 0 according to some (your favorite) random process. Does this
always give you a sheaf? How does this depend on X? What is the minimal set
of assumptions you would need to make on either X or the random assignment
in order to guarantee that what you get is in fact a sheaf?
One thinks of the values of the sheaf over cells as being data and the restric-
tion maps as something like local constraints or relationships between data. It’s
very worthwhile to think of a sheaf as programmable – one has a great deal of
Robert Ghrist 313
freedom in encoding local relationships. For example, consider the simple lin-
ear recurrence un+1 = An un , where un ∈ Rk is a vector of states and An is
a k-by-k real matrix. Such a discrete-time dynamical system can be represented
as a sheaf F of states over the time-line R with the cell structure on R having
Z as vertices, where F has constant stalks Rk . One programs the dynamics of
the recurrence relation as follows: F({n}(n, n + 1)) is the map u → An u and
F({n + 1}(n, n + 1)) is the identity. Compatibility of local solutions over the
sheaf is, precisely, the condition for being a global solution to the dynamics.
Local and Global Sections One says that the sheaf is generated by its values
on individual simplices of X: this stalk F(τ) over a cell τ is also called the local
sections of F on τ: one writes sτ ∈ F(τ) for a local section over τ. Though the
sheaf is generated by local sections, there is more to a sheaf than its generating
data, just as there is more to a vector space than its basis. The restriction maps
of a sheaf encode how local sections can be continued into larger sections. One
glues together local sections by means of the restriction maps. The value of the
sheaf F on all of X is defined to be those collections of local sections that continue
according to the restriction maps on faces. The global sections of F on X are defined
as:
(4.9) F(X) = {(sτ )τ∈X : sσ = F(ρσ)(sρ ) ∀ ρ σ} ⊂ F(τ) .
τ
Exercise 4.10. Show that in the example of a sheaf for the recurrence relation
un+1 = An un , the global solutions to this dynamical system are classified by the
global sections of the sheaf.
The observed fact that the value of the sheaf over all of X retains the same sort
of structure as the type of data over the vertices — say, a vector space over a field F
— is a hint that this space of global solutions is really a type of homological data.
In fact, it is cohomological in nature, and, like zero-dimensional cohomology, it
is measure of connected components of the sheaf.
using binary coefficients so that −1 = 1). Note that d : Cn (X; F) → Cn+1 (X; F),
since [σ : τ] = 0 unless σ is a codimension-1 face of τ. This gives a cochain com-
plex: in the computation of d2 , the incidence numbers factor from the restriction
maps, and the computation from cellular co/homology suffices to yield 0. The
resulting cellular sheaf cohomology is denoted H• (X; F).
This idea of global compatibility of sets of local data in a sheaf yield, through
the language of cohomology, global qualitative features of the data structure. We
have seen several examples of the utility of classifying various types of holes or
large-scale qualitative features of a space or complex. Imagine what one can do
with a measure of topological features of a data structure over a space.
Exercise 4.13. The cohomology of the constant sheaf FX on a compact cell com-
plex X is, clearly, H•cell (X; F), the usual cellular cohomology of X with coefficients
in F. Why the need for compactness? Consider the following cell complex: X = R,
decomposed into two vertices and three edges. What happens when you follow
all the above steps for the cochain complex of FX ? Show that this problem is
solved if you include in the cochain complex only contributions from compact
cells.
Exercise 4.14. For a closed subcomplex A ⊂ X, define the constant sheaf over A
as, roughly speaking, the constant sheaf on A (as its own complex) with all other
cells and face maps in X having data zero. Argue that H• (X; FA ) = ∼ H• (A; F).
Conclude that it is possible to have a contractible base space X with nontrivial
sheaf cohomology.
The elements of linear algebra recur throughout topology, including sheaf co-
homology. Consider the following sheaf F over the closed interval with two ver-
tices, a and b, and one edge e. The stalks are given as F(a) = Rm , F(b) = 0, and
F(e) = Rn . The restriction maps are F(be) = 0 and F(ae) = A, where A is
a linear transformation. Then, by definition, the sheaf cohomology is H0 =∼ ker A
1 ∼
and H = coker A.
Cellular sheaf cohomology taking values in vector spaces is really a charac-
terization of solutions to complex networks of linear equations. If one modifies
F(b) = Rp with F(be) = B another linear transformation, then the cochain
complex takes the form
[A|−B]
(4.15) 0 / Rm × Rp / Rn /0 / ··· ,
Cosheaves Sheaves are meant for cohomology: the direction of the restriction
maps insures this. Is there a way to talk about sheaf homology? If one works
in the cellular case, this is a simple process. As we have seen that the only real
difference between the cohomology of a cochain complex and the homology of
a chain complex is whether the grading ascends or descends, a simple matter
of arrow reversal on a sheaf should take care of things. It does. A cosheaf F̂
of vector spaces on a simplicial complex assigns (1) to each simplex σ a costalk,
a vector space F̂(σ); and (2) to each face στ of τ a corestriction map, a linear
transformation F̂(στ) : F̂(τ) → F̂(σ) that reverses the direction of the sheaf maps.
Of course, the cosheaf must respect the composition rule:
(4.17) ρ σ τ ⇒ F̂(ρτ) = F̂(ρσ) ◦ F̂(στ),
and the identity rule that F̂(ττ) = Id.
In the cellular context, there are very few differences between sheaves and
cosheaves — the use of one over another is a matter of convenience, in terms
of which direction makes the most sense. This is by no means true in the more
subtle setting of sheaves and cosheaves over open sets in a continuous domain.
Splines and Béziers. Cosheaves and sheaves alike arise in the study of splines,
Bézier curves, and other piecewise-assembled structures. For example, a single
segment of a planar Bézier curve is specified by the locations of two endpoints,
along with additional control points, each of which may be interpreted as a handle
specifying tangency data of the resulting curve at each endpoint. The reader who
has used any modern drawing software will understand the control that these
handles give over the resulting smooth curve. Most programs use a cubic Bézier
curve in the plane – the image of the unit closed interval by a cubic polynomial.
In these programs, the specification of the endpoints and the endpoint handles
(tangent vectors) completely determines the interior curve segment uniquely.
This can be viewed from the perspective of a cosheaf F̂ over the closed interval
I = [0, 1]. The costalk over the interior (0, 1) is the space of all cubic polynomials
from [0, 1] → R2 , which is isomorphic to R4 ⊕ R4 (one cubic polynomial for each
of the x and y coordinates). If one sets the costalks at the endpoints of [0, 1] to be
R2 , the physical locations of the endpoints, then the obvious corestriction maps
to the endpoint costalks are nothing more than evaluation at 0 and 1 respectively.
The corresponding cosheaf chain complex is:
(4.18) ··· /0 / R4 ⊕ R4 ∂ / R2 ⊕ R2 / 0.
Here, the boundary operator ∂ computes how far the cubic polynomial (edge
costalk) ‘misses’ the specified endpoints (vertex costalks).
(planar) tangent vector. Repeat this exercise for a 2-segment cubic planar Bézier
curve. How many control points are needed and with what degrees of freedom
are they needed in order to match the H1 of the cosheaf?
Note the interesting duality: the global solutions with boundary condition
are characterized by the top-dimensional homology of the cosheaf, instead of the
zero-dimensional cohomology of a sheaf. This simple example extends greatly, as
shown originally by Billera (using cosheaves, without that terminology [16]) and
Yuzvinsky (using sheaves [90]). By Billera’s work, the (vector) space of splines
over a triangulated Euclidean domain is isomorphic to the top-dimensional ho-
mology of a particular cosheaf over the domain. This matches what you see in
the simpler example of a Bézier curve over a line segment.
Splines and Béziers are a nice set of examples of cosheaves that have natu-
ral higher-dimensional generalizations — Bézier surfaces and surface splines are
used in design and modelling of surfaces ranging from architectural structures to
vehicle surfaces, ship hulls, and the like. Other examples of sheaves over higher-
dimensional spaces arise in the broad generalization of the Euler characteristic to
the Euler calculus, a topological integral calculus of recent interest in topological
signal processing applications [9, 10, 80–82].
Towards Generalizing Barcodes One of the benefits of a more general, sophis-
ticated language is the ability to reinterpret previous results in a new light with
new avenues for exploration appearing naturally. Let’s wrap up our brief survey
of sheaves and cosheaves by revisiting the basics of persistent homology, follow-
ing the thesis of Curry [33]. Recall the presentation of persistent homology and
barcodes in Lecture 2 that relied crucially on the Structure Theorem for linear
sequences of finite-dimensional vector spaces (Theorem 2.15).
There are a few ways one might want to expand this story. We have hinted on
a few occasions at the desirability of a continuous line as a parameter: our story
of sequences of vector spaces and linear transformations is bound to the discrete
setting. Intuitively, one could take a limit of finer discretizations and hope to
obtain a convergence with the appropriate assumptions on variability. Questions
of stability and interleaving (recall Exercise 2.23) then arise: see [18, 21, 66]
Another natural question is: what about non-linear sequences? What if in-
stead of a single parameter, there are two or more parameters that one wants to
vary? Is it possible to classify higher-dimensional sequences and derive barcodes
here? Unfortunately, the situation is much more complex than in the simple, lin-
ear setting. There are fundamental algebraic reasons for why such a classification
is not directly possible. These obstructions originate from representation theory
and quivers: see [24, 28, 76]. The good news is that quiver theory implies the exis-
tence of barcodes for linear sequences of vector spaces where the directions of the
maps do not have to be uniform, as per the zigzag persistence of Carlsson and de
Silva [24]. The bad news is that quiver theory implies that a well-defined barcode
Robert Ghrist 317
cannot exist as presently conceived for any sequence that is not a Dynkin dia-
gram (meaning, in particular, that higher-dimensional persistence has no simple
classification).
Nevertheless, one intuits that sheaves and cosheaves should have some bear-
ing on persistence and barcodes. Consider the classical scenario, in which one
has a sequence of finite-dimensional vector spaces Vi and linear transformations
ϕi : Vi → Vi+1 . Consider the following sheaf F over a Z-discretized R. To each
vertex {i} is assigned Vi . To each edge (i, i + 1) is assigned Vi+1 with an iden-
tity isomorphism from the right vertex stalk to the edge and ϕi as the left-vertex
restriction map to the edge data. Note the similarity of this sheaf to that of the
recurrence relation earlier in this lecture. As in the case of the recurrence rela-
tion, H0 detects global solutions: something similar happens for intervals in the
barcode.
There are limits to what basic sheaves and cosheaves can do, as cohomology
does not come with the descriptiveness plus uniqueness that the representation
theoretic approach gives. Nevertheless, there are certain settings in which bar-
codes for persistent homology are completely captured by sheaves and cosheaves
(see the thesis of Curry for the case of level-set persistence [33]), with more char-
acterizations to come [34].
Homological Data, Redux We summarize by updating and expanding the prin-
ciples that we outlined earlier in the lecture series into a more refined language:
(1) Algebraic co/chain complexes are a good model for converting a space
built from local pieces into a linear-algebraic structure.
(2) Co/homology is an optimal compression scheme to collapse inessential
structure and retain qualitative features.
(3) The classification of linear algebraic sequences yields barcodes as a de-
composition of sequences of co/homologies, capturing the evolution of
qualitative features.
(4) Exact sequences permit inference from partial sequential co/homological
data to more global characterization.
318 Homological Algebra and Data
Work of Adams and Carlsson [3] gives a complete solution to the existence of
an evasion path in the case of a planar (n = 2) system with additional genericity
conditions and some geometric assumptions. Recently, a complete solution in all
dimensions was given [52] using sheaves and sheaf cohomology. One begins with
a closed coverage region C ⊂ Rn × R whose complement is uniformly bounded
over R. For the sake of exposition, assume that the time axis is given a discretiza-
tion into (ordered) vertices vi and edges ei = (vi , vi+1 ) such that the coverage
domains Ct are topologically equivalent over each edge (this is not strictly neces-
sary). There are a few simple sheaves over the discretized timeline relevant to the
evasion problem.
First, consider for each time t ∈ R, the coverage domain Ct ⊂ Rn . How many
different ways are there for an evader to hide from Ct ? This is regulated by the
number of connected components of the complement, classified by H0 (Rn −Ct ).
Since we do not have access to Ct directly (remember – its embedding in Rn
is unknown to the pursuer), we must try to compute this H0 based only on the
topology of Ct . That this can be done is an obvious but wonderful corollary of
Alexander duality, which relates the homology and cohomology of complementary
subsets of Rn . Here, Alexander duality implies that H0 (Rn −Ct ) = ∼ Hn−1 (Ct ):
this, then, is something we can measure, and motivates using the Leray cellular
sheaf H of n − 1 dimensional cohomology of the coverage regions over the time
axis. Specifically, for each edge ei define H(ei ) = Hn−1 (C(vi ,vi+1 ) ) to be the
cohomology of the region over the open edge. For the vertices, use the star:
H(vi ) = Hn−1 (C(vi−1 ,vi+1 ) ).
Exercise 4.21. Why are the stalks over the vertices defined in this way? Show
that this gives a well-defined cellular sheaf using as restriction maps the induced
homomorphisms on cohomology. Hint: which way do the induced maps in coho-
mology go?
The intuition is that global sections of this sheaf over the time axis, H0 (R; H),
would classify the different complementary “tunnels” through the coverage set
that an evader could use to escape detection. Unfortunately, this is incorrect, for
reasons pointed out by Adams and Carlsson [3] (using the language of zigzags).
The culprit is the commutative nature of homology and cohomology — one can-
not discern tunnels which illegally twirl backwards in time. To solve this problem,
one could try to keep track of some sort of directedness or orientation. Thanks
to the assumption that Ct is connected for all time, there is a global orienta-
tion class on Rn that can be used to assign a ±1 to basis elements of Hn−1 (Ct )
based on whether the complementary tunnel is participating in a time-orientation-
preserving evasion path on the time interval (−∞, t).
However, to incorporate this orientation data into the sheaf requires breaking
the bounds of working with vector spaces. As detailed in [52], one may use
sheaves that take values in semigroups. In this particular case, the semigroups
320 Homological Algebra and Data
are positive cones within vector spaces, where a cone K ⊂ V in a vector space V
is a subset closed under vector addition and closed under multiplication by R+ ,
the [strictly] positive reals. A cone is positive if K ∩ −K = ∅. With work, one
can formulate co/homology theories and sheaves to take values in cones: for
details, see [52]. The story proceeds: within the sheaf H of n − 1 dimensional
cohomology of C on R, there is (via abuse of terminology) a “subsheaf” +H of
positive cones, meaning that the stalks of +H are positive cones within the stalks
of H, encoding all the positive cohomology classes that can participate in a legal
(time-orientation-respecting) evasion path. It is this sheaf of positive cones that
classifies evasion paths.
Theorem 4.22. For n > 1 and C = {Ct } ⊂ Rn × R closed and with bounded comple-
ment consisting of connected fibers Ct for all t, there is an evasion path over R if and
only if H0 (R;+ H) is nonempty.
Note that the theorem statement says nonempty instead of nonzero, since sheaf
takes values in positive cones, which are R+ -cones and thus do not contain zero.
The most interesting part of the story is how one can compute this H0 . This is
where the technicalities begin to weigh heavily, as one cannot use the classical def-
inition of H0 in terms of kernels and images. The congenial commutative world
of vector spaces requires significant care when passing to the nonabelian setting.
One defines H0 for sheaves of cones using constructs from category theory (lim-
its, specifically). Computation of such objects requires a great deal more thought
and care than the simpler linear-algebraic notions of these lectures. That is by
no means a defect; indeed, it is a harbinger. Increase in resolution requires an
increase in algebra.
analysis. The book by Edelsbrunner and Harer [40] is a gentle introduction, with
emphases on both theory and algorithms. The book on computational homology
by Kaczynsky, Mischaikow, and Mrozek [62] has even more algorithmic material,
mostly in the setting of cubical complexes. Both books suffer from the short shelf
life of algorithms as compared to theories. Newer titles on the theory of persistent
homology are in the process of appearing: that of Oudot [76] is one of perhaps
several soon to come. For introductions to topology specifically geared towards
the data sciences, there does not seem to be an ideal book; rather, a selection of
survey articles such as that of Carlsson [23] is appropriate.
The open directions for inquiry are perhaps too many to identify properly –
the subject of topological data analysis is in its infancy. One can say with cer-
tainty that there is a slow spread of these topological methods and perspectives
to new application domains. This will continue, and it is not yet clear to this au-
thor whether neuroscience, genetics, signal processing, materials science, or some
other domain will be the locus of inquiry that benefits most from homological
methods, so rapid has been the advances in all these areas. Dual to this spread in
applications is the antipodal expansion into Mathematics, as ideas from homolog-
ical data engage with impact contemporary Mathematics. The unique demands
of data have already prompted explorations into mathematical structures (e.g.,
interleaving distance in sheaf theory and persistence in matroid theory) which
otherwise would seem unmotivated and be unexplored. It is to be hoped that the
simple applications of representation theory to homological data analysis will
inspire deeper explorations into the mathematical tools.
There is at the moment a frenzy of activity surrounding various notions of
stability associated to persistent homology, sheaves, and related structures con-
cerning representation theory. It is likely to take some time to sort things out into
their clearest form. In general, one expects that applications of deeper ideas from
algebraic topology exist and will percolate through applied mathematics into ap-
plication domains. Perhaps the point of most optimism and uncertainty lies in
the intersection with probabilistic and stochastic methods. The rest of this volume
makes ample use of such tools; their absence in these notes is noticeable. Topol-
ogy and probability are neither antithetical nor natural partners; expectation of
progress is warranted from some excellent results on the topology of Gaussian
random fields [6] and recent work on the homology of random complexes [63].
Much of the material from these lectures appears ripe for a merger with modern
probabilistic methods. Courage and optimism — two of the cardinal mathemati-
cal virtues — are needed for this.
References
[1] A. Abrams and R. Ghrist. State complexes for metamorphic systems. Intl. J. Robotics Research,
23(7,8):809–824, 2004. ←278
[2] H. Adams and G. Carlsson. On the nonlinear statistics of range image patches. SIAM J. Imaging
Sci., 2(1):110–117, 2009. MR2486524 ←294
322 References
[3] H. Adams and G. Carlsson. Evasion paths in mobile sensor networks. International Journal of
Robotics Research, 34:90–104, 2014. ←318, 319
[4] R. J. Adler. The Geometry of Random Fields. Society for Industrial and Applied Mathematics, 1981.
MR3396215 ←280
[5] R. J. Adler, O. Bobrowski, M. S. Borman, E. Subag, and S. Weinberger. Persistent homology for
random fields and complexes. In Borrowing Strength: Theory Powering Applications, pages 124–143.
IMS Collections, 2010. MR2798515 ←285
[6] R. J. Adler and J. E. Taylor. Random Fields and Geometry. Springer Monographs in Mathematics.
Springer, New York, 2007. MR2319516 ←280, 321
[7] R. H. Atkin. Combinatorial Connectivities in Social Systems. Springer Basel AG, 1977. ←278
[8] A. Banyaga and D. Hurtubise. Morse Homology. Springer, 2004. ←304
[9] Y. Baryshnikov and R. Ghrist. Target enumeration via Euler characteristic integrals. SIAM J. Appl.
Math., 70(3):825–844, 2009. MR2538627 ←280, 316
[10] Y. Baryshnikov and R. Ghrist. Euler integration over definable functions. Proc. Natl. Acad. Sci.
USA, 107(21):9525–9530, 2010. MR2653583 ←280, 316
[11] Y. Baryshnikov, R. Ghrist, and D. Lipsky. Inversion of Euler integral transforms with applications
to sensor data. Inverse Problems, 27(12), 2011. MR2854317 ←280
[12] U. Bauer and M. Lesnick. Induced matchings and the algebraic stability of persistence barcodes.
Discrete Comput. Geom., 6(2):162–191, 2015. MR3333456 ←293
[13] P. Bendich, H. Edelsbrunner, and M. Kerber. Computing robustness and persistence for images.
IEEE Trans. Visual and Comput. Graphics, pages 1251–1260, 2010. ←294
[14] P. Bendich, J. Marron, E. Miller, A. Pielcoh, and S. Skwerer. Persistent homology analysis of brain
artery trees. Ann. Appl. Stat., 10(1):198–218, 2016. MR3480493 ←294
[15] S. Bhattacharya, R. Ghrist, and V. Kumar. Persistent homology for path planning in uncertain
environments. IEEE Trans. on Robotics, 31(3):578–590, 2015. ←295
[16] L. J. Billera. Homology of smooth splines: generic triangulations and a conjecture of Strang. Trans.
Amer. Math. Soc., 310(1):325–340, 1988. MR965757 ←316
[17] L. J. Billera, S. P. Holmes, and K. Vogtmann. Geometry of the space of phylogenetic trees. Adv. in
Appl. Math., 27(4):733–767, 2001. MR1867931 ←278
[18] M. Botnan and M. Lesnick. Algebraic stability of zigzag persistence modules. arXiv:160400655v2.
←293, 316
[19] R. Bott and L. Tu. Differential Forms in Algebraic Topology. Springer, 1982. MR658304 ←311
[20] G. Bredon. Sheaf Theory. Springer, 1997. MR1481706 ←312
[21] P. Bubenik, V. de Silva, and J. Scott. Metrics for generalized persistence modules. Found. Comput.
Math., 15(6):1501–1531, 2015. MR3413628 ←293, 316
[22] P. Bubenik and J. A. Scott. Categorification of persistent homology. Discrete Comput. Geom.,
51(3):600–627, 2014. MR3201246 ←293
[23] G. Carlsson. The shape of data. In Foundations of computational mathematics, Budapest 2011, volume
403 of London Math. Soc. Lecture Note Ser., pages 16–44. Cambridge Univ. Press, Cambridge, 2013.
MR3137632 ←321
[24] G. Carlsson and V. de Silva. Zigzag persistence. Found. Comput. Math., 10(4):367–405, 2010.
MR2657946 ←316
[25] G. Carlsson, T. Ishkhanov, V. de Silva, and A. Zomorodian. On the local behavior of spaces of
natural images. Intl. J. Computer Vision, 76(1):1–12, Jan. 2008. MR3715451 ←294
[26] G. Carlsson and F. Mémoli. Characterization, stability and convergence of hierarchical clustering
methods. J. Mach. Learn. Res., 11:1425–1470, Aug. 2010. MR2645457 ←294
[27] G. Carlsson and F. Mémoli. Classifying clustering schemes. Found. Comput. Math., 13(2):221–252,
2013. MR3032681 ←294
[28] G. Carlsson and A. Zomorodian. The theory of multidimensional persistence. Discrete Comput.
Geom., 42(1):71–93, 2009. MR2506738 ←316
[29] S. Carson, V. Ruta, L. Abbott, and R. Axel. Random convergence of olfactory inputs in the
drosophila mushroom body. Nature, 497(7447):113–117, 2013. ←285
[30] F. Chazal, V. de Silva, M. Glisse, and S. Oudot. The Structure and Stability of Persistence Modules.
Springer Briefs in Mathematics, 2016. MR3524869 ←293
[31] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete Com-
put. Geom., 37(1):103–120, 2007. MR2279866 ←293
References 323
[32] A. Collins, A. Zomorodian, G. Carlsson, and L. Guibas. A barcode shape descriptor for curve
point cloud data. In M. Alexa and S. Rusinkiewicz, editors, Eurographics Symposium on Point-
Based Graphics, ETH, Zürich, Switzerland, 2004. ←292
[33] J. Curry. Sheaves, Cosheaves and Applications. PhD thesis, University of Pennsylvania, 2014.
MR3259939 ←316, 317
[34] J. Curry and A. Patel. Classification of constructible cosheaves. arXiv:1603.01587. ←317
[35] C. Curto and V. Itskov. Cell groups reveal structure of stimulus space. PLoS Comput. Biol.,
4(10):e1000205, 13, 2008. MR2457124 ←285
[36] M. d’Amico, P. Frosini, and C. Landi. Optimal matching between reduced size functions. Techni-
cal Report 35, DISMI, Univ. degli Studi di Modena e Reggio Emilia, Italy, 2003. ←292
[37] V. de Silva and G. Carlsson. Topological estimation using witness complexes. In M. Alexa and
S. Rusinkiewicz, editors, Eurographics Symposium on Point-based Graphics, 2004. ←278
[38] J. Derenick, A. Speranzon, and R. Ghrist. Homological sensing for mobile robot localization. In
Proc. Intl. Conf. Robotics & Aut., 2012. ←296
[39] C. Dowker. Homology groups of relations. Annals of Mathematics, pages 84–95, 1952. MR0048030
←278
[40] H. Edelsbrunner and J. Harer. Computational Topology: an Introduction. American Mathematical
Society, Providence, RI, 2010. MR2572029 ←275, 321
[41] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification.
Discrete Comput. Geom., 28:511–533, 2002. MR1949898 ←292
[42] M. Farber. Invitation to Topological Robotics. Zurich Lectures in Advanced Mathematics. European
Mathematical Society (EMS), Zürich, 2008. MR2455573 ←280
[43] D. Farley and L. Sabalka. On the cohomology rings of tree braid groups. J. Pure Appl. Algebra,
212(1):53–71, 2008. MR2355034 ←305
[44] D. Farley and L. Sabalka. Presentations of graph braid groups. Forum Math., 24(4):827–859, 2012.
MR2949126 ←305
[45] R. Forman. Morse theory for cell complexes. Adv. Math., 134(1):90–145, 1998. MR1612391 ←304,
305
[46] R. Forman. A user’s guide to discrete Morse theory. Sém. Lothar. Combin., 48, 2002. MR1939695 ←
304
[47] Ś. R. Gal. Euler characteristic of the configuration space of a complex. Colloq. Math., 89(1):61–67,
2001. MR1853415 ←280
[48] M. Gameiro, Y. Hiraoka, S. Izumi, M. Kramar, K. Mischaikow, and V. Nanda. Topological mea-
surement of protein compressibility via persistent diagrams. Japan J. Industrial & Applied Mathe-
matics, 32(1):1–17, Oct 2014. MR3318898 ←296
[49] T. Gao, J. Brodzki, and S. Mukherjee. The geometry of synchronization problems and learning
group actions. arXiv:1610.09051. ←311
[50] S. I. Gelfand and Y. I. Manin. Methods of Homological Algebra. Springer Monographs in Mathemat-
ics. Springer-Verlag, Berlin, second edition, 2003. MR1950475 ←299, 320
[51] R. Ghrist. Elementary Applied Topology. Createspace, 1.0 edition, 2014. ← 275, 278, 282, 303, 311,
320
[52] R. Ghrist and S. Krishnan. Positive Alexander duality for pursuit and evasion. To appear, SIAM
J. Appl. Alg. Geom. MR3763757 ←318, 319, 320
[53] R. Ghrist and S. M. Lavalle. Nonpositive curvature and Pareto optimal coordination of robots.
SIAM J. Control Optim., 45(5):1697–1713, 2006. MR2272162 ←278
[54] R. Ghrist, D. Lipsky, J. Derenick, and A. Speranzon. Topological landmark-based navigation and
mapping. ←Preprint, 2012. 278, 296
[55] R. Ghrist and V. Peterson. The geometry and topology of reconfiguration. Adv. in Appl. Math.,
38(3):302–323, 2007. MR2301699 ←278
[56] C. Giusti, E. Pastalkova, C. Curto, and V. Itskov. Clique topology reveal intrinsic structure in
neural connections. Proc. Natl. Acad. Sci. USA, 112(44):13455–13460, 2015. MR3429279 ←284, 285
[57] L. J. Guibas and S. Y. Oudot. Reconstruction using witness complexes. In Proc. 18th ACM-SIAM
Sympos. on Discrete Algorithms, pages 1076–1085, 2007. MR2485259 ←278
[58] A. Hatcher. Algebraic Topology. Cambridge University Press, 2002. MR1867354 ←281, 282, 299, 300,
303
[59] G. Henselman and R. Ghrist. Matroid filtrations and computational persistent homology.
arXiv:1606.00199. ←307
324 References
[60] J. Hocking and G. Young. Topology. Dover Press, 1988. MR1016814 ←302
[61] X. Jiang, L.-H. Lim, Y. Yao, and Y. Ye. Statistical ranking and combinatorial Hodge theory. Math.
Program., 127(1, Ser. B):203–244, 2011. MR2776715 ←311
[62] T. Kaczynski, K. Mischaikow, and M. Mrozek. Computational Homology, volume 157 of Applied
Mathematical Sciences. Springer-Verlag, New York, 2004. MR2028588 ←275, 278, 307, 321
[63] M. Kahle. Topology of random clique complexes. Discrete Comput. Geom., 45(3):553–573, 2011.
MR2770552 ←285, 321
[64] M. Kashiwara and P. Schapira. Categories and Sheaves, volume 332 of Grundlehren der Mathematis-
chen Wissenschaften. Springer-Verlag, 2006. MR2182076 ←312
[65] D. Kozlov. Combinatorial Algebraic Topology, volume 21 of Algorithms and Computation in Mathemat-
ics. Springer, 2008. MR2361455 ←304, 305
[66] M. Lesnick. The theory of the interleaving distance on multidimensional persistence modules. J.
Found. Comp. Math., 15(3):613–650, 2015. MR3348168 ←316
[67] T. Lewiner, H. Lopes, and G. Tavares. Applications of Forman’s discrete Morse theory to topology
visualization and mesh compression. IEEE Trans. Visualization & Comput. Graphics, 10(5):499–508,
2004. ←305
[68] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1987.
←280
[69] K. Mischaikow and V. Nanda. Morse theory for filtrations and efficient computation of persistent
homology. Discrete Comput. Geom., 50(2):330–353, 2013. MR3090522 ←306, 307
[70] J. Munkres. Topology. Prentice Hall, 2000. MR3728284 ←275
[71] V. Nanda. Discrete Morse theory and localization. arXiv:1510.01907. ←307
[72] V. Nanda, D. Tamaki, and K. Tanaka. Discrete Morse theory and classifying spaces.
arXiv:1612.08429vi. ←307
[73] M. Nicolau, A. J. Levine, and G. Carlsson. Topology based data analysis identifies a subgroup of
breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. USA,
108(17):7265–7270, 2011. ←295
[74] J. O’Keefe and J. Dostrovsky. The hippocampes as a spatial map. Brain Research, 34(1):171–175,
1971. ←285
[75] N. Otter, M. Porter, U. Tillmann, P. Grindrod, and H. Harrington. A roadmap for the computation
of persistent homology. EPJ Data Science 6(17), 2017. MR3670641 ←306, 307, 308
[76] S. Oudot. Persistence Theory: From Quiver Representations to Data Analysis. American Mathematical
Society, 2015. MR3408277 ←275, 316, 321
[77] L. Pachter and B. Sturmfels. The mathematics of phylogenomics. SIAM Rev., 49(1):3–31, 2007.
MR2302545 ←278
[78] R. Penrose. La cohomologie des figures impossibles. Structural Topology, 17:11–16, 1991.
MR1140400 ←311
[79] V. Robins, P. Wood, and A. Sheppard. Theory and algorithms for constructing discrete Morse
complexes from grayscale digital images. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 33(8):1646–1658, 2011. ←294, 305
[80] M. Robinson. Topological Signal Processing. Springer, Heidelberg, 2014. MR3157249 ←316
[81] P. Schapira. Operations on constructible functions. J. Pure Appl. Algebra, 72(1):83–93, 1991.
MR1115569 ←316
[82] P. Schapira. Tomography of constructible functions. In Applied Algebra, Algebraic Algorithms and
Error-Correcting Codes, pages 427–435. Springer, 1995. MR1448182 ←316
[83] J. Schürmann. Topology of Singular Spaces and Constructible Sheaves, volume 63 of Mathematics
Institute of the Polish Academy of Sciences. Mathematical Monographs (New Series). Birkhäuser Verlag,
Basel, 2003. MR2031639 ←312
[84] M. Schwarz. Morse Homology, volume 111 of Progress in Mathematics. Birkhäuser Verlag, Basel,
1993. MR1239174 ←304
[85] Y. Shkolnisky and A. Singer. Viewing direction estimation in cryo-EM using synchronization.
SIAM J. Imaging Sci., 5(3):1088–1110, 2012. MR3022188 ←311
[86] A. Singer. Angular synchronization by eigenvectors and semidefinite programming. Appl. Com-
put. Harmonic Anal., 30(1):20–36, 2011. MR2737931 ←311
[87] A. Sizemore, C. Giusti, and D. Bassett. Classification of weighted networks through mesoscale
homological features. J. Complex Networks, 2016. MR3801686 ←285
References 325
[88] B. Torres, J. Oliviera, A. Tate, P. Rath, K. Cumnock, and D. Schneider. Tracking resilience to
infections by mapping disease space. PLOS Biology, 2016. ←295
[89] A. Wilkerson, H. Chintakunta, H. Krim, T. Moore, and A. Swami. A distributed collapse of a
network’s dimensionality. In Proceedings of Global Conference on Signal and Information Processing
(GlobalSIP). IEEE, 2013. ←278
[90] S. Yuzvinsky. Modules of splines on polyhedral complexes. Math. Z., 210(2):245–254, 1992.
MR1166523 ←316
[91] A. Zomorodian. Topology for Computing. Cambridge Univ Press, 2005. MR2111929 ←292
PCMS/25