100% found this document useful (1 vote)
1K views

The Mathematics of Data

Matemáticas de los datos

Uploaded by

angel.ricardo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

The Mathematics of Data

Matemáticas de los datos

Uploaded by

angel.ricardo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 340

IAS/PARK CITY

MATHEMATICS SERIES
Volume 25

The Mathematics
of Data
Michael W. Mahoney
John C. Duchi
Anna C. Gilbert
Editors

American Mathematical Society


Institute for Advanced Study
Society for Industrial and Applied Mathematics
The Mathematics
of Data
IAS/PARK CITY
MATHEMATICS SERIES
Volume 25

The Mathematics
of Data
Michael W. Mahoney
John C. Duchi
Anna C. Gilbert
Editors

American Mathematical Society


Institute for Advanced Study
Society for Industrial and Applied Mathematics
Rafe Mazzeo, Series Editor
Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert, Volume Editors.

IAS/Park City Mathematics Institute runs mathematics education programs that bring
together high school mathematics teachers, researchers in mathematics and mathematics
education, undergraduate mathematics faculty, graduate students, and undergraduates to
participate in distinct but overlapping programs of research and education. This volume
contains the lecture notes from the Graduate Summer School program
2010 Mathematics Subject Classification. Primary 15-02, 52-02, 60-02, 62-02, 65-02,
68-02, 90-02.

Library of Congress Cataloging-in-Publication Data


Names: Mahoney, Michael W., editor. | Duchi, John, editor. | Gilbert, Anna C. (Anna Catherine),
1972– editor. | Institute for Advanced Study (Princeton, N.J.) | Society for Industrial and Applied
Mathematics. | Park City Mathematics Institute.
Title: The mathematics of data / Michael W. Mahoney, John C. Duchi, Anna C. Gilbert, editors.
Description: Providence : American Mathematical Society, 2018. | Series: IAS/Park City mathe-
matics series ; Volume 25 | “Institute for Advanced Study.” | “Society for Industrial and Applied
Mathematics.” | Based on a series of lectures held July 2016, at the Park City Mathematics
Institute. | Includes bibliographical references.
Identifiers: LCCN 2018024239 | ISBN 9781470435752 (alk. paper)
Subjects: LCSH: Mathematics teachers–Training of–Congresses. | Mathematics–Study and
teaching–Congresses | Big data–Congresses. | AMS: Linear and multilinear algebra; matrix the-
ory – Research exposition (monographs, survey articles). msc | Convex and discrete geometry –
Research exposition (monographs, survey articles). msc | Probability theory and stochastic pro-
cesses – Research exposition (monographs, survey articles). msc | Statistics – Research exposition
(monographs, survey articles). msc | Numerical analysis – Research exposition (monographs, sur-
vey articles). msc | Computer science – Research exposition (monographs, survey articles). msc
| Operations research, mathematical programming – Research exposition (monographs, survey
articles). msc
Classification: LCC QA11.A1 M345 2018 | DDC 510–dc23
LC record available at https://lccn.loc.gov/2018024239

Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting
for them, are permitted to make fair use of the material, such as to copy select pages for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for permission
to reuse portions of AMS publication content are handled by the Copyright Clearance Center. For
more information, please visit www.ams.org/publications/pubpermissions.
Send requests for translation rights and licensed reprints to [email protected].

c 2018 by the American Mathematical Society. All rights reserved.
The American Mathematical Society retains all rights
except those granted to the United States Government.
Printed in the United States of America.

∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at https://www.ams.org/
10 9 8 7 6 5 4 3 2 1 23 22 21 20 19 18
Contents

Preface vii
Introduction ix

Lectures on Randomized Numerical Linear Algebra


Petros Drineas and Michael W. Mahoney 1

Optimization Algorithms for Data Analysis


Stephen J. Wright 49

Introductory Lectures on Stochastic Optimization


John C. Duchi 99

Randomized Methods for Matrix Computations


Per-Gunnar Martinsson 187

Four Lectures on Probabilistic Methods for Data Science


Roman Vershynin 231

Homological Algebra and Data


Robert Ghrist 273

v
IAS/Park City Mathematics Series
Volume 25, Pages -7—8
https://doi.org/10.1090/pcms/025/00827

Preface

The IAS/Park City Mathematics Institute (PCMI) was founded in 1991 as part
of the Regional Geometry Institute initiative of the National Science Foundation.
In mid-1993 the program found an institutional home at the Institute for Ad-
vanced Study (IAS) in Princeton, New Jersey.
The IAS/Park City Mathematics Institute encourages both research and educa-
tion in mathematics and fosters interaction between the two. The three-week sum-
mer institute offers programs for researchers and postdoctoral scholars, graduate
students, undergraduate students, high school students, undergraduate faculty,
K-12 teachers, and international teachers and education researchers. The Teacher
Leadership Program also includes weekend workshops and other activities dur-
ing the academic year.
One of PCMI’s main goals is to make all of the participants aware of the full
range of activities that occur in research, mathematics training and mathematics
education: the intention is to involve professional mathematicians in education
and to bring current concepts in mathematics to the attention of educators. To
that end, late afternoons during the summer institute are devoted to seminars and
discussions of common interest to all participants, meant to encourage interaction
among the various groups. Many deal with current issues in education: others
treat mathematical topics at a level which encourages broad participation.
Each year the Research Program and Graduate Summer School focuses on a
different mathematical area, chosen to represent some major thread of current
mathematical interest. Activities in the Undergraduate Summer School and Un-
dergraduate Faculty Program are also linked to this topic, the better to encourage
interaction between participants at all levels. Lecture notes from the Graduate
Summer School are published each year in this series. The prior volumes are:
• Volume 1: Geometry and Quantum Field Theory (1991)
• Volume 2: Nonlinear Partial Differential Equations in Differential Geometry
(1992)
• Volume 3: Complex Algebraic Geometry (1993)
• Volume 4: Gauge Theory and the Topology of Four-Manifolds (1994)
• Volume 5: Hyperbolic Equations and Frequency Interactions (1995)
• Volume 6: Probability Theory and Applications (1996)
• Volume 7: Symplectic Geometry and Topology (1997)
• Volume 8: Representation Theory of Lie Groups (1998)
• Volume 9: Arithmetic Algebraic Geometry (1999)
©2018 American Mathematical Society

vii
viii Preface

• Volume 10: Computational Complexity Theory (2000)


• Volume 11: Quantum Field Theory, Supersymmetry, and Enumerative Geome-
try (2001)
• Volume 12: Automorphic Forms and their Applications (2002)
• Volume 13: Geometric Combinatorics (2004)
• Volume 14: Mathematical Biology (2005)
• Volume 15: Low Dimensional Topology (2006)
• Volume 16: Statistical Mechanics (2007)
• Volume 17: Analytical and Algebraic Geometry (2008)
• Volume 18: Arthimetic of L-functions (2009)
• Volume 19: Mathematics in Image Processing (2010)
• Volume 20: Moduli Spaces of Riemann Surfaces (2011)
• Volume 21: Geometric Group Theory (2012)
• Volume 22: Geometric Analysis (2013)
• Volume 23: Mathematics and Materials (2014)
• Volume 24: Geometry of Moduli Spaces and Representation Theory (2015)
The American Mathematical Society publishes material from the Undergradu-
ate Summer School in their Student Mathematical Library and from the Teacher
Leadership Program in the series IAS/PCMI—The Teacher Program.
After more than 25 years, PCMI retains its intellectual vitality and continues
to draw a remarkable group of participants each year from across the entire spec-
trum of mathematics, from Fields Medalists to elementary school teachers.
Rafe Mazzeo
PCMI Director
March 2017
IAS/Park City Mathematics Series
Volume 25, Pages -9—12
https://doi.org/10.1090/pcms/025/00828

Introduction

Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert

“The Mathematics of Data” was the topic for the 26th annual Park City Mathe-
matics Institute (PCMI) summer session, held in July 2016. To those more familiar
with very abstract areas of mathematics or more applied areas of data—the latter
going these days by names such as “big data” or “data science”—it may come as
a surprise that such an area even exists. A moment’s thought, however, should
dispel such a misconception. After all, data must be modeled, e.g., by a matrix or
a graph or a flat table, and if one performs similar operations on very different
types of data, then there is an expectation that there must be some sort of com-
mon mathematical structure, e.g., from linear algebra or graph theory or logic.
So too, ignorance or errors or noise in the data can be modeled, and it should be
plausible that how well operations perform on data depend not just on how well
data are modeled but also on how well ignorance or noise or errors are modeled.
So too, the operations themselves can be modeled, e.g., to make statements such
as whether the operations answer a precise question, exactly or approximately, or
whether they will return a solution in a reasonable amount of time.
As such, “The Mathematics of Data” fits squarely in applied mathematics—
when that term is broadly, not narrowly, defined. Technically, it represents some
combination of what is traditionally the domain of linear algebra and probability
and optimization and other related areas. Moreover, while some of the work
in this area takes place in mathematics departments, much of the work in the
area takes place in computer science, statistics, and other related departments.
This was the challenge and opportunity we faced, both in designing the graduate
summer school portion of the PCMI summer session, as well as in designing this
volume. With respect to the latter, while the area is not sufficiently mature to say
the final word, we have tried to capture the major trends in the mathematics of
data sufficiently broadly and at a sufficiently introductory level that this volume
could be used as a teaching resource for students with backgrounds in any of the
wide range of areas related to the mathematics of data.
The first chapter, “Lectures on Randomized Numerical Linear Algebra,” pro-
vides an overview of linear algebra, probability, and ways in which they interact
fruitfully in many large-scale data applications. Matrices are a common way to
model data, e.g., an m × n matrix provides a natural way to describe m objects,
each of which is described by n features, and thus linear algebra, as well as more
sophisticated variants such as functional analysis and linear operator theory, are
©2018 American Mathematical Society

ix
x Introduction

central to the mathematics of data. An interesting twist is that, while work in nu-
merical linear algebra and scientific computing typically focuses on deterministic
algorithms that return answers to machine precision, randomness can be used
in novel algorithmic and statistical ways in matrix algorithms for data. While
randomness is often assumed to be a property of the data (e.g., think of noise
being modeled by random variables drawn from a Gaussian distribution), it can
also be a powerful algorithmic resource to speed up algorithms (e.g., think of
Monte Carlo and Markov Chain Monte Carlo methods), and many of the most
interesting and exciting developments in the mathematics of data explore this
algorithmic-statistical interface. This chapter, in particular, describes the use of
these methods for the development of improved algorithms for fundamental and
ubiquitous matrix problems such as matrix multiplication, least-squares approxi-
mation, and low-rank matrix approximation.
The second chapter, “Optimization Algorithms for Data Analysis,” goes one
step beyond basic linear algebra problems, which themselves are special cases
of optimization problems, to consider more general optimization problems. Op-
timization problems are ubiquitous throughout data science, and a wide class
of problems can be formulated as optimizing smooth functions, possibly with
simple constraints or structured nonsmooth regularizers. This chapter describes
some canonical problems in data analysis and their formulation as optimization
problems. It also describes iterative algorithms (i.e., those that generate a se-
quence of points) that, for convex objective functions, converge to the set of solu-
tions of such problems. Algorithms covered include first-order methods that de-
pend on gradients, so-called accelerated gradient methods, and Newton’s second-
order method that can guarantee convergence to points that approximately satisfy
second-order conditions for a local minimizer of a smooth nonconvex function.
The third chapter, “Introductory Lectures on Stochastic Optimization,” cov-
ers the basic analytical tools and algorithms necessary for stochastic optimiza-
tion. Stochastic optimization problems are problems whose definition involves
randomness, e.g., minimizing the expectation of some function; and stochastic
optimization algorithms are algorithms that generate and use random variables
to find the solution of a (perhaps deterministic) problem. As with the use of ran-
domness in Randomized Numerical Linear Algebra, there is an interesting syn-
ergy between the two ways in which stochasticity appears. This chapter builds
the necessary convex analytic and other background, and it describes gradient
and subgradient first-order methods for the solution of these types of problems.
These methods tend to be simple methods that are slower to converge than more
advanced methods—such as Newton’s or other second-order methods—for deter-
ministic problems, but they have the advantage that they can be robust to noise
in the optimization problem itself. Also covered are mirror descent and adap-
tive methods, as well as methods for proving upper and lower bounds on such
stochastic algorithms.
Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert xi

The fourth chapter, “Randomized Methods for Matrix Computations,” goes


into more detail on randomized methods for computing efficiently a low-rank
approximation to a given matrix. One often wants to decompose a large m × n
matrix A, where m and n are both large, into two lower-rank more-rectangular
matrices E and F such that A ≈ EF. Examples include low-rank approximations
to the eigenvalue decomposition or the singular value decomposition. While
low-rank approximation problems of this type form a cornerstone of traditional
applied mathematics and scientific computing, they also arise in a broad range
of data science applications. Importantly, though, the questions one asks of these
matrix decompositions (e.g., whether one is interested in numerical precision or
statistical inference objectives) and even how one accesses these matrices (e.g.,
within the RAM model idealization or in a single-pass streaming setting where
the data can’t even be stored) are very different. Randomness can be useful
in many ways here. This chapter describes randomized algorithms that obtain
better worst-case running time, both in the RAM model and a streaming model,
how randomness can be used to obtain improved communication properties for
algorithms, and also several data-driven decompositions such as the Nyström
method, the Interpolative Decomposition, and the CUR decomposition.
The fifth chapter, “Four Lectures on Probabilistic Methods for Data Science,”
describes modern methods of high dimensional probability and illustrates how
these methods can be used in data science. Methods of high-dimensional proba-
bility play a central role in applications for statistics, signal processing, theoretical
computer science, and related fields. For example, they can be used within a ran-
domized algorithm to obtain improved running time properties, and/or they can
be used as random models for data, in which case they are needed to obtain in-
ferential guarantees. Indeed, they are used (explicitly or implicitly) in all of the
previous chapters. This chapter presents a sample of particularly useful tools of
high-dimensional probability, focusing on the classical and matrix Bernstein’s in-
equality and the uniform matrix deviation inequality, and it illustrates these tools
with applications for dimension reduction, network analysis, covariance estima-
tion, matrix completion, and sparse signal recovery.
The sixth and final chapter, “Homological Algebra and Data,” provides an ex-
ample of how methods from more pure mathematics, in this case topology, might
be used fruitfully in data science and the mathematics of data, as outlined in the
previous chapters. Topology is—informally—the study of shape, and topological
data analysis provides a framework to analyze data in a manner that should be
insensitive to the particular metric chosen, e.g., to measure the similarity between
data points. It involves replacing a set of data points with a family of simplicial
complexes, and then using ideas from persistent homology to try to determine
the large scale structure of the set. This chapter approaches topological data
analysis from the perspective of homological algebra, where homology is an al-
gebraic compression scheme that excises all but the essential topological features
xii Introduction

from a class of data structures. An important point is that linear algebra can
be enriched to cover not merely linear transformations—the 99.9% use case—but
also sequences of linear transformations that form complexes, thus opening the
possibility of further mathematical developments.
Overall, the 2016 PCMI summer program included minicourses by Petros
Drineas, John Duchi, Cynthia Dwork and Kunal Talwar, Robert Ghrist, Piotr
Indyk, Mauro Maggioni, Gunnar Martinsson, Roman Vershynin, and Stephen
Wright. This volume consists of contributions, summarized above, by Petros
Drineas (with Michael Mahoney), Stephen Wright, John Duchi, Gunnar Martins-
son, Roman Vershynin, and Robert Ghrist. Each chapter in this volume was
written by a different author, and so each chapter has it’s own unique style, in-
cluding notational differences, but we have taken some effort to ensure that they
can fruitfully be read together.
Putting together such an effort—both the entire summer session as well as this
volume—is not a minor undertaking, but for us it was not difficult, due to the
large amount of support we received. We would first like to thank Richard Hain,
the former PCMI Program Director, who first invited us to organize the summer
school, as well as Rafe Mazzeo, the current PCMI Program Director, who pro-
vided seamless guidance throughout the entire process. In terms of running the
summer session, a special thank you goes out to the entire PCMI staff, and in par-
ticular to Beth Brainard and Dena Vigil as well as Bryna Kra and Michelle Wachs.
We received a lot of feedback from participants who enjoyed the event, and Beth
and Dena deserve much of the credit for making it run smoothly; and Bryna and
Michelle’s role with the graduate steering committee helped us throughout the
entire process. In terms of this volume, in addition to thanking the authors for
their efforts and (usually) getting back to us in a timely manner, we would like to
thank Ian Morrison, who is the PCMI Publisher. Putting together a volume such
as this can be a tedious task, but for us it was not, and this is in large part due to
Ian’s help and guidance.
IAS/Park City Mathematics Series
Volume 25, Pages 1–48
https://doi.org/10.1090/pcms/025/00829

Lectures on Randomized Numerical Linear Algebra

Petros Drineas and Michael W. Mahoney

Contents
1 Introduction 2
2 Linear Algebra 3
2.1 Basics. 3
2.2 Norms. 4
2.3 Vector norms. 4
2.4 Induced matrix norms. 5
2.5 The Frobenius norm. 6
2.6 The Singular Value Decomposition. 7
2.7 SVD and Fundamental Matrix Spaces. 9
2.8 Matrix Schatten norms. 9
2.9 The Moore-Penrose pseudoinverse. 10
2.10 References. 11
3 Discrete Probability 11
3.1 Random experiments: basics. 11
3.2 Properties of events. 12
3.3 The union bound. 12
3.4 Disjoint events and independent events. 12
3.5 Conditional probability. 12
3.6 Random variables. 13
3.7 Probability mass function and cumulative distribution function. 13
3.8 Independent random variables. 14
3.9 Expectation of a random variable. 14
3.10 Variance of a random variable. 14
3.11 Markov’s inequality. 15
3.12 The Coupon Collector Problem. 16
3.13 References. 16
4 Randomized Matrix Multiplication 16
4.1 Analysis of the RANDMATRIXMULTIPLY algorithm. 18
4.2 Analysis of the algorithm for nearly optimal probabilities. 21

2010 Mathematics Subject Classification. Primary 68W20; Secondary 65Fxx, 62Jxx.


Key words and phrases. Random sampling, random projection, matrix multiplication, least-squares
approximation, low-rank matrix approximation, Park City Mathematics Institute.

©2018 Petros Drineas and Michael W. Mahoney


1
2 Lectures on Randomized Numerical Linear Algebra

4.3 Bounding the two norm. 21


4.4 References. 24
5 RandNLA Approaches for Regression Problems 24
5.1 The Randomized Hadamard Transform. 25
5.2 The main algorithm and main theorem. 26
5.3 RandNLA algorithms as preconditioners. 28
5.4 The proof of Theorem 5.2.2. 31
5.5 The running time of the RANDLEASTSQUARES algorithm. 35
5.6 References. 36
6 A RandNLA Algorithm for Low-rank Matrix Approximation 36
6.1 The main algorithm and main theorem. 37
6.2 An alternative expression for the error. 40
6.3 A structural inequality. 41
6.4 Completing the proof of Theorem 6.1.1. 42
6.5 Running time. 47
6.6 References. 47

1. Introduction
Matrices are ubiquitous in computer science, statistics, and applied mathemat-
ics. An m × n matrix can encode information about m objects (each described
by n features), or the behavior of a discretized differential operator on a finite
element mesh; an n × n positive-definite matrix can encode the correlations be-
tween all pairs of n objects, or the edge-connectivity between all pairs of n nodes
in a social network; and so on. Motivated largely by technological developments
that generate extremely large scientific and Internet data sets, recent years have
witnessed exciting developments in the theory and practice of matrix algorithms.
Particularly remarkable is the use of randomization—typically assumed to be a
property of the input data due to, e.g., noise in the data generation mechanisms—
as an algorithmic or computational resource for the development of improved
algorithms for fundamental matrix problems such as matrix multiplication, least-
squares (LS) approximation, low-rank matrix approximation, etc.
Randomized Numerical Linear Algebra (RandNLA) is an interdisciplinary re-
search area that exploits randomization as a computational resource to develop
improved algorithms for large-scale linear algebra problems. From a founda-
tional perspective, RandNLA has its roots in theoretical computer science (TCS),
with deep connections to mathematics (convex analysis, probability theory, met-
ric embedding theory) and applied mathematics (scientific computing, signal pro-
cessing, numerical linear algebra). From an applied perspective, RandNLA is a
vital new tool for machine learning, statistics, and data analysis. Well-engineered
implementations have already outperformed highly-optimized software libraries
Petros Drineas and Michael W. Mahoney 3

for ubiquitous problems such as least-squares regression, with good scalability


in parallel and distributed environments. Moreover, RandNLA promises a sound
algorithmic and statistical foundation for modern large-scale data analysis.
This chapter serves as a self-contained, gentle introduction to three fundamen-
tal RandNLA algorithms: randomized matrix multiplication, randomized least-
squares solvers, and a randomized algorithm to compute a low-rank approxima-
tion to a matrix. As such, this chapter has strong connections with many areas
of applied mathematics, and in particular it has strong connections with several
other chapters in this volume. Most notably, this includes that of G. Martinsson,
who uses these methods to develop improved low-rank matrix approximation
solvers [19]; R. Vershynin, who develops probabilistic tools that are used in the
analysis of RandNLA algorithms [28]; and J. Duchi, who uses stochastic and ran-
domized methods in a complementary manner for more general optimization
problems [12].
We start this chapter with a review of basic linear algebraic facts in Section 2;
we review basic facts from discrete probability in Section 3; we present a random-
ized algorithm for matrix multiplication in Section 4; we present a randomized al-
gorithm for least-squares regression problems in Section 5; and finally we present
a randomized algorithm for low-rank approximation in Section 6. We conclude
this introduction by noting that [10, 17] might also be of interest to a reader who
wants to go through other introductory texts on RandNLA.

2. Linear Algebra
In this section, we present a brief overview of basic linear algebraic facts and
notation that will be useful in this chapter. We assume basic familiarity with
linear algebra (e.g., inner/outer products of vectors, basic matrix operations such
as addition, scalar multiplication, transposition, upper/lower triangular matrices,
matrix-vector products, matrix multiplication, matrix trace, etc.).
2.1. Basics. We will entirely focus on matrices and vectors over the reals. We
will use the notation x ∈ Rn to denote an n-dimensional vector: notice the use
of bold latin lowercase letters for vectors. Vectors will always be assumed to be
column vectors, unless explicitly noted otherwise. The vector of all zeros will be
denoted as 0, while the vector of all ones will be denoted as 1; dimensions will
be implied from context or explicitly included as a subscript.
We will use bold latin uppercase letters for matrices, e.g., A ∈ Rm×n denotes
an m × n matrix A. We will use the notation Ai∗ to denote the i-th row of A as
a row vector and A∗i to denote the i-th column of A as a column vector. The
(square) identity matrix will be denoted as In where n denotes the number of
rows and columns. Finally, we use ei to denote the i-th column of In , i.e., the
i-th canonical vector.
4 Lectures on Randomized Numerical Linear Algebra

Matrix Inverse. A matrix A ∈ Rn×n is nonsingular or invertible if there exists a


matrix A−1 ∈ Rn×n such that
AA−1 = In×n = A−1 A.
The inverse exists when all the columns (or all the rows) of A are linearly inde-
pendent. In other words, there does not exist a non-zero vector x ∈ Rn such that
Ax = 0. Standard properties of the inverse include: (A−1 ) = (A )−1 = A−
and (AB)−1 = B−1 A−1 .
Orthogonal matrix. A matrix A ∈ Rn×n is orthogonal if A = A−1 . Equivalently,
for all i and j between one and n,

 0, if i = j
A∗i A∗j = .
1, if i = j
The same property holds for the rows of A. In words, the columns (rows) of A
are pairwise orthogonal and normal vectors.
QR Decomposition. Any matrix A ∈ Rn×n can be decomposed into the product
of an orthogonal matrix and an upper triangular matrix as:
A = QR,
where Q ∈ Rn×n is an orthogonal matrix and R ∈ Rn×n is an upper triangular
matrix. The QR decomposition is useful in solving systems of linear equations,
has computational complexity O(n3 ), and is numerically stable. To solve the
linear system Ax = b using the QR decomposition we first premultiply both
sides by Q , thus getting Q QRx = Rx = Q b. Then, we solve Rx = Q b
using backward substitution [13].
2.2. Norms. Norms are used to measure the size or mass of a matrix or, relatedly,
the length of a vector. They are functions that map an object from Rm×n (or Rn )
to R. Formally:
Definition 2.2.1. Any function,  · : Rm×n → R that satisfies the following
properties is called a norm:
(1) Non-negativity: A  0; A = 0 if and only if A = 0.
(2) Triangle inequality: A + B  A + B.
(3) Scalar multiplication: αA = |α|A, for all α ∈ R.
The following properties are easy to prove for any norm:  − A = A and
|A − B|  A − B.
The latter property is known as the reverse triangle inequality.
2.3. Vector norms. Given x ∈ Rn and an integer p  1, we define the vector
p-norm as:
 n 1/p

xp = |xi |p
.
i=1
Petros Drineas and Michael W. Mahoney 5

The most common vector p-norms are:



• One norm: x1 = n i=1 |xi |.  √
• Euclidean (two) norm: x2 = n 
i=1 |xi | = x x.
2

• Infinity (max) norm: x∞ = max1in |xi |.


n
Given x, y ∈ Rn we can bound the inner product x y = i=1 xi yi using p-
norms. The Cauchy-Schwartz inequality states that:
|x y|  x2 y2 .
In words, it gives an upper bound for the inner product of two vectors in terms
of the Euclidean norm of the two vectors. Hölder’s inequality states that
|x y|  x1 y∞ and |x y|  x∞ y1 .
The following inequalities between common vector p-norms are easy to prove:

x∞  x1  nx∞ ,



x2  x1  nx2 ,

x∞  x2  nx∞ .
Also, x22 = xT x. We can now define the notion of orthogonality for a pair of
vectors and state the Pythagorean theorem.

Theorem 2.3.1. Two vectors x, y ∈ Rn are orthogonal, i.e., x y = 0, if and only if


x ± y22 = x22 + y22 .

Theorem 2.3.1 is also known as the Pythagorean Theorem. Another interest-


ing property of the Euclidean norm is that it does not change after pre(post)-
multiplication by a matrix with orthonormal columns (rows).

Theorem 2.3.2. Given a vector x ∈ Rn and a matrix V ∈ Rm×n with m  n and


V  V = In :
Vx2 = x2 and xT V T 2 = x2 .

2.4. Induced matrix norms. Given a matrix A ∈ Rm×n and an integer p  1 we


define the matrix p-norm as:
Axp
Ap = max = max Axp .
x=0 xp xp =1

The most frequent matrix p-norms are:


• One norm: the maximum absolute column sum,
m
A1 = max |Aij | = max Aej 1 .
1jn 1jn
i=1
• Infinity norm: the maximum absolute row sum,

n
A∞ = max |Aij | = max A ei 1 .
1im 1jm
j=1
6 Lectures on Randomized Numerical Linear Algebra

• Two (or spectral) norm :



A2 = max Ax2 = max x A Ax.
x2 =1 x2 =1

This family of norms is named “induced” because they are realized by a non-
zero vector x that varies depending on A and p. Thus, there exists a unit norm
vector (unit norm in the p-norm) x such that Ap = Axp . The induced matrix
p-norms follow the submultiplicativity laws:
Axp  Ap xp and ABp  Ap Bp .
Furthermore, matrix p-norms are invariant to permutations: PAQp = Ap ,
where P and Q are permutation matrices of appropriate dimensions. Also, if we
consider the matrix with permuted rows and columns
 
B A12
PAQ = ,
A21 A22
then the norm of the submatrix is related to the norm of the full unpermuted
matrix as follows: Bp  Ap . The following relationships between matrix
p-norms are relatively easy to prove. Given a matrix A ∈ Rm×n ,
1 √
√ A∞  A2  mA∞ ,
n
1 √
√ A1  A2  nA1 .
m
It is also the case that A 1 = A∞ and A ∞ = A1 . While transposition
affects the infinity and one norm of a matrix, it does not affect the two norm,
i.e., A 2 = A2 . Also, the matrix two-norm is not affected by pre-(or post-)
multiplication with matrices whose columns (or rows) are orthonormal vectors:
UAV  2 = A2 , where U and V are orthonormal matrices (UT U = I and
V T V = I) of appropriate dimensions.
2.5. The Frobenius norm. The Frobenius norm is not an induced norm, as it
belongs to the family of Schatten norms (to be discussed in Section 2.8).

Definition 2.5.1. Given a matrix A ∈ Rm×n , we define the Frobenius norm as:

 
 n  m

AF =  A2ij = Tr A A ,
j=1 i=1

where Tr (·) denotes the matrix trace (where, recall, the trace of a square matrix
is defined to be the sum of the elements on the main diagonal).

Informally, the Frobenius norm measures the variance or variability (which can
be given an interpretation of size or mass) of a matrix. Given a vector x ∈ Rn , its
Frobenius norm is equal to its Euclidean norm, i.e., xF = x2 . Transposition
of a matrix A ∈ Rm×n does not affect its Frobenius norm, i.e., AF = A F .
Similar to the two norm, the Frobenius norm does not change under permutations
Petros Drineas and Michael W. Mahoney 7

or under pre(post)- multiplication with any matrix with orthonormal columns


(rows):
UAV T F = AF ,
where U and V are orthonormal matrices (UT U = I and V T V = I) of appropriate
dimensions. The two and the Frobenius norm can be related by:
 
A2  AF  rank(A)A2  min {m, n}A2 .
The Frobenius norm satisfies the so-called strong sub-multiplicativity property,
namely:
ABF  A2 BF and ABF  AF B2 .
For any vectors x ∈ Rm and y ∈ Rn , the Frobenius norm of their outer product
is equal to the product of the Euclidean norms of the two vectors forming the
outer product:
xy F = x2 y2 .
Finally, we state a matrix version of the Pythagorean theorem.

Lemma 2.5.2 (Matrix Pythagoras). Let A, B ∈ Rm×n . If AT B = 0 then


A + B2F = A2F + B2F .

2.6. The Singular Value Decomposition. The Singular Value Decomposition


(SVD) is the most important matrix decomposition and exists for every matrix.

Definition 2.6.1. Given a matrix A ∈ Rm×n , we define its full SVD as:
min{m,n}

A = UΣV T = σi ui v
i ,
i=1

where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices that contain the left
and right singular vectors of A, respectively, and Σ ∈ Rm×n is a diagonal matrix,
with the singular values of A in decreasing order on the diagonal.

We will often use ui (respectively, vj ), i = 1, . . . , m (respectively, j = 1, . . . , n)


to denote the columns of the matrix U (respectively, V). Similarly, we will use σi ,
i = 1, . . . , min{m, n} to denote the singular values:
σ1  σ2  · · ·  σmin{m,n}  0.
The singular values of A are non-negative and their number is equal to min{m, n}.
The number of non-zero singular values of A is equal to the rank of A. Due to
orthonormal invariance, we get:
ΣPAQT = ΣA ,
where P and Q are orthonormal matrices (PT P = I and QT Q = I) of appro-
priate dimensions. In words, the singular values of PAQ are the same as the
singular values of A. The following inequalities involving the singular values of
the matrices A and B are important. First, if both A and B are in Rm×n , for all
8 Lectures on Randomized Numerical Linear Algebra

i = 1, . . . , min{m, n},

(2.6.2) |σi (A) − σi (B)|  A − B2 .


Second, if A ∈ Rp×m and B ∈ Rm×n , for all i = 1, . . . , min{m, n},

(2.6.3) σi (AB)  σ1 (A)σi (B),


where, recall, σ1 (A) = A2 . We are often interested in keeping only the non-
zero singular values and the corresponding left and right singular vectors of a
matrix A. Given a matrix A ∈ Rm×n with rank(A) = ρ, its thin SVD can be
defined as follows.

Definition 2.6.4. Given a matrix A ∈ Rm×n of rank ρ  min{m, n}, we define its
thin SVD as:

A =  V =
Σ 
U  σi ui v
i ,
m×ρ ρ×ρ ρ×n i=1

where U ∈ Rm×ρand V ∈ Rn×ρ


are matrices with pairwise orthonormal
columns (i.e., UT U = I and V T V = I) that contain the left and right singular
vectors of A corresponding to the non-zero singular values; Σ ∈ Rρ×ρ is a di-
agonal matrix with the non-zero singular values of A in decreasing order on the
diagonal.

If A is a nonsingular matrix, we can compute its inverse using the SVD:


A−1 = (UΣV  )−1 = VΣ−1 U .
(If A is nonsingular, then it is square and full rank, in which case the thin SVD is
the same as the full SVD.) The SVD is so important since, as is well-known, the
best rank-k approximation to any matrix can be computed via the SVD.

Theorem 2.6.5. Let A = UΣV  ∈ Rm×n be the thin SVD of A; let k be an integer
 
less than ρ = rank(A); and let Ak = k T
i=1 σi ui vi = Uk Σk V k . Then,

σk+1 = min A − B2 = A − Ak 2


B∈Rm×n , rank(B)=k

and

ρ
σ2j = min A − B2F = A − Ak 2F .
B∈Rm×n , rank(B)=k
j=k+1

In words, the above theorem states that if we seek a rank k approximation to a


matrix A that minimizes the two or the Frobenius norm of the “error” matrix, i.e.,
of the difference between A and its approximation, then it suffices to keep the top
k singular values of A and the corresponding left and right singular vectors.
We will often use the following notation: let Uk ∈ Rm×k and V k ∈ Rn×k
denote the matrices of the top k left and top k right singular vectors of A; and
let Σk ∈ Rk×k denote the diagonal matrix containing the top k singular values
of A. Similarly, let Uk,⊥ ∈ Rm×(ρ−k) (respectively, V k,⊥ ∈ Rn×(ρ−k) ) denote
Petros Drineas and Michael W. Mahoney 9

the matrix of the bottom ρ − k nonzero left (respectively, right) singular vectors
of A; and let Σk,⊥ ∈ R(ρ−k)×(ρ−k) denote the diagonal matrix containing the
bottom ρ − k singular values of A. Then,
(2.6.6) Ak = Uk Σk V Tk and Ak,⊥ = A − Ak = Uk,⊥ Σk,⊥ V Tk,⊥ .
2.7. SVD and Fundamental Matrix Spaces. Any matrix A ∈ Rm×n defines four
fundamental spaces:
The Column Space of A: This space is spanned by the columns of A:
range(A) = {b : Ax = b, x ∈ Rn } ⊂ Rm .
The Null Space of A: This space is spanned by all vectors x ∈ Rn such that
Ax = 0:
null(A) = {x : Ax = 0} ⊂ Rn .
The Row Space of A: This space is spanned by the rows of A:
range(A ) = {d : A y = d, y ∈ Rm } ⊂ Rn .
The Left Null Space of A: This space is spanned by all vectors y ∈ Rm
such that A y = 0:
null(A ) = {y : A y = 0} ⊂ Rm .
The SVD reveals orthogonal bases for all these spaces. Given a matrix A ∈ Rm×n ,
with rank(A) = ρ, its SVD can be written as:
  
Σ V
ρ 0 ρ
A = Uρ Uρ,⊥ .
0 0 Vρ,⊥

It is easy to prove that:

range(A) = range(Uρ ),
null(A) = range(V ρ,⊥ ),
range(A ) = range(V ρ ),
null(A ) = range(Uρ,⊥ ).

Theorem 2.7.1 (Basic Theorem of Linear Algebra.). The column space of A is


orthogonal to the null space of A and their union is Rm . The column space of
A is orthogonal to the null space of A and their union is Rn .

2.8. Matrix Schatten norms. The matrix Schatten norms are a special family of
norms that are defined on the vector containing the singular values of a matrix.
Given a matrix A ∈ Rm×n with singular values σ1  · · ·  σρ > 0, we define the
Schatten p-norm as:
 ρ 1
 p p
Ap = σi .
i=1

Common Schatten norms of a matrix A ∈ Rm×n are:


10 Lectures on Randomized Numerical Linear Algebra

Schatten one-norm: The nuclear norm, i.e., the sum of the singular values.
Schatten two-norm: The Frobenius norm, i.e., the square root of the sum of
the squares of the singular values.
Schatten infinity-norm: The spectral norm, defined as the limit as p → ∞
of the Schatten p-norm, i.e., the largest singular value.
Schatten norms are orthogonally invariant and submultiplicative, and they satisfy
Hölder’s inequality.
2.9. The Moore-Penrose pseudoinverse. A generalization of the well-known no-
tion of matrix inverse is the Moore-Penrose pseudoinverse. Formally, given a
matrix A ∈ Rm×n , a matrix A† is the Moore Penrose pseudoinverse of A if it
satisfies the following properties:
(1) AA† A = A.
(2) A† AA† = A† .
(3) (AA† ) = AA† .
(4) (A† A) = A† A.
Given a matrix A ∈ Rm×n of rank ρ and its thin SVD

ρ
A= σi ui v
i ,
i=1

its Moore-Penrose pseudoinverse A† is



ρ
1
A† = vi u
i .
σi
i=1

If a matrix A ∈ Rn×n has full rank, then A† = A−1 . If a matrix A ∈ Rm×n


has full column rank, then A† A = In , and AA† is a projection matrix onto the
column span of A; while if it has full row rank, then AA† = Im , and A† A is a
projection matrix onto the row span of A.
A particularly important property regarding the pseudoinverse of the product
of two matrices is the following: for matrices Y 1 ∈ Rm×p and Y 2 ∈ Rp×n ,
satisfying rank(Y 1 ) = rank(Y 2 ), [5, Theorem 2.2.3] states that

(2.9.1) (Y 1 Y 2 )† = Y †2 Y †1 .
(We emphasize that the condition on the ranks is crucial: while the inverse of
the product of two matrices always equals the product of the inverses of those
matrices, the analogous statement is not true in full generality for the Moore-
Penrose pseudoinverse [5].)
The fundamental spaces of the Moore-Penrose pseudoinverse are connected
with those of the actual matrix. Given a matrix A and its Moore-Penrose pseu-
doinverse A† , the column space of A† can be defined as:
range(A† ) = range(A A) = range(A ),
Petros Drineas and Michael W. Mahoney 11

and it is orthogonal to the null space of A. The null space of A† can be defined as:
null(A† ) = null(AA ) = null(A ),
and it is orthogonal to the column space of A.
2.10. References. We refer the interested reader to [5, 13, 26, 27] for additional
background on linear algebra and matrix computations, as well as to [4, 25] for
additional background on matrix perturbation theory.

3. Discrete Probability
In this section, we present a brief overview of discrete probability. More
advanced results (in particular, Bernstein-type inequalities for real-valued and
matrix-valued random variables) will be introduced in the appropriate context
later in the chapter. It is worth noting that most of RandNLA builds upon simple,
fundamental principles of discrete (instead of continuous) probability.
3.1. Random experiments: basics. A random experiment is any procedure that
can be infinitely repeated and has a well-defined set of possible outcomes. Typi-
cal examples are the roll of a dice or the toss of a coin. The sample space Ω of a
random experiment is the set of all possible outcomes of the random experiment.
If the random experiment only has two possible outcomes (e.g., success and fail-
ure) then it is often called a Bernoulli trial. In discrete probability, the sample
space Ω is finite. (We will not cover countably or uncountably infinite sample
spaces in this chapter.)
An event is any subset of the sample space Ω. Clearly, the set of all possible
events is the powerset (the set of all possible subsets) of Ω, often denoted as
2Ω . As an example, consider the following random experiment: toss a coin three
times. Then, the sample space Ω is
Ω = {HHH, HHT , HT H, HT T , T HH, T HT , T T H, T T T }
and an event E could be described in words as “the output of the random exper-
iment was either all heads or all tails”. Then, E = {HHH, T T T }. The probability
measure or probability function maps the (finite) sample space Ω to the interval
[0, 1]. Formally, let the function Pr [ω] for all ω ∈ Ω be a function whose do-
main is Ω and whose range is the interval [0, 1]. This function has the so-called
normalization property, namely

Pr [ω] = 1.
ω∈Ω
If E is an event, then

(3.1.1) Pr [E] = Pr [ω] ,
ω∈E
namely the probability of an event is the sum of the probabilities of its elements.
It follows that the probability of the empty event (the event E that corresponds
12 Lectures on Randomized Numerical Linear Algebra

to the empty set) is equal to zero, whereas the probability of the event Ω (clearly
Ω itself is an event) is equal to one. Finally, the uniform probability function is
defined as Pr [ω] = 1/ |Ω|, for all ω ∈ Ω.
3.2. Properties of events. Recall that events are sets and thus set operations
(union, intersection, complementation) are applicable. Assuming finite sample
spaces and using Eqn. (3.1.1), it is easy to prove the following property for the
union of two events E1 and E2 :
Pr [E1 ∪ E2 ] = Pr [E1 ] + Pr [E2 ] − Pr [E1 ∩ E2 ] .
This property follows from the well-known inclusion-exclusion principle for set
union and can be generalized to more than two sets and thus to more than two

events. Similarly, one can prove that Pr Ē = 1 − Pr [E] . In the above, Ē denotes
the complement of the event E. Finally, it is trivial to see that if E1 is a subset of
E2 then Pr [E1 ]  Pr [E2 ] .
3.3. The union bound. The union bound is a fundamental result in discrete
probability and can be used to bound the probability of a union of events without
any special assumptions on the relationships between the events. Indeed, let Ei
for all i = 1, . . . , n be events defined over a finite sample space Ω. Then, the union
bound states that n 
 n
Pr Ei  Pr [Ei ] .
i=1 i=1
The proof of the union bound is quite simple and can be done by induction, using
the inclusion-exclusion principle for two sets that was discussed in the previous
section.
3.4. Disjoint events and independent events. Two events E1 and E2 are called
disjoint or mutually exclusive if their intersection is the empty set, i.e., if
E1 ∩ E2 = ∅.
This can be generalized to any number of events by necessitating that the events
are all pairwise disjoint. Two events E1 and E2 are called independent if the oc-
currence of one does not affect the probability of the other. Formally, they must
satisfy
Pr [E1 ∩ E2 ] = Pr [E1 ] · Pr [E2 ] .
Again, this can be generalized to more than two events by necessitating that the
events are all pairwise independent.
3.5. Conditional probability. For any two events E1 and E2 , the conditional
probability Pr [E1 |E2 ] is the probability that E1 occurs given that E2 occurs. For-
mally,
Pr [E1 ∩ E2 ]
Pr [E1 |E2 ] = .
Pr [E2 ]
Petros Drineas and Michael W. Mahoney 13

Obviously, the probability of E2 in the denominator must be non-zero for this to


be well-defined. The well-known Bayes rule states that for any two events E1 and
E2 such that Pr [E1 ] > 0 and Pr [E2 ] > 0,
Pr [E1 |E2 ] Pr [E2 ]
Pr [E2 |E1 ] = .
Pr [E1 ]
Using the Bayes rule and the fact that the sample space Ω can be partitioned as
Ω = E2 ∪ E2 , it follows that
 
Pr [E1 ] = Pr [E1 |E2 ] Pr [E2 ] + Pr E1 |E2 Pr E2 .
We note that the probabilities of both events E1 and E2 must be in the open
interval (0, 1). We can now revisit the notion of independent events. Indeed, for
any two events E1 and E2 such that Pr [E1 ] > 0 and Pr [E2 ] > 0 the following
statements are equivalent:
(1) Pr [E1 |E2 ] = Pr [E1 ],
(2) Pr [E2 |E1 ] = Pr [E2 ], and
(3) Pr [E1 ∩ E2 ] = Pr [E1 ] Pr [E2 ].
Recall that the last statement was the definition of independence in the previous
section.
3.6. Random variables. Random variables are functions mapping the sample
space Ω to the real numbers R. Note that even though they are called variables,
in reality they are functions. Let Ω be the sample space of a random experiment.
A formal definition for the random variable X would be as follows: let α ∈ R be
a real number (not necessarily positive) and note that the function
X−1 (α) = {ω ∈ Ω : X (ω) = α}
returns a subset of Ω and thus is an event. Therefore, the function X−1 (α) has a
probability. We will abuse notation and write:
Pr [X = α]

instead of the more proper notation Pr X−1 (α) . This function of α is of great
interest and it is easy to generalize as follows:


Pr [X  α] = Pr X−1 α : α ∈ (−∞, α] = Pr [ω ∈ Ω : X (ω)  α] .
3.7. Probability mass function and cumulative distribution function. Two com-
mon functions associated with random variables are the probability mass func-
tion (PMF) and the cumulative distribution function (CDF). The first measures
the probability that a random variable takes a particular value α ∈ R, and the
second measures the probability that a random variable takes any value below
α ∈ R.

Definition 3.7.1 (Probability Mass Function (PMF)). For a random variable X


and a real number α, the function f(α) = Pr [X = α] is called the probability mass
function (PMF).
14 Lectures on Randomized Numerical Linear Algebra

Definition 3.7.2 (Cumulative Distribution Function (CDF)). For a random vari-


able X and a real number α, the the function F(α) = Pr [X  α] is called the
cumulative distribution function (CDF).

It is obvious from the above definitions that F(α) = xα f(x).
3.8. Independent random variables. Following the notion of independence for
events, we can now define the notion of independence for random variables. In-
deed, two random variables X and Y are independent if for all reals a and b,
Pr [X = a and Y = b] = Pr [X = a] · Pr [Y = b] .
3.9. Expectation of a random variable. Given a random variable X, its expecta-
tion E [X] is defined as

E [X] = x · Pr [X = x] .
x∈X(Ω)

In the above, X(Ω) is the image of the random variable X over the sample space
Ω; recall that X is a function. That is, the sum is over the range of the random
variable X. Alternatively, E [X] can be expressed in terms of a sum over the domain
of X, i.e., over Ω. For finite sample spaces Ω, such as those that arise in discrete
probability, we get 
E [X] = X(ω)Pr [ω] .
ω∈Ω
We now discuss fundamental properties of the expectation. The most important
property is linearity of expectation: for any random variables X and Y and real
number λ,

E [X + Y] = E [X] + E [Y] , and


E [λX] = λE [X] .
The first property generalizes to any finite sum of random variables and does
not need any assumptions on the random variables involved in the summation.
If two random variables X and Y are independent then we can manipulate the
expectation of their product as follows:
E [XY] = E [X] · E [Y] .
3.10. Variance of a random variable. Given a random variable X, its variance
Var [X] is defined as
Var [X] = E [X − E [X]]2 .
In words, the variance measures the average of the (square) of the difference
X − E [X]. The standard deviation is the square root of the variance and is often
denoted by σ. It is easy to prove that
 
Var [X] = E X2 − E [X]2 .
Petros Drineas and Michael W. Mahoney 15

This obviously implies  


Var [X]  E X2 ,
which is often all we need in order to get an upper bound for the variance. Un-
like the expectation, the variance does not have a linearity property, unless the
random variables involved are independent. Indeed, if the random variables X
and Y are independent, then

Var [X + Y] = Var [X] + Var [Y] .


The above property generalizes to sums of more than two random variables, as-
suming that all involved random variables are pairwise independent. Also, for
any real λ,

Var [λX] = λ2 Var [X] .


3.11. Markov’s inequality. Let X be a non-negative random variable; for any
α > 0,
E [X]
Pr [X  α]  .
α
This is a very simple inequality to apply and only needs an upper bound for
the expectation of X. An equivalent formulation is the following: let X be a non-
negative random variable; for any k > 1,
1
Pr [X  k · E [X]]  ,
k
or, equivalently,
1
Pr [X  k · E [X]]  1 − .
k
In words, the probability that a random variable exceeds k times its expectation
is at most 1/k. In order to prove Markov’s inequality, we will show,
E [X]
Pr [X  t] 
t
assuming
t
k= ,
E [X]
for any t > 0. In order to prove the above inequality, we define the following
function 
1, if X  t
f(X) =
0, otherwise
with expectation:

E [f(X)] = 1 · Pr [X  t] + 0 · Pr [X < t] = Pr [X  t] .
Clearly, from the function definition, f(X)  X
t . Taking expectation on both sides:
 
X E [X]
E [f(X)]  E = .
t t
16 Lectures on Randomized Numerical Linear Algebra

Thus,
E [X]
Pr [X  t]  .
t
Hence, we conclude the proof of Markov’s inequality.
3.12. The Coupon Collector Problem. Suppose there are m types of coupons
and we seek to collect them in independent trials, where in each trial the proba-
bility of obtaining any one coupon is 1/m (uniform). Let X denote the number of
trials that we need in order to collect at least one coupon of each type. Then, one
can prove that [20, Section 3.6]:

E [X] = m ln m + Θ (m) , and


π2 2
Var [X] = m + Θ (m ln m) .
6
The occurrence of the additional ln m factor in the expectation is common in
sampling-based approaches that attempt to recover m different types of objects
using sampling in independent trials. Such factors will appear in many RandNLA
sampling-based algorithms.
3.13. References. There are numerous texts covering discrete probability; most
of the material in this chapter was adapted from [20].

4. Randomized Matrix Multiplication


Our first randomized algorithm for a numerical linear algebra problem is a
simple, sampling-based approach to approximate the product of two matrices
A ∈ Rm×n and B ∈ Rn×p . This randomized matrix multiplication algorithm is
at the heart of all of the RandNLA algorithms that we will discuss in this chapter,
and indeed all of RandNLA more generally. It is of interest both pedagogically
and in and of itself, and it is also used in an essential way in the analysis of the
least squares approximation and low-rank approximation algorithms discussed
below.
We start by noting that the product AB may be written as the sum of n rank
one matrices:
n
(4.0.1) AB = A∗t Bt∗ ,
 
t=1 ∈Rm×n

where each of the summands is the outer product of a column of A and the corre-
sponding row of B. Recall that the standard definition of matrix multiplication
states that the (i, j)-th entry of the matrix product AB is equal to the inner product
of the i-th row of A and the j-th column of B, namely
(AB)ij = Ai∗ B∗j ∈ R.
It is easy to see that the two definitions are equivalent. However, when matrix
multiplication is formulated as in Eqn. (4.0.1), a simple randomized algorithm
Petros Drineas and Michael W. Mahoney 17

to approximate the product AB suggests itself: in independent identically dis-


tributed (i.i.d.) trials, randomly sample (and appropriately rescale) a few rank-
one matrices from the n terms in the summation of Eqn. (4.0.1); and then output
the sum of the (rescaled) terms as an estimator for AB.

Input: A ∈ Rm×n , B ∈ Rn×p , integer c (1  c  n), and {pk }n


k=1 s.t.
n
pk  0 and k=1 pk = 1.
Output: C ∈ Rm×c and R ∈ Rc×p .

(1) For t = 1 to c,
• Pick it ∈ {1, . . . , n} with Pr [it = k] = pk , independently and
with replacement.
1 1
• Set C∗t = √cp A and Rt∗ = √cp Bit ∗ .
c it 1 ∗it it
(2) Return CR = t=1 cpi A∗it Bit ∗ .
t

Algorithm 4.0.2: The RandMatrixMultiply algorithm

Consider the RandMatrixMultiply algorithm (Algorithm 4.0.2), which makes


this simple idea precise. When this algorithm is given as input two matrices A
and B, a probability distribution {pk }n
k=1 , and a number c of column-row pairs to
choose, it returns as output an estimator for the product AB of the form

c
1
A∗it Bit ∗ .
cpit
t=1
Equivalently, the above estimator can be thought of as the product of the two ma-
trices C and R formed by the RandMatrixMultiply algorithm, where C consists
of c (rescaled) columns of A and R consists of the corresponding (rescaled) rows
of B.
Observe that
   
c c
1 1 1 1
c
CR = C∗t Rt∗ = A∗it Bi ∗ = A∗it Bit ∗ .
cpit cpit t c pit
t=1 t=1 t=1
Therefore, the procedure used for sampling and scaling column-row pairs in the
RandMatrixMultiply algorithm corresponds to sampling and rescaling terms
in Eqn. (4.0.1).

Remark 4.0.3. The analysis of RandNLA algorithms has benefited enormously


from formulating algorithms using the so-called sampling-and-rescaling matrix for-
malism. Let’s define the sampling-and-rescaling matrix S ∈ Rn×c to be a matrix

with Sit t = 1/ cpit if the it -th column of A is chosen in the t-th trial (all other
entries of S are set to zero). Then
C = AS and R = ST B,
18 Lectures on Randomized Numerical Linear Algebra

so that CR = ASST B ≈ AB. Obviously, the matrix S is very sparse, having a


single non-zero entry per column, for a total of c non-zero entries, and so it is not
explicitly constructed and stored by the algorithm.

Remark 4.0.4. The choice of the sampling probabilities {pk }nk=1 in the RandMa-
trixMultiply algorithm is very important. As we will prove in Lemma 4.1.1,
the estimator returned by the RandMatrixMultiply algorithm is (in an element-
wise sense) unbiased, regardless of our choice of the sampling probabilities. How-
ever, a natural notion of the variance of our estimator (see Theorem 4.1.2 for a
precise definition) is minimized when the sampling probabilities are set to
A∗k Bk∗ F A 2 Bk∗ 2
pk = n = n ∗k .
A∗k Bk ∗ F
k =1 k =1 A∗k 2 Bk ∗ 2

In words, the best choice when sampling rank-one matrices from the summation
of Eqn. (4.0.1) is to select rank-one matrices that have larger Frobenius norms
with higher probabilities. This is equivalent to selecting column-row pairs that
have larger (products of) Euclidean norms with higher probability.

Remark 4.0.5. This approach for approximating matrix multiplication has several
advantages. First, it is conceptually simple. Second, since the heart of the algo-
rithm involves matrix multiplication of smaller matrices, it can use any algorithms
that exist in the literature for performing the desired matrix multiplication. Third,
this approach does not tamper with the sparsity of the input matrices. Finally, the
algorithm can be easily implemented in one pass over the input matrices A and
B, given the sampling probabilities {pk }nk=1 . See [9, Section 4.2] for a detailed dis-
cussion regarding the implementation of the RandMatrixMultiply algorithm in
the pass-efficient and streaming models of computation.

4.1. Analysis of the RANDMATRIXMULTIPLY algorithm. This section provides up-


per bounds for the error matrix AB − CR2F , where C and R are the outputs of
the RandMatrixMultiply algorithm.
Our first lemma proves that the expectation of the (i, j)-th element of the es-
timator CR is equal to the (i, j)-th element of the exact product AB, regardless
of the choice of the sampling probabilities. It further bounds the variance of the
(i, j)-th element of the estimator, although this bound does depend on our choice
of the sampling probabilities.

Lemma 4.1.1. Let C and R be constructed as described in the RandMatrixMul-


tiply algorithm. Then,

E (CR)ij = (AB)ij
and
 1 n
A2ik B2kj
Var (CR)ij  .
c pk
k=1
Petros Drineas and Michael W. Mahoney 19

Proof. Fix some pair i, j. For t = 1, . . . , c, define


 
A∗it Bit ∗ Aiit Bit j
Xt = = .
cpit ij cpit
Thus, for any t,

n
Aik Bkj 1
n
1
E [Xt ] = pk = Aik Bkj = (AB)ij .
cpk c c
k=1 k=1
c
Since we have (CR)ij = t=1 Xt ,
it follows that
 c 
  c
E (CR)ij = E Xt = E [Xt ] = (AB)ij .
t=1 t=1
Hence, CR is an unbiased estimator of AB, regardless of the choice of the sam-
pling probabilities. Using the fact that (CR)ij is the sum of c independent random
variables, we get
 c 
   c
Var (CR)ij = Var Xt = Var [Xt ] .
t=1 t=1
 n A2ik B2kj
Using Var [Xt ]  E Xt = 2
k=1 c2 pk
, we get

 c 
n
A2ik B2kj 1  Aik Bkj
n 2 2
Var (CR)ij = Var [Xt ]  c = ,
c 2 pk c pk
t=1 k=1 k=1
which concludes the proof of the lemma. 
Our next result bounds the expectation of the Frobenius norm of the error
matrix AB − CR. Notice that this error metric depends on our choice of the
sampling probabilities {pk }n
k=1 .

Theorem 4.1.2. Construct C and R using the RandMatrixMultiply algorithm


and let CR be an approximation to AB. Then,
   n
A∗k 22 Bk∗ 22
(4.1.3) E AB − CR2F  .
cpk
k=1
Furthermore, if
A∗k 2 Bk∗ 2
(4.1.4) pk = n ,
k =1A∗k 2 Bk ∗ 2
for all k = 1, . . . , n, then
 n 2
  1 
2
(4.1.5) E AB − CRF  A∗k 2 Bk∗ 2 .
c
k=1

This choice for {pk }n
k=1 minimizes E AB − CRF , among possible choices for
2

the sampling probabilities.


20 Lectures on Randomized Numerical Linear Algebra

Proof. First of all, since CR is an unbiased estimator of AB, E (AB − CR)ij = 0.
Thus,
  m  p    m p

E AB − CR2F = E (AB − CR)2ij = Var (CR)ij .
i=1 j=1 i=1 j=1
Using Lemma 4.1.1, we get
 ⎛ ⎞
  1 1
n  
E AB − CR2F  A2ik ⎝ B2kj ⎠
c pk
k=1 i j

1 
n
1
= A 2 B 2 .
c pk ∗k 2 k∗ 2
k=1
Let pk be as in Eqn. (4.1.4); then
 2
  1 
n
E AB − CR2F  A∗k 2 Bk∗ 2 .
c
k=1
Finally, to prove that the aforementioned choice for the {pk }n
k=1 minimizes the

quantity E AB − CR2F , define the function
 n
1
f(p1 , . . . pn ) = A∗k 22 Bk∗ 22 ,
pk
k=1

which characterizes the dependence of E AB − CR2F on the pk ’s. In order to
n
minimize f subject to k=1 pk = 1, we can introduce the Lagrange multiplier λ
and define the function
 n 

g(p1 , . . . pn ) = f(p1 , . . . pn ) + λ pk − 1 .
k=1
We then have the minimum at
∂g −1
0= = 2 A∗k 22 Bk∗ 22 + λ.
∂pk pk
Thus,
A∗k 2 Bk∗ 2 A  B 
pk = √ = n ∗k 2 k∗ 2 ,
λ k =1 A∗k 2 Bk ∗ 2

√ 
where the second equality comes from solving for λ in n k=1 pk = 1. These
2
∂ g
probabilities are minimizers of f because 2 > 0 for all k. 
∂pk
We conclude this section by pointing out that we can apply Markov’s inequal-
ity on the expectation bound of Theorem 4.1.2 in order to get bounds for the
Frobenius norm of the error matrix AB − CR that hold with constant probabil-
ity. We refer the reader to [9, Section 4.4] for a tighter analysis, arguing for a
better (in the sense of better dependence on the failure probability than provided
by Markov’s inequality) concentration of the Frobenius norm of the error matrix
around its mean using a martingale argument.
Petros Drineas and Michael W. Mahoney 21

4.2. Analysis of the algorithm for nearly optimal probabilities. We now dis-
cuss three different choices for the sampling probabilities that are easy to analyze
and will be useful in this chapter. We summarize these results in the following
list; all three bounds can be easily proven following the proof of Theorem 4.1.2.
Nearly optimal probabilities, depending on both A and B: If the {pk }n k=1
satisfy

n
βA∗k 2 Bk∗ 2
(4.2.1) pk = 1 and pk  n ,
k=1 k =1 A∗k 2 Bk ∗ 2

for some positive constant β  1, then,


 n 2
  1 
(4.2.2) E AB − CR2F  A∗k 2 Bk∗ 2 .
βc
k=1
Nearly optimal probabilities, depending only on A: If the {pk }n
k=1 satisfy

n
βA∗k 22
(4.2.3) pk = 1 and pk  ,
k=1
A2F
for some positive constant β  1, then,
  1
(4.2.4) E AB − CR2F  A2F B2F .
βc
Nearly optimal probabilities, depending only on B: If the {pk }n
k=1 satisfy

n
βBk∗ 22
(4.2.5) pk = 1 and pk  ,
k=1
B2F
for some positive constant β  1, then,
  1
(4.2.6) E AB − CR2F  A2F B2F .
βc
We note that, from the Cauchy-Schwartz inequality,
 n 2

A∗k 2 Bk∗ 2  A2F B2F ,
k=1
and thus the bound of Eqn. (4.2.2) is generally better than those of Eqns. (4.2.4)
and (4.2.6). See [9, Section 4.3, Table 1] for other sampling probabilities and
respective error bounds that might be of interest.
4.3. Bounding the two norm. In both applications of the RandMatrixMultiply
algorithm that we will discuss in this chapter (see least-squares approximation
and low-rank matrix approximation in Sections 5 and 6, respectively), we will be
particularly interested in approximating the product UT U, where U is a tall-and-
thin matrix, by sampling (and rescaling) a few rows of U. (The matrix U will be
a matrix spanning the column space or the “important” part of the column space
of some other matrix of interest.) It turns out that, without loss of generality,
we can focus on the special case where U ∈ Rn×d (n d) is a matrix with
orthonormal columns (i.e., UT U = Id ). Then, if we let R ∈ Rc×d be a sample
22 Lectures on Randomized Numerical Linear Algebra

of c (rescaled) rows of U constructed using the RandMatrixMultiply algorithm,


and note that the corresponding c (rescaled) columns of UT form the matrix RT ,
then Theorem 4.1.2 implies that
    d2
(4.3.1) E UT U − RT R2F = E Id − RT R2F  .
βc
In the above, we used the fact that U2F = d. For the above bound to hold, it
suffices to use sampling probabilities pk (k = 1, . . . , n) that satisfy

n
βUk∗ 22
(4.3.2) pk = 1 and pk  .
d
k=1

(The quantities Uk∗ 22are known as leverage scores [17]; and the probabilities
given by Eqn. (4.3.2) are nearly-optimal, in the sense of Eqn. (4.2.1), i.e., in the
sense that they approximate the optimal probabilities for approximating the ma-
trix product shown in Eqn (4.3.1), up to a β factor.) Applying Markov’s inequality
to the bound of Eqn. (4.3.1) and setting
10d2
(4.3.3) c= ,
β 2
we get that, with probability at least 9/10,
(4.3.4) UT U − RT RF = Id − RT RF  .
Clearly, the above equation also implies a two-norm bound. Indeed, with proba-
bility at least 9/10,
UT U − RT R2 = Id − RT R2 
by setting c to the value of Eqn. (4.3.3).
In the remainder of this section, we will state and prove a theorem that also
guarantees UT U − RT R2  , while setting c to a value that is smaller than
the one in Eqn. (4.3.3). For related concentration techniques, see the chapter by
Vershynin in this volume [28].

Theorem 4.3.5. Let U ∈ Rn×d (n d) satisfy UT U = Id . Construct R using


the RandMatrixMultiply algorithm and let the sampling probabilities {pk }n k=1
satisfy the conditions of Eqn. (4.3.2), for all k = 1, . . . , n and for some constant
β ∈ (0, 1].
Let ∈ (0, 1) be an accuracy parameter and let
 
96d 96d
(4.3.6) c ln √ .
β 2 β 2 δ
Then, with probability at least 1 − δ,
UT U − RT R2 = Id − RT R2  .

Prior to proving the above theorem, we state a matrix-Bernstein inequality that


is due to Oliveira [22, Lemma 1].
Petros Drineas and Michael W. Mahoney 23

Lemma 4.3.7. Let x1 , x2 , . . . , xc be independent identically distributed copies of a


d-dimensional random vector x with
 
x2  M and E xxT 2  1.
Then, for any α > 0,
1  i iT
c  
 x x − E xxT 2  α
c
i=1
holds with probability at least
 
2 cα2
1 − 2c exp − .
16M2 + 8M2 α
This inequality essentially gives a bound for the probability that the matrix
1 c i i T deviates significantly from its expectation where the deviation is
c i=1 x x
measured with respect to the two norm (namely the largest singular value) of the
error matrix.
Proof. (of Theorem 4.3.5) Define the random row vector y ∈ Rd as
 
1 βUk∗ 22
Pr y = √ Uk∗ = pk  ,
pk d
for k = 1, . . . , n. In words, y is set to be the (rescaled) k-th row of U with probabil-
ity pk . Thus, the matrix R has rows √1c y1 , √1c y2 , . . . , √1c yc , where y1 , y2 , . . . , yc
are c independent copies of y. Using this notation, it follows that
   n
1 1
(4.3.8) E yT y = pk ( √ UTk∗ )( √ Uk∗ ) = UT U = Id .
pk pk
k=1
Also,
1  tT t
c
RT R = y y .
c  
t=1
Rd×d
For this vector y, let
1
(4.3.9) M  y2 = √ Uk∗ 2 .
pk

Notice that from Eqn. (4.3.8) we immediately get E yT y 2 = Id 2 = 1. Ap-
plying Lemma 4.3.7 (with x = yT ), we get
(4.3.10) RT R − UT U2 < ,
2

with probability at least 1 − (2c)2 exp − 16M2c
+8M2 
.
Let δ be the failure probability of Theorem
4.3.5; we seek an appropriate value
2
of c in order to guarantee (2c)2 exp − 16M2c+8M2 
 δ. Equivalently, we need
to satisfy
c 2
√  2 16M2 + 8M2 .
ln 2c/ δ
24 Lectures on Randomized Numerical Linear Algebra

Combine Eqns. (4.3.9) and (4.3.2) to get M2  U2F /β = d/β. Recall that < 1
to conclude that it suffices to choose a value of c such that
c 48d
√  2
,
ln 2c/ δ β

or, equivalently, √
2c/ δ 96d
√  √ .
ln 2c/ δ β 2 δ

We now use the fact that


√ for√any η  4, if x  2η ln η, then x/ ln x  η. Let
x = 2c/ δ, let η = 96d/ β δ , and note that η  4 since d  1 and β, , and
2

δ are at most one. Thus, it suffices to set  


2c 96d 96d
√ 2 √ ln √ ,
δ β 2 δ β 2 δ
which concludes the proof of the theorem. 

Remark 4.3.11. Let δ = 1/10 and let and β be constants. Then, we can compare
the bound of Eqn. (4.3.3) with the bound of Eqn. (4.3.6) of Theorem 4.3.5: both
values of c guarantee the same accuracy and the same success probability (say
9/10). However, asymptotically, the bound of Theorem 4.3.5 holds by setting
c = O(d ln d), while the bound of Eqn. (4.3.3) holds by setting c = O(d2 ). Thus,
the bound of Theorem 4.3.5 is much better. By the Coupon Collector Problem (see
Section 3.12), sampling-based approaches necessitate at least Ω(d ln d) samples,
thus making our algorithm asymptotically optimal. We should note, however,
that deterministic methods exist (see, for example, [24]) that achieve the same
bound with c = O(d/ 2 ) samples.

Remark 4.3.12. We made no effort to optimize the constants in the expression for
c in Eqn. (4.3.6). Better constants are known, by using tighter matrix-Bernstein
inequalities. For a state-of-the-art bound see, for example, [16, Theorem 5.1].

4.4. References. Our presentation in this chapter follows closely the derivations
in [9]; see [9] for a detailed discussion of prior work on this topic. We also
refer the interested reader to [16] and references therein for more recent work on
randomized matrix multiplication.

5. RandNLA Approaches for Regression Problems


In this section, we will present a simple randomized algorithm for least-squares
regression. In many applications in mathematics and statistical data analysis, it is
of interest to find an approximate solution to a system of linear equations that has
no exact solution. For example, let a matrix A ∈ Rn×d and a vector b ∈ Rn be
given. If n d, there will not in general exist a vector x ∈ Rd such that Ax = b,
and yet it is often of interest to find a vector x such that Ax ≈ b in some precise
sense. The method of least squares, whose original formulation is often credited
Petros Drineas and Michael W. Mahoney 25

to Gauss and Legendre, accomplishes this by minimizing the sum of squares of


the elements of the residual vector, i.e., by solving the optimization problem
(5.0.1) Z = min Ax − b2 .
x∈Rd
The minimum 2 -norm vector among those satisfying Eqn. (5.0.1) is
(5.0.2) xopt = A† b,
where A† denotes the Moore-Penrose generalized inverse of the matrix A. This
solution vector has a very natural statistical interpretation as providing an op-
timal estimator among all linear unbiased estimators, and it has a very natural
geometric interpretation as providing an orthogonal projection of the vector b
onto the span of the columns of the matrix A.
Recall that to minimize the quantity in Eqn. (5.0.1), we can set the derivative
of Ax − b22 = (Ax − b)T (Ax − b) with respect to x equal to zero, from which
it follows that the minimizing vector xopt is a solution of the so-called normal
equations
(5.0.3) AT Axopt = AT b.
Computing AT A, and thus computing xopt in this way, takes O(nd2 ) time, assum-
ing n  d. Geometrically, Eqn. (5.0.3) requires the residual vector b⊥ = b − Axopt
T
to be orthogonal to the column space of A, i.e., b⊥ A = 0. While solving the nor-
mal equations squares the condition number of the input matrix (and thus is
typically not recommended in practice), direct methods (such as the QR decom-
position, see Section 2.1) also solve the problem of Eqn. (5.0.1) in O(nd2 ) time,
assuming that n  d. Finally, an alternative expression for the vector xopt of
Eqn. (5.0.2) emerges by leveraging the SVD of A. If A = UA ΣA V TA denotes the
SVD of A, then
T †
xopt = V A Σ−1
A UA b = A b.

Computing xopt in this way also takes O(nd2 ) time, again assuming n  d. In
this section, we will describe a randomized algorithm that will provide accu-
rate relative-error approximations to the minimal 2 -norm solution vector xopt
of Eqn. (5.0.2) faster than these “exact” algorithms for a large class of over-
constrained least-squares problems.
5.1. The Randomized Hadamard Transform. The Randomized Hadamard Trans-
form was introduced in [1] as one step in the development of a fast version of the
Johnson-Lindenstrauss lemma. Recall that the n × n Hadamard matrix (assuming
n is a power of two) H̃n , may be defined recursively as follows:
   
 n/2 H
H  n/2 +1 +1
H̃n = , with 2 =
H .
 n/2 −H
H  n/2 +1 −1

We can now define the normalized Hadamard transform Hn as (1/ n)H̃n ; it
is easy to see that Hn HTn = HTn Hn = In . Now consider a diagonal matrix
26 Lectures on Randomized Numerical Linear Algebra

D ∈ Rn×n such that Dii is set to +1 with probability 1/2 and to −1 with
probability 1/2. The product HD is the Randomized Hadamard Transform and has
three useful properties. First, when applied to a vector, it “spreads out” the
mass/energy of that vector, in the sense of providing a bound for the largest ele-
ment, or infinity norm, of the transformed vector. Second, computing the product
HDx for any vector x ∈ Rn takes O(n log2 n) time. Even better, if we only need
to access, say, r elements in the transformed vector, then those r elements can
be computed in O(n log2 r) time. We will expand on the latter observation in
Section 5.5, where we will discuss the running time of the proposed algorithm.
Third, the Randomized Hadamard Transform is an orthogonal transformation,
since HDDT HT = HT DT DH = In .
5.2. The main algorithm and main theorem. We are now ready to provide an
overview of the RandLeastSquares algorithm (Algorithm 5.2.1). Let the matrix
product HD denote the n × n Randomized Hadamard Transform discussed in the
previous section. (For simplicity, we restrict our discussion to the case that n is
a power of two, although this restriction can easily be removed by using variants
of the Randomized Hadamard Transform [17].) Our algorithm is a preconditioned
random sampling algorithm: after premultiplying A and b by HD, our algorithm
samples uniformly at random r constraints from the preprocessed problem. (See
Eqn. (5.2.3), as well as the remarks after Theorem 5.2.2 for the precise value of
r.) Then, this algorithm solves the least squares problem on just those sampled
constraints to obtain a vector x̃opt ∈ Rd such that Theorem 5.2.2 is satisfied.

Input: A ∈ Rn×d , b ∈ Rn , and an error parameter ∈ (0, 1).


Output: x̃opt ∈ Rd .
(1) Let r assume the value of Eqn. (5.2.3).
(2) Let S be an empty matrix.
(3) For t = 1, . . . , r (i.i.d. trials with replacement) select uniformly
at random an integer from {1, 2, . . . , n}. 
• If i is selected, then append the column vector n/r ei
to S, where ei ∈ Rn is the i-th canonical vector.
(4) Let H ∈ Rn×n be the normalized Hadamard transform matrix.
(5) Let D ∈ Rn×n be a diagonal matrix with

+1 , with probability 1/2
Dii =
−1 , with probability 1/2

(6) Compute and return x̃opt = ST HDA ST HDb.

Algorithm 5.2.1: The RandLeastSquares algorithm


Petros Drineas and Michael W. Mahoney 27

Formally, we will let S ∈ Rn×r denote a sampling-and-rescaling matrix speci-


fying which of the n (preprocessed) constraints are to be sampled and how they
are to be rescaled. This matrix is initially empty and is constructed as described
in the RandLeastSquares algorithm. (We are describing this algorithm in terms
of the matrix S, but as with the RandMatrixMultiply algorithm, we do not need
to construct it explicitly in an actual implementation [3].) Then, we can consider
the problem
Z̃ = min ST HDAx − ST HDb2 ,
x∈Rd
which is a least squares approximation problem involving only the r constraints,
where the r constraints are uniformly sampled from the matrix A after the pre-
processing with the Randomized Hadamard Transform. The minimum 2 -norm
vector x̃opt ∈ Rd among those that achieve the minimum value Z̃ in this prob-
lem is †
x̃opt = ST HDA ST HDb,
which is the output of the RandLeastSquares algorithm. One can prove (and
the proof is provided below) the following theorem about this algorithm.

Theorem 5.2.2. Suppose A ∈ Rn×d is a matrix of rank d, with n being a power


of two. Let b ∈ Rn and let ∈ (0, 1). Run the RandLeastSquares algorithm
with


(5.2.3) r = max 482 d ln (40nd) ln 1002 d ln (40nd) , 40d ln(40nd)/
and return x̃opt . Then, with probability at least .8, the following two claims hold:
first, x̃opt satisfies
Ax̃opt − b2  (1 + )Z,
where, recall, that Z is given in Eqn. (5.0.1); and, second, if we assume that
UA UTA b2  γb2 for some γ ∈ (0, 1], then x̃opt satisfies
√ 
xopt − x̃opt 2  κ(A) γ−2 − 1 xopt 2 .
Finally,
n(d + 1) + 2n(d + 1) log2 (r + 1) + O(rd2 )
time suffices to compute the solution x̃opt .

It is worth noting that the claims of Theorem 5.2.2 can be made to hold with
probability 1 − δ, for any δ > 0, by repeating the algorithm ln(1/δ)/ ln(5) times.
Also, we note that if n is not a power of two we can pad A and b with all-zero
rows in order to satisfy the assumption; this process at most doubles the size of
the input matrix.

Remark 5.2.4. Assuming that d  n  ed , and using max{a1 , a2 }  a1 + a2 , we


get that  
d ln n
r = O d(ln d)(ln n) + .

28 Lectures on Randomized Numerical Linear Algebra

Thus, the running time of the RandLeastSquares algorithm becomes


 
d d3 ln n
O nd ln + d3 (ln d)(ln n) + .

Assuming that n/ ln n = Ω(d2 ), the above running time reduces to
 
d nd ln d
O nd ln + .

For fixed , these improve the standard O(nd2 ) running time of traditional de-
terministic algorithms. It is worth noting that improvements over the standard
O(nd2 ) time could be derived with weaker assumptions on n and d. However,
for the sake of clarity of presentation, we only focus on the above setting.

Remark 5.2.5. The matrix ST HD can be viewed in one of two equivalent ways:
as a random preprocessing or random preconditioning, which “uniformizes” the
leverage scores of the input matrix A (see Lemma 5.4.1 for a precise statement),
followed by a uniform sampling operation; or as a Johnson-Lindenstrauss style
random projection, which preserves the geometry of the entire span of A, rather
than just a discrete set of points (see Lemma 5.4.5 for a precise statement).

5.3. RandNLA algorithms as preconditioners. Stepping back, recall that the


RandLeastSquares algorithm may be viewed as preconditioning the input ma-
trix A and the target vector b with a carefully-constructed data-independent ran-
dom matrix X. (Since the analysis of the RandLowRank algorithm, our main
algorithm for low-rank matrix approximation, in Section 6 below, boils down to
very similar ideas as the analysis of the RandLeastSquares algorithm, the ideas
underlying the following discussion also apply to the RandLowRank algorithm.)
For our random sampling algorithm, we let X = ST HD, where S is a matrix that
represents the sampling operation and HD is the Randomized Hadamard Trans-
form. Thus, we replace the least squares approximation problem of Eqn. (5.0.1)
with the least squares approximation problem
(5.3.1) Z̃ = min X(Ax − b)2 .
x∈Rd
We explicitly compute the solution to the above problem using a traditional de-
terministic algorithm, e.g., by computing the vector
(5.3.2) x̃opt = (XA)† Xb.
Alternatively, one could use standard iterative methods such as the the Conjugate
Gradient Normal Residual method, which can produce an -approximation to the
optimal solution of Eqn. (5.3.1) in O(κ(XA)rd ln(1/ )) time, where κ(XA) is the
condition number of XA and r is the number of rows of XA. This was indeed the
strategy implemented in the popular Blendenpik/LSRN approach [3].
We now state and prove a lemma that establishes sufficient conditions on
any matrix X such that the solution vector x̃opt to the least squares problem of
Eqn. (5.3.1) will satisfy the relative-error bounds of Theorem 5.2.2. Recall that
Petros Drineas and Michael W. Mahoney 29

the SVD of A is A = UA ΣA V TA . In addition, for notational simplicity, we let


⊥T
b⊥ = U⊥ A UA b denote the part of the right hand side vector b lying outside of
the column space of A.
The two conditions that we will require for the matrix X are:

(5.3.3) σ2min (XUA )  1/ 2; and
(5.3.4) UTA XT Xb⊥ 22  Z2 /2,
for some ∈ (0, 1). Several things should be noted about these conditions.

• First, although Condition (5.3.3) only states that σ2i (XUA )  1/ 2, for all
i = 1, . . . , d, our randomized algorithm satisfies the two-sided inequality
  √
1 − σ2 (XUA )  1 − 1/ 2, for all i = 1, . . . , d. This is equivalent to
i

I − UTA XT XUA 2  1 − 1/ 2.
Thus, one should think of XUA as an approximate isometry.
• Second, the lemma is a deterministic statement, since it makes no ex-
plicit reference to a particular randomized algorithm and since X is not as-
sumed to be constructed from a randomized process. Failure probabilities
will enter later when we show that our randomized algorithm constructs
an X that satisfies Conditions (5.3.3) and (5.3.4) with some probability.
• Third, Conditions (5.3.3) and (5.3.4) define what has come to be known as
a subspace embedding, since it is an embedding that preserves the geometry
of the entire subspace of the matrix A. Such a subspace embedding can
be oblivious (meaning that it is constructed without knowledge of the in-
put matrix, as with random projection algorithms) or non-oblivious (mean-
ing that it is constructed from information in the input matrix, as with
data-dependent nonuniform sampling algorithms). This style of analysis
represented a major advance in RandNLA algorithms, since it premitted
much stronger bounds to be obtained than had been possible with previ-
ous methods. See [11] for the journal version (which was a combination
and extension of two previous conference papers) of the first paper to use
this style of analysis.
⊥T
• Fourth, Condition (5.3.4) simply states that Xb⊥ = XU⊥ A UA b remains
approximately orthogonal to XUA . Clearly, before applying X, it holds
that UTA b⊥ = 0.
• Fifth, although Condition (5.3.4) depends on the right hand side vector
b, the RandLeastSquares algorithm will satisfy it without using any
information from b. (See Lemma 5.4.9 below.)
Given Conditions (5.3.3) and (5.3.4), we can establish the following lemma.

Lemma 5.3.5. Consider the overconstrained least squares approximation problem


of Eqn. (5.0.1) and let the matrix UA ∈ Rn×d contain the top d left singular
vectors of A. Assume that the matrix X satisfies Conditions (5.3.3) and (5.3.4)
above, for some ∈ (0, 1). Then, the solution vector x̃opt to the least squares
30 Lectures on Randomized Numerical Linear Algebra

approximation problem (5.3.1) satisfies:


(5.3.6) Ax̃opt − b2  (1 + )Z, and
1 √
(5.3.7) xopt − x̃opt 2  Z.
σmin (A)
Proof. Let us first rewrite the down-scaled regression problem induced by X as
min Xb − XAx22 = min XAx − Xb22
x∈Rd x∈Rd

(5.3.8) = min XA(xopt + y) − X(Axopt + b⊥ )22


y∈Rd

= min XAy − Xb⊥ 22


y∈Rd

(5.3.9) = min XUA z − Xb⊥ 22 .


z∈Rd

Eqn. (5.3.8) follows since b = Axopt + b⊥ and Eqn. (5.3.9) follows since the
columns of the matrix A span the same subspace as the columns of UA . Now, let
zopt ∈ Rd be such that UA zopt = A(x̃opt − xopt ). Using this value for zopt , we will
prove that zopt is minimizer of the above optimization problem, as follows:
XUA zopt − Xb⊥ 22 = XA(x̃opt − xopt ) − Xb⊥ 22
= XAx̃opt − XAxopt − Xb⊥ 22
(5.3.10) = XAx̃opt − Xb22
= min XAx − Xb22
x∈Rd

= min XUA z − Xb⊥ 22 .


z∈Rd

Eqn. (5.3.10) follows since b = Axopt + b⊥ and the last equality follows from
Eqn. (5.3.9). Thus, by the normal equations (5.0.3), we have that
(XUA )T XUA zopt = (XUA )T Xb⊥ .
Taking the norm of both sides and observing that under Condition (5.3.3) we have

σi ((XUA )T XUA ) = σ2i (XUA )  1/ 2, for all i, it follows that
(5.3.11) zopt 22 /2  (XUA )T XUA zopt 22 = (XUA )T Xb⊥ 22 .
Using Condition (5.3.4) we observe that
(5.3.12) zopt 22  Z2 .
To establish the first claim of the lemma, let us rewrite the norm of the residual
vector as
b − Ax̃opt 22 = b − Axopt + Axopt − Ax̃opt 22
(5.3.13) = b − Axopt 22 + Axopt − Ax̃opt 22
(5.3.14) = Z2 +  − UA zopt 22
(5.3.15)  Z2 + Z2 ,
Petros Drineas and Michael W. Mahoney 31

where Eqn. (5.3.13) follows by the Pythagorean theorem, since b − Axopt = b⊥ is


orthogonal to A and consequently to A(xopt − x̃opt ); Eqn. (5.3.14) follows by the
definition of zopt and Z; and Eqn. (5.3.15) follows by (5.3.12) and fact that UA has

orthonormal columns. The first claim of the lemma follows since 1 +  1 + .
To establish the second claim of the lemma, recall that A(xopt − x̃opt ) = UA zopt .
If we take the norm of both sides of this expression, we have that
UA zopt 22
(5.3.16) xopt − x̃opt 22 
σ2min (A)
Z2
(5.3.17)  ,
σ2min (A)
where Eqn. (5.3.16) follows since σmin (A) is the smallest singular value of A
and since the rank of A is d; and Eqn. (5.3.17) follows by Eqn. (5.3.12) and the
orthonormality of the columns of UA . Taking the square root, the second claim
of the lemma follows. 
If we make no assumption on b, then Eqn. (5.3.7) from Lemma 5.3.5 may pro-
vide a weak bound in terms of xopt 2 . If, on the other hand, we make the
additional assumption that a constant fraction of the norm of b lies in the sub-
space spanned by the columns of A, then Eqn. (5.3.7) can be strengthened. Such
an assumption is reasonable, since most least-squares problems are practically
interesting if at least some part of b lies in the subspace spanned by the columns
of A.

Lemma 5.3.18. Using the notation of Lemma 5.3.5, and additionally assuming
that UA UTA b2  γb2 , for some fixed γ ∈ (0, 1], it follows that
√ 
(5.3.19) xopt − x̃opt 2  κ(A) γ−2 − 1 xopt 2 .

Proof. Since UA UTA b2  γb2 , it follows that


Z2 = b22 − UA UTA b22
 (γ−2 − 1)UA UTA b22
 σ2max (A)(γ−2 − 1)xopt 22 .
This last inequality follows from UA UTA b = Axopt , which implies
UA UTA b2 = Axopt 2  A2 xopt 2 = σmax (A) xopt 2 .
By combining this with Eqn. (5.3.7) of Lemma 5.3.5, the lemma follows. 
5.4. The proof of Theorem 5.2.2. To prove Theorem 5.2.2, we adopt the follow-
ing approach: we first show that the Randomized Hadamard Transform has the
effect preprocessing or preconditioning the input matrix to make the leverage
scores approximately uniform; and we then show that Condition (5.3.3) and (5.3.4)
can be satisfied by sampling uniformly on the preconditioned input. The theorem
will then follow from Lemma 5.3.5.
32 Lectures on Randomized Numerical Linear Algebra

The effect of the Randomized Hadamard Transform. We start by stating a


lemma that quantifies the manner in which HD approximately “uniformizes”
information in the left singular subspace of the matrix A; this will allow us to
sample uniformly and apply our randomized matrix multiplication results from
Section 4 in order to analyze the proposed algorithm. We state the lemma for a
general n × d orthogonal matrix U such that UT U = Id .

Lemma 5.4.1. Let U be an n × d orthogonal matrix and let the product HD be the
n × n Randomized Hadamard Transform of Section 5.1. Then, with probability
at least .95,
2d ln(40nd)
(5.4.2)  (HDU)i∗ 22  , for all i = 1, . . . , n.
n
The following well-known inequality [15, Theorem 2] will be useful in the
proof. (See also the chapter by Vershynin in this volume [28] for related results.)

Lemma 5.4.3. Let Xi , i = 1, . . . , n be independent random variables with finite


first and second moments such that, for all i, ai  Xi  bi . Then, for any t > 0,
 n    
 
n  2n2 t2
 
Pr  Xi − E [Xi ]  nt  2 exp − n .
  i=1 (ai − bi )
2
i=1 i=1

Given this lemma, we now provide the proof of Lemma 5.4.1.


Proof. (of Lemma 5.4.1) Consider (HDU)ij for some i, j (recalling that i = 1, . . . , n
and j = 1, . . . , d). Recall that D is a diagonal matrix; then,

n 
n

n
(HDU)ij = Hi D Uj = D Hi Uj = X .
=1 =1 =1

Let X = D Hi Uj be our set of n (independent) random variables. By the
construction of D and H, it is easy to see that E [X ] = 0; also,

 1  
|X | = D Hi Uj   √ Uj  .
n
Applying Lemma 5.4.3, we get
 
   2n3 t2
  3 2
Pr (HDU)ij   nt  2 exp − n = 2 exp −n t /2 .
4 =1 U2j

In the last equality we used the fact that n 2
=1 Uj = 1, i.e., that the columns of
U are unit-length. Let the right-hand side of the above inequality be equal to δ
and solve for t to get
 
   2 ln(2/δ)
 
Pr (HDU)ij    δ.
n
Let δ = 1/(20nd) and apply the union bound over all nd possible index pairs
(i, j) to get that, with probability at least 1-1/20=0.95, for all i, j,
   2 ln(40nd)
 
(HDU)ij   .
n
Petros Drineas and Michael W. Mahoney 33

Thus,

d
2d ln(40nd)
(5.4.4)  (HDU)i∗ 22 = (HDU)2ij 
n
j=1

for all i = 1, . . . , n, which concludes the proof of the lemma. 


Satisfying Condition (5.3.3). We next prove the following lemma, which states
that all the singular values of ST HDUA are close to one, and in particular that
Condition (5.3.3) is satisfied by the RandLeastSquares algorithm. The proof
of this Lemma 5.4.5 essentially follows from our results in Theorem 4.3.5 for the
RandMatrixMultiply algorithm (for approximating the product of a matrix and
its transpose).
Lemma 5.4.5. Assume that Eqn. (5.4.2) holds. If

(5.4.6) r  482 d ln (40nd) ln 1002 d ln (40nd) ,
then, with probability at least .95,
 
  1
1 − σ2i ST HDUA   1 − √
2
holds for all i = 1, . . . , d.
Proof. (of Lemma 5.4.5) Note that for all i = 1, . . . , d,
 
 
1 − σ2i ST HDUA 
 
 
(5.4.7) = σi UTA DHT HDUA − σi UTA DHT SST HDUA 
 UTA DHT HDUA − UTA DHT SST HDUA 2 .
In the above, we used the fact that UTA DHT HDUA = Id and inequality (2.6.2)
that was discussed in our Linear Algebra review in Section 2.6. We now view
UTA DHT SST HDUA as an approximation to the product of two matrices, namely
UTA DHT = (HDUA )T and HDUA , constructed by randomly sampling and
rescaling columns of (HDUA )T . Thus, we can leverage Theorem 4.3.5.
More specifically, consider the matrix (HDUA )T . Obviously, since H, D, and

UA are orthogonal matrices, HDUA 2 = 1 and HDUA F = UA F = d. Let
β = (2 ln(40nd))−1 ; since we assumed that Eqn. (5.4.2) holds, we note that the
columns of (HDUA )T , which correspond to the rows of HDUA , satisfy
1  (HDUA )i∗ 22
(5.4.8) β , for all i = 1, . . . , n.
n HDUA 2F

Thus, applying Theorem 4.3.5 taking β = (2 ln(40nd))−1 , = 1 − 1/ 2, and
δ = 1/20 implies that
1
UTA DHT HUA − UTA DHT SST HDUA 2  1 − √
2
holds with probability at least 1 − 1/20 = .95. For the above bound to hold, we
need r to assume the value of Eqn. (5.4.6). Finally, we note that since we have
HDUA 2F = d  1, the assumption of Theorem 4.3.5 on the Frobenius norm of
34 Lectures on Randomized Numerical Linear Algebra

the input matrix is always satisfied. Combining the above with inequality (5.4.8)
concludes the proof of the lemma. 
Satisfying Condition (5.3.4). We next prove the following lemma, which states
that Condition (5.3.4) is satisfied by the RandLeastSquares algorithm. The proof
of this Lemma 5.4.9 again essentially follows from our bounds for the RandMa-
trixMultiply algorithm from Section 4 (except here it is used for approximating
the product of a matrix and a vector).

Lemma 5.4.9. Assume that Eqn. (5.4.2) holds. If r  40d ln(40nd)/ , then, with
probability at least .9,
T
 ST HDUA ST HDb⊥ 22  Z2 /2.
T
Proof. (of Lemma 5.4.9) Recall that b⊥ = U⊥ ⊥ ⊥
A UA b and that Z = b 2 . We start
by noting that since UTA DHT HDb⊥ 22 = UTA b⊥ 22 = 0 it follows that
T
 ST HDUA ST HDb⊥ 22 = UTA DHT SST HDb⊥ − UTA DHT HDb⊥ 22 .
T
Thus, ST HDUA ST HDb⊥ can be viewed as approximating the product of
the two matrices, (HDUA )T and HDb⊥ , by randomly sampling columns from
(HDUA )T and rows (elements) from HDb⊥ . Note that the sampling probabili-
ties are uniform and do not depend on the norms of the columns of (HDUA )T
or the rows of Hb⊥ . We will apply the bounds of Eqn. (4.2.4), after arguing
that the assumptions of Eqn. (4.2.3) are satisfied. Indeed, since we condition on
Eqn. (5.4.2) holding, the rows of HDUA (which of course correspond to columns
of (HDUA )T ) satisfy
1  (HDUA )i∗ 22
(5.4.10) β , for all i = 1, . . . , n,
n HDUA 2F
for β = (2 ln(40nd))−1 . Thus, Eqn. (4.2.4) implies
 T 
⊥ 2 1 dZ2
E  S HDUA S HDb 2 
T T
HDUA 2F HDb⊥ 22 = .
βr βr
In the above we used HDUA 2F = d. Markov’s inequality now implies that with
probability at least .9,
T 10dZ2
 ST HDUA ST HDb⊥ 22  .
βr
Setting r  20d/(β ) and using the value of β specified above concludes the proof
of the lemma. 
Completing the proof of Theorem 5.2.2. The theorem follows since Lemma 5.4.5
and Lemma 5.4.9 establish that the sufficient conditions of Lemma 5.3.5 hold. In
more detail, we now complete the proof of Theorem  5.2.2. First, let E(5.4.2) de-
note the event that Eqn. (5.4.2) holds; clearly, Pr E(5.4.2)  .95. Second, let
Petros Drineas and Michael W. Mahoney 35

E5.4.5,5.4.9|(5.4.2) denote the event that both Lemmas 5.4.5 and 5.4.9 hold conditioned
on E(5.4.2) holding. Then,

E5.4.5,5.4.9|(5.4.2) = 1 − E5.4.5,5.4.9|(5.4.2)

= 1 − Pr Lemma 5.4.5 does not hold|E(5.4.2)

or Lemma 5.4.9 does not hold|E(5.4.2)
 
 1 − Pr Lemma 5.4.5 does not hold|E(5.4.2)
 
− Pr Lemma 5.4.9 does not hold|E(5.4.2)
 1 − .05 − .1 = .85.
In the above, E denotes the complement of event E. In the first inequality we used
the union bound and in the second inequality we leveraged the bounds for the
failure probabilities of Lemmas 5.4.5 and 5.4.9, given that Eqn. (5.4.2) holds. We
now let E denote the event that both Lemmas 5.4.5 and 5.4.9 hold, without any a
priori conditioning on event E(5.4.2) ; we will bound Pr [E] as follows:
       
Pr [E] = Pr E|E(5.4.2) · Pr E(5.4.2) + Pr E|E(5.4.2) · Pr E(5.4.2)
   
 Pr E|E(5.4.2) · Pr E(5.4.2)
   
= Pr E5.4.5,5.4.9|(5.4.2) |E(5.4.2) · Pr E(5.4.2)
 .85 · .95  .8.
In the first inequality we used the fact that all probabilities are positive. The
above derivation immediately bounds the success probability of Theorem 5.2.2.
Combining Lemmas 5.4.5 and 5.4.9 with the structural results of Lemma 5.3.5
and setting r as in Eqn. (5.2.3) concludes the proof of the accuracy guarantees of
Theorem 5.2.2.
5.5. The running time of the RANDLEASTSQUARES algorithm. We now discuss the
running time of the RandLeastSquares algorithm. First of all, by the construc-
tion of S, the number of non-zero entries in S is r. In Step 6 we need to compute
the products ST HDA and ST HDb. Recall that A has d columns and thus the
running time of computing both products is equal to the time needed to apply
ST HD on (d + 1) vectors. In order to apply D on (d + 1) vectors in Rn , n(d + 1)
operations suffice. In order to estimate how many operations are needed to ap-
ply ST H on (d + 1) vectors, we use the following analysis that was first proposed
in [2, Section 7].
Let x be any vector in Rn ; multiplying H by x can be done as follows:
    
Hn/2 Hn/2 x1 Hn/2 (x1 + x2 )
= .
Hn/2 −Hn/2 x2 Hn/2 (x1 − x2 )
36 Lectures on Randomized Numerical Linear Algebra

Let T (n) be the number of operations required to perform this operation for n-
dimensional vectors. Then,
T (n) = 2T (n/2) + n,
and thus T (n) = O(n log n). We can now include the sub-sampling matrix S to
get
  
H Hn/2 x1
n/2
S1 S 2 = S1 Hn/2 (x1 + x2 ) + S2 Hn/2 (x1 − x2 ).
Hn/2 −Hn/2 x2
Let nnz(·) denote the number of non-zero entries of its argument. Then,
T (n, nnz(S)) = T (n/2, nnz(S1 )) + T (n/2, nnz(S2 )) + n.
From standard methods in the analysis of recursive algorithms, we can now use
the fact that r = nnz(S) = nnz(S1 ) + nnz(S2 ) to prove that
T (n, r)  2n log2 (r + 1).
Towards that end, let r1 = nnz(S1 ) and let r2 = nnz(S2 ). Then,

T (n, r) = T (n/2, r1 ) + T (n/2, r2 ) + n


n n
 2 log2 (r1 + 1) + 2 log2 (r2 + 1) + n log2 2
2 2
= n log2 (2(r1 + 1)(r2 + 1))
 n log2 (r + 1)2
= 2n log2 (r + 1).
The last inequality follows from simple algebra using r = r1 + r2 . Thus, at most
2n(d + 1) log2 (r + 1) operations are needed to apply ST HD on d + 1 vectors. Af-
ter this preprocessing, the RandLeastSquares algorithm must compute the pseu-
doinverse of an r × d matrix, or, equivalently, solve a least-squares problem on
r constraints and d variables. This operation can be performed in O(rd2 ) time
since r  d. Thus, the entire algorithm runs in time

n(d + 1) + 2n(d + 1) log2 (r + 1) + O rd2 .
5.6. References. Our presentation in this chapter follows closely the derivations
in [11]; see [11] for a detailed discussion of prior work on this topic. We also
refer the interested reader to [3, 30] for followup work on randomized solvers for
least-squares problems.

6. A RandNLA Algorithm for Low-rank Matrix Approximation


In this section, we will present a simple randomized matrix algorithm for low-
rank matrix approximation. Algorithms to compute low-rank approximations to
matrices have been of paramount importance historically in scientific computing
Petros Drineas and Michael W. Mahoney 37

(see, for example, [23] for traditional numerical methods based on subspace it-
eration and Krylov subspaces to compute such approximations) as well as more
recently in machine learning and data analysis. RandNLA has pioneered an
alternative approach, by applying random sampling and random projection algo-
rithms to construct such low-rank approximations with provable accuracy guar-
antees; see [7] for early work on the topic and [14, 17, 18, 30] for overviews of
more recent approaches. In this section, we will present and analyze a simple
algorithm to approximate the top k left singular vectors of a matrix A ∈ Rm×n .
Many RandNLA methods for low-rank approximation boil down to variants of
this basic technique; see, e.g., the chapter by Martinsson in this volume [19]. Un-
like the previous section on RandNLA algorithms for regression problems, no
particular assumptions will be imposed on m and n; indeed, A could be a square
matrix.
6.1. The main algorithm and main theorem. Our main algorithm is quite sim-
ple and again leverages the Randomized Hadamard Tranform of Section 5.1. In-
deed, let the matrix product HD denote the n × n Randomized Hadamard Trans-
form. First, we postmultiply the input matrix A ∈ Rm×n by (HD)T , thus form-
ing a new matrix ADH ∈ Rm×n .1 Then, we sample (uniformly at random) c
columns from the matrix ADH, thus forming a smaller matrix C ∈ Rm×c . Finally,
we use a Ritz-Rayleigh type procedure to construct approximations Ũk ∈ Rm×k
to the top k left singular vectors of A from C; these approximations lie within the
column space of C.
See the RandLowRank algorithm (Algorithm 6.1.4) for a detailed description
of this procedure, using a sampling-and-rescaling matrix S ∈ Rn×c to form
the matrix C. Theorem 6.1.1 is our main quality-of-approximation result for
the RandLowRank algorithm.

Theorem 6.1.1. Let A ∈ Rm×n , let k be a rank parameter, and let ∈ (0, 1/2]. If
we set  
k ln n k
(6.1.2) c  c0 2 ln 2 + ln ln n ,

(for a fixed constant c0 ) then, with probability at least .85, the RandLowRank
algorithm returns a matrix Ũk ∈ Rm×k such that
T
(6.1.3) A − Ũk Ũk AF  (1 + )A − Uk UTk AF = (1 + )A − Ak F .
(Here, Uk ∈ Rm×k contains the top k left singular vectors of A). The running
time of the RandLowRank algorithm is O(mnc).

We discuss the dimensions of the matrices in steps 6-9 of the RandLowRank


algorithm. One can think of the matrix C ∈ Rm×c as a “sketch” of the input
matrix A. Notice that c is O(k ln k) (up to ln ln factors and ignoring constant
1 Alternatively,
we could premultiply AT by HD. The reader should become comfortable going back
and forth with such manipulations.
38 Lectures on Randomized Numerical Linear Algebra

terms like and δ) and that the rank of C (denoted by ρC ) is at least k, i.e., ρC  k.
The matrix UC has dimensions m × ρC and the matrix W has dimensions ρC × n.
Finally, the matrix UW,k has dimensions ρC × k (by our assumption on the rank
of W).
Recall that the best rank-k approximation to A is equal to Ak = Uk UTk A. In
words, Theorem 6.1.1 argues that the RandLowRank algorithm returns a set of
k orthonormal vectors that are excellent approximations to the top k left singular
vectors of A, in the sense that projecting A on the subspace spanned by Ũk
returns a matrix that has residual error that is close to that of Ak .

Input: A ∈ Rm×n , a rank parameter k  min{m, n}, and an error


parameter ∈ (0, 1/2).
Output: Ũk ∈ Rm×k .
(1) Let c assume the value of Eqn. (6.1.2).
(2) Let S be an empty matrix.
(3) For t = 1, . . . , c (i.i.d. trials with replacement) select uniformly
at random an integer from {1, 2, . . . , n}. If i is selected, then

append the column vector n/c ei to S, where ei ∈ Rn is
the i-th canonical vector.
(4) Let H ∈ Rn×n be the normalized Hadamard transform matrix.
(5) Let D ∈ Rn×n be a diagonal matrix with

+1 , with probability 1/2
Dii =
−1 , with probability 1/2
(6) Compute C = ADHS ∈ Rm×c .
(7) Compute UC , a basis for the column space of C.
(8) Compute W = UTC A and (assuming that its rank is at least k),
compute its top k left singular vectors UW,k .
(9) Return Ũk = UC UW,k ∈ Rm×k .

Algorithm 6.1.4: The RandLowRank algorithm

Remark 6.1.5. We stress that the O(mnc) running time of the RandLowRank
algorithm is due to the Ritz-Rayleigh type procedure in steps (7)-(9). These
steps guarantee that the proposed algorithm returns a matrix Ũk with exactly
k columns that approximates the top k left singular vectors of A. The results
of [19] focus (in our parlance) on the matrix C, which can be constructed much
faster (see Section 6.5), in O(mn log2 c) time, but has more than k columns. By
bounding the error term A − CC† AF = A − UC UTC AF , one can prove that
the column span of C contains good approximations to the top k left singular
vectors of A.
Petros Drineas and Michael W. Mahoney 39

Remark 6.1.6. Repeating the RandLowRank algorithm ln(1/δ)/ ln 5 times and


T
keeping the matrix Ũk that minimizes the error A − Ũk Ũk AF reduces the fail-
ure probability of the algorithm to at most 1 − δ, for any δ ∈ (0, 1).
Remark 6.1.7. As with the sampling process in the RandLeastSquares algo-
rithm, the operation represented by DHS in the RandLowRank algorithm can
be viewed in one of two equivalent ways: either as a random preconditioning fol-
lowed by a uniform sampling operation; or as a Johnson-Lindenstrauss style ran-
dom projection. (In particular, informally, the RandLowRank algorithm “works”
for the following reason. If a matrix is well-approximated by a low-rank matrix,
then there is redundancy in the columns (and/or rows), and thus random sam-
pling “should” be successful at selecting a good set of columns. That said, just
as with the RandLeastSquares algorithm, there may be some columns that are
more important to select, e.g., that have high leverage. Thus, using a random
projection, which transforms the input to a new basis where the leverage scores
of different columns are uniformized, amounts to preconditioning the input such
that uniform sampling is appropriate.)
Remark 6.1.8. The value c is essentially2 equal to O((k/ 2 ) ln(k/ ) ln n). For
constant , this grows as a function of k ln k and ln n.
Remark 6.1.9. Similar bounds can be proven for many other random projection
algorithms (using different values for c) and not just the Randomized Hadamard
Transform. Well-known alternatives include random Gaussian matrices, the Ran-
domized Discrete Cosine Transform, sparsity-preserving random projections, etc.
Which variant is most appropriate in a given situation depends on the sparsity
structure of the matrix, the noise properties of the data, the model of data access,
etc. See [17, 30] for an overview of similar results.
Remark 6.1.10. One can generalize the RandLowRank algorithm to work with
the matrix (AAT )t ADHS for integer t  0. This would result in subspace iter-
ation. If all intermediate iterates (for t = 0, 1, . . .) are kept, the Krylov subspace
would be formed. See [8, 21] and references therein for a detailed treatment and
analysis of such methods. (See also the chapter by Martinsson in this volume [19]
for related results.)
The remainder of this section will focus on the proof of Theorem 6.1.1. Our
proof strategy will consist of three steps. First (Section 6.2), we we will prove that:
T
A − Ũk Ũk A2F  Ak − UC UTC Ak 2F + Ak,⊥ 2F .
This inequality allows us to focus on the easier-to-bound term Ak − UC UTC Ak 2F
T
instead of the term A − Ũk Ũk A2F . Second (Section 6.3), to bound this term, we
will use a structural inequality that is central (in this form or mild variations) in
2 We omit the ln ln n term from this qualitative remark. Recall that ln ln n goes to infinity with
dignity and therefore, quoting Stan Eisenstat, ln ln n is for all practical purposes essentially a constant;
see https://rjlipton.wordpress.com/2011/01/19/we-believe-a-lot-but-can-prove-little/.
40 Lectures on Randomized Numerical Linear Algebra

many RandNLA low-rank approximation algorithms and their analyses. Indeed,


we will argue that

Ak − UC UTC Ak 2F  2(A − Ak )DHS((V Tk DHS)† − (V Tk DHS)T )2F


+ 2(A − Ak )DHS(V Tk DHS)T 2F .
Third (Section 6.4), we will use results from Section 4 to bound the two terms at
the right hand side of the above inequality.
6.2. An alternative expression for the error. The RandLowRank algorithm ap-
proximates the top k left singular vectors of A, i.e., the matrix Uk ∈ Rm×k , by
T
the orthonormal matrix Ũk ∈ Rm×k . Bounding A − Ũk Ũk AF directly seems
hard, so we present an alternative expression that is easier to analyze and that
also reveals an interesting insight for Ũk . We will prove that the matrix Ũk Ũk A
is the best rank-k approximation to A (with respect to the Frobenius norm3 ) that
lies within the column space of the matrix C. This optimality property is guaran-
teed by the Ritz-Rayleigh type procedure implemented in Steps 7-9 of the Rand-
LowRank algorithm.

Lemma 6.2.1. Let UC be a basis for the column span of C and let Ũk be the
output of the RandLowRank algorithm. Then

T
(6.2.2) A − Ũk Ũk A = A − UC UTC A .
k
In addition, UC (UTC A)k is the best rank-k approximation to A, with respect to
the Frobenius norm, that lies within the column span of the matrix C, namely
(6.2.3) A − UC (UTC A)k 2F = min A − UC Y2F .
rank(Y)k
Proof. Recall that Ũk = UC UW,k , where UW,k is the matrix of the top k left
singular vectors of W = UTC A. Thus, UW,k spans the same range as W k , the best
rank-k approximation to W, i.e., UW,k UTW,k = W k W †k . Therefore
T
A − Ũk Ũk A = A − UC UW,k UTW,k UTC A
= A − UC W k W †k W = A − UC W k .
The last equality follows from W k W †k being the orthogonal projector onto the
range of W k . In order to prove the optimality property of the lemma, we simply
observe that

A − UC (UTC A)k 2F = A − UC UTC A + UC UTC A − UC (UTC A)k 2F


= (I − UC UTC )A + UC (UTC A − (UTC A)k )2F
= (I − UC UTC )A2F + UC (UTC A − (UTC A)k )2F
= (I − UC UTC )A2F + UTC A − (UTC A)k 2F .
3 This is not true for other unitarily invariant norms, e.g., the two-norm; see [6] for a detailed discus-
sion.
Petros Drineas and Michael W. Mahoney 41

The second to last equality follows from Matrix Pythagoras (Lemma 2.5.2) and
the last equality follows from the orthonormality of the columns of UC . The
second statement of the lemma is now immediate since (UTC A)k is the best rank-
k approximation to UTC A and thus any other matrix Y of rank at most k would
result in a larger Frobenius norm error. 
Lemma 6.2.1
shows that Eqn. (6.1.3) in Theorem 6.1.1 can be proven by bound-
ing A − UC UTC A F . Next, we transition from the best rank-k approximation
k
of the projected matrix (UTC A)k to the best rank-k approximation Ak of the orig-
inal matrix. First (recall the notation introduced in Section 2.6), we split

(6.2.4) A = Ak + Ak,⊥ , where Ak = Uk Σk V Tk and Ak,⊥ = Uk,⊥ Σk,⊥ V Tk,⊥ .

Lemma 6.2.5. Let UC be an orthonormal basis for the column span of the matrix
C and let Ũk be the output of the RandLowRank algorithm. Then,
T
A − Ũk Ũk A2F  Ak − UC UTC Ak 2F + Ak,⊥ 2F .

Proof. The optimality property in Eqn. (6.2.3) in Lemma 6.2.1 and the fact that
UTC Ak has rank at most k imply

T
A − Ũk Ũk A2F = A − UC UTC A 2F
k
 A − UC UTC Ak 2F
= Ak − UC UTC Ak 2F + Ak,⊥ 2F .
The last equality follows from Lemma 2.5.2. 
6.3. A structural inequality. We now state and prove a structural inequality
that will help us bound Ak − UC UTC Ak 2F (the first term in the error bound
of Lemma 6.2.5) and that, with minor variants, underlies nearly all RandNLA
algorithms for low-rank matrix approximation [18]. Recall that, given a matrix
A ∈ Rm×n , many RandNLA algorithms seek to construct a “sketch” of A by post-
multiplying A by some “sketching” matrix Z ∈ Rn×c , where c is much smaller
than n. (In particular, this is precisely what the RandLowRank algorithm does.)
Thus, the resulting matrix AZ ∈ Rm×c is much smaller than the original matrix
A, and the interesting question is the approximation guarantees that it offers.
A common approach is to explore how well AZ spans the principal subspace
of A, and one metric of accuracy is a suitably chosen norm of the error matrix
Ak − (AZ)(AZ)† Ak , where (AZ)(AZ)† Ak is the projection of Ak onto the sub-
space spanned by the columns of AZ. (See Section 2.9 for the definition of the
Moore-Penrose pseudoinverse of a matrix.) The following structural result offers
a means to bound the Frobenius norm of the error matrix Ak − (AZ)(AZ)† Ak .

Lemma 6.3.1. Given A ∈ Rm×n , let Z ∈ Rn×c (c  k) be any matrix such that
V Tk Z ∈ Rk×c has rank k. Then,
(6.3.2) Ak − (AZ)(AZ)† Ak 2F   (A − Ak ) Z(V Tk Z)† 2F .
42 Lectures on Randomized Numerical Linear Algebra

Remark 6.3.3. Lemma 6.3.1 holds for any matrix Z, regardless of whether Z is
constructed deterministically or randomly. In the context of RandNLA, typical
constructions of Z would represent a random sampling or random projection
operation, like the the matrix DHS used in the RandLowRank algorithm.

Remark 6.3.4. The lemma actually holds for any unitarily invariant norm, includ-
ing the two and the nuclear norm of a matrix [18].

Remark 6.3.5. See [18] for a detailed discussion of such structural inequalities and
their history. Lemma 6.3.1 immediately suggests a proof strategy for bounding
the error of RandNLA algorithms for low-rank matrix approximation: identify a
sketching matrix Z such that V Tk Z has full rank; and, at the same time, bound the

T
relevant norms of V k Z and (A − Ak )Z.

Proof. (of Lemma 6.3.1) First, note that


(AZ)† Ak = argminX∈Rc×n Ak − (AZ) X2F .
The above equation follows by viewing the above optimization problem as least-
squares regression with multiple right-hand sides. Interestingly, this property
holds for any unitarily invariant norm, but the proof is involved; see Lemma 4.2
of [8] for a detailed discussion. The upshot is that, instead of having to bound
Ak − (AZ)(AZ)† Ak 2F , we can replace (AZ)+ Ak with any other c × n matrix
and the equality with an inequality. In particular, we replace (AZ)† Ak with
(Ak Z)† Ak :

Ak − (AZ)(AZ)† Ak 2F  Ak − AZ (Ak Z)† Ak 2F .


This suboptimal choice for X is essentially the “heart” of our proof: it allows us
to manipulate and further decompose the error term, thus making the remainder
of the analysis feasible. Use A = A − Ak + Ak to get

Ak − (AZ)(AZ)† Ak 2F  Ak − (A − Ak + Ak )Z (Ak Z)† Ak 2F


= Ak − Ak Z(Ak Z)† Ak − (A − Ak )Z(Ak Z)† Ak 2F
= (A − Ak )Z(Ak Z)† Ak 2F .
To derive the last inequality, we used

Ak − Ak Z(Ak Z)† Ak = Ak − Uk Σk V Tk Z(Uk Σk V Tk Z)† Ak


(6.3.6) = Ak − Uk Σk (V Tk Z)(V Tk Z)† Σ−1 T
k Uk Ak
(6.3.7) = Ak − Uk UTk Ak = 0.
In Eqn. (6.3.6), we used Eqn. (2.9.1) and the fact that both matrices V Tk Z and
Uk Σk have rank k. The latter fact also implies that (V Tk Z)(V Tk Z)† = Ik , which
derives Eqn. (6.3.7). Finally, the fact that Uk UTk Ak = Ak concludes the derivation
and the proof of the lemma. 
Petros Drineas and Michael W. Mahoney 43

6.4. Completing the proof of Theorem 6.1.1. In order to complete the proof
of the relative error guarantee of Theorem 6.1.1, we will complete the strategy
outlined at the end of Section 6.1. First, recall that from Lemma 6.2.5 it suffices to
bound
T
(6.4.1) A − Ũk Ũk A2F  Ak − UC UTC Ak 2F + A − Ak 2F .
Then, to bound the first term in the right-hand side of the above inequality, we
will apply the structural result of Lemma 6.3.1 on the matrix
Φ = ADH,
with Z = S, where the matrices D, H, and S are constructed as described in
the RandLowRank algorithm. If V TΦ,k S has rank k, then Lemma 6.3.1 gives the
estimate that
(6.4.2) Φk − (ΦS)(ΦS)† Φk 2F  (Φ − Φk )S(V TΦ,k S)† 2F .
Here, we used V Φ,k ∈ Rn×k to denote the matrix of the top k right singular
vectors of Φ.
Recall from Section 5.1 that DH is an orthogonal matrix and thus the left
singular vectors and the singular values of the matrices A and Φ = ADH are
identical. The right singular vectors of the matrix Φ are simply the right singular
vectors of A, rotated by DH, namely
V TΦ = V T DH,
where V (respectively, V Φ ) denotes the matrix of the right singular vectors of A
(respectively, Φ). Thus, we have Φk = Ak DH, Φ − Φk = (A − Ak )DH, and
V Φ,k = V k DH. Using all the above, we can rewrite Eqn. (6.4.2) as follows:
(6.4.3) Ak − (ADHS)(ADHS)† Ak 2F  (A − Ak )DHS(V Tk DHS)† 2F .
In the above derivation, we used unitary invariance to drop a DH term from
the Frobenius norm. Recall that Ak,⊥ = A − Ak ; we now proceed to manipulate
the right-hand side of the above inequality as follows4 :

Ak,⊥ DHS(V Tk DHS)† 2F


= Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T + (V Tk DHS)T )2F
(6.4.4)  2Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T )2F
(6.4.5) + 2Ak,⊥ DHS(V Tk DHS)T 2F .
We now proceed to bound the terms in (6.4.4) and (6.4.5) separately. Our first
order of business, however, will be to quantify the manner in which the Random-
ized Hadamard Transform approximately uniformizes information in the top k
right singular vectors of A.
4 Weuse the following easy-to-prove version of the triangle inequality for the Frobenius norm: for any
two matrices X and Y that have the same dimensions, X + Y 2F  2X2F + 2Y2F .
44 Lectures on Randomized Numerical Linear Algebra

The effect of the Randomized Hadamard Transform. Here, we state a lemma


that quantifies the manner in which HD (premultiplying V k , or DH postmulti-
plying V Tk ) approximately “uniformizes” information in the right singular sub-
space of the matrix A, thus allowing us to apply our matrix multiplication results
from Section 4 in order to bound (6.4.4) and (6.4.5). This is completely analogous
to our discussion in Section 5.4 regarding the RandLeastSquares algorithm.
Lemma 6.4.6. Let V k be an n × k matrix with orthonormal columns and let the
product HD be the n × n Randomized Hadamard Transform of Section 5.1. Then,
with probability at least .95,

2k ln(40nk)
(6.4.7)  (HDV k )i∗ 22  , for all i = 1, . . . , n.
n
The proof of the above lemma is identical to the proof of Lemma 5.4.1, with
V k instead of U and k instead of d.
6.4.1. Bounding Expression (6.4.4). To bound the term in Expression (6.4.4), we
first use the strong submultiplicativity of the Frobenius norm (see Section 2.5)
to get

Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T )2F


(6.4.8) Ak,⊥ DHS2F (V Tk DHS)† − (V Tk DHS)T 22 .
Our first lemma bounds the term (A − Ak )DHS2F = Ak,⊥ DHS2F . We actually
prove the result for any matrix X and for our choice for the matrix S in the
RandLowRank algorithm.
Lemma 6.4.9. Let the sampling matrix S ∈ Rn×c be constructed as in the Rand-
LowRank algorithm. Then, for any matrix X ∈ Rm×n ,
 
E XS2F = X2F ,
and, from Markov’s inequality (see Section 3.11), with probability at least 0.95,
XS2F  20X2F .

Remark 6.4.10. The above lemma holds even if the sampling of the canonical
vectors ei to be included in S is not done uniformly at random, but with respect
to any set of probabilities {p1 , . . . , pn } summing up to one, as long as the selected
canonical vector at the t-th trial (say the it -th canonical vector eit ) is rescaled by

1/cpit . Thus, even for nonuniform sampling, XS is an unbiased estimator for
the Frobenius norm of the matrix X.

Proof. (of Lemma 6.4.9) We compute the expectation of XS2F from first princi-
ples as follows:
  
c n
1 n 1 
c n
E XS2F = · X∗j 2F = X∗j 2F = X2F .
n c c
t=1 j=1 t=1 j=1

The lemma now follows by applying Markov’s inequality. 


Petros Drineas and Michael W. Mahoney 45

We can now prove the following lemma, assuming that Eqn. (6.4.7) holds.

Lemma 6.4.11. Assume that Eqn. (6.4.7) holds. If c satisfies


 √ 
192k ln(40nk) 192 20k ln(40nk)
(6.4.12) c ln ,
2 2
then with probability at least .95,
† T
 V Tk DHS − V Tk DHS 22  2 2 .

Proof. Let σi denote the i-th singular value of the matrix V Tk DHS. Conditioned
on Eqn. (6.4.7) holding, we can replicate the proof of Lemma 5.4.5 to argue that if
c satisfies Eqn. (6.4.12), then, with probability at least .95,
 
 
(6.4.13) 1 − σ2i  
holds for all i. (Indeed, we can replicate the proof of Lemma 5.4.5 using V k
instead of UA and k instead of d; we also evaluate the bound for arbitrary
instead of fixing it.) We now observe that the matrices
† T
V Tk DHS and V Tk DHS
have the same left and right singular vectors5 . Recall that in this lemma we used
T
σi to denote the
singular Tvalues of the matrix V k DHS. Then, the singular values
T
of the matrix V k DHS are equal to the σi ’s, while the singular values of the

matrix V Tk DHS are equal to σ−1 i . Thus,
† T  2  
   2 2 −2 
 V Tk DHS − V Tk DHS 22 = max σ−1 i − σi  = max (1 − σi ) σi .
i i
Combining with Eqn. (6.4.13) and using the fact that  1/2,
† T  
 
 V Tk DHS − V Tk DHS 22 = max (1 − σ2i )σ−2 −1 2 2
i   (1 − )  2 . 
i
Lemma 6.4.14. Assume that Eqn. (6.4.7) holds. If c satisfies Eqn. (6.4.12), then,
with probability at least .9,
2Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T )2F  80 2 Ak,⊥ 2F .

Proof. We combine Eqn. (6.4.8) with Lemmas 6.4.9 (applied to X = Ak,⊥ ) and
Lemma 6.4.11 to get that, conditioned on Eqn. (6.4.7) holding, with probability at
least 1-0.05-0.05=0.9,
Ak,⊥ DHS((V Tk DHS)† − (V Tk DHS)T )2F  40 2 Ak,⊥ 2F .
The aforementioned failure probability follows from a simple union bound on
the failure probabilities of Lemmas 6.4.9 and 6.4.11. 
Bounding Expression (6.4.5). Our bound for Expression (6.4.5) will be condi-
tioned on Eqn. (6.4.7) holding; then, we will use our matrix multiplication results
5 Givenany matrix X with thin SVD X = UX ΣX V TX its transpose is XT = V X ΣX UTX and its
T
pseudoinverse is X = V X Σ−1
X UX .
46 Lectures on Randomized Numerical Linear Algebra

from Section 4 to derive our bounds. Our discussion is completely analogous to


the proof of Lemma 5.4.9. We will prove the following lemma.
Lemma 6.4.15. Assume that Eqn. (6.4.7) holds. If c  40k ln(40nk)/ , then, with
probability at least .95,
Ak,⊥ DHS(V Tk DHS)T 2F  Ak,⊥ 2F .
Proof. To prove the lemma, we first observe that
Ak,⊥ DHS(V Tk DHS)T 2F = Ak,⊥ DHSST HT DV k − Ak,⊥ DHHT DV k 2F ,
since DHHT D = In and Ak,⊥ V k = 0. Thus, we can view Ak,⊥ DHSST HT DV k
as approximating the product of two matrices, Ak,⊥ DH and HT DV k , by ran-
domly sampling columns from the first matrix and the corresponding rows from
the second matrix, with uniform sampling probabilities that do not depend on the
two matrices involved in the product. We will apply the bounds of Eqn. (4.2.6),
after arguing that the assumptions of Eqn. (4.2.5) are satisfied. Indeed, since we
condition on Eqn. (6.4.7) holding, the rows of HT DV k = HDV k satisfy
1  (HDV k )i∗ 22
(6.4.16) β , for all i = 1, . . . , n,
n k
for β = (2 ln(40nk))−1 . Thus, Eqn. (4.2.6) implies
  1
E Ak,⊥ DHSST HT DV k − Ak,⊥ DHHT DV k 2F  A DH2F HDV k 2F
βc k,⊥
k
= A 2 .
βc k,⊥ F
In the above we used HDV k 2F = k. Markov’s inequality now implies that with
probability at least .95,
20k
Ak,⊥ DHSST HT DV k − Ak,⊥ DHHT DV k 2F  Ak,⊥ 2F .
βc
Setting r  20k/(β ) and using the value of β specified above concludes the proof
of the lemma. 
Concluding the proof of Theorem 6.1.1. We are now ready to conclude the
proof, and therefore we revert back to using Ak,⊥ = A − Ak . We first state the
following lemma.
Lemma 6.4.17. Assume that Eqn. (6.4.7) holds. There exists a constant c0 such
that, if  
k ln n k ln n
(6.4.18) c  c0 2 ln ,
2
then with probability at least .85,
T
A − Ũk Ũk AF  (1 + )A − Ak F .
Proof. Combining Lemma 6.2.5 with Expressions (6.4.4) and (6.4.5), and Lem-
mas 6.4.14 and 6.4.15, we get
T
A − Ũk Ũk A2F  (1 + + 80 2 )A − Ak 2F  (1 + 41 )A − Ak 2F .
Petros Drineas and Michael W. Mahoney 47

The last inequality follows by using  1/2. Taking square roots of both sides

and using 1 + 41  1 + 7 , we get
T
A − Ũk Ũk AF  (1 + 7 )A − Ak F .
Observe that c has to be set to the maximum of the values used in Lemmas 6.4.14
and 6.4.15, which is the value of Eqn. (6.4.12). Adjusting to /21 and appropri-
ately adjusting the constants in the expression of c gives the lemma. (We made
no special effort to compute or optimize the constant c0 in the expression of c.)
The failure probability follows by a union bound on the failure probabilities of
Lemmas 6.4.14 and 6.4.15 conditioned on Eqn. (6.4.7). 
To conclude the proof of Theorem 6.1.1, we simply need to remove the con-
ditional probability from Lemma 6.4.17. Towards that end, we follow the same
strategy as in Section 5.4, to conclude that the success probability of the overall
approach is at least 0.85 · 0.95  0.8.
6.5. Running time. The RandLowRank Algorithm 6.1.4 computes the product
C = AHDS using the ideas of Section 5.5, thus taking 2n(m + 1) log2 (c + 1) time.
Step 7 takes O(mc2 ); step 8 takes O(mnc + nc2 ) time; step 9 takes O(mck) time.
Overall, the running time is, asymptotically, dominated by the O(mnc) term is
step 8, with c as in Eqn. (6.4.18).
6.6. References. Our presentation in this chapter follows the derivations in [8].
We also refer the interested reader to [21, 29] for related work.
Acknowledgements. The authors would like to thank Ilse Ipsen for allowing
them to use her slides for the introductory linear algebra lecture delivered at the
PCMI Summer School on which the first section of this chapter is heavily based.
The authors would also like to thank Aritra Bose, Eugenia-Maria Kontopoulou,
and Fred Roosta for their help in proofreading early drafts of this manuscript.

References
[1] N. Ailon and B. Chazelle, The fast Johnson-Lindenstrauss transform and approximate nearest neighbors,
SIAM J. Comput. 39 (2009), no. 1, 302–322, DOI 10.1137/060673096. MR2506527 ←25
[2] N. Ailon and E. Liberty, Fast dimension reduction using Rademacher series on dual BCH codes, Discrete
Comput. Geom. 42 (2009), no. 4, 615–630, DOI 10.1007/s00454-008-9110-x. MR2556458 ←35
[3] H. Avron, P. Maymounkov, and S. Toledo, Blendenpik: supercharging Lapack’s least-squares solver,
SIAM J. Sci. Comput. 32 (2010), no. 3, 1217–1236, DOI 10.1137/090767911. MR2639236 ←27, 28, 36
[4] Rajendra Bhatia, Matrix analysis, Graduate Texts in Mathematics, vol. 169, Springer-Verlag, New
York, 1997. MR1477662 ←11
[5] Å. Björck, Numerical Methods in Matrix Computations, Springer, Heidelberg, 2015. ←10, 11
[6] C. Boutsidis, P. Drineas, and M. Magdon-Ismail, Near-optimal column-based matrix reconstruction,
SIAM J. Comput. 43 (2014), no. 2, 687–717, DOI 10.1137/12086755X. MR3504679 ←40
[7] Petros Drineas, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo algorithms for matrices.
II. Computing a low-rank approximation to a matrix, SIAM J. Comput. 36 (2006), no. 1, 158–183, DOI
10.1137/S0097539704442696. MR2231644 ←37
[8] Petros Drineas, Ilse C. F. Ipsen, Eugenia-Maria Kontopoulou, and Malik Magdon-Ismail, Struc-
tural Convergence Results for Approximation of Dominant Subspaces from Block Krylov Spaces, SIAM J.
Matrix Anal. Appl. 39 (2018), no. 2, 567–586, DOI 10.1137/16M1091745. MR3782400 ←39, 42, 47
48 References

[9] Petros Drineas, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo algorithms for ma-
trices. I. Approximating matrix multiplication, SIAM J. Comput. 36 (2006), no. 1, 132–157, DOI
10.1137/S0097539704442684. MR2231643 ←18, 20, 21, 24
[10] P. Drineas and M. W. Mahoney, RandNLA: Randomized Numerical Linear Algebra, Communications
of the ACM 59 (2016), no. 6, 80–90. ←3
[11] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós, Faster least squares approximation,
Numer. Math. 117 (2011), no. 2, 219–249, DOI 10.1007/s00211-010-0331-6. MR2754850 ←29, 36
[12] John C. Duchi, Introductory lectures on stochastic optimization, The Mathematics of Data, IAS/Park
City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←3
[13] G. H. Golub and C. F. Van Loan, Matrix computations, 3rd ed., Johns Hopkins Studies in the
Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, 1996. MR1417720 ←4, 11
[14] N. Halko, P. G. Martinsson, and J. A. Tropp, Finding structure with randomness: probabilistic algo-
rithms for constructing approximate matrix decompositions, SIAM Rev. 53 (2011), no. 2, 217–288, DOI
10.1137/090771806. MR2806637 ←37
[15] Wassily Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist.
Assoc. 58 (1963), 13–30. MR0144363 ←32
[16] John T. Holodnak and Ilse C. F. Ipsen, Randomized approximation of the Gram matrix: exact com-
putation and probabilistic bounds, SIAM J. Matrix Anal. Appl. 36 (2015), no. 1, 110–137, DOI
10.1137/130940116. MR3306014 ←24
[17] M. W. Mahoney, Randomized algorithms for matrices and data, Foundations and Trends in Machine
Learning, NOW Publishers, Boston, 2011. ←3, 22, 26, 37, 39
[18] Michael W. Mahoney and Petros Drineas, Structural properties underlying high-quality randomized
numerical linear algebra algorithms, Handbook of big data, Chapman & Hall/CRC Handb. Mod.
Stat. Methods, CRC Press, Boca Raton, FL, 2016, pp. 137–154. MR3674816 ←37, 41, 42
[19] Per-Gunnar Martinsson, Randomized methods for matrix computations, The Mathematics of Data,
IAS/Park City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←3, 37, 38, 39
[20] Rajeev Motwani and Prabhakar Raghavan, Randomized algorithms, Cambridge University Press,
Cambridge, 1995. MR1344451 ←16
[21] C. Musco and C. Musco, Stronger and Faster Approximate Singular Value Decomposition via
the Block Lanczos Method, Neural Information Processing Systems (NIPS), 2015, available at
arXiv:1504.05477. ←39, 47
[22] Roberto Imbuzeiro Oliveira, Sums of random Hermitian matrices and an inequality by Rudelson, Elec-
tron. Commun. Probab. 15 (2010), 203–212, DOI 10.1214/ECP.v15-1544. MR2653725 ←22
[23] Yousef Saad, Numerical methods for large eigenvalue problems, Classics in Applied Mathematics,
vol. 66, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2011. Revised
edition of the 1992 original [ 1177405]. MR3396212 ←37
[24] Nikhil Srivastava, Spectral sparsification and restricted invertibility, ProQuest LLC, Ann Arbor, MI,
2010. Thesis (Ph.D.)–Yale University. MR2941475 ←24
[25] G. W. Stewart and J. G. Sun, Matrix Perturbation Theory, Academic Press, New York, 1990. ←11
[26] Gilbert Strang, Linear algebra and its applications, 2nd ed., Academic Press [Harcourt Brace Jo-
vanovich, Publishers], New York-London, 1980. MR575349 ←11
[27] L.N. Trefethen and D. Bau III, Numerical Linear Algebra, SIAM, Philadelphia, 1997. ←11
[28] Roman Vershynin, Four lectures on probabilistic methods for data science, The Mathematics of Data,
IAS/Park City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←3, 22, 32
[29] S. Wang and Z. Zhang and T. Zhang, Improved Analyses of the Randomized Power Method and Block
Lanczos Method (2015), 1–22 pp., available at arXiv:1508.06429. ←47
[30] David P. Woodruff, Sketching as a tool for numerical linear algebra, Found. Trends Theor. Comput.
Sci. 10 (2014), no. 1-2, iv+157. MR3285427 ←36, 37, 39

Purdue University, Computer Science Department, 305 N University Street, West Lafayette, IN
47906.
Email address: [email protected]

University of California at Berkeley, ICSI and Department of Statistics, 367 Evans Hall, Berkeley,
CA 94720.
Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 49–97
https://doi.org/10.1090/pcms/025/00830

Optimization Algorithms for Data Analysis

Stephen J. Wright

Contents
1 Introduction 50
1.1 Omissions 51
1.2 Notation 51
2 Optimization Formulations of Data Analysis Problems 52
2.1 Setup 52
2.2 Least Squares 54
2.3 Matrix Completion 54
2.4 Nonnegative Matrix Factorization 55
2.5 Sparse Inverse Covariance Estimation 56
2.6 Sparse Principal Components 56
2.7 Sparse Plus Low-Rank Matrix Decomposition 57
2.8 Subspace Identification 57
2.9 Support Vector Machines 58
2.10 Logistic Regression 60
2.11 Deep Learning 61
3 Preliminaries 63
3.1 Solutions 64
3.2 Convexity and Subgradients 64
3.3 Taylor’s Theorem 65
3.4 Optimality Conditions for Smooth Functions 67
3.5 Proximal Operators and the Moreau Envelope 68
3.6 Convergence Rates 69
4 Gradient Methods 71
4.1 Steepest Descent 71
4.2 General Case 72
4.3 Convex Case 72
4.4 Strongly Convex Case 73
4.5 General Case: Line-Search Methods 74
4.6 Conditional Gradient Method 75
2010 Mathematics Subject Classification. Primary 14Dxx; Secondary 14Dxx.
Key words and phrases. Park City Mathematics Institute.

©2018 Stephen J. Wright

49
50 Optimization Algorithms for Data Analysis

5 Prox-Gradient Methods 77
6 Accelerating Gradient Methods 80
6.1 Heavy-Ball Method 80
6.2 Conjugate Gradient 81
6.3 Nesterov’s Accelerated Gradient: Weakly Convex Case 82
6.4 Nesterov’s Accelerated Gradient: Strongly Convex Case 84
6.5 Lower Bounds on Rates 87
7 Newton Methods 88
7.1 Basic Newton’s Method 88
7.2 Newton’s Method for Convex Functions 90
7.3 Newton Methods for Nonconvex Functions 91
7.4 A Cubic Regularization Approach 93
8 Conclusions 95

1. Introduction
In this article, we consider algorithms for solving smooth optimization prob-
lems, possibly with simple constraints or structured nonsmooth regularizers. One
such canonical formulation is
(1.0.1) min f(x),
x∈Rn
where f : → R has at least Lipschitz continuous gradients. Additional as-
Rn
sumptions about f, such as convexity and Lipschitz continuity of the Hessian, are
introduced as needed. Another formulation we consider is
(1.0.2) min f(x) + λψ(x),
x∈Rn
where f is as in (1.0.1), ψ : Rn → R is a function that is usually convex and usually
nonsmooth, and λ  0 is a regularization parameter.1 We refer to (1.0.2) as a
regularized minimization problem because the presence of the term involving ψ
induces certain structural properties on the solution, that make it more desirable
or plausible in the context of the application. We describe iterative algorithms
that generate a sequence {xk }k=0,1,2,... of points that, in the case of convex objective
functions, converges to the set of solutions. (Some algorithms also generate other
“auxiliary” sequences of iterates.)
We are motivated to study problems of the forms (1.0.1) and (1.0.2) by their
ubiquity in data analysis applications. Accordingly, Section 2 describes some
canonical problems in data analysis and their formulation as optimization prob-
lems. After some preliminaries in Section 3, we describe in Section 4 algorithms
that take step based on the gradients ∇f(xk ). Extensions of these methods to
1A set S is said to be convex if for any pair of points z , z ∈ S, we have that αz + (1 − α)z ∈ S for
all α ∈ [0, 1]. A function φ : Rn → R is convex if φ(αz + (1 − α)z )  αφ(z ) + (1 − α)φ(z )
for all z , z in the (convex) domain of φ and all α ∈ [0, 1].
Stephen J. Wright 51

the case (1.0.2) of regularized objectives are described in Section 5. Section 6 de-
scribes accelerated gradient methods, which achieve better worst-case complexity
than basic gradient methods, while still only using first-derivative information.
We discuss Newton’s method in Section 7, outlining variants that can guarantee
convergence to points that approximately satisfy second-order conditions for a
local minimizer of a smooth nonconvex function.
1.1. Omissions Our approach throughout is to give a concise description of
some of the most important algorithmic tools for smooth nonlinear optimization
and regularized optimization, along with the basic convergence theory for each.
(In any given context, we mean by “smooth” that the function is differentiable as
many times as is necessary for the discussion to make sense.) In most cases, the
theory is elementary enough to include here in its entirety. In the few remaining
cases, we provide citations to works in which complete proofs can be found.
Although we allow nonsmoothness in the regularization term in (1.0.2), we do
not cover subgradient methods or mirror descent explicitly in this chapter. We
also do not discuss stochastic gradient methods, a class of methods that is central
to modern machine learning. All these topics are discussed in the contribution of
John Duchi to the current volume [22]. Other omissions include the following.
• Coordinate descent methods; see [47] for a recent review.
• Augmented Lagrangian methods, including alternating direction meth-
ods of multipliers (ADMM) [23]. The review [5] remains a good reference
for the latter topic, especially as it applies to problems from data analysis.
• Semidefinite programming (see [43, 45]) and conic optimization (see [6]).
• Methods tailored specifically to linear or quadratic programming, such as
the simplex method or interior-point methods (see [46] for a discussion of
the latter).
• Quasi-Newton methods, which modify Newton’s method by approximat-
ing the Hessian or its inverse, thus attaining attractive theoretical and
practical performance without using any second-derivative information.
For a discussion of these methods, see [36, Chapter 6]. One important
method of this class, which is useful in data analysis and many other
large-scale problems, is the limited-memory method L-BFGS [30]; see also
[36, Section 7.2].
1.2. Notation Our notational conventions in this chapter are as follows. We
use upper-case Roman characters (A, L, R, and so on) for matrices and lower-
case Roman (x, v, u, and so on) for vectors. (Vectors are assumed to be column
vectors.) Transposes are indicated by a superscript “T .” Elements of matrices and
vectors are indicated by subscripts, for example, Aij and xj . Iteration numbers are
indicated by superscripts, for example, xk . We denote the set of real numbers by
R, so that Rn denotes the Euclidean space of dimension n. The set of symmetric
real n × n matrices is denoted by SRn×n . Real scalars are usually denoted by
52 Optimization Algorithms for Data Analysis

Greek characters, for example, α, β, and so on, though in deference to convention,


we sometimes use Roman capitals (for example, L for the Lipschitz constant of
a gradient). Where vector norms appear, the type of norm in use is indicated
by a subscript (for example x1 ), except that when no subscript appears, the
Euclidean norm  · 2 is assumed. Matrix norms are defined where first used.

2. Optimization Formulations of Data Analysis Problems


In this section, we describe briefly some representative problems in data anal-
ysis and machine learning, emphasizing their formulation as optimization prob-
lems. Our list is by no means exhaustive. In many cases, there are a number of
different ways to formulate a given application as an optimization problem. We
do not try to describe all of them. But our list here gives a flavor of the interface
between data analysis and optimization.
2.1. Setup Practical data sets are often extremely messy. Data may be misla-
beled, noisy, incomplete, or otherwise corrupted. Much of the hard work in data
analysis is done by professionals, familiar with the underlying applications, who
“clean” the data and prepare it for analysis, while being careful not to change the
essential properties that they wish to discern from the analysis. Dasu and John-
son [19] claim out that “80% of data analysis is spent on the process of cleaning
and preparing the data.” We do not discuss this aspect of the process, focusing in-
stead on the part of the data analysis pipeline in which the problem is formulated
and solved.
The data set in a typical analysis problem consists of m objects:
(2.1.1) D := {(aj , yj ), j = 1, 2, . . . , m},
where aj is a vector (or matrix) of features and yj is an observation or label. (Each
pair (aj , yj ) has the same size and shape for all j = 1, 2, . . . , m.) The analysis
task then consists of discovering a function φ such that φ(aj ) ≈ yj holds for
most j = 1, 2, . . . , m. The process of discovering the mapping φ is often called
“learning” or “training.”
The function φ is often defined in terms of a vector or matrix of parameters,
which we denote by x or X. (Other notation also appears below.) With these
parametrizations, the problem of identifying φ becomes a data-fitting problem:
“Find the parameters x defining φ such that φ(aj ) ≈ yj , j = 1, 2, . . . , m in some
optimal sense.” Once we come up with a definition of the term “optimal,” we
have an optimization problem. Many such optimization formulations have objec-
tive functions of the “summation” type

m
(2.1.2) LD (x) := (aj , yj ; x),
j=1

where the jth term (aj , yj ; x) is a measure of the mismatch between φ(aj ) and
yj , and x is the vector of parameters that determines φ.
Stephen J. Wright 53

One use of φ is to make predictions about future data items. Given another
previously unseen item of data â of the same type as aj , j = 1, 2, . . . , m, we
predict that the label ŷ associated with â would be φ(â). The mapping may also
expose other structure and properties in the data set. For example, it may reveal
that only a small fraction of the features in aj are needed to reliably predict the
label yj . (This is known as feature selection.) The function φ or its parameter x
may also reveal important structure in the data. For example, X could reveal a
low-dimensional subspace that contains most of the aj , or X could reveal a matrix
with particular structure (low-rank, sparse) such that observations of X prompted
by the feature vectors aj yield results close to yj .
Examples of labels yj include the following.
• A real number, leading to a regression problem.
• A label, say yj ∈ {1, 2, . . . , M} indicating that aj belongs to one of M
classes. This is a classification problem. We have M = 2 for binary classifi-
cation and M > 2 for multiclass classification.
• Null. Some problems only have feature vectors aj and no labels. In this
case, the data analysis task may consist of grouping the aj into clusters
(where the vectors within each cluster are deemed to be functionally sim-
ilar), or identifying a low-dimensional subspace (or a collection of low-
dimensional subspaces) that approximately contains the aj . Such prob-
lems require the labels yj to be learned, alongside the function φ. For
example, in a clustering problem, yj could represent the cluster to which
aj is assigned.
Even after cleaning and preparation, the setup above may contain many com-
plications that need to be dealt with in formulating the problem in rigorous math-
ematical terms. The quantities (aj , yj ) may contain noise, or may be otherwise
corrupted. We would like the mapping φ to be robust to such errors. There may
be missing data: parts of the vectors aj may be missing, or we may not know all
the labels yj . The data may be arriving in streaming fashion rather than being
available all at once. In this case, we would learn φ in an online fashion.
One particular consideration is that we wish to avoid overfitting the model to
the data set D in (2.1.1). The particular data set D available to us can often be
thought of as a finite sample drawn from some underlying larger (often infinite)
collection of data, and we wish the function φ to perform well on the unobserved
data points as well as the observed subset D. In other words, we want φ to
be not too sensitive to the particular sample D that is used to define empirical
objective functions such as (2.1.2). The optimization formulation can be modified
in various ways to achieve this goal, by the inclusion of constraints or penalty
terms that limit some measure of “complexity” of the function (such techniques
are called generalization or regularization). Another approach is to terminate the
optimization algorithm early, the rationale being that overfitting occurs mainly in
the later stages of the optimization process.
54 Optimization Algorithms for Data Analysis

2.2. Least Squares Probably the oldest and best-known data analysis problem is
linear least squares. Here, the data points (aj , yj ) lie in Rn × R, and we solve
1  T
m
1
(2.2.1) min (aj x − yj )2 = Ax − y22 ,
x 2m 2m
j=1

where A is the matrix whose rows are aTj , j = 1, 2, . . . , m and y = (y1 , y2 , . . . , ym )T .


In the terminology above, the function φ is defined by φ(a) := aT x. (We could
also introduce a nonzero intercept by adding an extra parameter β ∈ R and
defining φ(a) := aT x + β.) This formulation can be motivated statistically, as a
maximum-likelihood estimate of x when the observations yj are exact but for
i.i.d. Gaussian noise. Randomized linear algebra methods for large-scale in-
stances of this problem are discussed in Section 5 of the lectures of Drineas and
Mahoney [20] in this volume.
Various modifications of (2.2.1) impose desirable structure on x and hence on
φ. For example, Tikhonov regularization with a squared 2 -norm, which is
1
min Ax − y22 + λx22 , for some parameter λ > 0,
x 2m
yields a solution x with less sensitivity to perturbations in the data (aj , yj ). The
LASSO formulation
1
(2.2.2) min Ax − y22 + λx1
x 2m
tends to yield solutions x that are sparse, that is, containing relatively few nonzero
components [42]. This formulation performs feature selection: The locations of
the nonzero components in x reveal those components of aj that are instrumental
in determining the observation yj . Besides its statistical appeal — predictors that
depend on few features are potentially simpler and more comprehensible than
those depending on many features — feature selection has practical appeal in
making predictions about future data. Rather than gathering all components of a
new data vector â, we need to find only the “selected” features, since only these
are needed to make a prediction. The LASSO formulation (2.2.2) is an important
prototype for many problems in data analysis, in that it involves a regularization
term λx1 that is nonsmooth and convex, but with relatively simple structure
that can potentially be exploited by algorithms.
2.3. Matrix Completion Matrix completion is in one sense a natural extension
of least-squares to problems in which the data aj are naturally represented as
matrices rather than vectors. Changing notation slightly, we suppose that each
Aj is an n × p matrix, and we seek another n × p matrix X that solves
1 
m
(2.3.1) min (Aj , X − yj )2 ,
X 2m
j=1

where A, B := trace(AT B). Here we can think of the Aj as “probing” the un-
known matrix X. Commonly considered types of observations are random linear
Stephen J. Wright 55

combinations (where the elements of Aj are selected i.i.d. from some distribution)
or single-element observations (in which each Aj has 1 in a single location and
zeros elsewhere). A regularized version of (2.3.1), leading to solutions X that are
low-rank, is
1 
m
(2.3.2) min (Aj , X − yj )2 + λX∗ ,
X 2m
j=1

where X∗ is the nuclear norm, which is the sum of singular values of X [39].
The nuclear norm plays a role analogous to the 1 norm in (2.2.2). Although the
nuclear norm is a somewhat complex nonsmooth function, it is at least convex, so
that the formulation (2.3.2) is also convex. This formulation can be shown to yield
a statistically valid solution when the true X is low-rank and the observation ma-
trices Aj satisfy a “restricted isometry” property, commonly satisfied by random
matrices, but not by matrices with just one nonzero element. The formulation is
also valid in a different context, in which the true X is incoherent (roughly speak-
ing, it does not have a few elements that are much larger than the others), and
the observations Aj are of single elements [10].
In another form of regularization, the matrix X is represented explicitly as a
product of two “thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r , with
r  min(n, p). We set X = LRT in (2.3.1) and solve
1 
m
(2.3.3) min (Aj , LRT  − yj )2 .
L,R 2m
j=1

In this formulation, the rank r is “hard-wired” into the definition of X, so there is


no need to include a regularizing term. This formulation is also typically much
more compact than (2.3.2); the total number of elements in (L, R) is (n + p)r,
which is much less than np. A disadvantage is that it is nonconvex. An active
line of current research, pioneered in [9] and also drawing on statistical sources,
shows that the nonconvexity is benign in many situations, and that under certain
assumptions on the data (Aj , yj ), j = 1, 2, . . . , m and careful choice of algorithmic
strategy, good solutions can be obtained from the formulation (2.3.3). A clue to
this good behavior is that although this formulation is nonconvex, it is in some
sense an approximation to a tractable problem: If we have a complete observation
of X, then a rank-r approximation can be found by performing a singular value
decomposition of X, and defining L and R in terms of the r leading left and right
singular vectors.
2.4. Nonnegative Matrix Factorization Some applications in computer vision,
chemometrics, and document clustering require us to find factors L and R like
those in (2.3.3) in which all elements are nonnegative. If the full matrix Y ∈ Rn×p
is observed, this problem has the form
min LRT − Y2F , subject to L  0, R  0.
L,R
56 Optimization Algorithms for Data Analysis

2.5. Sparse Inverse Covariance Estimation In this problem, the labels yj are
null, and the vectors aj ∈ Rn are viewed as independent observations of a ran-
dom vector A ∈ Rn , which has zero mean. The sample covariance matrix con-
structed from these observations is
1 
m
S= aj aTj .
m−1
j=1

The element Sil is an estimate of the covariance between the ith and lth elements
of the random variable vector A. Our interest is in calculating an estimate X of
the inverse covariance matrix that is sparse. The structure of X yields important
information about A. In particular, if Xil = 0, we can conclude that the i and
l components of A are conditionally independent. (That is, they are independent
given knowledge of the values of the other n − 2 components of A.) Stated an-
other way, the nonzero locations in X indicate the arcs in the dependency graph
whose nodes correspond to the n components of A.
One optimization formulation that has been proposed for estimating the in-
verse sparse covariance matrix X is the following:
(2.5.1) min S, X − log det(X) + λX1 ,
X∈SR n×n , X 0

where SRn×n is the set of n × n symmetric matrices, X  0 indicates that X is



positive definite, and X1 := n
i,l=1 |Xil | (see [17, 25]).

2.6. Sparse Principal Components The setup for this problem is similar to the
previous section, in that we have a sample covariance matrix S that is estimated
from a number of observations of some underlying random vector. The princi-
pal components of this matrix are the eigenvectors corresponding to the largest
eigenvalues. It is often of interest to find sparse principal components, approxi-
mations to the leading eigenvectors that also contain few nonzeros. An explicit
optimization formulation of this problem is
(2.6.1) max vT Sv s.t. v2 = 1, v0  k,
v∈Rn
where  · 0 indicates the cardinality of v (that is, the number of nonzeros in v)
and k is a user-defined parameter indicating a bound on the cardinality of v. The
problem (2.6.1) is NP-hard, so exact formulations (for example, as a quadratic
program with binary variables) are intractable. We consider instead a relaxation,
due to [18], which replaces vvT by a positive semidefinite proxy M ∈ SRn×n :
(2.6.2) max S, M s.t. M  0, I, M = 1, M1  ρ,
M∈SRn×n
for some parameter ρ > 0 that can be adjusted to attain the desired sparsity. This
formulation is a convex optimization problem, in fact, a semidefinite program-
ming problem.
This formulation can be generalized to find the leading r > 1 sparse principal
components. Ideally, we would obtain these from a matrix V ∈ Rn×r whose
Stephen J. Wright 57

columns are mutually orthogonal and have at most k nonzeros each. We can
write a convex relaxation of this problem, once again a semidefinite program, as
(2.6.3) max S, M s.t. 0  M  I, I, M = 1, M1  ρ .
M∈SRn×n
A more compact (but nonconvex) formulation is
max S, FFT  s.t. F2  1, F2,1  R̄,
F∈Rn×r
n
where F2,1 := i=1 Fi· 2[15]. The latter regularization term is often called
a “group-sparse” or “group-LASSO” regularizer. (An early use of this type of
regularizer was described in [44].)
2.7. Sparse Plus Low-Rank Matrix Decomposition Another useful paradigm
is to decompose a partly or fully observed n × p matrix Y into the sum of a
sparse matrix and a low-rank matrix. A convex formulation of the fully-observed
problem is
min M∗ + λS1 s.t. Y = M + S,
M,S
n p
where S1 := i=1 j=1 |Sij | [11, 14]. Compact, nonconvex formulations that
allow noise in the observations include the following:
1
min LRT + S − Y2F (fully observed)
L,R,S 2
1
min PΦ (LRT + S − Y)2F (partially observed),
L,R,S 2
where Φ represents the locations of the observed entries of Y and PΦ is projection
onto this set [15, 48].
One application of these formulations is to robust PCA, where the low-rank
part represents principal components and the sparse part represents “outlier”
observations. Another application is to foreground-background separation in
video processing. Here, each column of Y represents the pixels in one frame of
video, whereas each row of Y shows the evolution of one pixel over time.
2.8. Subspace Identification In this application, the aj ∈ Rn , j = 1, 2, . . . , m are
vectors that lie (approximately) in a low-dimensional subspace. The aim is to
identify this subspace, expressed as the column subspace of a matrix X ∈ Rn×r .
If the aj are fully observed, an obvious way to solve this problem is to perform
a singular value decomposition of the n × m matrix A = [aj ]m j=1 , and take X to
be the leading r right singular vectors. In interesting variants of this problem,
however, the vectors aj may be arriving in streaming fashion and may be only
partly observed, for example in indices Φj ⊂ {1, 2, . . . , n}. We would thus need to
identify a matrix X and vectors sj ∈ Rr such that
PΦj (aj − Xsj ) ≈ 0, j = 1, 2, . . . , m.
The algorithm for identifying X, described in [1], is a manifold-projection scheme
that takes steps in incremental fashion for each aj in turn. Its validity relies on
58 Optimization Algorithms for Data Analysis

incoherence of the matrix X with respect to the principal axes, that is, the matrix
X should not have a few elements that are much larger than the others. A local
convergence analysis of this method is given in [2].
2.9. Support Vector Machines Classification via support vector machines (SVM)
is a classical paradigm in machine learning. This problem takes as input data
(aj , yj ) with aj ∈ Rn and yj ∈ {−1, 1}, and seeks a vector x ∈ Rn and a scalar
β ∈ R such that

(2.9.1a) aTj x − β  1 when yj = +1;


(2.9.1b) aTj x − β  −1 when yj = −1.
Any pair (x, β) that satisfies these conditions defines a separating hyperplane in
Rn , that separates the “positive” cases {aj | yj = +1} from the “negative” cases
{aj | yj = −1}. (In the language of Section 2.1, we could define the function
φ as φ(aj ) = sign(aTj x − β).) Among all separating hyperplanes, the one that
minimizes x2 is the one that maximizes the margin between the two classes,
that is, the hyperplane whose distance to the nearest point aj of either class is
greatest.
We can formulate the problem of finding a separating hyperplane as an opti-
mization problem by defining an objective with the summation form (2.1.2):
1 
m
(2.9.2) H(x, β) = max(1 − yj (aTj x − β), 0).
m
j=1

Note that the jth term in this summation is zero if the conditions (2.9.1) are
satisfied, and positive otherwise. Even if no pair (x, β) exists with H(x, β) = 0,
the pair (x, β) that minimizes (2.1.2) will be the one that comes as close as possible
to satisfying (2.9.1), in a suitable sense. A term λx22 , where λ is a small positive
parameter, is often added to (2.9.2), yielding the following regularized version:
1 
m
1
(2.9.3) H(x, β) = max(1 − yj (aTj x − β), 0) + λx22 .
m 2
j=1

If λ is sufficiently small (but positive), and if separating hyperplanes exist, the


pair (x, β) that minimizes (2.9.3) is the maximum-margin separating hyperplane.
The maximum-margin property is consistent with the goals of generalizability
and robustness. For example, if the observed data (aj , yj ) is drawn from an
underlying “cloud” of positive and negative cases, the maximum-margin solution
usually does a reasonable job of separating other empirical data samples drawn
from the same clouds, whereas a hyperplane that passes close by several of the
observed data points may not do as well (see Figure 2.9.4).
The problem of minimizing (2.9.3) can be written as a convex quadratic pro-
gram — having a convex quadratic objective and linear constraints — by intro-
ducing variables sj , j = 1, 2, . . . , m to represent the residual terms. Then,
Stephen J. Wright 59

Figure 2.9.4. Linear support vector machine classification, with


one class represented by circles and the other by squares. One
possible choice of separating hyperplane is shown at left. If the
observed data is an empirical sample drawn from a cloud of un-
derlying data points, this plane does not do well in separating
the two clouds (middle). The maximum-margin separating hy-
perplane does better (right).

1 T 1
(2.9.5a) min 1 s + λx22 ,
x,β,s m 2
(2.9.5b) subject to sj  1 − yj (aTj x − β), sj  0, j = 1, 2, . . . , m,
where 1 = ∈
(1, 1, . . . , 1)T Rm .
Often it is not possible to find a hyperplane that separates the positive and
negative cases well enough to be useful as a classifier. One solution is to trans-
form all of the raw data vectors aj by a mapping ζ into a higher-dimensional
Euclidean space, then perform the support-vector-machine classification on the
vectors ζ(aj ), j = 1, 2, . . . , m.
The conditions (2.9.1) would thus be replaced by

(2.9.6a) ζ(aj )T x − β  1 when yj = +1;


(2.9.6b) ζ(aj ) x − β  −1 when yj = −1,
T

leading to the following analog of (2.9.3):


1 
m
1
(2.9.7) H(x, β) = max(1 − yj (ζ(aj )T x − β), 0) + λx22 .
m 2
j=1

When transformed back to Rm , the surface {a | ζ(a)T x − β = 0} is nonlinear and


possibly disconnected, and is often a much more powerful classifier than the
hyperplanes resulting from (2.9.3).
We can formulate (2.9.7) as a convex quadratic program in exactly the same
manner as we derived (2.9.5) from (2.9.3). By taking the dual of this quadratic
program, we obtain another convex quadratic program, in m variables:
1 1
(2.9.8) minm αT Qα − 1T α subject to 0  α  1, yT α = 0,
α∈R 2 λ
where
Qkl = yk yl ζ(ak )T ζ(al ), y = (y1 , y2 , . . . , ym )T , 1 = (1, 1, . . . , 1)T ∈ Rm .
60 Optimization Algorithms for Data Analysis

Interestingly, problem (2.9.8) can be formulated and solved without any explicit
knowledge or definition of the mapping ζ. We need only a technique to define the
elements of Q. This can be done with the use of a kernel function K : Rn × Rn → R,
where K(ak , al ) replaces ζ(ak )T ζ(al ) [4, 16]. This is the so-called “kernel trick.”
(The kernel function K can also be used to construct a classification function
φ from the solution of (2.9.8).) A particularly popular choice of kernel is the
Gaussian kernel:
K(ak , al ) := exp(−ak − al 2 /(2σ)),
where σ is a positive parameter.
2.10. Logistic Regression Logistic regression can be viewed as a variant of bi-
nary support-vector machine classification, in which rather than the classification
function φ giving a unqualified prediction of the class in which a new data vector
a lies, it returns an estimate of the odds of a belonging to one class or the other.
We seek an “odds function” p parametrized by a vector x ∈ Rn as follows:
(2.10.1) p(a; x) := (1 + exp(aT x))−1 ,
and aim to choose the parameter x so that

(2.10.2a) p(aj ; x) ≈ 1 when yj = +1;


(2.10.2b) p(aj ; x) ≈ 0 when yj = −1.
(Note the similarity to (2.9.1).) The optimal value of x can be found by maximizing
a log-likelihood function:
⎡ ⎤
1 ⎣  
(2.10.3) L(x) := log(1 − p(aj ; x)) + log p(aj ; x)⎦ .
m
j:yj =−1 j:yj =1

We can perform feature selection using this model by introducing a regularizer


λx1 , as follows:
⎡ ⎤
1 ⎣  
(2.10.4) max log(1 − p(aj ; x)) + log p(aj ; x)⎦ − λx1 ,
x m
j:yj =−1 j:yj =1

where λ > 0 is a regularization parameter. (Note that we subtract rather than add
the regularization term λx1 to the objective, because this problem is formulated
as a maximization rather than a minimization.) As we see later, this term has
the effect of producing a solution in which few components of x are nonzero,
making it possible to evaluate p(a; x) by knowing only those components of a
that correspond to the nonzeros in x.
An important extension of this technique is to multiclass (or multinomial) lo-
gistic regression, in which the data vectors aj belong to more than two classes.
Such applications are common in modern data analysis. For example, in a speech
recognition system, the M classes could each represent a phoneme of speech, one
of the potentially thousands of distinct elementary sounds that can be uttered by
Stephen J. Wright 61

humans in a few tens of milliseconds. A multinomial logistic regression problem


requires a distinct odds function pk for each class k ∈ {1, 2, . . . , M}. These func-
tions are parametrized by vectors x[k] ∈ Rn , k = 1, 2, . . . , M, defined as follows:
exp(aT x[k] )
(2.10.5) pk (a; X) := M , k = 1, 2, . . . , M,
T
l=1 exp(a x[l] )
where we define X := {x[k] | k = 1, 2, . . . , M}. Note that for all a and for all

k = 1, 2, . . . , M, we have pk (a) ∈ (0, 1) and also M k=1 pk (a) = 1. The operation
in (2.10.5) is referred to as a “softmax” on the quantities {aT x[l] | l = 1, 2, . . . , M}.
If one of these inner products dominates the others, that is, aT x[k] aT x[l] for
all l = k, the formula (2.10.5) will yield pk (a; X) ≈ 1 and pl (a; X) ≈ 0 for all l = k.
In the setting of multiclass logistic regression, the labels yj are vectors in RM ,
whose elements are defined as follows:

1 when aj belongs to class k,
(2.10.6) yjk =
0 otherwise.
Similarly to (2.10.2), we seek to define the vectors x[k] so that

(2.10.7a) pk (aj ; X) ≈ 1 when yjk = 1


(2.10.7b) pk (aj ; X) ≈ 0 when yjk = 0.
The problem of finding values of x[k] that satisfy these conditions can again be
formulated as one of maximizing a log-likelihood:
M M 
1   
m
T T
(2.10.8) L(X) := yj (x[] aj ) − log exp(x[] aj ) .
m
j=1 =1 =1

“Group-sparse” regularization terms can be included in this formulation to se-


lect a set of features in the vectors aj , common to each class, that distinguish
effectively between the classes.
2.11. Deep Learning Deep neural networks are often designed to perform the
same function as multiclass logistic regression, that is, to classify a data vector a
into one of M possible classes, where M  2 is large in some key applications.
The difference is that the data vector a undergoes a series of structured transfor-
mations before being passed through a multiclass logistic regression classifier of
the type described in the previous subsection.
The simple neural network shown in Figure 2.11.1 illustrates the basic ideas.
In this figure, the data vector aj enters at the bottom of the network, each node in
the bottom layer corresponding to one component of aj . The vector then moves
upward through the network, undergoing a structured nonlinear transformation
as it moves from one layer to the next. A typical form of this transformation,
which converts the vector al−1j at layer l − 1 to input vector alj at layer l, is
alj = σ(W l al−1
j + gl ), l = 1, 2, . . . , D,
62 Optimization Algorithms for Data Analysis

output nodes

hidden layers

input nodes

Figure 2.11.1. A deep neural network, showing connections be-


tween adjacent layers.

where W l is a matrix of dimension |alj | × |al−1


j | and g is a vector of length |aj |, σ
l l

is a componentwise nonlinear transformation, and D is the number of hidden layers,


defined as the layers situated strictly between the bottom and top layers. Each
arc in Figure 2.11.1 represents one of the elements of a transformation matrix W l .
We define a0j to be the “raw” input vector aj , and let aD j be the vector formed
by the nodes at the topmost hidden layer in Figure 2.11.1. Typical forms of the
function σ include the following, acting identically on each component t ∈ R of
its input vector:
• Logistic function: t → 1/(1 + e−t );
• Hinge loss: t → max(t, 0);
• Bernoulli: a random function that outputs 1 with probability 1/(1 + e−t )
and 0 otherwise.
Each node in the top layer corresponds to a particular class, and the output of
each node corresponds to the odds of the input vector belonging to each class. As
mentioned, the “softmax” operator is typically used to convert the transformed
input vector in the second-top layer (layer D) to a set of odds at the top layer. As-
sociated with each input vector aj are labels yjk , defined as in (2.10.6) to indicate
which of the M classes that aj belongs to.
The parameters in this neural network are the matrix-vector pairs (W l , gl ),
l = 1, 2, . . . , D that transform the input vector aj into its form aD
j at the topmost
hidden layer, together with the parameters X of the multiclass logistic regression
operation that takes place at the very top stage, where X is defined exactly as
in the discussion of Section 2.10. We aim to choose all these parameters so that
the network does a good job on classifying the training data correctly. Using the
notation w for the hidden layer transformations, that is,
(2.11.2) w := (W 1 , g1 , W 2 , g2 , . . . , W D , gD ),
Stephen J. Wright 63

and defining X := {x[k] | k = 1, 2, . . . , M} as in Section 2.10, we can write the loss


function for deep learning as follows:
M M 
1   
m
(2.11.3) L(w, X) := yj (xT[] aDj (w)) − log exp(xT[] aD
j (w)) .
m
j=1 =1 =1

Note that this is exactly the function (2.10.8) applied to the output of the top
hidden layer aD D
j (w). We write aj (w) to make explicit the dependence of aj
D

on the parameters w of (2.11.2), as well as on the input vector aj . (We can view
multiclass logistic regression (2.10.8) as a special case of deep learning in which
there are no hidden layers, so that D = 0, w is null, and aD j = aj , j = 1, 2, . . . , m.)
Neural networks in use for particular applications (in image recognition and
speech recognition, for example, where they have been very successful) include
many variants on the basic design above. These include restricted connectivity
between layers (that is, enforcing structure on the matrices W l , l = 1, 2, . . . , D),
layer arrangements that are more complex than the linear layout illustrated in
Figure 2.11.1, with outputs coming from different levels, connections across non-
adjacent layers, different componentwise transformations σ at different layers,
and so on. Deep neural networks for practical applications are highly engineered
objects.
The loss function (2.11.3) shares with many other applications the “summation”
form (2.1.2), but it has several features that set it apart from the other applications
discussed above. First, and possibly most important, it is nonconvex in the param-
eters w. There is reason to believe that the “landscape” of L is complex, with the
global minimizer being exceedingly difficult to find. Second, the total number
of parameters in (w, X) is usually very large. The most popular algorithms for
minimizing (2.11.3) are of stochastic gradient type, which like most optimization
methods come with no guarantee for finding the minimizer of a nonconvex func-
tion. Effective training of deep learning classifiers typically requires a great deal
of data and computation power. Huge clusters of powerful computers, often us-
ing multicore processors, GPUs, and even specially architected processing units,
are devoted to this task. Efficiency also requires many heuristics in the formula-
tion and the algorithm (for example, in the choice of regularization functions and
in the steplengths for stochastic gradient).

3. Preliminaries
We discuss here some foundations for the analysis of subsequent sections.
These include useful facts about smooth and nonsmooth convex functions, Tay-
lor’s theorem and some of its consequences, optimality conditions, and proximal
operators.
In the discussion of this section, our basic assumption is that f is a mapping
from Rn to R ∪ {+∞}, continuous on its effective domain D := {x | f(x) < ∞}.
Further assumptions of f are introduced as needed.
64 Optimization Algorithms for Data Analysis

3.1. Solutions Consider the problem of minimizing f (1.0.1). We have the fol-
lowing terminology:
• x∗ is a local minimizer of f if there is a neighborhood N of x∗ such that
f(x)  f(x∗ ) for all x ∈ N.
• x∗ is a global minimizer of f if f(x)  f(x∗ ) for all x ∈ Rn .
• x∗ is a strict local minimizer if it is a local minimizer on some neighborhood
N and in addition f(x) > f(x∗ ) for all x ∈ N with x = x∗ .
• x∗ is an isolated local minimizer if there is a neighborhood N of x∗ such that
f(x)  f(x∗ ) for all x ∈ N and in addition, N contains no local minimizers
other than x∗ .
3.2. Convexity and Subgradients A convex set Ω ⊂ Rn has the property that
(3.2.1) x, y ∈ Ω ⇒ (1 − α)x + αy ∈ Ω for all α ∈ [0, 1].
We usually deal with closed convex sets in this article. For a convex set Ω ⊂ Rn
we define the indicator function IΩ (x) as follows:

0 if x ∈ Ω
IΩ (x) =
+∞ otherwise.
Indicator functions are useful devices for deriving optimality conditions for con-
strained problems, and even for developing algorithms. The constrained opti-
mization problem
(3.2.2) min f(x)
x∈Ω
can be restated equivalently as follows:
(3.2.3) min f(x) + IΩ (x).
We noted already that a convex function φ : Rn → R ∪ {+∞} has the following
defining property:
(3.2.4)
φ((1 − α)x + αy)  (1 − α)φ(x) + αφ(y), for all x, y ∈ Rn and all α ∈ [0, 1].
The concepts of “minimizer” are simpler in the case of convex objective func-
tions than in the general case. In particular, the distinction between “local” and
“global” minimizers disappears. For f convex in (1.0.1), we have the following.
(a) Any local minimizer of (1.0.1) is also a global minimizer.
(b) The set of global minimizers of (1.0.1) is a convex set.
If there exists a value γ > 0 such that
1
(3.2.5) φ((1 − α)x + αy)  (1 − α)φ(x) + αφ(y) − γα(1 − α)x − y22
2
for all x and y in the domain of φ and α ∈ [0, 1], we say that φ is strongly convex
with modulus of convexity γ.
We summarize some definitions and results about subgradients of convex func-
tions here. For a more extensive discussion, see [22].
Stephen J. Wright 65

Definition 3.2.6. A vector v ∈ Rn is a subgradient of f at a point x if


f(x + d)  f(x) + vT d. for all d ∈ Rn .
The subdifferential, denoted ∂f(x), is the set of all subgradients of f at x.

Subdifferentials satisfy a monotonicity property, as we show now.

Lemma 3.2.7. If a ∈ ∂f(x) and b ∈ ∂f(y), we have (a − b)T (x − y)  0.

Proof. From the convexity of f and the definitions of a and b, we deduce that
f(y)  f(x) + aT (y − x) and f(x)  f(y) + bT (x − y). The result follows by adding
these two inequalities. 
We can easily characterize a minimum in terms of the subdifferential.

Theorem 3.2.8. The point x∗ is the minimizer of a convex function f if and only if
0 ∈ ∂f(x∗ ).

Proof. Suppose that 0 ∈ ∂f(x∗ ), we have by substituting x = x∗ and v = 0 into


Definition 3.2.6 that f(x∗ + d)  f(x∗ ) for all d ∈ Rn , which implies that x∗ is a
minimizer of f.
The converse follows trivially by showing that v = 0 satisfies Definition 3.2.6
when x∗ is a minimizer. 
The subdifferential is the generalization to nonsmooth convex functions of the
concept of derivative of a smooth function.

Theorem 3.2.9. If f is convex and differentiable at x, then ∂f(x) = {∇f(x)}.

A converse of this result is also true. Specifically, if the subdifferential of a


convex function f at x contains a single subgradient, then f is differentiable with
gradient equal to this subgradient (see [40, Theorem 25.1]).
3.3. Taylor’s Theorem Taylor’s theorem is a foundational result for optimization
of smooth nonlinear functions. It shows how smooth functions can be approxi-
mated locally by low-order (linear or quadratic) functions.

Theorem 3.3.1. Given a continuously differentiable function f : Rn → R, and given


x, p ∈ Rn , we have that
1
(3.3.2) f(x + p) = f(x) + ∇f(x + ξp)T p dξ,
0
(3.3.3) f(x + p) = f(x) + ∇f(x + ξp)T p, some ξ ∈ (0, 1).

If f is twice continuously differentiable, we have


1
(3.3.4) ∇f(x + p) = ∇f(x) + ∇2 f(x + ξp)p dξ,
0
1
(3.3.5) f(x + p) = f(x) + ∇f(x)T p + pT ∇2 f(x + ξp)p, for some ξ ∈ (0, 1).
2
66 Optimization Algorithms for Data Analysis

We can derive an important consequence of this theorem when f is Lipschitz


continuously differentiable with constant L, that is,
(3.3.6) ∇f(x) − ∇f(y)  Lx − y, for all x, y ∈ Rn .
We have by setting y = x + p in (3.3.2) and subtracting the term ∇f(x)T (y − x)
from both sides that 1
f(y) − f(x) − ∇f(x)T (y − x) = [∇f(x + ξ(y − x)) − ∇f(x)]T (y − x) dξ.
0
By using (3.3.6), we have

[∇f(x + ξ(y − x)) − ∇f(x)]T (y − x)  ∇f(x + ξ(y − x)) − ∇f(x)y − x


 Lξy − x2 .
By substituting this bound into the previous integral, we obtain
L
(3.3.7) f(y) − f(x) − ∇f(x)T (y − x)  y − x2 .
2
For the remainder of Section 3.3, we assume that f is continuously differ-
entiable and also convex. The definition of convexity (3.2.4) and the fact that
∂f(x) = {∇f(x)} implies that
(3.3.8) f(y)  f(x) + ∇f(x)T (y − x), for all x, y ∈ Rn .
We defined “strong convexity with modulus γ” in (3.2.5). When f is differentiable,
we have the following equivalent definition, obtained by rearranging (3.2.5) and
letting α ↓ 0.
γ
(3.3.9) f(y)  f(x) + ∇f(x)T (y − x) + y − x2 .
2
By combining this expression with (3.3.7), we have the following result.

Lemma 3.3.10. Given convex f satisfying (3.2.5), with ∇f uniformly Lipschitz continu-
ous with constant L, we have for any x, y that
γ L
(3.3.11) y − x2  f(y) − f(x) − ∇f(x)T (y − x)  y − x2 .
2 2
For later convenience, we define a condition number κ as follows:
L
(3.3.12) κ := .
γ
When f is twice continuously differentiable, we can characterize the constants γ
and L in terms of the eigenvalues of the Hessian ∇f(x). Specifically, we can show
that (3.3.11) is equivalent to
(3.3.13) γI  ∇2 f(x)  LI, for all x.
When f is strictly convex and quadratic, κ defined in (3.3.12) is the condition
number of the (constant) Hessian, in the usual sense of linear algebra.
Strongly convex functions have unique minimizers, as we now show.

Theorem 3.3.14. Let f be differentiable and strongly convex with modulus γ > 0. Then
the minimizer x∗ of f exists and is unique.
Stephen J. Wright 67

Proof. We show first that for any point x0 , the level set {x | f(x)  f(x0 )} is closed
and bounded, and hence compact. Suppose for contradiction that there is a se-
quence {x } such that x  → ∞ and
(3.3.15) f(x )  f(x0 ).
By strong convexity of f, we have for some γ > 0 that
γ
f(x )  f(x0 ) + ∇f(x0 )T (x − x0 ) + x − x0 2 .
2
By rearranging slightly, and using (3.3.15), we obtain
γ 
x − x0 2  −∇f(x0 )T (x − x0 )  ∇f(x0 )x − x0 .
2
By dividing both sides by (γ/2)x − x0 , we obtain x − x0   (2/γ)∇f(x0 )
for all , which contradicts unboundedness of {x }. Thus, the level set is bounded.
Since it is also closed (by continuity of f), it is compact.
Since f is continuous, it attains its minimum on the compact level set, which is
also the solution of minx f(x), and we denote it by x∗ . Suppose for contradiction
that the minimizer is not unique, so that we have two points x∗1 and x∗2 that
minimize f. Obviously, these points must attain equal objective values, so that
f(x∗1 ) = f(x∗2 ) = f∗ for some f∗ . By taking (3.2.5) and setting φ = f∗ , x = x∗1 ,
y = x∗2 , and α = 1/2, we obtain
1 1
f((x∗1 + x∗2 )/2)  (f(x∗1 ) + f(x∗2 )) − γx∗1 − x∗2 2 < f∗ ,
2 8
so the point (x∗1 + x∗2 )/2 has a smaller function value than both x∗1 and x∗2 , contra-
dicting our assumption that x∗1 and x∗2 are both minimizers. Hence, the minimizer
x∗ is unique. 
3.4. Optimality Conditions for Smooth Functions We consider the case of a
smooth (twice continuously differentiable) function f that is not necessarily con-
vex. Before designing algorithms to find a minimizer of f, we need to identify
properties of f and its derivatives at a point x̄ that tell us whether or not x̄ is a
minimizer, of one of the types described in Subsection 3.1. We call such properties
optimality conditions.
A first-order necessary condition for optimality is that ∇f(x̄) = 0. More precisely,
if x̄ is a local minimizer, then ∇f(x̄) = 0. We can prove this by using Taylor’s
theorem. Supposing for contradiction that ∇f(x̄) = 0, we can show by setting
x = x̄ and p = −α∇f(x̄) for α > 0 in (3.3.3) that f(x̄ − α∇f(x̄)) < f(x̄) for all
α > 0 sufficiently small. Thus any neighborhood of x̄ will contain points x with a
f(x) < f(x̄), so x̄ cannot be a local minimizer.
If f is convex, as well as smooth, the condition ∇f(x̄) = 0 is sufficient for x̄ to be
a global solution. This claim follows immediately from Theorems 3.2.8 and 3.2.9.
A second-order necessary condition for x̄ to be a local solution is that ∇f(x̄) = 0
and ∇2 f(x̄) is positive semidefinite. The proof is by an argument similar to that
of the first-order necessary condition, but using the second-order Taylor series ex-
pansion (3.3.5) instead of (3.3.3). A second-order sufficient condition is that ∇f(x̄) = 0
68 Optimization Algorithms for Data Analysis

and ∇2 f(x̄) is positive definite. This condition guarantees that x̄ is a strict local
minimizer, that is, there is a neighborhood of x̄ such that x̄ has a strictly smaller
function value than all other points in this neighborhood. Again, the proof makes
use of (3.3.5).
We call x̄ a stationary point for smooth f if it satisfies the first-order necessary
condition ∇f(x̄) = 0. Stationary points are not necessarily local minimizers. In
fact, local maximizers satisfy the same condition. More interestingly, stationary
points can be saddle points. These are points for which there exist directions u
and v such that f(x̄ + αu) < f(x̄) and f(x̄ + αv) > f(x̄) for all positive α suffi-
ciently small. When the Hessian ∇2 f(x̄) has both strictly positive and strictly
negative eigenvalues, it follows from (3.3.5) that x̄ is a saddle point. When ∇2 f(x̄)
is positive semidefinite or negative semidefinite, second derivatives alone are in-
sufficient to classify x̄; higher-order derivative information is needed.
3.5. Proximal Operators and the Moreau Envelope Here we present some anal-
ysis for analyzing the convergence of algorithms for the regularized problem
(1.0.2), where the objective is the sum of a smooth function and a convex (usually
nonsmooth) function.
We start with a formal definition.

Definition 3.5.1. For a closed proper convex function h and a positive scalar λ,
the Moreau envelope is
 
1 2 1 1 2
(3.5.2) Mλ,h (x) := inf h(u) + u − x = inf λh(u) + u − x .
u 2λ λ u 2
The proximal operator of the function λh is the value of u that achieves the infi-
mum in (3.5.2), that is,

1
(3.5.3) proxλh (x) := arg min λh(u) + u − x2 .
u 2
From optimality properties for (3.5.3) (see Theorem 3.2.8), we have
(3.5.4) 0 ∈ λ∂h(proxλh (x)) + (proxλh (x) − x).
The Moreau envelope can be viewed as a kind of smoothing or regularization
of the function h. It has a finite value for all x, even when h takes on infinite
values for some x ∈ Rn . In fact, it is differentiable everywhere, with gradient
1
∇Mλ,h (x) = (x − proxλh (x)).
λ
Moreover, x∗ is a minimizer of h if and only if it is a minimizer of Mλ,h .
The proximal operator satisfies a nonexpansiveness property. From the opti-
mality conditions (3.5.4) at two points x and y, we have
x − proxλh (x) ∈ λ∂(proxλh (x)), y − proxλh (y) ∈ λ∂(proxλh (y)).
By applying monotonicity (Lemma 3.2.7), we have

T
(1/λ) (x − proxλh (x)) − (y − proxλh (y)) (proxλh (x) − proxλh (y))  0,
Stephen J. Wright 69

Rearranging this and applying the Cauchy-Schwartz inequality yields

proxλh (x) − proxλh (y)2  (x − y)T (proxλh (x) − proxλh (y))


 x − y proxλh (x) − proxλh (y),
from which we obtain proxλh (x) − proxλh (y)  x − y, as claimed.
We list the prox operator for several instances of h that are common in data
analysis applications. These definitions are useful in implementing the prox-
gradient algorithms of Section 5.
• h(x) = 0 for all x, for which we have proxλh (x) = 0. (This observation is
useful in proving that the prox-gradient method reduces to the familiar
steepest descent method when the objective contains no regularization
term.)
• h(x) = IΩ (x), the indicator function for a closed convex set Ω. In this
case, we have for any λ > 0 that

1 1
proxλIΩ (x) = arg min λIΩ (u) + u − x2 = arg min u − x2 ,
u 2 u∈Ω 2
which is simply the projection of x onto the set Ω.
• h(x) = x1 . By substituting into definition (3.5.3) we see that the mini-
mization separates into its n separate components, and that the ith com-
ponent of proxλ·1 (x) is
  
1
proxλ·1 (x) = arg min λ|ui | + (ui − xi )2 .
i ui 2
We can thus verify that


⎪ if xi > λ;
⎨xi − λ
(3.5.5) [proxλ·1 (x)]i = 0 if xi ∈ [−λ, λ];


⎩x + λ if x < −λ,
i i

an operation that is known as soft-thresholding.


• h(x) = x0 , where x0 denotes the cardinality of the vector x, its number
of nonzero components. Although this h is not a convex function (as
we can see by considering convex combinations of the vectors (0, 1)T and
(1, 0)T in R2 ), its proximal operator is well defined, and is known as hard
thresholding:
 √
xi if |xi |  2λ;
[proxλ·0 (x)]i = √
0 if |xi | < 2λ.
As in (3.5.5), the definition (3.5.3) separates into n individual components.
3.6. Convergence Rates An important measure for evaluating algorithms is the
rate of convergence to zero of some measure of error. For smooth f, we may be
interested in how rapidly the sequence of gradient norms {∇f(xk )} converges
to zero. For nonsmooth convex f, a measure of interest may be convergence to
70 Optimization Algorithms for Data Analysis

zero of {dist(0, ∂f(xk ))} (the sequence of distances from 0 to the subdifferential
∂f(xk )). Other error measures for which we may be able to prove convergence
rates include xk − x∗  (where x∗ is a solution) and f(xk ) − f∗ (where f∗ is the
optimal value of the objective function f). For generality, we denote by {φk } the
sequence of nonnegative scalars whose rate of convergence to 0 we wish to find.
We say that linear convergence holds if there is some σ ∈ (0, 1) such that
(3.6.1) φk+1 /φk  1 − σ, for all k sufficiently large.
(This property is sometimes also called geometric or exponential convergence, but
the term linear is standard in the optimization literature, so we use it here.) It
follows from (3.6.1) that there is some positive constant C such that
(3.6.2) φk  C(1 − σ)k , k = 1, 2, . . . .
While (3.6.1) implies (3.6.2), the converse does not hold. The sequence

2−k k even
φk =
0 k odd,
satisfies (3.6.2) with C = 1 and σ = .5, but does not satisfy (3.6.1). To distinguish
between these two slightly different definitions, (3.6.1) is sometimes called Q-
linear while (3.6.2) is called R-linear.
Sublinear convergence is, as its name suggests, slower than linear. Several
varieties of sublinear convergence are encountered in optimization algorithms
for data analysis, including the following

(3.6.3a) φk  C/ k, k = 1, 2, . . . ,
(3.6.3b) φk  C/k, k = 1, 2, . . . ,
2
(3.6.3c) φk  C/k , k = 1, 2, . . . ,
where in each case, C is some positive constant.
Superlinear convergence occurs when the constant σ ∈ (0, 1) in (3.6.1) can be
chosen arbitrarily close to 1. Specifically, we say that the sequence {φk } converges
Q-superlinearly to 0 if
(3.6.4) lim φk+1 /φk = 0.
k→∞
Q-Quadratic convergence occurs when
(3.6.5) φk+1 /φ2k  C, k = 1, 2, . . . ,
for some sufficiently large C. We say that the convergence is R-superlinear if
there is a Q-superlinearly convergent sequence {νk } that dominates {φk } (that is,
0  φk  νk for all k). R-quadratic convergence is defined similarly. Quadratic
and superlinear rates are associated with higher-order methods, such as Newton
and quasi-Newton methods.
When a convergence rate applies globally, from any reasonable starting point,
it can be used to derive a complexity bound for the algorithm, which takes the
Stephen J. Wright 71

form of a bound on the number of iterations K required to reduce φk below


some specified tolerance . For a sequence satisfying the R-linear convergence
condition (3.6.2) a sufficient condition for φK  is C(1 − σ)K  . By using the
estimate log(1 − σ)  −σ for all σ ∈ (0, 1), we have that
C(1 − σ)K  ⇔ K log(1 − σ)  log( /C) ⇐ K  log(C/ )/σ.
It follows that for linearly convergent algorithms, the number of iterations re-
quired to converge to a tolerance depends logarithmically on 1/ and inversely
on the rate constant σ. For an algorithm that satisfies the sublinear rate (3.6.3a), a

sufficient condition for φK  is C/ K  , which is equivalent to K  (C/ )2 ,
so the complexity is O(1/ 2 ). Similar analyses for (3.6.3b) reveal complexity of

O(1/ ), while for (3.6.3c), we have complexity O(1/ ).
For quadratically convergent methods, the complexity is doubly logarithmic
in (that is, O(log log(1/ ))). Once the algorithm enters a neighborhood of qua-
dratic convergence, just a few additional iterations are required for convergence
to a solution of high accuracy.

4. Gradient Methods
We consider here iterative methods for solving the unconstrained smooth prob-
lem (1.0.1) that make use of the gradient ∇f (see also, [22] which describes sub-
gradient methods for nonsmooth convex functions.) We consider mostly methods
that generate an iteration sequence {xk } via the formula
(4.0.1) xk+1 = xk + αk dk ,
where dk is the search direction and αk is a steplength.
We consider the steepest descent method, which searches along the negative
gradient direction dk = −∇f(xk ), proving convergence results for nonconvex
functions, convex functions, and strongly convex functions. In Subsection 4.5, we
consider methods that use more general descent directions dk , proving conver-
gence of methods that make careful choices of the line search parameter αk at
each iteration. In Subsection 4.6, we consider the conditional gradient method for
minimization of a smooth function f over a compact set.
4.1. Steepest Descent The simplest stepsize protocol is the short-step variant
of steepest descent. We assume here that f is differentiable, with gradient ∇f
satisfying the Lipschitz continuity condition (3.3.6) with constant L. We choose
the search direction dk = −∇f(xk ) in (4.0.1), and set the steplength αk to be the
constant 1/L, to obtain the iteration
1
(4.1.1) xk+1 = xk − ∇f(xk ), k = 0, 1, 2, . . . .
L
To estimate the amount of decrease in f obtained at each iterate of this method,
we use Taylor’s theorem. From (3.3.7), we have
L
(4.1.2) f(x + αd)  f(x) + α∇f(x)T d + α2 d2 ,
2
72 Optimization Algorithms for Data Analysis

For x = xk and d = −∇f(xk ), the value of α that minimizes the expression on the
right-hand side is α = 1/L. By substituting these values, we obtain
1
(4.1.3) f(xk+1 ) = f(xk − (1/L)∇f(xk ))  f(xk ) − ∇f(xk )2 .
2L
This expression is one of the foundational inequalities in the analysis of optimiza-
tion methods. Depending on the assumptions about f, we can derive a variety of
different convergence rates from this basic inequality.
4.2. General Case We consider first a function f that is Lipschitz continuously
differentiable and bounded below, but that need not necessarily be convex. Using
(4.1.3) alone, we can prove a sublinear convergence result for the steepest descent
method.

Theorem 4.2.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),


and that f is bounded below by a constant f̄. Then for the steepest descent method with
constant steplength αk ≡ 1/L, applied from a starting point x0 , we have for any integer
T  1 that
 
2L[f(x0 ) − f(xT )] 2L[f(x0 ) − f̄]
min ∇f(x ) 
k
 .
0kT −1 T T
Proof. Rearranging (4.1.3) and summing over the first T − 1 iterates, we have

T −1 
T −1
(4.2.2) ∇f(xk )2  2L [f(xk ) − f(xk+1 )] = 2L[f(x0 ) − f(xT )].
k=0 k=0
(Note the telescoping sum.) Since f is bounded below by f̄, the right-hand side is
bounded above by the constant 2L[f(x0 ) − f̄]. We also have that

  T −1
1 
min ∇f(x ) =
k
min ∇f(x )  
k 2 ∇f(xk )2 .
0kT −1 0kT −1 T
k=0
The result is obtained by combining this bound with (4.2.2). 
This result shows that within the first T − 1 steps
 of steepest descent, at least
one of the iterates has gradient norm less than 2L[f(x0 ) − f̄]/T , which repre-
sents sublinear convergence of type (3.6.3a). It follows too from (4.2.2) that for f
bounded below, any accumulation point of the sequence {xk } is stationary.
4.3. Convex Case When f is also convex, we have the following stronger result
for the steepest descent method.

Theorem 4.3.1. Suppose that f is convex and Lipschitz continuously differentiable, sat-
isfying (3.3.6), and that (1.0.1) has a solution x∗ . Then the steepest descent method with
stepsize αk ≡ 1/L generates a sequence {xk }∞ k=0 that satisfies
L 0
(4.3.2) f(xT ) − f∗  x − x∗ 2 .
2T
Stephen J. Wright 73

Proof. By convexity of f, we have f(x∗ )  f(xk ) + ∇f(xk )T (x∗ − xk ), so by substi-


tuting into (4.1.3), we obtain for k = 0, 1, 2, . . . that
1
f(xk+1 )  f(x∗ ) + ∇f(xk )T (xk − x∗ ) − ∇f(xk )2
 2L
$ $2 
L $ 1 $
= f(x∗ ) + xk − x∗ 2 − $
$xk − x∗ − ∇f(xk )$ $
2 L
L k
= f(x∗ ) + x − x∗ 2 − xk+1 − x∗ 2 .
2
By summing over k = 0, 1, 2, . . . , T − 1, and noting the telescoping sum, we have

 L  k
T −1 T −1
(f(xk+1 ) − f∗ )  x − x∗ 2 − xk+1 − x∗ 2
2
k=0 k=0
L 0
= x − x∗ 2 − xT − x∗ 2
2
L 0
 x − x∗ 2 .
2
Since {f(xk )} is a nonincreasing sequence, we have, as required,

1 
T −1
L 0
f(xT ) − f(x∗ )  (f(xk+1 ) − f∗ )  x − x∗ 2 . 
T 2T
k=0

4.4. Strongly Convex Case Recall that the definition (3.3.9) of strong convexity
shows that f can be bounded below by a quadratic with Hessian γI. A strongly
convex f with L-Lipschitz gradients is also bounded above by a similar quadratic
(see (3.3.7)) differing only in the quadratic term, which becomes LI. From this
“sandwich” effect, we derive a linear convergence rate for the gradient method,
stated formally in the following theorem.
Theorem 4.4.1. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),
and strongly convex, satisfying (3.2.5) with modulus of convexity γ. Then f has a unique
minimizer x∗ , and the steepest descent method with stepsize αk ≡ 1/L generates a se-
quence {xk }∞
k=0 that satisfies
γ
f(xk+1 ) − f(x∗ )  1 − (f(xk ) − f(x∗ )), k = 0, 1, 2, . . . .
L
Proof. Existence of the unique minimizer x∗ follows from Theorem 3.3.14. Min-
imizing both sides of the inequality (3.3.9) with respect to y, we find that the
minimizer on the left side is attained at y = x∗ , while on the right side it is
attained at x − ∇f(x)/γ. Plugging these optimal values into (3.3.9), we obtain
γ
y − x2
min f(y)  min f(x) + ∇f(x)T (y − x) +
y y 2
  $ $2
1 γ $1 $
⇒ f(x∗ )  f(x) − ∇f(x)T ∇f(x) + $$ ∇f(x) $
$
γ 2 γ
1
⇒ f(x∗ )  f(x) − ∇f(x)2 .

74 Optimization Algorithms for Data Analysis

By rearrangement, we obtain
(4.4.2) ∇f(x)2  2γ[f(x) − f(x∗ )].
By substituting (4.4.2) into our basic inequality (4.1.3), we obtain
 
1 1 γ
f(xk+1 ) = f xk − ∇f(xk )  f(xk ) − ∇f(xk )2  f(xk ) − (f(xk ) − f∗ ).
L 2L L
Subtracting f∗ from both sides of this inequality yields the result. 
Note that After T steps, we have
γ T
(4.4.3) f(xT ) − f∗  1 − (f(x0 ) − f∗ ),
L
which is convergence of type (3.6.2) with constant σ = γ/L.
4.5. General Case: Line-Search Methods Returning to the case in which f has
Lipschitz continuous gradients but is possibly nonconvex, we consider algorithms
that take steps of the form (4.0.1), where dk is a descent direction, that is, it
makes a positive inner product with the negative gradient −∇f(xk ), so that
∇f(xk )T dk < 0. This condition ensures that f(xk + αdk ) < f(xk ) for sufficiently
small positive values of step length α — we obtain improvement in f by taking
small steps along dk . (This claim follows from (3.3.3).) Line-search methods are
built around this fundamental observation. By introducing additional conditions
on dk and αk , that can be verified in practice with reasonable effort, we can estab-
lish a bound on decrease similar to (4.1.3) on each iteration, and thus a conclusion
similar to that of Theorem 4.2.1.
We assume that dk satisfies the following for some η > 0:
(4.5.1) ∇f(xk )T dk  −η∇f(xk )dk .
For the steplength αk , we assume the following weak Wolfe conditions hold, for
some constants c1 and c2 with 0 < c1 < c2 < 1:

(4.5.2a) f(xk + αk dk )  f(xk ) + c1 αk ∇f(xk )T dk


(4.5.2b) ∇f(xk + αk dk )T dk  c2 ∇f(xk )T dk .
Condition (4.5.2a) is called “sufficient decrease;” it ensures descent at each step of
at least a small fraction c1 of the amount promised by the first-order Taylor-series
expansion (3.3.3). Condition (4.5.2b) ensures that the directional derivative of f
along the search direction dk is significantly less negative at the chosen steplength
αk than at α = 0. This condition ensures that the step is “not too short.” It can
be shown that it is always possible to find αk that satisfies both conditions (4.5.2)
simultaneously.
Line-search procedures, which are specialized optimization procedures for
minimizing functions of one variable, have been devised to find such values effi-
ciently; see [36, Chapter 3] for details.
For line-search methods of this type, we have the following generalization of
Theorem 4.2.1.
Stephen J. Wright 75

Theorem 4.5.3. Suppose that f is Lipschitz continuously differentiable, satisfying (3.3.6),


and that f is bounded below by a constant f̄. Consider the method that takes steps of the
form (4.0.1), where dk satisfies (4.5.1) for some η > 0 and the conditions (4.5.2) hold at
all k, for some constants c1 and c2 with 0 < c1 < c2 < 1. Then for any integer T  1,
we have  
L f(x0 ) − f̄
min ∇f(x )  k
2
.
0kT −1 η c1 (1 − c2 ) T

Proof. By combining the Lipschitz property (3.3.6) with (4.5.2b), we have


−(1 − c2 )∇f(xk )T dk  [∇f(xk + αk dk ) − ∇f(xk )]T dk  Lαk dk 2 .
By comparing the first and last terms in these inequalities, we obtain the following
lower bound on αk :
(1 − c2 ) ∇f(xk )T dk
αk  − .
L dk 2
By substituting this bound into (4.5.2a), and using (4.5.1) and the step definition
(4.0.1), we obtain
f(xk+1 ) = f(xk + αk dk )  f(xk ) + c1 αk ∇f(xk )T dk
c1 (1 − c2 ) (∇f(xk )T dk )2
(4.5.4)  f(xk ) −
L dk 2
c (1 − c2 ) 2
 f(xk ) − 1 η ∇f(xk )2 ,
L
which by rearrangement yields
L
(4.5.5) ∇f(xk )2  f(xk
) − f(xk+1
) .
c1 (1 − c2 )η2
The result now follows as in the proof of Theorem 4.2.1. 
It follows by taking limits on both sides of (4.5.5) that
(4.5.6) lim ∇f(xk ) = 0,
k→∞

and therefore all accumulation points x̄ of the sequence {xk } generated by the
algorithm (4.0.1) have ∇f(x̄) = 0. In the case of f convex, this condition guarantees
that x̄ is a solution of (1.0.1). When f is nonconvex, x̄ may be a local minimum,
but it may also be a saddle point or a local maximum.
The paper [29] uses the stable manifold theorem to show that line-search gra-
dient methods are highly unlikely to converge to stationary points x̄ at which
some eigenvalues of the Hessian ∇2 f(x̄) are negative. Although it is easy to con-
struct examples for which such bad behavior occurs, it requires special choices of
starting point x0 . Possibly the most obvious example is where f(x1 , x2 ) = x21 − x22
starting from x0 = (1, 0)T , where dk = −∇f(xk ) at each k. For this example, all
iterates have xk 2 = 0 and, under appropriate conditions, converge to the saddle
point x̄ = 0. Any starting point with x02 = 0 cannot converge to 0, in fact, it is easy
to see that xk2 diverges away from 0.
76 Optimization Algorithms for Data Analysis

4.6. Conditional Gradient Method The conditional gradient approach, often


known as “Frank-Wolfe” after the authors who devised it [24], is a method for
convex nonlinear optimization over compact convex sets. This is the problem
(4.6.1) min f(x),
x∈Ω
(see earlier discussion around (3.2.2)), where Ω is a compact convex set and f
is a convex function whose gradient is Lipschitz continuously differentiable in a
neighborhood of Ω, with Lipschitz constant L. We assume that Ω has diameter
D, that is, x − y  D for all x, y ∈ Ω.
The conditional gradient method replaces the objective in (4.6.1) at each iter-
ation by a linear Taylor-series approximation around the current iterate xk , and
minimizes this linear objective over the original constraint set Ω. It then takes
a step from xk towards the minimizer of this linearized subproblem. The full
method is as follows:

(4.6.2a) vk := arg min vT ∇f(xk );


v∈Ω
2
(4.6.2b) xk+1 := xk + αk (vk − xk ), αk := .
k+2
The method has a sublinear convergence rate, as we show below, and indeed
requires many iterations in practice to obtain an accurate solution. Despite this
feature, it makes sense in many interesting applications, because the subproblems
(4.6.2a) can be solved very cheaply in some settings, and because highly accurate
solutions are not required in some applications.
We have the following result for sublinear convergence of the conditional gra-
dient method.
Theorem 4.6.3. Under the conditions above, where L is the Lipschitz constant for ∇f on
an open neighborhood of Ω and D is the diameter of Ω, the conditional gradient method
(4.6.2) applied to (4.6.1) satisfies
2LD2
(4.6.4) f(xk ) − f(x∗ )  , k = 1, 2, . . . ,
k+2
where x∗ is any solution of (4.6.1).
Proof. Setting x = xk and y = xk+1 = xk + αk (vk − xk ) in (3.3.7), we have
1
f(xk+1 )  f(xk ) + αk ∇f(xk )T (vk − xk ) + α2k Lvk − xk 2
(4.6.5) 2
1
 f(x ) + αk ∇f(x ) (v − x ) + α2k LD2 ,
k k T k k
2
where the second inequality comes from the definition of D. For the first-order
term, we have since vk solves (4.6.2a) and x∗ is feasible for (4.6.2a) that
∇f(xk )T (vk − xk )  ∇f(xk )T (x∗ − xk )  f(x∗ ) − f(xk ).
By substituting in (4.6.5) and subtracting f(x∗ ) from both sides, we obtain
1
(4.6.6) f(xk+1 ) − f(x∗ )  (1 − αk )[f(xk ) − f(x∗ )] + α2k LD2 .
2
Stephen J. Wright 77

We now apply an inductive argument. For k = 0, we have α0 = 1 and


1 2
f(x1 ) − f(x∗ )  LD2 < LD2 ,
2 3
so that (4.6.4) holds in this case. Supposing that (4.6.4) holds for some value of k,
we aim to show that it holds for k + 1 too. We have

f(xk+1 ) − f(x∗ )
 
2 1 4
 1− [f(xk ) − f(x∗ )] + LD2 from (4.6.6), (4.6.2b)
k+2 2 (k + 2)2
 
2 2k 2
 LD + from (4.6.4)
(k + 2)2 (k + 2)2
(k + 1)
= 2LD2
(k + 2)2
k+1 1
= 2LD2
k+2 k+2
k+2 1 2LD2
 2LD2 = ,
k+3 k+2 k+3
as required. 

5. Prox-Gradient Methods
We now describe an elementary but powerful approach for solving the regu-
larized optimization problem
(5.0.1) min φ(x) := f(x) + λψ(x),
x∈Rn
where f is a smooth convex function, ψ is a convex regularization function (known
simply as the “regularizer”), and λ  0 is a regularization parameter. The tech-
nique we describe here is a natural extension of the steepest-descent approach,
in that it reduces to the steepest-descent method analyzed in Theorems 4.3.1 and
4.4.1 applied to f when the regularization term is not present (λ = 0). It is useful
when the regularizer ψ has a simple structure that is easy to account for explicitly,
as is true for many regularizers that arise in data analysis, such as the 1 function
(ψ(x) = x1 ) of the indicator function for a simple set Ω (ψ(x) = IΩ (x)), such
as a box Ω = [l1 , u1 ] ⊗ [l2 , u2 ] ⊗ . . . ⊗ [ln , un ]. For such regularizers, the proximal
operators can be computed explicitly and efficiently.2
Each step of the algorithm is defined as follows:
(5.0.2) xk+1 := proxαk λψ (xk − αk ∇f(xk )),
for some steplength αk > 0, and the prox operator defined in (3.5.3). By substitut-
ing into this definition, we can verify that xk+1 is the solution of an approximation
to the objective φ of (5.0.1), namely:
1
(5.0.3) xk+1 := arg min ∇f(xk )T (z − xk ) + z − xk 2 + λψ(z).
z 2αk
2 For the analysis of this section I am indebted to class notes of L. Vandenberghe, from 2013-14.
78 Optimization Algorithms for Data Analysis

One way to verify this equivalence is to note that the objective in (5.0.3) can be
written as $ $2 
1 1$ $
$z − (xk − αk ∇f(xk ))$ + αk λψ(x) ,
αk 2
(modulo a term αk ∇f(xk )2 that does not involve z). The subproblem objective
in (5.0.3) consists of a linear term ∇f(xk )T (z − xk ) (the first-order term in a Taylor-
series expansion), a proximality term 2α1 k z − xk 2 that becomes more strict as
αk ↓ 0, and the regularization term λψ(x) in unaltered form. When λ = 0, we
have xk+1 = xk − αk ∇f(xk ), so the iteration (5.0.2) (or (5.0.3)) reduces to the
usual steepest-descent approach discussed in Section 4 in this case. It is useful
to continue thinking of αk as playing the role of a line-search parameter, though
here the line search is expressed implicitly through a proximal term.
We will demonstrate convergence of the method (5.0.2) at a sublinear rate, for
functions f whose gradients satisfy a Lipschitz continuity property with Lipschitz
constant L (see (3.3.6)), and for the constant steplength choice αk = 1/L. The proof
makes use of a “gradient map” defined by
1
(5.0.4) Gα (x) := x − proxαλψ (x − α∇f(x)) .
α
By comparing with (5.0.2), we see that this map defines the step taken at itera-
tion k as:
1 k
(5.0.5) xk+1 = xk − αk Gαk (xk ) ⇔ Gαk (xk ) = (x − xk+1 ).
αk
The following technical lemma reveals some useful properties of Gα (x).

Lemma 5.0.6. Suppose that in problem (5.0.1), ψ is a closed convex function and that f
is convex with Lipschitz continuous gradient on Rn , with Lipschitz constant L. Then for
the definition (5.0.4) with α > 0, the following claims are true.
(a) Gα (x) ∈ ∇f(x) + λ∂ψ(x − αGα (x)).
(b) For any z, and any α ∈ (0, 1/L], we have that
α
φ(x − αGα (x))  φ(z) + Gα (x)T (x − z) − Gα (x)2 .
2
Proof. For part (a), we use the optimality property (3.5.4) of the prox operator,
and make the following substitutions: x − α∇f(x) for “x”, αλ for “λ”, and ψ for
“h” to obtain
0 ∈ αλ∂ψ(proxαλψ (x − α∇f(x))) + (proxαλψ (x − α∇f(x)) − (x − α∇f(x)).
We make the substitution proxαλψ (x − α∇f(x)) = x − αGα (x), using definition
(5.0.4), to obtain
0 ∈ αλ∂ψ(x − αGα (x)) − α(Gα (x) − ∇f(x)),
and the result follows when we divide by α.
For (b), we start with the following consequence of Lipschitz continuity of ∇f,
from Lemma 3.3.10:
L
f(y)  f(x) + ∇f(x)T (y − x) + y − x2 .
2
Stephen J. Wright 79

By setting y = x − αGα (x), for any α ∈ (0, 1/L], we have


Lα2
f(x − αGα (x))  f(x) − αGα (x)T ∇f(x) + Gα (x)2
(5.0.7) 2
α
 f(x) − αGα (x)T ∇f(x) + Gα (x)2 .
2
(The second inequality uses α ∈ (0, 1/L].) We also have by convexity of f and ψ
that for any z and any v ∈ ∂ψ(x − αGα (x) the following are true:
f(z)  f(x) + ∇f(x)T (z − x),
(5.0.8)
ψ(z)  ψ(x − αGα (x)) + vT (z − (x − αGα (x))).
From part (a) that v = (Gα (x) − ∇f(x))/λ ∈ ∂ψ(x − αGα (x)). Making this choice
of v in (5.0.8) and using (5.0.7) we have for any α ∈ (0, 1/L] that
φ(x − αGα (x))
= f(x − αGα (x)) + λψ(x − αGα (x))
α
 f(x) − αGα (x)T ∇f(x) + Gα (x)2 + λψ(x − αGα (x)) (from (5.0.7))
2
α
 f(z) + ∇f(x) (x − z) − αGα (x)T ∇f(x) + Gα (x)2
T
2
+ λψ(z) + (Gα (x) − ∇f(x))T (x − αGα (x) − z) (from (5.0.8))
α
= f(z) + λψ(z) + Gα (x)T (x − z) − Gα (x)2 ,
2
where the last equality follows from cancellation of several terms in the previous
line. Thus (b) is proved. 
Theorem 5.0.9. Suppose that in problem (5.0.1), ψ is a closed convex function and that f
is convex with Lipschitz continuous gradient on Rn , with Lipschitz constant L. Suppose
that (5.0.1) attains a minimizer x∗ (not necessarily unique) with optimal objective value
φ∗ . Then if αk = 1/L for all k in (5.0.2), we have
Lx0 − x∗ 2
φ(xk ) − φ∗  , k = 1, 2, . . . .
2k
Proof. Since αk = 1/L satisfies the conditions of Lemma 5.0.6, we can use part (b)
of this result to show that the sequence {φ(xk )} is decreasing and that the distance
to the optimum x∗ also decreases at each iteration. Setting x = z = xk and α = αk
in Lemma 5.0.6, and recalling (5.0.5), we have
α
φ(xk+1 ) = φ(xk − αk Gαk (xk ))  φ(xk ) − k Gαk (xk )2 ,
2
justifying the first claim. For the second claim, we have by setting x = xk , α = αk ,
and z = x∗ in Lemma 5.0.6 that
0  φ(xk+1 ) − φ∗ = φ(xk − αk Gαk (xk )) − φ∗
α
 Gαk (xk )T (xk − x∗ ) − k Gαk (xk )2
2
(5.0.10) 1 k
= x − x  − x − x∗ − αk Gαk (xk )2
∗ 2 k
2αk
1 k
= x − x∗ 2 − xk+1 − x∗ 2 ,
2αk
80 Optimization Algorithms for Data Analysis

from which xk+1 − x∗   xk − x∗  follows.


By setting αk = 1/L in (5.0.10), and summing over k = 0, 1, 2, . . . , K − 1, we
obtain from a telescoping sum on the right-hand side that
 L 0 L
K−1
(φ(xk+1 ) − φ∗ )  x − x∗ 2 − xK − x∗ 2  x0 − x∗ 2 .
2 2
k=0

By monotonicity of {φ(xk )}, we have



K−1
K(φ(xK ) − φ∗ )  (φ(xk+1 ) − φ∗ ).
k=0
The result follows immediately by combining these last two expressions. 

6. Accelerating Gradient Methods


We showed in Section 4 that the basic steepest descent method for solving
(1.0.1) for smooth f converges sublinearly at a 1/k rate when f is convex, and
linearly at a rate of (1 − γ/L) when f is strongly convex, satisfying (3.3.13) for
positive γ and L. We show in this section that by using the gradient information
in a more clever way, faster convergence rates can be attained.
The key idea is momentum. In iteration k of a momentum method, we tend to
continue moving along the previous search direction at each iteration, making a
small adjustment toward the negative gradient −∇f evaluated at xk or a nearby
point. (Steepest descent simply uses −∇f(xk ) as the search direction.) Although
not obvious at first, there is some intuition behind the momentum idea. The step
taken at the previous iterate xk−1 was based on negative gradient information
at that iteration, along with the search direction from the iteration prior to that
one, namely, xk−2 . By continuing this line of reasoning backwards, we see that
the previous step is a linear combination of all the gradient information that we
have encountered at all iterates so far, going back to the initial iterate x0 . If this
information is aggregated properly, it can produce a richer overall picture of the
function than the latest negative gradient alone, and thus has the potential to
yield better convergence.
Sure enough, several intricate methods that use the momentum idea have been
proposed, and have been widely successful. These methods are often called accel-
erated gradient methods. A major contributor in this area is Yuri Nesterov, dating to
his seminal contribution in 1983 [33] and explicated further in his book [34] and
other publications. Another key contribution is [3], which derived an accelerated
method for the regularized case (1.0.2).
6.1. Heavy-Ball Method Possibly the most elementary method of momentum
type is the heavy-ball method of Polyak [37]; see also [38]. Each iteration of this
method has the form
(6.1.1) xk+1 = xk − αk ∇f(xk ) + βk (xk − xk−1 ),
Stephen J. Wright 81

where αk and βk are positive scalars. That is, a momentum term βk (xk − xk−1 )
is added to the usual steepest descent update. Although this method can be ap-
plied to any smooth convex f (and even to nonconvex functions), the convergence
analysis is most straightforward for the special case of strongly convex quadratic
functions (see [38]). (This analysis also suggests appropriate values for the step
lengths αk and βk .) Consider the function
1
(6.1.2) minn f(x) := xT Ax − bT x,
x∈R 2
where the (constant) Hessian A has eigenvalues in the range [γ, L], with 0 < γ  L.
For the following constant choices of steplength parameters:
√ √
4 L− γ
αk = α := √ √ , β k = β := √ √ ,
( L + γ)2 L+ γ
it can be shown that xk − x∗   Cβk , for some (possibly large) constant C. We
can use (3.3.7) to translate this into a bound on the function error, as follows:
L k LC2 2k
f(xk ) − f(x∗ ) x − x∗ 2  β ,
2 2
allowing a direct comparison with the rate (4.4.3) for the steepest descent method.
If we suppose that L γ, we have

γ
β ≈ 1−2 ,
L
so that we achieve approximate convergence f(xk ) − f(x∗ )  (for small pos-

itive ) in O( L/γ log(1/ )) iterations, compared with O((L/γ) log(1/ )) for
steepest descent — a significant improvement.
The heavy-ball method is fundamental, but several points should be noted.
First, the analysis for convex quadratic f is based on linear algebra arguments,
and does not generalize to general strongly convex nonlinear functions. Second,
the method requires knowledge of γ and L, for the purposes of defining parame-
ters α and β. Third, it is not a descent method; we usually have f(xk+1 ) > f(xk )
for many k. These properties are not specific to the heavy-ball method — some
of them are shared by other methods that use momentum.
6.2. Conjugate Gradient The conjugate gradient method for solving linear sys-
tems Ax = b (or, equivalently, minimizing the convex quadratic (6.1.2)) where A
is symmetric positive definite, is one of the most important algorithms in compu-
tational science. Though invented earlier than the other algorithms discussed in
this section (see [27]) and motivated in a different way, conjugate gradient clearly
makes use of momentum. Its steps have the form
(6.2.1) xk+1 = xk + αk pk , where pk = −∇f(xk ) + ξk pk−1 ,
for some choices of αk and ξk , which is identical to (6.1.1) when we define βk ap-
propriately. For convex, strongly quadratic problems (6.1.2), conjugate gradient
has excellent properties. It does not require prior knowledge of the range [γ, L] of
82 Optimization Algorithms for Data Analysis

the eigenvalue spectrum of A, choosing the steplengths αk and ξk in an adaptive


fashion. (In fact, αk is chosen to be the exact minimizer along the search direction
pk .) The main arithmetic operation per iteration is one matrix-vector multiplica-
tion involving A, the same cost as a gradient evaluation for f in (6.1.2). Most
importantly, there is a rich convergence theory, that characterizes convergence in
terms of the properties of the full spectrum of A (not just its extreme elements),
showing in particular that good approximate solutions can be obtained quickly
if the eigenvalues are clustered. Convergence to an exact solution of (6.1.2) in at
most n iterations is guaranteed (provided, naturally, that the arithmetic is carried
out exactly).
There has been much work over the years on extending the conjugate gradi-
ent method to general smooth functions f. Few of the theoretical properties for
the quadratic case carry over to the nonlinear setting, though several results are
known; see [36, Chapter 5], for example. Such “nonlinear” conjugate gradient
methods vary in the accuracy with which they perform the line search for αk in
(6.2.1) and — more fundamentally — in the choice of ξk . The latter is done in
a way that ensures that each search direction pk is a descent direction. In some
methods, ξk is set to zero on some iterations, which causes the method to take
a steepest descent step, effectively “restarting” the conjugate gradient method at
the latest iterate.
Despite these qualifications, nonlinear conjugate gradient is quite commonly
used in practice, because of its minimal storage requirements and the fact that
it requires only one gradient evaluation per iteration. Its popularity has been
eclipsed in recent years by the limited-memory quasi-Newton method L-BFGS
[30], [36, Section 7.2], which requires more storage (though still O(n)) and is
similarly economical and easy to implement.
6.3. Nesterov’s Accelerated Gradient: Weakly Convex Case We now describe
Nesterov’s method for (1.0.1) and prove its convergence — sublinear at a 1/k2
rate — for the case of f convex with Lipschitz continuous gradients satisfying
(3.3.6). Each iteration of this method has the form

(6.3.1) xk+1 = xk − αk ∇f xk + βk (xk − xk−1 ) + βk (xk − xk−1 ),
for choices of the parameters αk and βk to be defined. Note immediately the
similarity to the heavy-ball formula (6.1.1). The only difference is that the extrap-
olation step xk → xk + βk (xk − xk−1 ) is taken before evaluation of the gradient
∇f in (6.3.1), whereas in (6.1.1) the gradient is simply evaluated at xk . It is con-
venient for purposes of analysis (and implementation) to introduce an auxiliary
sequence {yk }, fix αk ≡ 1/L, and rewrite the update (6.3.1) as follows:
1
(6.3.2a) xk+1 = yk − ∇f(yk ),
L
(6.3.2b) yk+1 = xk+1 + βk+1 (xk+1 − xk ), k = 0, 1, 2, . . . ,
Stephen J. Wright 83

where we initialize at an arbitrary y0 and set x0 = y0 . We define βk with reference


to another scalar sequence λk in the following manner:
  
1 λ −1
(6.3.3) λ0 = 0, λk+1 = 1 + 1 + 4λk , βk = k
2 .
2 λk+1
Since λk  1 for k = 1, 2, . . . , we have βk+1  0 for k = 0, 1, 2, . . . . It also follows
from the definition of λk+1 that
(6.3.4) λ2k+1 − λk+1 = λ2k .
We have the following result for convergence of Nesterov’s scheme on general
convex functions. We prove it using an argument from [3], as reformulated in
[7, Section 3.7]. The analysis is famously technical, and intuition is hard to come
by. Some recent progress has been made in deriving algorithms similar to (6.3.2)
that have a plausible geometric or algebraic motivation; see [8, 21].

Theorem 6.3.5. Suppose that f in (1.0.1) is convex, with ∇f Lipschitz continuously


differentiable with constant L (as in (3.3.6)) and that the minimum of f is attained at x∗ ,
with f∗ := f(x∗ ). Then the method defined by (6.3.2), (6.3.3) with x0 = y0 yields an
iteration sequence {xk } with the following property:
2Lx0 − x∗ 2
f(xT ) − f∗  , T = 1, 2, . . . .
(T + 1)2
Proof. From convexity of f and (3.3.7), we have for any x and y that
f(y − ∇f(y)/L) − f(x)
 f(y − ∇f(y)/L) − f(y) + ∇f(y)T (y − x)
(6.3.6) L
 ∇f(y)T (y − ∇f(y)/L − y) + y − ∇f(y)/L − y2 + ∇f(y)T (y − x)
2
1
= − ∇f(y)2 + ∇f(y)T (y − x).
2L
Setting y = y and x = xk in this bound, we obtain
k

f(xk+1 ) − f(xk ) = f(yk − ∇f(yk )/L) − f(xk )


1
(6.3.7) − ∇f(yk )2 + ∇f(yk )T (yk − xk )
2L
L
= − xk+1 − yk 2 − L(xk+1 − yk )T (yk − xk ).
2
k ∗
We now set y = y and x = x in (6.3.6), and use (6.3.2a) to obtain
L
(6.3.8) f(xk+1 ) − f(x∗ )  − xk+1 − yk 2 − L(xk+1 − yk )T (yk − x∗ ).
2
Introducing notation δk := f(xk ) − f(x∗ ), we multiply (6.3.7) by λk+1 − 1 and add
it to (6.3.8) to obtain
(λk+1 − 1)(δk+1 − δk ) + δk+1
L
 − λk+1 xk+1 − yk 2 − L(xk+1 − yk )T (λk+1 yk − (λk+1 − 1)xk − x∗ ).
2
84 Optimization Algorithms for Data Analysis

We multiply this bound by λk+1 , and use (6.3.4) to obtain

(6.3.9) λ2k+1 δk+1 − λ2k δk


L 
 − λk+1 (xk+1 − yk )2 + 2λk+1 (xk+1 − yk )T (λk+1 yk − (λk+1 − 1)xk − x∗ )
2
L 
= − λk+1 xk+1 − (λk+1 − 1)xk − x∗ 2 − λk+1 yk − (λk+1 − 1)xk − x∗ 2 ,
2
where in the final equality we used the identity a2 + 2aT b = a + b2 − b2 .
By multiplying (6.3.2b) by λk+2 , and using λk+2 βk+1 = λk+1 − 1 from (6.3.3), we
have
λk+2 yk+1 = λk+2 xk+1 + λk+2 βk+1 (xk+1 − xk )
= λk+2 xk+1 + (λk+1 − 1)(xk+1 − xk ).
By rearranging this equality, we have
λk+1 xk+1 − (λk+1 − 1)xk = λk+2 yk+1 − (λk+2 − 1)xk+1 .
By substituting into the first term on the right-hand side of (6.3.9), and using the
definition
(6.3.10) uk := λk+1 yk − (λk+1 − 1)xk − x∗ ,
we obtain
L
λ2k+1 δk+1 − λ2k δk  − (uk+1 2 − uk 2 ).
2
By summing both sides of this inequality over k = 0, 1, . . . , T − 1, and using λ0 = 0,
we obtain
L L
λ2T δT  (u0 2 − uT 2 )  x0 − x∗ 2 ,
2 2
so that
Lx0 − x∗ 2
(6.3.11) δT = f(xT ) − f(x∗ )  .
2λ2T
A simple induction confirms that λk  (k + 1)/2 for k = 1, 2, . . . , and the claim of
the theorem follows by substituting this bound into (6.3.11). 
6.4. Nesterov’s Accelerated Gradient: Strongly Convex Case We turn now to
Nesterov’s approach for smooth strongly convex functions, which satisfy (3.2.5)
with γ > 0. Again, we follow the proof in [7, Section 3.7], which is based on
the analysis in [34]. The method uses the same update formula (6.3.2) as in the
weakly convex case, and the same initialization, but with a different choice of
βk+1 , namely:
√ √ √
L− γ κ−1
(6.4.1) βk+1 ≡ √ √ =√ .
L+ γ κ+1
The condition measure κ is defined in (3.3.12). We prove the following conver-
gence result.

Theorem 6.4.2. Suppose that f is such that ∇f is Lipschitz continuously differentiable


with constant L, and that it is strongly convex with modulus of convexity γ and unique
Stephen J. Wright 85

minimizer x∗ . Then the method (6.3.2), (6.4.1) with starting point x0 = y0 satisfies
 
∗ L+γ 0 ∗ 2 1 T
f(x ) − f(x ) 
T
x − x  1 − √ , T = 1, 2, . . . .
2 κ
Proof. The proof makes use of a family of strongly convex functions Φk (z) de-
fined inductively as follows:
γ
(6.4.3a) Φ0 (z) = f(y0 ) + z − y0 2 ,
√2
(6.4.3b) Φk+1 (z) = (1 − 1/ κ)Φk (z)
1 γ
+ √ f(yk ) + ∇f(yk )T (z − yk ) + z − yk 2 .
κ 2
Each Φk (·) is a quadratic, and an inductive argument shows that ∇2 Φk (z) = γI
for all k and all z. Thus, each Φk has the form
γ
(6.4.4) Φk (z) = Φ∗k + z − vk 2 , k = 0, 1, 2, . . . ,
2
where vk is the minimizer of Φk (·) and Φ∗k is its optimal value. (From (6.4.3a),
we have v0 = y0 .) We note too that Φk becomes a tighter overapproximation to f
as k → ∞. To show this, we use (3.3.9) to replace the final term in parentheses in
(6.4.3b) by f(z), then subtract f(z) from both sides of (6.4.3b) to obtain

(6.4.5) Φk+1 (z) − f(z)  (1 − 1/ κ)(Φk (z) − f(z)).
In the remainder of the proof, we show that the following bound holds:
(6.4.6) f(xk )  min Φk (z) = Φ∗k , k = 0, 1, 2, . . . .
z
The upper bound in Lemma 3.3.10 for x = x∗
gives f(z) − f(x∗ )  (L/2)z − x∗ 2 .
By combining this bound with (6.4.5) and (6.4.6), we have
f(xk ) − f(x∗ )  Φ∗k − f(x∗ )
 Φk (x∗ ) − f(x∗ )

(6.4.7)  (1 − 1/ κ)k (Φ0 (x∗ ) − f(x∗ ))

 (1 − 1/ κ)k [(Φ0 (x∗ ) − f(x0 )) + (f(x0 ) − f(x∗ ))]
√ γ+L 0
 (1 − 1/ κ)k x − x∗ 2 .
2
The proof is completed by establishing (6.4.6), by induction on k. Since x0 = y0 ,
it holds by definition at k = 0. By using step formula (6.3.2a), the convexity
property (3.3.8) (with x = yk ), and the inductive hypothesis, we have

(6.4.8) f(xk+1 )
1
 f(yk ) − ∇f(yk )2
2L
√ √ √ 1
= (1 − 1/ κ)f(xk ) + (1 − 1/ κ)(f(yk ) − f(xk )) + f(yk )/ κ − ∇f(yk )2
2L
√ √ √ 1
 (1 − 1/ κ)Φ∗k + (1 − 1/ κ)∇f(yk )T (yk − xk ) + f(yk )/ κ − ∇f(yk )2 .
2L
86 Optimization Algorithms for Data Analysis

Thus the claim is established (and the theorem is proved) if we can show that the
right-hand side in (6.4.8) is bounded above by Φ∗k+1 .
Recalling the observation (6.4.4), we have by taking derivatives of both sides of
(6.4.3b) with respect to z that
√ √ √
(6.4.9) ∇Φk+1 (z) = γ(1 − 1/ κ)(z − vk ) + ∇f(yk )/ κ + γ(z − yk )/ κ.
Since vk+1 is the minimizer of Φk+1 we can set ∇Φk+1 (vk+1 ) = 0 in (6.4.9) to
obtain
√ √ √
(6.4.10) vk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ).
By subtracting yk from both sides of this expression, and taking  · 2 of both
sides, we obtain

vk+1 − yk 2 = (1 − 1/ κ)2 yk − vk 2 + ∇f(yk )2 /(γ2 κ)
(6.4.11) √ √
− 2(1 − 1/ κ)/(γ κ)∇f(yk )T (vk − yk ).
By evaluating Φk+1 at z = yk , using both (6.4.4) and (6.4.3b), we obtain
γ
Φ∗k+1 + yk − vk+1 2
2√ √
(6.4.12) = (1 − 1/ κ)Φk (yk ) + f(yk )/ κ
√ γ √ √
= (1 − 1/ κ)Φ∗k + (1 − 1/ κ)yk − vk 2 + f(yk )/ κ.
2
By substituting (6.4.11) into (6.4.12), we obtain
√ √ √ √
Φ∗k+1 = (1 − 1/ κ)Φ∗k + f(yk )/ κ + γ(1 − 1/ κ)/(2 κ)yk − vk 2
1 √ √
− ∇f(yk )2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ
2L
(6.4.13) √ √
 (1 − 1/ κ)Φ∗k + f(yk )/ κ
1 √ √
− ∇f(yk )2 + (1 − 1/ κ)∇f(yk )T (vk − yk )/ κ,
2L
where we simply dropped a nonnegative term from the right-hand side to obtain
the inequality. The final step is to show that

(6.4.14) vk − yk = κ(yk − xk ),
which we do by induction. Note that v0 = x0 = y0 , so the claim holds for k = 0.
We have
√ √ √
vk+1 − yk+1 = (1 − 1/ κ)vk + yk / κ − ∇f(yk )/(γ κ) − yk+1
√ √ √
= κyk − ( κ − 1)xk − κ∇f(yk )/L − yk+1
(6.4.15) √ √
= κxk+1 − ( κ − 1)xk − yk+1

= κ(yk+1 − xk+1 ),
where the first equality is from (6.4.10), the second equality is from the inductive
hypothesis, the third equality is from the iteration formula (6.3.2a), and the final
equality is from the iteration formula (6.3.2b) with the definition of βk+1 from
(6.4.1). We have thus proved (6.4.14), and by substituting this equality into (6.4.13),
Stephen J. Wright 87

we obtain that Φ∗k+1 is an upper bound on the right-hand side of (6.4.8). This
establishes (6.4.6) and thus completes the proof of the theorem. 
6.5. Lower Bounds on Rates The term “optimal” in Nesterov’s optimal method
is used because the convergence rate achieved by the method is the best possible
(possibly up to a constant), among algorithms that make use of gradient informa-
tion at the iterates xk . This claim can be proved by means of a carefully designed
function, for which no method that makes use of all gradients observed up to and
including iteration k (namely, ∇f(xi ), i = 0, 1, 2, . . . , k) can produce a sequence
{xk } that achieves a rate better than that of Theorem 6.3.5. The function proposed
in [32] is a convex quadratic f(x) = (1/2)xT Ax − eT1 x, where
⎡ ⎤
2 −1 0 0 ... ... 0 ⎡ ⎤
⎢ ⎥ 1
⎢−1 2 −1 0 . . . . . . 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢0⎥
⎢ 0 −1 2 −1 0 . . . 0 ⎥ ⎢ ⎥
⎢ ⎥
A=⎢ . . . ⎥ , e1 = ⎢ ⎢ 0⎥ ⎥.
⎢ .. .. .. ⎥ ⎢ ⎥
⎢ ⎥ .
⎢ .. ⎥
⎢ ⎥ ⎣ ⎦
⎢ 0 ... 0 −1 2 −1⎥
⎣ ⎦ 0
0 ... 0 −1 2
The solution x∗ satisfies Ax∗ = e1 ; its components are x∗i = 1 − i/(n + 1), for
i = 1, 2, . . . , n. If we use x0 = 0 as the starting point, and construct the iterate
xk+1 as
k
xk+1 = xk + ξj ∇f(xj ),
j=0

for some coefficients ξj , j = 0, 1, . . . , k, an elementary inductive argument shows


that each iterate xk can have nonzero entries only in its first k components. It
follows that for any such algorithm, we have
 n 
n  2
j
(6.5.1) xk − x∗ 2  (x∗j )2 = 1− .
n+1
j=k+1 j=k+1

A little arithmetic shows that


1 0 n
(6.5.2) xk − x∗ 2  x − x∗ 2 , k = 1, 2, . . . , − 1,
8 2
It can be shown further that
3L n
(6.5.3) f(xk ) − f∗  x0 − x∗ 2 , k = 1, 2, . . . , − 1,
32(k + 1)2 2
where L = A2 . This lower bound on f(xk ) − x∗ is within a constant factor of
the upper bound of Theorem 6.3.5.
The restriction k  n/2 in the argument above is not fully satisfying. A more
compelling example would show that the lower bound (6.5.3) holds for all k, but
an example of this type is not currently known.
88 Optimization Algorithms for Data Analysis

7. Newton Methods
So far, we have dealt with methods that use first-order (gradient or subgra-
dient) information about the objective function. We have shown that such algo-
rithms can yield sequences of iterates that converge at linear or sublinear rates.
We turn our attention in this chapter to methods that exploit second-derivative
(Hessian) information. The canonical method here is Newton’s method, named
after Isaac Newton, who proposed a version of the method for polynomial equa-
tions in around 1670.
For many functions, including many that arise in data analysis, second-order
information is not difficult to compute, in the sense that the functions that we
deal with are simple (usually compositions of elementary functions). In compar-
ing with first-order methods, there is a tradeoff. Second-order methods typically
have local superlinear or quadratic convergence rates: Once the iterates reach a
neighborhood of a solution at which second-order sufficient conditions are sat-
isfied, convergence is rapid. Moreover, their global convergence properties are
attractive. With appropriate enhancements, they can provably avoid convergence
to saddle points. But the costs of calculating and handling the second-order infor-
mation and of computing the step is higher. Whether this tradeoff makes them
appealing depends on the specifics of the application and on whether the second-
derivative computations are able to take advantage of structure in the objective
function.
We start by sketching the basic Newton’s method for the unconstrained smooth
optimization problem min f(x), and prove local convergence to a minimizer x∗
that satisfies second-order sufficient conditions. Subsection 7.2 discusses perfor-
mance of Newton’s method on convex functions, where the use of Newton search
directions in the line search framework (4.0.1) can yield global convergence. Mod-
ifications of Newton’s method for nonconvex functions are discussed in Subsec-
tion 7.3. Subsection 7.4 discusses algorithms for smooth nonconvex functions
that use gradient and Hessian information but guarantee convergence to points
that approximately satisfy second-order necessary conditions. Some variants of
these methods are related closely to the trust-region methods discussed in Sub-
section 7.3, but the motivation and mechanics are somewhat different.
7.1. Basic Newton’s Method Consider the problem
(7.1.1) min f(x),
where f : Rn → R is a Lipschitz twice continuously differentiable function, where
the Hessian has Lipschitz constant M, that is,
(7.1.2) ∇2 f(x ) − ∇2 f(x )  Mx − x ,
where  ·  denotes the Euclidean vector norm and its induced matrix norm. New-
ton’s method generates a sequence of iterates {xk }k=0,1,2,... .
Stephen J. Wright 89

A second-order Taylor series approximation to f around the current iterate xk


is
1
(7.1.3) f(xk + p) ≈ f(xk ) + ∇f(xk )T p + pT ∇2 f(xk )p.
2
When ∇2 f(xk ) is positive definite, the minimizer pk of the right-hand side is
unique; it is
(7.1.4) pk = −∇2 f(xk )−1 ∇f(xk ).
This is the Newton step. In its most basic form, then, Newton’s method is defined
by the following iteration:
(7.1.5) xk+1 = xk − ∇2 f(xk )−1 ∇f(xk ).
We have the following local convergence result in the neighborhood of a point x∗
satisfying second-order sufficient conditions.

Theorem 7.1.6. Consider the problem (7.1.1) with f twice Lipschitz continuously differ-
entiable with Lipschitz constant M defined in (7.1.2). Suppose that the second-order suf-
ficient conditions are satisfied for the problem (7.1.1) at the point x∗ , that is, ∇f(x∗ ) = 0
and ∇2 f(x∗ )  γI for some γ > 0. Then if x0 − x∗   2M γ
, the sequence defined by

(7.1.5) converges to x at a quadratic rate, with
M k
(7.1.7) xk+1 − x∗   x − x∗ 2 , k = 0, 1, 2, . . . .
γ
Proof. From (7.1.4) and (7.1.5), and using ∇f(x∗ ) = 0, we have

xk+1 − x∗ = xk − x∗ − ∇2 f(xk )−1 ∇f(xk )


= ∇2 f(xk )−1 [∇2 f(xk )(xk − x∗ ) − (∇f(xk ) − ∇f(x∗ ))].
so that
(7.1.8) xk+1 − x∗   ∇2 f(xk )−1 ∇2 f(xk )(xk − x∗ ) − (∇f(xk ) − ∇f(x∗ )).
By using Taylor’s theorem (see (3.3.4) with x = xk and p = x∗ − xk ), we have
1
∇f(x ) − ∇f(x ) = ∇2 f(xk + t(x∗ − xk ))(xk − x∗ ) dt.
k ∗
0
By using this result along with the Lipschitz condition (7.1.2), we have
∇2 f(xk )(xk − x∗ ) − (∇f(xk ) − ∇f(x∗ ))
$ $
$ 1 $
$ $
= $ [∇2 f(xk ) − ∇2 f(xk + t(x∗ − xk )](xk − x∗ ) dt$
$ 0 $
(7.1.9) 1
 ∇2 f(xk ) − ∇2 f(xk + t(x∗ − xk ))xk − x∗  dt
0 
1
 Mt dt xk − x∗ 2 = 12 Mxk − x∗ 2 .
0

From the Weilandt-Hoffman inequality[28] and (7.1.2), we have that


|λmin (∇2 f(xk )) − λmin (∇2 f(x∗ ))|  ∇2 f(xk ) − ∇2 f(x∗ )  Mxk − x∗ ,
90 Optimization Algorithms for Data Analysis

where λmin (·) denotes the smallest eigenvalue of a symmetric matrix. Thus for
γ
(7.1.10) xk − x∗   ,
2M
we have
γ γ
λmin (∇2 f(xk ))  λmin (∇2 f(x∗ )) − Mxk − x∗   γ − M  ,
2M 2
so that ∇2 f(xk )−1   2/γ. By substituting this result together with (7.1.9) into
(7.1.8), we obtain
2M k M k
xk+1 − x∗   x − x∗ 2 = x − x∗ 2 ,
γ 2 γ
verifying the local quadratic convergence rate. By applying (7.1.10) again, we
have  
M k 1
xk+1 − x∗   x − x∗  xk − x∗   xk − x∗ ,
γ 2
so, by arguing inductively, we see that the sequence converges to x∗ provided
that x0 satisfies (7.1.10), as claimed. 
Of course, we do not need to explicitly identify a starting point x0 in the stated
region of convergence. Any sequence that approaches to x∗ will eventually enter
this region, and thereafter the quadratic convergence guarantees apply.
We have established that Newton’s method converges rapidly once the iterates
enter the neighborhood of a point x∗ satisfying second-order sufficient optimality
conditions. But what happens when we start far from such a point?
7.2. Newton’s Method for Convex Functions When the function f is convex as
well as smooth, we can devise variants of Newton’s method for which global
convergence and complexity results (in particular, results based on those of Sec-
tion 4.5) can be proved in addition to local quadratic convergence.
When f is strongly convex with modulus γ and satisfies Lipschitz continuity
of the gradient (3.3.6), the Hessian ∇2 f(xk ) is positive definite for all k, with
all eigenvalues in the interval [γ, L]. Thus, the Newton direction (7.1.4) is well
defined at all iterates xk , and is a descent direction satisfying the condition (4.5.1)
with η = γ/L. To verify this claim, note first
1
pk   ∇2 f(xk )−1 ∇f(xk )  ∇f(xk ).
γ
Then
(pk )T ∇f(xk ) = −∇f(xk )T ∇2 f(xk )−1 ∇f(xk )
1
 − ∇f(xk )2
L
γ
 − ∇f(xk )pk .
L
We can use the Newton direction in the line-search framework of Subsection 4.5
to obtain a method for which xk → x∗ , where x∗ is the (unique) global minimizer
of f. (This claim follows from the property (4.5.6) together with the fact that x∗ is
the only point for which ∇f(x∗ ) = 0.) We can even obtain a complexity result —

and O(1/ T ) bound on min0kT −1 ∇f(xk ) — from Theorem 4.5.3.
Stephen J. Wright 91

These global convergence properties are enhanced by the local quadratic con-
vergence property of Theorem 7.1.6 if we modify the line-search framework by
accepting the step length αk = 1 in (4.0.1) whenever it satisfies the weak Wolfe
conditions (4.5.2). (It can be shown, by again using arguments based on Taylor’s
theorem (Theorem 3.3.1), that these conditions will be satisfied by αk = 1 for all
xk sufficiently close to the minimizer x∗ .)
Consider now the case in which f is convex and satisfies condition (3.3.6) but
is not strongly convex. Here, the Hessian ∇2 f(xk ) may be singular for some k, so
the direction (7.1.4) may not be well defined. However, by adding any positive
number λk > 0 to the diagonal, we can ensure that the modified Newton direction
defined by
(7.2.1) pk = −[∇2 f(xk ) + λk I]−1 ∇f(xk ),
is well defined and is a descent direction for f. For any η ∈ (0, 1) in (4.5.1),
we have by choosing λk large enough that λk /(L + λk )  η that the condition
(4.5.1) is satisfied too, so we can use the resulting direction pk in the line-search
framework of Subsection 4.5, to obtain a method that convergence to a solution
x∗ of (1.0.1), when one exists.
If, in addition, the minimizer x∗ is unique and satisfies a second-order suffi-
cient condition (so that ∇2 f(x∗ ) is positive definite), then ∇2 f(xk ) will be positive
definite too for k sufficiently large. Thus, provided that η is sufficiently small,
the unmodified Newton direction (with λk = 0 in (7.2.1)) will satisfy the condi-
tion (4.5.1). If we use (7.2.1) in the line-search framework of Section 4.5, but set
λk = 0 where possible, and accept αk = 1 as the step length whenever it satisfies
(4.5.2), we can obtain local quadratic convergence to x∗ , in addition to the global
convergence and complexity promised by Theorem 4.5.3.
7.3. Newton Methods for Nonconvex Functions For smooth nonconvex f, the
Hessian ∇2 f(xk ) may be indefinite for some k. The Newton direction (7.1.4)
may not exist (when ∇2 f(xk ) is singular) or it may not be a descent direction
(when ∇2 f(xk ) has negative eigenvalues). However, we can still define a modified
Newton direction as in (7.2.1), which will be a descent direction for λk sufficiently
large, and thus can be used in the line-search framework of Section 4.5. For a
given η in (4.5.1), a sufficient condition for pk from (7.2.1) to satisfy (4.5.1) is that
λk + λmin (∇2 f(xk ))
 η,
λk + L
where λmin (∇2 f(xk )) is the minimum eigenvalue of the Hessian, which may be
negative. The line-search framework of Section 4.5 can then be applied to ensure
that ∇f(xk ) → 0.
Once again, if the iterates {xk } enter the neighborhood of a local solution x∗
for which ∇2 f(x∗ ) is positive definite, some enhancements of the strategy for
choosing λk and the step length αk can recover the local quadratic convergence
of Theorem 7.1.6.
92 Optimization Algorithms for Data Analysis

Formula (7.2.1) is not the only way to modify the Newton direction to ensure
descent in a line-search framework. Other approaches are outlined in [36, Chap-
ter 3]. One such technique is to modify the Cholesky factorization of ∇2 (fk ) by
adding positive elements to the diagonal only as needed to allow the factoriza-
tion to proceed (that is, to avoid taking the square root of a negative number),
then using the modified factorization in place of ∇2 f(xk ) in the calculation of the
Newton step pk . Another technique is to compute an eigenvalue decomposition
∇2 f(xk ) = Qk Λk QTk (where Qk is orthogonal and Λk is the diagonal matrix con-
taining the eigenvalues), then define Λ̃k to be a modified version of Λk in which
all the diagonals are positive. Then, following (7.1.4), pk can be defined as

k Qk ∇f(x ).
pk := −Qk Λ̃−1 T k

When an appropriate strategy is used to define Λ̃k , we can ensure satisfaction


of the descent condition (4.5.1) for some η > 0. As above, the line-search frame-
work of Section 4.5 can be used to obtain an algorithm that generates a sequence
{xk } such that ∇f(xk ) → 0. We noted earlier that this condition ensures that all
accumulation points x̂ are stationary points, that is, they satisfy ∇f(x̂) = 0.
Stronger guarantees can be obtained from a trust-region version of Newton’s
method, which ensures convergence to a point satisfying second-order necessary
conditions, that is, ∇2 f(x̂)  0 in addition to ∇f(x̂) = 0. The trust-region approach
was developed in the late 1970s and early 1980s, and has become popular again
recently because of this appealing global convergence behavior. A trust-region
Newton method also recovers quadratic convergence to solutions x∗ satisfying
second-order-sufficient conditions, without any special modifications. (The trust-
region Newton approach is closely related to cubic regularization [26, 35], which
we discuss in the next section.)
We now outline the trust-region approach. (Further details can be found in
[36, Chapter 4].) The subproblem to be solved at each iteration is
1
(7.3.1) min f(xk ) + ∇f(xk )T d + dT ∇2 f(xk )d subject to d2  Δk .
d 2
The objective is a second-order Taylor-series approximation while Δk is the radius
of the trust region — the region within which we trust the second-order model
to capture the true behavior of f. Somewhat surprisingly, the problem (7.3.1) is
not too difficult to solve, even when the Hessian ∇2 f(xk ) is indefinite. In fact, the
solution dk of (7.3.1) satisfies the linear system
(7.3.2) [∇2 f(xk ) + λI]dk = −∇f(xk ), for some λ  0,
where λ is chosen such that ∇2 f(xk ) + λI is positive semidefinite and λ > 0 only if
d  = Δk (see [31]). Solving (7.3.1) thus reduces to a search for the appropriate
k

value of the scalar λk , for which specialized methods have been devised.
For large-scale problems, it may be too expensive to solve (7.3.1) near-exactly,
since the process may require several factorizations of an n × n matrix (namely,
the coefficient matrix in (7.3.2), for different values of λ). A popular approach
Stephen J. Wright 93

for finding approximate solutions of (7.3.1), which can be used when ∇2 f(xk )
is positive definite, is the dogleg method. In this method the curved path traced
out by solutions of (7.3.2) for values of λ in the interval [0, ∞) is approximated
by simpler path consisting of two line segments. The first segment joins 0 to
the point dk C that minimizes the objective in (7.3.1) along the direction −∇f(x ),
k

while the second segment joins dk C to the pure Newton step defined in (7.1.4). The
approximate solution is taken to be the point at which this “dogleg” path crosses
the boundary of the trust region d  Δk . If the dogleg path lies entirely inside
the trust region, we take dk to be the pure Newton step. See [36, Section 4.1].
Having discussed the trust-region subproblem (7.3.1), let us outline how it can
be used as the basis for a complete algorithm. A crucial role is played by the ratio
between the amount of decrease in f predicted by the quadratic objective in (7.3.1) and
the actual decrease in f, namely, f(xk ) − f(xk + dk ). Ideally, this ratio would be close
to 1. If it is at least greater than a small tolerance (say, 10−4 ) we accept the step
and proceed to the next iteration. Otherwise, we conclude that the trust-region
radius Δk is too large, so we do not take the step, shrink the trust region, and
re-solve (7.3.1) to obtain a new step. Additionally, when the actual-to-predicted
ratio is close to 1, we conclude that a larger trust region may hasten progress, so
we increase Δ for the next iteration, provided that the bound dk   Δk really is
active at the solution of (7.3.1).
Unlike a basic line-search method, the trust-region Newton method can “es-
cape” from a saddle point. Suppose we have ∇f(xk ) = 0 and ∇2 f(xk ) indefinite
with some strictly negative eigenvalues. Then, the solution dk to (7.3.1) will be
nonzero, and the algorithm will step away from the saddle point, in the direc-
tion of most negative curvature for ∇2 f(xk ). Another appealing feature of the
trust-region Newton approach is that when the sequence {xk } approaches a point
x∗ satisfying second-order sufficient conditions, the trust region bound becomes
inactive, and the method takes pure Newton steps (7.1.4) for all sufficiently large
k so the local quadratic convergence that characterizes Newton’s method.
The basic difference between line-search and trust-region methods can be sum-
marized as follows. Line-search methods first choose a direction pk , then decide
how far to move along that direction. Trust-region methods do the opposite: They
choose the distance Δk first, then find the direction that makes the best progress
for this step length.
7.4. A Cubic Regularization Approach Trust-region Newton methods have the
significant advantage of guaranteeing that any accumulation points will satisfy
second-order necessary conditions. A related approach based on cubic regulariza-
tion has similar properties, plus some additional complexity guarantees. Cubic
regularization requires the Hessian to be Lipschitz continuous, as in (7.1.2). It
follows that the following cubic function yields a global upper bound for f:
1 M
(7.4.1) TM (z; x) := f(x) + ∇f(x)T (z − x) + (z − x)T ∇2 f(x)(z − x) + z − x3 .
2 6
94 Optimization Algorithms for Data Analysis

Specifically, we have for any x that


f(z)  TM (z; x), for all z.
The basic cubic regularization algorithm starting from x0 proceeds as follows:
(7.4.2) xk+1 = arg min TM (z; xk ), k = 0, 1, 2, . . . .
z
The complexity properties of this approach were analyzed in [35], with variants
being studied in [26] and [12, 13]. Rather than present the theory for the method
based on (7.4.2), we describe an elementary algorithm that makes use of the ex-
pansion (7.4.1) as well as the steepest-descent theory of Subsection 4.1. Our algo-
rithm aims to identify a point that approximately satisfies second-order necessary
conditions, that is,
(7.4.3) ∇f(x)  g , λmin (∇2 f(x))  − H ,
where g and H are two small constants. In addition to Lipschitz continuity of
the Hessian (7.1.2), we assume Lipschitz continuity of the gradient with constant
L (see (3.3.6)), and also that the objective f is lower-bounded by some number f̄.
Our algorithm takes steps of two types: a steepest-descent step, as in Subsec-
tion 4.1, or a step in a negative curvature direction for ∇2 f. Iteration k proceeds
as follows:
(i) If ∇f(xk ) > g , take the steepest descent step (4.1.1).
(ii) Otherwise, if λmin (∇2 f(xk )) < − H , choose pk to be the eigenvector cor-
responding to the most negative eigenvalue of ∇2 f(xk ). Choose the size
and sign of pk such that pk  = 1 and (pk )T ∇f(xk )  0, and set
2
(7.4.4) xk+1 = xk + αk pk , where αk = H .
M
If neither of these conditions hold, then xk satisfies the approximate second-order
necessary conditions (7.4.3), so we terminate.
For the steepest-descent step (i), we have from (4.1.3) that
1 2g
(7.4.5) f(xk+1 )  f(xk ) − ∇f(xk )2  f(xk ) − .
2L 2L
For a step of type (ii), we have from (7.4.1) that
1 1
f(xk+1 )  f(xk ) + αk ∇f(xk )T pk + α2k (pk )T ∇2 f(xk )pk + Mα3k pk 3
2 6
 2  3
1 2 H 1 2 H
(7.4.6)  f(xk ) − H + M
2 M 6 M
2 3H
= f(xk ) − .
3 M2
By aggregating (7.4.5) and (7.4.6), we have that at each xk for which the condition
(7.4.3) does not hold, we attain a decrease in the objective of at least
 
2g 2 3H
min , .
2L 3 M2
Stephen J. Wright 95

Using the lower bound f̄ on the objective f, we see that the number of iterations
K required must satisfy the condition
 
2g 2 3H
K min ,  f(x0 ) − f̄,
2L 3 M2
from which we conclude that 
3 2 −3 0
K  max 2L −2 g , M H f(x ) − f̄ .
2
We also observe that that the maximum number of iterates required to identify a
point at which only the approximate stationarity condition ∇f(xk )  g holds
is 2L −2 0
g (f(x ) − f̄). (We can just omit the second-order part of the algorithm.)
Note too that it is easy to devise approximate versions of this algorithm with simi-
lar complexity. For example, the negative curvature direction pk in step (ii) above
can be replaced by an approximation to the direction of most negative curvature,
obtained by the Lanczos iteration with random initialization.
In algorithms that make more complete use of the cubic model (7.4.1), the term
−3/2
−2
g in the complexity expression becomes g , and the constants are different.
The subproblems (7.4.1) are more complicated to solve than those in the simple
scheme above. Active research is going on into other algorithms that achieve
complexities similar to those of the cubic regularization approach. A variety of
methods that make use of Newton-type steps, approximate negative curvature di-
rections, accelerated gradient methods, random perturbations, randomized Lanc-
zos and conjugate gradient methods, and other algorithmic elements have been
proposed.

8. Conclusions
We have outlined various algorithmic tools from optimization that are useful
for solving problems in data analysis and machine learning, and presented their
basic theoretical properties. The intersection of optimization and machine learn-
ing is a fruitful and very popular area of current research. All the major machine
learning conferences have a large contingent of optimization papers, and there is
a great deal of interest in developing algorithmic tools to meet new challenges
and in understanding their properties. The edited volume [41] contains a snap-
shot of the state of the art circa 2010, but this is a fast-moving field and there have
been many developments since then.

Acknowledgments
I thank Ching-pei Lee for a close reading and many helpful suggestions, and
David Hong and an anonymous referee for detailed, excellent comments.
96 References

References
[1] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from highly incom-
plete information, 48th Annual Allerton Conference on Communication, Control, and Computing,
2010, pp. 704–711. ←57
[2] L. Balzano and S. J. Wright, Local convergence of an algorithm for subspace identification from par-
tial data, Found. Comput. Math. 15 (2015), no. 5, 1279–1314, DOI 10.1007/s10208-014-9227-7.
MR3394711 ←58
[3] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems,
SIAM J. Imaging Sci. 2 (2009), no. 1, 183–202, DOI 10.1137/080716542. MR2486527 ←80, 83
[4] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers,
Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992, pp. 144–
152. ←60
[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learn-
ing via the alternating direction methods of multipliers, Foundations and Trends in Machine Learning
3 (2011), no. 1, 1–122. ←51
[6] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, Cambridge, 2004.
MR2061575 ←51
[7] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends in Machine
Learning 8 (2015), no. 3–4, 231–357. ←83, 84
[8] S. Bubeck, Y. T. Lee, and M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent,
Technical Report arXiv:1506.08187, Microsoft Research, 2015. ←83
[9] S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs
via low-rank factorization, Math. Program. 95 (2003), no. 2, Ser. B, 329–357, DOI 10.1007/s10107-
002-0352-8. Computational semidefinite and second order cone programming: the state of the art.
MR1976484 ←55
[10] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computa-
tional Mathematics 9 (2009), 717–772. ←55
[11] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, J. ACM 58 (2011),
no. 3, Art. 11, 37, DOI 10.1145/1970392.1970395. MR2811000 ←57
[12] C. Cartis, N. I. M. Gould, and P. L. Toint, Adaptive cubic regularisation methods for unconstrained
optimization. Part I: motivation, convergence and numerical results, Math. Program. 127 (2011), no. 2,
Ser. A, 245–295, DOI 10.1007/s10107-009-0286-5. MR2776701 ←94
[13] C. Cartis, N. I. M. Gould, and P. L. Toint, Adaptive cubic regularisation methods for unconstrained
optimization. Part II: worst-case function- and derivative-evaluation complexity, Math. Program. 130
(2011), no. 2, Ser. A, 295–319, DOI 10.1007/s10107-009-0337-y. MR2855872 ←94
[14] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-sparsity incoherence for
matrix decomposition, SIAM J. Optim. 21 (2011), no. 2, 572–596, DOI 10.1137/090761793. MR2817479
←57
[15] Y. Chen and M. J. Wainwright, Fast low-rank estimation by projected gradent descent: General statistical
and algorithmic guarantees, UC-Berkeley, 2015, arXiv:1509.03025. ←57
[16] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning 20 (1995), 273–297. ←60
[17] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection,
SIAM J. Matrix Anal. Appl. 30 (2008), no. 1, 56–66, DOI 10.1137/060670985. MR2399568 ←56
[18] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet, A direct formulation for sparse
PCA using semidefinite programming, SIAM Rev. 49 (2007), no. 3, 434–448, DOI 10.1137/050645506.
MR2353806 ←56
[19] T. Dasu and T. Johnson, Exploratory data mining and data cleaning, Wiley Series in Probability and
Statistics, Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2003. MR1979601 ←52
[20] P. Drineas and M. W. Mahoney, Lectures on randomized numerical linear algebra, The Mathematics
of Data, IAS/Park City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←54
[21] D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first order method based on optimal quadratic
averaging, SIAM J. Optim. 28 (2018), no. 1, 251–271, DOI 10.1137/16M1072528. MR3757113 ←83
[22] J. C. Duchi, Introductory lectures on stochastic optimization, The Mathematics of Data, IAS/Park
City Math. Ser., vol. 25, Amer. Math. Soc., Providence, RI, 2018. ←51, 64, 71
[23] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point
algorithm for maximal monotone operators, Math. Programming 55 (1992), no. 3, Ser. A, 293–318,
DOI 10.1007/BF01581204. MR1168183 ←51
References 97

[24] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Res. Logist. Quart. 3 (1956),
95–110, DOI 10.1002/nav.3800030109. MR0089102 ←76
[25] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso,
Biostatistics 9 (2008), no. 3, 432–441. ←56
[26] A. Griewank, The modification of Newton’s method for unconstrained optimization by bounding cubic
terms, Technical Report NA/12, DAMTP, Cambridge University, 1981. ←92, 94
[27] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Research
Nat. Bur. Standards 49 (1952), 409–436 (1953). MR0060307 ←81
[28] A. J. Hoffman and H. W. Wielandt, The variation of the spectrum of a normal matrix, Duke Math. J.
20 (1953), 37–39. MR0052379 ←89
[29] J. D Lee, M. Simchowitz, M. I Jordan, and B. Recht, Gradient descent only converges to minimizers,
Conference on learning theory, 2016, pp. 1246–1257. ←75
[30] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization, Math.
Programming 45 (1989), no. 3, (Ser. B), 503–528, DOI 10.1007/BF01589116. MR1038245 ←51, 82
[31] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM J. Sci. Statist. Comput. 4 (1983),
no. 3, 553–572, DOI 10.1137/0904038. MR723110 ←92
[32] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, John
Wiley & Sons, Inc., New York, 1983. Translated from the Russian and with a preface by E. R.
Dawson; Wiley-Interscience Series in Discrete Mathematics. MR702836 ←87
[33] Yu. E. Nesterov, A method for solving the convex programming problem with convergence rate O(1/k2 )
(Russian), Dokl. Akad. Nauk SSSR 269 (1983), no. 3, 543–547. MR701288 ←80
[34] Y. Nesterov, Introductory lectures on convex optimization, Applied Optimization, vol. 87, Kluwer
Academic Publishers, Boston, MA, 2004. A basic course. MR2142598 ←80, 84
[35] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Math.
Program. 108 (2006), no. 1, Ser. A, 177–205, DOI 10.1007/s10107-006-0706-8. MR2229459 ←92, 94
[36] J. Nocedal and S. J. Wright, Numerical Optimization, Second, Springer, New York, 2006. ←51, 74,
82, 92, 93
[37] B. T. Poljak, Some methods of speeding up the convergence of iterative methods (Russian), Ž. Vyčisl. Mat.
i Mat. Fiz. 4 (1964), 791–803. MR0169403 ←80
[38] B. T. Polyak, Introduction to optimization, Translations Series in Mathematics and Engineering,
Optimization Software, Inc., Publications Division, New York, 1987. Translated from the Russian;
With a foreword by Dimitri P. Bertsekas. MR1099605 ←80, 81
[39] B. Recht, M. Fazel, and P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equa-
tions via nuclear norm minimization, SIAM Rev. 52 (2010), no. 3, 471–501, DOI 10.1137/070697835.
MR2680543 ←55
[40] R. T. Rockafellar, Convex analysis, Princeton Mathematical Series, No. 28, Princeton University
Press, Princeton, N.J., 1970. MR0274683 ←65
[41] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for machine learning, NIPS Workshop
Series, MIT Press, 2011. ←95
[42] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B 58 (1996),
no. 1, 267–288. MR1379242 ←54
[43] M. J. Todd, Semidefinite optimization, Acta Numer. 10 (2001), 515–560, DOI
10.1017/S0962492901000071. MR2009698 ←51
[44] B. A. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics 47
(2005), no. 3, 349–363, DOI 10.1198/004017005000000139. MR2164706 ←57
[45] L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Rev. 38 (1996), no. 1, 49–95, DOI
10.1137/1038003. MR1379041 ←51
[46] S. J. Wright, Primal-dual interior-point methods, Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1997. MR1422257 ←51
[47] S. J. Wright, Coordinate descent algorithms, Math. Program. 151 (2015), no. 1, Ser. B, 3–34, DOI
10.1007/s10107-015-0892-3. MR3347548 ←51
[48] X. Yi, D. Park, Y. Chen, and C. Caramanis, Fast algorithms for robust PCA via gradient descent,
Advances in Neural Information Processing Systems 29, 2016, pp. 4152–4160. ←57

Computer Sciences Department, University of Wisconsin-Madison, Madison, WI 53706


Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 99–185
https://doi.org/10.1090/pcms/025/00831

Introductory Lectures on Stochastic Optimization

John C. Duchi

Contents
1 Introduction 100
1.1 Scope, limitations, and other references 101
1.2 Notation 102
2 Basic Convex Analysis 103
2.1 Introduction and Definitions 103
2.2 Properties of Convex Sets 105
2.3 Continuity and Local Differentiability of Convex Functions 112
2.4 Subgradients and Optimality Conditions 114
2.5 Calculus rules with subgradients 119
3 Subgradient Methods 122
3.1 Introduction 122
3.2 The gradient and subgradient methods 123
3.3 Projected subgradient methods 129
3.4 Stochastic subgradient methods 132
4 The Choice of Metric in Subgradient Methods 140
4.1 Introduction 141
4.2 Mirror Descent Methods 141
4.3 Adaptive stepsizes and metrics 151
5 Optimality Guarantees 157
5.1 Introduction 157
5.2 Le Cam’s Method 162
5.3 Multiple dimensions and Assouad’s Method 167
A Technical Appendices 172
A.1 Continuity of Convex Functions 172
A.2 Probability background 173
A.3 Auxiliary results on divergences 175
B Questions and Exercises 176
2010 Mathematics Subject Classification. Primary 65Kxx; Secondary 90C15, 62C20.
Key words and phrases. Convexity, stochastic optimization, subgradients, mirror descent, minimax op-
timal.

©2018 American Mathematical Society

99
100 Introductory Lectures on Stochastic Optimization

1. Introduction
In this set of four lectures, we study the basic analytical tools and algorithms
necessary for the solution of stochastic convex optimization problems, as well as
for providing optimality guarantees associated with the methods. As we proceed
through the lectures, we will be more exact about the precise problem formula-
tions, providing a number of examples, but roughly, by a stochastic optimization
problem we mean a numerical optimization problem that arises from observing
data from some (random) data-generating process. We focus almost exclusively
on first-order methods for the solution of these types of problems, as they have
proven quite successful in the large scale problems that have driven many ad-
vances throughout the 2000s.
Our main goal in these lectures, as in the lectures by S. Wright [61] in this vol-
ume, is to develop methods for the solution of optimization problems arising in
large-scale data analysis. Our route will be somewhat circuitous, as we will first
build the necessary convex analytic and other background (see Lecture 2), but
broadly, the problems we wish to solve are the problems arising in stochastic con-
vex optimization. In these problems, we have samples S coming from a sample
space S, drawn from a distribution P, and we have some decision vector x ∈ Rn
that we wish to choose to minimize the expected
loss
(1.0.1) f(x) := EP [F(x; S)] = F(x; s)dP(s),
S
where F is convex in its first argument.
The methods we consider for minimizing problem (1.0.1) are typically sim-
ple methods that are slower to converge than more advanced methods—such as
Newton or other second-order methods—for deterministic problems, but have the
advantage that they are robust to noise in the optimization problem itself. Con-
sequently, it is often relatively straightforward to derive generalization bounds
for these procedures: if they produce an estimate ' x exhibiting good performance
on some sample S1 , . . . , Sm drawn from P, then they are likely to exhibit good
performance (on average) for future data, that is, to have small objective f(' x);
see Lecture 3, and especially Theorem 3.4.13. It is of course often advantageous
to take advantage of problem structure and geometric aspects of the problem,
broadly defined, which is the goal of mirror descent and related methods, which
we discuss in Lecture 4.
The last part of our lectures is perhaps the most unusual for material on opti-
mization, which is to investigate optimality guarantees for stochastic optimization
problems. In Lecture 5, we study the sample complexity of solving problems of
the form (1.0.1). More precisely, we measure the performance of an optimization
procedure given samples S1 , . . . , Sm drawn independently from the population
distribution P, denoted by ' x=' x(S1:m ), in a uniform sense: for a class of objec-
tive functions F, a procedure’s performance is its expected error—or risk—for the
worst member of the class F. We provide lower bounds on this maximum risk,
John C. Duchi 101

showing that the first-order procedures we have developed satisfy certain notions
of optimality.
We briefly outline the coming lectures. The first lecture provides definitions
and the convex analytic tools necessary for the development of our algorithms
and other ideas, developing separation properties of convex sets as well as other
properties of convex functions from basic principles. The second two lectures
investigate subgradient methods and their application to certain stochastic opti-
mization problems, demonstrating a number of convergence results. The second
lecture focuses on standard subgradient-type methods, while the third investi-
gates more advanced material on mirror descent and adaptive methods, which
require more care but can yield substantial practical performance benefits. The
final lecture investigates optimality guarantees for the various methods we study,
demonstrating two standard techniques for proving lower bounds on the ability
of any algorithm to solve stochastic optimization problems.
1.1. Scope, limitations, and other references The lectures assume some limited
familiarity with convex functions and convex optimization problems and their
formulation, which will help appreciation of the techniques herein. All that is
truly essential is a level of mathematical maturity that includes some real analysis,
linear algebra, and introductory probability. In terms of real analysis, a typical
undergraduate course, such as one based on Marsden and Hoffman’s Elementary
Real Analysis [37] or Rudin’s Principles of Mathematical Analysis [50], are sufficient.
Readers should not consider these lectures in any way a comprehensive view of
convex analysis or stochastic optimization. These subjects are well-established,
and there are numerous references.
Our lectures begin with convex analysis, whose study Rockafellar, influenced
by Fenchel, launched in his 1970 book Convex Analysis [49]. We develop the basic
ideas necessary for our treatment of first-order (gradient-based) methods for op-
timization, which includes separating and supporting hyperplane theorems, but
we provide essentially no treatment of the important concepts of Lagrangian and
Fenchel duality, support functions, or saddle point theory more broadly. For these
and other important ideas, I have found the books of Rockafellar [49], Hiriart-
Urruty and Lemaréchal [27, 28], Bertsekas [8], and Boyd and Vandenberghe [12]
illuminating.
Convex optimization itself is a huge topic, with thousands of papers and nu-
merous books on the subject. Because of our focus on solution methods for large-
scale problems arising out of data collection, we are somewhat constrained in
our views. Boyd and Vandenberghe [12] provide an excellent treatment of the
possibilities of modeling engineering and scientific problems as convex optimiza-
tion problems, as well as some important numerical methods. Polyak [47] pro-
vides a treatment of stochastic and non-stochastic methods for optimization from
which ours borrows substantially. Nocedal and Wright [46] and Bertsekas [9]
also describe more advanced methods for the solution of optimization problems,
102 Introductory Lectures on Stochastic Optimization

focusing on non-stochastic optimization problems for which there are many so-
phisticated methods.
Because of our goal to solve problems of the form (1.0.1), we develop first-order
methods that are robust to many types of noise from sampling. There are other
approaches to dealing with data uncertainty, notably robust optimization [6],
where researchers study and develop tractable (polynomial-time-solvable) formu-
lations for a variety of data-based problems in engineering and the sciences. The
book of Shapiro et al. [54] provides a more comprehensive picture of stochastic
modeling problems and optimization algorithms than we have been able to in our
lectures, as stochastic optimization is by itself a major field. Several recent sur-
veys on online learning and online convex optimization provide complementary
treatments to ours [26, 52].
The last lecture traces its roots to seminal work in information-based complex-
ity by Nemirovski and Yudin in the early 1980s [41], who investigate the limits of
“optimal” algorithms, where optimality is defined in a worst-case sense accord-
ing to an oracle model of algorithms given access to function, gradient, or other
types of local information about the problem at hand. Issues of optimal estima-
tion in statistics are as old as the field itself, and the minimax formulation we use
is originally due to Wald in the late 1930s [59, 60]. We prove our results using
information theoretic tools, which have broader applications across statistics, and
that have been developed by many authors [31, 33, 62, 63].
1.2. Notation We use mostly standard notation throughout these notes, but for
completeness, we collect it here. We let R denote the typical field of real numbers,
with Rn having its usual meaning as n-dimensional Euclidean space. Given
vectors x and y, we let x, y denote the inner product between x and y. Given a
norm ·, its dual norm ·∗ is defined as
z∗ := sup {z, x | x  1} .
Hölder’s inequality (see Exercise B.2.4) shows that the p and q norms, defined
by
 n 1
p
xp = |xj |p

j=1

(and as the limit x∞ = maxj |xj |) are dual to one another, where 1/p + 1/q = 1

and p, q ∈ [1, ∞]. Throughout, we will assume that x2 = x, x is the norm
defined by the inner product ·, ·.
We also require notation related to sets. For a sequence of vectors v1 , v2 , v3 , . . .,
we let (vn ) denote the entire sequence. Given sets A and B, we let A ⊂ B denote
that A is a subset (possibly equal to) B, and A  B that A is a strict subset of
B. The notation cl A denotes the closure of A, while int A denotes the interior of
the set A. For a function f, the set dom f is its domain. If f : Rn → R ∪ {+∞} is
convex, we let dom f := {x ∈ Rn | f(x) < +∞}.
John C. Duchi 103

2. Basic Convex Analysis


Lecture Summary: In this lecture, we will outline several standard facts
from convex analysis, the study of the mathematical properties of convex
functions and sets. For the most part, our analysis and results will all be
with the aim of setting the necessary background for understanding first-
order convex optimization methods, though some of the results we state will
be quite general.

2.1. Introduction and Definitions This set of lecture notes considers convex op-
timization problems, numerical optimization problems of the form

minimize f(x)
(2.1.1)
subject to x ∈ C,
where f is a convex function and C is a convex set. While we will consider
tools to solve these types of optimization problems presently, this first lecture is
concerned most with the analytic tools and background that underlies solution
methods for these problems.
The starting point for any study of convex functions is the definition and study
of convex sets, which are intimately related to convex functions. To that end, we
recall that a set C ⊂ Rn is convex if for all x, y ∈ C,
λx + (1 − λ)y ∈ C for λ ∈ [0, 1].
See Figure 2.1.2.

(a) (b)
Figure 2.1.2. (a) A convex set (b) A non-convex set.

A convex function is similarly defined: a function f : Rn → (−∞, ∞] is convex


if for all x, y ∈ dom f := {x ∈ Rn | f(x) < +∞}
f(λx + (1 − λ)y)  λf(x) + (1 − λ)f(y) for λ ∈ [0, 1].
The epigraph of a function is defined as
epi f := {(x, t) : f(x)  t},
104 Introductory Lectures on Stochastic Optimization

and by inspection, a function is convex if and only if its epigraph is a convex


set. A convex function f is closed if its epigraph is a closed set; continuous
convex functions are always closed. We will assume throughout that any convex
function we deal with is closed. See Figure 2.1.3 for graphical representations of
these ideas, which make clear that the epigraph is indeed a convex set.

epi f

f(x) f(x)

(a) (b)
Figure 2.1.3. (a) The convex function f(x) = max{x2 , −2x − .2}
and (b) its epigraph, which is a convex set.

One may ask why, precisely, we focus on convex functions. In short, as Rock-
afellar [49] notes, convex optimization problems are the clearest dividing line
between numerical problems that are efficiently solvable, often by iterative meth-
ods, and numerical problems for which we have no hope. We give one simple
result in this direction first:

Observation. Let f : Rn → R be convex and x be a local minimum of f (respectively


a local minimum over a convex set C). Then x is a global minimum of f (resp. a global
minimum of f over C).

t3

t1 t2

f(x+t)−f(x)
Figure 2.1.4. The slopes t increase, with t1 < t2 < t3 .
John C. Duchi 105

To see this, note that if x is a local minimum then for any y ∈ C, we have for
small enough t > 0 that
f(x + t(y − x)) − f(x)
f(x)  f(x + t(y − x)) or 0  .
t
We now use the criterion of increasing slopes, that is, for any convex function f the
function
f(x + tu) − f(x)
(2.1.5) t →
t
is increasing in t > 0. (See Fig. 2.1.4.) Indeed, let 0  t1  t2 . Then

f(x + t1 u) − f(x) t f(x + t2 (t1 /t2 )u) − f(x)


= 2
t1 t1 t2
t f((1 − t1 /t2 )x + (t1 /t2 )(x + t2 u)) − f(x)
= 2
t1 t2
t2 (1 − t1 /t2 )f(x) + (t1 /t2 )f(x + t2 u) − f(x)

t1 t2
f(x + t2 u) − f(x)
= .
t2
In particular, because 0  f(x + t(y − x)) for small enough t > 0, we see that for
all t > 0 we have
f(x + t(y − x)) − f(x)
0 or f(x)  inf f(x + t(y − x))  f(y)
t t0
for all y ∈ C.
Most of the results herein apply in general Hilbert (complete inner product)
spaces, and many of our proofs will not require anything particular about finite
dimensional spaces, but for simplicity we use Rn as the underlying space on
which all functions and sets are defined.1 While we present all proofs in the
chapter, we try to provide geometric intuition that will aid a working knowledge
of the results, which we believe is the most important.
2.2. Properties of Convex Sets Convex sets enjoy a number of very nice proper-
ties that allow efficient and elegant descriptions of the sets, as well as providing
a number of nice properties concerning their separation from one another. To
that end, in this section, we give several fundamental properties on separating
and supporting hyperplanes for convex sets. The results here begin by showing
that there is a unique (Euclidean) projection to any convex set C, then use this
fact to show that whenever a point is not contained in a set, it can be separated
from the set by a hyperplane. This result can be extended to show separation
of convex sets from one another and that points in the boundary of a convex set
have a hyperplane tangent to the convex set running through them. We leverage
these results in the sequel by making connections of supporting hyperplanes to
1 The generality of Hilbert, or even Banach, spaces in convex analysis is seldom needed. Readers
familiar with arguments in these spaces will, however, note that the proofs can generally be extended
to infinite dimensional spaces in reasonably straightforward ways.
106 Introductory Lectures on Stochastic Optimization

epigraphs and gradients, results that in turn find many applications in the design
of optimization algorithms as well as optimality certificates.
A few basic properties We list a few simple properties that convex sets have,
which are evident from their definitions. First, if Cα are convex sets for each
α ∈ A, where A is an arbitrary index set, then the intersection
(
C= Cα
α∈A
is also convex. Additionally, convex sets are closed under scalar multiplication: if
α ∈ R and C is convex, then
αC := {αx : x ∈ C}
is evidently convex. The Minkowski sum of two convex sets is defined by
C1 + C2 := {x1 + x2 : x1 ∈ C1 , x2 ∈ C2 },
and is also convex. To see this, note that if xi , yi ∈ Ci , then
λ(x1 + x2 ) + (1 − λ)(y1 + y2 ) = λx1 + (1 − λ)y1 + λx2 + (1 − λ)y2 ∈ C1 + C2 .
   
∈C1 ∈C2

In particular, convex sets are closed under all linear combination: if α ∈ Rm , then

C= m i=1 αi Ci is also convex.
We also define the convex hull of a set of points x1 , . . . , xm ∈ Rn by
m 
 
m
Conv{x1 , . . . , xm } = λi xi : λi  0, λi = 1 .
i=1 i=1
This set is clearly a convex set.
Projections We now turn to a discussion of orthogonal projection onto a con-
vex set, which will allow us to develop a number of separation properties and
alternate characterizations of convex sets. See Figure 2.2.1 for a geometric view
of projection.

y πC (x)

Figure 2.2.1. Projection of the point x onto the set C (with pro-
jection πC (x)), exhibiting x − πC (x), y − πC (x)  0.
John C. Duchi 107

We begin by stating a classical result about the projection of zero onto a convex
set.

Theorem 2.2.2 (Projection of zero). Let C be a closed convex set not containing the
origin 0. Then there is a unique point xC ∈ C such that xC 2 = infx∈C x2 . Moreover,
xC 2 = infx∈C x2 if and only if
(2.2.3) xC , y − xC   0
for all y ∈ C.

Proof. The key to the proof is the following parallelogram identity, which holds
in any inner product space: for any x, y,
1 1
(2.2.4) x − y22 + x + y22 = x22 + y22 .
2 2
Define M := infx∈C x2 . Now, let (xn ) ⊂ C be a sequence of points in C such that
xn 2 → M as n → ∞. By the parallelogram identity (2.2.4), for any n, m ∈ N,
we have
1 1
xn − xm 22 = xn 22 + xm 22 − xn + xm 22 .
2 2
Fix > 0, and choose N ∈ N such that n  N implies that xn 22  M2 + . Then
for any m, n  N, we have
1 1
(2.2.5) xn − xm 22  2M2 + 2 − xn + xm 22 .
2 2
Now we use the convexity of the set C. We have 12 xn + 12 xm ∈ C for any n, m,
which implies
$ $2
1 $1 1 $
xn + xm 22 = 2 $
$ xn + x $
m $  2M
2
2 2 2 2
by definition of M. Using the above inequality in the bound (2.2.5), we see that
1
xn − xm 22  2M2 + 2 − 2M2 = 2 .
2 √
In particular, xn − xm 2  2 ; since was arbitrary, (xn ) forms a Cauchy
sequence and so must converge to a point xC . The continuity of the norm ·2
implies that xC 2 = infx∈C x2 , and the fact that C is closed implies that the
point xC lies in C.
Now we show the inequality (2.2.3) holds if and only if xC is the projection of
the origin 0 onto C. Suppose that inequality (2.2.3) holds. Then
xC 22 = xC , xC   xC , y  xC 2 y2 ,
the last inequality following from the Cauchy-Schwartz inequality. Dividing each
side by xC 2 implies that xC 2  y2 for all y ∈ C. For the converse, let xC
minimize x2 over C. Then for any t ∈ [0, 1] and any y ∈ C, we have
xC 22  (1 − t)xC + ty22
= xC + t(y − xC )22 = xC 22 + 2t xC , y − xC  + t2 y − xC 22 .
108 Introductory Lectures on Stochastic Optimization

Subtracting xC 22 and t2 y − xC 22 from both sides of the above inequality, we
have
−t2 y − xC 22  2t xC , y − xC  .
Dividing both sides of the above inequality by 2t, we have
t
− y − xC 22  xC , y − xC 
2
for all t ∈ (0, 1]. Letting t ↓ 0 gives the desired inequality. 
With this theorem in place, a simple shift gives a characterization of more
general projections onto convex sets.
Corollary 2.2.6 (Projection onto convex sets). Let C be a closed convex set and let
x ∈ Rn . Then there is a unique point πC (x), called the projection of x onto C, such
that x − πC (x)2 = infy∈C x − y2 , that is, πC (x) = argminy∈C y − x22 . The
projection is characterized by the inequality
(2.2.7) πC (x) − x, y − πC (x)  0
for all y ∈ C.
Proof. When x ∈ C, the statement is clear. For x ∈ C, the corollary simply fol-
lows by considering the set C = C − x, then using Theorem 2.2.2 applied to the
recentered set. 
Corollary 2.2.8 (Non-expansive projections). Projections onto convex sets are non-
expansive, in particular,
πC (x) − y2  x − y2
for any x ∈ Rn and y ∈ C.
Proof. When x ∈ C, the inequality is clear, so assume that x ∈ C. Now use
inequality (2.2.7) from the previous corollary. By adding and subtracting y in the
inner product, we have

0  πC (x) − x, y − πC (x)


= πC (x) − y + y − x, y − πC (x)
= − πC (x) − y22 + y − x, y − πC (x)
We rearrange the above and then use the Cauchy-Schwartz or Hölder’s inequality,
which gives
πC (x) − y22  y − x, y − πC (x)  y − x2 y − πC (x)2 .
Now divide both sides by πC (x) − y2 . 
Separation Properties Projections are important not just because of their exis-
tence, but because they also guarantee, first, that convex sets can be described by
halfplanes that contain them and, second, that any two convex sets can be sepa-
rated by hyperplanes. Moreover, the separation can be strict if one of the sets is
compact.
John C. Duchi 109

y πC (x)

Figure 2.2.9. Separation of the point x from the set C by the


vector v = x − πC (x).

Proposition 2.2.10 (Strict separation of points). Let C be a closed convex set. Given
any point x ∈ C, there is a vector v such that
(2.2.11) v, x > sup v, y
y∈C

Moreover, we can take the vector v = x − πC (x), and v, x  supy∈C v, y + v22 .
See Figure 2.2.9.

Proof. Indeed, since x ∈ C, we have x − πC (x) = 0. By setting v = x − πC (x), we


have from the characterization (2.2.7) that
0  v, y − πC (x) = v, y − x + x − πC (x) = v, y − x + v = v, y − x + v22 .
In particular, we see that v, x  v, y + v2 for all y ∈ C. 

Proposition 2.2.12 (Strict separation of convex sets). Let C1 , C2 be closed convex


sets, with C2 compact. Then there is a vector v such that
inf v, x > sup v, x .
x∈C1 x∈C2

Proof. The set C = C1 − C2 is convex and closed.2 Moreover, we have 0 ∈ C, so


that there is a vector v such that 0 < infz∈C v, z by Proposition 2.2.10. Thus we
have
0< inf v, z = inf v, x − sup v, x ,
z∈C1 −C2 x∈C1 x∈C2

which is our desired result. 


2 If C is closed and C is compact, then C + C is closed. Indeed, let z = x + y be a convergent
1 2 1 2 n n n
sequence of points (say zn → z) with zn ∈ C1 + C2 . We claim that z ∈ C1 + C2 . Indeed, passing
to a subsequence if necessary, we may assume yn → y. Passing to this subsequence, we have
xn = zn − yn → z − y, so that xn is convergent and necessarily converges to a point x ∈ C1 .
110 Introductory Lectures on Stochastic Optimization

We can also investigate the existence of hyperplanes that support the convex
set C, meaning that they touch only its boundary and never enter its interior.
Such hyperplanes—and the halfspaces associated with them—provide alternata-
tive descriptions of convex sets and functions. See Figure 2.2.13.

Figure 2.2.13. Supporting hyperplanes to a convex set.

Theorem 2.2.14 (Supporting hyperplanes). Let C be a closed convex set and x ∈ bd C,


the boundary of C. Then there exists a vector v = 0 supporting C at x, that is,
(2.2.15) v, x  v, y for all y ∈ C.
Proof. Let (xn ) be a sequence of points approaching x from outside C, that is,
xn ∈ C for any n, but xn → x. For each n, we can take sn = xn − πC (xn ) and
define vn = sn / sn 2 . Then (vn ) is a sequence satisfying vn , x > vn , y for
all y ∈ C, and since vn 2 = 1, the sequence (vn ) belongs to the compact set
{v : v2  1}.3 Passing to a subsequence if necessary, it is clear that there is a
vector v such that vn → v, and we have v, x  v, y for all y ∈ C. 
Theorem 2.2.16 (Halfspace intersections). Let C  Rn be a closed convex set. Then
C is the intersection of all the spaces containing it; moreover,
(
(2.2.17) C= Hx
x∈bd C
where Hx denotes the intersection of the halfspaces contained in hyperplanes supporting
C at x.
)
Proof. It is clear that C ⊆ x∈bd C Hx . Indeed, let hx = 0 be a hyperplane support-
ing to C at x ∈ bd C and consider Hx = {y : hx , x  hx , y}. By Theorem 2.2.14
we see that Hx ⊇ C.
3 In
a general Hilbert space, this set is actually weakly compact by Alaoglu’s theorem. However, in
a weakly compact set, any sequence has a* weakly convergent
+ subsequence, that is, there exists a
subsequence n(m) and vector v such that vn(m) , y → v, y for all y.
John C. Duchi 111
)
Now we show the other inclusion: x∈bd C Hx ⊆ C. Suppose for the sake
)
of contradiction that z ∈ x∈bd C Hx satisfies z ∈ C. We will construct a hy-
perplane supporting C that separates z from C, which will be a contradiction to
our supposition. Since C is closed, the projection of πC (z) of z onto C satisfies
z − πC (z), z > supy∈C z − πC (z), y by Proposition 2.2.10. In particular, defin-
ing vz = z − πC (z), the hyperplane {y : vz , y = vz , πC (z)} is supporting to C
at the point πC (z) (Corollary 2.2.6) and the halfspace {y : vz , y  vz , πC (z)}
does not contain z but does contain C. This contradicts the assumption that
)
z ∈ x∈bd C Hx . 
As a not too immediate consequence of Theorem 2.2.16 we obtain the following
characterization of a convex function as the supremum of all affine functions that
minorize the function (that is, affine functions that are everywhere less than or
equal to the original function). This is intuitive: if f is a closed convex function,
meaning that epi f is closed, then epi f is the intersection of all the halfspaces
containing it. The challenge is showing that we may restrict this intersection to
non-vertical halfspaces. See Figure 2.2.18.

Figure 2.2.18. The function f (solid blue line) and affine under-
estimators (dotted lines).

Corollary 2.2.19. Let f be a closed convex function that is not identically −∞. Then
f(x) = sup {v, x + b : f(y)  b + v, y for all y ∈ Rn } .
v∈Rn ,b∈R

Proof. First, we note that epi f is closed by definition. Moreover, we know that we
can write
epi f = ∩{H : H ⊃ epi f},
where H denotes a halfspace. More specifically, we may index each halfspace by
(v, a, c) ∈ Rn × R × R, and we have Hv,a,c = {(x, t) ∈ Rn × R : v, x + at  c}.
Now, because H ⊃ epi f, we must be able to take t → ∞ so that a  0. If a < 0,
112 Introductory Lectures on Stochastic Optimization

we may divide by |a| and assume without loss of generality that a = −1, while
otherwise a = 0. So if we let
 
H1 := (v, c) : Hv,−1,c ⊃ epi f and H0 := {(v, c) : Hv,0,c ⊃ epi f} .
then ( (
epi f = Hv,−1,c ∩ Hv,0,c .
(v,c)∈H1 (v,c)∈H0

We would like to show that epi f = ∩(v,c)∈H1 Hv,−1,c , as the set Hv,0,c is a vertical
hyperplane separating the domain of f, dom f, from the rest of the space.
To that end, we show that if (v1 , c1 ) ∈ H1 and (v0 , c0 ) ∈ H0 , then
(
H := Hv1 +λv0 ,−1,c1 +λc1 = Hv1 ,−1,c1 ∩ Hv0 ,0,c0 .
λ0

Indeed, suppose that (x, t) ∈ Hv1 ,−1,c1 ∩ Hv0 ,0,c0 . Then


v1 , x − t  c1 and λ v0 , x  λc0 for all λ  0.
Summing these, we have
(2.2.20) v1 + λv0 , x − t  c1 + λc0 for all λ  0,
or (x, t) ∈ H. Conversely, if (x, t) ∈ H then inequality (2.2.20) holds, so that taking
λ → ∞ we have v0 , x  c0 , while taking λ = 0 we have v1 , x − t  c1 .
Noting that H ∈ {Hv,−1,c : (v, c) ∈ H1 }, we see that
(
epi f = Hv,−1,c = {(x, t) ∈ Rn × R : v, x − t  c for all (v, c) ∈ H1 } .
(v,c)∈H1

This is equivalent to the claim in the corollary. 


2.3. Continuity and Local Differentiability of Convex Functions Here we dis-
cuss several important results concerning convex functions in finite dimensions.
We will see that assuming that a function f is convex is quite strong. In fact, we
will see the (intuitive if one pictures a convex function) facts that f is continuous,
has a directional derivative everywhere, and in fact is locally Lipschitz. We prove
the first two results on continuity in Appendix A.1, as they are not fully necessary
for our development.
We begin with the fact that if f is defined on a compact domain, then f has an
upper bound. The first step in this direction is to argue that this holds for 1 balls,
which can be proved by a simple argument with the definition of convexity.

Lemma 2.3.1. Let f be convex and defined on B1 = {x ∈ Rn : x1  1}, the 1 ball in
n dimensions. Then there exist −∞ < m  M < ∞ such that m  f(x)  M for all
x ∈ B1 .

We provide a proof of this lemma, as well as the coming theorem, in Appen-


dix A.1, as they are not central to our development, relying on a few results in
the sequel. The theorem makes use of the above lemma to show that on compact
domains, convex functions are Lipschitz continuous. The proof of the theorem
John C. Duchi 113

begins by showing that if a convex function is bounded in some set, then it is


Lipschitz continuous in the set, then using Lemma 2.3.1 we can show that on
compact sets f is indeed bounded.
Theorem 2.3.2. Let f be convex and defined on a set C with non-empty interior. Let
B ⊆ int C be compact. Then there is a constant L such that |f(x) − f(y)|  L x − y on
B, that is, f is L-Lipschitz continuous on B.
The last result, which we make strong use of in the next section, concerns the
existence of directional derivatives for convex functions.
Definition 2.3.3. The directional derivative of a function f at a point x in the direc-
tion u is
1
f (x; u) := lim [f(x + αu) − f(x)] .
α↓0 α
This definition makes sense for convex f by our earlier arguments that convex
functions have increasing slopes (recall expression (2.1.5)). To see that the above
definition makes sense, we restrict our attention to x ∈ int dom f, so that we can
approach x from all directions. By taking u = y − x for any y ∈ dom f,
f(x + α(y − x)) = f((1 − α)x + αy)  (1 − α)f(x) + αf(y)
so that
1 1
[f(x + α(y − x)) − f(x)]  [αf(y) − αf(x)] = f(y) − f(x) = f(x + u) − f(x).
α α
We also know from Theorem 2.3.2 that f is locally Lipschitz, so for small enough
α there exists some L such that f(x + αu)  f(x) − Lα u, and thus for which
f (x; u)  −L u. Further, an argument by convexity (the criterion (2.1.5) of
increasing slopes) shows that the function
1
α → [f(x + αu) − f(x)]
α
is increasing, so we can replace the limit in the definition of f (x; u) with an
infimum over α > 0, that is, f (x; u) = infα>0 α1 [f(x + αu) − f(x)]. Noting that if
x is on the boundary of dom f and x + αu ∈ dom f for any α > 0, then we must
have f (x; u) = +∞, we have proved the following theorem.
Theorem 2.3.4. For convex f, at any point x ∈ dom f and for any u, the directional
derivative f (x; u) exists and is
1 1
f (x; u) = lim [f(x + αu) − f(x)] = inf [f(x + αu) − f(x)] .
α↓0 α α>0 α

If x ∈ int dom f, there exists a constant L < ∞ such that |f (x; u)|  L u for any
u ∈ Rn . If f is Lipschitz continuous with respect to the norm ·, we can take L to be
the Lipschitz constant of f.
Lastly, we state a well-known condition that is equivalent to convexity. This is
inuitive: if a function is bowl-shaped, it should have positive second derivatives.
Theorem 2.3.5. Let f : Rn → R be twice continuously differentiable. Then f is convex
if and only if ∇2 f(x)  0 for all x, that is, ∇2 f(x) is positive semidefinite.
114 Introductory Lectures on Stochastic Optimization

Proof. We may essentially reduce the argument to one-dimensional problems, be-


cause if f is twice continuously differentiable, then for each v ∈ Rn we may define
hv : R → R by
hv (t) = f(x + tv),
and f is convex if and only if hv is convex for each v (because convexity is a
property only of lines, by definition). Moreover, we have
hv (0) = v ∇2 f(x)v,
and ∇2 f(x)  0 if and only if hv (0)  0 for all v.
Thus, with no loss of generality, we assume n = 1 and show that f is convex if
and only if f (x)  0. First, suppose that f (x)  0 for all x. Then using that
1
f(y) = f(x) + f (x)(y − x) + (y − x)2 f (
x)
2
for some x between x and y, we have that f(y)  f(x) + f (x)(y − x) for all x, y.
Let λ ∈ [0, 1]. Then we have

f(y)  f(λx + (1 − λ)y) + λf (λx + (1 − λ)y)(y − x) and


f(x)  f(λx + (1 − λ)y) + (1 − λ)f (λx + (1 − λ)y)(x − y).
Multiplying the first equation by 1 − λ and the second by λ, then adding, we
obtain
(1 − λ)f(y) + λf(x)  (1 − λ)f(λx + (1 − λ)y) + λf(λx + (1 − λ)y) = f(λx + (1 − λ)y),
that is, f is convex.
For the converse, let δ > 0 and define x1 = x + δ > x > x − δ = x0 . Then we
have x1 − x0 = 2δ, and

f(x1 ) = f(x) + f (x)δ + 2δ2 f (


x1 ) and f(x0 ) = f(x) − f (x)δ + 2δ2 f (
x0 )
x1 , 
for  x0 ∈ [x − δ, x + δ]. Defining cδ = f(x1 ) + f(x0 ) − 2f(x)  0 (the last inequality
by convexity), these equations show that
cδ = 2δ2 [f (
x1 ) + f (
x0 )].
By continuity, we have f (xi ) → f (x) as δ → 0, and as cδ /2δ2  0 for all δ > 0,
we must have
c
2f (x) = lim sup{f (
x1 ) + f (
x0 )} = lim sup δ2  0.
δ→0 δ→0 2δ
This gives the result. 
2.4. Subgradients and Optimality Conditions The subgradient set of a function
f at a point x ∈ dom f is defined as follows:
(2.4.1) ∂f(x) := {g : f(y)  f(x) + g, y − x for all y} .
Intuitively, since a function is convex if and only if epi f is convex, the subgradient
set ∂f should be non-empty and consist of supporting hyperplanes to epi f. That
John C. Duchi 115

is, f should always have global linear underestimators of itself. When a function f
is convex, the subgradient generalizes the derivative of f (which is a global linear
underestimator of f when f is differentiable), and is also intimately related to
optimality conditions for convex minimization.

f(x1 ) + g1 , x − x1 

f(x2 ) + g3 , x − x2 

f(x2 ) + g2 , x − x2 

x2 x1

Figure 2.4.2. Subgradients of a convex function. At the point


x1 , the subgradient g1 is the gradient. At the point x2 , there are
multiple subgradients, because the function is non-differentiable.
We show the linear functions given by g2 , g3 ∈ ∂f(x2 ).

Existence and characterizations of subgradients Our first theorem guarantees


that the subdifferential set is non-empty.
Theorem 2.4.3. Let x ∈ int dom f. Then ∂f(x) is nonempty, closed convex, and com-
pact.
Proof. The fact that ∂f(x) is closed and convex is straightforward. Indeed, all we
need to see this is to recognize that
(
∂f(x) = {g : f(z)  f(x) + g, z − x}
z
which is an intersection of half-spaces, which are all closed and convex.
Now we need to show that ∂f(x) = ∅. This will essentially follow from the
following fact: the set epi f has a non-zero supporting hyperplane at the point
(x, f(x)). Indeed, from Theorem 2.2.14, we know that there exist a vector v and
scalar b, not identically zero, such that
v, x + bf(x)  v, y + bt
for all (y, t) ∈ epi f (that is, y and t such that f(y)  t). Rearranging slightly, we
have
v, x − y  b(t − f(x))
and setting y = x shows that b  0. This is close to what we desire, since if b < 0
we set t = f(y) and see that
,v -
−bf(y)  −bf(x) + v, y − x or f(y)  f(x) − ,y−x
b
116 Introductory Lectures on Stochastic Optimization

for all y, by dividing both sides by −b. In particular, −v/b is a subgradient. Thus,
suppose for the sake of contradiction that b = 0. Then we have v, x − y  0 for
all y ∈ dom f, but we assumed that x ∈ int dom f, so for small enough > 0, we
can set y = x + v. This would imply that v, x − y = − v, v = 0, i.e. v = 0,
contradicting the fact that at least one of v and b must be non-zero.
For the compactness of ∂f(x), we use Lemma 2.3.1, which implies that f is
bounded in an 1 -ball around of x. As x ∈ int dom f by assumption, there is
some > 0 such that x + B ⊂ int dom f for the 1 -ball B = {v : v1  1}.
Lemma 2.3.1 implies that supv∈B f(x + v) = M < ∞ for some M, so we have
M  f(x + v)  f(x) + g, v for all v ∈ B and g ∈ ∂f(x), or g∞  (M − f(x))/ .
Thus ∂f(x) is closed and bounded, hence compact. 
The next two results require a few auxiliary results related to the directional
derivative of a convex function. The reason for this is that both require connect-
ing the local properties of the convex function f with the sub-differential ∂f(x),
which is difficult in general since ∂f(x) can consist of multiple vectors. However,
by looking at directional derivatives, we can accomplish what we desire. The
connection between a directional derivative and the subdifferential is contained
in the next two lemmas.

Lemma 2.4.4. An equivalent characterization of the subdifferential ∂f(x) of f at x is


 
(2.4.5) ∂f(x) = g : g, u  f (x; u) for all u .

Proof. Denote by S = {g : g, u  f (x; u)}, the set on the right hand side of the
equality (2.4.5), and let g ∈ S. By the increasing slopes condition, we have
f(x + αu) − f(x)
g, u  f (x; u) 
α
for all u and α > 0; in particular, by taking α = 1 and u = y − x, we have the
standard subgradient inequality that f(x) + g, y − x  f(y). So if g ∈ S, then
g ∈ ∂f(x). Conversely, for any g ∈ ∂f(x), the definition of a subgradient implies
that
f(x + αu)  f(x) + g, x + αu − x = f(x) + α g, u .
Subtracting f(x) from both sides and dividing by α gives that
1
[f(x + αu) − f(x)]  sup g, u
α g∈∂f(x)

for all α > 0; in particular, g ∈ S. 


The representation (2.4.5) gives another proof that ∂f(x) is compact, as claimed in
Theorem 2.4.3. Because we know that f (x; u) is finite for all u as x ∈ int dom f,
any g ∈ ∂f(x) satisfies
g2 = sup g, u  sup f (x; u) < ∞.
u:u2 1 u:u2 1
John C. Duchi 117

Lemma 2.4.6. Let f be closed convex and ∂f(x) = ∅. Then


(2.4.7) f (x; u) = sup g, u .
g∈∂f(x)

Proof. Certainly, Lemma 2.4.4 shows that f (x; u)  supg∈∂f(x) g, u. We must
show the other direction. To that end, note that viewed as a function of u, f (x; u)
is convex and positively homogeneous, meaning that f (x; tu) = tf (x; u) for t  0.
Thus, we can always write (by Corollary 2.2.19)
 
f (x; u) = sup v, u + b : f (x; w)  b + v, w for all w ∈ Rn .
Using the positive homogeneity, we have f (x; 0) = 0 and thus we must have
b = 0, so that u → f (x; u) is characterized as the supremum of linear functions:
 
f (x; u) = sup v, u : f (x; w)  v, w for all w ∈ Rn .
But the set {v : v, w  f (x; w) for all w} is simply ∂f(x) by Lemma 2.4.4. 
A relatively straightforward calculation using Lemma 2.4.4, which we give
in the next proposition, shows that the subgradient is simply the gradient of
differentiable convex functions. Note that as a consequence of this, we have
the first-order inequality that f(y)  f(x) + ∇f(x), y − x for any differentiable
convex function.

Proposition 2.4.8. Let f be convex and differentiable at a point x. Then ∂f(x) = {∇f(x)}.

Proof. If f is differentiable at a point x, then the chain rule implies that


f (x; u) = ∇f(x), u  g, u
for any g ∈ ∂f(x), the inequality following from Lemma 2.4.4. By replacing u with
−u, we have f (x; −u) = − ∇f(x), u  − g, u as well, or g, u = ∇f(x), u for
all u. Letting u vary in (for example) the set {u : u2  1} gives the result. 
Lastly, we have the following consequence of the previous lemmas, which re-
lates the norms of subgradients g ∈ ∂f(x) to the Lipschitzian properties of f.
Recall that a function f is L-Lipschitz with respect to the norm · over a set C if
|f(x) − f(y)|  L x − y
for all x, y ∈ C. Then the following proposition is an immediate consequence of
Lemma 2.4.6.

Proposition 2.4.9. Suppose that f is L-Lipschitz with respect to the norm · over a set
C, where C ⊂ int dom f. Then
sup{g∗ : g ∈ ∂f(x), x ∈ C}  L.

Examples We can provide a number of examples of subgradients. A general


rule of thumb is that, if it is possible to compute the function, it is possible to
compute its subgradients. As a first example, we consider
f(x) = |x|.
118 Introductory Lectures on Stochastic Optimization

Then by inspection, we have




⎪ if x < 0
⎨−1
∂f(x) = [−1, 1] if x = 0


⎩1 if x > 0.
A more complex example is given by any vector norm ·. In this case, we use
the fact that the dual norm is defined by
y∗ := sup x, y .
x:x1

Moreover, we have that x = supy:y 1 y, x. Fixing x ∈ Rn , we thus see that

if g∗  1 and g, x = x, then
x + g, y − x = x − x + g, y  sup v, y = y .
v:v∗ 1

It is possible to show a converse—we leave this as an exercise for the interested


reader—and we claim that
∂ x = {g ∈ Rn : g∗  1, g, x = x}.
For a more concrete example, we have

x/ x2 if x = 0
∂ x2 =
{u : u2  1} if x = 0.
Optimality properties Subgradients also allow us to characterize solutions to
convex optimization problems, giving similar characterizations as those we pro-
vided for projections.

−g

y
x

Figure 2.4.10. The point x minimizes f over C (the shown level


curves) if and only if for some g ∈ ∂f(x ), g, y − x   0 for all
y ∈ C. Note that not all subgradients satisfy this inequality.
John C. Duchi 119

The next theorem, containing necessary and sufficient conditions for a point x
to minimize a convex function f, generalizes the standard first-order optimality
conditions for differentiable f (e.g., Section 4.2.3 in [12]). The intuition for Theo-
rem 2.4.11 is that there is a vector g in the subgradient set ∂f(x) such that −g is
a supporting hyperplane to the feasible set C at the point x. That is, the direc-
tions of decrease of the function f lie outside the optimization set C. Figure 2.4.10
shows this behavior.

Theorem 2.4.11. Let f be convex. The point x ∈ int dom f minimizes f over a convex
set C if and only if there exists a subgradient g ∈ ∂f(x) such that simultaneously for all
y ∈ C,
(2.4.12) g, y − x  0.

Proof. One direction of the theorem is easy. Indeed, pick y ∈ C. Then certainly
there exists g ∈ ∂f(x) for which g, y − x  0. Then by definition,
f(y)  f(x) + g, y − x  f(x).
This holds for any y ∈ C, so x is clearly optimal.
For the converse, suppose that x minimizes f over C. Then for any y ∈ C and
any t  0 such that x + t(y − x) ∈ C, we have
f(x + t(y − x)) − f(x)
f(x + t(y − x))  f(x) or 0  .
t
Taking the limit as t → 0, we have f (x; y − x)  0 for all y ∈ C. Now, let
us suppose for the sake of contradiction that there exists a y such that for all
g ∈ ∂f(x), we have g, y − x < 0. Because
∂f(x) = {g : g, u  f (x; u) for all u ∈ Rn }
by Lemma 2.4.6, and ∂f(x) is compact, we have that supg∈∂f(x) g, y − x is at-
tained, which would imply
f (x; y − x) < 0.
This is a contradiction. 
2.5. Calculus rules with subgradients We present a number of calculus rules
that show how subgradients are, essentially, similar to derivatives, with a few
exceptions (see also Ch. VII of [27]). When we develop methods for optimization
problems based on subgradients, these basic calculus rules will prove useful.
Scaling. If we let h(x) = αf(x) for some α  0, then ∂h(x) = α∂f(x).

Finite sums. Suppose that f1 , . . . , fm are convex functions and let f = m
i=1 fi .
Then
m
∂f(x) = ∂fi (x),
i=1

where the addition is Minkowski addition. To see that m ∂f (x) ⊂ ∂f(x), let
m m i=1 i
gi ∈ ∂fi (x) for each i. Clearly, f(y) = i=1 fi (y)  i=1 fi (x) + gi , y − x, so
120 Introductory Lectures on Stochastic Optimization

that m i=1 gi ∈ ∂f(x). The converse is somewhat more technical and is a special
case of the results to come.
Integrals. More generally, we can extend this summation result to integrals, as-
suming the integrals exist. These calculations are essential for our development
of stochastic optimization schemes based on stochastic (sub)gradient information
in the coming lectures. Indeed, for each s ∈ S, where S is some set, let fs be
convex. Let μ be a positive measure on the set S, and define the convex function

f(x) = fs (x)dμ(s). In the notation of the introduction (Eq. (1.0.1)) and the prob-
lems coming in Section 3.4, we take μ to be a probability distribution on a set S,
and if F(·; s) is convex in its first argument for all s ∈ S, then we may take
f(x) = E[F(x; S)]
and satisfy the conditions above. We shall see many such examples in the sequel.
Then if we let gs (x) ∈ ∂fs (x) for each s ∈ S, we have (assuming the integral
exists and that the selections g s (x) are appropriately measurable)
(2.5.1) gs (x)dμ(s) ∈ ∂f(x).
To see the inclusion, note that for any y we have
. /
gs (x)dμ(s), y − x = gs (x), y − x dμ(s)

 (fs (y) − fs (x))dμ(s) = f(y) − f(x) ,
so the inclusion (2.5.1) holds. Eliding a few technical details, one generally obtains
the equality

∂f(x) = gs (x)dμ(s) : gs (x) ∈ ∂fs (x) for each s ∈ S .
Returning to our running example of stochastic optimization, if we have a
collection of functions F : Rn × S → R, where for each s ∈ S the function F(·; s)
is convex, then f(x) = E[f(x; S)] is convex when we take expectations over S, and
taking
g(x; s) ∈ ∂F(x; s)
gives a stochastic gradient with the property that E[g(x; S)] ∈ ∂f(x). For more on
these calculations and conditions, see the classic paper of Bertsekas [7], which
addresses the measurability issues.
Affine transformations. Let f : Rm → R be convex and A ∈ Rm×n and
b ∈ Rm . Then h : Rn → R defined by h(x) = f(Ax + b) is convex and has
subdifferential
∂h(x) = AT ∂f(Ax + b).
Indeed, let g ∈ ∂f(Ax + b), so that
, -
h(y) = f(Ay + b)  f(Ax + b) + g, (Ay + b) − (Ax + b) = h(x) + AT g, y − x ,
giving the result.
John C. Duchi 121

Finite maxima. Let f(x) = maxim fi (x) where fi , i = 1, . . . , m are convex


functions. Then we have (
epi f = epi fi ,
im

which is convex, and f is convex. Now, let i be any index such that fi (x) = f(x),
and let gi ∈ ∂fi (x). Then we have for any y ∈ Rn that
f(y)  fi (y)  fi (x) + gi , y − x = f(x) + gi , y − x .
So gi ∈ ∂f(x). More generally, we have the result that
(2.5.2) ∂f(x) = Conv{∂fi (x) : fi (x) = f(x)},
that is, the subgradient set of f is the convex hull of the subgradients of active
functions at x, that is, those attaining the maximum. If there is only a single
unique active function fi , then ∂f(x) = ∂fi (x). See Figure 2.5.3 for a graphical
representation.

epi f f1
f2

x0 x1

Figure 2.5.3. Subgradients of finite maxima. The function


f(x) = max{f1 (x), f2 (x)} where f1 (x) = x2 and f2 (x) =−2x − 15 ,
and f is differentiable everywhere except at x0 = −1 + 4/5.

Uncountable maxima (supremum). Lastly, consider f(x) = supα∈S fα (x),


where A is an arbitrary index set and fα is convex for each α. First, let us assume
that the supremum is attained at some α ∈ A. Then, identically to the above, we
have that ∂fα (x) ⊂ ∂f(x). More generally, we have
∂f(x) ⊃ Conv{∂fα (x) : fα (x) = f(x)}.
Achieving equality in the preceding definition requires a number of conditions,
and if the supremum is not attained, the function f may not be subdifferentiable.
Notes and further reading The study of convex analysis and optimization orig-
inates, essentially, with Rockafellar’s 1970 book Convex Analysis [49]. Because of
122 Introductory Lectures on Stochastic Optimization

the limited focus of these lecture notes, we have only barely touched on many
topics in convex analysis, developing only those we need. Two omissions are
perhaps the most glaring: except tangentially, we have provided no discussion of
conjugate functions and conjugacy, and we have not discussed Lagrangian dual-
ity, both of which are central to any study of convex analysis and optimization.
A number of books provide coverage of convex analysis in finite and infinite di-
mensional spaces and make excellent further reading. For broad coverage of con-
vex optimization problems, theory, and algorithms, Boyd and Vandenberghe [12]
is an excellent reference, also providing coverage of basic convex duality theory
and conjugate functions. For deeper forays into convex analysis, personal fa-
vorites of mine include the books of Hiriart-Urruty and Lemaréchal [27, 28], as
well as the shorter volume [29], and Bertsekas [8] also provides an elegant geo-
metric picture of convex analysis and optimization. Our approach here follows
Hiriart-Urruty and Lemaréchal’s most closely. For a treatment of the issues of
separation, convexity, duality, and optimization in infinite dimensional spaces,
an excellent reference is the classic book by Luenberger [36].

3. Subgradient Methods
Lecture Summary: In this lecture, we discuss first order methods for the min-
imization of convex functions. We focus almost exclusively on subgradient-
based methods, which are essentially universally applicable for convex opti-
mization problems, because they rely very little on the structure of the prob-
lem being solved. This leads to effective but slow algorithms in classical
optimization problems. In large scale problems arising out of machine learn-
ing and statistical tasks, however, subgradient methods enjoy a number of
(theoretical) optimality properties and have excellent practical performance.

3.1. Introduction In this lecture, we explore a basic subgradient method, and a


few variants thereof, for solving general convex optimization problems. Through-
out, we will attack the problem
(3.1.1) minimize f(x) subject to x ∈ C
x
where f : Rn → R is convex (though it may take on the value +∞ for x ∈ dom f)
and C is a closed convex set. Certainly in this generality, finding a universally
good method for solving the problem (3.1.1) is hopeless, though we will see that
the subgradient method does essentially apply in this generality.
Convex programming methodologies developed in the last fifty years or so
have given powerful methods for solving optimization problems. The perfor-
mance of many methods for solving convex optimization problems is measured
by the amount of time or the number of iterations required of them to give an
-optimal solution to the problem (3.1.1), roughly, how long it takes to find some
' x) − f(x )  and dist('
x such that f(' x, C)  for an optimal x ∈ C. Essentially
John C. Duchi 123

any problem for which we can compute subgradients efficiently can be solved
to accuracy in time polynomial in the dimension n of the problem and log 1
by the ellipsoid method (cf. [41, 45]). Moreover, for somewhat better structured
(but still quite general) convex problems, interior point and second order meth-
ods [12, 45] are practically and theoretically quite efficient, sometimes requiring
only O(log log 1 ) iterations to achieve optimization error . (See the lectures by S.
Wright in this volume.) These methods use the Newton method as a basic solver,
along with specialized representations of the constraint set C, and can be quite
powerful.
However, for large scale problems, the time complexity of standard interior
point and Newton methods can be prohibitive. Indeed, for problems in n-dimen-
sions—that is, when x ∈ Rn —interior point methods scale at best as O(n3 ), and
can be much worse. When n is large (where today, large may mean n ≈ 109 ),
this becomes highly non-trivial. In such large scale problems and problems aris-
ing from any type of data-collection process, it is reasonable to expect that our
representation of problem data is inexact at best. In statistical machine learning
problems, for example, this is often the case; generally, many applications do not
require accuracy higher than, say = 10−2 or 10−3 , in which case faster but less
exact methods become attractive.
It is with this motivation that we approach the problem (3.1.1) in this lec-
ture, showing classical subgradient algorithms. These algorithms have the ad-
vantage that their per-iteration costs are low—O(n) or smaller for n-dimensional
problems—but they achieve low accuracy solutions to (3.1.1) very quickly. More-
over, depending on problem structure, they can sometimes achieve convergence
rates that are independent of problem dimension. More precisely, and as we will
see later, the methods we study will guarantee convergence to an -optimal so-
lution to problem (3.1.1) in O(1/ 2 ) iterations, while methods that achieve better
dependence on require at least n log 1 iterations.
3.2. The gradient and subgradient methods We begin by focusing on the un-
constrained case, that is, when the set C in problem (3.1.1) is C = Rn . That is, we
wish to solve
minimize
n
f(x).
x∈R
We first review the gradient descent method, using it as motivation for what
follows. In the gradient descent method, we minimize the objective (3.1.1) by
iteratively updating
(3.2.1) xk+1 = xk − αk ∇f(xk ),
where αk > 0 is a positive sequence of stepsizes. The original motivations for
this choice of update come from the fact that x minimizes a convex f if and only
if 0 = ∇f(x ); we believe a more compelling justification comes from the idea
of modeling the convex function being minimized. Indeed, the update (3.2.1) is
124 Introductory Lectures on Stochastic Optimization

equivalent to

1
(3.2.2) xk+1 = argmin f(xk ) + ∇f(xk ), x − xk  + x − xk 22 .
x 2αk
The interpretation is that the linear functional x → {f(xk ) + ∇f(xk ), x − xk } is
the best linear approximation to the function f at the point xk , and we would like
to make progress minimizing x. So we minimize this linear approximation, but to
make sure that it has fidelity to the function f, we add a quadratic x − xk 22 to pe-
nalize moving too far from xk , which would invalidate the linear approximation.
See Figure 3.2.3.

f(x)
f(xk )
+ ∇f(xk ), x − xk 
+ 12 x − xk 22

f(x)
f(xk ) + ∇f(xk ), x − xk 

Figure 3.2.3. Left: linear approximation (in black) to the func-


tion f(x) = log(1 + ex ) (in blue) at the point xk = 0. Right: linear
plus quadratic upper bound for the function f(x) = log(1 + ex )
at the point xk = 0. This is the upper-bound and approximation
of the gradient method (3.2.2) with the choice αk = 1.

Assuming that f is continuously differentiable (often, one assumes the gradient


∇f(x) is Lipschitz), then gradient descent is a descent method if the stepsize
αk > 0 is small enough—it monotonically decreases the objective f(xk ). We
spend no more time on the convergence of gradient-based methods, except to
say that the choice of the stepsize αk is often extremely important, and there is
a body of research on carefully choosing directions as well as stepsize lengths;
Nesterov [44] provides an excellent treatment of many of the basic issues.
Subgradient algorithms The subgradient method is a variant of the method of
(3.2.1) in which, instead of using the gradient, we use a subgradient. The method
can be written simply: for k = 1, 2, . . ., we iterate
i. Choose any subgradient
gk ∈ ∂f(xk ) .
ii. Take the subgradient step
(3.2.4) xk+1 = xk − αk gk .
John C. Duchi 125

Unfortunately, the subgradient method is not, in general, a descent method.


For a simple example, take the function f(x) = |x|, and let x1 = 0. Then except
for the choice g = 0, all subgradients g ∈ ∂f(0) = [−1, 1] are ascent directions.
This is not just an artifact of 0 being optimal for f; in higher dimensions, this
behavior is common. Consider, for example, f(x) = x1 and let x = e1 ∈ Rn ,

the first standard basis vector. Then ∂f(x) = e1 + n i=2 ti ei , where ti ∈ [−1, 1].
n n
Any vector g = e1 + i=2 ti ei with i=2 |ti | > 1 is an ascent direction for f,
meaning that f(x − αg) > f(x) for all α > 0. If we were to pick a uniformly
random g ∈ ∂f(e1 ), for example, then the probability that g is a descent direction
is exponentially small in the dimension n.
In general, the characterization of the subgradient set ∂f(x) as in Lemma 2.4.4,
as {g : f (x; u)  g, u for all u} where f (x; u) = limt→0
f(x+tu)−f(x)
t is the

directional derivative, and the fact that f (x; u) = supg∈∂f(x) g, u guarantees
that
argmin {g22 }
g∈∂f(x)

is a descent direction, but we do not prove this here. Indeed, finding such a
descent direction would require explicitly calculating the entire subgradient set
∂f(x), which for a number of functions is non-trivial and breaks the simplicity of
the subgradient method (3.2.4), which works with any subgradient.
It is the case, however, that so long as the point x does not minimize f(x), then
subgradients descend on a related quantity: the distance of x to any optimal point.
Indeed, let g ∈ ∂f(x), and let x ∈ argmin f(x) (we assume such a point exists),
which need not be unique. Then we have for any α that

1 1 α2
x − αg − x 22 = x − x 22 − α g, x − x  + g22 .
2 2 2
The key is that for small enough α > 0, the quantity on the right is strictly
smaller than 12 x − x 22 , as we now show. We use the defining inequality of the
subgradient, that is, that f(y)  f(x) + g, y − x for all y, including x . This gives
− g, x − x  = g, x − x  f(x ) − f(x), and thus
1 1 α2
(3.2.5) x − αg − x 22  x − x 22 − α (f(x) − f(x )) + g22 .
2 2 2
From inequality (3.2.5), we see immediately that, no matter our choice g ∈ ∂f(x),
we have
2(f(x) − f(x ))
0<α< 2
implies x − αg − x 22 < x − x 22 .
g2
Summarizing, by noting that f(x) − f(x ) > 0, we have

Observation 3.2.6. If 0 ∈ ∂f(x), then for any x ∈ argminx f(x) and any g ∈ ∂f(x),
there is a stepsize α > 0 such that x − αg − x 22 < x − x 22 .

This observation is the key to the analysis of subgradient methods.


126 Introductory Lectures on Stochastic Optimization

Convergence guarantees Unsurprisingly, given the simplicity of the subgradi-


ent method, the analysis of convergence for the method is also quite simple. We
begin by stating a general result on the convergence of subgradient methods; we
provide a number of variants in the sequel. We make a few simplifying assump-
tions in stating our result, several of which are not completely necessary, but
which considerably simplify the analysis. We enumerate them here:
i. There is at least one (not necessarily unique) point x ∈ argminx f(x) with
f(x ) = infx f(x) > −∞.
ii. The subgradients are bounded: for all x and all g ∈ ∂f(x), we have the
subgradient bound g2  M < ∞ (independently of x).
Theorem 3.2.7. Let αk  0 be any sequence of stepsizes for which the assumptions above
hold. Let xk be generated by the subgradient iteration (3.2.4). Then for all K  1,

K
1 1 2 2
K
αk [f(xk ) − f(x )]  x1 − x 22 + αk M .
2 2
k=1 k=1
Proof. The entire proof essentially amounts to writing down explicitly the dis-
tance xk+1 − x 22 and expanding the square, which we do. By applying in-
equality (3.2.5), we have
1 1
xk+1 − x 22 = xk − αk gk − x 22
2 2
(3.2.5) 1 α2
 xk − x 22 − αk (f(xk ) − f(x )) + k gk 22 .
2 2
Rearranging this inequality and using that gk 22  M2 , we obtain

1 1 α2
αk [f(xk ) − f(x )] xk − x 22 − xk+1 − x 22 + k gk 22
2 2 2
1 1 α2
 xk − x 22 − xk+1 − x 22 + k M2 .
2 2 2
By summing the preceding expression from k = 1 to k = K and canceling the
alternating ± xk − x 22 terms, we obtain the theorem. 
Theorem 3.2.7 is the starting point from which we may derive a number of
useful consquences. First, we use convexity to obtain the following immediate
corollary (we assume that αk > 0 in the corollary).
 1 K
Corollary 3.2.8. Let Ak = k i=1 αi and define xK = AK k=1 αk xk . Then

x1 − x 22 + K 2
k=1 αk M
2
f(xK ) − f(x )  K .
2 k=1 αk
K
Proof. Noting that A−1K k=1 αk = 1, we see by convexity that
 K 
1 K 
f(xK ) − f(x )  K αk f(xk ) − f(x ) = A−1
K αk (f(xk ) − f(x )) .
k=1 α k k=1 k=1
Applying Theorem 3.2.7 gives the result. 
John C. Duchi 127

Corollary 3.2.8 allows us to give a number of basic convergence guarantees


based on our stepsize choices. For example, we see that whenever we have


αk → 0 and αk = ∞,
k=1
K K
then 2
k=1 αk / k=1 αk → 0 and so
f(xK ) − f(x ) → 0 as K → ∞.
Moreover, we can give specific stepsize choices to optimize the bound. For exam-
ple, let us assume for simplicity that R2 = x1 − x 22 is our distance (radius) to
optimality. Then choosing a fixed stepsize αk = α, we have
R2 αM2
(3.2.9) f(xK ) − f(x )  + .
2Kα 2
R
√ gives
Optimizing this bound by taking α =
M K
RM
f(xK ) − f(x )  √ .
K
Given that subgradient descent methods are not descent methods, it often
makes sense, instead of tracking the (weighted) average of the points or using
the final point, to use the best point observed thus far. Naturally, if we let
xbest
k = argmin f(xi )
xi :ik

and define fbest


k = f(xbest
k ), then we have the same convergence guarantees that

best R2 + K 2
k=1 αk M
2
f(xk ) − f(x ) 

K .
2 k=1 αk
A number of more careful stepsize choices are possible, and we refer to the notes
at the end of this lecture for more on these choices and applications outside of
those we consider, as our focus is naturally circumscribed.
Example We now present an example that has applications in robust statistics
and other data fitting scenarios. As a motivating scenario, suppose we have a
sequence of vectors ai ∈ Rn and target responses bi ∈ R, and we would like to
predict bi via the inner product ai , x for some vector x. If there are outliers or
other data corruptions in the targets bi , a natural objective for this task, given the
data matrix A = [a1 · · · am ] ∈ Rm×n and vector b ∈ Rm , is the absolute error
1 
m
1
(3.2.10) f(x) = Ax − b1 = | ai , x − bi |.
m m
i=1
We perform subgradient descent on this objective, which has subgradient
1 
m
1
g(x) = AT sign(Ax − b) = ai sign(ai , x − bi ) ∈ ∂f(x)
m m
i=1
at the point x, for K = 4000 iterations with a fixed stepsize αk ≡ α for all k.
128 Introductory Lectures on Stochastic Optimization

f(xk ) − f(x )

Figure 3.2.11. Subgradient method applied to the robust regres-


sion problem (3.2.10) with fixed stepsizes.
k − f(x )
fbest 

Figure 3.2.12. Subgradient method applied to the robust regres-


sion problem (3.2.10) with fixed stepsizes, showing performance
of the best iterate fbest 
k − f(x ).

We give the results in Figures 3.2.11 and 3.2.12, which exhibit much of the
typical behavior of subgradient methods. From the plots, we see roughly a few
phases of behavior: the method with stepsize α = 1 makes progress very quickly
initially, but then enters its “jamming” phase, where it essentially makes no more
progress. (The largest stepsize, α = 10, simply jams immediately.) The accuracy
of the methods with different stepsizes varies greatly, as well—the smaller the
John C. Duchi 129

stepsize, the better the (final) performance of the iterates xk , but initial progress
is much slower.
3.3. Projected subgradient methods We often wish to solve problems not over
Rn but over some constrained set, for example, in the Lasso [57] and in com-
pressed sensing applications [20] one minimizes an objective such as Ax − b22
subject to x1  R for some constant R < ∞. Recalling the problem (3.1.1), we
more generally wish to solve the problem
minimize f(x) subject to x ∈ C ⊂ Rn ,
where C is a closed convex set, not necessarily Rn . The projected subgradient
method is close to the subgradient method, except that we replace the iteration
with
(3.3.1) xk+1 = πC (xk − αk gk )
where
πC (x) = argmin{x − y2 }
y∈C

denotes the (Euclidean) projection onto C. As in the gradient case (3.2.2), we


can reformulate the update as making a linear approximation, with quadratic
damping, to f and minimizing this approximation: by algebraic manipulation,
the update (3.3.1) is equivalent to

1 2
(3.3.2) xk+1 = argmin f(xk ) + gk , x − xk  + x − xk 2 .
x∈C 2αk
Figure 3.3.3 shows an example of the iterations of the projected gradient method
applied to minimizing f(x) = Ax − b22 subject to the 1 -constraint x1  1.
Note that the method iterates between moving outside the 1 -ball toward the
minimum of f (the level curves) and projecting back onto the 1 -ball.

Figure 3.3.3. Example execution of the projected gradient


method (3.3.1), on minimizing f(x) = 12 Ax − b22 subject to
x1  1.
130 Introductory Lectures on Stochastic Optimization

It is very important in the projected subgradient method that the projection


mapping πC be efficiently computable—the method is effective essentially only
in problems where this is true. Often this is the case, but some care is necessary
if the objective f is simple but the set C is complex. Then projecting onto the set
C may be as complex as solving the original optimization problem (3.1.1). For
example, a general linear programming problem is described by
minimize c, x subject to Ax = b, Cx  d.
x
Then computing the projection onto the set {x : Ax = b, Cx  d} is at least as
difficult as solving the original problem.
Examples of projections As noted above, it is important that projections πC
be efficiently calculable, and often a method’s effectiveness is governed by how
quickly one can compute the projection onto the constraint set C. With that in
mind, we now provide two examples exhibiting convex sets C onto which projec-
tion is reasonably straightforward and for which we can write explicit, concrete
projected subgradient updates.
Example 3.3.4: Suppose that C is an affine set, given as C = {x ∈ Rn : Ax = b}
for A ∈ Rm×n , m  n, where A is full rank. (So that A is a short and fat matrix
and AAT 0.) Then the projection of x onto C is
πC (x) = (I − AT (AAT )−1 A)x + AT (AAT )−1 b,
and if we begin the iterates from a point xk ∈ C, i.e. with Axk = b, then
xk+1 = πC (xk − αk gk ) = xk − αk (I − AT (AAT )−1 A)gk ,
that is, we project gk onto the nullspace of A and iterate. ♦

Example 3.3.5 (Some norm balls): Consider updates when C = {x : xp  1} for
p ∈ {1, 2, ∞}, each reasonably simple, though the projections are no longer affine.
First, for p = ∞, we consider each coordinate j = 1, 2, . . . , n in turn, giving
[πC (x)]j = min{1, max{xj , −1}},
that is, we truncate the coordinates of x to be in the range [−1, 1]. For p = 2, we
have a similarly simple to describe update:

x if x2  1
πC (x) =
x/ x2 otherwise.
When p = 1, that is, C = {x : x1  1}, the update is somewhat more complex. If
x1  1, then πC (x) = x. Otherwise, we find the (unique) t  0 such that

n

|xj | − t + = 1,
j=1

and then set the coordinates j via [πC (x)]j = sign(xj ) |xj | − t + . There are nu-
merous efficient algorithms for finding this t (e.g. [14, 23]). ♦
John C. Duchi 131

Convergence results We prove the convergence of the projected subgradient


using an argument similar to our proof of convergence for the classic (uncon-
strained) subgradient method. We assume that the set C is contained in the
interior of the domain of the function f, which (as noted in the lecture on con-
vex analysis) guarantees that f is Lipschitz continuous and subdifferentiable, so
that there exists M < ∞ with g2  M for all g ∈ ∂f. We make the following
assumptions in the next theorem.
i. The set C ⊂ Rn is compact and convex, and x − x 2  R < ∞ for all
x ∈ C.
ii. There exists M < ∞ such that g2  M for all g ∈ ∂f(x) and x ∈ C.
We make the compactness assumption to allow for a slightly different result than
Theorem 3.2.7.
Theorem 3.3.6. Let xk be generated by the projected subgradient iteration (3.3.1), where
the stepsizes αk > 0 are non-increasing. Then

K
R2 1
K
[f(xk ) − f(x )]  + αk M2 .
2αK 2
k=1 k=1

Proof. The starting point of the proof is the same basic inequality as we have been
using, that is, the distance xk+1 − x 22 . In this case, we note that projections can
never increase distances to points x ∈ C, so that
xk+1 − x 22 = πC (xk − αk gk ) − x 22  xk − αk gk − x 22 .
Now, as in our earlier derivation, we apply inequality (3.2.5) to obtain
1 1 α2
xk+1 − x 22  xk − x 22 − αk [f(xk ) − f(x )] + k gk 22 .
2 2 2
Rearranging this slightly by dividing by αk , we find that
1   α
f(xk ) − f(x )  xk − x 22 − xk+1 − x 22 + k gk 22 .
2αk 2
Now, using a variant of the telescoping sum in the proof of Theorem 3.2.7 we
have
K  
  1   α
K
(3.3.7) [f(xk ) − f(x )]  xk − x 22 − xk+1 − x 22 + k gk 22 .
2αk 2
k=1 k=1
We rearrange the middle sum in expression (3.3.7), obtaining

 1  
K
xk − x 22 − xk+1 − x 22
2αk
k=1
K 
 
1 1 1 1
= − xk − x 22 + x − x 22 − x − x 22
2αk 2αk−1 2α1 1 2αK K
k=2
K 
 
1 1 1 2
 − R2 + R
2αk 2αk−1 2α1
k=2
132 Introductory Lectures on Stochastic Optimization

because αk  αk−1 . Noting that this last sum telescopes and that gk 22  M2 in
inequality (3.3.7) gives the result. 

One application of this result is to use a decreasing stepsize of αk = α/ k.
This allows nearly as strong of a convergence rate as in the fixed stepsize case
when the number of iterations K is known, but the algorithm provides a guarantee
for all iterations k. Here, we have that
K K √
1 1
√  t− 2 dt = 2 K,
k=1
k 0

1  K
and so by taking xK = K k=1 xk we obtain the following corollary.

Corollary 3.3.8. In addition to the conditions of the preceding paragraph, let the condi-
tions of Theorem 3.3.6 hold. Then
R2 M2 α
f(xK ) − f(x )  √ + √ .
2α K K

So we see that convergence is guaranteed, at the “best” rate 1/ K, for all iter-
ations. Here, we say “best” because this rate is unimprovable—there are worst
case functions for which no method can achieve a rate of convergence faster than

RM/ K—but in practice, one would hope to attain better behavior by leveraging
problem structure.
3.4. Stochastic subgradient methods The real power of subgradient methods,
which has become evident in the last ten or fifteen years, is in their applicability to
large scale optimization problems. Indeed, while subgradient methods guarantee
only slow convergence—requiring 1/ 2 iterations to achieve -accuracy—their
simplicity ensures that they are robust to a number of errors. In fact, subgradient
methods achieve unimprovable rates of convergence for a number of optimization
problems with noise, and they often do so very computationally efficiently.
Stochastic optimization problems The basic building block for stochastic (sub)-
gradient methods is the stochastic (sub)gradient, often called the stochastic (sub)-
gradient oracle. Let f : Rn → R ∪ {∞} be a convex function, and fix x ∈ dom f.
(We will typically omit the sub- qualifier in what follows.) Then a random vector
g is a stochastic gradient for f at the point x if E[g] ∈ ∂f(x), or
f(y)  f(x) + E[g], y − x for all y.
Said somewhat more formally, we make the following definition.

Definition 3.4.1. A stochastic gradient oracle for the function f is a triple (g, S, P),
where S is a sample space, P is a probability distribution, and g : Rn × S → Rn
is a mapping that for each x ∈ dom f satisfies
EP [g(x, S)] = g(x, s)dP(s) ∈ ∂f(x),
where S ∈ S is a sample drawn from P.
John C. Duchi 133

Often, with some abuse of notation, we will use g or g(x) for shorthand of the
random vector g(x, S) when this does not cause confusion.
A standard example for these types of problems is stochastic programming,
where we wish to solve the convex optimization problem

minimize f(x) := EP [F(x; S)]


(3.4.2)
subject to x ∈ C.
Here S is a random variable on the space S with distribution P (so the expectation
EP [F(x; S)] is taken according to P), and for each s ∈ S, the function x → F(x; s) is
convex. Then we immediately see that if we let
g(x, s) ∈ ∂x F(x; s),
then g is a stochastic gradient when we draw S ∼ P and set g = g(x, S), as in
Lecture 2 (recall expression (2.5.1)). Recalling this calculation, we have
f(y) = EP [F(y; S)]  EP [F(x; S) + g(x, S), y − x] = f(x) + EP [g(x, S)], y − x
so that EP [g(x, S)] is a stochastic subgradient.
To make the setting (3.4.2) more concrete, consider the robust regression prob-
lem (3.2.10), which uses
1 
m
1
f(x) = Ax − b1 = | ai , x − bi |.
m m
i=1
Then a natural stochastic gradient, which requires time only O(n) to compute
(as opposed to O(m · n) to compute Ax − b), is to uniformly at random draw an
index i ∈ [m], then return
g = ai sign(ai , x − bi ).
More generally, given any problem in which one has a large dataset {s1 , . . . , sm },
and we wish to minimize the sum
1 
m
f(x) = F(x; si ),
m
i=1
then drawing an index i ∈ {1, . . . , m} uniformly at random and selecting any
g ∈ ∂x F(x; si ) is a stochastic gradient. Computing this stochastic gradient re-
quires only the time necessary for computing some element of the subgradient
set ∂x F(x; si ), while the standard subgradient method applied to these problems
is m-times more expensive in each iteration.
More generally, the expectation E[F(x; S)] is generally intractable to compute,
especially if S is a high-dimensional distribution. In statistical and machine learn-
ing applications, we may not even know the distribution P, but we can observe
iid
samples Si ∼ P. In these cases, it may be impossible to even implement the cal-
culation of a subgradient f (x) ∈ ∂f(x), but sampling from P is possible, allowing
us to compute stochastic subgradients.
134 Introductory Lectures on Stochastic Optimization

Stochastic subgradient method With this motivation in place, we can describe


the (projected) stochastic subgradient method. Simply, the method iterates as
follows:
(1) Compute a stochastic subgradient gk at the point xk , where we have
E[gk | xk ] ∈ ∂f(xk ).
(2) Perform the projected subgradient step
xk+1 = πC (xk − αk gk ).
This is essentially identical to the projected gradient method (3.3.1), except that
we replace the true subgradient with a stochastic gradient.
In the next section, we analyze the convergence of the procedure, but here we
give two examples that exhibit some of the typical behavior of these methods.
Example 3.4.3 (Robust regression): We consider the robust regression problem
of equation (3.2.10), solving
1 
m
(3.4.4) minimize f(x) = | ai , x − bi | subject to x2  R,
x m
i=1
using the random sample g = ai sign(ai , x − bi ) as our stochastic gradient. We
iid
generate A = [a1 · · · am ] by drawing ai ∼ N(0, In×n ) and bi = ai , u + εi |εi |3 ,
iid
where εi ∼ N(0, 1) and u is a Gaussian random variable with identity covariance.
We use n = 50, m = 100, and R = 4 for this experiment.
f(xk ) − f(x )

Iteration k
Figure 3.4.5. Performance of the stochastic subgradient method
and of the non-stochastic subgradient method on problem (3.4.4).

We plot the results of running the stochastic gradient iteration versus stan-
dard projected subgradient descent in Figure 3.4.5; both methods run with the

fixed stepsize α = R/M K for M2 = m 1
A2Fr , which optimizes the convergence
John C. Duchi 135

guarantees for the methods. We see in the figure the typical performance of a sto-
chastic gradient method: the initial progress in improving the objective is quite
fast, but the method eventually stops making progress once it achieves some low
accuracy (in this case, 10−1 ). In this figure we should make clear, however, that
each iteration of the stochastic gradient method requires time O(n), while each
iteration of the (non-noisy) projected gradient method requires times O(n · m), a
factor of approximately 100 times slower. ♦

Example 3.4.6 (Multiclass support vector machine): Our second example is some-
what more complex. We are given a collection of 16 × 16 grayscale images of
handwritten digits {0, 1, . . . , 9}, and wish to classify images, represented as vec-
tors a ∈ R256 , as one of the 10 digits. In a general k-class classification problem,
we represent the multiclass classifier using the matrix
X = [x1 x2 · · · xk ] ∈ Rn×k ,
where k = 10 for the digit classification problem. Given a data vector a ∈ Rn , the
“score” associated with class l is then xl , a, and the goal (given image data) is
to find a matrix X assigning high scores to the correct image labels. (In machine
learning, the typical notation is to use weight vectors w1 , . . . , wk ∈ Rn instead of
x1 , . . . , xk , but we use X to remain consistent with our optimization focus.) The
predicted class for a data vector a ∈ Rn is then
argmax a, xl  = argmax{[XT a]l }.
l∈[k] l∈[k]

We represent single training examples as pairs (a, b) ∈ Rn × {1, . . . k}, and as a


convex surrogate for a misclassification error that the matrix X makes on the pair
(a, b), we use the multiclass hinge loss function
F(X; (a, b)) = max [1 + a, xl − xb ]+
l=b

where [t]+ = max{t, 0} denotes the positive part. Then F is convex in X, and for
a pair (a, b) we have F(X; (a, b)) = 0 if and only if the classifer represented by X
has a large margin, meaning that
a, xb   a, xl  + 1 for all l = b.
In this example, we have a sample of N = 7291 digits (ai , bi ) ∈ Rn × {1, . . . , k},
and we compare the performance of stochastic subgradient descent to standard
subgradient descent for solving the problem

1 
N
(3.4.7) minimize f(X) = F(X; (ai , bi )) subject to XFr  R
N
i=1
where R = 40. We perform the stochastic gradient descent using the stepsizes
√ 1 N 2
αk = α1 / k, where α1 = R/M and M2 = N i=1 ai 2 (this is an approxi-
mation to the Lipschitz constant of f). For our stochastic gradient oracle, we
select an index i ∈ {1, . . . , N} uniformly at random, then take g ∈ ∂X F(X; (ai , bi )).
136 Introductory Lectures on Stochastic Optimization

For the standard subgradient method, we also perform projected subgradient


descent, where we compute subgradients by taking gi ∈ ∂F(X; (ai , bi )) and set-
1 N
ting g = N
√ i=1 gi ∈ ∂f(X). We use an identical stepsize strategy of setting
αk = α1 / k, but use the five stepsizes α1 = 10−j R/M for j ∈ {−2, −1, . . . , 2}. We
plot the results of this experiment in Figure 3.4.8, showing the optimality gap
(vertical axis) plotted against the number of matrix-vector products X a com-
puted, normalized by N = 7291. The plot makes clear that computing the entire
subgradient ∂f(X) is wasteful: the non-stochastic methods’ convergence, in terms
of iteration count, is potentially faster than that for the stochastic method, but
the large (7291×) per-iteration speedup the stochastic method enjoys because of
its random sampling yields substantially better performance. Though we do not
demonstrate this in the figure, this benefit remains typically true even across a
range of stepsize choices, suggesting the benefits of stochastic gradient methods
in stochastic programming problems such as problem (3.4.7). ♦
f(Xk ) − f(X )

Figure 3.4.8. Comparison of stochastic versus non-stochastic


methods for the average hinge-loss minimization problem (3.4.7).
The horizontal axis is a measure of the time used by each method,
represented as the number of times the matrix-vector product
XT ai is computed. Stochastic gradient descent vastly outper-
forms the non-stochastic methods.

Convergence guarantees We now turn to guarantees of convergence for the


stochastic subgradient method. As in our analysis of the projected subgradi-
ent method, we assume that C is compact and there is some R < ∞ such that
John C. Duchi 137

x − x2  R for all x ∈ C, that projections πC are efficiently computable, and
that for all x ∈ C we have the bound E[g(x, S)22 ]  M2 for our stochastic oracle
g. (The oracle’s noise S may depend on the previous iterates, but we always have
the unbiased condition E[g(x, S)] ∈ ∂f(x).)

Theorem 3.4.9. Let the conditions of the preceding paragraph hold and let αk > 0 be a
1 K
non-increasing sequence of stepsizes. Let xK = K k=1 xk . Then

1 
K
R2
E[f(xK ) − f(x )]  + αk M2 .
2KαK 2K
k=1

Proof. The analysis is quite similar to our previous analyses, in that we simply
expand the error xk+1 − x 22 . Let us define f (x) := E[g(x, S)] ∈ ∂f(x) to be
the expected subgradient returned by the stochastic gradient oracle, and then let
ξk = gk − f (xk ) be the error in the kth subgradient. Then
1 1
x − x 22 = πC (xk − αk gk ) − x 22
2 k+1 2
1
 xk − αk gk − x 22
2
1 α2
= xk − x 22 − αk gk , xk − x  + k gk 22 ,
2 2
as in the proof of Theorems 3.2.7 and 3.3.6. Now, we can add and subtract a term
αk f (xk ), xk − x , which gives
1
x − x 22
2 k+1
1 * + α2
 xk − x 22 − αk f (xk ), xk − x + k gk 22 − αk ξk , xk − x 
2 2
α 2
1
 xk − x 22 − αk [f(xk ) − f(x )] + k gk 22 − αk ξk , xk − x  ,
2 2
where we have used the standard first-order convexity inequality.
Except for the error term ξk , xk − x , the proof is completely identical to that
of Theorem 3.3.6. Indeed, dividing each side of the preceding display by αk and
rearranging, we have
1 α
f(xk ) − f(x )  xk − x 22 − xk+1 − x 22 + k gk 22 − ξk , xk − x  .
2αk 2
Summing this inequality, as is done after inequality (3.3.7), yields

K
R2 1
K 
K
(3.4.10) [f(xk ) − f(x )]  + αk gk 22 − ξk , xk − x  .
2αK 2
k=1 k=1 k=1
All our subsequent convergence guarantees follow from this basic inequality.
For this theorem, we need only take expectations, realizing that
* + 
E[ξk , xk − x ] = E E[ g(xk ) − f (xk ), xk − x | xk ]
138 Introductory Lectures on Stochastic Optimization
 
= E E[g(xk ) | xk ] −f (xk ), xk − x = 0.
 
=f (xk )

Thus we obtain
 
1
K K
R2
E (f(xk ) − f(x ))  + αk M2
2αK 2
k=1 k=1

once we realize that E[gk 22 ]  M2 , which gives the desired result. 
Theorem 3.4.9 makes it clear that, in expectation, we can achieve the same con-
vergence guarantees as in the non-noisy case. This does not mean that stochastic
subgradient methods are always as good as non-stochastic methods, but it does
show the robustness of the subgradient method even to substantial noise. So
while the subgradient method is very slow, its slowness comes with the benefit
that it can handle large amounts of noise.
We now provide a few corollaries on the convergence of stochastic gradient de-
scent. For background on probabilistic modes of convergence, see Appendix A.2.

Corollary 3.4.11. Let the conditions of Theorem 3.4.9 hold, and let αk = R/M k for
each k. Then
3RM
E[f(xK )] − f(x )  √
2 K
for all K ∈ N.

The proof of the corollary is identical to that of Corollary 3.3.8 for the projected
gradient method, once we substitute α = R/M in the bound. We can also obtain
convergence in probability of the iterates more generally.
∞
Corollary 3.4.12. If αk is non-summable but convergent to zero (i.e. k=1 αk = ∞
p
and αk → 0), then f(xK ) − f(x ) → 0 as K → ∞. That is, for all > 0 we have


lim sup P (f(xk ) − f(x )  ) = 0.


k→∞

The above corollaries guarantee convergence of the iterates in expectation and


with high probability, but sometimes it is advantageous to give finite sample
guarantees of convergence with high probability. We can do this under somewhat
stronger conditions on the subgradient noise sequence and using the Azuma-
Hoeffding inequality (Theorem A.2.5 in Appendix A.2), which we present now.

Theorem 3.4.13. In addition to the conditions of Theorem 3.4.9, assume that g2  M
for all stochastic subgradients g. Then for any > 0,

R2  αk K
RM
f(xK ) − f(x )  + M2 + √
2KαK 2 K
k=1
1 2
with probability at least 1 − e− 2  .
John C. Duchi 139

1 2
Written differently, by taking αk = √R and setting δ = e− 2  , we have
kM

3MR MR 2 log δ1
f(xK ) − f(x )  √ + √
K K

with probability at least 1 − δ. That is, we have convergence of O(MR/ K) with
high probability.
Before providing the proof proper, we discuss two examples in which the
boundedness condition holds. Recall from Lecture 2 that a convex function f
is M-Lipschitz if and only if g2  M for all g ∈ ∂f(x) and x ∈ Rn , so Theo-
rem 3.4.13 requires that the random functions F(·; S) are Lipschitz over the domain
C. Our robust regression and multiclass support vector machine examples both
satisfy the conditions of the theorem so long as the data is bounded. More pre-
cisely, for the robust regression problem (3.2.10) with loss F(x; (a, b)) = | a, x − b|,
we have ∂F(x; (a, b)) = a sign(a, x − b) so that the condition g2  M holds
if and only if a2  M. For the multiclass hinge loss problem (3.4.7), with

F(X; (a, b)) = l=b [1 + a, xl − xb ]+ , Exercise B.3.1 develops the subgradient
calculations, but again, we have the boundedness of ∂X F(X; (a, b)) if and only if
a ∈ Rn is bounded.
Proof. We begin with the basic inequality of Theorem 3.4.9, inequality (3.4.10). We
see that we would like to bound the probability that

K
ξk , x − xk 
k=1
is large. First, we note that the iterate xk is a function of ξ1 , . . . , ξk−1 , and we
have the conditional expectation
E[ξk | ξ1 , . . . , ξk−1 ] = E[ξk | xk ] = 0.
Moreover, using the boundedness assumption that g2  M, we first obtain
ξk 2 = gk − f (xk )2  2M and then
| ξk , xk − x  |  ξk 2 xk − x 2  2MR.
K
k=1 ξk , xk − x  is a bounded difference martingale se-
Thus, the sequence 

quence, and we may apply Azuma’s inequality (Theorem A.2.5), which gurantees
 K   
t2
P ξk , x − xk   t  exp −
2KM2 R2
k=1

for all t  0. Substituting t = MR K , we obtain, as desired, that
  K   2
1 MR
P ξk , x − xk   √  exp − . 
K K 2
k=1

Summarizing the results of this section, we see a number of consequences.


First, stochastic gradient methods guarantee that after O(1/ 2 ) iterations, we have
140 Introductory Lectures on Stochastic Optimization

error at most f(x) − f(x ) = O( ). Secondly, this convergence is (at least to the
order in ) the same as in the non-noisy case; that is, stochastic gradient meth-
ods are robust enough to noise that their convergence is hardly affected by it. In
addition to this, they are often applicable in situations in which we cannot even
evaluate the objective f, whether for computational reasons or because we do not
have access to it, as in statistical problems. This robustness to noise and good
performance has led to wide adoption of subgradient-like methods as the de facto
choice for many large-scale data-based optimization problems. In the coming sec-
tions, we give further discussion of the optimality of stochastic gradient methods,
showing that—roughly—when we have access only to noisy data, it is impossi-
ble to solve (certain) problems to accuracy better than given 1/ 2 data points;
thus, using more expensive but accurate optimization methods may have limited
benefit (though there may still be some benefit practically!).
Notes and further reading Our treatment in this chapter borrows from a num-
ber of resources. The two heaviest are the lecture notes for Stephen Boyd’s Stan-
ford’s EE364b course [10, 11] and Polyak’s Introduction to Optimization [47]. Our
guarantees of high probability convergence are similar to those originally de-
veloped by Cesa-Bianchi et al. [16] in the context of online learning, which Ne-
mirovski et al. [40] more fully develop. More references on subgradient methods
include the lecture notes of Nemirovski [43] and Nesterov [44].
A number of extensions of (stochastic) subgradient methods are possible, in-
cluding to online scenarios in which we observe streaming sequences of func-
tions [25, 64]; our analysis in this section follows closely that of Zinkevich [64].
The classic paper of Polyak and Juditsky [48] shows that stochastic gradient de-
scent methods, coupled with averaging, can achieve asymptotically optimal rates
of convergence even to constant factors. Recent work in machine learning by a
number of authors [18, 32, 53] has shown how to leverage the structure of opti-
1 N
mization problems based on finite sums, that is, when f(x) = N i=1 fi (x), to
develop methods that achieve convergence rates similar to those of interior point
methods but with iteration complexity close to stochastic gradient methods.

4. The Choice of Metric in Subgradient Methods


Lecture Summary: Standard subgradient and projected subgradient meth-
ods are inherently Euclidean—they rely on measuring distances using Eu-
clidean norms, and their updates are based on Euclidean steps. In this lec-
ture, we study methods for more carefully choosing the metric, giving rise
to mirror descent, also known as non-Euclidean subgradient descent, as well
as methods for adapting the updates performed to the problem at hand. By
more carefully studying the geometry of the optimization problem being
solved, we show how faster convergence guarantees are possible.
John C. Duchi 141

4.1. Introduction In the prior lecture, we studied projected subgradient methods


for solving the problem (2.1.1) by iteratively updating xk+1 = πC (xk − αk gk ),
where πC denotes Euclidean projection. The convergence of these methods, as
exemplified by Corollaries 3.2.8 and 3.4.11, scales as
MR diam(C)Lip(f)
(4.1.1) f(xK ) − f(x )  √ = O(1) √ ,
K K
where R = supx∈C x − x 2 and M is the Lipschitz constant of f over the set C
with respect to the 2 -norm,
 n 1 
2
M = sup sup g2 = g2j .
x∈C g∈∂f(x) j=1

The convergence guarantee (4.1.1) reposes on Euclidean measures of scale—the


diameter of C and norm of the subgradients g are both measured in 2 -norm. It is
thus natural to ask if we can develop methods whose convergence rates depend
on other measures of scale of f and C, obtainining better problem-dependent
behavior and geometry. With that in mind, in this lecture we derive a number of
methods that use either non-Euclidean or adaptive updates to better reflect the
geometry of the underlying optimization problem.
4.2. Mirror Descent Methods Our first set of results focuses on mirror descent
methods, which modify the basic subgradient update to use a different distance-
measuring function rather than the squared 2 -term.

h(x)
Dh (x, y)

h(y) h(y) + ∇h(y), x − y

Figure 4.2.1. Bregman divergence Dh (x, y). The bottom upper


function is h(x) = log(1 + ex ), the lower (linear) is the linear
approximation x → h(y) + ∇h(y), x − y to h at y.
142 Introductory Lectures on Stochastic Optimization

Before presenting these methods, we give a few definitions. Let h be a differen-


tiable convex function, differentiable on C. The Bregman divergence Dh associated
with h is defined as
(4.2.2) Dh (x, y) = h(x) − h(y) − ∇h(y), x − y .
It is always nonnegative, by the standard first-order inequality for convex func-
tions, and measures the gap between the value h(x) at x and the linear approxi-
mation h(y) + ∇h(y), x − y for h(x) taken from the point y. See Figure 4.2.1.
As one standard example, if we take h(x) = 12 x22 , then we obtain Dh (x, y) =
2
2 x − y
1
2 . A second common example follows by taking the entropy functional
n
h(x) = j=1 xj log xj , restricting x to the probability simplex (i.e. x  0 and
 n xj
x
j j = 1). We then have D h (x, y) = j=1 xj log yj , the entropic or Kullback-
Leibler divergence.
Because the quantity (4.2.2) is always non-negative and convex in its first ar-
gument, it is natural to treat it as a distance-like function in the development of
optimization procedures. Indeed, by recalling the updates (3.2.2) and (3.3.2), by
analogy we consider the method
i. Compute subgradient gk ∈ ∂f(xk ).
ii. Perform update 
1
xk+1 = argmin f(xk ) + gk , x − xk  + Dh (x, xk )
x∈C α k
(4.2.3) 
1
= argmin gk , x + Dh (x, xk ) .
x∈C αk
This scheme is the mirror descent method. Thus, each differentiable convex function
h gives a new optimization scheme, where we often attempt to choose h to better
match the geometry of the underlying constraint set C.
To this point, we have been vague about the “geometry” of the constraint set,
so we attempt to be somewhat more concrete. We say that h is λ-strongly convex
over C with respect to the norm · if
λ
h(y)  h(x) + ∇h(x), y − x + x − y2 for all x, y ∈ C.
2
Importantly, this norm need not be the typical 2 or Euclidean norm. Then our
goal is, roughly, to choose a strongly convex function h so that the diameter of
C is small in the norm · with respect to which h is strongly convex (as we see
presently, an analogue of the bound (4.1.1) holds). In the standard updates (3.2.2)
and (3.3.2), we use the squared Euclidean norm to trade between making progress
on the linear approximation x → f(xk ) + gk , x − xk  and making sure the approx-
imation is reasonable—we regularize progress. Thus it is natural to ask that the
function h we use provide a similar type of regularization, and consequently, we
will require that the function h be 1-strongly convex (usually shortened to the
unqualified strongly convex) with respect to some norm · over the constraint
John C. Duchi 143

set C in the mirror descent method (4.2.3).4 Note that strong convexity of h is
equivalent to
1
Dh (x, y)  x − y2 for all x, y ∈ C.
2
Examples of mirror descent Before analyzing the method (4.2.3), we present a
few examples, showing the updates that are possible as well as verifying that the
associated divergence is appropriately strongly convex. One of the nice conse-
quences of allowing different divergence measures Dh , as opposed to only the
Euclidean divergence, is that they often yield cleaner or simpler updates.
Example 4.2.4 (Gradient descent is mirror descent): Let h(x) = 12 x22 . Then
∇h(y) = y, and
1 1 1 1 1
Dh (x, y) = x22 − y22 − y, x − y = x22 + y22 − x, y = x − y22 .
2 2 2 2 2
Thus, substituting into the update (4.2.3), we see the choice h(x) = 12 x22 recovers
the standard (stochastic sub)gradient method

1
xk+1 = argmin gk , x + x − xk 22 .
x∈C 2αk
It is evident that h is strongly convex with respect to the 2 -norm for any con-
straint set C. ♦

Example 4.2.5 (Solving problems on the simplex with exponentiated gradient


methods): Suppose that our constraint set C = {x ∈ Rn + : 1, x = 1} is the
probability simplex in Rn . Then updates with the standard Euclidean distance
are somewhat challenging—though there are efficient implementations [14, 23]—
and it is natural to ask for a simpler method.

With that in mind, let h(x) = n j=1 xj log xj be the negative entropy, which is
convex because it is the sum of convex functions. (The derivatives of f(t) = t log t
are f (t) = log t + 1 and f (t) = 1/t > 0 for t  0.) Then we have


n

Dh (x, y) = xj log xj − yj log yj − (log yj + 1)(xj − yj )
j=1
n
xj
= xj log + 1, y − x = Dkl (x||y) ,
yj
j=1

the KL-divergence between x and y (when extended to Rn + , though over C we


have 1, x − y = 0). This gives us the form of the update (4.2.3).
Let us consider the update (4.2.3). Simplifying notation, we would like to solve

n
xj
minimize g, x + xj log subject to 1, x = 1, x  0.
yj
j=1

4 Thisis not strictly a requirement, and sometimes it is analytically convenient to avoid this, but our
analysis is simpler when h is strongly convex.
144 Introductory Lectures on Stochastic Optimization

We assume that the yj > 0, though this is not strictly necessary. Though we
have not discussed this, we write the Lagrangian for this problem by introducing
Lagrange multipliers τ ∈ R for the equality constraint 1, x = 1 and λ ∈ Rn
+ for
the inequality x  0.
Then we obtain Lagrangian
n 
 
xj
L(x, τ, λ) = g, x + xj log + τxj − λj xj − τ.
yj
j=1

Minimizing out x to find the appropriate form for the solution, we take deriva-
tives with respect to x and set them to zero to find

0= L(x, τ, λ) = gj + log xj + 1 − log yj + τ − λj ,
∂xj
or
xj (τ, λ) = yj exp(−gj − 1 − τ + λj ).
We may take λj = 0, as the latter expression yields all positive xj , and to satisfy
 
the constraint that j xj = 1, we set τ = log( j yj e−gj ) − 1. Thus we have the
update
y exp(−gi )
xi = n i .
j=1 yj exp(−gj )
Rewriting this in terms of the precise update at time k for the mirror descent
method, we have for each coordinate i of iterate k + 1 of the method that
xk,i exp(−αk gk,i )
(4.2.6) xk+1,i = n .
j=1 xk,j exp(−αk gk,j )
This is the so-called exponentiated gradient update, also known as entropic mirror
descent.
Later, after stating and proving our main convergence theorems, we will show
that the negative entropy is strongly convex with respect to the 1 -norm, meaning
that our coming convergence guarantees apply. ♦

Example 4.2.7 (Using p -norms): As a final example, we consider using squared


p -norms for our distance-generating function h. These have nice robustness
properties, and are also finite on any compact set (unlike the KL-divergence of
Example 4.2.5). Indeed, let p ∈ (1, 2], and define h(x) = 2(p−1) 1
x2p . We claim
without proof that h is strongly convex with respect to the p -norm, that is,
1
Dh (x, y)  x − y2p .
2
(See, for example, the thesis of Shalev-Shwartz [51] and Question B.4.2 in the
exercises. This inequality fails for powers other than 2 as well as for p > 2.)
We do not address the constrained case here, assuming instead that C = Rn .
In this case, we have
1  
∇h(x) = x2−p
p sign(x1 )|x1 |p−1 · · · sign(xn )|xn |p−1 .
p−1
John C. Duchi 145

Now, if we define the function φ(x) = (p − 1)∇h(x), then a calculation verifies


that the function ϕ : Rn → Rn defined coordinate-wise by
φj (x) = x2−p
p sign(xj )|xj |p−1 and ϕj (y) = y2−q
q sign(yj )|yj |q−1 ,
where p1 + q1 = 1, satisfies ϕ(φ(x)) = x, that is, ϕ = φ−1 (and similarly φ = ϕ−1 ).
Thus, the mirror descent update (4.2.3) when C = Rn becomes the somewhat
more complex
(4.2.8) xk+1 = ϕ(φ(xk ) − αk (p − 1)gk ) = (∇h)−1 (∇h(xk ) − αk gk ).
The second form of the update (4.2.8), that is, that involving the inverse of the
gradient mapping (∇h)−1 , holds more generally, that is, for any strictly convex
and differentiable h. This is the original form of the mirror descent update (4.2.3),
and it justifies the name mirror descent, as the gradient is “mirrored” through
the distance-generating function h and back again. Nonetheless, we find the
modeling perspective of (4.2.3) somewhat easier to explain.
We remark in passing that while constrained updates are somewhat more chal-
lenging for this case, a few are efficiently solvable. For example, suppose that
C = {x ∈ Rn + : 1, x = 1}, the probability simplex. In this case, the update with
p -norms becomes a problem of solving
1
minimize v, x + x2p subject to 1, x = 1, x  0,
x 2
where v = αk (p − 1)gk − φ(xk ), and ϕ and φ are defined as above. An analysis
of the Karush-Kuhn-Tucker conditions for this problem (omitted) yields that the
solution to the problem is given by finding the t ∈ R such that

n
 
ϕj ( −vj + t + ) = 1 and setting xj = ϕ( −vj + t + ).
j=1

Because ϕ is increasing in its argument with ϕ(0) = 0, this t can be found to


accuracy in time O(n log 1 ) by binary search. ♦

Convergence guarantees With the mirror descent method described, we now


provide an analysis of its convergence behavior. In this case, the analysis is
somewhat more complex than that for the subgradient, projected subgradient,
and stochastic subgradient methods, as we cannot simply expand the distance
xk+1 − x 22 . Thus, we give a variant proof that relies on the optimality condi-
tions for convex optimization problems, as well as a few tricks involving norms
and their dual norms. Recall that we assume that the function h is strongly con-
vex with respect to some norm ·, and that the associated dual norm ·∗ is
defined by
y∗ := sup y, x .
x:x1

Theorem 4.2.9. Let αk > 0 be any sequence of non-increasing stepsizes. Let xk be


generated by the mirror descent iteration (4.2.3). If Dh (x, x )  R2 for all x ∈ C, then
146 Introductory Lectures on Stochastic Optimization

for all K ∈ N

K
1 2  αk
K
[f(xk ) − f(x )]  R + gk 2∗ .
αK 2
k=1 k=1
If αk ≡ α is constant, then for all K ∈ N

K
1 α
K
[f(xk ) − f(x )]  Dh (x , x1 ) + gk 2∗ .
α 2
k=1 k=1
1 K
As an immediate consequence of this theorem, we see that if xK = K k=1 xk or
xK = argminxk f(xk ) and we have the gradient bound g∗  M for all g ∈ ∂f(x)
for x ∈ C, then (say, in the second case) convexity implies
1 α
(4.2.10) f(xK ) − f(x )  Dh (x , x1 ) + M2 .
Kα 2
By comparing with the bound (3.2.9), we see that the mirror descent (non-Euclid-
ean gradient descent) method gives roughly the same type of convergence guar-
antees as standard subgradient descent. Roughly we expect the following type of
behavior with a fixed stepsize: a rate of convergence of roughly 1/αK until we are
within a radius α of the optimum, after which mirror descent and subgradient
descent essentially jam—they just jump back and forth near the optimum.
Proof. We begin by considering the progress made in a single update of xk , but
whereas our previous proofs all began with a Lyapunov function for the distance
xk − x 2 , we use function value gaps instead of the distance to optimality. Us-
ing the first order convexity inequality—i.e. the definition of a subgradient—we
have
f(xk ) − f(x )  gk , xk − x  .
The idea is to show that replacing xk with xk+1 makes the term gk , xk − x 
small because of the definition of xk+1 , but xk and xk+1 are close together so that
this is not much of a difference.
First, we add and subtract gk , xk+1  to obtain
(4.2.11) f(xk ) − f(x )  gk , xk+1 − x  + gk , xk − xk+1  .
Now, we use the the first-order necessary and sufficient conditions for optimality
of convex optimization problems given by Theorem 2.4.11. Because xk+1 solves
problem (4.2.3), we have
, -
k (∇h(xk+1 ) − ∇h(xk )) , x − xk+1  0 for all x ∈ C.
gk + α−1
In particular, this holds for x = x , and substituting into (4.2.11) yields
1
f(xk ) − f(x )  ∇h(xk+1 ) − ∇h(xk ), x − xk+1  + gk , xk − xk+1  .
αk
We now use two tricks: an algebraic identity involving Dh and the Fenchel-Young
inequality. By algebraic manipulations, we have that
∇h(xk+1 ) − ∇h(xk ), x − xk+1  = Dh (x , xk ) − Dh (x , xk+1 ) − Dh (xk+1 , xk ).
John C. Duchi 147

Substituting into the preceding display, we have


f(xk ) − f(x )
(4.2.12) 1
 [Dh (x , xk ) − Dh (x , xk+1 ) − Dh (xk+1 , xk )] + gk , xk − xk+1  .
αk
The second insight is that the subtraction of Dh (xk+1 , xk ) allows us to cancel
some of gk , xk − xk+1 . To see this, recall the Fenchel-Young inequality, which
states that
η 1
x, y  x2 + y2∗
2 2η
for any pair of dual norms (· , ·∗ ) and any η > 0. To see this, note that by
definition of the dual norm, we have x, y  x y∗ , and for any constants
1 1
a, b ∈ R and η > 0, we have 0  12 (η 2 a − η− 2 b)2 = η2 a2 + 2η 1 2
b − ab, so that
η 2 2
x y∗  2 x + 2η y∗ . In particular, we have
1

αk 1
gk , xk − xk+1   gk 2∗ + xk − xk+1 2 .
2 2αk
The strong convexity assumption on h guarantees Dh (xk , xk+1 )  12 xk − xk+1 2 ,
or that
1 α
− Dh (xk+1 , xk ) + gk , xk − xk+1   k gk 2∗ .
αk 2
Substituting this into inequality (4.2.12), we have
1 α
(4.2.13) f(xk ) − f(x )  [Dh (x , xk ) − Dh (x , xk+1 )] + k gk 2∗ .
αk 2
This inequality should look similar to inequality (3.3.7) in the proof of Theo-
rem 3.3.6 on the projected subgradient method in Lecture 3. Indeed, using that
Dh (x , xk )  R2 by assumption, an identical derivation to that in Theorem 3.3.6
gives the first result of this theorem. For the second when the stepsize is fixed,
note that

K 
K
1 
K
α
[f(xk ) − f(x )]  [Dh (x , xk ) − Dh (x , xk+1 )] + gk 2∗
α 2
k=1 k=1 k=1

1 
K
α
= [D (x , x1 ) − Dh (x , xK+1 )] + g 2 ,
α h 2 k ∗
k=1
which is the second result. 
We briefly provide a few remarks before moving on. As a first remark, all
of the preceding analysis carries through in an almost completely identical fash-
ion in the stochastic case. We state the most basic result, as the extension from
Section 3.4 is essentially straightforward.

Corollary 4.2.14. Let the conditions of Theorem 4.2.9 hold, except that instead of re-
ceiving a vector gk ∈ ∂f(xk ) at iteration k, the vector gk is a stochastic subgradient
satisfying E[gk | xk ] ∈ ∂f(xk ). Then for any non-increasing stepsize sequence αk
148 Introductory Lectures on Stochastic Optimization

(where αk may be chosen dependent on g1 , . . . , gk ),


   
K
R2 
K
αk 2
E (f(xk ) − f(x ))  E

+ gk ∗ .
αK 2
k=1 k=1

Proof. We sketch the proof, which is identical to that of Theorem 4.2.9, except that
we replace gk with the vector f (xk ) satisfying E[gk | xk ] = f (xk ) ∈ ∂f(xk ). Then
* + * +
f(xk ) − f(x )  f (xk ), xk − x = gk , xk − x  + f (xk ) − gk , xk − x ,
and an identical derivation yields the following analogue of inequality (4.2.13):
f(xk ) − f(x )
1 α * +
 [Dh (x , xk ) − Dh (x , xk+1 )] + k gk 2∗ + f (xk ) − gk , xk − x .
αk 2
This inequality holds regardless of how we choose αk . Moreover, by iterating
expectations, we have
* + * +
E[ f (xk ) − gk , xk − x ] = E[ f (xk ) − E[gk | xk ], xk − x ] = 0,
which gives the corollary by a derivation identical to Theorem 4.2.9. 
Thus, if we have the bound E[g2∗ ]  M2 for all stochastic subgradients, then
1 K

taking xK = K k=1 xk and αk = R/M k, then

RM R maxk E[gk 2∗ ]  1


K
RM
(4.2.15) E[f(xK ) − f(x )]  √ + √  3√
K M 2 k K
k=1
 √
− 12  2 K.
where we have used that E[g2∗ ]  M2 and K k=1 k
In addition, we can provide concrete convergence guarantees for a few meth-
ods, revisiting our earlier examples. We begin with Example 4.2.5, exponentiated
gradient descent.
n
Corollary 4.2.16. Let C = {x ∈ Rn + : 1, x = 1}, and take h(x) = j=1 xj log xj , the
1 1 K
negative entropy. Let x1 = n 1, the vector with all entries 1/n. Then if xK = K k=1 xk ,
the exponentiated gradient method (4.2.6) with fixed stepsize α guarantees

α 
K
log n
f(xK ) − f(x )  + gk 2∞ .
Kα 2K
k=1

Proof. To apply Theorem 4.2.9, we must show that the negative entropy h is
strongly convex with respect to the 1 -norm, whose dual norm is the ∞ -norm.
By a Taylor expansion, we know that for any x, y ∈ C, we have
1
h(x) = h(y) + ∇h(y), x − y + (x − y) ∇2 h( x)(x − y)
2
for some  x between x and y, that is, 
x = tx + (1 − t)y for some t ∈ [0, 1]. Calculat-
ing these quantities, this is equivalent to
 
1  (xj − yj )2
n
1 1 1
Dkl (x||y) = Dh (x, y) = (x − y) diag ,..., (x − y) = .
2 
x1 
xn 2 
xj
j=1
John C. Duchi 149

Using the Cauchy-Schwarz inequality and the fact that  x ∈ C, we have


n n  n 1  n 1
|xj − yj | 2 (xj − yj )2 2
x − y1 = |xj − yj | = xj 
  
xj .

xj 
xj
j=1 j=1 j=1 j=1
 
=1

That is, we have Dkl (x||y) = Dh (x, y)  1


x − y21 ,
and h is strongly convex with
2
respect to the 1 -norm over C.
With this strong convexity result in hand, we may apply second result of The-
orem 4.2.9, achieving

K
Dkl (x | x1 ) α 
K
[f(xk ) − f(x )]  + gk 2∞ .
α 2
k=1 k=1

If x1 = 1
then Dkl (x||x1 ) = h(x) + log n  log n, as h(x)  0 for x ∈ C. Thus,
n 1,
1 K
dividing by K and using that f(xK )  K k=1 f(xk ) gives the corollary. 
Comparing the guarantee provided by Corollary 4.2.16 to that given by the
standard (non-stochastic) projected subgradient method (i.e. using h(x) = 12 x22
as in Theorem 3.3.6) is instructive. In the case of projected subgradient descent,
we have Dh (x , x) = 12 x − x22  1 for all x, x ∈ C = {x ∈ Rn
+ : 1, x = 1} (and
this distance is achieved). But the dual norm to the 2 -norm is 2 , so we measure

the size of the gradient terms gk  in 2 -norm. As gk ∞  gk 2  n gk ∞ ,

supposing that gk ∞  1 for all k, the convergence guarantee O(1) log n/K

may be up to n/ log n-times better than that guaranteed by the standard (Eu-
clidean) projected gradient method.
Lastly, we provide a final convergence guarantee for the mirror descent method
using p -norms, where p ∈ (1, 2]. Using such norms has the benefit that Dh
is bounded whenever the set C is compact—distinct from the relative entropy
 x
Dh (x, y) = j xj log yjj —and thus providing a nicer guarantee of convergence.
Indeed, for h(x) = 12 x2p we always have that

1 1  n
Dh (x, y) = x2p − y2p − y2−p
p sign(yj )|yj |p−1 (xj − yj )
2 2
j=1

1 2 1 2 2−p
n
(4.2.17) = xp + yp − yp |yj |p−1 sign(yj )xj  x2p + y2p ,
2 2
j=1
 
 21 x2p + 21 y2p

where the inequality uses that q(p − 1) = p and


n 
n 1 
n 1
q p
y2−p
p |yj |p−1 |xj |  y2−p
p |yj | q(p−1)
|xj |
p

j=1 j=1 j=1


p
1 1
= y2−p
p yp xp = yp xp 
q
y2p + x2p .
2 2
150 Introductory Lectures on Stochastic Optimization

More generally, h(x) = 12 x − x0 2p gives Dh (x, y)  x − x0 2p + y − x0 2p . As


one example, we obtain the following corollary.
1
Corollary 4.2.18. Let h(x) = 2(p−1) x2p , where p = 1 + 1
log(2n) , and assume that
C ⊂ {x ∈ R : x1  R1 }. Then
n


K
2R21 log(2n) e2 
K
[f(xk ) − f(x )]  + αk gk 2∞ .
αK 2
k=1 k=1
 1 K
In particular, taking αk = R1 log(2n)/k/e and xK = K k=1 xk gives

R1 log(2n)
f(xK ) − f(x )  3e
 √ .
K
1
Proof. First, we note that h(x) = 2(p−1) x2p is strongly convex with respect to
the p -norm, where 1 < p  2. (Recall Example 4.2.7 and see Exercise B.4.2.)
Moreover, we know that the dual to the p -norm is the conjugate q -norm with
1/p + 1/q = 1, and thus Theorem 4.2.9 implies that

K
1  αk K
[f(xk ) − f(x )]  sup Dh (x, x ) + gk 2q .
αK x∈C 2
k=1 k=1
Now, we use that if C is contained in the 1 -ball of radius R1 , then
(p − 1)Dh (x, y)  x2p + y2p  x21 + y21  2R21 .
1
Moreover, because p = 1 + log(2n) , we have q = 1 + log(2n), and
1 1
vq  1q v∞ = n q v∞ = n log(2n) v∞  e v∞ .
Substituting this into the previous display and noting that 1/(p − 1) = log(2n)
 − 12 and then
gives the first result. The second follows by first tntegrating K
k=1 k
using convexity. 
So we see that, in more general cases than the simple simplex constraint af-
forded by the entropic mirror descent (exponentiated gradient) updates, we have
 √
convergence guarantees of order log n/ K, which may be substantially faster
than that guaranteed by the standard projected gradient methods.
A simulated mirror-descent example With our convergence theorems given, we
provide a (simulation-based) example of the convergence behavior for an opti-
mization problem for which it is natural to use non-Euclidean norms. We con-
sider a robust regression problem of the following form: we let A ∈ Rm×n have
entries drawn i.i.d. N(0, 1) with rows a  1
1 , . . . , am . We let bi = 2 (ai,1 + ai,2 ) + εi
iid
where εi ∼ N(0, 10−2 ), and m = 20 and the dimension n = 3000. Then we define

m
f(x) := Ax − b1 = | ai , x − bi |,
i=1
John C. Duchi 151

which has subgradients A sign(Ax − b). We then minimize f over the simplex
C = {x ∈ Rn + : 1, x = 1}; this is the same robust regression problem (3.2.10),
except with a particular choice of C.
We compare the subgradient method to exponentiated gradient descent for
this problem, noting that the Euclidean projection of a vector v ∈ Rn to the set C

has coordinates xj = vj − t + , where t ∈ R is chosen so that

n 
n

xj = vj − t + = 1.
j=1 j=1

(See the papers [14, 23] for a full derivation of this expression.) We use stepsizes

αk = α0 / k, where the initial stepsize α0 is chosen to optimize the convergence
guarantee for each of the methods (see the coming section). In Figure 4.2.19, we
plot the results of performing the projected gradient method versus the expo-
nentiated gradient (entropic mirror decent) method and a method using distance
generating functions h(x) = 12 x2p for p = 1 + 1/ log(2n), which can also be
shown to be optimal, showing the optimality gap versus iteration count. While
all three methods are sensitive to initial stepsize, the mirror descent method (4.2.6)
enjoys faster convergence than the standard gradient-based method.

Figure 4.2.19. Convergence of mirror descent (entropic gradient


method) versus projected gradient method.

4.3. Adaptive stepsizes and metrics In our discussion of mirror descent meth-
ods, we assumed we knew enough about the geometry of the problem at hand—
or at least the constraint set—to choose an appropriate metric and associated
distance-generating function h. In other situations, however, it may be advanta-
geous to adapt the metric being used, or at least the stepsizes, to achieve faster
convergence guarantees. We begin by describing a simple scheme for choosing
stepsizes to optimize bounds on convergence, which means one does not need
to know the Lipschitz constants of gradients ahead of time, and then move on to
somewhat more involved schemes that use a distance-generating function of the
152 Introductory Lectures on Stochastic Optimization

type h(x) = 12 x Ax for some matrix A, which may change depending on infor-
mation observed during solution of the problem. We leave proofs of the major
results in these sections to exercises at the end of the lectures.
Adaptive stepsizes Let us begin by recalling the convergence guarantees for
mirror descent in the stochastic case, given by Corollary 4.2.14, which assumes
the stepsize αk used to calculate xk+1 is chosen based on the observed gradients
1 K
g1 , . . . , gk (it may be specified ahead of time). In this case, taking xK = K k=1 xk ,
we have by Corollary 4.2.14 that as long as Dh (x, x )  R for all x ∈ C, then
 2
 
R2 1 K
α
(4.3.1) E[f(xK ) − f(x )]  E + k
gk 2∗ .
KαK K 2
k=1
Now, if we were to use a fixed stepsize αk = α for all k, we see that the choice of
stepsize minimizing
α 
K
R2
+ gk 2∗
Kα 2K
k=1
is
 − 1
√ K
2
2

α = 2R gk ∗ ,
k=1
which, when substituted into the bound (4.3.1) yields
 K 1 
√ R  2
2
(4.3.2) E[f(xK ) − f(x )]  2 E

gk ∗ .
K
k=1
While the stepsize choice α and the resulting bound are not strictly possible,
as we do not know the magnitudes of the gradients gk ∗ before the procedure
executes, in Exercise B.4.1, we prove the following corollary, which uses the “up
to now” optimal choice of stepsize αk .

k 2
Corollary 4.3.3. Let the conditions of Corollary 4.2.14 hold. Let αk = R/ i=1 gi ∗ .
Then  K
 1 
R 2
2
E[f(xK ) − f(x )]  3 E

gk ∗ ,
K
k=1
1 K
where xK = K k=1 xk .

When comparing Corollary 4.3.3 to Corollary 4.2.14, we see by Jensen’s in-


equality that, if E[gk 2∗ ]  M2 for all k, then
 K 1   1
 2
2
K
2
2 √ √
E gk ∗ E gk ∗  M2 K = M K.
k=1 k=1

Thus, ignoring the 2 versus 3 multiplier, the bound of Corollary 4.3.3 is al-
ways tighter than that provided by Corollary 4.2.14 and its immediate conse-
quence (4.2.15). We do not explore these particular stepsize choices further, but
turn to more sophisticated adaptation strategies.
John C. Duchi 153

Variable metric methods and the adaptive gradient method In variable metric
methods, the idea is to adjust the metric with which one constructs updates to
better reflect local (or non-local) problem structure. The basic framework is very
similar to the standard subgradient method (or the mirror descent method), and
proceeds as follows.
(i) Receive subgradient gk ∈ ∂f(xk ) (or stochastic subgradient gk satisfying
E[gk | xk ] ∈ ∂f(xk )).
(ii) Update positive semidefinite matrix Hk ∈ Rn×n .
(iii) Compute update

1
(4.3.4) xk+1 = argmin gk , x + x, Hk x .
x∈C 2
The method (4.3.4) subsumes a number of standard and less standard optimiza-
tion methods. If Hk = α1k In×n , a scaled identity matrix, we recover the (sto-
chastic) subgradient method (3.2.4) when C = Rn (or (3.3.2) generally). If f is
twice differentiable and C = Rn , then taking Hk = ∇2 f(xk ) to be the Hessian
of f at xk gives the (undamped) Newton method, and using Hk = ∇2 f(xk ) even
when C = Rn gives a constrained Newton method. More general choices of
Hk can even give the ellipsoid method and other classical convex optimization
methods [56].
In our case, we specialize the iterations above to focus on diagonal matrices Hk ,
and we do not assume the function f is smooth (not even differentiable). This, of
course, renders unusable standard methods using second order information in
the matrix Hk (as it does not exist), but we may still develop useful algorithms.
It is possible to consider more general matrices [22], but their additional com-
putational cost generally renders them impractical in large scale and stochastic
settings. With that in mind, let us develop a general framework for algorithms
and provide their analysis.
We begin with a general convergence guarantee.
Theorem 4.3.5. Let Hk be a sequence of positive definite matrices, where Hk is a func-
tion of g1 , . . . , gk (and potentially some additional randomness). Let gk be (stochastic)
subgradients with E[gk | xk ] ∈ ∂f(xk ). Then

K 
E 
(f(xk ) − f(x ))
k=1
 K 
1 
 2  2  2
 E xk − x Hk − xk − x Hk−1 + x1 − x H1
2
k=2

K 
1
+ E gk 2H−1 .
2 k
k=1
Proof. In contrast to mirror descent methods, in this proof we return to our classic
Lyapunov-based style of proof for standard subgradient methods, looking at the
154 Introductory Lectures on Stochastic Optimization

distance xk − x . Let x2A = x, Ax for any positive semidefinite matrix. We
claim that
$ $2
$ $
(4.3.6) xk+1 − x 2Hk  $xk − H−1
k gk − x $ ,
Hk
the analogue of the fact that projections are non-expansive. This is an immediate
consequence of the update (4.3.4): we have that
$ $2 
$ $
xk+1 = argmin $x − (xk − H−1 k k $
g ) ,
x∈C Hk

which is a Euclidean projection of xk − H−1 k gk into C (in the norm ·Hk ). Then
the standard result that projections are non-expansive (Corollary 2.2.8) gives in-
equality (4.3.6).
Inequality (4.3.6) is the key to our analysis, as previously. Expanding the
square on the right side of the inequality, we obtain
1 1$$
$2
$
xk+1 − x 2Hk  $xk − H−1 g
k k − x $
2 2 Hk
1 1
= xk − x 2Hk − gk , xk − x  + gk 2H−1 ,
2 2 k
and taking expectations we have E[gk , xk − x  | xk ]  f(xk ) − f(x ) by convexity
and that E[gk | xk ] ∈ ∂f(xk ). Thus
 
1   2
 1  2 1 2
E xk+1 − x Hk  E xk − x Hk − [f(xk ) − f(x )] + gk H−1 .

2 2 2 k
Rearranging, we have
 
1 1 1
E[f(xk ) − f(x )]  E xk − x 2Hk − xk+1 − x 2Hk + gk 2H−1 .
2 2 2 k
Summing this inequality from k = 1 to K gives the theorem. 
We may specialize the theorem in a number of ways to develop particular algo-
rithms. One specialization, which is convenient because the computational over-
head is fairly small, is to use diagonal matrices Hk . In particular, the AdaGrad
method sets
k 1
1 
2
(4.3.7) Hk = diag gi gi ,
α
i=1
where α > 0 is a pre-specified constant (stepsize). In this case, the following corol-
lary to Theorem 4.3.5 follows. Exercise B.4.3 sketches the proof of the corollary,
which is similar to that of Corollary 4.3.3. In the corollary, recall that the trace of

a matrix is tr(A) = n j=1 Ajj .

Corollary 4.3.8 (AdaGrad convergence). Let R∞ := supx∈C x − x ∞ be the ∞ ra-


dius of the set C and let the conditions of Theorem 4.3.5 hold. Then with the choice (4.3.7)
in the variable metric method, we have
K 
1 2
E (f(xk ) − f(x )) 

R E[tr(HK )] + αE[tr(HK )].
2α ∞
k=1
John C. Duchi 155

Inspecting Corollary 4.3.8 gives a few consequences. First, choosing α = R∞ ,


1 K
we obtain the expected convergence guarantee 32 R∞ E[tr(HK )]. If xK = K k=1 xk
as usual, and let gk,j denote the jth component of the kth gradient observed by
the method, then we immediately obtain the convergence guarantee
 K 1 
3 3 
n  2
2
(4.3.9) E [f(xK ) − f(x )] 

R∞ E[tr(HK )] = R∞ E gk,j .
2K 2K
j=1 k=1

In addition to proving the bound (4.3.9), Exercise B.4.3 also shows that when we
take C = {x ∈ Rn : x∞  1}, the bound (4.3.9) is always better than the bounds
(e.g. Corollary 3.4.11) guaranteed by standard stochastic gradient methods. In
addition, the bound (4.3.9) is unimprovable—there are stochastic optimization
problems for which no algorithm can achieve a faster convergence rate. These
types of problems generally involve data in which the gradients g have highly
varying components (or components that are often zero, i.e. the gradients g are
sparse), as for such problems geometric aspects are quite important.

Figure 4.3.10. A comparison of the convergence of AdaGrad


and SGD on the problem (4.3.11) for the best initial stepsize α for
each method.

We now give an example application of the AdaGrad method, showing its


performance on a simulated example. We consider solving the problem
1 
m
(4.3.11) minimize f(x) = [1 − bi ai , x]+ subject to x∞  1,
m
i=1
where the vectors ai ∈ {−1, 0, 1} with
n m = 5000 and n = 1000. This is the
objective common to hinge loss (support vector machine) classification problems.
For each coordinate j ∈ {1, . . . , n}, we set ai,j ∈ {±1} to have a random sign
with probability 1/j, and ai,j = 0 otherwise. Letting u ∈ {−1, 1}n uniformly at
random, we set bi = sign(ai , u) with probability .95 and bi = − sign(ai , u)
otherwise. For this problem, the coordinates of ai (and hence subgradients or
156 Introductory Lectures on Stochastic Optimization

stochastic subgradients of f) naturally have substantial variability, making it a


natural problem for adaptation of the metric Hk .
In Figure 4.3.10, we show the convergence behavior of AdaGrad versus sto-
chastic gradient descent (SGD) on one realization of this problem, where at each
iteration we choose a stochastic gradient by selecting i ∈ {1, . . . , m} uniformly
at random, then setting gk ∈ ∂ [1 − bi ai , xk ]+ . For SGD, we use stepsizes

αk = α/ k, where α is the best stepsize of several choices (based on the even-
tual convergence of the method), while AdaGrad uses the matrix (4.3.7), with
α similarly chosen based on the best eventual convergence. The plot shows
the typical behavior of AdaGrad with respect to stochastic gradient methods, at
least for problems with appropriate geometry: with good initial stepsize choice,
AdaGrad often outperforms stochastic gradient descent. (We have been vague
about the “right” geometry for problems in which we expect AdaGrad to per-
form well. Roughly, problems for which the domain C is well-approximated by a
box {x ∈ Rn : x∞  c} are those for which we expect AdaGrad to succeed, and
otherwise, it may exhibit worse performance than standard subgradient methods.
As in any problem, some care is needed in the choice of methods.) Figure 4.3.12
shows this somewhat more broadly, plotting the convergence f(xk ) − f(x ) ver-
sus iteration k for a number of initial stepsize choices for both stochastic gradient
descent and AdaGrad on the problem (4.3.11). Roughly, we see that both meth-
ods are sensitive to initial stepsize choice, but the best choice for AdaGrad often
outperforms the best choice for SGD.

Figure 4.3.12. A comparison of the convergence of AdaGrad


and SGD on the problem (4.3.11) for various initial stepsize
choices α ∈ {10−i/2 , i = −2, . . . , 2} = {.1, .316, 1, 3.16, 10}. Both
methods are sensitive to the initial stepsize choice α, though for
each initial stepsize choice, AdaGrad has better convergence than
the subgradient method.
John C. Duchi 157

Notes and further reading The mirror descent method was originally devel-
oped by Nemirovski and Yudin [41] in order to more carefully control the norms
of gradients, and associated dual spaces, in first-order optimization methods.
Since their original development, a number of researchers have explored variants
and extensions of their methods. Beck and Teboulle [5] give an analysis of mirror
descent as a non-Euclidean gradient method, which is the approach we take in
this lecture. Nemirovski et al. [40] study mirror descent methods in stochastic
settings, giving high-probability convergence guarantees similar to those we gave
in the previous lecture. Bubeck and Cesa-Bianchi [15] explore the use of mirror
descent methods in the context of bandit optimization problems, where instead of
observing stochastic gradients one observes only random function values f(x) + ε,
where ε is mean-zero noise perturbtion.
Variable metric methods have a similarly long history. Our simple results with
stepsize selection follow the more advanced techniques of Auer et al. [3] (see es-
pecially their Lemma 3.5), and the AdaGrad method (and our development) is
due to Duchi, Hazan, and Singer [22] and McMahan and Streeter [38]. More gen-
eral metric methods include Shor’s space dilation methods (of which the ellipsoid
method is a celebrated special case), which develop matrices Hk that make new
directions of descent somewhat less correlated with previous directions, allowing
faster convergence in directions toward x ; see the books of Shor [55, 56] as well
as the thesis of Nedić [39]. Newton methods, which we do not discuss, use scaled
multiples of ∇2 f(xk ) for Hk , while Quasi-Newton methods approximate ∇2 f(xk )
with Hk while using only gradient-based information; for more on these and
other more advanced methods for smooth optimization problems, see the books
of Nocedal and Wright [46] and Boyd and Vandenberghe [12].

5. Optimality Guarantees
Lecture Summary: In this lecture, we provide a framework for demonstrat-
ing the optimality of a number of algorithms for solving stochastic optimiza-
tion problems. In particular, we introduce minimax lower bounds, showing
how techniques for reducing estimation problems to statistical testing prob-
lems allow us to prove lower bounds on optimization.

5.1. Introduction The procedures and algorithms we have presented thus far en-
joy good performance on a number of statistical, machine learning, and stochastic
optimization tasks, and we have provided theoretical guarantees on their perfor-
mance. It is interesting to ask whether it is possible to improve the algorithms, or
in what ways it may be possible to improve them. With that in mind, in this lec-
ture we develop a number of tools for showing optimality—according to certain
metrics—of optimization methods for stochastic problems.
158 Introductory Lectures on Stochastic Optimization

Minimax rates We provide optimality guarantees in the minimax framework for


optimality, which proceeds roughly as follows: we have a collection of possi-
ble problems and an error measure for the performance of a procedure, and we
measure a procedure’s performance by its behavior on the hardest (most difficult)
member of the problem class. We then ask for the best procedure under this
worst-case error measure. Let us describe this more formally in the context of our
stochastic optimization problems, where the goal is to understand the difficulty
of minimizing a convex function f subject to constraints x ∈ C while observing
only stochastic gradient (or other noisy) information about f. Our bounds build
on three objects:
(i) A collection F of convex functions f : Rn → R;
(ii) A closed convex set C ⊂ Rn over which we optimize;
(iii) A stochastic gradient oracle, which consists of a sample space S, a gradient
mapping
g : Rn × S × F → Rn ,
and (implicitly) a probability distributions P on S. The stochastic gradient
oracle may be queried at a point x, and when queried, draws S ∼ P with
the property that
(5.1.1) E[g(x, S, f)] ∈ ∂f(x).
Depending on the scenario of the problem, the optimization procedure may be
given access either to S or simply the value of the stochastic gradient g = g(x, S, f),
and the goal is to use the sequence of observations g(xk , Sk , f), for k = 1, 2, . . . , to
optimize f.
A simple example of the setting (i)–(iii) is as follows. Let A ∈ Rn×n be a fixed
positive definite matrix, and let F be the collection of convex functions of the form
f(x) = 12 x Ax − b x for all b ∈ Rn . Then C may be any convex set, and—for the
sake of proving lower bounds, not for real applicability in solving problems—we
might take the stochastic gradient
iid
g = ∇f(x) + ξ = Ax − b + ξ for ξ ∼ N(0, In×n ).
A somewhat more complex example, but with more fidelity to real problems,
comes from the stochastic programming problem (3.4.2) from Lecture 3 on sub-
gradient methods. In this case, there is a known convex function F : Rn × S → R,
which is the instantaneous loss function F(x; s). The problem is then to optimize
fP (x) := EP [F(x; S)]
where the distribution P on the random variable S is unknown to the method
a priori; there is then a correspondence between distributions P and functions
f ∈ F. Generally, an optimization is given access to a sample S1 , . . . , SK drawn i.i.d.
according to the distribution P (in this case, there is no selection of points xi by the
optimization procedure, as the sample S1 , . . . , SK contains even more information
John C. Duchi 159

than the stochastic gradients). A similar variant with a natural stochastic gradient
oracle is to set g(x, s, F) ∈ ∂F(x; s) instead of providing the sample S = s.
We focus in this note on the case when the optimization procedure may view
only the sequence of subgradients g1 , g2 , . . . at the points it queries. We note in
passing, however, that for many problems we can reconstruct S from a gradient
g ∈ ∂F(x; S). As an example, consider a logistic regression problem with data
s = (a, b) ∈ {0, 1}n × {−1, 1}, a typical data case. Then
1
F(x; s) = log(1 + e−ba,x ), and ∇x F(x; s) = − ba,
1 + eba,x
so that (a, b) is identifiable from any g ∈ ∂F(x; s). More generally, classical linear
models in statistics have gradients that are scaled multiples of the data, so that
the sample s is typically identifiable from g ∈ ∂F(x; s).
Now, given function f and stochastic gradient oracle g, an optimization pro-
cedure chooses query points x1 , x2 , . . . , xK and observes stochastic subgradients
gk with E[gk ] ∈ ∂f(xk ). Based on these stochastic gradients, the optimization
procedure outputs ' xK , and we assess the quality of the procedure in terms of the
excess loss  
E f ('
xK (g1 , . . . , gK )) − inf 
f(x ) ,
x ∈C 

where the expectation is taken over the subgradients g(xi , Si , f) returned by the
stochastic oracle and any randomness in the chosen iterates, or query points,
x1 , . . . , xK of the optimization method. Of course, if we only consider this ex-
cess objective value for a fixed function f, then a trivial optimization procedure
achieves excess risk 0: simply return some x ∈ argminx∈C f(x). It is thus impor-
tant to ask for a more uniform notion of risk: we would like the procedure to have
good performance uniformly across all functions f ∈ F, leading us to measure the
performance of a procedure by its worst-case risk
 
sup E f('x(g1 , . . . , gk )) − inf f(x ) ,
f∈F x∈C

where the supremum is taken over functions f ∈ F (the subgradient oracle g then
implicitly depends on f). An optimal estimator for this metric then gives the
minimax risk for optimizing the family of stochastic optimization problems {f}f∈F
over x ∈ C ⊂ Rn , which is  
(5.1.2) MK (C, F) := inf sup E f('
xK (g1 , . . . , gK )) − inf f(x ) .
' K f∈F
x x ∈C


We take the supremum (worst-case) over distributions f ∈ F and the infimum


over all possible optimization schemes ' xK using K stochastic gradient samples.
A criticism of the framework (5.1.2) is that it is too pessimistic: by taking a
worst-case over distributions of functions f ∈ F, one is making the family of
problems too challenging. We will not address these challenges except to say
that one response is to develop adaptive procedures ' x, which are simultaneously
optimal for a variety of collections of problems F.
160 Introductory Lectures on Stochastic Optimization

The basic approach There are a variety of techniques giving lower bounds on
the minimax risk (5.1.2). Each of them transforms the maximum risk by lower
bounding it via a Bayesian problem (e.g. [31, 33, 34]), then proving a lower bound
on the performance of all possible estimators for the Bayesian problem. In partic-
ular, let {fv } ⊂ F be a collection of functions in F indexed by some (finite or count-
able) set V and π be any probability mass function over V. Let f = infx∈C f(x).
Then for any procedure ' x, the maximum risk has lower bound

sup E [f('x) − f ]  x) − fv ] .
π(v)E [fv ('
f∈F v
While trivial, this lower bound serves as the departure point for each of the
subsequent techniques for lower bounding the minimax risk. The lower bound
also allows us to assume that the procedure ' x is deterministic. Indeed, assume that
'
x is non-deterministic, which we can represent generally as depending on some
auxiliary random variable U independent of the observed subgradients. Then we
certainly have
  
E x) − fv | U]  inf
π(v)E [fv (' 
x) − fv | U = u] ,
π(v)E [fv ('
u
v v
that is, there is some realization of the auxiliary randomness that is at least as
good as the average realization. We can simply incorporate this into our minimax
optimal procedures ' x, and thus we assume from this point onward that all our
optimization procedures are deterministic when proving our lower bounds.

f0 f1

δ = dopt (f0 , f1 )

x : f1 (x)  f1 + δ

Figure 5.1.3. Separation of optimizers of f0 and f1 . Optimizing


one function to accuracy better than δ = dopt (f0 , f1 ) implies we
optimize the other poorly; the gap f(x) − f is at least δ.
John C. Duchi 161

The second step in proving minimax bounds is to reduce the optimization


problem to a type of statistical test [58, 62, 63]. To perform this reduction, we de-
fine a distance-like quantity between functions such that, if we have optimized a
function fv to better than the distance, we cannot have optimized other functions
well. In particular, consider two convex functions f0 and f1 . Let fv = infx∈C fv (x)
for v ∈ {0, 1}. We let the optimization separation between functions f0 and f1 over
the set C be
dopt (f0 , f1 ; C) :=
 
(5.1.4) f1 (x)  f1 + δ implies f0 (x)  f0 + δ
sup δ  0 : for any x ∈ C .
f0 (x)  f0 + δ implies f1 (x)  f1 + δ
That is, if we have any point x such that fv (x) − fv  dopt (f0 , f1 ), then x cannot
optimize f1−v well, i.e. we can only optimize one of the two functions f0 and f1
to accuracy dopt (f0 , f1 ). See Figure 5.1.3 for an illustration of this quantity. For
example, if f1 (x) = (x + c)2 and f0 (x) = (x − c)2 for a constant c = 0, then we
have dopt (f1 , f0 ) = c2 .
This separation dopt allows us to give a reduction from optimization to testing
via the canonical hypothesis testing problem, which is as defined as follows:
1. Nature chooses an index V ∈ V uniformly at random.
2. Conditional on the choice V = v, the procedure observes stochastic sub-
gradients for the function fv according to the oracle g(xk , Sk , fv ) for i.i.d.
Sk .
Then, given the observed subgradients, the goal is to test which of the random
indices v nature chose. Intuitively, if we can optimize fv well—to better than
the separation dopt (fv , fv )—then we can identify the index v. If we can show
this, then we can adapt classical statistical results on optimal hypothesis testing
to lower bound the probability of error in testing whether the data was generated
conditional on V = v.
More formally, we have the following key lower bound. In the lower bound,
we say that a collection of functions {fv }v∈V is δ-separated, where δ  0, if
(5.1.5) dopt (fv , fv ; C)  δ for each v, v ∈ V with v = v .
Then we have the next proposition.

Proposition 5.1.6. Let S be drawn uniformly from V, where |V| < ∞, and assume the
collection {fv }v∈V is δ-separated. Then for any optimization procedure '
x based on the
observed subgradients,
1 
E[fv ('
x) − fv ]  δ · inf P('
v = V),
|V| '
v
v∈V
where the distribution P is the joint distribution over the random index V and the ob-
served gradients g1 , . . . , gK and the infimum is taken over all testing procedures '
v based
on the observed data.
162 Introductory Lectures on Stochastic Optimization

Proof. We let Pv denote the distribution of the subgradients conditional on the


choice V = v, meaning that E[gk | V = v] ∈ ∂fv (xk ). We observe that for any v,
we have
x) − fv ]  δE[1 {fv ('
E[fv (' x)  fv + δ}] = δPv (fv ('
x)  fv + δ).
Now, define the hypothesis test '
v, which is a function of ' x, by

v x)  fv + δ
if fv ('
'
v=
arbitrary in V otherwise.
This is a well-defined mapping, as by the condition that dopt (fv , fv )  δ, there
can be only a single index v such that fv (x)  fv + δ. We then note the following
implication:
' x)  fv + δ.
v = v implies fv ('
Thus we have
v = v)  Pv (fv ('
Pv (' x)  fv + δ),
or, summarizing, we have
1  1 
E[fv ('
x) − fv ]  δ · v = v).
Pv ('
|V| |V|
v∈V v∈V
1 
But by definition of the distribution P, we have |V| v∈V Pv (' v = v) = P('
v = V),
'
and taking the best possible test v gives the result of the proposition. 
Proposition 5.1.6 allows us to then bring in the tools of optimal testing in
statistics and information theory, which we can use to prove lower bounds. To
leverage Proposition 5.1.6, we follow a two phase strategy: we construct a well-
separated function collection, and then we show that it is difficult to test which of
the functions we observe data from. There is a natural tension in the proposition,
as it is easier to distinguish functions that are far apart (i.e. large δ), while hard-
to-distinguish functions (i.e. large P('
v = V)) often have smaller separation. Thus
we trade these against one another carefully in constructing our lower bounds on
the minimax risk. We also present a variant lower bound in Section 5.3 based on
a similar reduction, except that we use multiple binary hypothesis tests.
5.2. Le Cam’s Method Our first set of lower bounds is based on Le Cam’s
method [33], which uses optimality guarantees for simple binary hypothesis tests
to provide lower bounds for optimization problems. That is, we let V = {−1, 1}
and will construct only pairs of functions and distributions P1 , P−1 generating
data. In this section, we show how to use these binary hypothesis tests to prove
lower bounds on the family of stochastic optimization problems characterized by
the following conditions: the domain C ⊂ Rn contains an 2 -ball of radius R and
the subgradients gk satisfy the second moment bound
E[gk 22 ]  M2
for all k. We assume that F consists of M-Lipschitz continuous convex functions.
John C. Duchi 163

With the definition (5.1.4) of the separation in terms of optimization value,


we can provide a lower bound on optimization in terms of distances between
distributions P1 and P−1 . Before we continue, we require a few definitions about
distances between distributions.

Definition 5.2.1. Let P and Q be distributions on a space S, and assume that they
are both absolutely continuous with respect to a measure μ on S. The variation
distance between P and Q is
1
P − QTV := sup |P(A) − Q(A)| = |p(s) − q(s)|dμ(s).
A⊂S 2 S
The Kullback-Leibler divergence between P and Q is

p(s)
Dkl (P||Q) := p(s) log dμ(s).
S q(s)
We can connect the variation distance to binary hypothesis tests via the following
lemma, due to Le Cam. The lemma states that testing between two distributions
is hard precisely when they are close in variation distance.

Lemma 5.2.2. Let P1 and P−1 be any distributions. Then


inf {P1 (' v = −1)} = 1 − P1 − P−1 TV .
v = 1) + P−1 ('
'
v
Proof. Any testing procedure 'v : S → {−1, 1} maps one region of the sample space,
call it A, to 1 and the complement Ac to −1. Thus, we have
v = 1) + P−1 ('
P1 (' v = −1) = P1 (Ac ) + P−1 (A) = 1 − P1 (A) + P−1 (A).
Optimizing over '
v is then equivalent to optimizing over sets A, yielding

inf{P1 ('
v = 1) + P−1 ('
v = −1)} = inf{1 − P1 (A) + P−1 (A)}
'
v A
= 1 − sup{P1 (A) − P−1 (A)} = 1 − P1 − P−1 TV
A
as desired. 
As an immediate consequence of Lemma 5.2.2, we obtain the standard mini-
max lower bound based on binary hypothesis testing. In particular, let f1 and f−1
be δ-separated and belong to F, and assume that the method ' x receives data (in
K
this case, the data is the K subgradients) from Pv when fv is the true function.
Then we immediately have
1  $ $ 
$ K $
(5.2.3) MK (C, F)  inf max {EPv [fv (' xK ) − fv ]}  δ · 1 − $P1K − P−1 $ .
' K v∈{−1,1}
x 2 TV

Inequality (5.2.3) gives a quantitative guarantee on an intuitive fact: if we observe


data from one of two distributions P1 and P−1 that are close, while the optimiz-
ers of the functions f1 and f−1 associated with P1 and P−1 differ, it is difficult to
optimize well. Moreover, there is a natural tradeoff—the farther apart the func-
tions f1 and f−1 are (i.e. δ = dopt (f1 , f−1 ) is large), the bigger the penalty for
optimizing one well, but conversely, this usually forces the distributions P1 and
164 Introductory Lectures on Stochastic Optimization

P−1 to be quite different, as they provide subgradient information on f1 and f−1 ,


respectively.
It is challenging to compute quantities—especially with multiple samples—
involving the variation distance, so we now convert our bounds to ones involving
the KL-divergence, which is computationally easier when dealing with multiple
samples. First, we use Pinsker’s inequality (see Appendix A.3, Proposition A.3.2
for a proof): for any distributions P and Q,
1
P − Q2TV  Dkl (P||Q) .
2
As we see presently, the KL-divergence tensorizes when we have multiple obser-
vations from different distributions (see Lemma 5.2.8 to come), allowing substan-
tially easier computation of individual divergence terms. Then we have the fol-
lowing theorem.

Theorem 5.2.4. Let F be a collection of convex functions, and let f1 , f−1 ∈ F. Assume
that when function fv is to be optimized, we observe K subgradients according to PvK .
Then   
dopt (f−1 , f1 ; C) 1 K K

MK (C, P)  1− D P |P .
2 2 kl 1 −1

What remains to give a concrete lower bound, then, is (1) to construct a family
of well-separated functions f1 , f−1 , and (2) to construct a stochastic gradient or-
acle for which we give a small upper bound on the KL-divergence between the
distributions P1 and P−1 associated with the functions, which means that testing
between P1 and P−1 is hard.
Constructing well-separated functions Our first goal is to construct a family
of well-separated functions and an associated first-order subgradient oracle that
makes the functions hard to distinguish. We parameterize our functions—of
which we construct only 2—by a parameter δ > 0 governing their separation.
Our construction applies in dimension n = 1: let us assume that C contains the
interval [−R, R] (this is no loss of generality, as we may simply shift the interval).
Then define the M-Lipschitz continuous functions
(5.2.5) f1 (x) = Mδ|x − R| and f−1 (x) = Mδ|x + R|.
See Figure 5.2.7 for an example of these functions, which makes clear that their
separation (5.1.4) is
dopt (f1 , f−1 ) = δMR.
We also consider the stochastic oracle for this problem, recalling that we must
construct subgradients satisfying E[g22 ]  M2 . We will do slightly more: we
will guarantee that |g|  M always. With this in mind, we assume that δ  1,
and define the stochastic gradient oracle for the distribution Pv , v ∈ {−1, 1} at the
John C. Duchi 165

point x to be

1+δ
M sign(x − vR) with probability 2
(5.2.6) gv (x) =
1−δ
−M sign(x − vR) with probability 2 .
At x = vR the oracle simply returns a random sign. Then by inspection, we see
that
Mδ Mδ
E[gv (x)] = sign(x − vR) − (− sign(x − vR)) = Mδ sign(x − vR) ∈ ∂fv (x)
2 2
for v = −1, 1. Thus, the combination of the functions (5.2.5) and the stochastic
gradient (5.2.6) give us a valid subgradient and well-separated pair of functions.

f−1 f1
dopt (f−1 , f1 ) = MRδ
x : f1 (x) 
f1 + MRδ

−R R

Figure 5.2.7. The function construction (5.2.5) with separation


dopt (f1 , f−1 ) = MRδ.

Bounding the distance between distributions The second step in proving our
minimax lower bound is to upper bound the distance between the distributions
that generate the subgradients our methods observe. This means that testing
which of the functions we are optimizing is challenging, giving us a strong lower
bound. At a high level, building off of Theorem 5.2.4, we hope to show an upper
bound of the form
Dkl P1K | P−1
K
 κδ2
for some κ. This is a local condition, allowing us to scale our problems with δ
to achieve minimax bounds. If we have such a quadratic, we may simply choose
δ2 = 1/2κ, giving the constant probability of error
$ $  
$ K K $ 1 K K
κδ2 1
1 − $P1 − P−1 $  1 − Dkl P1 | P−1 /2  1 −  .
TV 2 2 2
166 Introductory Lectures on Stochastic Optimization

To this end, we begin with a standard lemma (the chain rule for KL diver-
gence), which applies when we have K potentially dependent observations from
a distribution. The result is an immediate consequence of Bayes’ rule.
Lemma 5.2.8. Let P(· | g1 , . . . , gk−1 ) denote the conditional distribution of gk given
g1 , . . . , gk−1 . For each k ∈ N let P1k and P−1
k be distributions on the K subgradients

g1 , . . . , gk . Then

K
Dkl P1K | P−1
K
= EPk−1 [Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 ))] .
1
k=1

Using Lemma 5.2.8, we have the following upper bound on the KL-divergence
between P1K and P−1
K for the stochastic gradient (5.2.6).

Lemma 5.2.9. Let the K observations under distribution Pv come from the stochastic
gradient oracle (5.2.6). Then for δ  45 ,

Dkl P1K | P−1
K
 3Kδ2 .
Proof. We use the chain-rule for KL-divergence, whence we must only provide
an upper bound on the individual terms. We first note that xk is a function
of g1 , . . . , gk−1 (because we may assume w.l.o.g. that xk is deterministic) so that
Pv (· | g1 , . . . , gk−1 ) is the distribution of a Bernoulli random variable with distri-
bution (5.2.6), i.e. with probabilities 1±δ 2 . Thus we have
 
1+δ 1−δ
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 ))  Dkl |
2 2
1+δ 1+δ 1−δ 1−δ
= log + log
2 1−δ 2 1+δ
1+δ
= δ log .
1−δ
By a Taylor expansion, we have that
   
1+δ 1 1
δ log = δ δ − δ2 + O(δ3 ) − δ −δ − δ2 + O(δ3 ) = 2δ2 + O(δ4 )  3δ2
1−δ 2 2
for δ  45 , or
Dkl (P1 (· | g1 , . . . , gk−1 )||P−1 (· | g1 , . . . , gk−1 ))  3δ2
for δ  45 . Summing over k completes the proof. 
Putting it all together: a minimax lower bound With Lemma 5.2.9 in place
along with our construction (5.2.5) of well-separated functions, we can now give
the best possible convergence guarantees for a broad family of problems.
Theorem 5.2.10. Let C ⊂ Rn be a convex set containing an 2 ball of radius R, and let P
denote the collection of distributions generating stochastic subgradients with g2  M
with probability 1. Then, for all K ∈ N,
RM
MK (C, P)  √ √
4 6 K
John C. Duchi 167

Proof. We combine Le Cam’s method, Lemma 5.2.2 (and the subsequent Theo-
rem 5.2.4) with our construction (5.2.5) and their stochastic subgradients (5.2.6).
Certainly, the class of n-dimensional optimization problems is at least as challeng-
ing as a 1-dimensional problem (we may always restrict our functions to depend
only on a single coordinate), so that for any δ  0 we have
  
δMR 1 K K

MK (C, F)  1− D P |P .
2 2 kl 1 −1
Next, we use Lemma 5.2.9, which guarantees the further lower bound
  
δMR 3Kδ2
MK (C, F)  1− ,
2 2

valid for all δ  45 .



1
Finally, choosing δ2 = 1
6K < 45 , we have that Dkl P1K | P−1
K  2 , and
δMR
MK (C, F)  .
4
Substituting our choice of δ into this expression gives the theorem. 
In short, Theorem 5.2.10 gives a guarantee that matches the upper bounds of
the previous lectures to within a numerical constant factor of 10. A more careful
inspection of our analysis allows us to prove a lower bound, at least as K → ∞,

of 1/8 K. In particular, by using Theorem 3.4.9 of our lecture on subgradient
methods, we find that if the set C contains an 2 -ball of radius Rinner and is
contained in an 2 -ball of radius Router , we have
1 MRinner MRouter
(5.2.11) √ √  MK (C, F)  √
96 K K
for all K ∈ N, where the upper bound is attained by the stochastic projected
subgradient method.
5.3. Multiple dimensions and Assouad’s Method The results in Section 5.2 pro-
vide guarantees for problems where we can embed much of the difficulty of our
family F in optimizing a pair of only two functions—something reminiscent of
problems in classical statistics on the “hardest one-dimensional subproblem” (see,
for example, the work of Donoho, Liu, and MacGibbon [19]). In many stochas-
tic optimization problems, the higher-dimension n yields increased difficulty, so
that we would like to derive bounds that incorporate dimension more directly.
With that in mind, we develop a family of lower bounds, based on Assouad’s
method [2], that reduce optimization to a collection of binary hypothesis tests,
one for each of the n dimensions of the problem.
More precisely, we let V = {−1, 1}n be the n-dimensional binary hypercube,
and for each v ∈ V, we assume we have a function fv ∈ F where fv : Rn → R.
Without loss of generality, we will assume that our constraint set C has the point
0 in its interior. Let δ ∈ Rn+ be an n-dimensional nonnegative vector. Then we
say that the functions {fv } induce a δ-separation in the Hamming metric if for any
168 Introductory Lectures on Stochastic Optimization

x ∈ C ⊂ Rn we have

n
 
(5.3.1) fv (x) − fv  δj 1 sign(xj ) = vj ,
j=1

where the subscript j denotes the jth coordinate. For example, if we define the
function fv (x) = δ x − v1 for each v ∈ V, then certainly {fv } is δ1-separated

in the Hamming metric; more generally, fv (x) = n j=1 δj |xj − vj | is δ-separated.
With this definition, we have the following lemma, providing a lower bound for
functions f : Rn → R.

Lemma 5.3.2 (Generalized Assouad). Let δ ∈ Rn + and let {fv } be δ-separated in


Hamming metric where v ∈ V = {−1, 1}n . Let ' x be any optimization algorithm, and let
Pv be the distribution of (all) the subgradients g1 , . . . , gK the procedure '
x observes when
optimizing fv . Define
1  1 
P+j = n−1 Pv and P−j = n−1 Pv .
2 2
v:vj =1 v:vj =−1

Then
1  1
n
$ $
E[fv ('
x) − fv ]  δj (1 − $P+j − P−j $TV ).
2n 2
v∈{−1,1}n j=1

Proof. By using the separation condition, we immediately see that



d
E[fv ('
x) − fv ]  xj ) = vj )
δj Pv (sign('
j=1

for any v ∈ V. Averaging over the vectors v ∈ V, we obtain


1 
E[fv ('
x) − fv ]
2n
v∈V


d
1 
 xj ) = vj )
δj Pv (sign('
|V|
j=1 v∈V


d    
1
= δj xj ) = 1) +
Pv (sign(' xj ) = −1)
Pv (sign('
|V|
j=1 v:vj =1 v:vj =−1


d
δj 
= xj ) = 1) + P−j (sign('
P+j (sign(' xj ) = −1) .
2
j=1

Now we use Le Cam’s lemma (Lemma 5.2.2) on optimal binary hypothesis tests
to see that
$ $
xj ) = −1)  1 − $P+j − P−j $TV
xj ) = 1) + P−j (sign('
P+j (sign('
which gives the desired result. 
John C. Duchi 169

As a nearly immediate consequence of Lemma 5.3.2, we see that if the separa-


tion is a constant δ > 0 for each coordinate, we have the following lower bound
on the minimax risk.
Proposition 5.3.3. Let the collection {fv }v∈V ⊂ F, where V = {−1, 1}n , be δ-separated
in Hamming metric for some δ ∈ R+ , and let the conditions of Lemma 5.3.2 hold. Then

  
n  1 n

MK (C, F)  δ 1 −  Dkl P+j | P−j .


2 2n
j=1

Proof. Lemma 5.3.2 guarantees that


δ $ $
n
MK (C, F)  (1 − $P+j − P−j $TV ).
2
j=1

Applying the Cauchy-Schwarz inequality, we have by Pinsker’s inequality


 
    
n
$ $  n $ $2  n

$P+j − P−j $  n $P+j − P−j $   n Dkl P+j | P−j


TV TV 2
j=1 j=1 j=1

Substituting this into the previous bound gives the desired result. 
With this proposition, we can give a number of minimax lower bounds. We
focus on two concrete cases, which show that the stochastic gradient procedures
we have developed are optimal for a variety of problems. We give one result,
deferring others to the exercises associated with the lecture notes. For our main
result using Assouad’s method, we consider optimization problems for which the
set C ⊂ Rn contains an ∞ ball of radius R. We also assume that the stochastic
gradient oracle satisfies the 1 -bound condition
E[g(x, S, f)21 ]  M2 .
This means that all the functions f ∈ F are M-Lipschitz continuous with respect
to the ∞ -norm, that is, |f(x) − f(y)|  M x − y∞ .
Theorem 5.3.4. Let F and the stochastic gradient oracle be as above, and assume that
C ⊃ [−R, R]n . Then
√ 
1 1 n
MK (C, F)  RM min ,√ √ .
5 96 K
Proof. Our proof is similar to our construction of our earlier lower bounds, except
that now we must construct functions defined on Rn so that our minimax lower
bound on convergence rate grows with the dimension. Let δ > 0 be fixed for now.
For each v ∈ V = {−1, 1}n , define the function

fv (x) := x − Rv1 .
n
Then by inspection, the collection {fv } is MRδ
n -separated in Hamming metric, as

Mδ  Mδ  
n n

fv (x) = |xj − Rvj |  R1 sign(xj ) = vj .
n n
j=1 j=1
170 Introductory Lectures on Stochastic Optimization

Now, we must (as before) construct a stochastic subgradient oracle. Let e1 , . . . , en


be the n standard basis vectors. For each v ∈ V, we define the stochastic subgra-
dient as

Mej sign(xj − Rvj ) with probability 1+δ
2n
(5.3.5) g(x, fv ) =
1−δ
−Mej sign(xj − Rvj ) with probability 2n .
That is, the oracle randomly chooses a coordinate j ∈ {1, . . . , n}, then conditional
on this choice, flips a biased coin and with probability 1+δ
2 returns the correctly
signed jth coordinate of the subgradient, Mej sign(xj − Rvj ), and otherwise re-
turns the negative. Letting sign(x) denote the vector of signs of x, we then have
n  
1+δ 1−δ Mδ
E[g(x, fv )] = M ej − sign(xj − Rvj ) = sign(x − Rv).
n n n
j=1

That is, E[g(x, fv )] ∈ ∂fv (x) as desired.


Now, we apply Proposition 5.3.3, which guarantees that
⎛  ⎞

MRδ ⎝  1 n

(5.3.6) MK (C, F)  1− Dkl P+j | P−j ⎠ .


2 2n
j=1

It remains to upper bound the KL-divergence terms. Let PvK denote the distribu-
tion of the K subgradients the method observes for the function fv , and let v(±j)
denote the vector v except that its jth entry is forced to be ±1. Then, we may use
the convexity of the KL-divergence to obtain that

1 
Dkl P+j | P−j  n Dkl PvK(+j) | PvK(−j) .
2
v∈V
K K

Let us thus bound Dkl Pv | Pv when v and v differ in only a single coordinate
(we let it be the first coordinate with no loss of generality). Let us assume for
notational simplicity M = 1 for the next calculation, as this only changes the
support of the subgradient distribution (5.3.5) but not any divergences. Applying
the chain rule (Lemma 5.2.8), we have
K
Dkl PvK | PvK = EPv [Dkl (Pv (· | g1:k−1 )||Pv (· | g1:k−1 ))] .
k=1
We consider one of the terms, noting that the kth query xk is a function of
g1 , . . . , gk−1 . We have

Dkl (Pv (· | xk )||Pv (· | xk ))


Pv (g = e1 | xk ) Pv (g = −e1 | xk )
= Pv (g = e1 | xk ) log + Pv (g = −e1 | xk ) log ,
Pv (g = e1 | xk ) Pv (g = −e1 | xk )
because Pv and Pv assign the same probability to all subgradients except when
g ∈ {±e1 }. Continuing the derivation, we obtain
1+δ 1+δ 1−δ 1−δ δ 1+δ
Dkl (Pv (· | xk )||Pv (· | xk )) = log + log = log .
2n 1−δ 2n 1+δ n 1−δ
John C. Duchi 171

2
Noting that this final quantity is bounded by 3δn for δ  5 gives that
4

3Kδ2 4
Dkl PvK | PvK  if δ  .
n 5
Substituting the preceding calculation into the lower bound (5.3.6), we obtain
⎛  ⎞   
 
n
MRδ ⎝  1 3Kδ2 MRδ 3Kδ 2
MK (C, F)  1− ⎠= 1− .
2 2n n 2 2n
j=1

Choosing δ2 = min{16/25, 4K
n
} gives the result of the theorem. 
A few remarks are in order. First, the theorem recovers the 1-dimensional re-
sult of Theorem 5.2.10, by simply taking n = 1 in its statement. Second, we see
that if we wish to optimize over a set larger than the 2 -ball, then there must neces-
sarily be some dimension-dependent penalty, at least in the worst case. Lastly, the
result again is sharp. By using Theorem 3.4.9, we obtain the following corollary.

Corollary 5.3.7. In addition to the conditions of Theorem 5.3.4, let C ⊂ Rn contain an


∞ box of radius Rinner and be contained in an ∞ box of radius Router . Then
√  √ 
1 1 n n
Rinner M min ,√ √  MK (C, F)  Router M min 1, √ .
5 96 K K
Notes and further reading The minimax criterion for measuring optimality of
optimization and estimation procedures has a long history, dating back at least
to Wald [59] in 1939. The information-theoretic approach to optimality guaran-
tees was extensively developed by Ibragimov and Has’minskii [31], and this is
our approach. Our treatment in this chapter is specifically based off of that by
Agarwal et al. [1] for proving lower bounds for stochastic optimization problems,
though our results appear to have slightly sharper constants. Notably missing
in our treatment is the use of Fano’s inequality for lower bounds, which is com-
monly used to prove converse statements to achievability results in information
theory [17, 63]. Recent treatments of various techniques for proving lower bounds
in statistics can be found in the book of Tsybakov [58] or the lecture notes [21].
Our focus on stochastic optimization problems allows reasonably straightfor-
ward reductions from optimization to statistical testing problems, for which in-
formation theoretic and statistical tools give elegant solutions. For lower bounds
for non-stochastic problems, the classical reference is the book of Nemirovski and
Yudin [41] (who also provide optimality guarantees for stochastic problems). The
basic idea is to provide lower bounds for the oracle model of convex optimization,
where we consider optimality in terms of the number of queries to an oracle
giving true first- or second-order information (as opposed to the stochastic or-
acle studied here). More recent work, including the lecture notes [42] and the
book [44] provide a somewhat easier guide to such results, while the recent pa-
per of Braun et al. [13] shows how to use information-theoretic tools to guarantee
optimality even for non-stochastic optimization problems.
172 Introductory Lectures on Stochastic Optimization

A. Technical Appendices
A.1. Continuity of Convex Functions In this appendix, we provide proofs of
the basic continuity results for convex functions. Our arguments are based on
those of Hiriart-Urruty and Lemaréchal [27].

Proof of Lemma 2.3.1 We can write x ∈ B1 as x = n i=1 xi ei , where ei are the
n
standard basis vectors and i=1 |xi |  1. Thus, we have

n  
n 
f(x) = f ei xi =f |xi | sign(xi )ei + (1 − x1 )0
i=1 i=1

n
 |xi |f(sign(xi )ei ) + (1 − x1 )f(0)
i=1
 max {f(e1 ), f(−e1 ), f(e2 ), f(−e2 ), . . . , f(en ), f(−en ), f(0)} .
The first inequality uses the fact that the |xi | and (1 − x1 ) form a convex combi-
nation, since x ∈ B1 , as does the second.
For the lower bound, note by the fact that x ∈ int B1 satisfies x ∈ int dom f,
we have ∂f(x) = ∅ by Theorem 2.4.3. In particular, there is a vector g such that
f(y)  f(x) + g, y − x for all y, and even more,
f(y)  f(x) + inf g, y − x  f(x) − 2 g∞
y∈B1

for all y ∈ B1 .
Proof of Theorem 2.3.2 First, let us suppose that for each point x0 ∈ C, there
exists an open ball B ⊂ int dom f such that
$ $
(A.1.1) |f(x) − f(x )|  L $x − x $2 for all x, x ∈ B.
The collection of such balls B covers C, and as C is compact, there exists a
finite subcover B1 , . . . , Bk with associated Lipschitz constants L1 , . . . , Lk . Take
L = maxi Li to obtain the result.
It thus remains to show that we can construct balls satisfying the Lipschitz
condition (A.1.1) at each point x0 ∈ C.
With that in mind, we use Lemma 2.3.1, which shows that for each point x0 ,
there is some > 0 and −∞ < m  M < ∞ such that
−∞ < m  inf f(x + v)  sup f(x + v)  M < ∞.
v:v2 2 v:v2 2

We make the following claim, from which the condition (A.1.1) evidently follows
based on the preceding display.

Lemma A.1.2. Let > 0, let f be convex, and let B = {v : v2  1}. Suppose that
f(x) ∈ [m, M] for all x ∈ x0 + 2 B. Then
M−m $ $
$x − x $ for all x, x ∈ x0 + B.
|f(x) − f(x )|  2

John C. Duchi 173

Proof. Let x, x ∈ x0 + B. Let


x − x
x = x + ∈ x0 + 2 B,
x − x2
as (x − x )/ x − x 2 ∈ B. By construction, we have that x lies in the segment
{tx + (1 − t)x , t ∈ [0, 1]} between x and x ; explicitly,
 
x − x2
1+ x = x + x or x =
x + x.
x − x2 x − x2 x − x2 + x − x2 +
Then we find that
x − x 2
f(x )  f(x ) + f(x),
x − x 2 + x − x 2 +
or
x − x 2  x − x 2
f(x ) − f(x)  f(x ) − f(x)  [M − m]
x − x 2 + x − x 2 +
M−m $ $
$x − x $ .
 2

Swapping the roles of x and x gives the result. 
A.2. Probability background In this section, we very tersely review a few of
the necessary definitions and results that we employ here. We provide a non
measure-theoretic treatment, as it is not essential for the basic uses we have.

Definition A.2.1. A sequence X1 , X2 , . . . of random vectors converges in probability


to a random vector X∞ if for all > 0, we have
lim sup P(Xn − X∞  > ) = 0.
n→∞

Definition A.2.2. A sequence X1 , X2 , . . . of random vectors is a martingale if there is


a sequence of random variables Z1 , Z2 , . . . (which may contain all the information
about X1 , X2 , . . .) such that for each n, (i) Xn is a function of Zn , (ii) Zn−1 is a
function of Zn , and (iii) we have the conditional expectation condition
E[Xn | Zn−1 ] = Xn−1 .
When condition (i) is satisfied, we say that Xn is adapted to Z. We say that a se-

quence X1 , X2 , . . . is a martingale difference sequence if Sn = n
i=1 Xi is a martingale,
or, equivalently, if E[Xn | Zn−1 ] = 0.

We now provide a self-contained proof of the Azuma-Hoeffding inequality.


Our first result is an important intermediate result.

Lemma A.2.3 (Hoeffding’s Lemma [30]). Let X be a random variable with a  X  b.


Then  2 
λ (b − a)2
E [exp(λ(X − E[X]))]  exp for all λ ∈ R.
8
Proof. First, we note that if Y is any random variable with Y ∈ [c1 , c2 ], then
(c −c )2
Var(Y)  2 4 1 . Indeed, we have that Var(Y) = E[(Y − E[Y])2 ] and that E[Y]
174 Introductory Lectures on Stochastic Optimization

minimizes E[(Y − t)2 ] over t ∈ R, so that


    
c2 + c1 2 c2 + c1 2 (c2 − c1 )2
(A.2.4) Var(Y)  E Y−  c2 − = .
2 2 4
Without loss of generality, we assume that E[X] = 0 and that 0 ∈ [a, b]. Let
ψ(λ) = log E[eλX ]. Then
E[XeλX ] E[X2 eλX ] E[XeλX ]2
ψ (λ) = and ψ
(λ) = − .
E[eλX ] E[eλX ] E[eλX ]2
Note that ψ (0) = E[X] = 0. Let P denote the distribution of X, and assume
without loss of generality that X has a density p.5 Define the random variable Y
to have the shifted density f defined by
eλy
f(y) = p(y)
E[eλX ]
for y ∈ R, where p(y) = 0 for y ∈ [a, b]. Then we find E[Y] = ψ (λ) and
Var(Y) = E[Y 2 ] − E[Y]2 = ψ (λ). But of course, we know that Y ∈ [a, b] because
the distribution P of X is supported on [a, b], so that
(b − a)2
ψ (λ) = Var(Y) 
4
by inequality (A.2.4). Using Taylor’s theorem, we have that
λ2 λ2
ψ(λ) = ψ(0) + ψ (0) λ + ψ (λ)ψ(λ) = ψ(0) + ψ (λ)
  2 2
=0
(b−a)2 2
for some λ between 0 and λ. But ψ (λ)  4 , so that ψ(λ)  λ2 (b−a)
2 4 as
desired. 

Theorem A.2.5 (Azuma-Hoeffding Inequality [4]). Let X1 , X2 , . . . be a martingale


difference sequence with |Xi |  B for all i = 1, 2, . . .. Then
 n   
 2t2
P Xi  t  exp − 2
nB
i=1
and  n   
 2t2
P Xi  −t  exp − 2
nB
i=1
for all t  0.

Proof. We prove the upper tail, as the lower tail is similar. The proof is a nearly
immediate consequence of Hoeffding’s lemma (Lemma A.2.3) and the Chernoff
bound technique. Indeed, we have
 n    n 
 
P Xi  t  E exp λ Xi exp(−λt)
i=1 i=1

5 We may assume there is a dominating base measure μ with respect to which P has a density p.
John C. Duchi 175

for all λ  0. Now, letting Zi be the sequence to which the Xi are adapted, we
iterate conditional expectations. We have
  n     n−1  
 
E exp λ Xi = E E exp λ Xi exp(λXn ) | Zn−1
i=1 i=1
   

n−1
= E exp λ Xi E[exp(λXn ) | Zn−1 ]
i=1
   

n−1
λ2 B2
 E exp λ Xi e 8

i=1
because X1 , . . . , Xn−1 are functions of Zn−1 . By iteratively applying this calcula-
tion, we arrive at
  n   2 2
 λ nB
(A.2.6) E exp λ Xi  exp .
8
i=1
Now we optimize by choosing λ  0 to minimize the upper bound that in-
equality (A.2.6) provides, namely
 n   2 2   
 λ nB 2t2
P Xi  t  inf exp − λt = exp − 2
λ0 8 nB
i=1
4t
by taking λ = Bn . 
A.3. Auxiliary results on divergences We present a few standard results on
divergences without proof, referring to standard references (e.g. the book of Cover
and Thomas [17] or the extensive paper on divergence measures by Liese and
Vajda [35]). Nonetheless, we state and prove a few results. The first is known as
the data processing inequality, and it says that processing a random variable (even
adding noise to it) can only make distributions closer together. See Cover and
Thomas [17] or Theorem 14 of Liese and Vajda [35] for a proof.

Proposition A.3.1 (Data processing). Let P0 and P1 be distributions on a random


variable S ∈ S, and let Q(· | s) denote any conditional probability distribution conditioned
on s, and define
Qv (A) = Q(A | s)dPv (s)
for v = 0, 1 and all sets A. Then
Q0 − Q1 TV  P0 − P1 TV and Dkl (Q0 | Q1 )  Dkl (P0 | P1 ) .

This proposition is somewhat intuitive: if we do any processing on a random


variable S ∼ P, then there is less “information” about the initial distribution of P
than if we did no further processing. A consequence is Pinsker’s inequality.

Proposition A.3.2 (Pinsker’s inequality). Let P and Q be arbitrary distributions. Then


1
P − Q2TV  Dkl (P||Q) .
2
176 Introductory Lectures on Stochastic Optimization

Proof. First, we note that if we show the result assuming that the sample space
S on which P and Q are defined is finite, we have the general result. Indeed,
suppose that A ⊂ S achieves the supremum
P − QTV = sup |P(A) − Q(A)|.
A⊂S
(We may assume without loss of generality that such a set exists.) If we define
 and Q
P  to be the binary distributions with P(0)
 
= P(A) and P(1) = 1 − P(A),
  
and similarly for Q, we have P − QTV = P − QTV , and Proposition A.3.1
immediately guarantees that
 |Q)
Dkl (P|   Dkl (P||Q) .

Let us assume then that |S| < ∞.


In this case, Pinsker’s inequality is an immediate consequence of the strong

convexity of the negative entropy functional h(p) = n i=1 pi log pi with respect to
the 1 -norm over the probability simplex. For completeness, let us prove this. Let
n n
p and q ∈ Rn + satisfy i=1 pi = i=1 qi = 1. Then Taylor’s theorem guarantees
that
1
h(q) = h(p) + ∇h(p), q − p + (q − p) ∇2 h( q)(q − p),
2
where q  = λp + (1 − λ)q for some λ ∈ [0, 1]. Now, we note that
∇2 h(p) = diag(1/p1 , . . . , 1/pn ),
and using that ∇h(p) = [log pi + 1]n
i=1 , we find

n
1  (qi − pi )2
n
h(q) = h(p) + (qi − pi ) log pi + .
2 i
q
i=1 i=1
Using the Cauchy-Schwarz inequality, we have
 n 2  n   n  n 
|qi − pi | 2 (qi − pi )2
|qi − pi | = qi   i
q .
qi i
q
i=1 i=1 i=1 i=1
Of course, this gives
n
1
h(q)  h(p) + (qi − pi ) log pi + p − q21 .
2
i=1
 qi
Rearranging this, we have h(q) − h(p) − ∇h(p), q − p = n i=1 qi log pi , or that
1
Dkl (q||p)  p − q21 = 2 P − Q2TV .
2
This is the result. 

B. Questions and Exercises


Exercises for Lecture 2
Question B.2.1: Let πC (x) := argminy∈C x − y2 denote the Euclidean projec-
tion of x onto the set C, where C is closed convex. Show that the projection is a
John C. Duchi 177

Lipschitz mapping, that is, for all vectors x0 , x1 ,


πC (x0 ) − πC (x1 )2  x0 − x1 2 .
Show that, even if C is compact, this inequality cannot (in general) be improved.

Question B.2.2: If Sn = {A ∈ Rn×n : A = AT } is the set of symmetric matrices


and, for A ∈ Sn , f(A) = λmax (A), show that f is convex and compute ∂f(A).

Question B.2.3: A convex function f is called λ strongly convex with respect to


the norm  ·  on the (convex) domain X if for any x, y ∈ X, we have
λ
f(y)  f(x) + g, y − x + x − y2
2
for all g ∈ ∂f(x). Recall that a function f is L-Lipschitz continuous with respect to
the norm · on the domain X if
|f(x) − f(y)|  L x − y for all x, y ∈ X.
Let f be λ-strongly convex w.r.t. · and h1 , h2 be L-Lipschitz continuous convex
functions with respect to the norm ·. For i = 1, 2 define
xi = arg min{f(x) + hi (x)}.
x∈X
Show that
2L
x1 − x2  .
λ
Hint: You may use the fact, demonstrated in the notes, that if h is L-Lipschitz and
convex, then g∗  L for all g ∈ ∂h(x), where ·∗ is the dual norm to ·.

Question B.2.4 (Hölder’s inequality): Let x and y be vectors in Rn and let


p, q ∈ (1, ∞) be conjugate, that is, satisfy 1/p + 1/q = 1. In this question, we will
show that x, y  xp yq , and moreover, that ·p and ·q are dual norms.
(The result is essentially immediate in the case that p = 1 and q = ∞.)
(a) Show that for any a, b  0 and any η  0, we have
ηp p 1
ab  a + q bq .
p η q
Hint: use the concavity of the logarithm and that 1/p + 1/q = 1.
p
(b) Show that x, y  ηp xp q
p + ηq q yq for all η > 0.
1

(c) Using the result of part (b), show that x, y  xp yq .
(d) Show that ·p and ·q are dual norms.

Exercises for Lecture 3


Question B.3.1: In this question and the next, we perform experiments with
(stochastic) subgradient methods to train a handwritten digit recognition classi-
fier (one to recognize the digits {0, 1, . . . , 9}). A warning: we use optimization
notation here, consistent with Example 3.4.6, which is non-standard for typical
machine learning or statistical learning applications.
178 Introductory Lectures on Stochastic Optimization

We represent a multiclass classifier using a matrix


X = [x1 x2 · · · xk ] ∈ Rd×k ,
where there are k classes, and the predicted class for a data vector a ∈ Rd is
argmax a, xl  = argmax{[XT a]l }.
l∈[k] l∈[k]

We represent data as pairs (a, b) ∈ Rd


× {1, . . . , k}, where a is the data point (fea-
tures) and b the label of the data point. We use the multiclass hinge loss function
F(X; (a, b)) = max [1 + a, xl − xb ]+
l=b

where [t]+ = max{t, 0} denotes the positive part. We will use stochastic gradient
descent to attempt to minimize
f(X) := EP [F(X; (A, B))] = F(X; (a, b))dP(a, b),
where the expectation is taken over pairs (A, B).
(a) Show that F is convex.
(b) Show that F(X; (a, b)) = 0 if and only if the classifer represented by X has
a large margin, meaning that
a, xb   a, xl  + 1 for all l = b.
(c) For a pair (a, b), give a way to calculate a vector G ∈ ∂F(X; (a, b)) (note
that G ∈ Rd×k ).

Question B.3.2: In this problem, you will perform experiments to explore the
performance of stochastic subgradient methods for classification problems, specif-
ically, a handwritten digit recognition problem using zip code data from the
United States Postal Service (this data is taken from the book [24], originally
due to Yann Le Cunn). The data—training data zip.train, test data zip.test,
and information file zip.inf—are available for download from the zipped tar
file http://web.stanford.edu/~jduchi/PCMIConvex/ZIPCodes.tgz. Starter code is
available for julia and Matlab at the following urls.
i. For Julia: http://web.stanford.edu/~jduchi/PCMIConvex/sgd.jl
ii. For Matlab: http://web.stanford.edu/~jduchi/PCMIConvex/matlab.tgz
There are two methods left un-implemented in the starter code: the sgd method
and the MulticlassSVMSubgradient method. Implement these methods (you may
find the code for unit-testing the multiclass SVM subgradient useful to double
check your implementation). For the SGD method, your stepsizes should be

proportional to αi ∝ 1/ i, and you should project X to the Frobenius norm ball

Br := {X ∈ Rd×k : XFr  r}, where X2Fr = X2ij .
ij

We have implemented a pre-processing step that also kernelizes the data repre-
sentation. Let the function K(a, a ) = exp(− 2τ
1
a − a 22 ). Then the kernelized
John C. Duchi 179

data representation transforms each datapoint a ∈ Rd into a vector



φ(a) = K(a, ai1 ) K(a, ai2 ) · · · K(a, aim )
where i1 , . . . , im is a random subset of {1, . . . , N} (see GetKernelRepresentation.)
Once you have implemented the sgd and MulticlassSVMSubgradient methods,
use the method RunExperiment (Julia/Matlab). What performance do you get in
classification? Which digits is your classifier most likely to confuse?

Question B.3.3: In this problem, we give a simple bound on the rate of conver-
gence for stochastic optimization for minimization of strongly convex functions.
Let C denote a compact convex set and f denote a λ-strongly convex function
with respect to the 2 -norm on C, meaning that
λ
f(y)  f(x) + g, y − x + x − y22 for all g ∈ ∂f(x), x, y ∈ C.
2
Consider the following stochastic gradient method: at iteration k, we
i. receive a noisy subgradient gk with E[gk | xk ] ∈ ∂f(xk );
ii. perform the projected subgradient step
xk+1 = πC (xk − αk gk ).
Show that if E[gk 22 ]
 M2
for all k, then with the stepsize choice αk = 1
λk , we
have the convergence guarantee
 K 
 M2
E (f(xk ) − f(x∗ ))  (log K + 1).

k=1

Exercises for Lecture 4


Question B.4.1: We saw in the lecture
that if we use mirror descent,
1
xk+1 = argmin gk , x + D (x, xk ) ,
x∈C αk h
in the stochastic setting with E[gk | xk ] ∈ ∂f(xk ) then we have the regret bound
   
1 2 1
K K
2
E (f(xk ) − f(x ))  E

R + αk gk ∗ .
αK 2
k=1 k=1

Here we have assumed that 


Dh (x , xk ) R2
for all k. We now use this inequality
to prove Corollary 4.3.3. Choose the stepsize αk adaptively at the kth step by
optimizing the convergence bound up to the current iterate, that is, set

k − 1
2
αk = R gi 2∗ ,
i=1
based on the previous subgradients. Prove that in this case one has
   K 1 
K  2
∗ 2
E (f(xk ) − f(x ))  3RE gk ∗
k=1 k=1
180 Introductory Lectures on Stochastic Optimization

Conclude Corollary 4.3.3.


Hint: An intermediate step, which may be useful, is to prove the following
inequality: for any non-negative sequence a1 , a2 , . . . , ak , one has

 k
k
ai 
  2 ai .
i
i=1 j=1 aj i=1

Induction is one natural strategy.

Question B.4.2 (Strong convexity of p -norms): Prove the claim of Example 4.2.7.
That is, for some fixed p ∈ (1, 2], if h(x) = 2(p−1)
1
x2p , show that h is strongly
convex with respect to the p -norm.

1
Hint: Let Ψ(t) = 2(p−1) t2/p and φ(t) = |t|p , noting that h(x) = Ψ( n j=1 φ(xj )).
Then by a Taylor expansion, this question is equivalent to showing that for any
w, x ∈ Rn , we have
x ∇2 h(w)x  x2p
where, defining the shorthand vector ∇φ(w) = [φ (w1 ) · · · φ (wn )] , we have

n 
∇2 h(w) = Ψ φ(wj ) ∇φ(w)∇φ(w)
j=1

n 

+ Ψ φ(wj ) diag φ (w1 ), . . . , φ (wn ) .


j=1

Now apply an argument similar to that used in Example 4.2.5 to show the strong

convexity of h(x) = j xj log xj , but applying Hölder’s inequality instead of
Cauchy-Schwarz.

Question B.4.3 (Variable metric methods and AdaGrad): Consider the following
variable-metric method for minimizing a convex function f on a convex subset
C ⊂ Rn : 
1 
xk+1 = argmin gk , x + (x − xk ) Hk (x − xk ) ,
x∈C 2
where E[gk ] ∈ ∂f(xk ). In the lecture, we showed that

K 
E (f(xk ) − f(x )) 


k=1
 K     
1  1
K
 2  2  2 2
E xk − x Hk − xk − x Hk−1 + x1 − x H1 + E gk H−1 .
2 2 k
k=2 k=1
(a) Let

k 1
2
Hk = diag gi g
i
i=1
John C. Duchi 181

be the diagonal matrix whose entries are the square roots of the sum of
the squares of the gradient coordinates. (This is the AdaGrad method.)
Show that
xk − x 2Hk − xk − x 2Hk−1  xk − x ∞ tr(Hk − Hk−1 ),

where tr(A) = n i=1 Aii is the trace of the matrix
(b) Assume that R∞ = supx∈C x − x ∞ is finite. Show that with any choice
of diagonal matrix Hk , we obtain
K   K 
1 1
E (f(xk ) − f(x ))  R∞ E[tr(HK )] + E gk 2H−1 .
2 2 k
k=1 k=1
(c) Let gk,j denote the jth coordinate of the kth subgradient. Let Hk be chosen
as above. Show that
   K 1 
K
3 
n  2
2
E (f(xk ) − f(x ))  R∞

E gk,j .
2
k=1 j=1 k=1

(d) Suppose that the domain C = {x : x∞  1}. What is the expected
regret of AdaGrad? Show that (to a numerical constant factor we ignore)
this expected regret is always smaller than the expected regret bound for
standard projected gradient descent, which is
K  K 1
2
2
E (f(xk ) − f(x ))  O(1) sup x − x 2 E
 
gk 2 .
k=1 x∈C k=1
Hint: Use Cauchy-Schwarz.
(e) As in the previous sub-question, assume that C = {x : x∞  1}. Suppose
that the subgradients are such that gk ∈ {−1, 0, 1}n for all k, and that for
each coordinate j we have P(gk,j = 0) = pj . Show that AdaGrad has
convergence guarantee
  √ n
3 K √
K
E (f(xk ) − f(x )) 

pj .
2
k=1 j=1

What is the corresponding bound for standard projected gradient descent?


How much better can AdaGrad be?

Exercises for Lecture 5


Question B.5.1: In this problem, we prove a lower bound for strongly convex
optimization problems. Suppose at each iteration of the optimization procedure,
we receive a noisy subgradient gk satisfying
iid
gk = ∇f(xk ) + ξk , ξk ∼ N(0, σ2 ).
182 Introductory Lectures on Stochastic Optimization

To prove a lower bound for optimization procedures, we use the functions


λ
fv (x) = (x − vδ)2 , v ∈ {±1}.
2
Let fv = 0 denote the minimum function values for fv on R for v = ±1.
(a) Recall the separation between two functions f1 and f−1 as defined previ-
ously (5.1.4),

dopt (f−1 , f1 ; C) :=
 
f1 (x)  f1 + δ implies f−1 (x)  f−1 + δ
sup δ  0 : for any x ∈ C. .
f−1 (x)  f−1 + δ implies f1 (x)  f1 + δ
When C = R (or, more generally, as long as C ⊃ [−δ, δ]), show that
λ
dopt (f−1 , f1 ; C)  δ2 .
2
(b) Show that the Kullback-Leibler divergence between two normal distribu-
tions P1 = N(μ1 , σ2 ) and P2 = N(μ2 , σ2 ) is
(μ1 − μ2 )2
Dkl (P1 | P−1 ) = .
2σ2
(c) Use Le Cam’s method to show the following lower bound for stochastic
optimization: for any optimization procedure ' xK using K noisy gradient
evaluations,
σ2
max EPv [fv ('
xK ) − fv ]  .
v∈{−1,1} 32λK
Compare the result with the regret upper bound in problem B.3.3. Hint: If
PvK denotes the distribution of the K noisy gradients for function fv , show
that 2Kλ2 δ2
Dkl P1K | P−1
K
 .
σ2
Question B.5.2: Let C = {x ∈ Rn : x∞  1}, and consider the collection of
functions F where the stochastic gradient oracle g : Rn × S × F → {−1, 0, 1}n
satisfies
P(gj (x, S, f) = 0)  pj
for each coordinate j = 1, 2, . . . , n. Show that, for large enough K ∈ N, a minimax
lower bound for this class of functions and the given stochastic oracle is
1 √
n
MK (C, F)  c √ pj ,
K j=1
where c > 0 is a numerical constant. How does this compare to the convergence
guarantee that AdaGrad gives?
John C. Duchi 183

References
[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, Information-theoretic lower bounds
on the oracle complexity of stochastic convex optimization, IEEE Trans. Inform. Theory 58 (2012), no. 5,
3235–3249, DOI 10.1109/TIT.2011.2182178. MR2952543 ←171
[2] P. Assouad, Deux remarques sur l’estimation (French, with English summary), C. R. Acad. Sci. Paris
Sér. I Math. 296 (1983), no. 23, 1021–1024. MR777600 ←167
[3] P. Auer, N. Cesa-Bianchi, and C. Gentile, Adaptive and self-confident on-line learning algorithms, J.
Comput. System Sci. 64 (2002), no. 1, 48–75, DOI 10.1006/jcss.2001.1795. Special issue on COLT
2000 (Palo Alto, CA). MR1896142 ←157
[4] K. Azuma, Weighted sums of certain dependent random variables, Tôhoku Math. J. (2) 19 (1967), 357–
367, DOI 10.2748/tmj/1178243286. MR0221571 ←174
[5] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex opti-
mization, Oper. Res. Lett. 31 (2003), no. 3, 167–175, DOI 10.1016/S0167-6377(02)00231-6. MR1967286
←157
[6] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski, Robust optimization, Princeton Series in Applied
Mathematics, Princeton University Press, Princeton, NJ, 2009. MR2546839 ←102
[7] D. P. Bertsekas, Stochastic optimization problems with nondifferentiable cost functionals, J. Optimiza-
tion Theory Appl. 12 (1973), 218–231, DOI 10.1007/BF00934819. MR0329725 ←120
[8] D. P. Bertsekas, Convex optimization theory, Athena Scientific, Nashua, NH, 2009. MR2830150 ←101,
122
[9] D. P. Bertsekas, Nonlinear programming, 2nd ed., Athena Scientific Optimization and Computation
Series, Athena Scientific, Belmont, MA, 1999. MR3444832 ←101
[10] S. Boyd, J. Duchi, and L. Vandenberghe, Subgradients, 2015. Course notes for Stanford Course
EE364b. ←140
[11] S. Boyd and A. Mutapcic, Stochastic subgradient methods, 2007. Course notes for EE364b at Stanford,
available at http://www.stanford.edu/class/ee364b/notes/stoch_subgrad_notes.pdf. ←140
[12] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, Cambridge, 2004.
MR2061575 ←101, 119, 122, 123, 157
[13] G. Braun, C. Guzmán, and S. Pokutta, Lower bounds in the oracle complexity of nonsmooth convex
optimization via information theory, IEEE Trans. Inform. Theory 63 (2017), no. 7, 4709–4724, DOI
10.1109/TIT.2017.2701343. MR3666985 ←171
[14] P. Brucker, An O(n) algorithm for quadratic knapsack problems, Oper. Res. Lett. 3 (1984), no. 3,
163–166, DOI 10.1016/0167-6377(84)90010-5. MR761510 ←130, 143, 151
[15] S. Bubeck and N. Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit
problems, Foundations and Trends in Machine Learning 5 (2012), no. 1, 1–122. ←157
[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile, On the generalization ability of on-line learning al-
gorithms, IEEE Trans. Inform. Theory 50 (2004), no. 9, 2050–2057, DOI 10.1109/TIT.2004.833339.
MR2097190 ←140
[17] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed., Wiley-Interscience [John
Wiley & Sons], Hoboken, NJ, 2006. MR2239987 ←171, 175
[18] A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support
for non-strongly convex composite objectives, Advances in neural information processing systems 27,
2014. ←140
[19] D. L. Donoho, R. C. Liu, and B. MacGibbon, Minimax risk over hyperrectangles, and implications,
Ann. Statist. 18 (1990), no. 3, 1416–1437, DOI 10.1214/aos/1176347758. MR1062717 ←167
[20] D. L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (2006), no. 4, 1289–1306, DOI
10.1109/TIT.2006.871582. MR2241189 ←129
[21] J. C. Duchi, Stats311/EE377: Information theory and statistics, 2015. ←171
[22] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic
optimization, J. Mach. Learn. Res. 12 (2011), 2121–2159. MR2825422 ←153, 157
[23] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, Efficient projections onto the 1 -ball for
learning in high dimensions, Proceedings of the 25th international conference on machine learning,
2008. ←130, 143, 151
[24] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, 2nd ed., Springer Series
in Statistics, Springer, New York, 2009. Data mining, inference, and prediction. MR2722294 ←178
184 References

[25] E. Hazan, The convex optimization approach to regret minimization, Optimization for machine learn-
ing, 2012. ←140
[26] E. Hazan, Introduction to online convex optimization, Foundations and Trends in Optimization 2
(2016), no. 3–4, 157–325. ←102
[27] J. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms I, Springer, New
York, 1993. ←101, 119, 122, 172
[28] J. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms II, Springer,
New York, 1993. ←101, 122
[29] J.-B. Hiriart-Urruty and C. Lemaréchal, Fundamentals of convex analysis, Springer, 2001. ←122
[30] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc.
58 (1963), 13–30. MR0144363 ←173
[31] I. A. Ibragimov and R. Z. Hasminskiı̆, Statistical estimation, Applications of Mathematics, vol. 16,
Springer-Verlag, New York-Berlin, 1981. Asymptotic theory; Translated from the Russian by
Samuel Kotz. MR620321 ←102, 160, 171
[32] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction,
Advances in neural information processing systems 26, 2013. ←140
[33] L. Le Cam, Asymptotic methods in statistical decision theory, Springer Series in Statistics, Springer-
Verlag, New York, 1986. MR856411 ←102, 160, 162
[34] E. L. Lehmann and G. Casella, Theory of point estimation, 2nd ed., Springer Texts in Statistics,
Springer-Verlag, New York, 1998. MR1639875 ←160
[35] F. Liese and I. Vajda, On divergences and informations in statistics and information theory, IEEE Trans.
Inform. Theory 52 (2006), no. 10, 4394–4412, DOI 10.1109/TIT.2006.881731. MR2300826 ←175
[36] D. G. Luenberger, Optimization by vector space methods, John Wiley & Sons, Inc., New York-London-
Sydney, 1969. MR0238472 ←122
[37] J. E. Marsden, Elementary classical analysis, W. H. Freeman and Co., San Francisco, 1974. With the
assistance of Michael Buchner, Amy Erickson, Adam Hausknecht, Dennis Heifetz, Janet Macrae
and William Wilson, and with contributions by Paul Chernoff, István Fáry and Robert Gulliver.
MR0357693 ←101
[38] B. McMahan and M. Streeter, Adaptive bound optimization for online convex optimization, Proceed-
ings of the twenty third annual conference on computational learning theory, 2010. ←157
[39] A. Nedić, Subgradient methods for convex minimization, Ph.D. Thesis, 2002. ←157
[40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach
to stochastic programming, SIAM J. Optim. 19 (2008), no. 4, 1574–1609, DOI 10.1137/070704277.
MR2486041 ←140, 157
[41] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, A Wiley-
Interscience Publication, John Wiley & Sons, Inc., New York, 1983. Translated from the Russian
and with a preface by E. R. Dawson; Wiley-Interscience Series in Discrete Mathematics. MR702836
←102, 123, 157, 171
[42] A. Nemirovski, Efficient methods in convex programming, 1994. Technion: The Israel Institute of
Technology. ←171
[43] A. Nemirovski, Lectures on modern convex optimization, 2005. Georgia Institute of Technology.
←140
[44] Y. Nesterov, Introductory lectures on convex optimization, Applied Optimization, vol. 87, Kluwer
Academic Publishers, Boston, MA, 2004. A basic course. MR2142598 ←124, 140, 171
[45] Y. Nesterov and A. Nemirovskii, Interior-point polynomial algorithms in convex programming, SIAM
Studies in Applied Mathematics, vol. 13, Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 1994. MR1258086 ←123
[46] J. Nocedal and S. J. Wright, Numerical optimization, 2nd ed., Springer Series in Operations Re-
search and Financial Engineering, Springer, New York, 2006. MR2244940 ←101, 157
[47] B. T. Polyak, Introduction to optimization, Translations Series in Mathematics and Engineering,
Optimization Software, Inc., Publications Division, New York, 1987. Translated from the Russian;
With a foreword by Dimitri P. Bertsekas. MR1099605 ←101, 140
[48] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control
Optim. 30 (1992), no. 4, 838–855, DOI 10.1137/0330046. MR1167814 ←140
[49] R. T. Rockafellar, Convex analysis, Princeton University Press, 1970. ←101, 104, 121
[50] W. Rudin, Principles of mathematical analysis, 3rd ed., McGraw-Hill Book Co., New York-Auckland-
Düsseldorf, 1976. International Series in Pure and Applied Mathematics. MR0385023 ←101
References 185

[51] S. Shalev-Shwartz, Online learning: Theory, algorithms, and applications, Ph.D. Thesis, 2007. ←144
[52] O. Shamir and S. Shalev-Shwartz, Matrix completion with the trace norm: learning, bounding, and
transducing, J. Mach. Learn. Res. 15 (2014), 3401–3423. MR3277164 ←102
[53] S. Shalev-Shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss mini-
mization, J. Mach. Learn. Res. 14 (2013), 567–599. MR3033340 ←140
[54] A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on stochastic programming, MPS/SIAM
Series on Optimization, vol. 9, Society for Industrial and Applied Mathematics (SIAM), Philadel-
phia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2009. Modeling and the-
ory. MR2562798 ←102
[55] N. Z. Shor, Minimization methods for nondifferentiable functions, Springer Series in Computational
Mathematics, vol. 3, Springer-Verlag, Berlin, 1985. Translated from the Russian by K. C. Kiwiel
and A. Ruszczyński. MR775136 ←157
[56] N. Z. Shor, Nondifferentiable optimization and polynomial problems, Nonconvex Optimization and its
Applications, vol. 24, Kluwer Academic Publishers, Dordrecht, 1998. MR1620179 ←153, 157
[57] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B 58 (1996),
no. 1, 267–288. MR1379242 ←129
[58] A. B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009. ←161, 171
[59] A. Wald, Contributions to the theory of statistical estimation and testing hypotheses, Ann. Math. Statis-
tics 10 (1939), 299–326. MR0000932 ←102, 171
[60] A. Wald, Statistical decision functions which minimize the maximum risk, Ann. of Math. (2) 46 (1945),
265–280, DOI 10.2307/1969022. MR0012402 ←102
[61] S. Wright, Optimization Algorithms for Data Analysis, 2018. ←100
[62] Y. Yang and A. Barron, Information-theoretic determination of minimax rates of convergence, Ann.
Statist. 27 (1999), no. 5, 1564–1599, DOI 10.1214/aos/1017939142. MR1742500 ←102, 161
[63] B. Yu, Assouad, Fano, and Le Cam, Festschrift for Lucien Le Cam, Springer, New York, 1997,
pp. 423–435. MR1462963 ←102, 161, 171
[64] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceedings
of the twentieth international conference on machine learning, 2003. ←140

Stanford University, Stanford CA 94305


Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 187–229
https://doi.org/10.1090/pcms/025/00832

Randomized Methods for Matrix Computations

Per-Gunnar Martinsson

Contents
1 Introduction 188
1.1 Scope and objectives 188
1.2 The key ideas of randomized low-rank approximation 189
1.3 Advantages of randomized methods 190
1.4 Relation to other chapters and the broader literature 190
2 Notation 191
2.1 Notation 191
2.2 The singular value decomposition (SVD) 191
2.3 Orthonormalization 192
2.4 The Moore-Penrose pseudoinverse 192
3 A two-stage approach 193
4 A randomized algorithm for “Stage A” — the range finding problem 194
5 Single pass algorithms 195
5.1 Hermitian matrices 196
5.2 General matrices 198
6 A method with complexity O(mn log k) for general dense matrices 199
7 Theoretical performance bounds 200
7.1 Bounds on the expectation of the error 201
7.2 Bounds on the likelihood of large deviations 202
8 An accuracy enhanced randomized scheme 202
8.1 The key idea — power iteration 202
8.2 Theoretical results 204
8.3 Extended sampling matrix 205
9 The Nyström method for positive symmetric definite matrices 205
10 Randomized algorithms for computing Interpolatory Decompositions 206
10.1 Structure preserving factorizations 206
10.2 Three flavors of ID: row, column, and double-sided ID 207
10.3 Deterministic techniques for computing the ID 208
10.4 Randomized techniques for computing the ID 210
11 Randomized algorithms for computing the CUR decomposition 212
11.1 The CUR decomposition 212
©2018 American Mathematical Society

187
188 Randomized Methods for Matrix Computations

11.2 Converting a double-sided ID to a CUR decomposition 213


12 Adaptive rank determination with updating of the matrix 214
12.1 Problem formulation 214
12.2 A greedy updating algorithm 215
12.3 A blocked updating algorithm 217
12.4 Evaluating the norm of the residual 217
13 Adaptive rank determination without updating the matrix 218
14 Randomized algorithms for computing a rank-revealing QR
decomposition 221
14.1 Column pivoted QR decomposition 221
15 A strongly rank-revealing UTV decomposition 223
15.1 The UTV decomposition 224
15.2 An overview of randUTV 224
15.3 A single step block factorization 225

1. Introduction
1.1. Scope and objectives The objective of this chapter is to describe a set of
randomized methods for efficiently computing a low rank approximation to a
given matrix. In other words, given an m × n matrix A, we seek to compute
factors E and F such that
A ≈ E F,
(1.1.1)
m×n m×k k×n
where the rank k of the approximation is a number we assume to be much smaller
than either m or n. In some situations, the rank k is given to us in advance, while
in others, it is part of the problem to determine a rank such that the approxima-
tion satisfies a bound of the type
A − EF  ε
where ε is a given tolerance, and  ·  is some specified matrix norm (in this
chapter, we will discuss only the spectral and the Frobenius norms).
An approximation of the form (1.1.1) is useful for storing the matrix A more fru-
gally (we can store E and F using k(m + n) numbers, as opposed to mn numbers
for storing A), for efficiently computing a matrix vector product z = Ax (via y = Fx
and z = Ey), for data interpretation, and much more. Low-rank approximation
problems of this type form a cornerstone of data analysis and scientific comput-
ing, and arise in a broad range of applications, including principal component
analysis (PCA) in computational statistics, spectral methods for clustering high-
dimensional data and finding structure in graphs, image and video compression,
model reduction in physical modeling, and many more.
Per-Gunnar Martinsson 189

In performing low-rank approximation, one is typically interested in specific


factorizations where the factors E and F satisfy additional constraints. When A is
a symmetric n × n matrix, one is commonly interested in finding an approximate
rank-k eigenvalue decomposition (EVD), which takes the form

A ≈ U D U∗ ,
(1.1.2)
n×n n×k k×k k×n
where the columns of U form an orthonormal set, and where D is diagonal. For
a general m × n matrix A, we would typically be interested in an approximate
rank-k singular value decomposition (SVD), which takes the form

A ≈ U D V∗ ,
(1.1.3)
m×n m×k k×k k×n
where U and V have orthonormal columns, and D is diagonal. In this chapter, we
will discuss both the EVD and the SVD in depth. We will also describe factoriza-
tions such as the interpolative decomposition (ID) and the CUR decomposition which
are highly useful for data interpretation, and for certain applications in scientific
computing. In these, we seek to determine a subset of the columns (rows) of A
itself that form a good approximate basis for the column (row) space.
While most of the chapter is aimed at computing low rank factorizations where
the target rank k is much smaller than the dimensions of the matrix m and n, we
will in the last couple of sections of the chapter also discuss how randomization
can be used to speed up factorization of full matrices, such as a full column
pivoted QR factorization, or various relaxations of the SVD that are useful for
solving least-squares problems, etc.
1.2. The key ideas of randomized low-rank approximation To quickly intro-
duce the central ideas of the current chapter, let us describe a simple prototypical
randomized algorithm: Let A be a matrix of size m × n that is approximately of
low rank. In other words, we assume that for some integer k < min(m, n), there
exists an approximate low rank factorization of the form (1.1.1). Then a natu-
ral question is how do you in a computationally efficient manner construct the
factors E and F? In [39], it was observed that random matrix theory provides a
simple solution: Draw a Gaussian random matrix G of size n × k, form the sampling
matrix
E = AG,
and then compute the factor F via
F = E† A,
where E† is the Moore-Penrose pseudo-inverse of A, as in Subsection 2.4. (Then
EF = EE† A, where EE† is the orthogonal projection onto the linear space spanned
190 Randomized Methods for Matrix Computations

by the k columns in E.) Then in many important situations, the approximation


A ≈ E E A ,
(1.2.1)
m×n m×k k×n
is close to optimal. With this observation as a starting point, we will construct
highly efficient algorithms for computing approximate spectral decompositions
of A, for solving certain least-squares problems, for doing principal component
analysis of large data sets, etc.
1.3. Advantages of randomized methods The algorithms that result from using
randomized sampling techniques are computationally efficient, and are simple to
implement as they rely on standard building blocks (matrix-matrix multiplication,
unpivoted QR factorization, etc.) that are readily available for most computing
environments (multicore CPU, GPU, distributed memory machines, etc). As an il-
lustration, we invite the reader to peek ahead at Algorithm 4.0.1, which provides a
complete Matlab code for a randomized algorithm that computes an approximate
singular value decomposition of a matrix. Examples of improvements enabled by
these randomized algorithms include:
• Given an m × n matrix A, the cost of computing a rank-k approximant
by classical methods is O(mnk). Randomized algorithms can attain com-
plexity O(mn log k + k2 (m + n)), cf. [26, Sec. 6.1], and Section 6.
• Algorithms for performing principal component analysis (PCA) of large
data sets have been greatly accelerated, in particular when the data is
stored out-of-core, cf. [25].
• Randomized methods tend to require less communication than traditional
methods, and can be efficiently implemented on severely communication
constrained environments such as GPUs [38] and distributed computing
platforms, cf. [24, Ch. 4] and [15, 18].
• Randomized algorithms have enabled the development of single-pass ma-
trix factorization algorithms in which the matrix is “streamed” and never
stored, cf. [26, Sec. 6.3] and Section 5.
1.4. Relation to other chapters and the broader literature Our focus in this
chapter is to describe randomized methods that attain high practical computa-
tional efficiency. In particular, we use randomization mostly as a tool for min-
imizing communication, rather than minimizing the flop count (although we do
sometimes improve asymptotic flop counts as well). The methods described were
first published in [39] (which was inspired by [17], and later led to [31, 40]; see
also [45]). Our presentation largely follows that in the 2011 survey [26], but with
a focus more on practical usage, rather than theoretical analysis. We have also
included material from more recent work, including [49] on factorizations that al-
low for better data interpretation, [38] on blocking and adaptive error estimation,
and [35, 36] on full factorizations.
Per-Gunnar Martinsson 191

The idea of using randomization to improve algorithms for low-rank approxi-


mation of matrices has been extensively investigated within the theoretical com-
puter science community, with early work including [7, 17, 45]. The focus of these
texts has been to develop algorithms with optimal or close to optimal theoreti-
cal performance guarantees in terms of asymptotic flop counts and error bounds.
This is the scope of the lectures of Drineas and Mahoney [11] in this volume
to which we refer the reader for further details. The surveys [32] and [52] also
provide excellent introductions to this literature.

2. Notation

the chapter, we measure vectors in R using their


2.1. Notation Throughout n
n
Euclidean norm, v = j=1 |v(i)| . We measure matrices using the spectral
2

and the Frobenius norms, defined by


⎛ ⎞1/2

A = sup Ax, and AFro = ⎝ |A(i, j)|2 ⎠ ,
x=1 i,j

respectively. We use the notation of Golub and Van Loan [20] to specify sub-
matrices. In other words, if B is an m × n matrix with entries B(i, j), and if
I = [i1 , i2 , . . . , ik ] and J = [j1 , j2 , . . . , j ] are two index vectors, then B(I, J) de-
notes the k ×  matrix
⎡ ⎤
⎢ B(i1 , j1 ) B(i1 , j2 ) · · · B(i1 , j ) ⎥
⎢ ⎥
⎢ ⎥
⎢ B(i2 , j1 ) B(i2 , j2 ) · · · B(i2 , j ) ⎥
⎢ ⎥
B(I, J) = ⎢ ⎥.
⎢ .. .. .. ⎥
⎢ . . . ⎥
⎢ ⎥
⎣ ⎦
B(ik , j1 ) B(ik , j2 ) · · · B(ik , j )

We let B(I, :) denote the matrix B(I, [1, 2, . . . , n]), and define B(:, J) analogously.
The transpose of B is denoted B∗ , and we say that a matrix U is orthonormal
(ON) if its columns form an orthonormal set, so that U∗ U = I.
2.2. The singular value decomposition (SVD) The SVD was introduced briefly
in the introduction. Here we define it again, with some more detail added. Let A
denote an m × n matrix, and set r = min(m, n). Then A admits a factorization
A = U D V∗ ,
(2.2.1)
m×n m×r r×r r×n
where the matrices U and V are orthonormal, and D is diagonal. We let {ui }ri=1
and {vi }ri=1 denote the columns of U and V, respectively. These vectors are the
left and right singular vectors of A. The diagonal elements {σj }rj=1 of D are the
singular values of A. We order these so that σ1  σ2  · · ·  σr  0.
192 Randomized Methods for Matrix Computations

We let Ak denote the truncation of the SVD to its first k terms,



∗ 
k
Ak = U(:, 1 : k)D(1 : k, 1 : k) V(:, 1 : k) = σj uj vj∗ .
j=1

It is easily verified that


A − Ak  = σk+1 ,
and that ⎛ ⎞1/2
min(m,n)

A − Ak Fro = ⎝ σ2j ⎠ ,
j=k+1

where A denotes the operator norm of A and AFro denotes the Frobenius
norm of A. Moreover, the Eckart-Young theorem [14] states that these errors are
the smallest possible errors that can be incurred when approximating A by a
matrix of rank k.
2.3. Orthonormalization Given an m ×  matrix X, with m  , we introduce
the function
Q = orth(X)
to denote orthonormalization of the columns of X. In other words, the matrix Q
will be an m ×  orthonormal matrix whose columns form a basis for the column
space of X.
In practice, this step is typically achieved most efficiently by a call to a pack-
aged QR factorization; e.g., in Matlab, we would write [Q, ∼] = qr(X, 0). However,
all calls to orth in this manuscript can be implemented without pivoting, which
makes efficient implementation much easier.
2.4. The Moore-Penrose pseudoinverse The Moore-Penrose pseudoinverse is a
generalization of the concept of an inverse for a non-singular square matrix. To
define it, let A be a given m × n matrix. Let k denote its actual rank, so that its
singular value decomposition (SVD) takes the form

k
A= σj uj vj∗ = Uk Dk Vk

,
j=1

where σ1  σ2  σk > 0. Then the pseudoinverse of A is the n × m matrix


defined via

k
1
A† = vj u∗j = Vk D−1 ∗
k Uk .
σj
j=1

For any matrix A, the matrices


A† A = Vk Vk

, and AA† = Uk U∗k ,
are the orthogonal projections onto the row and column spaces of A, respectively.
If A is square and non-singular, then A† = A−1 .
Per-Gunnar Martinsson 193

3. A two-stage approach
The problem of computing an approximate low-rank factorization to a given
matrix can conveniently be split into two distinct “stages.” For concreteness, we
describe the split for the specific task of computing an approximate singular value
decomposition. To be precise, given an m × n matrix A and a target rank k, we
seek to compute factors U, D, and V such that

A ≈ U D V∗ .
m×n m×k k×k k×n
The factors U and V should be orthonormal, and D should be diagonal. (For
now, we assume that the rank k is known in advance, techniques for relaxing this
assumption are described in Section 12.) Following [26], we split this task into
two computational stages:
Stage A — find an approximate range: Construct an m × k matrix Q with or-
thonormal columns such that A ≈ QQ∗ A. (In other words, the columns of
Q form an approximate basis for the column space of A.) This step will
be executed via a randomized process described in Section 4.
Stage B — form a specific factorization: Given the matrix Q computed in Stage
A, form the factors U, D, and V using classical deterministic techniques.
For instance, this stage can be executed via the following steps:
(1) Form the k × n matrix B = Q∗ A.
(2) Compute the SVD of the (small) matrix B so that B = ÛDV∗ .
(3) Form U = QÛ.
The point here is that in a situation where k  min(m, n), the difficult part
of the computation is all in Stage A. Once that is finished, the post-processing in
Stage B is easy, as all matrices involved have at most k rows or columns.

Remark 3.0.1. Stage B is exact up to floating point arithmetic so all errors in the
factorization process are incurred at Stage A. To be precise, we have
Q Q∗ A = Q 
B = QÛ DV∗ = UDV∗ .
 
=B =ÛDV∗ =U
In other words, if the factor Q satisfies A − QQ∗ A  ε, then automatically
(3.0.2) A − UDV∗  = A − QQ∗ A  ε
unless ε is close to the machine precision.

Remark 3.0.3. A bound of the form (3.0.2) implies that the diagonal elements
{D(i, i)}k
i=1 of D are accurate approximations to the singular values of A in the
sense that |σi − D(i, i)|  ε for i = 1, 2, . . . , k. However, a bound like (3.0.2) does
not provide assurances on the relative errors in the singular values; nor does it, in
the general case, provide strong assurances that the columns of U and V are good
approximations to the singular vectors of A.
194 Randomized Methods for Matrix Computations

4. A randomized algorithm for “Stage A” — the range finding problem


This section describes a randomized technique for solving the range finding
problem introduced as “Stage A” in Section 3. As a preparation for this discus-
sion, let us recall that an “ideal” basis matrix Q for the range of a given matrix A
is the matrix Uk formed by the k leading left singular vectors of A. Letting σj (A)
denote the jth singular value of A, the Eckart-Young theorem [47] states that
inf{||A − C|| : C has rank k} = ||A − Uk U∗k A|| = σk+1 (A).
Now consider a simplistic randomized method for constructing a spanning set
with k vectors for the range of a matrix A: Draw k random vectors {gj }k j=1 from
a Gaussian distribution, map these to vectors yj = Agj in the range of A, and
then use the resulting set {yj }kj=1 as a basis. Upon orthonormalization via, e.g.,
Gram-Schmidt, an orthonormal basis {qj }k j=1 would be obtained. For the special
case where the matrix A has exact rank k, one can prove that the vectors {Agj }k j=1
would with probability 1 be linearly independent, and the resulting orthonormal
(ON) basis {qj }k
j=1 would therefore exactly span the range of A. This would in
a sense be an ideal algorithm. The problem is that in practice, there are almost
always many non-zero singular values beyond the first k ones. The left singular
vectors associated with these modes all “pollute” the sample vectors yj = Agj and
will therefore shift the space spanned by {yj }k
j=1 so that it is no longer aligned with
the ideal space spanned by the k leading singular vectors of A. In consequence,
the process described can (and frequently does) produce a poor basis. Luckily,
there is a fix: Simply take a few extra samples. It turns out that if we take,
say, k + 10 samples instead of k, then the process will with probability almost 1
produce a basis that is comparable to the best possible basis.
To summarize this discussion, the randomized sampling algorithm for con-
structing an approximate basis for the range of a given m × n matrix A proceeds
as follows: First pick a small integer p representing how much “over-sampling”
we do. (The choice p = 10 is often good.) Then execute the following steps:
(1) Form a set of k + p random Gaussian vectors {gj }k+p
j=1 .
k+p
(2) Form a set {yj }j=1 of samples from the range where yj = Agj .
(3) Perform Gram-Schmidt on the set {yj }k+p k+p
j=1 to form the ON-set {qj }j=1 .
Now observe that the k + p matrix-vector products are independent and can ad-
vantageously be executed in parallel. A full algorithm for computing an approx-
imate SVD using this simplistic sampling technique for executing “Stage A” is
summarized in Algorithm 4.0.1.
The error incurred by the randomized range finding method described in this
section is a random variable. There exist rigorous bounds for both the expectation
of this error, and for the likelihood of a large deviation from this expectation.
These bounds demonstrate that when the singular values of A decay “reasonably
fast,” the error incurred is close to the theoretically optimal one. We provide more
details in Section 7.
Per-Gunnar Martinsson 195

Algorithm: RSVD — basic randomized SVD

Inputs: An m × n matrix A, a target rank k, and an over-sampling parameter


p (say p = 10).
Outputs: Matrices U, D, and V in an approximate rank-(k + p) SVD of A (so
that U and V are orthonormal, D is diagonal, and A ≈ UDV∗ .)
Stage A:
(1) Form an n × (k + p) Gaussian random matrix G.
G = randn(n,k+p)
(2) Form the sample matrix Y = A G.
Y = A*G
(3) Orthonormalize the columns of the sample matrix Q = orth(Y).
[Q,∼] = qr(Y,0)
Stage B:
(4) Form the (k + p) × n matrix B = Q∗ A.
B = Q’*A
(5) Form the SVD of the small matrix B: B = ÛDV∗ .
[Uhat,D,V] = svd(B,’econ’)
(6) Form U = QÛ.
U = Q*Uhat

Algorithm 4.0.1. A basic randomized algorithm. If a factoriza-


tion of precisely rank k is desired, the factorization in Step 5 can
be truncated to the k leading terms. The sans-serif text below
each line is Matlab code for executing it.

Remark 4.0.2 (How many basis vectors?). The reader may have observed that
while our stated goal was to find a matrix Q that holds k orthonormal columns,
the randomized process discussed in this section and summarized in Algorithm
4.0.1 results in a matrix with k + p columns instead. The p extra vectors are
needed to ensure that the basis produced in “Stage A” accurately captures the k
dominant left singular vectors of A. In a situation where an approximate SVD
with precisely k modes is sought, one can drop the last p components when exe-
cuting Stage B. Using Matlab notation, we would after Step (5) run the commands
Uhat = Uhat(:,1:k); D = D(1:k,1:k); V = V(:,1:k);.
From a practical point of view, the cost of carrying around a few extra samples
in the intermediate steps is often entirely negligible.

5. Single pass algorithms


The randomized algorithm described in Algorithm 4.0.1 accesses the matrix
A twice, first in “Stage A” where we build an orthonormal basis for the column
196 Randomized Methods for Matrix Computations

space, and then in “Stage B” where we project A on to the space spanned by the
computed basis vectors. It turns out to be possible to modify the algorithm in
such a way that each entry of A is accessed only once. This is important because
it allows us to compute the factorization of a matrix that is too large to be stored.
For Hermitian matrices, the modification to Algorithm 4.0.1 is very minor and
we describe it in Subsection 5.1. Subsection 5.2 then handles the case of a general
matrix.

Remark (Loss of accuracy). The single-pass algorithms described in this section


tend to produce a factorization of lower accuracy than what Algorithm 4.0.1
would yield. In situations where one has a choice between using either a one-
pass or a two-pass algorithm, the latter is generally preferable since it yields
higher accuracy, at only moderately higher cost.

Remark (Streaming Algorithms). We say that an algorithm for processing a ma-


trix is a streaming algorithm if each entry of the matrix is accessed only once, and
if, in addition, entries can be fed in any order. (In other words, the algorithm is
not allowed to dictate the order in which elements are viewed.) The algorithms
described in this section satisfy both of these conditions.

5.1. Hermitian matrices Suppose that A = A∗ , and that our objective is to com-
pute an approximate eigenvalue decomposition

A ≈ U D U∗
(5.1.1)
n×n n×k k×k k×n
with U an orthonormal matrix and D diagonal. (Note that for a Hermitian matrix,
the EVD and the SVD are essentially equivalent, and that the EVD is the more
natural factorization.) Then execute Stage A with an over-sampling parameter p
to compute an orthonormal matrix Q whose columns form an approximate basis
for the column space of A:
(1) Draw a Gaussian random matrix G of size n × (k + p).
(2) Form the sampling matrix Y = AG.
(3) Orthonormalize the columns of Y to form Q, in other words Q = orth(Y).
Then
(5.1.2) A ≈ QQ∗ A.
Since A is Hermitian, its row and column spaces are identical, so we also have
(5.1.3) A ≈ AQQ∗ .
Inserting (5.1.2) into (5.1.3), we (informally!) find that
(5.1.4) A ≈ QQ∗ AQQ∗ .
We define
(5.1.5) C = Q∗ AQ.
Per-Gunnar Martinsson 197

If C is known, then the post-processing is straight-forward: Simply compute the


EVD of C to obtain C = ÛDÛ∗ , then define U = QÛ, to find that
A ≈ QCQ∗ = QÛDÛ∗ Q∗ = UDU∗ .
The problem now is that since we are seeking a single-pass algorithm, we are
not in position to evaluate C directly from formula (5.1.5). Instead, we will derive
a formula for C that can be evaluated without revisiting A. To this end, multiply
(5.1.5) by Q∗ G to obtain
(5.1.6) C(Q∗ G) = Q∗ AQQ∗ G.
We use that AQQ∗ ≈ A (cf. (5.1.3)), to approximate the right hand side in (5.1.6):
(5.1.7) Q∗ AQQ∗ G ≈ Q∗ AG = Q∗ Y.
Combining, (5.1.6) and (5.1.7), and ignoring the approximation error, we define C
as the solution of the linear system (recall that  = k + p)

C Q G = Q∗ Y .
(5.1.8)
× × ×
At first, it may appear that (5.1.8) is perfectly balanced in that there are 2 equa-
tions for 2 unknowns. However, we need to enforce that C is Hermitian, so the
system is actually over-determined by roughly a factor of two. Putting everything
together, we obtain the method summarized in Algorithm 5.1.9.

Algorithm: Single-pass Randomized EVD for a Hermitian Matrix


Inputs: An n × n Hermitian matrix A, a target rank k, and an over-sampling
parameter p (say p = 10).
Outputs: Matrices U and D in an approximate rank-k EVD of A (so that U is
an orthonormal n × k matrix, D is a diagonal k × k matrix, and A ≈ UDU∗ ).
Stage A:
(1) Form an n × (k + p) Gaussian random matrix G.
(2) Form the sample matrix Y = A G.
(3) Let Q denote the orthonormal matrix formed by the k dominant
left singular vectors of Y.
Stage B:

(4) Let C denote the k × k least squares solution of C Q∗ G = Q∗ Y


obtained by enforcing that C should be Hermitian.
(5) Compute that eigenvalue decomposition of C so that C = ÛDÛ∗ .
(6) Form U = QÛ.

Algorithm 5.1.9. A basic randomized algorithm single-pass algo-


rithm suitable for a Hermitian matrix.
198 Randomized Methods for Matrix Computations

The procedure described in this section is less accurate than the procedure
described in Algorithm 4.0.1 for two reasons: (1) The approximation error in
formula (5.1.4) tends to be larger than the error in (5.1.2). (2) While the matrix
Q∗ G is invertible, it tends to be very ill-conditioned.

Remark 5.1.10 (Extra over-sampling). To combat the problem that Q∗ G tends to


be ill-conditioned, it is helpful to over-sample more aggressively when using a
single pass algorithm, even to the point of setting p = k if memory allows. Once
the sampling stage is completed, we form Q as the leading k left singular vectors
of Y (compute these by forming the full SVD of Y, and then discard the last p
components). Then C will be of size k × k, and the equation that specifies C reads

C QG = Q∗ Y.
(5.1.11)
k×k k× k×
Since (5.1.11) is over-determined, we solve it using a least-squares technique. Ob-
serve that we are now looking for less information (a k × k matrix rather than an
 ×  matrix), and have more information in order to determine it.

5.2. General matrices We next consider a general m × n matrix A. In this case,


we need to apply randomized sampling to both its row space and its column
space simultaneously. We proceed as follows:
(1) Draw two Gaussian random matrices Gc of size n × (k + p) and Gr of size
m × (k + p).
(2) Form two sampling matrices Yc = AGc and Yr = A∗ Gr .
(3) Compute two basis matrices Qc = orth(Yc ) and Qr = orth(Yr ).
Now define the small projected matrix via
(5.2.1) C = Q∗c AQr .
We will derive two relationships that together will determine C in a manner that
is analogous to (5.1.6). First left multiply (5.2.1) by G∗r Qc to obtain
(5.2.2) G∗r Qc C = G∗r Qc Q∗c AQr ≈ G∗r AQr = Yr∗ Qr .
Next we right multiply (5.2.1) by Q∗r Gc to obtain
(5.2.3) CQ∗r Gc = Q∗c AQr Q∗r Gc ≈ Q∗c AGc = Q∗c Yc .
We now define C as the least-square solution of the two equations

Gr Qc C = Yr∗ Qr and C Q∗r Gc = Q∗c Yc .


Again, the system is over-determined by about a factor of 2, and it is advan-
tageous to make it further over-determined by more aggressive over-sampling,
cf. Remark 5.1.10. Algorithm 5.2.4 summarizes the single-pass method for a gen-
eral matrix.
Per-Gunnar Martinsson 199

Algorithm: Single-pass Randomized SVD for a General Matrix


Inputs: An m × n matrix A, a target rank k, and an over-sampling parameter
p (say p = 10).
Outputs: Matrices U, V, and D in an approximate rank-k SVD of A (so
that U and V are orthonormal with k columns each, D is diagonal, and
A ≈ UDV∗ .)
Stage A:
(1) Form two Gaussian random matrices Gc and Gr of sizes n × (k + p)
and m × (k + p), respectively.
(2) Form the sample matrices Yc = A Gc and Yr = A∗ Gr .
(3) Form orthonormal matrices Qc and Qr consisting of the k dominant
left singular vectors of Yc and Yr .
Stage B:
(4) Let C denote the k × k least
squares solution of the
joint system of
equations formed by G∗r Qc C = Yr∗ Qr and C Q∗r Gc = Q∗c Yc .
(5) Compute the SVD of C so that C = ÛDV̂∗ .
(6) Form U = Qc Û and V = Qr V̂.

Algorithm 5.2.4. A basic randomized algorithm single-pass algo-


rithm suitable for a general matrix.

6. A method with complexity O(mn log k) for general dense matrices


The Randomized SVD (RSVD) algorithm given in Algorithm 4.0.1 is highly ef-
ficient when we have access to fast methods for evaluating matrix-vector products
x → Ax. For the case where A is a general m × n matrix given simply as an array
or real numbers, the cost of evaluating the sample matrix Y = AG (in Step (2)
of Algorithm 4.0.1) is O(mnk). RSVD is still often faster than classical methods
since the matrix-matrix multiply can be highly optimized, but it does not have an
edge in terms of asymptotic complexity. However, it turns out to be possible to
modify the algorithm by replacing the Gaussian random matrix G with a different
random matrix Ω that has two seemingly contradictory properties:
(1) Ω is sufficiently structured that AΩ can be evaluated in O(mn log(k)) flops;
(2) Ω is sufficiently random that the columns of AΩ accurately span the range of A.
For instance, a good choice of random matrix Ω is

Ω = D F S,
(6.0.1)
n× n×n n×n n×
where D is a diagonal matrix whose diagonal entries are complex numbers of
modulus one drawn from a uniform distribution on the unit circle in the complex
plane, where F is the discrete Fourier transform,
F(p, q) = n−1/2 e−2πi(p−1)(q−1)/n , p, q ∈ {1, 2, 3, . . . , n},
200 Randomized Methods for Matrix Computations

and where S is a matrix consisting of a random subset of  columns from the n × n


unit matrix (drawn without replacement). In other words, given an arbitrary
matrix X of size m × n, the matrix XS consists of a randomly drawn subset of
 columns of X. For the matrix Ω specified by (6.0.1), the product XΩ can be
evaluated via a subsampled FFT in O(mn log()) operations. The parameter 
should be chosen slightly larger than the target rank k; the choice  = 2k is often
good. (A transform of this type was introduced in [1] under the name “Fast
Johnson-Lindenstrauss Transform” and was applied to the problem of low-rank
approximation in [45, 53]. See also [2, 29, 30].)
By using the structured random matrix described in this section, we can reduce
the complexity of “Stage A” in the RSVD from O(mnk) to O(mn log(k)). In order
to attain overall cost O(mn log(k)), we must also modify “Stage B” to eliminate
the need to compute Q∗ A (since direct evaluation of Q∗ A has cost O(mnk)). One
option is to use the single pass algorithm described in 5.2.4, using the structured
random matrix to approximate both the row and the column spaces of A. A sec-
ond, and typically better, option is to use a so called row-extraction technique for
Stage B; we describe the details in Section 10.
Our theoretical understanding of the errors incurred by the accelerated range
finder is not as satisfactory as what we have for Gaussian random matrices, cf. [26,
Sec. 11]. In the general case, only quite weak results have been proven. In practice,
the accelerated scheme is often as accurate as the Gaussian one, but we do not
currently have good theory to predict precisely when this happens.

7. Theoretical performance bounds


In this section, we will briefly summarize some proven results concerning the
error in the output of the basic RSVD algorithm in Algorithm 4.0.1. Observe that
the factors U, D, V depend not only on A, but also on the draw of the random
matrix G. This means that the error that we try to bound is a random variable. It is
therefore natural to seek bounds on first the expected value of the error, and then
on the likelihood of large deviations from the expectation.
Before we start, let us recall from Remark 3.0.1 that all the error incurred by
the RSVD algorithm in Algorithm 4.0.1 is incurred in Stage A. The reason is that
the “post-processing” in Stage B is exact (up to floating point arithmetic). Conse-
quently, we can (and will) restrict ourselves to giving bounds on A − QQ∗ A.

Remark 7.0.1. The theoretical investigation of errors resulting from randomized


methods in linear algebra is an active area of research that draws heavily on
random matrix theory, theoretical computer science, classical numerical linear al-
gebra, and many other fields. Our objective here is merely to state a couple of
representative results, without providing any proofs or details about their deriva-
tion. Both results are taken from [26], where the interested reader can find an
in-depth treatment of the subject. More recent results pertaining to the RSVD
Per-Gunnar Martinsson 201

can be found in, e.g., [22, 51], while a detailed discussion of a related method for
low-rank approximation can be found in [11]

7.1. Bounds on the expectation of the error A basic result on the typical error
observed is Theorem 10.6 of [26], which states:
min(m,n)
Theorem 7.1.1. Let A be an m × n matrix with singular values {σj }j=1 . Let k be
a target rank, and let p be an over-sampling parameter such that p  2 and such that
k + p  min(m, n). Let G be a Gaussian random matrix of size n × (k + p) and set
Q = orth(AG). Then the average error, as measured in the Frobenius norm, satisfies
⎛ ⎞1/2
 1/2 min(m,n)

 k
(7.1.2) E A − QQ∗ AFro  1 + ⎝ σ2j ⎠ ,
p−1
j=k+1

where E refers to expectation with respect to the draw of G. The corresponding result for
the spectral norm reads
   ⎛ ⎞1/2
√ min(m,n)

 k e k+p ⎝
(7.1.3) E A − QQ∗ A  1 + σk+1 + σ2j ⎠ .
p−1 p
j=k+1

When errors are measured in the Frobenius norm, Theorem 7.1.1 is very gratify-
ing. For our standard recommendation of p = 10, we are basically within a factor

of 1 + k/9 of the theoretically minimal error. (Recall that the Eckart-Young theo-
min(m,n) 2 1/2
rem states that j=k+1 σj is a lower bound on the residual for any rank-k
approximant.) If you over-sample more aggressively and set p = k + 1, then we

are within a distance of 2 of the theoretically minimal error.
When errors are measured in the spectral norm, the situation is much less rosy.
The first term in the bound in (7.1.3) is perfectly acceptable, but the second term
is unfortunate in that it involves the minimal error in the Frobenius norm, which
can be much larger, especially when m or n are large. The theorem is quite sharp,
as it turns out, so the sub-optimality expressed in (7.1.3) reflects a true limitation
on the accuracy to be expected from the basic randomized scheme.
The extent to which the error in (7.1.3) is problematic depends on how rapidly
min(m,n)
the “tail” singular values {σj }j=1 decay. If they decay fast, then the spectral
norm error and the Frobenius norm error are similar, and the RSVD works well.
If they decay slowly, then the RSVD performs fine when errors are measured
in the Frobenius norm, but not very well when the spectral norm is the one of
interest. To illustrate the difference, let us consider two situations:
Case 1 — fast decay: Suppose that the tail singular values decay exponentially
fast, so that for some β ∈ (0, 1), we have σj ≈ σk+1 βj−k−1 for j > k. Then we
get an estimate for the tail singular values of
⎛ ⎞1/2 ⎛ ⎞1/2
min(m,n) min(m,n)
 
⎝ σ2j ⎠ ≈ σk+1 ⎝ β2(j−k−1) ⎠  σk+1 (1 − β2 )−1/2 .
j=k+1 j=k+1
202 Randomized Methods for Matrix Computations

As long as β is not very close to 1, we see that the contribution from the tail
singular values is modest in this case.
Case 2 — no decay: Suppose that the tail singular values exhibit no decay, so
that σj = σk+1 for j > k. Now
⎛ ⎞1/2
min(m,n)
 
⎝ σ2j ⎠ = σk+1 min(m, n) − k .
j=k+1

This represents the worst case scenario and, since we want to allow for n and
m to be very large, represents devastating suboptimality.
Fortunately, it is possible to modify the RSVD in such a way that the errors
produced are close to optimal in both the spectral and the Frobenius norms. The
price to pay is a modest increase in the computational cost. See Section 8 and
[26, Sec. 4.5].
7.2. Bounds on the likelihood of large deviations One can prove that (perhaps
surprisingly) the likelihood of a large deviation from the mean depends only on
the over-sampling parameter p, and decays extraordinarily fast. For instance, one
can prove that if p  4, then


 8 k + p  1/2
(7.2.1) ||A − QQ A||  1 + 17 1 + k/p σk+1 + σ2j ,
p+1 j>k
−p
with failure probability at most 3 e , see [26, Cor. 10.9].

8. An accuracy enhanced randomized scheme


8.1. The key idea — power iteration We saw in Section 7 that the basic ran-
domized scheme (see, e.g., Algorithm 4.0.1) gives accurate results for matrices
whose singular values decay rapidly, but tends to produce suboptimal results
when they do not. To recap, suppose that we compute a rank-k approximation to
min(m,n)
an m × n matrix A with singular values {σj }j=1 . The theory shows that the
error measured in the spectral norm is bounded only by a factor that scales with


2 1/2 . When the singular values decay slowly, this quantity can be much
j>k σj
larger than the theoretically minimal approximation error (which is σk+1 ).
Recall that the objective of the randomized sampling is to construct a set of
orthonormal vectors {qj }j=1 that capture to high accuracy the space spanned by
the k dominant left singular vectors {uj }k
j=1 of A. The idea is now to sample not
A, but the matrix A(q) defined by

q
A(q) = AA∗ A,
where q is a small positive integer (say, q = 1 or q = 2). A simple calculation
shows that if A has the SVD A = UDV∗ , then the SVD of A(q) is
A(q) = U D2q+1 V∗ .
Per-Gunnar Martinsson 203

In other words, A(q) has the same left singular vectors as A, while its singular
values are {σ2q+1
j }j . Even when the singular values of A decay slowly, the singular
values of A(q) tend to decay fast enough for our purposes.
The accuracy enhanced scheme now consists of drawing a Gaussian matrix G
and then forming a sample matrix

q
Y = AA∗ AG.
Then orthonormalize the columns of Y to obtain Q = orth(Y), and proceed as
before. The resulting scheme is shown in Algorithm 8.1.1.

Algorithm: Accuracy Enhanced Randomized SVD


Inputs: An m × n matrix A, a target rank k, an over-sampling parameter
p (say p = 10), and a small integer q denoting the number of steps in the
power iteration.
Outputs: Matrices U, D, and V in an approximate rank-(k + p) SVD of A.
(I.e. U and V are orthonormal and D is diagonal.)
(1) G = randn(n, k + p);
(2) Y = AG;
(3) for j = 1 : q
(4) Z = A∗ Y;
(5) Y = AZ;
(6) end for
(7) Q = orth(Y);
(8) B = Q∗ A;
(9) [Û, D, V] = svd(B, ’econ’);
(10) U = QÛ;

Algorithm 8.1.1. The accuracy enhanced randomized SVD. If a


factorization of precisely rank k is desired, the factorization in
Step 9 can be truncated to the k leading terms.

Remark 8.1.2. The scheme described in Algorithm 8.1.1 can lose accuracy due to
round-off errors. The problem is that as q increases, all columns in the sample

q
matrix Y = AA∗ AG tend to align closer and closer to the dominant left singu-
lar vector. This means that essentially all information about the singular values
and singular vectors associated with smaller singular values get lots to round-off
errors. Roughly speaking, if
σj 1/(2q+1)
 mach ,
σ1
where mach is machine precision, then all information associated with the jth
singular value and beyond is lost (see Section 3.2 of [34]). This problem can
be fixed by orthonormalizing the columns between each iteration, as shown in
204 Randomized Methods for Matrix Computations

Algorithm 8.1.3. The modified scheme is more costly due to the extra calls to
orth. (However, note that orth can be executed using unpivoted Gram-Schmidt,
which is quite fast.)

Algorithm: Accuracy Enhanced Randomized SVD


(with orthonormalization)
(1) G = randn(n, k + p);
(2) Q = orth(AG);
(3) for j = 1 : q
(4) W = orth(A∗ Q);
(5) Q = orth(AW);
(6) end for
(7) B = Q∗ A;
(8) [Û, D, V] = svd(B, ’econ’);
(9) U = QÛ;

Algorithm 8.1.3. This algorithm takes the same inputs and out-
puts as the method in Algorithm 8.1.1. The only difference is that
orthonormalization is carried out between each step of the power
iteration, to avoid loss of accuracy due to rounding errors.

8.2. Theoretical results A detailed error analysis of the scheme described in


Algorithm 8.1.1 is provided in [26, Sec. 10.4]. In particular, the key theorem
states:
Theorem 8.2.1. Let A denote an m × n matrix, let p  2 be an over-sampling parameter,
and let q denote a small integer. Draw a Gaussian matrix G of size n × (k + p), set
Y = (AA∗ )q AG, and let Q denote an m × (k + p) orthonormal matrix resulting from
orthonormalizing the columns of Y. Then

E A − QQ∗ A
⎡ ⎛ ⎞1/2 ⎤1/(2q+1)
   √ min(m,n)
(8.2.2) ⎢ k e k+p ⎝  2(2q+1) ⎠ ⎥
⎣ 1+ σ2q+1
k+1 + σj ⎦ .
p−1 p
j=k+1

The bound in (8.2.2) is slightly opaque. To simplify it, let us consider a worst
case scenario in which there is no decay in the singular values beyond the trunca-
tion point, so that we have σk+1 = σk+2 = · · · = σmin(m,n) . Then (8.2.2) simplifies
to
  √ 1/(2q+1)

 k e k+p 
E A − QQ A  1 + + · min{m, n} − k σk+1 .
p−1 p
Per-Gunnar Martinsson 205

In other words, as we increase the exponent q, the power scheme drives the factor
that multiplies σk+1 to one exponentially fast. This factor represents the degree
of “sub-optimality” you can expect to see.
8.3. Extended sampling matrix The scheme given in Subsection 8.1 is slightly
wasteful in that it does not directly use all the sampling vectors computed. To
further improve accuracy, one could, for a symmetric matrix A and a small posi-
tive integer q, form an “extended” sampling matrix

Y = AG, A2 G, . . . , Aq G .
Observe that this new sampling matrix Y has q columns. Then proceed as before:
(8.3.1) Q = qr(Y), B = Q∗ A, [Û, D, V] = svd(B, ’econ’), U = QÛ.
The computations in (8.3.1) can be quite expensive since the “tall thin” matrices
being operated on now have q columns, rather than the tall thin matrices in,
e.g., Algorithm 8.1.1, which have only  columns. This results in an increase in
cost for all operations (QR factorization, matrix-matrix multiply, SVD) by a factor
of O(q2 ).
Consequently, the scheme described here is primarily useful in situations in
which the computational cost is dominated by applications of A and A∗ , and we
want to maximally leverage all interactions with A. An early discussion of this
idea can be found in [44, Sec. 4.4], with a more detailed discussion in [42].

9. The Nyström method for positive symmetric definite matrices


When the input matrix A is positive semidefinite (psd), the Nyström method can
be used to improve the quality of standard factorizations at almost no additional
cost; see [9] and its bibliography. To describe the idea, we first recall from Subsec-
tion 5.1 that when A is Hermitian (which of course every psd matrix is), then it is
natural to use the approximation

(9.0.1) A ≈ Q Q∗ AQ Q∗ .
In contrast, the so called “Nyström scheme” relies on the rank-k approximation

−1
(9.0.2) A ≈ (AQ) Q∗ AQ (AQ)∗ .
For both stability and computational efficiency, we typically rewrite (9.0.2) as
A ≈ FF∗ ,
where F is an approximate Cholesky factor of A of size n × k, defined by

−1/2
F = (AQ) Q∗ AQ .
To compute the factor F numerically, we may first form the matrices B1 = AQ
and B2 = Q∗ B1 . Observe that B2 is necessarily psd, so that we can compute
its Cholesky factorization B2 = C∗ C. Finally compute the factor F = B1 C−1 by
206 Randomized Methods for Matrix Computations

performing a triangular solve. The low-rank factorization (9.0.2) can be converted


to a standard decomposition using the techniques from Section 3.
The Nyström technique for computing an approximate eigenvalue decomposi-
tion is given in Algorithm 9.0.3. Let us compare the cost of this method to the
more straight-forward method resulting from using the formula (9.0.1). In both
cases, we need to twice apply A to a set of k + p vectors (first in computing AG,
then in computing AQ). But the Nyström method tends to result in substantially
more accurate results. Informally speaking, the reason is that by exploiting the
psd property of A, we can take one step of power iteration “for free.” For a more
formal analysis of the cost and accuracy of the Nyström method, we refer the
reader to [19, 43].

Algorithm: Eigenvalue Decomposition via the Nyström Method


Given an n × n non-negative matrix A, a target rank k and an over-sampling
parameter p, this procedure computes an approximate eigenvalue decomposition
A ≈ UΛU∗ , where U is orthonormal, and Λ is nonnegative and diagonal.
(1) Draw a Gaussian random matrix G = randn(n, k + p).
(2) Form the sample matrix Y = AG.
(3) Orthonormalize the columns of the sample matrix to obtain the
basis matrix Q = orth(Y).
(4) Form the matrices B1 = AQ and B2 = Q∗ B1 .
(5) Perform a Cholesky factorization B2 = C∗ C.
(6) Form F = B1 C−1 using a triangular solve.
(7) Compute a Singular Value Decomposition of the Cholesky factor
[U, Σ, ∼] = svd(F, ’econ’).
(8) Set Λ = Σ2 .

Algorithm 9.0.3. The Nyström method for low-rank approxima-


tion of self-adjoint matrices with non-negative eigenvalues. It
involves two applications of A to matrices with k + p columns,
and has comparable cost to the basic RSVD in Algorithm 4.0.1.
However, it exploits the symmetry of A to boost the accuracy.

10. Randomized algorithms for computing Interpolatory


Decompositions
10.1. Structure preserving factorizations Any matrix A of size m × n and rank
k < min(m, n), admits a so called “interpolative decomposition (ID)” which takes
the form
A = C Z,
(10.1.1)
m×n m×k k×n
Per-Gunnar Martinsson 207

where the matrix C is given by a subset of the columns of A and where Z is


well-conditioned in a sense that we will make precise shortly. The ID has several
advantages, as compared to, e.g., the QR or SVD factorizations:
• If A is sparse or non-negative, then C shares these properties.
• The ID requires less memory to store than either the QR or the singular
value decomposition.
• Finding the indices associated with the spanning columns is often helpful
in data interpretation.
• In the context of numerical algorithms for discretizing PDEs and integral
equations, the ID often preserves “the physics” of a problem in a way that
the QR or SVD do not.
One shortcoming of the ID is that when A is not of precisely rank k, then the
approximation error by the best possible rank-k ID can be substantially larger
than the theoretically minimal error. (In fact, the ID and the column pivoted QR
factorizations are closely related, and they attain exactly the same minimal error.)
For future reference, let Js be an index vector in {1, 2, . . . , n} that identifies the
k columns in C so that
C = A(:, Js ).
One can easily show (see, e.g., [34, Thm. 9]) that any matrix of rank k admits a
factorization (10.1.1) that is well-conditioned in the sense that each entry of Z is
bounded in modulus by one. However, any algorithm that is guaranteed to find
such an optimally conditioned factorization must have combinatorial complexity.
Polynomial time algorithms with high practical efficiency are discussed in [6, 23].
Randomized algorithms are described in [26, 49].

Remark 10.1.2. The interpolative decomposition is closely related to the so called


CUR decomposition which has been studied extensively in the context of ran-
domized algorithms [5, 8, 10, 50]. We will return to this point in Section 11.

10.2. Three flavors of ID: row, column, and double-sided ID Subsection 10.1
describes a factorization where we use a subset of the columns of A to span its
column space. Naturally, this factorization has a sibling which uses the rows of A
to span its row space. In other words A also admits the factorization

A = X R,
(10.2.1)
m×n m×k k×n
where R is a matrix consisting of k rows of A, and where X is a matrix that
contains the k × k identity matrix. We let Is denote the index vector of length k
that marks the “skeleton” rows so that R = A(Is , :).
Finally, there exists a so called double-sided ID which takes the form

A = X As Z,
(10.2.2)
m×n m×k k×k k×n
208 Randomized Methods for Matrix Computations

where X and Z are the same matrices as those that appear in (10.1.1) and (10.2.1),
and where As is the k × k submatrix of A given by
As = A(Is , Js ).
10.3. Deterministic techniques for computing the ID In this section we demon-
strate that there is a close connection between the column ID and the classical
column pivoted QR factorization (CPQR). The end result is that standard soft-
ware used to compute the CPQR can with some light post-processing be used to
compute the column ID.
As a starting point, recall that for a given m × n matrix A, with m  n, the QR
factorization can be written as
A P = Q S,
(10.3.1)
m×n n×n m×n n×n
where P is a permutation matrix, where Q has orthonormal columns and where
S is upper triangular. 1 Since our objective here is to construct a rank-k approxi-
mation to A, we split off the leading k columns from Q and S to obtain partitions
k n−k
k n−k  
  k S11 S12
(10.3.2) Q= m Q1 Q2 , and S= .
m−k 0 S22

Combining (10.3.1) and (10.3.2), we then find that


 
S11 S12
(10.3.3) AP = [Q1 | Q2 ] = [Q1 S11 | Q1 S12 + Q2 S22 ].
0 S22
Equation (10.3.3) tells us that the m × k matrix Q1 S11 consists precisely of first k
columns of AP. These columns were the first k columns that were chosen as “piv-
ots” in the QR-factorization procedure. They typically form a good (approximate)
basis for the column space of A. We consequently define our m × k matrix C as
this matrix holding the first k pivot columns. Letting J denote the permutation
vector associated with the permutation matrix P, so that
AP = A(:, J),
we define Js = J(1 : k) as the index vector identifying the first k pivots, and set
(10.3.4) C = A(:, Js ) = Q1 S11 .
Now let us rewrite (10.3.3) by extracting the product Q1 S11
(10.3.5) AP = Q1 [S11 | S12 ] + Q2 [0 | S22 ] = Q1 S11 [Ik | S−1
11 S12 ] + Q2 [0 | S22 ].

(Remark 10.3.8 discusses why S11 must be invertible.) Now define


(10.3.6) T = S−1
11 S12 , and Z = [Ik | T]P∗
1 We use the letter S instead of the traditional R to avoid confusion with the “R”-factor in the row ID,
(10.2.1).
Per-Gunnar Martinsson 209

so that (10.3.5) can be rewritten (upon right-multiplication by P∗ , which equals


P−1 since P is unitary) as
(10.3.7) A = C[Ik | T]P∗ + Q2 [0 | S22 ]P∗ = CZ + Q2 [0 | S22 ]P∗ .
Equation (10.3.7) is precisely the column ID we sought, with the additional bonus
that the remainder term is explicitly identified. Observe that when the spectral or
Frobenius norms are used, the error term is of exactly the same size as the error
term obtained from a truncated QR factorization:
A − CZ = Q2 [0 | S22 ]P∗  = S22  = A − Q1 [S11 | S12 ]P∗ .

Remark 10.3.8 (Conditioning). Equation (10.3.5) involves the quantity S−111 which
prompts the question of whether S11 is necessarily invertible, and what its condi-
tion number might be. It is easy to show that whenever the rank of A is at least
k, the CPQR algorithm is guaranteed to result in a matrix S11 that is non-singular.
(If the rank of A is j, where j < k, then the QR factorization process can detect
this and halt the factorization after j steps.)
Unfortunately, S11 is typically quite ill-conditioned. The saving grace is that
even though one should expect S11 to be poorly conditioned, it is often the case
that the linear system
(10.3.9) S11 T = S12
still has a solution T whose entries are of moderate size. Informally, one could say
that the directions where S11 and S12 are small “line up.” For standard column
pivoted QR, the system (10.3.9) will in practice be observed to almost always have
a solution T of small size [6], but counter-examples can be constructed. More
sophisticated pivot selection procedures have been proposed that are guaranteed
to result in matrices S11 and S12 such that (10.3.9) has a good solution; but these
are harder to code and take longer to execute [23].

Of course, the row ID can be computed via an entirely analogous process that
starts with a CPQR of the transpose of A. In other words, we execute a pivoted
Gram-Schmidt orthonormalization process on the rows of A.
Finally, to obtain the double-sided ID, we start with using the CPQR-based
process to build the column ID (10.1.1). Then compute the row ID by performing
Gram-Schmidt on the rows of the tall thin matrix C.
The three deterministic algorithms described for computing the three flavors
of ID are summarized in Algorithm 10.3.11.

Remark 10.3.10 (Partial factorization). The algorithms for computing interpola-


tory decompositions in Algorithm 10.3.11 are wasteful when k  min(m, n) since
they involve a full QR factorization, which has complexity O(mn min(m, n)). This
problem is very easily remedied by replacing the full QR factorization by a partial
QR factorization, which has cost O(mnk). Such a partial factorization could take
as input either a preset rank k, or a tolerance ε. In the latter case, the factorization
210 Randomized Methods for Matrix Computations

would stop once the residual error A − Q(:, 1 : k)S(1 : k, :) = S22   ε. When
the QR factorization is interrupted after k steps, the output would still be a fac-
torization of the form (10.3.1), but in this case, S22 would not be upper triangular.
This is immaterial since S22 is never used. To further accelerate the computation,
one can advantageously use a randomized CPQR algorithm, cf. Sections 14 and 15
or [36, 38].

Compute a column ID so that A ≈ A(:, Js ) Z.


function [Js , Z] = ID_col(A, k)
[Q, S, J] = qr(A, 0);
T = (S(1 : k, 1 : k))−1 S(1 : k, (k + 1) : n);
Z = zeros(k, n)
Z(:, J) = [Ik T];
Js = J(1 : k);

Compute a row ID so that A ≈ X A(Is , :).


function [Is , X] = ID_row(A, k)
[Q, S, J] = qr(A∗ , 0);
T = (S(1 : k, 1 : k))−1 S(1 : k, (k + 1) : m);
X = zeros(m, k)
X(J, :) = [Ik T]∗ ;
Is = J(1 : k);

Compute a double-sided ID so that A ≈ X A(Is , Js ) Z.


function [Is , Js , X, Z] = ID_double(A, k)
[Js , Z] = ID_col(A, k);
[Is , X] = ID_row(A(:, Js ), k);

Algorithm 10.3.11. Deterministic algorithms for computing the


column, row, and double-sided ID via the column pivoted QR
factorization. The input is in every case an m × n matrix A and
a target rank k. Since the algorithms are based on the CPQR, it
is elementary to modify them to the situation where a tolerance
rather than a rank is given. (Recall that the errors resulting from
these ID algorithms are identical to the error in the first CPQR
factorization executed.)

10.4. Randomized techniques for computing the ID The ID is particularly well


suited to being computed via randomized algorithms. To describe the ideas,
suppose temporarily that A is an m × n matrix of exact rank k, and that we have
Per-Gunnar Martinsson 211

by some means computed an approximate rank-k factorization

A = Y F.
(10.4.1)
m×n m×k k×n
Once the factorization (10.4.1) is available, let us use the algorithm ID_row de-
scribed in Algorithm 10.3.11 to compute a row ID [Is , X] = ID_row(Y, k) of Y so
that
Y = X Y(Is , :).
(10.4.2)
m×k m×k k×k
It then turns out that {Is , X} is automatically (!) a row ID of A as well. To see this,
simply note that

XA(Is , :)
= XY(Is , :)F {Use (10.4.1) restricted to the rows in Is .}
= YF {Use (10.4.2).}
= A {Use (10.4.1).}
The key insight here is very simple, but powerful, so let us spell it out explicitly:

Observation: In order to compute a row ID of a matrix A, the only informa-


tion needed is a matrix Y whose columns span the column space of A.

Algorithm: Randomized ID

Inputs: An m × n matrix A, a target rank k, an over-sampling parameter p


(say p = 10), and a small integer q denoting the number of power iterations
taken.
Outputs: An m × k interpolation matrix X and an index vector Is ∈ Nk
such that A ≈ XA(Is , : ).
(1) G = randn(n, k + p);
(2) Y = AG;
(3) for j = 1 : q
(4) Y = A∗ Y;
(5) Y = AY ;
(6) end for
(7) Form an ID of the n × (k + p) sample matrix: [Is , X] = ID_row(Y, k).

Algorithm 10.4.3. An O(mnk) algorithm for computing an inter-


polative decomposition of A via randomized sampling.
212 Randomized Methods for Matrix Computations

As we have seen, the task of finding a matrix Y whose columns form a good
basis for the column space of a matrix is ideally suited to randomized sampling.
To be precise, we showed in Section 4 that given a matrix A, we can find a matrix
Y whose columns approximately span the column space of A via the formula
Y = AG, where G is a tall thin Gaussian random matrix. The scheme that results
from combining these two insights is summarized in Algorithm 10.4.3.
The randomized algorithm for computing a row ID shown in Algorithm 10.4.3
has complexity O(mnk). We can reduce this complexity to O(mn log k) by using
a structured random matrix instead of a Gaussian, cf. Section 6. The result is
summarized in Algorithm 10.4.4.

Algorithm: Fast Randomized ID

Inputs: An m × n matrix A, a target rank k, and an over-sampling parameter


p (say p = k).
Outputs: An m × k interpolation matrix X and an index vector Is ∈ Nk
such that A ≈ XA(Is , : ).
(1) Form an n × (k + p) SRFT Ω.
(2) Form the sample matrix Y = A Ω.
(3) Form an ID of the n × (k + p) sample matrix: [Is , X] = ID_row(Y, k).

Algorithm 10.4.4. An O(mn log k) algorithm for computing an


interpolative decomposition of A.

11. Randomized algorithms for computing the CUR decomposition


11.1. The CUR decomposition The so called CUR-factorization [10] is a “struc-
ture preserving” factorization that is similar to the Interpolative Decomposition
described in Section 10. The CUR factorization approximates an m × n matrix A
as a product

A ≈ C U R,
(11.1.1)
m×n m×k k×k k×n
where C contains a subset of the columns of A and R contains a subset of the rows
of A. Like the ID, the CUR decomposition offers the ability to preserve properties
like sparsity or non-negativity in the factors of the decomposition, the prospect
to reduce memory requirements, and excellent tools for data interpretation.
The CUR decomposition is often obtained in three steps [10, 41]: (1) Some
scheme is used to assign a weight or the so called leverage score (of importance)
[27] to each column and row in the matrix. This is typically done either using
the 2 norms of the columns and rows or by using the leading singular vectors of
A [12, 50]. (2) The matrices C and R are constructed via a randomized sampling
Per-Gunnar Martinsson 213

procedure, using the leverage scores to assign a sampling probability to each


column and row. (3) The U matrix is computed via:
(11.1.2) U ≈ C† AR† ,
with C† and R† being the pseudoinverses of C and R. Non-randomized ap-
proaches to computing the CUR decomposition are discussed in [46, 49].

Remark 11.1.3 (Conditioning of CUR). For matrices whose singular values ex-
perience substantial decay, the accuracy of the CUR factorization can deteriorate
due to effects of ill-conditioning. To simplify slightly, one would normally expect
the leading k singular values of C and R to be rough approximations to the lead-
ing k singular values of A, so that the condition numbers of C and R would be
roughly σ1 (A)/σk (A). Since low-rank factorizations are most useful when applied
to matrices whose singular values decay reasonably rapidly, we would typically
expect the ratio σ1 (A)/σk (A) to be large, which is to say that C and R would be
ill-conditioned. Hence, in the typical case, evaluation of the formula (11.1.2) can
be expected to result in substantial loss of accuracy due to accumulation of round-
off errors. Observe that the ID does not suffer from this problem; in (10.2.2), the
matrix Askel tends to be ill-conditioned, but it does not need to be inverted. (The
matrices X and Z are well-conditioned.)

11.2. Converting a double-sided ID to a CUR decomposition The next algo-


rithm converts a double-sided ID to a CUR decomposition.
To explain this algorithm, we begin by assuming that the factorization (10.2.2)
has been computed using the procedures described in Section 10 (either the de-
terministic or the randomized ones). In other words, we assume that the index
vectors Is and Js , and the basis matrices X and Z, are all available. We then define
C and R in the natural way as
(11.2.1) C = A(:, Js ) and R = A(Is , :).
Consequently, C and R are respectively subsets of columns and of rows of A.
The index vectors Is and Js are determined by the column pivoted QR factoriza-
tions, possibly combined with a randomized projection step for computational
efficiency. It remains to construct a k × k matrix U such that
(11.2.2) A ≈ C U R.
Now recall that, cf. (10.1.1),
(11.2.3) A ≈ C Z.
By inspecting (11.2.3) and (11.2.2), we find that we achieve our objective if we
determine a matrix U such that
U R = Z.
(11.2.4)
k×k k×n k×n
214 Randomized Methods for Matrix Computations

Unfortunately, (11.2.4) is an over-determined system, but at least intuitively, it


seems plausible that it should have a fairly accurate solution, given that the rows
of R and the rows of Z should, by construction, span roughly the same space
(namely, the space spanned by the k leading right singular vectors of A). Solving
(11.2.4) in the least-square sense, we arrive at our definition of U:
(11.2.5) U := ZR† .
These steps are summarized in Algorithm 11.2.6.

Algorithm: Randomized CUR

Inputs: An m × n matrix A, a target rank k, an over-sampling parameter p


(say p = 10), and a small integer q denoting the number of power iterations
taken.
Outputs: A k × k matrix U and index vectors Is and Js of length k such that
A ≈ A( : , Js ) U A(Is , : ).
(1) G = randn(k + p, m);
(2) Y = GA;
(3) for j = 1 : q
(4) Z = YA∗ ;
(5) Y = ZA;
(6) end for
(7) Form an ID of the (k + p) × n sample matrix: [Is , Z] = ID_col(Y, k).
(8) Form an ID of the m × k matrix of chosen columns:
[Is , ∼] = ID_row(A(:, Js ), k).
(9) Find the matrix U by solving the least squares equation UA(Is , :) = Z.

Algorithm 11.2.6. A randomized algorithm for computing a


CUR decomposition of A via randomized sampling. For q = 0,
the scheme is fast and accurate for matrices whose singular val-
ues decay rapidly. For matrices whose singular values decay
slowly, one should pick a larger q (say q = 1 or 2) to improve
accuracy at the cost of longer execution time. If accuracy better
than 1/(2q
mach
+ 1)
is desired, then the scheme should be modified to
incorporate orthonormalization as described in Remark 8.1.2.

12. Adaptive rank determination with updating of the matrix


12.1. Problem formulation Up to this point, we have assumed that the rank k
is given as an input variable to the factorization algorithm. In practical usage, it
is common that we are given instead a matrix A and a computational tolerance ε,
and our task is then to determine a matrix Ak of rank k such that A − Ak   ε.
Per-Gunnar Martinsson 215

The techniques described in this section are designed for dense matrices stored
in RAM. They directly update the matrix, and come with a firm guarantee that
the computed low rank approximation is within distance ε of the original matrix.
There are many situations where direct updating is not feasible and we can in
practice only interact with the matrix via the matrix-vector multiplication (e.g.,
very large matrices stored out-of-core, sparse matrices, matrices that are defined
implicitly). Section 13 describes algorithms designed for this environment that
use randomized sampling techniques to estimate the approximation error.
Recall that for the case where a computational tolerance is given (rather than
min(m,n)
a rank), the optimal solution is given by the SVD. Specifically, let {σj }j=1 be
the singular values of A, and let ε be a given tolerance. Then the minimal rank k
for which there exists a matrix B of rank k that is within distance ε of A, is the
is the smallest integer k such that σk+1  ε. The algorithms described here will
determine a k that is not necessarily optimal, but is typically fairly close.
12.2. A greedy updating algorithm Let us start by describing a general algorith-
mic template for how to compute an approximate rank-k approximate factoriza-
tion of a matrix. To be precise, suppose that we are given an m × n matrix A,
and a computational tolerance ε. Our objective is then to determine an integer
k ∈ {1, 2, . . . , min(m, n)}, an m × k orthonormal matrix Qk , and a k × n matrix Bk
such that A − Qk Bk   ε. Algorithm 12.2.1 outlines how one might in a greedy
fashion build the matrices Qk and Bk , adding one column to Qk and one row to
Bk at each step.

(1) Q0 = [ ]; B0 = [ ]; A0 = A; k = 0;
(2) while Ak  > ε
(3) k = k+1
(4) Pick a vector y ∈ Ran(Ak−1 )
(5) q = y/y;
(6) b = q∗ Ak−1 ;
(7) Qk =  [Qk−1 q];
Bk−1
(8) Bk = ;
b
(9) Ak = Ak−1 − qb;
(10) end for

Algorithm 12.2.1. A greedy algorithm for building a low-rank


approximation to a given m × n matrix A that is accurate to
within a given precision ε. To be precise, the algorithm deter-
mines an integer k, an m × k orthonormal matrix Qk and a k × n
matrix Bk = Q∗k A such that A − Qk Bk   ε. One can easily
verify that after step j, we have A = Qj Bj + Aj .
216 Randomized Methods for Matrix Computations

Algorithm 12.2.1 is a generalization of the classical Gram-Schmidt procedure.


The key to understanding how the algorithm works is provided by the identity
A = Qj Bj + Aj , j = 0, 1, 2, . . . , k.
The computational efficiency and accuracy of the algorithm depend crucially on
the strategy for picking the vector y on line (4). Let us consider three possibilities:
Pick the largest remaining column Suppose we instantiate line (4) by letting y
be simply the largest column of the remainder matrix Ak−1 .
(4) Set jk = argmax{Ak−1 (:, j) : j = 1, 2, . . . , n} and then y = Ak−1 (:, jk ).
With this choice, Algorithm 12.2.1 is precisely column pivoted Gram-Schmidt
(CPQR). This algorithm is reasonably efficient, and often leads to fairly close
to optimal low-rank approximation. For instance, when the singular values of
A decay rapidly, CPQR determines a numerical rank k that is typically reason-
ably close to the theoretically exact ε-rank. However, this is not always the case
even when the singular values decay rapidly, and the results can be quite poor
when the singular values decay slowly. (A striking illustration of how suboptimal
CPQR can be for purposes of low-rank approximation is provided by the famous
“Kahan counter-example,” see [28, Sec. 5].)
Pick the locally optimal vector A choice that is natural and conceptually simple
is to pick the vector y by solving the obvious minimization problem:
(4) y = argmin{Ak−1 − yy∗ Ak−1  : y = 1}.
With this choice, the algorithm will produce matrices that attain the theoretically
optimal precision
A − Qj Bj  = σj+1 .
This tells us that the greediness of the algorithm is not a problem. However, this
strategy is impractical since solving the local minimization problem is computa-
tionally hard.
A randomized selection strategy Suppose now that we pick y by forming a
linear combination of the columns of Ak−1 with the expansion weights drawn
from a normalized Gaussian distribution:
(4) Draw a Gaussian random vector g ∈ Rn and set y = Ak−1 g.
With this choice, the algorithm becomes logically equivalent to the basic random-
ized SVD given in Algorithm 4.0.1. This means that this choice often leads to
a factorization that is close to optimally accurate, and is also computationally
efficient.
One can attain higher accuracy by trading away some computational efficiency
to incorporate a couple of steps of power iteration, and, for some small integer

q
q—say q = 1 or q = 2—choosing y = Ak−1 A∗k−1 Ak−1 g.
Per-Gunnar Martinsson 217

12.3. A blocked updating algorithm A key benefit of the randomized greedy


algorithm described in Subsection 12.2 is that it can easily be blocked. In other
words, given a block size b, we can at each step of the iteration draw a set of b
Gaussian random vectors, compute the corresponding sample vectors, and then
extend the factors Q and B by adding b columns and b rows at a time, respectively.
The result is shown in Algorithm 12.3.1.

(1) Q = [ ]; B = [ ];
(2) while A > ε
(3) Draw an n × b Gaussian random matrix G.
(4) Compute the m × b matrix Qnew = qr(AG, 0).
(5) Bnew = Q∗new A
(6) Q = [Q Qnew ]
 
B
(7) B=
Bnew
(8) A = A − Qnew Bnew
(9) end while

Algorithm 12.3.1. A greedy algorithm for building a low-rank


approximation to a given m × n matrix A that is accurate to
within a given precision ε. This algorithm is a blocked analogue
of the method described in Algorithm 12.2.1 and takes as input
a block size b. Its output is an orthonormal matrix Q of size
m × k (where k is a multiple of b) and a k × n matrix B such that
A − QB  ε. For higher accuracy, one can incorporate a couple
of steps of power iteration and set Qnew = qr((AA∗ )q AR, 0) on
Line (4).

12.4. Evaluating the norm of the residual The algorithms described in this sec-
tion contain one step that could be computationally expensive unless some care
is exercised. The potential problem concerns the evaluation of the norm of the re-
mainder matrix Ak  (cf. Line (2) in Algorithm 12.2.1) at each step of the iteration.
When the Frobenius norm is used, this evaluation can be done very efficiently, as
follows: When the computation starts, evaluate the norm of the input matrix
a = AFro .
Then observe that after step j completes, we have
A = Qj Bj + Aj .
  

=Qj Q∗j A =(I−Qj Q∗j A


218 Randomized Methods for Matrix Computations

Since the columns in the first term all lie in Col(Qj ), and the columns of the
second term all lie in Col(Qj )⊥ , we now find that
A2Fro = Qj Bj 2Fro + Aj 2Fro = Bj 2Fro + Aj 2Fro ,
where in the last step we used that Qj Bj Fro = Bj Fro since Qj is orthonormal.
In other words, we can easily compute Aj Fro via the identity

Aj Fro = a2 − Bj 2Fro .
The idea here is related to “down-dating” schemes for computing column norms
when executing a column pivoted QR factorization, as described in, e.g., [48,
Chapter 5, Section 2.1].
When the spectral norm is used, one could use a power iteration to compute
an estimate of the norm of the matrix. Alternatively, one can use the randomized
procedure described in Section 13 which is faster, but less accurate.

13. Adaptive rank determination without updating the matrix


The techniques described in Section 12 for computing a low rank approxima-
tion to a matrix A that is valid to a given tolerance (as opposed to a given rank)
are highly computationally efficient whenever the matrix A itself can be easily
updated (e.g. a dense matrix stored in RAM). In this section, we describe algo-
rithms for solving the “given tolerance” problem that do not need to explicitly
update the matrix; this comes in handy for sparse matrices, for matrices stored
out-of-core, for matrices defined implicitly, etc. In such a situation, it often works
well to use randomized methods to estimate the norm of the residual matrix
A − QB. The framing for such a randomized estimator is that given a tolerated
risk probability p, we can cheaply compute a bound for A − QB that is valid
with probability at least 1 − p.
As a preliminary step in deriving the update-free scheme, let us reformulate
the basic RSVD in Algorithm 4.0.1 as the sequential algorithm shown in Algo-
rithm 13.0.1 that builds the matrices Q and B one vector at a time. We observe
that this method is similar to the greedy template shown in Algorithm 12.2.1,
except that there is not an immediately obvious way to tell when A − Qj Bj 
becomes small enough.
However, it is possible to estimate this quantity quite easily. The idea is that
once Qj becomes large enough to capture “most” of the range of A, then the
sample vectors yj drawn will all approximately lie in the span of Qj , which is to
say that the projected vectors zj will become very small. In other words, once
we start to see a sequence of vectors zj that are all very small, we can reasonably
deduce that the basis we have on hand very likely covers most of the range of A.
Per-Gunnar Martinsson 219

(1) Q0 = [ ]; B0 = [ ];
(2) for j = 1, 2, 3, . . .
(3) Draw a Gaussian random vector gj ∈ Rn and set yj = Agj
(4) Set zj = yj − Qj−1 Q∗j−1 yj and then qj = zj /zj .
(5) Qj = [Qj−1 qj ]
 
Bj−1
(6) Bj =
q∗j A
(7) end for

Algorithm 13.0.1. A randomized range finder that builds an or-


thonormal basis {q1 , q2 , q3 , . . . } for the range of A one vector at
a time. This algorithm is mathematically equivalent to the basic
RSVD in Algorithm 4.0.1 in the sense that if G = [g1 g2 g3 . . . ],
then the vectors {qj }p
j=1 form an orthonormal basis for AG(:,
1 :
p) for both methods. Observe that zj = A − Qj−1 Q∗j−1 A gj ,
cf. (13.0.2).

To make the discussion in the previous paragraph more mathematically rigor-


ous, let us first observe that each projected vector zj satisfies the relation

(13.0.2) zj = yj − Qj−1 Q∗j−1 yj = Agj − Qj−1 Q∗j−1 Agj = A − Qj−1 Q∗j−1 A gj .


Next, we use the fact that if T is any matrix, then by looking at the magnitude
of Tg, where g is a Gaussian random vector, we can deduce information about
the spectral norm of T. The precise result that we need is the following, cf. [53,
Sec. 3.4] and [26, Lemma 4.1].

Lemma 13.0.3. Let T be a real m × n matrix. Fix a positive integer r and a real number
α ∈ (0, 1). Draw an independent family {gi : i = 1, 2, . . . , r} of standard Gaussian
vectors. Then 
1 2
T  max Tgi 
α π i=1,...,r
with probability at least 1 − αr .

In applying this result, we set α = 1/10, whence it follows that if zj  is smaller
than the resulting threshold for r vectors in a row, then A − Qj Bj   ε with
probability at least 1 − 10−r . The result is shown in Algorithm 13.0.4. Observe
that choosing α = 1/10 will work well only if the singular values decay reasonably
fast.
220 Randomized Methods for Matrix Computations

(1) Draw standard Gaussian vectors g1 , . . . , gr of length n.


(2) For i = 1, 2, . . . , r, compute yi = Agi .
(3) j = 0.
(4) Q0 = [ ], the m × 0 empty matrix. 
 
(5) while max yj+1 , yj+2 , . . . , yj+r  > ε/(10 2/π),
(6) j = j + 1.
(7) Overwrite yj by yj − Qj−1 (Qj−1 )∗ yj .
(8) qj = yj /|yj |.
(9) Qj = [Qj−1 qj ].
(10) Draw a standard Gaussian vector gj+r of length n.

(11) yj+r = I − Qj (Qj )∗ Agj+r .


(12) for i = (j + 1), (j + 2), . . . , (j + r − 1),
(13) Overwrite yi by yi − qj qj , yi .
(14) end for
(15) end while
(16) Q = Qj .

Algorithm 13.0.4. A randomized range finder. Given an m × n


matrix A, a tolerance ε, and an integer r, the algorithm computes
an orthonormal matrix Q such that A − QQ∗ A  ε holds with
probability at least 1 − min{m, n}10−r . Line (7) is mathematically
redundant but improves orthonormality of Q in the presence of
round-off errors. (Adapted from Algorithm 4.2 of [26].)

Remark 13.0.5. While the proof of Lemma 13.0.3 is outside the scope of these
lectures, it is perhaps instructive to prove a much simpler related result that says
that if T is any m × n matrix, and g ∈ Rn is a standard Gaussian vector, then
(13.0.6) E[Tg2 ] = T2Fro .
To prove (13.0.6), let T have the singular value decomposition T = UDV∗ , and set
g̃ = V∗ g. Then

n
Tg2 = UDV∗ g2 = UDg̃2 = {U is unitary} = Dg̃2 = σ2j g̃2j .
j=1

Then observe that since the distribution of Gaussian vectors is rotationally invari-
ant, the vector g̃ = V∗ g is also a standardized Gaussian vector, and so E[g̃2j ] = 1.
Since the variables {g̃j }n
j=1 are independent, it follows that
⎡ ⎤
n 
n 
n
E[Tg2 ] = E ⎣ σ2j g̃2j ⎦ = σ2j E[g̃2j ] = σ2j = T2Fro .
j=1 j=1 j=1
Per-Gunnar Martinsson 221

14. Randomized algorithms for computing a rank-revealing QR


decomposition
Up until now, all methods discussed have concerned the problems of comput-
ing a low-rank approximation to a given matrix. These methods were designed
explicitly for the case where the rank k is substantially smaller than the matrix
dimensions m and n. In the last two sections, we will describe some recent de-
velopments that illustrate how randomized projections can be used to accelerate
matrix factorization algorithms for any rank k, including full factorizations where
k = min(m, n). These new algorithms offer “one stop shopping” in that they are
faster than traditional algorithms in essentially every computational regime. We
start in this section with a randomized algorithm for computing full and partial
column pivoted QR (CPQR) factorizations.
The material in this section assumes that the reader is familiar with the classi-
cal Householder QR factorization procedure, and with the concept of blocking to
accelerate matrix factorization algorithms. It is intended as a high-level introduc-
tion to randomized algorithms for computing full factorizations of matrices. For
details, we refer the reader to [33, 36].
14.1. Column pivoted QR decomposition Given an m × n matrix A, with m 
n, we recall that the column pivoted QR factorization (CPQR) takes the form,
cf. (10.3.1),
A P = Q R,
(14.1.1)
m×n n×n m×n n×n
where P is a permutation matrix, where Q is an orthogonal matrix, and where R
is upper triangular. A standard technique for computing the CPQR of a matrix
is the Householder QR process, which we illustrate in Figure 14.1.2. The process
requires n − 1 steps to drive A to upper triangular form from a starting point of
A0 = A. At the ith step, we form
Ai = Q∗i Ai−1 Pi
where Pi is a permutation matrix that flips the ith column of Ai−1 with the
column of largest magnitude in Ai−1 (:, i : n). The column moved into the ith
place is called the pivot column. The matrix Qi is a so called Householder reflector
that zeros out all elements beneath the diagonal in the pivot column. In other
words, if we let ci denote the pivot column, then 2
⎡ ⎤
ci (1 : (i − 1))
⎢ ⎥
Q∗i ci = ⎢
⎣ ±ci (i : m) ⎦ .

0

2 The matrix Q is in fact symmetric, so Q = Q∗ , but we keep the transpose symbol in the formula
i i i
for consistency with the remainder of the section.
222 Randomized Methods for Matrix Computations

A0 = A A1 = Q∗1 A0 P1 A2 = Q∗2 A1 P2 A3 = Q∗3 A2 P3

Figure 14.1.2. Basic QR factorization process. The n × n matrix


A is driven to upper triangular form in n − 1 steps. (Shown for
n = 4.) At step i, we form Ai = Q∗i Ai−1 Pi where Pi is a permuta-
tion matrix that moves the largest column in Ai−1 (:, 1 : n) to the
ith position, and Qi is a Householder reflector that zeros out all
elements under the diagonal in the pivot column.

Once the process has completed, the matrices Q and P in (14.1.1) are given by
Q = Qn−1 Qn−2 · · · Q1 , and P = Pn−1 Pn−2 · · · P1 .
For details, see, e.g., [20, Sec. 5.2].
The Householder QR factorization process is a celebrated algorithm that is
exceptionally stable and accurate. However, it has a serious weakness in that
it executes rather slowly on modern hardware, in particular on systems involv-
ing many cores, when the matrix is stored in distributed memory or on a hard
drive, etc. The problem is that it inherently consists of a sequence of n − 1 rank-1
updates (so called BLAS2 operations), which makes the process very communica-
tion intensive. In principle, the resolution to this problem is to block the process,
as shown in Figure 14.1.3.
Let b denote a block size; then in a blocked Householder QR algorithm, we
would find groups of b pivot vectors that are moved into the active b slots at
once, then b Householder reflectors would be determined by processing the b
pivot columns, and then the remainder of the matrix would be updated jointly.
Such a blocked algorithm would expend most of its flops on matrix-matrix mul-
tiplications (so called BLAS3 operations), which execute very rapidly on a broad
range of computing hardware. Many techniques for blocking Householder QR
have been proposed over the years, including, e.g., [3, 4].
It was recently observed [36] that randomized sampling is ideally suited for
resolving the long-standing problem of how to find groups of pivot vectors. The
key observation is that a measure of quality for a group of b pivot vectors is its
spanning volume in Rm . This turns out to be closely related to how good of a ba-
sis these vectors form for the column space of the matrix [21, 23, 37]. As we saw in
Section 10, this task is particularly well suited to randomized sampling. Precisely,
consider the task of identifying a group of b good pivot vectors in the first step of
the blocked QR process shown in Figure 14.1.3. Using the procedures described
in Subsection 10.4, we proceed as follows: Fix an over-sampling parameter p, say
p = 10. Then draw a Gaussian random matrix G of size (b + p) × m, and form
Per-Gunnar Martinsson 223

A0 = A A1 = Q∗1 A0 P1 A2 = Q∗2 A1 P2 A3 = Q∗3 A2 P3

Figure 14.1.3. Blocked QR factorization process. The matrix A


consists of p × p blocks of size b × b (shown for p = 3 and b = 4).
The matrix is driven to upper triangular form in p steps. At step
i, we form Ai = Q∗i Ai−1 Pi where Pi is a permutation matrix, and
Qi is a product of b Householder reflectors.

the sampling matrix Y = GA. Then simply perform column pivoted QR on the
columns of Y. To summarize, we determine P1 as follows:

G = randn(b + p, m),
Y = GA,
[∼, ∼, P1 ] = qr(Y, 0).
Observe that the QR factorization of Y is affordable since Y is small, and fits in
fast memory close to the processor. For the remaining steps, we simply apply
the same idea to find the best spanning columns for the lower right block in Ai
that has not yet been driven to upper triangular form. The resulting algorithm is
called Householder QR with Randomization for Pivoting (HQRRP); it is described in
detail in [36], and is available at https://github.com/flame/hqrrp/. (The method
described in this section was first published in [33], but is closely related to the
independently discovered results in [13].)
To maximize performance, it turns out to be possible to “downdate” the sam-
pling matrix from one step of the factorization to the next, in a manner simi-
lar to how downdating of the pivot weights are done in classical Householder
QR[48, Ch.5, Sec. 2.1]. This obviates the need to draw a new random matrix at
each step [13, 36], and reduces the leading term in the asymptotic flop count of
HQRRP to 2mn2 − (4/3)n3 , which is identical to classical Householder QR.

15. A strongly rank-revealing UTV decomposition


This section describes a randomized algorithm randUTV that is very similar to
the randomized QR factorization process described in Section 14 but that results
in a so called “UTV factorization.” The new algorithm has several advantages:
• randUTV provides close to optimal low-rank approximation, and highly
accurate estimates for the singular values of a matrix.
224 Randomized Methods for Matrix Computations

• The algorithm randUTV builds the factorization (15.1.1) incrementally,. This


means that when it is applied to a matrix of numerical rank k, the algo-
rithm can be stopped early and incur an overall cost of O(mnk).
• Like HQRRP, the algorithm randUTV is blocked, which enables it to execute
fast on modern hardware.
• The algorithm randUTV is not an iterative algorithm. In this regard, it is
closer to the CPQR than standard SVD algorithms, which substantially
simplifies software optimization.
15.1. The UTV decomposition Given an m × n matrix A, with m  n, a “UTV
decomposition” of A is a factorization the form

A = U T V∗ ,
(15.1.1)
m×n m×m m×n n×n
where U and V are unitary matrices, and T is a triangular matrix (either lower or
upper triangular). The UTV decomposition can be viewed as a generalization of
other standard factorizations such as, e.g., the Singular Value Decomposition (SVD)
or the Column Pivoted QR decomposition (CPQR). (To be precise, the SVD is the
special case where T is diagonal, and the CPQR is the special case where V is
a permutation matrix.) The additional flexibility inherent in the UTV decompo-
sition enables the design of efficient updating procedures, see [48, Ch. 5, Sec. 4]
and [16].
15.2. An overview of randUTV The algorithm randUTV follows the same general
pattern as HQRRP, as illustrated in Figure 15.2.1. Like HQRRP, it drives a given matrix
A = A0 to upper triangular form via a sequence of steps
Ai = U∗i Ai−1 Vi ,
where each Ui and Vi is a unitary matrix. As in Section 14, we let b denote a block
size, and let p = n/b denote the number of steps taken. The key difference from

A0 = A A1 = U∗1 A0 V1 A2 = U∗2 A1 V2 A3 = U∗3 A2 V3

Figure 15.2.1. Blocked UTV factorization process. The matrix A


consists of p × p blocks of size b × b (shown for p = 3 and b = 4).
The matrix is driven to upper triangular form in p steps. At step
i, we form Ai = U∗i Ai−1 Vi where Ui and Vi consist (mostly) of a
product of b Householder reflectors. The elements shown in grey
are not zero, but are very small in magnitude.
Per-Gunnar Martinsson 225

HQRRP is that we in randUTV allow the matrices Vi to consist (mostly) of a product


of b Householder reflectors. This added flexibility over CPQR allows us to drive
more mass onto the diagonal entries, and thereby render the off-diagonal entries
in the final matrix T very small in magnitude.
15.3. A single step block factorization The algorithm randUTV consists of re-
peated application of a randomized technique for building approximations to the
spaces spanned by the dominant b left and right singular vectors, where b is a
given block size. To be precise, given an m × n matrix A, we seek to build unitary
matrices U1 and V1 such that
A = U1 A1 V1∗
where A1 has the block structure
 
A1,11 A1,12
A1 = = ,
0 A1,22

so that A1,11 is diagonal, and A1,12 has entries of small magnitude.


We first build V1 . To this end, we use the randomized power iteration de-
scribed in Section 8 to build a sample matrix Y of size b × n whose columns
approximately span the same subspace as the b dominant right singular vectors
of A. To be precise, we draw a b × m Gaussian random matrix G and form the
sample matrix

q
Y = GA A∗ A ,
where q is a parameter indicating the number of steps of power iteration taken
(in randUTV, the gain from over-sampling is minimal and is generally not worth
the bother). Then we form a unitary matrix Ṽ whose first b columns form an
orthonormal basis for the column space of Y. (The matrix V consists of a product
of b Householder reflectors, which are determined by executing the standard
Householder QR procedure on the columns of Y∗ .) We then execute b steps of
Householder QR on the matrix AṼ to form a matrix Ũ consisting of a product of
b Householder reflectors. This leaves us with a new matrix


à = Ũ A Ṽ
that has the block structure
 
Ã11 Ã12
à = = .
0 Ã22

In other words, the top left b × b block is upper triangular, and the bottom left
block is zero. One can also show that all entries of Ã12 are typically small in
magnitude.
226 Randomized Methods for Matrix Computations

function [U, T, V] = randUTV(A, b, q)


T = A;
U = eye(size(A, 1));
V = eye(size(A, 2));
for i = 1 : ceil(size(A, 2)/b)
I1 = 1 : (b(i − 1));
I2 = (b(i − 1) + 1) : size(A, 1);
J2 = (b(i − 1) + 1) : size(A, 2);
if (length(J2 ) > b)
[Û, T̂, V̂] = stepUTV(T(I2 , J2 ), b, q);
else
[Û, T̂, V̂] = svd(T(I2 , J2 ));
end if
U(:, I2 ) = U(:, I2 ) ∗ Û;
V(:, J2 ) = V(:, J2 ) ∗ V̂;
T(I2 , J2 ) = T̂;
T(I1 , J2 ) = T̂ (I1 , J2 ) ∗ V̂;
end for
return
function [U, T , V] = stepUTV(A, b, q)
G = randn(size(A, 1), b);
Y = A∗ G;
for i = 1 : q
Y = A∗ (AY);
end for
[V, ∼] = qr(Y);
[U, D, W] = svd(AV(:, 1 : b));

T = D, U∗ AV(:, (b + 1) : end) ;
V(:, 1 : b) = V(:, 1 : b) ∗ W;
return

Algorithm 15.3.1. The algorithm randUTV (described in Section


15) that given an m × n matrix A computes its UTV factorization
A = UTV∗ , cf. (15.1.1). The input parameters b and q reflect the
block size and the number of steps of power iteration, respec-
tively. The single step function stepUTV is described in Subsec-
tion 15.3. (Observe that most of the unitary matrices that arise
consist of products of Householder reflectors; this property must
be exploited to attain computational efficiency.)

Next, we compute a full SVD of the block Ã11 = ÛD11 V̂∗ . This step is affordable
since Ã11 is of size b × b, where b is small. Then we form the transformation
Per-Gunnar Martinsson 227

matrices
   
Û 0 V̂ 0
U1 = Ũ , and V1 = Ṽ ,
0 Im−b 0 In−b
and set
A1 = U∗1 AV1 .
One can demonstrate that the diagonal entries of D11 typically form accurate
approximations to first b singular values of A, and that
A1,22  ≈ inf{A − B : B has rank b}.
Once the first b columns and rows of A have been processed as described in
this Section, randUTV then applies the same procedure to the remaining block
A1,22 , of size (m − b) × (n − b), and then continues in the same fashion to pro-
cess all remaining blocks, as outlined in Subsection 15.2. The full algorithm is
summarized in Algorithm 15.3.1.
For more information about the UTV factorization, including careful numerical
experiments that illustrate how it compares in terms of speed and accuracy to
competitors such as column pivoted QR and the traditional SVD, see [35]. Codes
are available for download from https://github.com/flame/randutv.

Remark 15.3.2. As discussed in Section 8, the rows of Y approximately span the


linear space spanned by the b dominant right singular vectors of A. The first
b columns of V1 simply form an orthonormal basis for Row(Y∗ ). Analogously,
the first b columns of U1 approximately form an orthonormal basis for the b
dominant left singular vectors of A. However, the additional application of A that
is implicit in forming the matrix AṼ provides a boost in accuracy, and the rank-
b approximation resulting from one step of randUTV is similar to the accuracy
obtained from the randomized algorithm in Section 8, but with “q + 1/2 steps”
of power iteration, rather than q steps. (We observe that if the first b columns
of Ũ and Ṽ spanned exactly the spaces spanned by the dominant b left and right
singular vectors of A, then we would have Ã12 = 0, and the singular values of Ã11
would be identical to the top b singular values of A.)

References
[1] N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast johnson-lindenstrauss transform,
Proceedings of the thirty-eighth annual acm symposium on theory of computing, 2006, pp. 557–
563. ←200
[2] N. Ailon and E. Liberty, An almost optimal unrestricted fast Johnson-Lindenstrauss transform, ACM
Transactions on Algorithms (TALG) 9 (2013), no. 3, 21. ←200
[3] C. H. Bischof and G. Quintana-Ortí, Algorithm 782: Codes for rank-revealing QR factorizations of
dense matrices, ACM Transactions on Mathematical Software (TOMS) 24 (1998), no. 2, 254–257.
←222
[4] C. H. Bischof and G. Quintana-Ortí, Computing rank-revealing QR factorizations of dense matrices,
ACM Transactions on Mathematical Software (TOMS) 24 (1998), no. 2, 226–253. ←222
228 References

[5] C. Boutsidis, M. W Mahoney, and P. Drineas, An improved approximation algorithm for the column
subset selection problem, Proceedings of the twentieth annual acm-siam symposium on discrete
algorithms, 2009, pp. 968–977. ←207
[6] H. Cheng, Z. Gimbutas, P.-G. Martinsson, and V. Rokhlin, On the compression of low rank matrices,
SIAM Journal of Scientific Computing 26 (2005), no. 4, 1389–1404. ←207, 209
[7] P. Drineas, R. Kannan, and M. W. Mahoney, Fast Monte Carlo algorithms for matrices. II. Comput-
ing a low-rank approximation to a matrix, SIAM J. Comput. 36 (2006), no. 1, 158–183 (electronic).
MR2231644 (2008a:68243) ←191
[8] P. Drineas, M. Magdon-Ismail, M. W Mahoney, and D. P Woodruff, Fast approximation of matrix
coherence and statistical leverage, Journal of Machine Learning Research 13 (2012), no. Dec, 3475–
3506. ←207
[9] P. Drineas and M. W. Mahoney, On the Nyström method for approximating a Gram matrix for improved
kernel-based learning, J. Mach. Learn. Res. 6 (2005), 2153–2175. ←205
[10] P. Drineas and M. W. Mahoney, CUR matrix decompositions for improved data analysis, Proceedings
of the National Academy of Sciences 106 (2009), no. 3, 697–702. ←207, 212
[11] P. Drineas and M. W. Mahoney, Lectures on randomized numerical linear algebra, The mathematics
of data, 2018. ←191, 201
[12] P. Drineas, M. W. Mahoney, and S. Muthukrishnan, Relative-error CUR matrix decompositions,
SIAM J. Matrix Anal. Appl. 30 (2008), no. 2, 844–881. MR2443975 (2009k:68269) ←212
[13] J. Duersch and M. Gu, True blas-3 performance qrcp using random sampling, 2015. ←223
[14] C. Eckart and G. Young, The approximation of one matrix by another of lower rank, Psychometrika 1
(1936), no. 3, 211–218. ←192
[15] Inc. Facebook, Fast randomized svd, 2016. ←190
[16] R. D Fierro, P. C. Hansen, and P. S. K. Hansen, Utv tools: Matlab templates for rank-revealing utv
decompositions, Numerical Algorithms 20 (1999), no. 2-3, 165–194. ←224
[17] A. Frieze, R. Kannan, and S. Vempala, Fast Monte Carlo algorithms for finding low-rank approxima-
tions, J. ACM 51 (2004), no. 6, 1025–1041. (electronic). MR2145262 (2005m:65006) ←190, 191
[18] A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu, K.
Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy,
M. W. Mahoney, and Prabhat, Matrix factorizations at scale: A comparison of scientific data analytics
in spark and C+MPI using three case studies, 2016 IEEE International Conference on Big Data, 2016,
pp. 204–213. ←190
[19] A. Gittens and M. W. Mahoney, Revisiting the nyström method for improved large-scale machine learn-
ing, J. Mach. Learn. Res. 17 (January 2016), no. 1, 3977–4041. ←206
[20] G. H. Golub and C. F. Van Loan, Matrix computations, Third, Johns Hopkins Studies in the Math-
ematical Sciences, Johns Hopkins University Press, Baltimore, MD, 1996. ←191, 222
[21] S. A. Goreinov, N. L. Zamarashkin, and E. E. Tyrtyshnikov, Pseudo-skeleton approximations by
matrices of maximal volume, Mathematical Notes 62 (1997). ←222
[22] M. Gu, Subspace iteration randomization and singular value problems, SIAM Journal on Scientific
Computing 37 (2015), no. 3, A1139–A1173. ←201
[23] M. Gu and S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization,
SIAM J. Sci. Comput. 17 (1996), no. 4, 848–869. MR97h:65053 ←207, 209, 222
[24] N. Halko, Randomized methods for computing low-rank approximations of matrices, Ph.D. Thesis, 2012.
←190
[25] N. Halko, P.-G. Martinsson, Y. Shkolnisky, and M. Tygert, An algorithm for the principal component
analysis of large data sets, SIAM Journal on Scientific Computing 33 (2011), no. 5, 2580–2594. ←190
[26] N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algo-
rithms for constructing approximate matrix decompositions, SIAM Review 53 (2011), no. 2, 217–288.
←190, 193, 200, 201, 202, 204, 207, 219, 220
[27] D. C. Hoaglin and R. E. Welsch, The Hat matrix in regression and ANOVA, The American Statistician
32 (1978), no. 1, 17–22. ←212
[28] W. Kahan, Numerical linear algebra, Canadian Math. Bull 9 (1966), no. 6, 757–801. ←216
[29] D. M. Kane and J. Nelson, Sparser johnson-lindenstrauss transforms, J. ACM 61 (January 2014), no. 1,
4:1–4:23. ←200
[30] E. Liberty, Accelerated dense random projections, Ph.D. Thesis, 2009. ←200
References 229

[31] E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert, Randomized algorithms for the
low-rank approximation of matrices, Proc. Natl. Acad. Sci. USA 104 (2007), no. 51, 20167–20172.
←190
[32] M. W Mahoney, Randomized algorithms for matrices and data, Foundations and Trends® in Machine
Learning 3 (2011), no. 2, 123–224. ←191
[33] P.-G. Martinsson, Blocked rank-revealing qr factorizations: How randomized sampling can be used to
avoid single-vector pivoting, 2015. ←221, 223
[34] P.-G. Martinsson, Randomized methods for matrix computations and analysis of high dimensional data
(2016), available at arXiv:1607.01649. ←203, 207
[35] P.-G. Martinsson, G. Quintana Orti, and N. Heavner, randUTV: A blocked randomized algorithm for
computing a rank-revealing UTV factorization (2017), available at arXiv:1703.00998. ←190, 227
[36] P.-G. Martinsson, G. Quintana-Ortí, N. Heavner, and R. van de Geijn, Householder qr factorization
with randomization for column pivoting (HQRRP), SIAM Journal on Scientific Computing 39 (2017),
no. 2, C96–C115. ←190, 210, 221, 222, 223
[37] P.-G. Martinsson, V. Rokhlin, and M. Tygert, On interpolation and integration in finite-dimensional
spaces of bounded functions, Comm. Appl. Math. Comput. Sci (2006), 133–142. ←222
[38] P.-G. Martinsson and S. Voronin, A randomized blocked algorithm for efficiently computing rank-
revealing factorizations of matrices., 2015. To appear in SIAM Journal on Scientific Computation,
arXiv:1503.07157. ←190, 210
[39] P.-G. Martinsson, V. Rokhlin, and M. Tygert, A randomized algorithm for the approximation of matri-
ces, Technical Report Yale CS research report YALEU/DCS/RR-1361, Yale University, Computer
Science Department, 2006. ←189, 190
[40] P.-G. Martinsson, V. Rokhlin, and M. Tygert, A randomized algorithm for the decomposition of matrices,
Appl. Comput. Harmon. Anal. 30 (2011), no. 1, 47–68. MR2737933 (2011i:65066) ←190
[41] N. Mitrovic, M. T. Asif, U. Rasheed, J. Dauwels, and P. Jaillet, Cur decomposition for compression
and compressed sensing of large-scale traffic data, Intelligent transportation systems-(itsc), 2013 16th
international ieee conference on, 2013, pp. 1475–1480. ←212
[42] C. Musco and C. Musco, Randomized block krylov methods for stronger and faster approximate singu-
lar value decomposition, Proceedings of the 28th international conference on neural information
processing systems, 2015, pp. 1396–1404. ←205
[43] F. Pourkamali-Anaraki and S. Becker, Randomized clustered nystrom for large-scale kernel machines,
2016. arXiv:1612.06470. ←206
[44] V. Rokhlin, A. Szlam, and M. Tygert, A randomized algorithm for principal component analysis, SIAM
Journal on Matrix Analysis and Applications 31 (2009), no. 3, 1100–1124. ←205
[45] T. Sarlos, Improved approximation algorithms for large matrices via random projections, 2006 47th an-
nual ieee symposium on foundations of computer science (focs’06), 2006, pp. 143–152. ←190, 191,
200
[46] D. C Sorensen and M. Embree, A deim induced cur factorization, SIAM Journal on Scientific Com-
puting 38 (2016), no. 3, A1454–A1482. ←213
[47] G. W. Stewart, On the early history of the singular value decomposition, SIAM Rev. 35 (1993), no. 4,
551–566. MR1247916 (94f:15001) ←194
[48] G. W. Stewart, Matrix algorithms volume 1: Basic decompositions, SIAM, 1998. ←218, 223, 224
[49] S. Voronin and P.-G. Martinsson, A CUR factorization algorithm based on the interpolative decomposi-
tion (2014), available at arXiv:1412.8447. ←190, 207, 213
[50] S. Wang and Z. Zhang, Improving CUR matrix decomposition and the Nyström approximation via
adaptive sampling, J. Mach. Learn. Res. 14 (2013), 2729–2769. MR3121656 ←207, 212
[51] R. Witten and E. Candes, Randomized algorithms for low-rank matrix factorizations: sharp performance
bounds, Algorithmica 72 (2015), no. 1, 264–281. ←201
[52] D. P. Woodruff, Sketching as a tool for numerical linear algebra, Foundations and Trends in Theoreti-
cal Computer Science 10 (2014), no. 1Ű2, 1–157. ←191
[53] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, A fast randomized algorithm for the approximation
of matrices, Applied and Computational Harmonic Analysis 25 (2008), no. 3, 335–366. ←200, 219

Mathematical Institute, Andrew Wiles Building, University of Oxford, Oxford, OX2 6GG, United
Kingdom
Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 231–271
https://doi.org/10.1090/pcms/025/00833

Four Lectures on Probabilistic Methods for Data Science

Roman Vershynin
Abstract. Methods of high-dimensional probability play a central role in ap-
plications for statistics, signal processing, theoretical computer science and re-
lated fields. These lectures present a sample of particularly useful tools of high-
dimensional probability, focusing on the classical and matrix Bernstein’s inequal-
ity and the uniform matrix deviation inequality. We illustrate these tools with
applications for dimension reduction, network analysis, covariance estimation,
matrix completion and sparse signal recovery. The lectures are geared towards
beginning graduate students who have taken a rigorous course in probability but
may not have any experience in data science applications.

Contents
1 Lecture 1: Concentration of sums of independent random variables 232
1.1 Sub-gaussian distributions 233
1.2 Hoeffding’s inequality 234
1.3 Sub-exponential distributions 234
1.4 Bernstein’s inequality 235
1.5 Sub-gaussian random vectors 236
1.6 Johnson-Lindenstrauss Lemma 237
1.7 Notes 239
2 Lecture 2: Concentration of sums of independent random matrices 239
2.1 Matrix calculus 239
2.2 Matrix Bernstein’s inequality 241
2.3 Community recovery in networks 244
2.4 Notes 248
3 Lecture 3: Covariance estimation and matrix completion 249
3.1 Covariance estimation 249
3.2 Norms of random matrices 252
3.3 Matrix completion 255
3.4 Notes 258
4 Lecture 4: Matrix deviation inequality 259
4.1 Gaussian width 260
4.2 Matrix deviation inequality 261
4.3 Deriving Johnson-Lindenstrauss Lemma 262

Partially supported by NSF Grant DMS 1265782 and U.S. Air Force Grant FA9550-14-1-0009.

©2018 American Mathematical Society

231
232 Four Lectures on Probabilistic Methods for Data Science

4.4 Covariance estimation 263


4.5 Underdetermined linear equations 264
4.6 Sparse recovery 266
4.7 Notes 268

1. Lecture 1: Concentration of sums of independent random variables


These lectures present a sample of modern methods of high dimensional prob-
ability and illustrate these methods with applications in data science. This sample
is not comprehensive by any means, but it could serve as a point of entry into
a branch of modern probability that is motivated by a variety of data-related
problems.
To get the most out of these lectures, you should have taken a graduate course
in probability, have a good command of linear algebra (including the singular
value decomposition) and be familiar with very basic concepts of functional anal-
ysis (familiarity with Lp norms should be enough).
All of the material of these lectures is covered more systematically, at a slower
pace, and with a wider range of applications, in my forthcoming textbook [60].
You may also be interested in two similar tutorials: [58] is focused on random
matrices, and a more advanced text [59] discusses high-dimensional inference
problems.
It should be possible to use these lectures for a self-study or group study. You
will find here many places where you are invited to do some work (marked in
the text e.g. by “check this!”), and you are encouraged to do it to get a better
grasp of the material. Each lecture ends with a section called “Notes” where you
will find bibliographic references of the results just discussed, as well as various
improvements and extensions.
We are now ready to start.
Probabilistic reasoning has a major impact on modern data science. There are
roughly two ways in which this happens.
• Randomized algorithms, which perform some operations at random, have
long been developed in computer science and remain very popular. Ran-
domized algorithms are among the most effective methods – and some-
times the only known ones – for many data problems.
• Random models of data form the usual premise of statistical analysis. Even
when the data at hand is deterministic, it is often helpful to think of it as a
random sample drawn from some unknown distribution (“population”).
In these lectures, we will encounter both randomized algorithms and random
models of data.
Roman Vershynin 233

1.1. Sub-gaussian distributions Before we start discussing probabilistic meth-


ods, we will introduce an important class of probability distributions that forms a
natural “habitat” for random variables in many theoretical and applied problems.
These are sub-gaussian distributions. As the name suggests, we will be looking
at an extension of the most fundamental distribution in probability theory – the
gaussian, or normal, distribution N(μ, σ).
It is a good exercise to check that the the following basic properties of the
standard normal random variable X ∼ N(0, 1):
 
Tails: P |X|  t  2 exp(−t2 /2) for all t  0.

Moments: Xp := (E |X|p )1/p = O( p) as p → ∞.
MGF of square: 1 E exp(cX2 )  2 for some c > 0.
MGF: E exp(λX) = exp(λ2 ) for all λ ∈ R.
All these properties tell the same story from four different perspectives. It is
not very difficult to show (although we will not do it here) that for any random
variable X, not necessarily Gaussian, these four properties are essentially equiva-
lent.
Proposition 1.1.1 (Sub-gaussian properties). For a random variable X, the following
properties are equivalent.2
 
Tails: P |X|  t  2 exp(−t2 /K21 ) for all t  0.

Moments: Xp  K2 p for all p  1.
MGF of square: E exp(X2 /K23 )  2.
Moreover, if E X = 0 then these properties are also equivalent to the following one:
MGF: E exp(λX)  exp(λ2 K24 ) for all λ ∈ R.
Random variables that satisfy one of the first three properties (and thus all of
them) are called sub-gaussian. The best K3 is called the sub-gaussian norm of X, and
is usually denoted Xψ2 , that is


Xψ2 := inf t > 0 : E exp(X2 /t2 )  2 .
One can check that  · ψ2 indeed defines a norm; it is an example of the general
concept of the Orlicz norm. Proposition 1.1.1 states that the numbers Ki in all four
properties are equivalent to Xψ2 up to absolute constant factors.
Example 1.1.2. As already noted, the standard normal random variable X ∼ N(0, 1)
is sub-gaussian. Similarly, arbitrary normal random variables X ∼ N(μ, σ) are sub-
gaussian. Another example is a Bernoulli random variable X that takes values 0
and 1 with probabilities 1/2 each. More generally, any bounded random variable
X is sub-gaussian. On the contrary, Poisson, exponential, Pareto and Cauchy
distributions are not sub-gaussian. (Verify all these claims; this is not difficult.)
1 MGF stands for moment generation function.
2 Theparameters Ki > 0 appearing in these properties can be different. However, they may differ
from each other by at most an absolute constant factor. This means that there exists an absolute
constant C such that property 1 implies property 2 with parameter K2  CK1 , and similarly for
every other pair or properties.
234 Four Lectures on Probabilistic Methods for Data Science

1.2. Hoeffding’s inequality You may remember from a basic course in probabil-
ity that the normal distribution N(μ, σ) has a remarkable property: the sum of
independent normal random variables is also normal. Here is a version of this
property for sub-gaussian distributions.
Proposition 1.2.1 (Sums of sub-gaussians). Let X1 , . . . , XN be independent, mean

zero, sub-gaussian random variables. Then N
i=1 Xi is a sub-gaussian, and
$N $2 
N
$ $
$ Xi $  C Xi 2ψ2
ψ2
i=1 i=1

where C is an absolute constant.3


Proof. Let us bound the moment generating function of the sum for any λ ∈ R:

N

N
E exp λ Xi = E exp(λXi ) (using independence)
i=1 i=1

N
 exp(Cλ2 Xi 2ψ2 ) (by last property in Proposition 1.1.1)
i=1

N
= exp(λ2 K2 ) where K2 := C Xi 2ψ2 .
i=1
Using again the last property in Proposition 1.1.1, we conclude that the sum

S= N i=1 Xi is sub-gaussian, and Sψ2  C1 K where C1 is an absolute constant.
The proof is complete. 
Let us rewrite Proposition 1.2.1 in a form that is often more useful in appli-
cations, namely as a concentration inequality. To do this, we simply use the first

property in Proposition 1.1.1 for the sum N i=1 Xi . We immediately get the fol-
lowing.
Theorem 1.2.2 (General Hoeffding’s inequality). Let X1 , . . . , XN be independent,
mean zero, sub-gaussian random variables. Then, for every t  0 we have

 N  ct2
 
P  Xi   t  2 exp − N .
i=1 Xi ψ2
2
i=1
Hoeffding’s inequality controls how far and with what probability a sum of
independent random variables can deviate from its mean, which is zero.
1.3. Sub-exponential distributions Sub-gaussian distributions constitute a suf-
ficiently wide class of distributions. Many results in probability and data science
are proved nowadays for sub-gaussian random variables. Still, as we noted, there
are some natural random variables that are not sub-gaussian. For example, the
3 Inthe future, we will always denote positive absolute constants by C, c, C1 , etc. These numbers do
not depend on anything. In most cases, one can get good bounds on these constants from the proof,
but the optimal constants for each result are rarely known.
Roman Vershynin 235

square X2 of a normal random variable X ∼ N(0, 1) is not sub-gaussian. (Check!)


To cover examples like this, we will introduce the similar but weaker notion of
sub-exponential distributions.

Proposition 1.3.1 (Sub-exponential properties). For a random variable X, the follow-


ing properties are equivalent, in the same sense as in Proposition 1.1.1.
 
Tails: P |X|  t  2 exp(−t/K1 ) for all t  0.
Moments: Xp  K2 p for all p  1.
MGF of the square: E exp(|X|/K3 )  2.
Moreover, if E X = 0 then these properties imply the following one:
MGF: E exp(λX)  exp(λ2 K24 ) for |λ|  1/K4 .

Just like we did for sub-gaussian distributions, we call the best K3 the sub-
exponential norm of X and denote it by Xψ1 , that is
Xψ1 := inf {t > 0 : E exp(|X|/t)  2} .
All sub-exponential random variables are squares of sub-gaussian random vari-
ables. Indeed, inspecting the definitions you will quickly see that
(1.3.2) X2 ψ1 = X2ψ2 .
(Check!)
1.4. Bernstein’s inequality A version of Hoeffding’s inequality known as Bern-
stein’s inequality holds for sub-exponential random variables. You may naturally
expect to see a sub-exponential tail bound in this result. So it may come as a sur-
prise that Bernstein’s inequality actually has a mixture of two tails – sub-gaussian
and sub-exponential. Let us state and prove the inequality first, and then we will
comment on the mixture of the two tails.

Theorem 1.4.1 (Bernstein’s inequality). Let X1 , . . . , XN be independent, mean zero,


sub-exponential random variables. Then, for every t  0 we have

 N   t2 
  t
P  Xi   t  2 exp − c min N , .
i=1 Xi ψ1
2 maxi Xi ψ1
i=1

Proof. For simplicity, we will assume that K = 1 and only prove the one-sided
bound (without absolute value); the general case is not much harder. Our ap-
proach will be based on bounding the moment generating function of the sum
N
S := i=1 Xi . To see how MGF can be helpful here, choose λ  0 and use
Markov’s inequality to get
   
(1.4.2) P S  t = P exp(λS)  exp(λt)  e−λt E exp(λS).

Recall that S = N i=1 Xi and use independence to express the right side of (1.4.2)
as
N
e−λt E exp(λXi ).
i=1
236 Four Lectures on Probabilistic Methods for Data Science

(Check!) It remains to bound the MGF of each term Xi , and this is a much simpler
task. If we choose λ small enough so that
c
(1.4.3) 0<λ ,
maxi Xi ψ1
then we can use the last property in Proposition 1.3.1 to get

E exp(λXi )  exp Cλ2 Xi 2ψ1 .


Substitute into (1.4.2) and conclude that

P{S  t}  exp −λt + Cλ2 σ2

where σ2 = N i=1 Xi ψ1 . The left side does not depend on λ while the right side
2

does. So we can choose λ that minimizes the right side subject to the constraint
(1.4.3). When this is done carefully, we obtain the tail bound stated in Bernstein’s
inequality. (Do this!) 
Now, why does Bernstein’s inequality have a mixture of two tails? The sub-
exponential tail should of course be there. Indeed, even if the entire sum consisted
of a single term Xi , the best bound we could hope for would be of the form
exp(−ct/Xi ψ1 ). The sub-gaussian term could be explained by the central limit
theorem, which states that the sum should becomes approximately normal as the
number of terms N increases to infinity.

Remark 1.4.4 (Bernstein’s inequality for bounded random variables). Suppose the
random variables Xi are uniformly bounded, which is a stronger assumption than
being sub-gaussian. Then there is a useful version of Bernstein’s inequality, which
unlike Theorem 1.4.1 is sensitive to the variances of Xi ’s. It states that if K > 0 is
such that |Xi |  K almost surely for all i, then, for every t  0, we have

 
N  t2 /2
 
(1.4.5) P  Xi   t  2 exp − .
σ2 + CKt
i=1
N
σ2 i=1 E Xi
Here = 2 is the variance of the sum. This version of Bernstein’s
inequality can be proved in essentially the same way as Theorem 1.4.1. We will
not do it here, but a stronger Theorem 2.2.1, which is valid for matrix-valued
random variables Xi , will be proved in Lecture 2.
To compare this with Theorem 1.4.1, note that σ2 + CKt  2 max(σ2 , CKt). So
we can state the probability bound (1.4.5) as
 t2 t 
2 exp − c min 2 , .
σ K
Just as before, here we also have a mixture of two tails, sub-gaussian and sub-
exponential. The sub-gaussian tail is a bit sharper than in Theorem 1.4.1, since
it depends on the variances rather than sub-gaussian norms of Xi . The sub-
exponential tail, on the other hand, is weaker, since it depends on the sup-norms
rather than the sub-exponential norms of Xi .
Roman Vershynin 237

1.5. Sub-gaussian random vectors The concept of sub-gaussian distributions


can be extended to higher dimensions. Consider a random vector X taking values
in Rn . We call X a sub-gaussian random vector if all one-dimensional marginals of X,
i.e., the random variables X, x for x ∈ Rn , are sub-gaussian. The sub-gaussian
norm of X is defined as
Xψ2 := sup  X, x ψ2
x∈Sn−1

where Sn−1 denotes the unit Euclidean sphere in Rn .

Example 1.5.1. Examples of sub-gaussian random distributions in Rn include the


standard normal distribution N(0, In ) (why?), the uniform distribution on the

centered Euclidean sphere of radius n, the uniform distribution on the cube
{−1, 1}n , and many others. The last example can be generalized: a random vector
X = (X1 , . . . , Xn ) with independent and sub-gaussian coordinates is sub-gaussian,
with Xψ2  C maxi Xi ψ2 .

1.6. Johnson-Lindenstrauss Lemma Concentration inequalities like Hoeffding’s


and Bernstein’s are successfully used in the analysis of algorithms. Let us give
one example for the problem of dimension reduction. Suppose we have some data
that is represented as a set of N points in Rn . (Think, for example, of n gene
expressions of N patients.)
We would like to compress the data by representing it in a lower dimensional
space Rm instead of Rn with m  n. By how much can we reduce the dimension
without loosing the important features of the data?
The basic result in this direction is the Johnson-Lindenstrauss Lemma. It states
that a remarkably simple dimension reduction method works – a random linear
map from Rn to Rm with
m ∼ log N,
see Figure 1.6.3. The logarithmic function grows very slowly, so we can usually
reduce the dimension dramatically.
What exactly is a random linear map? Several models are possible to use.
Here we will model such a map using a Gaussian random matrix – an m × n
matrix A with independent N(0, 1) entries. More generally, we can consider an
m × n matrix A whose rows are independent, mean zero, isotropic4 and sub-
gaussian random vectors in Rn . For example, the entries of A can be independent
Rademacher entries – those taking values ±1 with equal probabilities.

Theorem 1.6.1 (Johnson-Lindenstrauss Lemma). Let X be a set of N points in Rn


and ε ∈ (0, 1). Consider an m × n matrix A whose rows are independent, mean zero,
isotropic and sub-gaussian random vectors in Rn . Rescale A by defining the “Gaussian
4A random vector X ∈ Rn is called isotropic if E XXT = In .
238 Four Lectures on Probabilistic Methods for Data Science

random projection”5
1
P := √ A.
m
Assume that
m  Cε−2 log N,
where C is an appropriately large constant that depends only on the sub-gaussian norms of
the vectors Xi . Then, with high probability (say, 0.99), the map P preserves the distances
between all points in X with error ε, that is
(1.6.2) (1 − ε)x − y2  Px − Py2  (1 + ε)x − y2 for all x, y ∈ X.

Figure 1.6.3. Johnson-Lindenstrauss Lemma states that a random pro-


jection of N data points from dimension n to dimension m ∼ log N
approximately preserves the distances between the points.

Proof. Let us take a closer look at the desired conclusion (1.6.2). By linearity,
Px − Py = P(x − y). So, dividing the inequality by x − y2 , we can rewrite (1.6.2)
in the following way:
(1.6.4) 1 − ε  Pz2  1 + ε for all z ∈ T
where 
x−y
T := : x, y ∈ X distinct points .
x − y2
It will be convenient to square the inequality (1.6.4). Using that 1 + ε  (1 + ε)2
and 1 − ε  (1 − ε)2 , we see that it is enough to show that
(1.6.5) 1 − ε  Pz22  1 + ε for all z ∈ T .
By construction, the coordinates of the vector Pz = √1m Az are √1
m
Xi , z. Thus
we can restate (1.6.5) as
1 m 
 
(1.6.6)  Xi , z2 − 1  ε for all z ∈ T .
m
i=1
Results like (1.6.6) are often proved by combining concentration and a union
bound. In order to use concentration, we first fix z ∈ T . By assumption, the
random variables Xi , z2 − 1 are independent; they have zero mean (use isotropy
5 Strictly speaking, this P is not a projection since it maps Rn to a different space Rm .
Roman Vershynin 239

to check this!), and they are sub-exponential (use (1.3.2) to check this). Then
Bernstein’s inequality (Theorem 1.4.1) gives
 
1 m 
 2 
P  Xi , z − 1 > ε  2 exp(−cε2 m).
m
i=1
(Check!)
Finally, we can unfix z by taking a union bound over all possible z ∈ T :
   
1  m   1 m 
 2   2 
P max  Xi , z − 1 > ε  P  Xi , z − 1 > ε
z∈T m m
i=1 z∈T i=1
2
(1.6.7)  |T | · 2 exp(−cε m).
By definition of T , we have |T |  So, if we choose m  Cε−2 log N with
N2 .
appropriately large constant C, we can make (1.6.7) bounded by 0.01. The proof
is complete. 
1.7. Notes The material presented in Sections 1.1–1.5 is basic and can be found
e.g. in [58] and [60] with all the proofs. Bernstein’s and Hoeffding’s inequalities
that we covered here are two basic examples of concentration inequalities. There
are many other useful concentration inequalities for sums of independent random
variables (e.g. Chernoff’s and Bennett’s) and for more general objects. The text-
book [60] is an elementary introduction into concentration; the books [10, 38, 39]
offer more comprehensive and more advanced accounts of this area.
The original version of Johnson-Lindenstrauss Lemma was proved in [31]. The
version we gave here, Theorem 1.6.1, was stated with probability of success 0.99,
but an inspection of the proof gives probability 1 − 2 exp(−cε2 m) which is much
better for large m. A great variety of ramifications and applications of Johnson-
Lindenstrauss lemma are known, see e.g. [2, 4, 7, 10, 34, 42].

2. Lecture 2: Concentration of sums of independent random matrices


In the previous lecture we proved Bernstein’s inequality, which quantifies how
a sum of independent random variables concentrates about its mean. We will
now study an extension of Bernstein’s inequality to higher dimensions, which
holds for sums of independent random matrices.
2.1. Matrix calculus The key idea of developing a matrix Bernstein’s inequality
will be to use matrix calculus, which allows us to operate with matrices as with
scalars – adding and multiplying them of course, but also comparing matrices
and applying functions to matrices. Let us explain this.
We can compare matrices to each other using the notion of being positive semi-
definite. Let us focus here on n × n symmetric matrices. If A − B is a positive
240 Four Lectures on Probabilistic Methods for Data Science

semidefinite matrix,6 which we denote A − B  0, then we say that A  B (and,


of course, B  A). This defines a partial order on the set of n × n symmetric matri-
ces. The term “partial” indicates that, unlike the real numbers, there exist n × n
symmetric matrices A and B that can not be compared. (Give an example where
neither A  B nor B  A!)
Next, let us guess how to measure the magnitude of a matrix A. The magnitude
of a scalar a ∈ R is measured by the absolute value |a|; it is the smallest non-
negative number t such that
−t  a  t.
Extending this reasoning to matrices, we can measure the magnitude of an n × n
symmetric matrix A by the smallest non-negative number t such that7
−tIn  A  tIn .
The smallest t is called the operator norm of A and is denoted A. Diagonalizing
A, we can see that
(2.1.1) A = max{|λ| : λ is an eigenvalue of A}.
With a little more work (do it!), we can see that A is the norm of A acting as a
linear operator on the Euclidean space (Rn ,  · 2 ); this is why A is called the
operator norm. Thus A is the smallest non-negative number M such that
Ax2  Mx2 for all x ∈ Rn .
Finally, we will need to be able to take functions of matrices. Let f : R → R
be a function and X be an n × n symmetric matrix. We can define f(X) in two
equivalent ways. The spectral theorem allows us to represent X as

n
X= λi u i u T
i
i=1
where λi are the eigenvalues of X and ui are the corresponding eigenvectors.
Then we can simply define

n
f(X) := f(λi )ui uT
i.
i=1
Note that f(X) has the same eigenvectors as X, but the eigenvalues change under
the action of f. An equivalent way to define f(X) is using power series. Suppose
the function f has a convergent power series expansion about some point x0 ∈ R,
i.e.
∞
f(x) = ak (x − x0 )k .
k=1

6 Recallthat a symmetric real n × n matrix M is called positive semidefinite if xT Mx  0 for any


vector x ∈ Rn .
7 Here and later, I denotes the n × n identity matrix.
n
Roman Vershynin 241

Then one can check that the following matrix series converges8 and defines f(X):


f(X) = ak (X − X0 )k .
k=1
(Check!)
2.2. Matrix Bernstein’s inequality We are now ready to state and prove a re-
markable generalization of Bernstein’s inequality for random matrices.

Theorem 2.2.1 (Matrix Bernstein’s inequality). Let X1 , . . . , XN be independent, mean


zero, n × n symmetric random matrices, such that Xi   K almost surely for all i.
Then, for every t  0 we have

$ 
N $ t2 /2
$ $
P $ Xi $  t  2n · exp − .
σ2 + Kt/3
i=1
$ $
$ 2 $ is the norm of the “matrix variance” of the sum.
Here σ2 =$ Ni=1 E X i$

The scalar case, where n = 1, is the classical Bernstein’s inequality we stated


in (1.4.5). A remarkable feature of matrix Bernstein’s inequality, which makes
it especially powerful, is that it does not require any independence of the entries (or
the rows or columns) of Xi ; all is needed is that the random matrices Xi be
independent from each other.
In the rest of this section we will prove matrix Bernstein’s inequality, and give
a few applications in this and next lecture.
Our proof will be based on bounding the moment generating function (MGF)

E exp(λS) of the sum S = N i=1 Xi . Note that to exponentiate the matrix λS in
order to define the matrix MGF, we rely on the matrix calculus that we introduced
in Section 2.1.
If the terms Xi were scalars, independence would yield the classical fact that
the MGF of a product is the product of MGF’s, i.e.

N 
N
(2.2.2) E exp(λS) = E exp(λXi ) = E exp(λXi ).
i=1 i=1
But for matrices, this reasoning breaks down badly, for in general
eX+Y = eX eY
even for 2 × 2 symmetric matrices X and Y. (Give a counterexample!)
Fortunately, there are some trace inequalities that can often serve as proxies for
the missing equality eX+Y = eX eY . One of such proxies is the Golden-Thompson
inequality, which states that
(2.2.3) tr(eX+Y )  tr(eX eY )
8 Theconvergence holds in any given metric on the set of matrices, for example in the metric given by
the operator norm. In this series, the terms (X − X0 )k are defined by the usual matrix product.
242 Four Lectures on Probabilistic Methods for Data Science

for any n × n symmetric matrices X and Y. Another result, which we will actually
use in the proof of matrix Bernstein’s inequality, is Lieb’s inequality.

Theorem 2.2.4 (Lieb’s inequality). Let H be an n × n symmetric matrix. Then the


function
f(X) = tr exp(H + log X)
is concave9 on the space on n × n symmetric matrices.

Note that in the scalar case, where n = 1, the function f in Lieb’s inequality is
linear and the result is trivial.
To use Lieb’s inequality in a probabilistic context, we will combine it with
the classical Jensen’s inequality. It states that for any concave function f and a
random matrix X, one has10
(2.2.5) E f(X)  f(E X).
Using this for the function f in Lieb’s inequality, we get
E tr exp(H + log X)  tr exp(H + log E X).
And changing variables to X = eZ , we get the following:

Lemma 2.2.6 (Lieb’s inequality for random matrices). Let H be a fixed n × n sym-
metric matrix and Z be an n × n symmetric random matrix. Then
E tr exp(H + Z)  tr exp(H + log E eZ ).

Lieb’s inequality is a perfect tool for bounding the MGF of a sum of indepen-

dent random variables S = N i=1 Xi . To do this, let us condition on the random

variables X1 , . . . , XN−1 . Apply Lemma 2.2.6 for the fixed matrix H := N−1 i=1 λXi
and the random matrix Z := λXN , and afterwards take the expectation with re-
spect to X1 , . . . , XN−1 . By the law of total expectation, we get
 N−1
 
E tr exp(λS)  E tr exp λXi + log E eλXN .
i=1
N−2
Next, apply Lemma 2.2.6 in a similar manner for H := i=1 λXi + log E eλXN
and Z := λXN−1 , and so on. After N times, we obtain:

Lemma 2.2.7 (MGF of a sum of independent random matrices). Let X1 , . . . , XN be



independent n × n symmetric random matrices. Then the sum S = N
i=1 Xi satisfies

N 
E tr exp(λS)  tr exp log E eλXi .
i=1
9 Formally, concavity of f means that f(λX + (1 − λ)Y)  λf(X) + (1 − λ)f(Y) for all symmetric
matrices X and Y and all λ ∈ [0, 1].
10 Jensen’s inequality is usually stated for a convex function g and a scalar random variable X, and

it reads g(E X)  E g(X). From this, inequality (2.2.5) for concave functions and random matrices
easily follows (Check!).
Roman Vershynin 243

Think of this inequality is a matrix version of the scalar identity (2.2.2). The
main difference is that it bounds the trace of the MGF11 rather the MGF itself.
You may recall from a course in probability theory that the quantity log E eλXi
that appears in this bound is called the cumulant generating function of Xi .
Lemma 2.2.7 reduces the complexity of our task significantly, for it is much
easier to bound the cumulant generating function of each single random variable
Xi than to say something about their sum. Here is a simple bound.

Lemma 2.2.8 (Moment generating function). Let X be an n × n symmetric random


matrix. Assume that E X = 0 and X  K almost surely. Then, for all 0 < λ < 3/K we
have λ2 /2
E exp(λX)  exp g(λ) E X2 where g(λ) = .
1 − λK/3
Proof. First, check that the following scalar inequality holds for 0 < λ < 3/K and
|x|  K:
eλx  1 + λx + g(λ)x2 .
Then extend it to matrices using matrix calculus: if 0 < λ < 3/K and X  K
then
eλX  I + λX + g(λ)X2 .
(Do these two steps carefully!) Finally, take the expectation and recall that E X = 0
to obtain
E eλX  I + g(λ) E X2  exp g(λ) E X2 .
In the last inequality, we use the matrix version of the scalar inequality 1 + z  ez
that holds for all z ∈ R. The lemma is proved. 
Proof of Matrix Bernstein’s inequality. We would like to bound the operator norm

of the random matrix S = N i=1 Xi , which, as we know from (2.1.1), is the largest
eigenvalue of S by magnitude. For simplicity of exposition, let us drop the absolute
value from (2.1.1) and just bound the maximal eigenvalue of S, which we denote
λmax (S). (Once this is done, we can repeat the argument for −S to reinstate the
absolute value. Do this!) So, we are to bound
 

P λmax (S)  t = P eλ·λmax (S)  eλt (multiply by λ > 0 and exponentiate)
 e−λt E eλ·λmax (S) (by Markov’s inequality)
=e −λt
E λmax (e λS
) (check!)
e −λt
E tr e
λS
(max of eigenvalues is bounded by the sum)

N 
 e−λt tr exp log E eλXi (use Lemma 2.2.7)
i=1
 tr exp [−λt + g(λ)Z] (by Lemma 2.2.8)
11 Note that the order of expectation and trace can be swapped using linearity.
244 Four Lectures on Probabilistic Methods for Data Science

where

N
Z := E X2i .
i=1
It remains to optimize this bound in λ. The minimum value is attained for
λ = t/(σ2 + Kt/3). (Check!) With this value of λ, we conclude
  t2 /2
P λmax (S)  t  n · exp − 2 .
σ + Kt/3
This completes the proof of Theorem 2.2.1. 
N
Bernstein’s inequality gives a powerful tail bound for  i=1 Xi . This easily
implies a useful bound on the expectation:

Corollary 2.2.9 (Expected norm of sum of random matrices). Let X1 , . . . , XN be


independent, mean zero, n × n symmetric random matrices, such that Xi   K almost
surely for all i. Then
$ N $ 
$ $
E$ Xi $  σ log n + K log n
i=1
$ $
where σ = $ N 2 $1/2 .
i=1 E Xi

Proof. The link from tail bounds to expectation is provided by the basic identity

 
(2.2.10) EZ = P Z > t dt
0
which is valid for any non-negative random variable Z. (Check it!) Integrating
the tail bound given by matrix Bernstein’s inequality, you will arrive at the expec-
tation bound we claimed. (Check!) 
Notice that the bound in this corollary has mild, logarithmic, dependence on
the ambient dimension n. As we will see shortly, this can be an important feature
in some applications.
2.3. Community recovery in networks Matrix Bernstein’s inequality has many
applications. The one we are going to discuss first is for the analysis of networks.
A network can be mathematically represented by a graph, a set of n vertices
with edges connecting some of them. For simplicity, we will consider undirected
graphs where the edges do not have arrows. Real world networks often tend to
have clusters, or communities – subsets of vertices that are connected by unusually
many edges. (Think, for example, about a friendship network where communities
form around some common interests.) An important problem in data science is
to recover communities from a given network.
We are going to explain one of the simplest methods for community recovery,
which is called spectral clustering. But before we introduce it, we will first of all
place a probabilistic model on the networks we consider. In other words, it will be
convenient for us to view networks as random graphs whose edges are formed at
random. Although not all real-world networks are truly random, this simplistic
Roman Vershynin 245

model can motivate us to develop algorithms that may empirically succeed also
for real-world networks.
The basic probabilistic model of random graphs is the Erdös-Rényi model.
Definition 2.3.1 (Erdös-Rényi model). Consider a set of n vertices and connect every
pair of vertices independently and with fixed probability p. The resulting random graph
is said to follow the Erdös-Rényi model G(n, p).
The Erdös-Rényi random model is very simple. But it is not a good choice if
we want to model a network with communities, for every pair of vertices has the
same chance to be connected. So let us introduce a natural generalization of the
Erdös-Rényi random model that does allow for community structure:
Definition 2.3.2 (Stochastic block model). Partition a set of n vertices into two subsets
(“communities”) with n/2 vertices each, and connect every pair of vertices independently
with probability p if they belong to the same community and q < p if not. The resulting
random graph is said to follow the stochastic block model G(n, p, q).
Figure 2.3.3 illustrates a simulation of a stochastic block model.

Figure 2.3.3. A network generated according to the stochastic block


model G(n, p, q) with n = 200 nodes and connection probabilities
p = 1/20 and q = 1/200.

Suppose we are shown one instance of a random graph generated according


to a stochastic block model G(n, p, q). How can we find which vertices belong to
which community?
The spectral clustering algorithm we are going to explain will do precisely this.
It will be based on the spectrum of the adjacency matrix A of the graph, which is
the n × n symmetric matrix whose entries Aij equal 1 if the vertices i and j are
connected by an edge, and 0 otherwise.12
The adjacency matrix A is a random matrix. Let us compute its expectation
first. This is easy, since the entires of A are Bernoulli random variables. If i and j
12 For convenience, we call the vertices of the graph 1, 2, . . . , n.
246 Four Lectures on Probabilistic Methods for Data Science

belong to the same community then E Aij = p and otherwise E Aij = q. Thus A
has block structure: for example, if n = 4 then A looks like this:
⎡ ⎤
p p q q
⎢ ⎥
⎢ p p q q ⎥
⎢ ⎥
EA = ⎢ ⎥
⎢ q q p p ⎥
⎣ ⎦
q q p p
(For illustration purposes, we grouped the vertices from each community to-
gether.)
You will easily check that A has rank 2, and the non-zero eigenvalues and the
corresponding eigenvectors are
⎡ ⎤ ⎡ ⎤
1 1
⎢ ⎥ ⎢ ⎥
⎢ 1 ⎥ ⎢ 1 ⎥
p+q ⎢ ⎥ p−q ⎢ ⎥
λ1 (EA) = n, v1 (EA) = ⎢ ⎥ ; λ2 (EA) = n, v2 (EA) = ⎢ ⎥.
2 ⎢ 1 ⎥ 2 ⎢ −1 ⎥
⎣ ⎦ ⎣ ⎦
1 −1
(Check!)
The eigenvalues and eigenvectors of E A tell us a lot about the community
structure of the underlying graph. Indeed, the first (larger) eigenvalue,
p + q
d := n,
2
is the expected degree of any vertex of the graph.13 The second eigenvalue tells
us whether there is any community structure at all (which happens when p = q
and thus λ2 (E A) = 0). The first eigenvector v1 is not informative of the structure
of the network at all. It is the second eigenvector v2 that tells us exactly how to
separate the vertices into the two communities: the signs of the coefficients of v2
can be used for this purpose.
Thus if we know E A, we can recover the community structure of the network
from the signs of the second eigenvector. The problem is that we do not know
E A. Instead, we know the adjacency matrix A. If, by some chance, A is not far
from E A, we may hope to use the A to approximately recover the community
structure. So is it true that A ≈ E A? The answer is yes, and we can prove it
using matrix Bernstein’s inequality.

Theorem 2.3.4 (Concentration of the stochastic block model). Let A be the adjacency
matrix of a G(n, p, q) random graph. Then

E A − E A  d log n + log n.
Here d = (p + q)n/2 is the expected degree.
13 The degree of the vertex is the number of edges connected to it.
Roman Vershynin 247

Proof. Let us sketch the argument. To use matrix Bernstein’s inequality, let us
break A into a sum of independent random matrices

A= Xij ,
i,j: ij

where each matrix Xij contains a pair of symmetric entries of A, or one diagonal
entry.14 Matrix Bernstein’s inequality obviously applies for the sum

A−EA = (Xij − E Xij ).
ij

Corollary 2.2.9 gives15



(2.3.5) E A − E A  σ log n + K log n
$ $
where σ2 = $ ij E(Xij − E Xij )2 $ and K = maxij Xij − E Xij . It is a good
exercise to check that
σ2  d and K  2.
(Do it!) Substituting into (2.3.5), we complete the proof. 
How useful is Theorem 2.3.4 for community recovery? Suppose that the net-
work is not too sparse, namely
d log n.
Then 
A − E A  d log n while  E A = λ1 (E A) = d,
which implies that
A − E A   E A.
In other words, A nicely approximates E A: the relative error or approximation
is small in the operator norm.
At this point one can apply classical results from the perturbation theory for
matrices, which state that since A and E A are close, their eigenvalues and eigen-
vectors must also be close. The relevant perturbation results are Weyl’s inequality
for eigenvalues and Davis-Kahan’s inequality for eigenvectors, which we will not
reproduce here. Heuristically, what they give us is
⎡ ⎤
1
⎢ ⎥
⎢ 1 ⎥
⎢ ⎥
v2 (A) ≈ v2 (E A) = ⎢ ⎥.
⎢ −1 ⎥
⎣ ⎦
−1

14 Precisely,if i = j, then Xij has all zero entries except the (i, j) and (j, i) entries that can potentially
equal 1. If i = j, the only non-zero entry of Xij is the (i, i).
15 We will liberally use the notation  to hide constant factors appearing in the inequalities. Thus,

a  b means that a  Cb for some constant C.


248 Four Lectures on Probabilistic Methods for Data Science

Then we should expect that most of the coefficients of v2 (A) are positive on one
community and negative on the other. So we can use v2 (A) to approximately
recover the communities. This method is called spectral clustering:
Spectral Clustering Algorithm. Compute v2 (A), the eigenvector corresponding to
the second largest eigenvalue of the adjacency matrix A of the network. Use the signs of
the coefficients of v2 (A) to predict the community membership of the vertices.
We saw that spectral clustering should perform well for the stochastic block
model G(n, p, q) if it is not too sparse, namely if the expected degrees satisfy
d = (p + q)n/2 log n.
A more careful analysis along these lines, which you should be able to do
yourself with some work, leads to the following more rigorous result.
Theorem 2.3.6 (Guarantees of spectral clustering). Consider a random graph gener-
ated according to the stochastic block model G(n, p, q) with p > q, and set a = pn,
b = qn. Suppose that
(2.3.7) (a − b)2 log(n)(a + b).
Then, with high probability, the spectral clustering algorithm recovers the communities
up to o(n) misclassified vertices.
Note that condition (2.3.7) implies that the expected degrees are not too small,
namely d = (a + b)/2 log(n) (check!). It also ensures that a and b are suffi-
ciently different: recall that if a = b the network is Erdös-Rényi graph without
any community structure.
2.4. Notes The idea to extend concentration inequalities like Bernstein’s to ma-
trices goes back to R. Ahlswede and A. Winter [3]. They used Golden-Thompson
inequality (2.2.3) and proved a slightly weaker form of matrix Bernstein’s inequal-
ity than we gave in Section 2.2. R. Oliveira [48, 49] found a way to improve
this argument and gave a result similar to Theorem 2.2.1. The version of matrix
Bernstein’s inequality we gave here (Theorem 2.2.1) and a proof based on Lieb’s
inequality is due to J. Tropp [52].
The survey [53] contains a comprehensive introduction of matrix calculus, a
proof of Lieb’s inequality (Theorem 2.2.4), a detailed proof of matrix Bernstein’s
inequality (Theorem 2.2.1) and a variety of applications. A proof of Golden-
Thompson inequality (2.2.3) can be found in [8, Theorem 9.3.7].
In Section 2.3 we scratched the surface of an interdisciplinary area of net-
work analysis. For a systematic introduction into networks, refer to the book
[47]. Stochastic block models (Definition 2.3.2) were introduced in [33]. The
community recovery problem in stochastic block models, sometimes also called
community detection problem, has been in the spotlight in the last few years.
A vast and still growing body of literature exists on algorithms and theoretical
results for community recovery, see the book [47], the survey [22], papers such as
[9, 29, 30, 32, 37, 46, 61] and the references therein.
Roman Vershynin 249

A concentration result similar to Theorem 2.3.4 can be found in [48]; the ar-
gument there is also based on matrix concentration. This theorem is not quite
optimal. For dense networks, where the expected degree d satisfies d  log n, the
concentration inequality in Theorem 2.3.4 can be improved to

(2.4.1) E A − E A  d.
This improved bound goes back to the original paper [21] which studies the sim-
pler Erdös-Rényi model but the results extend to stochastic block models [17]; it
can also be deduced from [6, 32, 37].
If the network is relatively dense, i.e. d  log n, one can improve the guarantee
(2.3.7) of spectral clustering in Theorem 2.3.6 to
(a − b)2 (a + b).
All one has to do is use the improved concentration inequality (2.4.1) instead of
Theorem 2.3.4. Furthermore, in this case there exist algorithms that can recover
the communities exactly, i.e. without any misclassified vertices, and with high
probability, see e.g. [1, 17, 32, 43].
For sparser networks, where d  log n and possibly even d = O(1), relatively
few algorithms were known until recently, but now there exist many approaches
that provably recover communities in sparse stochastic block models, see, for
example, [9, 17, 29, 30, 37, 46, 61].

3. Lecture 3: Covariance estimation and matrix completion


In the last lecture, we proved matrix Bernstein’s inequality and gave an appli-
cation for network analysis. We will spend this lecture discussing a couple of
other interesting applications of matrix Bernstein’s inequality. In Section 3.1 we
will work on covariance estimation, a basic problem in high-dimensional statistics.
In Section 3.2, we will derive a useful bound on norms of random matrices, which
unlike Bernstein’s inequality does not require any boundedness assumptions on
the distribution. We will apply this bound in Section 3.3 for a problem of matrix
completion, where we are shown a small sample of the entries of a matrix and
asked to guess the missing entries.
3.1. Covariance estimation Covariance estimation is a problem of fundamental
importance in high-dimensional statistics. Suppose we have a sample of data
points X1 , . . . , XN in Rn . It is often reasonable to assume that these points are
independently sampled from the same probability distribution (or “population”)
which is unknown. We would like to learn something useful about this distribu-
tion.
Denote by X a random vector that has this (unknown) distribution. The most
basic parameter of the distribution is the mean E X. One can estimate E X from
1 N
the sample by computing the sample mean N i=1 Xi . The law of large numbers
guarantees that the estimate becomes tight as the sample size N grows to infinity.
250 Four Lectures on Probabilistic Methods for Data Science

In other words,
1 
N
Xi → E X as N → ∞.
N
i=1
The next most basic parameter of the distribution is the covariance matrix
Σ := E(X − E X)(X − E X)T .
This is a higher-dimensional version of the usual notion of variance of a random
variable Z, which is
Var(Z) = E(Z − E Z)2 .
The eigenvectors of the covariance matrix of Σ are called the principal components.
Principal components that correspond to large eigenvalues of Σ are the directions
in which the distribution of X is most extended, see Figure 3.1.1. These are often
the most interesting directions in the data. Practitioners often visualize the high-
dimensional data by projecting it onto the span of a few (maybe two or three) of
such principal components; the projection may reveal some hidden structure of
the data. This method is called Principal Component Analysis (PCA).

Figure 3.1.1. Data points X1 , . . . , XN sampled from a distribution in Rn


and the principal components of the covariance matrix.

One can estimate the covariance matrix Σ from the sample by computing the
sample covariance
1 
N
ΣN := (Xi − E Xi )(Xi − E Xi )T .
N
i=1
Again, the law of large numbers guarantees that the estimate becomes tight as
the sample size N grows to infinity, i.e.
ΣN → Σ as N → ∞.
But how large should the sample size N be for covariance estimation? Gener-
ally, one can not have N < n for dimension reasons. (Why?) We are going to
show that
N ∼ n log n
is enough. In other words, covariance estimation is possible with just logarithmic
oversampling.
For simplicity, we shall state the covariance estimation bound for mean zero
distributions. (If the mean is not zero, we can estimate it from the sample and
Roman Vershynin 251

subtract. Check that the mean can be accurately estimated from a sample of size
N = O(n).)
Theorem 3.1.2 (Covariance estimation). Let X be a random vector in Rn with covari-
ance matrix Σ. Suppose that
(3.1.3) X22  E X22 = tr Σ almost surely.
Then, for every N  1, we have
 n log n n log n
E ΣN − Σ  Σ + .
N N
Before we pass to the proof, let us note that Theorem 3.1.2 yields the covariance
estimation result we promised. Let ε ∈ (0, 1). If we take a sample of size
N ∼ ε−2 n log n,
then we are guaranteed covariance estimation with a good relative error:
E ΣN − Σ  εΣ.
Proof. Apply matrix Bernstein’s inequality (Corollary 2.2.9) for the sum of inde-
pendent random matrices Xi XTi − Σ and get

1 $$
$ 1 
N
$

(3.1.4) E ΣN − Σ = E$ (Xi XT


i − Σ)$  σ log n + K log n
N N
i=1
where
$N $ $ $
$ 2$
σ2 = $ E(Xi XT
i − Σ) $ = N$ E(XXT − Σ)2 $
i=1
and K is chosen so that
XXT − Σ  K almost surely.
It remains to bound σ and K. Let us start with σ. We have

E(XXT − Σ)2 = E X22 XXT − Σ2 (check by expanding the square)


T
 tr(Σ) · E XX (drop Σ2 and use (3.1.3))
= tr(Σ) · Σ.
Thus
σ2  N tr(Σ)Σ.
Next, to bound K, we have

XXT − Σ  X22 + Σ (by triangle inequality)


 tr Σ + Σ (using (3.1.3))
 2 tr Σ =: K.
Substitute the bounds on σ and K into (3.1.4) and get

1

E ΣN − Σ  N tr(Σ)Σ log n + tr(Σ) log n


N
252 Four Lectures on Probabilistic Methods for Data Science

Complete the proof by using tr Σ  nΣ (check this!) to simplify the bound. 

Remark 3.1.5 (Low-dimensional distributions). Far fewer samples are needed for
covariance estimation for low-dimensional, or approximately low-dimensional,
distributions. To measure approximate low-dimensionality we can use the notion
of the stable rank of Σ2 . The stable rank of a matrix A is defined as the square of
the ratio of the Frobenius to operator norms:16
A2F
r(A) :=.
A2
The stable rank is always bounded by the usual, linear algebraic rank,
r(A)  rank(A),
and it can be much smaller. (Check both claims.)
Our proof of Theorem 3.1.2 actually gives
 r log n r log n
E ΣN − Σ  Σ + .
N N
where
tr Σ
r = r(Σ1/2 ) = .
Σ
(Check this!) Therefore, covariance estimation is possible with
N ∼ r log n
samples.

Remark 3.1.6 (The boundedness condition). It is a good exercise to check that if


we remove the boundedness condition (3.1.3), a nontrivial covariance estimation
is impossible in general. (Show this!) But how do we know whether the bound-
edness condition holds for data at hand? We may not, but we can enforce this
condition by truncation. All we have to do is to discard 1% of data points with
largest norms. (Check this accurately, assuming that such truncation does not
change the covariance significantly.)

3.2. Norms of random matrices We have worked a lot with the operator norm of
matrices, denoted A. One may ask if is there exists a formula that expresses A
in terms of the entires Aij . Unfortunately, there is no such formula. The operator
norm is a more difficult quantity in this respect than the Frobenius norm, which as

we know can be easily expressed in terms of entries: AF = ( i,j A2ij )1/2 .
If we can not express A in terms of the entires, can we at least get a good es-
timate? Let us consider n × n symmetric matrices for simplicity. In one direction,
16 TheFrobenius normnof an
m n × m matrix, sometimes also called the Hilbert-Schmidt norm, is de-
fined as A
nF = ( i=1
2 1/2 . Equivalently, for an n × n symmetric matrix, we have
j=1 Aij )
AF = ( i=1 λi (A)2 )1/2 ,where λi (A) are the eigenvalues of A. Thus the stable rank of A
can be expressed as r(A) = n 2 2
i=1 λi (A) / maxi λi (A) .
Roman Vershynin 253

A is always bounded below by the largest Euclidean norm of the rows Ai :
 1/2
(3.2.1) A  max Ai 2 = max A2ij .
i i
j

(Check!) Unfortunately, this bound is sometimes very loose, and the best possible
upper bound is

(3.2.2) A  n · max Ai 2 .
i
(Show this bound, and give an example where it is sharp.)
Fortunately, for random matrices with independent entries the bound (3.2.2)
can be improved to the point where the upper and lower bounds almost match.
Theorem 3.2.3 (Norms of random matrices without boundedness assumptions).
Let A be an n × n symmetric random matrix whose entries on and above the diagonal are
independent, mean zero random variables. Then
E max Ai 2  E A  C log n · E max Ai 2 ,
i i
where Ai denote the rows of A.
In words, the operator norm of a random matrix is almost determined by the
norm of the rows.
Our proof of this result will be based on matrix Bernstein’s inequality – more
precisely, Corollary 2.2.9. There is one surprising point. How can we use matrix
Bernstein’s inequality, which applies only for bounded distributions, to prove
a result like Theorem 3.2.3 that does not have any boundedness assumptions?
We will do this using a trick based on conditioning and symmetrization. Let us
introduce this technique first.
Lemma 3.2.4 (Symmetrization). Let X1 , . . . , XN be independent, mean zero random
vectors in a normed space and ε1 , . . . , εN be independent Rademacher random variables.17
Then
1 $ $
$ $ $ $ $
N N N
$ $ $ $ $
E$ εi X i $  E $ Xi $  2 E $ εi Xi $.
2
i=1 i=1 i=1

Proof. To prove the upper bound, let (Xi )


be an independent copy of the random
vectors (Xi ), i.e. just different random vectors with the same joint distribution as
(Xi ) and independent from (Xi ). Then
$ $ $  $ 
$ $ $ $
E$ Xi $ = E $ Xi − E Xi $ (since E Xi = 0 by assumption)
i i i i
$  $
$ $
 E$ Xi − Xi $ (by Jensen’s inequality)
i i
$ $
$ $
= E$ (Xi − Xi )$.
i

means that random variables εi take values ±1 with probability 1/2 each. We require that all
17 This

random variables we consider here, i.e. {Xi , εi : i = 1, . . . , N} are jointly independent.


254 Four Lectures on Probabilistic Methods for Data Science

The distribution of the random vectors Yi := Xi − Xi is symmetric, which means


that the distributions of Yi and −Yi are the same. (Why?) Thus the distribution
of the random vectors Yi and εi Yi is also the same, for all we do is change the
signs of these vectors at random and independently of the values of the vectors.
Summarizing, we can replace Xi − Xi in the sum above with εi (Xi − Xi ). Thus
$ $ $ $
$ $ $ $
E$ Xi $  E $ εi (Xi − Xi )$
i i
$ $ $ $
$ $ $ $
 E$ εi X i $ + E $ εi Xi $ (using triangle inequality)
i i
$ $
$ $
= 2E$ εi Xi $ (the two sums have the same distribution).
i
This proves the upper bound in the symmetrization inequality. The lower bound
can be proved by a similar argument. (Do this!) 
Proof of Theorem 3.2.3. We already mentioned in (3.2.1) that the bound in Theo-
rem 3.2.3 is trivial. The proof of the upper bound will be based on matrix Bern-
stein’s inequality.
First, we decompose A in the same way as we did in the proof of Theorem 2.3.4.
Thus we represent A as a sum of independent, mean zero, symmetric random
matrices Zij each of which contains a pair of symmetric entries of A (or one
diagonal entry): 
A= Zij .
i,j: ij

Apply the symmetrization inequality (Lemma 3.2.4) for the random matrices Zij
and get
$ $ $ $
$ $ $ $
(3.2.5) E A = E $ Zij $  2 E $ Xij $
ij ij

where we set
Xij := εij Zij
and εij are independent Rademacher random variables.
Now we condition on A. The random variables Zij become fixed values and all
randomness remains in the Rademacher random variables εij . Note that Xij are
(conditionally) bounded almost surely, and this is exactly what we have lacked to
apply matrix Bernstein’s inequality. Now we can do it. Corollary 2.2.9 gives18
$ $ 
$ $
(3.2.6) Eε $ Xij $  σ log n + K log n,
ij
$ $
where σ2 =$ 2 $
ij E ε Xij and K = maxij Xij .
18 We stick a subscript ε to the expected value to remember that this is a conditional expectation, i.e.

we average only with respect to εi .


Roman Vershynin 255

A good exercise is to check that


σ  max Ai 2 and K  max Ai 2 .
i i
(Do it!) Substituting into (3.2.6), we get
$ $
$ $
Eε $ Xij $  log n · max Ai 2 .
i
ij

Finally, we unfix A by taking expectation of both sides of this inequality with


respect to A and using the law of total expectation. The proof is complete. 
We stated Theorem 3.2.3 for symmetric matrices, but it is simple to extend it
to general m × n random matrices A. The bound in this case becomes

(3.2.7) E A  C log(m + n) · E max Ai 2 + E max Aj 2


i j

where Ai and Aj denote the rows and columns of A. To see this, apply Theo-
rem 3.2.3 to the (m + n) × (m + n) symmetric random matrix
 
0 A
.
AT 0
(Do this!)
3.3. Matrix completion Consider a fixed, unknown n × n matrix X. Suppose we
are shown m randomly chosen entries of X. Can we guess all the missing entries?
This important problem is called matrix completion. We will analyze it using the
bounds on the norms on random matrices we just obtained.
Obviously, there is no way to guess the missing entries unless we know some-
thing extra about the matrix X. So let us assume that X has low rank:
rank(X) =: r  n.
The number of degrees of freedom of an n × n matrix with rank r is O(rn).
(Why?) So we may hope that
(3.3.1) m ∼ rn
observed entries of X will be enough to determine X completely. But how?
Here we will analyze what is probably the simplest method for matrix com-
pletion. Take the matrix Y that consists of the observed entries of X while all
unobserved entries are set to zero. Unlike X, the matrix Y may not have small
rank. Compute the best rank r approximation19 of Y. The result, as we will show,
will be a good approximation to X.
But before we show this, let us define sampling of entries more rigorously.
Assume each entry of X is shown or hidden independently of others with fixed
19 Thebest rank r approximation of an n × n matrix A is a matrix B of rank r that minimizes the
operator norm A − B or, alternatively, the Frobenius norm A − BF (the minimizerturns out to
be the same). One can compute
r B by truncating the singular value decomposition A = n T
i=1 si ui vi
T
of A as follows: B = i=1 si ui vi , where we assume that the singular values si are arranged in
non-increasing order.
256 Four Lectures on Probabilistic Methods for Data Science

probability p. Which entries are shown is decided by independent Bernoulli


random variables
m
δij ∼ Ber(p) with p := 2
n
which are often called selectors in this context. The value of p is chosen so that
among n2 entries of X, the expected number of selected (known) entries is m.
Define the n × n matrix Y with entries
Yij := δij Xij .
We can assume that we are shown Y, for it is a matrix that contains the observed
entries of X while all unobserved entries are replaced with zeros. The following
result shows how to estimate X based on Y.

Theorem 3.3.2 (Matrix completion). Let X̂ be a best rank r approximation to p−1 Y.


Then 
1 rn
(3.3.3) E X̂ − XF  C log(n) X∞ ,
n m
Here X∞ = maxi,j |Xij | denotes the maximum magnitude of the entries of X.

Before we prove this result, let us understand what this bound says about the
quality of matrix completion. The recovery error is measured in the Frobenius
norm, and the left side of (3.3.3) is
1 1 
n 1/2
2
X̂ − XF = |X̂ ij − X ij | .
n n2
i,j=1

Thus Theorem 3.3.2 controls the average error per entry in the mean-squared sense.
To make the error small, let us assume that we have a sample of size
m rn log2 n,
which is slightly larger than the ideal size we discussed in (3.3.1). This makes

C log(n) rn/m = o(1) and forces the recovery error to be bounded by o(1)X∞ .
Summarizing, Theorem 3.3.2 says that the expected average error per entry is much
smaller than the maximal magnitude of the entries of X. This is true for a sample of
almost optimal size m. The smaller the rank r of the matrix X, the fewer entries
of X we need to see in order to do matrix completion.
Proof of Theorem 3.3.2. Step 1: The error in the operator norm. Let us first bound
the recovery error in the operator norm. Decompose the error into two parts using
triangle inequality:
X̂ − X  X̂ − p−1 Y + p−1 Y − X.
Recall that X̂ is a best approximation to p−1 Y. Then the first part of the error is
smaller than the second part, i.e. X̂ − p−1 Y  p−1 Y − X, and we have
2
(3.3.4) X̂ − X  2p−1 Y − X = Y − pX.
p
Roman Vershynin 257

The entries of the matrix Y − pX,


(Y − pX)ij = (δij − p)Xij ,
are independent and mean zero random variables. Thus we can apply the bound
(3.2.7) on the norms of random matrices and get

(3.3.5) E Y − pX  C log n · E max (Y − pX)i 2 + E max (Y − pX)j 2 .


i∈[n] j∈[n]

All that remains is to bound the norms of the rows and columns of Y − pX.
This is not difficult if we note that they can be expressed as sums of independent
random variables:
n 
n
(Y − pX)i 22 = (δij − p)2 X2ij  (δij − p)2 · X2∞ ,
j=1 j=1

and similarly for columns.


Taking expectation and noting that E(δij − p)2 = Var(δij ) = p(1 − p), we get20

(3.3.6) E (Y − pX)i 2  (E (Y − pX)i 22 )1/2  pn X∞ .
This is a good bound, but we need something stronger in (3.3.5). Since the max-
imum appears inside the expectation, we need a uniform bound, which will say
that all rows are bounded simultaneously with high probability.
Such uniform bounds are usually proved by applying concentration inequali-
ties followed by a union bound. Bernstein’s inequality (1.4.5) yields
⎧ ⎫
⎨n ⎬
P (δij − p)2 > tpn  exp(−ctpn) for t  3.
⎩ ⎭
j=1

(Check!) This probability can be further bounded by n−ct using the assumption
that m = pn2  n log n. A union bound over n rows leads to
⎧ ⎫
⎨ n ⎬
P max (δij − p)2 > tpn  n · n−ct for t  3.
⎩i∈[n] ⎭
j=1

Integrating this tail, we conclude using (2.2.10) that



n
E max (δij − p)2  pn.
i∈[n]
j=1

(Check!) And this yields the desired bound on the rows,



E max (Y − pX)i 2  pn,
i∈[n]

which is an improvement of (3.3.6) we wanted. We can do similarly for the


columns. Substituting into (3.3.5), this gives

E Y − pX  log(n) pn X∞ .
20 The first bound below that compares the L1 and L2 averages follows from Hölder’s inequality.
258 Four Lectures on Probabilistic Methods for Data Science

Then, by (3.3.4), we get



n
(3.3.7) E X̂ − X  log(n) X∞ .
p
Step 2: Passing to Frobenius norm. Now we will need to pass from the
operator to Frobenius norm. This is where we will use for the first (and only)
time the rank of X. We know that rank(X)  r by assumption and rank(X̂)  r
by construction, so rank(X̂ − X)  2r. There is a simple relationship between the
operator and Frobenius norms:

X̂ − XF  2rX̂ − X.
(Check it!) Take expectation of both sides and use (3.3.7); we get
√ 
rn
E X̂ − XF  2r E X̂ − X  log(n) X∞ .
p
Dividing both sides by n, we can rewrite this bound as

1 rn
E X̂ − XF  log(n) X∞ .
n pn2
This yields (3.3.3) since pn2 = m by definition of the sampling probability p. 
3.4. Notes Theorem 3.1.2 on covariance estimation is a version of [58, Corol-
lary 5.52], see also [36]. The logarithmic factor is in general necessary. This
theorem is a general-purpose result. If one knows some additional structural in-
formation about the covariance matrix (such as sparsity), then fewer samples may
be needed, see e.g. [12, 16, 40].
A version of Theorem 3.2.3 was proved in [51] in a more technical way. Al-
though the logarithmic factor in Theorem 3.2.3 can not be completely removed in
general, it can be improved. Our argument actually gives

E A  C log n · E max Ai 2 + C log n · E max |Aij |,
i ij

Using different methods, one can save an extra log n factor and show that

E A  C E max Ai 2 + C log n · E max |Aij |
i ij
(see [6]) and 
E A  C log n · log log n · E max Ai 2 ,
i
see [55]. (The results in [6, 55] are stated for Gaussian random matrices; the two
bounds above can be deduced by using conditioning and symmetrization.) The
surveys [6, 58] and the textbook [60] present several other useful techniques to
bound the operator norm of random matrices.
The matrix completion problem, which we discussed in Section 3.3, has at-
tracted a lot of recent attention. E. Candes and B. Recht [14] showed that one can
often achieve exact matrix completion, thus computing the precise values of all
missing values of a matrix, from m ∼ rn log2 (n) randomly sampled entries. For
exact matrix completion, one needs an extra incoherence assumption that is not
Roman Vershynin 259

present in Theorem 3.3.2. This assumption basically excludes matrices that are
simultaneously sparse and low rank (such as a matrix whose all but one entries
are zero – it would be extremely hard to complete it, since sampling will likely
miss the non-zero entry). Many further results on exact matrix completion are
known, e.g. [15, 18, 28, 56].
Theorem 3.3.2 with a simple proof is borrowed from [50]; see also the tutorial
[59]. This result only guarantees approximate matrix completion, but it does not
have any incoherence assumptions on the matrix.

4. Lecture 4: Matrix deviation inequality


In this last lecture we will study a new uniform deviation inequality for ran-
dom matrices. This result will be a far reaching generalization of the Johnson-
Lindenstrauss Lemma we proved in Lecture 1.
Consider the same setup as in Theorem 1.6.1, where A is an m × n random ma-
trix whose rows are independent, mean zero, isotropic and sub-gaussian random
vectors in Rn . (If you find it helpful to think in terms of concrete examples, let
the entries of A be independent N(0, 1) random variables.) Like in the Johnson-
Lindenstrauss Lemma, we will be looking at A as a linear transformation from
Rn to Rm , and we will be interested in what A does to points in some set in Rn .
This time, however, we will allow for infinite sets T ⊂ Rn .
Let us start by analyzing what A does to a single fixed vector x ∈ Rn . We have


m
* +2
E Ax22 = E Aj , x (where AT
j denote the rows of A)
j=1

m
* +2
= E Aj , x (by linearity)
j=1

= mx22 (using isotropy of Aj ).


Further, if we assume that concentration about the mean holds here (and in fact,
it does), we should expect that

(4.0.1) Ax2 ≈ m x2
with high probability.
Similarly to Johnson-Lindenstrauss Lemma, our next goal is to make (4.0.1)
hold simultaneously over all vectors x in some fixed set T ⊂ Rn . Precisely, we
may ask – how large is the average uniform deviation:
 √ 
 
(4.0.2) E sup Ax2 − m x ?
x∈T
This quantity should clearly depend on some notion of the size of T : the larger
T , the larger should the uniform deviation be. So, how can we quantify the size
of T for this problem? In the next section we will do precisely this – introduce a
260 Four Lectures on Probabilistic Methods for Data Science

convenient, geometric measure of the sizes of sets in Rn , which is called Gaussian


width.
4.1. Gaussian width
Definition 4.1.1. Let T ⊂ Rn be a bounded set, and g be a standard normal random
vector in Rn , i.e. g ∼ N(0, In ). Then the quantities
w(T ) := E sup g, x and γ(T ) := E sup | g, x |
x∈T x∈T
are called the Gaussian width of T and the Gaussian complexity of T , respectively.
Gaussian width and Gaussian complexity are closely related. Indeed,21
(4.1.2) 2w(T ) = w(T − T ) = E sup g, x − y = E sup | g, x − y | = γ(T − T ).
x,y∈T x,y∈T

(Check these identities!)


Gaussian width has a natural geometric interpretation. Suppose g is a unit
vector in Rn . Then a moment’s thought reveals that supx,y∈T g, x − y is simply
the width of T in the direction of g, i.e. the distance between the two hyperplanes
with normal g that touch T on both sides as shown in Figure 4.1.3. Then 2w(T )
can be obtained by averaging the width of T over all directions g in Rn .

x
width

T
g
y

Figure 4.1.3. The width of a set T in the direction of g.

This reasoning is valid except where we assumed that g is a unit vector. In-
stead, for g ∼ N(0, In ) we have E g22 = n and

g2 ≈ n with high probability.
(Check both these claims using Bernstein’s inequality.) Thus, we need to scale

by the factor n. Ultimately, the geometric interpretation of the Gaussian width

becomes the following: w(T ) is approximately n/2 larger than the usual, geometric
width of T averaged over all directions.
21 Theset T − T is defined as {x − y : x, y ∈ T }. More generally, given two sets A and B in the
same vector space, the Minkowski sum of A and B is defined as A + B = {a + b : a ∈ A, b ∈ B}.
Roman Vershynin 261

A good exercise is to compute the Gaussian width and complexity for some
simple sets, such as the unit balls of the p norms in Rn , which we denote by
Bnp = {x ∈ R : xp  1}. In particular, we have
n
√ 
(4.1.4) 2 ) ∼ n,
γ(Bn 1)∼
γ(Bn log n.
For any finite set T ⊂ Bn
2 , we have 
(4.1.5) γ(T )  log |T |.
The same holds for Gaussian width w(T ). (Check these facts!)
A look a these examples reveals that the Gaussian width captures some non-
obvious geometric qualities of sets. Of course, the fact that the Gaussian width of

the unit Euclidean ball Bn2 is or order n is not surprising: the usual, geometric

width in all directions is 2 and the Gaussian width is about n times that. But
it may be surprising that the Gaussian width of the 1 ball Bn 1 is much smaller,
and so is the width of any finite set T (unless the set has exponentially large
cardinality). As we will see later, Gaussian width nicely captures the geometric
size of “the bulk” of a set.
4.2. Matrix deviation inequality Now we are ready to answer the question we
asked in the beginning of this lecture: what is the magnitude of the uniform de-
viation (4.0.2)? The answer is surprisingly simple: it is bounded by the Gaussian
complexity of T . The proof is not too simple however, and we will skip it (see the
notes after this lecture for references).

Theorem 4.2.1 (Matrix deviation inequality). Let A be an m × n matrix whose rows


Ai are independent, isotropic and sub-gaussian random vectors in Rn . Let T ⊂ Rn be a
fixed bounded set. Then
 √ 
 
E sup Ax2 − mx2   CK2 γ(T )
x∈T

where K = maxi Ai ψ2 is the maximal sub-gaussian norm22 of the rows of A.

Remark 4.2.2 (Tail bound). It is often useful to have results that hold with high
probability rather than in expectation. There exists a high-probability version of
the matrix deviation inequality, and it states the following. Let u  0. Then the
event
 √ 
 
(4.2.3) sup Ax2 − mx2   CK2 [γ(T ) + u · rad(T )]
x∈T

holds with probability at least 1 − 2 exp(−u2 ). Here rad(T ) is the radius of T ,


defined as
rad(T ) := sup x2 .
x∈T

22 Adefinition of the sub-gaussian norm of a random vector was given in Section 1.5. For example, if
A is a Gaussian random matrix with independent N(0, 1) entries, then K is an absolute constant.
262 Four Lectures on Probabilistic Methods for Data Science

Since rad(T )  γ(T ) (check!) we can continue the bound (4.2.3) by


 K2 uγ(T )
for all u  1. This is a weaker but still a useful inequality. For example, we can
use it to bound all higher moments of the deviation:
 √ p 1/p
 
(4.2.4) E sup Ax2 − mx2   Cp K2 γ(T )
x∈T

where Cp  C p for p  1. (Check this using Proposition 1.1.1.)

Remark 4.2.5 (Deviation of squares). It is sometimes helpful to bound the devia-


tion of the square Ax22 rather than Ax2 itself. We can easily deduce the devia-
tion of squares by using the identity a2 − b2 = (a − b)2 + 2b(a − b) for a = Ax2

and b = mx2 . Doing this, we conclude that
  √
 
(4.2.6) E sup Ax22 − mx22   CK4 γ(T )2 + CK2 m rad(T )γ(T ).
x∈T
(Do this calculation using (4.2.4) for p = 2.) We will use this bound in Section 4.4.

Matrix deviation inequality has many consequences. We will explore some of


them now.
4.3. Deriving Johnson-Lindenstrauss Lemma We started this lecture by promis-
ing a result that is more general than Johnson-Lindenstrauss Lemma. So let us
show how to quickly derive Johnson-Lindenstrauss from the matrix deviation
inequality. Theorem 1.6.1 from Theorem 4.2.1.
Assume we are in the situation of the Johnson-Lindenstrauss Lemma (Theo-
rem 1.6.1). Given a set X ⊂ R, consider the normalized difference set

x−y
T := : x, y ∈ X distinct vectors .
x − y2
Then T is a finite subset of the unit sphere of Rn , and thus (4.1.5) gives
  
γ(T )  log |T |  log |X|2  log |X|.
Matrix deviation inequality (Theorem 4.2.1) then yields
 A(x − y) √   √
 2
sup  − m  log N  ε m.
x,y∈X x − y2
with high probability, say 0.99. (To pass from expectation to high probability, we
can use Markov’s inequality. To get the last bound, we use the assumption on m
in Johnson-Lindenstrauss Lemma.)

Multiplying both sides by x − y2 / m, we can write the last bound as follows.
With probability at least 0.99, we have
1
(1 − ε)x − y2  √ Ax − Ay2  (1 + ε)x − y2 for all x, y ∈ X.
m
This is exactly the consequence of Johnson-Lindenstrauss lemma.
Roman Vershynin 263

The argument based on matrix deviation inequality, which we just gave, can
be easily extended for infinite sets. It allows one to state a version of Johnson-
Lindenstrauss lemma for general, possibly infinite, sets, which depends on the
Gaussian complexity of T rather than cardinality. (Try to do this!)
4.4. Covariance estimation In Section 3.1, we introduced the problem of covari-
ance estimation, and we showed that
N ∼ n log n
samples are enough to estimate the covariance matrix of a general distribution in
Rn . We will now show how to do better if the distribution is sub-gaussian — see
Section 1.5 for the definition of sub-gaussian random vectors — when we can get
rid of the logarithmic oversampling and the boundedness condition (3.1.3).

Theorem 4.4.1 (Covariance estimation for sub-gaussian distributions). Let X be a


random vector in Rn with covariance matrix Σ. Suppose X is sub-gaussian, and more
specifically
(4.4.2)  X, x ψ2   X, x L2 = Σ1/2 x2 for any x ∈ Rn .
Then, for every N  1, we have
 n n
E ΣN − Σ  Σ + .
N N
This result implies that if, for ε ∈ (0, 1), we take a sample of size
N ∼ ε−2 n,
then we are guaranteed covariance estimation with a good relative error:
E ΣN − Σ  εΣ.
Proof. Since we are going to use Theorem 4.2.1, we will need to first bring the
random vectors X, X1 , . . . , XN to the isotropic position. This can be done by a
suitable linear transformation. You will easily check that there exists an isotropic
random vector Z such that
X = Σ1/2 Z.
(For example, Σ has full rank, set Z := Σ−1/2 X. Check the general case.) Similarly,
we can find independent and isotropic random vectors Zi such that
Xi = Σ1/2 Zi , i = 1, . . . , N.
The sub-gaussian assumption (4.4.2) then implies that
Zψ2  1.
(Check!) Then

1 
N
ΣN − Σ = Σ1/2 RN Σ1/2  where RN := Zi ZT
i − In .
N
i=1
264 Four Lectures on Probabilistic Methods for Data Science

The operator norm of a symmetric n × n matrix A can be computed by maxi-


mizing the quadratic form over the unit sphere: A = maxx∈Sn−1 | Ax, x |. (To
see this, recall that the operator norm is the biggest eigenvalue of A in magni-
tude.) Then
, -
ΣN − Σ = max Σ1/2 RN Σ1/2 x, x = max RN x, x
x∈Sn−1 x∈T

where T is the ellipsoid


T := Σ1/2 Sn−1 .
Recalling the definition of RN , we can rewrite this as
1 
N   
  1
ΣN − Σ = max  Zi , x2 − x22  = max Ax22 − Nx22 .
x∈T N N x∈T
i=1
Now apply the matrix deviation inequality for squares (4.2.6) to conclude that
1 √
ΣN − Σ  γ(T )2 + N rad(T )γ(T ) .
N
(Do this calculation!) The radius and Gaussian width of the ellipsoid T are easy
to compute:
rad(T ) = Σ1/2 and γ(T )  (tr Σ)1/2 .
Substituting, we get
1 
ΣN − Σ  tr Σ + NΣ tr Σ .
N
To complete the proof, use that tr Σ  nΣ (check!) and simplify the bound. 

Remark 4.4.3 (Low-dimensional distributions). Similarly to Section 3.1, we can


show that much fewer samples are needed for covariance estimation of low-
dimensional sub-gaussian distributions. Indeed, the proof of Theorem 4.4.1 ac-
tually yields
 r r
(4.4.4) E ΣN − Σ  Σ +
N N
where
tr Σ
r = r(Σ1/2 ) =
Σ
is the stable rank of Σ1/2 . This means that covariance estimation is possible with
N∼r
samples.

4.5. Underdetermined linear equations We will give one more application of


the matrix deviation inequality – this time, to the area of high dimensional in-
ference. Suppose we need to solve a severely underdetermined system of linear
equations: say, we have m equations in n m variables. Let us write it in the
matrix form as
y = Ax
Roman Vershynin 265

where A is a given m × n matrix, y ∈ Rm is a given vector and x ∈ Rn is an


unknown vector. We would like to compute x from A and y.
When the linear system is underdetermined, we can not find x with any ac-
curacy, unless we know something extra about x. So, let us assume that we do
have some a-priori information. We can describe this situation mathematically by
assuming that
x∈K
where K ⊂ Rn is some known set in Rn that describes anything that we know
about x a-priori. (Admittedly, we are operating on a high level of generality here.
If you need a concrete example, we will consider one in Section 4.6.)
Summarizing, here is the problem we are trying to solve. Determine a solution
x = x(A, y, K) to the underdetermined linear equation y = Ax as accurately as
possible, assuming that x ∈ K.
A variety of approaches to this and similar problems were proposed during the
last decade; see the notes after this lecture for pointers to some literature. The one
we will describe here is based on optimization. To do this, it will be convenient
to convert the set K into a function on Rn which is called the Minkowski functional
of K. This is basically a function whose level sets are multiples of K. To define it
formally, assume that K is star-shaped, which means that together with any point
x, the set K must contain the entire interval that connects x with the origin; see
Figure 4.5.1 for illustration. The Minkowski functional of K is defined as
 
xK := inf t > 0 : x/t ∈ K , x ∈ Rn .
If the set K is convex and symmetric about the origin, xK is actually a norm on
Rn . (Check this!)

0 0

Figure 4.5.1. The set on the left (whose boundary is shown) is star-
shaped, the set on the right is not.

Now we propose the following way to solve the recovery problem: solve the
optimization program
(4.5.2) min x K subject to y = Ax .
Note that this is a very natural program: it looks at all solutions to the equation
y = Ax and tries to “shrink” the solution x toward K. (This is what minimiza-
tion of Minkowski functional is about.)
Also note that if K is convex, this is a convex optimization program, and thus
can be solved effectively by one of the many available numeric algorithms.
266 Four Lectures on Probabilistic Methods for Data Science

The main question we should now be asking is – would the solution to this
program approximate the original vector x? The following result bounds the
approximation error for a probabilistic model of linear equations. Assume that A
is a random matrix as in Theorem 4.2.1, i.e. A is an m × n matrix whose rows Ai
are independent, isotropic and sub-gaussian random vectors in Rn .

Theorem 4.5.3 (Recovery by optimization). The solution x̂ of the optimization pro-


gram (4.5.2) satisfies23
w(K)
E x̂ − x2  √ ,
m
where w(K) is the Gaussian width of K.

Proof. Both the original vector x and the solution x̂ are feasible vectors for the
optimization program (4.5.2). Then

x̂K  xK (since x̂ minimizes the Minkowski functional)


 1 (since x ∈ K).
Thus both x̂, x ∈ K.
We also know that Ax̂ = Ax = y, which yields
(4.5.4) A(x̂ − x) = 0.
Let us apply matrix deviation inequality (Theorem 4.2.1) for T := K − K. It
gives  
 √ 
E sup A(u − v)2 − m u − v2   γ(K − K) = 2w(K),
u,v∈K

where we used (4.1.2) in the last identity. Substitute u = x̂ and v = x here. We


may do this since, as we noted above, both these vectors belong to K. But then
the term A(u − v)2 will equal zero by (4.5.4). It disappears from the bound, and
we get

E m x̂ − x2  w(K).

Dividing both sides by m we complete the proof. 
Theorem 4.5.3 says that a signal x ∈ K can be efficiently recovered from
m ∼ w(K)2
random linear measurements.
4.6. Sparse recovery Let us illustrate Theorem 4.5.3 with an important specific
example of the feasible set K. Suppose we know that the signal x is sparse, which
means that only a few coordinates of x are nonzero. As before, our task is to
recover x from the random linear measurements given by the vector
y = Ax,
and in other similar results, the notation  will hide possible dependence on the sub-gaussian
23 Here

norms of the rows of A.


Roman Vershynin 267

where A is an m × n random matrix. This is a basic example of sparse recovery


problems, which are ubiquitous in various disciplines.
The number of nonzero coefficients of a vector x ∈ Rn , or the sparsity of
x, is often denoted x0 . This is reminiscent of the notation for the p norm

xp = ( n i=1 |xi | )
p 1/p , and for a reason. You can quickly check that

(4.6.1) x0 = lim xp


p→0

(Do this!) Keep in mind that neither x0 nor xp for 0 < p < 1 are actually
norms on Rn , since they fail triangle inequality. (Give an example.)
Let us go back to the sparse recovery problem. Our first attempt to recover x
is to try the following optimization problem:
(4.6.2) min x 0 subject to y = Ax .
This is sensible because this program selects the sparsest feasible solution. But
there is an implementation caveat: the function f(x) = x0 is highly non-convex
and even discontinuous. There is simply no known algorithm to solve the opti-
mization problem (4.6.2) efficiently.
To overcome this difficulty, let us turn to the relation (4.6.1) for an inspiration.
What if we replace x0 in the optimization problem (4.6.2) by xp with p > 0?
The smallest p for which f(x) = xp is a genuine norm (and thus a convex
function on Rn ) is p = 1. So let us try
(4.6.3) min x 1 subject to y = Ax .
This is a convexification of the non-convex program (4.6.2), and a variety of nu-
meric convex optimization methods are available to solve it efficiently.
We will now show that 1 minimization works nicely for sparse recovery. As
before, we assume that A is a random matrix as in Theorem 4.2.1.
Theorem 4.6.4 (Sparse recovery by optimization). If an unknown vector x ∈ Rn has
at most s non-zero coordinates, i.e. x0  s, then the solution x̂ of the optimization
program (4.6.3) satisfies

s log n
E x̂ − x2  x2 .
m
Proof. Since x0  s, Cauchy-Schwarz inequality shows that

(4.6.5) x1  s x2 .
(Check!) Denote the unit ball of the 1 norm in Rn by Bn1 , i.e. B1 := {x ∈ R :
n n

x1  1}. Then we can rewrite (4.6.5) as the inclusion



x ∈ s x2 · Bn 1 := K.

Apply Theorem 4.5.3 for this set K. We noted the Gaussian width of Bn1 in (4.1.4),
so 
√ √ √
w(K) = s x2 · w(Bn 1 )  s x2 · γ(B1 )  s x2 ·
n
log n.
Substitute this in Theorem 4.5.3 and complete the proof. 
268 References

Theorem 4.6.4 says that an s-sparse signal x ∈ Rn can be efficiently recovered


from
m ∼ s log n
random linear measurements.
4.7. Notes For a more thorough introduction to Gaussian width and its role in
high-dimensional estimation, refer to the tutorial [59] and the textbook [60]; see
also [5]. Related to Gaussian complexity is the notion of Rademacher complexity
of T , obtained by replacing the coordinates of g by independent Rademacher (i.e.
±1 symmetric) random variables. Rademacher complexity of classes of functions
plays an important role in statistical learning theory, see e.g. [44]
Matrix deviation inequality (Theorem 4.2.1) is borrowed from [41]. In the spe-
cial case where A is a Gaussian random matrix, this result follows from the work
of G. Schechtman [57] and could be traced back to results of Gordon [24–27].
In the general case of sub-gaussian distributions, earlier variants of Theo-
rem 4.2.1 were proved by B. Klartag and S. Mendelson [35], S. Mendelson, A. Pajor
and N. Tomczak-Jaegermann [45] and S. Dirksen [20].
Theorem 4.4.1 for covariance estimation can be proved alternatively using more
elementary tools (Bernstein’s inequality and ε-nets), see [58]. However, no known
elementary approach exists for the low-rank covariance estimation discussed in
Remark 4.4.3. The bound (4.4.4) was proved by V. Koltchinskii and K. Lounici
[36] by a different method.
In Section 4.5, we scratched the surface of a recently developed area of sparse
signal recovery, which is also called compressed sensing. Our presentation there
essentially follows the tutorial [59]. Theorem 4.6.4 can be improved: if we take
m  s log(n/s)
measurements, then with high probability the optimization program (4.6.3) recov-
ers the unknown signal x exactly, i.e.
x̂ = x.
First results of this kind were proved by J. Romberg, E. Candes and T. Tao [13]
and a great number of further developments followed; refer e.g. to the book [23]
and the chapter in [19] for an introduction into this research area.

Acknowledgement
I am grateful to the referees who made a number of useful suggestions, which
led to better presentation of the material in this chapter.

References
[1] E. Abbe, A. S. Bandeira, G. Hall. Exact recovery in the stochastic block model, IEEE Transactions on
Information Theory 62 (2016), 471–487. MR3447993 249
References 269

[2] D. Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins, Journal
of Computer and System Sciences, 66 (2003), 671–687. MR2005771 239
[3] R. Ahlswede, A. Winter, Strong converse for identification via quantum channels, IEEE Trans. Inf.
Theory 48 (2002), 569–579. MR1889969 248
[4] N. Ailon, B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform,
Proceedings of the 38th Annual ACM Symposium on Theory of Computing. New York: ACM
Press, 2006. pp. 557–563. MR2277181 239
[5] D. Amelunxen, M. Lotz, M. McCoy, J. Tropp, Living on the edge: phase transitions in convex programs
with random data, Inf. Inference 3 (2014), 224–294. MR3311453 268
[6] A. Bandeira, R. van Handel, Sharp nonasymptotic bounds on the norm of random matrices with inde-
pendent entries, Ann. Probab. 44 (2016), 2479–2506. MR3531673 249, 258
[7] R. Baraniuk, M. Davenport, R. DeVore, M. Wakin, A simple proof of the restricted isometry property
for random matrices, Constructive Approximation, 28 (2008), 253–263. MR2453366 239
[8] R. Bhatia, Matrix Analysis. Graduate Texts in Mathematics, vol. 169. Springer, Berlin, 1997.
MR1477662 248
[9] C. Bordenave, M. Lelarge, L. Massoulie, Non-backtracking spectrum of random graphs: community
detection and non-regular Ramanujan graphs, Annals of Probability, to appear. MR3758726 248, 249
[10] S. Boucheron, G. Lugosi, P. Massart, Concentration inequalities. A nonasymptotic theory of indepen-
dence. With a foreword by Michel Ledoux. Oxford University Press, Oxford, 2013. MR3185193
239
[11] O. Bousquet1, S. Boucheron, G. Lugosi, Introduction to statistical learning theory, in: Advanced
Lectures on Machine Learning, Lecture Notes in Computer Science 3176, pp.169–207, Springer
Verlag 2004.
[12] T. Cai, R. Zhao, H. Zhou, Estimating structured high-dimensional covariance and precision matrices:
optimal rates and adaptive estimation, Electron. J. Stat. 10 (2016), 1–59. MR3466172 258
[13] E. Candes, J. Romberg, T. Tao, Robust uncertainty principles: exact signal reconstruction from highly
incomplete frequency information, IEEE Trans. Inform. Theory 52 (2006), 489–509. MR2236170 268
[14] E. Candes, B. Recht, Exact Matrix Completion via Convex Optimization, Foundations of Computa-
tional Mathematics 9 (2009), 717–772. MR2565240 258
[15] E. Candes, T. Tao, The power of convex relaxation: near-optimal matrix completion, IEEE Trans. Inform.
Theory 56 (2010), 2053–2080. MR2723472 259
[16] R. Chen, A. Gittens, J. Tropp, The masked sample covariance estimator: an analysis using matrix con-
centration inequalities, Inf. Inference 1 (2012), 2–20. MR3311439 258
[17] P. Chin, A. Rao, and V. Vu, Stochastic block model and community detection in the sparse graphs: A
spectral algorithm with optimal rate of recovery, preprint, 2015. 249
[18] M. Davenport, Y. Plan, E. van den Berg, M. Wootters, 1-bit matrix completion, Inf. Inference 3
(2014), 189–223. MR3311452 259
[19] M. Davenport, M. Duarte, Yonina C. Eldar, Gitta Kutyniok, Introduction to compressed sensing, in:
Compressed sensing. Edited by Yonina C. Eldar and Gitta Kutyniok. Cambridge University Press,
Cambridge, 2012. MR2963166 268
[20] S. Dirksen, Tail bounds via generic chaining, Electron. J. Probab. 20 (2015), 1–29. MR3354613 268
[21] U. Feige, E. Ofek, Spectral techniques applied to sparse random graphs, Random Structures Algo-
rithms 27 (2005), 251–275. MR2155709 249
[22] S. Fortunato, Santo; D. Hric, Community detection in networks: A user guide. Phys. Rep. 659 (2016),
1–44. MR3566093 248
[23] S. Foucart, H. Rauhut, A mathematical introduction to compressive sensing. Applied and Numerical
Harmonic Analysis. Birkhäuser/Springer, New York, 2013. MR3100033 268
[24] Y. Gordon, Some inequalities for Gaussian processes and applications, Israel J. Math. 50 (1985), 265–289.
MR800188 268
[25] Y. Gordon, Elliptically contoured distributions, Prob. Th. Rel. Fields 76 (1987), 429–438. MR917672
268
[26] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in Rn , Geo-
metric aspects of functional analysis (1986/87), Lecture Notes in Math., vol. 1317, pp. 84–106.
MR950977 268
[27] Y. Gordon, Majorization of Gaussian processes and geometric applications, Prob. Th. Rel. Fields 91
(1992), 251–267. MR1147616 268
270 References

[28] D. Gross, Recovering low-rank matrices from few coefficients in any basis, IEEE Trans. Inform. Theory
57 (2011), 1548–1566. MR2815834 259
[29] O. Guedon, R. Vershynin, Community detection in sparse networks via Grothendieck’s inequality, Prob-
ability Theory and Related Fields 165 (2016), 1025–1049. MR3520025 248, 249
[30] A. Javanmard, A. Montanari, F. Ricci-Tersenghi, Phase transitions in semidefinite relaxations, PNAS,
April 19, 2016, vol. 113, no.16, E2218–E2223. MR3494080 248, 249
[31] W. B. Johnson, J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space. In Beals,
Richard; Beck, Anatole; Bellow, Alexandra; et al. Conference in modern analysis and probability
(New Haven, Conn., 1982). Contemporary Mathematics. 26. Providence, RI: American Mathemat-
ical Society, 1984. pp. 189–206. MR737400 239
[32] B. Hajek, Y. Wu, J. Xu, Achieving exact cluster recovery threshold via semidefinite programming, IEEE
Transactions on Information Theory 62 (2016), 2788–2797. MR3493879 248, 249
[33] P. W. Holland, K. B. Laskey, S. Leinhardt, Stochastic blockmodels: first steps, Social Networks 5
(1983), 109–137. MR718088 248
[34] D. Kane, J. Nelson, Sparser Johnson-Lindenstrauss Transforms, Journal of the ACM 61 (2014): 1.
MR3167920 239
[35] B. Klartag, S. Mendelson, Empirical processes and random projections, J. Funct. Anal. 225 (2005),
229–245. MR2149924 268
[36] V. Koltchinskii, K. Lounici, Concentration inequalities and moment bounds for sample covariance oper-
ators, Bernoulli 23 (2017), 110–133. MR3556768 258, 268
[37] C. Le, E. Levina, R. Vershynin, Concentration and regularization of random graphs, Random Struc-
tures and Algorithms, to appear. MR3689343 248, 249
[38] M. Ledoux, The concentration of measure phenomenon. American Mathematical Society, Providence,
RI, 2001. MR1849347 239
[39] M. Ledoux, M. Talagrand, Probability in Banach spaces. Isoperimetry and processes. Springer-Verlag,
Berlin, 1991. MR1102015 239
[40] E. Levina, R. Vershynin, Partial estimation of covariance matrices, Probability Theory and Related
Fields 153 (2012), 405–419. MR2948681 258
[41] C. Liaw, A. Mehrabian, Y. Plan, R. Vershynin, A simple tool for bounding the deviation of random ma-
trices on geometric sets, Geometric Aspects of Functional Analysis, Lecture Notes in Mathematics,
Springer, Berlin, to appear. MR3645128 268
[42] J. Matouĺek, Lectures on discrete geometry. Graduate Texts in Mathematics, 212. Springer-Verlag,
New York, 2002. MR1899299 239
[43] F. McSherry, Spectral partitioning of random graphs, Proc. 42nd FOCS (2001), 529–537. MR1948742
249
[44] S. Mendelson, S. Mendelson, A few notes on statistical learning theory, in: Advanced Lectures on
Machine Learning, S. Mendelson, A. J. Smola (Eds.) LNAI 2600, pp. 1–40, 2003. 268
[45] S. Mendelson, A. Pajor, N. Tomczak-Jaegermann, Reconstruction and subgaussian operators in as-
ymptotic geometric analysis. Geom. Funct. Anal. 17 (2007), 1248–1282. MR2373017 268
[46] E. Mossel, J. Neeman, A. Sly, Belief propagation, robust reconstruction and optimal recovery of block
models. Ann. Appl. Probab. 26 (2016), 2211–2256. MR3543895 248, 249
[47] M. E. Newman, Networks. An introduction. Oxford University Press, Oxford, 2010. MR2676073 248
[48] R. I. Oliveira, Concentration of the adjacency matrix and of the Laplacian in random graphs with inde-
pendent edges, unpublished (2010), arXiv:0911.0600. 248, 249
[49] R. I. Oliveira, Sums of random Hermitian matrices and an inequality by Rudelson, Electron. Commun.
Probab. 15 (2010), 203–212. MR2653725 248
[50] Y. Plan, R. Vershynin, E. Yudovina, High-dimensional estimation with geometric constraints, Informa-
tion and Inference 0 (2016), 1–40. MR3636866 259
[51] S. Riemer, C. Schütt, On the expectation of the norm of random matrices with non-identically distributed
entries, Electron. J. Probab. 18 (2013), no. 29, 13 pp. MR3035757 258
[52] J. Tropp, User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12 (2012),
389–434. MR2946459 248
[53] J. Tropp, An introduction to matrix concentration inequalities. Found. Trends Mach. Learning 8 (2015),
10-230. 248
[54] R. van Handel, Structured random matrices. in: IMA Volume “Discrete Structures: Analysis and
Applications”, Springer, to appear.
References 271

[55] R. van Handel, On the spectral norm of Gaussian random matrices, Trans. Amer. Math. Soc., to appear.
MR3695857 258
[56] B. Recht, A simpler approach to matrix completion, J. Mach. Learn. Res. 12 (2011), 3413–3430.
MR2877360 259
[57] G. Schechtman, Two observations regarding embedding subsets of Euclidean spaces in normed spaces,
Adv. Math. 200 (2006), 125–135. MR2199631 268
[58] R. Vershynin, Introduction to the non-asymptotic analysis of random matrices. Compressed sensing,
210–268, Cambridge University Press, Cambridge, 2012. MR2963170 232, 239, 258, 268
[59] R. Vershynin, Estimation in high dimensions: a geometric perspective. Sampling Theory, a Renaissance,
3–66, Birkhauser Basel, 2015. MR3467418 232, 259, 268
[60] R. Vershynin, High-Dimensional Probability. An Introduction with Applications in Data Science. Cam-
bridge University Press, to appear. 232, 239, 258, 268
[61] H. Zhou, A. Zhang, Minimax Rates of Community Detection in Stochastic Block Models, Annals of
Statistics, to appear. MR3546450 248, 249

Department of Mathematics, University of Michigan, 530 Church Street, Ann Arbor, MI 48109,
U.S.A.
Email address: [email protected]
IAS/Park City Mathematics Series
Volume 25, Pages 273–325
https://doi.org/10.1090/pcms/025/00834

Homological Algebra and Data

Robert Ghrist

Contents
Introduction and Motivation 273
What is Homology? 274
When is Homology Useful? 274
Scheme 275
Lecture 1: Complexes and Homology 275
Spaces 275
Spaces and Equivalence 279
Application: Neuroscience 284
Lecture 2: Persistence 286
Towards Functoriality 286
Sequences 288
Stability 292
Application: TDA 294
Lecture 3: Compression and Computation 297
Sequential Manipulation 297
Homology Theories 301
Application: Algorithms 306
Lecture 4: Higher Order 308
Cohomology and Duality 308
Cellular Sheaves 312
Cellular Sheaf Cohomology 313
Application: Sensing and Evasion 318
Conclusion: Beyond Linear Algebra 320

Introduction and Motivation


These lectures are meant as an introduction to the methods and perspectives of
Applied Topology for students and researchers in areas including but not limited
to data science, neuroscience, complex systems, and statistics. Though the tools

2010 Mathematics Subject Classification. Primary 55-01; Secondary 18G35, 55N30.


Key words and phrases. cohomology, complexes, homology, persistence, sheaves.
RG supported by the Office of the Assistant Secretary of Defense Research & Engineering through
ONR N00014-16-1-2010.
273
©2018 Robert Ghrist
274 Homological Algebra and Data

are mathematical in nature, this article will treat the formalities with a light touch
and heavy references, in order to make the subject more accessible to practitioners.
See the concluding section for a roadmap for finding more details. The material
is written for beginning graduate students in any of the applied mathematical
sciences (though some mathematical maturity is helpful).

What is Homology?
Homology is an algebraic compression scheme that excises all but the essen-
tial topological features from a particular class of data structures arising naturally
from topological spaces. Homology therefore pairs with topology. Topology is
the mathematics of abstract space and transformations between them. The notion
of a space, X, requires only a set together with a notion of nearness, expressed as
a system of subsets comprising the “open” neighborhoods satisfying certain con-
sistency conditions. Metrics are permissible but not required. So many familiar
notions in applied mathematics – networks, graphs, data sets, signals, imagery,
and more – are interpretable as topological spaces, often with useful auxiliary
structures. Furthermore, manipulations of such objects, whether as comparison,
inference, or metadata, are expressible in the language of mappings, or contin-
uous relationships between spaces. Topology concerns the fundamental notions
of equivalence up to the loose nearness of what makes a space. Thus, connectiv-
ity and holes are significant; bends and corners less so. Topological invariants
of spaces and mappings between them record the essential qualitative features,
insensitive to coordinate changes and deformations.
Homology is the simplest, general, computable invariant of topological data.
In its most primal manifestation, the homology of a space X returns a sequence
of vector spaces H• (X), the dimensions of which count various types of linearly
independent holes in X. Homology is inherently linear-algebraic, but transcends
linear algebra, serving as the inspiration for homological algebra. It is this algebraic
engine that powers the subject.

When is Homology Useful?


Homological methods are, almost by definition, robust, relying on neither pre-
cise coordinates nor careful estimates for efficacy. As such, they are most useful in
settings where geometric precision fails. With great robustness comes both great
flexibility and great weakness. Topological data analysis is more fundamental
than revolutionary: such methods are not intended to supplant analytic, proba-
bilistic, or spectral techniques. They can however reveal a deeper basis for why
some data sets and systems behave the way they do. It is unwise to wield topo-
logical techniques in isolation, assuming that the weapons of unfamiliar “higher”
mathematics are clad in incorruptible silver.
Robert Ghrist 275

Scheme
There is far too much material in the subject of algebraic topology to be sur-
veyed here. Existing applications alone span an enormous range of principles and
techniques, and the subject of applications of homology and homological algebra
is in its infancy still. As such, these notes are selective to a degree that suggests
caprice. For deeper coverage of the areas touched on here, complete with illustra-
tions, see [51]. For alternate ranges and perspectives, there are now a number of
excellent sources, including [40, 62, 76]. These notes will deemphasize formalities
and ultimate formulations, focusing instead on principles, with examples and ex-
ercises. The reader should not infer that the theorems or theoretic minutiae are
anything less than critical in practice.
These notes err on the side of simplicity. The many included exercises are
not of the typical lemma-lemma-theorem form appropriate for a mathematics
course; rather, they are meant to ground the student in examples. There is an
additional layer of unstated problems for the interested reader: these notes are
devoid of figures. The student apt with a pen should endeavor to create cartoons
to accompany the various definitions and examples presented here, with the aim
of minimality and clarity of encapsulation. The author’s attempt at such can be
found in [51].

Lecture 1: Complexes and Homology


This lecture will introduce the initial objects and themes of applied algebraic
topology. There is little novel here: all definitions are standard and found in
standard texts. The quick skip over formalities, combined with a linear-algebraic
sensibility, allows for a rapid ascent to the interesting relationships to be found
in homology and homological algebra.

Spaces
A space is a set X together with a compendium of all subsets in X deemed
“open,” which subcollection must of necessity satisfy a list of intuitively obvious
properties. The interested reader should consult any point-set topology book
(such as [70]) briefly. All the familiar spaces of elementary calculus – surfaces,
level sets of functions, Euclidean spaces – are indeed topological spaces and just
the beginning of the interesting spaces studied in manifold theory, algebraic ge-
ometry, differential geometry, and more. These tend to be frustratingly indiscrete.
Applications involving computation prompt an emphasis on those spaces that are
easily digitized. Such are usually called complexes, often with an adjectival prefix.
Several are outlined below.
Simplicial Complexes Consider a set X of discrete objects. A k-simplex in X is
an unordered collection σ of k + 1 distinct elements of X. Though the definition is
combinatorial, for X a set of points in a Euclidean space [viz. point-cloud data set]
276 Homological Algebra and Data

one visualizes a simplex as the geometric convex hull of the k + 1 points, a “filled-
in” clique: thus, 0-simplices are points, 1-simplices are edges, 2-simplices are
filled-in triangles, etc. A complex is a collection of multiple simplices.1 In particular,
a simplicial complex on X is a collection of simplices in X that is downward closed,
in the sense that every subset of every simplex is also a simplex in the complex.
One says that X contains all its faces. Greek letters (especially σ and τ) will be used
to denote simplices in what follows.

Exercise 1.1. Recall that a collection of random variables X = {Xi }k 1 on a fixed


domain are statistically independent if their probability densities fXi are jointly
multiplicative (that is, the probability density fX of the combined random vari-

able (X1 , . . . , Xk ) satisfies fX = i fXi ). Given a set of n random variables on
a fixed domain, explain how one can build a simplicial complex using statisti-
cal independence to define simplices. What is the maximal dimension of this
independence complex? What does the number of connected components of the
independence complex tell you? Is it possible to have all edges present and no
higher-dimensional faces?

Exercise 1.2. Not all interesting simplicial complexes are simple to visualize. Con-
sider a finite-dimensional real vector space V and consider V to be the vertex set
of a simplicial complex defined as follows: a k-simplex consists of k + 1 linearly
independent members of V. Is the resulting independence complex finite? Finite-
dimensional? What does the dimension of this complex tell you?

Simplicial complexes as described are purely combinatorial objects, like the


graphs they subsume. As a graph, one topologizes a simplicial complex as a
quotient space built from topological simplices. The standard k-simplex is the
following incarnation of its Platonic ideal:
 

k
(1.3) Δ = x ∈ [0, 1]
k k+1
: xi = 1 .
i=0
One topologizes an abstract simplicial complex into a space X by taking one for-
mal copy of Δk for each k-simplex of X, then identifying these together along
faces inductively. The way to do this formally is to take a disjoint union of stan-
dard simplices, one for each simplex in the complex; then identify or “glue” using
an equivalence relation. Specifically, define the k-skeleton of X, k ∈ N, to be the
quotient space:
 0
 
(1.4) X (k)
= X (k−1)
Δ k
∼ ,
σ:dim σ=k

where ∼ is the equivalence relation that identifies faces of Δk with the correspond-
ing combinatorial faces of σ in X(j) for j < k. Thus, for example, X(0) is a discrete
1 The etymology of both words is salient.
Robert Ghrist 277

collection of points (the vertices) and X(1) is the abstract space obtained by gluing
edges to the vertices using information stored in the 1-simplices.

Exercise 1.5. How many total k-simplices are there in the closed n-simplex for
k < n?

Vietoris-Rips Complexes A data set in the form of a finite metric space (X, d)
gives rise to a family of simplicial complexes in the following manner. The
Vietoris-Rips complex (or VR-complex) of (X, d) at scale > 0 is the simplicial
complex VR (X) whose simplices are precisely those collections of points with
pairwise distance  . Otherwise said, one connects points that are sufficiently
close, filling in sufficiently small holes, with sufficiency specified by .
These VR complexes have been used as a way of associating a simplicial com-
plex to point cloud data sets. One obvious difficulty, however, lies in the choice
of : too small, and nothing is connected; too large, and everything is connected.
The question of which to use has no easy answer. However, the perspectives of
algebraic topology offer a modified question. How to integrate structures across all
values? This will be considered in Lecture 2 of this series.
Flag/clique complexes The VR complex is a particular instance of the following
construct. Given a graph (network) X, the flag complex or clique complex of X is
the maximal simplicial complex X that has the graph as its 1-skeleton: X(1) = X.
What this means in practice is that whenever you “see” the skeletal frame of
a simplex in X, you fill it and all its faces in with simplices. Flag complexes are
advantageous as data structures for spaces, in that you do not need to input/store
all of the simplices in a simplicial complex: the 1-skeleton consisting of vertices
and edges suffices to define the rest of the complex.

Exercise 1.6. Consider a combinatorial simplicial complex X on a vertex set of


size n. As a function of this n, how difficult is it to store in memory enough
information about X to reconstruct the list of its simplices? (There are several
ways to approach this: see Exercise 1.5 for one approach.) Does this worst-case
complexity improve if you know that X is a flag complex?

Nerve Complexes This is a particular example of a nerve complex associated to


a collection of subsets.
Let U = {Uα } be a collection of open subsets of a topological space X. The
nerve of U, N(U), is the simplicial complex defined by the intersection lattice of U.
The k-simplices of N(U) correspond to nonempty intersections of k + 1 distinct
elements of U. Thus, vertices of the nerve correspond to elements of U; edges
correspond to pairs in U which intersect nontrivially. This definition respects
faces: the faces of a k-simplex are obtained by removing corresponding elements
of U, leaving the resulting intersection still nonempty.
278 Homological Algebra and Data

Exercise 1.7. Compute all possible nerves of four bounded convex subsets in
the Euclidean plane. What is and is not possible? Now, repeat, but with two
nonconvex subsets of Euclidean R3 .

Dowker Complexes There is a matrix version of the nerve construction that


is particularly relevant to applications, going back (at least) to the 1952 paper of
Dowker [39]. For simplicity, let X and Y be finite sets with R ⊂ X × Y representing
the ones in a binary matrix (also denoted R) whose columns are indexed by X
and whose rows are indexed by Y. The Dowker complex of R on X is the simplicial
complex on the vertex set X defined by the rows of the matrix R. That is, each
row of R determines a subset of X: use this to generate a simplex and all its faces.
Doing so for all the rows gives the Dowker complex on X. There is a dual Dowker
complex on Y whose simplices on the vertex set Y are determined by the ones in
columns of R.

Exercise 1.8. Compute the Dowker complex and the dual Dowker complex of the
following relation R:
⎡ ⎤
1 0 0 0 1 1 0 0
⎢ ⎥
⎢0 1 1 0 0 0 1 0⎥
⎢ ⎥
⎢ ⎥
(1.9) R = ⎢0 1 0 0 1 1 0 1⎥ .
⎢ ⎥
⎢1 0 1 0 1 0 0 1⎥
⎣ ⎦
1 0 1 0 0 1 1 0

Dowker complexes have been used in a variety of social science contexts (where
X and Y represent agents and attributes respectively) [7]. More recent applications
of these complexes have arisen in settings ranging from social networks [89] to
sensor networks [54]. The various flavors of witness complexes in the literature on
topological data analysis [37, 57] are special cases of Dowker complexes.
Cell Complexes There are other ways to build spaces out of simple pieces.
These, too, are called complexes, though not simplicial, as they are not necessarily
built from simplicies. They are best described as cell complexes, being built from
cells of various dimensions sporting a variety of possible auxiliary structures.
A cubical complex is a cell complex built from cubes of various dimensions,
the formal definition mimicking Equation (1.4): see [51, 62]. These often arise as
the natural model for pixel or voxel data in imagery and time series. Cubical
complexes have found other uses in modelling spaces of phylogenetic trees [17,
77] and robot configuration spaces [1, 53, 55].
There are much more general cellular complexes built from simple pieces with
far less rigidity in the attachments between pieces. Perhaps the most general
useful model of a cell complex is the CW complex used frequently in algebraic
topology. The idea of a CW complex is this: one begins with a disjoint union
of points X(0) as the 0-skeleton. One then inductively defines the n-skeleton of
Robert Ghrist 279

X, X(n) as the (n − 1)-skeleton along with a collection of closed n-dimensional


balls, Dn , each glued to X(n−1) via attaching maps on the boundary spheres
∂Dn → X(n−1) . In dimension one, [finite] CW complexes, simplicial complexes,
and cubical complexes are identical and equivalent to [finite] graphs.2 In higher
dimensions, these types of cell complexes diverge in expressivity and ease of use.

Spaces and Equivalence


Many of the spaces of interest in topological data analysis are finite metric
spaces [point clouds] and simplicial approximations and generalizations of these.
However, certain spaces familiar from basic calculus are relevant. We have already
referenced Dn , the closed unit n-dimensional ball in Euclidean Rn . Its boundary
defines the standard sphere Sn−1 of dimension n − 1. The 1-sphere S1 is also
the 1-torus, where, by n-torus is meant the [Cartesian] product Tn = (S1 )n of
n circles. The 2-sphere S2 and 2-torus T2 are compact, orientable surfaces of
genus 0 and 1 respectively. For any genus g ∈ N, there is a compact orientable
surface Σg with that genus: for g > 1 these look like g 2-tori merged together so
as to have the appearance of having g holes. All orientable genus g surfaces are
“topologically equivalent”, though this is as yet imprecise.
One soon runs into difficulty with descriptive language for spaces and equiv-
alences, whether via coordinates or visual features. Another language is needed.
Many of the core results of topology concern equivalence, detection, and res-
olution: are two spaces or maps between spaces qualitatively the same? This
presumes a notion of equivalence, of which there are many. In what follows, map
always means a continuous function between spaces.
Homeomorphism and Homotopy A homeomorphism is a map f : X → Y with
continuous inverse. This is the strongest form of topological equivalence, dis-
tinguishing spaces of different (finite) dimensions or different essential features
(e.g., genus of surfaces) and also distinguishing an open from a closed interval.
The more loose and useful equivalence is that generated by homotopy. A homo-
topy between maps, f0 # f1 : X → Y is a continuous 1-parameter family of maps
ft : X → Y. A homotopy equivalence is a map f : X → Y with a homotopy inverse,
g : Y → X satisfying f ◦ g # IdY and g ◦ f # IdX . One says that such an X and Y
are homotopic. This is the core equivalence relation among spaces in topology.

Exercise 1.10. A space is contractible if it is homotopic to a point. (1) Show explic-


itly that Dn is contractible. (2) Show that D3 with a point in the interior removed
is homotopic to S2 . (3) Argue that the twice-punctured plane is homotopic to a
“figure-eight.” It’s not so easy to do this with explicit maps and coordinates, is it?
2 With the exception of loop edges, which are generally not under the aegis of a graph, but are

permissible in CW complexes.
280 Homological Algebra and Data

Many of the core results in topology are stated in the language of homotopy
(and are not true when homotopy is replaced with the more restrictive homeomor-
phism). For example:
Theorem 1.11. If U is a finite collection of open contractible subsets of X with all non-
empty intersections of subcollections of U contractible, then N(U) is homotopic to the
union ∪α Uα .
Theorem 1.12. Given any binary relation R ⊂ X × Y, the Dowker and dual Dowker
complexes are homotopic.
Homotopy invariants — functions that assign equivalent values to homotopic
inputs — are central both to topology and its applications to data (as noise per-
turbs spaces in an often non-homeomorphic but homotopic manner). Invariants
of finite simplicial and cell complexes invite a computational perspective, since
one has the hope of finite inputs and felicitous data structures.
Euler Characteristic The simplest nontrivial topological invariant of finite cell
complexes dates back to Euler. It is elementary, combinatorial, and sublime. The
Euler characteristic of a finite cell complex X is:

(1.13) χ(X) = (−1)dimσ ,
σ
where the sum is over all cells σ of X.
Exercise 1.14. Compute explicitly the Euler characteristics of the following cell
complexes: (1) the decompositions of the 2-sphere, S2 , defined by the boundaries
of the five regular Platonic solids; (2) the CW complex having one 0-cell and
one 2-cell disc whose boundary is attached directly to the 0-cell (“collapse the
boundary circle to a point”); and (3) the thickened 2-sphere S2 × [0, 1]. How did
you put a cell structure on this last 3-dimensional space?
Completion of this exercise suggests the following result:
Theorem 1.15. Euler characteristic is a homotopy invariant among finite cell complexes.
That this is so would seem to require a great deal of combinatorics to prove.
The modern proof transcends combinatorics, making the problem hopelessly un-
computable before pulling back to the finite world, as will be seen in Lecture 3.
Exercise 1.16. Prove that the Euler characteristic distinguishes [connected] trees
from [connected] graphs with cycles. What happens if the connectivity require-
ment is dropped?
Euler characteristic is a wonderfully useful invariant, with modern applica-
tions ranging from robotics [42, 47] and AI [68] to sensor networks [9–11] to
Gaussian random fields [4,6]. In the end, however, it is a numerical invariant, and
has a limited resolution. The path to improving the resolution of this invariant is
to enrich the underlying algebra that the Eulerian ±1 obscures in Equation (1.13).
Robert Ghrist 281

Lifting to Linear Algebra One of the core themes of this lecture series is the lift-
ing of cell complexes to algebraic complexes on which the tools of homological
algebra can be brought to bear. This is not a novel idea: most applied mathemati-
cians learn, e.g., to use the adjacency matrix of a graph as a means of harnessing
linear-algebraic ideas to understand networks. What is novel is the use of higher-
dimensional structure and the richer algebra this entails.
Homological algebra is often done with modules over a commutative ring. For
clarity of exposition, let us restrict to the nearly trivial setting of finite-dimensional
vector spaces over a field F, typically either R or, when orientations are bother-
some, F2 , the binary field.
Given a cell complex, one lifts the topological cells to algebraic objects by using
them as bases for vector spaces. One remembers the dimensions of the cells by
using a sequence of vector spaces, with dimension as a grading that indexes the
vector spaces. Consider the following sequence C = (Ck ) of vector spaces, where
the grading k is in N.
(1.17) ··· Ck Ck−1 ··· C1 C0 .
In algebraic topology, one often uses a “star” or “dot” to denote a grading: this
chapter will use a dot, as in C = (C• ). For a finite (and thus finite-dimensional)
cell complex, the sequence becomes all zeros eventually. Such a sequence does
not obviously offer an algebraic advantage over the original space; indeed, much
of the information on how cells are glued together has been lost. However, it
is easy to “lift” the Euler characteristic to this class of algebraic objects. For C a
sequence of finite-dimensional vector spaces with finitely many nonzero terms,
define:

(1.18) χ(C) = (−1)k dimCk .
k

Chain Complexes Recall that basic linear algebra does not focus overmuch on
vector spaces and bases; it is in linear transformations that power resides. Aug-
menting a sequence of vector spaces with a matching sequence of linear transfor-
mations adds in the assembly instructions and permits a fuller algebraic repre-
sentation of a topological complex. Given a simplicial3 complex X, fix a field F
and let C = (Ck , ∂k ) denote the following sequence of F-vector spaces and linear
transformations.
∂k ∂k−1 ∂2 ∂1 ∂0
(1.19) ··· / Ck / Ck−1 / ··· / C1 / C0 /0.

Each Ck has as basis the k-simplices of X. Each ∂k restricted to a k-simplex


basis element sends it to a linear combination of those basis elements in Ck−1 de-
termined by the k + 1 faces of the k-simplex. This is simplest in the case F = F2 ,
3 Cellcomplexes in full generality can be used with more work put into the definitions of the linear
transformations: see [58].
282 Homological Algebra and Data

in which case orientations can be ignored; otherwise, one must affix an orienta-
tion to each simplex and proceed accordingly: see [51, 58] for details on how this
is performed.
The chain complex is the primal algebraic object in homological algebra. It
is rightly seen as the higher-dimensional analogue of a graph together with its
adjacency matrix. The chain complex will often be written as C = (C• , ∂), with
the grading implied in the subscript dot. The boundary operator, ∂ = ∂• , can be
thought of either as a sequence of linear transformations, or as a single operator
acting on the direct sum of the Ck .
Homology Homological algebra begins with the following suspiciously simple
statement about simplicial complexes.
Lemma 1.20. The boundary of a boundary is null:
(1.21) ∂2 = ∂k−1 ◦ ∂k = 0,
for all k.
Proof. For simplicity, consider the case of an abstract simplicial complex on a
vertex set V = {vi } with chain complex having F2 coefficients. The face map Di
acts on a simplex by removing the ith vertex vi from the simplex’s list, if present;
else, return zero. The graded boundary operator ∂ : C• → C• is thus a formal sum
1
of face maps ∂ = i Di . It suffices to show that ∂2 = 0 on each basis simplex σ.
Computing the composition in terms of face maps, one obtains:

(1.22) ∂2 σ = Dj Di σ.
i=j

Each (k − 2)-face of the k-simplex σ is represented exactly twice in the image of


Dj Di over all i = j. Thanks to F2 coefficients, the sum over this pair is zero. 
Inspired by what happens with simplicial complexes, one defines an algebraic
complex to be any sequence C = (C• , ∂) of vector spaces and linear transformations
with the property that ∂2 = 0. Going two steps along the sequence is the zero-
map.
Exercise 1.23. Show that for any algebraic complex C, im ∂k+1 is a subspace of
ker ∂k for all k.
The homology of an algebraic complex C, H• (C), is a complex of vector spaces
defined as follows. The k-cycles of C are elements of Ck with “zero boundary”,
denoted Zk = ker ∂k . The k-boundaries of C are elements of Ck that are the
boundary of something in Ck+1 , and are denoted Bk = im ∂k+1 . The homology of
C is the complex H• (C) of quotient vector spaces Hk (C), for k ∈ N, given by:

Hk (C) = Zk /Bk
(1.24) = ker ∂k / im ∂k+1
= cycles/ boundaries.
Robert Ghrist 283

Homology inherits the grading of the complex C and has trivial (zero) linear
transformations connecting the individual vector spaces. Elements of H• (C) are
homology classes and are denoted [α] ∈ Hk , where α ∈ Zk is a k-cycle and [·]
denotes the equivalence class modulo elements of Bk .

Exercise 1.25. If C has boundary maps that are all zero, what can you say about
H• (C)? What if all the boundary maps (except at the ends) are isomorphisms
(injective and surjective)?

Homology of Simplicial Complexes For X a simplicial complex, the chain com-


plex of F2 -vector spaces generated by simplices of X and the boundary attaching
maps is a particularly simple algebraic complex associated to X. The homology
of this complex is usually denoted H• (X) or perhaps H• (X; F2 ) when the binary
coefficients are to be emphasized.

Exercise 1.26. Show that if X is a connected simplicial complex, then H0 (X) = F2 .


Argue that dim H0 (X) equals the number of connected components of X.

Exercise 1.27. Let X be a one-dimensional simplicial complex with five vertices


2
and eight edges that looks like . Show that dim H1 (X) = 4.

One of the important aspects of homology is that it allows one to speak of


cycles that are linearly independent. It is true that for a graph, dim H1 is the number
of independent cycles in the graph. One might have guessed that the number of
cycles in the previous exercise is five, not four; however, the fifth can be always
be expressed as a linear combination of the other four basis cycles.
Graphs have nearly trivial homology, since there are no simplices of higher
dimension. Still, one gets from graphs the [correct] intuition that H0 counts con-
nected components and H1 counts loops. Higher-dimensional homology mea-
sures higher-dimensional “holes” as detectable by cycles.

Exercise 1.28. Compute explicitly the F2 -homology of the cell decompositions


of the 2-sphere, S2 , defined by the boundaries of the five regular Platonic solids.
When doing so, recall that ∂2 takes the various types of 2-cells (triangles, squares,
pentagons) to the formal sum of their boundary edges, using addition with F2
coefficients.

These examples testify as to one of the most important features of homology.

Theorem 1.29. Homology is a homotopy invariant.

As stated, the above theorem would seem to apply only to cell complexes.
However, as we will detail in Lecture 3, we can define homology for any topologi-
cal space independent of cell structure; to this, as well, the above theorem applies.
Thus, we can talk of the homology of a space independent of any cell structure
or concrete representation: homotopy type is all that matters. It therefore makes
sense to explore some basic examples. The following are the homologies of the
284 Homological Algebra and Data

n-dimensional sphere, Sn ; the n-dimensional torus, Tn ; and the oriented surface


Σg of genus g.

1 k = n, 0
(1.30) dim Hk (Sn ) = ,
0 k = n, 0
 
n
(1.31) dim Hk (Tn ) = ,
k


⎪ 0 k>2



⎨ 1 k=2
(1.32) dim Hk (Σg ) = .

⎪ 2g k=1



⎩ 1 k=0
Betti Numbers We see in the above examples that the dimensions of the homol-
ogy are the most notable features. In the history of algebraic topology, these
dimensions of the homology groups — called Betti numbers βk = dim Hk — were
the first invariants investigated. They are just the beginning of the many connec-
tions to other topological invariants. For example, we will explain the following
in Lecture 3:
Theorem 1.33. The Euler characteristic of a finite cell complex is the alternating sum of

its Betti numbers: χ(X) = k (−1)k βk .
For the moment, we will focus on applications of Betti numbers as a topological
statistic. In Lecture 2 and following, however, we will go beyond Betti numbers
and consider the richer internal structure of homologies.

Application: Neuroscience
Each of these lectures ends with a sketch of some application(s): this first
sketch will focus on the use of Betti numbers. Perhaps the best-to-date exam-
ple of the use of homology in data analysis is the following recent work of
Giusti, Pastalkova, Curto, & Itskov [56] on network inference in neuroscience
using parametrized Betti numbers as a statistic.
Consider the challenge of inferring how a collection of neurons is wired to-
gether. Because of the structure of a neuron (in particular the length of the
axon), mere physical proximity does not characterize the wiring structure: neu-
rons which are far apart may in fact be “wired” together. Experimentalists can
measure the responses of individual neurons and their firing sequences as a re-
sponse to stimuli. By comparing time-series data from neural probes, the corre-
lations of neuron activity can be estimated, resulting in a correlation matrix with
entries, say, between zero and one, referencing the estimated correlation between
neurons, with the diagonal, of course, consisting of ones. By thresholding the
correlation matrix at some value, one can estimate the “wiring network” of how
neurons are connected.
Robert Ghrist 285

Unfortunately, things are more complicated than this simple scenario suggests.
First, again, the problem of which threshold to choose is present. Worse, the cor-
relation matrix is not the truth, but an experimentally measured estimation that
relies on how the experiment was performed (Where were the probes inserted?
How was the spike train data handled?). Repeating an experiment may lead to a
very different correlation matrix – a difference not accountable by a linear trans-
formation. This means, in particular, that methods based on spectral properties
such as PCA are misleading [56].
What content does the experimentally-measured correlation matrix hold? The
entries satisfy an order principle: if neurons A and B seem more correlated than
C and D, then, in truth, they are. In other words, repeated experiments lead
to a nonlinear, but order-preserving, homeomorphism of the correlation axis. It
is precisely this nonlinear coordinate-free nature of the problem that prompts a
topological approach.
The approach is this. Given a correlation matrix R, let 1   0 be a decreas-
ing threshold parameter, and, for each , let R be the binary matrix generated
from R with ones wherever the correlation exceeds . Let X be the Dowker com-
plex of R (or dual; the same, by symmetry). Then consider the kth Betti number
distribution βk : [1, 0] → N. These distributions are unique under change of cor-
relation axis coordinates up to order-preserving homeomorphisms of the domain.
What do these distributions look like? For → 1, the complex is an isolated
set of points, and for → 0 it is one large connected simplex: all the interesting
homology lies in the middle. It is known that when this sequence of simplicial
complexes is obtained by sampling points from a probability distribution, the
Betti distributions βk for k > 0 are unimodal. Furthermore, it is known that
homological peaks are ordered by dimension [5]: the peak value for β1 precedes
that of β2 , etc. Thus, what is readily available as a signature for the network is
the ordering of the heights of the peaks of the βk distributions. The surprise is
that this peak height data gives information about the distribution from which
the points were sampled.
In particular, one can distinguish between networks that are wired randomly
versus those that are wired geometrically. This is motivated by the neuroscience
applications. It has been known since the Nobel prize-winning work of O’Keefe
et al. that certain neurons in the visual cortex of rats act as place cells, encoding the
geometry of a learned domain (e.g., a maze) by how the neurons are wired [74], in
manner not unlike that of a nerve complex [35]. Other neural networks are known
to be wired together randomly, such as the olfactory system of a fly [29]. Giusti et
al., relying on theorems about Betti number distributions for random geometric
complexes by Kahle [63], show that one can differentiate between geometrically-
wired and randomly wired networks by looking at the peak signatures of β1 ,
β2 , and β3 and whether the peaks increase [random] or decrease [geometric].
Follow-on work gives novel signature types [87]. The use of these methods is
286 Homological Algebra and Data

revolutionary, since actual physical experiments to rigorously determine neuron


wiring are prohibitively difficult and expensive, whereas computing homology is,
in principle, simple. Lecture 3 will explore this issue of computation more.

Lecture 2: Persistence
We have covered the basic definitions of simplicial and algebraic complexes
and their homological invariants. Our goal is to pass from the mechanics of
invariants to the principles that animate the subject, culminating in a deeper un-
derstanding of how data can be qualitatively compressed and analyzed. In this
lecture, we will begin that process, using the following principles as a guide:
(1) A simplicial [or cell] complex is the right type of discrete data structure
for capturing the significant features of a space.
(2) A chain complex is a linear-algebraic representation of this data structure
– an algebraic set of assembly instructions.
(3) To prove theorems about how cell complexes behave under deformation,
study instead deformations of chain complexes.
(4) Homology is the optimal compression of a chain complex down to its
qualitative features.

Towards Functoriality
Our chief end is this: homology is functorial. This means that one can talk
not only about homology of a complex, but also of the homology of a map be-
tween complexes. To study continuous maps between spaces algebraically, one
translates the concept to chain complexes. Assume that X and Y are simplicial
complexes and f : X → Y is a simplicial map – a continuous map taking simplices
to simplices.4 This does not imply that the simplices map homeomorphically to
simplices of the same dimension.
In the same way that X and Y lift to algebraic chain complexes C• (X) and C• (Y),
the map f lifts to a graded sequence of linear transformations f• : C• (X) → C• (Y),
generated by basis n-simplices of X being sent to basis n-simplices of Y, where,
if an n-simplex of X is sent by f to a simplex of dimension less than n, then the
algebraic effect is to send the basis chain in Cn (X) to 0 ∈ Cn (Y). The continuity
of the map f induces a chain map f• that fits together with the boundary maps
of C• (X) and C• (Y) to form the following diagram of vector spaces and linear
transformations:

··· / Cn+1 (X) ∂ / Cn (X) ∂ / Cn−1 (X) ∂ / ··· .

(2.1) fn+1 fn fn−1


  
··· / Cn+1 (Y) ∂ / Cn (Y) ∂ / Cn−1 (Y) ∂ / ···
4 For cell complexes, one makes the obvious adjustments.
Robert Ghrist 287

Commutative Diagrams Equation (2.1) is important – it is our first example of


what is known as a commutative diagram. These are the gears for algebraic engines
of inference. In this example, commutativity means precisely that the chain maps
respect the boundary operation, f• ∂ = ∂f• . This is what continuity means for linear
transformations of complexes. There is no need for simplices or cells to be explicit.
One defines a chain map to be any sequence of linear transformations f• : C → C
on algebraic complexes making the diagram commutative.

Exercise 2.2. Show that any chain map f• : C → C takes cycles to cycles and
boundaries to boundaries.

Induced Homomorphisms Because of this commutativity, a chain map f• acts


not only on chains but on cycles and boundaries as well. This makes well-defined
the induced homomorphism H(f) : H• (C) → H• (C ) on homology. For α a cycle in
C with homology class [α], one may thus define H(f)[α] = [f• α] = [f ◦ α]. This is
well-defined: if [α] = [α ], then, as chains, α = α + ∂β for some β, and,
(2.3) f• α = f ◦ α = f ◦ (α + ∂β) = f ◦ α + f ◦ ∂β = f• α + ∂(f• β),
so that H(f)[α ] = [f• α ] = [f• α] = H(f)[α] in H• (C ).
The term homomorphism is used to accustom the reader to standard terminology.
Of course, in the present context, an induced homomorphism is simply a graded
linear transformation on homology induced by a chain map.

Exercise 2.4. Consider the disc in R2 of radius π punctured at the integer points
along the x and y axes. Although this space is not a cell complex, let us assume
that its homology is well-defined and is “the obvious thing” for H1 , defined by
the number of punctures. What are the induced homomorphisms on H1 of the
continuous maps given by (1) rotation by π/2 counterclockwise; (2) the folding
map x → |x|; (3) flipping along the y axis?

Functoriality Homology is functorial, meaning that the homomorphisms in-


duced on homology are an algebraic reflection of the properties of continuous
maps between spaces. The following are simple properties of induced homomor-
phisms, easily shown from the definitions above:
• Given a chain map f• : C → C , H(f) : H• (C) → H• (C ) is a (graded) se-
quence of linear transformations.
• The identity map Id : C → C induces the identity Id : H• (C) → H• (C) on
homology.
• Given f• : C → C and g• : C → C , H(g ◦ f) = H(g) ◦ H(f).
There is hardly a more important feature of homology than this functoriality.

Exercise 2.5. Show using functoriality that homeomorphisms between spaces in-
duce isomorphisms on homologies.
288 Homological Algebra and Data

Exercise 2.6. Can you find explicit counterexamples to the following statements
about maps f between simplicial complexes and their induced homomorphisms
H(f) (on some grading for homology)?
(1) If f is surjective then H(f) is surjective.
(2) If f is injective then H(f) is injective.
(3) If f is not surjective then H(f) is not surjective.
(4) If f is not injective then H(f) is not injective.
(5) If f is not bijective then H(f) is not bijective.

Functorial Inference It is sometimes the case that what is desired is knowledge


of the qualitative features [homology] of an important but unobservable space X;
what is observed is an approximation Y to X, of uncertain homological fidelity.
One such observation is unhelpful. Two or more homological samplings may
lead to increased confidence; however, functoriality can relate observations to
truth. Suppose the observed data comprises the homology of a pair of spaces Y1 ,
Y2 , which are related by a map f : Y1 → Y2 that factors through a map to X, so that
f = f2 ◦ f1 with f1 : Y1 → X and f2 : X → Y2 . If the induced homomorphism H(f)
is known, then, although H• (X) is hidden from view, inferences can be made.
H(f)
H• Y1 / H• Y2 .
(2.7) 9
%
H(f1 ) H(f2 )
H• X

Exercise 2.8. In the above scenario, what can you conclude about H• (X) if H(f) is
an isomorphism? If it is merely injective? Surjective?

The problem of measuring topological features of experimental data by means


of sensing is particularly vulnerable to threshold effects. Consider, e.g., an open
tank of fluid whose surface waves are experimentally measured and imaged. Per-
haps the region of interest is the portion of the fluid surface above the ambient
height h = 0; the topology of the set A = {h  0} must be discerned, but can
only be approximated by imprecise pixellated images of {h  0}. One can choose
a measurable threshold above and below the zero value to get just such a situ-
ation as outlined in Equation (2.7) above. Similar scenarios arise in MRI data,
where the structure of a tissue of interest can be imaged as a pair of pixellated
approximations, known to over- and under-approximate the truth.

Sequences
Induced homomorphisms in homology are key, as central to homology as the
role of linear transformations are in linear algebra. In these lectures, we have seen
how, though linear transformations between vector spaces are important, what a
great advantage there is in the chaining of linear transformations into sequences
and complexes. The advent of induced homomorphisms should prompt the same
desire, to chain into sequences, analyze, classify, and infer. This is the plan for
Robert Ghrist 289

the remainder of this lecture as we outline the general notions of persistence,


persistent homology, and topological data analysis.
Consider a sequence of inclusions of subcomplexes ι : Xk ⊂ Xk+1 of a simpli-
cial complex X for 1  k  N. These can be arranged into a sequence of spaces
with inclusion maps connecting them like so:
ι ι ι ι ι
(2.9) ∅ = X0 −→ X1 −→ · · · −→ XN−1 −→ XN −→ X.
Sequences of spaces are very natural. One motivation comes from a sequence
of Vietoris-Rips complexes of a set of data points with an increasing sequence of
radii ( i )N
i=1 .

Exercise 2.10. At the end of Lecture 1, we considered a correlation matrix R on


a set V of variables, where correlations are measured from 0 to 1, and used this
matrix to look at a sequence of Betti numbers. Explain how to rephrase this as
a sequence of homologies with maps (assuming some discretization along the
correlation axis). What maps induce the homomorphisms on homologies? What
do the homologies look like for very large or very small values of the correlation
parameter?

A topological sequence of spaces is converted to an algebraic sequence by


passing to homology and using induced homomorphisms:
H(ι) H(ι) H(ι) H(ι) H(ι)
(2.11) H• (X0 ) −→ H• (X1 ) −→ · · · −→ H• (XN−1 ) −→ H• (XN ) −→ H• (X).
The individual induced homomorphisms on homology encode local topological
changes in the Xi ; thanks to functoriality, the sequence encodes the global changes.

Exercise 2.12. Consider a collection of 12 equally-spaced points on a circle —


think of tick-marks on a clock. Remove from this all the points corresponding
to the prime numbers (2, 3, 5, 7, 11). Use the remaining points on the circle as
the basis of a sequence of Vietoris-Rips [VR] complexes based on an increasing
sequence { i } of distances starting with 0 = 0. Without worrying about the
actual values of the i , describe what happens to the sequence of VR complexes.
What do you observe? Does H0 ever increase? Decrease? What about H1 ?

What one observes from this example is the evolution of homological features
over a sequence: homology classes are born, can merge, split, die, or persist.
This evolutionary process as written in the language of sequences is the algebraic
means of encoding notions of geometry, significance, and noise.
Persistence Let us formalize some of what we have observed. Consider a se-
quence of spaces (Xi ) and continuous transformations fi : Xi → Xi+1 , without
requiring subcomplexes and inclusions. We again have a sequence of homologies
with induced homomorphisms. A homology class in H• (Xi ) is said to persist if its
image in H• (Xi+1 ) is also nonzero; otherwise it is said to die. A homology class
in H• (Xj ) is said to be born when it is not in the image of H• (Xj−1 ).
290 Homological Algebra and Data

One may proceed with this line of argument, at the expense of some sloppiness
of language. Does every homology class have an unambiguous birth and death?
Can we describe cycles this way, or do we need to work with classes of cycles
modulo boundaries? For the sake of precision and clarity, it is best to follow the
pattern of these lectures and pass to the context of linear algebra and sequences.
Consider a sequence V• of finite-dimensional vector spaces, graded over the
integers Z, and stitched together with linear transformations like so:
(2.13) V• = · · · −→ Vi−1 −→ Vi −→ Vi+1 −→ · · · .
These sequences are more general than algebraic complexes, which must satisfy the
restriction of composing two incident linear transformations yielding zero. Two
such sequences V• and V• are said to be isomorphic if there are isomorphisms
Vk = ∼ V which commute with the linear transformations in V• and V as in
k •
Equation (2.1). The simplest such sequence is an interval indecomposable of the
form
Id Id Id
(2.14) I• = · · · −→ 0 −→ 0 −→ F −→ F −→ · · · −→ F −→ 0 −→ 0 −→ · · · ,
where the length of the interval equals the number of Id maps, so that an interval
of length zero consists of 0 → F → 0 alone. Infinite or bi-infinite intervals are
also included as indecomposables.
Representation Theory A very slight amount of representation theory is all that
is required to convert a sequence of homologies into a useful data structure for
measuring persistence. Consider the following operation: sequences can be for-
mally added by taking the direct sum, ⊕, term-by-term and map-by-map. The
interval indecomposables are precisely indecomposable with respect to ⊕ and can-
not be expressed as a sum of simpler sequences, even up to isomorphism. The
following theorem, though simple, is suitable for our needs.

Theorem 2.15 (Structure Theorem for Sequences). Any sequence of finite dimen-
sional vector spaces and linear transformations decomposes as a direct sum of interval
indecomposables, unique up to reordering.

What does this mean? It’s best to begin with the basics of linear algebra, and
then see how that extends to homology.
A
Exercise 2.16. Any linear transformation Rn −→ Rm extends to a biinfinite se-
quence with all but two terms zero. How many different isomorphism classes of
decompositions into interval indecomposables are there? What types of intervals
are present? Can you interpret the numbers of the various types of intervals?
What well-known theorem from elementary linear algebra have you recovered?

Barcodes. When we use field coefficients, applying the Structure Theorem to a


sequence of homologies gives an immediate clarification of how homology classes
evolve. Homology classes correspond to interval indecomposables, and are born,
Robert Ghrist 291

persist, then die at particular (if perhaps infinite) parameter values. This decom-
position also impacts how we illustrate evolving homology classes. By drawing
pictures of the interval indecomposables over the [discretized] parameter line as
horizontal bars, we obtain a pictograph that is called a homology barcode.

Exercise 2.17. Consider a simple sequence of four vector spaces, each of dimen-
sion three. Describe and/or draw pictures of all possible barcodes arising from
such a sequence. Up to isomorphism, how many such barcodes are there?

The phenomena of homology class birth, persistence, and death corresponds


precisely to the beginning, middle, and end of an interval indecomposable. The
barcode is usually presented with horizontal intervals over the parameter line
corresponding to interval indecomposables. Note that barcodes, like the homol-
ogy they illustrate, are graded. There is an Hk -barcode for each k  0. Since,
from the Structure Theorem, the order does not matter, one typically orders the
bars in terms of birth time (other orderings are possible).
The barcode provides a simple descriptor for topological significance: the
shorter an interval, the more ephemeral the hole; long bars indicate robust topo-
logical features with respect to the parameter. This is salient in the context of
point clouds Q and Vietoris-Rips complexes VR (Q) using an increasing sequence
{ i } as parameter. For too small or too large, the homology of VR (Q) is unhelp-
ful. Instead of trying to choose an optimal , choose them all: the barcode reveals
significant features.

Exercise 2.18. Persistent homology is useful and powerful in topological data


analysis, but sometimes one can get lost in the equivalence relation that comprises
homology classes. Often, in applications, one cares less about the homology
and more about a particular cycle (whose homology class may be too loose to
have meaning within one’s data). Given a sequence of chain complexes and
chain maps, what can be said about persistent cycles and persistent boundaries? Are
these well-defined? Do they have barcodes? How would such structures relate to
persistent homology barcodes?

Persistent Homology Let us summarize what we have covered with slightly


more formal terminology. A persistence complex is a sequence of chain complexes
P = (Ci ), together with chain maps x : Ci −→ Ci+1 . For notational simplicity, the
index subscripts on the chain maps x are suppressed. Note that each Ci = (C•,i , ∂)
is itself a complex: we have a sequence of sequences. The persistent homology of
a persistence complex P is not a simple homology theory, but rather a homology
associated to closed intervals in the “parameter domain”. Over the interval [i, j],
its persistent homology, denoted H• (P[i, j]), is defined to be the image of the
induced homomorphism H(xj−i ) : H• (Ci ) → H• (Cj ) induced by xj−i . That is,
one looks at the composition of the chain maps from Ci → Cj and takes the
image of the induced homomorphism on homologies. This persistent homology
292 Homological Algebra and Data

consists of homology classes that persist: dim Hk (P[i, j]) equals the number of
intervals in the barcode of Hk (P) containing the parameter interval [i, j].

Exercise 2.19. If, in the indexing for a persistence complex, you have i < j < k < ,
what is the relationship between the various subintervals of [i, ] using {i, j, k, }
as endpoints? Draw the lattice of such intervals under inclusion. What is the
relationship between the persistent homologies on these subintervals?

Persistence Diagrams. Barcodes are not the only possible graphical presenta-
tion for persistent homology. Since there is a decomposition into homology
classes with well-defined initial and terminal parameter values, one can plot each
homology class as a point in the plane with axes the parameter line. To each
interval indecomposable (homology class) one assigns a single point with coordi-
nates (birth, death). This scatter plot is called the persistence diagram and is more
practical to plot and interpret than a barcode for very large numbers of homology
classes.

Exercise 2.20. In the case of a homology barcode coming from a data set, the
“noisy” homology classes are those with the smallest length, with the largest bars
holding claim as the “significant” topological features in a data set. What do these
noisy and significant bars translate to in the context of a persistence diagram? For
a specific example, return to the “clockface” data set of Exercise 2.12, but now
consider the set of all even points: 2, 4, 6, 8, 10, and 12. Show that the persistent
H2 contains a “short” bar. Are you surprised at this artificial bubble in the VR
complex? Does a similar bubble form in the homology when all 12 points are
used? In which dimension?

One aspect worth calling out is the notion of persistent homology as a ho-
mological data structure over the parameter space, in that one associates to each
interval [i, j] its persistent homology. This perspective is echoed in the early lit-
erature on the subject [32, 36, 41, 91], in which a continuous parameter space was
used, with a continuous family of (excursion sets) of spaces Xt , t ∈ R: in this
setting, persistent homology is assigned to an interval [s, t]. The discretized pa-
rameter interval offers little in the way of restrictions (unless you are working
with fractal-like or otherwise degenerate objects) and opens up the simple setting
of the Structure Theorem on Sequences as used in this lecture.

Stability
The idea behind the use of barcodes and persistence diagrams in data is
grounded in the intuition that essential topological features of a domain are ro-
bust to noise, whether arising from sensing, sampling, or approximation. In a
barcode, noisy features appear as short bars; in a persistence diagram, as near-
diagonal points. To solidify this intuition of robustness, one wants a more specific
statement on the stability of persistent homology. Can a small change in the input
Robert Ghrist 293

— whether a sampling of points or a Dowker relation or a perturbation of the


metric — have a large impact on how the barcode appears?
There have of late been a plethora of stability theorems in persistent homology,
starting with the initial result of Cohen-Steiner et al. [31] and progressing to more
general and categorical forms [12, 18, 21, 22, 30]. In every one of these settings, the
stability result is given in the context of persistence over a continuous parameter,
, such as one might use in the case of a Vietoris-Rips filtration on a point-cloud.
The original stability theorem is further framed in the setting of sublevel sets of
a function h : X → R and the filtration is by sublevel sets Xt = {h  t}. Both the
statements of the stability theorems and their proofs are technical; yet the techni-
calities lie in the difficulties of the continuum parameter. For the sake of clarity
and simplicity, we will assume that one has imposed a uniform discretization of
the real line with step size a fixed > 0 (as would often be the case in practice).
Interleaving. At present, the best language for describing the stability of per-
sistent homology and barcode descriptors is the recent notion of interleaving. In
keeping with the spirit of these lectures, we will present the theory in the context
of sequences of vector spaces and linear transformations. Assume that one has
a pair of sequences, V• and W• , of vector spaces and linear transformations. We
say that a T -interleaving is a pair of degree-T mappings:
(2.21) f• : V• → W•+T g• : W• → V•+T ,
such that the diagram commutes:
··· / Vn / ··· /3 Vn+T / ··· / Vn+2T / ··· .
g f 8

&
/+ Wn+T
f g
··· / Wn / ··· / ··· / Wn+2T / ···

In particular, at each n, the composition gn+T ◦ fn equals the composition of the


2T horizontal maps in V• starting at n; and, likewise with f ◦ g on W• . One
defines the interleaving distance between two sequences of vector spaces to be the
minimal T ∈ N such that there exists a T -interleaving.

Exercise 2.22. Verify that the interleaving distance of two sequences is zero if and
only if the two sequences are isomorphic.

Exercise 2.23. Assume that V• is an interval indecomposable of length 5. Describe


the set of all W• that are within interleaving distance one of V• .

Note that the conclusions of Exercises 2.22-2.23 are absolutely dependent on


the discrete nature of the problem. In the case of a continuous parameter, the
interleaving distance is not a metric on persistence complexes, but is rather a
pseudometric, as one can have the infimum of discretized interleaving distances
become zero without an isomorphism in the limit.
294 Homological Algebra and Data

Application: TDA
Topological Data Analysis, or TDA, is the currently popular nomenclature for
the set of techniques surrounding persistence, persistent homology, and the ex-
traction of significant topological features from data. The typical input to such a
problem is a point cloud Q in a Euclidean space, though any finite metric space
will work the same. Given such a data set, assumed to be a noisy sampling of
some domain of interest, one wants to characterize that domain. Such questions
are not new: linear regression assumes an affine space and returns a best fit; a
variety of locally-linear or nonlinear methods look for nonlinear embeddings of
Euclidean spaces.
Topological data analysis looks for global structure — homology classes — in
a manner that is to some degree decoupled from rigid geometric considerations.
This, too, is not entirely novel. Witness clustering algorithms take a point cloud
and return a partition that is meant to approximate connected components. Of
course, this reminds one of H0 , and the use of a Vietoris-Rips complex makes this
precise: single linkage clustering is precisely the computation of H0 (VR (Q)) for a
choice of > 0. Which choice is best? The lesson of persistence is to take all
and build the homology barcode. Notice however, that the barcode returns only
the dimension of H0 — the number of clusters — and to more carefully specify
the clusters, one needs an appropriate basis. There are many other clustering
schemes with interesting functorial interpretations [26, 27].
The ubiquity and utility of clustering is clear. What is less clear is the preva-
lence and practicality of higher-dimensional persistent homology classes in “or-
ganic” data sets. Using again a Vietoris-Rips filtration of simplicial complexes
on a point cloud Q allows the computation of homology barcodes in gradings
larger than zero. To what extent are they prevalent? The grading of homology
is reminiscent of the grading of polynomials in Taylor expansions. Though Tay-
lor expansions are undoubtedly useful, it is acknowledged that the lowest-order
terms (zeroth and first especially) are most easily seen and used. Something like
this holds in TDA, where one most readily sees clusters (H0 ) and simple loops
(H1 ) in data. The following is a brief list of applications known to the author.
The literature on TDA has blown-up of late to a degree that makes it impossi-
ble to give an exhaustive account of applications. The following are chosen as
illustrative of the basic principles of persistent homology.
Medical imaging data: Some of the earliest and most natural applications of TDA
were to image analysis [2, 13, 25, 79]. One recent study by Benditch et al. looks at
the structure of arteries in human brains [14]. These are highly convoluted path-
ways, with lots of branching and features at multiple scales, but which vary in
dramatic and unpredictable ways from patient to patient. The topology of arterial
structures are globally trivial — sampling the arterial structure though standard
imaging techniques yields a family of trees (acyclic graphs). Nevertheless, since
the geometry is measurable, one can filter these trees by sweeping a plane across
Robert Ghrist 295

the three-dimensional ambient domain, and look at the persistent H0 . Results


show statistically significant correlations between the vector of lengths of the top
100 bars in the persistent H0 barcode and features such as patient age and sex.
For example, older brains tend to have shorter longest bars in the H0 barcode. The
significance of the correlation is very strong and outperforms methods derived
from graph theory and phylogenetic-based tree-space geometry methods. It is in-
teresting that it is not the “longest bar” that matters so much as the ensemble of
longest bars in this barcode. Work in progress includes using H0 barcode statistics
to characterize global structure of graph-like geometries, including examinations
of insect wing patterns, tree leaf vein networks, and more. Other exciting exam-
ples of persistent H0 to medical settings feature an analysis of breast cancer by
Nicolau et al. [73].
Distinguishing illness from health and recovey: Where does persistent homology
beyond H0 come into applications? A recent excellent paper of Torres et al. uses
genetic data of individuals with illnesses to plot a time series of points in a disease
space of traits [88]. Several examples are given of studies on human and mouse
subjects tracking the advancement and recovery from disease (including, in the
mice, malaria). Genetic data as a function of time and for many patients gives
a point cloud in an abstract space for which geometry is not very relevant (for
example, axes are of differing and incomparable units). What is interesting about
this study is the incorporation of data from subjects that extends from the onset of
illness, through its progress, and including a full recovery phase back to health.
Of interest is the question of recovery — does recovery from illness follow the
path of illness in reverse? Does one recover to the same state of health, or is there
a monodromy? The study of Torres et al. shows a single clear long-bar in the
H1 -barcode in disease space, indicating that all instances of the illness are homolo-
gous, as are all instances of recovery, but that “illness and recovery are homologically
distinct events” [88]. In contrast to prior studies that performed a linear regression
on a pair of variables and concluded a linear relationship between these variables
(with a suggestion of a causal relationship), the full-recovery data set with its
loopy phenomenon of recovery suggests skepticism: indeed, a careful projection
of a generator for the H1 barcode into this plane recovers the loop.
Robot path planning: A very different set of applications arises in robot motion
planning, in which an autonomous agent needs to navigate in a domain X ⊂ Rn
(either physical or perhaps a configuration space of the robot) from an initial state
to a goal state in X. In the now-familiar case of self-driving vehicles, autonomous
drones, or other agents with sensing capabilities, the navigable (“obstacle-free”)
subset of X is relatively uncertain, and known only as a probabilistic model. Bhat-
tacharya et al. consider such a probability density ρ : X → [0, ∞) and use a com-
bination of persistent homology and graph search-based algorithms in order to
compute a set of best likely paths-to-goal [15]. The interesting aspects of this ap-
plication are the following. (1) The persistence parameter is the probability, used
296 Homological Algebra and Data

to filter the density function on X. This is one natural instance of a continuous


as opposed to a discrete persistence parameter. (2) The H1 homology barcode
is used, but on a slight modification of X obtained by abstractly identifying the
initial and goal states (by an outside edge if one wants to be explicit), so that
a path from initial to goal corresponds precisely to a 1-cycle that intersects this
formal edge. (3) Once again, the problem of “where to threshhold” the density
is avoided by the use of a barcode; the largest bars in the H1 barcode correspond
to the path classes most likely to be available and robust to perturbations in the
sensing. For path classes with short bars, a slight update to the system might
invalidate this path.
Localization and mapping: Computing persistent homology is also useful as a
means of building topological maps of an unknown environment, also of rel-
evance to problems in robotics, sensing, and localization. Imagine a lost trav-
eler wandering through winding, convoluted streets in an unfamiliar city with
unreadable street signs. Such a traveler might use various landmarks to build
up an internal map: “From the jewelry store, walk toward the cafe with the red
sign, then look for the tall church steeple.” This can be accomplished without ref-
erence to coordinates or odometry. For the general setting, assume a domain D
filled with landmarks identifiable by observers that register landmarks via local
sensing/visibility. A collection of observations are taken, with each observation
recording only those landmarks “visible” from the observation point. Both the
landmarks and observations are each a discrete set, with no geometric or coor-
dinate data appended. The sensing data is given in the form of an unordered
sequence of pairs of observation-landmark identities encoding who-sees-what.
From this abstract data, one has a Dowker relation from which one can build a
pair of (dual, homotopic) Dowker complexes that serve as approximations to the
domain topology. In [54], two means of inferring a topological map from persis-
tent homology are given. (1) If observers record in the sensing relation a visibility
strength (as in, say, the strengths of signals to all nearby wireless SSIDs), then fil-
tering the Dowker complexes on this (as we did in the neuroscience applications
of the previous lecture) gives meaning to long bars as significant map features.
(2) In the binary (hidden/seen) sensing case, the existence of non-unique land-
marks (“Ah, look! a Starbucks! I know exactly where I am now...”) confounds the
topology, but filtering according to witness weight can eliminate spurious sim-
plices: see [54] for details and [38] for an experimental implementation.
Protein compressibility: One of the first uses of persistent H2 barcodes has ap-
peared recently in the work of Gameiro et al. [48] in the context of characterizing
compressibility in certain families of protein chains. Compressibility is a particu-
lar characterization of a protein’s softness and is key to the interface of structure
and function in proteins, the determination of which is a core problem in biology.
The experimental measurement of protein compressibility is highly nontrivial,
and involves fine measurement of ultrasonic wave velocities from pressure waves
Robert Ghrist 297

in solutions/solvents with the protein. Hollow cavities within the protein’s geo-
metric conformation are contributing factors, both size and number. Gameiro et
al. propose a topological compressibility that is argued to measure the relative con-
tributions of these features, but with minimal experimental measurement, using
nothing more as input than the standard molecular datasets that record atom lo-
cations as a point cloud, together with a van der Waals radius about each atom.
What is interesting in this case is that one does not have the standard Vietoris-
Rips filtered complex, but rather a filtered complex obtained by starting with the
van der Waals radii (which vary from atom to atom) and then adding to these
radii the filtration parameter > 0. The proposed topological compressibility is
a ratio of the number of persistent H2 intervals divided by the number of persis-
tent H1 intervals (where the intervals are restricted to certain parameter ranges).
This ratio is meant to serve as proxy to the experimental measurement of cavi-
ties and tunnels in the protein’s structure. Comparisons with experimental data
suggest, with some exceptions, a tight linear correlation between the expensive
experimentally-measured compressibility and the (relatively inexpensive) topo-
logical compressibility.
These varied examples are merely summaries: see the cited references for more
details. The applications of persistent homology to data are still quite recent, and
by the time of publication of these notes, there will have been a string of novel
applications, ranging from materials science to social networks and more.

Lecture 3: Compression and Computation


We now have in hand the basic tools for topological data analysis: complexes,
homology, and persistence. We are beginning to develop the theories and perspec-
tives into which these tools fit, as a higher guide to how to approach qualitative
phenomena in data. We have not yet dealt with issues of computation and effec-
tive implementation. Our path to doing so will take us deeper into sequences,
alternate homology theories, cohomology, and Morse theory.

Sequential Manipulation
Not surprisingly, these lectures take the perspective that the desiderata for ho-
mological data analysis include a calculus for complexes and sequences of com-
plexes. We have seen hints of this in Lecture 2; now, we proceed to introduce a
bit more of the rich structure that characterizes (the near-trivial linear-algebraic
version of) homological algebra. Instead of focusing on spaces or simplicial com-
plexes per se, we focus on algebraic complexes; this motivates our examination
of certain types of algebraic sequences.
Exact Complexes Our first new tool is inspired by the question: among all com-
plexes, which are simplest? Simplicial complexes might suggest that the simplest
sort of complex is that of a single simplex, which has homology vanishing in all
298 Homological Algebra and Data

gradings except zero. However, there are simpler complexes still. We say that
an algebraic complex C = (C• , ∂) is exact if its homology completely vanishes,
H• (C) = 0. This is often written termwise as:
(3.1) ker ∂k = im ∂k+1 ∀ k.
Exercise 3.2. What can you say about the barcode of an exact complex? (This
means the barcode of the complex, not its [null] homology).
The following simple examples of exact complexes help build intuition:
• Two vector spaces are isomorphic, V =∼ W, iff there is an exact complex of
the form:
0 /V /W /0.

• The 1st Isomorphism Theorem says that for a linear transformation ϕ of


V, the following sequence is exact:
0 / ker ϕ /V ϕ / im ϕ /0.

Such a 5-term complex framed by zeroes is called a short exact complex.


In any such short exact complex, the second map is injective; the penulti-
mate, surjective.
• More generally, the kernel and cokernel of any linear transformation
ϕ : V → W fit into an exact complex:
0 / ker ϕ /V ϕ /W / coker ϕ /0.

Exercise 3.3. Consider C = C∞ (R3 ), the vector space of differentiable functions


and X = X(R3 ), the vector space of C∞ vector fields on R3 . Show that these fit
together into an exact complex,
∇ ∇× ∇·
(3.4) 0 /R /C /X /X /C /0,
where ∇ is the gradient differential operator from vector calculus, and the initial
R term in the complex represents the constant functions on R3 . This one exact
complex compactly encodes many of the relations of vector calculus.
Mayer-Vietoris Complex There are a number of exact complexes that are used
in homology, the full exposition of which would take us far afield. Let us focus
on one particular example as a means of seeing how exact complexes assist in
computational issues. The following is presented in the context of simplicial com-
plexes. Let X = A ∪ B be a union of two simplicial complexes with intersection
A ∩ B. The following exact complex is the Mayer-Vietoris complex:
H(φ) H(ψ)
/ Hn (A ∩ B) / Hn (A) ⊕ Hn (B) / Hn (X) δ / Hn−1 (A ∩ B) /.

The linear transformations between the homologies are where all the interesting
details lie. These consist of: (1) H(ψ), which adds the homology classes from A
and B to give a homology class in X; (2) H(φ), which reinterprets a homology
class in A ∩ B to be a homology class of A and a (orientation reversed!) homology
Robert Ghrist 299

class in B respectively; and (3) δ, which decomposes a cycle in X into a sum of


chains in A and B, then takes the boundary of one of these chains in A ∩ B. This
unmotivated construction has a clean explanation, to be addressed soon.
For the moment, focus on what the Mayer-Vietoris complex means. This com-
plex captures the additivity of homology. When, for example, A ∩ B is empty,
then every third term of the complex vanishes — the H• (A ∩ B) terms. Because
every-third-term-zero implies that the complementary pairs of incident terms are
isomorphisms, this quickly yields that homology of a disjoint union is additive,
using ⊕. When the intersection is nonempty, the Mayer-Vietoris complex details
exactly how the homology of the intersection impacts the homology of the union:
it is, precisely, an inclusion-exclusion principle.

Exercise 3.5. Assume the following: for any k  0, (1) Hn (Dk ) = 0 for all n > 0;
∼ F and Hn (S1 ) = 0 for all n > 1. The computation
(2) the 1-sphere S1 has H1 (S1 ) =
of H• (Sk ) can be carried out via Mayer-Vietoris as follows. Let A and B be upper
and lower hemispheres of Sk , each homeomorphic to Dk and intersecting at
an equatorial Sk−1 . Write out the Mayer-Vietoris complex in this case: what
can you observe? As H• (Dk ) = ∼ 0 for k > 0, one obtains by exactness that
∼ Hn−1 (Sk−1 ) for all n and all k. Thus, starting from a knowledge
δ : Hn (Sk ) =
of H• S , show that Hn (Sk ) =
1 ∼ 0 for k > 0 unless n = k, where it has dimension
equal to one.

Sequences of Sequences One of the themes of these lectures is the utility of


composed abstraction: if spaces are useful, so should be spaces of spaces. Later,
we will argue that an idea of “homologies of homologies” is sensible and use-
ful (in the guise of the sheaf theory of Lecture 4). In this lecture, we argue that
sequences of sequences and complexes of complexes are useful. We begin with
an elucidation of what is behind the Mayer-Vietoris complex: what are the maps,
and how does it arise?
Consider the following complex of algebraic complexes,
φ• ψ•
(3.6) 0 / C• (A ∩ B) / C• (A) ⊕ C• (B) / C• (A + B) / 0,
with chain maps φ• : c → (c, −c), and ψ• : (a, b) → a + b. The term on the
right, C• (A + B), consists of those chains which can be expressed as a sum of
chains on A and chains on B. In cellular homology with A, B subcomplexes,
C• (A + B) =∼ C• (X).

Exercise 3.7. Show that this complex-of-complexes is exact by construction.

In general, any such short exact complex of complexes can be converted to a


long exact complex on homologies using a method from homological algebra called
the Snake Lemma [50, 58]. Specifically, given:
i• j•
(3.8) 0 / A• / B• / C• / 0,
300 Homological Algebra and Data

there is an induced exact complex of homologies


H(i) H(j) H(i)
(3.9) / Hn (A) / Hn (B) / Hn (C) δ / Hn−1 (A) / .
Moreover, the long exact complex is natural: a commutative diagram of short
exact complexes and chain maps
0 / A• / B• / C• /0.

(3.10) f• g• h•
  
0 / Õ / B̃• / C̃• /0

induces a commutative diagram of long exact complexes

/ Hn (A) / Hn (B) / Hn (C) δ / Hn−1 (A) / .

(3.11) H(f) H(g) H(h) H(f)


   
/ Hn (Ã) / Hn (B̃) / Hn (C̃) δ / Hn−1 (Ã) /

The induced connecting homomorphism δ : Hn (C) → Hn−1 (A) comes from the
boundary map in C as follows:
(1) Fix [γ] ∈ Hn (C); thus, γ ∈ Cn .
(2) By exactness, γ = j(β) for some β ∈ Bn .
(3) By commutativity, j(∂β) = ∂(jβ) = ∂γ = 0.
(4) By exactness, ∂β = iα for some α ∈ An−1 .
(5) Set δ[γ] = [α] ∈ Hn−1 (A).
Exercise 3.12. This is a tedious but necessary exercise for anyone interested in
homological algebra: (1) show that δ[γ] is well-defined and independent of all
choices; (2) show that the resulting long complex is exact. Work at ad tedium: for
help, see any textbook on algebraic topology, [58] recommended.
Euler Characteristic, Redux Complexes solve the mystery of the topological
invariance of the Euler characteristic. Recall that we can define the Euler char-
acteristic of a (finite, finite-dimensional) complex C as in Equation (1.13). The
alternating sum is a binary exactness. A short exact complex of vector spaces
0 → A → B → C → 0 has χ = 0, since C = ∼ B/A. By applying this to individ-
ual rows of a short exact complex of (finite, finite-dimensional) chain complexes,
we can lift once again to talk about the Euler characteristic of a (finite-enough)
complex of complexes:
(3.13) 0 / A• / B• / C• /0.

One sees that χ of this complex also vanishes: χ(A• ) − χ(B• ) + χ(C• ) = 0.
The following lemma is the homological version of the Rank-Nullity Theorem
from linear algebra:
Lemma 3.14. The Euler characteristic of a chain complex C• and its homology H• are
identical, when both are defined.
Robert Ghrist 301

Proof. From the definitions of homology and chain complexes, one has two short
exact complexes of chain complexes:
0 / B• / Z• / H• /0.
(3.15)
0 / Z• / C• / B•−1 /0

Here, B•−1 is the shifted boundary complex whose kth term is Bk−1 . By exact-
ness, the Euler characteristic of each of these two complexes is zero; thus, so is
the Euler characteristic of their concatenation.
0 / B• / Z• / H• /0 / Z• / C• / B•−1 /0.

Count the +/− signs: the Z terms cancel and, since χ(B•−1 ) = −χ(B• ), the B
terms cancel. This leaves two terms and we conclude that χ(H• ) − χ(C• ) = 0. 
Euler characteristic thus inherits its topological invariance from that of ho-
mology. Where does the invariance of homology come from? Something more
complicated still?

Homology Theories
Invariance of homology is best discerned from a singularly uncomputable vari-
ant that requires a quick deep dive into the plethora of homologies available. We
begin with a reminder: homology is an algebraic compression scheme — a way
of collapsing a complex to the simplest form that respects its global features. The
notion of homology makes sense for any chain complex. Thus far, our only means
of generating a complex from a space X has been via some finite auxiliary struc-
ture on X, such as a simplicial, cubical, or cellular decomposition. There are other
types of structures a space may carry, and, with them, other complexes. In the
same way that the homology of a simplicial complex is independent of the simpli-
cial decomposition, the various homologies associated to a space under different
auspices tend to be isomorphic.
Reduced homology Our first alternate theory is not really a different type of
homology at all; merely a slight change in the chain complex meant to make
contractible (or rather acyclic) spaces fit more exactly into homology theory. Re-
call that a contractible cell complex — such as a single simplex — has homology
Hk = 0 for all k > 0, with H0 being one-dimensional, recording the fact that
the cell complex is connected. For certain results in algebraic topology, it would
be convenient to have the homology of a contractible space vanish completely.
This can be engineered by an augmented complex in a manner that is applicable to
any N-graded complex. Assume for simplicity that C = (Ck , ∂) is a N-graded
complex of vector spaces over a field F. The reduced complex is the following
augmentation:
(3.16) ···
∂ / C3 ∂ / C2 ∂ / C1 ∂ / C0  /F /0,
302 Homological Algebra and Data

where the aumentation map : C0 → F sends a vector in C0 to the sum of its


components (having fixed a basis for C0 ). The resulting homology of this complex
is called the reduced homology and is denoted H̃• .

Exercise 3.17. Show that the reduced complex is in fact a complex: i.e., that
∂ = 0. How does the reduced homology of a complex differ from the “ordinary”
homology? What is the dependence on the choice of augmentation map ? Show
that the augmented complex of a contractible simplicial complex is exact.

Čech Homology One simple structure associated to a topological space is an


open cover — a collection U of open sets {Uα } in X. The Čech complex of U is
the complex C(U) with basis for Ck (U) being all (unordered) sets of k + 1 dis-
tinct elements of U with nonempty intersection. (The usual complexities arise for
coefficients not in F2 , as one needs to order the elements of U up to even per-
mutations.) The boundary maps ∂ : C• (U) → C•−1 (U) act on a basis element by
forgetting one of the terms in the set, yielding face maps.
For a finite collection U of sets, the Čech complex is identical to the simplicial
chain complex of the nerve N(U). With the further assumption of contractible
sets and intersections, the resulting homology is, by the Nerve Lemma, identical
to X = ∪α Uα . However, even in the non-finite case, the result still holds if the
analogous contractibility assumptions hold. In short, if all the basis elements
for the Čech complex are nullhomologous, then the Čech homology, H• (C(U)) is
isomorphic to H• (X). Of course, the Čech complex and its homology are still
well-defined even if the local simplicity assumptions are violated. This Čech
homology has in the past been used in the context of a sequence of covers, with
limiting phenomena of most interest for complex fractal-like spaces [60]. This is
perhaps one of the earliest incarnations of persistent homology.
Singular Homology If you find it risky to think of the Čech homology of a cover
of non-finite size, then the next homology theory will seem obscenely prodigal.
Given a topological space X, the singular chain complex is the complex Csing whose
k-chains have as basis elements all maps σ : Δk → X, where Δk is the Platonic
k-simplex. Note: there are no restrictions on the maps σ other than continuity:
images in X may appear crushed or crumpled. The boundary maps are the obvi-
ous restrictions of σ to the (k − 1)-dimensional faces of Δk , taking a linear com-
bination with orientations if the field F demands. The resulting singular chain
complex is wildly uncountable, unless X should happen to be a trivial space.

Exercise 3.18. Do the one explicit computation possible in singular homology:


sing
show that for X a finite disjoint union of points, the singular homology Hn (X)
vanishes except when n = 0, in which case it has β0 equal to the number of
points in X. Note: you cannot assume, as in cellular homology, that the higher-
dimensional chains Cn vanish for n > 0.
Robert Ghrist 303

There is little hope in computing the resulting singular homology, save for the
fact that this homology is, blessedly, an efficient compression.

Theorem 3.19. For a cell complex, singular and cellular homology are isomorphic.

The proof of this is an induction argument based on the n-skeleton of X. The


previous exercise establishes the isomorphism on the level of H0 . To induct to
higher-dimensional skeleta requires a few steps just outside the bounds of these
lectures: see [51, 58] for details.
Homotopy Invariance Why pass to the uncomputable singular theory? There
is so much room in Csing that it is easy to deform continuously and prove the core
result.

Theorem 3.20. Homology [singular] is a homotopy invariant of spaces.

When combined with Theorem 3.19, we obtain a truly useful, computable re-
sult. The proof of Theorem 3.20 does not focus on spaces at all, but rather, in
the spirit of these lectures, pulls back the notion of homotopy to complexes. Re-
call that f, g : X → Y are homotopic if there is a map F : X × [0, 1] → Y which
restricts to f on X × {0} and to g on X × {1}. A chain homotopy between chain maps
ϕ• , ψ• : C → C is a graded linear transformation F : C → C sending n-chains to
(n + 1)-chains so that ∂F − F∂ = ϕ• − ψ• :

··· / Cn+1 ∂ / Cn ∂ /
Cn−1
∂ / ··· .
z zz z zz
F zzz F zz F zzz z
zz
(3.21) zz ψ• •zzz ψ• •zzz ψ• •zzz F
ϕ ϕ ϕ
z  |z  |z  }z
}zz
··· / C / C / C / ···
n+1 ∂ n ∂ n−1 ∂

One calls F a map of degree +1, indicating the upshift in the grading.5 Note
the morphological resemblance to homotopy of maps: a chain homotopy maps
each n-chain to a n + 1-chain, the algebraic analogue of a 1-parameter family.
The difference between the ends of the homotopy, ∂F − F∂, gives the difference
between the chain maps.

Exercise 3.22. Show that two chain homotopic maps induce the same homomor-
phisms on homology. Start by considering [α] ∈ H• (C), assuming ϕ• and ψ• are
chain homotopic maps from C to C .

The proof of Theorem 3.20 follows from constructing an explicit chain homo-
topy [58].
Morse Homology All the homology theories we have looked at so far have
used simplices or cells as basis elements of chains and dimension as the grading.
There is a wonderful homology theory that breaks this pattern in a creative and,
eventually, useful manner. Let M be a smooth, finite-dimensional Riemannian
5 The overuse of the term degree in graphs, maps of spheres, and chain complexes is unfortunate.
304 Homological Algebra and Data

manifold. There is a homology theory based on a dynamical system on M. One


chooses a function h : M → R and considers the (negative) gradient flow of h on
M — the smooth dynamical system given by dx/dt = −∇h.
The dynamics of this vector field are simple: solutions either are fixed points
(critical points of h) or flow downhill from one fixed point to another. Let Cr(h)
denote the set of critical points, and assume for the sake of simplicity that all
such critical points are nondegenerate – the second derivative (or Hessian) is nonde-
generate (has nonzero determinant) at these points. These nondegenerate critical
points are the basis elements of a Morse complex. What is the grading?
Nondegenerate critical points have a natural grading – the number of negative
eigenvalues of the Hessian of h at p. This is called the Morse index, μ(p), of
p ∈ Cr(h) and has the more topological interpretation as the dimension of the
set of points that converge to p in negative time. The Morse index measures how
unstable a critical point is: minima have the lowest Morse index; maxima the
highest. Balancing a three-legged stool on k legs leads to an index μ = 3 − k
equilibrium.
One obtains the Morse complex, Ch = (MC• , ∂), with MCk the vector space with
basis {p ∈ Cr(h) ; μ(p) = k}. The boundary maps encode the global flow of the
gradient field: ∂k counts (modulo 2 in the case of F2 coefficients) the number of
connecting orbits – flowlines from a critical point with μ = k to a critical point with
μ = k − 1. One hopes (or assumes) that this number is well-defined. The difficult
business is to demonstrate that ∂2 = 0: this involves careful analysis of the con-
necting orbits, as in, e.g., [8,84]. The use of F2 coefficients is highly recommended.
The ensuing Morse homology, MH• (h), captures information about M.
Theorem 3.23 (Morse Homology Theorem). Fix a compact manifold M and a Morse
∼ H• (M; F2 ), independent of h.
functionh : M → R. Then MH• (h; F2 ) =
Exercise 3.24. Compute the Morse homology of a 2-sphere, S2 , outfitted with a
Morse function having two maxima and two minima. How many saddle points
must it have?
Our perspective is that Morse homology is a precompression of the complex
onto its critical elements, as measured by h.
Discrete Morse Theory As given, Morse homology would seem to be greatly
disconnected from the data-centric applications and perspectives of this chapter
— it uses smooth manifolds, smooth functions, smooth flows, and nondegenerate
critical points, all within the delimited purview of smooth analysis. However, as
with most of algebraic topology, the smooth theory is the continuous limit of a
correlative discrete theory. In ordinary homology, the discrete [that is, simplicial]
theory came first, followed by the limiting case of the singular homology theory.
In the case of Morse theory, the smooth version came long before the following
discretization, which first seems to have appeared in a mature form in the work
of Forman [45, 46]; see also the recent book of Kozlov [65].
Robert Ghrist 305

Consider for concreteness a simplicial or cell complex X. The critical ingredient


for Morse theory is not the Morse function but rather its gradient flow. A discrete
vector field is a pairing V which partitions the cells of X (graded by dimension)
into pairs Vα = (σα τα ) where σα is a codimension-1 face of τα . All leftover
cells of X not paired by V are the critical cells of V, Cr(V). A discrete flowline is a
sequence (Vi ) of distinct paired cells with codimension-1 faces, arranged so that
V1 V2 VN
        
(3.25) σ1 τ1  σ2 τ2  · · ·  σN τN .
A flowline is periodic if τN σ1 for N > 1. A discrete gradient field is a discrete
vector field devoid of periodic flowlines.
The best approach is to lift everything to algebraic actions on the chain complex
C = (Ccell
• , ∂) associated to the cell complex X. By linearity, the vector field V
induces a chain map V : Ck → Ck+1 induced by the pairs στ – one visualizes
an arrow from the face σ to the cell τ. As with classical Morse homology, F2
coefficients are simplest.
To every discrete gradient field we can associate a discrete Morse complex,
CV = (MC• , ∂) ˜ with MCk defined to bethe vector space with basis the critical
cells {σ ∈ Cr(V) ; dim(σ) = k}. Note that dimension plays the role of Morse index.

Exercise 3.26. Place several discrete gradient fields on a discretization of a circle


and examine the critical cells. What do you notice about the number and dimen-
sion of critical cells? Does this make sense in light of the Euler characteristic of a
circle?

The boundary maps ∂˜ k count (modulo 2 in the case of F2 coefficients; with a


complicated induced orientation else) the number of discrete flowlines from a crit-
ical simplex of dimension k to a critical simplex of dimension k − 1. Specifically,
given τ a critical k-simplex and σ a critical (k − 1)-simplex, the contribution of
∂˜ k (τ) to σ is the number of gradient paths from a face of τ to a coface of σ. In the
case that στ, then this number is 1, ensuring that the trivial V for which all cells
are critical yields CV the usual cellular chain complex. It is not too hard to show
that ∂˜ 2 = 0 and that, therefore, the homology MH• (V) = H• (CV ) is well-defined.
As usual, the difficulty lies in getting orientations right for Z coefficients.
∼ Hcell (X).
Theorem 3.27 ([45]). For any discrete gradient field V, MH• (V) = •

Discrete Morse theory shows that the classical constraints – manifolds, smooth
dynamics, nondegenerate critical points – are not necessary. This point is worthy
of emphasis: the classical notion of a critical point (maximum, minimum, saddle)
is distilled away from its analytic and dynamical origins until only the algebraic
spirit remains.
Applications of discrete Morse theory are numerous and expansive, including
to combinatorics [65], mesh simplification [67], image processing [79], configu-
ration spaces of graphs [43, 44], and, most strikingly, efficient computation of
306 Homological Algebra and Data

homology of cell complexes [69]. This will be our focus for applications to com-
putation at the end of this lecture.

Application: Algorithms
Advances in applications of homological invariants have been and will remain
inextricably linked to advances in computational methods for such invariants. Re-
cent history has shown that potential applications are impotent when divorced
from computational advances, and computation of unmotivated quantities is fu-
tile. The reader who is interested in applying these methods to data is no doubt
interested in knowing the best and easiest available software. Though this is not
the right venue for a discussion of cutting-edge software, there are a number
of existing software libraries/packages for computing homology and persistent
homology of simplicial or cubical complexes, some of which are exciting and
deep. As of the time of this writing, the most extensive and current benchmark-
ing comparing available software packages can be found in the preprint of Otter
et al. [75]. We remark on and summarize a few of the issues involved with com-
puting [persistent] homology, in order to segue into how the theoretical content
of this lecture impacts how software can be written.
Time complexity: Homology is known to be output-sensitive, meaning that the
complexity of computing homology is a function of how large the homology is,
as opposed to how large the complex is. What this means in practice is that
the homology of a simple complex is simple to compute. The time-complexity
of computing homology is, by output-sensitivity, difficult to specify tightly. The
standard algorithm to compute H• (X) for X a simplicial complex is to compute
the Smith normal form of the graded boundary map ∂ : C• → C• , where we
concatenate the various gradings into one large vector space. This graded bound-
ary map is, by definition, nilpotent: it has a block structure with zero blocks on
the block-diagonal (since ∂k : Ck → Ck−1 ) and is nonzero on the superdiagonal
blocks. The algorithm for computing Smith normal form is really a slight variant
of the ubiquitous Gaussian elimination, with reduction to the normal form via
elementary row and column operations. For field coefficients in F2 this reduction
is easily seen to be of time-complexity O(n3 ) in the size of the matrix, with an
expected run time of O(n2 ). This is not encouraging, given the typical sizes seen
in applications. Fortunately, compression preprocessing methods exist, as we will
detail.
Memory and inputs: Time-complexity is not the only obstruction; holding a
complex in memory is nontrivial, as is the problem of inputting a complex. A
typical simplicial complex is specified by fixing the simplices as basis and then
specifying the boundary matrices. For very large complexes, this is prohibitive
and unnecessary, as the boundary matrices are typically sparse. There are a
number of ways to reduce the input cost, including inputting (1) a distance matrix
spelling out an explicit metric between points in a point cloud, using a persistent
Robert Ghrist 307

Dowker complex (see Exercise 2.10) to build the filtered complex; (2) using voxels
in a lattice as means of coordinatizing top-dimensional cubes in a cubical complex,
and specifying the complex as a list of voxels; and (3) using the Vietoris-Rips
complex of a network of nodes and edges, the specification of which requires
only a quadratic number of bits of data as a function of nodes.

Exercise 3.28. To get an idea of how the size of a complex leads to an inefficient
complexity bound, consider a single simplex, Δn , and cube, In , each of dimen-
sion n. How many total simplices/cubes are in each? Include all faces of all
dimensions. Computing the [nearly trivial] homology of such a simple object
requires, in principle, computing the Smith normal form of a graded boundary
matrix of what net size?

Morse theory & Compression: One fruitful approach for addressing the compu-
tation of homology is to consider alternate intermediate compression schemes. If
instead of applying Smith Normal Form directly to a graded boundary operator,
one modifies the complex first to obtain a smaller chain-homotopic complex, then
the resulting complexity bounds may collapse with a dramatic decrease in size
of the input. There have been many proposals for reduction and coreduction of
chain complexes that preserve homology: see [62] for examples. One clear and
successful compression scheme comes from discrete Morse theory. If one puts
a discrete gradient field on a cell complex, then the resulting Morse complex
is smaller and potentially much smaller, being generated only by critical cells.
The process of defining and constructing an associated discrete Morse complex
is roughly linear in the size of the cell complex [69] and thus gives an efficient
approach to homology computation. This has been implemented in the popular
software package Perseus (see [75]).
Modernizing Morse Theory: Morse homology, especially the discrete version, has
not yet been fully unfolded. There are several revolutionary approaches to Morse
theory that incorporate tools outside the bounds of these lectures. Nevertheless,
it is the opinion of this author that we are just realizing the full picture of the cen-
trality of Morse theory in Mathematics and in homological algebra in particular.
Two recent developments are worth pointing out as breakthroughs in conceptual
frameworks with potentially large impact. The first, in the papers by Nanda et
al. [71, 72], gives a categorical reworking of discrete Morse that relaxes the notion
of a discrete vector field to allow for any acyclic pairing of cells and faces without
restriction on the dimension of the face. It furthermore shows how to reconstruct
the topology of the original complex (up to homotopy type, not homology type)
using only data about critical cells and the critical discrete flowlines. Though the
tools used are formidable (2-categories and localization), the results are equally
strong.
Matroid Morse Theory: The second contribution on the cusp of impact comes
in the thesis of Henselman [59] which proposes matroid theory as the missing link
308 Homological Algebra and Data

between Morse theory and homological algebra. Matroids are classical structures
in the intersection of combinatorial topology and linear algebra and have no end
of interesting applications in optimization theory. Henselman recasts discrete
Morse theory and persistent homology both in terms of matroids, then exploits
matroid-theoretic principles (rank, modularity, minimal bases) in order to gener-
ate efficient algorithms for computing persistent homology and barcodes. This
work has already led to an initial software package Eirene6 that, as of the time of
this writing, has computed persistent Hk for 0  k  7 of a filtered 8-dimensional
simplicial complex obtained as the Vietoris-Rips complex of a random sampling
of 50 points in dimension 20 with a total of 3.3E + 11 simplices on a PC laptop
with i7 quadcore and 16meg RAM in 11.1 seconds with peak memory use of
1.8GB. This computation compares very favorably with the fastest-available soft-
ware, Ripser7 , on a cluster of machines, as recorded in the survey of Otter et
al. [75]: Ripser computes the persistent homology of this complex in 349 seconds
with a peak memory load of 24.7GB. This portends much more to come, both at
the level of conceptual understanding and computational capability.
This prompts the theme of our next lecture, that in order to prepare for in-
creased applicability, one must ascend and enfold tools and perspectives of in-
creasing generality and power.

Lecture 4: Higher Order


Having developed the basics of topological data analysis, we focus now on the
theories and principles to which these point.

Cohomology and Duality


One of the first broad generalizations of all we have described in these lectures
is the theory of cohomology, an algebraic dual of homology. There are many ways
to approach cohomology — dual spaces, Morse theory, differential forms, and
configuration spaces all provide useful perspectives in this subject. These lectures
will take the low-tech approach most suitable for a first-pass. We have previously
considered a general chain complex C to be a graded sequence of vector spaces
C• with linear transformations ∂k : Ck → Ck−1 satisfying ∂2 = ∂k−1 ∂k = 0.
These chain complexes are typically graded over the naturals N, and any such
complex compresses to its homology, H• (C), preserving homological features and
forgetting all extraneous data.
In a chain complex, the boundary maps descend in the grading. If that grad-
ing is tied to dimension or local complexity of an assembly substructure, then the
boundary maps encode how more-complex objects are related, attached, or pro-
jected to their less-complex components. Though this is a natural data structure
in many contexts, there are instances in which one knows instead how objects are
6 Available at gregoryhenselman.org/eirene.
7 Available at https://github.com/Ripser/ripser.
Robert Ghrist 309

related to larger superstructures rather than smaller substructures. This prompts


the investigation of cochain complexes. For purposes of these lectures, a cochain
complex is a sequence C = (C• , d• ) of vector spaces and linear transformations
which increment the grading (dk : Ck → Ck+1 ) and satisfy the complex condi-
tion (dk+1 dk = 0). One uses subscripts for chain complexes and superscripts
for cochain complexes. The cohomology of the cochain complex is the complex
H• (C) = ker d/imd consisting of cocycles equivalent up to coboundaries.
The simplest example of a cochain complex comes from dualizing a chain
complex. Given a chain complex (Ck , ∂, k) of F-vector spaces, define Ck = C∨ k,
the vector space of linear functionals Ck → F. The coboundary dk is then the
adjoint (de facto, transpose) of the boundary ∂k+1 , so that
(4.1) d ◦ d = ∂∨ ◦ ∂∨ = (∂ ◦ ∂)∨ = 0∨ = 0.
In the case of a simplicial complex, the standard simplicial cochain complex is
precisely such a dual to the simplicial chain complex. The coboundary opera-
tor d is explicit: the coboundary of a functional on k-simplices, f ∈ Ck , acts
as (df)(τ) = f(∂τ). For σ a k-simplex, d implicates the cofaces – those (k + 1)-
simplices τ having σ as a face.
Dualizing chain complexes in this manner leads to a variety of cohomology
theories mirroring the many homology theories of the previous section: simpli-
cial, cellular, singular, Morse, Čech, and other cohomology theories follow.

Exercise 4.2. Fix a triangulated disc D2 and consider cochains using F2 coeffi-
cients. What do 1-cocycles look like? Show that any such 1-cocycle is the cobound-
ary of a 0-cochain which labels vertices with 0 and 1 on the left and on the right of
the 1-cocycle, so to speak: this is what a trivial class in H1 (D2 ) looks like. Now fix
a circle S1 discretized as a finite graph and construct examples of 1-cocycles that
are (1) coboundaries; and (2) nonvanishing in H1 . What is the difference between
the trivial and nontrivial cocycles on a circle?

The previous exercise foreshadows the initially depressing truth: nothing new
is gained by computing cohomology, in the sense that Hn (X) and Hn (X) have the
same dimension for each n. Recall, however, that there is more to co/homology
than just the Betti numbers. Functoriality is key, and there is a fundamental
difference in how homology and cohomology transform.

Exercise 4.3. Fix f : X → Y a simplicial map of simplicial complexes, and con-


sider the simplicial cochain complexes C(X) and C(Y). We recall that the in-
duced chain map f• yields a well-defined induced homomorphism on homology
H(f) : H• (C(X)) → H• (C(Y)). Using what you know about adjoints, show that
the induced homomorphism on cohomology is also well-defined but reverses di-
rection: H• (f) : H• (Y) → H• (X). This allows one to lift cohomology cocycles from
the codomain to the domain.
310 Homological Algebra and Data

Alexander Duality There are numerous means by which duality expresses itself
in the form of cohomology. One of the most useful and ubiquitous of these
is known as Alexander duality, which relates the homology and cohomology of
a subset of a sphere Sn (or, with a puncture, Rn ) and its complement. The
following is a particularly simple form of that duality theorem.

Theorem 4.4 (Alexander Duality). Let A ⊂ Sn be compact, nonempty, proper, and


locally-contractible. There is an isomorphism

=
(4.5) AD : H̃k (Sn −A) −→ H̃n−k−1 (A).

Note that the reduced theory is used for both homology and cohomology.
Cohomology and Calculus Most students initially view cohomology as more
obtuse than homology; however, there are certain instances in which cohomology
is the most natural operation. Perhaps the most familiar such setting comes from
calculus. As seen in Exercise 3.3 from Lecture 2, the familiar constructs of vector
calculus on R3 fit into an exact complex. This exactness reflects the fact that R3 is
topologically trivial [contractible]. Later, in Exercise 4.2, you looked at simplicial
1-cocycles and hopefully noticed that whether or not they are null in H1 depends
on whether or not these cochains are simplicial gradients of 0-chains on the vertex
set. These exercises together hint at the strong relationship between cohomology
and calculus.
The use of gradient, curl, and divergence for vector calculus is, however, an un-
fortunate vestige of the philosophy of calculus-for-physics as opposed to a more
modern calculus-for-data sensibility. A slight modern update sets the stage bet-
ter for cohomology. For U ⊂ Rn an open set, let Ωk (U) denote the differentiable
k-form fields on U (a smooth choice of multilinear antisymmetric functionals on
ordered k-tuples of tangent vectors at each point). For example, Ω0 consists of
smooth functionals, Ω1 consists of 1-form fields, viewable (in a Euclidean setting)
as duals to vector fields, Ωn consists of signed densities on U times the volume
form, and Ωk>n (U) = 0. There is a natural extension of differentiation (famil-
iar from implicit differentiation in calculus class) that gives a coboundary map
d : Ωk → Ωk+1 , yielding the deRham complex,
(4.6) 0 / Ω0 (U) d / Ω1 (U) d / Ω2 (U) d / ··· d / Ωn (U) /0.
As one would hope, d2 = 0, in this case due to the fact that mixed partial deriva-
tives commute: you worked this out explicitly in Exercise 3.3. The resulting
cohomology of this complex, the deRham cohomology H• (U), is isomorphic to the
singular cohomology of U using R coefficients.
This overlap between calculus and cohomology is neither coincidental nor con-
cluded with this brief example. A slightly deeper foray leads to an examination
of the Laplacian operator (on a manifold with some geometric structure). The
well-known Hodge decomposition theorem then gives, among other things, an iso-
morphism between the cohomology of the manifold and the harmonic differential
Robert Ghrist 311

forms (those in the kernel of the Laplacian). For more information on these con-
nections, see [19].
What is especially satisfying is that the calculus approach to cohomology and
the deRham theory feeds back to the simplicial: one can export the Laplacian and
the Hodge decomposition theorem to the cellular world (see [51, Ch. 6]). This,
then, impacts data-centric problems of ranking and more over networks.
Cohomology and Ranking Cohomology arises in a surprising number of dif-
ferent contexts. One natural example that follows easily from the calculus-based
perspective on cohomology lives in certain Escherian optical illusions, such as
impossible tribars, eternally cyclic waterfalls, or neverending stairs. When one
looks at an Escher staircase, the drawn perspective is locally realizable – one can
construct a local perspective function. – but a global extension cannot be defined.
Thus, an Escherlike loop is really a non-zero class in H1 (as first pointed out by
Penrose [78]).
This is not disconnected from issues of data. Consider the problem of ranking.
One simple example that evokes nontrivial 1-cocycles is the popular game of Rock,
Paper, Scissors, for which there are local but not global ranking functions. A local
gradient of rock-beats-scissors does not extend to a global gradient. Perhaps this
is why customers are asked to conduct rankings (e.g., Netflix movie rankings or
Amazon book rankings) as a 0-cochain (“how many stars?”), and not as a 1-cochain
(“which-of-these-two-is-better?”): nontrivial H1 is, in this setting, undesirable. The
Condorcet paradox – that locally consistent comparative rankings can lead to global
inconsistencies – is an appearance of H1 in ranking theory.
There are less frivolous examples of precisely this type of application, leverag-
ing the language of gradients and curls to realize cocycle obstructions to perfect
rankings in systems. The paper of Jiang et al. [61] interprets the simplicial cochain
complex of the clique/flag complex of a network in terms of rankings. For exam-
ple, the (R-valued) 0-cochains are interpreted as numerical score functions on the
nodes of the network; the 1-cochains (supported on edges) are interpreted as pair-
wise preference rankings (with oriented edges and positive/negative values deter-
mining which is preferred over the other); and the higher-dimensional cochains
represent more sophisticated local orderings of nodes in a clique [simplex]. They
then resort to the calculus-based language of grad, curl, and div to build up the
cochain complex and infer from its cohomology information about existence and
nonexistence of compatible ranking schemes over the network. Their use of the
Laplacian and the Hodge decomposition theorem permits projection of noisy or
inconsistent ranking schemes onto the nearest consistent ranking.
There are more sophisticated variants of these ideas, with applications pass-
ing beyond finding consistent rankings or orderings. Recent work of Gao et
al. [49] gives a cohomological and Hodge-theoretic approach to synchronization
problems over networks based on pairwise nodal data in the presence of noise.
Singer and collaborators [85, 86] have published several works on cryo electron
312 Homological Algebra and Data

microscopy that is, in essence, a cohomological approach to finding consistent


solutions to pairwise-compared data over a network. The larger lesson to be in-
ferred from these types of results is that networks often support data above and
beyond what is captured by the network topology alone (nodes, edges). This
data blends with the algebra and topology of the system using the language of
cohomology. It is this perspective of data that lives above a network that propels
our next set of tools.

Cellular Sheaves
One of the most natural uses for cohomology comes in the form of a yet-more-
abstract theory that is the stated end of these lectures: sheaf cohomology. Our
perspective is that a sheaf is an algebraic data structure tethered to a space (gen-
erally) or simplicial complex (in particular). In keeping with the computational
and linear-algebraic focus of this series, we will couch everything in the language
of linear algebra. The more general approach [20, 64, 83] is much more general.
Fix X a simplicial (or regular cell) complex with  denoting the face relation:
στ if and only if σ ⊂ τ. A cellular sheaf over X, F, is generated by (1) an
assignment to each simplex σ of X a stalk, a vector space F(σ); and (2) to each
face pair στ a restriction map, a linear transformation F(στ) : F(σ) → F(τ).
This data must respect that manner in which the simplicial complex is assembled,
meaning that faces of faces satisfy the composition rule:
(4.7) ρ  σ  τ ⇒ F(ρτ) = F(στ) ◦ F(ρσ).
The trivial face ττ by default has the identity isomorphism F(ττ) = Id as its
restriction map. Again, if one thinks of the stalks as the data over the individual
simplices, then, in the same manner that the simplicial complex is glued up by
face maps, the sheaf is assembled by the system of linear transformations.
One simple example of a sheaf on a cell complex X is that of the constant sheaf ,
FX , taking values in vector spaces over a field F. This sheaf assigns F to every
cell and the identity map Id : F → F to every face στ. In contrast, the skyscraper
sheaf over a single cell σ of X is the sheaf Fσ that assigns F to σ and 0 to all other
cells and face maps.

Exercise 4.8. Consider the following version of a random rank-1 sheaf over a
simplicial complex X. Assign the field F to every simplex. To each face map στ
assign either Id or 0 according to some (your favorite) random process. Does this
always give you a sheaf? How does this depend on X? What is the minimal set
of assumptions you would need to make on either X or the random assignment
in order to guarantee that what you get is in fact a sheaf?

One thinks of the values of the sheaf over cells as being data and the restric-
tion maps as something like local constraints or relationships between data. It’s
very worthwhile to think of a sheaf as programmable – one has a great deal of
Robert Ghrist 313

freedom in encoding local relationships. For example, consider the simple lin-
ear recurrence un+1 = An un , where un ∈ Rk is a vector of states and An is
a k-by-k real matrix. Such a discrete-time dynamical system can be represented
as a sheaf F of states over the time-line R with the cell structure on R having
Z as vertices, where F has constant stalks Rk . One programs the dynamics of
the recurrence relation as follows: F({n}(n, n + 1)) is the map u → An u and
F({n + 1}(n, n + 1)) is the identity. Compatibility of local solutions over the
sheaf is, precisely, the condition for being a global solution to the dynamics.
Local and Global Sections One says that the sheaf is generated by its values
on individual simplices of X: this stalk F(τ) over a cell τ is also called the local
sections of F on τ: one writes sτ ∈ F(τ) for a local section over τ. Though the
sheaf is generated by local sections, there is more to a sheaf than its generating
data, just as there is more to a vector space than its basis. The restriction maps
of a sheaf encode how local sections can be continued into larger sections. One
glues together local sections by means of the restriction maps. The value of the
sheaf F on all of X is defined to be those collections of local sections that continue
according to the restriction maps on faces. The global sections of F on X are defined
as:

(4.9) F(X) = {(sτ )τ∈X : sσ = F(ρσ)(sρ ) ∀ ρ  σ} ⊂ F(τ) .
τ

Exercise 4.10. Show that in the example of a sheaf for the recurrence relation
un+1 = An un , the global solutions to this dynamical system are classified by the
global sections of the sheaf.
The observed fact that the value of the sheaf over all of X retains the same sort
of structure as the type of data over the vertices — say, a vector space over a field F
— is a hint that this space of global solutions is really a type of homological data.
In fact, it is cohomological in nature, and, like zero-dimensional cohomology, it
is measure of connected components of the sheaf.

Cellular Sheaf Cohomology


In the simple setting of a compact cell complex X, it is easy to define a cochain
complex based on a sheaf F on X. Let Cn (X; F) be the product of F(σ) over all
n-cells σ of X. These cochains are connected by coboundary maps as follows:
 d
 d
 d
(4.11) 0 −→ F(σ) −→ F(σ) −→ F(σ) −→ · · · ,
dim σ=0 dim σ=1 dim σ=2
where the coboundary map d is defined on sections over cells using the sheaf
restriction maps

(4.12) d(sσ ) = [σ : τ] F(στ)sσ ,
στ

where, for a regular cell complex, [σ : τ] is either zero or ±1 depending on the


orientation of the simplices involved (beginners may start with all vector spaces
314 Homological Algebra and Data

using binary coefficients so that −1 = 1). Note that d : Cn (X; F) → Cn+1 (X; F),
since [σ : τ] = 0 unless σ is a codimension-1 face of τ. This gives a cochain com-
plex: in the computation of d2 , the incidence numbers factor from the restriction
maps, and the computation from cellular co/homology suffices to yield 0. The
resulting cellular sheaf cohomology is denoted H• (X; F).
This idea of global compatibility of sets of local data in a sheaf yield, through
the language of cohomology, global qualitative features of the data structure. We
have seen several examples of the utility of classifying various types of holes or
large-scale qualitative features of a space or complex. Imagine what one can do
with a measure of topological features of a data structure over a space.
Exercise 4.13. The cohomology of the constant sheaf FX on a compact cell com-
plex X is, clearly, H•cell (X; F), the usual cellular cohomology of X with coefficients
in F. Why the need for compactness? Consider the following cell complex: X = R,
decomposed into two vertices and three edges. What happens when you follow
all the above steps for the cochain complex of FX ? Show that this problem is
solved if you include in the cochain complex only contributions from compact
cells.
Exercise 4.14. For a closed subcomplex A ⊂ X, define the constant sheaf over A
as, roughly speaking, the constant sheaf on A (as its own complex) with all other
cells and face maps in X having data zero. Argue that H• (X; FA ) = ∼ H• (A; F).
Conclude that it is possible to have a contractible base space X with nontrivial
sheaf cohomology.
The elements of linear algebra recur throughout topology, including sheaf co-
homology. Consider the following sheaf F over the closed interval with two ver-
tices, a and b, and one edge e. The stalks are given as F(a) = Rm , F(b) = 0, and
F(e) = Rn . The restriction maps are F(be) = 0 and F(ae) = A, where A is
a linear transformation. Then, by definition, the sheaf cohomology is H0 =∼ ker A
1 ∼
and H = coker A.
Cellular sheaf cohomology taking values in vector spaces is really a charac-
terization of solutions to complex networks of linear equations. If one modifies
F(b) = Rp with F(be) = B another linear transformation, then the cochain
complex takes the form
[A|−B]
(4.15) 0 / Rm × Rp / Rn /0 / ··· ,

where d = [A|−B] : Rm+p → Rn is augmentation of A by −B. The zeroth sheaf


cohomology H0 is precisely the set of solutions to the equation Ax = By, for
x ∈ Rm and y ∈ Rp . These are the global sections over the closed edge. The first
sheaf cohomology measures the degree to which Ax − By does not span Rn . All
higher sheaf cohomology groups vanish.
Exercise 4.16. Prove that sheaf cohomology of a cell complex in grading zero
classifies global sections: H0 (X; F) = F(X).
Robert Ghrist 315

Cosheaves Sheaves are meant for cohomology: the direction of the restriction
maps insures this. Is there a way to talk about sheaf homology? If one works
in the cellular case, this is a simple process. As we have seen that the only real
difference between the cohomology of a cochain complex and the homology of
a chain complex is whether the grading ascends or descends, a simple matter
of arrow reversal on a sheaf should take care of things. It does. A cosheaf F̂
of vector spaces on a simplicial complex assigns (1) to each simplex σ a costalk,
a vector space F̂(σ); and (2) to each face στ of τ a corestriction map, a linear
transformation F̂(στ) : F̂(τ) → F̂(σ) that reverses the direction of the sheaf maps.
Of course, the cosheaf must respect the composition rule:
(4.17) ρ  σ  τ ⇒ F̂(ρτ) = F̂(ρσ) ◦ F̂(στ),
and the identity rule that F̂(ττ) = Id.
In the cellular context, there are very few differences between sheaves and
cosheaves — the use of one over another is a matter of convenience, in terms
of which direction makes the most sense. This is by no means true in the more
subtle setting of sheaves and cosheaves over open sets in a continuous domain.
Splines and Béziers. Cosheaves and sheaves alike arise in the study of splines,
Bézier curves, and other piecewise-assembled structures. For example, a single
segment of a planar Bézier curve is specified by the locations of two endpoints,
along with additional control points, each of which may be interpreted as a handle
specifying tangency data of the resulting curve at each endpoint. The reader who
has used any modern drawing software will understand the control that these
handles give over the resulting smooth curve. Most programs use a cubic Bézier
curve in the plane – the image of the unit closed interval by a cubic polynomial.
In these programs, the specification of the endpoints and the endpoint handles
(tangent vectors) completely determines the interior curve segment uniquely.
This can be viewed from the perspective of a cosheaf F̂ over the closed interval
I = [0, 1]. The costalk over the interior (0, 1) is the space of all cubic polynomials
from [0, 1] → R2 , which is isomorphic to R4 ⊕ R4 (one cubic polynomial for each
of the x and y coordinates). If one sets the costalks at the endpoints of [0, 1] to be
R2 , the physical locations of the endpoints, then the obvious corestriction maps
to the endpoint costalks are nothing more than evaluation at 0 and 1 respectively.
The corresponding cosheaf chain complex is:
(4.18) ··· /0 / R4 ⊕ R4 ∂ / R2 ⊕ R2 / 0.
Here, the boundary operator ∂ computes how far the cubic polynomial (edge
costalk) ‘misses’ the specified endpoints (vertex costalks).

Exercise 4.19. Show that for this simple cosheaf, H0 = 0 and H1 = ∼ R2 ⊕ R2 .


Interpret this as demonstrating that there are four degrees of freedom available
for a cubic planar Bézier curve with fixed endpoints: these degrees of freedom
are captured precisely by the pair of handles, each of which is specified by a
316 Homological Algebra and Data

(planar) tangent vector. Repeat this exercise for a 2-segment cubic planar Bézier
curve. How many control points are needed and with what degrees of freedom
are they needed in order to match the H1 of the cosheaf?

Note the interesting duality: the global solutions with boundary condition
are characterized by the top-dimensional homology of the cosheaf, instead of the
zero-dimensional cohomology of a sheaf. This simple example extends greatly, as
shown originally by Billera (using cosheaves, without that terminology [16]) and
Yuzvinsky (using sheaves [90]). By Billera’s work, the (vector) space of splines
over a triangulated Euclidean domain is isomorphic to the top-dimensional ho-
mology of a particular cosheaf over the domain. This matches what you see in
the simpler example of a Bézier curve over a line segment.
Splines and Béziers are a nice set of examples of cosheaves that have natu-
ral higher-dimensional generalizations — Bézier surfaces and surface splines are
used in design and modelling of surfaces ranging from architectural structures to
vehicle surfaces, ship hulls, and the like. Other examples of sheaves over higher-
dimensional spaces arise in the broad generalization of the Euler characteristic to
the Euler calculus, a topological integral calculus of recent interest in topological
signal processing applications [9, 10, 80–82].
Towards Generalizing Barcodes One of the benefits of a more general, sophis-
ticated language is the ability to reinterpret previous results in a new light with
new avenues for exploration appearing naturally. Let’s wrap up our brief survey
of sheaves and cosheaves by revisiting the basics of persistent homology, follow-
ing the thesis of Curry [33]. Recall the presentation of persistent homology and
barcodes in Lecture 2 that relied crucially on the Structure Theorem for linear
sequences of finite-dimensional vector spaces (Theorem 2.15).
There are a few ways one might want to expand this story. We have hinted on
a few occasions at the desirability of a continuous line as a parameter: our story
of sequences of vector spaces and linear transformations is bound to the discrete
setting. Intuitively, one could take a limit of finer discretizations and hope to
obtain a convergence with the appropriate assumptions on variability. Questions
of stability and interleaving (recall Exercise 2.23) then arise: see [18, 21, 66]
Another natural question is: what about non-linear sequences? What if in-
stead of a single parameter, there are two or more parameters that one wants to
vary? Is it possible to classify higher-dimensional sequences and derive barcodes
here? Unfortunately, the situation is much more complex than in the simple, lin-
ear setting. There are fundamental algebraic reasons for why such a classification
is not directly possible. These obstructions originate from representation theory
and quivers: see [24, 28, 76]. The good news is that quiver theory implies the exis-
tence of barcodes for linear sequences of vector spaces where the directions of the
maps do not have to be uniform, as per the zigzag persistence of Carlsson and de
Silva [24]. The bad news is that quiver theory implies that a well-defined barcode
Robert Ghrist 317

cannot exist as presently conceived for any sequence that is not a Dynkin dia-
gram (meaning, in particular, that higher-dimensional persistence has no simple
classification).
Nevertheless, one intuits that sheaves and cosheaves should have some bear-
ing on persistence and barcodes. Consider the classical scenario, in which one
has a sequence of finite-dimensional vector spaces Vi and linear transformations
ϕi : Vi → Vi+1 . Consider the following sheaf F over a Z-discretized R. To each
vertex {i} is assigned Vi . To each edge (i, i + 1) is assigned Vi+1 with an iden-
tity isomorphism from the right vertex stalk to the edge and ϕi as the left-vertex
restriction map to the edge data. Note the similarity of this sheaf to that of the
recurrence relation earlier in this lecture. As in the case of the recurrence rela-
tion, H0 detects global solutions: something similar happens for intervals in the
barcode.

Exercise 4.20. Recall that persistent homology of a persistence complex is really a


homology that is attached to an interval in the parameter line. Given the sheaf F
associated to R as above, and an interval I subcomplex of R, let FI be the restric-
tion of the sheaf to the interval, following Exercise 4.14. Prove that the number
of bars in the barcode over I is the dimension of H0 (I; FI ). Can you argue what
changes in F could be made to preserve this result in the case of a sequence of
linear transformations ϕi that do not all “go the same direction”? Can you adapt
this construction to a collection of vector spaces and linear transformations over
an arbitrary poset (partially-ordered set)? Be careful, there are some complica-
tions.

There are limits to what basic sheaves and cosheaves can do, as cohomology
does not come with the descriptiveness plus uniqueness that the representation
theoretic approach gives. Nevertheless, there are certain settings in which bar-
codes for persistent homology are completely captured by sheaves and cosheaves
(see the thesis of Curry for the case of level-set persistence [33]), with more char-
acterizations to come [34].
Homological Data, Redux We summarize by updating and expanding the prin-
ciples that we outlined earlier in the lecture series into a more refined language:
(1) Algebraic co/chain complexes are a good model for converting a space
built from local pieces into a linear-algebraic structure.
(2) Co/homology is an optimal compression scheme to collapse inessential
structure and retain qualitative features.
(3) The classification of linear algebraic sequences yields barcodes as a de-
composition of sequences of co/homologies, capturing the evolution of
qualitative features.
(4) Exact sequences permit inference from partial sequential co/homological
data to more global characterization.
318 Homological Algebra and Data

(5) A variety of different co/homology theories exist, adapted to different


types of structures on a space or complex, with functoriality being the
tool for relating (and, usually, equating) these different theories.
(6) Sheaves and cosheaves are algebraic data structures over spaces that can
be programmed to encode local constraints.
(7) The concomitant sheaf cohomologies [& cosheaf homologies] compress
these data structures down to their qualitative core, integrating local data
into global features.
(8) Classifications of sheaves and cosheaves recapitulates the classification of
co/chain complexes into barcodes, but presage a much broader and more
applicable theory in the making.

Application: Sensing and Evasion


Most of the examples of cellular sheaves given thus far have been simplistic,
for pedagogical purposes. This lecture ends with an example of what one can
do with a more interesting sheaf to solve a nontrivial inference problem. This ex-
ample is chosen to put together as many pieces as possible of the things we have
learned, including homology theories, persistence, cohomology, sheaves, compu-
tation, and more. It is, as a result, a bit complicated, and this survey will be
highly abbreviated: see [52] for full details.
Consider the following type of evasion game, in which an evader tries to hide
from a pursuer, and capture is determined by being “seen” or “sensed”. For
concreteness, the game takes place in a Euclidean space Rn and time progresses
over the reals. At each time t ∈ R, the observer sees or senses a coverage region
Ct ⊂ Rn that is assumed to (1) be connected; and (2) include the region outside a
fixed ball (to preclude the evader from running away off to infinity). The evasion
problem is this: given the coverage regions over time, is it possible for there to
be an evasion path: a continuous map e : {t} → (Rn − Ct ) on all of the timeline
t ∈ R. Such a map is a section of the projection p : (Rn × R)−C → R from the
complement of the full coverage region C ⊂ Rn × R to the timeline.
What makes this problem difficult (i.e., interesting) is that the geometry and
topology of the complement of the coverage region, where the evader can hide,
is not known: were this known, graph-theoretic methods would handily assist in
finding a section or determining nonexistence. Furthermore, the coverage region
C is not known geometrically, but rather topologically, with unknown embedding.
The thesis of Adams [3] gives examples of two different time-dependent coverage
regions, C and C , whose fibers are topologically the same (homotopic) for each
t, but which differ in the existence of evasion path. The core difficulty is that,
though C and C each admit “tunnels” in their complements stretching over the
entire timeline, one of them has a tunnel that snakes backwards along the time
axis: topologically, legal; physically, illegal.
Robert Ghrist 319

Work of Adams and Carlsson [3] gives a complete solution to the existence of
an evasion path in the case of a planar (n = 2) system with additional genericity
conditions and some geometric assumptions. Recently, a complete solution in all
dimensions was given [52] using sheaves and sheaf cohomology. One begins with
a closed coverage region C ⊂ Rn × R whose complement is uniformly bounded
over R. For the sake of exposition, assume that the time axis is given a discretiza-
tion into (ordered) vertices vi and edges ei = (vi , vi+1 ) such that the coverage
domains Ct are topologically equivalent over each edge (this is not strictly neces-
sary). There are a few simple sheaves over the discretized timeline relevant to the
evasion problem.
First, consider for each time t ∈ R, the coverage domain Ct ⊂ Rn . How many
different ways are there for an evader to hide from Ct ? This is regulated by the
number of connected components of the complement, classified by H0 (Rn −Ct ).
Since we do not have access to Ct directly (remember – its embedding in Rn
is unknown to the pursuer), we must try to compute this H0 based only on the
topology of Ct . That this can be done is an obvious but wonderful corollary of
Alexander duality, which relates the homology and cohomology of complementary
subsets of Rn . Here, Alexander duality implies that H0 (Rn −Ct ) = ∼ Hn−1 (Ct ):
this, then, is something we can measure, and motivates using the Leray cellular
sheaf H of n − 1 dimensional cohomology of the coverage regions over the time
axis. Specifically, for each edge ei define H(ei ) = Hn−1 (C(vi ,vi+1 ) ) to be the
cohomology of the region over the open edge. For the vertices, use the star:
H(vi ) = Hn−1 (C(vi−1 ,vi+1 ) ).

Exercise 4.21. Why are the stalks over the vertices defined in this way? Show
that this gives a well-defined cellular sheaf using as restriction maps the induced
homomorphisms on cohomology. Hint: which way do the induced maps in coho-
mology go?

The intuition is that global sections of this sheaf over the time axis, H0 (R; H),
would classify the different complementary “tunnels” through the coverage set
that an evader could use to escape detection. Unfortunately, this is incorrect, for
reasons pointed out by Adams and Carlsson [3] (using the language of zigzags).
The culprit is the commutative nature of homology and cohomology — one can-
not discern tunnels which illegally twirl backwards in time. To solve this problem,
one could try to keep track of some sort of directedness or orientation. Thanks
to the assumption that Ct is connected for all time, there is a global orienta-
tion class on Rn that can be used to assign a ±1 to basis elements of Hn−1 (Ct )
based on whether the complementary tunnel is participating in a time-orientation-
preserving evasion path on the time interval (−∞, t).
However, to incorporate this orientation data into the sheaf requires breaking
the bounds of working with vector spaces. As detailed in [52], one may use
sheaves that take values in semigroups. In this particular case, the semigroups
320 Homological Algebra and Data

are positive cones within vector spaces, where a cone K ⊂ V in a vector space V
is a subset closed under vector addition and closed under multiplication by R+ ,
the [strictly] positive reals. A cone is positive if K ∩ −K = ∅. With work, one
can formulate co/homology theories and sheaves to take values in cones: for
details, see [52]. The story proceeds: within the sheaf H of n − 1 dimensional
cohomology of C on R, there is (via abuse of terminology) a “subsheaf” +H of
positive cones, meaning that the stalks of +H are positive cones within the stalks
of H, encoding all the positive cohomology classes that can participate in a legal
(time-orientation-respecting) evasion path. It is this sheaf of positive cones that
classifies evasion paths.

Theorem 4.22. For n > 1 and C = {Ct } ⊂ Rn × R closed and with bounded comple-
ment consisting of connected fibers Ct for all t, there is an evasion path over R if and
only if H0 (R;+ H) is nonempty.

Note that the theorem statement says nonempty instead of nonzero, since sheaf
takes values in positive cones, which are R+ -cones and thus do not contain zero.
The most interesting part of the story is how one can compute this H0 . This is
where the technicalities begin to weigh heavily, as one cannot use the classical def-
inition of H0 in terms of kernels and images. The congenial commutative world
of vector spaces requires significant care when passing to the nonabelian setting.
One defines H0 for sheaves of cones using constructs from category theory (lim-
its, specifically). Computation of such objects requires a great deal more thought
and care than the simpler linear-algebraic notions of these lectures. That is by
no means a defect; indeed, it is a harbinger. Increase in resolution requires an
increase in algebra.

Conclusion: Beyond Linear Algebra


These lecture notes have approached topological data analysis from the per-
spective of homological algebra. If the reader takes from these notes the singular
idea that linear algebra can be enriched to cover not merely linear transformations,
but also sequences of linear transformations that form complexes, then these lec-
tures will have served their intended purpose. The reader for whom this seems to
open a new world will be delighted indeed to learn that the vector space version
of homological algebra is almost too pedestrian to matter to mathematicians: as
hinted at in the evader inference example above, the story begins in earnest when
one works with rings, modules, and more interesting categories still. Neverthe-
less, for applications to data, vector spaces and linear transformations are a safe
place to start.
For additional material that aligns with the author’s view on topology and
homological algebra, the books [50, 51] are recommended. It is noted that the
perspective of these lecture notes is idiosyncratic. For a broader view, the inter-
ested reader is encouraged to consult the growing literature on topological data
Robert Ghrist 321

analysis. The book by Edelsbrunner and Harer [40] is a gentle introduction, with
emphases on both theory and algorithms. The book on computational homology
by Kaczynsky, Mischaikow, and Mrozek [62] has even more algorithmic material,
mostly in the setting of cubical complexes. Both books suffer from the short shelf
life of algorithms as compared to theories. Newer titles on the theory of persistent
homology are in the process of appearing: that of Oudot [76] is one of perhaps
several soon to come. For introductions to topology specifically geared towards
the data sciences, there does not seem to be an ideal book; rather, a selection of
survey articles such as that of Carlsson [23] is appropriate.
The open directions for inquiry are perhaps too many to identify properly –
the subject of topological data analysis is in its infancy. One can say with cer-
tainty that there is a slow spread of these topological methods and perspectives
to new application domains. This will continue, and it is not yet clear to this au-
thor whether neuroscience, genetics, signal processing, materials science, or some
other domain will be the locus of inquiry that benefits most from homological
methods, so rapid has been the advances in all these areas. Dual to this spread in
applications is the antipodal expansion into Mathematics, as ideas from homolog-
ical data engage with impact contemporary Mathematics. The unique demands
of data have already prompted explorations into mathematical structures (e.g.,
interleaving distance in sheaf theory and persistence in matroid theory) which
otherwise would seem unmotivated and be unexplored. It is to be hoped that the
simple applications of representation theory to homological data analysis will
inspire deeper explorations into the mathematical tools.
There is at the moment a frenzy of activity surrounding various notions of
stability associated to persistent homology, sheaves, and related structures con-
cerning representation theory. It is likely to take some time to sort things out into
their clearest form. In general, one expects that applications of deeper ideas from
algebraic topology exist and will percolate through applied mathematics into ap-
plication domains. Perhaps the point of most optimism and uncertainty lies in
the intersection with probabilistic and stochastic methods. The rest of this volume
makes ample use of such tools; their absence in these notes is noticeable. Topol-
ogy and probability are neither antithetical nor natural partners; expectation of
progress is warranted from some excellent results on the topology of Gaussian
random fields [6] and recent work on the homology of random complexes [63].
Much of the material from these lectures appears ripe for a merger with modern
probabilistic methods. Courage and optimism — two of the cardinal mathemati-
cal virtues — are needed for this.

References
[1] A. Abrams and R. Ghrist. State complexes for metamorphic systems. Intl. J. Robotics Research,
23(7,8):809–824, 2004. ←278
[2] H. Adams and G. Carlsson. On the nonlinear statistics of range image patches. SIAM J. Imaging
Sci., 2(1):110–117, 2009. MR2486524 ←294
322 References

[3] H. Adams and G. Carlsson. Evasion paths in mobile sensor networks. International Journal of
Robotics Research, 34:90–104, 2014. ←318, 319
[4] R. J. Adler. The Geometry of Random Fields. Society for Industrial and Applied Mathematics, 1981.
MR3396215 ←280
[5] R. J. Adler, O. Bobrowski, M. S. Borman, E. Subag, and S. Weinberger. Persistent homology for
random fields and complexes. In Borrowing Strength: Theory Powering Applications, pages 124–143.
IMS Collections, 2010. MR2798515 ←285
[6] R. J. Adler and J. E. Taylor. Random Fields and Geometry. Springer Monographs in Mathematics.
Springer, New York, 2007. MR2319516 ←280, 321
[7] R. H. Atkin. Combinatorial Connectivities in Social Systems. Springer Basel AG, 1977. ←278
[8] A. Banyaga and D. Hurtubise. Morse Homology. Springer, 2004. ←304
[9] Y. Baryshnikov and R. Ghrist. Target enumeration via Euler characteristic integrals. SIAM J. Appl.
Math., 70(3):825–844, 2009. MR2538627 ←280, 316
[10] Y. Baryshnikov and R. Ghrist. Euler integration over definable functions. Proc. Natl. Acad. Sci.
USA, 107(21):9525–9530, 2010. MR2653583 ←280, 316
[11] Y. Baryshnikov, R. Ghrist, and D. Lipsky. Inversion of Euler integral transforms with applications
to sensor data. Inverse Problems, 27(12), 2011. MR2854317 ←280
[12] U. Bauer and M. Lesnick. Induced matchings and the algebraic stability of persistence barcodes.
Discrete Comput. Geom., 6(2):162–191, 2015. MR3333456 ←293
[13] P. Bendich, H. Edelsbrunner, and M. Kerber. Computing robustness and persistence for images.
IEEE Trans. Visual and Comput. Graphics, pages 1251–1260, 2010. ←294
[14] P. Bendich, J. Marron, E. Miller, A. Pielcoh, and S. Skwerer. Persistent homology analysis of brain
artery trees. Ann. Appl. Stat., 10(1):198–218, 2016. MR3480493 ←294
[15] S. Bhattacharya, R. Ghrist, and V. Kumar. Persistent homology for path planning in uncertain
environments. IEEE Trans. on Robotics, 31(3):578–590, 2015. ←295
[16] L. J. Billera. Homology of smooth splines: generic triangulations and a conjecture of Strang. Trans.
Amer. Math. Soc., 310(1):325–340, 1988. MR965757 ←316
[17] L. J. Billera, S. P. Holmes, and K. Vogtmann. Geometry of the space of phylogenetic trees. Adv. in
Appl. Math., 27(4):733–767, 2001. MR1867931 ←278
[18] M. Botnan and M. Lesnick. Algebraic stability of zigzag persistence modules. arXiv:160400655v2.
←293, 316
[19] R. Bott and L. Tu. Differential Forms in Algebraic Topology. Springer, 1982. MR658304 ←311
[20] G. Bredon. Sheaf Theory. Springer, 1997. MR1481706 ←312
[21] P. Bubenik, V. de Silva, and J. Scott. Metrics for generalized persistence modules. Found. Comput.
Math., 15(6):1501–1531, 2015. MR3413628 ←293, 316
[22] P. Bubenik and J. A. Scott. Categorification of persistent homology. Discrete Comput. Geom.,
51(3):600–627, 2014. MR3201246 ←293
[23] G. Carlsson. The shape of data. In Foundations of computational mathematics, Budapest 2011, volume
403 of London Math. Soc. Lecture Note Ser., pages 16–44. Cambridge Univ. Press, Cambridge, 2013.
MR3137632 ←321
[24] G. Carlsson and V. de Silva. Zigzag persistence. Found. Comput. Math., 10(4):367–405, 2010.
MR2657946 ←316
[25] G. Carlsson, T. Ishkhanov, V. de Silva, and A. Zomorodian. On the local behavior of spaces of
natural images. Intl. J. Computer Vision, 76(1):1–12, Jan. 2008. MR3715451 ←294
[26] G. Carlsson and F. Mémoli. Characterization, stability and convergence of hierarchical clustering
methods. J. Mach. Learn. Res., 11:1425–1470, Aug. 2010. MR2645457 ←294
[27] G. Carlsson and F. Mémoli. Classifying clustering schemes. Found. Comput. Math., 13(2):221–252,
2013. MR3032681 ←294
[28] G. Carlsson and A. Zomorodian. The theory of multidimensional persistence. Discrete Comput.
Geom., 42(1):71–93, 2009. MR2506738 ←316
[29] S. Carson, V. Ruta, L. Abbott, and R. Axel. Random convergence of olfactory inputs in the
drosophila mushroom body. Nature, 497(7447):113–117, 2013. ←285
[30] F. Chazal, V. de Silva, M. Glisse, and S. Oudot. The Structure and Stability of Persistence Modules.
Springer Briefs in Mathematics, 2016. MR3524869 ←293
[31] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete Com-
put. Geom., 37(1):103–120, 2007. MR2279866 ←293
References 323

[32] A. Collins, A. Zomorodian, G. Carlsson, and L. Guibas. A barcode shape descriptor for curve
point cloud data. In M. Alexa and S. Rusinkiewicz, editors, Eurographics Symposium on Point-
Based Graphics, ETH, Zürich, Switzerland, 2004. ←292
[33] J. Curry. Sheaves, Cosheaves and Applications. PhD thesis, University of Pennsylvania, 2014.
MR3259939 ←316, 317
[34] J. Curry and A. Patel. Classification of constructible cosheaves. arXiv:1603.01587. ←317
[35] C. Curto and V. Itskov. Cell groups reveal structure of stimulus space. PLoS Comput. Biol.,
4(10):e1000205, 13, 2008. MR2457124 ←285
[36] M. d’Amico, P. Frosini, and C. Landi. Optimal matching between reduced size functions. Techni-
cal Report 35, DISMI, Univ. degli Studi di Modena e Reggio Emilia, Italy, 2003. ←292
[37] V. de Silva and G. Carlsson. Topological estimation using witness complexes. In M. Alexa and
S. Rusinkiewicz, editors, Eurographics Symposium on Point-based Graphics, 2004. ←278
[38] J. Derenick, A. Speranzon, and R. Ghrist. Homological sensing for mobile robot localization. In
Proc. Intl. Conf. Robotics & Aut., 2012. ←296
[39] C. Dowker. Homology groups of relations. Annals of Mathematics, pages 84–95, 1952. MR0048030
←278
[40] H. Edelsbrunner and J. Harer. Computational Topology: an Introduction. American Mathematical
Society, Providence, RI, 2010. MR2572029 ←275, 321
[41] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification.
Discrete Comput. Geom., 28:511–533, 2002. MR1949898 ←292
[42] M. Farber. Invitation to Topological Robotics. Zurich Lectures in Advanced Mathematics. European
Mathematical Society (EMS), Zürich, 2008. MR2455573 ←280
[43] D. Farley and L. Sabalka. On the cohomology rings of tree braid groups. J. Pure Appl. Algebra,
212(1):53–71, 2008. MR2355034 ←305
[44] D. Farley and L. Sabalka. Presentations of graph braid groups. Forum Math., 24(4):827–859, 2012.
MR2949126 ←305
[45] R. Forman. Morse theory for cell complexes. Adv. Math., 134(1):90–145, 1998. MR1612391 ←304,
305
[46] R. Forman. A user’s guide to discrete Morse theory. Sém. Lothar. Combin., 48, 2002. MR1939695 ←
304
[47] Ś. R. Gal. Euler characteristic of the configuration space of a complex. Colloq. Math., 89(1):61–67,
2001. MR1853415 ←280
[48] M. Gameiro, Y. Hiraoka, S. Izumi, M. Kramar, K. Mischaikow, and V. Nanda. Topological mea-
surement of protein compressibility via persistent diagrams. Japan J. Industrial & Applied Mathe-
matics, 32(1):1–17, Oct 2014. MR3318898 ←296
[49] T. Gao, J. Brodzki, and S. Mukherjee. The geometry of synchronization problems and learning
group actions. arXiv:1610.09051. ←311
[50] S. I. Gelfand and Y. I. Manin. Methods of Homological Algebra. Springer Monographs in Mathemat-
ics. Springer-Verlag, Berlin, second edition, 2003. MR1950475 ←299, 320
[51] R. Ghrist. Elementary Applied Topology. Createspace, 1.0 edition, 2014. ← 275, 278, 282, 303, 311,
320
[52] R. Ghrist and S. Krishnan. Positive Alexander duality for pursuit and evasion. To appear, SIAM
J. Appl. Alg. Geom. MR3763757 ←318, 319, 320
[53] R. Ghrist and S. M. Lavalle. Nonpositive curvature and Pareto optimal coordination of robots.
SIAM J. Control Optim., 45(5):1697–1713, 2006. MR2272162 ←278
[54] R. Ghrist, D. Lipsky, J. Derenick, and A. Speranzon. Topological landmark-based navigation and
mapping. ←Preprint, 2012. 278, 296
[55] R. Ghrist and V. Peterson. The geometry and topology of reconfiguration. Adv. in Appl. Math.,
38(3):302–323, 2007. MR2301699 ←278
[56] C. Giusti, E. Pastalkova, C. Curto, and V. Itskov. Clique topology reveal intrinsic structure in
neural connections. Proc. Natl. Acad. Sci. USA, 112(44):13455–13460, 2015. MR3429279 ←284, 285
[57] L. J. Guibas and S. Y. Oudot. Reconstruction using witness complexes. In Proc. 18th ACM-SIAM
Sympos. on Discrete Algorithms, pages 1076–1085, 2007. MR2485259 ←278
[58] A. Hatcher. Algebraic Topology. Cambridge University Press, 2002. MR1867354 ←281, 282, 299, 300,
303
[59] G. Henselman and R. Ghrist. Matroid filtrations and computational persistent homology.
arXiv:1606.00199. ←307
324 References

[60] J. Hocking and G. Young. Topology. Dover Press, 1988. MR1016814 ←302
[61] X. Jiang, L.-H. Lim, Y. Yao, and Y. Ye. Statistical ranking and combinatorial Hodge theory. Math.
Program., 127(1, Ser. B):203–244, 2011. MR2776715 ←311
[62] T. Kaczynski, K. Mischaikow, and M. Mrozek. Computational Homology, volume 157 of Applied
Mathematical Sciences. Springer-Verlag, New York, 2004. MR2028588 ←275, 278, 307, 321
[63] M. Kahle. Topology of random clique complexes. Discrete Comput. Geom., 45(3):553–573, 2011.
MR2770552 ←285, 321
[64] M. Kashiwara and P. Schapira. Categories and Sheaves, volume 332 of Grundlehren der Mathematis-
chen Wissenschaften. Springer-Verlag, 2006. MR2182076 ←312
[65] D. Kozlov. Combinatorial Algebraic Topology, volume 21 of Algorithms and Computation in Mathemat-
ics. Springer, 2008. MR2361455 ←304, 305
[66] M. Lesnick. The theory of the interleaving distance on multidimensional persistence modules. J.
Found. Comp. Math., 15(3):613–650, 2015. MR3348168 ←316
[67] T. Lewiner, H. Lopes, and G. Tavares. Applications of Forman’s discrete Morse theory to topology
visualization and mesh compression. IEEE Trans. Visualization & Comput. Graphics, 10(5):499–508,
2004. ←305
[68] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1987.
←280
[69] K. Mischaikow and V. Nanda. Morse theory for filtrations and efficient computation of persistent
homology. Discrete Comput. Geom., 50(2):330–353, 2013. MR3090522 ←306, 307
[70] J. Munkres. Topology. Prentice Hall, 2000. MR3728284 ←275
[71] V. Nanda. Discrete Morse theory and localization. arXiv:1510.01907. ←307
[72] V. Nanda, D. Tamaki, and K. Tanaka. Discrete Morse theory and classifying spaces.
arXiv:1612.08429vi. ←307
[73] M. Nicolau, A. J. Levine, and G. Carlsson. Topology based data analysis identifies a subgroup of
breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. USA,
108(17):7265–7270, 2011. ←295
[74] J. O’Keefe and J. Dostrovsky. The hippocampes as a spatial map. Brain Research, 34(1):171–175,
1971. ←285
[75] N. Otter, M. Porter, U. Tillmann, P. Grindrod, and H. Harrington. A roadmap for the computation
of persistent homology. EPJ Data Science 6(17), 2017. MR3670641 ←306, 307, 308
[76] S. Oudot. Persistence Theory: From Quiver Representations to Data Analysis. American Mathematical
Society, 2015. MR3408277 ←275, 316, 321
[77] L. Pachter and B. Sturmfels. The mathematics of phylogenomics. SIAM Rev., 49(1):3–31, 2007.
MR2302545 ←278
[78] R. Penrose. La cohomologie des figures impossibles. Structural Topology, 17:11–16, 1991.
MR1140400 ←311
[79] V. Robins, P. Wood, and A. Sheppard. Theory and algorithms for constructing discrete Morse
complexes from grayscale digital images. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 33(8):1646–1658, 2011. ←294, 305
[80] M. Robinson. Topological Signal Processing. Springer, Heidelberg, 2014. MR3157249 ←316
[81] P. Schapira. Operations on constructible functions. J. Pure Appl. Algebra, 72(1):83–93, 1991.
MR1115569 ←316
[82] P. Schapira. Tomography of constructible functions. In Applied Algebra, Algebraic Algorithms and
Error-Correcting Codes, pages 427–435. Springer, 1995. MR1448182 ←316
[83] J. Schürmann. Topology of Singular Spaces and Constructible Sheaves, volume 63 of Mathematics
Institute of the Polish Academy of Sciences. Mathematical Monographs (New Series). Birkhäuser Verlag,
Basel, 2003. MR2031639 ←312
[84] M. Schwarz. Morse Homology, volume 111 of Progress in Mathematics. Birkhäuser Verlag, Basel,
1993. MR1239174 ←304
[85] Y. Shkolnisky and A. Singer. Viewing direction estimation in cryo-EM using synchronization.
SIAM J. Imaging Sci., 5(3):1088–1110, 2012. MR3022188 ←311
[86] A. Singer. Angular synchronization by eigenvectors and semidefinite programming. Appl. Com-
put. Harmonic Anal., 30(1):20–36, 2011. MR2737931 ←311
[87] A. Sizemore, C. Giusti, and D. Bassett. Classification of weighted networks through mesoscale
homological features. J. Complex Networks, 2016. MR3801686 ←285
References 325

[88] B. Torres, J. Oliviera, A. Tate, P. Rath, K. Cumnock, and D. Schneider. Tracking resilience to
infections by mapping disease space. PLOS Biology, 2016. ←295
[89] A. Wilkerson, H. Chintakunta, H. Krim, T. Moore, and A. Swami. A distributed collapse of a
network’s dimensionality. In Proceedings of Global Conference on Signal and Information Processing
(GlobalSIP). IEEE, 2013. ←278
[90] S. Yuzvinsky. Modules of splines on polyhedral complexes. Math. Z., 210(2):245–254, 1992.
MR1166523 ←316
[91] A. Zomorodian. Topology for Computing. Cambridge Univ Press, 2005. MR2111929 ←292

Departments of Mathematics and Electrical & Systems Engineering, University of Pennsylvania


Email address: [email protected]
PUBLISHED TITLES IN THIS SERIES

25 Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert, Editors, The


Mathematics of Data, 2018
24 Roman Bezrukavnikov, Alexander Braverman, and Zhiwei Yun, Editors,
Geometry of Moduli Spaces and Representation Theory, 2017
23 Mark J. Bowick, David Kinderlehrer, Govind Menon, and Charles Radin,
Editors, Mathematics and Materials, 2017
22 Hubert L. Bray, Greg Galloway, Rafe Mazzeo, and Natasa Sesum, Editors,
Geometric Analysis, 2016
21 Mladen Bestvina, Michah Sageev, and Karen Vogtmann, Editors, Geometric
Group Theory, 2014
20 Benson Farb, Richard Hain, and Eduard Looijenga, Editors, Moduli Spaces of
Riemann Surfaces, 2013
19 Hongkai Zhao, Editor, Mathematics in Image Processing, 2013
18 Cristian Popescu, Karl Rubin, and Alice Silverberg, Editors, Arithmetic of
L-functions, 2011
17 Jeffery McNeal and Mircea Mustaţă, Editors, Analytic and Algebraic Geometry,
2010
16 Scott Sheffield and Thomas Spencer, Editors, Statistical Mechanics, 2009
15 Tomasz S. Mrowka and Peter S. Ozsváth, Editors, Low Dimensional Topology, 2009
14 Mark A. Lewis, Mark A. J. Chaplain, James P. Keener, and Philip K. Maini,
Editors, Mathematical Biology, 2009
13 Ezra Miller, Victor Reiner, and Bernd Sturmfels, Editors, Geometric
Combinatorics, 2007
12 Peter Sarnak and Freydoon Shahidi, Editors, Automorphic Forms and Applications,
2007
11 Daniel S. Freed, David R. Morrison, and Isadore Singer, Editors, Quantum Field
Theory, Supersymmetry, and Enumerative Geometry, 2006
10 Steven Rudich and Avi Wigderson, Editors, Computational Complexity Theory, 2004
9 Brian Conrad and Karl Rubin, Editors, Arithmetic Algebraic Geometry, 2001
8 Jeffrey Adams and David Vogan, Editors, Representation Theory of Lie Groups, 2000
7 Yakov Eliashberg and Lisa Traynor, Editors, Symplectic Geometry and Topology,
1999
6 Elton P. Hsu and S. R. S. Varadhan, Editors, Probability Theory and Applications,
1999
5 Luis Caffarelli and Weinan E, Editors, Hyperbolic Equations and Frequency
Interactions, 1999
4 Robert Friedman and John W. Morgan, Editors, Gauge Theory and the Topology of
Four-Manifolds, 1998
3 János Kollár, Editor, Complex Algebraic Geometry, 1997
2 Robert Hardt and Michael Wolf, Editors, Nonlinear partial differential equations in
differential geometry, 1996
1 Daniel S. Freed and Karen K. Uhlenbeck, Editors, Geometry and Quantum Field
Theory, 1995
Data science is a highly interdisciplinary field, incorporating ideas from applied math-
ematics, statistics, probability, and computer science, as well as many other areas. This
book gives an introduction to the mathematical methods that form the foundations of
machine learning and data science, presented by leading experts in computer science,
statistics, and applied mathematics. Although the chapters can be read independently, they
are designed to be read together as they lay out algorithmic, statistical, and numerical
approaches in diverse but complementary ways.
This book can be used both as a text for advanced undergraduate and beginning graduate
courses, and as a survey for researchers interested in understanding how applied math-
ematics broadly defined is being used in data science. It will appeal to anyone interested
in the interdisciplinary foundations of machine learning and data science.

PCMS/25

You might also like