Algorithmic Aspects Machine Learning PDF
Algorithmic Aspects Machine Learning PDF
This book bridges theoretical computer science and machine learning by exploring
what the two sides can teach each other. It emphasizes the need for flexible, tractable
models that better capture not what makes machine learning hard but what makes it
easy. Theoretical computer scientists will be introduced to important models in
machine learning and to the main questions within the field. Machine learning
researchers will be introduced to cutting-edge research in an accessible format and
will gain familiarity with a modern algorithmic toolkit, including the method of
moments, tensor decompositions, and convex programming relaxations.
The treatment goes beyond worst-case analysis to build a rigorous understanding
about the approaches used in practice and to facilitate the discovery of exciting new
ways to solve important, long-standing problems.
ANKUR MOITRA
Massachusetts Institute of Technology
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314-321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781107184589
DOI: 10.1017/9781316882177
© Ankur Moitra 2018
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2018
Printed in the United States of America by Sheridan Books, Inc.
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Moitra, Ankur, 1985– author.
Title: Algorithmic aspects of machine learning / Ankur Moitra,
Massachusetts Institute of Technology.
Description: Cambridge, United Kingdom ; New York, NY, USA : Cambridge
University Press, 2018. | Includes bibliographical references.
Identifiers: LCCN 2018005020 | ISBN 9781107184589 (hardback) |
ISBN 9781316636008 (paperback)
Subjects: LCSH: Machine learning–Mathematics. | Computer algorithms.
Classification: LCC Q325.5 .M65 2018 | DDC 006.3/1015181–dc23
LC record available at https://lccn.loc.gov/2018005020
ISBN 978-1-107-18458-9 Hardback
ISBN 978-1-316-63600-8 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
1 Introduction 1
2 Nonnegative Matrix Factorization 4
2.1 Introduction 4
2.2 Algebraic Algorithms 11
2.3 Stability and Separability 16
2.4 Topic Models 22
2.5 Exercises 27
3 Tensor Decompositions: Algorithms 29
3.1 The Rotation Problem 29
3.2 A Primer on Tensors 31
3.3 Jennrich’s Algorithm 35
3.4 Perturbation Bounds 40
3.5 Exercises 46
4 Tensor Decompositions: Applications 48
4.1 Phylogenetic Trees and HMMs 48
4.2 Community Detection 55
4.3 Extensions to Mixed Models 58
4.4 Independent Component Analysis 65
4.5 Exercises 69
5 Sparse Recovery 71
5.1 Introduction 71
5.2 Incoherence and Uncertainty Principles 74
5.3 Pursuit Algorithms 77
v
vi Contents
Bibliography 143
Index 150
Preface
vii
1
Introduction
1
2 1 Introduction
2.1 Introduction
In order to better understand the motivations behind the nonnegative matrix
factorization problem and why it is useful in applications, it will be helpful
to first introduce the singular value decomposition and then compare the two.
Eventually, we will apply both of these to text analysis later in this section.
4
2.1 Introduction 5
r
M= σi ui vTi
i=1
where ui is the ith column of U, vi is the ith column of V, and σi is the ith
diagonal entry of . Throughout this section we will fix the convention that
σ1 ≥ σ2 ≥ . . . ≥ σr > 0. In this case, the rank of M is precisely r.
Throughout this book, we will have occasion to use this decomposition as
well as the (perhaps more familiar) eigendecomposition. If M is an n×n matrix
and is diagonalizable, its eigendecomposition is written as
M = PDP−1
Two Applications
Two of the most important properties of the singular value decomposition are
that it can be used to find the best rank k approximation and that it can be used
for dimension reduction. We explore these next. First, let’s formalize what we
mean by the best rank k approximation problem. One way to do this is to work
with the Frobenius norm:
Definition 2.1.1 (Frobenius norm) MF = 2
i,j Mi,j
6 2 Nonnegative Matrix Factorization
It is easy to see that the Frobenius norm is invariant under rotations. For
example, this follows by considering each of the columns of M separately
as a vector. The square of the Frobenius norm of a matrix is the sum of
squares of the norms of its columns. Then left-multiplying by an orthogonal
matrix preserves the norm of each of its columns. An identical argument holds
for right-multiplying by an orthogonal matrix (but working with the rows
instead). This invariance allows us to give an alternative characterization of
the Frobenius norm that is quite useful:
MF = U T MVF = F = σi2
The first equality is where all the action is happening and uses the rotational
invariance property we established above.
Then the Eckart–Young theorem asserts that the best rank k approximation
to some matrix M (in terms of Frobenius norm) is given by its truncated
singular value decomposition:
Theorem 2.1.2 (Eckart–Young) argmin M − BF = ki=1 σi ui vTi
rank(B)≤k
M can be obtained directly from its best rank k + 1 approximation. This is not
always the case, as we will see in the next chapter when we work with tensors.
Next, we give an entirely different application of the singular value decom-
position in the context of data analysis before we move on to applications of it
in text analysis. Recall that M is an m × n matrix. We can think of it as defining
a distribution on n-dimensional vectors, which we obtain from choosing one
of its columns uniformly at random. Further suppose that E[x] = 0; i.e., the
columns sum to the all-zero vector. Let Pk be the space of all projections onto
a k-dimensional subspace.
Theorem 2.1.5 argmax E[Px2 ] = ki=1 ui uTi
P∈Pk
This is another basic theorem about the singular value decomposition, and
from it we can readily compute the k-dimensional projection that maximizes
the projected variance. This theorem is often invoked in visualization, where
one can visualize high-dimensional vector data by projecting it to a more
manageable lower-dimensional subspace.
This quantity computes the probability that a randomly chosen word w from
document i and a randomly chosen word w from document j are the same. But
what makes this a bad measure is that when documents are sparse, they may not
have many words in common just by accident because of the particular words
each author chose to use to describe the same types of things. Even worse,
some documents could be deemed to be similar because they contain many of
the same common words, which have little to do with what the documents are
actually about.
Deerwester et al. [60] proposed to use the singular value decomposition of
M to compute a more reasonable measure of similarity, and one that seems
to work better when the term-by-document matrix is sparse (as it usually is).
Let M = UV T and let U1...k and V1...k be the first k columns of U and V,
respectively. The approach is to compute
U1...k
T T
Mi , U1...k Mj
for each pair of documents. The intuition is that there are some topics that occur
over and over again in the collection of documents. And if we could represent
each document Mi on the basis of topics, then their inner product on that basis
would yield a more meaningful measure of similarity. There are some models –
i.e., hypotheses for how the data is stochastically generated – where it can be
shown that this approach provably recovers the true topics [118]. This is the
ideal interaction between theory and practice – we have techniques that work
(somewhat) well, and we can analyze/justify them.
However, there are many failings of latent semantic indexing that have
motivated alternative approaches. If we associate the top singular vectors with
topics, then
However, topics like politics and finance actually contain many words in
common, so they cannot be orthonormal.
2.1 Introduction 9
Later on, we will get some insight into why nonnegative matrix factorization
is NP-hard. But what approaches are used in practice to actually compute such
a factorization? The usual approach is alternating minimization:
Hence rank(M) ≤ 3. (In fact, rank(M) = 3.) However, M has zeros along
the diagonal and nonzeros off the diagonal. Furthermore, for any rank one
entrywise nonnegative matrix Mi , its pattern of zeros and nonzeros is a
combinatorial rectangle – i.e., the intersection of some set of rows and
columns – and it can be shown that one needs at least log n such rectangles
to cover the nonzeros of M without covering any of its zeros. Hence:
Fact 2.2.4 rank+ (M) ≥ log n
A word of caution: For this example, a number of authors have incorrectly
tried to prove a much stronger lower bound (e.g., rank+ (M) = n). In fact (and
somewhat surprisingly), it turns out that rank+ (M) ≤ 2 log n. The usual error
is in thinking that because the rank of a matrix is the largest r such that it has
r linearly independent columns, the nonnegative rank is the largest r such that
there are r columns where no column is a convex combination of the other
r − 1. This is not true!
has a solution. This system consists of quadratic equality constraints (one for
each entry of M) and linear inequalities that require A and W to be entrywise
nonnegative. Before we worry about fast algorithms, we should ask a more
basic question (whose answer is not at all obvious):
Question 4 Is there any finite time algorithm for deciding if rank+ (M) ≤ r?
2.2 Algebraic Algorithms 13
(nDL)O(k)
that decides whether the system has a solution. Moreover, if it does have a
solution, then it outputs a polynomial and an interval (one for each variable)
in which there is only one root, which is the value of the variable in the true
solution.
Notice that this algorithm finds an implicit representation of the solution, since
you can find as many bits of the solution as you would like by performing a
binary search for the root. Moreover, this algorithm is essentially optimal, and
improving it would yield subexponential time algorithms for 3-SAT.
We can use these algorithms to solve nonnegative matrix factorization, and
it immediately implies that there is an algorithm for deciding if rank+ (M) ≤ r
runs in exponential time. However, the number of variables we would need in
the naive representation is nr + mr, one for each entry in A or W. So even if
r = O(1), we would need a linear number of variables and the running time
would still be exponential. It turns that even though the naive representation
uses many variables, there is a more clever representation that uses many fewer
variables.
Variable Reduction
Here we explore the idea of finding a system of polynomial equations that
expresses the nonnegative matrix factorization problem using many fewer
14 2 Nonnegative Matrix Factorization
variables. In [13, 112], Arora et al. and Moitra gave a system of polynomial
inequalities with f (r) = 2r2 variables that has a solution if and only if
rank+ (M) ≤ r. This immediately yields a polynomial time algorithm to
compute a nonnegative matrix factorization of inner-dimension r (if it exists)
for any r = O(1). These algorithms turn out to be essentially optimal in a
worst-case sense, and prior to this work the best known algorithms even for
the case r = 4 ran in exponential time.
We will focus on a special case to illustrate the basic idea behind variable
reduction. Suppose that rank(M) = r, and our goal is to decide whether or
not rank+ (M) = r. This is called the simplicial factorization problem. Can
we find an alternate system of polynomial inequalities that expresses this
decision problem but uses many fewer variables? The following simple but
useful observation will pave the way:
Claim 2.2.6 In any solution to the simplicial factorization problem, A and W
must have full column and row rank, respectively.
Proof: If M = AW, then the column span of A must contain the columns
of M, and similarly, the row span of W must contain the rows of M. Since
rank(M) = r, we conclude that A and W must have r linearly independent
columns and rows, respectively. Since A has r columns and W has r rows, this
implies the claim.
Hence we know that A has a left pseudo-inverse A+ and W has a right
pseudo-inverse W + so that A+ A = WW + Ir , where Ir is the r × r identity
matrix. We will make use of these pseudo-inverses to reduce the number of
variables in our system of polynomial inequalities. In particular:
A+ AW = W
need to know how S+ acts on the rows of M that span an r-dimensional space.
Hence we can apply a change of basis to write
MC = MU
Notice that S and T are both r × r matrices, and hence there are 2r2 variables
in total. Moreover, this formulation is equivalent to the simplicial factorization
problem in the following sense:
Claim 2.2.7 If rank(M) = rank+ (M) = r, then (2.3) has a solution.
Proof: Using the notation above, we can set S = U + W + and T = A+ V + .
Then MC S = MUU + W + = A and similarly TMR = A+ V + VM = W, and this
implies the claim.
This is often called completeness, since if there is a solution to the original
problem, we want there to be a valid solution to our reformulation. We also
need to prove soundness, that any solution to the reformulation yields a valid
solution to the original problem:
Claim 2.2.8 If there is a solution to (2.3), then there is a solution to (2.1).
Proof: For any solution to (2.3), we can set A = MC S and W = TMR , and it
follows that A, W ≥ 0 and M = AW.
It turns out to be quite involved to extend the ideas above to nonnegative
matrix factorization in general. The main idea in [112] is to first establish a
new normal form for nonnegative matrix factorization, and use the observation
that even though A could have exponentially many maximal sets of linearly
independent columns, their psueudo-inverses are algebraically dependent and
can be expressed over a common set of r2 variables using Cramer’s rule.
Additionally, Arora et al. [13] showed that any algorithm that solves even the
simplicial factorization problem in (nm)o(r) time yields a subexponential time
16 2 Nonnegative Matrix Factorization
algorithm for 3-SAT, and hence the algorithms above are nearly optimal under
standard complexity assumptions.
Further Remarks
Earlier in this section, we gave a simple example that illustrates a separation
between the rank and the nonnegative rank. In fact, there are more interesting
examples of separations that come up in theoretical computer science, where
a natural question is to express a particular polytope P in n dimensions,
which has exponentially many facets as the projection of a higher dimensional
polytope Q with only polynomially many facets. This is called an extended
formulation, and a deep result of Yannakakis is that the minimum number
of facets of any such Q – called the extension complexity of P – is precisely
equal to the nonnegative rank of some matrix that has to do with the geometric
arrangement between vertices and facets of P [144]. Then the fact that there are
explicit polytopes P whose extension complexity is exponential is intimately
related to finding explicit matrices that exhibit large separations between their
rank and nonnegative rank.
Furthermore, the nonnegative rank also has important applications in com-
munication complexity, where one of the most important open questions – the
log-rank conjecture [108] – can be reformulated by asking: Given a Boolean
matrix M, is log rank+ (M) ≤ (log rank(M))O(1) ? Thus, in the example above,
the fact that the nonnegative rank cannot be bounded by any function of the
rank could be due to the entries of M taking on many distinct values.
Reductions
We will prove that the simplicial factorization problem and the intermediate
simplex problem are equivalent in the sense that there is a polynomial time
18 2 Nonnegative Matrix Factorization
Caution: The fact that you can start out with an arbitrary factorization and
ask to rotate it into a nonnegative matrix factorization of minimum inner
dimension but haven’t painted yourself into a corner is particular to the
simplicial factorization problem only! It is generally not true when rank(M) <
rank+ (M).
Now we can give a geometric interoperation of P1:
(1) Let u1 , u2 , . . . , um be the rows of U.
(2) Let t1 , t2 , . . . , tr be the columns of T.
(3) Let v1 , v2 , . . . , vn be the columns of V.
We will first work with an intermediate cone problem, but its connection to
the intermediate simplex problem will be immediate. Toward that end, let P
be the cone generated by u1 , u2 , . . . , um , and let K be the cone generated by
t1 , t2 , . . . , tr . Finally, let Q be the cone given by
Q = {x|ui , x ≥ 0 for all i}.
It is not hard to see that Q is a cone in the sense that it is generated as all
nonnegative combinations of a finite set of vectors (its extreme rays), but we
2.3 Stability and Separability 19
Geometric Gadgets
Vavasis made use of the equivalences in the previous section to construct
certain geometric gadgets to prove that nonnegative matrix factorization is NP-
hard. The idea was to construct a two-dimensional gadget where there are only
two possible intermediate triangles, which can then be used to represent the
truth assignment for a variable xi . The description of the complete reduction
and the proof of its soundness are involved (see [139]).
Theorem 2.3.9 [139] Nonnegative matrix factorization, simplicial factoriza-
tion, intermediate simplex, intermediate cone, and P1 are all NP-hard.
Arora et al. [13] improved upon this reduction by constructing low-
dimensional gadgets with many more choices. This allows them to reduce from
the d-SUM problem, where we are given a set of n numbers and the goal is to
find a set of d of them that sum to zero. The best known algorithms for this
20 2 Nonnegative Matrix Factorization
problem run in time roughly nd/2
. Again, the full construction and the proof
of soundness are involved.
Theorem 2.3.10 Nonnegative matrix factorization, simplicial factorization,
intermediate simplex, intermediate cone, and P1 all require time at least
(nm)(r) unless there is a subexponential time algorithm for 3-SAT.
In all of the topics we will cover, it is important to understand what makes
the problem hard in order to identify what makes it easy. The common feature
in all of the above gadgets is that the gadgets themselves are highly unstable
and have multiple solutions, and so it is natural to look for instances where
the answer itself is robust and unique in order to identify instances that can be
solved more efficiently than in the worst case.
Separability
In fact, Donoho and Stodden [64] were among the first to explore the question
of what sorts of conditions imply that the nonnegative matrix factorization of
minimum inner dimension is unique. Their original examples came from toy
problems in image segmentation, but it seems like the condition itself can most
naturally be interpreted in the setting of text analysis.
Definition 2.3.11 We call A separable if, for every column i of A, there is a
row j where the only nonzero is in the ith column. Furthermore, we call j an
anchor word for column i.
In fact, separability is quite natural in the context of text analysis. Recall
that we interpret the columns of A as topics. We can think of separability as the
promise that these topics come with anchor words; informally, for each topic
there is an unknown anchor word, and if it occurs in a document, the document
is (partially) about the given topic. For example, 401k could be an anchor word
for the topic personal finance. It seems that natural language contains many
such highly specific words.
We will now give an algorithm for finding the anchor words and for
solving instances of nonnegative matrix factorization where the unknown A
is separable in polynomial time.
Theorem 2.3.12 [13] If M = AW and A is separable and W has full row rank,
then the Anchor Words Algorithm outputs A and W (up to rescaling).
Why do anchor words help? It is easy to see that if A is separable, then the
rows of W appear as rows of M (after scaling). Hence we just need to determine
which rows of M correspond to anchor words. We know from our discussion
in Section 2.3 that if we scale M, A, and W so that their rows sum to one, the
2.3 Stability and Separability 21
convex hull of the rows of W contains the rows of M. But since these rows
appear in M as well, we can try to find W by iteratively deleting rows of M that
do not change its convex hull.
Let M i denote the ith row of M and let M I denote the restriction of M to the
rows in I for I ⊆ [n]. So now we can find the anchor words using the following
simple procedure:
Here in the first step, we want to remove redundant rows. If two rows are scalar
multiples of each other, then one being in the cone generated by the rows of W
implies the other is too, so we can safely delete one of the two rows. We do this
for all rows, so that in the equivalence class of rows that are scalar multiples of
each other, exactly one remains. We will not focus on this technicality in our
discussion, though.
It is easy to see that deleting a row of M that is not an anchor word
will not change the convex hull of the remaining rows, and so the above
algorithm terminates with a set I that only contains anchor words. Moreover,
at termination
conv({M i |i ∈ I}) = conv({M j }j ).
Alternatively, the convex hull is the same as at the start. Hence the anchor
words that are deleted are redundant and we can just as well do without them.
For i = 1 to n
Sample Wi from μ
Generate L words by sampling i.i.d. from the distribution AWi
End
generates a random sample from the same distribution AWi , but each word
w1 = j is annotated with the topic from which it came, t1 = i (i.e., the column
of A we sampled it from). We can now define the topic co-occurrence matrix:
Definition 2.4.2 Let R denote the r × r matrix where
Ri,i = P[t1 = i, t2 = i ].
Note that we can estimate the entries of G directly from our samples, but
we cannot directly estimate the entries of R. Nevertheless, these matrices are
related according to the following identity:
Lemma 2.4.3 G = ARAT
Proof: We have
Gj,j = P[w1 = j, w2 = j ] = P[w1 = j, w2 = j |t1 = i, t2 = i ]P[t1 = i, t2 = i ]
i,i
= P[w1 = j|t1 = i]P[w2 = j |t2 = i ]P[t1 = i, t2 = i ]
i,i
= Aj,i Aj ,i Ri,i
i,i
where the last line follows, because the posterior distribution on the topic
t2 = i , given that w2 is an anchor word for topic i , is equal to one if and
only if i = i . Finally, the proof follows by invoking Claim 2.4.4.
Now we can proceed:
P[w1 = j|w2 = j ] = P[w1 = j|w2 = π(i )] P[t2 = i |w2 = j ]
i
unknowns
Hence this is a linear system in variables P[w1 = j|w2 = π(i )] and it is not
hard to show that if R has full rank, then it has a unique solution.
Finally, by Bayes’s rule we can compute the entries of A:
Theorem 2.4.6 [14] There is a polynomial time algorithm to learn the topic
matrix for any separable topic model, provided that R is full rank.
Remark 2.4.7 The running time and sample complexity of this algorithm
depend polynomially on m, n, r, σmin (R), p, 1/, log 1/δ where p is a lower
bound on the probability of each anchor word, is the target accuracy, and
δ is the failure probability.
Note that this algorithm works for short documents, even for L = 2.
Experimental Results
Now we have provable algorithms for nonnegative matrix factorization and
topic modeling under separability. But are natural topic models separable or
close to being separable? Consider the following experiment:
(1) UCI Dataset: A collection of 300, 000 New York Times articles
(2) MALLET: A popular topic-modeling toolkit
We trained MALLET on the UCI dataset and found that with r = 200,
about 0.9 fraction of the topics had a near anchor word – i.e., a word where
P[t = i|w = j] had a value of at least 0.9 on some topic. Indeed, the algorithms
we gave can be shown to work in the presence of some modest amount of
error – deviation from the assumption of separability. But can they work with
this much modeling error?
We then ran the following additional experiment:
(1) Run MALLET on the UCI dataset, learn a topic matrix (r = 200).
(2) Use A to generate a new set of documents synthetically from an LDA
model.
2.5 Exercises 27
(3) Run MALLET and our algorithm on a new set of documents, and
compare their outputs to the ground truth. In particular, compute the
minimum cost matching between the columns of the estimate and the
columns of the ground truth.
It is important to remark that this is a biased experiment – biased against
our algorithm! We are comparing how well we can find the hidden topics (in a
setting where the topic matrix is only close to separable) to how well MALLET
can find its own output again. And with enough documents, we can find it
more accurately and hundreds of times faster! This new algorithm enables us
to explore much larger collections of documents than ever before.
2.5 Exercises
Problem 2-1: Which of the following are equivalent definitions of nonnegative
rank? For each, give a proof or a counterexample.
(a) The smallest r such that M can be written as the sum of r rank-one
nonnegative matrices
(b) The smallest r such that there are r nonnegative vectors v1 , v2 , . . . , vr such
that the cone generated by them contains all the columns of M
(c) The largest r such that there are r columns of M, M1 , M2 , . . . , Mr such
that no column in the set is contained in the cone generated by the
remaining r − 1 columns
Problem 2-2: Let M ∈ Rn×n where Mi,j = (i − j)2 . Prove that rank(M) = 3
and that rank+ (M) ≥ log2 n. Hint: To prove a lower bound on rank+ (M), it
suffices to consider just where it is zero and where it is nonzero.
Problem 2-3: Papadimitriou et al. [118] considered the following document
model: M = AW and each column of W has only one nonzero and the support
of each column of A is disjoint. Prove that the left singular vectors of M
are the columns of A (after rescaling). You may assume that all the nonzero
singular values of M are distinct. Hint: MM T is a block diagonal after applying
a permutation π to its rows and columns.
Problem 2-4: Consider the following algorithm:
Set S = ∅
For i = 2 to r
Project the rows of M orthogonal to the span of vectors in S
Add the row with the largest 2 norm to S
End
In this chapter, we will study tensors and various structural and computational
problems we can ask about them. Generally, many problems that are easy over
matrices become ill-posed or NP-hard when working over tensors instead.
Contrary to popular belief, this isn’t a reason to pack up your bags and go
home. Actually, there are things we can get out of tensors that we can’t get
out of matrices. We just have to be careful about what types of problems we
try to solve. More precisely, in this chapter we will give an algorithm with
provable guarantees for low-rank tensor decomposition – that works in natural
but restricted settings – as well as some preliminary applications of it to factor
analysis.
29
30 3 Tensor Decompositions
arranged his data into a 1000 × 10 matrix M. He believed that how a student
performed on a given test was determined by some hidden variables that had
to do with the student and the test. Imagine that each student is described by
a two-dimensional vector where the two coordinates give numerical scores
quantifying his or her mathematical and verbal intelligence, respectively.
Similarly, imagine that each test is also described by a two-dimensional vector,
but the coordinates represent the extent to which it tests mathematical and
verbal reasoning. Spearman set out to find this set of two-dimensional vectors,
one for each student and one for each test, so that how a student performs on a
test is given by the inner product between their two respective vectors.
Let’s translate the problem into a more convenient language. What we are
looking for is a particular factorization
M = ABT
where A is size 1000 × 2 and B is size 10 × 2 that validates Spearman’s theory.
The trouble is, even if there is a factorization M = ABT where the columns of
A and the rows of B can be given some meaningful interpretation (that would
corroborate Spearman’s theory) how can we find it? There can be many other
factorizations of M that have the same inner dimension but are not the factors
we are looking for. To make this concrete, suppose that O is a 2 × 2 orthogonal
matrix. Then we can write
M = ABT = (AO)(OT BT )
and we can just as easily find the factorization M = ÂB̂T where  = AO and
B̂ = BO instead. So even if there is a meaningful factorization that would
explain our data, there is no guarantee that we find it, and in general what we
find might be an arbitrary inner rotation of it that itself is difficult to interpret.
This is called the rotation problem. This is the stumbling block that we alluded
to earlier, which we encounter if we use matrix techniques to perform factor
analysis.
What went wrong here is that low-rank matrix decompositions are not
unique. Let’s elaborate on what exactly we mean by unique in this context.
Suppose we are given a matrix M and are promised that it has some meaningful
low-rank decomposition
r
M= a(i) (b(i) )T .
i=1
Our goal is to recover the factors a(i) and b(i) . The trouble is that we could
compute the singular value decomposition M = UV T and find another low-
rank decomposition
3.2 A Primer on Tensors 31
r
M= σi u(i) (v(i) )T .
i=1
These are potentially two very different sets of factors that just happen to
recreate the same matrix. In fact, the vectors u(i) are necessarily orthonormal,
because they came from the singular value decomposition, even though there
is a priori no reason to think that the true factors a(i) that we are looking for are
orthonormal too. So now we can qualitatively answer the question we posed at
the outset. Why are we interested in tensors? It’s because they solve the rotation
problem and their decomposition is unique under much weaker conditions than
their matrix decomposition counterparts.
Ti,j,k = ui vj wk .
T =u⊗v⊗w
We can now define the rank of a tensor:
32 3 Tensor Decompositions
with matrices. So, what is so subtle about working with tensors? For starters,
what makes linear algebra so elegant and appealing is how something like
the rank of a matrix M admits a number of equivalent definitions. When we
defined the rank of a tensor, we were careful to say that what we were doing
was taking one of the definitions of the rank of a matrix and writing down the
natural generalization to tensors. But what if we took a different definition for
the rank of a matrix and generalized it in the natural way? Would we get the
same notion of rank for a tensor? Usually not!
Let’s try it out. Instead of defining the rank of a matrix M as the smallest
number of rank-one matrices we need to add up to get M, we could define the
rank through the dimension of its column/row space. This next claim just says
that we’d get the same notion of rank.
Claim 3.2.4 The rank of a matrix M is equal to the dimension of its
column/row space. More precisely,
where the first 2 × 2 matrix is the first slice through the tensor and the second
2 × 2 matrix is the second slice. It is not hard to show that rankR (T) ≥ 3. But it
is easy to check that
1 1 1 1 1 1 1
T= ⊗ ⊗ + ⊗ ⊗ .
2 −i i −i i −i i
34 3 Tensor Decompositions
We will refer to the factors u(i) , v(i) , and w(i) as the hidden factors to emphasize
that we do not know them but want to find them. We should be careful here.
What do we mean by find them? There are some ambiguities that we can never
hope to resolve. We can only hope to recover the factors up to an arbitrary
reordering (of the sum) and up to certain rescalings that leave the rank-one
tensors themselves unchanged. This motivates the following definition, which
takes into account these issues:
36 3 Tensor Decompositions
are equivalent if there is a permutation π : [r] → [r] such that for all i
and moreover, the factors (u(i) , v(i) , w(i) ) and (û(i) , v̂(i) , ŵ(i) ) are equivalent.
The original result of Jennrich [84] was stated as a uniqueness theorem, that
under the conditions on the factors u(i) , v(i) , and w(i) above, any decomposition
of T into at most r rank-one tensors must use an equivalent set of factors. It just
so happened that the way that Jennrich proved this uniqueness theorem was
by giving an algorithm that finds the decomposition, although in the paper it
was never stated that way. Intriguingly, this seems to be a major contributor
to why the result was forgotten. Much of the subsequent literature cited a
3.3 Jennrich’s Algorithm 37
Recall that T·,·,i denotes the ith matrix slice through T. Thus T (a) is just the
weighted sum of matrix slices through T, each weighted by ai .
The first step in the analysis is to express T (a) and T (b) in terms of the
hidden factors. Let U and V be size m × r and n × r matrices, respectively,
whose columns are u(i) and v(i) . Let D(a) and D(b) be r × r diagonal matrices
whose entries are w(i) , a and w(i) , b, respectively. Then
Lemma 3.3.3 T (a) = UD(a) V T and T (b) = UD(b) V T
Proof: Since the operation of computing T (a) from T is linear, we can apply it
to each of the rank-one tensors in the low-rank decomposition of T. It is easy
to see that if we are given the rank-one tensor u ⊗ v ⊗ w, then the effect of
taking the weighted sum of matrix slices, where the ith slice is weighted by ai ,
is that we obtain the matrix w, au ⊗ v.
Thus by linearity we have
r
T (a) = w(i) , au(i) ⊗ v(i)
i=1
38 3 Tensor Decompositions
which yields the first part of the lemma. The second part follows analogously
with a replaced by b.
It turns out that we can now recover the columns of U and the columns of V
through a generalized eigendecomposition. Let’s do a thought experiment. If
we are given a matrix M of the form M = UDU −1 where the entries along
the diagonal matrix D are distinct and nonzero, the columns of U will be
eigenvectors, except that they are not necessarily unit vectors. Since the entries
of D are distinct, the eigendecomposition of M is unique, and this means we
can recover the columns of U (up to rescaling) as the eigenvectors of M.
Now, if we are instead given two matrices of the form A = UD(a) V T and
B = UD(b) V T , then if the entries of D(a) (D(b) )−1 are distinct and nonzero,
we can recover the columns of U and V (again up to rescaling) through an
eigendecomposition of
AB−1 = UD(a) (D(b) )−1 U −1 and (A−1 B)T = VD(b) (D(a) )−1 V −1
respectively. It turns out that instead of actually forming the matrices above, we
could instead look for all the vectors v that satisfy Av = λv Bv, which is called
a generalized eigendecomposition. In any case, this is the main idea behind the
following lemma, although we need to take some care, since in our setting the
matrices U an V are not necessarily square, let alone invertible matrices.
Lemma 3.3.4 Almost surely, the columns of U and V are the unique eigenvec-
tors corresponding to nonzero eigenvalues of T (a) (T (b) )+ and ((T (a) )+ T (b) )T ,
respectively. Moreover, the eigenvalue corresponding to u(i) is the reciprocal
of the eigenvalue corresponding to v(i) .
Proof: We can use the formula for T (a) and T (b) in Lemma 3.3.3 to compute
The entries of D(a) (D(b) )+ are w(i) , a/w(i) , b. Then, because every pair of
vectors in {w(i) }i is linearly independent, we have that almost surely over the
choice of a and b, the entries along the diagonal of D(a) (D(b) )+ will all be
nonzero and distinct.
Now, returning to the formula above for T (a) (T (b) )+ , we see that it is an
eigendecomposition and, moreover, that the nonzero eigenvalues are distinct.
Thus the columns of U are the unique eigenvectors of T (a) (T (b) )+ with nonzero
eigenvalue, and the eigenvalue corresponding to u(i) is w(i) , a/w(i) , b. An
identical argument shows that the columns of V are the unique eigenvectors of
α1 v(1) , au(1) = 0
The term b− b/b is often called the relative error and is a popular distance
to measure closeness in numerical linear algebra. What the discussion above
tells us is that the condition number controls the relative error when solving a
linear system.
Now let’s tie this back in to our earlier discussion. It turns out that our
perturbation bounds for eigendecompositions will also have to depend on the
condition number of U. Intuitively, this is because, given U and U −1 , finding
the eigenvalues of M is like solving a linear system that depends on U and
U −1 . This can be made more precise, but we won’t do so here.
Thus λ ∈ D(Mii , Ri ).
is diagonalizable. The idea
Now we can return to the task of showing that M
is straightforward and comes from digesting a single expression. Consider
U = U −1 (M + E)U = D + U −1 EU.
U −1 M
What does this expression tell us? The right-hand side is a perturbation of
a diagonal matrix, so we can use Gershgorin’s disk theorem to say that its
eigenvalues are close to those of D. Now, because left multiplying by U −1 and
right multiplying by U is a similarity transformation, this in turn tells us about
’s eigenvalues.
M
Let’s put this plan into action and apply Gershgorin’s disk theorem to
understand the eigenvalues of D = D + U −1 EU. First, we can bound the
magnitude of the entries of E = U −1 EU as follows. Let A∞ denote the
matrix max norm, which is the largest absolute value of any entry in A.
Lemma 3.4.3
E∞ ≤ κ(U)E
Proof: For any i and j, we can regard
Ei,j as the quadratic form of the ith row
of U −1 and the jth column of U on E. Now, the jth column of U has Euclidean
3.4 Perturbation Bounds 43
norm at most σmax (U), and similarly the ith row of U −1 has Euclidean norm at
most σmax (U −1 ) = 1/σmin (U). Together, this yields the desired bound.
Now let’s prove that, under the appropriate conditions, the eigenvalues of
are distinct. Let R = maxi j |
M Ei,j | and let δ = mini=j |Di,i − Dj,j | be the
minimum separation of the eigenvalues of D.
are distinct.
Lemma 3.4.4 If R < δ/2, then the eigenvalues of M
Proof: First we use Gershgorin’s disk theorem to conclude that the eigenvalues
of D are contained in disjoint disks, one for each row. There’s a minor
technicality, that Gershgorin’s disk theorem works with a radius that is the sum
of the absolute values of the entries in a row, except for the diagonal entry. But
we leave it as an exercise to check that the calculation still goes through.
Actually, we are not done yet.1 Even if Gershgorin’s disk theorem implies
that there are disjoint disks (one for each row) that contain the eigenvalues of
D, how do we know that no disk contains more than one eigenvalue and that
no disk contains no eigenvalues? It turns out that the eigenvalues of a matrix
are a continuous function of the entries, so as we trace out a path
γ (t) = (1 − t)D + t(
D)
from D to D as t goes from zero to one, the disks in Gershgorin’s disk theorem
are always disjoint and no eigenvalue can jump from one disk to another. Thus,
at
D we know that there really is exactly one eigenvalue in each disk, and since
the disks are disjoint, we have that the eigenvalues of
D are distinct as desired.
Of course the eigenvalues of D and M are the same, because they are related
by a similarity transformation.
1
Thanks to Santosh Vempala for pointing out this gap in an earlier version of this book.
See also [79].
44 3 Tensor Decompositions
Proof: The first part of the theorem follows by combining Lemma 3.4.3 and
Lemma 3.4.4. For the second part of the theorem, let’s fix i and let P be
the projection onto the orthogonal complement of ui . Then, using elementary
geometry and the fact that the eigenvectors are all unit vectors, we have
ui −
uπ(i) ≤ 2P
uπ(i) .
Lemma 3.4.5 supplies the bounds on the coefficients cj , which completes the
proof of the theorem.
You were warned early on that the bound would be messy! It is also by
no means optimized. But what you should instead take away is the qualitative
corollary that we were after: If E ≤ poly(1/n, σmin (U), 1/σmax (U), δ) (i.e., if
the sampling noise is small enough compared to the dimensions of the matrix,
the condition number of U, and the minimum separation), then U and U are
close.
3.5 Exercises
Problem 3-1:
(a) Suppose we want to solve the linear system Ax = b (where A ∈ Rn×n is
square and invertible) but we are only given access to a noisy vector b̃
satisfying
b − b̃
≤ε
b
and a noisy matrix à satisfying A − à ≤ δ (in operator norm). Let x̃ be
the solution to Ãx̃ = b̃. Show that
3.5 Exercises 47
Figure 3.1: Three shifted copies of the true signal x are shown in gray. Noisy
samples yi are shown in red. (Figure credit: [23].)
Many exciting problems fit into the following paradigm: First, we choose some
parametric family of distributions that are rich enough to model things like
evolution, writing, and the formation of social networks. Second, we design
algorithms for learning the unknown parameters — which you should think
of as a proxy for finding hidden structure in our data, like a tree of life
that explains how species evolved from each other, the topics that underly a
collection of documents, or the communities of strongly connected individuals
in a social network. In this chapter, all of our algorithms will be based on
tensor decomposition. We will construct a tensor from the moments of our
distribution and apply Jennrich’s algorithm to find the hidden factors, which in
turn will reveal the unknown parameters of our model.
48
4.1 Phylogenetic Trees and HMMs 49
(a) A rooted binary tree with root r (the leaves do not necessarily have the
same depth).
(b) A set of states, for example = {A, C, G, T}. Let k = ||.
(c) A Markov model on the tree; i.e., a distribution πr on the state of the root
and a transition matrix Puv for each edge (u, v).
We can generate a sample from the model as follows: We choose a state for the
root according to πr , and for each node v with parent u we choose the state of
v according to the distribution defined by the ith row of Puv , where i is the state
of u. Alternatively, we can think of s(·) : V → as a random function that
assigns states to vertices where the marginal distribution on s(r) is πr and
Reconstructing Quartets
Now we will use Steel’s evolutionary distance to compute the topology by
piecing together the picture four nodes at a time.
Our goal is to determine which of these induced topologies is the true
topology, given the pairwise distances.
Lemma 4.1.3 If all distances in the tree are strictly positive, then it is possible
to determine the induced topology on any four nodes a, b, c, and d given an
oracle that can compute the distance between any pair of them.
Proof: The proof is by case analysis. Consider the three possible induced
topologies between the nodes a, b, c, and d, as depicted in Figure 4.1. Here
by induced topology, we mean delete edges not on any shortest path between
any pair of the four leaves and contract paths to a single edge if possible.
It is easy to check that under topology (a) we have
ψ(a, b) + ψ(c, d) < min {ψ(a, c) + ψ(b, c), ψ(a, d) + ψ(b, d)} .
But under topology (b) or (c) this inequality would not hold. There is an
analogous way to identify each of the other topologies, again based on the
4.1 Phylogenetic Trees and HMMs 51
a c a b a b
b d c d d c
pairwise distances. What this means is that we can simply compute three
values: ψ(a, b) + ψ(c, d), ψ(a, c) + ψ(b, c), and ψ(a, d) + ψ(b, d). Whichever
is the smallest determines the induced topology as being (a), (b), or (c),
respectively.
Indeed, from just these quartet tests, we can recover the topology of the tree.
Lemma 4.1.4 If for any quadruple of leaves a, b, c, and d we can determine
the induced topology, it is possible to determine the topology of the tree.
Proof: The approach is to first determine which pairs of leaves have the same
parent, and then determine which pairs have the same grandparent, and so on.
First, fix a pair of leaves a and b. It is easy to see that they have the same parent
if and only if for every other choice of leaves c and d, the quartet test returns
topology (a). Now, if we want to determine whether a pair of leaves a and b
have the same grandparent, we can modify the approach as follows: They have
the same grandparent if and only if for every other choice of leaves c and d,
neither of which is a sibling of a or b, the quartet test returns topology (a).
Essentially, we are building up the tree by finding the closest pairs first.
An important technical point is that we can only approximate F ab from our
samples. This translates into a good approximation of ψab when a and b are
close, but is noisy when a and b are far away. Ultimately, the approach in [69]
of Erdos, Steel, Szekely, and Warnow is to use quartet tests only where all the
distances are short.
These are third-order moments of our distribution that we can estimate from
samples. We will assume throughout this section that the transition matrices
are full rank. This means that we can reroot the tree arbitrarily. Now consider
the unique node that lies on all of the shortest paths among a, b, and c. Let’s
let this be the root. Then
T abc = P(s(r) = )P(s(a) = ·|s(r) = ) ⊗ P(s(b) = ·|s(r) = )
⊗ P(s(c) = ·|s(r) = )
= P(s(r) = )Pra ⊗ P ⊗ P
rb rc
and if we can find Puw and Pvw (using the star tests above), then we can
compute Puv = Puw (Pvw )−1 since we have assumed that the transition matrices
are invertible.
However, there are two serious complications:
(a) As in the case of finding the topology, long paths are very noisy.
Mossel and Roch showed that one can recover the transition matrices also
using only queries to short paths.
In the above star test, we could apply any permutation to the states of r and
permute the rows of the transition matrices Pra , Prb , and Prc accordingly so
that the resulting joint distribution on a, b, and c is unchanged.
However, the approach of Mossel and Roch is to work instead in the
probably approximately correct learning framework of Valiant [138], where
4.1 Phylogenetic Trees and HMMs 53
the goal is to learn a generative model that produces almost the same joint
distribution on the leaves. In particular, if there are multiple ways to label the
internal nodes to produce the same joint distribution on the leaves, we are
indifferent to them.
Remark 4.1.5 Hidden Markov models are a special case of phylogenetic trees,
where the underlying topology is a caterpillar. But note that for the above
algorithm, we need that the transition matrices and the observation matrices
are full rank.
More precisely, we require that the transition matrices are invertible and
that the observation matrices, with the convention that the rows are indexed by
the states of the corresponding hidden node and whose columns are indexed
by the output symbols each have full row rank.
1
m
1χT (X (j) )=b(j) .
m
j=1
From standard concentration bounds, it follows that with high probability this
value is larger than (say) 3/5 if and only if S = T.
The best-known algorithm due to Blum, Kalai, and Wasserman [37] has
running time and sample complexity 2n/ log n . It is widely believed that there
is no polynomial time algorithm for noisy parity even given any polynomial
number of samples. This is an excellent example of a problem whose sample
complexity and computational complexity are (conjectured to be) wildly
different.
Next we show how to embed samples from a noisy parity problem into an
HMM; however, to do so, we will make use of transition matrices that are not
full rank. Consider an HMM that has n hidden nodes, where the ith hidden node
encoded is used to represent the ith coordinate of X, and the running parity
χSi (X) := X(i ) mod 2.
i ≤i,i ∈S
Hence each node has four possible states. We can define the following
transition matrices. Let s(i) = (xi , si ) be the state of the ith internal node, where
si = χSi (X).
We can define the following transition matrices:
⎧
⎪
⎪ 12 (0, si )
⎨
if i + 1 ∈ S P i,i+1
= 12 (1, si + 1 mod 2)
⎪
⎪
⎩0 otherwise
⎧
⎪
⎪ 12 (0, si )
⎨
if i + 1 ∈
/S Pi,i+1 = 12 (1, si ) .
⎪
⎪
⎩0 otherwise
At each internal node we observe xi , and at the last node we also observe
χS (X) with probability 2/3 and otherwise 1 − χS (X). Each sample from the
noisy parity problem is a set of observations from this HMM, and if we could
learn its transition matrices, we would necessarily learn S and solve the noisy
parity problem.
Note that here the observation matrices are certainly not full rank, because
we only observe two possible emissions even though each internal node has
four possible states! Hence these problems become much harder when the
transition (or observation) matrices are not full rank!
4.2 Community Detection 55
• π : V → [k] partitions the vertices V into k disjoint groups (we will relax
this condition later).
• Each possible edge (u, v) is chosen independently with
q π(u) = π(v)
P[(u, v) ∈ E] = .
p otherwise
In our setting we will set q > p, which is called the assortative case, but this
model also makes sense when q < p, which is called the disassortative case.
For example, when q = 0, we are generating a random graph that has a planted
k-coloring. Regardless, we observe a random graph generated from the above
model and our goal is to recover the partition described by π .
When is this information theoretically possible? In fact, even for k = 2,
where π is a bisection, we need
& log n '
q−p>
n
in order for the true bisection to be the uniquely smallest cut that bisects the
random graph G with high probability. If q − p is smaller, then it is not even
information theoretically possible to find π . Indeed, we should also require
56 4 Tensor Decompositions
that each part of the partition is large, and for simplicity we will assume that
k = O(1) and |{u|π(u) = i}| = (n).
There has been a long line of work on partitioning random graphs in the
stochastic block model, culminating in the work of McSherry [109]:
Theorem 4.2.1 [109] There is an efficient algorithm that recovers π (up to
relabeling) if
(
q−p log n/δ
>c
q qn
so that each row of contains exactly one 1. Finally, let R be the k × k matrix
whose entries are the connection probabilities. In particular,
q i=j
(R)ij = .
p i = j
Consider the product R. The ith column of R encodes the probability
that there is an edge from a node in community i to the node corresponding to
the given row.
(R)xi = Pr[(x, a) ∈ E|π(a) = i].
We will use (R)Ai to denote the matrix R restricted to the ith column and
the rows in A, and similarly for B and C. Moreover, let pi be the fraction of
nodes in X that are in community i. Then our algorithm revolves around the
following tensor:
T= pi (R)Ai ⊗ (R)Bi ⊗ (R)C i .
i
Actually, we need the factors to be not just full rank, but also well
conditioned. The same type of argument as in the previous lemma shows that
as long as each community is well represented in A, B, and C (which happens
with high probability if A, B, and C are large enough and chosen at random),
then the factors {(R)Ai }i , {(R)Bi }i , and {(R)Bi }i will be well conditioned.
Now let’s recover the community structure from the hidden factors: First, if
we have {(R)Ai }i , then we can partition A into communities just by grouping
together nodes whose corresponding rows are the same. In turn, if A is large
enough, then we can extend this partitioning to the whole graph: We add a node
x∈ / A to community i if and only if the fraction of nodes a ∈ A with π(a) = i
that x is connected to is close to q. If A is large enough and we have recovered
its community structure correctly, then with high probability this procedure
will recover the true communities in the entire graph.
For a full analysis of the algorithm, including its sample complexity
and accuracy, see [9]. Anandkumar et al. also give an algorithm for mixed
membership models, where each πu is chosen from a Dirichlet distribution.
We will not cover this latter extension, because we will instead explain those
types of techniques in the setting of topic models next.
Discussion
We note that there are powerful extensions to the stochastic block model
that are called semirandom models. Roughly, these models allow a monotone
adversary to add edges between nodes in the same cluster and delete edges
between clusters after G is generated. It sounds like the adversary is only
making your life easier by strengthening ties within a community and breaking
ties across them. If the true community structure is the partition of G into
k parts that cuts the fewest edges, then this is only more true after the
changes. Interestingly, many tensor and spectral algorithms break down in the
semirandom model, but there are elegant techniques for recovering π even
in this more general setting (see [71], [72]). This is some food for thought
and begs the question: How much are we exploiting brittle properties of our
stochastic model?
In Section 2.4 we constructed the Gram matrix, which represents the joint
distribution of pairs of words. Here we will use the joint distribution of triples
of words. Let w1 , w2 , and w3 denote the random variables for its first, second,
and third words, respectively.
Definition 4.3.1 Let T denote the m × m × m tensor where
r
T= pi Ai ⊗ Ai ⊗ Ai
i=1
So how can we recover the topic matrix given samples from a pure topic
model? We can construct an estimated T where Ta,b,c counts the fraction of
documents in our sample whose first word, second word and third word, are
a, b, and c, respectively. If the number of documents is large enough, then T
converges to T.
Now we can apply Jennrich’s algorithm. Provided that A has full column
rank, we will recover the true factors in the decomposition up to a rescaling.
However, since each column in A is a distribution, we can properly normalize
whatever hidden factors we find and compute the values of pi too. To really
make this work, we need to analyze how many documents we need in order
for
T to be close to T, and then apply the results in Section 3.4, where we
analyzed the noise tolerance of Jennrich’s algorithm. The important point is
that the columns of our estimated A converge to columns of A at an inverse
60 4 Tensor Decompositions
polynomial rate with the number of samples we are given, where the rate of
convergence depends on things like how well conditioned the columns of A are.
This model is already more realistic in the following way. When documents
are long (say Nj > m log m), then in a pure topic model, pairs of documents
would necessarily have nearly identical empirical distributions on words. But
this is no longer the case in mixed models like the one above.
The basic issue in extending our tensor decomposition approach for learning
pure topic models to mixed models is that the third-order tensor that counts the
joint distribution of triples of words now satisfies the following expression:
T= Dijk Ai ⊗ Aj ⊗ Ak
ijk
where Di,j,k is the probability that the first three words in a random document
are generated from topics i, j, and k, respectively. In a pure topic model, Di,j,k
is diagonal, but for a mixed model it is not!
4.3 Extensions to Mixed Models 61
Other Tensors
We described the tensor T based on the following experiment: Let Ta,b,c be
the probability that the first three words in a random document are a, b, and
c, respectively. But we could just as well consider alternative experiments.
The two other experiments we will need in order to given a tensor spectral
algorithm for LDA are:
62 4 Tensor Decompositions
(a) Choose three documents at random, and look at the first word of each
document.
(b) Choose two documents at random, and look at the first two words of the
first document and the first word of the second document.
These two new experiments combined with the old experiment result in three
tensors whose Tucker decompositions use the same factors but whose core
tensors differ.
Definition 4.3.4 Let μ, M, and D be the first-, second-, and third-order
moments of the Dirichlet distribution.
More precisely, let μi be the probability that the first word in a random
document is generated from topic i. Let Mi,j be the probability that the first
and second words in a random document are generated from topics i and j,
respectively. And as before, let Di,j,k be the probability that the first three words
in a random document are generated from topics i, j, and k, respectively. Then
let T 1 , T 2 , and T 3 be the expectations of the first (choose three documents),
second (choose two documents), and third (choose one document) experiment,
respectively.
Lemma 4.3.5 (a) T 1 = i,j,k [μ ⊗ μ ⊗ μ]i,j,k Ai ⊗ Aj ⊗ Ak
(b) T 2 = i,j,k [M ⊗ μ]i,j,k Ai ⊗ Aj ⊗ Ak
(c) T 3 = i,j,k Di,j,k Ai ⊗ Aj ⊗ Ak
Proof: Let w1 denote the first word and let t1 denote the topic of w1 (and
similarly for the other words). We can expand P[w1 = a, w2 = b, w3 = c] as:
P[w1 = a, w2 = b, w3 = c|t1 = i, t2 = j, t3 = k]P[t1 = i, t2 = j, t3 = k]
i,j,k
The important point is that we can estimate the terms on the left-hand side from
our sample (if we assume we know α0 ) and we can apply Jennrich’s algorithm
to the tensor on the right-hand side to recover the topic model, provided that A
has full column rank. In fact, we can compute α0 from our samples (see [8]),
but we will focus instead on proving the above identity.
For example, for Ti,i,k this is the probability that the first two balls are color i
and the third ball is color k. The probability that the first ball is color i is αα0i ,
and since we place it back with one more of its own color, the probability that
the second ball is also color i is αα0i +1
+1 . And the probability that the third ball is
color k is α0α+2
k
. It is easy to check the above formulas in the other cases too.
Note that it is much easier to think about only the numerators in the above
formulas. If we can prove that following relation for just the numerators
D + 2μ⊗3 − M ⊗ μ(all three ways) = diag({2αi }i )
it is easy to check that we would obtain our desired formula by multiplying
through by α03 (α0 + 1)(α0 + 2).
Definition 4.3.6 Let R = num(D)+num(2μ⊗3 )−num(M⊗μ)(all three ways).
Then the main lemma is:
64 4 Tensor Decompositions
Epilogue
The algorithm of Anandkumar et al. [9] for learning mixed membership
stochastic block models follows the same pattern. Once again, the Dirichlet
distribution plays a key role. Instead of each node belonging to just one
community, as in the usual stochastic block model, each node is described
4.4 Independent Component Analysis 65
Such problems are also often referred to as blind source separation. We will
follow an approach of Frieze, Jerrum, and Kannan [74]. What’s really neat
about their approach is that it uses nonconvex optimization.
66 4 Tensor Decompositions
y = Ax + b
but where for all i, E[xi ] = 0, E[xi2 ] = 1. The idea is that if any variable xi
were not mean zero, we could make it mean zero and add a correction to b.
And similarly, if xi were not variance one, we could rescale both it and the
corresponding column of A to make its variance be one. These changes are
just notational and do not affect the distribution on samples that we observe.
So from here on out, let’s assume that we are given samples in the above
canonical form.
We will give an algorithm based on nonconvex optimization for estimating
A and b. But first let’s discuss what assumptions we will need. We will
make two assumptions: (a) A is nonsingular and (b) every variable satisfies
E[xi4 ] = 3. You should be used to nonsingularity assumptions by now (it’s
what we need every time we use Jennrich’s algorithm). But what about the
second assumption? Where does it come from? It turns out that it is actually
quite natural and is needed to rule out a problematic case.
Claim 4.4.1 If each xi is an independent standard Gaussian, then for any
orthogonal transformation R, x and Rx, and consequently
y = Ax + b and y = ARx + b
Whitening
As usual, we cannot hope to learn A from just the second moments. This is
really the same issue that arose when we discussed the rotation problem. In the
case of tensor decompositions, we went directly to the third-order moment to
learn the columns of A through Jennrich’s algorithm. Here we will learn what
we can from the first and second moments, and then move on to the fourth
moment. In particular, we will use the first and second moments to learn b and
to learn A up to a rotation:
Lemma 4.4.2 E[y] = b and E[yyT ] = AAT
Proof: The first identity is obvious. For the second, we can compute
where the last equality follows from the condition that E[xi ] = 0 and E[xi2 ] = 1
and that each xi is independent.
What this means is that we can estimate b and M = AAT to arbitrary
precision by taking enough samples. What I claim is that this determines A
up to a rotation. Since M 0, we can find B such that M = BBT using the
Cholesky factorization. But how are B and A related?
Lemma 4.4.3 There is an orthogonal transformation R so that BR = A.
Proof: Recall that we assumed A is nonsingular, hence M = AAT and B are
also nonsingular. So we can write
z = B−1 (y − b) = B−1 Ax = Rx
This is called whitening (think white noise), because it makes the first moments
of our distribution zero and the second moments all one (in every direction).
The key to our analysis is the following functional:
We will want to minimize it over the unit sphere. As u ranges over the unit
sphere, so does vT = uT R. Hence our optimization problem is equivalent to
minimizing
From this expression, it is easy to check that the local minima of H(v)
correspond exactly to setting v = ±ei for some i.
Recall that vT = uT R, and so this characterization implies that the local
minima of F(u) correspond to setting u to be a column of ±R. The algorithm
proceeds by using gradient descent (and a lower bound on the Hessian) to show
that you can find local minima of F(u) quickly. The intuition is that if you keep
following steep gradients, you decrease the objective value. Eventually, you
must get stuck at a point where the gradients are small, which is an approximate
local minimum. Any such u must be close to some column of ±R, and we can
then recurse on the orthogonal complement to the vector we have found to find
the other columns of R. This idea requires some care to show that the errors
do not accumulate too badly; see [74], [140], [17]. Note that when E[xi4 ] = 3
instead of the stronger assumption that E[xi4 ] < 3, we can follow the same
approach, but we need to consider local minima and local maxima of F(u).
Also, Vempala and Xiao [140] gave an algorithm that works under weaker
conditions whenever there is a constant order moment that is different from
that of the standard Gaussian.
4.5 Exercises 69
4.5 Exercises
Problem 4-1: Let u v denote the Khatri-Rao product between two vectors,
where if u ∈ Rm and v ∈ Rn , then u v ∈ Rmn and corresponds to flattening
the matrix uvT into a vector, column by column. Also recall that the Kruskal
rank k-rank of a collection of vectors u1 , u2 , . . . , um ∈ Rn is the largest k such
that every set of k vectors is linearly independent.
In this problem, we will explore properties of the Khatri-Rao product and
use it to design algorithms for decomposing higher-order tensors.
In fact, for random or perturbed vectors, the Khatri-Rao product has a much
stronger effect of multiplying their Kruskal rank. These types of properties
can be used to obtain algorithms for decomposing higher-order tensors in the
highly overcomplete case where r is some polynomial in n.
Problem 4-2: In Section 4.4 we saw how to solve independent component
analysis using nonconvex optimization. In this problem we will see how
to solve it using tensor decomposition instead. Suppose we observe many
samples of the form y = Ax, where A is an unknown nonsingular square
matrix and each coordinate of x is independent and satisfies E[xj ] = 0 and
70 4 Tensor Decompositions
E[xj4 ] = 3 E[xj2 ]2 . The distribution of xj is unknown and might not be the same
for all j.
+ ,⊗2
(a) Write down expressions for E[y⊗4 ] and E[y⊗2 ] in terms of A and the
moments of x. (You should not have any A’s inside the expectation.)
(b) Using part (a), show how to use the moments of y to produce a tensor of
the form j cj a⊗4 j , where aj denotes column j of A and the cj are nonzero
scalars.
(c) Show how to recover the columns of A (up to permutation and scalar
multiple) using Jennrich’s algorithm.
5
Sparse Recovery
In this chapter, we will witness the power of sparsity for the first time.
Let’s get a sense of what it’s good for. Consider the problem of solving an
underdetermined linear system Ax = b. If we are given A and b, there’s no
chance to recover x uniquely, right? Well, not if we know that x is sparse.
In that case, there are natural conditions on A where we actually will be able to
recover x even though the number of rows of A is comparable to the sparsity of
x rather than its dimension. Here we will cover the theory of sparse recovery.
And in case you’re curious, it’s an area that not only has some theoretical gems,
but also has had major practical impact.
5.1 Introduction
In signal processing (particularly imaging), we are often faced with the task
of recovering some unknown signal given linear measurements of it. Let’s fix
our notation. Throughout this chapter, we will be interested in solving a linear
system Ax = b where A is an m×n matrix and x and b are n and m dimensional
vectors, respectively. In our setup, both A and b are known. You can think of A
as representing the input-output functionality of some measurement device we
are using.
Now, if m < n, then we cannot hope to recover x uniquely. At best we could
find some solution y that satisfies Ay = b and we would have the promise that
x = y + z, where z belongs to the kernel of A. This tells us that if we want
to recover an n-dimensional signal, we need at least n linear measurements.
This is quite natural. Sometimes you will hear this referred to as the Shannon-
Nyquist rate, although I find that a rather opaque way to describe what is going
on. The amazing idea that will save us is that if x is sparse — i.e., b is a linear
combination of only a few columns of A — then we really will be able to get
71
72 5 Sparse Recovery
away with many fewer linear measurements and still be able to reconstruct
x exactly.
What I want to do in this section is explain why you actually should not
be surprised by it. If you ignore algorithms (which we won’t do later on), it’s
actually quite simple. It turns out that assuming that x is sparse isn’t enough
by itself. We will always have to make some structural assumption about A as
well. Let’s consider the following notion:
Definition 5.1.1 The Kruskal rank of a set of vectors {Ai }i is the maximum r
such that all subsets of at most r vectors are linearly independent.
If you are given a collection of n vectors in n dimensions, they can all be
linearly independent, in which case their Kruskal rank is n. But if you have
n vectors in m dimensions — like when we take the columns of our sensing
matrix A – and m is smaller than n, the vectors can’t be all linearly independent,
but they can still have Kruskal rank m. In fact, this is the common case:
Claim 5.1.2 If A1 , A2 , . . . , An are chosen uniformly at random from Sm−1 , then
almost surely their Kruskal rank is m.
Now let’s prove our first main result about sparse recovery. Let x0 be the
number of nonzero entries of x. We will be interested in the following highly
non-convex optimization problem:
Let’s show that if we could solve (P0 ), we could find x from much fewer than
n linear measurements:
Lemma 5.1.3 Let A be an m × n matrix whose columns have Kruskal rank at
least r. Let x be an r/2-sparse vector and let Ax = b. Then the unique optimal
solution to (P0 ) is x.
Proof: We know that x is a solution to Ax = b that has objective value
x0 = r/2. Now suppose there were any other solution y that satisfies Ay = b.
Consider the difference between these solutions, i.e., z = x − y. We
know that z is in the kernel of A. However, z0 ≥ r + 1, because by
assumption every set of at most r columns of A is linearly independent. Finally,
we have
y0 ≥ z0 − x0 ≥ r/2 + 1
which implies that y has larger objective value than x. This completes the
proof.
5.1 Introduction 73
The difference between this and the standard moment curve is in the last term,
where we have αim instead of αim−1 .
Lemma 5.1.4 A set I with |I| = m has i∈I αi = 0 if and only if the vectors
{ (αi )}i∈I are linearly dependent.
Proof: Consider the determinant of the matrix whose columns are { (αi )}i∈I .
Then the proof is based on the following observations:
(a) The
+m, determinant is a polynomial in the variables αi with total degree
2 + 1, which can be seen by writing the determinant in terms of its
Laplace expansion (see, e.g., [88]).
-
(b) Moreover, the determinant is divisible by i<j αi − αj , since the
determinant is zero if any αi = αj .
Hence we can write the determinant as
&" '& '
(αi − αj ) αi .
i<j i∈I
i,j∈I
We have assumed that the αi ’s are distinct, and consequently the determinant
is zero if and only if the sum of αi = 0.
We can now prove a double whammy. Not only is solving (P0 ) NP-hard,
but so is computing the Kruskal rank:
Theorem 5.1.5 Both computing the Kruskal rank and finding the sparsest
solution to a system of linear equations are NP-hard.
74 5 Sparse Recovery
Proof: First let’s prove that computing the Kruskal rank is NP-hard. Consider
the vectors { (αi )}i . It follows from Lemma 5.1.4 that if there is a set I with
|I| = m that satisfies i∈I αi = 0, then the Kruskal rank of { (αi )}i is at
most m − 1, and otherwise is exactly m. Since subset sum is NP-hard, so too is
deciding whether the Kruskal rank is m or at most m − 1.
Now let’s move on to showing that finding the sparsest solution to a linear
system is NP-hard. We will use a one-to-many reduction. For each j, consider
the following optimization problem:
) *
(Pj ) min w0 s.t. (α1 ), . . . , (αj−1 ), (αj+1 ), . . . , (αn ) w = (αj )
It is easy to see that the Kruskal rank of { (αi )}i is at most m − 1 if and only
if there is some j so that (Pj ) has a solution whose objective value is at most
m − 2. Thus (P0 ) is also NP-hard.
In the rest of this chapter, we will focus on algorithms. We will give simple
greedy methods as well as ones based on convex programming relaxations.
These algorithms will work under more stringent assumptions on the sensing
matrix A than just that its columns have large Kruskal rank. Nevertheless, all of
the assumptions we make will still be met by a randomly chosen A, as well as
many others. The algorithms we give will even come with stronger guarantees
that are meaningful in the presence of noise.
zT AT Az = 0.
complex plane centered at one with radius μ|S| < 1. Thus B is nonsingular
and we have a contradiction.
Actually, we can prove a stronger uniqueness result when A is the union
of two orthonormal bases, as is the case for the spikes-and-sines matrix. Let’s
first prove the following result, which we will mysteriously call an uncertainty
principle:
Lemma 5.2.4 Let A = [U, V] be an n×2n matrix that is μ-incoherent where U
and V are n×n orthogonal matrices. If b = Uα = Vβ, then α0 +β0 ≥ μ2 .
Proof: Since U and V are orthonormal, we have that b2 = α2 = β2 .
We can rewrite b as either Uα or Vβ, and hence b22 = |β T (V T U)α|. Because
A is incoherent, we can conclude that each entry of V T U has absolute value at
most μ(A), and so |β T (V T U)α| ≤ μ(A)α1 β1 . Using Cauchy-Schwarz, it
√
follows that α1 ≤ α0 α2 and thus
.
b22 ≤ μ(A) α0 β0 α2 β2 .
√
Rearranging, we have μ(A) 1
≤ α0 β0 . Finally, applying the AM-GM
inequality, we get 2
μ ≤ α0 + β0 and this completes the proof.
This proof was short and simple. Perhaps the only confusing part is why
we called it an uncertainty principle. Let’s give an application of Lemma 5.2.4
to clarify this point. If we set A to be the spikes-and-sines matrix, we get that
√
any non-zero signal must have at least n nonzeros in the standard basis or in
the Fourier basis. What this means is that no signal can be sparse in both the
time and frequency domains simultaneously! It’s worth taking a step back. If
we had just proven this result, you would have naturally associated it with the
Heisenberg uncertainty principle. But it turns out that what’s really driving it
is just the incoherence of the time and frequency bases for our signal, and it
applies equally well to many other pairs of bases.
Let’s use our uncertainty principle to prove an even stronger uniqueness
result:
Claim 5.2.5 Let A = [U, V] be an n × 2n matrix that is μ-incoherent where
U and V are n × n orthogonal matrices. If Ax = b and x0 < μ1 , then x is the
uniquely sparsest solution to the linear system.
Proof: Consider any alternative solution A x = b. Set y = x − x, in which
case y ∈ ker(A). Write y as y = [αy , βy ] , and since Ay = 0, we have that
T
Uαy = −Vβy . We can now apply the uncertainty principle and conclude that
y0 = αy 0 +βy 0 ≥ μ2 . It is easy to see that
x0 ≥ y0 −x0 > μ1 , and
so
x has strictly more nonzeros than x does, and this completes the proof.
5.3 Pursuit Algorithms 77
Initialize: x0 = 0, r0 = Ax0 − b, S = ∅
For = 1, 2, . . . , k
|A ,r−1 |
Choose column j that maximizes j 2 .
Aj 2
Add j to S.
Set r = projU ⊥ (b), where U = span(AS ).
If r = 0, break.
End
Solve for xS : AS xS = b. Set xS̄ = 0.
This is what makes matching pursuit faster in practice; however, the analysis
is more cumbersome because we need to keep track of how the error (due to
not projecting b on the orthogonal complement of the columns we’ve chosen
so far) accumulates.
As we did before, we will simplify the notation and write ω = ei2π/n for the nth
root of unity. With this notation, the entry in row a, column b is ω(a−1)(b−1) .
The matrix F has a number of important properties, including:
= 1 + λ1 z + . . . + λ k z k
Claim 5.4.5 If we know p(z), we can find supp(x).
Proof: In fact, an index b is in the support of x if and only if p(ωb ) = 0. So
we can evaluate p at powers of ω, and the exponents where p evaluates to a
nonzero are exactly the support of x.
82 5 Sparse Recovery
The basic idea of Prony’s method is to use the first 2k values of the discrete
Fourier transform to find p, and hence the support of x. We can then solve a
linear system to actually find the values of x. Our first goal is to find the helper
polynomial. Let
Geometric Properties of
Here we will establish some important geometric properties of C-almost
Euclidean subsections. Throughout this section, let S = n/C2 . First we show
that cannot contain any sparse, nonzero vectors:
Claim 5.5.9 Let v ∈ , then either v = 0 or |supp(v)| ≥ S.
86 5 Sparse Recovery
& m '
k ≤ S/16 = (n/C2 ) =
log n/m
from m linear measurements.
Next we will consider stable recovery. Our main theorem is:
Theorem 5.5.13 Let = ker(A) be a C-almost Euclidean subsection. Let
S = Cn2 . If Ax = Aw = b and w1 ≤ x1 , we have
x − w1 ≤ 4 σ S (x) .
16
1
x − w1 ≤ x − w1 + 2σ S (x) .
2 16
Epilogue
Finally, we will end with one of the main open questions in compressed
sensing, which is to give a deterministic construction of matrices that satisfy
the restricted isometry property:
Question 7 (Open) Is there a deterministic algorithm to construct a matrix
with the restricted isometry property? Alternatively, is there a deterministic
algorithm to construct an almost Euclidean subsection ?
Avi Wigderson likes to refer to these types of problems as “finding hay in a
haystack.” We know that a randomly chosen A satisfies the restricted isometry
property with high probability. Its kernel is also an almost Euclidean subspace
with high probability. But can we remove the randomness? The best known
deterministic construction is due to Guruswami, Lee, and Razborov [82]:
88 5 Sparse Recovery
5.6 Exercises
Problem 5-1: In this question, we will explore uniqueness conditions for
sparse recovery and conditions under which 1 -minimization provably works.
(a) Let A1x = b, and suppose A has n columns. Further suppose 2k ≤ m. Prove
that for every 1 x0 ≤ k, 1
x with 1 x is the uniquely sparsest solution to the
linear system if and only if the k-rank of the columns of A is at least 2k.
(b) Let U = kernel(A), and U ⊂ Rn . Suppose that for each nonzero x ∈ U,
and for any set S ⊂ [n] with |S| ≤ k,
1
xS 1 < x1
2
where xS denotes the restriction of x to the coordinates in S. Prove that
(P1) min x1 s.t. Ax = b
recovers x = 1 x = b and 1
x, provided that A1 x0 ≤ k.
(c) Challenge: Can you construct a subspace U ⊂ Rn of dimension (n)
that has the property that every nonzero x ∈ U has at least (n) nonzero
coordinates? Hint: Use an expander.
Problem 5-2: Let1 x be a k-sparse vector in n-dimensions. Let ω be the nth root
of unity. Suppose we are given v = nj=1 1 xj ωj for = 0, 1, . . . , 2k − 1. Let
A, B ∈ R k×k be defined so that Ai,j = vi+j−2 and Bi,j = vi+j−1 .
(a) Express both A and B in the form A = VDA V T and B = VDB V T , where V
is a Vandermonde matrix and DA , DB are diagonal.
(b) Prove that the solutions to the generalized eigenvalue problem Ax = λBx
can be used to recover the locations of the nonzeros in 1 x.
(c) Given the locations of the nonzeros in 1x and v0 , v1 , . . . , vk−1 , give an
algorithm to recover the values of the nonzero coefficients in 1 x.
This is called the matrix pencil method. If you squint, it looks like Prony’s
method (Section 5.4) and has similar guarantees. Both are (somewhat) robust
to noise if and only if the Vandermonde matrix is well-conditioned, and exactly
when that happens is a longer story. See Moitra [113].
6
Sparse Coding
Many types of signals turn out to be sparse, either in their natural basis or
in a hand-designed basis (e.g., a family of wavelets). But if we are given a
collection of signals and we don’t know the basis in which they are sparse,
can we automatically learn it? This problem goes by various names, including
sparse coding and dictionary learning. It was introduced in the context of
neuroscience, where it was used to explain how neurons get the types of
activation patterns they have. It also has applications to compression and deep
learning. In this chapter, we will give algorithms for sparse coding that leverage
convex programming relaxations as well as iterative algorithms where we will
prove that greedy methods successfully minimize a nonconvex function in an
appropriate stochastic model.
6.1 Introduction
Sparse coding was introduced by Olshausen and Field [117], who were
neuroscientists interested in understanding properties of the mammalian visual
cortex. They were able to measure the receptive field of neurons — essentially
how neurons respond to various types of stimuli. But what they found surprised
them. The response patterns were always
(a) spatially localized, which means that each neuron was sensitive only to
light in a particular region of the image;
(b) bandpass, in the sense that adding high-frequency components had a
negligible effect on the response; and
(c) oriented, in that rotating images with sharp edges produced responses
only when the edge was within some range of angles.
89
90 6 Sparse Coding
Guess 1
A
Given 1
A, compute a column sparse 1 X so that 1
A1X ≈ B (using, e.g.,
matching pursuit [111] or basis pursuit [50]).
X , compute the 1
Given 1 A that minimizes 1
A1X − BF .
End
6.1 Introduction 91
K-SVD [5]
Input: Matrix B, whose columns can be jointly sparsely represented
Output: A basis 1
A and representation 1
X
Guess 1A
Given 1
A, compute a column sparse 1 X so that 1
A1X ≈ B (using, e.g.,
matching pursuit [111] or basis pursuit [50]).
End
You should think about these algorithms as variants of the alternating mini-
mization algorithm we gave for nonnegative matrix factorization. They follow
the same style of heuristic. The difference is that k-SVD is more clever about
how it corrects for the contribution of the other columns in our basis 1
A when
performing an update, which makes it the heuristic of choice in practice.
Empirically, both of these algorithms are sensitive to their initialization but
work well aside from this issue.
We want algorithms with provable guarantees. Then it is natural to focus on
the case where A is a basis for which we know how to solve sparse recovery
problems. Thus we could consider both the undercomplete case, where A has
full column rank, and the overcomplete case, where there are more columns
than rows and A is either incoherent or has the restricted isometry property.
That’s exactly what we’ll do in this chapter. We’ll also assume a stochastic
92 6 Sparse Coding
model for how the x(i) ’s are generated, which helps prevent lots of pathologies
that can arise (e.g., a column in A is never represented).
uT B = (uT A)A−1 B = vT X.
This is the usual trick of replacing the sparsity of a vector with its 1 norm. The
constraint rT w = 1 is needed just to fix a normalization, to prevent us from
returning the all-zero vector as a solution. We will choose r to be a column in
B for reasons that will become clear later. Our goal is to show that the optimal
solution to (P1 ) is a scaled row of X. In fact, we can transform the above linear
program into a simpler one that will be easier to analyze:
and it is easy to check that you can go from a solution to (Q1 ) to a solution to
(P1 ) in the analogous way.
94 6 Sparse Coding
holds with high probability for any fixed z1 . We can take a union bound over
an -net of all possible unit vectors z1 and conclude by rescaling that the bound
holds for all nonzero z1 ’s.
because the feasible regions of (Q1 ) and (R1 ) are the same, and their objective
value is nearly the same after rescaling. The final step is the following:
96 6 Sparse Coding
where the only the ith coordinate of z is nonzero. Hence it is a scaled copy of
the ith row of X. Now, since the generative model chooses the nonzero entries
of x from a standard Gaussian, almost surely there is a coordinate that is the
strictly largest in absolute value.
In fact, even more is true. For any fixed coordinate i, with high probability it
will be the strictly largest coordinate in absolute value for some column of X.
This means that if we repeatedly solve (P1 ) by setting r to be different columns
of B, then with high probability every row of X will show up. Now, once we
know the rows of X, we can solve for A as follows. With high probability, if
we take enough samples, then X will have a left pseudo-inverse and we can
compute A = BX + , which will recover A up to a permutation and rescaling of
its columns. This completes the proof.
are even accelerated methods that get faster rates by leveraging connections
to physics, like momentum. You could write an entire book on iterative
methods. And indeed there are many terrific sources, such as Nesterov [116]
and Rockefellar [127].
In this section we will prove some basic results about gradient descent in
the simplest setting, where f is twice differentiable, β-smooth, and α-strongly
convex. We will show that the difference between the current value of our
objective and the optimal value decays exponentially. Ultimately, our interest
in gradient descent will be in applying it to nonconvex problems. Some of
the most interesting problems, like fitting parameters in a deep network, are
nonconvex. When faced with a nonconvex function f , you just run gradient
descent anyway.
It is very challenging to prove guarantees about nonconvex optimization
(except for things like being able to reach a local minimum). Nevertheless, our
approach for overcomplete sparse coding will be based on an abstraction of the
analysis of gradient descent. What is really going on under the hood is that the
gradient always points you somewhat in the direction of the globally minimal
solution. In nonconvex settings, we will still be able to get some mileage out
of this intuition by showing that under the appropriate stochastic assumptions,
even simple update rules make progress in a similar manner. In any case, let’s
now define gradient descent:
Gradient Descent
Given: A convex, differentiable function f : Rn → R
Output: A point xT that is an approximate minimizer of f
For t = 1 to T
xt+1 = xt − η∇f (xt )
End
The parameter η is called the learning rate. You want to make it large, but
not so large that you overshoot. Our analysis of gradient descent will hinge
on multivariable calculus. A useful ingredient for us will be the following
multivariate Taylor’s theorem:
Theorem 6.3.1 Let f : Rn → R be a convex, differentiable function. Then
1
f (y) = f (x) + (∇f (x))T (y − x) + (y − x)T ∇ 2 f (x)(y − x) + o(y − x2 ).
2
98 6 Sparse Coding
Now let’s precisely define the conditions on f that we will impose. First we
need the gradient to not change too quickly:
Definition 6.3.2 We will say that f is β-smooth if for all x and y, we have
α
f (y) ≥ f (x) + (∇f (x))T (y − x) + y − x2 .
2
Now let’s state the main result we will prove in this section:
Theorem 6.3.4 Let f be twice differentiable, β-smooth, and α-strongly convex.
Let x∗ be the minimizer of f and η ≤ β1 . Then gradient descent starting from
x1 satisfies
& ηα 't−1
f (xt ) − f (x∗ ) ≤ β 1 − x1 − x∗ 2 .
2
We will make use of the following helper lemma:
Lemma 6.3.5 If f is twice differentiable, β-smooth, and α-strongly con-
vex, then
α 1
∇f (xt )T (xt − x∗ ) ≥ xt − x∗ 2 + ∇f (xt )2 .
4 2β
Let’s come back to its proof. For now, let’s see how it can be used to
establish Theorem 6.3.4:
6.3 Gradient Descent 99
Proof: Let α = α
4 and β = 1
2β . Then we have
1
∇f (x)T (x − x∗ ) ≥ .
β
Taking the average of the two main inequalities completes the proof.
Actually, our proof works even when the direction you move in is just
an approximation to the gradient. This is an important shortcut when, for
example, f is a loss function that depends on a very large number of training
examples. Instead of computing the gradient of f , you can sample some
training examples, compute your loss function on just those, and follow its
gradient. This is called stochastic gradient descent. The direction it moves
in is a random variable whose expectation is the gradient of f . The beauty
of it is that the usual proofs of convergence for gradient descent carry over
straightforwardly (provided your sample is large enough).
There is an even further abstraction we can make. What if the direction
you move in isn’t a stochastic approximation of the gradient, but is just some
direction that satisfies the conditions shown in Lemma 6.3.5? Let’s call this
abstract gradient descent, just to give it a name:
For t = 1 to T
xt+1 = xt − ηgt
End
solution x∗ . It turns out that the proof we gave of Theorem 6.3.4 generalizes
immediately to this more abstract setting:
Theorem 6.3.8 Suppose that gt is (α , β , t )-correlated with a point x∗ and,
moreover, η ≤ 2β . Then abstract gradient descent starting from x1 satisfies
& ηα 't−1 maxt t
xt − x∗ 2 ≤ 1 − x1 − x∗ 2 + .
2 α
Now we have the tools we need for overcomplete sparse coding. We’ll
prove convergence bounds for iterative methods in spite of the fact that the
underlying function they are attempting to minimize is nonconvex. The key is
to use the above framework and exploit the stochastic properties of our model.
being a hard penalty function that is infinite when x has more than k nonzero
coordinates and is zero otherwise. It could also be your favorite sparsity-
inducing soft penalty function.
Many iterative algorithms attempt to minimize an energy function like the
one above that balances how well your basis explains each sample and how
sparse each representation is. The trouble is that the function is nonconvex, so
if you want to give provable guarantees, you would have to figure out all kinds
of things, like why it doesn’t get stuck in a local minimum or why it doesn’t
spend too much time moving slowly around saddle points.
Question 8 Instead of viewing iterative methods as attempting to minimize
a known nonconvex function, can we view them as minimizing an unknown
convex function?
What we mean is: What if, instead of the 1
x ’s, we plug in the true sparse
representations x? Our energy function becomes
p
E(1
A, X) = b(i) − 1
Ax(i) 2
i=1
which is convex, because only the basis A is unknown. Moreover, it’s natural to
expect that in our stochastic model (and probably many others), the minimizer
of E(1A, X) converges to the true basis A. So now we have a convex function
where there is a path from our initial solution to the optimal solution via
gradient descent. The trouble is that we cannot evaluate or compute gradients
of the function E(1A, X), because X is unknown.
The path we will follow in this section is to show that simple, iterative algo-
rithms for sparse coding move in a direction that is an approximation to the gra-
dient of E(1
A, X). More precisely, we will show that under our stochastic model,
the direction our update rule moves in meets the conditions in Definition 6.3.7.
That’s our plan of action. We will study the following iterative algorithm:
For t = 0 to T
x(i) = threshold1/2 (1
1 AT b(i) )
6.4 The Overcomplete Case 103
q(t+1)
1
A ←1
A+η (b(i) − 1 x (i) )T
x(i) ) sign(1
A1
i=qt+1
End
Bi − Ai ≤ δ
and 1
A is (1/ log n, 2)-close to A. Then decoding succeeds; i.e.,
sign(threshold1/2 (1
AT b)) = sign(x)
We will not prove this lemma here. The idea is that for any j, we can write
(1
AT b)j = ATj Aj xj + (1
Aj − Aj )T Aj xj + 1
ATj Ai xi
i∈S\{j}
where S = supp(x). The first term is xj . The second term is at most 1/ log n in
absolute value. And the third term is a random variable whose variance can be
appropriately bounded. For the full details, see Arora et al. [16]. Keep in mind
√
that for incoherent dictionaries, we think of μ = 1/ n.
Let γ denote any vector whose norm is negligible (say n−ω(1) ). We will use
γ to collect various sorts of error terms that are small, without having to worry
about what the final expression looks like. Consider the expected direction that
our Hebbian update moves in when restricted to some column j. We have
gj = E[(b − 1
A1
x ) sign(1
xj )]
where the expectation is over a sample Ax = b from our model. This is a priori
a complicated expression to analyze, because b is a random variable of our
model and1 x is a random variable that arises from our decoding rule. Our main
lemma is the following:
Lemma 6.4.4 Suppose that 1
A and A are (1/ log n, 2)-close. Then
gj = pj qj (I − 1
Aj1 A−j Q1
ATj )Aj + pj1 AT−j Aj ± γ
where qj = P[j ∈ S], qi,j = P[i, j ∈ S] and pj = E[xj sign(xj )|j ∈ S]. Moreover,
Q = diag({qi,j }i ).
Proof: Using the fact that the decoding step recovers the correct signs of x
with high probability, we can play various tricks with the indicator variable for
whether or not the decoding succeeds and be able to replace 1 x ’s with x’s. For
now, let’s state the following claim, which we will prove later:
Claim 6.4.5 gj = E[(I − 1
AS1
ATS )Ax sign(xj ))] ± γ
Now let S = supp(x). We will imagine first sampling the support of x, then
choosing the values of its nonzero entries. Thus we can rewrite the expectation
using subconditioning as
gj = E[E[(I − 1
AS1
ATS )Ax sign(xj ))]|S] ± γ
S xS
= E[E[(I − 1
AS1
ATS )Aj xj sign(xj ))]|S] ± γ
S xS
= pj E[(I − 1
AS1
ATS )Aj ] ± γ
S
= pj qj (I − 1
Aj1 A−j Q1
ATj )Aj + pj1 AT−j Aj ± γ .
6.4 The Overcomplete Case 105
The second equality uses the fact that the coordinates are uncorrelated,
conditioned on the support S. The third equality uses the definition of pj . The
fourth equality follows from separating the contribution from j from all the
other coordinates, where A−j denotes the matrix we obtain by deleting the jth
column. This now completes the proof of the main lemma.
So why does this lemma tell us that our update rule meets the conditions in
Definition 6.3.7? When 1
A and A are close, you should think of the expression
as follows:
gj = pj qj (I − 1Aj1 A−j Q1
ATj )Aj + pj1 AT−j Aj ±γ
≈pj qj (Aj −1
Aj ) systemic error
And so the expected direction that the update rule moves in is almost the
ideal direction Aj − 1Aj , pointing toward the true solution. What this tells us
is that sometimes the way to get around nonconvexity is to have a reasonable
stochastic model. Even though in a worst-case sense you can still get stuck in
a local minimum, in the average case you often make progress with each step
you take. We have not discussed the issue of how to initialize it. But it turns
out that there are simple spectral algorithms to find a good initialization. See
Arora et al. [16] for the full details, as well as the guarantees of the overall
algorithm.
Let’s conclude by proving Claim 6.4.5:
Proof: Let F denote the event that decoding recovers the correct signs of x.
From Lemma 6.4.3, we know that F holds with high probability. First let’s use
the indicator variable for event F to replace the 1
x inside the sign function with
x at the expense of adding a negligible error term:
gj = E[(b − 1
A1 xj )1F ] + E[(b − 1
x) sign(1 A1
x) sign(1
xj )1F ]
1x) sign(xj )1F ] ± γ
= E[(b − A1
gj = E[(b − 1
A threshold1/2 (1 AT b)) sign(xj )1F ] ± γ
= E[(b − 1
AS1ATS b) sign(xj )1F ] ± γ
= E[(I − 1
AS1
ATS )b sign(xj )1F ] ± γ
Here we have used the fact that threshold1/2 (1 AT b) keeps all coordinates in S
the same and zeros out the rest when event F occurs. Now we can play some
more tricks with the indicator variable to get rid of it:
106 6 Sparse Coding
6.5 Exercises
Problem 6-1: Consider the sparse coding model y = Ax where A is a fixed
n × n matrix with orthonormal columns ai , and x has i.i.d. coordinates drawn
from the distribution
⎧
⎨ +1 with probability α/2,
xi = −1 with probability α/2,
⎩
0 with probability 1 − α.
The goal is to recover the columns of A (up to sign and permutation) given
many independent samples y. Construct the matrix
) *
M = Ey y(1) , yy(2) , yyyT
where y(1) = Ax(1) and y(2) = Ax(2) are two fixed samples from the
sparse coding model, and the expectation is over a third sample y from
the sparse coding model. Let ẑ be the (unit-norm) eigenvector of M corre-
sponding to the largest (in absolute value) eigenvalue.
(a) Write an expression for M in terms of α, x(1) , x(2) , {ai }.
(b) Assume for simplicity that x(1) and x(2) both have support size exactly αn
and that their supports intersect at a single coordinate i∗ . Show that
ẑ, ai∗ 2 ≥ 1 − O(α 2 n) in the limit α → 0.
This method can be used to find a good starting point for alternating
minimization.
7
Gaussian Mixture Models
7.1 Introduction
Karl Pearson was one of the luminaries of statistics and helped to lay its
foundation. He introduced revolutionary new ideas and methods, such as:
(a) p-values, which are now the de facto way to measure statistical
significance
(b) The chi-squared test, which measures goodness of fit to a Gaussian
distribution
(c) Pearson’s correlation coefficient
(d) The method of moments for estimating the parameters of a distribution
(e) Mixture models for modeling the presence of subpopulations
107
108 7 Gaussian Mixture Models
Believe it or not, the last two were introduced in the same influential study
from 1894 that represented Pearson’s first foray into biometrics [120]. Let’s
understand what led Pearson down this road. While on vacation, his colleague
Walter Weldon and his wife had meticulously collected 1,000 Naples crabs
and measured 23 different physical attributes of each of them. But there was a
surprise lurking in the data. All but one of these statistics was approximately
Gaussian. So why weren’t they all Gaussian?
Everyone was quite puzzled, until Pearson offered an explanation: Maybe
the Naples crab is not one species, but rather two species. Then it is natural to
model the observed distribution as a mixture of two Gaussians, rather than
just one. Let’s be more formal. Recall that the density function of a one-
dimensional Gaussian with mean μ and variance σ 2 is
2
1 −(x − μ)2
N (μ, σ , x) = √
2
exp .
2π σ 2 2σ 2
And for a mixture of two Gaussians, it is
F(x) = w1 N (μ1 , σ12 , x) +(1 − w1 ) N (μ2 , σ22 , x) .
F1 (x) F2 (x)
We will use F1 and F2 to denote the two Gaussians in the mixture. You can also
think of it in terms of how you’d generate a sample from it: Take a biased coin
that is heads with probability w1 and tails with the remaining probability 1−w1 .
Then for each sample you flip the coin; i.e., decide which subpopulation your
sample comes from. If it’s heads, you output a sample from the first Gaussian,
otherwise you output a sample from the second one.
This is already a powerful and flexible statistical model (see Figure 7.1). But
Pearson didn’t stop there. He wanted to find the parameters of a mixture of two
Gaussians that best fit the observed data to test out his hypothesis. When it’s
just one Gaussian, it’s easy, because you can set μ and σ 2 to be the empirical
mean and empirical variance, respectively. But what should you do when there
are five unknown parameters and for each sample there is a hidden variable
representing which subpopulation it came from? Pearson used the method
of moments, which we will explain in the next subsection. The parameters
he found seemed to be a good fit, but there were still a lot of unanswered
questions, such as: Does the method of moments always find a good solution
if there is one?
Method of Moments
Here we will explain how Pearson used the method of moments to find the
unknown parameters. The key observation is that the moments of a mixture
7.1 Introduction 109
20
15
10
5
0
And so the rth raw moment of a mixture of two Gaussians is itself a degree
r + 1 polynomial, which we denote by Pr , in the parameters we would like to
learn.
110 7 Gaussian Mixture Models
Pearson’s Sixth Moment Test: We can estimate Ex←F [xr ] from random
samples. Let S be our set of samples. Then we can compute:
1 r
r =
M x
|S|
x∈S
r will be
And given a polynomial number of samples (for any r = O(1)), M
additively close to Ex←F(x) [xr ]. Pearson’s approach was:
• Solve this system. Each solution is a setting of all five parameters that
explains the first five empirical moments.
Expectation Maximization
The workhorse in modern statistics is the maximum likelihood estimator, which
sets the parameters so as to maximize the probability that the mixture would
generate the observed samples. This estimator has lots of wonderful properties.
Under certain technical conditions, it is asymptotically efficient, meaning that
no other estimator can achieve asymptotically smaller variance as a function of
the number of samples. Even the law of its distribution can be characterized,
and is known to be normally distributed with a variance related to what’s called
the Fisher information. Unfortunately, for most of the problems we will be
interested in, it is NP-hard to compute [19].
The popular alternative is known as expectation maximization and was
introduced in an influential paper by Dempster, Laird, and Rubin [61]. It is
important to realize that this is just a heuristic for computing the maximum
likelihood estimator and does not inherit any of its statistical guarantees.
7.2 Clustering-Based Algorithms 111
In practice, it seems to work well. But it can get stuck in local maxima of the
likelihood function. Even worse, it can be quite sensitive to how it is initialized
(see, e.g., [125]).
• Cluster all of the samples S into two sets S1 and S2 depending on whether
they were generated by the first or second component.
• Output the empirical mean and covariance of each Si along with the
empirical mixing weight |S|S|1 | .
The details of how we will implement the first step and what types of
conditions we need to impose will vary from algorithm to algorithm. But first
let’s see that if we could design a clustering algorithm that succeeds with high
probability, the parameters we find would be provably good estimates for the
true ones. This is captured by the following lemmas. Let |S| = m be the number
of samples.
Lemma 7.2.3 If m ≥ C log 1/δ
2 and clustering succeeds, then
|1
w1 − w1 | ≤
with probability at least 1 − δ.
Now let wmin = min(w1 , 1 − w1 ). Then
Lemma 7.2.4 If m ≥ C nwlog 1/δ
2
and clustering succeeds, then
min
1
μi − μi 2 ≤
for each i, with probability at least 1 − δ.
7.2 Clustering-Based Algorithms 113
1i − i ≤
for each i, with probability at least 1 − δ.
All of these lemmas can be proven via standard concentration bounds. The first
two follow from concentration bounds for scalar random variables, and the
third requires more high-powered matrix concentration bounds. However, it is
easy to prove a version of this that has a worse but still polynomial dependence
on n by proving that each entry of 1i and i are close and using the union
bound. What these lemmas together tell us is that if we really could solve
clustering, then we would indeed be able to provably estimate the unknown
parameters.
√
Dasgupta [56]: ( n) Separation
Dasgupta gave the first provable algorithms for learning mixtures of Gaussians,
√
and required that μi − μj 2 ≥ ( nσmax ) where σmax is the maximum
variance of any Gaussian in any direction (e.g., if the components are not
spherical). Note that the constant in the separation depends on wmin , and we
assume we know this parameter (or a lower bound on it).
The basic idea behind the algorithm is to project the mixture onto log k
dimensions uniformly at random. This projection will preserve distances
between each pair of centers μi and μj with high probability, but will contract
distances between samples from the same component and make each com-
ponent closer to spherical, thus making it easier to cluster. Informally, we can
think of this separation condition as: if we think of each Gaussian as a spherical
ball, then if the components are far enough apart, these balls will be disjoint.
i=1
Hence the top left singular vectors of E[xxT ],whose singular value is strictly
larger than σ 2 , exactly span T. We can then estimate E[xxT ] from sufficiently
many random samples, compute its singular value decomposition, and project
the mixture onto T and invoke the algorithm of [19].
7.3 Discussion of Density Estimation 115
Using the above claim, we know that there is a coupling between F1 and F2
that agrees with probability 1/2. Hence, instead of thinking about sampling
from a mixture of two Gaussians in the usual way (choose which component,
then choose a random sample from it), we can alternatively sample as follows:
1. Choose (x, y) from the best coupling between F1 and F2 .
2. If x = y, output x with probability 1/2, and otherwise output y.
3. Else output x with probability 1/2, and otherwise output y.
This procedure generates a random sample from F just as before. What’s
important is that if you reach the second step, the value you output doesn’t
depend on which component the sample came from. So you can’t predict
it better than randomly guessing. This is a useful way to think about the
assumptions that clustering-based algorithms make. Some are stronger than
others, but at the very least they need to take at least n samples and cluster all
of them correctly. In order for this to be possible, we must have
dTV (F1 , F2 ) ≥ 1 − 1/n.
But who says that algorithms for learning must first cluster? Can we hope to
learn the parameters even when the components almost entirely overlap, such
as when dTV (F1 , F2 ) = 1/n?
Now is a good time to discuss the types of goals we could aim for and how
they relate to each other.
(a) Improper Density Estimation
This is the weakest learning goal. If we’re given samples from some distribu-
tion F in some class C (e.g., C could be all mixtures of two Gaussians), then
we want to find any other distribution 1 F that satisfies dTV (F, 1
F ) ≤ ε. We do
not require 1F to be in class C too. What’s important to know about improper
density estimation is that in one dimension it’s easy. You can solve it using a
kernel density estimate, provided that F is smooth.
Here’s how kernel density estimates work. First you take many samples and
construct an empirical point mass distribution G. Now, G is not close to F. It’s
not even smooth, so how can it be? But you can fix this by convolving with a
Gaussian with small variance. In particular, if you set 1 F = G ∗ N (0, σ 2 ) and
choose the parameters and number of samples appropriately, what you get will
satisfy dTV (F, 1
F ) ≤ ε with high probability. This scheme doesn’t use much
about the distribution F, but it pays the price in high dimensions. The issue is
that you just won’t get enough samples that are close to each other. In general,
kernel density estimates need the number of samples to be exponential in the
dimension in order to work.
7.3 Discussion of Density Estimation 117
Note that F and 1 F must necessarily be close as mixtures too: dTV (F, 1 F) ≤
4ε. However, we can have mixtures F and 1 F that are both mixtures of k
Gaussians and are close as distributions, but are not close on a component-
by-component basis. So why should we aim for such a challenging goal? It
turns out that if 1
F is ε-close to F, then given a typical sample, we can estimate
the posterior accurately [94]. What this means is that even if you can’t cluster
all of your samples into which component they came from, you can still figure
out which ones it’s possible to be confident about. This is one of the main
advantages of parameter learning over some of the weaker learning goals.
It’s good to achieve the strongest types of learning goals you can hope for,
but you should also remember that lower bounds for these strong learning goals
(e.g., parameter learning) do not imply lower bounds for weaker problems
(e.g., proper density estimation). We will give algorithms for learning the
parameters of a mixture of k Gaussians that run in polynomial time for any
k = O(1) but have an exponential dependence on k. But this is necessary, in
that there are pairs of mixtures of k Gaussians F and 1 F that are not close on
1
a component-by-component basis but have dTV (F, F ) ≤ 2−k [114]. So any
algorithm for parameter learning would be able to tell them apart, but that
118 7 Gaussian Mixture Models
takes at least 2k samples, again by a coupling argument. But maybe for proper
density estimation it’s possible to get an algorithm that is polynomial in all of
the parameters.
Open Question 1 Is there a poly(n, k, 1/ε) time algorithm for proper density
estimation for mixtures of k Gaussians in n dimensions? What about in one
dimension?
Again, we need a quantitative lower bound on dTV (Fi , Fj ), say dTV (Fi , Fj ) ≥ ε,
for each i = j so that if we take a reasonable number of samples, we will
get at least one sample from the nonoverlap region between various pairs of
components.
Theorem 7.4.2 [94], [114] If wi ≥ ε for each i and dTV (Fi , Fj ) ≥ ε for each
i = j, then there is an efficient algorithm that learns an ε-close estimate 1
F to F
whose running time and sample complexity are poly(n, 1/ε, log 1/δ) and that
succeeds with probability 1 − δ.
7.4 Clustering-Free Algorithms 119
Outline
We can now describe the basic outline of the algorithm, although there will be
many details to fill in:
Pairing Lemma
Next we will encounter the second problem: Suppose we project onto direction
F r = 12 1
r and s and learn 1 F1r + 12 1 F s = 12 1
F2r and 1 F1s + 12 1
F2s , respectively. Then
the mean and variance of 1 r
F1 yield a linear constraint on one of the two high-
dimensional Gaussians, and similarly for 1 F1s .
Problem 3 How do we know that they yield constraints on the same high-
dimensional component?
Ultimately we want to set up a system of linear constraints to solve for
the parameters of F1 , but when we project F onto different directions (say,
122 7 Gaussian Mixture Models
s s
r r
Figure 7.2: The projected mean and projected variance vary continuously as we
sweep from r to s.
r and s), we need to pair up the components from these two directions. The
key observation is that as we vary r to s, the parameters of the mixture vary
continuously. (See Figure 7.2). Hence when we project onto r, we know
from the isotropic projection lemma that the two components will have either
noticeably different means or variances. Suppose their means are different by
ε3 ; then if r and s are close (compared to ε1 ), the parameters of each component
in the mixture do not change much and the component in projr [F] with larger
mean will correspond to the same component as the one in projs [F] with
larger mean. A similar statement applies when it is the variances that are at
least ε3 apart.
Lemma 7.4.8 If r − s ≤ ε2 = poly(1/n, ε3 ), then:
(a) If |rT μ1 − rT μ2 | ≥ ε3 , then the components in projr [F] and projs [F] with
the larger mean correspond to the same high-dimensional component.
(b) Else if |rT 1 r − rT 2 r| ≥ ε3 , then the components in projr [F] and
projs [F] with the larger variance correspond to the same
high-dimensional component.
Hence if we choose r randomly and only search over directions s with
r − s ≤ ε2 , we will be able to pair up the components correctly in the
different one-dimensional mixtures.
Grid Search
Input: Samples from F()
1 = (1
Output: Parameters w1 , 1
μ1 , 1
σ12 , 1
μ2 , 1
σ22 )
There are many ways we could think about testing the closeness of our
estimate with the true parameters of the model. For example, we could
empirically estiamte the first six moments of F() from our samples, and pass
1 if its first six moments are each within some additive tolerance τ of the
empirical moments. (This is really a variant on Pearson’s sixth moment test.)
It is easy to see that if we take enough samples and set τ appropriately, then if
we round the true parameters to any valid grid point whose parameters are
multiples of εC , the resulting 1 will with high probability pass our test. This
is called the completeness. The much more challenging part is establishing the
1 except for ones
soundness; after all, why is there no other set of parameters
close to that pass our test?
Alternatively, we want to prove that any two mixtures F and 1 F whose
parameters do not match within an additive ε must have one of their first six
moments noticeably different. The main lemma is:
Lemma 7.5.2 (Six Moments Suffice) For any F and 1 F that are not ε-close in
parameters, there is an r ∈ {1, 2, . . . , 6} where
1 ) ≥ εO(1)
Mr () − Mr (
where the first term is at most τ because the test passes, and the second term
is small because we can take enough samples (but still poly(1/τ )) so that the
empirical moments and the true moments are close. Hence we can apply the
above lemma in the contrapositive, and conclude that if the grid search outputs
7.5 A Univariate Algorithm 125
F1(x)
^
p(x) F1(x) ^ (x)
F2
F2(x)
^
f(x) = F(x) − F(x)
Figure 7.3: If f (x) has at most six zero crossings, we can find a polynomial of
degree at most six that agrees with its sign.
1, then and
1 must be ε-close in parameters, which gives us an efficient
univariate algorithm!
So our main goal is to prove that if F and 1 F are not ε-close, one of their
first six moments is noticeably different. In fact, even the case of ε = 0 is
challenging: If F and 1 F are different mixtures of two Gaussians, why is one
of their first six moments necessarily different? Our main goal is to prove this
statement using the heat equation.
In fact, let us consider the following thought experiment. Let f (x) = F(x) −
1
F (x) be the pointwise difference between the density functions F and 1 F . Then
the heart of the problem is: Can we prove that f (x) crosses the x-axis at most
six times? (See Figure 7.3.)
Lemma 7.5.3 If f (x) crosses the x-axis at most six times, then one of the first
six moments of F and 1
F is different.
Proof: In fact, we can construct a (nonzero) degree at most six polynomial
p(x) that agrees with the sign of f (x); i.e., p(x)f (x) ≥ 0 for all x. Then
5 5 6
0 < p(x)f (x)dx = pr xr f (x)dx
x x r=1
6
1 ).
≤ |pr |Mr () − Mr (
r=1
Lemma 7.5.4 Let f (x) = ki=1 αi N (μi , σi2 , x) be a linear combination of k
Gaussians (αi can be negative). Then if f (x) is not identically zero, f (x) has at
most 2k − 2 zero crossings.
We will rely on the following tools:
Theorem 7.5.5 Given f (x) : R → R that is analytic and has n zero crossings,
then for any σ 2 > 0, the function g(x) = f (x) ∗ N (0, σ 2 ) has at most n zero
crossings.
This theorem has a physical interpretation. If we think of f (x) as the heat
profile of an infinite one-dimensional rod, then what does the heat profile look
like at some later time? In fact, it is precisely g(x) = f (x) ∗ N (0, σ 2 ) for an
appropriately chosen σ 2 . Alternatively, the Gaussian is the Green’s function of
the heat equation. And hence many of our physical intuitions for diffusion have
consequences for convolution – convolving a function by a Gaussian has the
effect of smoothing it, and it cannot create new local maxima (and relatedly, it
cannot create new zero crossings).
Finally, we recall the elementary fact:
Fact 7.5.6 N (0, σ12 ) ∗ N (0, σ22 ) = N (0, σ12 + σ22 )
Now, we are ready to prove the above lemma and conclude that if we knew
the first six moments of a mixture of two Gaussians, exactly, then we would
know its parameters exactly too. Let us prove the above lemma by induction,
and assume that for any linear combination of k = 3 Gaussians, the number
of zero crossings is at most four. Now consider an arbitrary linear combination
of four Gaussians, and let σ 2 be the smallest variance of any component. (See
Figure 7.4a.) We can consider a related mixture where we subtract σ 2 from the
variance of each component. (See Figure 7.4b.)
Now, if we ignore the delta function, we have a linear combination of three
Gaussians, and by induction we know that it has at most four zero crossings.
But how many zero crossings can we add when we add back in the delta
function? We can add at most two, one on the way up and one on the way
down (here we are ignoring some real analysis complications of working with
delta functions for ease of presentation). (See Figure 7.4c.) And now we can
convolve the function by N (0, σ 2 ) to recover the original linear combination
of four Gaussians, but this last step does not increase the number of zero
crossings! (See Figure 7.4d.)
This proves that
1 ) = Mr () , r = 1, 2, . . . , 6
Mr (
7.6 A View from Algebraic Geometry 127
(a) (b)
(d) (c)
Figure 7.4: (a) Linear combination of four Gaussians; (b) subtracting σ 2 from
each variance; (c) adding back in the delta function; (d) convolving by N (0, σ 2 )
to recover the original linear combination.
has only two solutions (the true parameters, and we can also interchange
which is component is which). In fact, this system of polynomial equations
is also stable, and there is an analogue of condition numbers for systems
of polynomial equations that implies a quantitative version of what we have
just proved: if F and 1F are not ε-close, then one of their first six moments is
noticeably different. This gives us our univariate algorithm.
Polynomial Families
We will analyze the method of moments for the following class of
distributions:
128 7 Gaussian Mixture Models
1) ⇒ F() = F(
∀r, Mr () = Mr ( 1 ).
Our goal is to show that for any polynomial family, a finite number of its
moments suffice. First we introduce the relevant definitions:
Definition 7.6.4 Given a ring R, an ideal I generated by g1 , g2 , · · · , gn ∈ R
denoted by I = g1 , g2 , · · · , gn is defined as
2
I= ri gi where ri ∈ R .
i
Definition 7.6.5 A Noetherian ring is a ring such that for any sequence of
ideals
I1 ⊆ I 2 ⊆ I 3 ⊆ · · · ,
for some polynomial pij ∈ R[, 1 ]. Thus, if Mr () = Mr (1) for all
1
r ∈ 1, 2, · · · , N, then Mr () = Mr () for all r, and from Fact 7.6.3 we
conclude that F() = F( 1).
The other side of the theorem is obvious.
The theorem above does not give any finite bound on N, since the basis
theorem does not either. This is because the basis theorem is proved by
contradiction, but more fundamentally, it is not possible to give a bound on N
that depends only on the choice of the ring. Consider the following example:
6 7
Example 2 Consider the Noetherian ring R[x]. Let Ii = xN−i for
i = 0, · · · , N. It is a strictly ascending chain of ideals for i = 0, · · · , N.
Therefore, even if the ring R[x] is fixed, there is no universal bound on N.
Bounds such as those in Theorem 7.6.7 are often referred to as ineffective.
Consider an application of the above result to mixtures of Gaussians: from the
above theorem, we have that any two mixtures F and 1 F of k Gaussians are
identical if and only if these mixtures agree on their first N moments. Here
N is a function of k and N is finite, but we cannot write down any explicit
bound on N as a function of k using the above tools. Nevertheless, these tools
apply much more broadly than the specialized ones based on the heat equation
that we used in the previous section to prove that 4k − 2 moments suffice for
mixtures of k Gaussians.
7.7 Exercises
Problem 7-1: Suppose we are given a mixture of two Gaussians where the
variances of each component are equal:
F(x) = w1 N (μ1 , σ 2 , x) + (1 − w1 )N (μ2 , σ 2 , x)
Show that four moments suffice to uniquely determine the parameters of the
mixture.
Problem 7-2: Suppose we are given access to an oracle that, for any direction r,
returns the projected means and variances; i.e., rT μ1 and rT 1 r for one
component and rT μ2 and rT 2 r. The trouble is that you do not know which
parameters correspond to which component.
(a) Design an algorithm to recover μ1 and μ2 (up to permuting which
component is which) that makes at most O(d2 ) queries to the oracle
where d is the dimension. Hint: Recover the entries of
(μ1 − μ2 )(μ1 − μ2 )T .
(b) Challenge: Design an algorithm to recover 1 and 2 (up to permuting
which component is which) that makes O(1) queries to the oracle when
d = 2.
Note that here we are not assuming anything about how far apart the projected
means or variances are on some direction r.
8
Matrix Completion
8.1 Introduction
In 2006, Netflix issued a grand challenge to the machine learning community:
Beat our prediction algorithms for recommending movies to users by more
than 10 percent, and we’ll give you a million dollars. It took a few years, but
eventually the challenge was won and Netflix paid out. During that time, we
all learned a lot about how to build good recommendation systems. In this
chapter, we will cover one of the main ingredients, which is called the matrix
completion problem.
The starting point is to model our problem of predicting movie ratings as
a problem of predicting the unobserved entries of a matrix from the ones we
do observe. More precisely, if user i rates movie j (from one to five stars),
we set Mi,j to be the numerical score. Our goal is to use the entries Mi,j that
132
8.1 Introduction 133
we observe to predict the ones that we don’t know. If we could predict these
accurately, it would give us a way to suggest movies to users in a way that we
are suggesting movies that we think they might like. A priori, there’s no reason
to believe you can do this. If we think about the entire matrix M that we would
get by coercing every user to rate every movie (and in the Netflix dataset there
are 480, 189 users and 17, 770 movies), then in principle the entries Mi,j that
we observe might tell us nothing about the unobserved entries.
We’re in the same conundrum we were in when we talked about compressed
sensing. A priori, there is no reason to believe you can take fewer linear
measurements of a vector x than its dimension and reconstruct x. What we need
is some assumption about the structure. In compressed sensing, we assumed
that x is sparse or approximately sparse. In matrix completion, we will assume
that M is low-rank or approximately low-rank. It’s important to think about
where this assumption comes from. If M were low-rank, we could write it as
M = u(1) (v(1) )T + u(2) (v(2) )T . . . u(r) (v(r) )T .
The hope is that each of these rank-one terms represents some category of
movies. For example, the first term might represent the category drama, and
the entries in u(1) might represent for every user, to what extent does he or she
like drama movies? Then each entry in v(1) would represent for every movie,
to what extent would it appeal to someone who likes drama? This is where the
low-rank assumption comes from. What we’re hoping is that there are some
categories underlying our data that make it possible to fill in missing entries.
When I have a user’s ratings for movies in each of the categories, I could then
recommend other movies in the category that he or she likes by leveraging the
data I have from other users.
In this chapter, our main result is that there are efficient algorithms for
recovering M exactly if m ≈ mr log m where m ≥ n and rank(M) ≤ r.
This is similar to compressed sensing, where we were able to recover a k-
sparse signal x from O(k log n/k) linear measurements, which is much smaller
than the dimension of x. Here too we can recover a low-rank matrix M from a
number of observations that is much smaller than the dimension of M.
Let us examine the assumptions above. The assumption that should give
us pause is that is uniformly random. This is somewhat unnatural, since it
would be more believable if the probability that we observe Mi,j depended on
the value itself. Alternatively, a user should be more likely to rate a movie if he
or she actually liked it.
We already discussed the second assumption. In order to understand the
third assumption, suppose our observations are indeed uniformly random.
Consider
Ir 0
M= T
0 0
where is a uniformly random permutation matrix. M is low-rank, but unless
we observe all of the ones along the diagonal, we will not be able to recover
M uniquely. Indeed, the top singular vectors of M are standard basis vectors.
But if we were to assume that the singular vectors of M are incoherent with
respect to the standard basis, we would avoid this snag, because the vectors in
our low-rank decomposition of M are spread out over many rows and columns.
Definition 8.1.1 The coherence μ of a subspace U ⊆ Rn of dimension
dim(u) = r is
n
max PU ei 2
r i
where PU denotes the orthogonal projection onto U and ei is the standard basis
element.
It is easy to see that if we choose U uniformly at random, then μ(U) = O(1).
Also we have that 1 ≤ μ(U) ≤ n/r and the upper bound is attained if U
contains any ei . We can now see that if we set U to be the top singular vectors
of the above example, then U has high coherence. We will need the following
conditions on M:
(a) Let M = UV T√, then μ(U), μ(V) ≤ μ0 .
(b) UV T ∞ ≤ μ√1nmr , where || · ||∞ denotes the maximum absolute value of
any entry.
The main result of this chapter is:
8.2 Nuclear Norm 135
inner product.
Lemma 8.2.3 X∗ = maxB≤1 X, B
To get a feel for this, consider the special case where we restrict X and B to be
diagonal. Moreover, let X = diag(x) and B = diag(b). Then X∗ = x1 and
the constraint B ≤ 1 (the spectral norm of B is at most one) is equivalent to
b∞ ≤ 1. So we can recover a more familiar characterization of vector norms
in the special case of diagonal matrices:
x1 = max bT x
b∞ ≤1
Proof: We will only prove one direction of the above lemma. What B should
we use to certify the nuclear norm of X? Let X = UX X VXT , then we will
choose B = UX VXT . Then
X, B = trace(BT X) = trace(VX UXT UX X VXT )
= trace(VX X VXT ) = trace(X ) = X∗
where we have used the basic fact that trace(ABC) = trace(BCA). Hence this
proves X∗ ≤ maxB≤1 X, B, and the other direction is not much more
difficult (see, e.g., [88]).
How can we show that the solution to (P1 ) is M? Our basic approach will
be a proof by contradiction. Suppose not; then the solution is M + Z for some
Z that is supported in . Our goal will be to construct a matrix B of spectral
norm at most one for which
M + Z∗ ≥ M + Z, B > M∗ .
Hence M +Z would not be the optimal solution to (P1 ). This strategy is similar
to the one in compressed sensing, where we hypothesized some other solution
8.2 Nuclear Norm 137
w that differs from x by a vector y in the kernel of the sensing matrix A. There,
our strategy was to use geometric properties of ker(A) to prove that w has
strictly larger 1 norm than x. The proof here will be in the same spirit, but
considerably more technical and involved.
Let us introduce some basic projection operators that will be crucial in our
proof. Recall, M = UV T , let u1 , . . . , ur be columns of U, and let v1 , . . . , vr
be columns of V. Choose ur+1 , . . . , un so that u1 , . . . , un form an orthonormal
basis for all of Rn ; i.e., ur+1 , . . . , un is an arbitrary orthonormal basis of U ⊥ .
Similarly, choose vr+1 , . . . , vn so that v1 , . . . , vn form an orthonormal basis for
all of Rn . We will be interested in the following linear spaces over matrices:
Definition 8.2.4 T = span{ui vTj | 1 ≤ i ≤ r or 1 ≤ j ≤ r or both}
And similarly,
PT [Z] = Z, ui vTj · ui vTj = PU Z + ZPV − PU ZPV .
(i,j)∈[n]×[n]−[r+1,n]×[r+1,n]
We are now ready to describe the outline of the proof of Theorem 8.1.2. The
proof will be based on the following:
(a) We will assume that a certain helper matrix Y exists, and show that this is
enough to imply M + Z∗ > M∗ for any Z supported in .
(b) We will construct such a Y using quantum golfing [80].
3 4 VT
B= U U⊥ · T = UV T + U⊥ V⊥
T
.
V⊥
Claim 8.2.5 B ≤ 1
Proof: By construction, U T U⊥ = 0 and V T V⊥ = 0, and hence the above
expression for B is its singular value decomposition, and the claim now
follows.
Hence we can plug in our choice for B and simplify:
M + Z∗ ≥ M + Z, B
= M + Z, UV T + U⊥ V⊥
T
= M, UV T +Z, UV T + U⊥ V⊥
T
M∗
T . Now, using
where in the last line we use the fact that M is orthogonal to U⊥ V⊥
the fact that Y and Z have disjoint supports, we can conclude:
M + Z∗ ≥ M∗ + Z, UV T + U⊥ V⊥
T
− Y
Therefore, in order to prove the main result in this section, it suffices to prove
that Z, UV T + U⊥ V⊥ T − Y > 0. We can expand this quantity in terms of its
construction of the helper matrix Y will both make use of the matrix Bernstein
inequality, which we present in the next section.
We will explain how this bound fits into the framework of the matrix
Bernstein inequality, but for a full proof, see [123]. Note that E[PT R PT ] =
PT E[R ]PT = nm2 PT , and so we just need to show that PT R PT does not
deviate too far from its expectation. Let e1 , e2 , . . . , ed be the standard basis
vectors. Then we can expand:
/ 0
PT (Z) = PT (Z), ea eTb ea eTb
a,b
/ 0
= Z, PT (ea eTb ) ea eTb
a,b
/ 0
Hence R PT (Z) = (a,b)∈Z, PT (ea eTb ) ea eTb , and finally we conclude that
/ 0
PT R PT (Z) = Z, PT (ea eTb ) PT (ea eTb ).
(a,b)∈
We can/ think of PT0 R PT as the sum of random operators of the form τa,b :
Z → Z, PT (ea eTb ) PT (ea eTb ), and the lemma follows by applying the matrix
Bernstein inequality to the random operator (a,b)∈ τa,b .
We can now complete the deferred proof of part (a):
Lemma 8.3.5 If is chosen uniformly at random and m ≥ nr log n, then with
high probability for any Z supported in we have
r
PT ⊥ (Z)∗ > PT (Z)F .
2n
Proof: Using Lemma 8.3.3 and the definition of the operator norm (see the
remark), we have
/ m 0 m
Z, PT R PT Z − 2 PT Z ≥ − 2 Z2F .
n 2n
Furthermore, we can upper-bound the left-hand side as
Z, PT R PT Z = Z, PT R2 PT Z = R (Z − PT ⊥ (Z))2F
= R (PT ⊥ (Z))2F ≤ PT ⊥ (Z)2F
8.3 Quantum Golfing 141
n2
Yi+1 = Yi + R (Wi )
m i+1
2
and update Wi+1 = UV T − PT (Yi+1 ). It is easy to see that E[ nm Ri+1 ] = I.
Intuitively, this means that at each step Yi+1 − Yi is an unbiased estimator
for Wi , and so we should expect the remainder to decrease quickly (here we
will rely on the concentration bounds we derived from the noncommutative
Bernstein inequality). Now we can explain the nomenclature quantum golfing:
at each step, we hit our golf ball in the direction of the hole, but here our target
is to approximate the matrix UV T , which for various reasons is the type of
question that arises in quantum mechanics.
It is easy to see that Y = i Yi is supported in and that PT (Wi ) = Wi for
all i. Hence we can compute
n2
PT (Yi ) − UV F = PT Ri Wi−1 − Wi−1
T
m
F
n2
= PT Ri PT Wi−1 − PT Wi−1
m
F
n2 m 1
= PT R PT − 2 PT ≤ Wi−1 F
m n 2
142 8 Matrix Completion
where the last inequality follows from Lemma 8.3.3. Therefore, the Frobenius
norm of the remainder decreases geometrically, and it is easy to guarantee that
Y satisfies condition (a).
The more technically involved part is showing that Y also satisfies condition
(b). However, the intuition is that PT ⊥ (Y1 ) is itself not too large, and since
the norm of the remainder Wi decreases geometrically, we should expect that
PT ⊥ (Yi ) does too, and so most of the contribution to
PT ⊥ (Y) ≤ PT ⊥ (Yi )
i
comes from the first term. For full details, see [123]. This completes the proof
that computing the solution to the convex program indeed finds M exactly,
provided that M is incoherent and || ≥ max(μ21 , μ0 )r(n + m) log2 (n + m).
Further Remarks
There are many other approaches to matrix completion. What makes the above
argument so technically involved is that we wanted to solve exact matrix
completion. When our goal is to recover an approximation to M, it becomes
much easier to show bounds on the performance of (P1 ). Srebro and Shraibman
[132] used Rademacher complexity and matrix concentration bounds to show
that (P1 ) recovers a solution that is close to M. Moreover, their argument
extends straightforwardly to the arguably more practically relevant case when
M is only entrywise close to being low-rank. Jain et al. [93] and Hardt [83]
gave provable guarantees for alternating minimization. These guarantees are
worse in terms of their dependence on the coherence, rank, and condition
number of M, but alternating minimization has much better running time and
space complexity and is the most popular approach in practice. Barak and
Moitra [26] studied noisy tensor completion and showed that it is possible to
complete tensors better than naively flattening them into matrices, and showed
lower bounds based on the hardness of refuting random constraint satisfaction
problems.
Following the work on matrix completion, convex programs have proven to
be useful in many other related problems, such as separating a matrix into the
sum of a low-rank and a sparse part [44]. Chandrasekaran et al. [46] gave a
general framework for analyzing convex programs for linear inverse problems
and applied it in many settings. An interesting direction is to use reductions and
convex programming hierarchies as a framework for exploring computational
versus statistical trade-offs [29, 45, 24].
Bibliography
143
144 Bibliography
[14] S. Arora, R. Ge, and A. Moitra. Learning topic models – going beyond SVD. In
FOCS, pages 1–10, 2012.
[15] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and
overcomplete dictionaries. arXiv:1308.6273, 2013.
[16] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, efficient, and neural algorithms
for sparse coding. In COLT, pages 113–149, 2015.
[17] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown
Gaussian noise, and implications for Gaussian mixtures and autoencoders. In
NIPS, pages 2384–2392, 2012.
[18] S. Arora, R. Ge, S. Sachdeva, and G. Schoenebeck. Finding overlapping
communities in social networks: Towards a rigorous approach. In EC, 2012.
[19] S. Arora and R. Kannan. Learning mixtures of separated nonspherical Gaussians.
Ann. Appl. Probab., 15(1A):69–92, 2005.
[20] M. Balcan, A. Blum, and A. Gupta. Clustering under approximation stability.
J. ACM, 60(2): 1–34, 2013.
[21] M. Balcan, A. Blum, and N. Srebro. On a theory of learning with similarity
functions. Mach. Learn., 72(1–2):89–112, 2008.
[22] M. Balcan, C. Borgs, M. Braverman, J. Chayes, and S.-H. Teng. Finding
endogenously formed communities. In SODA, 2013.
[23] A. Bandeira, P. Rigollet, and J. Weed. Optimal rates of estimation for multi-
reference alignment. arXiv:1702.08546, 2017.
[24] B. Barak, S. Hopkins, J. Kelner, P. Kothari, A. Moitra, and A. Potechin. A nearly
tight sum-of-squares lower bound for the planted clique problem. In FOCS,
pages 428–437, 2016.
[25] B. Barak, J. Kelner, and D. Steurer. Dictionary learning and tensor decomposi-
tion via the sum-of-squares method. In STOC, pages 143–151, 2015.
[26] B. Barak and A. Moitra. Noisy tensor completion via the sum-of-squares
hierarchy. In COLT, pages 417–445, 2016.
[27] M. Belkin and K. Sinha. Toward learning Gaussian mixtures with arbitrary
separation. In COLT, pages 407–419, 2010.
[28] M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS,
pages 103–112, 2010.
[29] Q. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse
principal component detection. In COLT, pages 1046–1066, 2013.
[30] A. Bhaskara, M. Charikar, and A. Vijayaraghavan. Uniqueness of tensor
decompositions with applications to polynomial identifiability. In COLT, pages
742–778, 2014.
[31] A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis
of tensor decompositions. In STOC, pages 594–603, 2014.
[32] Y. Bilu and N. Linial. Are stable instances easy? In Combinatorics, Probability
and Computing, 21(5):643–660, 2012.
[33] V. Bittorf, B. Recht, C. Re, and J. Tropp. Factoring nonnegative matrices with
linear programs. In NIPS, 2012.
[34] D. Blei. Introduction to probabilistic topic models. Commun. ACM, 55(4):77–84,
2012.
Bibliography 145
[35] D. Blei and J. Lafferty. A correlated topic model of science. Ann. Appl. Stat.,
1(1):17–35, 2007.
[36] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res.,
3:993–1022, 2003.
[37] A. Blum, A. Kalai, and H. Wasserman. Noise-tolerant learning, the parity
problem, and the statistical query model. J. ACM, 50:506–519, 2003.
[38] A. Blum and J. Spencer. Coloring random and semi-random k-colorable graphs.
Journal of Algorithms, 19(2):204–234, 1995.
[39] K. Borgwardt. The Simplex Method: A Probabilistic Analysis. New York:
Springer, 2012.
[40] S. C. Brubaker and S. Vempala. Isotropic PCA and affine-invariant clustering. In
FOCS, pages 551–560, 2008.
[41] E. Candes and B. Recht. Exact matrix completion via convex optimization.
Found. Comput. Math., 9(6):717–772, 2008.
[42] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and
inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006.
[43] E. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Inf.
Theory, 51(12):4203–4215, 2005.
[44] E. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?
J. ACM, 58(3):1–37, 2011.
[45] V. Chandrasekaran and M. Jordan. Computational and statistical tradeoffs via
convex relaxation. Proc. Natl. Acad. Sci. U.S.A., 110(13):E1181–E1190, 2013.
[46] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky. The convex geometry
of linear inverse problems. Found. Comput. Math., 12(6):805–849, 2012.
[47] J. Chang. Full reconstruction of Markov models on evolutionary trees: Identifia-
bility and consistency. Math. Biosci., 137(1):51–73, 1996.
[48] K. Chaudhuri and S. Rao. Learning mixtures of product distributions using
correlations and independence. In COLT, pages 9–20, 2008.
[49] K. Chaudhuri and S. Rao. Beyond Gaussians: Spectral methods for learning
mixtures of heavy-tailed distributions. In COLT, pages 21–32, 2008.
[50] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit.
SIAM J. Sci. Comput., 20(1):33–61, 1998.
[51] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term
approximation. J. AMS, 22(1):211–231, 2009.
[52] J. Cohen and U. Rothblum. Nonnegative ranks, decompositions and factoriza-
tions of nonnegative matrices. Linear Algebra Appl., 190:149–168, 1993.
[53] P. Comon. Independent component analysis: A new concept? Signal Processing,
36(3):287–314, 1994.
[54] A. Dasgupta. Asymptotic Theory of Statistics and Probability. New York:
Springer, 2008.
[55] A. Dasgupta, J. Hopcroft, J. Kleinberg, and M. Sandler. On learning mixtures of
heavy-tailed distributions. In FOCS, pages 491–500, 2005.
[56] S. Dasgupta. Learning mixtures of Gaussians. In FOCS, pages 634–644, 1999.
[57] S. Dasgupta and L. J. Schulman. A two-round variant of EM for Gaussian
mixtures. In UAI, pages 152–159, 2000.
146 Bibliography
[79] N. Goyal, S. Vempala, and Y. Xiao. Fourier PCA. In STOC, pages 584–593,
2014.
[80] D. Gross. Recovering low-rank matrices from few coefficients in any basis.
arXiv:0910.1879, 2009.
[81] D. Gross, Y.-K. Liu, S. Flammia, S. Becker, and J. Eisert. Quantum state
tomography via compressed sensing. Phys. Rev. Lett., 105(15):150401, 2010.
[82] V. Guruswami, J. Lee, and A. Razborov. Almost Euclidean subspaces of n1 via
expander codes. Combinatorica, 30(1):47–68, 2010.
[83] M. Hardt. Understanding alternating minimization for matrix completion. In
FOCS, pages 651–660, 2014.
[84] R. Harshman. Foundations of the PARAFAC procedure: model and conditions
for an “explanatory” multi-mode factor analysis. UCLA Working Papers in
Phonetics, 16:1–84, 1970.
[85] J. Håstad. Tensor rank is NP-complete. J. Algorithms, 11(4):644–654, 1990.
[86] C. Hillar and L.-H. Lim. Most tensor problems are NP-hard. arXiv:0911.1393v4,
2013
[87] T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289–296,
1999.
[88] R. Horn and C. Johnson. Matrix Analysis. New York: Cambridge University
Press, 1990.
[89] D. Hsu and S. Kakade. Learning mixtures of spherical Gaussians: Moment
methods and spectral decompositions. In ITCS, pages 11–20, 2013.
[90] P. J. Huber. Projection pursuit. Ann. Stat., 13:435–475, 1985.
[91] R. A. Hummel and B. C. Gidas. Zero crossings and the heat equation. Courant
Institute of Mathematical Sciences, TR-111, 1984.
[92] R. Impagliazzo and R. Paturi. On the complexity of k-SAT. J. Comput. Syst. Sci.,
62(2):367–375, 2001.
[93] P. Jain, P. Netrapalli, and S. Sanghavi. Low rank matrix completion using
alternating minimization. In STOC, pages 665–674, 2013.
[94] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two
Gaussians. In STOC, pages 553–562, 2010.
[95] R. Karp. Probabilistic analysis of some combinatorial search problems. In
Algorithms and Complexity: New Directions and Recent Results. New York:
Academic Press, 1976, pages 1–19.
[96] B. Kashin and V. Temlyakov. A remark on compressed sensing. Manuscript,
2007.
[97] L. Khachiyan. On the complexity of approximating extremal determinants in
matrices. J. Complexity, 11(1):138–153, 1995.
[98] D. Koller and N. Friedman. Probabilistic Graphical Models. Cambridge, MA:
MIT Press, 2009.
[99] J. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions
with applications to arithmetic complexity and statistics. Linear Algebra Appl.,
18(2):95–138, 1997.
[100] A. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for near-
separable non-negative matrix factorization. In ICML, pages 231–239, 2013.
148 Bibliography
[101] D. Lee and H. Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788-791, 1999.
[102] D. Lee and H. Seung. Algorithms for non-negative matrix factorization. In NIPS,
pages 556–562, 2000.
[103] S. Leurgans, R. Ross, and R. Abel. A decomposition for three-way arrays. SIAM
J. Matrix Anal. Appl., 14(4):1064–1083, 1993.
[104] M. Lewicki and T. Sejnowski. Learning overcomplete representations. Comput.,
12:337–365, 2000.
[105] W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models
of topic correlations. In ICML, pp. 633-640, 2007.
[106] B. Lindsay. Mixture Models: Theory, Geometry and Applications. Hayward, CA:
Institute for Mathematical Statistics, 1995.
[107] B. F. Logan. Properties of high-pass signals. PhD thesis, Columbia University,
1965.
[108] L. Lovász and M. Saks. Communication complexity and combinatorial lattice
theory. J. Comput. Syst. Sci., 47(2):322–349, 1993.
[109] F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529–537,
2001.
[110] S. Mallat. A Wavelet Tour of Signal Processing. New York: Academic Press,
1998.
[111] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.
IEEE Trans. Signal Process., 41(12):3397–3415, 1993.
[112] A. Moitra. An almost optimal algorithm for computing nonnegative rank. In
SODA, pages 1454–1464, 2013.
[113] A. Moitra. Super-resolution, extremal functions and the condition number of
Vandermonde matrices. In STOC, pages 821–830, 2015.
[114] A. Moitra and G. Valiant. Setting the polynomial learnability of mixtures of
Gaussians. In FOCS, pages 93–102, 2010.
[115] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov
models. In STOC, pages 366–375, 2005.
[116] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course.
New York: Springer, 2004.
[117] B. Olshausen and B. Field. Sparse coding with an overcomplete basis set:
A strategy employed by V1? Vision Research, 37(23):3311–3325, 1997.
[118] C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic
indexing: A probabilistic analysis. J. Comput. Syst. Sci., 61(2):217–235, 2000.
[119] Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit:
Recursive function approximation with applications to wavelet decomposition.
Asilomar Conference on Signals, Systems, and Computers, pages 40–44, 1993.
[120] K. Pearson. Contributions to the mathematical theory of evolution. Philos. Trans.
Royal Soc. A, 185: 71–110, 1894.
[121] Y. Rabani, L. Schulman, and C. Swamy. Learning mixtures of arbitrary distribu-
tions over large discrete domains. In ITCS, pages 207–224, 2014.
[122] R. Raz. Tensor-rank and lower bounds for arithmetic formulas. In STOC, pages
659–666, 2010.
Bibliography 149
150
Index 151