Spectral - Graph - Theory - 5
Spectral - Graph - Theory - 5
Spectral graph theory is the study of a graph through the properties of the eigenvalues and
eigenvectors of its associated Laplacian matrix. In the following, we use G = (V, E) to
represent an undirected n-vertex graph with no self-loops, and write V = {1, . . . , n}, with
the degree of vertex i denoted di . For undirected graphs
Pour convention will be that if there
is an edge then both (i, j) ∈ E and (j, i) ∈ E. Thus (i,j)∈E 1 = 2|E|. If we wish to sum
P
over edges only once, we will write {i, j} ∈ E for the unordered pair. Thus {i,j}∈E 1 = |E|.
1
Note that this is not necessarily a symmetric matrix. But it is a column-stochastic matrix:
each column has non-negative entries that sum to 1. This means that if p ∈ Rn+ is a
probability vector defined over the vertices, then Ap is another probability vector, obtained
by “randomly walking” along an edge, starting from a vertex chosen at random according to
p. We will explore the connection between A and random walks on G in much more detail
in the next lecture.
Finally we introduce the Laplacian matrix, which will provide us with a very useful
quadratic form associated to G:
Definition 4.4 (Laplacian and normalized Laplacian Matrix). The Laplacian matrix is
defined as
L = D − A.
The normalized Laplacian is defined as
L = D−1/2 LD−1/2 = I − D−1/2 AD−1/2 .
Note that L and L are always symmetric. They are best thought of as quadratic forms:
for any x ∈ Rn , X X X
xT Lx = di x2i − xi xj = (xi − xj )2 .
i (i,j)∈E {i,j}∈E
Proof.
xT Lx = xT x − xT D−1/2 AD−1/2 x
X X xi x
= x2i − √ Aij p j
i i,j
di dj
X xi 2 X xi x
= di √ − √ · √j
i
di (i,j)∈E
di di
!2
X xi xj
= √ −p .
{i,j}∈E
d i d j
2
Claim 4.5 provides the following interpretation of the Laplacian: if we think of the vector
x as assigning a weight, or “potential” xi ∈ R to every vertex v ∈ V , then the Laplacian
measures the average variation of the potential over all edges. The expression xT Lx will be
small when the potential x is close to constant across all edges (when appropriately weighted
by the corresponding degrees), and large when it varies a lot, for instance when potentials
associated with endpoints of an edge have a different sign.
We will return to this interpretation soon. Let’s first see some examples. It will be
convenient to always order the eigenvalues of A in deceasing order, µ1 ≥ · · · ≥ µn , and those
of L in increasing order, λ1 ≤ · · · ≤ λn . So what do the eigenvalues of A or L have to say?
Example 4.6. Consider the graph shown in Figure. 4.1.
3
The adjacency matrix is given by
0 1 1
A= 1 0 1
1 1 0
We can also compute
2 0 0 1 −1/2 −1/2
D= 0 2 0 and L = −1/2 1 −1/2 .
0 0 2 −1/2 −1/2 1
where since the second eigenvalue 3/2 is degenerate we have freedom in choosing a basis for
the associated 2-dimensional subspace.
Example 4.8. As a last example, consider the path of length two, pictured in Figure. 4.3.
4
We’ve seen three examples — do you notice any pattern? 0 seems to always be the
smallest eigenvalue. Moreover, in two cases the associated eigenvector has all its coefficients
equal. In the case of the path, the middle coefficient is larger — this seems to reflect the
degree distribution in some way. Anything else? The largest eigenvalue is not always the
same. Sometimes there is a degenerate eigenvalue.
Exercise 1. Show that the largest eigenvalue λn of the normalized Laplacian of a connected
graph G is such that λn = 2 if and only if G is bipartite.
We will see that much more can be read about combinatorial properties of G from L in
a systematic way. The main connection is provided by the Courant-Fisher theorem:
Theorem 4.9 (Variational Characterization of Eigenvalues). Let M ∈ Rn×n be a symmetric
matrix with eigenvalues µ1 ≥ · · · ≥ µn , and let the corresponding eigenvectors be v1 , · · · , vn .
Then
xT M x xT M x xT M x xT M x
µ1 = sup , µ2 = sup , ... µn = sup = inf .
x∈Rn x x
T
x∈Rn x x
T
x∈ n R xT x x∈Rn xT x
x6=0 x⊥v1 x⊥v1 ,...,vn−1 x6=0
because by taking x = vk and using (4.2) together with viT vk = 0 for i 6= k we immediately
get
xT M x
= µk .
xT x
T
Pn observe thatPany2 x such that x x = 1 and x ⊥ v1 , . . . , vk−1
To show the reverse inequality,
can be decomposed as x = j=k αj vj with j αj = 1. Now
n X
X n n
X
xT M x = µl αi αj viT vl vlT vj = µl αl2 ≤ µk
i,j=k l=1 l=k
5
4.2 Eigenvalues of the Laplacian
Using the variational characterization of eigenvalues given in Theorem 4.9, we can connect
the quadratic form associated with the normalized Laplacian in Claim 4.5 to the eigenvalues
of L.
Claim 4.10. For any graph G with normalized Laplacian L, 0 ≤ L ≤ 2I. Moreover, if λ1 is
the smallest eigenvalue of L then λ1 = 0 with multiplicity equal to the number of connected
components of G.
Proof. From (4.1) we see that xT Lx ≥ 0 for any x, and using (a − b)2 ≤ 2(a2 + b2 ) we also
have xT Lx ≤ 2xT x. Using the variational characterization
xT Lx xT Lx
λ1 = inf , λn = sup ,
x6=0 xT x x6=0 xT x
where λn is the largest eigenvalue, we see that 0 ≤ L ≤ 2I. To see that λ1 = 0 always with
multiplicity at least 1 it suffices to consider the vector
√
d1
..
v1 = . ,
√
dn
6
The conductance of S is
|∂S|
φ(S) = ,
min (d(S), d(V \S))
P
where d(S) := i∈S di is a natural measure of volume: the total number of edges incident
on vertices in S. If G is d-regular, then this simplifies to
|∂S|
φ(S) = .
d · min(|S|, |V \S|)
Definition 4.12 (Conductance). The conductance of a graph G is defined as
• If G is a clique, then
k(n − k) n 1
φ(G) = min = ≈ .
1≤k≤n/2 (n − 1)k 2(n − 1) 2
• If G is a cycle, then
2 2
φ(G) = min = .
1≤k≤n/2 2k n
Exercise 2. Compute the conductance of the hypercube G = (V, E) where V = {0, 1}n and
E = {{u, v} ∈ V : dH (u, v) = 1}, where dH is the Hamming distance.
The following theorem is a fundamental result relating conductance and the second small-
est eigenvalue of the normalized Laplacian.
Theorem 4.14 (Cheeger’s inequality). Let G be an undirected graph with normalized Lapla-
cian L = I − D−1/2 AD−1/2 . Let 0 = λ1 ≤ λ2 ≤ · · · ≤ λn be the eigenvalues of L. Then
λ2 p
≤ φ(G) ≤ 2λ2 .
2
7
Remark 4.15. • Both sides of the inequality are interesting. The left-hand side says
that if there is a good cut, that is a cut of small conductance, then there is an eigen-
vector orthogonal to the smallest eigenvector with small eigenvalue. This is called the
“easy” side of Cheeger.
• The right-hand side says that if λ2 is small, then there must exist a poorly connected
set. This is called the “hard” side of Cheeger.
• We will give “algorithmic” proofs of both inequalities: for the left-hand side, given a
set S of low conductance we will show how to construct a vector v ⊥ v1 that achieves
a low value in (4.1). For the right-hand side, given a vector v2 ⊥ v1 achieving a low
value in (4.1) we will construct a set S of low conductance.
• The next exercise shows that both sides of the inequality are tight.
Exercise 3. Show that the left-hand side of Cheeger’s inequality is tight by computing the
eigenvalues and eigenvectors of the hypercube (hint: Fourier basis). Show that the right-hand
side is also tight by considering the example of the n-cycle.
Proof of Cheeger’s inequality. We first prove the “easy side”. Let S be a set of vertices such
that φ(S) = φ(G). We claim
xT Lx
λ2 = minn ≤ 2φ(G).
x∈
√ R √ xT x
x⊥v1 =( d1 ,··· , dn )
We then have
X X d(S)2 d(S)d(V )
xT x = di + σ 2 di = d(S) + · d(V \S) = ,
i∈S
d(V \S)2 d(V \S)
i∈V \S
and
!2
X x x
xT Lx = √ i − pj
{i,j}∈E
di dj
X X d(V \S) + d(S) 2
2
= (1 + σ) =
d(V \S)
{i,j}∈∂S {i,j}∈∂S
2
d(V )
= |∂S| .
d(V \S)2
8
This finally implies
xT Lx |∂S|d(V ) 2|∂S|
T
= ≤ = 2φ(G),
x x d(S)d(V \S) min(d(S), d(V \S))
where the inequality can be seen by considering the cases d(S) ≤ d(V \S) and d(S) > d(V \S)
separately.
Now let’s turn to the “hard side” of the inequality. Let y ∈ Rn be such that
y T Ly
≤ λ2 (4.4)
yT y
√ √
and y ⊥ v1 = ( d1 , . . . , dn )T . Our main idea is going to be to think of the coordinates of
y as providing an ordering of the vertices, with each coordinate telling us how likely (small,
negative yi ) or unlikely (large, positive yi ) the vertex is to be in a set with small conductance:
this intuition comes from the form of the vector we found in the proof of the easy case, which
provides such an ordering.
Rather than use the inequality (4.4) as starting point, it will be more convenient to work
with the analogue formulation for the Laplacian L, so we start with a few manipulations.
Let z = D−1/2 y − σ1 for some σ to be determined soon and 1 = (1, . . . , 1)T . Since 1T L1 = 0,
we see that z T Lz = y T Ly. Moreover, D1 = v1 , thus
z T Dz = y T y − 2σ v1T · y +σ 2 d(V ) ≥ y T y,
| {z }
=0
z T Lz
≤ λ2 .
z T Dz
We make the following conventions, without loss of generality:
Let t ∈ [z1 , zn ] be chosen according to the distribution with density 2|t| (the scaling on z
assumed above ensures that this is a properly normalized probability measure). Observe
9
that for any a < b,
Z b
Pr(t ∈ [a, b]) = 2|t|dt
a
2 2
b + a
if a < 0 < b
= b 2 − a2 if b > a > 0
2
a − b2 otherwise
≤ |b − a| (|a| + |b|) ,
an inequality that is easily verified in all three cases. For any t, let St = {i : zi ≤ t}. Then
X X
E d(St ) = Pr(i ∈ St )di = Pr(zi ≤ t)di .
t
i i
Our choice of the index i0 in (4.5) ensures that, if t < 0 then min(d(St ), d(V \St )) = d(St ),
while if t ≥ 0 then min(d(St ), d(V \St )) = d(V \St ). Thus
X X
E min (d(St ), d(V \St )) = Pr(zi ≤ t ∧ t < 0)di + Pr(zi > t ∧ t ≥ 0)di
t
i<i0 i≥i0
X X
= zi2 di + zi2 di (4.6)
i<i0 i≥i0
T
= z Dz. (4.7)
Next we compute
X
E |∂St | = Pr (zi ≤ t ≤ zj )
t
{i,j}∈E
X
≤ |zj − zi |(|zi | + |zj |)
{i,j}∈E
s X s X
≤ (zi − zj )2 (|zi | + |zj |)2
{i,j}∈E {i,j}∈E
| √
{z } √| {z √
}
z T Lz 2 2 T
P
≤ 2 {i,j}∈E (|zi | +|zj | )= 2z Dz
p
≤ 2λ2 z T Dz, (4.8)
10
From there we deduce that there exists a choice of t such that
p
Φ(St ) ≤ 2λ2 ,
√
which immediately gives us that Φ(G) ≤ 2λ2 , as desired.
We note that the proof given above is algorithmic, in that it describes
√ an efficient algo-
rithm that, given a graph, will produce a set with conductance at most 2λ2 : simply com-
pute an eigenvector associated with the second smallest eigenvalue (using, e.g., the power
method), and output the set St which has smallest conductance among the n possibilities:
this can be checked efficiently.
4.3 Sparsity
The sparsity is another natural measure of “disconnectedness” of a graph, which turns out
to be very closely related to the conductance.
Definition 4.16. For a d-regular graph G and ∅ ( S ( V , the sparsity of S is defined as
1
E(i,j)∈E |1S (i) − 1S (j)| dn
|∂S| |V ||∂S|
σ(S) = = 2 = ,
E(i,j)∈V 2 |1S (i) − 1S (j)| n2
|S| · |V \S| 2d|S||V \S|
and so
Φ(S)
≤ σ(S) ≤ Φ(S).
2
Note that the definition of the sparsity lets us give a different, almost immediate proof
of the “easy” side of Cheeger’s inequality:
σ(G) = min σ(S)
S
1 2
P
dn (i,j)∈E |xi − xj |
= min 2
P 2
(i,j)∈V 2 |xi − xj |
x∈{0,1}n ,x6=0,1 2
n
2 {i,j}∈E (xi − xj )2
P
≥ min
x∈Rn ,x6=0,x⊥1 2d
P
n (i,j)∈V 2 (xi − xj ) 2
xT Lx
= min
x∈R ,x6=0,x⊥1 1
P 2
(i,j)∈V 2 (xi − xj )
n
n
xT Lx
= min
x∈Rn ,x6=0,x⊥1 1 2nx2i − 2 i,j xi xj
P P
n i∈V
T
x Lx λ2
= min T
= ,
x∈Rn ,x6=0,x⊥1 2x x 2
where here the inequality follows since we are taking the minimum over a larger set (the
constraint x ⊥ 1 is without loss of generality since the whole expression is invariant under
translation of the vector x by an additive constant multiple of 1).
11
Using the above inequality σ(G) ≤ 2Φ(G), we have re-proven the left-hand side of
Cheeger’s inequality, simply by observing that the second eigenvalue of L could be seen
as a natural relaxation of the sparsity, itself very closely related to the conductance.
The second eigenvalue gives us a good approximation to the conductance in case λ2 is
not too small, say it is a constant. If however λ2 goes to 0 with n, say λ2 ∼ 1/n, then the
−1/2 √
approximation can be very bad, it can be a multiplicative factor λ2 ∼ n off.
Here are two other relaxations that have been considered, and do much better. The first
one is due to Leighton and Rao, and can be defined as
P
(i,j)∈E wij
LR(G) = min d
P . (4.9)
w∈Rn×n
wij ≥0,wii =0n (i,j)∈V 2 wij
wij ≤wik +wkj
This can be interpreted as a minimization over all semi-metrics: distance measures d(i, j) =
wij on the graph that are always non-negative and satisfy the triangle inequality. An example
such metric is (i, j) → wij = |xi − xj | (for any fixed vector x), but there are others. The
advantage of allowing all semi-metrics is that LR(G) can be computed using a linear program.
Moreover, Leighton and Rao showed that
O(log n) LR(G) ≥ σ(G) ≥ LR(G),
thus we get a much tighter approximation to the sparsity than the one given by λ2 in cases
when the sparsity is small. An even tighter relaxation was introduced by Arora, Rao and
Vazirani, who considered
P
(i,j)∈E wij
ARV(G) = min d
P . (4.10)
w∈Rn×n
n
wij ≥0,wii =0(i,j)∈V 2 wij
2 ≤w 2 +w 2
wij ik kj
12
its negation), is it possible to find an assignment to the variables that satisfies all clauses?
For k ≥ 3 the problem is NP-hard, but for k = 2 there are efficient algorithms.
Here is a simple candidate. Start with a random assignment to the variables. At each
step, choose a clause that is not satisfied. Pick one of the two variables it acts on at random,
and flip it. Repeat.
Is this going to work? And if so, how long will take? Here is the key idea. Suppose there
exists a satisfying assignment, and fix it. Now consider the distance between the current
assignment, maintained by the algorithm, and this satisfying assignment. This distance is an
integer between 0 and n, and at each step it either increases or decreases by 1. If a clause is
violated, at least one of the two variables involved must have a different value in the current
assignment as in the satisfying assignment. With probability 1/2 we flip this variable, so
that at each step with probability at least 1/2 we decrease the distance by 1. How many
steps will it take to find the satisfying assignment? The answer is O(n2 ), and we’ll see how
to show this very easily once we’ve covered some of the basics of the analysis of random
walks on arbitrary graphs. (Note the algorithm we just described is not the best, and it is
possible to solve 2SAT in deterministic linear time...)
which in matrix form can be written as p(1) = AD−1 p(0) . This version of the random walk
has one major drawback, which is that it does not always converge: consider for instance a
graph with a single edge, or more generally a bipartite graph; the walk started at a vertex on
the left will continue hopping back and forth between left and right without ever converging.
To overcome this issue it is customary to consider instead the lazy random walk: with
probability 1/2, do not move, and with probability 1/2, do as before. The update rule is
then
(t) I 1 −1
p = + AD p(t−1) ,
2 2
which in matrix form is p(t) = W p(t−1) where W is the random walk matrix:
Definition 4.17. Let G = (V, E) be an n-vertex weighted, undirected graph with weights
wij ≥ 0. Define the lazy random walk on G: if p ∈ Rn+ is a distribution on the vertices
V = {1, ..., n}, one step of the random walk brings p to W p where
1 1
W = I + AD−1 .
2 2
13
Note that W is not symmetric, so it is not diagonalizable. But observe that
1 −1/2 −1/2
L −1/2
1/2
D−1/2 = D1/2 I −
W =D I− I−D AD D ,
2 2
L is the normalized Laplacian associated with G. Thus if v is an eigenvector for L with
eigenvalue λ, then w = D1/2 v is a right eigenvector for W with eigenvalue (1 − λ/2); thus W
has n eigenvalues wi = 1−λi /2 that are directly related to those of the normalized Laplacian.
This will let us transfer our understanding of the λi to derive convergence properties of the
random walk.
14
Therefore,
X
kW t p − πk1 = D1/2 wit αi vi
1
i≥2
√ X
≤ n αi wit D1/2 vi
2
i≥2
√ p X 1/2
≤ n dmax αi2 wi2t
i≥2
√ √
≤ n n ω2t ,
where for the last step we bounded the maximum degree by n and used that i≥2 αi2 ≤
P
• For the n-vertex path Pn , we saw λ2 ∼ 1/n2 , and you can check that τε ≥ n2 log( 1ε ):
starting on the leftmost vertex, it takes ∼ n2 steps before we hit the rightmost vertex,
and the log(1/ε) term is overhead before the walk becomes sufficiently close to uniform.
Note this proves the bound on the convergence time for the randomized algorithm for
2-SAT we saw earlier.
• For the dumbbell graph Kn/3 — Pn/3 — Kn/3 (two (n/3)-cliques linked by a path of
length n/3), the conductance is at most ∼ 1/n2 (cut in the middle), so by Cheeger
λ2 = O(1/n2 ). Consider a random walk starting on the left-most vertex. Then, one
step leads to a uniform distribution on the left Kn/3 clique. From there the probability
to enter the bridge is ∼ 1/n2 , and moreover the probability of making it through the
bridge without falling back into the left clique is about ∼ 1/n. Using this intuition
you can show that the mixing time is of order n3 , so that in fact λ2 = O(1/n3 ).
v1 v2 v3 vn−1 vn
How bad can it get? From the definition we see that the conductance φ(G) is always at
least 1/n2 , so by Cheeger’s inequality as long as G is connected we have λ2 = Ω(n−4 ). In
fact we can prove something slightly better.
15
Claim 4.22. Let G be a connected, unweighted graph, λ2 the second smallest eigenvalue of
2
the normalized Laplacian and r the diameter of G. Then λ2 ≥ r(n−1)2.
Proof. For any two vertices u, v let E u,v be the normalized Laplacian associated to the graph
having a single edge u → v, and P u,v the normalized Laplacian for a path of length at most
r from u to v. Then E u,v ≤ rP u,v , as can be seen from the associated quadratic forms:
(xu − xv )2 ≤ r((xu − xu1 )2 + · · · + (xur−1 −xv )2 ). Since any pair of vertices are connected by a
path of length at most r in G, LKn ≤ r n2 LG , where Kn is the clique on n vertices. Since the
n
normalized Laplacian for Kn has second smallest eigenvalue n−1 , we get the claimed bound
on λ2 .
16
that is distributed according to a distribution p = p(x, δ) such that kp − πk1 ≤ δ, where
f (x) = (S, π).
A famous theorem by Jerrum, Valiant and Vazirani shows that approximate counting
and approximate sampling are essentially equivalent:
Theorem 4.24. For “nicely behaved” counting problems (the technical term is “downward
self-reducible”), the existence of an FPRAS is equivalent to the existence of an FPAUS.
Proof sketch. For concreteness we prove the theorem for the problem of counting the number
of satisfying assignments to any formula ϕ.
FPAUS =⇒ FPRAS. Take a polynomial number of satisfying assignments for ϕ sampled by
the FPAUS. This lets us estimate p0 and p1 , the probability that ϕ is satisfiable conditioned
on x1 = 0 and x1 = 1 respectively. Assume p0 ≥ 1/2, the other case being symmetric. Make
a recursive call to approximate the number of satisfying assignments to ϕ|x1 =0 . Let N̂0 be
the estimate returned, and output N̂0 /p0 .
It is clear that this is correct on expectation. Moreover, using that p0 is not too small
a Chernoff bound shows that the estimate obtained from the FPAUS samples will be very
accurate with good probability. It is then not hard to prove by induction that provided a
polynomial number of samples are taken at each of the n recursive calls (where n is the
number of variables), the overall estimate can be made sufficiently accurate, with good
probability.
FPRAS =⇒ FPAUS. First we run the FPRAS to obtain good estimates for the number of
satisfying assignments N to ϕ and N0 to ϕ|x1 =0 . We can assume the estimates returned are
such that N̂0 /N̂ ≥ 1/2, as otherwise we exchange the roles of 0 and 1. Next we flip a coin
with bias N̂0 /N̂ . If it comes up heads we set x1 = 0 and recurse; if it comes up tails we set
x1 = 1 and recurse. Provided the estimates N̂ , N̂0 , . . . are accurate enough the distribution
on assignments produced by this procedure will be close in statistical distance to the uniform
distribution on satisfying assignments.
• Membership oracle (resp. weak membership oracle): given a query x ∈ Rn , the oracle
answers whether x ∈ K (resp. x ∈ K or d(x, K) ≥ ε; the oracle is allowed to fail
whenever neither condition is satisfied).
17
• Separation oracle (resp. weak separation oracle): given a query x, the oracle returns
the same answer as the membership oracle, but in case x ∈ / K (resp. d(x, K) ≥ ε) it
also returns an y ∈ R such that y z > y x ∀z ∈ K (resp. y T z > y T x − ε/2 ∀z ∈ K).
n T T
It is always possible to derive a weak separation oracle from a weak membership oracle,
and in this lecture we won’t worry about the difference — in fact all we’ll need is a weak
membership oracle.
But is this enough? What if we query x, and we learn x ∈ / K, with y = (1, 0, . . . , 0)? Then
we know we should increase x1 . Say we double every coordinate, to x0 = 2x. But suppose
we get the same answer, again and again. We never know how far K is! It seems like a
boundedness assumption is necessary, so we’ll assume the following: there exists (known)
values r, R > 0 such that B∞ (0, r) ⊆ K ⊆ B∞ (0, R) with R/r < 2poly(n) , where B∞ (0, r)
denotes the ball of radius r for the `∞ norm (a square box with sides of length 2r). In fact,
by scaling we may as well assume r = 1, and for simplicity we’ll also assume R = n2 . It
is not so obvious at first this is without loss of generality, but it is — with some further
re-scaling and shifting of things around it is not too hard to reduce to this case.
(The idea is to slowly grow a simplex inside K. At each step we can perform a change of
basis so that Conv(e1 , . . . , en ) ⊆ K. Then for i = 1, . . . , n we check if there is a point x ∈ K
such that |xi | ≥ 1 + 1/n2 . If so we include it, rescale, and end up with a simplex of volume at
least 1 + 1/n2 times the previous one, and this guarantees we won’t have to go through too
many steps. If there is no such point, we stop, as we’ve achieved the desired ratio. Finally
we need to check that we can find such point, if it exists, in polynomial time; this is the case
as if x ∈ K then (0, . . . , 0, xi , 0, . . . , 0) ∈ K as well (using convexity and the assumption that
K contains the simplex), so it suffices to call the membership oracle n times.)
Now that we have a proper setup in place, can we estimate Vol(K)? Say we only want
a multiplicative approximation. An idea would be to use the separation oracle to get some
kind of a rough approximation of the boundary of K using hyperplanes, then do some kind
of triangulation, and estimate the volume by counting tetrahedrons (or, in dimension n,
simplices). In fact there is a very strong no-go theorem for deterministic algorithms:
Theorem 4.25. For any deterministic polynomial-time algorithm such that on input a
convex body K ⊆ Rn (specified via a separation oracle) returns α(K), β(K) such that
α(K) ≤ Vol(K) ≤ β(K), it must be that there exists a constant c > 0 and a sequence
of convex bodies {Kn ⊆ Rn }n≥1 such that for all n ≥ 1,
β(Kn ) cn n
≥ .
α(Kn ) log n
This is very bad: even an approximation within an exponential factor is ruled out! Note
however that a key to the above result is that the only access to K is given by a separation
oracle — if we have more knowledge about K then a polynomial-time algorithm might be
feasible (though we don’t know any).
Proof idea. The idea for the proof is to design an oracle that answers the queries made by
any deterministic algorithm in a way that is consistent with the final convex body being one
18
of two possible bodies, K or K ◦ , whose volume ratio is exponentially large; if we manage to
do this then the algorithm cannot provide an estimate that will be accurate for both K and
K ◦.
The oracle is very simple: upon any query x ∈ Rn it is very generous and says that
x/kxk ∈ K, −x/kxk ∈ K, and moreover K is included in the “slab” {y : −kxk ≤ hy, xi ≤
kxk}. Note that these answers are all consistent with K being the Euclidean unit ball.
Now, if points x1 , . . . , xm have been queried, define K to be the convex hull of ±xi /kxi k,
±ej where ej are the unit basis vectors. Define K ◦ = {y : hy, zi ≤ 1 ∀z ∈ K. Then you
can check that the oracle’s answers are all consistent with K and K ◦ . But their volumes are
very different, and one can show that Vol(K ◦ )/Vol(K) is roughly of order (n/ log(m/n))n ;
as long as m is not exponential in n this is exponentially large.
If we allow randomized algorithms the situation is much better:
Theorem 4.26 (Dyer, Frieze, Kannan 1991). There exists a fully polynomial randomized
approximation scheme for approximating Vol(K).
A fully polynomial randomized approximation scheme (FPRAS) means that ∀ε, δ > 0
the algorithm returns a (1 ± ε)-multiplicative approximation to the volume with probability
at least 1 − δ, and runs in time poly(n, 1/ε, log 1/δ). Volume estimation is one of these
relatively rare problems for which we have strong indication that randomized algorithms can
be exponentially more efficient than deterministic ones (primality testing used to be another
such problem before the AKS algorithm was discovered!).
The algorithm of Dyer, Frieze and Kannan had a running time that scaled like ∼ n23 .
Since then a lot of work has been done on the problem, and the current record is ∼ n5 . In
principle it’s possible this could be lowered even more, to say ∼ n2 ; there are no good lower
bounds for this problem.
Proof sketch. The main idea is to use random sampling. For instance, suppose we’d place
K inside a large, fine grid, as in Figure 4.4.
19
Figure 4.4: The region K is the feasible region.
We could then run a random walk on the grid until it mixes to uniform. This will take
time roughly n(R/δ)2 , where δ is the grid spacing; given we assumed r = 1 something like
δ = 1/n2 would be reasonable.
Exercise 4. Show that the mixing time of the lazy random walk on {1, . . . , N }n , the n-
dimensional grid with sides of length N , is O(n2 N 2 ).
At the end of the walk we can call the membership oracle to check if we are in K. Since
Voln (K)
Prob(x ∈ K) ∼ Vol n (grid)
, by repeating the walk sufficiently many times we’d get a good
estimate. While this works fine in two dimensions (you can estimate π = Area(unit disk) in
this way!), in higher dimensions it fails dramatically, as all we know about Vol(K) is that
it is at least 2n (since it contains the unit ball for `∞ ), but the grid could have volume as
large as (2R)n = (2n2 )n , so even assuming perfectly uniform mixing the probability that we
actually obtain a point in K is tiny, it’s exponentially small.
This is still roughly how we’ll proceed, but we’re going to have to be more careful. There
are three important steps. Here is a sketch:
• Step 1: Subdivision.
Set K0 = B∞ (0, 1) ∩ K = B∞ (0, 1), K1 = B∞ (0, 21/n ) ∩ K, . . . , K2n log n = B∞ (0, n2 ) ∩
K = K. Then
Vol(K2n log n ) Vol(K2n log n−1 ) Vol(K2 ) n
Vol(K) = Vol(K2n log n ) = · ··· ·2 ,
Vol(K2n log n−1 ) Vol(K2n log n−2 ) Vol(K1 )
20
since Vol(K1 ) = 2n . So, we have reduced our problem to the following: given K ⊆ L
both convex such that Vol(K) ≥ 21 Vol(L), estimate Vol(K)/Vol(L). This eliminates
the “tiny ratio” issue we had initially, but now we have another problem: the enclosing
set L is no longer a nice grid, but it is an arbitrary convex set itself. Are we making
any progress?
• Step 2: A random walk.
Our strategy will be as follows. Run a random walk on a grid that contains L, such that
the stationary distribution of the random walk satisfies the two conditions that Pr(x ∈
L) is not too small, and the stationary distribution is close to uniform, conditioned on
lying in L. If we can do this we’re done: we repeatedly sample from the stationary
distribution sufficiently many times that we obtain many samples in L, and we check
the fraction of these samples that are also in K: Vol(K)
Vol(L)
∼ Pr(x∈K)
Pr(x∈L)
= Pr(x ∈ K|x ∈ L).
So the challenge is to figure out how to define this random walk around L. Here is a
natural attempt. Start at an arbitrary point x(0) ∈ L, say the origin. Set x(1) to be
a random neighbor of x(0) on the grid, subject to x(1) ∈ L (we have 2n neighbors to
consider, and for each we can call the membership oracle for L). Repeat sufficiently
many times. This is the right idea — note that we really want to stay as close to L
as possible, because if we allow ourselves to go outside too much we’ll get this “tiny
ratio” issue once more — but the boundary causes a lot of problems:
(a) Some points are never reached. L could be very pointy, in which case there could
be a grid point that lies in L, but none of its neighbors does. And this cannot be
solved just by making the grid finer; it is really an issue with the kinds of angles
that are permitted in L.
(b) The degree of the graph underlying our walk is not constant (it tends to be smaller
close to the boundary), so the stationary distribution will not be uniform.
(c) Some grid cubes have much bigger intersection with L than others.
It turns out we can fix all of these issues by doing a √
bit of “smoothing
√ out” on L. Let
0
δ be the width of the grid, and consider L = (1 + δ n)L, where δ n is the diameter
of a cube. Assuming δ ≤ n−2 , this doesn’t blow up the volume by much, so in terms
of volume ratio we’re fine. Moreover, you can check easily that:
– Any grid point inside L has all of its neighbors in L0 ,
– All p ∈ L belong to a grid cube ⊆ L0 .
These two points get rid of issue (a) above: all points in L are now reached by the
walk. Moreover, we can easily get rid of the degree issue by adding self loops. This
guarantees that the stationary distribution will be uniform on grid points in L, clearing
(b). Remains (c), the issue of uneven intersection between grid cubes and L. For this
we do the following:
– Do a random walk on δZn ∩ L0 as described before.
21
– Arrive at a random grid point p. Choose random vector q ∈ B∞ (0, 1) and output
the point p + δq if it is in L. Otherwise, restart the walk.
If we look at a body L that is a very thin needle, we see that for S that cuts across half
the needle the volume will be roughly (d/2) times the area, so the estimate provided
in the theorem is essentially optimal. The proof of the theorem is not hard but it does
involve quite a bit of re-arranging to argue that the needle is indeed the “worst-case
scenario”, and we’ll skip it.
√ √
In our setting we have B∞ (0, 1) ⊆ K ⊆ L ⊆ L0 = (1 √
+ δ n)L ⊆ (1 + δ n)B∞ (0, n2 ).
2
So, our grid has sides of length at most . 2(1+δδ n)n = O(n4 ), and the diameter
is O(n4 ). Thus the isoperimetry theorem implies that our random walk will mix in
polynomial time.
22
a graph on V and select the points with a random walk. What types of graphs would be
good for this procedure? We could throw in all edges — this will optimize the mixing time.
But then at any step we need to choose amongst N possible neighbors, and this can be a
complicated task; especially in situations such as we encountered in the last lecture, where
determining if a point is a neighbor requires to do some computation (in that case, we had to
make a call to the membership oracle, which could be expensive). To reduce the number of
choices that the random walk has to make at each step, we want to minimize the degree. But
we don’t want to sacrifice the mixing time either. Expander graphs are often used because
they reach the optimal tradeoff, achieving the best possible mixing time (or more precisely,
the largest possible second eigenvalue) for a fixed degree.
Definition 4.28. Given d ∈ N and γ ∈ (0, 1), a one-sided (resp. two-sided) (d, γ) spectral
expander is a graph G such that:
• G is d-regular.
• |λ2 (L̄) − 1| ≤ γ (resp. ∀i ≥ 2, |λi (L̄) − 1| ≤ γ), where L is the normalized Laplacian of
G.
Two-sided expanders have all their eigenvalues close to 1, except λ1 = 0; one-sided ex-
panders can also have larger eigenvalues, up to the maximum possible of 2 (and in particular
they can be bipartite).
Remark 4.29. Using the mixing lemma from last lecture we find that the lazy random walk
on expanders mixes fast:
log( n )
τ = O .
1−γ
For the case of expanders (and especially two-sided expanders, which are guaranteed to
not be bipartite) we will usually run a normal random walk with walk matrix W = AD−1 ,
rather than the lazy random walk with walk matrix W = I− 21 D1/2 L̄D−1/2 that we considered
previously.
23
Definition 4.31. A (d, γv ) vertex expander is a graph G such that:
24
Therefore d2 (n − 1)γ 2 ≥ nd − d2 , i.e.
n 1
γ2 ≥ − .
d(n − 1) n − 1
q
For large n the last term is very small, and we get the bound γ ≥ (1 − on (1)) d1 , which is
off by just a factor 2.
Ramanujan graphs are graphs that achieve the optimal expansion for a given degree:
Unfortunately random graphs are not so useful in practice because they are just that:
random. In particular, working with a random graph requires to first compute the whole
graph, then store it in memory and perform the random walk. But recall that for our typical
applications of expanders we are thinking of working with potentially large graphs for which
we’d like to be able to compute the neighborhood structure locally very efficiently. Such a
construction was presented by Margulis; efficiency aside it is the first explicit construction
of a good (meaning that both d and γ are constants independent of the size of the graph)
family of expanders:
Example 4.35 (Margulis ’73). Take V = Zm × Zm for some integer m. For each vertex
(x, y) ∈ V connect it to the following eight vertices:
(x, y ± x)
(x, y ± (x + 1))
N ((x, y)) =
(x ± y, y)
(x ± (y + 1), y)
Theorem 4.36 (Margulis). The graph given above is a (8, γ) two-sided spectral expander
for some γ < 1 independent of m.
25
The construction of the graph is very simple, but the proof of the theorem is very difficult.
This construction also provides a very fast mixing time: it provides a way to mix a 2-
dimensional m × m grid in time O(log m), whereas as we’ve seen the regular random walk
would require time O(m2 ). This only requires us to double the degree and throw in a few
“long-distance hops”.
The first construction of explicit Ramanujan expanders was given by Lubotzky, Philips
and Sarnak:
Theorem 4.37 (LPS ’88). for every n = p + 1 where:
• p is a prime such that p ≡ 1[4],
• d = q m + 1, q prime, m integer,
q
there exists a (d, γ = 2 d1 − d12 ) spectral expander.
Very recent works by Marcus, Spielman and Srivastava give explicit constructions of
bipartite expanders for every possible degree d and size n. It is still an open problem whether
(non-bipartite) Ramanujan expanders exist for all possible degrees.
4.7 Derandomization
A couple lectures ago we saw a randomized algorithm that could efficiently solve a problem,
volume estimation, that we also argued was too hard to solve deterministically. One might
ask whether it is still possible to remove the randomness from such algorithms to obtain
efficient deterministic algorithms. The process of reducing or eliminating randomness from
an algorithm is called derandomization. The question of derandomizing all polynomial-
time algorithms is the question as to whether P = BPP, where:
Definition 4.38. P is the class of problems that can be solved deterministically in polyno-
mial time. BPP is the class of problems for which there is a randomized polynomial-time
algorithm that makes the correct decision for every input with probability ≥ 23 .
In the last lecture with volume estimation we seemed to show that for a specific problem
in BPP it was impossible to have an equivalent deterministic algorithm — thus P 6= BPP?
However the impossibility result relied on the assumption that the only access to the convex
set was through a membership/separation oracle. In “real life” it will in general be the case
that other details of the problem are available, and may be used to construct a deterministic
polynomial-time algorithm. This is what makes proving separations of complexity classes
difficult!
26
Definition 4.39. A pseudorandom generator is a function g : {0, 1}s → {0, 1}n , where
n ≥ s and s is called the seed length. We say that a PRG g -fools a class of functions
C ⊆ {f : {0, 1}n → {0, 1}} if ∀f ∈ C:
Pr f (x) = 1 − Pr f (g(y)) = 1 ≤ ,
x∈{0,1}n y∈{0,1} s
where both probabilities are taken over a uniformly random choice of x, y respectively. (In-
tuitively, this means that from the point of view of any function f ∈ C, the output of g looks
random)
Here is how we might use this to prove that P = BPP. Let C be the class of all functions
that can be computed by a polynomial time-randomized algorithm with success probability
at least 2/3. Such algorithms can be understood as functions f of two variables, the input x
to the problem and the randomness r. Fixing all possible inputs x of a certain length gives
us a large class of functions C 0 which are functions of the randomness only. If we design a
PRG that -fools the class C 0 with < 31 and s = O(log n), then we could derandomize BPP
by trying all 2s = poly(n) possible seeds to our PRG and computing a good estimate of the
probability that f would accept on any input.
Pseudorandom generators are hard to construct. Nisan and Wigderson in 94 proved the
following “hardness vs randomness” trade-off. Here we state a strong form of their theorem
that incorporates a worst-case to average-case reduction by Impagliazzo and Wigderson.
Theorem 4.40 (NW’94,IW’97). Suppose there is a language L in EXP and δ > 0 such that
for all n large enough, the minimal size of a Boolean circuit that computes Ln is at least 2δn .
Then then there is a family of generators gm : {0, 1}O(log m) → {0, 1}m that are computable in
poly(m) time and that 1/8-fool the class of functions computable by circuits of size at most
2m (in particular, P = BPP).
Theorem 4.41. Given any BPP algorithm which requires r random bits and has error ≤
1
100
, we can construct an algorithm solving the same problem with error ≤ ( √25 )t and using
only r + 9t random bits.
The idea for the algorithm is very simple. We fix a (d, γ) expander G on V = {0, 1}r with
1
d ≤ 400 and γ ≤ 10 . We haven’t seen how to construct such a thing but I promise you that
it exists, and it’s not that hard to get. We then pick a random vertex to start from using r
bits, and for each of t iterations we perform a random walk to a neighbor using log2 (400) ≈ 9
27
bits. For each of the vertices traversed we run the algorithm with the corresponding bits
and take the majority outcome at the end.
Note that we do not need to actually compute the whole graph (which would have
size exponential in r); we only need to be able to access a small number of vertices and
their neighbors. The hope is that running the BPP algorithm on the highly correlated
pseudo-random bits generated by this procedure will generate similar results to uncorrelated
random bits. This is not trivial, and in particular it does not follow from the analysis of the
mixing time given in the previous lecture. That analysis only shows that mixing happens in
O(log 2r ) = O(r) steps, but here t could be much smaller than r.
Proof. Given an input x, we know that for any set of random bits y,
1
Pr Alg fails on input x and randomness y ≤ .
y∈{0,1}r 100
Fix an input x, and let X = {y : Alg fails on y}, Y = {0, 1}r \ X, v0 , v1 , . . . , vt the vertices
1
selected from the random walk and S = {i : vi ∈ X}. Note that X is at most a fraction 100
of the graph and that since we are taking majority at the end, Pr(f ail) = Pr(|S| ≥ t/2).
We will use a regular “non-lazy” random walk, as there is no point in staying at a vertex.
This gives us a random walk matrix W = AD−1 = d1 A since the graph is d-regular.
Let DX be the diagonal matrix where entry i is 1 if i ∈ X and 0 otherwise, and DY the
diagonal matrix where entry i is 1 if i ∈ Y and 0 otherwise. So DX + DY = I. Given a
certain set of designated walk steps R ⊆ {0, . . . , t} the probability that all such steps are
errors and all other steps are correct is:
(
1 DX if i ∈ R
Pr(R = S) = 1T Dt W · · · D1 W D0 , where Di = .
n DY if i ∈ /R
kxk2 = α2 n + kyk2 ,
and
DX W x = αDX W 1 + DX W y.
28
To bound the first term, use that W 1 = 1 and so
p kxk p kxk
kαDX W 1k = |α| |X| ≤ √ |X| ≤ .
n 10
1 kxk
kDX W yk ≤ kW yk ≤ kyk ≤ ,
10 10
where the second inequality follows since y ⊥ 1. Thus we have shown that kDX W k ≤ 15 .
Finally, putting everything together,
t
Pr(f ail) = Pr |S| ≥
X 2
= Pr[R = S]
R:|R|≥ 2t
X
≤ kDX W k|R|
R:|R|≥ 2t
1 t
≤ 2t ( ) 2 .
5
Definition 4.42 (Polynomial Identity Testing). Given a (large) finite field F and an n-variate
polynomial p ∈ F[x1 , ..., xn ] provided as an arithmetic circuit (eg. (x1 +3x2 −x3 )(3x1 +x4 −1)),
determine whether p ≡ 0, i.e. all coefficients of p when fully expanded are zero.
29
To solve the above problem we could simply multiply all the terms out and check whether
the final coefficient of every term is 0, but that could take exponential time in the size of the
circuit specifying p. No known deterministic algorithm can solve PIT in polynomial time,
but there is a simple randomized algorithm that can.
Lemma 4.43 (Schwartz-Zippel). Given a non-zero p ∈ F[x1 , ..., xn ] with degree(p) ≤ d, let
S ⊆ F and (s1 , ...sn ) be n points selected independently and uniformly at random from S.
Then:
d
Pr p(s1 , ..., sn ) = 0 ≤ |S| .
Proof. The proof is by induction on n. For the base case, n = 1, p is a polynomial in one
variable and thus has at most d roots. Hence Pr(p(s1 ) = 0) ≤ d/|S|.
For the inductive step, we let k be the largest degree of x1 in p and write
p(s1 , . . . , sn ) = p1 (s2 , . . . , sn )sk1 + p2 (s1 , . . . , sn ),
where the maximum degree of p1 (s2 , . . . , sn ) is at most d − k and the maximum degree of s1
in p2 is strictly less than k. For the purposes of analysis, we can assume that s2 , . . . , sn are
chosen first. Then, we let E be the event that p1 (s2 , . . . , sn ) = 0. There are two cases:
• Case 1: E happens. From the induction hypothesis applied to M (a polynomial in
n − 1 variables), we know that Pr(E) ≤ (d − k)/|S|.
• Case 2: E does not happen. In this case, we let p0 be the polynomial in one variable
x1 that remains after x2 = s2 , . . . , xn = sn are substituted in p(x1 , . . . , xn ) . Since
p1 (s2 , . . . , sn ) 6= 0, the coefficient of of xk1 is non-zero, so p0 is a non-zero polynomial of
degree k in one variable. It thus has at most k roots, so Pr(p0 (r1 ) = p(r1 , . . . , rn ) =
0|¬E) ≤ k/|S|.
Putting the two cases together, we have
Pr p(s1 , . . . , sn ) = 0 = Pr p(s1 , . . . , sn ) = 0|E Pr E
+ Pr p(s1 , . . . , sn ) = 0|¬E Pr ¬E
≤ (d − k)/|S| + k/|S| = d/|S|.
Thus, if we take the set S to have cardinality at least twice the degree of our polynomial,
we can bound the probability of error by 1/2. This can be reduced to any desired small
number by repeated trials, as usual. Note that the algorithms also works over finite fields,
provided the field size is larger than the degree of the polynomial. Otherwise, the algorithm
could not possibly work: for example, the polynomial p(x) = x3 −x is not the zero polynomial,
but it evaluates to 0 on every point in F3 .
Is there an efficient deterministic algorithm for identity testing? A rather devastating
negative result was proved by Kabanets and Impagliazz, who showed that if there exists a
deterministic polynomial time algorithm for checking polynomial identities, then either:
30
• NEXP does not have polynomial size circuits; or
This means that an efficient derandomization of the above Schwartz-Zippel algorithm (or
indeed, any efficient deterministic algorithm for identity testing) would necessarily entail a
major breakthrough in complexity theory.
Definition 4.44 (Tutte matrix). The Tutte matrix AG corresponding to the graph G is the
n × n matrix [aij ] such that aij is a variable xij if (i, j) ∈ E, and 0 otherwise.
Proof. By definition, det(AG ) = σ sgn(σ) ni=1 aiσ(i) , where the sum is over all permuta-
P Q
tions σ of {1, . . . , n}. Note that each monomial in this sum corresponds to a possible perfect
matching in G, and that the monomial will be non-zero if and only if the corresponding
matching is present in G. Moreover, every pair of monomials differs in at least one (actually,
at least two) variables, so there can be no cancellations between monomials. This implies
that det(AG ) 6≡ 0 iff G contains a perfect matching.
The above Claim immediately yields an efficient algorithm for testing whether G con-
tains a perfect matching: simply run the Schwartz-Zippel algorithm on det(AG ), which is
a polynomial in n2 variables of degree n. Note that the determinant can be computed in
O(n3 ) time by Gaussian elimination. Moreover, the algorithm can be efficiently parallelized
using the standard fact that an n × n determinant can be computed in O(log2 n) time on
O(n3.5 ) processors.
Exercise 5. Generalize the above to a non-bipartite graph G. For this, you will need the
skew-symmetric matrix AG = [aij ] defined as:
xij if (i, j) ∈ E and i < j;
aij = −xij if (i, j) ∈ E and i > j;
0 otherwise.
31
[Hint: You should show that the above Claim still holds, with this modified definition
of AG . This requires a bit more care than in the bipartite case, because the monomials in
det(AG ) do not necessarily correspond to perfect matchings, but to cycle covers in G. In this
case some cancellations will occur.]
32