0% found this document useful (0 votes)
12 views218 pages

Advanced Algorithms

The document consists of lecture notes on Advanced Algorithms from ETH Zürich, authored by Mohsen Ghaffari, and is regularly updated. It covers various topics in approximation algorithms, randomized schemes, streaming algorithms, graph sparsification, and online algorithms, providing definitions, theorems, and examples. The notes also include a section for feedback and are intended for critical reading due to potential typos and errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views218 pages

Advanced Algorithms

The document consists of lecture notes on Advanced Algorithms from ETH Zürich, authored by Mohsen Ghaffari, and is regularly updated. It covers various topics in approximation algorithms, randomized schemes, streaming algorithms, graph sparsification, and online algorithms, providing definitions, theorems, and examples. The notes also include a section for feedback and are intended for critical reading due to potential typos and errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 218

Advanced Algorithms

Computer Science, ETH Zürich

Mohsen Ghaffari

These notes will be updated regularly. Please read critically; there are typos
throughout, but there might also be mistakes. Feedback and comments would be
greatly appreciated and should be emailed to [email protected].
Last update: September 20, 2023
Contents

Notation and useful inequalities

I Basics of Approximation Algorithms 1


1 Greedy algorithms 3
1.1 Minimum set cover & vertex cover . . . . . . . . . . . . . . . . 3

2 Approximation schemes 13
2.1 Knapsack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Bin packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Minimum makespan scheduling . . . . . . . . . . . . . . . . . 22

3 Randomized approximation schemes 29


3.1 DNF counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Network reliability . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Counting graph colorings . . . . . . . . . . . . . . . . . . . . . 35

4 Rounding Linear Program Solutions 41


4.1 Minimum set cover . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Minimizing congestion in multi-commodity routing . . . . . . 47
4.3 Scheduling on unrelated parallel machines . . . . . . . . . . . 53

II Selected Topics in Approximation Algorithms 59


5 Distance-preserving tree embedding 61
5.1 A tight probabilistic tree embedding construction . . . . . . . 62
5.2 Application: Buy-at-bulk network design . . . . . . . . . . . . 73

6 L1 metric embedding & sparsest cut 77


6.1 Warm up: Min s-t Cut . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Sparsest Cut via L1 Embedding . . . . . . . . . . . . . . . . . 80
6.3 L1 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Oblivious Routing, Cut-Preserving Tree Embedding, and Bal-


anced Cut 91
7.1 Oblivious Routing . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 Oblivious Routing via Trees . . . . . . . . . . . . . . . . . . . 92
7.3 Existence of the Tree Collection . . . . . . . . . . . . . . . . . 97
7.4 The Balanced Cut problem . . . . . . . . . . . . . . . . . . . . 100

8 Multiplicative Weights Update (MWU) 105


8.1 Learning from Experts . . . . . . . . . . . . . . . . . . . . . . 105
8.2 Approximating Covering/Packing LPs via MWU . . . . . . . . 109
8.3 Constructive Oblivious Routing via MWU . . . . . . . . . . . 114
8.4 Other Applications: Online routing of virtual circuits . . . . . 118

III Streaming and Sketching Algorithms 123


9 Basics and Warm Up with Majority Element 125
9.1 Typical tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Majority element . . . . . . . . . . . . . . . . . . . . . . . . . 126

10 Estimating the moments of a stream 129


10.1 Estimating the first moment of a stream . . . . . . . . . . . . 129
10.2 Estimating the zeroth moment of a stream . . . . . . . . . . . 131
10.3 Estimating the k th moment of a stream . . . . . . . . . . . . . 138

11 Graph sketching 147


11.1 Finding the single cut edge . . . . . . . . . . . . . . . . . . . . 148
11.2 Finding one out of k > 1 cut edges . . . . . . . . . . . . . . . 150
11.3 Finding one out of arbitrarily many cut edges . . . . . . . . . 152
11.4 Maximal forest with O(n log4 n) memory . . . . . . . . . . . . 152

IV Graph sparsification 157


12 Preserving distances 159
12.1 α-multiplicative spanners . . . . . . . . . . . . . . . . . . . . . 160
12.2 β-additive spanners . . . . . . . . . . . . . . . . . . . . . . . . 162
13 Preserving cuts 171
13.1 Warm up: G = Kn . . . . . . . . . . . . . . . . . . . . . . . . 172
13.2 Prelim: Contractions and Cut Counting . . . . . . . . . . . . 172
13.3 Uniform edge sampling . . . . . . . . . . . . . . . . . . . . . . 174
13.4 Non-uniform edge sampling . . . . . . . . . . . . . . . . . . . 176

V Online Algorithms and Competitive Analysis 181


14 Warm up: Ski rental 183

15 Linear search 185


15.1 Amortized analysis . . . . . . . . . . . . . . . . . . . . . . . . 185
15.2 Move-to-Front . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

16 Paging 189
16.1 Types of adversaries . . . . . . . . . . . . . . . . . . . . . . . 190
16.2 Random Marking Algorithm (RMA) . . . . . . . . . . . . . . 191
16.3 Lower Bound for Paging via Yao’s Principle . . . . . . . . . . 194

17 The k-server problem 197


17.1 Special case: Points on a line . . . . . . . . . . . . . . . . . . 198
Notation and useful inequalities

Commonly used notation


• P: class of decision problems that can be solved on a deterministic
sequential machine in polynomial time with respect to input size

• N P: class of decision problems that can be solved non-deterministically


in polynomial time with respect to input size. That is, decision prob-
lems for which “yes” instances have a proof that can be verified in
polynomial time.

• A: usually denotes the algorithm we are discussing about

• I: usually denotes a problem instance

• ind.: independent / independently

• w.p.: with probability

• w.h.p: with high probability


We say event X holds with high probability (w.h.p.) if

1
Pr[X] ≥ 1 −
poly(n)
1
say, Pr[X] ≥ 1 − nc
for some constant c ≥ 2.

• L.o.E.: linearity of expectation

• u.a.r.: uniformly at random

• Integer range [n] = {1, . . . , n}

• e ≈ 2.718281828459: the base of the natural logarithm


Useful distributions
Bernoulli Coin flip w.p. p. Useful for indicators
Pr[X = 1] = p
E[X] = p
Var(X) = p(1 − p)

Binomial Number of successes out of n trials, each succeeding w.p. p;


Sample with replacement out of n items, p of which are successes
 
n k
Pr[X = k] = p (1 − p)n−k
k
E[X] = np
Var(X) = np(1 − p) ≤ np

Geometric Number of Bernoulli trials until one success


Pr[X = k] = (1 − p)n−1 p
1
E[X] =
p
1−p
Var(X) =
p2

Hypergeometric Number of successes in n draws without replacement,


from a population of N items in which K are successful:
K N −K
 
k n−k
Pr[X = k] = N

n
K
E[X] = n ·
N
K N −K N −n
Var(X) = n · · ·
N N N −1

Exponential Parameter: λ; Written as X ∼ Exp(λ)


(
λe−λx if x ≥ 0
Pr[X = x] =
0 if x < 0
1
E[X] =
λ
1
Var(X) = 2
λ
Remark If x1 ∼ Exp(λ1 ), . . . , xn ∼ Exp(λn ), then

• min{x1 , . . . , xn } ∼ Exp(λ1 + · · · + λn )

• Pr[k | xk = min{x1 , . . . , xn }] = λk
λ1 +···+λn

Useful inequalities
• ( nk )k ≤ nk ≤ ( en

k
)k

• nk ≤ nk


• limn→∞ (1 − n1 )n = e−1
P∞ 1 π2
• i=1 i2 = 6

• (1 − x) ≤ e−x , for any x

• (1 + 2x) ≥ ex , for x ∈ [0, 1]

• (1 + x2 ) ≥ ex , for x ∈ [−1, 0]

• (1 − x) ≥ e−x−x , for x ∈ (0, 12 )


2

• 1
1−x
≤ 1 + 2x for x ≤ 1
2

Theorem (Linearity of Expectation).


n
X n
X
E( ai X i ) = ai E(Xi )
i=1 i=1

Theorem (Variance).

V(X) = E(X 2 ) − E(X)2

Theorem (Variance of a Sum of Random Variables).

V(aX + bY ) = a2 V(X) + b2 V(Y ) + 2abCov(X, Y )

Theorem (AM-GM inequality). Given n numbers x1 , . . . , xn ,


x1 + · · · + xn
≥ (x1 ∗ · · · ∗ xn )1/n
n
The equality holds if and only if x1 = · · · = xn .
Theorem (Markov’s inequality). If X is a nonnegative random variable and
a > 0, then
E(X)
Pr[X ≥ a] ≤
a
Theorem (Chebyshev’s inequality). If X is a random variable (with finite
expected value µ and non-zero variance σ 2 ), then for any k > 0,

1
Pr[|X − µ| ≥ kσ] ≤
k2

Theorem (Bernoulli’s inequality). For every integer r ≥ 0 and every real


number x ≥ −1,
(1 + x)r ≥ 1 + rx

TheoremP(Chernoff bound). For independent Bernoulli variables X1 , . . . , Xn ,


let X = ni=1 Xi . Then,

−2 E(X)
Pr[X ≥ (1 + ) · E(X)] ≤ exp( ) for 0 < 
3
−2 E(X)
Pr[X ≤ (1 − ) · E(X)] ≤ exp( ) for 0 <  < 1
2

By union bound, for 0 <  < 1, we have

−2 E(X)
Pr[|X − E(X)| ≥  · E(X)] ≤ 2 exp( )
3

Remark 1 There is actually a tighter form of Chernoff bounds:

e
∀ > 0, Pr[X ≥ (1 + )E(X)] ≤ ( )E(X)
(1 + )1+

Remark 2 We usually apply Chernoff bound to show that the probability


2
of bad approximation is low by picking parameters such that 2 exp( − E(X)
3
)≤
δ, then negate to get Pr[|X − E(X)| ≤  · E(X)] ≥ 1 − δ.

Theorem (Probabilistic Method). Let (Ω, A, P ) be a probability space,

Pr[ω] > 0 ⇐⇒ ∃ω ∈ Ω
Combinatorics taking k elements out of n:

• no repetition, no ordering: nk


• no repetition, ordering: n!
(n−k)!

• repetition, no ordering: n+k−1



k

• repetition, ordering: nk
Part I

Basics of Approximation
Algorithms

1
Chapter 1

Greedy algorithms

Unless P = N P, we do not expect efficient algorithms for N P-hard prob-


lems. However, we are often able to design efficient algorithms that give
solutions that are provably close/approximate to the optimum.
Definition 1.1 (α-approximation). An algorithm A is an α-approximation
algorithm for a minimization problem with respect to cost metric c if for any
problem instance I and for some optimum solution OP T ,
c(A(I)) ≤ α · c(OP T (I))
Maximization problems are defined similarly with c(OP T (I)) ≤ α·c(A(I)).

1.1 Minimum set cover & vertex cover


Consider a universe U = {e1 , . . . , en } of n elements, a S
collection of subsets
S = {S1 , . . . , Sm } of m subsets of U such that U = m i=1 Si , and a non-
negative1 cost function c : S → R+ . If Si = {e1 , e2 , e5 }, then we say Si
covers elements e1 , e2 , and e5 . For any subset T ⊆ S, define the cost of T as
the cost of all subsets in T . That is,
X
c(T ) = c(Si )
Si ∈T

Definition 1.2 (Minimum set cover problem). Given a universe of elements


U, a collection of subsets S, and a non-negative cost function c : S → R+ ,
find a subset S ∗ ⊆ S such that:
(i) S ∗ is a set cover: Si ∈S ∗ Si = U
S

(ii) c(S ∗ ), the cost of S ∗ , is minimized


1
If a set costs 0, then we can just remove all the elements covered by it for free.

3
4 CHAPTER 1. GREEDY ALGORITHMS

Example

S1 e1

S2 e2

S3 e3

S4 e4

e5

Suppose there are 5 elements e1 , e2 , e3 , e4 , e5 , 4 subsets S1 , S2 , S3 , S4 ,


and the cost function is defined as c(Si ) = i2 . Even though S3 ∪ S4 covers all
vertices, this costs c({S3 , S4 }) = c(S3 ) + c(S4 ) = 9 + 16 = 25. One can verify
that the minimum set cover is S ∗ = {S1 , S2 , S3 } with a cost of c(S ∗ ) = 14.
Notice that we want a minimum cover with respect to c and not the number
of subsets chosen from S (unless c is uniform cost).

1.1.1 A greedy minimum set cover algorithm


Since finding the minimum set cover is N P-complete, we are interested in al-
gorithms that give a good approximation for the optimum. [Joh74] describes
a greedy algorithm GreedySetCover and proved that it gives an Hn -
approximation2 . The intuition is as follows: Spread the cost c(Si ) amongst
the vertices that are newly covered by Si . Denoting the price-per-item by
ppi(Si ), we greedily select the set that has the lowest ppi at each step unitl
we have found a set cover.

Algorithm 1 GreedySetCover(U, S, c)
T ←∅ . Selected subset of S
C←∅ . Covered vertices
while C 6= U do
Si ← arg minSi ∈S\T |Sc(S i)
i \C|
. Pick set with lowest price-per-item
T ← T ∪ {Si } . Add Si to selection
C ← C ∪ Si . Update covered vertices
end while
return T
Pn
Hn = i=1 1i = ln(n) + γ ≤ ln(n) + 0.6 ∈ O(log(n)), where γ is the Euler-Mascheroni
2

constant. See https://en.wikipedia.org/wiki/Euler-Mascheroni_constant.


1.1. MINIMUM SET COVER & VERTEX COVER 5

Consider a run of GreedySetCover on the earlier example. In the first


iteration, ppi(S1 ) = 1/3, ppi(S2 ) = 4, ppi(S3 ) = 9/2, ppi(S4 ) = 16/3. So,
S1 is chosen. In the second iteration, ppi(S2 ) = 4, ppi(S3 ) = 9, ppi(S4 ) = 16.
So, S2 was chosen. In the third iteration, ppi(S3 ) = 9, ppi(S4 ) = ∞. So,
S3 was chosen. Since all vertices are now covered, the algorithm terminates
(coincidentally to the minimum set cover). Notice that ppi for the unchosen
sets change according to which vertices remain uncovered. Furthermore, once
one can simply ignore S4 when it no longer covers any uncovered vertices.

Theorem 1.3. GreedySetCover is an Hn -approximation algorithm.

Proof. By construction, GreedySetCover terminates with a valid set cover


T . It remains to show that c(T ) ≤ Hn · c(OP T ) for any minimum set
cover OP T . Upon relabelling, let e1 , . . . , en be the elements in the order
they are covered by GreedySetCover. Define price(ei ) as the price-
per-item associated with ei at the time ei was purchased during the run
of the algorithm. Consider the moment in the algorithm where elements
Ck−1 = {e1 , . . . , ek−1 } are already covered by some sets Tk ⊂ T . Tk covers no
elements in {ek , . . . , en }. Since there is a cover3 of cost at most c(OP T ) for
the remaining n − k + 1 elements, there must be an element e∗ ∈ {ek , . . . , en }
whose price price(e∗ ) is at most c(OP n−k+1
T)
.

U
not in OP T e1

ek−1

ek

ek+1
OP T

OP Tk en

We formalize this intuition with the argument below. Since OP T is a


set cover, there exists a subset OP Tk ⊆ OP T that covers ek . . . en . Sup-
3
OP T is a valid cover (though probably not minimum) for the remaining elements.
6 CHAPTER 1. GREEDY ALGORITHMS

pose OP Tk = {O1 , . . . , Op } where Oi ∈ S ∀i ∈ [p]. We make the following


observations:

1. Since no element in {ek , . . . , en } is covered by Tk , O1 , . . . , Op ∈ S \ Tk .

2. Because some elements may be covered more than once,

n − k + 1 = |U \ Ck−1 |
≤ |O1 ∩ (U \ Ck−1 )| + · · · + |Op ∩ (U \ Ck−1 )|
p
X
= |Oj ∩ (U \ Ck−1 )|
j=1

c(Oj )
3. By definition, for each j ∈ {1, . . . , p}, ppi(Oj ) = |Oj ∩(U \Ck−1 )|
.

Since the greedy algorithm will pick a set in S \ Tk with the lowest price-per-
item, price(ek ) ≤ ppi(Oj ) for all j ∈ {1, . . . , p}. Substituting this expression
into the last equation and rearranging the terms we get:

c(Oj ) ≥ price(ek ) · |Oj ∩ (U \ Ck−1 )|, ∀j ∈ {1, . . . , p} (1.1)

Summing over all p sets, we have

c(OP T ) ≥ c(OP Tk ) Since OP Tk ⊆ OP T


p
X
= c(Oj ) Definition of c(OP Tk )
j=1
p
X
≥ price(ek ) · |Oj ∩ (U \ Ck−1 )| By Equation (1.1)
j=1

≥ price(ek ) · |U \ Ck−1 | By observation 2


= price(ek ) · (n − k + 1)
c(OP T )
Rearranging, price(ek ) ≤ n−k+1
. Summing over all elements, we have:
n n n
X X X c(OP T ) X 1
c(T ) = c(S) = price(ek ) ≤ = c(OP T )· = Hn ·c(OP T )
S∈T k=1 k=1
n − k + 1 k=1
k

Remark By construction, price(e1 ) ≤ · · · ≤ price(en ).


Next we provide an example to show this bound is indeed tight.
1.1. MINIMUM SET COVER & VERTEX COVER 7

Tight bound example for GreedySetCover Consider n = 2 · (2k − 1)


elements, for some k ∈ N \ {0}. Partition the elements into groups of size
2 · 20 , 2 · 21 , 2 · 22 , . . . , 2 · 2k−1 . Let S = {S1 , . . . , Sk , Sk+1 , Sk+2 }. For 1 ≤ i ≤ k,
let Si cover the group of size 2 · 2i−1 = 2i . Let Sk+1 and Sk+2 cover half of
each group (i.e. 2k − 1 elements each) such that Sk+1 ∩ Sk+2 = ∅.

2 4 8 = 2 · 22 2 · 2k−1
elts elts elements elements
... ... Sk+1

... ... Sk+2

S1 S2 S3 Sk

Suppose c(Si ) = 1, ∀i ∈ {1, . . . , k + 2}. The greedy algorithm will pick


Sk , then Sk−1 . . . , and finally S1 . This is because 2 · 2k−1 > n/2 and
P, k−1
2 · 2 > (n − j=i+1 2 · 2j )/2, for 1 ≤ i ≤ k − 1. This greedy set cover costs
i

k = O(log(n)). Meanwhile, the minimum set cover is S ∗ = {Sk+1 , Sk+2 } with


a cost of 2.
A series of works by Lund and Yannakakis [LY94], Feige [Fei98], and Dinur
[DS14, Corollary 1.5] showed that it is N P-hard to always approximate set
cover to within (1 − ) · ln |U|, for any constant  > 0.

Theorem 1.4 ([DS14, Corollary 1.5]). It is N P-hard to always approximate


set cover to within (1 − ) · ln |U|, for any constant  > 0.

Proof. See [DS14, Corollary 1.5]

1.1.2 Special cases


In this section, we show that one may improve the approximation factor
from Hn if we have further assumptions on the set cover instance. View-
ing a set cover instance as a bipartite graph between sets and elements, let
∆ = maxi∈{1,...,m} degree(Si ) and f = maxi∈{1,...,n} degree(ei ) represent the
maximum degree of the sets and elements respectively. Consider the follow-
ing two special cases of set cover instances:

1. All sets are small. That is, ∆ is small.

2. Every element is covered by few sets. That is, f is small.


8 CHAPTER 1. GREEDY ALGORITHMS

Special case: Small ∆


Theorem 1.5. GreedySetCover is a H∆ -approximation algorithm.
Proof. Suppose OP T = {O1 , . . . , Op }, where Oi ∈ S ∀i ∈ [p]. Consider a set
Oi = {ei,1 , . . . , ei,d } with degree(Oi ) = d ≤ ∆. Without loss of generality,
suppose that the greedy algorithm covers ei,1 , then ei,2 , and so on. For
c(Oi )
1 ≤ k ≤ d, when ei,k is covered, price(ei,k ) ≤ d−k+1 (It is an equality if the
greedy algorithm also chose Oi to first cover ei,k , . . . , ei,d ). Hence, the greedy
cost of covering elements in Oi (i.e. ei,1 , . . . , ei,d ) is at most
d d
X c(Oi ) X 1
= c(Oi ) · = c(Oi ) · Hd ≤ c(Oi ) · H∆
k=1
d−k+1 k=1
k
Summing over all p sets to cover all n elements, we have c(T ) ≤ H∆ ·c(OP T ).

Remark We apply the same greedy algorithm for small ∆ but analyzed in
a more localized manner. Crucially, in this analysis, we always work with
the exact degree d and only use the fact d ≤ ∆ after summation. Observe
that ∆ ≤ n and the approximation factor equals that of Theorem 1.3 when
∆ = n.

Special case: Small f


We first look at the case when f = 2, show that it is related to another graph
problem, then generalize the approach for general f .

Vertex cover as a special case of set cover


Definition 1.6 (Minimum vertex cover problem). Given a graph G = (V, E),
find a subset S ⊆ V such that:
(i) S is a vertex cover: ∀e = {u, v} ∈ E, u ∈ S or v ∈ S
(ii) |S|, the size of S, is minimized
We next argue that each instance of minimum vertex cover can be seen as
an instance of minimum set cover problem with f = 2, and (more importantly
for our approximation algorithm) any instance of minimum set cover problem
with f = 2 can be reduced to an instance of minimum vertex cover.
When f = 2 and c(Si ) = 1, ∀Si ∈ S, the minimum vertex cover can be
seen as an instance of minimum set cover. Given an instance I of minimum
vertex cover I = hG = (V, E)i we build an instance I ∗ = hU ∗ , S ∗ i of minimum
set cover as follows:
1.1. MINIMUM SET COVER & VERTEX COVER 9

• Each edge ei ∈ E in G becomes an element e0i in U ∗

• Each vertex vj ∈ V in G becomes an element Sj in S ∗ and e0i ∈ Sj ⇐⇒


ei is adjacent to vj ∈ I

Notice that every element ei ∈ U will be in exactly 2 elements of S, for every


edge is adjacent to exactly two vertices. Hence, I ∗ has f = 2.
Moreover, we can reduce the minimum set cover problem with f = 2 to
an instance of a minimum vertex cover. For each element that appears in
only one set, simply take the set that includes it. And repeat this until we
have removed all the elements that appear in exactly one set. At this point,
we are left with sets and elements such that each element appears in exactly
two sets. Then, we can view this as a simple graph by thinking of the sets as
the vertices and each element as an edge between between the two vertices
corresponding to the two sets that contain the edge.
One way to obtain a 2-approximation to minimum vertex cover (and
hence 2-approximation for this special case of set cover) is to use a maximal
matching.

Definition 1.7 (Maximal matching problem). Given a graph G = (V, E),


find a subset M ⊆ E such that:

(i) M is a matching: Distinct edges ei , ej ∈ M do not share an endpoint

(ii) M is maximal: ∀ek 6∈ M , M ∪ {ek } is not a matching

a b c d e f

A related concept to maximal matching is maximum matching, where one


tries to maximize the set of M . By definition, any maximum matching is also
a maximal matching, but the converse is not necessarily true. Consider a path
of 6 vertices and 5 edges. Both the set of blue edges {{a, b}, {c, d}, {e, f }}
and the set of red edges {{b, c}, {d, e}} are valid maximal matchings, where
the maximum matching is the former.

Remark Any maximal matching is a 2-approximation of maximum match-


ing.
GreedyMaximalMatching is a greedy maximal matching algorithm.
The algorithm greedily adds any available edge ei that is not yet incident to
M , then excludes all edges that are adjacent to ei .
10 CHAPTER 1. GREEDY ALGORITHMS

Algorithm 2 GreedyMaximalMatching(V, E)
M ←∅ . Selected edges
C←∅ . Set of incident vertices
while E 6= ∅ do
ei = {u, v} ← Pick any edge from E
M ← M ∪ {ei } . Add ei to the matching
C ← C ∪ {u, v} . Add endpoints to incident vertices
Remove all edges in E that are incident to u or v
end while
return M

Theorem 1.8. The set of incident vertices C at the end of GreedyMax-


imalMatching is a 2-approximation for minimum vertex cover.

...
Vertex cover C,
Maximal matching M
... where |C| = 2 · |M |
[h]

Proof. Suppose, for a contradiction, that GreedyMaximalMatching ter-


minated with a set C that is not a vertex cover. Then, there exists an edge
e = {u, v} such that u 6∈ C and v 6∈ C. If such an edge exists, e = {u, v} ∈ E
then M 0 = M ∪ {e} would have been a matching with |M 0 | > |M | and
GreedyMaximalMatching would not have terminated. This is a contra-
diction, hence C is a vertex cover.
Consider the matching M . Any vertex cover has to include at least one
endpoint for each edge in M , hence the minimum vertex cover OP T has at
least |M | vertices (i.e. |OP T | ≥ |M |). By picking C as our vertex cover,
|C| = 2 · |M | ≤ 2 · |OP T |. Therefore, C is a 2-approximation.

We now generalize beyond f = 2 by considering hypergraphs. Hyper-


graphs are a generalization of graphs in which an edge can join any number
of vertices. Formally, a hypergraph H = (X, E) consists of a set X of ver-
tices/elements and a set E of hyperedges where each hyperedge is an element
of P(X)\∅ (where P is the powerset of X). The minimum vertex cover prob-
lem and maximal matching problem are defined similarly on a hypergraph.

Remark A hypergraph H = (X, E) can be viewed as a bipartite graph with


partitions X and E, with an edge between element x ∈ X and hyperedge
e ∈ E if x ∈ e in H.
1.1. MINIMUM SET COVER & VERTEX COVER 11

Example Suppose H = (X, E) where X = {a, b, c, d, e} and E = {{a, b, c},


{b, c}, {a, d, e}}. A minimum vertex cover of size 2 would be {a, c} (there
are multiple vertex covers of size 2). Maximal matchings would be {{a, b, c}}
and {{b, c}, {a, d, e}}, where the latter is the maximum matching.

Claim 1.9. Generalizing GreedyMaximalMatching to compute a max-


imal matching in the hypergraph by greedily picking hyperedges yields an f -
approximation algorithm for minimum vertex cover.

Sketch of Proof Let C be the set of all vertices involved in the greedily
selected hyperedges. In a similar manner as the proof in Theorem 1.8, C can
be showed to be an f -approximation.
12 CHAPTER 1. GREEDY ALGORITHMS
Chapter 2

Approximation schemes

In the last chapter, we described simple greedy algorithms that approximate


the optimum for minimum set cover, maximal matching and minimum vertex
cover within a constant factor of the optimum solution. We now want to
devise algorithms, which come arbitrarily close to the optimum solution.
For that purpose we formalize the notion of efficient (1 + )-approximation
algorithms for minimization problems, a la [Vaz13].
Let I be an instance from the problem of interest (e.g. minimum set
cover). Denote |I| as the size of the problem instance in bits, and |Iu | as
the size of the problem instance in unary. For example, if the input is a
number x of at most n bits, then |I| = log2 (x) = O(n) while |Iu | = O(2n ).
This distinction of “size of input” will be important when we discuss the
knapsack problem later.
Definition 2.1 (Polynomial time approximation scheme (PTAS)). For a
given cost metric c, an optimal algorithm OP T and a parameter , an algo-
rithm A is a PTAS for a minimization problem if
• c(A (I)) ≤ (1 + ) · c(OP T (I))

• A runs in poly(|I|) time


Note that  is a parameter of the algorithm, and is not considered as input.
Thus the runtime for PTAS may depend arbitrarily on . If we define  as an
input parameter for the algorithm, we can obtain a stricter definition, namely
that of fully polynomial time approximation schemes (FPTAS). Assuming
P=6 N P, FPTAS is the best one can hope for on N P-hard problems.
Definition 2.2 (Fully polynomial time approximation scheme (FPTAS)).
For a given cost metric c, an optimal algorithm OP T and input parameter
, an algorithm A is an FPTAS for a minimization problem if

13
14 CHAPTER 2. APPROXIMATION SCHEMES

• For any  > 0, c(A(I)) ≤ (1 + ) · c(OP T (I))

• A runs in poly(|I|, 1 ) time

As before, one can define (1 − )-approximations, PTAS, and FPTAS for


maximization problems similarly.

2.1 Knapsack
Definition 2.3 (Knapsack problem). Consider a set S with n items. Each
item i has size(i) ∈ Z+ and profit(i) ∈ Z+ . Given a budget B, find a
subset S ∗ ⊆ S such that:

(i) Selection S ∗ fits budget:


P
i∈S ∗ size(i) ≤ B

(ii) Selection S ∗ has maximum value:


P
i∈S ∗ profit(i) is maximized

Let pmax = maxi∈{1,...,n} profit(i) denote the highest profit for an item.
Also, notice that any item which has size(i) > B cannot be chosen, due to
the size constraint, and therefore we can it. In O(n) time, we can remove any
such item and relabel the remaining ones as items 1, 2, 3, ... Thus, without
loss of generality, we can assume that size(i) ≤ B, ∀i ∈ {1, . . . , n}.
Observe that pmax ≤ profit(OP T (I)) because we can always pick at
least one item, namely the highest valued one.

Example Denote the i-th item by i : hsize(i), profit(i)i. Consider an


instance with S = {1 : h10, 130i, 2 : h7, 103i, 3 : h6, 91i, 4 : h4, 40i, 5 : h3, 38i}
and budget B = 10. Then, the best subset S ∗ = {2 : h7, 103i, 5 : h3, 38i} ⊆ S
yields a total profit of 103 + 38 = 141.

2.1.1 An exact algorithm via dynamic programming


The maximum achievable profit is at most npmax , as we can have at most n
items, each having profit at most pmax . Define the size of a subset as the
sum of the size of the sets involved. Using dynamic programming (DP), we
can form an n-by-(npmax ) matrix M where M [i, p] is the smallest size of a
subset chosen from {1, . . . , i} such that the total profit equals p. Trivially,
set M [1, 0] = 0. To handle boundaries, define M [i, j] = ∞ for j ≤ 0. Then,
we compute M [i + 1, p] as follows:

• If profit(i + 1) > p, then we cannot pick item i + 1.


So, M [i + 1, p] = M [i, p].
2.1. KNAPSACK 15

• If profit(i + 1) ≤ p, then we may pick item i + 1.


So, M [i + 1, p] = min{M [i, p], size(i + 1) + M [i, p − profit(i + 1)]}.

Since each cell can be computed in O(1) using DP via the above recurrence,
matrix M can be filled in O(n2 pmax ) and S ∗ may be extracted from the table
M [., .]: we find the maximum value j ∈ [pmax , npmax ] for which M [n, j] < B
and we back-track from there to extract the optimal set S ∗ .

Remark This dynamic programming algorithm is not a PTAS because


O(n2 pmax ) can be exponential in input problem size |I|. Namely the number
pmax which appears in the runtime is encoded by log2 (pmax ) bits in the input,
thus is of order at most O(n). However the actual value can be of exponential
size. As such, we say that this Dynamic Programming provides a pseudo-
polynomial time algorithm.

2.1.2 FPTAS via profit rounding

Algorithm 3 FPTAS-Knapsack(S, B, )
k ← max{1, b n pmax c} . Choice of k to be justified later
for i ∈ {1, . . . , n} do
profit0 (i) = b profit(i)
k
c . Round and scale the profits
end for
Run DP in Section 2.1.1 with B, size(i), and re-scaled profit0 (i).
return Items selected by DP

FPTAS-Knapsack pre-processes the problem input by rounding down


to the nearest multiple of k and then, since every value is now a multiple
of k, scaling down by a factor of k. FPTAS-Knapsack then calls the DP
algorithm described in Section 2.1.1. Since we scaled down the profits, the
2
new maximum profit is pmaxk
, hence the DP now runs in O( n pkmax ).
To obtain a FPTAS, we pick k = max{1, b pmax n
c} so that FPTAS-
Knapsack is a (1 − )-approximation algorithm and runs in poly(n, 1 ).

Theorem 2.4. FPTAS-Knapsack is a FPTAS for the knapsack problem.

Proof. Suppose we are given a knapsack instance I = (S, B). Let loss(i)
denote the decrease in value by using rounded profit0 (i) for item i. By the
profit rounding definition, for each item i,

profit(i)
loss(i) = profit(i) − k · b c≤k
k
16 CHAPTER 2. APPROXIMATION SCHEMES

Then, over all n items,

n
X
loss(i) ≤ nk loss(i) ≤ k for any item i
i=1

<  · pmax Since k = b pmax c
n
≤  · profit(OP T (I)) Since pmax ≤ profit(OP T (I))

Thus, profit(FPTAS-Knapsack(I)) ≥ (1 − ) · profit(OP T (I)).


2 3
Furthermore, FPTAS-Knapsack runs in O( n pkmax ) = O( n ) ∈ poly(n, 1 ).

Remark k = 1 when pmax ≤ n . In that case, no rounding occurs and the


3
DP finds the exact solution in O(n2 pmax ) ∈ O( n ) ∈ poly(n, 1 ) time.

Example Recall the earlier example where budget B = 10 and S = {1 :


h10, 130i, 2 : h7, 103i, 3 : h6, 91i, 4 : h4, 40i, 5 : h3, 38i}. For  = 21 , one
would set k = max{1, b pmax n
c} = max{1, b 130/25
c} = 13. After round-
ing, we have S 0 = {1 : h10, 10i, 2 : h7, 7i, 3 : h6, 7i, 4 : h4, 3i, 5 : h3, 2i}.
The optimum subset from S 0 is {3 : h6, 7i, 4 : h4, 3i} which translates to
a total profit of 91 + 40 = 131 in the original problem. As expected,
131 = profit(FPTAS-Knapsack(I)) ≥ (1 − 21 ) · profit(OP T (I)) = 70.5.

2.2 Bin packing


Definition 2.5 (Bin packing problem). Given a set S with n items where
each item i has size(i) ∈ (0, 1], find the minimum number of unit-sized bins
(i.e. bins of size 1) that can hold all n items.

For any problem instance I, let OP T (I) be an optimal bin assignment


and |OP T (I)|Pbe the corresponding minimum number of bins required. One
can see that ni=1 size(i) ≤ |OP T (I)|.

Example Consider
Pn S = {0.5, 0.1, 0.1, 0.1, 0.5, 0.4, 0.5, 0.4, 0.4}, where |S| =
n = 9. Since i=1 size(i) = 3, at least 3 bins are needed. One can verify
that 3 bins suffice: b1 = b2 = b3 = {0.5, 0.4, 0.1}. Hence, |OP T (S)| = 3.
2.2. BIN PACKING 17

0.1 0.1 0.1

0.4 0.4 0.4

0.5 0.5 0.5

b1 b2 b3

2.2.1 First-fit: A 2-approximation algorithm


FirstFit processes items one-by-one, creating new bins if an item cannot
fit into one of the existing bins. For a unit-sized bin b, we use size(b) to
denote the sum of the size of items that are put into b, and define free(b) =
1 − size(b).

Algorithm 4 FirstFit(S)
B→∅ . Collection of bins
for i ∈ {1, . . . , n} do
if size(i) ≤ free(b) for some bin b ∈ B. then
Pick the smallest such b.
free(b) ← free(b) − size(i) . Put item i into existing bin b
else
B ← B ∪ {b0 } . Put item i into a fresh bin b0
free(b0 ) = 1 − size(i)
end if
end for
return B

Lemma 2.6. Using FirstFit, at most one bin is less than half-full. That
is, |{b ∈ B : size(b) ≤ 21 }| ≤ 1, where B is the output of FirstFit.
Proof. Suppose, for contradiction, that there are two bins bi and bj such that
i < j, size(bi ) ≤ 12 and size(bj ) ≤ 12 . Then, FirstFit could have put all
items in bj into bi , and would not have created bj . This is a contradiction.
Theorem 2.7. FirstFit is a 2-approximation algorithm for bin packing.
Proof.
Pn Suppose FirstFit terminates with |B| = m bins. By Pn Lemma 2.6,
m−1
i=1 size(i) > 2 , as m−1 bins are at least half-full. Since i=1 size(i) ≤
18 CHAPTER 2. APPROXIMATION SCHEMES

|OP T (I)|, we have


n
X
m−1<2· size(i) ≤ 2 · |OP T (I)|
i=1

That is, m ≤ 2 · |OP T (I)| since both m and |OP T (I)| are integers.

Recall example with S = {0.5, 0.1, 0.1, 0.1, 0.5, 0.4, 0.5, 0.4, 0.4}. First-
Fit will use 4 bins: b1 = {0.5, 0.1, 0.1, 0.1}, b2 = b3 = {0.5, 0.4}, b4 = {0.4}.
As expected, 4 = |FirstFit(S)| ≤ 2 · |OP T (S)| = 6.

0.1 0.4 0.4


0.1
0.1

0.5 0.5 0.5 0.4

b1 b2 b3 b4

Remark If we first sort the item weights in non-increasing order, then one
can show that running FirstFit on the sorted item weights will yield a
3
2
-approximation algorithm for bin packing. See footnote for details1 .
It is natural to wonder whether we can do better than a 32 -approximation.
Unfortunately, unless P = N P, we cannot do so efficiently. To prove this, we
show that if we can efficiently derive a ( 23 − )-approximation for bin packing,
then the partition problem (which is N P-hard) can be solved efficiently.

Definition 2.8 (Partition problem). Given a multiset S of (possibly re-


peated) positive integers
P . . . , xn , is there a way to partition S into S1
x1 , P
and S2 such that x∈S1 x = x∈S2 x?

Theorem 2.9. It is N P-hard to solve bin packing with an approximation


factor better than 32 .
1
Curious readers can read the following lecture notes for proof on First-Fit-Decreasing:
http://ac.informatik.uni-freiburg.de/lak_teaching/ws11_12/combopt/notes/
bin_packing.pdf
https://dcg.epfl.ch/files/content/sites/dcg/files/courses/2012%20-%
20Combinatorial%20Optimization/12-BinPacking.pdf
2.2. BIN PACKING 19

Proof. Suppose some polytime algorithm A solves bin packing with a ( 23 −


)-approximation for  > 0. Given Pnan instance of the partition problem
with S = {x1 , . . . , xn }, letPX = i=1 xi . Define a bin packing instance
0 2x1 2xn
S = { X , . . . , X }. Since x∈S 0 x = 2, at least two bins are required. By
construction, one can bipartition S if and only if only two bins are required
to pack S 0 . Since A gives a ( 23 − )-approximation, if OP T on S 0 returns 2
bins, then A on S 0 will return also b2 · ( 23 − )c = 2 bins. Therefore, as A
solves bin-packing with a ( 23 −)-approximation in polytime, we would get an
algorithm for solving the partition problem in polytime. Contradiction.
The above rules out the possibility of a proper PTAS for bin packing
as a 1 + ε approximation of 2, for ε < 0.5, would be strictly less than 3.
Another way to view this negative result is to say that we need to allow the
approximation algorithm to have at least an additive +1 loss, in comparison
to the optimum. But we can still aim for an approximation that is within a
1+ε factor of optimum modulo an additive +1 error, i.e., achieving a number
of bins that is at most OP T (1 + ε) + 1.
In the following sections, we work towards this goal, with a runtime that
is exponential in 1 . To do this, we first consider two simplifying assumptions
and design algorithms for them. Then, we see how to adapt the algorithm
and remove these two assumptions.

2.2.2 Special case 1: Exact solving with A


In this section, we make the following two assumptions:

Assumption (1) All items have at least size , for some  > 0.

Assumption (2) There are only k different possible sizes (k is a constant).

Define M = d 1 e. By assumption 1, there are at most M items in a bin.


In addition, define R = MM+k . By assumption 2, there are at most R items


arrangements in one bin. Since at most  n bins areR needed,Rthe total number
n+R
of bin configurations is at most R ≤ (n + R) = O(n ). Since k and 
are constant, R is also constant and one can enumerate over all possible bin
configurations (denote this algorithm as A ) to exactly solve bin packing, in
this special case, in O(nR ) ∈ poly(n) time.

Remark 1 The number of configurations are computed by solving combi-


natorics problems of the following form: If xi defines the number of items of
the ith possible size, how many non-negative integer solutions are there to
x1 + · · · + xk ≤ M ? This type of problems can be solved by counting how
20 CHAPTER 2. APPROXIMATION SCHEMES

many ways there are to put n indistinguishable balls into k distinguishable


bins and is generally known under stars and bars.2

Remark 2 The number of bin configurations is computed out of n bins


(i.e., 1 bin for each item). One may use less than n bins, but this upper
bound suffices for our purposes.

2.2.3 Special case 2: PTAS


In this section, we remove the second assumption and require only:
Assumption (1) All items have at least size , for some  > 0.
Our goal is to reuse the exact algorithm A on a slightly modified prob-
lem instance J that satisfies both assumptions. For this, we partition the
items into k non-overlapping groups of Q ≤ bn2 c elements each. To obtain a
constant number of different sizes, we round the sizes of all items in a group
to the largest size of that group, resulting in at most k different item sizes.
We can now call A on J to solve the modified instance exactly in polyno-
mial time. Since J only rounds up sizes, A (J) will yield a satisfying bin
assignment for instance I, with possibly “spare slack”. The entire procedure
is described in PTAS-BinPacking.

Algorithm 5 PTAS-BinPacking(I = S, )
k ← d 12 e
Q ← bn2 c
Partition n items into k non-overlapping groups, each with ≤ Q items
for i ∈ {1, . . . , k} do
imax ← maxitem j in group i size(j)
for item j in group i do
size(j) ← imax
end for
end for
Denote the modified instance as J
return A (J)

It remains to show that the solution to the modified instance OP T (J)


yields a (1+)-approximation of OP T (I). For this, consider another modified
instance J 0 that is defined analogously to J only with rounded down item
2
See slides 22 and 23 of http://www.cs.ucr.edu/~neal/2006/cs260/piyush.pdf for
illustration of MM+k and n+R

R .
2.2. BIN PACKING 21

sizes. Thus, since we rounded down item sizes in J 0 , we have |OP T (J 0 )| ≤


|OP T (I)|.

J1 J2 J rounds up Jk
.. .. ..
0 Item sizes

J10 J20 J 0 rounds down Jk0

≤ Q items ≤ Q items ≤ Q items

Figure 2.1: Partition items into k groups, each with ≤ Q items; Label
groups in ascending sizes; J rounds up item sizes, J 0 rounds down item sizes.

Lemma 2.10. |OP T (J)| ≤ |OP T (J 0 )| + Q

Proof. Label the k groups in J by J1 , . . . , Jk where the items in Ji have


smaller sizes than the items in Ji+1 . Label the k groups in J 0 similarly. See
0
Figure 2.1. For i = {1, . . . , k − 1}, since the smallest item in Ji+1 has size
at least as large as the largest item in Ji , any valid packing for Ji0 serves as
a valid packing for the Ji−1 . For Jk (the largest ≤ Q items of J), we use
separate bins for each of its items (hence the additive Q term).

Lemma 2.11. |OP T (J)| ≤ |OP T (I)| + Q

Proof. By Lemma 2.10 and the fact that |OP T (J 0 )| ≤ |OP T (I)|.

Theorem 2.12. PTAS-BinPacking is an (1 + )-approximation algorithm


for bin packing with assumption (1).

Proof. By Assumption (1), all item sizes are at least , so |OP T (I)| ≥ n.
Then, Q = bn2 c ≤  · |OP T (I)|. Apply Lemma 2.11.

2.2.4 General case


We now consider the general case where we do not make any assumptions
on the problem instance I. First, we lower bound the minimum item size
by putting aside all items with size smaller than min{ 21 , 2 }, thus allowing us
to use PTAS-BinPacking. Then, we add back the small items in a greedy
manner with FirstFit to complete the packing.

Theorem 2.13. Full-PTAS-BinPacking uses ≤ (1+)|OP T (I)|+1 bins


22 CHAPTER 2. APPROXIMATION SCHEMES

Algorithm 6 Full-PTAS-BinPacking(I = S, )
0 ← min{ 12 , 2 } . See analysis why we chose such an 0
X ← Items with size < 0 . Ignore small items
0
P ← PTAS-BinPacking(S \ X,  ) . By Theorem 2.12,
. |P | = (1 + 0 ) · |OP T (S \ X)|
P 0 ← Using FirstFit, add items in X to P . Handle small items
return Resultant packing P 0

Proof. If FirstFit does not open a new bin, the theorem trivially holds.
Suppose FirstFit opens a new bin (using m bins in total), then we know
that at least (m − 1) bins are strictly more than (1 − 0 )-full.

n
X
|OP T (I)| ≥ size(i) Lower bound on |OP T (I)|
i=1
> (m − 1)(1 − 0 ) From above observation

Hence,

|OP T (I)|
m< +1 Rearranging
1 − 0
1 1
< |OP T (I)| · (1 + 20 ) + 1 Since 0
≤ 1 + 20 , for 0 ≤
1− 2
1 
≤ (1 + ) · |OP T (I)| + 1 By choice of 0 = min{ , }
2 2

2.3 Minimum makespan scheduling


Definition 2.14 (Minimum makespan scheduling problem). Given n jobs,
let I = {p1 , . . . , pn } be the set of processing times, where job i takes pi units of
time to complete. Find an assignment for the n jobs to m identical machines
such that the completion time (i.e. makespan) is minimized.
For any problem instance I, let OP T (I) be an optimal job assignment
and |OP T (I)| be the corresponding makespan. One can see that:
• pmax = maxi∈{1,...,n} pi ≤ |OP T (I)|

• m1 ni=1 pi ≤ |OP T (I)|


P
2.3. MINIMUM MAKESPAN SCHEDULING 23
Pn
Denote L(I) = max{pmax , m1 i=1 pi } as the larger lower bound. Then,
L(I) ≤ |OP T (I)|.

Remark To prove approximation factors, it is often useful to relate to lower


bounds of |OP T (I)|.

Example Suppose we have 7 jobs with processing times I = {p1 = 3,


p2 = 4, p3 = 5, p4 = 6, p5 = 4, p6 = 5, p7 = 6} and m = 3 machines.
Then, the lower bound on the makespan is L(I) = max{6, 11} = 11. This is
achievable by allocating M1 = {p1 , p2 , p5 }, M2 = {p3 , p4 }, M3 = {p6 , p7 }.

M3 p6 p7

M2 p3 p4

M1 p1 p2 p5

0 Time
3 5 7 Makespan = 11

Graham [Gra66] is a 2-approximation greedy algorithm for the minimum


makespan scheduling problem. With slight modifications, we improve it to
ModifiedGraham, a 34 -approximation algorithm. Finally, we end off the
section with a PTAS for minimum makespan scheduling.

2.3.1 Greedy approximation algorithms

Algorithm 7 Graham(I = {p1 , . . . , pn }, m)


M1 , . . . , Mm ← ∅ . All machines are initially free
for i ∈ {1, . . . , n} do P
j ← argminj∈{1,...,m} p∈Mj p . Pick the least loaded machine
Mj ← Mj ∪ {pi } . Add job i to this machine
end for
return M1 , . . . , Mm

Theorem 2.15. Graham is a 2-approximation algorithm.


Proof. Suppose the last job that finishesP(which takes plast time) running was
assigned to machine Mj . Define t = ( p∈Mj p) − plast as the makespan of
machine Mj before the last job was assigned to it. That is,
|Graham(I)| = t + plast
24 CHAPTER 2. APPROXIMATION SCHEMES

As Graham assigns greedily to the least loaded machine, all machines take
at least t time, hence
n
1 X
t≤ pi ≤ |OP T (I)|.
m i=1

as m1 ni=1 pi is the average of work done on each machine. Since plast ≤


P
pmax ≤ |OP T (I)|, we have |Graham(I)| = t + plast ≤ 2 · |OP T (I)|.

Corollary 2.16. |OP T (I)| ≤ 2 · L(I), where L(I) = max{pmax , m1 ni=1 pi }.


P

Proof. FromPthe proof of Theorem 2.15, we have |Graham(I)| = t + plast


and t ≤ m1 ni=1 pi . Since |OP T (I)| ≤ |Graham(I)| and plast ≤ pmax , we
have n
1 X
|OP T (I)| ≤ pi + pmax ≤ 2 · L(I)
m i=1

Recall the example with I = {p1 = 3, p2 = 4, p3 = 5, p4 = 6, p5 =


4, p6 = 5, p7 = 6} and m = 3. Graham will schedule M1 = {p1 , p4 },
M2 = {p2 , p5 , p7 }, M3 = {p3 , p6 }, yielding a makespan of 14. As expected,
14 = |Graham(I)| ≤ 2 · |OP T (I)| = 22.

M3 p3 p6

M2 p2 p5 p7

M1 p1 p4

0 Time
3 4 5 8 9 10 Makespan = 14

Remark The approximation for Graham is loose because we have no


guarantees on plast beyond plast ≤ pmax . This motivates us to order the job
timings in descending order (see ModifiedGraham).

Algorithm 8 ModifiedGraham(I = {p1 , . . . , pn }, m)


I 0 ← I in descending order
return Graham(I 0 , m)

Let plast be the last job that finishes running. We consider the two cases
plast > 13 · |OP T (I)| and plast ≤ 31 · |OP T (I)| separately for the analysis.
2.3. MINIMUM MAKESPAN SCHEDULING 25

1
Lemma 2.17. If plast > 3
· |OP T (I)|, then |ModifiedGraham(I)| = |OP T (I)|.
Proof. For m ≥ n, |ModifiedGraham(I)| = |OP T (I)| by trivially putting
one job on each machine. For m < n, without loss of generality3 , we can
assume that every machine has a job.
Suppose, for a contradiction, that |ModifiedGraham(I)| > |OP T (I)|.
Then, there exists a sequence of jobs with descending sizes I = {p1 , . . . , pn }
such that the last, smallest job pn causes ModifiedGraham(I) to have a
makespan larger than OP T (I)4 . That is, |ModifiedGraham(I \ {pn })| ≤
|OP T (I)| and plast = pn . Let C be the configuration of machines after
ModifiedGraham assigned {p1 , . . . , pn−1 }.
Observation 1 In C, each machine has either 1 or 2 jobs.
If there exists machine Mi with ≥ 3 jobs, Mi will take > |OP T (I)|
time because all jobs take > 31 · |OP T (I)| time. This contradicts the
assumption |ModifiedGraham(I \ {pn })| ≤ |OP T (I)|.
Let us denote the jobs that are alone in C as heavy jobs, and the machines
they are on as heavy machines.
Observation 2 In OP T (I), all heavy jobs are alone.
By assumption on pn , we know that assigning pn to any machine (in
particular, the heavy machines) in C causes the makespan to exceed
|OP T (I)|. Since pn is the smallest job, no other job can be assigned to
the heavy machines otherwise |OP T (I)| cannot attained by OP T (I).
Suppose there are k heavy jobs occupying a machine each in OP T (I). Then,
there are 2(m − k) + 1 jobs (two non-heavy jobs per machine in C, and pn ) to
be distributed across m − k machines. By the pigeonhole principle, at least
one machine M ∗ will get ≥ 3 jobs in OP T (I). However, since the smallest
job pn takes > 13 · |OP T (I)| time, M ∗ will spend > |OP T (I)| time. This is
a contradiction.
Theorem 2.18. ModifiedGraham is a 43 -approximation algorithm.
Proof. By similar arguments as per Theorem 2.15, |ModifiedGraham(I)| =
t + plast ≤ 34 · |OP T (I)| when plast ≤ 13 · |OP T (I)|. Meanwhile, when plast >
1
3
· |OP T (I)|, |ModifiedGraham(I)| = |OP T (I)| by Lemma 2.17.
3
Suppose there is a machine Mi without a job, then there must be another machine
Mj with more than 1 job (by pigeonhole principle). Shifting one of the jobs from Mj to
Mi will not increase the makespan.
4
If adding pj for some j < n already causes |ModifiedGraham({p1 , . . . , pj })| >
|OP T (I)|, we can truncate I to {p1 , . . . , pj } so that plast = pj . Since pj ≥ pn >
1
3 · |OP T (I)|, the antecedent still holds.
26 CHAPTER 2. APPROXIMATION SCHEMES

Recall the example with I = {p1 = 3, p2 = 4, p3 = 5, p4 = 6, p5 = 4, p6 =


5, p7 = 6} and m = 3. Putting I in decreasing sizes, I 0 = hp4 = 6, p7 = 6,
p3 = 5, p6 = 5, p2 = 4, p5 = 4, p1 = 3i and ModifiedGraham will schedule
M1 = {p4 , p2 , p1 }, M2 = {p7 , p5 }, M3 = {p3 , p6 }, yielding a makespan of 13.
As expected, 13 = |ModifiedGraham(I)| ≤ 34 · |OP T (I)| = 14.666 . . .

M3 p3 p6

M2 p7 p5

M1 p4 p2 p1

0 Time
5 6 10 Makespan = 13

2.3.2 PTAS for minimum makespan scheduling


Recall that any makespan Pn scheduling instance (I, m) has a lower bound
1
L(I) = max{pmax , m i=1 pi }. From Corollary 2.16, we know that |OP T (I)| ∈
[L(I), 2L(I)]. Let Bin(I, t) be the minimum number of bins of size t that can
hold all jobs. By associating job processing times with item sizes, and scal-
ing bin sizes up by a factor of t, we can relate Bin(I, t) to the bin packing
problem. One can see that Bin(I, t) is monotonically decreasing in t and
|OP T (I)| is the minimum t such that Bin(I, t) = m. Hence, to get a (1 + )-
approximate schedule, it suffices to find a t ≤ (1 + ) · |OP T (I)| such that
Bin(I, t) ≤ m.
Given t, PTAS-Makespan transforms a makespan scheduling instance
into a bin packing instance, then solves for an approximate bin packing to
yield an approximate scheduling. By ignoring small jobs (jobs of size ≤ t)
and rounding job sizes down to the closest power of (1 + ) : t · {1, (1 +
), . . . , (1+)h = −1 }, exact bin packing A with size t bins is used yielding a
packing P . To get a bin packing for the original job sizes, PTAS-Makespan
follows P ’s bin packing but uses bins of size t(1+) to account for the rounded
down job sizes. Suppose jobs 1 and 2 with sizes p1 and p2 were rounded down
to p01 and p02 , and P assigns them to a same bin (i.e., p01 +p02 ≤ t). Then, due to
the rounding process, their original sizes should also fit into a size t(1 + ) bin
since p1 + p2 ≤ p01 (1 + ) + p02 (1 + ) ≤ t(1 + ). Finally, small jobs are handled
using FirstFit. Let α(I, t, ) be the final bin configuration produced by
PTAS-Makespan on parameter t and |α(I, t, )| be the number of bins used.
Since |OP T (I)| ∈ [L, 2L], there will be a t ∈ {L, L + L, L + 2L, . . . , 2L}
such that |α(I, t, )| ≤ Bin(I, t) ≤ m bins (see Lemma 2.19 for the first
2.3. MINIMUM MAKESPAN SCHEDULING 27

Algorithm 9 PTAS-Makespan(I = {p1 , . . . , pn }, m)


L = max{pmax , m1 ni=1 pi }
P
for t ∈ {L, L + L, L + 2L, L + 3L, . . . , 2L} do
I 0 ← I \ {Jobs with sizes ≤ t} := I \ X . Ignore small jobs
h ← dlog1+ ( 1 )e . To partition (t, t] into powers of (1 + )
for pi ∈ I 0 do
k ← Largest j ∈ {0, . . . , h} such that pi ≥ t(1 + )j
pi ← t(1 + )k . Round down job size
end for
P ← A (I 0 ) . Use A from Section 2.2.2 with size t bins
α(I, t, ) ← Use bins of size t(1 + ) to emulate P on original sizes
α(I, t, ) ← Using FirstFit, add items in X to α(I, t, )
if α(I, t, ) uses ≤ m bins then
return Assign jobs to machines according to α(I, t, )
end if
end for

inequality). Note that running binary search on t also works, but we only
care about poly-time.
Lemma 2.19. For any t > 0, |α(I, t, )| ≤ Bin(I, t).
Proof. If FirstFit does not open a new bin, then |α(I, t, )| ≤ Bin(I, t) since
α(I, t, ) uses an additional (1 + ) buffer on each bin. If FirstFit opens a
new bin (say, totalling b bins), then there are at least (b − 1) produced bins
from A (exact solving on rounded down items of size > t) that are more than
(t(1 + ) − t) = t-full. Hence, any bin packing algorithm must use strictly
more than (b−1)t
t
= b − 1 bins. In particular, Bin(I, t) ≥ b = |α(I, t, )|.
Theorem 2.20. PTAS-Makespan is a (1 + )-approximation for the min-
imum makespan scheduling problem.
Proof. Let t∗ = |OP T (I)| and tα be the minimum t ∈ {L, L + L, L +
2L, . . . , 2L} such that |α(I, t, )| ≤ m. It follows that tα ≤ t∗ + L. Since
L ≤ |OP T (I)| and since we consider bins of final size tα (1+) to accomodate
for the original sizes, we have |PTAS-Makespan(I)| = tα (1 + ) ≤ (t∗ +
L)(1 + ) ≤ (1 + )2 · |OP T (I)|. For  ∈ [0, 1] we have (1 + )2 ≤ (1 + 3)
and thus the statement follows.
Theorem 2.21. PTAS-Makespan runs in poly(|I|, m) time.
Pn
Proof. There are at most max{ pmax 
1
, m 1
i=1 pi } ∈ O(  ) values of t to try.
Filtering small jobs and rounding remaining jobs takes O(n). From the
h+1
previous lecture, A runs in O( 1 · n  ) and FirstFit runs in O(nm).
28 CHAPTER 2. APPROXIMATION SCHEMES
Chapter 3

Randomized approximation
schemes

In this chapter, we study the class of algorithms which extends FPTAS by


allowing randomization.

Definition 3.1 (Fully polynomial randomized approximation scheme (FPRAS)).


For cost metric c, an algorithm A is a FPRAS if

• For any  > 0, Pr[|c(A(I)) − c(OP T (I))| ≤  · c(OP T (I))] ≥ 3


4

• A runs in poly(|I|, 1 )

Intuition An FPRAS computes, with a high enough probability, a solution


which is not too far from the optimal one in a reasonable time.

Remark The probability 43 above is somewhat arbitrary. We can then


easily amplify the success probability. In particular, for any desired δ > 0,
we can invoke O(log 1δ ) independent copies of the algorithm A and then return
the median. The median is a correct estimation with probability greater than
1 − δ. This is known as probability amplification (see section 9.1).

3.1 DNF counting


Definition 3.2 (Disjunctive Normal Form (DNF)). A formula F on n Boolean
variables x1 , . . . , xn is said to be in a Disjunctive Normal Form (DNF) if

• F = C1 ∨ · · · ∨ Cm is a disjunction (that is, logical OR) of clauses

29
30 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES

• ∀i ∈ [m], a clause Ci = li,1 ∧ · · · ∧ li,|Ci | is a conjunction (that is, logical


AND) of literals

• ∀i ∈ [n], a literal li ∈ {xi , ¬xi } is either the variable xi or its negation.

Let α : [n] → {0, 1} be a truth assignment to the n variables. Formula F


is said to be satisfiable if there exists a satisfying assignment α such that F
evaluates to true under α (i.e. F [α] = 1).

Any clause with both xi and ¬xi is trivially false. As they can be removed
in a single scan of F , we assume that F does not contain such trivial clauses.

Example Let F = (x1 ∧ ¬x2 ∧ ¬x4 ) ∨ (x2 ∧ x3 ) ∨ (¬x3 ∧ ¬x4 ) be a


Boolean formula on 4 variables, where C1 = x1 ∧ ¬x2 ∧ ¬x4 , C2 = x2 ∧ x3
and C3 = ¬x3 ∧ ¬x4 . Drawing the truth table, one sees that there are 9 sat-
isfying assignments to F , one of which is α(1) = 1, α(2) = α(3) = α(4) = 0.

Remark Another common normal form for representing Boolean formulas


is the Conjunctive Normal Form (CNF). Formulas in CNF are conjunctions
of disjunctions (as compared to disjunctions of conjunctions in DNF). In
particular, one can determine in polynomial time whether a DNF formula is
satisfiable but it is N P-complete to determine if a CNF formula is satisfiable.

In this section, we are interested in the number of satisfying assignment


for each given DNF. Suppose F is a Boolean formula in DNF. Let f (F ) =
|{α : F [α] = 1}| be the number of satisfying assignments to F . If we let
Si = {α : Ci [α] = 1} be S the set of satisfying assignments to clause Ci ,
then we see that f (F ) = | m i=1 Si |. We are interested in polynomial-time
algorithms for computing or approximating |f (F )|. In the above example,
|S1 | = 2, |S2 | = 4, |S3 | = 4, and f (F ) = 9.

In the following, we present two failed attempts to compute f (F ) and


then present DNF-Count, a FPRAS for DNF counting via sampling.

3.1.1 Failed attempt 1: Principle of Inclusion-Exclusion


By definition of f (F ) = | m
S
i=1 Si |, one may be tempted to apply the Principle
of Inclusion-Exclusion to expand:
m
[ m
X X
| Si | = |Si | − |Si ∩ Sj | + . . .
i=1 i=1 i<j
3.1. DNF COUNTING 31

However, there are exponentially many terms and there exist instances where
truncating the sum yields arbitrarily bad approximation.

3.1.2 Failed attempt 2: Sampling (wrongly)


Suppose we pick k assignments uniformly at random (u.a.r.). Let Xi be
the
Pk indicator variable whether the i-th assignment satisfies F , and X =
i=1 Xi be the total number of satisfying assignments out of the k sampled
assignments. A u.a.r. assignment is satisfying with probability f2(Fn ) . By
linearity of expectation, E(X) = k · f2(Fn ) . Unfortunately, since we only sample
k ∈ poly(n, 1 ) assignments, and as 2kn can be exponentially small, it can be
quite likely (e.g., probability much larger than 1 − 1/poly(n)) that none of
our sampled assignments is a satisfying assignment: in such a case, we cannot
infer much about the number of satisfying assignments using only poly(n)
samples. We would need exponentially many samples for E(X) to be a good
estimate of f (F ). Thus, this approach will not yield a FPRAS for DNF
counting.

3.1.3 An FPRAS for DNF counting via sampling


Consider an m-by-f (F ) boolean matrix M where
(
1 if assignment αj satisfies clause Ci
M [i, j] =
0 otherwise

Remark We are trying to estimate f (F ) and thus will never be able to


build the matrix M . It is used here as an explanation of why this attempt
works.

α1 α2 ... αf (F )
C1 0 1 ... 0
C2 1 1 ... 1
C3 0 0 ... 0
.. .. .. ..
... . . . .
Cm 0 1 ... 1

Table 3.1: Visual representation of the matrix M . Red 1’s indicate the
topmost clause Ci satisfied for each assignment αj
32 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES

Let |M | denote the total number of 1’s in M ; it is the sum of the number
of clauses satisfied by each assignment that satisfies F . Recall that PmSi is the
n−|Ci |
set of assignments
Pm n−|Ci | that satisfy C i . Since |Si | = 2 , |M | = i=1 |Si | =
i=1 2 .
We are now interested in the number of “topmost” 1’s in the matrix, where
“topmost” is defined column-wise. As every column represents a satisfying
assignment, at least one clause must be satisfied for each assignment and this
proves that there are exactly f (F ) “topmost” 1’s in the matrix M (i.e. one
by column).
DNF-Count estimates the fraction of “topmost” 1’s in M , then returns
this fraction times |M | as an estimate of f (F ).
To estimate the fraction of “topmost” 1’s:

• Pick a clause according to its length: shorter clauses are more likely.

• Uniformly select a satisfying assignment for the picked clause by flip-


ping coins for variables not in the clause.

• Check if the assignment satisfies any clauses with a smaller index.

Algorithm 10 DNF-Count(F, )
X←0 . Empirical number of “topmost” 1’s sampled
9m
for k = 2 times do
n−|C |
Ci ← Sample one of m clauses, where Pr[Ci chosen] = 2 |M | i
αj ← Sample one of 2n−|Ci | satisfying assignments of Ci
IsTopmost ← True
for l ∈ {1, . . . , i − 1} do . Check if αj is “topmost”
if Cl [α] = 1 then . Checkable in O(n) time
IsTopmost ← False
end if
end for
if IsTopmost then
X ←X +1
end if
end for
return |Mk|·X

Lemma 3.3. DNF-Count samples a ‘1’ in the matrix M uniformly at


random at each step.
3.1. DNF COUNTING 33
Pm
Proof.
Pm n−|C Recall that the total number of 1’s in M is |M | = i=1 |Si | =
i|
i=1 2 .

Pr[Ci and αj are chosen] = Pr[Ci is chosen] · Pr[αj is chosen|Ci is chosen]


2n−|Ci | 1
= Pm n−|Ci | · n−|Ci |
i=1 2 2
1
= Pm n−|Ci |
i=1 2
1
=
|M |

h i
|M |·X
Lemma 3.4. In DNF-Count, Pr k
− f (F ) ≤  · f (F ) ≥ 43 .

Proof. Let Xi be the indicator variable whether the i-th sampled assignment
is “topmost”, where p = Pr[Xi = 1]. By Lemma 3.3, p = Pr[Xi = 1] = f|M (F )
|
.
Pk
Let X = i=1 Xi be the empirical number of “topmost” 1’s. Then, E(X) =
kp by linearity of expectation. By picking k = 9m
2
,

 
|M | · X
Pr − f (F ) ≥  · f (F )
k
 
k · f (F )  · k · f (F ) k
= Pr X − ≥ Multiply by
|M | |M | |M |
f (F )
= Pr [|X − kp| ≥ kp] Since p =
|M |
 2 
 kp
≤ 2 exp − By Chernoff bound
3
 
3m · f (F ) 9m f (F )
= 2 exp − Since k = and p =
|M | 2 |M |
≤ 2 exp(−3) Since |M | ≤ m · f (F )
1

4
Negating, we get:
 
|M | · X 1 3
Pr − f (F ) ≤  · f (F ) ≥ 1 − =
k 4 4
34 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES

Lemma 3.5. DNF-Count runs in poly(F, 1 ) = poly(n, m, 1 ) time.

Proof. There are k ∈ O( m2 ) iterations. In each iteration, we spend O(m + n)


sampling Ci and αj , and O(nm) for checking if a sampled αj is “topmost”.
2
In total, DNF-Count runs in O( m n(m+n) 2
) time.

Theorem 3.6. DNF-Count is a FPRAS for DNF counting.

Proof. By Lemmas 3.4 and 3.5.

3.2 Network reliability


Let’s consider an undirected graph G = (V, E), where each edge e ∈ E has a
certain probability of failing, namely pe , independent of other edges. In the
network reliability problem, we are interested in calculating or estimating the
probability that the network becomes disconnected. For simplicity we study
the symmetrical case, where pe = p for every edge e ∈ E.
As a side remark, we note that we are aiming to obtain a (1 ± ε)-
approximation of the quantity P := Pr[G is disconnected]. Such an approx-
imation is not necessarily a (1 ± ε)-approximation of the probability of the
converse, 1 − P = Pr[G is connected] and vice versa, since P may be very
close to 1.

Observation 3.7. For an edge cut of size k edges, the probability of its
disconnection is pk .

Reduction to DNF counting: If the graph had only a few cuts, there
would be a natural and easy way to formulate the problem as a variant of
DNF counting: each edge would be represented by a variable, and every
clause would correspond to a cut postulating that all of its edges have failed.
The probability of disconnecting can then be inferred from the fraction of
satisfying assignments, when each variable is true with probability p and false
otherwise. We note that the latter can be computed by an easy extension of
the DNF counting algorithm discussed in the previous section.

Unfortunately, there are exponentially many cuts in a general graph and


this method is thus inefficient. We discuss two cases based on the minimum
cut size c. The reduction we present here is due to Karger [Kar01].
1
• When pc ≥ 4 , this means that the probability of the network discon-
n
nection is rather high. As mentioned previously, this is not the case of a
3.3. COUNTING GRAPH COLORINGS 35

large interest, since the motivation behind studying this problem is the
understanding how to build reliable networks, and network with rather
small cut is not. Nevertheless, since the probability of disconnection
is rather high, Monte-Carlo method of sampling subsets of edges and
checking whether they are cuts is sufficient, since we only need Õ(n4 )
samples to achieve concentration.

1
• When pc ≤ 4 , we show that the large cuts do not contribute to the
n
probability of disconnection too much, and therefore they can be safely
ignored. Recall that the number of cuts of size αc for α ≥ 1 is at most
O(n2α )1 . When one selects a threshold γ = max{O(logn ε−1 ), Θ(1)} on
the cut size, we can express the contribution of large cuts – those of
size γc and higher — to the overall probability as
Z ∞
n2α · pαc dα < εpc .
γ

Hence, the error introduced by ignoring large cuts is at most a ε factor


of the lower bound on the probability of failure, i.e., pc . Thus, we can
ignore those large cuts.
The number of cuts smaller than γc is clearly a polynomial of n and
therefore the reduction to DNF-counting is efficient. The only remain-
ing thing is finding those – Karger’s contraction algorithm provides us
with a way of sampling those cuts and we can thus use Coupon collector
to enumerate those 2 .

3.3 Counting graph colorings


Definition 3.8 (Graph coloring). Let G = (V, E) be a graph on |V | = n
vertices and |E| = m edges. Denote the maximum degree as ∆. A valid q-
coloring of G is an assignment c : V → {1, . . . , q} such that adjacent vertices
have different colors. i.e., If u and v are adjacent in G, then c(u) 6= c(v).

Example (3-coloring of the Petersen graph)


1
If you haven’t seen this fact before (e.g., in the context of Karger’s randomized con-
traction algorithm in your undergraduate algorithmic classes), take this as a black-box
claim for now. You will see the proof later in Chapter 13.
2
As you might have seen in your undergraduate classes, e.g., in the course Algorithms,
Probability, and Computing
36 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES

For q ≥ ∆ + 1, one can obtain a valid q-coloring by sequentially coloring


a vertex with available colors greedily. In this section, we show a FPRAS for
counting f (G), the number of graph coloring of a given graph G, under the
assumption that we have q ≥ 2∆ + 1 colors.

3.3.1 Sampling a coloring uniformly


When q ≥ 2∆ + 1, the Markov chain approach in SampleColor allows us
to sample a random coloring in O(n log n ) steps.

Algorithm 11 SampleColor(G = (V, E), )


Greedily color the graph
for k = O(n log n ) times do
Pick a random vertex v uniformly at random from V
Pick u.a.r. an available color . Different from the colours in N (v)
Color v with new color . May end up with same color
end for
return Coloring

Claim 3.9. For q ≥ 2∆ + 1, the distribution of colorings returned by Sam-


pleColor is -close to a uniform distribution on all valid q-colorings.
Notes on the Proof. A full proof of this claim is beyond the scope of this
course. Let us still provide some helpful explanations: The coloring in the
course of the algorithm can be modelled as a Markov chain. Moreover, this
chain is aperiodic and irreducible: Firstly, the chain is aperiodic because the
color of a vertex can be retained while resampling (since it is not used in any
of its neighbours). The chain is irreducible because we can transform between
any two colorings using at most 2n steps. Since the chain is aperiodic and
irreducible, it converges to its stationary distribution. Now, as the chain is
clearly symmetric, the stationary distribution is uniform among all possible
colorings with q colors. Finally, it is known that the chain is rapidly mixing
3.3. COUNTING GRAPH COLORINGS 37

and has mixing time at most O(n log n). Therefore, after k = O(n log n )
times resamplings, the distribution of the coloring will be -close to a uniform
distribution on all valid q-colorings.

3.3.2 FPRAS for q ≥ 2∆ + 1 and ∆ ≥ 2


Fix an arbitrary ordering of edges in E. For i = {1, . . . , m}, let Gi = (V, Ei )
be a graph such that Ei = {e1 , . . . , ei } is the set of the first i edges. Define
Ωi = {c : c is a valid coloring for Gi } as the set of all valid colorings of Gi ,
and denote ri = |Ω|Ωi−1
i|
|
.
We will estimate the number of graph colorings as
|Ω1 | |Ωm |
f (G) = |Ωm | = |Ω0 | · ... = |Ω0 | · Πm n m
i=1 ri = q · Πi=1 ri
|Ω0 | |Ωm−1 |
One can see that Ωi ⊆ Ωi−1 as removal of ei in Gi−1 can only increase
the number of valid colorings. Furthermore, suppose ei = {u, v}. Then,
Ωi−1 \ Ωi = {c : c(u) = c(v)}. That is, colorings that are in Ωi−1 but not in
Ωi are exactly those that assign the same color to both endpoints v and u
of the edge e. We can argue that |Ω|Ωi−1 i|
|
cannot be too small. In particular,
for any coloring in Ωi−1 \ Ωi , we can associate (in an injective manner) many
different colorings in Ωi : Fix the coloring of, say the lower-indexed vertex, u.
Then, there are ≥ q − ∆ ≥ 2∆ + 1 − ∆ = ∆ + 1 possible recolorings of v in
Gi . Hence,

|Ωi | ≥ (∆ + 1) · |Ωi−1 \ Ωi |
⇐⇒ |Ωi | ≥ (∆ + 1) · (|Ωi−1 | − |Ωi |)
⇐⇒ |Ωi | + (∆ + 1) · |Ωi | ≥ (∆ + 1) · |Ωi−1 |
⇐⇒ (∆ + 2) · |Ωi | ≥ (∆ + 1) · |Ωi−1 |
|Ωi | ∆+1
⇐⇒ ≥
|Ωi−1 | ∆+2
|Ωi | ∆+1
This implies that ri = |Ωi−1 |
≥ ∆+2
≥ 43 since ∆ ≥ 2.
|Ω1 |
Since f (G) = |Ωm | = |Ω0 | · |Ω0 |
. . . |Ω|Ωm−1
m|
|
= |Ω0 | · Πm
= q n · Πm i=1 ri
i=1 ri , if
we can find a good estimate for each ri with high probability, then we have
a FPRAS for counting the number of valid graph colorings for G.

We now define Color-Count(G, ) (algorithm 12) as an algorithm that


estimates the number of valid colorings of graph G using q ≥ 2∆ + 1 colors.
 1
 
Lemma 3.10. For all i ∈ {1, . . . , m}, Pr |b
ri − ri | ≤ 2m · ri ≥ 1 − 4m .
38 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES

Algorithm 12 Color-Count(G, )
m ← 0
rb1 , . . . , rc . Estimates for ri
for i = 1, . . . , m do
3
for k = 128m 2
times do
c ← Sample coloring of Gi−1 . Using SampleColor
if c is a valid coloring for Gi then
rbi ← rbi + k1 . Update empirical count of ri = |Ω|Ωi−1
i|
|
end if
end for
end for
return q n Πm i=1 r
bi

Proof. Let Xj be the indicator variable whether the j-th sampled coloring
for Ωi−1 is a valid coloring for Ωi , where p = Pr[Xj = 1]. From above, we
know that p = Pr[Xj = 1] = |Ω|Ωi−1
i| 3
Pk
|
≥ 4
. Let X = j=1 Xj be the empirical
number of colorings that is valid for both Ωi−1 and Ωi , captured by k · rbi .
3
Then, E(X) = kp by linearity of expectation. Picking k = 128m2
,

  2 
h  i ( 2m ) kp
Pr |X − kp| ≥ kp ≤ 2 exp − By Chernoff bound
2m 3
128m3
 
32mp
= 2 exp − Since k =
3 2
3
≤ 2 exp(−8m) Since p ≥
4
1 1
≤ Since exp(−x) ≤ for x > 0
4m x
Dividing by k and negating, we have:
h  i h  i 1
Pr |b
ri − ri | ≤ · ri = 1 − Pr |X − kp| ≥ kp ≥ 1 −
2m 2m 4m

Lemma 3.11. Color-Count runs in poly(F, 1 ) = poly(n, m, 1 ) time.


3
Proof. There are m ri ’s to estimate. Each estimation has k ∈ O( m2 ) iter-
ations. In each iteration, we spend O(n log n ) time to sample a coloring c
of Gi−1 and O(m) time to check if c is a valid coloring for Gi . In total,
Color-Count runs in O(mk(n log n + m)) = poly(n, m, 1 ) time.
3.3. COUNTING GRAPH COLORINGS 39

Theorem 3.12. Color-Count is a FPRAS for counting the number of


valid graph colorings when q ≥ 2∆ + 1 and ∆ ≥ 2.

Proof. By Lemma 3.11, Color-Count runs in poly(n, m, 1 ) time. Since


 m 
1 + x ≤ ex for all real x, we have (1 + 2m ) ≤ e 2 ≤ 1 + . The last
inequality3 is because ex ≤ 1 + 2x for 0 ≤ x ≤ 1.25643. On the other hand,
 m
Bernoulli’s inequality tells us that (1 − 2m ) ≥ 1 − 2 ≥ 1 − . We know from
 1
the proof of Lemma 3.10, Pr[|b ri − ri | > 2m · ri ] ≤ 4m for any estimate ri .
Therefore, by a union bound, we have

m
n
X h  i
Pr [|q Πm
i=1 r
bi − f (G)| > f (G)] ≤ Pr |b
ri − ri | > · ri
i=1
2m
1 1
≤m· =
4m 4
Hence, Pr [|q n Πm bi − f (G)| ≤ f (G)] ≥ 3/4.
i=1 r

Remark Recall from Claim 3.9 that SampleColor actually gives an ap-
proximate uniform coloring. A more careful analysis can absorb the approx-
imation of SampleColor under Color-Count’s  factor.

3
See https://www.wolframalpha.com/input/?i=e%5Ex+%3C%3D+1%2B2x
40 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES
Chapter 4

Rounding Linear Program


Solutions

Linear programming (LP) and integer linear programming (ILP) are versa-
tile models but with different solving complexities — LPs are solvable in
polynomial time while ILPs are N P-hard.

Definition 4.1 (Linear program (LP)). The canonical form of an LP is

minimize cT x
subject to Ax ≥ b
x≥0

where x is the vector of n variables (to be determined), b and c are vectors


of (known) coefficients, and A is a (known) matrix of coefficients. cT x and
obj(x) are the objective function and objective value of the LP respectively.
For an optimal variable assignment x∗ , obj(x∗ ) is the optimal value.

ILPs are defined similarly with the additional constraint that variables
take on integer values. As we will be relaxing ILPs into LPs, to avoid confu-
sion, we use y for ILP variables to contrast against the x variables in LPs.

Definition 4.2 (Integer linear program (ILP)). The canonical form of an


ILP is

minimize cT y
subject to Ay ≥ b
y≥0
y ∈ Zn

41
42 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

where y is the vector of n variables (to be determined), b and c are vectors


of (known) coefficients, and A is a (known) matrix of coefficients. cT y and
obj(y) are the objective function and objective value of the LP respectively.
For an optimal variable assignment y ∗ , obj(y ∗ ) is the optimal value.

Remark We can define LPs and ILPs for maximization problems similarly.
One can also solve maximization problems with a minimization LPs using
the same constraints but negated objective function. The optimal value from
the solved LP will then be the negation of the maximized optimal value.
In this chapter, we illustrate how one can model set cover and multi-
commodity routing as ILPs, and how to perform rounding to yield approx-
imations for these problems. As before, Chernoff bounds will be a useful
inequality in our analysis toolbox.

4.1 Minimum set cover


Recall the minimum set cover problem and the example from Section 1.1.

Example
S1 e1

S2 e2

S3 e3

S4 e4

e5

Suppose there are n = 5 vertices and m = 4 subsets S = {S1 , S2 , S3 , S4 },


where the cost function is defined as c(Si ) = i2 . Then, the minimum set
cover is S ∗ = {S1 , S2 , S3 } with a cost of c(S ∗ ) = 14.
In Section 1.1, we saw that a greedy selection of sets that minimizes
the price-per-item of remaining sets gave an Hn -approximation for set cover.
Furthermore, in the special cases where ∆ = maxi∈[m] degree(Si ) and f =
maxi∈[n] degree(xi ) are small, one can obtain H∆ -approximation and f -approximation
respectively.
We now show how to formulate set cover as an ILP, reduce it into a LP,
and how to round the solutions to yield an approximation to the original set
cover instance. Consider the following ILP:
4.1. MINIMUM SET COVER 43

ILPSet cover
m
X
minimize yi · c(Si ) / Cost of chosen set cover
i=1
X
subject to yi ≥ 1 ∀j ∈ [n] / Every item ej is covered
i:ej ∈Si

yi ∈ {0, 1} ∀i ∈ [m] / Indicator whether set Si is chosen

Upon solving ILPSet cover , the set {Si : i ∈ [n] ∧ yi∗ = 1} is the optimal
solution for a given set cover instance. However, as solving ILPs is N P-hard,
we consider relaxing the integral constraint by replacing binary yi variables
by real-valued/fractional xi ∈ [0, 1]. Such a relaxation will yield the corre-
sponding LP:

LPSet cover
m
X
minimize xi · c(Si ) / Cost of chosen fractional set cover
i=1
X
subject to xi ≥ 1 ∀j ∈ [n] / Every item ej is fractionally covered
i:ej ∈Si

0 ≤ xi ≤ 1 ∀i ∈ [m] / Relaxed indicator variables

Since LPs can be solved in polynomial time, we can find the optimal
fractional solution to LPSet cover in polynomial time.

Observation As the set of solutions of ILPSet cover is a subset of LPSet cover ,


obj(x∗ ) ≤ obj(y ∗ ).

Example The corresponding ILP for the example set cover instance is:

minimize y1 + 4y2 + 9y3 + 16y4


subject to y1 + y4 ≥ 1 / Sets covering e1
y1 + y3 ≥ 1 / Sets covering e2
y3 ≥ 1 / Sets covering e3
y2 + y4 ≥ 1 / Sets covering e4
y1 + y4 ≥ 1 / Sets covering e5
∀i ∈ {1, . . . , 4}, yi ∈ {0, 1}
44 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

After relaxing:

minimize x1 + 4x2 + 9x3 + 16x4


subject to x1 + x4 ≥ 1
x1 + x3 ≥ 1
x3 ≥ 1
x2 + x4 ≥ 1
x1 + x4 ≥ 1
∀i ∈ {1, . . . , 4}, 0 ≤ xi ≤ 1 / Relaxed indicator variables

Solving it using a LP solver1 yields: x1 = 1, x2 = 1, x3 = 1, x4 = 0. Since


the solved x∗ are integral, x∗ is also the optimal solution for the original
ILP. In general, the solved x∗ may be fractional, which does not immediately
yield a set selection.
We now describe two ways to round the fractional assignments x∗ into
binary variables y so that we can interpret them as proper set selections.

4.1.1 (Deterministic) Rounding for small f


We round x∗ as follows:
(
1 if x∗i ≥ 1
f
∀i ∈ [m], set yi =
0 else

Theorem 4.3. The rounded y is a feasible solution to ILPSet cover .

Proof. Since x∗ is a feasible (not to mention, optimal) solution for LPSet cover ,
in each constraint, there is at least one x∗i that is greater or equal to f1 . Hence,
every element is covered by some set yi in the rounding.

Theorem 4.4. The rounded y is a f -approximation to ILPSet cover .

Proof. By the rounding, yi ≤ f · x∗i , ∀i ∈ [m]. Therefore,

obj(y) ≤ f · obj(x∗ ) ≤ f · obj(y ∗ )

1
Using Microsoft Excel. See tutorial: http://faculty.sfasu.edu/fisherwarre/lp_
solver.html
Or, use an online LP solver such as: http://online-optimizer.appspot.com/?model=
builtin:default.mod
4.1. MINIMUM SET COVER 45

4.1.2 (Randomized) Rounding for general f


If f is large, having a f -approximation algorithm from the previous sub-
section may be unsatisfactory. By introducing randomness in the rounding
process, we show that one can obtain a ln(n)-approximation (in expectation)
with arbitrarily high probability through probability amplification.
Consider the following rounding procedure:
1. Interpret each x∗i as probability for picking Si . That is, Pr[yi = 1] = x∗i .
2. For each i, independently set yi to 1 with probability x∗i .
Theorem 4.5. E(obj(y)) = obj(x∗ )
Proof.
Xm
E(obj(y)) = E( yi · c(Si ))
i=1
m
X
= E(yi ) · c(Si ) By linearity of expectation
i=1
Xm
= Pr(yi = 1) · c(Si ) Since each yi is an indicator variable
i=1
Xm
= x∗i · c(Si ) Since Pr(yi = 1) = x∗i
i=1
= obj(x∗ )

Although the rounded selection to yield an objective cost that is close


to the optimum (in expectation) of the LP, we need to consider whether all
constraints are satisfied.
Theorem 4.6. For any j ∈ [n], item ej is not covered w.p. ≤ e−1 .
Proof. For any j ∈ [n],
X
Pr[Item ej not covered] = Pr[ yi = 0]
i:ej ∈Si

= Πi:ej ∈Si (1 − x∗i ) Since the yi are chosen independently



≤ Πi:ej ∈Si e−xi Since (1 − x) ≤ e−x
x∗i
P
− i:ej ∈Si
=e
≤ e−1
46 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

because the optimal solution x∗ satisfies the j th


The last inequality holds P
constraint in the LP that i:ej ∈Si x∗i ≥ 1.

Since e−1 ≈ 0.37, we would expect the rounded y not to cover several
items. However, one can amplify the success probability by considering in-
dependent roundings and taking the union (See ApxSetCoverILP).

Algorithm 13 ApxSetCoverILP(U, S, c)
ILPSet cover ← Construct ILP of problem instance
LPSet cover ← Relax integral constraints on indicator variables y to x
x∗ ← Solve LPSet cover
T ←∅ . Selected subset of S
for k · ln(n) times (for any constant k > 1) do
for i ∈ [m] do
yi ← Set to 1 with probability x∗i
if yi = 1 then
T ← T ∪ {Si } . Add to selected sets T
end if
end for
end for
return T

Similar to Theorem 4.4, we can see that E(obj(T )) ≤ (k · ln(n)) · obj(y ∗ ).


Furthermore, Markov’s inequality tells us that the probability of obj(T ) being
z times larger than its expectation is at most z1 .

Theorem 4.7. ApxSetCoverILP gives a valid set cover w.p. ≥ 1 − n1−k .

Proof. For all j ∈ [n],

Pr[Item ej not covered by T ] = Pr[ej not covered by all k ln(n) roundings]


≤ (e−1 )k ln(n)
= n−k

Taking union bound over all n items,


n
X
Pr[T is not a valid set cover] ≤ n−k = n1−k
i=1

So, T is a valid set cover with probability ≥ 1 − n1−k .


4.2. MINIMIZING CONGESTION IN MULTI-COMMODITY ROUTING47

Note that the success probability of 1 − n1−k can be further amplified by


taking several independent samples of ApxSetCoverILP, then returning
the lowest cost valid set cover sampled. With z samples, the probability that
all repetitions fail is less than nz(1−k) , so we succeed w.p. ≥ 1 − nz(1−k) .

4.2 Minimizing congestion in multi-commodity


routing
A multi-commodity routing (MCR) problem involves routing multiple (si , ti )
flows across a network with the goal of minimizing congestion, where con-
gestion is defined as the largest ratio of flow over capacity of any edge in
the network. In this section, we discuss two variants of the multi-commodity
routing problem. In the first variant (special case), we are given the set of
possible paths Pi for each (si , ti ) source-target pairs. In the second variant
(general case), we are given only the network. In both cases, [RT87] showed
that one can obtain an approximation of O( loglog(m)
log(m)
) with high probability.

Definition 4.8 (Multi-commodity routing problem). Consider a directed


graph G = (V, E) where |E| = m and each edge e = (u, v) ∈ E has a capacity
c(u, v). The in-set/out-set of a vertex v is denoted as in(v) = {(u, v) ∈ E :
u ∈ V } and out(v) = {(v, u) ∈ E : u ∈ V } respectively. Given k triplets
(si , ti , di ), where si ∈ V is the source, ti ∈ V is the target, and di ≥ 0 is
the demand for the ith commodity respectively, denote f (e, i) ∈ [0, 1] as the
fraction of di that is flowing through edge e. The task is to minimize the
congestion parameter λ by finding paths pi for each i ∈ [k], such that:
P P
(i) (Valid sources): e∈out(si ) f (e, i) − e∈in(si ) f (e, i) = 1, ∀i ∈ [k]
P P
(ii) (Valid sinks): e∈in(ti ) f (e, i) − e∈out(ti ) f (e, i) = 1, ∀i ∈ [k]

(iii) (Flow conservation): For each commodity i ∈ [k],


X X
f (e, i) − f (e, i) = 0, ∀e ∈ E, ∀v ∈ V \ {si ∪ ti }
e∈out(v) e∈in(v)

(iv) (Single path): All demand for commodity i passes through a single path
pi (no repeated vertices).
(v) (Congestion factor): ∀e ∈ E, ki=1 di 1e∈pi ≤ λ · c(e), where indicator
P
1e∈pi = 1 ⇐⇒ e ∈ pi .
(vi) (Minimum congestion): λ is minimized.
48 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

Example Consider the following flow network with k = 3 commodities


with edge capacities as labelled:

s1 13 17
a t1
7 8
19 5

s2 20
b 11 t2

8 5 7

s3 6
c t3

For demands d1 = d2 = d3 = 10, there exists a flow assignment such that


the total demands flowing on each edge is below its capacity:

s1 10 10
a t1

s2 b t2

s3 c t3

s1 5
a t1
5 5
5 5
5
s2 b t2

s3 c t3
4.2. MINIMIZING CONGESTION IN MULTI-COMMODITY ROUTING49

s1 a t1

s2 5 5
b t2

5 5 5

s3 5
c t3

Although the assignment attains congestion λ = 1 (due to edge (s3 , a)),


the path assignments for commodities 2 and 3 violate the property of “single
path”. Forcing all demand of each commodity to flow through a single path,
we have a minimum congestion of λ = 1.25 (due to edges (s3 , s2 ) and (a, t2 )):

s1 10 10
a t1

s2 b t2

s3 c t3

s1 a t1
10
10
10
s2 b t2

s3 c t3
50 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

s1 a t1

10
s2 10
b 10 t2

10

s3 c t3

4.2.1 Special case: Given sets of si − ti paths Pi


For each commodity i ∈ [k], we are to select a path pi from a given set
of valid paths Pi , where each edge in all paths in Pi has capacities ≥ di .
Because we intend to pick a single path for each commodity to send all
demands through, constraints (i)-(iii) of MCR are fulfilled trivially. Using
yi,p as indicator variables whether path p ∈ Pi is chosen, we can model the
following ILP:

ILPMCR-Given-Paths

minimize λ / (1)
k 
X X 
subject to di · yi,p ≤ λ · c(e) ∀e ∈ E / (2)
i=1 p∈Pi ,e∈p
X
yi,p = 1 ∀i ∈ [k] / (3)
p∈Pi

yi,p ∈ {0, 1} ∀i ∈ [k], p ∈ Pi / (4)

/ (1) Congestion parameter λ

/ (2) Congestion factor relative to selected paths

/ (3) Exactly one path chosen from each Pi

/ (4) Indicator variable for path p ∈ Pi

Relax the integral constraint on yi,p to xi,p ∈ [0, 1] and solve the correspond-
ing LP. Define λ∗ = obj(LPMCR-Given-Paths ) and denote x∗ as a fractional path
4.2. MINIMIZING CONGESTION IN MULTI-COMMODITY ROUTING51

selection that achieves λ∗ . To obtain a valid path selection, for each com-
x∗
modity i ∈ [k], pick path p ∈ Pi with weighted probability P i,p x∗ = x∗i,p .
p∈Pi i,p
Note that by constraint (3), p∈Pi x∗i,p = 1.
P

Remark 1 For a fixed i, a path is selected exclusively (only one!) (cf. set
cover’s roundings where we may pick multiple sets for an item).

Remark 2 The weighted sampling is independent across different com-


modities. That is, the choice of path amongst Pi does not influence the
choice of path amongst Pj for i 6= j.

2c log m
Theorem 4.9. Pr[obj(y) ≥ log log m
max{1, λ∗ }] ≤ 1
mc−1

Proof. Fix an arbitrary edge e ∈ E. For each commodity i, define an indi-


cator variable Ye,i whether edge e is part of the chosen path for commod-

P
ity i. By randomized rounding, Pr[Ye,i = 1] = p∈Pi ,e∈p xi,p . Denoting
Pk
Ye = i=1 di · Ye,i as the total demand on edge e in all k chosen paths,

k
X
E(Ye ) = E( di · Ye,i )
i=1
k
X
= di · E(Ye,i ) By linearity of expectation
i=1
Xk X X
= di xi,p Since Pr[Ye,i = 1] = xi,p
i=1 p∈Pi ,e∈p p∈Pi ,e∈p

≤ λ∗ · c(e) By MCR constraint and optimality of the solved LP

For every edge e ∈ E, applying2 the tight form of Chernoff bounds with
2 log n Ye
(1 + ) = log log n
on variable c(e) gives

Ye 2c log m 1
Pr[ ≥ max{1, λ∗ }] ≤ c
c(e) log log m m

Finally, take union bound over all m edges.


2
See Corollary 2 of https://courses.engr.illinois.edu/cs598csc/sp2011/
Lectures/lecture_9.pdf for details.
52 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

4.2.2 General: Given only a network


In the general case, we may not be given path sets Pi and there may be
exponentially many si − ti paths in the network. However, we show that one
can still formulate an ILP and round it (slightly differently) to yield the same
approximation factor. Consider the following:

ILPMCR-Given-Network
minimize λ / (1)
X X
subject to f (e, i) − f (e, i) = 1 ∀i ∈ [k] / (2)
e∈out(si ) e∈in(si )
X X
f (e, i) − f (e, i) = 1 ∀i ∈ [k] / (3)
e∈in(ti ) e∈out(ti )
X X
f (e, 1) − f (e, 1) = 0 ∀e ∈ E, / (4)
e∈out(v) e∈in(v)

∀v ∈ V \ {s1 ∪ t1 }
.. ..
. .
X X
f (e, k) − f (e, k) = 0 ∀e ∈ E, / (4)
e∈out(v) e∈in(v)

∀v ∈ V \ {sk ∪ tk }
k 
X X 
di · yi,p ≤ λ · c(e) ∀e ∈ E As before
i=1 p∈Pi ,e∈p
X
yi,p = 1 ∀i ∈ [k] As before
p∈Pi

yi,p ∈ {0, 1} ∀i ∈ [k], p ∈ Pi As before


/ (1) Congestion parameter λ
/ (2) Valid sources
/ (3) Valid sinks
/ (4) Flow conservation
Relax the integral constraint on yi,p to xi,p ∈ [0, 1] and solve the corresponding
LP. To extract the path candidates Pi for each commodity, perform flow de-
composition3 . For each extracted path pi for commodity i, treat the minimum
3
See https://www.youtube.com/watch?v=zgutyzA9JM4&t=1020s (17:00 to 29:50) for
a recap on flow decomposition.
4.3. SCHEDULING ON UNRELATED PARALLEL MACHINES 53

mine∈pi f (e, i) on the path as the selection probability (as per xe,i in the pre-
vious section). By selecting the path pi with probability mine∈pi f (e, i), one
can show by similar arguments as before that E(obj(y)) ≤ obj(x∗ ) ≤ obj(y ∗ ).

4.3 Scheduling on unrelated parallel machines


We will now discuss a generalization of the makespan problem. In order to
get a good approximation, we will again use an LP formulation. However,
this time, randomized rounding will not be sufficient, and we will instead use
a more elaborate form of rounding, by utilizing the combinatorial structure
of the problem.

Setting. The setting is close to what we discussed in the past: We have a


collection J of n jobs/tasks, and we have a set M of m machines to process
them. Previously, each job had a fixed size regardless of which machine pro-
cesses it. Now we think about a more general case: tij ≥ 0 is the time for
machine j to process job i. The tij can be arbitrary. This means that some
machines can be a lot better than others on given jobs, and a lot worse on
others. This is what we mean when we speak of “unrelated” machines.

How can we formulate this problem as a Linear Program?

Naive LP. The most obvious way is to take xij to be an indicator for
assigning job i to machine j and then optimizing the following objective:

min t
n
X
s.t. ∀machine j, tij · xij ≤ t
i=1
m
X
and ∀job i, xij ≥ 1
j=1

and ∀job i, machine j, xij ≥ 0

Here, t is an additional variable giving an upper bound for the finishing time
of the last job.
The problem with this LP is that the best fractional solution and the
best integral solution can be far apart, i.e., the LP has a large “integrality
gap”. Namely, the fractional solution to the LP is allowed to distribute a
54 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

big job among different machines. In particular, consider an instance with a


single large job with the same processing time on all machines. The integral
solution needs to assign this job to one of the machines, while the fractional
solution can evenly split it among all m machines. Therefore, we can lose a
factor as big as m.
Note that we get a correct solution for both the fractional and the integral
case. The problem is that we want to relate the two solutions in order to
prove a small approximation factor, and this is not possible for the given LP.

Improved LPs. We will now change the LP a bit. Suppose somebody


tells us that the processing time t is at most a certain λ. We will find a way
to check this claim up to an approximation factor. Once we have the upper
bound λ, only some of the assignments make sense, namely
Sλ = {(i, j) | tij ≤ λ}.
If a single job i takes more than λ time on a given machine j, we cannot
schedule it on this machine at all. The set Sλ contains all assignments of
single jobs to single machines that are not ruled out in this way. We can now
write an LP on |Sλ | variables that is specific to a given value of λ.

LP(λ) :
X
∀machine j, tij · xij ≤ λ
(i,j)∈Sλ
X
∀job i, xij ≥ 1
(i,j)∈Sλ

∀(i, j) ∈ Sλ , xij ≥ 0

This time, we just want to check for feasibility. We have constraints (defining
a polytope), but there is no objective function. Using binary search4 , we can
find the smallest λ∗ for which we can find a fractional solution of LP(λ∗ ).
Note that it is easy to initialize the binary search, as there are trivial lower
and upper bounds on λ∗ , for example 0 and the sum of all processing times.
Now, somebody gives us a fractional solution of LP(λ∗ ): Suppose this
solution is x∗ ∈ [0, 1]|Sλ | . Instead of just assuming that x∗ is an arbitrary
solution, we will also assume that x∗ is a vertex of the polytope.5
Feasibility of LP(λ) is a monotone property in λ, as for λ ≤ λ0 , a solution of LP(λ)
4

can be extended to a solution of LP(λ0 ).


5
Also known as: a basic feasible solution, an extreme point, a generator of the polytope,
a solution that cannot be written as a convex combination of other solutions, etc.
4.3. SCHEDULING ON UNRELATED PARALLEL MACHINES 55

Rounding Algorithm. We want to round x∗ to an integral assignment of


jobs to machines. This is a place where the rounding will be not very direct.
However, there are some assignments for which rounding is obvious, in the
sense that xij = 1. For those cases, we can just assign the job i to machine j.
So in the following we can assume that all variables have fractional values.
The support of x∗ forms a graph H on jobs and machines. (The edge (i, j) is
present iff x∗ij ∈ (0, 1).) This is a bipartite graph with jobs on one side and
machines on the other side.
We will prove that there is a perfect matching in H, and that we obtain
a 2-approximation by combining the obvious assignments with this perfect
matching. More explicitly, the algorithm is this:

1. For edges (i, j) such that xij = 1, assign job i to machine j. Let I be
the set of jobs assigned in this step. (I ⊆ J.)

2. Let H be the bipartite graph on jobs and machines where job i and
machine j are connected iff xi,j ∈ (0, 1).

3. Find a matching in H that is perfect for the remaining jobs F . (F =


J \ I.)

4. For each matching edge (i, j), assign job i to machine j.

Jobs assigned in step 1 take at most time λ∗ to complete. With the matching,
each machine gets at most one more job, whose cost is also at most λ∗ (by
definition of the set Sλ∗ ). Therefore, we can construct a solution with cost
at most 2 · λ∗ . As λ∗ is a lower bound on the optimal cost, this proves that
the algorithm is a 2-approximation. This is a much more careful rounding
method than randomized rounding.

Correctness of Rounding. It remains to be argued that the perfect


matching exists.

Definition 4.10. A pseudo-forest is a graph for which each connected com-


ponent is a pseudo-tree, where a pseudo-tree is either a tree or a tree with an
additional edge

Claim: H is a bipartite pseudo-forest, where each leaf is a machine. It


is easy to prove that such bipartite pseudo-forests admit a matching that is
perfect for the jobs.

Lemma 4.11. A bipartite pseudo-forest whose leaves are machines contains


a matching that covers all the jobs.
56 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS

Proof. We argue separately for each connected component. If a connected


component has a leaf that is not adjacent to a cycle, it is always possible to
pick a leaf not adjacent to a cycle such that if we remove it and its (unique)
neighbour, the produced components consist of at most one pseudo-tree with
machines as leaves and possibly a few isolated machines.
We can repeatedly pick a machine that is such a leaf and match it to
its neighbour. We then delete both the matched job and its machine, and
further, we delete all machines that have become isolated. The resulting
component is again a bipartite pseudo-tree whose leaves are machines. If we
repeat this process until no more steps can be taken, we are left with a graph
that is an even cycle, possibly with a few leaves attached. As those leaves
are machines, we can ignore them and use one of the two perfect matchings
of the cycle to assign the remaining jobs. By doing this for all components,
we can assign all jobs to machines.

We still need to prove that H is in fact a pseudo-forest.

Lemma 4.12. The graph H is a pseudo-forest.

Proof. We first prove that H is a pseudo-tree if it is connected, and then we


show how to extend the argument to the case where H has multiple connected
components.

H connected. If H is connected, it suffices to show that |E(H)| ≤ |V (H)|.


We will use the fact that x∗ is a vertex of the polytope. Let r = |Sλ∗ |
be the number of variables. As x∗ is a vertex, there must exist r linearly
independent tight constraints. Among those r constraints, there can be at
most m constraints on machines and n constraints on job assignments (c.f.
the LP). Therefore, at least r − (n + m) constraints of the form xij ≥ 0 must
be tight. Hence the number of variables that are non-zero in x∗ is at most
r − (r − (n + m)) = n + m. In particular, the number of fractional variables,
corresponding to edges in E(H), is at most n + m = |V (H)|.
(Aside: Recall that we defined I as the set of integrally-assigned jobs,
while F is the set of fractionally-assigned jobs. As all jobs fall into exactly
one of those categories, we have |I|+|F | = n. Each integral job i is associated
to at least one non-zero variable x∗ij0 , while a fractional job is associated to
at least two non-zero variables x∗ij1 and x∗ij2 (because of the constraint that
P
j xij ≥ 1). We derive the inequality |I| + 2 · |F | ≤ n + m and can conclude
that |I| ≥ n − m. This means that if the number of machines is small, many
jobs will be assigned non-fractionally.)
4.3. SCHEDULING ON UNRELATED PARALLEL MACHINES 57

H disconnected. Now we extend the argument to cover the case where


H has multiple connected components. Given some such component H 0 , we
can restrict the solution x∗ to this component, by ignoring all variables cor-
responding to assignments of single jobs to single machines that are not both
in V (H 0 ). We call this restricted vector x0∗ . Claim: x0∗ is a vertex of the
polytope obtained by only considering variables associated to edges connect-
ing vertices in V (H 0 ) and writing the analogue of the LP(λ∗ ) constraints for
them. Proof: Otherwise, we can pick two feasible solutions x0∗ 0∗
1 and x2 of
the restricted LP such that x0∗ = 21 (x0∗ 0∗ 0∗
1 + x2 ). We can then extend x1 and
x0∗
2 to solutions for the unrestricted LP by filling in the missing components
from x∗ . The resulting solutions have x∗ as their middle point, which con-
tradicts the assumption that x∗ is a vertex of the polytope associated to the
unrestricted LP. Therefore, the reasoning from above applies independently
to all connected components of H and H is a pseudo-forest.
All leaves of H are machines, because fractionally-assigned jobs have at
least two neighbours. Using the two lemmas, we conclude that the rounding
algorithm is correct.
58 CHAPTER 4. ROUNDING LINEAR PROGRAM SOLUTIONS
Part II

Selected Topics in
Approximation Algorithms

59
Chapter 5

Distance-preserving tree
embedding

Many hard graph problems become easy if the graph is a tree: in partic-
ular, some N P-hard problems are known to admit exact polynomial time
solutions on trees, and for some other problems, we can obtain much better
approximations on tree. Motivated by this fact, one hopes to design the
following framework for a general graph G = (V, E) with distance metric
dG (u, v) between vertices u, v ∈ V :

1. Construct a tree T

2. Solve the problem on T efficiently

3. Map the solution back to G

4. Argue that the transformed solution from T is a good approximation


for the exact solution on G.

Ideally, we want to build a tree T such that dG (u, v) ≤ dT (u, v) and


dT (u, v) ≤ c · dG (u, v), where c is the stretch of the tree embedding. Unfortu-
nately, such a construction is hopeless1 .
Instead, we relax the hard constraint dT (u, v) ≤ c · dG (u, v) and consider
a distribution over a collection of trees T , so that

• (Over-estimates cost): ∀u, v ∈ V , ∀T ∈ T , dG (u, v) ≤ dT (u, v)

• (Over-estimate by not too much): ∀u, v ∈ V , ET ∈T [dT (u, v)] ≤ c · dG (u, v)


1
For a cycle G with n vertices, the excluded edge in a constructed tree will cause the
stretch factor c ≥ n − 1. Exercise 8.7 in [WS11]

61
62 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

• (T is a probability space):
P
T ∈T Pr[T ] = 1

Bartal [Bar96, Theorem 8] gave a construction for probabilistic tree em-


bedding with poly-logarithmic stretch factor c. He also proved in [Bar96,
Theorem 9] that a stretch factor c ∈ Ω(log n) is required for general graphs.
A construction that yields c ∈ O(log n), in expectation, was subsequently
found by Fakcharoenphol, Talwar, and Rao [FRT03].

5.1 A tight probabilistic tree embedding con-


struction
In this section, we describe a probabilistic tree embedding construction due
to [FRT03] with a stretch factor c = O(log n). For a graph G = (V, E), let
the distance metric dG (u, v) be the distance between two vertices u, v ∈ V
and denote diam(C) = maxu,v∈C dG (u, v) as the maximum distance between
any two vertices u, v ∈ C for any subset of vertices C ⊆ V . In particular,
diam(V ) refers to the diameter of the whole graph. In the following, let
B(v, r) := {u ∈ V : dG (u, v) ≤ r} denote the ball of radius r around vertex
v.

5.1.1 Idea: Ball carving


To sample an element of the collection T we will recursively split our graph
using a technique called ball carving.

Definition 5.1 (Ball carving). Given a graph G = (V, E), a subset C ⊆ V


of vertices and upper bound D, where diam(C) = maxu,v∈C dG (u, v) ≤ D,
partition C into C1 , . . . , Cl such that
D
(A) ∀i ∈ {1, . . . , l}, maxu,v∈Ci dG (u, v) ≤ 2

dG (u,v)
(B) ∀u, v ∈ V , Pr[u and v not in same partition] ≤ α · D
, for some α

Before using ball carving to construct a tree embedding with expected


stretch α, we show that a reasonable value α ∈ O(log n) can be achieved.

5.1.2 Ball carving construction


The following algorithm concretely implements ball carving and thus gives a
split of a given subset of the graph that satisfies properties (A) and (B) as
defined.
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION63

Algorithm 14 BallCarving(G = (V, E), C ⊆ V, D)


if |C| = 1 then
return The only vertex in C
else . Say there are n vertices, where n > 1
θ ← Uniform random value from the range [ D8 , D4 ]
Pick a random permutation π on C . Denote πi as the ith vertex in π
for i ∈ [n] do
Vi ← B(πi , θ) \ i−1
S
j=1 B(πj , θ) . V1 , . . . , Vn is a partition of C
end for
return Non-empty sets V1 , . . . , Vl . Vi can be empty
end if . i.e. Vi = ∅ ⇐⇒ ∀v ∈ B(πi , θ), [∃j < i, v ∈ B(πj , θ)]

Notation Let π : C → N be an ordering of the vertices C. For vertex


v ∈ C, denote π(v) as v’s position in π and πi as the ith vertex. That is,
v = ππ(v) .

Example C = {A, B, C, D, E, F } and π(A) = 3, π(B) = 2, π(C) = 5, π(D) =


1, π(E) = 6, π(F ) = 4. Then π gives an ordering of these vertices as
(D, B, A, F, C, E) denoted as π. E = π6 = ππ(E) .
Figure 5.1 illustrates the process of ball carving on a set of vertices C =
{N1 , N2 , · · · , N8 }.
Claim 5.2. BallCarving(G, C, D) returns partition V1 , . . . , Vl such that
D
diam(Vi ) = max dG (u, v) ≤
u,v∈Vi 2
for all i ∈ {1, . . . , l}.
Proof. Since θ ∈ [ D8 , D4 ], all constructed balls have diameters ≤ D
4
·2 = D
2
.
Definition 5.3 (Ball cut). A ball B(u, r) is cut if BallCarving puts the
vertices in B(u, r) in at least two different partitions. We say Vi cuts B(u, r)
if there exist w, y ∈ B(u, r) such that w ∈ Vi and y 6∈ Vi .
Lemma 5.4. For any vertex u ∈ C and radius r ∈ R+ ,
r
Pr[B(u, r) is cut in BallCarving(G,C,D)] ≤ O(log n) ·
D
Proof. Let θ be the randomly chosen ball radius and π be the random permu-
tation on C in BallCarving. We give another ordering of vertices according
to the increasing order of their distances to B(u, r). The distance of a fixed
point w to the ball B(u, r) is the distance of w to the closest point in B(u, r):
v1 , v2 , . . . , vn , such that dG (B(u, r), v1 ) ≤ dG (B(u, r), v2 ) ≤ · · · ≤ dG (B(u, r), vn ).
64 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

N2

N1
N3
N5

N4
N7

N8
N6

Figure 5.1: Ball carving on a set of vertices C = {N1 , N2 , · · · , N8 }. The


ordering of nodes is given by a random permutation π. In Ball(N1 ) there are
vertices N1 , N2 , N5 . So V1 = {N1 , N2 , N5 }. In Ball(N2 ) there is only N3 not
been carved by the former balls. So V2 = {N3 }. All of vertices in Ball(N3 )
have been carved. So V3 = φ. In Ball(N4 ), only N4 has not been carved.
V4 = {N4 }. In Ball(N5 ) all of vertices have been carved. V5 = φ. Ball(N6 )
carves N6 , N7 , N8 , so V6 = {N6 , N7 , N8 }. Similar to N3 , N5 , V7 = φ and
V8 = φ. Thus C is partitioned into sets {N1 , N2 , N5 }, {N3 }, {N4 } and
{N6 , N7 , N8 }.
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION65

Observation 5.5. If Vi is the first partition that cuts B(u, r), a necessary
condition is that in the random permutation π, vi appears before any vj with
j < i. (i.e. π(vi ) < π(vj ), ∀1 ≤ j < i).

Proof. Consider the largest 1 ≤ j < i such that π(vj ) < π(vi ):

• If B(u, r) ∩ B(vj , θ) = ∅, then B(u, r) ∩ B(vi , θ) = ∅. Since B(u, r) ∩


B(vj , θ) = ∅ ⇐⇒ ∀u0 ∈ B(u, r), u0 ∈ / B(vj , θ) ⇐⇒ ∀u0 ∈ B(u, r), dG (u0 , vj ) >
θ ⇐⇒ dG (B(u, r), vj ) > θ. Also, we know dG (B(u, r), vi ) ≥ dG (B(u, r), vj ) >
θ. None of B(u, r)’s vertices will be in B(vi , θ), neither in Vi .

• If B(u, r) ⊆ B(vj , θ), then vertices in B(u, r) would have been removed
before vi is considered.

• If B(u, r) ∩ B(vj , θ) 6= ∅ and B(u, r) 6⊆ B(vj , θ), then Vi is not the first
partition that cuts B(u, r) since Vj (or possibly an earlier partition)
has already cut B(u, r).

In any case, if there is a 1 ≤ j < i such that π(vj ) < π(vi ), Vi does not cut
B(u, r).

2r
Observation 5.6. Pr[Vi cuts B(u, r)] ≤ D/8

Proof. We ignore all the other partitions, only considering the sufficient con-
dition for a partition to cut a ball. Vi cuts B(u, r) means ∃u1 ∈ B(u, r), s.t.u1 ∈
B(vi , θ) ∩ ∃u2 ∈ B(u, r), s.t.u2 ∈
/ B(vi , θ).

• ∃u1 ∈ B(u, r), s.t.u1 ∈ B(vi , θ) ⇒ dG (u, vi ) − r ≤ dG (u1 , vi ) ≤ θ.

• ∃u2 ∈ B(u, r), s.t.u2 ∈


/ B(vi , θ) ⇒ dG (u, vi ) + r ≥ dG (u2 , vi ) ≥ θ.

We get the bounds of θ : θ ∈ [dG (u, vi ) − r, dG (u, vi ) + r]. Since θ is uniformly


chosen from [ D8 , D4 ],

(dG (u, vi ) + r) − (dG (u, vi ) − r) 2r


Pr[θ ∈ [dG (u, vi )−r, dG (u, vi )+r]] ≤ =
D/4 − D/8 D/8

2r
Therefore, Pr[Vi cuts B(u, r)] ≤ Pr[θ ∈ [dG (u, vi ) − r, dG (u, vi ) + r]] ≤ D/8
.
66 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

Thus,
Pr[B(u, r) is cut]
[n
= Pr[ Event that Vi first cuts B(u, r)]
i=1
n
X
≤ Pr[Vi first cuts B(u, r)] Union bound
i=1
n
X
= Pr[π(vi ) = min π(vj )] Pr[Vi cuts B(u, r)] Require vi to appear first
j≤i
i=1
n
X 1
= · Pr[Vi cuts B(u, r)] By random permutation π
i=1
i
n
X 1 2r D D
≤ · diam(B(u, r)) ≤ 2r, θ ∈ [ , ]
i=1
i D/8 8 4
n
r X 1
= 16 Hn Hn =
D i=1
i
r
∈ O(log(n)) ·
D

Claim 5.7. BallCarving(G) returns partition V1 , . . . , Vl such that


dG (u, v)
∀u, v ∈ V, Pr[u and v not in same partition] ≤ α ·
D
Proof. Let r = dG (u, v), then v is on the boundary of B(u, r).

Pr[u and v not in same partition]


≤ Pr[B(u, r) is cut in BallCarving]
r
≤ O(log n) · By Lemma 5.4
D
dG (u, v)
= O(log n) · Since r = dG (u, v)
D
Note: α = O(log n) as previously claimed.

5.1.3 Construction of T
Using ball carving, ConstructT recursively partitions the vertices of a
given graph until there is only one vertex remaining. At each step, the upper
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION67

bound D indicates the maximum distance between the vertices of C. The


first call of ConstructT starts with C = V and D = diam(V ). Figure 5.2
illustrates the process of building a tree T from a given graph G.

Algorithm 15 ConstructT(G = (V, E), C ⊆ V, D)


if |C| = 1 then
return The only vertex in C . Return an actual vertex from V (G)
else
V1 , . . . , Vl ← BallCarving(G, C, D) . maxu,v,∈Vi dG (u, v) ≤ D2
Create auxiliary vertex r . r is root of current subtree
for i ∈ {1, . . . , l} do
ri ← ConstructT(G, Vi , D2 )
Add edge {r, ri } with weight D
end for
return Root of subtree r . Return an auxiliary vertex r
end if

Lemma 5.8. For any two vertices u, v ∈ V and i ∈ N, if T separates u and


v at level i, then 2D
2i
≤ dT (u, v) ≤ 4D
2i
, where D = diam(V ).

Proof. If T splits u and v at level i, then the path from u to v in T has to


include two edges of length 2Di , hence dT (u, v) ≥ 2D
2i
. To be precise,

2D D D 4D
i
≤ dT (u, v) = 2 · ( i + i+1 + · · · ) ≤ i
2 2 2 2
See picture — r is the auxiliary node at level i which splits nodes u and v.

r r
D D D D
2i 2i 2i 2i

u ∈ Vu ... v ∈ Vv
u ∈ Vu ... v ∈ Vv

D D
2i+1 2i+1

.. ..
. .
u v
68 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

level-0 r0 r0 r0 level-0
D D D D D D

level-1 V1 ... Vl0 r1 ... rl0 r1 ... rl0 level-1


D
D D .. 2
2 2 .
.
..
D
V1,1 . . . V1,l1 2i−1

V1,1,...,1 level-i

D
2i

..
.

Figure 5.2: Recursive ball carving with dlog2 (D)e levels. Red vertices are
auxiliary nodes that are not in the original graph G. Denoting the root as
the 0th level, edges from level i to level i + 1 have weight 2Di .

Remark If u, v ∈ V separate before level i, then dT (u, v) must still include


the two edges of length 2Di , hence dT (u, v) ≥ 2D
2i
.

Claim 5.9. ConstructT(G, C = V, D = diam(V )) returns a tree T such


that
dG (u, v) ≤ dT (u, v)

Proof. Consider u, v ∈ V . Say 2Di ≤ dG (u, v) ≤ 2i−1


D
for some i ∈ N. By
property (A) of ball carving, T will separate them at, or before, level i. By
Lemma 5.8, dT (u, v) ≥ 2D
2i
D
= 2i−1 ≥ dG (u, v).

Claim 5.10. ConstructT(G, C = V, D = diam(V )) returns a tree T such


that
E[dT (u, v)] ≤ 4α log(D) · dG (u, v)

Proof. Consider u, v ∈ V . Define Ei as the event that “vertices u and v get


separated at the ith level”, for i ∈ N. By recursive nature of ConstructT,
the subset at the ith level has distance at most 2Di . So, property (B) of ball
carving tells us that Pr[Ei ] ≤ α · dGD/2
(u,v)
i . Then,
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION69

log(D)−1
X
E[dT (u, v)] = Pr[Ei ] · [dT (u, v), given Ei ] Definition of expectation
i=0
log(D)−1
X 4D
≤ Pr[Ei ] · By Lemma 5.8
i=0
2i
log(D)−1
X dG (u, v) 4D
≤ (α · )· i Property (B) of ball carving
i=0
D/2i 2
= 4α log(D) · dG (u, v) Simplifying

If we apply Claim 5.7 with Claim 5.10, we get

E[dT (u, v)] ≤ O(log(n) log(D)) · dG (u, v)

We can remove the log(D) factor, and prove that the tree embedding built
by the algorithm has stretch factor c = O(log n). For that, we need a tighter
analysis of the ball carving process, by only considering vertices that may
cut B(u, dG (u, v)) instead of all n vertices, in each level of the recursive
partitioning. This sharper analysis is presented as a separate section below.
See Theorem 5.13 in Section 5.1.4.

5.1.4 Sharper Analysis of Tree Embedding


If we apply Claim 5.7 with Claim 5.10, we get E[dT (u, v)] ≤ O(log(n) log(D))·
dG (u, v). To remove the log(D) factor, so that stretch factor c = O(log n), a
tighter analysis is needed by only considering vertices that may cut B(u, dG (u, v))
instead of all n vertices.

Tighter analysis of ball carving


Fix arbitrary vertices u and v. Let r = dG (u, v). Recall that θ is chosen
uniformly at random from the range [ D8 , D4 ]. A ball B(vi , θ) can cut B(u, r)
only when dG (u, vi ) − r ≤ θ ≤ dG (u, vi ) + r. In other words, one only needs
to consider vertices vi such that D8 − r ≤ θ − r ≤ dG (u, vi ) ≤ θ + r ≤ D4 + r.
D 16r
Lemma 5.11. For i ∈ N, if r > 16
, then Pr[B(u, r) is cut] ≤ D
D 16r
Proof. If r > 16 , then D
> 1. As Pr[B(u, r) is cut at level i] is a probability
≤ 1, the claim holds.
70 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

Remark Although lemma 5.11 is not a very useful inequality (since any
probability ≤ 1), we use it to partition the value range of r so that we can
say something stronger in the next lemma.

D
Lemma 5.12. For i ∈ N, if r ≤ 16
, then

r |B(u, D/2)|
Pr[B(u, r) is cut] ≤ O(log( ))
D |B(u, D/16)|

D D
Proof. Vi cuts B(u, r) only if 8
− r ≤ dG (u, vi ) ≤ 4
+ r, we have dG (u, vi ) ∈
D 5D D D
[ 16 , 16 ] ⊆ [ 16 , 2 ].

D
D 2
2 D
16

u u
D v1 vj vj+1 . . . vk Dist from u
16

Suppose we arrange the vertices in ascending order of distance from u:


u = v1 , v2 , . . . , vn . Denote:

• j − 1 = |B(u, 16
D
)| as the number of nodes that have distance ≤ D
16
from
u

• k = |B(u, D2 )| as the number of nodes that have distance ≤ D


2
from u

We see that only vertices vj , vj+1 , . . . , vk have distances from u in the range
D D
[ 16 , 2 ]. Pictorially, only vertices in the shaded region could possibly cut
B(u, r). As before, let π(v) be the ordering in which vertex v appears in
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION71

random permutation π. Then,

Pr[B(u, r) is cut]
k
[
= Pr[ Event that Vi cuts B(u, r)] Only Vj , Vj+1 , . . . , Vk can cut
i=j
k
X
≤ Pr[π(vi ) < min{π(vz )}] · Pr[Vi cuts B(u, r)] Union bound
z<i
i=j
k
X 1
= · Pr[Vi cuts B(u, r)] By random permutation π
i=j
i
k
X 1 2r D D
≤ · diam(B(u, r)) ≤ 2r, θ ∈ [ , ]
i=j
i D/8 8 4
k
r X 1
= (Hk − Hj ) where Hk =
D i=1
i
r |B(u, D/2)|
∈ O(log( )) since Hk ∈ Θ(log(k))
D |B(u, D/16)|

Plugging into ConstructT

Recall that ConstructT is a recursive algorithm which handles graphs of


diameter ≤ 2Di at level i. For a given pair of vertices u and v, there exists
i∗ ∈ N such that 2Di∗ ≤ r = dG (u, v) ≤ 2i∗D−1 . In other words, 2i∗D−4 16 1

D 1 ∗
r ≤ 2i∗ −5 16 . So, lemma 5.12 applies for levels i ∈ [0, i − 5] and lemma 5.11
applies for levels i ∈ [i∗ − 4, log(D) − 1].

Theorem 5.13. E[dT (u, v)] ∈ O(log n) · dG (u, v)

Proof. As before, let Ei be the event that “vertices u and v get separated at
the ith level. For Ei to happen, the ball B(u, r) = B(u, dG (u, v)) must be cut
at level i, so Pr[Ei ] ≤ Pr[B(u, r) is cut at level i].
72 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

E[dT (u, v)]


log(D)−1
X
= Pr[Ei ] · Pr[dT (u, v), given Ei ] (1)
i=0
log(D)−1
X 4D
≤ Pr[Ei ] · (2)
i=0
2i
∗ −5
iX log(D)−1
4D X 4D
= Pr[Ei ] · i + Pr[Ei ] · i (3)
i=0
2 i=i∗ −4
2
∗ −5
iX log(D)−1
r |B(u, D/2i+1 )| 4D X 4D
≤ i
O(log( i+4
)) · i
+ Pr[Ei ] · i (4)
i=0
D/2 |B(u, D/2 )| 2 i=i∗ −4
2
∗ −5
iX log(D)−1
r |B(u, D/2i+1 )| 4D X 16r 4D
≤ i
O(log( i+4
)) · i
+ i ∗ −4 · (5)
i=0
D/2 |B(u, D/2 )| 2 i=i∗ −4
D/2 2i
∗ −5
iX log(D)−1
|B(u, D/2i+1 )| X ∗
= 4r · O(log( i+4
)) + 4 · 2i −i · r (6)
i=0
|B(u, D/2 )| i=i∗ −4
∗ −5
iX
|B(u, D/2i+1 )|
≤ 4r · O(log( )) + 27 r (7)
i=0
|B(u, D/2i+4 )|
= 4r · O(log(n)) + 27 r (8)
∈ O(log n) · r

(1) Definition of expectation

(2) By Lemma 5.8


D 1 D 1
(3) Split into cases: 2i∗ −4 16
≤r≤ 2i∗ −5 16

(4) By Lemma 5.12


∗ −4
(5) By Lemma 5.11 with respect to D/2i

(6) Simplifying
i∗ −i
(7) Since log(D)−1 ≤ 25
P
i=i∗ −4 2

(8) log( xy ) = log(x) − log(y) and |B(u, ∞)| ≤ n


5.2. APPLICATION: BUY-AT-BULK NETWORK DESIGN 73

5.1.5 Removing auxiliary nodes from tree T


Note in Figure 5.2 that we introduce auxiliary vertices in our tree construc-
tion. We next would like to build a tree T without additional vertices (i.e.
such that V (T ) = V (G)). In this section, the pseudo-code Contract ex-
plains how to remove the auxiliary vertices. It remains to show that the
produced tree still preserves desirable properties of a tree embedding.

Algorithm 16 Contract(T )
while T has an edge (u, w) such that u ∈ V and w is an auxiliary node
do
Contract edge (u, w) by merging subtree rooted at u into w
Identify the new node as u
end while
Multiply weight of every edge by 4
return Modified T 0

Claim 5.14. Contract returns a tree T 0 such that

dT (u, v) ≤ dT 0 (u, v) ≤ 4 · dT (u, v)

Proof. Suppose auxiliary node w, at level i, is the closest common ancestor


for two arbitrary vertices u, v ∈ V in the original tree T . Then,
log D
XD D
dT (u, v) = dT (u, w) + dT (w, v) = 2 · ( j
)≤4· i
j=i
2 2

Since we do not contract actual vertices, at least one of the (u, w) or (v, w)
edges of weight 2Di will remain. Multiplying the weights of all remaining edges
by 4, we get dT (u, v) ≤ 4 · 2Di = dT 0 (u, v).
Suppose we only multiply the weights of dT (u, v) by 4, then dT 0 (u, v) = 4 · dT (u, v).
Since we contract edges, dT 0 (u, v) can only decrease, so dT 0 (u, v) ≤ 4 · dT (u, v).

Remark Claim 5.14 tells us that one can construct a tree T 0 without aux-
iliary variables by incurring an additional constant factor overhead.

5.2 Application: Buy-at-bulk network design


Definition 5.15 (Buy-at-bulk network design problem). Consider a graph
G = (V, E) with edge lengths le for e ∈ E. Let f : R+ → R+ be a sub-additive
74 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING

cost function. That is, f (x + y) ≤ f (x) + f (y). Given k commodity triplets


(si , ti , di ), where si ∈ V is the source, ti ∈ V is the target, and di ≥ 0 is the
demand for the ith commodity, find a capacity assignment on edges ce (for
all edges) such that


P
e∈E f (ce ) · le is minimized

• ∀e ∈ E, ce ≥ Total flow passing through it

• Flow conservation is satisfied and every commodity’s demand is met

Remark If f is linear (e.g. f (x + y) = f (x) + f (y)), one can obtain an


optimum solution by finding the shortest path si → ti for each commodity i,
then summing up the required capacities for each edge.

Algorithm 17 NetworkDesign(G = (V, E))


ce = 0, ∀e ∈ E . Initialize capacities
T ← ConstructT(G) . Build probabilistic tree embedding T of G
T ← Contract(T) . V (T ) = V (G) after contraction
for i ∈ {1, . . . , k} do . Solve problem on T
T
Psi ,ti ← Find shortest si − ti path in T . It is unique in a tree
for Edge {u, v} of PsTi ,ti in T do
G
Pu,v ← Find shortest u − v path in G
G
ce ← ce + di , for each edge in e ∈ Pu,v
end for
end for
return {e ∈ E : ce }

Let us denote I = (G, f, {si , ti , di }ki=1 ) as the given instance. Let OP T (I, G)
be the optimal solution on G. The general idea of our algorithm NetworkDesign
is first transforming the original graph G into a tree T by probabilistic tree
embedding method, contracting the tree as T 0 , then finding an optimal solu-
tion on T 0 and map it back to graph G. Let A(I, G) be the solution produced
by our algorithm on graph G. Denote the costs as |OP T (I, G)| and |A(I, T )|
respectively.
We now compare the solutions OP T (I, G) and A(I, T ) by comparing
edge costs (u, v) ∈ E in G and tree embedding T . For the three claims
below, we provide just proof sketches, without diving into the notation-heavy
calculations. Please refer to Section 8.6 in [WS11] for the formal arguments.

Claim 5.16. |A(I, G)| using edges in G ≤ |A(I, T )| using edges in T .


5.2. APPLICATION: BUY-AT-BULK NETWORK DESIGN 75

Proof. (Sketch) This follows from two facts: (1) For any edge xy ∈ T , all of
the paths sent in A(I, T ) along edge xy are now sent along the shortest path
connecting x and y in G, which by the first property of the tree embedding
has length at most equal to the length of the xy edge. (2) Sevaral paths of
, corresponding to different edges in the tree T , might end up being routed
through the same graph G edge e. But by subadditivity, the cost on edge e
is at most the summation of the costs of those paths.

Claim 5.17. |A(I, T )| using edges in T ≤ |OP T (I, T )| using edges in T .

Proof. (Sketch) Since shortest path in a tree is unique, A(I, T ) is optimum


for T . So, any other flow assignment has to incur higher edge capacities.

Claim 5.18. E[|OP T (I, T )| using edges in T ] ≤ O(log n) · |OP T (I, G)|

Proof. (Sketch) Using subadditivity, we can upper bound the cost of OP T (I, T )
by the summation over all edges e ∈ G of the cost for the capacity of this
edge in the optimal solution OP T (I, G) multiplied by the length of the path
connecting the two endpoints of e in the tree T . We know that T stretches
edges by at most a factor of O(log n), in expectation. Hence, the cost is in
expectation upper bounded by the summation over all edges e ∈ G of the
cost for the capacity of this edge in the optimal solution OP T (I, G) mul-
tiplied by the length of the edge e in G. The latter is simply the cost of
OP T (I, G).
By the three claims above, NetworkDesign gives a O(log n)-approximation
to the buy-at-bulk network design problem, in expectation.
76 CHAPTER 5. DISTANCE-PRESERVING TREE EMBEDDING
Chapter 6

L1 metric embedding &


sparsest cut

In this section, we see how viewing graphs as geometric objects helps us to


solve some cut problems. You can find more on this topic in the seminal
work of Linial, London, and Rabinovich [LLR95].

6.1 Warm up: Min s-t Cut

In this section we will study the minimum s-t cut problem in undirected
graphs. Given a graph G = (V, E) we define a cut as a partition of the
vertices in two sets(S, V \ S). We will typically work with connected graphs,
in this situation any non-trivial cut (S 6= ∅, V \ S 6= ∅) will have some edges
going across the cut. The minimum s-t cut problem searches for cuts that
separate a source s ∈ V from a target t ∈ V minimizing the capacity of the
edges across the cut.
An intuitive way of thinking of the minimum s-t cut problem is considering
the capacities as prices to pay to remove the edges and trying to disconnect
the node s from t while minimizing the price paid.
The minimum s-t cut problem can be solved in polynomial time by formulat-
ing it as a linear programming. We will follow a slightly different approach,
solving a relaxation of the problem and constructing a minimum cost cut
using the solution of the relaxed problem. The solution that we present is
not computationally better than solving the minimum s-t cut directly but
presents a framework that can be generalized to more complex cut problems.

77
78 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT

The mathematical formulation of the minimum cut problem is:

ce |1S (u) − 1S (v)|


X X
min ce = min (6.1)
cut S⊂V cut S⊂V
e={u,v}∈E e={u,v}∈E
u∈S, v∈V \S
(
1, u ∈ S
where 1S (u) = is the indicator function of the set S and
0, else
|1S (u) − 1S (v)| = 1 when u and v are not on the same side of the cut. Note
that for any cut (S, V \ S), |1S (u) − 1S (v)| defines a pseudo-metric.
Definition 6.1. A pseudo-metric on a set S is a function d : S × S −→ R
such that
(A) identity: ∀s ∈ S, d(s, s) = 0
(B) non-negativity: ∀s, t ∈ S, d(s, t) ≥ 0
(C) symmetry: ∀s, t ∈ S, d(s, t) = d(t, s)
(D) triangular inequality: ∀s, t, u ∈ S, d(s, u) ≤ d(s, t) + d(t, u)
A pseudo-metric is a generalization of the notion of a metric where there is
no requirement that d(s, t) = 0 ⇒ s = t
Since |1S (u) − 1S (v)| defines a pseudo-metric that separates the source
and the target, |1S (s) − 1S (t)| = 1 we can relax the minimization problem
allowing any pseudo-metric that satisfies d(s, t) = 1. This minimization
problem can be formulated as the following linear program:
X
min ce duv
duv
e={u,v}∈E

subject to duv ≥ 0 ∀u, v ∈ V


duv = dvu ∀u, v ∈ V
duw ≤ duv + dvw ∀u, v, w ∈ V
dst = 1

With the solution of the linear program d∗ we can find the minimum s-t
cut.
Claim 6.2. For the minimum s-t cut problem, the relaxed optimization prob-
lem has the same optimum value as the original problem.

ce |1S ∗ (u) − 1S ∗ (v)|


X X
ce d∗ (e) =
e∈E e={u,v}∈E
6.1. WARM UP: MIN S-T CUT 79

Moreover, given any optimal pseudo-metric d∗ , we can explicitly construct a


cut S ∗ that achieves the optimal value of the original problem.

Proof. First we note that since d∗ is a relaxation of the original problem, for
any cut (S, V \ S) it holds that

ce |1S (u) − 1S (v)|


X X
ce d∗ (e) ≤
e∈E e={u,v}∈E

To prove the equality it suffices to find a cut S ∗ that satisfies

ce |1S ∗ (u) − 1S ∗ (v)|


X X
ce d∗ (u, v) ≥

Such cut S ∗ will be a minimum s-t cut. To find it we first observe that
since d∗ is the optimum of the linear program it must satisfy the constraint
d∗ (s, t) = 1. Then we order the vertices according to their distance to
the source, defining vi ∈ V as the ith farthest vertex to the source and
xi = d∗ (s, vi ) it’s corresponding distance to the source. We also define the
increments yi = xi+1 − xi .
Now we can define the natural cuts as the cuts that separate the vertices ac-
cording to their distance to the source, Si = {v ∈ V |d∗ (s, v) ≤ xi }. Figure
6.1 shows the vertices and the natural cuts.
We now show that one of the natural cuts achieves the optimum and there-

v4

s v1 v2 v3 v5 v6 t

d∗ (s, vi )
0 x1 x2 x3 x4 x5 1

y2

Figure 6.1: The natural cuts xi and the corresponding increments yi

fore is a minimum s-t cut. First we observe that for any edge e = {vi , vj } with
xi ≤ xj w.l.g. the triangular inequality gives us d(s, vj ) ≤ d(s, vi ) + d(vi , vj ),
hence
X
d(e) = d(vi , vj ) ≥ d(s, vj ) − d(s, vi ) = xj − xi = yk
cuts k that are crossed by the edge e
80 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT

Therefore we can write:


X X X
ce d∗ (e) ≥ ce yk
e∈E e∈E cuts k that are crossed by the edge e
X X
= yk ce
k-th cut edges e that cross the k-th cut
( )
X X
≥ min ce yk
k cuts
edges e that cross the k-th cut k-th cut

Pn−1
Since d∗ (s, t) = 1 and the vertices are ordered, k=0 yk = xn ≥ xt = 1,
hence
( )
X
≥ min ce
k cuts
edges e that cross the k-th cut

ce |1S ∗ (u) − 1S ∗ (v)|


X
=
e={u,v}∈E

Where P S ∗ is defined as the natural cut with the lowest capacity, i.e. it
minimizes edges e that cross the k-th cut ce . Therefore the cut S ∗ achieves a lower
cost than the cost of the relaxed problem and is a minimum s-t cut.

6.2 Sparsest Cut via L1 Embedding


We now move to the sparsest cut problem, which is finding a cut S ⊆ V that
minimizes the following objective function:
|E(S \ V, S)|
L(S) = (6.2)
|S||V \ S|

where E(S, V \ S) denotes the set of edges that cross the cut {e = (u, v) ∈
E|u ∈ S and v ∈ V \ S}.
Observation 6.3 (Sparse cuts in complete graphs). For complete graphs,
the number of edges between any cut S and V \ S is exactly |S| · |V \ S|, so
L(S) = 1.
We can interpret the objective function (6.2) as comparison between the
cut S in the given graph and the cut S in a complete graph on the same
vertices. We also note that sparse cuts penalize cuts where the two sets have
6.2. SPARSEST CUT VIA L1 EMBEDDING 81

very different sizes.

The sparsest cut problem is NP-hard. In this section we will provide


a polynomial time approximation using ideas closely related to the ideas
exposed in the previous section. We will prove the following theorem.
Theorem 6.4 (Sparsest cut approximation). Let SOP T ⊆ V be an optimal
solution to the sparsest cut problem. There is a cut S ⊆ V , which can be
computed in polynomial time such that

L(S) ≤ O(log(n))L(SOP T ) (6.3)

with high probability.


We start by formulating the sparsest cut problem with indicator functions:

{u,v}∈E |1S (u) − 1S (v)|


P
L(S) = P . (6.4)
u,v∈V |1S (u) − 1S (v)|

Now, as we did in the minimum s-t cut problem, we will relax the prob-
lem. We observe that for any cut (S, V \ S) the expression dS (u, v) =
P |1S (u)−1S (v)|
P
|1S (u)−1S (v)|
defines a pseudo-metric and satisfies u,v∈V dS (u, v) =
u,v∈V

P |1S (u)−1S (v)| = 1. Therefore we will relax the problem to any pseudo-
u,v∈V |1S (u)−1S (v)|P
metric that satisfies u,v∈V duv = 1. We write the relaxed problem as the
following linear program:
Definition 6.5 (Sparsest cut relaxation).
X
minimise duv (6.5)
e={u,v}∈E

subject to duv ≥ 0 ∀u, v ∈ V,


duv = dvu ∀u, v ∈ V,
duw ≤ duv + dvw ∀u, v, w ∈ V,
X
duv = 1
u,v∈V

Now from the optimal pseudo-metric d∗ we would like to obtain a cut that
approximates the sparsest cut. Finding such cut from d∗ is not easy, therefore
we approximate the optimal pseudo-metric d∗ with the L1 distance of an
Rk space for some k. This approximation can be done using the following
embedding result.
82 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT

Theorem 6.6 (L1 embedding of a graph). Let d be an arbitrary pseudo-


metric on V , e.g. Then there exists an integer k = O(log2 (n))1 and a map
f : V → Rk which preserves distances up to a log(n) factor:
d∗ (u, v)
≤ kf (v) − f (u)k1 ≤ d∗ (u, v) (6.6)
Θ(log(n))
for all u, v ∈ V with high probability. Moreover, f can be computed in poly-
nomial time.
The L1 embedding theorem will be proved in next section and can be
assumed as a black-box for now. The approximation of the optimal pseudo-
metric d∗ by a L1 metric will allow us to find a cut (S ∗ , V \S ∗ ) with lower cost
than the cost given by the L1 metric, this cut will have an “aproximatively
lower” cost than the cost of the relaxed problem. To find the cut we first
prove this useful lemma.
Lemma
Pk 6.7. for any non-negative numbers a1 , b1 , a2 , b2 . . . , ak , bk ≥ 0 with
i=1 bi 6= 0 we have:
Pk
ai k a
j
Pi=1
k
≥ min .
i=1 bi
j=1 b j

Proof. For k = 1 this clearly P holds, so suppose it holds for some k ≥ 1.


We assume that bk+1 6= 0 and ki=1 bi 6= 0, otherwise the inequality quickly
follows from non-negativity of a1 , . . . , ak+1 . Then
Pk
ak+1 1 ai Pk
ak+1 +
Pk
ai bk+1
+ bk+1 Pi=1
k
bi i=1 bi
Pki=1 ≥ Pki=1 bi
bk+1 + i=1 bi 1 + i=1 bk+1
aj aj Pk
mink+1 k
j=1 bj + minj=1 bj
bi
i=1 bk+1

1 + ki=1 bk+1 bi
P

aj Pk
mink+1
j=1 bj (1 +
bi
i=1 bk+1 )

1 + ki=1 bk+1
bi
P

k+1
aj
= min .
j=1 bj

Thus the proof of the claim follows by induction.


Now we can use the previous lemma to find cut (S, V \ S) with lower cost
than the L1 cost.
1
The dimension k can be reduced to k = O(log(n)), see [Ind01].
6.2. SPARSEST CUT VIA L1 EMBEDDING 83

Claim 6.8 (Cut extraction). There is a cut S ⊆ V such that


{u,v}∈E |1S (u) − 1S (v)|
P P
{u,v}∈E kf (u) − f (v)k1
≤ P
u,v∈V |1S (u) − 1S (v)|
P
u,v∈V kf (u) − f (v)k1

Where f is the L1 embedding defined in theorem 6.6.


Proof. We write out the L1-norm as a sum of absolute values
P Pk
{u,v}∈E i=1 |fi (u) − fi (v)|
P Pk
u,v∈V i=1 |fi (u) − fi (v)|

where fi : V → R is the ith component of f . Now we use lemma 6.7 to reduce


the problem to one dimension:
P Pk P
{u,v}∈E i=1 |fi (u) − fi (v)| {u,v}∈E |fj (u) − fj (v)|
Pk ≥ min P (6.7)
u,v∈V |fj (u) − fj (v)|
P
i=1 |fi (u) − fi (v)|
j=1,...,k
u,v∈V

Let jmin be the index that minimizes the quotient on the right hand side.
Since the objective function is invariant to affine transformation to f (fˆ(u) =
a · f (u) + b), we may assume that
max fjmin (u) = 1 and min fjmin (u) = 0.
u∈V u∈V

Now let τ ∈ [0, 1] be a uniformly distributed threshold and define the cut
Sτ = {v ∈ V : fjmin (v) ≤ τ } (6.8)
Note that

0 if fjmin (u), fjmin (v) ≤ τ

|1Sτ (u)−1Sτ (v)| = 1 if min{fjmin (u), fjmin (v)} ≤ τ < max{fjmin (u), fjmin (v)}

0 if fjmin (u), fjmin (v) > τ

and
E[|1Sτ (u) − 1Sτ (v)|] = |fjmin (u) − fjmin (v)|.
Putting everything together, we obtain
P P
{u,v}∈E kf (u) − f (v)k1 {u,v}∈E |fjmin (u) − fjmin (v)|
P ≥ P
u,v∈V kf (u) − f (v)k1 |fj (u) − fjmin (v)|
P u,v∈V min
{u,v}∈E E[|1Sτ (u) − 1Sτ (v)|]
= P
u,v∈V E[|1Sτ (u) − 1Sτ (v)|]

{u,v}∈E |1Sτ (u) − 1Sτ (v)|


P
≥ min P .
u,v∈V |1Sτ (u) − 1Sτ (v)|

84 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT

The last step follows from choosing the minimal among n different cuts Sτ ,
and using lemma (6.7) again.
Finally, we can show that the cut obtained from the L1 metric is a good
approximation of the sparsest cut.
Proof of Theorem 6.4. Let d∗ be an optimal pseudo-metric solution to the
linear program (6.5), function f : V → Rk an embedding as in Theorem 6.6,
and set S ⊆ V the cut extraction described in Claim 6.8. Then with high
probability we have
P
{u,v}∈E kf (u) − f (v)k1
L(S) ≤ P
u,v∈V kf (u) − f (v)k1

P
{u,v}∈E d (u, v)
≤ O(log(n)) P ∗
u,v∈V d (u, v)
≤ O(log(n))L(SOP T ).

Where the first inequality comes from the cut extraction, Claim 6.8, the sec-
ond inequality comes from the L1 embedding theorem and the last inequality
comes from d∗ being the solution of the relaxed optimization problem.
Note that random threshold cuts Sτ can be defined along any dimension
j ∈ {1, . . . , k} of f . The theorem shows that among these at most n · k =
O(n log2 (n)) cuts Sτ , one of them is an O(log(n)) approximation of the
sparsest cut, with high probability.

6.3 L1 Embedding
In the previous section, we saw how we can find a Θ(log n) approximation
of the sparsest cut by using in a black-box manner an embedding that maps
the points to a space with L1 norm while stretching pairwise distances by
at most an Θ(log n) factor. In this section, we prove the existence of this
embedding, i.e.,
Lemma 6.9. Given a pseudo-metric d : V × V → R+ for an n-point space
V , we construct a mapping f : V → Rk for k = Θ(log2 n) such that for any
two vertices u, v ∈ V , we have d(u, v)/Θ(log n) ≤ ||f (u) − f (v)||1 ≤ d(u, v).

Warm up & Intuition


Here, we provide some intuitive discussions that help us to understand how
we arrive at the final solution.
6.3. L1 EMBEDDING 85

Approach 1: fix an arbitrary vertex s ∈ V , and define a one-dimensional


function f : V →∈ R, f (u) := d(s, u). This will give us a pseudo-metric. The
pseudo-metric satisfies ||f (u) − f (v)||1 ≤ d(u, v) by the triangle inequality.
What remains is to show that ||f (u) − f (v)||1 is not much smaller than
d(u, v). This depends on the choice of node s. A natural suggestion would
be to pick s at random, and that works well for some scenarios. But not
always. As we shall see in the following pathological example, we do not
have d(u, v)/Θ(log n) ≤ ||f (u) − f (v)||1 .
Example 1: Consider a graph with n vertices such that there are two
vertexs u, v of degree n − 2 that share the same neighbours, in this situation
d(u, v) = 2, d(u, vi ) = d(v, vi ) = 1 for all vi 6= u, v. There are n − 2 “in the
middle of the segment uv”. If we sample s ∈ V uniformly at random, then
with large probability (n − 2)/n, we will choose a vertex “in the middle of
uv”. This way, we have ||f (u) − f (v)||1 ≈ 0, so we do not have the bound
d(u, v)/Θ(log n) ≤ ||f (u) − f (v)||1 .

Approach 2: The failure of approach 1 indicates that a single “source


vertex” s might not be enough for our purpose. To fix it, we can pick a set
of vertices S as the “source vertices”. More precisely, we choose a set S by
including each vertex from V in S with probability p = 1/2.
Suppose S is chosen. Define f : V → R as f (u) = d(S, u) := mins∈S d(s, u).
The distance to a set S is a pseudo-metric and from the triangular inequality
we can deduce that ||f (u) − f (v)||1 ≤ d(u, v) still holds.
As for the other direction of the inequality, we can check if the approach
is valid for the pathological example again. Notice that if we choose an
S such that u ∈ S, v 6∈ S, then f (u) = 0 and f (v) = 1. This implies
||f (u) − f (v)||1 ≥ 1, satisfying d(u, v)/Θ(log n) ≤ ||f (u) − f (v)||1 . We also
notice that the event {S : u ∈ S, v 6∈ S} happens with a constant probability
for sampling probability p = 1/2. Therefore this counter example is solved.
Example 2: We consider a graph formed by a path of length n and take
u, v ∈ V the end nodes of the path. In this situation d(u, v) = n − 1 but
choosing S with probability 1/2 we are very likely to sample vertices that
are close to u and v, and we will have |f (u) − f (v)| ≈ O(1). Therefore the
bound d(u, v)/Θ(log n) ≤ ||f (u) − f (v)||1 does not hold.

Approach 3: The failure of approach 2 indicates that there are some


graphs that require choosing a set S with a high probability, like we have
seen in Example 1, and there are graphs that require a low probability, like
we have seen in Example 2. This motivates the use of a multi-dimensional
embedding f : V → Rk where k = log n. Each component of the embedding
86 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT

will have a different sampling probability, solving the problems of Example


1 and Example 2. We define the component fi (u) := d(Si , u) where Si has
been sampled with probability pi = 1/2i .

6.3.1 The algorithm and its analysis


The Algorithm With the above considerations we write the following al-
gorithm, where we have added additional dimensions to be able to amplify
the success probabilities.

Algorithm 18 L1 EMBEDDING
for i = 1 to L = log n do
for h = 1 to H = 1000 log n do . Probability amplification
Define Sih by including each v ∈ V in it independently with prob. 1/2i
Define the coordinate of f by f(i−1)H+h (u) = d(u, Sih )/LH
end for
end for
The embedding is then given as f : V → RLH , where each fi is defined above

Analysis: We need to prove that for every pair of points u, v ∈ V , we have


d(u, v)/Θ(log n) ≤ ||f (u) − f (v)||1 ≤ d(u, v). For readability, denote the
((i − 1)H + h)-th coordinate of f as fih .
Let us start with the easy side. By construction, we have
L X
X H
||f (u) − f (v)||1 = |fih (u) − fih (v)|
i=1 h=1
L X H
X |d(u, Sih ) − d(v, Sih )|
=
i=1 h=1
LH
d(u, v)
≤ LH · = d(u, v)
LH
For the other direction, the following result provides the inequality with high
probability.

Claim 6.10. The L1 embedding algorithm provides an embedding f : V →


RLH , where L = log n and H = 1000 log n that satisfies the following condi-
tion with high probability:
 
d(u, v)
||f (u) − f (v)||1 ≥ Θ ∀u, v ∈ V
L
6.3. L1 EMBEDDING 87

Proof. This result will be proved in two steps.

1. We fix a pair of vertices u, v ∈ V and focus on a single component


fih where i ∈ {1, . . . , L}, h ∈ {1, . . . H}. We prove that for a certain
sequence ρ̂i the lower bound
ρ̂i − ρ̂i−1
|fih (u) − fih (v)| ≥
LH
holds with constant probability in the component fih .

2. After showing that this result holds for one component we will augment
the probability to make it hold for components fih for all i ∈ {1, . . . , L}
and a fraction of h ∈ {1, . . . H}. This will allow us to provide the
desired bound for the L1 embedding.

Let us start the first step. For t ∈ {0, 1, . . . , log n} we define

ρt = min{|Br (u)| ≥ 2t and |Br (v)| ≥ 2t }.


r

Where Br (u) and Br (v) are closed balls of radius r. From the definition of
ρt there exists an index j such that ρj < d(u, v)/2 and ρj+1 ≥ d(u, v)/2. We
define the truncated sequence,
(
ρi if i = 0, 1, . . . , j
ρ̂i =
d(u, v)/2 if i = j + 1, . . . , L

If we take an index i > j + 1 we have ρ̂i = ρ̂i−1 = d(u,v) 2


, hence |fih (u) −
ρ̂i −ρ̂i−1
fih (v)| ≥ LH = 0 will trivially hold. Let us focus on some index i ≤ j + 1.
Without loss of generality, we may assume that ρ̂i is defined by u, that is,
|Bρ̂open
i
(u)| < 2i . We also have |Bρ̂i−1 (v)| ≥ 2i−1 by construction. Since
ρ̂i−1 ≤ ρ̂i ≤ d(u, v)/2, the balls Bρ̂open
i
(u) and Bρ̂i−1 (v) are disjoint. Consider
the events

• Aih = {Sih ∩ Bρ̂open


i
(u) = ∅}

• Bih = {Sih ∩ Bρ̂i−1 (v) 6= ∅}

Where Sih are the set of nodes that defines the coordinate fih . Because the
two balls are disjoint, events Aih and Bih are independent. When the events
Aih and Bih both happen we have d(u, Sih ) ≥ ρ̂i and d(v, Sih ) ≤ ρ̂i−1 , so

ρ̂i − ρ̂i−1
|fih (u) − fih (v)| = |d(Sih , u) − d(Sih , v)| ≥
LH
88 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT

To conclude the proof of the first step it remains to show that the event
Aih ∩ Bih happens with a constant probability. The probabilities of Sih not
sampling a node in Bρ̂open
i
(u) and Sih sampling a node in Bρ̂i−1 (v) can be
calculated as
 |Bρ̂open (u)|  2i
1 i 1 1
Pr[Aih ] = 1 − i ≥ 1− i ≥ 4−1 = .
2 2 4
 |Bρ̂ (v)|  2i−1
1 i−1 1
Pr[Bih ] = 1 − 1 − i ≥1− 1− i ≥ 1 − e−1/2 .
2 2
In the preceding calculation, we use the fact that 4−x ≤ 1 − x ≤ e−x for
x ∈ [0, 1/2]. Since Aih and Bih are independent events, then

Pr[Aih ∩ Bih ] = Pr[Aih ] · Pr[Bih ] ≥ c

Where we have defined c := (1 − e−1/2 )/4. And the first step of the proof is
finished, the inequality |fih (u) − fih (v)| ≥ ρ̂i −ρ̂
LH
i−1
holds in a component fih
with constant probability.
In the second step of the proof we amplify the success probability. For every
i ∈ {1, . . . , L} let xi be the number of indices such that the inequality holds,
using the Chernoff bound we have
h c i h c i
Pr xi ≤ H ≤ Pr |x − c · H| ≥ H
2   2
−c · H 2 2
≤ 2 exp ≤ 1000c/12 ≤ 5
12 n n

And for every i ∈ {1, . . . , L} the inequality proved in Step 1 will hold in
at least cH/2 indices with high probability 1 − 1/n5 . To make this result
hold for all i ∈ {1, . . . , L} the probability of failure will be amplified to
Θ( log
n5
n
) ≤ Θ( n14 ).
Now, since the result holds for every index i and a significant fraction of the
indices h we have
L X
X H
||f (u) − f (v)||1 = |fih (u) − fih (v)|
i=1 h=1
L
c X ρ̂i − ρ̂i−1 d(u, v)
≥ H =
2 i=1
LH 2L

Where the last equality comes from a telescopic sum and the definition of
ρ̂0 = 0 and ρ̂L = d(u,v)
2
.
6.3. L1 EMBEDDING 89

This proves that the embedding is valid for a pair of vertices u, v ∈ V that
we have fixed at the beginning of the proof. If we consider the probability
that the embedding works for any pair of vertices the failure probability will
2
be amplified to Θ( nn4 ) = Θ( n12 ). Therefore the L1 embedding satisfies the
distance contraction for any pair of nodes with high probability.
90 CHAPTER 6. L1 METRIC EMBEDDING & SPARSEST CUT
Chapter 7

Oblivious Routing,
Cut-Preserving Tree
Embedding, and Balanced Cut

In this section, we develop the notion of cut-preserving tree embeddings.


We will use the problem of oblivious routing as our initial motivation. But
then, we will also see that these cut-preserving trees are also useful for other
approximation problems, and we discuss the balanced cut problem as one
such application1 .

7.1 Oblivious Routing


Consider an undirected graph G = (V, E) where every edge e ∈ E has a
given capacity ce . Suppose we have many routing demands duv for u, v ∈ V ,
where duv denotes the demand to be sent from node u to node v.
For any pair (u, v) we want to define a route, or more formally a flow of
size duv from vertex u to v. This is a function ruv : E → [0, 1] such that:
P P
1. e∈out(w) ruv (e)duv − e∈in(w) ruv (e)duv = 0 for all w 6= u, v,
P P
2. e∈out(u) ruv (e)duv − e∈in(u) ruv (e)duv = duv
P P
3. − e∈out(v) ruv (e)duv + e∈in(v) ruv (e)duv = duv
1
Please read critically. This section has not been reviewed by the instructor yet. To
get a quick response, please post your questions and comments about this chapter on the
course moodle.

91
92 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING

We define the congestion of an edge e to be the total amount of demand


sent through e divided by the capacity ce . Overall, our objective is to have
a small congestion over all the edges.
Definition 7.1. (Oblivious Routing Problem) Given an undirected graph
G = (V, E), capacities ce for all e ∈ E, and demands duv for u, v ∈ V , we
wish to find routes for the demands with the objective of minimizing the
maximum congestion over all the edges.
Moreover, we wish to find these routes in an oblivious manner, meaning that
we should pick the route of each demand duv independently of the existence
of the other demands.
Our goal is to do this in a way that is competitive in terms of congestion
with the best possible routing that we could have designed after knowing all of
the demands. In this section, we discuss approximation schemes that devise
these routes independently for each demand and yet are still competitive with
the optimal solutions that one can obtain after knowing all of the demands.

7.2 Oblivious Routing via Trees


Warm-up Suppose the graph which we are given is simply one tree, e.g.,
the following picture:

In this case, we can achieve oblivious routing since for all vertices u, v ∈ V
there is a unique shortest path between them, and all other paths from u to
v must cover this path as well. The solution is therefore to send all of the
demand duv along this path, as outlined below.

v
7.2. OBLIVIOUS ROUTING VIA TREES 93

General Graphs, routing via one tree Our hope is to use a similar idea
for any arbitrary graph. Consider the following graph G.

One strategy would be to pick a spanning tree of G and require that all de-
mands are routed through this tree. For example take the following spanning
tree T ⊆ G.

Note that if we pick an edge {x, y} from the edge set of T then removing
this edge from T disconnects the tree into two connected components. Let
S(x, y) denote the vertices which are in the same connected component as x,
as in the following figure.

S(x, y)

Now any demand duv that has exactly one endpoint in S(x, y) will be routed
through {x, y} in our spanning tree. These are the demands that have to go
through cut(S(x, y), V \ S(x, y)). Since we are routing only along the chosen
tree, all these demands have to traverse through the edge {x, y} in the tree.
We therefore define
X
D(x, y) = duv ,
u∈S(x,y),v∈V \S(x,y)
94 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING

as the amount that will be passed through edge {x, y} in our scheme. Hence,
in our routing, the congestion on an edge e = {x, y} ∈ T is exactly D(x,y)
cxy
.
On the other hand, in any routing scheme on G, this demand D(x, y) has
to be sent through edges in cut(S(x, y), V \ S(x, y)). We can therefore lower
bound the optimum congestion in any scheme by

D(x, y)
OP T ≥ ,
C(x, y)

where X
C(x, y) = ce .
e∈cut(S(x,y),V \S(x,y))

Thus, if we had the following edge condition, then our routing would be
α-competitive.

Definition 7.2. (Edge Condition) For each edge {u, v} ∈ T , we should


have that
1
cuv ≥ C(u, v) (7.1)
α
The above discussion shows that this edge condition is a sufficient condi-
tion for α-competitiveness. However, we claim it is also a necessary condition
for the routing scheme on the tree.

Claim 7.3. The edge condition is necessary for the routing scheme on the
tree to be α-competitive in congestion.

Proof. We wish to show that if our scheme is α-competitive, then the edge
condition holds. Note that if our scheme is α-competitive, then it must be
so for any possible demand. Consider the case in which the demands which
we want to send are equal to the capacities, i.e. duv = cuv for all {u, v} ∈ G.
This problem can be routed optimally with congestion 1, by sending each
demand completely along its corresponding edge. For some edge {u, v} ∈ T ,
our scheme would try to send D(u, v), which in this case is C(u, v), along
the edge. Therefore the congestion for all edges {u, v} ∈ T is C(u,v)
cuv
. So if
C(u,v)
our scheme is α-competitive, we must have that cuv ≤ α.

Generalizing and Modifying the Scheme Unfortunately, routing along


one subtree will not provide a competitive scheme for all graphs (examples
are discussed in the exercise sessions). To remedy this, we generalise our
plan of routing along a tree, in the following two ways:
7.2. OBLIVIOUS ROUTING VIA TREES 95

1. Instead of routing along one tree, we route along many trees, namely a
collection T = {T1P, T2 , ..., Tk } where each tree Ti has probability λi of
being chosen. So ki=1 λi = 1 and with probability λi we pick Ti and
route through it.

2. The trees Ti do not need to be subgraphs of our graph G. Tree Ti can be


a virtual graph on vertices V where sending along an edge {x, y} ∈ Ti
actually means sending along some fixed path Pi (x, y) in G.

Pi (x, y)
y

Given these two generalizations, we reach a natural modified variant of


the edge condition as we describe next. We will show that this condition is
necessary and sufficient for α-competitiveness of the routing scheme. Let Ci
and Di be defined analogously to C and D on the tree Ti .

Definition 7.4. (Updated Edge Condition) For each edge {u, v} ∈ G,


we should have that
1X X
cuv ≥ λi Ci (x, y) (7.2)
α i
{x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

Claim 7.5. The update edge condition is necessary for the routing scheme
to be α-competitive in congestion.

Proof. As in the proof of claim 7.3, we consider the case in which all demands
duv are equal to the capacities cuv . The optimal congestion in this case is 1.
We now consider the congestion achieved by our scheme. Consider an edge
{u, v} ∈ G. Suppose our scheme chooses tree Ti out of the collection of trees
T . Note that for every edge {x, y} ∈ Ti , if the path Pi (x, y) goes through
{u, v} then our scheme will send Di (x, y), which in this case is Ci (x, y),
through {u, v}. So the demand routed through {u, v} in Ti will be
X
Ci (x, y).
{x,y}∈Ti
s.t. {u,v}∈Pi (x,y)
96 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING

But each tree Ti is chosen with probability λi out of our collection of trees T .
Therefore we can write the expected demand which will be routed through
{u, v} as X X
λi Ci (x, y).
i {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

Therefore if our scheme is α-competitive with the optimal congestion, this


sum must be ≤ αcuv and therefore the edge condition holds.

Next, we will show that this updated tree condition is also sufficient for
achieving α-competitiveness, i.e. that having inequality 7.4 satisfied is enough
to imply our routing scheme to be α-competitive in congestion.

Claim 7.6. The updated edge condition is sufficient for the routing scheme
being α-competitive in congestion.

Proof. Assume the updated edge condition is satisfied, i.e. inequality 7.4
holds. Let’s consider an arbitrary set of demands and see how they must
be routed through our collection T of trees T1 , . . . , Tk as compared to the
optimal routing in G.
It is clear that the amount routed through some tree edge {x, y} ∈ Ti
must in G be routed through any edge {u, v} that lies on the fixed path
corresponding to {x, y}, i.e. through any {u, v} ∈ Pi (x, y).
If a tree Ti is chosen, we decide to route Di (x, y) amount of flow through
{u, v} ∈ Pi (x, y). With this strategy, we end up sending a total expected
amount of X X
λi Di (x, y)
i {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

through edge {u, v} ∈ G, where the expectation is taken over the possible
choices of a tree Ti . Together with our assumption of 7.4 being satisfied, we
can now upper bound the congestion on edge {u, v} by
X X
λi Di (x, y)
i {x,y}∈Ti  
s.t. {u,v}∈Pi (x,y) (∗) Di (x, y) (∗∗)
X X ≤ α · max ≤ α · OP T
1
λ i C i (x, y) i Ci (x, y)
α
i {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

a1 +···+ak ai
Here, (∗) follows from b1 +···+bk
≤ maxi bi
for any a1 , . . . , ak , b1 , . . . , bk ∈ R
Di (x,y)
and (∗∗) is because, as seen before, maxi Ci (x,y) is a lower bound on OP T .
7.3. EXISTENCE OF THE TREE COLLECTION 97

7.3 Existence of the Tree Collection


In this section, our goal is to show the existence of a tree collection that
achieves O(log n)-competitiveness in oblivious routing. The proof proceeds
by writing the constraint minimization of the competitiveness ratio α as an
LP. We then write down the dual of this LP and show that its optimal
solution is in O(log n). By strong LP duality, we can then also claim the
minimum competitiveness ratio achievable in the primal to be O(log n).
To cast the competitiveness of the oblivious routing scheme as a linear
program LPOblivious Routing , we rely on the updated edge condition of the
previous section, which we know is necessary and sufficient:

LPOblivious Routing :

minimize α / competitiveness ratio


X X
subject to αcuv − λi Ci (x, y) ≥ 0 ∀ {u, v} ∈ G / edge condition
i {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)
X
λi ≥ 1 / valid probabilities
i
λi ≥ 0 ∀i ∈ [k]

Here we allow the probabilities λi to sum to over 1, but P this is not an


issue, as there always exists an optimal solution that satisfies i λi = 1.

Claim 7.7. The optimal value of LPOblivious Routing is in O(log n).

Proof. We begin by writing down the dual of LPOblivious Routing as below.

Dual-LPOblivious Routing

maximize z / (1)
X
subject to cuv luv ≤ 1 / (2)
{u,v}∈G
X X
z− Ci (x, y) · luv ≤ 0 ∀i ∈ [k] / (3)
{u,v}∈G {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

luv ≥ 0 ∀ {u, v} ∈ G / (4)


98 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING

In order to provide an upper bound on the optimal solution of Dual-


LPOblivious Routing , we will think of the variable luv as denoting the length of
edge {u, v}. In any valid solution, these lengths need to satisfy inequality (2)
together with the edges’ given capacities. We further rewrite the constraints
in line (3). Instead of summing over all edges {u, v} and restricting the inner
sum to tree edges {x, y} whose corresponding path Pi (x, y) contains {u, v},
we can also do the outer sum over all tree edges and the inner over just the
relevant {u, v}’s. This way, the constraint of line (3) is rewritten as:

X X
z≤ Ci (x, y) luv ∀i ∈ [k].
{x,y}∈Ti {u,v}∈Pi (x,y)
| {z }
=: Ai

To arrive at the tightest condition on z, and thus also the tightest lower
bound on OP T , we are interested in the tree Ti that minimizes Ai under the
constraint that our distances satisfy condition (2). In the following, we show
that there exists such a tree Ti among our collection T with Ai ≤ O(log n).

For that, recall from Chapter 5 on probabilistic distance-preserving tree


embeddings that for any graph G = (V, E) one can find a distribution over
a collection of trees Ti = (V, Ei ), each being chosen with probability λi , such
that for any x, y ∈ V we have

dl (x, y) ≤ Ti (x, y) and


X
E[Ti (x, y)] = λi Ti (x, y) ≤ O(log n) · dl (x, y)
i

where dl (x, y) denotes the distance of vertices x, y in G and Ti (x, y) their


distance in T . Below, we will be able to apply the existence of such an
embedding to our graph with distances luv .

Consider the expectation of Ai with respect to our distribution over the


7.3. EXISTENCE OF THE TREE COLLECTION 99

trees Ti with probabilities λi .


X
ETi [Ai ] = λi Ai
i
X X X
= λi Ci (x, y) luv
i {x,y}∈Ti {u,v}∈Pi (x,y)
X X
≤ λi Ci (x, y)dl (x, y) (path length ≥ distance)
i {x,y}∈Ti
X X
≤ λi Ci (x, y)Ti (x, y) (distance-preserving embedding)
i {x,y}∈Ti
X X X
= λi Ti (x, y) cuv (definition of Ci )
i {x,y}∈Ti {u,v}∈Cut(Si (x,y),V \Si (x,y))
X X X
= λi cuv · Ti (x, y) (definition of Si )
i {u,v}∈G {x,y}∈Pi (u,v)
X X
= λi cuv Ti (u, v) (rearranging the sum)
i {u,v}∈G
X X
= cuv λi Ti (u, v) (rearranging, again)
{u,v}∈G i
X
≤ cuv · O(log n) · dl (u, v) (distance-preserving embedding)
{u,v}∈G
X
≤ cuv · O(log n) · luv (property of distance)
{u,v}∈G
X
= O(log n) · cuv · luv = O(log n)
uv
| {z }
≤1, by (2)

From ETi [Ai ] = O(log n) it follows directly that one of the Ai ’s must be in
O(log n). Therefore, OP T of Dual-LPOblivious Routing is in O(log n) and
OP T of LPOblivious Routing is in O(log n).
Since any valid solution to LPOblivious Routing with objective value α cor-
responds to an α-congestion-competitive oblivious routing scheme, we have
shown that O(log n)-competitiveness can be achieved.
Notice that we have only given an existence proof for a satisfactory mix-
ture of trees, not a construction. Additionally, we have not shown yet how
many trees are necessary in order to achieve O(log n)-competitiveness. In
section 8.3, we show how to construct a polynomially large mixture of trees
in polynomial time.
100 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING

7.4 The Balanced Cut problem


We now want to tackle a new problem using the tree embedding technique
to gain a good approximation for a hard problem on general graphs, the
balanced cut problem. We aim to split the vertices of an undirected graph
into two sets of equal size so that the weight of the edges going across the
cut are minimized.
Definition 7.8 (minimum balanced cut). Given an undirected graph G =
(V, E) with edge costs ce for all e ∈ E , a minimum balanced cut (S, V \ S)
is a cut consisting of exactly half the vertices which minimizes the sum of
crossing edges over the cut.
X
min ce
|V |
S⊂V,|S|=b c
2 e∈(S,V \S)

Computing the balanced cut is of special interest to us as it can be used


as a powerful building block for other algorithms. In ”divide & conquer”
approaches, we can split the graph into two sub-problems of roughly equal
size in a way that the dependencies (edges) between them are small using
the minimum balanced cut.
Note that the technique seen here is much more generally applicable. It
allows us to go from a problem on general graphs to working on trees, which
often drastically simplifies the problem. Because the tree construction in
some sense preserves properties of the cut we can hope to translate the so-
lution back to the general graph without loosing to much guarantees on the
optimality of our solution.

To find the minimal balanced cut, we use the following procedure:


1. Construct a collection of trees T

2. Virtually compute the edge costs as the weight of the induced cut in Ti
X
costTi (x, y) = Ci (x, y) = ce
e∈(Si (x,y),V \Si (x,y))

3. Compute the optimal minimum balanced cut Xi on tree Ti using dy-


namic programming

4. Take the tree Xi∗ with minimal costs on graph G as the solution
Definition 7.9
P (Cut preserving tree embedding). We call a collection of
trees T with i λi = 1, λi ≥ 0 a cut preserving tree embedding if and only if
7.4. THE BALANCED CUT PROBLEM 101

(I) CutG (S, V \ S) ≤ CutTi (S, V \ S)


P
(II) i λi CutTi (S, V \ S) ≤ O(log(n))CutG (S, V \ S)

We can interpret the λi ’s as probabilities of the different trees in the


collection. Then a cut preserving tree embedding is always at least the
original cut and in expectation only a log(n) factor greater.

Claim 7.10. The minimum balanced cut of a cut preserving tree embedding
yields an O(log(n)) approximation for the minimum balanced cut of the whole
graph.

Proof. Let (S ∗ , V \ S ∗ ) be an optimal cut for the minimum balanced cut


problem of the whole graph G. Furthermore let (Xi , V \ Xi ) be the cut
computed for Ti of the tree embedding and let (X ∗ , V \ X ∗ ) be the one with
minimum cost on G among them.

CutG (X ∗ , V \ X ∗ ) = min CutG (Xi , V \ Xi )


i
X
≤ λi CutG (Xi , V \ Xi ) Convexity
i
X
≤ λi CutTi (Xi , V \ Xi ) Property I
i
X
≤ λi CutTi (S ∗ , V \ S ∗ ) Xi optimal on Ti
i
≤ O(log(n))CutG (S ∗ , V \ S ∗ ) Property II

The collection of trees we have used for the oblivious routing problem in
the previous section fulfills both properties of a cut preserving tree embed-
ding. We will prove this in lemmas 7.11 and 7.12.

Lemma 7.11. For every tree Ti in the collection of trees T it holds that

CutG (S, V \ S) ≤ CutTi (S, V \ S)

Proof. The key insight to why the inequality holds is that each edge in the
original cut (on the left) gets counted at least once on the right hand side.
Let’s look at an arbitrary edge u ∈ S, v ∈ V \ S over the cut (S, V \ S).
On the unique path from u to v in the tree Ti there exists at least one
102 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING

xy ∈ Ti for which x ∈ S, y ∈ V \ S. By definition costTi (x, y) = Ci (x, y) =


P
e∈(Si (x,y),V \Si (x,y)) ce . Because the x, y were along the path from u to v
u ∈ Si (x, y), v ∈ V \ Si (x, y), therefore the edge uv is included in the sum of
Ci (x, y).
X
CutG (S, V \ S) = cuv def. CutG
uv∈(S,V \S)
X X
≤ ce above discussion
xy∈Ti e∈(Si (x,y),V \Si (x,y))
xy∈(S,V \S)
X
= costTi (x, y) def. edge costs
xy∈Ti
xy∈(S,V \S)

= CutTi (S, V \ S) def. CutTi

P
Lemma 7.12. For every mixture of trees satisfying condition 7.4 with i λi =
1, λi ≥ 0 it holds that
X
λi CutTi (V, V \ S) ≤ O(log(n))CutG (S, V \ S)
i

Proof. Recall edge condition 7.4 for α = O(log(n)):


X X
λi Ci (x, y) ≤ O(log(n))cuv
i xy∈Ti
uv∈Pi (x,y)

The key insight to rewrite the summation is that each xy ∈ Ti going


across the cut (S, V \ S) must have at least one edge uv on it’s path Pi (x, y)
which also crosses the cut. If this were not the case all vertices on the path,
including x, y would belong to the same side of the cut which is not the case.
This allows us to upper bound the edge weights in the tree Ti :

X X X
Ci (x, y) ≤ Ci (x, y)
xy∈Ti uv∈(S,V \S) xy∈Ti
xy∈(S,V \S) uv∈Pi (x,y)

Now we can apply the definitions and combine them with the two attained
7.4. THE BALANCED CUT PROBLEM 103

bounds from above to prove the claim.


X X X
λi CutTi (V, V \ S) = λi Ci (x, y) def. CutTi
i i xy∈Ti
xy∈(S,V \S)
X X X
≤ λi Ci (x, y) bound tree weights
i uv∈(S,V \S) xy∈Ti
uv∈Pi (x,y)
X X X
= λi Ci (x, y)
uv∈(S,V \S) i xy∈Ti
uv∈Pi (x,y)
X
≤ O(log(n))cuv edge property 7.4
uv∈(S,V \S)
X
= O(log(n)) cuv
uv∈(S,V \S)

= O(log(n))CutG (S, V \ S) def. of CutG


104 CHAPTER 7. CUT-PRESERVING TREE EMBEDDING
Chapter 8

Multiplicative Weights Update


(MWU)

In this lecture, we discuss the Multiplicative Weight Updates (MWU) method.


A comprehensive survey on MWU and its applications can be found in
[AHK12].

8.1 Learning from Experts


Definition 8.1 (The learning from experts problem). Every day, we are to
make a binary decision. At the end of the day, a binary output is revealed
and we incur a mistake if our decision did not match the output. Suppose we
have access to n experts e1 , . . . , en , each of which makes a recommendation
for the binary decision to take per day. How does one make use of the experts
to minimize the total number of mistakes on an online binary sequence?

Toy setting Consider a stock market with only a single stock. Every day,
we decide whether to buy the stock or not. At the end of the day, the stock
value will be revealed and we incur a mistake/loss of 1 if we did not buy
when the stock value rose, or bought when the stock value fell. Let σ be the
sequence of true outcomes. Furthermore, we denote the true outcome on day
j as σj .

Example — Why it is non-trivial Suppose n = 3 and σ = (1, 1, 0, 0, 1).

105
106 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

Days/σ 1 1 0 0 1
e1 1 1 0 0 1
e2 1 0 0 0 1
e3 1 1 1 1 0

In hindsight, e1 is always correct so we would have incurred 0 mistakes if


we always followed e1 ’s recommendation. However, we do not know which
expert is perfect (assuming a perfect expert even exists). Furthermore, it is
not necessarily true that the best expert always incurs the least number of
mistakes on any prefix of the sequence σ. Ignoring e1 , one can check that e2
outperforms e3 on the example sequence. However, at the end of day 2, e3
incurred 0 mistakes while e2 incurred 1 mistake. The goal is as follows: If
a perfect expert exists, we hope to eventually converge to always following
him/her. If not, we hope to not do much worse than the best expert on the
entire sequence.

Warm up: Perfect expert exists As a warm up, suppose there exists a
perfect expert. Then the problem would be easy to solve: Do the following
on each day:
• Make a decision by taking the majority vote of the remaining experts.

• If we incur a loss, remove the experts that were wrong.


Theorem 8.2. We incur at most log2 n mistakes on any given sequence.
Proof. Whenever we incur a mistake, at least half the experts were wrong
and were removed. Hence, the total number of experts is at least halved
whenever a mistake occurred. After at most log2 n removals, the only expert
left will be the perfect expert and we will be always correct thereafter.

8.1.1 A deterministic MWU algorithm


Suppose that there may not be a perfect expert. The idea is similar, but we
update our trust for each expert instead of completely removing an expert
when he/she makes a mistake. Consider the following deterministic algorithm
(DMWU):
• Initialize weights wi = 1 for expert ei , for i ∈ {1, . . . , n}.

• On each day:

– Make a decision based on the weighted majority.


8.1. LEARNING FROM EXPERTS 107

– If we incur a loss, set wi to (1 − ) · wi for each wrong expert, for


some constant  ∈ (0, 12 ).
Theorem 8.3. Suppose the best expert makes m∗ mistakes and DMWU
makes m mistakes. Then,
2 log n
m ≤ 2(1 + )m∗ +

Proof. Observe that when DMWU makes a mistake, the weighted majority
Pn and their weight decreases by a factor of (1 − ). Suppose that
was wrong
x = i=1 wi at the start of the day. If we make a mistake, x drops to
≤ 2 (1 − ) + x2 = x(1 − 2 ). That is, the overall weight reduces by at least
x

a factor of (1 − 2 ). Since the best expert e∗ makes m∗ mistakes, his/her



weight at the end is (1 − )m , since the initial weight is 1. By the above
observation, the total weight of all experts would be ≤ n(1 − 2 )m at the end
of the sequence, since we make m mistakes and the initial sum of weights is
n. Then we can bound m in terms of m∗ ,

∗ 
(1 − )m ≤ n(1 − )m Expert e∗ ’s weight is part of the overall weight
2
∗ 
⇒ m log(1 − ) ≤ log n + m log(1 − ) Taking log on both sides
2
 1
⇒ m∗ (− − 2 ) ≤ log n + m(− ) Since −x − x2 ≤ log(1 − x) ≤ −x for x ∈ (0, )
2 2
2 log n
⇒ m ≤ 2(1 + )m∗ + Rearranging


Remark 1 In the warm up toy example, m∗ = 0.

Remark 2 For x ∈ (0, 12 ), the inequality −x − x2 ≤ log(1 − x) ≤ −x is due


to the Taylor expansion1 of log. A more familiar equivalent form would be:
2
e−x−x ≤ (1 − x) ≤ e−x .
Theorem 8.4. No deterministic algorithm A can be better than 2-competitive.
Proof. Consider only two experts e0 and e1 where e0 always outputs 0 and
e1 always outputs 1. Any binary sequence σ must contain at least |σ|
2
zeroes
|σ| ∗ |σ|
or 2 ones. Thus, m ≤ 2 . On the other hand, the adversary looks at A
and produces a sequence σ which forces A to incur a loss every day. Thus,
m = |σ| ≥ 2m∗ .
1
See https://en.wikipedia.org/wiki/Taylor_series#Natural_logarithm
108 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

8.1.2 A randomized MWU algorithm


The 2-factor in DMWU is due to the fact that DMWU deterministically takes
the (weighted) majority at each step. Let us instead interpret the weights as
probabilities. Consider the following randomized algorithm (RMWU):
• Initialize weights wi = 1 for expert ei , for i ∈ {1, . . . , n}.
• On each day:
– Pick a random expert with probability
P proportional to their weight.
(i.e. Pick ei with probability wi / nj=1 wj )
– Follow that expert’s recommendation.
– For each wrong expert, set wi to (1 − ) · wi , for some constant
 ∈ (0, 12 ).
Another way to think about the probabilities is to split all experts into two
groups A = {Experts that output 0} and B = {Experts that output 1}.
wA
Then, decide P
‘0’ with probability P
wA +wB
and ‘1’ with probability wAw+w B
B
,
where wA = ei ∈A wi and wB = ei ∈B wi are the sum of weights in each
set.
Theorem 8.5. Suppose the best expert makes m∗ mistakes and RMWU
makes m mistakes. Then,
log n
E[m] ≤ (1 + )m∗ +

Proof. Fix a day j ∈ {1, . . . , |σ|}. Let A = {Experts that output P0 on day j}
and BP= {Experts that output 1 on day j}, where wA = ei ∈A wi and
wB = ei ∈B wi are the sums of weights in each set. Let Fj be the weighted
fraction of wrong experts on day j and F̄j the weighted fraction of correct
experts on day j. If σj = 0, then Fj = wAw+w B
B
. If σj = 1, then Fj = wAw+wA
B
.
By definition of Fj , RMWU makes a mistake on day j with probability Fj .
By linearity of expectation, E[m] = |σ|
P
j=1 Fj .
∗ ∗
Since the best expert e makes m mistakes, his/her weight at the end is

(1 − )m . On each day, RMWU reduces the overall weight by a factor of
(1 −  · Fj ) by penalizing wrong experts (new weights / old weights):
((1 − )Fj + F̄j )(wa + wb )
= ((1 − )Fj + F̄j ) = Fj + F̄j −  · Fj = 1 −  · Fj
wa + wb
|σ|
Hence, the total weight of all experts would be n · Πj=1 (1 −  · Fj ) at the end
of the sequence, because we have n experts all initialized with weight one
and reduce the weights by the second term over the whole sequence. Then,
8.2. APPROXIMATING COVERING/PACKING LPS VIA MWU 109

∗ |σ|
(1 − )m ≤ n · Πj=1 (1 −  · Fj ) Expert e∗ ’s weight is part of the overall weight

P|σ|
⇒ (1 − )m ≤ n · e j=1 (−·Fj ) Since (1 − x) ≤ e−x
|σ|
m∗
X
−·E[m]
⇒ (1 − ) ≤n·e Since E[m] = Fj
j=1

⇒ m∗ log(1 − ) ≤ log n −  · E[m] Taking log on both sides


log(1 − ) ∗ log n
⇒ E[m] ≤ − m + Rearranging
 
log n
⇒ E[m] ≤ (1 + )m∗ + Since − log(1 − x) ≤ −(−x − x2 ) = x + x2


Generalization
The above results can be generalized, in a straightforward manner: Denote
the loss of expert i on day t as mti ∈ [−ρ, ρ], for some constant ρ. When we
mt
incur a loss, update the weights of affected experts from wi to (1 −  ρi )wi .
mti
Note that ρ
is essentially the normalized loss ∈ [−1, 1].

Claim 8.6. [Without proof ] With RMWU, we have


X X ρ log n
E[m] ≤ min( mti +  |mti | + )
i
t t


Remark If each expert has a different ρi , one can modify the update rule
and claim to use ρi instead of a uniform ρ accordingly.

8.2 Approximating Covering/Packing LPs via


MWU
In this section, we see how ideas from multiplicative weights updates, as
discussed above, give us a simple algorithm that computes approximate so-
lutions for covering/packing linear programs, e.g. set cover, bin packing.
We particularly discuss the case of covering LPs; the approach for packing
LPs is similar. In general, a covering LP can be formulated as follows:
110 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

Covering LP:
n
X
minimize ci x i
i=1
Xn
subject to aij xi ≥ bj ∀j ∈ {1, 2, . . . , m}
i=1
xi ≥ 0 ∀i ∈ {1, 2, . . . , n}

Here, we assume that all coefficients ci , aij , and bj are non-negative.

In the covering LP, xi can be viewed as the number of objects of type i that
is bought. The goal is to minimize the total cost of all bought objects such
that all covering constraints are satisfied. For the set cover we assumed all a
and b to be 1 (it is enough if any set covers an object).

First, we turn the problem into a feasibility


Pn question by turning the ob-
jective function into another constraint i=1 ci xi = K. Using this modified
feasibility variant we can solve the original covering LP via binary search
(find the correct K by having a binary search over the possible K values).

Feasibility variant of a Covering LP:

Are there xi , for i ∈ [n], such that


n
X
ci x i = K
i=1
Xn
subject to aij xi ≥ bj ∀j ∈ {1, 2, . . . , m}
i=1
xi ≥ 0 ∀i ∈ {1, 2, . . . , n}

Here, we assume that all coefficients ci , aij , and bj are non-negative.

We will now P use MWU to find an approximate


Pn solution x = (x1 , x2 , . . . , xn ),
n
such that (A) i=1 ci xi = K, and (B) i=1 aij xi ≥ bj − δ, for all j ∈
{1, 2, . . . , m}. Here, δ > 0 is a desirably small positive parameter. By set-
ting δ smaller and smaller, we get a better and better approximate solution.
However, the runtime also grows as it depends on δ. To find such approx-
imate solution, we appeal to the multiplicative Pn weight updates method as
follows: We think of each of the constraints i=1 aij xi ≥ bj as one expert.
8.2. APPROXIMATING COVERING/PACKING LPS VIA MWU 111

Thus, we have m experts. We start with a weight of wj = 1 for each expert


j ∈ {1, 2, . . . , m}.

Each Iteration of MWU In any iteration t, define a coefficient ptj for


w
each expert/constraint by normalizing the weights, i.e., setting ptj = P jwj .
j
Instead of asking for all the m constraints to be satisfied, we ask that a
linear mixture of them with these coefficients should be satisfied. That is, in
iteration t, we find xt = (xt1 , xt2 , . . . , xtn ) such that we have:

n
X
ci xti = K, and
i=1
m
X n
X m
X
ptj aij xti ptj · bj


j=1 i=1 j=1

This is a much simpler problem with just two constraints. If the aforemen-
tioned feasibility problem has a YES answer—i.e. if it is feasible—then the
same solution satisfies the above inequalities. We can easily find a solution
xt = (xt1 , xt2 , . . . , xtn ) that satisfies these two constraints (assuming the feasi-
bility of the original problem for objective value K). We can maximize the
left hand side of the inequality, subject to the equality constraint, using a
greedy approach. We’ll find the index P t
i with the best value/cost ratio, i.e.,
p aij
the one which maximizes the ratio j cij . Intuitively, the numerator can
be seen as gain and the denominator as cost per unit. Then, we set variable
xti = K/ci to satisfy the equality constraint, while we set all other variables
as xti0 = 0.

Weights update Once we have found such a solution xt = (xt1 , xt2 , . . . , xtn ),
we have to see how it fits the original feasibility problem. If this xt already
satisfies the original feasibility problem, we are done. More interestingly,
suppose that this is not the case; then we need to update the weights of con-
straints. Intuitively, we want to increase the importance of the constraints
that are violated by this solution xt and decrease the importance of those
that are satisfied. This should also take into account how strongly the con-
straint Pis violated/satisfied. Concretely, for the j th constraint, we define
mtj = ( ni=1 aij xti ) − bj . Notice that mtj is positive if we fulfill the inequal-
ity and negative if not. Then, we update the weight of this constraint as
mt
wj ← wj (1 − ε ρj ). Here, ε is a small positive constant; we will discuss later
how to set is so that we get the aforementioned approximation with additive
112 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

δ slack in each constraint. Also, ρ is a normalization factor and we set


Xn
ρ = max{1, max
P |( aij xi ) − bj |}
x s.t. i ci xi =K
i=1

Note that ρ does not depend on the current iteration t and can be computed
using only the problem input.

Now that we have updated the weights we can start the next iteration, up-
dating the ptj first and then using the same greedy approach again.

Final Output We run the above procedure for a number T of iterations.


The proper value of T will be discussed. At the end, we output the average
of all iterations as the final output, that is, we output
PT
xt
x̄ = t=1
T
Theorem 8.7. Suppose that the original feasibility problem is feasible. Then,
δ
for any given parameter δ > 0, set ε = 4ρ and run the above procedure for T =
PT
10ρ2 log m xt
δ2
iterations. The final output x̄ = t=1 T
satisfies each constraint up
to additive δ error, besides fully satisfying the equality constraint ni=1 ci x̄i =
P
K. That is, for each constraint j ∈ {1, 2, . . . , m}, we have
n
X
aij x̄i ≥ bj − δ
i=1

Proof. We make use of Claim 8.6 which gives us:


X X X ρ log m
pt mt ≤ min( mtj +  |mtj | + ) (8.1)
t
j
t t


Here, we have pt = (pt1 , pt2 , . . . , ptm ) and mt = (mt1 , mt2 , . . . , mtm ).


Notice that in iteration t, we chose xt = (xt1 , xt2 , . . . , xtm ) such that we have
P t P n t
 P t
j pj i=1 aij xi ≥ j pj · bj . By rearranging this expression we get:

X n
 X 
ptj aij xti = pt mt ≥ 0

− bj
j i=1

Hence, we conclude that the left hand side of 8.1 is non-negative, i.e., t pt mt ≥
P
0. Therefore, so is the right hand side. In particular, for any j, we have
X X ρ log n
0≤ mtj +  |mtj | +
t t

8.2. APPROXIMATING COVERING/PACKING LPS VIA MWU 113

From this we get:

X X ρ log m
0≤ mtj +  |mtj | +
t t

X ρ log m
⇒ 0≤ mtj + T ρ + ρ ≥ mtj for all j
t

 n 
X X
t ρ log m
by definition of mtj

⇒ 0≤ aij xi − bj + T ρ +
t

 i=1 
Pn t

i=1 a ij x i − bj
X ρ log m
⇒ 0≤ + ρ + divide by T
t
T ·T

δ 10ρ2 log m
Now, since ε = 4ρ
and T = δ2
, we see that:
 
Pn t

i=1 aij xi − bj
X δ 4ρ2 log m
0≤ + + plug in 
t
T 4 δ·T
 
Pn t

i=1 aij xi − bj
X δ 2δ
⇒ 0≤ + + plug in T
t
T 4 5
 
Pn t

X i=1 aij xi − bj
⇒ 0≤ +δ add and round δ
t
T
X bj  X Pn aij xt
i=1 i
⇒ −δ ≤ split up sum and move to LHS
T
T t
T
n
X LHS: sum is independent of t
⇒ bj − δ ≤ aij x̄i
i=1
RHS: defintion of x̄

That is, the output x̄ = (x̄1 , x̄2 , . . . , x̄m ) satisfies the j th inequality up to
additive error δ. This is proved for all constraints.
Note that if the LP is not feasible, it will be detected during one of the
iterations of the algorithm when no greedy solution will be found.
114 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

8.3 Constructive Oblivious Routing via MWU


In this section, we use the multiplicative weight updates method to provide
an efficient algorithm that constructs the oblivious routing scheme that we
discussed and showed to exist in the previous section.

Problem Recap Concretely, the algorithm findsP a collection of trees Ti ,


each with a coefficient λi ≥ 0, such that we have i λi ≥ 1 and still, the
following edge congestion condition is satisifed for each edge {u, v} of the
graph G:

X X
λi Ci (x, y) ≤ O(log n)cuv
i {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

Recall that in the above, for each tree Ti and each tree edge {x, y} ∈ Ti ,
we use Pi (x, y) to denote the path that corresponds to the virtual edge {x, y}
and connects x and y in G. Moreover, we have
X
Ci (x, y) = ce
e∈cut(S(x,y),V \S(x,y))

where S(x, y) is one side of the cut that results from removing the edge {x, y}
from the tree Ti . For convenience, let us define

P
{x,y}∈Ti Ci (x, y)
s.t. {u,v}∈Pi (x,y)
loadi (u, v) =
cuv

as the relative load that the ith tree in the collection places on the edge {u, v}.
Thus, our task is to find a collection so that
X
λi · loadi (u, v) ≤ O(log n)
i

Construction Plan We start with an empty collection and add trees to the
collection. In iteration j, we add a new tree Tj withPa coefficient λj ∈ (0, 1].
jend
The construction ends once we find jend suchP that i=1 λi ≥ 1. During the
construction, we think of each constraint i λi · loadi (u, v) ≤ O(log n) as
one of our experts. We have one constraint for each edge {u, v} ∈ G. To
8.3. CONSTRUCTIVE OBLIVIOUS ROUTING VIA MWU 115

track the performance of each of the constraints (experts) in the course of


this construction, we use a potential function defined as follows:
j
!
X X
Φj = exp λi · loadi (u, v)
{u,v}∈G i=1

The initial potential is equal to the number of the edges of the graph. Thus,
Φ0 < n2 . When we add the tree Tj with coefficient λj in iteration j of the
construction, we want:
P Pj 
Φj {u,v}∈G exp i=1 λi · loadi (u, v)
=P Pj−1  ≤ exp(λj · O(log n))
Φj−1 {u,v}∈G exp i=1 λ i · load i (u, v)
2
P
P Φjend ≤ n exp( i λi · O(log 2n)). Note that at this point we
This ensures
have 1 ≤ i λi ≤ 2 and thus Φjend ≤ n exp(O(log n)) = exp(O(log n)).
Monotonicity of the logarithm then implies for each edge {u, v} ∈ G:
jend
X
λi · loadi (u, v) ≤ O(log n)
i=1

It remains to show how we find a tree in each iteration and to bound the
number of iterations.

One Iteration, Finding One Tree We now focus on iteration j and we


explain how we find a tree Tj with coefficient λj such that the addition of
Φj
this tree to the collection ensures Φj−1 ≤ exp(λj · O(log n)).
Φ
j
Let us first examine the potential change factor Φj−1 and bound it in a
more convenient manner:
P Pj 
Φj {u,v}∈G exp i=1 λi · loadi (u, v)
=P Pj−1 
Φj−1 {u,v}∈G exp i=1 λi · loadi (u, v)
P  Pj  Pj−1 
{u,v}∈G exp i=1 λ i · load i (u, v) − exp i=1 λi · loadi (u, v)
=1+ P Pj−1 
{u,v}∈G exp i=1 λ i · load i (u, v)
P  Pj−1  
{u,v}∈G exp i=1 λ i · loadi (u, v) · exp(λ j · load j (u, v)) − 1
=1+ P Pj−1 
{u,v}∈G exp i=1 λ i · load i (u, v)

We choose λj small enough such that for each edge {u, v} ∈ G, we have
λj · loadj (u, v) ≤ 1. Observing this, we can make use of the inequality
ez ≤ 1 + 2z, for all z ≤ [0, 1]. We can now upper bound
116 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

P  Pj−1  
Φj {u,v}∈G i=1 exp
λ i · loadi (u, v) · exp(λ j · load j (u, v)) − 1
=1+ Pj−1 
Φj−1
P
{u,v}∈G exp i=1 λ i · load i (u, v)
P  Pj−1  
{u,v}∈G exp i=1 λ i · load i (u, v) · 2λ i · load j (u, v)
≤1+ P Pj−1 
{u,v}∈G exp i=1 λ i · load i (u, v)

We next show that we can find a tree Tj such that

P  Pj−1  
{u,v}∈G exp λi · loadi (u, v) · 2 · loadj (u, v)
i=1
P Pj−1  ≤ O(log n)
{u,v}∈G exp i=1 λi · loadi (u, v)

Φj
This then gives Φj−1
≤ exp(λj · O(log n)), with the use of the inequality
1 + y ≤ ey .

In order to find such a tree, we define the length of an edge {u, v} by

Pj−1 
exp λi · loadi (u, v)
i=1
`(u, v) = P Pj−1 
cuv · {u0 ,v0 }∈G exp λ · load (u 0, v0)
i=1 i i

After inserting the definition of the relative load of an edge, the condition on
tree Tj from above translates to:

X  X 

`(u, v) · Ci (x, y) ≤ O(log n)
{u,v}∈G {x,y}∈Ti
s.t. {u,v}∈Pi (x,y)

Recall the probabilistic distance-preserving tree embedding discussed in


Chapter 5. Let T be the corresponding collection of trees. For an edge {x, y}
of some tree of the collection T , we define the corresponding path P (x, y) as
a shortest path on the graph G. Then, we can bound the right hand side of
the above inequality, in expectation, as follows:
8.3. CONSTRUCTIVE OBLIVIOUS ROUTING VIA MWU 117

 X  X 

ET ∈T `(u, v) · C(x, y)
{u,v}∈G {x,y}∈T
s.t. {u,v}∈P (x,y)
 X  X X 

= ET ∈T `(u, v) · ce ,
{u,v}∈G {x,y}∈T e∈cut(S(x,y),V \S(x,y))
s.t. {u,v}∈P (x,y)
X X X 
= ET ∈T ce `(u, v)
e {x,y}∈T {u,v}∈P (x,y)
s.t. e∈cut(S(x,y),V \S(x,y))
X  X
= ET ∈T ce · distT (e) = ce · ET ∈T [distT (e)]
e e
X X
≤ ce · O(log n) · `(e) = O(log n) · ce · `(e)
e e
Pj−1 
X exp
i=1 λi · load i (e)
= O(log n) · ce · P Pj−1 
e c e · {u,v}∈G exp i=1 λi · loadi (u, v)
Pj−1 
exp i=1 λi · loadi (e)
X
= O(log n) · P Pj−1 
e {u,v}∈G exp i=1 λi · loadi (u, v)

= O(log n)
The equations above show that taking one probabilistic tree from the dis-
tribution satisfies the inequality in expectation. Hence, thanks to Markov’s
inequality and by increasing the left hand side by a constant factor, we know
that the inequality is satisfied with probability at least 1/2. Therefore, if we
run the algorithm for O(log n) independent repetitions, with high probability,
we find one tree that satisfies the inequality. This is the tree Tj that we add to
our collection. Moreover, we set λj such that max{u,v}∈G λj · loadj (u, v) = 1.

Bounding the Number of Iterations We now show that the construc-


tion uses only O(m log n) iterations. In each iteration j, we set λj for the
new tree so that for at least one edge {u, v}, we have
λj · loadj (u, v) = 1
As argued above, at the end all edges {u, v} satisfy:
jend
X
λi · loadi (u, v) ≤ O(log n)
i=1

Hence, we have at most O(m log n) iterations.


118 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

8.4 Other Applications: Online routing of vir-


tual circuits
Definition 8.8 (The online routing of virtual circuits problem). Consider
a graph G = (V, E) where each edge e ∈ E has a capacity ue . A request is
denoted by a triple hs(i), t(i), d(i)i, where s(i) ∈ V is the source, t(i) ∈ V is
the target, and d(i) > 0 is the demand for the ith request respectively. Given
the ith request, we have to build a connection (single path Pi ) from s(i) to
t(i) with flow d(i). The objective is to minimize the maximum congestion on
all edges as we handle requests in an online manner. To be precise, we wish
P|σ| P
d(i)
to minimize maxe∈E i=1 uPe i 3e on the input sequence σ, where Pi 3 e is
the set of paths that include edge e.

Remark This is similar to the multi-commodity routing problem in chap-


ter 4. However, in this problem, each commodity flow cannot be split into
multiple paths, and the commodities appear in an online fashion.

Example Consider the following graph G = (V, E) with 5 vertices and 5


edges with the edge capacities ue annotated for each edge e ∈ E. Suppose
there are 2 requests: σ = (hv1 , v4 , 5i, hv5 , v2 , 8i).

v1 v4
13 10
v3 20

v2 11 8 v5

Upon seeing σ(1) = hv1 , v4 , 5i, in an online algorithm (red edges) we commit
to P1 = v1 – v3 – v4 as it minimizes the congestion to 5/10. When σ(2) =
hv5 , v2 , 8i appears, P2 = v5 – v3 – v2 minimizes the congestion given that we
committed to P1 . This causes the congestion to be 8/8 = 1. On the other
hand, the optimal offline algorithm (blue edges) can attain a congestion of
8/10 via P1 = v1 – v3 – v5 – v4 and P2 = v5 – v4 – v3 – v2 .

v1 v4 v1 v4 v1 v4
5/13 5/10 5/13 5/10 5/13 8/10
v3 0/21 v3 0/21 v3 13/21

v2 0/11 0/8 v5 v2 8/11 8/8 v5 v2 8/11 5/8 v5


8.4. OTHER APPLICATIONS: ONLINE ROUTING OF VIRTUAL CIRCUITS119

To facilitate further discussion, we define the following notations:


• pe (i) = d(i)
ue
is the demand of the ith request with respect to the capacity
of edge e.

• le (j) = Pi 3e,i≤j pe (i) is the relative load of edge e after request j.


P

• le∗ (j) is the optimal offline algorithm’s relative load of edge e after re-
quest j.
In other words, the objective is to minimize maxe∈E le (|σ|) for a given se-
quence σ. Denoting Λ as the (unknown) optimal congestion factor, we nor-

malize pee (i) = peΛ(i) , e le∗ (j) = leΛ(j) . Let a be a constant to be
le (j) = leΛ(j) , and e
determined. Consider the algorithm A which does the following on request
i + 1:
• Denote the cost of edge e by ce = ale (i)+epe (i+1) − ale (i) .
e e

• Return a shortest (smallest total cost) s(i) − t(i) path Pi on G with


edge weights ce .
Finding the shortest path with respect to the cost function ce tries to mini-
mize the load impact of the new (i + 1)th request. To analyze A, we consider
le∗ (j)), for some con-
the following potential function: Φ(j) = e∈E ale (j) (γ − e
P e

stant γ ≥ 2. Because e∗ e∗
P of normalization, le (j) ≤ 1, so γ − le (j) ≥ 1. Initially,
we have Φ(0) = e∈E γ = mγ.
1 x
Lemma 8.9. For γ ≥ 1 and 0 ≤ x ≤ 1, then (1 + 2γ
) < 1 + γx .
1 x x x
Proof. By Taylor series2 , (1 + 2γ
) =1+ 2γ
+ O( 2γ ) < 1 + γx .
1
Lemma 8.10. For a = 1 + 2γ
and γ ≥ 1, then Φ(j + 1) − Φ(j) ≤ 0.

Proof. Let Pj+1 be the path that algorithm A found and let Pj+1 be the
th
path that the optimal offline algorithm assigned to the (j + 1) request
hs(j + 1), t(j + 1), d(j + 1)i. For any edge e, observe the following:
• If e 6∈ Pj+1

, the load on e caused by the optimal offline algorithm
remains unchanged. That is e le∗ (j + 1) = ele∗ (j). On the other hand, if

e ∈ Pj+1 le∗ (j + 1) = e
, then e le∗ (j) + pee (j + 1).

• Similarly, if e 6∈ Pj+1 , then e le (j); if e ∈ Pj+1 , then e


le (j +1) = e le (j +1) =
le (j) + pee (j + 1).
e
2
See https://en.wikipedia.org/wiki/Taylor_series#Binomial_series
120 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

• Thus if e is neither in Pj+1 nor in Pj+1 ∗


le∗ (j + 1)) =
, then ale (j+1) (γ − e
e

le∗ (j)). So only edges used by Pj+1 or Pj+1


ale (j) (γ−e ∗
affect Φ(j+1)−Φ(j).
e

Using the observations above together with Lemma 8.9 and the fact that A
computes a shortest path, one can show that Φ(j + 1) − Φ(j) ≤ 0. In detail:

Φ(j + 1) − Φ(j)
X e
= le∗ (j + 1)) − ale (j) (γ − e
ale (j+1) (γ − e le∗ (j))
e

e∈E
X
= le∗ (j))
(ale (j+1) − ale (j) )(γ − e (1)
e e


e∈Pj+1 \Pj+1
X
+ le∗ (j) − pee (j + 1)) − ale (j) (γ − e
ale (j+1) (γ − e le∗ (j))
e e


e∈Pj+1
X X
= (ale (j+1) − ale (j) )(γ − lee∗ (j)) − ale (j+1) pee (j + 1)
e e e

e∈Pj+1 ∗
e∈Pj+1
X X
≤ (ale (j+1) − ale (j) )γ − ale (j+1) pee (j + 1) (2)
e e e

e∈Pj+1 ∗
e∈Pj+1
X X
≤ (ale (j+1) − ale (j) )γ − ale (j) pee (j + 1) (3)
e e e

e∈Pj+1 ∗
e∈Pj+1
X X
= (ale (j)+epe (j+1) − ale (j) )γ − ale (j) pee (j + 1) (4)
e e e

e∈Pj+1 ∗
e∈Pj+1
X X
= ce γ − ale (j) pee (j + 1) (5)
e

e∈Pj+1 ∗
e∈Pj+1
X  
≤ ce γ − ale (j) pee (j + 1) (6)
e


e∈Pj+1
X  
= (ale (j)+epe (j+1) − ale (j) )γ − ale (j) pee (j + 1)
e e e


e∈Pj+1
X  
= ale (j) (apee (j+1) − 1)γ − pee (j + 1)
e


e∈Pj+1
X  1 pee (j+1) 
ale (j)

= (1 + ) − 1 γ − pee (j + 1) (7)
e


e∈Pj+1

≤ 0 (8)
8.4. OTHER APPLICATIONS: ONLINE ROUTING OF VIRTUAL CIRCUITS121

(1) From the observations above

le∗ (j) ≥ 0
(2) e

le (j + 1) ≥ e
(3) e le (j) and a > 1

(4) For e ∈ Pj+1 , lee (j + 1) = lee (j) + pee (j + 1)

(5) Since ce = ale (j)+epe (j+1) − ale (j)


e e

(6) Since Pj+1 is a shortest path (with respect to ce )


1
(7) Since a = 1 + 2γ

(8) Lemma 8.9 with 0 ≤ pee (j + 1) ≤ 1

Theorem 8.11. Let L = maxe∈E e le (|σ|) be the maximum normalized load at


1
the end of the input sequence σ. For γ ≥ 2, a = 1 + 2γ and L ∈ O(log n).
Then A is O(log n)-competitive.
Proof. Since Φ(0) = mγ and Φ(j + 1) − Φ(j) ≤ 0, we see that Φ(j) ≤ mγ, for
all j ∈ {1, . . . , |σ|}. Consider the edge e with the highest congestion. Since
γ −e le∗ (j) ≥ 1, we see that
 L
1
le∗ (j) ≤ Φ(j) ≤ mγ ≤ n2 γ
≤ aL · γ − e

1+

Taking the logarithm on both sides and rearranging, we get:
1
L ≤ (2 log(n) + log(γ)) · 1 ∈ O(log n)
log(1 + 2γ
)

Handling unknown Λ Since Λ is unknown but is needed for the run of A


(to compute ce when a request arrives), we use a dynamically estimated Λ.e
Let β be a constant such that A is β-competitive according to Theorem 8.11.
The following modification to A is a 4β-competitive: On the first request,
we can explicitly compute Λe = Λ. Whenever the actual congestion exceeds
e we reset3 the edge loads to 0, update our estimate to 2Λ,
Λβ, e and start a
new phase. Thus, we have:
3
Existing paths are preserved, just that we ignore them in the subsequent computations
of ce .
122 CHAPTER 8. MULTIPLICATIVE WEIGHTS UPDATE (MWU)

• By the updating procedure, Λ


e ≤ 2βΛ in all phases.

• Let T be the total number of phases. In any phase i ≤ T , the congestion


at the end of phase i is at most 22βΛT −i . Across all phases, we have
PT 2βΛ
i=1 2T −i ≤ 4βΛ.
Part III

Streaming and Sketching


Algorithms

123
Chapter 9

Basics and Warm Up with


Majority Element

Thus far, we have been ensuring that our algorithms run fast. What if our
system does not have sufficient memory to store all data to post-process it?
For example, a router has relatively small amount of memory while tremen-
dous amount of routing data flows through it. In a memory constrained set-
ting, can one compute something meaningful, possible approximately, with
limited amount of memory?
More formally, we now look at a slightly different class of algorithms
where data elements from [n] = {1, . . . , n} arrive in one at a time, in a stream
S = a1 , . . . , am , where ai ∈ [n] arrives in the ith time step. At each step, our
algorithm performs some computation1 and discards the item ai . At the end
of the stream2 , the algorithm should give us a value that approximates some
value of interest.

9.1 Typical tricks


Before we begin, let us first describe two typical tricks used to amplify suc-
cess probabilities of randomized algorithms. Suppose we have a randomized
algorithm A that returns an unbiased estimate of a quantity of interest X
on a problem instance I, with success probability p > 0.5.

Trick 1: Reduce Pvariance Run j independent copies ofPA on I, and return


the mean 1j ji=1 A(I). The expected outcome E( 1j ji=1 A(I)) will still
be X while the variance drops by a factor of j.
1
Usually this is constant time so we ignore the runtime.
2
In general, the length of the stream, m, may not be known.

125
126CHAPTER 9. BASICS AND WARM UP WITH MAJORITY ELEMENT

Trick 2: Improve success Run k independent copies of A on I, and re-


turn the median. As each copy of A succeeds (independently) with
probability p > 0.5, the probability that more than half of them fails
(and hence the median fails) drops exponentially with respect to k.
Let  > 0 and δ > 0 denote the precision factor and failure probability
respectively. Robust combines the above-mentioned two tricks to yield a
(1 ± )-approximation to X that succeeds with probability > 1 − δ.

Algorithm 19 Robust(A, I, , δ)
C←∅ . Initialize candidate outputs
for k = O(log 1δ ) times do
sum ← 0
for j = O( 12 ) times do
sum ← sum + A(I)
end for
Add sum
j
to candidates C . Include new sample of mean
end for
return Median of C . Return median

9.2 Majority element


Definition 9.1 (“Majority in a stream” problem). Given a stream S =
{a1 , . . . , am } of items from [n] = {1, . . . , n}, with an element j ∈ [n] that
appears strictly more than m2 times in S, find j.

Example Consider a stream S = {1, 3, 3, 7, 5, 3, 2, 3}. The table below


shows how guess and count are updated as each element arrives.
Stream elements 1 3 3 7 5 3 2 3
Guess 1 3 3 3 5 3 2 3
Count 1 1 2 1 1 1 1 1
One can verify that MajorityStream uses O(log n + log m) bits to
store guess and counter.
Claim 9.2. MajorityStream correctly finds element j ∈ [n] which appears
> m2 times in S = {a1 , . . . , am }.
Proof. (Sketch) Match each other element in S with a distinct instance of j.
Since j appears > m2 times, at least one j is unmatched. As each matching
cancels out count, only j could be the final guess.
9.2. MAJORITY ELEMENT 127

Algorithm 20 MajorityStream(S = {a1 , . . . , am })


guess ← 0
count ← 0
for ai ∈ S do . Items arrive in streaming fashion
if ai = guess then
count ← count + 1
else if count > 1 then
count ← count − 1
else
guess ← ai
end if
end for
return guess

Remark If no element appears > m2 times, then MajorityStream is


not guaranteed to return the most frequent element. For example, for S =
{1, 3, 4, 3, 2}, MajorityStream(S) returns 2 instead of 3.
128CHAPTER 9. BASICS AND WARM UP WITH MAJORITY ELEMENT
Chapter 10

Estimating the moments of a


stream

One class of interesting problems is computing moments of a given stream S.


For items j ∈ [n], define fj as the number of times P
j appears in a stream S.
th n k
Then, the k momentPn of a stream S is defined as j=1 (fj ) . When k = 1,
the first moment j=1 fj = m is simply the number of elements in the stream
S. When k = 0, by associating 00 = 0, the zeroth moment nj=1 (fj )0 is the
P
number of distinct elements in the stream S.

10.1 Estimating the first moment of a stream


A trivial exact solution would be to use O(log m) bits to maintain a counter,
incrementing for each element observed. For some upper bound M , consider
the sequence (1 + ), (1 + )2 , . . . , (1 + )log1+ M . For any stream length m,
there exists i ∈ N such that (1 + )i ≤ m ≤ (1 + )i+1 . Thus, to obtain
a (1 + )-approximation, it suffices to track the exponent i to estimate the
length of m. For  ∈ Θ(1), this can be done in O(log log m) bits.

Algorithm 21 Morris(S = {a1 , . . . , am })


x←0
for ai ∈ S do . Items arrive in streaming fashion
r ← Random probability from [0, 1]
if r ≤ 2−x then . If not, x is unchanged.
x←x+1
end if
end for
return 2x − 1 . Estimate m by 2x − 1

129
130 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

The intuition behind Morris [Mor78] is to increase the counter (and


hence double the estimate) when we expect to observe 2x new items. For
analysis, denote Xm as the value of counter x after exactly m items arrive.

Theorem 10.1. E[2Xm − 1] = m. That is, Morris is an unbiased estimator


for the length of the stream.

Proof. Equivalently, let us prove E[2Xm ] = m + 1, by induction on m ∈ N+ .


On the first element (m = 1), x increments with probability 1, so E[2X1 ] =
21 = m + 1. Suppose it holds for some m ∈ N, then

m
X
E[2Xm+1 ] = E[2Xm+1 |Xm = j] Pr[Xm = j] Condition on Xm
j=1
Xm
= (2j+1 · 2−j + 2j · (1 − 2−j )) · Pr[Xm = j] Increment x w.p. 2−j
j=1
Xm
= (2j + 1) · Pr[Xm = j] Simplifying
j=1
Xm m
X
= 2j · Pr[Xm = j] + Pr[Xm = j] Splitting the sum
j=1 j=1
m
X
= E[2Xm ] + Pr[Xm = j] Definition of E[2Xm ]
j=1
m
X
= E[2Xm ] + 1 Pr[Xm = j] = 1
i=1
= (m + 1) + 1 Induction hypothesis
=m+2

Note that we sum up to m because x ∈ [1, m] after m items.

Claim 10.2. E[22Xm ] = 32 m2 + 32 m + 1

Proof. Exercise.
m2
Claim 10.3. Var(2Xm − 1) = E[(2Xm − 1 − m)2 ] ≤ 2

Proof. Exercise. Use the Claim 10.2.


1
Theorem 10.4. For  > 0, Pr[|(2Xm − 1) − m| > m] ≤ 22
10.2. ESTIMATING THE ZEROTH MOMENT OF A STREAM 131

Proof.

Xm Var(2Xm − 1)
Pr[|(2 − 1) − m| > m] ≤ Chebyshev’s inequality
(m)2
m2 /2
≤ 2 2 By Claim 10.3
m
1
= 2
2

Remark Using the discussion in Section 9.1, we can run Morris multiple
times to obtain a (1 ± )-approximation of the first moment of a stream that
succeeds with probability > 1 − δ. For instance, repeating Morris 10 2
times
1
and reporting the mean m, b Pr[|mb − m| > m] ≤ 20 because the variance is
2
reduced by 10 .

10.2 Estimating the zeroth moment of a stream


Trivial exact solutions could either use O(n) bits to track if element exists,
or use O(m log n) bits to remember the whole stream. Suppose there are D
distinct items in the whole stream. In this section, we show that one can in
fact make do with only O(log n) bits to obtain an approximation of D.

10.2.1 An idealized algorithm


Consider the following algorithm sketch:

1. Take a uniformly random hash function h : {1, . . . , m} → [0, 1]

2. As items ai ∈ S arrive, track z = min{h(ai )}


1
3. In the end, output z
−1

Since we are randomly hashing elements into the range [0, 1], we expect
1 1
the minimum hash output to be D+1 , so E[ z1 − 1] = D. Unfortunately,
storing a uniformly random hash function that maps to the interval [0, 1] is
infeasible. As storing real numbers is memory intensive, one possible fix is to
discretize the interval [0, 1], using O(log n) bits per hash output. However,
storing this hash function would still require O(n log n) space.
1
See https://en.wikipedia.org/wiki/Order_statistic
132 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

10.2.2 An actual algorithm


Instead of a uniformly random hash function, we select a random hash from
a family of pairwise independent hash functions.

Definition 10.5 (Family of pairwise independent hash functions). Hn,m is


a family of pairwise independent hash functions if

• (Hash definition): ∀h ∈ Hn,m , h : {1, . . . , n} → {1, . . . , m}

• (Uniform hashing): ∀x ∈ {1, . . . , n}, Prh∈Hn,m [h(x) = i] = 1


m

• (Pairwise independent) ∀x, y ∈ {1, . . . , n}, x 6= y, Prh∈Hn,m [h(x) =


i ∧ h(y) = j] = m12

Remark For now, we care only about m = n, and write Hn,n as Hn .

Claim 10.6. Let n be a prime number. Then,

Hn = {ha,b : h(x) = ax + b mod n, ∀a, b ∈ Zn }

is a family of pairwise independent hash functions.

Proof. (Sketch) For any given x 6= y,

• There is a unique value of h(x) mod n, out of n possibilities.

• The system {ax + b = i mod  n, 


ay + b = j mod n} has a unique
x 1
solution for (a, b) (note that ∈ Zn2×2 is non-singular), out of n2
y 1
possibilities.

Remark If n is not a prime, we know there exists a prime p such that


n ≤ p ≤ 2n, so we round n up to p. Storing a random hash from Hn is then
storing the numbers a and b in O(log n) bits.
We now present an algorithm [FM85] which estimates the zeroth moment
of a stream and defer the analysis to the next lecture. In FM, zeros refer
to the number of trailing zeroes in the binary representation of h(ai ). For
example, if h(ai ) = 20 = (...10100)2 , then zeros(h(ai )) = 2.P
Recall that the k th moment of a stream S is defined as nj=1 (fj )k . Since
the hash h is deterministic after picking a random hash from Hn,n , h(ai ) =
h(aj ), ∀ai = aj ∈ [n]. We first prove a useful lemma.
10.2. ESTIMATING THE ZEROTH MOMENT OF A STREAM 133

Algorithm 22 FM(S = {a1 , . . . , am })


h ← Random hash from Hn,n
Z←0
for ai ∈ S do . Items arrive in streaming fashion
Z = max{Z, zeros(h(ai ))}
(zeros(h(ai )) = # leading zeroes in binary representation of h(ai ))
end for √
return 2Z · 2 . Estimate of D

Lemma 10.7. PIf X1 , . . . , Xn are pairwise independent indicator random vari-


ables and X = ni=1 Xi , then Var(X) ≤ E[X].
Proof.
n
X
Var(X) = Var(Xi ) The Xi ’s are pairwise independent
i=1
n
X
= (E[Xi2 ] − (E[Xi ])2 ) Definition of variance
i=1
Xn
≤ E[Xi2 ] Ignore negative part
i=1
Xn
= E[Xi ] Xi2 = Xi since Xi ’s are indicator random variables
i=1
n
X
= E[ Xi ] Linearity of expectation
i=1
= E[X] Definition of expectation

Theorem 10.8. There exists a constant C > 0 such that


D √
≤ 2Z · 2 ≤ 3D] > C
Pr[
3
√ √
Proof. We will prove Pr[( D3 > 2Z · 2) or (2Z · 2 > 3D)] ≤ 1 − C by
√ √
separately analyzing Pr[ D3 ≥ 2Z · 2] and Pr[2Z · 2 ≥ 3D], then applying
union bound. Define indicator variables
(
1 if zeros(h(ai )) ≥ r
Xi,r =
0 otherwise
134 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

and Xr = m
P
i=1 Xi,r = |{ai ∈ S : zeros(h(ai )) ≥ r}|. Notice that Xn ≤ Xn−1 ≤ · · · ≤ X1
since zeros(h(ai )) ≥ r + 1 ⇒ zeros(h(ai )) ≥ r. Now,

m
X m
X
E[Xr ] = E[ Xi,r ] Since Xr = Xi,r
i=1 i=1
m
X
= E[Xi,r ] By linearity of expectation
i=1
m
X
= Pr[Xi,r = 1] Since Xi,r are indicator variables
i=1
m
X 1
= h is a uniform hash
i=1
2r
D
= r Since h hashes same elements to the same value
2


Denote τ1 as the smallest integer
√ such that 2τ1 · 2 > 3D, and τ2 as the
largest integer such that 2τ2 · 2 < D3 . We see that if τ1 < Z < τ2 , then

2Z · 2 is a 3-approximation of D.

r τ2 τ1 0
τ2 + 1 log2 ( √D2 )

√ √
• If Z ≥ τ1 , then 2Z · 2 ≥ 2τ1 · 2 > 3D

√ √
• If Z ≤ τ2 , then 2Z · 2 ≤ 2τ2 · 2< D
3
10.2. ESTIMATING THE ZEROTH MOMENT OF A STREAM 135

Pr[Z ≥ τ1 ] ≤ Pr[Xτ1 ≥ 1] Since Z ≥ τ1 ⇒ Xτ1 ≥ 1


E[Xτ1 ]
≤ By Markov’s inequality
1
D D
= τ1 Since E[Xr ] =
2√ 2r
2 √
≤ Since 2τ1 · 2 > 3D
3

Pr[Z ≤ τ2 ] ≤ Pr[Xτ2 +1 = 0] Since Z ≤ τ2 ⇒ Xτ2 +1 = 0


≤ Pr[E[Xτ2 +1 ] − Xτ2 +1 ≥ E[Xτ2 +1 ]] Implied
≤ Pr[|Xτ2 +1 − E[Xτ2 +1 ]| ≥ E[Xτ2 +1 ]] Adding absolute sign
Var[Xτ2 +1 ]
≤ By Chebyshev’s inequality
(E[Xτ2 +1 ])2
E[Xτ2 +1 ]
≤ By Lemma 10.7
(E[Xτ2 +1 ])2
2τ2 +1 D
≤ Since E[Xr ] =
√D 2r
2 √ D
≤ Since 2τ2 · 2<
3 3
Putting together,

D √ √
Pr[( > 2Z · 2) or (2Z · 2 > 3D)]
3
D √ √
≤ Pr[ ≥ 2Z · 2] + Pr[2Z · 2 ≥ 3D] By union bound
√3
2 2
≤ From above
3 √
2 2
=1−C For C = 1 − >0
3

Although√ the analysis tells us that there is a small success probability


2 2
(C = 1 − 3 ≈ 0.0572), one can use t independent hashes and output the

mean k1 ki=1 (2Zi · 2) (Recall Trick 1). With t hashes, the variance drops
P
by a factor of 1t , improving the analysis for Pr[Z ≤ τ2 ]. When the success
136 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

probability C > 0.5 (for instance, after t ≥ 17 repetitions), one can then call
the routine k times independently and return the median (Recall Trick 2).
While Tricks 1 and 2 allows us to strength the success probability C, more
work needs to be done to improve the approximation factor from 3 to (1 + ).
To do this, we look at a slight modification of FM, due to [BYJK+ 02].

Algorithm 23 FM+(S = {a1 , . . . , am }, )


N ← n3
t ← c2 ∈ O( 12 ) . For some constant c ≥ 28
h ← Random hash from Hn,N . Hash to a larger space
T ←∅ . Maintain t smallest h(ai )’s
for ai ∈ S do . Items arrive in streaming fashion
T ← t smallest values from T ∪ {h(ai )}
(If |T ∪ {h(ai )}| ≤ t, then T = T ∪ {h(ai )})
end for
Z = maxt∈T T
return tNZ
. Estimate of D

Remark For a cleaner analysis, we treat the integer interval [N ] as a con-


tinuous interval in Theorem 10.9. Note that there may be a rounding error
of N1 but this is relatively small and a suitable c can be chosen to make the
analysis still work.

Theorem 10.9. In FM+, for any given 0 <  < 12 , Pr[| tN


Z
− D| ≤ D] > 43 .

Proof. We first analyze Pr[ tN


Z
> (1 + )D] and Pr[ tN
Z
< (1 − )D] separately.
Then, taking union bounds and negating yields the theorem’s statement.

If tN
Z
> (1 + )D, then (1+)DtN
> Z = tth smallest hash value, implying
tN
that there are ≥ t hashes smaller than (1+)D . Since the hash uniformly
distributes [n] over [N ], for each element ai ,
tN
tN (1+)D t
Pr[h(ai ) ≤ ]= =
(1 + )D N (1 + )D

Let d1 , . . . , dD be the D distinct elements in the stream. Define indicator


variables (
tN
1 if h(di ) ≤ (1+)D
Xi =
0 otherwise
10.2. ESTIMATING THE ZEROTH MOMENT OF A STREAM 137
PD tN
and X = i=1 Xi is the number of hashes that are smaller than (1+)D .
t t
From above, Pr[Xi = 1] = (1+)D . By linearity of expectation, E[X] = (1+) .
Then, by Lemma 10.7, Var(X) ≤ E[X]. Now,

tN
Pr[ > (1 + )D] ≤ Pr[X ≥ t] Since the former implies the latter
Z
= Pr[X − E[X] ≥ t − E[X]] Subtracting E[X] from both sides
 t 
≤ Pr[X − E[X] ≥ t] Since E[X] = ≤ (1 − )t
2 (1 + ) 2

≤ Pr[|X − E[X]| ≥ t] Adding absolute sign
2
Var(X)
≤ By Chebyshev’s inequality
(t/2)2
E[X]
≤ Since Var(X) ≤ E[X]
(t/2)2
4(1 − /2)t t 
≤ Since E[X] = ≤ (1 − )t
2 t2 (1 + ) 2
4 c 
≤ Simplifying with t = 2 and (1 − ) < 1
c  2

Similarly, if tN
Z
tN
< (1 − )D, then (1−)D < Z = tth smallest hash value,
tN
implying that there are < t hashes smaller than (1−)D . Since the hash
uniformly distributes [n] over [N ], for each element ai ,

tN
tN (1−)D t
Pr[h(ai ) ≤ ]= =
(1 − )D N (1 − )D

Let d1 , . . . , dD be the D distinct elements in the stream. Define indicator


variables
(
tN
1 if h(di ) ≤ (1−)D
Yi =
0 otherwise

and Y = D tN
P
i=1 Yi is the number of hashes that are smaller than (1−)D . From
t t
above, Pr[Yi = 1] = (1−)D . By linearity of expectation, E[Y ] = (1−) . Then,
by Lemma 10.7, Var(Y ) ≤ E[Y ]. Now,
138 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

tN
Pr[ < (1 − )D]
Z
≤ Pr[Y ≤ t] Since the former implies the latter
= Pr[Y − E[Y ] ≤ t − E[Y ]] Subtracting E[Y ] from both sides
t
≤ Pr[Y − E[Y ] ≤ −t] Since E[Y ] = ≥ (1 + )t
(1 − )
≤ Pr[−(Y − E[Y ]) ≥ t] Swap sides
≤ Pr[|Y − E[Y ]| ≥ t] Adding absolute sign
Var(Y )
≤ By Chebyshev’s inequality
(t)2
E[Y ]
≤ Since Var(Y ) ≤ E[Y ]
(t)2
(1 + 2)t t
≤ Since E[Y ] = ≤ (1 + 2)t
2 t2 (1 − )
3 c
≤ Simplifying with t = 2 and (1 + 2) < 3
c 

Putting together,

tN tN tN
Pr[| − D| > D]] ≤ Pr[ > (1 + )D]] + Pr[ < (1 − )D]] By union bound
Z Z Z
≤ 4/c + 3/c From above
≤ 7/c Simplifying
≤ 1/4 For c ≥ 28

10.3 Estimating the k th moment of a stream


In this section, we describe algorithms from [AMS96] that estimates the k th
moment of a stream, first for k = 2, then Pfor general k. Recall that the k th
moment of a stream S is defined as Fk = ni=1 (fi )k , where for each element
i ∈ [n], fi denotes the number of times value i appears in the stream.
10.3. ESTIMATING THE K T H MOMENT OF A STREAM 139

10.3.1 k=2
For each element i ∈ [n], we associate a random variable ri ∈u.a.r. {−1, +1}.

Algorithm 24 AMS-2(S = {a1 , . . . , am })


Assign ri ∈u.a.r. {−1, +1}, ∀i ∈ [n] . For now, this takes O(n) space
Z←0
for ai ∈ S do . Items arrive in streaming
Pfashion
Z ← Z + ri . At the end, Z = ni=1 ri fi
end for
return Z 2 . Estimate of F2 = ni=1 fi2
P

Lemma 10.10. In AMS-2, Pn if random variables {ri }i∈[n] are pairwise in-
2 2
dependent, then E[Z ] = i=1 fi = F2 . That is, AMS-2 is an unbiased
estimator for the second moment.
Proof.
Xn n
X
2
E[Z ] = E[( ri fi )2 ] Since Z = ri fi at the end
i=1 i=1
Xn X Xn
= E[ ri2 fi2 + 2 ri rj fi fj ] Expanding ( ri f i ) 2
i=1 1≤i<j≤n i=1
n
X X
= E[ri2 fi2 ] + 2 E[ri rj fi fj ] Linearity of expectation
i=1 1≤i<j≤n
Xn X
= E[ri2 ]fi2 + 2 E[ri rj ]fi fj fi ’s are (unknown) constants
i=1 1≤i<j≤n
Xn X
= fi2 + 2 E[ri rj ]fi fj Since ri2 = 1, ∀i ∈ [n]
i=1 1≤i<j≤n
Xn X
= fi2 + 2 E[ri ]E[rj ]fi fj Since {ri }i∈[n] are pairwise independent
i=1 1≤i<j≤n
n
X
= fi2 Since E[ri ] = 0, ∀i ∈ [n]
i=1
n
X
= F2 Since F2 = fi2
i=1
140 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

So we have an unbiased estimator for the second moment but we are also
interested in the probability of error. We want a small probability for the
output Z 2 to deviate more than (1 + ) from the true value, i.e.,
Pr[|Z 2 − F2 | > F2 ] should be small.
Lemma 10.11. In AMS-2, if random variables {ri }i∈[n] are 4-wise inde-
pendent2 , then Var[Z 2 ] ≤ 2(E[Z 2 ])2 .
Proof. As before, E[ri ] = 0 and ri2 = 1 for all i ∈ [n]. By 4-wise independence,
the expectation of any product of at most 4 different ri ’s is the product of
their expectations. Thus we get E[ri rj rk rl ] = E[ri ]E[rj ]E[rk ]E[rl ] = 0, as well
as E(ri3 rj ) = E(ri rj ) = 0 and E(ri2 rj rk ) = E(rj rk ) = 0, where the indices
i, j, k, l are pairwise different. This allows us to compute E[Z 4 ]:
Xn n
X
4
E[Z ] = E[( ri fi )4 ] Since Z = ri fi at the end
i=1 i=1
n
X X
= E[ri4 ]fi4 + 6 E[ri2 rj2 ]fi2 fj2 L.o.E. and 4-wise independence
i=1 1≤i<j≤n
Xn X
= fi4 + 6 fi2 fj2 Since ri4 = ri2 = 1, ∀i ∈ [n] .
i=1 1≤i<j≤n

4

Note that the coefficient of 1≤i<j≤n E[ri2 rj2 ]fi2 fj2 is
P
2
= 6 and that all
other terms vanish by the computation above.

Var[Z 2 ] = E[(Z 2 )2 ] − (E[Z 2 ])2 Definition of variance


Xn X
= fi4 + 6 fi2 fj2 − (E[Z 2 ])2 From above
i=1 1≤i<j≤n
Xn X n
X
= fi4 +6 fi2 fj2 −( fi2 )2 By Lemma 10.10
i=1 1≤i<j≤n i=1
X
=4 fi2 fj2 Expand and simplify
1≤i<j≤n
X n
≤ 2( fi2 )2 Introducing fi4 terms
i=1
2 2
= 2(E[Z ]) By Lemma 10.10
2
The random variables {ri }i∈[n] are said to be 4-wise independent if
 Q4
Pr (ri1 , ri2 , ri3 , ri4 ) = (i1 , i2 , i3 , i4 ) = j=1 Pr(rij = ij ) for all i1 , i2 , i3 , i4 .
Note that 4-wise independence implies pairwise independence.
10.3. ESTIMATING THE K T H MOMENT OF A STREAM 141

Theorem 10.12. In AMS-2, if {ri }i∈[n] are 4-wise independent, then we


have Pr[|Z 2 − F2 | > F2 ] ≤ 22 for any  > 0.

Proof.

Pr[|Z 2 − F2 | > F2 ] = Pr[|Z 2 − E[Z 2 ]| > E[Z 2 ]] By Lemma 10.10


Var(Z 2 )
≤ By Chebyshev’s inequality
(E[Z 2 ])2
2(E[Z 2 ])2
≤ By Lemma 10.11
(E[Z 2 ])2
2
= 2


We can again apply the mean trick to decrease the variance by a factor
of k and have a smaller upper bound on the probability of error.
In particular, if we pick k = 10
2
repetitions of ASM-2 and output the mean
value of the output Z 2 we have :

Var[Z 2 ] k1 1 2 1
Pr[error] ≤ 2 2
≤ · 2 =
(E[Z ]) k  5

Claim 10.13. O(k log n) bits of randomness suffices to obtain a set of k-wise
independent random variables.

Proof. Recall the definition of hash family Hn,m . In a similar fashion3 , we


consider hashes from the family (for prime p):

k−1
X
{hak−1 ,ak−2 ,...,a1 ,a0 : h(x) = ai x i mod p
i=1
= ak−1 xk−1 + ak−2 xk−2 + · · · + a1 x + a0 mod p,
∀ak−1 , ak−2 , . . . , a1 , a0 ∈ Zp }

This requires k random coefficients, which can be stored with O(k log n)
bits.
3
See https://en.wikipedia.org/wiki/K-independent_hashing
142 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

Observe that the above analysis only require {ri }i∈[n] to be 4-wise in-
dependent. Claim 10.13 implies that AMS-2 only needs O(4 log n) bits to
represent {ri }i∈[n] .
Although the failure probability 22 is large for small , one can repeat t
times and output the mean (Recall Trick 1). With t ∈ O( 12 ) samples, the
failure probability drops to t22 ∈ O(1). When the failure probability is less
than 12 , one can then call the routine k times independently, and return the
median (Recall Trick 2). On the whole, for any given  > 0 and δ > 0,
O( log(n) log(1/δ)
2 ) space suffices to yield a (1 ± )-approximation algorithm that
succeeds with probability > 1 − δ.

10.3.2 General k

Algorithm 25 AMS-k(S = {a1 , . . . , am })


m ← |S| . For now, assume we know m = |S|
J ∈u.a.r. [m] . Pick a random index
r←0
for ai ∈ S do . Items arrive in streaming fashion
if i ≥ J and ai = aJ then
r ←r+1
end if
end for
Z ← m(rk − (r − 1)k )
. Estimate of Fk = ni=1 (fi )k
P
return Z

Remark At the end of AMS-k, r = |{i ∈ [m] : i ≥ J and ai = aJ }| will


be the number of occurrences of aJ in a suffix of the stream.
The assumption of known m in AMS-k can be removed via reservoir
sampling4 . The idea is as follows: Initially, initialize stream length and J as
both 0. When ai arrives, choose to replace J with i with probability 1i . If J
is replaced, reset r to 0 and start counting from this stream suffix onwards.
It can be shown that the choice of J is uniform over current stream length.

Lemma 10.14. In AMS-k, E[Z] = ni=1 fik = Fk . That is, AMS-k is an


P
unbiased estimator for the k th moment.
4
See https://en.wikipedia.org/wiki/Reservoir_sampling
10.3. ESTIMATING THE K T H MOMENT OF A STREAM 143

Proof. When aJ = i, there are fi choices for J. By telescoping sums, we have

E[Z | aJ = i]
1 1 1
= [m(fik − (fi − 1)k )] + [m((fi − 1)k − (fi − 2)k )] + · · · + [m(1k − 0k )]
fi fi fi
m k
= [(fi − (fi − 1)k ) + ((fi − 1)k − (fi − 2)k ) + · · · + (1k − 0k )]
fi
m k
= fi .
fi

Thus,

n
X
E[Z] = E[Z | aJ = i] · Pr[aJ = i] Condition on the choice of J
i=1
n
X fi
= E[Z | aJ = i] · Since choice of J is uniform at random
i=1
m
n
X m fi
= fik · From above
i=1
fi m
Xn
= fik Simplifying
i=1
n
X
= Fk Since Fk = fik .
i=1

Lemma 10.15. For positive reals f1 , f2 , . . . , fn and a positive integer k, we


have
Xn Xn Xn
2k−1 1−1/k
( fi )( fi )≤n ( fik )2 .
i=1 i=1 i=1

Pn
Proof. Let M = maxi∈[n] fi , then fi ≤ M for any i ∈ [n] and M k ≤ i=1 fik .
Hence,
144 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM

Xn Xn Xn n
X
( fi )( fi2k−1 ) ≤ ( fi )(M k−1 fik ) Since fi2k−1 ≤ M k−1 fik
i=1 i=1 i=1 i=1
Xn Xn n
X n
X
k (k−1)/k
≤( fi )( fi ) ( fik ) Since M ≤k
fik
i=1 i=1 i=1 i=1
Xn Xn
=( fi )( fik )(2k−1)/k Merging the last two terms
i=1 i=1
n
X Xn n
X n
X
≤ n1−1/k ( fik )1/k ( fik )(2k−1)/k Fact: ( fi )/n ≤ ( fik /n)1/k
i=1 i=1 i=1 i=1
Xn
1−1/k
=n ( fik )2 Merging the last two terms .
i=1

1
Remark f1 = n k , f2 = · · · = fn = 1 is a tight example for Lemma 10.15,
up to a constant factor.
1
Theorem 10.16. In AMS-k, Var(Z) ≤ kn1− k (E[Z])2 .

Proof. Let us first analyze E[Z 2 ].

m k
E[Z 2 ] = [ (1 − 0k )2 + (2k − 1k )2 + · · · + (f1k − (f1 − 1)k )2 (1)
m
+ (1k − 0k )2 + (2k − 1k )2 + · · · + (f2k − (f2 − 1)k )2
+ ...
+ (1k − 0k )2 + (2k − 1k )2 + · · · + (fnk − (fn − 1)k )2 ]
≤ m[ k1k−1 (1k − 0k ) + k2k−1 (2k − 1k ) + · · · + kf1k−1 (f1k − (f1 − 1)k ) (2)
+ k1k−1 (1k − 0k ) + k2k−1 (2k − 1k ) + · · · + kf2k−1 (f2k − (f2 − 1)k )
+ ...
+ k1k−1 (1k − 0k ) + k2k−1 (2k − 1k ) + · · · + kfnk−1 (fnk − (fn − 1)k )]
≤ m[kf12k−1 + kf22k−1 + · · · + kfn2k−1 ] (3)
= kmF2k−1 (4)
= kF1 F2k−1 (5)

(1) Condition on J and expand as in the proof of Theorem 10.14


10.3. ESTIMATING THE K T H MOMENT OF A STREAM 145

(2) For all 0 < b < a,

ak − bk = (a − b)(ak−1 + ak−2 b + · · · + abk−2 + bk−1 ) ≤ (a − b)kak−1 ,

in particular, ((ak − (a − 1)k )2 ≤ kak−1 (ak − (a − 1)k ).

(3) Telescope each row, then ignore remaining negative terms

(4) F2k−1 = ni=1 fi2k−1


P

(5) F1 = ni=1 fi = m
P

Then,

Var(Z) = E[Z 2 ] − (E[Z])2 Definition of variance


≤ E[Z 2 ] Ignore negative part
≤ kF1 F2k−1 From above
1
≤ kn1− k Fk2 By Lemma 10.15
1
= kn1− k (E[Z])2 By Theorem 10.14

Remark Proofs for Lemma 10.15 and Theorem 10.16 were omitted in class.
The above proofs are presented in a style consistent with the rest of the scribe
notes. Interested readers can refer to [AMS96] for details.

Remark One can apply an analysis similar to the case when k = 2, then
use Tricks 1 and 2.
e 1− k2 ) is known.
Claim 10.17. For k > 2, a lower bound of Θ(n

Proof. Theorem 3.1 in [BYJKS04] gives the lower bound. See [IW05] for
algorithm that achieves it.
146 CHAPTER 10. ESTIMATING THE MOMENTS OF A STREAM
Chapter 11

Graph sketching

Definition 11.1 (Streaming connected components problem). Consider a


graph of n vertices and a stream S of edge updates {het , ±i}t∈N+ , where edge
et is either added (+) or removed (-). Assume that S is “well-behaved”,
that is existing edges are not added and an edge is deleted only if it’s already
present in the graph.
At time t, the edge set Et of the graph Gt = (V, Et ) is the set of edges
present after accounting for all stream updates up to time t. How much
memory do we need if we want to be able to query the connected components
for Gt for any t ∈ N+ ?

Remark In this chapter, we mainly focus on the amount of memory needed


without worrying about processing time. However, the solution that we will
see can be computed in polynomial time.
Let m be the total number of distinct edges in the stream. There are two
ways to represent connected components on a graph:
1. Every vertex stores a label such that vertices in the same connected
component have the same label
2. Explicitly build a tree for each connected component — This yields a
maximal forest
For now, we are interested in building a maximal forest for Gt . This
can be done with memory size of O(m) words1 , or — in the special case
of only edge additions — O(n) words2 . However, these are unsatisfactory
1
Toggle edge additions/deletion per update. Compute connected components on de-
mand.
2
We just need to store a label for each node of the graph so that nodes of a same con-
nected component have the same label and then use the Union-Find data structure as new
edges arrive. See https://en.wikipedia.org/wiki/Disjoint-set_data_structure

147
148 CHAPTER 11. GRAPH SKETCHING

as m ∈ O(n2 ) on a complete graph, and we may have edge deletions. We


show how one can maintain a data structure with O(n log4 n) memory, with
a randomized algorithm that succeeds in building the maximal forest with
success probability ≥ 1 − n110 .

Coordinator model For a change in perspective3 , consider the following


computation model where each vertex acts independently from each other.
Then, upon request of connected components, each vertex sends some infor-
mation to a centralized coordinator to perform computation and outputs the
maximal forest.
The coordinator model will be helpful in our analysis of the algorithm
later as each vertex will send O(log4 n) amount of data (a local sketch of the
graph) to the coordinator, totalling O(n log4 n) memory as required.

11.1 Finding the single cut edge


Definition 11.2 (The single cut problem). Fix an arbitrary subset A ⊆ V .
Suppose there is exactly 1 cut edge {u, v} between A and V \ A. How do we
output the cut edge {u, v} using O(log n) bits of memory?
Without loss of generality, assume u ∈ A and v ∈ V \ A. Note that this is
not a trivial problem at first glance since it already takes O(n) bits for any
vertex to enumerate all its adjacent edges. To solve the problem, we use a bit
trick which exploits the fact that any edge {a, b} ∈ A will be considered twice
by vertices in A. Since one can uniquely identify each vertex with O(log n)
bits, consider the following:
• Identify an edge e = {u, v} by the concatenation of the identifiers of
its endpoints: id(e) = id(u) ◦ id(v) if id(u) < id(v)
• Locally, every vertex u maintains

XORu = ⊕{id(e) : e ∈ S ∧ u is an endpoint of e}

Thus XORu represents the bit-wise XOR of the identifiers of all edges
that are adjacent to u.
• All vertices send the coordinator their value XORu and the coordinator
computes
XORA = ⊕{XORu : u ∈ A}
3
In reality, the algorithm simulates all the vertices’ actions so it is not a real multi-party
computation setup.
11.1. FINDING THE SINGLE CUT EDGE 149

Example Suppose V = {v1 , v2 , v3 , v4 , v5 } where id(v1 ) = 000, id(v2 ) = 001,


id(v3 ) = 010, id(v4 ) = 011, and id(v5 ) = 100. Then, id({v1 , v3 }) = id(v1 ) ◦
id(v3 ) = 000010, and so on. Suppose

S = {h{v1 , v2 }, +i, h{v2 , v3 }, +i, h{v1 , v3 }, +i, h{v4 , v5 }, +i, h{v2 , v5 }, +i, h{v1 , v2 }, −i}

and we query for the cut edge {v2 , v5 } with A = {v1 , v2 , v3 } at t = |S|. The
figure below shows the graph G6 when t = 6:

v1 v4

v2

v3 v5

Vertex v1 sees {h{v1 , v2 }, +i, h{v1 , v3 }, +i, and h{v1 , v2 }, −i}. So,

XOR1 ⇒ 000000 Initialize


⇒ 000000 ⊕ id((v1 , v2 )) = 000000 ⊕ 000001 = 000001 Due to h{v1 , v2 }, +i
⇒ 000001 ⊕ id((v1 , v3 )) = 000001 ⊕ 000010 = 000011 Due to h{v1 , v3 }, +i
⇒ 000011 ⊕ id((v1 , v2 )) = 000011 ⊕ 000001 = 000010 Due to h{v1 , v2 }, −i

Repeating the simulation for all vertices,

XOR1 = 000010 = id({v1 , v2 }) ⊕ id({v1 , v3 }) ⊕ id({v1 , v2 })


= 000001 ⊕ 000010 ⊕ 000001
XOR2 = 000110 = id({v1 , v2 }) ⊕ id({v2 , v3 }) ⊕ id({v2 , v5 }) ⊕ id({v1 , v2 })
= 000001 ⊕ 001010 ⊕ 001100 ⊕ 000001
XOR3 = 001000 = id({v2 , v3 }) ⊕ id({v1 , v3 })
= 001010 ⊕ 000010
XOR4 = 011100 = id({v4 , v5 })
= 011100
XOR5 = 010000 = id({v4 , v5 }) ⊕ id({v2 , v5 })
= 011100 ⊕ 001100
150 CHAPTER 11. GRAPH SKETCHING

Thus, XORA = XOR1 ⊕ XOR2 ⊕ XOR3 = 000010 ⊕ 000110 ⊕ 001000 =


001100 = id({v2 , v5 }) as expected. Notice that after adding or deleting an
edge e = (u, v), updating XORu and XORv can be done by doing a bit-wise
XOR of each of these values together with id(e). Also, the identifier of every
edge with both endpoints in A contributes two times to XORA .

Claim 11.3. XORA = ⊕{XORu : u ∈ A} is the identifier of the cut edge.

Proof. For any edge e = (a, b) such that a, b ∈ A, id(e) contributes to both
XORa and XORb . So, XORa ⊕ XORb will cancel out the contribution
of id(e) because id(e) ⊕ id(e) = 0. Hence, the only remaining value in
XORA = ⊕{XORu : u ∈ A} will be the identifier of the cut edge since only
one of its endpoints lies in A.

Remark Bit tricks are often used in the random linear network coding
literature (e.g. [HMK+ 06]).

11.2 Finding one out of k > 1 cut edges


Definition 11.4 (The k cut problem). Fix an arbitrary subset A ⊆ V .
Suppose there are exactly k cut edges (u, v) between A and V \ A, and we are
k such that k2 ≤ k ≤ b
given an estimate b k. How do we output a cut edge (u, v)
b

2
using O(log n) bits of memory, with high probability?

A straight-forward idea is to independently mark each edge, each with


probability 1/b
k. In expectation, we expect one edge to be marked. Denote
the set of marked cut edges by E 0 .

Pr[|E 0 | = 1]
= k · Pr[Cut edge {u, v} is marked; others are not]
= k · (1/b k))k−1
k)(1 − (1/b Edges marked ind. w.p. 1/bk
k
b
≥ (bk/2)(1/b k))k
k)(1 − (1/b Since ≤ k ≤ b k
b
2
1 1
≥ · 4−1 Since 1 − x ≥ 4−x for x ≤
2 2
1

10
11.2. FINDING ONE OUT OF K > 1 CUT EDGES 151

Remark The above analysis assumes that vertices can locally mark the
edges in a consistent manner (i.e. both endpoints of any edge make the same
decision whether to mark the edge or not). This can be achieved with a
sufficiently large string of randomness, shared across all the nodes.

From above, we know that Pr[|E 0 | = 1] ≥ 1/10. If |E 0 | = 1, we can


re-use the idea from Section 11.1. However, if |E 0 | = 6 1, then XORA may
correspond erroneously to another edge in the graph. In the above example,
id({v1 , v2 }) ⊕ id({v2 , v4 }) = 000001 ⊕ 001011 = 001010 = id({v2 , v3 }).
To fix this, we use random bits as edge IDs instead of simply concate-
nating vertex IDs: randomly assign (in a consistent manner) to each edge a
random ID of k = 20 log n bits. Since the XOR of random bits is random,
for any edge e, Pr[XORA = id(e) | |E 0 | =
6 1] = ( 21 )k = ( 21 )20 log n . Hence,

Pr[XORA = id(e) for some edge e | |E 0 | =


6 1]
X
≤ Pr[XORA = id(e) | |E 0 | =
6 1] Union bound over all possible edges
e∈(V2 )
   
n 1 20 log n n
= ( ) There are possible edges
2 2 2
 
−18 log n n
=2 Since ≤ n2 = 22 log n
2
1
= 18 Rewriting
n

Therefore, if we sample two or more edges from the cut, with high proba-
bility the XOR of their identifiers will be distinguishable from the identifier
of another edge, allowing us to determine whether |E 0 | = 1.
The probability of sampling a single edge across the cut is P r[|E 0 | = 1] ≥
1
10
.To amplify it we can perform t = C log(n) = O(log n) parallel repetitions,
each time sampling the edges independently and computing for every node
the XOR of the sampled edges. We succeed to find an edge across the cut
if at least one repetition succeeds, which happens with probability at least
9 C log n
≥ 1 − n110 , by setting the constant term C appropriately.

1 − 10
Overall, the size of the message that each node sends to the coordinator
is O(log2 n) bits.
152 CHAPTER 11. GRAPH SKETCHING

11.3 Finding one out of arbitrarily many cut


edges
For an arbitrary cut, the number of edges across the cut is k ∈ [0, n2 ]. To find
a single edge across the cut, we can apply the procedure described in Section
2
11.2 using all the powers of 2, b k ∈ {20 , 21 , . . . , 2dlog n e }, as an estimate of k.
Section 11.2 proves that if the estimate is within a 2-factor of the number
of edges k, i.e. k2 ≤ k ≤ b k, then we find a single edge across the cut with
b

1
probability at least 1 − n10 .
For the values of b
k that are much bigger than k, the sampling probability
is such that in expectation no edges across the cut are sampled. Conversely,
if b
k is much smaller than k, more than one edge across the cut is expected to
be sampled, but with high probability the XOR of their identifiers will not
be a valid edge ID.
In total, there are dlog n2 e + 1 = O(log n) powers of 2 which are used
as an estimate for k, so overall each node sends an O(log3 n) bits message
to the coordinator: for each estimate b k, we perform O(log n) independent
samplings of edges (to amplify success probability) and every time each node
computes the XOR of the identifiers of the incident sampled edges, that has
size O(log n).

11.4 Maximal forest with O(n log4 n) memory


The procedure described in section 11.3 finds an edge across a single cut with
high probability. This can be used to find a maximal forest of the graph, but
we are only allowed to call it for poly(n) cuts: each time the procedure is
called, it has a failure probability of at most n110 , and it blows up if we consider
all the possible cuts, whose number is exponential in n.
We recall Borůvka’s algorithm4 for building a minimum spanning tree:
• Start with each node being one connected component
• Find the cheapest edge leaving each connected component and add it
into the MST (ignoring cycles)
• Repeat until there is only one connected component
The number of connected components decreases by at least half per iteration,
so it converges in O(log n) iterations.
4
For a detailed explanation, see https://en.wikipedia.org/wiki/Bor%C5%AFvka%
27s_algorithm
11.4. MAXIMAL FOREST WITH O(N LOG4 N ) MEMORY 153

Since here we do not care about edge weights, the step of finding the
cheapest edge leaving each component amounts to finding one out of many
cut edges, which we solved in section 11.3.
Therefore, each node sends to the coordinator a message of size O(log4 n),
obtained by repeating log n times independently the message construction
outlined in section 11.3. The entire procedure is presented in ComputeS-
ketch.

Algorithm 26 ComputeSketch(v ∈ V )
for h = 1 to log n do . Iterations of Borůvka
for i ∈ {0, 1, . . . , dlog n2 e} do . dlog n2 e + 1 guesses of k
for t = C log n times do . Amplify success probability
1
Sample each edge w.p p = 2i
Send XOR of sampled edges incident to v
end for
end for
end for

When the coordinator receives all the messages from the nodes, it is able
to compute a maximal forest by simulating the steps of Borůvka’s algorithm:
• initially, every node is a single component of the graph;
• for each step h ∈ {1, . . . , log n}:
– use part h of the messages (of size O(log3 n)) to find, for every
component, a single edge connecting it to another component (if
there is one) with high probability;
– merge the newly formed components and exclude any edges form-
ing a cycle;
This scheme can also be implemented in the streaming setting, by main-
taining a message of size O(log4 n) for every node as described above, and
updating them after every edge insertion or deletion.ComputeSketches
and StreamingMaximalForest outline this procedure.
More precisely, every vertex in ComputeSketches maintains O(log3 n)
copies of edge XORs using random edge IDs and marking probabilities. In
order to have consistent edge IDs and marking probabilities among vertices,
we use a source of shared randomness R. Then, StreamingMaximalFor-
est simulates Borůvka using the output of ComputeSketches. In total,
this requires O(n log4 n) memory to compute a maximal forest of the graph.
154 CHAPTER 11. GRAPH SKETCHING

Algorithm 27 ComputeSketches(S = {he, ±i, . . . }, , R)


for i = 1, . . . , n do
3
XORi ← 0(20 log n)∗log n . Initialize log3 n copies
end for
for Edge update {he = (u, v), ±i} ∈ S do . Streaming edge updates
for h = log n times do . For Borůvka simulation later
for i ∈ {0, 1, . . . , dlog n2 e} do . dlog n2 e + 1 guesses of k
for t = C log n times do . Amplify success probability
Rh,i,t ← Randomness for this specific instance based on R
k = 2−i , according to Rh,i,t then
if Edge e is marked w.p. 1/b
Compute id(e) using R
XORu [h, i, t] ← XORu [h, i, t] ⊕ id(e)
XORv [h, i, t] ← XORv [h, i, t] ⊕ id(e)
end if
end for
end for
end for
end for
return XOR1 , . . . , XORn
11.4. MAXIMAL FOREST WITH O(N LOG4 N ) MEMORY 155

Algorithm 28 StreamingMaximalForest(S = {he, ±i, . . . }, )


R ← Generate O(log2 n) bits of shared randomness
XOR1 , . . . , XORn ← ComputeSketches(S, , R)
F ← (VF = V, EF = ∅) . Initialize empty forest
for h = log n times do . Simulate Borůvka
C←∅ . Initialize candidate edges
for Every connected component A in F do
for i ∈ {1, 2, . . . , dlog n2 e} do . Guess A has [2i−1 , 2i ] cut edges
for t = C log n times do . Amplify success probability
Rh,i,t ← Randomness for this specific instance
XORA ← ⊕{XORu [h, i, t] : u ∈ A}
if XORA = id(e) for some edge e = (u, v) then
C ← C ∪ {(u, v)} . Add cut edge (u, v) to candidates
Go to next connected component in F
end if
end for
end for
end for
EF ← EF ∪ C, removing cycles in O(1) if necessary . Add candidates
end for
return F

At each step, we fail to find one cut edge leaving a connected component
1 t
with probability ≤ (1 − 10 ) , which can be made to be in O( n110 ). Applying
union bound over all O(log3 n) computations of XORA , we see that

log3 n 1
Pr[Any XORA corresponds wrongly to some edge ID] ≤ O( 18 ) ⊆ O( 10 )
n n
So, StreamingMaximalForest succeeds with high probability.

Remark One can drop the memory constraint per vertex from O(log4 n)
to O(log3 n) by using a constant t instead of t ∈ O(log n) such that the
success probability is a constant larger than 1/2. Then, simulate Borůvka
for d2 log ne steps. See [AGM12] (Note that they use a slightly different
sketch).
Theorem 11.5. Any randomized distributed sketching protocol for computing
spanning forest with success probability  > 0 must have expected average
sketch size Ω(log 3 n), for any constant  > 0.
Proof. See [NY18].
156 CHAPTER 11. GRAPH SKETCHING

Claim 11.6. Polynomial number of bits provide sufficient independence for


the procedure described above.

Remark One can generate polynomial number of bits of randomness with


O(log2 n) bits. Interested readers can check out small-bias sample spaces5 .
The construction is out of the scope of the course, but this implies that the
shared randomness R can be obtained within our memory constraints.

5
See https://en.wikipedia.org/wiki/Small-bias_sample_space
Part IV

Graph sparsification

157
Chapter 12

Preserving distances

Given a simple, unweighted, undirected graph G with n vertices and m edges,


can we sparsify G by ignoring some edges such that certain desirable prop-
erties still hold? We will consider simple, unweighted and undirected graphs
G. For any pair of vertices u, v ∈ G, denote the shortest path between them
by Pu,v . Then, the distance between u and v in graph G, denoted by dG (u, v),
is simply the length of shortest path Pu,v between them.
Definition 12.1 ((α, β)-spanners). Consider a graph G = (V, E) with |V | =
n vertices and |E| = m edges. For given α ≥ 1 and β ≥ 0, an (α, β)-spanner
is a subgraph G0 = (V, E 0 ) of G, where E 0 ⊆ E, such that
dG (u, v) ≤ dG0 (u, v) ≤ α · dG (u, v) + β

Remark The first inequality is because G0 has less edges than G. The
second inequality upper bounds how much the distances “blow up” in the
sparser graph G0 .

For an (α, β)-spanner, α is called the multiplicative stretch of the spanner


and β is called the additive stretch of the spanner. One would then like to
construct spanners with small |E 0 | and stretch factors. An (α, 0)-spanner is
called a α-multiplicative spanner, and a (1, β)-spanner is called a β-additive
spanner. We shall first look at α-multiplicative spanners, then β-additive
spanners in a systematic fashion:
1. State the result (the number of edges and the stretch factor)
2. Give the construction
3. Bound the total number of edges |E 0 |
4. Prove that the stretch factor holds

159
160 CHAPTER 12. PRESERVING DISTANCES

Remark One way to prove the existence of an (α, β)-spanner is to use the
probabilistic method : Instead of giving an explicit construction, one designs
a random process and argues that the probability that the spanner existing
is strictly larger than 0. However, this may be somewhat unsatisfying as such
proofs do not usually yield a usable construction. On the other hand, the
randomized constructions shown later are explicit and will yield a spanner
with high probability 1 .

12.1 α-multiplicative spanners


Let us first state a fact regarding the girth of a graph G. The girth of a
graph G, denoted g(G), is defined as the length of the shortest cycle in G.
Suppose g(G) > 2k, then for any vertex v, the subgraph formed by the k-hop
neighbourhood of v is a tree with distinct vertices. This is because the k-hop
neighbourhood of v cannot have a cycle since g(G) > 2k.

v k

Theorem 12.2. [ADD+ 93] For a fixed k ≥ 1, every graph G on n vertices


has a (2k − 1)-multiplicative spanner with O(n1+1/k ) edges.
Proof.
Construction
1. Initialize E 0 = ∅
2. For e = {u, v} ∈ E (in arbitrary order):
If dG0 (u, v) ≥ 2k currently, add {u, v} into E 0 .
Otherwise, ignore it.
1
This is shown by invoking concentration bounds such as Chernoff.
12.1. α-MULTIPLICATIVE SPANNERS 161

Number of edges We claim that |E 0 | ∈ O(n1+1/k ). Suppose, for a con-


tradiction, that |E 0 | > 2n1+1/k . Let G00 = (V 00 , E 00 ) be a graph obtained by
iteratively removing vertices with degree ≤ n1/k from G0 . By construction,
|E 00 | > n1+1/k since at most n·n1/k edges are removed. Observe the following:

• g(G00 ) ≥ g(G0 ) ≥ 2k + 1, since girth does not decrease with fewer edges.

• Every vertex in G00 has degree ≥ n1/k + 1, by construction.

• Pick an arbitrary vertex v ∈ V 00 and look at its k-hop neighbourhood.

n ≥ |V 00 | By construction
k
X
≥ |{v}| + |{u ∈ V 00 : dG00 (u, v) = i}| Look only at k-hop neighbourhood from v
i=1
k
X
≥1+ (n1/k + 1)(n1/k )i−1 Vertices distinct and have deg ≥ n1/k + 1
i=1
(n1/k )k − 1
= 1 + (n1/k + 1) Sum of geometric series
n1/k − 1
> 1 + (n − 1) Since (n1/k + 1) > (n1/k − 1)
=n

This is a contradiction since we showed n > n. Hence, |E 0 | ≤ 2n1+1/k ∈ O(n1+1/k ).

Stretch factor For e = {u, v} ∈ E, dG0 (u, v) ≤ (2k − 1) · dG (u, v) since we


only leave e out of E 0 if the distance is at most the stretch factor at the point
of considering e. For any u, v ∈ V , let Pu,v be the shortest path between u
and v in G. Say, Pu,v = (u, w1 , . . . , wk , v). Then,

dG0 (u, v) ≤ dG0 (u, w1 ) + · · · + dG0 (wk , v) Simulating Pu,v in G0


≤ (2k − 1) · dG (u, w1 ) + · · · + (2k − 1) · dG (wk , v) Apply edge stretch to each edge
= (2k − 1) · (dG (u, w1 ) + · · · + dG (wk , v)) Rearrange
= (2k − 1) · dG (u, v) Definition of Pu,v
162 CHAPTER 12. PRESERVING DISTANCES

Let us consider the family of graphs G on n vertices with girth > 2k. It can
be shown by contradiction that a graph G with n vertices with girth > 2k can-
not have a proper (2k − 1)-spanner2 : Assume G0 is a proper (2k − 1)-spanner
with edge {u, v} removed. Since G0 is a (2k − 1)-spanner, dG0 (u, v) ≤ 2k − 1.
Adding {u, v} to G0 will form a cycle of length at most 2k, contradicting the
assumption that G has girth > 2k.
Let g(n, k) be the maximum possible number of edges in a graph from G.
By the above argument, a graph on n vertices with g(n, k) edges cannot have
a proper (2k−1)-spanner. Note that the greedy construction of Theorem 12.2
will always produce a (2k − 1)-spanner with ≤ g(n, k) edges. The size of the
spanner is asymptotically tight if Conjecture 12.3 holds.

Conjecture 12.3. [Erd64] For a fixed k ≥ 1, there exists a family of graphs


on n vertices with girth at least 2k + 1 and Ω(n1+1/k ) edges.

Remark 1 By considering edges in increasing weight order, the greedy


construction works also for weighted graphs [FS16].

Remark 2 The girth conjecture is confirmed for k ∈ {1, 2, 3, 5} [Wen91,


Woo06].

12.2 β-additive spanners


In this section, we will use a random process to select a subset of vertices by
independently selecting vertices to join the subset. The following claim will
be useful for analysis:

Claim 12.4. If one picks vertices independently with probability p to be in


S ⊆ V , where |V | = n, then

1. E[|S|] = np

2. For any vertex v with degree d(v) and neighbourhood


N (v) = {u ∈ V : (u, v) ∈ E},

• E[|N (v) ∩ S|] = d(v) · p


• Pr[|N (v) ∩ S| = 0] ≤ e−d(v)·p

Proof. ∀v ∈ V , let Xv be the indicator whether v ∈ S. By construction,


E[Xv ] = Pr[Xv = 1] = p.
2
A proper subgraph in this case refers to removing at least one edge.
12.2. β-ADDITIVE SPANNERS 163

1.
X
E[|S|] = E[ Xv ] By construction of S
v∈V
X
= E[Xv ] Linearity of expectation
v∈V
X
= p Since E[Xv ] = Pr[Xv = 1] = p
v∈V
= np Since |V | = n

2.
X
E[|N (v) ∩ S|] = E[ Xv ] By definition of N (v) ∩ S
v∈N (v)
X
= E[Xv ] Linearity of expectation
v∈N (v)
X
= p Since E[Xv ] = Pr[Xv = 1] = p
v∈N (v)

= d(v) · p Since |N (v)| = d(v)

Probability that none of the neighbours of v is in S is


d(v)
Pr[|N (v) ∩ S| = 0] = (1 − p)d(v) ≤ e−p ≤ e−p·d(v) ,

since 1 − x ≤ e−x for any x.

e hides logarithmic factors. For example, O(n log1000 n) ⊆ O(n).


Remark O e

Theorem 12.5. [ACIM99] Every graph G on n vertices has a 2-additive


e 3/2 ) edges.
spanner with O(n
Proof.
Construction Partition vertex set V into light vertices L and heavy vertices
H, where

L = {v ∈ V : deg(v) ≤ n1/2 } and H = {v ∈ V : deg(v) > n1/2 }

1. Let E10 be the set of all edges incident to some vertex in L.

2. Initialize E20 = ∅.
164 CHAPTER 12. PRESERVING DISTANCES

• Choose S ⊆ V by independently putting each vertex into S with


probability 10n−1/2 log n.
• For each s ∈ S, add a Breadth-First-Search (BFS) tree rooted at
s to E20 .
Select edges in spanner to be E 0 = E10 ∪ E20 .
Number of edges We can bound the expected number of edges in the
spanner. There are at most n light vertices, so
|E10 | ≤ n · n1/2 = n3/2 .
By Claim 12.4 for p = 10n−1/2 log n, the expected size of S is
E[|S|] = n · 10n−1/2 log n = 10n1/2 log n.
The number of edges in each BFS tree is at most n − 1, so
E[|E20 |] ≤ nE[|S|].
Therefore,
E[|E 0 |] = E[|E10 ∪ E20 |] ≤ E[|E10 | + |E20 |]
= |E10 | + E[|E20 |]
≤ n3/2 + n · 10n1/2 log n ∈ O e n3/2 .


Stretch factor Consider two arbitrary vertices u and v with the shortest
path Pu,v in G. Let h be the number of heavy vertices in Pu,v . We split the
analysis into two cases: (i) h ≤ 1; (ii) h ≥ 2. Recall that a heavy vertex has
degree at least n1/2 .
Case (i) All edges in Pu,v are adjacent to a light vertex and are thus in E10 .
Hence, dG0 (u, v) = dG (u, v), with additive stretch 0.
Case (ii)
Claim 12.6. Suppose there exists a vertex w ∈ Pu,v such that (w, s) ∈
E for some s ∈ S, then dG0 (u, v) ≤ dG (u, v) + 2.

s∈S

... ...
... ...
u w v
12.2. β-ADDITIVE SPANNERS 165

Proof.

dG0 (u, v) ≤ dG0 (u, s) + dG0 (s, v) (1)


= dG (u, s) + dG (s, v) (2)
≤ dG (u, w) + dG (w, s) + dG (s, w) + dG (w, v) (3)
≤ dG (u, w) + 1 + 1 + dG (w, v) (4)
≤ dG (u, v) + 2 (5)

(1) By triangle inequality


(2) Since we add the BFS tree rooted at s
(3) By triangle inequality
(4) Since {s, w} ∈ E, dG (w, s) = dG (s, w) = 1
(5) Since w lies on Pu,v

Let w be a heavy vertex in Pu,v with degree d(w) > n1/2 . By Claim
12.4 with p = 10n−1/2 log n, Pr[|N (w) ∩ S| = 0] ≤ e−10 log n = n−10 .
Taking union bound over all possible pairs of vertices u and v,
 
n −10
Pr [∃u, v ∈ V, Pu,v has h ≥ 2 and no neighbour in S] ≤ n ≤ n−8
2
Then, Claim 12.6 tells us that the additive stretch factor is at most 2
with probability ≥ 1 − n18 .
1
Therefore, with high probability (≥ 1 − n8
), the construction yields a 2-
additive spanner.

Remark A way to remove log factors from Theorem 12.5 is to sample only
n1/2 nodes into S, and then add all edges incident to nodes that don’t have an
adjacent node in S. The same argument then shows that this costs O(n3/2 )
edges in expectation.
Theorem 12.7. [Che13] Every graph G on n vertices has a 4-additive span-
e 7/5 ) edges.
ner with O(n
Proof.
Construction Partition vertex set V into light vertices L and heavy vertices
H, where

L = {v ∈ V : deg(v) ≤ n2/5 } and H = {v ∈ V : deg(v) > n2/5 }


166 CHAPTER 12. PRESERVING DISTANCES

1. Let E10 be the set of all edges incident to some vertex in L.

2. Initialize E20 = ∅.

• Choose S ⊆ V by independently putting each vertex into S with


probability 30n−3/5 log n.
• For each s ∈ S, add a Breadth-First-Search (BFS) tree rooted at
s to E20

3. Initialize E30 = ∅.

• Choose S 0 ⊆ V by independently putting each vertex into S 0 with


probability 10n−2/5 log n.
• For each heavy vertex w ∈ H, if there exists an edge (w, s0 ) for
some s0 ∈ S 0 , add one such edge to E30 .
• ∀s, s0 ∈ S 0 , add the shortest path among all paths from s and s0
with ≤ n1/5 internal heavy vertices.
Note: If all paths between s and s0 contain > n1/5 heavy vertices,
do not add any edge to E30 .

Select edges in the spanner to be E 0 = E10 ∪ E20 ∪ E30 .

Number of edges

• Since there are at most n light vertices, |E10 | ≤ n · n2/5 = n7/5 .

• By Claim 12.4 with p = 30n−3/5 log n, E[|S|] = n · 30n−3/5 log n =


30n2/5 log n. Then, since every BFS tree has n − 1 edges3 , E[|E20 |] ≤
n · |S| = 30n7/5 log n ∈ O(n
e 7/5 ).

• Since there are ≤ n heavy vertices, ≤ n edges of the form (v, s0 ) for
v ∈ H, s0 ∈ S 0 will be added to E30 . Then, for shortest s − s0 paths
with ≤ n1/5 heavy internal vertices, only edges adjacent to the heavy
vertices need to be counted because those adjacent to light vertices are
already accounted for in E10 . By Claim 12.4 with p = 10n−2/5 log n,
E[|S 0 |] = n · 10n−2/5 log n = 10n3/5 log n. As |S 0 | is highly concentrated
around itsexpectation, we have E[|S 0 |2 ] ∈ O(n
e 6/5 ). So, E 0 contributes
3
0
≤ n + |S2 | · n1/5 ∈ O(ne 7/5 ) edges to the count of |E 0 |.

3
Though we may have repeated edges
12.2. β-ADDITIVE SPANNERS 167

Stretch factor Consider two arbitrary vertices u and v with the shortest
path Pu,v in G. Let h be the number of heavy vertices in Pu,v . We split the
analysis into three cases: (i) h ≤ 1; (ii) 2 ≤ h ≤ n1/5 ; (iii) h > n1/5 . Recall
that a heavy vertex has degree at least n2/5 .
Case (i) All edges in Pu,v are adjacent to a light vertex and are thus in E10 .
Hence, dG0 (u, v) = dG (u, v), with additive stretch 0.
Case (ii) Denote the first and last heavy vertices in Pu,v as w and w0 re-
spectively. Recall that in Case (ii), including w and w0 , there are
at most n1/5 heavy vertices between w and w0 . By Claim 12.4, with
p = 10n−2/5 log n,
2/5 ·10n−2/5
Pr[|N (w) ∩ S 0 | = 0], Pr[|N (w0 ) ∩ S 0 | = 0] ≤ e−n log n
= n−10

Let s, s0 ∈ S 0 be vertices adjacent in G0 to w and w0 respectively.


Observe that s − w − w0 − s0 is a path between s and s0 with at most
∗ ∗
n1/5 internal heavy vertices. Let Ps,s 0 be the shortest path of length l

from s to s0 with at most n1/5 internal heavy vertices. By construction,


∗ 0
we have added Ps,s 0 to E3 . Observe:

• By definition of Ps,s
∗ ∗ 0 0 0
0 , we have l ≤ dG (s, w)+dG (w, w )+dG (w , s ) =

dG (w, w0 ) + 2.
• Since there are no internal heavy vertices between u − w and
w0 − v, Case (i) tells us that dG0 (u, w) = dG (u, w) and dG0 (w0 , v) =
dG (w0 , v).
Thus,

dG0 (u, v)
≤ dG0 (u, w) + dG0 (w, w0 ) + dG0 (w0 , v) (1)
≤ dG0 (u, w) + dG0 (w, s) + dG0 (s, s0 ) + dG0 (s0 , w0 ) + dG0 (w0 , v) (2)
≤ dG0 (u, w) + dG0 (w, s) + l∗ + dG0 (s0 , w0 ) + dG0 (w0 , v) (3)
≤ dG0 (u, w) + dG0 (w, s) + dG (w, w0 ) + 2 + dG0 (s0 , w0 ) + dG0 (w0 , v) (4)
= dG0 (u, w) + 1 + dG (w, w0 ) + 2 + 1 + dG0 (w0 , v) (5)
= dG (u, w) + 1 + dG (w, w0 ) + 2 + 1 + dG (w0 , v) (6)
≤ dG (u, v) + 4 (7)

(1) Decomposing Pu,v in G0


168 CHAPTER 12. PRESERVING DISTANCES

(2) Triangle inequality


∗ 0
(3) Ps,s 0 is added to E3

(4) Since l∗ ≤ dG (w, w0 ) + 2


(5) Since (w, s) ∈ E 0 , (s0 , w0 ) ∈ E 0 and dG0 (w, s) = dG0 (s0 , w0 ) = 1
(6) Since dG0 (u, w) = dG (u, w) and dG0 (w0 , v) = dG (w0 , v)
(7) By definition of Pu,v

s ∈ S0 ∗
Ps,s 0 of length l
∗ s0 ∈ S 0
...

... ... ...


u w 0 v
w
First heavy vertex Last heavy vertex

Case (iii)
Claim 12.8. There cannot be a vertex y that is a common neighbour
to more than 3 heavy vertices in Pu,v .

Proof. Suppose, for a contradiction, that y is adjacent to w1 , w2 , w3 , w4 ∈


Pu,v as shown in the picture. Then u − w1 − y − w4 − v is a shorter
u − v path than Pu,v , contradicting the fact that Pu,v is the shortest
u − v path.

... ... ... ... ...


u w1 w2 w3 w4 v

Note that if y is on Pu,v it can have at most two neighbours on Pu,v .

|N (w)| · 31 . Let
S P
Claim 12.8 tells us that | w∈H∩Pu,v N (w)| ≥ w∈H∩Pu,v

Nu,v = {x ∈ V : (x, w) ∈ E for some w ∈ Pu,v .}

Applying Claim 12.4 with p = 30 · n−3/5 · log n and Claim 12.8, we get
1 2/5
Pr[|Nu,v ∩ S| = 0] ≤ e−p·|Nu,v | ≤ e−p· 3 ·|H∩Pu,v |·n = e−10 log n = n−10 .
12.2. β-ADDITIVE SPANNERS 169

Taking union bound over all possible pairs of vertices u and v,


 
n −10
1/5
Pr[∃u, v ∈ V, Pu,v has h > n and no neighbour in S] ≤ n ≤ n−8 .
2

Then, Claim 12.6 tells us that the additive stretch factor is at most 4
with probability ≥ 1 − n18 .
1
Therefore, with high probability (≥ 1 − n8
), the construction yields a 4-
additive spanner.

Remark Suppose the shortest u − v path Pu,v contains a vertex from S,


say s. Then, Pu,v is contained in E 0 since we include the BFS tree rooted at
s because it is the shortest u − s path and shortest s − v path by definition.
In other words, the triangle inequality between u, s, v becomes tight.

Concluding remarks

Additive β Number of edges Remarks


[ACIM99] 2 e 3/2 )
O(n Almost 4 tight [Woo06]
[Che13] 4 e 7/5 )
O(n e 4/3 ) possible?
Open: Is O(n
[BKMP05] ≥6 e 4/3 )
O(n Tight [AB17]

Remark 1 A k-additive spanner is also a (k + 1)-additive spanner.

Remark 2 The additive stretch factors appear in even numbers because


current constructions “leave” the shortest path, then “re-enter” it later, in-
troducing an even number of extra edges. Regardless, it is a folklore theorem
that it suffices to only consider additive spanners with even error. Specif-
ically, any construction of an additive (2k + 1)-spanner on ≤ E(n) edges
implies a construction of an additive 2k-spanner on O(E(n)) edges. Proof
sketch: Copy the input graph G and put edges between the two copies to
yield a bipartite graph H; Run the spanner construction on H; “Collapse”
the parts back into one. The distance error must be even over a bipartite
graph, and so the additive (2k + 1)-spanner construction must actually give
an additive 2k-spanner by showing that the error bound is preserved over
the “collapse”.

4
O(n4/3 /2 ( log n) ) is still conceivable — i.e. The gap is bigger than polylog, but still
subpolynomial.
170 CHAPTER 12. PRESERVING DISTANCES
Chapter 13

Preserving cuts

In the previous chapter, we looked at preserving distances via spanners. In


this chapter, we look at preserving cut sizes.

Definition 13.1 (Cut and minimum cut). Consider a graph G = (V, E).

• For S ⊆ V, S 6= ∅, S 6= V , a non-trivial cut in G is defined as the edges


CG (S, V \ S) = {(u, v) : u ∈ S, v ∈ V \ S}.
• The cut size is defined as EG (S, V \ S) = e∈CG (S,V \S) w(e).
P

If the graph G is unweighted, we have w(e) = 1 for all e ∈ E, so


EG (S, V \ S) = |CG (S, V \ S)|.
• The minimum cut size of the graph G is the minimum over all non-
trivial cuts, and it is denoted by µ(G) = minS⊆V,S6=∅,S6=V EG (S, V \ S).
If we consider an unweighted graph, the minimum cut size is the small-
est number of edges whose removal disconnects the graph.
• A cut CG (S, V \ S) is said to be minimum if EG (S, V \ S) = µ(G).

Given an undirected unweighted 1 graph G = (V, E), our goal in this


chapter is to construct a sparse 2 weighted graph H = (V, E 0 ) with E 0 ⊆ E
and weight function w : E 0 → R+ such that

(1 − ) · EG (S, V \ S) ≤ EH (S, V \ S) ≤ (1 + ) · EG (S, V \ S)

for every S ⊆ V, S 6= ∅, S 6= V .
1
This can also be generalized to weighted graphs.
2
For now, sparse means almost linear number of edges in n–we will make this concrete
soon

171
172 CHAPTER 13. PRESERVING CUTS

13.1 Warm up: G = Kn


As a warm up, imagine that G is a complete graph. Consider the following
procedure to construct H:

1. Let p = Ω( log n
2 n
)

2. Independently put each edge e ∈ E into E 0 with probability p

3. Define w(e) = 1
p
for each edge e ∈ E 0

One can check that this suffices for G = Kn . For that, fix an arbitrary cut
(any cut size is ≥ n − 1), and analyze the probability of the above condition
on the size of the cut being badly estimated in H. Then, take a union bound
over all cuts. In the exercise, we discuss an even more general form of this
warm up, where G is a graph with a constant edge expansion.
The rest of this section is devoted to proving a similar result for general
graphs. For that, we first need to review cut counting results that you might
have seen in previous courses (e.g., Algorithms, Probability, and Computing).

13.2 Prelim: Contractions and Cut Counting


Recall Karger’s random contraction algorithm [Kar93]3 :

Algorithm 29 RandomContraction(G = (V, E))


while |V | > 2 do
e ← Pick an edge uniformly at random from E
G ← G/e . Contract edge e
end while
return The remaining cut . This may be a multi-graph

Theorem 13.2. For a fixed minimum cut S∗ in the graph, RandomCon-
traction returns it with probability ≥ 1/ n2 .

Proof. Fix a minimum cut S ∗ in the graph and suppose |S ∗ | = k. In order


for RandomContraction to successfully return S ∗ , none of the edges in
S ∗ must be selected in the whole contraction process.
Consider the i-th step in the cycle of RandomContraction. By con-
struction, there will be n − i vertices in the graph at this point. Since
µ(G) = k, each vertex has degree at least k (otherwise that vertex itself
3
Also, see https://en.wikipedia.org/wiki/Karger%27s_algorithm
13.2. PRELIM: CONTRACTIONS AND CUT COUNTING 173

gives a cut with size smaller than k), so there are at least (n − i)k/2 edges
in the graph. Thus,

     
k k k
Pr[Success] ≥ 1 − · 1− ··· 1 −
nk/2 (n − 1)k/2 3k/2
     
2 2 2
= 1− · 1− ··· 1 −
n n−1 3
     
n−2 n−3 1
= · ···
n n−1 3
2
=
n(n − 1)
1
= n
2

n

Corollary 13.3. There are at most 2
minimum cuts in a graph.

Proof. Let C1 , · · · , CN be the minimum cuts in G. According to Theo-


rem 13.2, we know

1
Pr[Ci is found by RandomContraction(G)] ≥ n .

2

Now, we observe that for each two distinct indices i, j ∈ [N ], the events “Ci is
found by RandomContraction(G)” and “Cj is found by RandomCon-
traction(G)” are disjoint. Therefore, it follows that

Pr[a minimum cut is found by RandomContraction(G)]


N
X N
= Pr[Ci is found by RandomContraction(G)] ≥ n .

i=1 2

Since Pr[a minimum cut is found by RandomContraction(G)] ≤ 1 be-


n

cause it is a probability, we obtain N ≤ 2 .

Remark There exist (multi-)graphs with n2 minimum cuts: consider a




cycle where there are µ(G)


2
edges between every pair of adjacent vertices (the
bound is tight when µ(G) is even).
174 CHAPTER 13. PRESERVING CUTS

...
µ(G)

In general, we can generalize the bound on the number of cuts that are
of size at most α · µ(G) for α ≥ 1.

Theorem 13.4. In an undirected graph, the number of α-minimum cuts is


at most n2α .

Proof. The proof is analogous to that of Theorem 13.2, except for taking into
account that we now want the probability of any fixed α-minimum cut being
output. We continue contract until there are r = d2αe vertices remaining,
and then pick one of the 2r−1 cuts of the resulting graph uniformly at random.
Following the calculations of Theorem 13.2 and more cautious lower bounding
on the success probability, it shows that the probability that we pick the fixed
cut successfully is at least 1/n2α , which thus means the number of α-minimum
cuts is at most n2α . For more details, see Lemma 2.2 and Appendix A (in
particular, Corollary A.7) of a version4 of [Kar99].

13.3 Uniform edge sampling


Given a graph G with minimum cut size µ(G) = k, consider the following
procedure to construct H:

1. Set p = min{1, c log


2 k
n
} for some constant c.

2. Independently put each edge e ∈ E into E 0 with probability p.

3. Define w(e) = 1
p
for each edge e ∈ E 0 .

Theorem 13.5. With high probability, for every S ⊆ V, S 6= ∅, S 6= V ,

(1 − ) · EG (S, V \ S) ≤ EH (S, V \ S) ≤ (1 + ) · EG (S, V \ S)

Proof. Fix an arbitrary cut CG (S, V \ S). Suppose EG (S, V \ S) = k 0 = α · k


for some α ≥ 1.
4
Version available at: http://people.csail.mit.edu/karger/Papers/
skeleton-journal.ps
13.3. UNIFORM EDGE SAMPLING 175

k0

S V \S

Let Xe be the indicator for the edge e ∈ CG (S, V \ S) being inserted


into E 0 . By construction, E[Xi ] =
PPr[Xi = 1] = p. Then, by linearity of
expectation, E[|CH (S, V \ S)|] = e∈CG (S,V \S) E[Xi ] = k 0 p. As we put 1/p
weight on each edge in E 0 , E[EH (S, V \ S)] = k 0 . Using Chernoff bound, for
sufficiently large c, we get:

Pr[Cut CG (S, V \ S) is badly estimated in H]


= Pr[|EH (S, V \ S) − E[EH (S, V \ S)]| >  · k 0 ] Definition of bad estimation
= Pr[|CH (S, V \ S) − E[CH (S, V \ S)]| >  · k 0 p] Multiply p on both sides
2 k0 p
≤ 2e− 3 Chernoff bound
2
−  αkp
= 2e 3 Since k 0 = αk
≤ n−10α For sufficiently large c

Using Theorem 13.4 and union bound over all possible cuts in G,

Pr[Any cut is badly estimated in H]


Z ∞
1
≤ n2α · −10α dα From Theorem 13.4 and above
1 n
−5
≤n Loose upper bound

Therefore, all cuts in G are well estimated in H with high probability.

Theorem 13.6. [Kar99] Given a graph G, consider sampling every edge


e ∈ E into E 0 with independent random weights in the interval [0, 1]. Let
H = (V, E 0 ) be the sampled graph and suppose that the expected weight of
every cut in H is ≥ c log
2
n
, for some constant c. Then, with high probability
every cut in H has weighted size within (1 ± ) of its expectation.

Theorem 13.6 can be proved by using a variant of the earlier proof. In-
terested readers can see Theorem 2.1 of [Kar99].
176 CHAPTER 13. PRESERVING CUTS

13.4 Non-uniform edge sampling


Unfortunately, uniform sampling does not work well on graphs with small
minimum cut. Consider the following example of a graph composed of two
cliques of size n with only one edge connecting them:

Running the uniform edge sampling will not sparsify the above dumbbell
graph as µ(G) = 1 leads to large sampling probability p.

Before we describe a non-uniform edge sampling process [BK96], we first


introduce the definition of k-strong components.

Definition 13.7 (k-connected). A graph is k-connected if the value of each


cut of G is at least k.

Definition 13.8 (k-strong component). A k-strong component is a maximal


k-connected vertex-induced subgraph.

Definition 13.9 (edge strength). Given an edge e, its strength (or strong
connectivity) ke is the maximum k such that e is in a k-strong component.
We say an edge is k-strong if ke ≥ k.

Remark The (standard) connectivity of an edge e = (u, v) is the minimum


cut size over all cuts that separate its endpoints u and v. In particular, an
edge’s strong connectivity is no more than the edge’s (standard) connectivity
since a cut size of k between u and v implies there is no (k + 1)-connected
component containing both u and v.

Lemma 13.10. The following holds for k-strong components:

1. ke is uniquely defined for every edge e.

2. For any k, the k-strong components are disjoint.

3. For any 2 values k1 , k2 (k1 < k2 ), k2 -strong components are a refine-


ment of k1 -strong components.
13.4. NON-UNIFORM EDGE SAMPLING 177

1
P
4. e∈E ke ≤ n − 1.
Intuition: If the graph is a tree, each ke equal to one. Because there
are n − 1 edges in a tree, the sum is equal to n − 1. If there are a lot of
edges (the graph is not a tree), then many of them have high strength
and the sum is therefore less than n − 1.
Proof.
k1 -strong components

k2 -strong components

1. ke is a maximum and there is only one maximum therefore ke is unique.

2. Suppose, for a contradiction, there are two different intersecting k-


strong components. Since their union is also k-strong, this contradicts
the fact that they were maximal.

3. For k1 < k2 , a k2 -strong component is also k1 -strong, so it is a subset


of some k1 -strong component.

4. Consider a minimum cut CG (S, V \ S). Since ke ≥ µ(G) for all edges
e ∈ CG (S, V \ S), these edges contribute at most µ(G) · k1e ≤ µ(G) ·
1
µ(G)
= 1 to the summation. Remove these edges from G and repeat
the argument on the remaining connected components (excluding iso-
lated vertices). Since each cut removal contributes at most 1 to the
summation and the process stops when we reach n components, then
1
P
e∈E ke ≤ n − 1.

For a graph G with minimum cut size µ(G) = k, consider the following
procedure to construct H:
c log n
1. Set q = 2
for some constant c.
178 CHAPTER 13. PRESERVING CUTS

2. Independently put each edge e ∈ E into E 0 with probability pe =


min{1, kqe }.

3. Define w(e) = 1
pe
for each edge e ∈ E 0 .

Lemma 13.11. E[|E 0 |] ≤ O( n log


2
n
)
Proof. Let Xe be the indicator random variable whether edge e ∈ E was
selected into E 0 . By construction, E[Xe ] = Pr[Xe = 1] = pe . Then,

X
E[|E 0 |] = E[ Xe ] By definition
e∈E
X
= E[Xe ] Linearity of expectation
e∈E
X
= pe Since E[Xe ] = Pr[Xe = 1] = pe
e∈E
X q q
= Since pe =
e∈E
ke ke
X 1
≤ q(n − 1) Since ≤n−1
e∈E
ke
 
n log n c log n
∈O Since q = for some constant c
2 2

Remark If we apply Chernoff bound we see that |E 0 | is highly concentrated


around its expectation:
 
 0 0 0
 −c log n
Pr |E | − E[|E |] ≥  · E[|E |] ≤ 2 exp
3
We clearly see that the probability that |E 0 | is not close to its expectation is
really low.
Theorem 13.12. With high probability, for every S ⊆ V, S 6= ∅, S 6= V ,
(1 − ) · EG (S, V \ S) ≤ EH (S, V \ S) ≤ (1 + ) · EG (S, V \ S)
Proof. Let k1 < k2 < · · · < ks be all possible strength values in the graph.
Consider G0 as a weighted graph with edge weights kqe for each edge e ∈ E,
and a family of unweighted graphs F1 , . . . , Fs , where Fi = (V, Ei ) is the graph
with edges Ei = {e ∈ E : ke ≥ ki } belonging to the ki -strong components of
G. Observe that:
13.4. NON-UNIFORM EDGE SAMPLING 179

• s ≤ |E| since each edge has only 1 strength value

• By construction of Fi ’s, if an edge e has strength ki in Fi , ke = ki in G

• F1 = G

• For each i ≤ s − 1, Fi+1 is a subgraph of Fi

• By defining k0 = 0, one can write G0 = si=1 ki −kq i−1 Fi . This is because


P
an edge with strength ki will appear in Fi , Fi−1 , . . . , F1 and the terms
will telescope to yield a weight of kqi .
The sampling process in G0 directly translates to a sampling process in
each graph in {Fi }i∈[s] — when we add an edge e into E 0 , we also add it to
the edge sets of Fke , . . . , F1 , i.e., we use the same decision of whether to keep
edge e or not in G0 for all Fi -s that contain it.
First, consider the sampling on the graph F1 = G. We know that each
edge e ∈ E is sampled with probability pe = q/ke , where ke ≥ k1 by construc-
tion of F1 . In this graph, consider any non-trivial cut CF1 (S, V \ S) and let
e be any edge of this cut. We can observe that ke ≤ EF1 (S, V \ S), otherwise
contradicting its strength ke . Then, by using the indicator random variables
Xe whether the edge e ∈ E1 has been sampled, the expected size of this cut
in F1 after the sampling is
 
X
E[EF1 (S, V \ S)] = E  Xe 
e∈CF1 (S,V \S)
X
= E[Xe ] Linearity of expectation
e∈CF1 (S,V \S)
X q q
= Since E[Xe ] = Pr[Xe = 1] =
ke ke
e∈CF1 (S,V \S)
X q
≥ Since ke ≤ EF1 (S, V \ S)
EF1 (S, V \ S)
e∈CF1 (S,V \S)

c log n c log n
= Since q =
2 2
Since this holds for any cut in F1 , we can apply Theorem 13.6 to conclude
that, with high probability, all cuts in F1 have size within (1 ± ) of their
expectation. Note that the same holds after scaling the edge weights in
k1 −k0
q
F1 = kq1 · F1 .
In a similar way, consider any other subgraph Fi ⊆ G as previously
defined. Since an Fi contains the edges from the ki -strong components of
180 CHAPTER 13. PRESERVING CUTS

G, any edge e ∈ Ei belongs only to one of them. Let D be the ki -strong


component such that e ∈ D. By observing that e necessarily belongs to a
ke -connected subgraph of G by definition, and that ke ≥ ki , then such a
ke -connected subgraph is entirely contained in D. Hence, the strength of e
with respect to the graph D is equal to ke . By a similar argument as done
for F1 , we can show that the expected size of a cut CD (S, V \ S) in D after
the sampling of the edges is
X q q
E[ED (S, V \ S)] = Since E[Xe ] = Pr[Xe = 1] =
ke ke
e∈CD (S,V \S)
X q
≥ Since ke ≤ ED (S, V \ S)
ED (S, V \ S)
e∈CD (S,V \S)
c log n c log n
= Since q =
2 2
Therefore, we can once again apply Theorem 13.6 to the subgraph D, which
states that, with high probability, all cuts in D are within (1 ± ) of their
expected value. We arrive at the conclusion that this also holds for Fi by
applying the same argument to all the ki -strong components of Fi .
To sum up, for each i ∈ [s] Theorem 13.6 tells us that every cut in Fi
is well-estimated with high probability. Then, a union bound over {Fi }i∈[s]
provides a lower bound of the probability that all Fi ’s have all cuts within
(1 ± ) of their expected values, and we can see that this also happens with
high probability. This tells us that any cut in G is well-estimated with high
probability,
Ps kalso because all multiplicative factors ki − ki−1 in the calculation
0 i −ki−1
G = i=1 q Fi are positive.
Part V

Online Algorithms and


Competitive Analysis

181
Chapter 14

Warm up: Ski rental

We now study the class of online problems where one has to commit to
provably good decisions as data arrive in an online fashion. To measure the
effectiveness of online algorithms, we compare the quality of the produced
solution against the solution from an optimal offline algorithm that knows
the whole sequence of information a priori. The tool we will use for doing
such a comparison is competitive analysis.

Remark We do not assume that the optimal offline algorithm has to be


computationally efficient. Under the competitive analysis framework, only
the quality of the best possible solution matters.

Definition 14.1 (α-competitive online algorithm). Let σ be an input se-


quence, c be a cost function, A be the online algorithm and OP T be the
optimal offline algorithm. Then, denote cA (σ) as the cost incurred by A
on σ and cOP T (σ) as the cost incurred by OPT on the same sequence. We
say that an online algorithm is α-competitive if for any input sequence σ,
cA (σ) ≤ α · cOP T (σ).

Definition 14.2 (Ski rental problem). Suppose we wish to ski every day but
we do not have any skiing equipment initially. On each day, we can choose
between:

• Rent the equipment for a day at CHF 1

• Buying the equipment (once and for all) for CHF B

In the toy setting where we may break our leg on each day (and cannot ski
thereafter), let d be the (unknown) total number of days we ski. What is the
best online strategy for renting/buying?

183
184 CHAPTER 14. WARM UP: SKI RENTAL

Claim 14.3. A = “Rent for B days, then buy on day B+1” is a 2-competitive
algorithm.

Proof. If d ≤ B, the optimal offline strategy is to rent everyday, incurring


a cost of cOP T (d) = d. A will rent for d days and also incur a loss of
cA (d) = d = cOP T (d). If d > B, the optimal offline strategy is to buy
the equipment immediately, incurring a loss of cOP T (d) = B. A will rent
for B days and then buy the equipment for CHF B, incurring a cost of
cA (d) = 2B ≤ 2cOP T (d). Thus, for any d, cA (d) ≤ 2 · cOP T (d).
Chapter 15

Linear search

Definition 15.1 (Linear search problem). We have a stack of n papers on


the desk. Given a query, we do a linear search from the top of the stack.
Suppose the i-th paper in the stack is queried. Since we have to go through
i papers to reach the queried paper, we incur a cost of i doing so. We have
the option to perform two types of swaps in order to change the stack:

Free swap Move the queried paper from position i to the top of the stack
for 0 cost.

Paid swap For any consecutive pair of items (a, b) before i, swap their rel-
ative order to (b, a) for 1 cost.

What is the best online strategy for manipulating the stack to minimize total
cost on a sequence of queries?

Remark One can reason that the free swap costs 0 because we already
incurred a cost of i to reach the queried paper.

15.1 Amortized analysis


Amortized analysis1 is a way to analyze the complexity of an algorithm on a
sequence of operations. Instead of looking the worst case performance on a
single operation, it measures the total cost for a batch of operations.
The dynamic resizing process of hash tables is a classical example of
amortized analysis. An insertion or deletion operation will typically cost
1
See https://en.wikipedia.org/wiki/Amortized_analysis

185
186 CHAPTER 15. LINEAR SEARCH

O(1) unless the hash table is almost full or almost empty, in which case we
double or halve the hash table of size m, incurring a runtime of O(m).
Worst case analysis tells us that dynamic resizing will incur O(m) run
time per operation. However, resizing only occurs after O(m) insertion/dele-
tion operations, each costing O(1). Amortized analysis allows us to conclude
that this dynamic resizing runs in amortized O(1) time. There are two equiv-
alent ways to see it:
• Split the O(m) resizing overhead and “charge” O(1) to each of the
earlier O(m) operations.
• The total run time for every sequential chunk of m operations is O(m).
Hence, each step takes O(m)/m = O(1) amortized run time.

15.2 Move-to-Front
Move-to-Front (MTF) [ST85] is an online algorithm for the linear search
problem where we move the queried item to the top of the stack (and do no
other swaps). We will show that MTF is a 2-competitive algorithm for linear
search. Before we analyze MTF, let us first define a potential function Φ and
look at examples to gain some intuition.
Let Φt be the number of pairs of papers (i, j) that are ordered differently
in MTF’s stack and OPT’s stack at time step t. By definition, Φt ≥ 0 for
any t. We also know that Φ0 = 0 since MTF and OPT operate on the same
initial stack sequence.

Example One way to interpret Φ is to count the number of inversions


between MTF’s stack and OPT’s stack. Suppose we have the following stacks
(visualized horizontally) with n = 6:
1 2 3 4 5 6
MTF’s stack a b c d e f
OPT’s stack a b e d c f
We have the inversions (c, d), (c, e) and (d, e), so Φ = 3.
Scenario 1 We swap (b, e) in OPT’s stack — A new inversion (b, e) was
created due to the swap.

1 2 3 4 5 6
MTF’s stack a b c d e f
OPT’s stack a e b d c f
15.2. MOVE-TO-FRONT 187

Now, we have the inversions (b, e), (c, d), (c, e) and (d, e), so Φ = 4.

Scenario 2 We swap (e, d) in OPT’s stack — The inversion (d, e) was de-
stroyed due to the swap.

1 2 3 4 5 6
MTF’s stack a b c d e f
OPT’s stack a b d e c f

Now, we have the inversions (c, d) and (c, e), so Φ = 2.

In either case, we see that any paid swap results in ±1 inversions, which
changes Φ by ±1.

Claim 15.2. MTF is 2-competitive.

Proof. We will consider the potential function Φ as before and perform amor-
tized analysis on any given input sequence σ. Let at = cM T F (t) + (Φt − Φt−1 )
be the amortized cost of MTF at time step t, where cM T F (t) is the cost MTF
incurs at time t. Suppose the queried item x at time step t is at position k
in MTF’s stack. Denote:

F = {Items on top of x in MTF’s stack and on top of x in OPT’s stack}


B = {Items on top of x in MTF’s stack and underneath x in OPT’s stack}

Let |F | = f and |B| = b. There are k −1 items in front x, so f +b = k −1.

F ∪B
k ≥ |F | = f
x
x

≥ |B| = b

MTF OPT

Since x is the k-th item, MTF will incur cM T F (t) = k = f + b + 1 to


reach item x, then move it to the top. On the other hand, OPT needs to
188 CHAPTER 15. LINEAR SEARCH

spend at least f + 1 to reach x. Suppose OPT does p paid swaps, then


cOP T (t) ≥ f + 1 + p.
To measure the change in potential, we first look at the swaps done by
MTF and how OPT’s swaps can affect them. Let ∆M T F (Φt ) be the change
in Φ due to MTF and ∆OP T (Φt ) be the change in Φt due to OPT. Thus,
∆(Φt ) = ∆M T F (Φt ) + ∆OP T (Φt ). In MTF, moving x to the top destroys
b inversions and creates f inversions, so the change in Φ due to MTF is
∆M T F (Φt ) = f − b. If OPT chooses to do a free swap, Φ does not increase
as both stacks now have x before any element in F . For every paid swap
that OPT performs, Φ changes by one since inversions only locally affect the
swapped pair and thus, ∆OP T (Φt ) ≤ p.
Therefore, the effect on Φ from both processes is: ∆(Φt ) = ∆M T F (Φt ) +
∆OP T (Φt ) ≤ (f − b) + p. Putting together, we have cOP T (t) ≥ f + 1 + p and
at = cM T F (t) + (Φt − Φt−1 ) = k + ∆(Φt ) ≤ 2f + 1 + p ≤ 2 · cOP T (t). Summing
up over all queries in the sequence yields:
|σ| |σ|
X X
2 · cOP T (σ) = 2 · cOP T (t) ≥ at
t=1 t=1

With at = cM T F (t) + (Φt − Φt−1 ) and using the fact that the sum over the
change in potential is telescoping, we get:
|σ| |σ|
X X
at = cM T F (t) + (Φt − Φt−1 )
t=1 t=1
|σ|
X
= cM T F (t) + (Φ|σ| − Φ0 )
t=1

P|σ|
Since Φt ≥ 0 = Φ0 and cM T F (σ) = t=1 cM T F (t):

|σ| |σ|
X X
cM T F (t) + (Φ|σ| − Φ0 ) ≥ cM T F (t) = cM T F (σ)
t=1 t=1

We have shown that cM T F (σ) ≤ 2 · cOP T (σ) which completes the proof.
Chapter 16

Paging

Definition 16.1 (Paging problem [ST85]). Suppose we have a fast memory


(cache) that can fit k pages and an unbounded sized slow memory. Accessing
items in the cache costs 0 units of time while accessing items in the slow
memory costs 1 unit of time. After accessing an item in the slow memory,
we can bring it into the cache by evicting an incumbent item if the cache was
full. What is the best online strategy for maintaining items in the cache to
minimize the total access cost on a sequence of queries?
Denote cache miss as accessing an item that is not in the cache. Any
sensible strategy should aim to reduce the number of cache misses. For
example, if k = 3 and σ = {1, 2, 3, 4, . . . , 2, 3, 4}, keeping item 1 in the cache
will incur several cache misses. Instead, the strategy should aim to keep items
{2, 3, 4} in the cache. We formalize this notion in the following definition of
conservative strategy.
Definition 16.2 (Conservative strategy). A strategy is conservative if on
any consecutive subsequence that includes only k distinct pages, there are at
most k cache misses.

Remark Some natural paging strategies such as “Least Recently Used


(LRU)” and “First In First Out (FIFO)” are conservative.
Claim 16.3. If A is a deterministic online algorithm that is α-competitive,
then α ≥ k.
Proof. Consider the following input sequence σ on k + 1 pages: since the
cache has size k, at least one item is not in the cache at any point in time.
Iteratively pick σ(t + 1) as the item not in the cache after time step t.
Since A is deterministic, the adversary can simulate A for |σ| steps and
build σ accordingly. By construction, cA (σ) = |σ|.

189
190 CHAPTER 16. PAGING

On the other hand, since OP T can see the entire sequence σ, OP T can
choose to evict the page i that is requested furthest in the future. The next
request for page i has to be at least k requests ahead in the future, since by
definition of i all other pages j 6= i ∈ {1, ..., k +1} have to be requested before
i. Thus, in every k steps, OP T has ≤ 1 cache miss. Therefore, cOP T ≤ |σ| k
which implies: k · cOP T ≤ |σ| = cA (σ).

Claim 16.4. Any conservative online algorithm A is k-competitive.

Proof. For any given input sequence σ, partition σ into m maximal phases
— P1 , P2 , . . . , Pm — where each phase has k distinct pages, and a new phase
is created only if the next element is different from the ones in the current
phase. Let xi be the first item that does not belong in Phase i.

σ= k distinct pages x1 k distinct pages x2 ...

Phase 1 Phase 2

By construction, OP T has to pay ≥ 1 to handle the elements in Pi ∪{xi },


for any i; so cOP T ≥ m. On the other hand, since A is conservative, A has
≤ k cache misses per phase. Hence, cA (σ) ≤ k · m ≤ k · cOP T (σ).

Remark A randomized algorithm can achieve O(log k)-competitiveness.


This will be covered in the next lecture.

16.1 Types of adversaries


Since online algorithms are analyzed on all possible input sequences, it helps
to consider adversarial inputs that may induce the worst case performance
for a given online algorithm A. To this end, one may wish to classify the
classes of adversaries designing the input sequences (in increasing power):

Oblivious The adversary designs the input sequence σ at the beginning. It


does not know any randomness used by algorithm A.

Adaptive At each time step t, the adversary knows all randomness used
by algorithm A thus far. In particular, it knows the exact state of the
algorithm. With these in mind, it then picks the (t + 1)-th element in
the input sequence.

Fully adaptive The adversary knows all possible randomness that will be
used by the algorithm A when running on the full input sequence σ. For
16.2. RANDOM MARKING ALGORITHM (RMA) 191

instance, assume the adversary has access to the same pseudorandom


number generator used by A and can invoke it arbitrarily many times
while designing the adversarial input sequence σ.

Remark If A is deterministic, then all three classes of adversaries have the


same power.

16.2 Random Marking Algorithm (RMA)


Consider the Random Marking Algorithm (RMA), a O(log k)-competitive
algorithm for paging against oblivious adversaries:

• Initialize all pages as marked

• Upon request of a page p

– If p is not in cache,
∗ If all pages in cache are marked, unmark all
∗ Evict a random unmarked page
– Mark page p

Example Suppose k = 3, σ = (2, 5, 2, 1, 3).

Suppose the cache is initially: Cache 1 3 4


Marked? 3 3 3

When σ(1) = 2 arrives, all pages Cache 1 2 4


were unmarked. Suppose the random
Marked? 7 3 7
eviction chose page ‘3’. The newly
added page ‘2’ is then marked.

When σ(2) = 5 arrives, suppose Cache 1 2 5


random eviction chose page ‘4’ (be-
Marked? 7 3 3
tween pages ‘1’ and ‘4’). The newly
added page ‘5’ is then marked.

When σ(3) = 2 arrives, page ‘2’ Cache 1 2 5


in the cache is marked (no change).
Marked? 7 3 3
192 CHAPTER 16. PAGING

When σ(4) = 1 arrives, page ‘1’ Cache 1 2 5


in the cache is marked. At this point,
Marked? 3 3 3
any page request that is not from
{1, 2, 5} will cause a full unmarking
of all pages in the cache.

When σ(5) = 3 arrives, all pages Cache 1 2 3


were unmarked. Suppose the random
Marked? 7 7 3
eviction chose page ‘5’. The newly
added page ‘3’ is then marked.

We denote a phase as the time period between 2 consecutive full unmark-


ing steps. That is, each phase is a maximal run where we access k distinct
pages. In the above example, {2, 5, 2, 1} is such a phase for k = 3.

Observation As pages are only unmarked at the beginning of a new phase,


the number of unmarked pages is monotonically decreasing within a phase.

marked pages

time

phase 1 phase 2 phase 3

Figure 16.1: The number of marked pages within a phase is monotonically


increasing.

Theorem 16.5. RMA is O(log k)-competitive against any oblivious adver-


sary.
Proof. Let Pi be the set of pages at the start of phase i. Since requesting a
marked page does not incur any cost, it suffices to analyze the first time any
request occurs within the phase.
Let mi be the number of unique new requests (pages that are not in Pi )
and oi as the number of unique old requests (pages that are in Pi ). By
definition, oi ≤ k and mi + oi = k.
We have cRM A (Phase i) = (Cost due to new requests) + (Cost due to old
requests). We first focus on the extra cost incurred from the old requests,
16.2. RANDOM MARKING ALGORITHM (RMA) 193

that is when an old page is requested that has already been kicked out upon
the arrival of a new request.

Order the old requests in the order which they appear in the phase and
let xj be the j th old request, for j ∈ {1, . . . , oi }. Define lj as the number of
distinct new requests before xj .
For j ∈ {1, . . . , oi }, consider the first time the j th old request xj occurs.
Since the adversary is oblivious, xj is equally likely to be in any position in
the cache at the start of the phase. After seeing (j − 1) old requests and
marking their cache positions, there are k − (j − 1) initial positions in the
cache that xj could be in. Since we have only seen lj new requests and (j −1)
old requests, there are at least1 k − lj − (j − 1) old pages remaining in the
cache. So, the probability that xj is in the cache when requested is at least
k−lj −(j−1)
k−(j−1)
. Then,

oi
X
Cost due to old requests = Pr[xj is not in cache when requested] Sum over old requests
j=1
oi
X lj
≤ From above
j=1
k − (j − 1)
oi
X mi
≤ Since lj ≤ mi = |N |
j=1
k − (j − 1)
k
X 1
≤ mi · Since oi ≤ k
j=1
k − (j − 1)
k
X 1
= mi · Rewriting
j=1
j
n
X 1
= mi · Hk Since = Hn
i=1
i

Since every new request incurs a unit cost, the cost due to these requests
is mi .
Together for new and old requests, we get cRM A (Phase i) ≤ mi + mi · Hk .
We now analyze OPT’s performance. By definition of phases, among all
requests between two consecutive phases (say, i − 1 and i), a total of k + mi
distinct pages are requested. So, OPT has to incur at least ≥ mi to bring in
1
We get an equality if all these requests kicked out an old page.
194 CHAPTER 16. PAGING

these new pages. To avoid double


P counting, we lower bound
P cOP T (σ) for both
odd and even i: cOP T (σ) ≥ odd i mi and cOP T (σ) ≥ even i mi . Together,
X X X
2 · cOP T (σ) ≥ mi + mi ≥ mi
odd i even i i

Therefore, we have:
X X
cRM A (σ) ≤ (mi + mi · Hk ) = O(log k) mi ≤ O(log k) · cOP T (σ)
i i

Remark In the above example, k = 3, phase 1 = (2, 5, 2, 1), P1 = {1, 3, 4},


new requests = {2, 5}, old requests = {1}. Although ‘2’ appeared twice, we
only care about analyzing the first time it appeared.

16.3 Lower Bound for Paging via Yao’s Prin-


ciple
Yao’s Principle Often, it is considerably easier to obtain (distributional)
lower bounds against deterministic algorithms, than to (directly) obtain de-
terministic lower bound instances against randomized algorithms. We use
Yao’s principle to bridge this gap. Informally, this principle tells us that if
no deterministic algorithm can do well on a given distribution of random
inputs (D), then for any randomized algorithm, there is a deterministic bad
input so that the cost of the randomized algorithm on this particular input
will be high (C). We next state and prove Yao’s principle.
Before getting to the principle, let us observe that given the sequence of
random bits used, a randomized algorithm behaves deterministically. Hence,
one may view a randomized algorithm as a random choice from a distribution
of deterministic algorithms.
Let X be the space of problem inputs and A be the space of all possible
deterministic algorithms. Denote probability distributions over A and X by
pa = Pr[A = a] and qx = Pr[X = x], where X and A are random variables
for input and deterministic algorithm, respectively. Define c(a, x) as the cost
of algorithm a ∈ A on input x ∈ X.

Theorem 16.6 ([Yao77]).

C = max Ep [c(a, x)] ≥ min Eq [c(a, x)] = D


x∈X a∈A
16.3. LOWER BOUND FOR PAGING VIA YAO’S PRINCIPLE 195

Proof.
X
C= qx · C Sum over all possible inputs x
x
X
≥ qx Ep [c(A, x)] Since C = max Ep [c(A, x)]
x∈X
x
X X
= qx pa c(a, x) Definition of Ep [c(A, x)]
x p
X X
= pa qx c(a, x) Swap summations
a q
X
= pa Eq [c(a, X)] Definition of Eq [c(a, X)]
a
X
≥ pa · D Since D = min Eq [c(a, X)]
a∈A
a
=D Sum over all possible algorithms a

Application to the paging problem

Theorem 16.7. Any (randomized) algorithm has competitive ratio Ω(log k)


against an oblivious adversary.

Proof. Fix an arbitrary deterministic algorithm A. Let |σ| = m. Consider


the following random input sequence σ where the i-th page is drawn from
{1, . . . , k + 1} uniformly at random.
1
By construction of σ, the probability of having a cache miss is k+1 for A,
m
regardless of what A does. Hence, E[cA (σ)] = k+1 .
On the other hand, an optimal offline algorithm may choose to evict the
page that is requested furthest in the future. As before, we denote a phase
as a maximal run where there are k distinct page requests. This means that
m
E[cOP T (σ)] = Expected number of phases = Expected phase length
.
To analyze the expected length of a phase, suppose there are i distinct
pages so far, for 0 ≤ i ≤ k. The probability of the next request being new
is k+1−i
k+1
, and one expects to get k+1−ik+1
requests before having i + 1 distinct
pages. Thus, the expected length of a phase is ki=0 k+1−i k+1
P
= (k + 1) · Hk+1 .
m
Therefore, E[cOP T (σ)] = (k+1)·H k+1
.
m
So far we have obtained that D = E[cA (σ)] = k+1 ; from Yao’s Minimax
Principle we know that C ≥ D, hence we can also compare the competitive
ratios E[cOPCT (σ)] ≥ E[cOPDT (σ)] = Hk+1 = Θ(log k).
196 CHAPTER 16. PAGING

Remark The length of a phase is essentially the coupon collector problem


with n = k + 1 coupons.
Chapter 17

The k-server problem

Definition 17.1 (k-server problem [MMS90]). Consider a metric space (V, d)


where V is a set of n points and d : V × V → R is a distance metric between
any two points. Suppose there are k servers placed on V and we are given
an input sequence σ = (v1 , v2 , . . . ). Upon request of vi ∈ V , we have to move
one server to point vi to satisfy that request. What is the best online strategy
to minimize the total distance travelled by servers to satisfy the sequence of
requests?

Remark We do not fix the starting positions of the k servers, but we com-
pare the performance of OPT on σ with same initial starting positions.

The paging problem is a special case of the k-server problem where the
points are all possible pages, the distance metric is unit cost between any
two different points, and the servers represent the pages in cache of size k.

Progress It is conjectured that a deterministic k-competitive algorithm


exists and a randomized poly(log k)-competitive algorithm exists. A long-
standing conjecture of a O(log k)-competitive randomized algorithm existing
was disproved in 2023 [BCR23]. The table below shows the current progress
on this problem.

Competitive ratio Type


[MMS90] k-competitive, for k = 2 and k = n − 1 Deterministic
[FRR90] 2O(k log k) -competitive Deterministic
O(k)
[Gro91] 2 -competitive Deterministic
[KP95] (2k − 1)-competitive Deterministic
[BBMN11] poly(log n, log k)-competitive Randomized

197
198 CHAPTER 17. THE K-SERVER PROBLEM

Remark [BBMN11] uses a probabilistic tree embedding, a concept we have


seen in earlier lectures.

17.1 Special case: Points on a line


Consider the metric space where V are points on a line and d(u, v) is the
distance between points u, v ∈ V . One can think of all points lying on the
1-dimensional number line R.

17.1.1 Greedy is a bad idea


A natural greedy idea would be to pick the closest server to serve any given
request. However, this can be arbitrarily bad. Consider the following:

..
.
s∗
0 1+2+

Without loss of generality, suppose all servers currently lie on the left of “0”.
For  > 0, consider the sequence σ = (1 + , 2 + , 1 + , 2 + , . . . ). The first
request will move a single server s∗ to “1 + ”. By the greedy algorithm,
subsequent requests then repeatedly use s∗ to satisfy requests from both
“1 + ” and “2 + ” since s∗ is the closest server. This incurs a total cost of
≥ |σ| while OPT could station 2 servers on “1 + ” and “2 + ” and incur a
constant total cost on input sequence σ.

17.1.2 Double coverage


The double coverage algorithm does the following:

• If request r is on one side of all servers, move the closest server to cover
it

• If request r lies between two servers, move both towards it at constant


speed until r is covered

Before r Before r
After r After r
17.1. SPECIAL CASE: POINTS ON A LINE 199

Theorem 17.2. Double coverage (DC) is k-competitive on a line.


Proof. Without loss of generality,
• Suppose location of DC’s servers on the line are: x1 ≤ x2 ≤ · · · ≤ xk
• Suppose location of OPT’s servers on the line are: y1 ≤ y2 ≤ · · · ≤ yk
Define potential function Φ = Φ1 + Φ2 = k · ki=1 |xi − yi | + i<j (xj − xi ),
P P
where Φ1 is k times the “paired distances” between xi and yi and Φ2 is the
pairwise distance between any two servers in DC.
We denote the potential function at time step t by Φt = Φt,1 + Φt,2 . For
a given request r at time step t, we will first analyze OPT’s action then
DC’s action. We analyze the change in potential ∆(Φ) by looking at ∆(Φ1 )
and ∆(Φ2 ) separately, and further distinguish the effects of DC and OPT on
∆(Φ) via ∆DC (Φ) and ∆OP T (Φ) respectively.
Suppose OPT moves server s∗ by a distance of x = d(s∗ , r) to reach the
point r. Then, cOP T (t) ≥ x. Since s∗ moved by x, ∆(Φt,1 ) ≤ kx. Since OPT
does not move DC’s servers, ∆(Φt,2 ) = 0. Hence, ∆OP T (Φt ) ≤ kx.
There are three cases for DC, depending on where r appears.
1. r appears exactly on a current server position
DC does nothing. So, cDC (t) = 0 and ∆DC (Φt ) = 0. Hence,

cDC (t) + ∆(Φt ) = cDC (t) + ∆DC (Φt ) + ∆OP T (Φt )


≤ 0 + kx + 0 = kx
≤ k · cOP T (t)

2. r appears on one side of all servers x1 , . . . , xk (say r > xk without loss


of generality)
DC will move server xk by a distance y = d(xk , r) to reach point r. That
is, cDC (t) = y. Since OPT has a server at r, yk ≥ r. So, ∆DC (Φt,1 ) =
−ky. Since only xk moved, ∆DC (Φt,2 ) = (k − 1)y. Hence,
cDC (t) + ∆(Φt ) = cDC (t) + ∆DC (Φt ) + ∆OP T (Φt )
≤ y − ky + (k − 1)y + kx
= kx
≤ k · cOP T (t)

3. r appears between two servers xi < r < xi+1


Without loss of generality, say r is closer to xi and denote z = d(xi , r).
DC will move server xi by a distance of z to reach point r, and server
xi+1 by a distance of z to reach xi+1 − z. That is, cDC (t) = 2z.
200 CHAPTER 17. THE K-SERVER PROBLEM

Claim 17.3. At least one of xi or xi+1 is moving closer to its partner


(yi or yi+1 respectively).

Proof. Suppose, for a contradiction, that both xi and xi+1 are moving
away from their partners. That means yi ≤ xi < r < xi+1 ≤ yi+1 at
the end of OPT’s action (before DC moved xi and xi+1 ). This is a
contradiction since OPT must have a server at r but there is no server
between yi and yi+1 by definition.

Since at least one of xi or xi+1 is moving closer to its partner, ∆DC (Φt,1 ) ≤
z − z = 0.
Meanwhile, since xi and xi+1 are moved a distance of z towards each
other, (xi+1 − xi ) = −2z while the total change against other pairwise
distances cancel out, so ∆DC (Φt,2 ) = −2z.
Hence,

cDC (t)+∆(Φt ) = cDC (t)+∆DC (Φt )+∆OP T (Φt ) ≤ 2z−2z+kx = kx ≤ k·cOP T (t)

In all cases, we see that cDC (t) + ∆(Φt ) ≤ k · cOP T (t). Hence,

|σ| |σ|
X X
(cDC (t) + ∆(Φt )) ≤ k · cOP T (t) Summing over σ
t=1 t=1
|σ|
X
⇒ cDC (t) + (Φ|σ| − Φ0 ) ≤ k · cOP T (σ) Telescoping
t=1
|σ|
X
⇒ cDC (t) − Φ0 ≤ k · cOP T (σ) Since Φt ≥ 0
t=1
|σ|
X
⇒ cDC (σ) ≤ k · cOP T (σ) + Φ0 Since cDC (σ) = cDC (t)
t=1

Since Φ0 is a constant that captures the initial state, DC is k-competitive.

Remark One can generalize the approach of double coverage to points on


a tree. The idea is as follows: For a given request point r, consider the
set of servers S such that for s ∈ S, there is no other server s0 between
s and r. Move all servers in S towards r “at the same speed” until one
17.1. SPECIAL CASE: POINTS ON A LINE 201

of them reaches r. This generalization gives us a k-competitiveness on a


tree; building on this we can use the Probabilistic Tree Embedding approach
(stretching distances by only O(log n) in expectation) getting immediately
an O(k log n)-competitiveness in expectation on a graph.
Bibliography

[AB17] Amir Abboud and Greg Bodwin. The 43 additive spanner expo-
nent is tight. Journal of the ACM (JACM), 64(4):28, 2017.

[ACIM99] Donald Aingworth, Chandra Chekuri, Piotr Indyk, and Ra-


jeev Motwani. Fast estimation of diameter and shortest paths
(without matrix multiplication). SIAM Journal on Computing,
28(4):1167–1181, 1999.

[ADD+ 93] Ingo Althöfer, Gautam Das, David Dobkin, Deborah Joseph,
and José Soares. On sparse spanners of weighted graphs. Discrete
& Computational Geometry, 9(1):81–100, 1993.

[AGM12] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Ana-
lyzing graph structure via linear measurements. In Proceedings
of the twenty-third annual ACM-SIAM symposium on Discrete
Algorithms, pages 459–467. SIAM, 2012.

[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplica-
tive weights update method: a meta-algorithm and applications.
Theory of Computing, 8(1):121–164, 2012.

[AMS96] Noga Alon, Yossi Matias, and Mario Szegedy. The space com-
plexity of approximating the frequency moments. In Proceedings
of the twenty-eighth annual ACM symposium on Theory of com-
puting, pages 20–29. ACM, 1996.

[Bar96] Yair Bartal. Probabilistic approximation of metric spaces and its


algorithmic applications. In Foundations of Computer Science,
1996. Proceedings., 37th Annual Symposium on, pages 184–193.
IEEE, 1996.

[BBMN11] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph


Naor. A polylogarithmic-competitive algorithm for the k-server

i
ii Advanced Algorithms

problem. In Foundations of Computer Science (FOCS), 2011


IEEE 52nd Annual Symposium on, pages 267–276. IEEE, 2011.

[BCR23] Sébastien Bubeck, Christian Coester, and Yuval Rabani. The


randomized k-server conjecture is false! In Barna Saha and
Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM
Symposium on Theory of Computing, STOC 2023, Orlando, FL,
USA, June 20-23, 2023, pages 581–594. ACM, 2023.

[BK96] András A Benczúr and David R Karger. Approximating st min-


e 2 ) time. In Proceedings of the twenty-eighth
imum cuts in O(n
annual ACM symposium on Theory of computing, pages 47–55.
ACM, 1996.

[BKMP05] Surender Baswana, Telikepalli Kavitha, Kurt Mehlhorn, and


Seth Pettie. New constructions of (α, β)-spanners and purely
additive spanners. In Proceedings of the sixteenth annual ACM-
SIAM symposium on Discrete algorithms, pages 672–681. Soci-
ety for Industrial and Applied Mathematics, 2005.

[BYJK+ 02] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and


Luca Trevisan. Counting distinct elements in a data stream. In
International Workshop on Randomization and Approximation
Techniques in Computer Science, pages 1–10. Springer, 2002.

[BYJKS04] Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and


D Sivakumar. An information statistics approach to data stream
and communication complexity. Journal of Computer and Sys-
tem Sciences, 68(4):702–732, 2004.

[Che13] Shiri Chechik. New additive spanners. In Proceedings of the


twenty-fourth annual ACM-SIAM symposium on Discrete algo-
rithms, pages 498–512. Society for Industrial and Applied Math-
ematics, 2013.

[DS14] Irit Dinur and David Steurer. Analytical approach to parallel


repetition. In Proceedings of the forty-sixth annual ACM sym-
posium on Theory of computing, pages 624–633. ACM, 2014.

[Erd64] P. Erdös. Extremal problems in graph theory. In “Theory of


graphs and its applications,” Proc. Symposium Smolenice, pages
29–36, 1964.
BIBLIOGRAPHY iii

[Fei98] Uriel Feige. A threshold of ln n for approximating set cover.


Journal of the ACM (JACM), 45(4):634–652, 1998.
[FM85] Philippe Flajolet and G Nigel Martin. Probabilistic counting
algorithms for data base applications. Journal of computer and
system sciences, 31(2):182–209, 1985.
[FRR90] Amos Fiat, Yuval Rabani, and Yiftach Ravid. Competitive k-
server algorithms. In Foundations of Computer Science, 1990.
Proceedings., 31st Annual Symposium on, pages 454–463. IEEE,
1990.
[FRT03] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight
bound on approximating arbitrary metrics by tree metrics. In
Proceedings of the thirty-fifth annual ACM symposium on Theory
of computing, pages 448–455. ACM, 2003.
[FS16] Arnold Filtser and Shay Solomon. The greedy spanner is exis-
tentially optimal. In Proceedings of the 2016 ACM Symposium
on Principles of Distributed Computing, pages 9–17. ACM, 2016.
[Gra66] Ronald L Graham. Bounds for certain multiprocessing anoma-
lies. Bell System Technical Journal, 45(9):1563–1581, 1966.
[Gro91] Edward F Grove. The harmonic online k-server algorithm is
competitive. In Proceedings of the twenty-third annual ACM
symposium on Theory of computing, pages 260–266. ACM, 1991.
[HMK+ 06] Tracey Ho, Muriel Médard, Ralf Koetter, David R Karger,
Michelle Effros, Jun Shi, and Ben Leong. A random linear net-
work coding approach to multicast. IEEE Transactions on In-
formation Theory, 52(10):4413–4430, 2006.
[Ind01] Piotr Indyk. Algorithmic applications of low-distortion geomet-
ric embeddings. In Proceedings 42nd IEEE Symposium on Foun-
dations of Computer Science, pages 10–33. IEEE, 2001.
[IW05] Piotr Indyk and David Woodruff. Optimal approximations of
the frequency moments of data streams. In Proceedings of the
thirty-seventh annual ACM symposium on Theory of computing,
pages 202–208. ACM, 2005.
[Joh74] David S Johnson. Approximation algorithms for combinatorial
problems. Journal of computer and system sciences, 9(3):256–
278, 1974.
iv Advanced Algorithms

[Kar93] David R Karger. Global min-cuts in rnc, and other ramifications


of a simple min-cut algorithm. In SODA, volume 93, pages 21–
30, 1993.
[Kar99] David R Karger. Random sampling in cut, flow, and network de-
sign problems. Mathematics of Operations Research, 24(2):383–
413, 1999.
[Kar01] David R. Karger. A randomized fully polynomial time approxi-
mation scheme for the all-terminal network reliability problem.
SIAM Rev., 43(3):499–522, March 2001.
[KP95] Elias Koutsoupias and Christos H Papadimitriou. On the k-
server conjecture. Journal of the ACM (JACM), 42(5):971–983,
1995.
[LLR95] Nathan Linial, Eran London, and Yuri Rabinovich. The geom-
etry of graphs and some of its algorithmic applications. Combi-
natorica, 15(2):215–245, 1995.
[LY94] Carsten Lund and Mihalis Yannakakis. On the hardness of
approximating minimization problems. Journal of the ACM
(JACM), 41(5):960–981, 1994.
[MMS90] Mark S Manasse, Lyle A McGeoch, and Daniel D Sleator. Com-
petitive algorithms for server problems. Journal of Algorithms,
11(2):208–230, 1990.
[Mor78] Robert Morris. Counting large numbers of events in small reg-
isters. Communications of the ACM, 21(10):840–842, 1978.
[NY18] Jelani Nelson and Huacheng Yu. Optimal lower bounds for
distributed and streaming spanning forest computation. arXiv
preprint arXiv:1807.05135, 2018.
[RT87] Prabhakar Raghavan and Clark D Tompson. Randomized round-
ing: a technique for provably good algorithms and algorithmic
proofs. Combinatorica, 7(4):365–374, 1987.
[ST85] Daniel D Sleator and Robert E Tarjan. Amortized efficiency
of list update and paging rules. Communications of the ACM,
28(2):202–208, 1985.
[Vaz13] Vijay V Vazirani. Approximation algorithms. Springer Science
& Business Media, 2013.
BIBLIOGRAPHY v

[Wen91] Rephael Wenger. Extremal graphs with no c4’s, c6’s, or c10’s.


Journal of Combinatorial Theory, Series B, 52(1):113–116, 1991.

[Woo06] David P Woodruff. Lower bounds for additive spanners, emu-


lators, and more. In Foundations of Computer Science, 2006.
FOCS’06. 47th Annual IEEE Symposium on, pages 389–398.
IEEE, 2006.

[WS11] David P Williamson and David B Shmoys. The design of ap-


proximation algorithms. Cambridge university press, 2011.

[Yao77] Andrew Chi-Chin Yao. Probabilistic computations: Toward a


unified measure of complexity. In Foundations of Computer Sci-
ence, 1977., 18th Annual Symposium on, pages 222–227. IEEE,
1977.

You might also like