Online and Streaming Algorithms For Clustering - UCSD - Lec6
Online and Streaming Algorithms For Clustering - UCSD - Lec6
6-1
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
Here 2R is roughly the cost of the current clustering, as we will shortly make precise. The lemmas below
describe invariants that hold at lines (A), (B), and (C) in the code; each refers to the start of the line (that
is, before the line is executed).
Lemma 1. All data points seen so far are (i) within distance 2R of T at (B) and (ii) within distance 4R of
T at (C).
Proof. Use induction on the number of iterations through the main loop. Notice that (i) holds on the first
iteration, and that if (i) holds on the pth iteration then (ii) holds on the pth iteration and (i) holds on the
(p + 1)st iteration.
To prove an approximation guarantee, we also need to lower bound the cost of the optimal clustering in
terms of R.
2. If a node is associated with xi , then one of its children must also be associated with xi .
6-2
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
3. All nodes at depth j are at distance at least 1/2j from each other.
4. Each node at depth j + 1 is within distance 1/2j of its parent (at depth j).
This is described as an infinite tree for simplicity of analysis, but it would not be stored as such. In practice,
there is no need to duplicate a node as its own child, and so the tree would take up O(n) space.
The figure below gives an example of a cover tree for a data set of five points. This is just the top few
levels of the tree, but the rest of it is simply a duplication of the bottom row. From the structure of the tree
we can conclude, for instance, that x1 , x2 , x5 are all at distance ≥ 1/2 from each other (since they are all at
depth 1), and that the distance between x2 and x3 is ≤ 1/4 (since x3 is at depth 3, and is a child of x2 ).
x1
x1 depth 0
x4
x2 x2 x1 x5 1
x5
x3
x2 x1 x5 x4 2
3
x3 x2 x1 x5 x4
What makes cover trees especially convenient is that they can be built on-line, one point at a time. To
insert a new point x: find the largest j such that x is within 1/2j of some node p at depth j in the tree; and
make x a child of p. Do you see why this maintains the four defining properties?
Once the tree is built, it is easy to obtain k-clusterings from it.
Lemma 4. For any k, consider the deepest level of the tree with ≤ k nodes, and let Tk be those nodes. Then:
Proof. Fix any k, and suppose j is the deepest level with ≤ k nodes. By Property 4, all of Tk ’s children are
within distance 1/2j of it, and its grandchildren are within distance 1/2j +1/2j+1 of it, and so on. Therefore,
1 1 1
cost(Tk ) ≤ + j+1 + · · · = j−1 .
2j 2 2
Meanwhile, level j + 1 has ≥ k + 1 nodes, and by Property 3, these are at distance ≥ 1/2j+1 from each other.
Therefore the optimal k-clustering has cost at least 1/2j+2 .
To get rid of the assumption that all interpoint distances are ≤ 1, simply allow the tree to have depths
−1, −2, −3 and so on. Initially, the root is at depth 0, but if a node arrives that is at distance ≥ 8 (say)
from all existing nodes, then it is made the new root and is put at depth −3.
Also, the creation of the data structure appears to take O(n2 ) time and O(n) space. But if we are only
interested in values of k from 1 to K, then we need only keep the top few levels of the tree, those with ≤ K
nodes. This reduces the time and space requirements to O(nK) and O(K), respectively.
6-3
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
is not too much more than the optimal cost achievable by the best k centers for x1 , . . . , xt .
Problem 2. Develop an on-line algorithm for k-means clustering that has a performance guarantee under
either of the two criteria above.
On-line Streaming
Endless stream of data Stream of (known) length n
Fixed amount of memory Memory available is o(n)
Tested at every time step Tested only at the very end
Each point is seen only once More than one pass may be possible
There are some standard ways to convert regular algorithms (that assume the data fits in memory) into
streaming algorithms. These include divide-and-conquer and random sampling. We’ll now see k-medoid
algorithms based on each of these.
6-4
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
T∗ optimal k medoids
S1 S2 S3 Sl
(a, b)-approx
T1 T2 T3 Tl
Figure 6.1. A streaming algorithm for the k-medoid problem, based on divide and conquer.
Earlier, we saw an LP-based algorithm that finds 2k centers whose cost is at most 4 times that of the best
k-medoid solution. This extends to the case when the problem is weighted, that is, when each data point x
has an associated weight w(x), and the cost function is
X
cost(T ) = w(x)ρ(x, T ).
x∈S
We will call it a (2, 4)-approximation algorithm. It turns out that (a, b)-approximation algorithms are
available for many combinations (a, b).
A natural way to deal with a huge data stream S is to read as much of it as will fit into memory (call
this portion S1 ), solve this sub-instance, then read the next batch S2 , solve this sub-instance, and so on. At
the end, the partial solutions need to be combined somehow.
6-5
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
The next lemma says that when clustering a data set S ′ , picking centers from S ′ is at most twice as bad
as picking centers from the entire underlying metric space X .
Lemma 7. For any S ′ ⊂ X , we have
min cost(S ′ , T ′ ) ≤ 2 min cost(S ′ , T ′ ).
T ′ ⊂S ′ ′ T ⊂X
Proof. Let T ′ ⊂ X be the optimal solution chosen from X . For each induced cluster of S ′ , replace its center
t′ ∈ T ′ by the closest neighbor of t′ in S ′ . This at most doubles the cost, by the triangle inequality.
Our final goal is to upper-bound cost(S, T ), and we will do so by bounding theP two terms on the right-
hand side of Lemma 6. Let’s start with the first of them. We’d certainly hope that i cost(Si , Ti ) is smaller
than cost(S, T ∗ ); after all, the former uses way more representatives (about akl of them) to approximate the
same set S. We now give a coarse upper bound to this effect.
Pl ∗
Lemma 8. i=1 cost(Si , Ti ) ≤ 2b · cost(S, T ).
Proof. Each Ti is a b-approximation solution to the k-medoid problem for Si . Thus
l
X l
X
cost(Si , Ti ) ≤ b · min cost(Si , T ′ )
T ⊂Si
′
i=1 i=1
l
X
≤ 2b · min cost(Si , T ′ )
T ⊂X
′
i=1
l
X
≤ 2b · cost(Si , T ∗ ) = cost(S, T ∗ ).
i=1
6-6
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
The theorem follows immediately by putting together the last two lemmas with Lemma 6.
Problem 3. Notice that this streaming algorithm uses two k-medoid subroutines. Even if both are perfect,
that is, if both are (1, 1)-approximations, the overall bound on the approximation factor is 8. Can a better
factor be achieved?
6-7
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
1 s
Lemma 11. With probability ≥ 1 − δ, for all i ∈ L: |Si′ | ≥ 2 · n · |Si |.
Proof. Use a multiplicative form of the Chernoff bound.
We’ll next show that the optimal centers T ∗ do a pretty good job on the random sample S ′ .
1 s
Lemma 12. With probability ≥ 1 − δ, cost(S ′ , T ∗ ) ≤ δ · n · cost(S, T ∗ ).
Proof. The expected value of cost(S ′ , T ∗ ) is exactly (s/n)cost(S, T ∗ ), whereupon the lemma follows by
Markov’s inequality.
Henceforth assume that the high-probability events of Lemmas 11 and 12 hold. Lemma 12 implies that
T ′ is also pretty good for S ′ .
1 s
Lemma 13. cost(S ′ , T ′ ) ≤ 2b · δ · n · cost(S, T ∗ ).
Proof. This is by now a familiar argument.
Proof. We’ll show that for each optimal medoid t∗i of a large cluster, there must be a point in T ′ relatively
close by. To show this, we use the triangle inequality ρ(t∗i , T ′ ) ≤ ρ(t∗i , y) + ρ(y, t′ (y)), where t′ (y) ∈ T ′ is the
medoid in T ′ closest to point y. This inequality holds for all y, so we can average it over all y ∈ Si′ :
1 X
ρ(t∗i , T ′ ) ≤ (ρ(y, t∗i ) + ρ(y, t′ (y))) .
|Si′ | ′
y∈Si
Now we can bound the cost of the large clusters with respect to medoids T ′ .
XX
cost (∪i∈L Si , T ′ ) ≤ (ρ(x, t∗i ) + ρ(t∗i , T ′ ))
i∈L x∈Si
X
≤ cost(S, T ∗ ) + |Si |ρ(t∗i , T ′ )
i∈L
X |Si | X
≤ cost(S, T ∗ ) + (ρ(y, t∗i ) + ρ(y, t′ (y)))
|Si′ |
i∈L y∈Si′
2n X
≤ cost(S, T ∗ ) + (ρ(y, t∗ (y)) + ρ(y, t′ (y)))
s ′ y∈S
2n
= cost(S, T ∗ ) + (cost(S ′ , T ∗ ) + cost(S ′ , T ′ ))
s
where the last inequality is from Lemma 11. The rest follows from Lemmas 12 and 13.
Thus the large clusters are well-represented by T ′ . But this isn’t exactly what we need. Looking back at
the algorithm, we see that medoids T ′′ will take care of S ′′ , which means that we need medoids T ′ to take
care of S \ S ′′ .
6-8
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008
Proof. A large cluster is one with at least (8n/s) log(k/δ) points. Therefore, since there are at most k small
clusters, the total number of points in small clusters is at most (8kn/s) log(k/δ). This means that the large
clusters account for the vast majority of the data, at least n − (8kn/s) log(k/δ) points.
S\S ′′ is exactly the n−(8kn/s) log(k/δ) points closest to T ′ . Therefore cost(S\S ′′ , T ′ ) ≤ cost (∪i∈L Si , T ′ ).
Problem 4. The approximation factor contains a highly unpleasant 1/δ term. Is there a way to reduce this
to, say, log(1/δ)?
6-9