0% found this document useful (0 votes)
41 views

Online and Streaming Algorithms For Clustering - UCSD - Lec6

This document summarizes an online lecture about online and streaming algorithms for clustering. It discusses: 1) The problem of online k-clustering where data points arrive sequentially and must be clustered without storing all points. 2) An online k-center algorithm that maintains cluster centers within a constant factor of optimal. 3) How a cover tree data structure can be used to obtain clusterings for any k within a constant factor of optimal.

Uploaded by

Thuy Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Online and Streaming Algorithms For Clustering - UCSD - Lec6

This document summarizes an online lecture about online and streaming algorithms for clustering. It discusses: 1) The problem of online k-clustering where data points arrive sequentially and must be clustered without storing all points. 2) An online k-center algorithm that maintains cluster centers within a constant factor of optimal. 3) How a cover tree data structure can be used to obtain clusterings for any k within a constant factor of optimal.

Uploaded by

Thuy Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CSE 291: Unsupervised learning Spring 2008

Lecture 6 — Online and streaming algorithms for clustering

6.1 On-line k-clustering


To the extent that clustering takes place in the brain, it happens in an on-line manner: each data point
comes in, is processed, and then goes away never to return. To formalize this, imagine an endless stream of
data points x1 , x2 , . . . and a k-clustering algorithm that works according to the following paradigm:
repeat forever:
get a new data point x
update the current set of k centers
This algorithm cannot store all the data it sees, because the process goes on ad infinitum. More precisely, we
will allow it space proportional to k. And at any given moment in time, we want the algorithm’s k-clustering
to be close to the optimal clustering of all the data seen so far.
This is a tall order, but nonetheless we can manage it for the simplest of our cost functions, k-center.

6.2 The on-line k-center problem


The setting is, as usual, a metric space (X , ρ). Recall that for data set S ⊂ X and centers T ⊂ X we define
cost(T ) = maxx∈S ρ(x, T ). In the on-line setting, our algorithm must conform to the following template.
repeat forever:
get x ∈ X
update centers T ⊂ X , |T | = k
And for all times t, we would like it to be the case that the cost of T for the points seen so far is close to
the optimal cost achievable for those particular points. We will look at two schemes that fit this bill.

6.2.1 A doubling algorithm


This algorithm is due to Charikar, Chekuri, Feder, and Motwani (1997).
T ← {first k distinct data points}
R ← smallest interpoint distance in T
repeat forever:
while |T | ≤ k:
(A) get new point x
if ρ(x, T ) > 2R : T ← T ∪ {x}
(B) T′ ← { }
while there exists z ∈ T such that ρ(z, T ′ ) > 2R:
T ′ ← T ′ ∪ {z}
T ← T′
(C) R ← 2R

6-1
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

Here 2R is roughly the cost of the current clustering, as we will shortly make precise. The lemmas below
describe invariants that hold at lines (A), (B), and (C) in the code; each refers to the start of the line (that
is, before the line is executed).

Lemma 1. All data points seen so far are (i) within distance 2R of T at (B) and (ii) within distance 4R of
T at (C).

Proof. Use induction on the number of iterations through the main loop. Notice that (i) holds on the first
iteration, and that if (i) holds on the pth iteration then (ii) holds on the pth iteration and (i) holds on the
(p + 1)st iteration.

To prove an approximation guarantee, we also need to lower bound the cost of the optimal clustering in
terms of R.

Lemma 2. At (B), there are k + 1 centers at distance ≥ R from each other.


Proof. A similar induction. It is easiest to simultaneously establish that at (C), all centers are distance ≥ 2R
apart.

We now put these together to give a performance guarantee.

Theorem 3. Whenever the algorithm is at (A),

cost(T ) ≤ 8 · cost(optimal k centers for data seen so far).

Proof. At location (A), we have |T | ≤ k and cost(T ) ≤ 2R.


The last time the algorithm was at (B), there were k + 1 centers at distance ≥ R/2 from each other,
implying an optimal cost of ≥ R/4.

Problem 1. A commonly-used scheme for on-line k-means works as follows.

initialize the k centers t1 , . . . , tk in any way


create counters n1 , . . . , nk , all initialized to zero
repeat forever:
get data point x
let ti be its closest center
set ti ← (ni ti + x)/(ni + 1) and ni ← ni + 1
What can be said about this scheme?

6.2.2 An algorithm based on the cover tree


The algorithm of the previous section works for a pre-specified value of k. Is there some way to handle all k
simultaneously? Indeed, there is a clean solution to this; it is made possible by a data structure called the
cover tree, which was nicely formalized by Beygelzimer, Kakade, and Langford (2003), although it had been
in the air for a while before then.
Assume for the moment that all interpoint distances are ≤ 1. Of course this is ridiculous, but we will
get rid of this assumption shortly. A cover tree on data points x1 , . . . , xn is a rooted infinite tree with the
following properties.
1. Each node of the tree is associated with one of the data points xi .

2. If a node is associated with xi , then one of its children must also be associated with xi .

6-2
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

3. All nodes at depth j are at distance at least 1/2j from each other.

4. Each node at depth j + 1 is within distance 1/2j of its parent (at depth j).

This is described as an infinite tree for simplicity of analysis, but it would not be stored as such. In practice,
there is no need to duplicate a node as its own child, and so the tree would take up O(n) space.
The figure below gives an example of a cover tree for a data set of five points. This is just the top few
levels of the tree, but the rest of it is simply a duplication of the bottom row. From the structure of the tree
we can conclude, for instance, that x1 , x2 , x5 are all at distance ≥ 1/2 from each other (since they are all at
depth 1), and that the distance between x2 and x3 is ≤ 1/4 (since x3 is at depth 3, and is a child of x2 ).

x1
x1 depth 0
x4
x2 x2 x1 x5 1
x5

x3
x2 x1 x5 x4 2

3
x3 x2 x1 x5 x4

What makes cover trees especially convenient is that they can be built on-line, one point at a time. To
insert a new point x: find the largest j such that x is within 1/2j of some node p at depth j in the tree; and
make x a child of p. Do you see why this maintains the four defining properties?
Once the tree is built, it is easy to obtain k-clusterings from it.
Lemma 4. For any k, consider the deepest level of the tree with ≤ k nodes, and let Tk be those nodes. Then:

cost(Tk ) ≤ 8 · cost(optimal k centers).

Proof. Fix any k, and suppose j is the deepest level with ≤ k nodes. By Property 4, all of Tk ’s children are
within distance 1/2j of it, and its grandchildren are within distance 1/2j +1/2j+1 of it, and so on. Therefore,
1 1 1
cost(Tk ) ≤ + j+1 + · · · = j−1 .
2j 2 2
Meanwhile, level j + 1 has ≥ k + 1 nodes, and by Property 3, these are at distance ≥ 1/2j+1 from each other.
Therefore the optimal k-clustering has cost at least 1/2j+2 .

To get rid of the assumption that all interpoint distances are ≤ 1, simply allow the tree to have depths
−1, −2, −3 and so on. Initially, the root is at depth 0, but if a node arrives that is at distance ≥ 8 (say)
from all existing nodes, then it is made the new root and is put at depth −3.
Also, the creation of the data structure appears to take O(n2 ) time and O(n) space. But if we are only
interested in values of k from 1 to K, then we need only keep the top few levels of the tree, those with ≤ K
nodes. This reduces the time and space requirements to O(nK) and O(K), respectively.

6-3
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

6.2.3 On-line clustering algorithms: epilogue


It is an open problem to develop a good on-line algorithm for k-means clustering. There are at least two
different ways in which a proposed scheme might be analyzed. The first supposes at each time t, the algorithm
sees a new data point xt , and then outputs a set of k centers Tt . The hope is that for some constant α ≥ 1,
for all t,
cost(Tt ) ≤ α · cost(best k centers for x1 , . . . , xt ).
This is the pattern we have followed in our k-center examples.
The second kind of analysis is the more usual setting of the on-line learning literature. It supposes that
at each time t, the algorithm announces a set of k centers Tt , then sees a new point xt and incurs a loss
equal to the cost of xt under Tt (that is, the squared distance from xt to the closest center in Tt ). The hope
is that at any time t, the total loss incurred upto time t,
X
min kxt − zk2
z∈Tt
t′ ≤t

is not too much more than the optimal cost achievable by the best k centers for x1 , . . . , xt .

Problem 2. Develop an on-line algorithm for k-means clustering that has a performance guarantee under
either of the two criteria above.

6.3 Streaming algorithms for clustering


The streaming model of computation is inspired by massive data sets that are too large to fit in memory.
Unlike on-line learning, which continues forever, there is a finite job to be done. But it is a very big job,
and so only a small portion of the input can be held in memory
√ at any one time. For an input of size n,
one typically seeks algorithm that use memory o(n) (say, n) and that need to make just one or two passes
through the data.

On-line Streaming
Endless stream of data Stream of (known) length n
Fixed amount of memory Memory available is o(n)
Tested at every time step Tested only at the very end
Each point is seen only once More than one pass may be possible

There are some standard ways to convert regular algorithms (that assume the data fits in memory) into
streaming algorithms. These include divide-and-conquer and random sampling. We’ll now see k-medoid
algorithms based on each of these.

6.3.1 A streaming k-medoid algorithm based on divide-and-conquer


Recall the k-medoid problem: the setting is a metric space (X , ρ).

Input: Finite set S ⊂ X ; integer k.


Output: T ⊂ S with |T | = k.
P
Goal: Minimize cost(T ) = x∈S ρ(x, T ).

6-4
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

T∗ optimal k medoids

huge data stream S

S1 S2 S3 Sl
(a, b)-approx
T1 T2 T3 Tl

Sw weighted instance of akl points


(a′ , b′ )-approx
T

Figure 6.1. A streaming algorithm for the k-medoid problem, based on divide and conquer.

Earlier, we saw an LP-based algorithm that finds 2k centers whose cost is at most 4 times that of the best
k-medoid solution. This extends to the case when the problem is weighted, that is, when each data point x
has an associated weight w(x), and the cost function is
X
cost(T ) = w(x)ρ(x, T ).
x∈S

We will call it a (2, 4)-approximation algorithm. It turns out that (a, b)-approximation algorithms are
available for many combinations (a, b).
A natural way to deal with a huge data stream S is to read as much of it as will fit into memory (call
this portion S1 ), solve this sub-instance, then read the next batch S2 , solve this sub-instance, and so on. At
the end, the partial solutions need to be combined somehow.

Divide S into groups S1 , S2 , . . . , Sl


for each i = 1, 2, . . . , l:
run an (a, b)-approximation alg on Si to get ≤ ak medoids Ti = {ti1 , ti2 , . . .}
suppose Si1 ∪ Si2 ∪ · · · are the induced clusters of Si
Sw ← T1 ∪ T2 ∪ · · · ∪ Tl , with weights w(tij ) ← |Sij |
run an (a′ , b′ )-approximation algorithm on weighted Sw to get ≤ a′ k centers T
return T

Figure 6.1 shows √


this pictorially. An interesting case to consider is when stream S has length√n, and each
batch Si consists of nk points. In this case, the second clustering
√ problem Sw is also of size nk, and so
the whole algorithm operates with just a single pass and O( nk) memory.
Before analyzing the algorithm we introduce some notation. Let T ∗ = {t∗1 , . . . , t∗k } be the optimal k
medoids for data set S. Let t∗ (x) ∈ T ∗ be the medoid closest to point x. Likewise, let t(x) ∈ T be the point
in T closest to x, and ti (x) the point in Ti closest to x. Since we are dividing the data into subsets Si , we
will need to talk about the costs of clustering these subsets as well as the overall cost of clustering S. To

6-5
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

this end, define


cost(S ′ , T ′ ) = cost of medoids T ′ for data S ′

 P
= Px∈S ′ ρ(x, T ) ′ unweighted instance
x∈S ′ w(x)ρ(x, T ) weighted instance

Theorem 5. This streaming algorithm is an (a′ , 2b + 2b′ (2b + 1))-approximation.


The a′ part is obvious; the rest will be shown over the course of the next few lemmas. The first step
is to bound the overall cost in terms of the costs of two clustering steps; this is a simple application of the
triangle inequality.
Pl
Lemma 6. cost(S, T ) ≤ i=1 cost(Si , Ti ) + cost(Sw , T ).
Proof. Recall that the second clustering problem Sw consists of all medoids tij from the first step, with
weights w(tij ) = |Sij |.
l X
X l X
X
cost(S, T ) = ρ(x, T ) ≤ (ρ(x, ti (x)) + ρ(ti (x), T ))
i=1 x∈Si i=1 x∈Si
l
X l X
X
= cost(Si , Ti ) + |Sij |ρ(tij , T )
i=1 i=1 j
l
X
= cost(Si , Ti ) + cost(Sw , T ).
i=1

The next lemma says that when clustering a data set S ′ , picking centers from S ′ is at most twice as bad
as picking centers from the entire underlying metric space X .
Lemma 7. For any S ′ ⊂ X , we have
min cost(S ′ , T ′ ) ≤ 2 min cost(S ′ , T ′ ).
T ′ ⊂S ′ ′ T ⊂X

Proof. Let T ′ ⊂ X be the optimal solution chosen from X . For each induced cluster of S ′ , replace its center
t′ ∈ T ′ by the closest neighbor of t′ in S ′ . This at most doubles the cost, by the triangle inequality.
Our final goal is to upper-bound cost(S, T ), and we will do so by bounding theP two terms on the right-
hand side of Lemma 6. Let’s start with the first of them. We’d certainly hope that i cost(Si , Ti ) is smaller
than cost(S, T ∗ ); after all, the former uses way more representatives (about akl of them) to approximate the
same set S. We now give a coarse upper bound to this effect.
Pl ∗
Lemma 8. i=1 cost(Si , Ti ) ≤ 2b · cost(S, T ).
Proof. Each Ti is a b-approximation solution to the k-medoid problem for Si . Thus
l
X l
X
cost(Si , Ti ) ≤ b · min cost(Si , T ′ )
T ⊂Si

i=1 i=1
l
X
≤ 2b · min cost(Si , T ′ )
T ⊂X

i=1
l
X
≤ 2b · cost(Si , T ∗ ) = cost(S, T ∗ ).
i=1

6-6
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

The second inequality is from Lemma 7.


Finally, we bound the second term on the right-hand side of Lemma 6.
Pl
Lemma 9. cost(Sw , T ) ≤ 2b′ ( i=1 cost(Si , Ti ) + cost(S, T ∗ )).
Proof. It is enough to upper-bound cost(Sw , T ∗ ) and then invoke
cost(Sw , T ) ≤ b′ · min cost(Sw , T ′ ) ≤ 2b′ · min cost(Sw , T ′ ) ≤ 2b′ · cost(Sw , T ∗ ).
′T ⊂Sw ′ T ⊂X

To do so, we need only the triangle inequality.


X X X
cost(Sw , T ∗ ) = |Sij |ρ(tij , T ∗ ) ≤ (ρ(x, tij ) + ρ(x, t∗ (x)))
i,j i,j x∈Sij
XX
= (ρ(x, ti (x)) + ρ(x, t∗ (x)))
i x∈Si
X
= cost(Si , Ti ) + cost(S, T ∗ ).
i

The theorem follows immediately by putting together the last two lemmas with Lemma 6.
Problem 3. Notice that this streaming algorithm uses two k-medoid subroutines. Even if both are perfect,
that is, if both are (1, 1)-approximations, the overall bound on the approximation factor is 8. Can a better
factor be achieved?

6.3.2 A streaming k-medoid algorithm based on random sampling


In this approach, we randomly sample from stream S, taking as many points as will fit in memory. We then
solve the clustering problem on this subset S ′ ⊂ S. The resulting cluster centers should work well for all of
S, unless S contains a cluster that has few points (and is thus not represented in S ′ ) and is far away from
the rest of the data. A second round of clustering is used to deal with this contigency.
The algorithm below is due to Indyk (1999). Its input is a stream S of length n, and also a confidence
parameter δ.
let S ′ ⊂ S be a random subset of size s
run an (a, b)-approximation alg on S ′ to get ≤ ak medoids T ′
let S ′′ ⊂ S be the (8kn/s) log(k/δ) points furthest from T ′
run an (a, b)-approximation alg on S ′′ to get ≤ ak medoids T ′′
return T ′ ∪ T ′′
p
This algorithm returns
p 2ak medoids. It requires two passes through the data, and if s is set to 8kn log(k/δ),
it uses memory O( 8kn log(k/δ)).
Theorem 10. With probability ≥ 1 − 2δ, this is a (2a, (1 + 2b)(1 + 2/δ))-approximation.
As before, let T ∗ = {t∗1 , . . . , t∗k } denote the optimal medoids. Suppose these induce a clustering S1 , . . . , Sk
of S. In the initial random sample S ′ , the large clusters — those with a lot of points — will be well-
represented. We now formalize this. Let L be the large clusters:
L = {i : |Si | ≥ (8n/s) log(k/δ)}.
The sample S ′ contains points Si′ = Si ∩ S ′ from cluster Si .

6-7
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

1 s
Lemma 11. With probability ≥ 1 − δ, for all i ∈ L: |Si′ | ≥ 2 · n · |Si |.
Proof. Use a multiplicative form of the Chernoff bound.
We’ll next show that the optimal centers T ∗ do a pretty good job on the random sample S ′ .
1 s
Lemma 12. With probability ≥ 1 − δ, cost(S ′ , T ∗ ) ≤ δ · n · cost(S, T ∗ ).
Proof. The expected value of cost(S ′ , T ∗ ) is exactly (s/n)cost(S, T ∗ ), whereupon the lemma follows by
Markov’s inequality.
Henceforth assume that the high-probability events of Lemmas 11 and 12 hold. Lemma 12 implies that
T ′ is also pretty good for S ′ .
1 s
Lemma 13. cost(S ′ , T ′ ) ≤ 2b · δ · n · cost(S, T ∗ ).
Proof. This is by now a familiar argument.

cost(S ′ , T ′ ) ≤ b · min′ cost(S ′ , T ) ≤ 2b · min cost(S ′ , T ) ≤ 2b · cost(S ′ , T ∗ )


T ⊂S T ⊂X

and the rest follows from Lemma 12.


Since the large clusters are well-represented in S ′ , we would expect centers T ′ to do a good job with
them.
Lemma 14. cost (∪i∈L Si , T ′ ) ≤ cost(S, T ∗ ) · 1 + 2(1 + 2b) 1δ .


Proof. We’ll show that for each optimal medoid t∗i of a large cluster, there must be a point in T ′ relatively
close by. To show this, we use the triangle inequality ρ(t∗i , T ′ ) ≤ ρ(t∗i , y) + ρ(y, t′ (y)), where t′ (y) ∈ T ′ is the
medoid in T ′ closest to point y. This inequality holds for all y, so we can average it over all y ∈ Si′ :
1 X
ρ(t∗i , T ′ ) ≤ (ρ(y, t∗i ) + ρ(y, t′ (y))) .
|Si′ | ′
y∈Si

Now we can bound the cost of the large clusters with respect to medoids T ′ .
XX
cost (∪i∈L Si , T ′ ) ≤ (ρ(x, t∗i ) + ρ(t∗i , T ′ ))
i∈L x∈Si
X
≤ cost(S, T ∗ ) + |Si |ρ(t∗i , T ′ )
i∈L
X |Si | X
≤ cost(S, T ∗ ) + (ρ(y, t∗i ) + ρ(y, t′ (y)))
|Si′ |
i∈L y∈Si′
2n X
≤ cost(S, T ∗ ) + (ρ(y, t∗ (y)) + ρ(y, t′ (y)))
s ′ y∈S
2n
= cost(S, T ∗ ) + (cost(S ′ , T ∗ ) + cost(S ′ , T ′ ))
s
where the last inequality is from Lemma 11. The rest follows from Lemmas 12 and 13.
Thus the large clusters are well-represented by T ′ . But this isn’t exactly what we need. Looking back at
the algorithm, we see that medoids T ′′ will take care of S ′′ , which means that we need medoids T ′ to take
care of S \ S ′′ .

6-8
CSE 291 Lecture 6 — Online and streaming algorithms for clustering Spring 2008

Lemma 15. cost(S \ S ′′ , T ′ ) ≤ cost(S, T ∗ ) · 1 + 2(1 + 2b) 1δ .




Proof. A large cluster is one with at least (8n/s) log(k/δ) points. Therefore, since there are at most k small
clusters, the total number of points in small clusters is at most (8kn/s) log(k/δ). This means that the large
clusters account for the vast majority of the data, at least n − (8kn/s) log(k/δ) points.
S\S ′′ is exactly the n−(8kn/s) log(k/δ) points closest to T ′ . Therefore cost(S\S ′′ , T ′ ) ≤ cost (∪i∈L Si , T ′ ).

To prove the main theorem, it just remains to bound the cost of S ′′ .

Lemma 16. cost(S ′′ , T ′′ ) ≤ 2b · cost(S, T ∗ ).

Proof. This is routine.

cost(S ′′ , T ′′ ) ≤ b · min′′ cost(S ′′ , T ) ≤ 2b · min cost(S ′′ , T ) ≤ 2b · cost(S ′′ , T ∗ ) ≤ 2b · cost(S, T ∗ ).


T ⊂S T ⊂X

Problem 4. The approximation factor contains a highly unpleasant 1/δ term. Is there a way to reduce this
to, say, log(1/δ)?

6-9

You might also like