Mining Data Streams 1
Mining Data Streams 1
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
Sampling from a Data Stream:
Sampling a fixed-size sample
As the stream grows, the sample is of
fixed size
Maintaining a fixed-size sample
Problem 2: Fixed-size sample
Suppose we need to maintain a random
sample S of size exactly s tuples
E.g., main memory size constraint
Why? Don’t know length of stream in advance
Suppose at time n we have seen n items
Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Solution: Fixed Size Sample
Algorithm (a.k.a. Reservoir Sampling)
Store all the first s elements of the stream to S
Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
With probability s/n, keep the nth element, else discard it
If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
qwertyuiopasdfghjklzxcvbnm
Past Future
Obvious solution:
Store the most recent N bits
When new bit comes in, discard the N+1st bit
010011011101010110110110 Suppose N=6
Past Future
Maintain 2 counters:
S: number of 1s from the beginning of the stream
Z: number of 0s from the beginning of the stream
How many 1s are in the last N bits?
But, what if stream is non-uniform?
What if distribution changes over time?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
[Datar, Gionis, Indyk, Motwani]
DGIM Method
DGIM solution that does not assume
uniformity
We store bits per stream
Solution gives approximate answer,
never off by more than 50%
Error factor can be reduced to any fraction > 0,
with more complicated algorithm and
proportionally more stored bits
6 10
4
?
3 2
1 2
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
[Datar, Gionis, Indyk, Motwani]
Constraint on buckets:
Number of 1s must be a power of 2
That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
Representing a Stream by Buckets
Either one or two buckets with the same
power-of-2 number of 1s
Buckets do not overlap in timestamps
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101100101101
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
111111110000000011101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
Further Reducing the Error
Instead of maintaining 1 or 2 of each size
bucket, we allow either r-1 or r buckets (r > 2)
Except for the largest size buckets; we can have
any number between 1 and r of those
Error is at most O(1/r)
By picking r appropriately, we can tradeoff
between number of bits we store and the
error