Data Stream Sampling
Data Stream Sampling
We need k number of hash functions to calculate the hashes for a given input. When we want to add an
item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are calculated using hash
functions.
Example – Suppose we want to enter “neeraj” in the filter, we are using 3 hash functions and a bit array of
length 10, all set to 0 initially.
First, we’ll calculate the hashes as follows:
h1(“neeraj”) % 10 = 1
h2(“neeraj”) % 10 = 4
h3(“neeraj”) % 10 = 7
Note: These outputs are random for explanation only.
Now we will set the bits at indices 1, 4 and 7 to 1.
1 7 2 00010 1
3 19 4 00100 2
2 13 3 00011 0
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
4 25 0 00000 5
3 19 4 00100 2
1 7 2 00010 1
2 13 3 00011 0
3 19 4 00100 2
1 7 2 00010 1
R = max( r(a) ) = 5
So, no. of distinct elements = N=2R=25=32
• We may want to know how many different elements have appeared in the
stream.
• For example, we wish to know how many distinct users visited the website till
now or in last 2 hours.
• If no of distinct elements required to process many streams, then keeping data
in main memory is challenge.
• FM algorithm gives an efficient way to count the distinct elements in a stream.
• It is possible to estimate the no. of distinct elements by hashing the elements
of the universal set to a bit string that is sufficiently long.
• The length of the bit string must be sufficient that there are more possible
results of the hash function than there are elements in the universal set
•Whenever we apply a hash function h to a stream element a, the bit string h(a) will end
in some number of oS, possibly none.
•Call this as tail length for a hash.
•Let R be the maximum tail length of any a seen so far in the stream.
•Then we shall use estimate 2R for the number of distinct elements seen in the stream.
•Consider a stream as:
S = {1, 2, 1, 3}
Let hash function be 2x + 2 mod 4
•When we apply the hash function, we get reminder represented in binary as follows:
000, 101, 000 considering bit string length as 3.
•Maximum tail length R will be 3.
•No of distinct elements will be 2R=23=8
•Here the estimates may be too large or too low depending on hash function.
•We may apply multiple hash functions and combine the estimate to get near accurate
values.