Experiment No 8
Experiment No 8
THEORY :
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements
in a stream with a single pass and space-consumption which is logarithmic in the maximum
number of possible distinct elements in the stream. The algorithm was introduced by Philippe
Flajolet and G. Nigel Martini in their 1984 paper "Probabilistic Counting Algorithms for Data
Base Applications.
The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the
stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property
we shall exploit is that the value ends in many 0’s, although many other options exist. Whenever
we apply a hash function h to a stream element a, the bit string h(a) will end in some number of
0’s, possibly none. Call this number the tail length for a and h. Let R be the maximum tail length
of any a seen so far in the stream. Then we shall use estimate 2R for the number of distinct
elements seen in the stream.
This estimate makes intuitive sense. The probability that a given stream element a has h(a)
ending in at least r 0’s is 2−r . Suppose there are m distinct elements in the stream. Then the
probability that none of them has tail length at least r is (1 − 2 −r ) m. This sort of expression
should be familiar by now. We can rewrite it as (1 − 2 −r ) 2 r m2 −r . Assuming r is reasonably
large, the inner expression is of the form (1 − ǫ) 1/ǫ, which is approximately 1/e. Thus, the
probability of not finding a stream element with as many as r 0’s at the end of its hash value is e
−m2 −r .
We can conclude:
1. If m is much larger than 2r , then the probability that we shall find a tail of length at least r
approaches 1.
2. If m is much less than 2r , then the probability of finding a tail length at least r approaches 0.
ALGORITHM :
Assume that we are given a hash function hash(X)which maps input to integers in the range [0;
2L -1] and where the outputs are sufficiently uniformly distributed. Note that the set of integers
from 0 to 2L -1 corresponds to the set of binary strings of length L. For any non-negative integer
y, define bit (y, k) to be the -th bit in the binary representation of, such that:
We then define a functionP(y) which outputs the position of the least significant 1-bit in the
binary representation of :
Where, p(0) = L. Note that with the above definition we are using 0-indexing for the positions.
For example, P(13) = P(1101) since the least significant bit is a 1, and P(8) = P(1000) = 3 since
the least significant 1-bit is at the third position. At this point, note that under the assumption that
the output of our hash-function is uniformly distributed, then the probability of observing a hash-
output ending with (a one, followed by zeroes) is since this corresponds to flipping heads and
then a tail with a fair coin.
Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset is as follows:
1. Initialize a bit-vector BITMAP to be of length and contain all 0's.
x
2.For each element in M:
1. Index =
2. BITMAP[INDEX]=1
3. Let R denote the smallest index i such that BITMAP[i] = 0.
The idea is that if n is the number of distinct elements in the multiset, M, then BITMAP[0] is
accessed approximately y/2 times, BITMAP[1] is accessed approximately n/4 times and so on.
Consequently if i>> log2n then BITMAP[i] is almost certainly 0, and if i<< log2n then
BITMAP[i] is almost certainly 1. If then BITMAP[i] can be expected to be either 1
or 0.
A problem with the Flajolet–Martin algorithm in the above form is that the results vary a lot. A
common solution is to run the algorithm multiple times with different hash-functions, and
combine the results from the different runs. One idea is to take the mean of the results together
from each hash-function, obtaining a single estimate of the cardinality. The problem with this is
that averaging is very susceptible to outliers (which are likely here). A different idea is to use the
median which is less prone to be influenced by outliers. The problem with this is that the results
can only take form , where is integer. A common solution is to combine both the mean
and the median: Create hash-functions and split them into distinct groups (each of size ).
Within each group use the median for aggregating together the results, and finally take the
mean of the group estimates as the final estimate.
CODE:
import numpy as np
# simplify testing.
def generateSeq():
arr=np.random.randint(0,9999, size=(10000))
return arr
# class slides
def hashFunction(a,b,num):
ret=(num*a+b)%10000
return ret
def trailingZeros(binNum):
ret=0
binNum=binNum[2:]
for i in range(len(binNum)-1,-1,-1):
if binNum[i]=="0":
ret+=1
else:
break
if ret==len(binNum):
return 1
else:
return ret
inpData=(generateSeq())
a=np.random.randint(0,9999)
b=np.random.randint(0,9999)
hashedNum=hashFunction(a,b,num)
# variable tz
tz=trailingZeros(bin(hashedNum))
if maxTZ==-1 or maxTZ<tz:
maxTZ=tz
print(maxTZ)
print(2**maxTZ)
CONCLUSION :
If the number of distinct elements is too great, or if there are too many streams that need to be
processed at once (e.g., Yahoo! wants to count the number of unique users viewing each of its
pages in a month), then we cannot store the needed data in main memory. In FM algorithms we
will be able to estimate the number of distinct elements.
OUTPUT: