0% found this document useful (0 votes)
54 views

Experiment No 8

The document describes an experiment to implement a data streaming algorithm using MapReduce. It provides background on the Flajolet–Martin algorithm, which can approximate the number of distinct elements in a data stream using only a single pass over the data and logarithmic space. The algorithm works by tracking the maximum number of trailing zeros in hashed values of elements. The document outlines the Flajolet–Martin algorithm and provides Python code to implement it on a sample random data set to estimate the number of distinct elements. The conclusion notes that the Flajolet–Martin algorithm is useful when the distinct elements are too great to store in memory.

Uploaded by

Aman Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Experiment No 8

The document describes an experiment to implement a data streaming algorithm using MapReduce. It provides background on the Flajolet–Martin algorithm, which can approximate the number of distinct elements in a data stream using only a single pass over the data and logarithmic space. The algorithm works by tracking the maximum number of trailing zeros in hashed values of elements. The document outlines the Flajolet–Martin algorithm and provides Python code to implement it on a sample random data set to estimate the number of distinct elements. The conclusion notes that the Flajolet–Martin algorithm is useful when the distinct elements are too great to store in memory.

Uploaded by

Aman Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Experiment No.

Aim : Write a program to implement a Data Streaming algorithm using MapReduce.


Lab Outcome No. : 8.ITL 801.5
Lab Outcome : Design and implement algorithms to analyze Big Data like streams, Web
Graphs and Social Media data and construct recommendation systems.
Date of Performance: 8/4/22
Date of Submission: 13/4/22

Program Documentation Timely Viva Experiment Teacher


formation/ (02) Submission Answer Marks (15) Signature
Execution / (03) (03) with date
Ethical
practices (07 )
EXPERIMENT NO : 8

AIM : Write a program to implement a Data Streaming algorithm using MapReduce.

THEORY :
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements
in a stream with a single pass and space-consumption which is logarithmic in the maximum
number of possible distinct elements in the stream. The algorithm was introduced by Philippe
Flajolet and G. Nigel Martini in their 1984 paper "Probabilistic Counting Algorithms for Data
Base Applications.

The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the
stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property
we shall exploit is that the value ends in many 0’s, although many other options exist. Whenever
we apply a hash function h to a stream element a, the bit string h(a) will end in some number of
0’s, possibly none. Call this number the tail length for a and h. Let R be the maximum tail length
of any a seen so far in the stream. Then we shall use estimate 2R for the number of distinct
elements seen in the stream.

This estimate makes intuitive sense. The probability that a given stream element a has h(a)
ending in at least r 0’s is 2−r . Suppose there are m distinct elements in the stream. Then the
probability that none of them has tail length at least r is (1 − 2 −r ) m. This sort of expression
should be familiar by now. We can rewrite it as (1 − 2 −r ) 2 r m2 −r . Assuming r is reasonably
large, the inner expression is of the form (1 − ǫ) 1/ǫ, which is approximately 1/e. Thus, the
probability of not finding a stream element with as many as r 0’s at the end of its hash value is e
−m2 −r .

We can conclude:

1. If m is much larger than 2r , then the probability that we shall find a tail of length at least r
approaches 1.

2. If m is much less than 2r , then the probability of finding a tail length at least r approaches 0.

ALGORITHM :
Assume that we are given a hash function hash(X)which maps input to integers in the range [0;
2L -1] and where the outputs are sufficiently uniformly distributed. Note that the set of integers
from 0 to 2L -1 corresponds to the set of binary strings of length L. For any non-negative integer
y, define bit (y, k) to be the -th bit in the binary representation of, such that:

We then define a functionP(y) which outputs the position of the least significant 1-bit in the
binary representation of :

Where, p(0) = L. Note that with the above definition we are using 0-indexing for the positions.
For example, P(13) = P(1101) since the least significant bit is a 1, and P(8) = P(1000) = 3 since
the least significant 1-bit is at the third position. At this point, note that under the assumption that
the output of our hash-function is uniformly distributed, then the probability of observing a hash-
output ending with (a one, followed by zeroes) is since this corresponds to flipping heads and
then a tail with a fair coin.
Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset is as follows:
1. Initialize a bit-vector BITMAP to be of length and contain all 0's.
x
2.For each element in M:

1. Index =
2. BITMAP[INDEX]=1
3. Let R denote the smallest index i such that BITMAP[i] = 0.

4. Estimate the cardinality of M as where,

The idea is that if n is the number of distinct elements in the multiset, M, then BITMAP[0] is
accessed approximately y/2 times, BITMAP[1] is accessed approximately n/4 times and so on.
Consequently if i>> log2n then BITMAP[i] is almost certainly 0, and if i<< log2n then
BITMAP[i] is almost certainly 1. If then BITMAP[i] can be expected to be either 1
or 0.

A problem with the Flajolet–Martin algorithm in the above form is that the results vary a lot. A
common solution is to run the algorithm multiple times with different hash-functions, and
combine the results from the different runs. One idea is to take the mean of the results together
from each hash-function, obtaining a single estimate of the cardinality. The problem with this is
that averaging is very susceptible to outliers (which are likely here). A different idea is to use the
median which is less prone to be influenced by outliers. The problem with this is that the results
can only take form , where is integer. A common solution is to combine both the mean
and the median: Create hash-functions and split them into distinct groups (each of size ).
Within each group use the median for aggregating together the results, and finally take the
mean of the group estimates as the final estimate.

CODE:

import numpy as np

# The following function generates the input array of 10000

# random integers between the range (0-9999)

# In real-life scenario, the input array would be real time

# and not stored in a list. But since 10000 is not a huge

# number I generated it beforehand an not on the fly to

# simplify testing.

def generateSeq():

arr=np.random.randint(0,9999, size=(10000))

return arr

# The following code generates the hash value of the given

# number from the following equation (a*x + b) modulus m.

# Value of m here is 10000.

# The value of a,b is randomly initiated between the

# range(0,9999), following the algorithm discussed in

# class slides

def hashFunction(a,b,num):
ret=(num*a+b)%10000

return ret

# The following code calculates the number of trailing zeros

# in the binary hashed value. It returns 1 in case the binNum

# variable has all zeros.

def trailingZeros(binNum):

ret=0

binNum=binNum[2:]

for i in range(len(binNum)-1,-1,-1):

if binNum[i]=="0":

ret+=1

else:

break

if ret==len(binNum):

return 1

else:

return ret

# Generate input array

inpData=(generateSeq())

# Initialize the variable as -1 which stores the current maximum number

# of trailing zeros as we iterate over input array


maxTZ=-1

# Generate random values for variable used in hashing function

a=np.random.randint(0,9999)

b=np.random.randint(0,9999)

# Iterate over input array

for num in inpData:

# calculate hash value of current number

hashedNum=hashFunction(a,b,num)

# calculate number of trailing zeros in binary string and store in

# variable tz

tz=trailingZeros(bin(hashedNum))

# if current value of tz is greater than maxTZ, update value of maxTZ

if maxTZ==-1 or maxTZ<tz:

maxTZ=tz

# Print max number of trailing zeroes in any number in input array

print(maxTZ)

# Print number of distinct elements in input array given by the

# function 2 ** (max number of trailing zeros)

print("Number of distinct elements: ")

print(2**maxTZ)
CONCLUSION :
If the number of distinct elements is too great, or if there are too many streams that need to be
processed at once (e.g., Yahoo! wants to count the number of unique users viewing each of its
pages in a month), then we cannot store the needed data in main memory. In FM algorithms we
will be able to estimate the number of distinct elements.

OUTPUT:

You might also like