0% found this document useful (0 votes)

54 views

Experiment No 8

The document describes an experiment to implement a data streaming algorithm using MapReduce. It provides background on the Flajolet–Martin algorithm, which can approximate the number of distinct elements in a data stream using only a single pass over the data and logarithmic space. The algorithm works by tracking the maximum number of trailing zeros in hashed values of elements. The document outlines the Flajolet–Martin algorithm and provides Python code to implement it on a sample random data set to estimate the number of distinct elements. The conclusion notes that the Flajolet–Martin algorithm is useful when the distinct elements are too great to store in memory.

Uploaded by

Aman Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Experiment No 8

Uploaded by

Aman Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Experiment No.

Aim : Write a program to implement a Data Streaming algorithm using MapReduce.

Lab Outcome No. : 8.ITL 801.5
Lab Outcome : Design and implement algorithms to analyze Big Data like streams, Web
Graphs and Social Media data and construct recommendation systems.
Date of Performance: 8/4/22
Date of Submission: 13/4/22

Program Documentation Timely Viva Experiment Teacher

formation/ (02) Submission Answer Marks (15) Signature
Execution / (03) (03) with date
Ethical
practices (07 )
EXPERIMENT NO : 8

AIM : Write a program to implement a Data Streaming algorithm using MapReduce.

THEORY :
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements
in a stream with a single pass and space-consumption which is logarithmic in the maximum
number of possible distinct elements in the stream. The algorithm was introduced by Philippe
Flajolet and G. Nigel Martini in their 1984 paper "Probabilistic Counting Algorithms for Data
Base Applications.

The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the
stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property
we shall exploit is that the value ends in many 0’s, although many other options exist. Whenever
we apply a hash function h to a stream element a, the bit string h(a) will end in some number of
0’s, possibly none. Call this number the tail length for a and h. Let R be the maximum tail length
of any a seen so far in the stream. Then we shall use estimate 2R for the number of distinct
elements seen in the stream.

This estimate makes intuitive sense. The probability that a given stream element a has h(a)
ending in at least r 0’s is 2−r . Suppose there are m distinct elements in the stream. Then the
probability that none of them has tail length at least r is (1 − 2 −r ) m. This sort of expression
should be familiar by now. We can rewrite it as (1 − 2 −r ) 2 r m2 −r . Assuming r is reasonably
large, the inner expression is of the form (1 − ǫ) 1/ǫ, which is approximately 1/e. Thus, the
probability of not finding a stream element with as many as r 0’s at the end of its hash value is e
−m2 −r .

We can conclude:

1. If m is much larger than 2r , then the probability that we shall find a tail of length at least r
approaches 1.

2. If m is much less than 2r , then the probability of finding a tail length at least r approaches 0.

ALGORITHM :
Assume that we are given a hash function hash(X)which maps input to integers in the range [0;
2L -1] and where the outputs are sufficiently uniformly distributed. Note that the set of integers
from 0 to 2L -1 corresponds to the set of binary strings of length L. For any non-negative integer
y, define bit (y, k) to be the -th bit in the binary representation of, such that:

We then define a functionP(y) which outputs the position of the least significant 1-bit in the
binary representation of :

Where, p(0) = L. Note that with the above definition we are using 0-indexing for the positions.
For example, P(13) = P(1101) since the least significant bit is a 1, and P(8) = P(1000) = 3 since
the least significant 1-bit is at the third position. At this point, note that under the assumption that
the output of our hash-function is uniformly distributed, then the probability of observing a hash-
output ending with (a one, followed by zeroes) is since this corresponds to flipping heads and
then a tail with a fair coin.
Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset is as follows:
1. Initialize a bit-vector BITMAP to be of length and contain all 0's.
x
2.For each element in M:

1. Index =
2. BITMAP[INDEX]=1
3. Let R denote the smallest index i such that BITMAP[i] = 0.

4. Estimate the cardinality of M as where,

The idea is that if n is the number of distinct elements in the multiset, M, then BITMAP[0] is
accessed approximately y/2 times, BITMAP[1] is accessed approximately n/4 times and so on.
Consequently if i>> log2n then BITMAP[i] is almost certainly 0, and if i<< log2n then
BITMAP[i] is almost certainly 1. If then BITMAP[i] can be expected to be either 1
or 0.

A problem with the Flajolet–Martin algorithm in the above form is that the results vary a lot. A
common solution is to run the algorithm multiple times with different hash-functions, and
combine the results from the different runs. One idea is to take the mean of the results together
from each hash-function, obtaining a single estimate of the cardinality. The problem with this is
that averaging is very susceptible to outliers (which are likely here). A different idea is to use the
median which is less prone to be influenced by outliers. The problem with this is that the results
can only take form , where is integer. A common solution is to combine both the mean
and the median: Create hash-functions and split them into distinct groups (each of size ).
Within each group use the median for aggregating together the results, and finally take the
mean of the group estimates as the final estimate.

CODE:

import numpy as np

# The following function generates the input array of 10000

# random integers between the range (0-9999)

# In real-life scenario, the input array would be real time

# and not stored in a list. But since 10000 is not a huge

# number I generated it beforehand an not on the fly to

# simplify testing.

def generateSeq():

arr=np.random.randint(0,9999, size=(10000))

return arr

# The following code generates the hash value of the given

# number from the following equation (a*x + b) modulus m.

# Value of m here is 10000.

# The value of a,b is randomly initiated between the

# range(0,9999), following the algorithm discussed in

# class slides

def hashFunction(a,b,num):
ret=(num*a+b)%10000

return ret

# The following code calculates the number of trailing zeros

# in the binary hashed value. It returns 1 in case the binNum

# variable has all zeros.

def trailingZeros(binNum):

ret=0

binNum=binNum[2:]

for i in range(len(binNum)-1,-1,-1):

if binNum[i]=="0":

ret+=1

else:

break

if ret==len(binNum):

return 1

else:

return ret

# Generate input array

inpData=(generateSeq())

# Initialize the variable as -1 which stores the current maximum number

# of trailing zeros as we iterate over input array

maxTZ=-1

# Generate random values for variable used in hashing function

a=np.random.randint(0,9999)

b=np.random.randint(0,9999)

# Iterate over input array

for num in inpData:

# calculate hash value of current number

hashedNum=hashFunction(a,b,num)

# calculate number of trailing zeros in binary string and store in

# variable tz

tz=trailingZeros(bin(hashedNum))

# if current value of tz is greater than maxTZ, update value of maxTZ

if maxTZ==-1 or maxTZ<tz:

maxTZ=tz

# Print max number of trailing zeroes in any number in input array

print(maxTZ)

# Print number of distinct elements in input array given by the

# function 2 ** (max number of trailing zeros)

print("Number of distinct elements: ")

print(2**maxTZ)
CONCLUSION :
If the number of distinct elements is too great, or if there are too many streams that need to be
processed at once (e.g., Yahoo! wants to count the number of unique users viewing each of its
pages in a month), then we cannot store the needed data in main memory. In FM algorithms we
will be able to estimate the number of distinct elements.

OUTPUT:

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
bda exp8
No ratings yet
bda exp8
4 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
2020300053_BDA_EXP5_CHINMAY
No ratings yet
2020300053_BDA_EXP5_CHINMAY
3 pages
fm algorithm
No ratings yet
fm algorithm
3 pages
DSBD_Unit-II_3
No ratings yet
DSBD_Unit-II_3
28 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
MMD 05
No ratings yet
MMD 05
50 pages
Streams 2
No ratings yet
Streams 2
49 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
BDA PT 2
No ratings yet
BDA PT 2
35 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
Bloom FIlter and Hash Function Numericals
No ratings yet
Bloom FIlter and Hash Function Numericals
6 pages
14 Streams
No ratings yet
14 Streams
6 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Algorithms for Massive Data Problems
No ratings yet
Algorithms for Massive Data Problems
28 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Lec 31 Handout
No ratings yet
Lec 31 Handout
18 pages
Viden Io Data Analytics Lecture8 Counting Distinct Elements PDF
No ratings yet
Viden Io Data Analytics Lecture8 Counting Distinct Elements PDF
13 pages
Introduction To Randomized Algorithms
No ratings yet
Introduction To Randomized Algorithms
18 pages
Ch11 Soln 2
No ratings yet
Ch11 Soln 2
8 pages
unit-3.pptx
No ratings yet
unit-3.pptx
49 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
CS174: Note07
No ratings yet
CS174: Note07
5 pages
BDA 8 59
No ratings yet
BDA 8 59
4 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
(It-Ebooks-2017) It-Ebooks - Stanford CS109 Probability For Computer Scientists Lecture Notes-iBooker It-Ebooks (2017)
No ratings yet
(It-Ebooks-2017) It-Ebooks - Stanford CS109 Probability For Computer Scientists Lecture Notes-iBooker It-Ebooks (2017)
71 pages
Presentation On Counting Frequent Itemsets
No ratings yet
Presentation On Counting Frequent Itemsets
13 pages
Homework Assignment 2: Total Points 80
No ratings yet
Homework Assignment 2: Total Points 80
2 pages
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
No ratings yet
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
4 pages
AdityaGaur BDA Exp7
No ratings yet
AdityaGaur BDA Exp7
2 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Lecture1 Intro Streaming
No ratings yet
Lecture1 Intro Streaming
37 pages
02 Counting
No ratings yet
02 Counting
4 pages
Hashing
No ratings yet
Hashing
111 pages
Assignment 5-Fall 2024_553
No ratings yet
Assignment 5-Fall 2024_553
8 pages
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
Computer Science Lecture 10 Continuations
No ratings yet
Computer Science Lecture 10 Continuations
13 pages
Counting Distinct Elements in A Stream
No ratings yet
Counting Distinct Elements in A Stream
4 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
FM Algorithm Theory Explanation
No ratings yet
FM Algorithm Theory Explanation
2 pages
DS 8
No ratings yet
DS 8
30 pages
Problem Idea of Universal Hashing
No ratings yet
Problem Idea of Universal Hashing
14 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
SoICT-Eng - ProbComp - Lec 5
No ratings yet
SoICT-Eng - ProbComp - Lec 5
41 pages
Lect1004 PDF
No ratings yet
Lect1004 PDF
7 pages
Lab 01 - Basic Principle of Counting
No ratings yet
Lab 01 - Basic Principle of Counting
5 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
hw05 Solution PDF
No ratings yet
hw05 Solution PDF
8 pages
Universal Hashing
No ratings yet
Universal Hashing
4 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
RPL Assignment - 1
No ratings yet
RPL Assignment - 1
6 pages
Devops Assignment - 1
No ratings yet
Devops Assignment - 1
4 pages
Devops Assignment 2
No ratings yet
Devops Assignment 2
2 pages
BDA Assignment - 1
No ratings yet
BDA Assignment - 1
6 pages
Experiment No 6
No ratings yet
Experiment No 6
11 pages
Experiment No 4
No ratings yet
Experiment No 4
9 pages
Experiment No 2
No ratings yet
Experiment No 2
9 pages
Experiment No 1
No ratings yet
Experiment No 1
13 pages
Data Structure Assignment 1
No ratings yet
Data Structure Assignment 1
4 pages
Attribute Grammar
No ratings yet
Attribute Grammar
17 pages
Class12 Boolean Algebra
100% (1)
Class12 Boolean Algebra
50 pages
ADi COmpUTer PrOjEct
No ratings yet
ADi COmpUTer PrOjEct
55 pages
Chapter 3 - Discrete
No ratings yet
Chapter 3 - Discrete
3 pages
Maths Formulas
No ratings yet
Maths Formulas
9 pages
2 Python Regular Expression Patterns List
No ratings yet
2 Python Regular Expression Patterns List
4 pages
9 PLC Program To Implement Binary To Gray Code Conversion
No ratings yet
9 PLC Program To Implement Binary To Gray Code Conversion
5 pages
Moerdijk-VanOosten2018 Book SetsModelsAndProofs
100% (2)
Moerdijk-VanOosten2018 Book SetsModelsAndProofs
151 pages
Visvesvaraya National Institute of Technology, Nagpur: Presentation On Fuzzy Arithmetic
No ratings yet
Visvesvaraya National Institute of Technology, Nagpur: Presentation On Fuzzy Arithmetic
32 pages
Answers To Final Exam Practice Questions: PHIL 120 Intro To Logic
No ratings yet
Answers To Final Exam Practice Questions: PHIL 120 Intro To Logic
6 pages
7.assignment2 DAA Answers Dsatm PDF
No ratings yet
7.assignment2 DAA Answers Dsatm PDF
19 pages
DAX Information Functions and DAX Date Functions
No ratings yet
DAX Information Functions and DAX Date Functions
30 pages
Toc Summer 2022 Merged
No ratings yet
Toc Summer 2022 Merged
6 pages
Instant Download (Ebook) Logical Reasoning: A First Course by Rob P. Nederpelt, Fairouz D. Kamareddine ISBN 9780954300678, 095430067X PDF All Chapters
100% (5)
Instant Download (Ebook) Logical Reasoning: A First Course by Rob P. Nederpelt, Fairouz D. Kamareddine ISBN 9780954300678, 095430067X PDF All Chapters
81 pages
Day 2 Proper Subsets and Subsets Set Operations
No ratings yet
Day 2 Proper Subsets and Subsets Set Operations
19 pages
Artificial Intelligence: Informed Search
No ratings yet
Artificial Intelligence: Informed Search
25 pages
Cosc 340
No ratings yet
Cosc 340
3 pages
Logical Equivalence TG
No ratings yet
Logical Equivalence TG
4 pages
Recursive Descent Parsing: Goal Approach Key Question: Which Production To Use?
No ratings yet
Recursive Descent Parsing: Goal Approach Key Question: Which Production To Use?
25 pages
Detailed Lesson Plan Grade 7 - Mathematics
No ratings yet
Detailed Lesson Plan Grade 7 - Mathematics
7 pages
Lesson 2 Variables Data Types and Operators
No ratings yet
Lesson 2 Variables Data Types and Operators
6 pages
2012-Exam-Solutions With Detailed Solutions
No ratings yet
2012-Exam-Solutions With Detailed Solutions
45 pages
An Introduction to Real Analysis First Edition Agarwal download pdf
100% (5)
An Introduction to Real Analysis First Edition Agarwal download pdf
55 pages
TE AI Honor Course
No ratings yet
TE AI Honor Course
18 pages
Linear Programming - Graphical Method
No ratings yet
Linear Programming - Graphical Method
6 pages
Abstract Algebra Theory and Applications - Thomas W Judson
100% (1)
Abstract Algebra Theory and Applications - Thomas W Judson
1,005 pages
UNIT2 TOP DOWN AND BOTTOM UP Parser
No ratings yet
UNIT2 TOP DOWN AND BOTTOM UP Parser
151 pages
Logical Induction (Abridged) : Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares
No ratings yet
Logical Induction (Abridged) : Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares
20 pages
Toot
No ratings yet
Toot
32 pages

Experiment No 8

Uploaded by

Experiment No 8

Uploaded by

Experiment No.

Aim : Write a program to implement a Data Streaming algorithm using MapReduce.

Program Documentation Timely Viva Experiment Teacher

AIM : Write a program to implement a Data Streaming algorithm using MapReduce.

4. Estimate the cardinality of M as where,

# The following function generates the input array of 10000

# random integers between the range (0-9999)

# In real-life scenario, the input array would be real time

# and not stored in a list. But since 10000 is not a huge

# number I generated it beforehand an not on the fly to

# The following code generates the hash value of the given

# number from the following equation (a*x + b) modulus m.

# Value of m here is 10000.

# The value of a,b is randomly initiated between the

# range(0,9999), following the algorithm discussed in

# The following code calculates the number of trailing zeros

# in the binary hashed value. It returns 1 in case the binNum

# variable has all zeros.

# Generate input array

# Initialize the variable as -1 which stores the current maximum number

# of trailing zeros as we iterate over input array

# Generate random values for variable used in hashing function

# Iterate over input array

for num in inpData:

# calculate hash value of current number

# calculate number of trailing zeros in binary string and store in

# if current value of tz is greater than maxTZ, update value of maxTZ

# Print max number of trailing zeroes in any number in input array

# Print number of distinct elements in input array given by the

# function 2 ** (max number of trailing zeros)

print("Number of distinct elements: ")

You might also like