0% found this document useful (0 votes)
45 views

Viden Io Data Analytics Lecture7 Data Stream Filtering PDF

The document discusses data stream filtering using Bloom filters. Bloom filters are a space-efficient randomized data structure used to represent a set and answer membership queries. They can produce false positives but never false negatives. The document explains how Bloom filters work by using hash functions to map elements to bit positions in a bit array. It also discusses calculating the probability of false positives in Bloom filters.

Uploaded by

Ram Chandu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Viden Io Data Analytics Lecture7 Data Stream Filtering PDF

The document discusses data stream filtering using Bloom filters. Bloom filters are a space-efficient randomized data structure used to represent a set and answer membership queries. They can produce false positives but never false negatives. The document explains how Bloom filters work by using hash functions to map elements to bit positions in a bit array. It also discusses calculating the probability of false positives in Bloom filters.

Uploaded by

Ram Chandu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Stream Filtering

Filtering and Streaming


• The randomized algorithms and data structures we have seen so
far always produce the correct answer but have a small
probability of being slow.
• In this lecture, we will consider randomized algorithms that are
always fast, but have a small probability of returning the wrong
answer.
• More generally, we are interested in tradeoffs between the
(likely) efficiency of the algorithm and the (likely) quality of its
output.
Bloom Filters
Whenever a list or set is used, and space is consideration, a Bloom
filter should be considered. When using a Bloom filter, consider
the potential effects of false positives."

 It is a randomized data structure that is used to represent a set.


 It answers membership queries
 It can give FALSE POSITIVE while answering membership
queries (very less %).
 But can't return FALSE NEGATIVE
 POSSIBLY IN SET
 DEFINITELY NOT IN SET
 Space efficient
Bloom Filters
• Bloom filters are a natural variant of hashing proposed by Burton
Bloom in 1970 as a mechanism for supporting membership
queries in sets.
• Applications:
• Example: Email spam filtering
• We know 1 billion “good” email addresses
• If an email comes from one of these, it is NOT spam
Filtering Stream Content

• To motivate the Bloom-filter idea, consider a web crawler.


• It keeps, centrally, a list of all the URL's it has found so far.
• It assigns these URL's to any of a number of parallel tasks;
these tasks stream back the URL's they find in the links they
discover on a page.
• It needs to filter out those URL's it has seen before.
Role of the Bloom Filter

• A Bloom filter placed on the stream of URL's will declare that


certain URL's have been seen before.
• Others will be declared new, and will be added to the list of
URL's that need to be crawled.
• Unfortunately, the Bloom filter can have false positives.
• It can declare a URL has been seen before when it hasn't.
• But if it says “never seen”, then it is truly new.
How a Bloom Filter Works?
• A Bloom filter is an array of bits, together with a number of
hash functions.
• The argument of each hash function is a stream element.
and it returns a position in the array.
• Initially, all bits are 0.
• When input x arrives, we set to 1 the bits h(x). for each hash
function h.
The Set memebrship task

• • x : An element
• S: A set of elements
• Input: x,S
• Output:
• -TRUE if x in S
• -FALSE if x not in S
• A Bloom filter consists of vectors of n boolean values, initially all
set false, as well as k independent hash functions, h1,h2,.....,hk,
each with range {0,1,..., n-1}

0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 Initial setup n=10
• For each element x in S, the boolean values with positions h1(x),
h2(x),...,hk(x), are set true.
x1 x2
h1 h3 h1 h3
h2 h 2

0 1 0 0 1 0 1 1 0 1 Installing two elements x1, x2


0 1 2 3 4 5 6 7 8 9
Insert (x)
Find h1(x), h2(x),......,hk(x), set all these bits in Bloom Filter to 1

QUERY-Bloom Filter (y)


Find h1(y), h2(y),....hk(y)
IF (All h1(y), h2(y),....hk(y) ==1)
RETURN(1)
ELSE
RETURN(0);

We assume hash functions maps an element in bits 0,1,2,...(n-1)


m = number of elements inserted or present in the set
Error types

• False Negative: answering “not there” on an element that is in the


set.
‒Never happens for Bloom Filters

• False Positive: answering “is there” on an element that is not in


the set
‒We design the filter so that the probability of a false positive is very small.
Calculating the Probability of False Positives
1 0 1 0 1 1 1 0 0 0 1 0 12 bit Bloom
filter
1 2 3 4 5 6 7 8 9 10 11 12

k=3 (hash function) INSERT (x1) INSERT (x2)


h1(x1) = 3 h1(x2) = 3
h2(x1) = 5 h2(x2) = 5
h3(x1) = 11 h3(x2) = 11
QUERY(x3) QUERY(x4)
h1(x3) = 3 CASE of FALSE h1(x4) = 4 x4 is not
h2(x3) = 11 POSITIVE h2(x4) = 9
PRESENT
h3(x3) = 7 h3(x4) = 10
Example-1
(1/e ≈0.37....)

You might also like