0% found this document useful (0 votes)
3 views

Data Stream Sampling

The document discusses data stream sampling, methods for sampling from continuous data streams, and the importance of filtering and counting distinct elements in these streams. It covers various sampling techniques such as Reservoir Sampling and Bloom Filters, which are used to efficiently manage and analyze large volumes of data without storing all elements. Additionally, it introduces the Flajolet-Martin algorithm for estimating the number of unique elements in a data stream, highlighting its efficiency and probabilistic nature.

Uploaded by

komalmalik1432
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Stream Sampling

The document discusses data stream sampling, methods for sampling from continuous data streams, and the importance of filtering and counting distinct elements in these streams. It covers various sampling techniques such as Reservoir Sampling and Bloom Filters, which are used to efficiently manage and analyze large volumes of data without storing all elements. Additionally, it introduces the Flajolet-Martin algorithm for estimating the number of unique elements in a data stream, highlighting its efficiency and probabilistic nature.

Uploaded by

komalmalik1432
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Streams

Sampling Data in a Stream


Data Stream Sampling is the process of selecting a
subset of data from a continuous and potentially infinite
stream of data.
Data streams are typically generated in real-time, such as
in sensor networks, user interactions, or social media
feeds, and often involve massive volumes of data that
cannot be stored or processed entirely.
Sampling helps reduce the amount of data that needs to be
stored and processed, while still retaining useful
information for analysis.
…….
Selecting a subset of a stream so that any query can be
asked about the selected subset and have the answer be
statistically representative of the stream as a whole.
A search engine receives a stream of queries and it would
like to study the behavior of typical users. Let we assume
the stream consists of tuples (user, query, time).
Suppose that we want to answer queries such as “What
fraction of the typical user’s queries were repeated over
the past month”
Sampling Methods
 Reservoir Sampling: To sample a fixed-size subset of data from a
stream of unknown or very large size.
 Random Sampling: To select data points randomly from the
stream.
 Systematic Sampling: To sample data points at regular intervals
from the stream.
 Sliding Window Sampling: To sample the most recent k elements
from a stream, continuously updating the sample as new data
arrives.
 Stratified Sampling: To ensure that different "strata" or categories
in the data stream are represented proportionally in the sample.
Reservoir Sampling
Randomly sample k elements from a stream of data where
the total number of elements (n) is unknown or too large to
store in memory.
The key advantage of this method is that it ensures that
each element in the data stream has an equal probability
of being included in the sample, even if the stream is
infinite or its size is not known in advance.
This is commonly used for scenarios where it’s not feasible
to store all incoming data
 Initialization:
◦ Start by filling the reservoir with the first k elements of the stream.
These k elements will be the initial sample.
 Processing Subsequent Elements:
◦ After the first k elements, for each subsequent element n+1 (where
n is the number of elements processed so far):
● With probability (𝑘𝑛+1: n+1,k​), replace a random element in the
reservoir with the new element.
● If the new element is not chosen (i.e., with probability 𝑛𝑛+1:n+1,n​), it
is discarded.
 Termination:
◦ Once the stream ends (or you decide to stop sampling), the k
elements in the reservoir represent a random sample of the entire
stream.
Filtering Stream
Data filtering is the process of choosing a smaller part of
your data set and using that subset for viewing or analysis.
Filtering is generally (but not always) temporary – the
complete data set is kept, but only part of it is used for the
calculation. Filtering may be used to:
◦ Look at results for a particular period of time.
◦ Calculate results for particular groups of interest.
◦ Exclude erroneous or "bad" observations from an analysis.
Bloom Filter
 Before understanding bloom filters, must know what is hashing. Hashing
refers to the process of generating a fixed-size output from an input of
variable size using the mathematical formulas known as hash functions.
This technique determines an index or location for the storage of an item
in a data structure. There are majorly three components of hashing:
◦ Key: A Key can be anything string or integer which is fed as input in the hash
function the technique that determines an index or location for storage of an item
in a data structure.
◦ Hash Function: The hash function receives the input key and returns the index of
an element in an array called a hash table. The index is known as the hash index.
◦ Hash Table: Hash table is a data structure that maps keys to values using a
special function called a hash function. Hash stores the data in an associative
manner in an array where each data value has its own unique index.
Collision
The hashing process generates a small number for a big key, so there is a possibility that two keys could produce the
same value. The situation where the newly inserted key maps to an already occupied, and it must be handled using
some collision handling technology.
Bloom Filter
 A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a
member of a set.
 For example, Suppose you are creating an account on Facebook, you want to enter a username, you entered it
and got a message, “Username is already taken”. But have you ever thought about how quickly Facebook
checks availability of username by searching millions of username registered with it. There are many ways to do
this job –
 Linear search : Bad idea!
 Binary Search : Store all username alphabetically and compare entered username with middle one in list, If it
matched, then username is taken otherwise figure out, whether entered username will come before or after
middle one and if it will come after, neglect all the usernames before middle one(inclusive). Now search after
middle one and repeat this process until you got a match or search end with no match. This technique is better
and promising but still it requires multiple steps.
 Interesting Properties of Bloom Filters
 Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number
of elements.
 Adding an element never fails. However, the false positive rate increases as elements are added until all bits in
the filter are set to 1, means, all queries produce a positive result.
 Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually
exists.
 Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices
generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in
given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4
becomes 0 and bloom filter claims that “nerd” is not present.
 Working of Bloom Filter
 An empty bloom filter is a bit array of m bits, all set to zero, like this –

 We need k number of hash functions to calculate the hashes for a given input. When we want to add an
item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are calculated using hash
functions.
 Example – Suppose we want to enter “neeraj” in the filter, we are using 3 hash functions and a bit array of
length 10, all set to 0 initially.
First, we’ll calculate the hashes as follows:
h1(“neeraj”) % 10 = 1
h2(“neeraj”) % 10 = 4
h3(“neeraj”) % 10 = 7
 Note: These outputs are random for explanation only.
 Now we will set the bits at indices 1, 4 and 7 to 1.

 Again we want to enter “ramesh”, similarly, we’ll calculate hashes


h1(“ramesh”) % 10 = 3
h2(“ramesh”) % 10 = 5
h3(“ramesh”) % 10 = 4
Set the bits at indices 3, 5 and 4 to 1
 Now if we want to check “neeraj” is present in filter or not. We calculate respective hashes using
h1, h2 and h3 and check if all these indices are set to 1 in the bit array. If all the bits are set then
we can say that “neeraj” is probably present. If any of the bit at these indices are 0 then
“neeraj” is definitely not present.
 False Positive in Bloom Filters
 The question is why we said “probably present”, why this uncertainty. Let’s understand this
with an example. Suppose we want to check whether “ram” is present or not. We’ll calculate
hashes using h1, h2 and h3
h1(“ram”) % 10 = 1
h2(“ram”) % 10 = 3
h3(“ram”) % 10 = 7
 If we check the bit array, bits at these indices are set to 1 but we know that “ram” was never
added to the filter. Bit at index 1 and 7 was set when we added “neeraj” and bit 3 was set we
added “ramesh”.
 So, because bits at calculated indices are already set by some other item, bloom filter mistakenly claims
that “ram” is present and generating a false positive result.
 We can control the probability of getting a false positive by controlling the size of the Bloom filter. More
space means fewer false positives. If we want to decrease probability of false positive result, we have to
use more number of hash functions.
 Operations that a Bloom Filter supports
 insert(x): To insert an element in the Bloom Filter.
 lookup(x): to check whether an element is already present in Bloom Filter with a positive false probability.
 NOTE: We cannot delete an element in Bloom Filter.
Counting Distinct Element in a
stream
 Counting distinct elements in a data stream refers to the problem of
determining how many unique elements exist in a sequence of data
that arrives in real-time, without storing all the elements.
 This is particularly useful when dealing with large-scale or infinite
datasets, where storing every element is not feasible due to memory
or time constraints.
 Memory Constraints: In a data stream, you can’t store all elements,
especially when dealing with a large number of unique elements.
 Infinite Data: The data stream can potentially be infinite, so you can’t rely
on keeping track of everything.
Flajolet-Martin Algorithm
 The Flajolet-Martin algorithm is also known as probabilistic
algorithm
 This algorithm was invented by Philippe Flajolet and G. Nigel
Martin in 1983
 Flajolet-Martin algorithm approximates the number of unique
objects in a stream or a database in one pass.
 If the stream contains n elements with m of them unique, this
algorithm runs in O(n) time and needs O(log(m)) memory.
 It is a part of a family of approximate counting algorithms that
use hashing to efficiently provide an estimate of the cardinality
without storing all the elements.
Algorithm….
 Create a bit vector (bit array) of sufficient length L, such that 2L>n,
the number of elements in the stream. Usually a 64-bit vector is
sufficient since 264 is quite large for most purposes.
 The i-th bit in this vector/array represents whether we have seen a
hash function value whose binary representation ends in 0 i. So,
initialize each bit to 0.
 Once input is exhausted, get the index of the first 0 in the bit array
(call this R). By the way, this is just the number of consecutive 1s
(i.e. we have seen 0,00,...,0R−1as the output of the hash function)
plus one.
 Calculate the number of unique words as 2 R/ϕ , where ϕ is 0.77351.
A proof for this can be found in the original paper listed in the
reference section.
……
 The standard deviation of R is a constant: σ(R)=1.12 . (In other
words, R can be off by about 1 for 1 - 0.68 = 32% of the
observations, off by 2 for about 1 - 0.95 = 5% of the observations,
off by 3 for 1 - 0.997 = 0.3% of the observations using the
Empirical rule of statistics). This implies that our count can be off
by a factor of 2 for 32% of the observations, off by a factory of 4
for 5% of the observations, off by a factor of 8 for 0.3% of the
observations and so on.
Example
Example:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x)=(6x+1) mod 5
Assume |b| = 5
h(1)= (6*1+1)mod 5
=7mod 5
=2
x h(x) Rem Binary r(a)

1 7 2 00010 1

3 19 4 00100 2

2 13 3 00011 0

1 7 2 00010 1

2 13 3 00011 0

3 19 4 00100 2

4 25 0 00000 5

3 19 4 00100 2

1 7 2 00010 1

2 13 3 00011 0

3 19 4 00100 2

1 7 2 00010 1
R = max( r(a) ) = 5
So, no. of distinct elements = N=2R=25=32
• We may want to know how many different elements have appeared in the
stream.
• For example, we wish to know how many distinct users visited the website till
now or in last 2 hours.
• If no of distinct elements required to process many streams, then keeping data
in main memory is challenge.
• FM algorithm gives an efficient way to count the distinct elements in a stream.
• It is possible to estimate the no. of distinct elements by hashing the elements
of the universal set to a bit string that is sufficiently long.
• The length of the bit string must be sufficient that there are more possible
results of the hash function than there are elements in the universal set
•Whenever we apply a hash function h to a stream element a, the bit string h(a) will end
in some number of oS, possibly none.
•Call this as tail length for a hash.
•Let R be the maximum tail length of any a seen so far in the stream.
•Then we shall use estimate 2R for the number of distinct elements seen in the stream.
•Consider a stream as:
S = {1, 2, 1, 3}
Let hash function be 2x + 2 mod 4
•When we apply the hash function, we get reminder represented in binary as follows:
000, 101, 000 considering bit string length as 3.
•Maximum tail length R will be 3.
•No of distinct elements will be 2R=23=8
•Here the estimates may be too large or too low depending on hash function.
•We may apply multiple hash functions and combine the estimate to get near accurate
values.

You might also like