0% found this document useful (0 votes)
38 views

Data Science 5

The document discusses data stream mining and counting distinct elements in a stream. It introduces the count-distinct problem of determining how many different elements have appeared in a data stream. It describes the Flajolet-Martin algorithm for approximately solving this problem in sub-linear space. The algorithm uses probabilistic data structures like Bloom filters and hash functions to estimate distinct counts with a controllable error rate while using little memory. Issues with processing high-speed streams in real-time with limited resources are also covered.

Uploaded by

kagome
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Data Science 5

The document discusses data stream mining and counting distinct elements in a stream. It introduces the count-distinct problem of determining how many different elements have appeared in a data stream. It describes the Flajolet-Martin algorithm for approximately solving this problem in sub-linear space. The algorithm uses probabilistic data structures like Bloom filters and hash functions to estimate distinct counts with a controllable error rate while using little memory. Issues with processing high-speed streams in real-time with limited resources are also covered.

Uploaded by

kagome
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

AACS1573

Introduction to
Data Science
Chapter 5
Mining Data Stream
Assumptions

1. data arrives in a stream or


streams & if it is not
processed immediately or
stored, then it is lost
forever.

2. Arrives so rapidly that it is


not feasible to store it all in active
storage (i.e., in a conventional
database).
The Stream Data Model
- A Data-Stream-Management System
- Examples of Stream Sources
- Stream Queries
- Issues in Stream Processing
The Stream Data Model
http://www.climate4you.com/SeaTemperatures.html
Stream Queries
There are two ways that queries get asked about
streams.

○ standing queries

○ ad-hoc queries
Standing queries
● In Fig. 4.1 a place within the processor where
standing queries are stored.
● These queries are, in a sense, permanently
executing, and produce outputs at
appropriate times.
Ad-hoc queries
● A question asked once.
Issues in Stream Processing
● We must process elements in real time, or we lose the opportunity
to process them at all.
● Thus, it often is important that the stream-processing algorithm is
executed in main memory, without access to secondary storage or
with only rare accesses to secondary storage.
● Moreover, even when streams are “slow,” as in the sensor-data
(ocean behavior - 3.5 terabytes arriving every day), there may be
many such streams.
● Even if each stream by itself can be processed using a small
amount of main memory, the requirements of all the streams
together can easily exceed the amount of available main memory.
Issues in Stream Processing
● Thus, many problems about streaming data would be easy to solve

IF WE HAD ENOUGH MEMORY


but become rather hard and require the invention of new
techniques in order to execute them at a realistic rate on a
machine
of realistic size.
5.2 Sampling Data in a
Stream
next…
Sampling Algorithms

Sampling
Algorithms

Simple Random
Stratified Reservoir Undersampling
Random
Sampling Sampling and
Sampling Oversampling
5.3 Filtering Streams
- The Bloom Filter
- Analysis of Bloom Filtering
next…
The Bloom Filter
● Bloom filter is represented by a bit array of m bits. At the beginning all bits set
to 0.
● To insert element into the filter, we :
○ calculate values of all k hash functions for the element and
○ use their output values as indices in the array where we set bits to 1.
● To test if element is in the filter, we :
○ calculate all k hash functions for the element and
○ check bits in all corresponding indices:
■ if all bits are set, then answer is "maybe"
■ if at least 1 bit isn't set, then answer is "definitely not"
Hashing

Hash functions accelerate table or database lookup by detecting duplicated


records in a large file.
Hash function
● A hash function is any function that can be used
to map data of arbitrary size to data of fixed size.
● The values returned by a hash function are
called hash values, hash codes, digests, or
simply hashes.
● One use is a data structure called a hash table,
widely used in computer software for rapid data
lookup.
● Hash functions accelerate table or database
lookup by detecting duplicated records in a
large file.
Hash Functions

DJB2

DJB2a FNV

murmur SDBM
Example:

• Suppose we have a set, S, of 1000 allowed email addresses – those we will


allow through because we believe them not to be spam
Bloom Filter (example)

Email

Email Email
address itself

Hash functions:
• A hash function is function that can be used to map data to fixed-size values.
• The values returned by a hash function are called hash values, hash codes,
The Bloom Filter

For example,
• bit vector can
be m=50 cells
in length
• k=3 means 3
hash functions
will be applied
to the key.
How it works (Bloom Filter)
1. Initiate an empty bit array of length m. m=10
0 1 2 3 4 5 6 7 8 9

0 0 0 0 0 0 0 0 0 0

2. Select K different Hash functions


e.g.: K= 2, Func. A and Func. B

3. To add element, calculate all the K hashes and set the corresponding bits to 1.
e.g.: element “[email protected]
- Func. A: position = 2
- Func. B: position = 3
0 1 2 3 4 5 6 7 8 9

0 0 1 1 0 0 0 0 0 0
How it works (Bloom Filter)

4. Repeat “[email protected]” : Func. A = 0, Func. B = 2


0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 0 0 0 0

5. Repeat “[email protected]” : Func. A = 6, Func. B = 7

0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 1 1 0 0

Bloom filter with 3 elements. It consists of 10 bits and uses 3 hash functions.
How it works (Bloom Filter)
● To test if [email protected] is in the set.
● If [email protected], : Func. A = 7, Func. B = 9
0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 1 1 0 1

● The element [email protected] is “definitely not” in the set.

● To test [email protected], : Func. A = 0, Func. B = 6


0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 1 1 0 1

● The element [email protected] is “maybe” in the set.


How it works (Bloom filter)
[email protected] is not in the set, it is concluded as “maybe” in
the set
● This is an example of False Positive.
The Bloom Filter
● Bloom filter can be used anywhere where false positives are OK and can be
handled but false negatives are not acceptable.
● Bloom filters can also be used to reduce I/O operations and thus increase
performance.
The Bloom Filter
● Example
● Consider Bloom filter of 16 bits (m=16):

● As an example, consider 2 hash functions (k=2): MurmurHash3 and Fowler-


Noll-Vo.
● Now, let's add element ferret to the filter:
 MurmurHash3(ferret)=1,
FNV(ferret)=11.
● Therefore, we need to set bits #1 and #11 in the filter:

The Bloom Filter
● Add element bernau to the filter:
MurmurHash3(bernau)=4, FNV(bernau)=4
● So, we need to set only the bit #4:

● But that about element paris? Its hashes are MurmurHash3(paris)=11,


FNV(paris)=4. 

● If we check bits #4 and #11, they all set in the filter and we shell conclude
that paris maybe in the set.
● This is an example of a false positive response, because we know that we
didn't add paris to the filter and the appropriate bits were set independently
by 2 different elements (bit #4 were set by bernau and bit #11 by ferret).
Question (Bloom Filter)
● Bloom filter helps to answer “Yes” “No” question. “Is the item in the list?”
● Add “Cat” and “Dog” into the 12-bit Bloom filter.
○ “Cat” hashes to 3, 4, 10
○ “Dog” hashes to 1, 2, 5
● Build the Bloom filter diagram

● “elephant” hashes to 2, 4, 5. Is “elephant” is the set?

● Determine the type of error and describe the error.


Examples of applications
AACS1573
Introduction to
Data Science
Chapter 5
Mining Data Stream
(Part B)
We will learn about…

5.4 5.5 5.6 5.7


Counting Counting
Estimating Decaying
Distinct Ones in a
Moments Windows
Elements in a Window
Stream

Chapter Chapter Chapter Chapter Chapter Chapter Chapter


Table of
Contents
We have learned about…

5.4 5.5 5.6 5.7


Counting Counting
Estimating Decaying
Distinct Ones in a
Moments Windows
Elements in a Window
Stream

Chapter Chapter Chapter Chapter Chapter Chapter Chapter


Coming next…
Chapter 6 Case Study
5.4 Counting
Distinct
Elements in a
Stream
next…
5.4 Counting Distinct Elements in a
Stream

- The Count-Distinct Problem


- The Flajolet-Martin Algorithm
- Space Requirements
Counting Distinct Elements in a Stream
● In this section we look at a third simple kind
of processing we might want to do on a
stream.
● As with the previous examples – sampling
and filtering – it is somewhat tricky to do
what we want in a reasonable amount of
main memory, so we use a variety of hashing
and a randomized algorithm to get
approximately what we want with little space
needed per stream.
The Count-Distinct Problem

● Suppose stream elements are


chosen from some universal set.
○ Problem or Question : We would like to
know how many different elements have
appeared in the stream.

○ Solution: Counting from the beginning of


the stream.
● Example 1: As a useful example of this problem, consider a Web site gathering
statistics on how many unique users it has seen in each given month. The
universal set is the set of logins for that site, and a stream element is
generated each time someone logs in. This measure is appropriate for a site
like Amazon, where the typical user logs in with their unique login name.

● Example 2: A similar problem is a Web site like Google that does not require
login to issue a search query, and may be able to identify users only by the IP
address from which they send the query. There are about 4 billion IP
addresses, sequences of four 8-bit bytes will serve as the universal set in this
case.
The Count-Distinct Problem
● Keep them in an efficient search structure such as a hash table or search
tree, so one can quickly add new elements and check whether or not the
element that just arrived on the stream was already seen.
● As long as the number of distinct elements is not too great, this structure
can fit in main memory and there is little problem obtaining an exact
answer to the question how many distinct elements appear in the stream.
● However, if the number of distinct elements is too great, or if there are too
many streams that need to be processed at once (e.g., Yahoo! wants to
count the number of unique users viewing each of its pages in a month),
then we cannot store the needed data in main memory.
The Count-Distinct Problem
● There are several options.

○ We could use more machines, each machine


handling only one or several of the streams.

○ We could store most of the data structure in


secondary memory and batch stream elements so
whenever we brought a disk block to main memory
there would be many tests and updates to be
performed on the data in that block.

○ we could use the strategy to be discussed in this


section, where we only estimate the number of
distinct elements but use much less memory than the
number of distinct elements.
The Flajolet-Martin Algorithm
● The idea behind the Flajolet-Martin Algorithm is that the more different
elements we see in the stream, the more different hash-values we shall see.
● As we see more different hash-values, it becomes more likely that one of
these values will be “unusual.”
● The particular unusual property we shall exploit is that the value ends in
many 0’s, although many other options exist.
● Whenever we apply a hash function h to a stream element a, the bit string
h(a) will end in some number of 0’s, possibly none.
● Call this number the tail length for a and h.
● Let R be the maximum tail length of any a seen so far in the stream.
● Then we shall use estimate 2R for the number of distinct elements seen in
the stream.
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
● Calculate Hash function, h (x)
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
● Calculate binary bit for every calculated Hash function
The Flajolet-Martin Algorithm
● Find out the trailing zeros
The Flajolet-Martin Algorithm
● Find out the trailing zeros
The Flajolet-Martin Algorithm
● Find out the distinct element
Question (Flajolet-Martin Algorithm)
● Input stream of integers, x = 3, 4, 4, 6, 2
● Hash Function, h(x) = 6x + 1 mod 5
● Compute the binary equivalent for every hash function calculated
● Compute the count of trailing zeros in each hash function
● Compute the working of getting the distinct value using FM
● List the distinct values.
Space Requirements

● Observe that as we read the stream it is not


necessary to store the elements seen.
● The only thing we need to keep in main memory is
one integer per hash function; this integer records
the largest tail length seen so far for that hash
function and any stream element.
Space Requirements
● If we are processing only one stream, we could use
millions of hash functions, which is far more than we
need to get a close estimate.
● Only if we are trying to process many streams at the
same time would main memory constrain the number
of hash functions we could associate with any one
stream.
● In practice, the time it takes to compute hash values for
each stream element would be the more significant
limitation on the number of hash functions we use.
5.5
Estimating
Moments
next…
Estimating Moments
● In this section we consider a generalization of the problem of
counting distinct elements in a stream. = Moments.
Moments
the distribution of frequencies of
different elements in the stream
Definition of Moments
● The 1st moment is the sum of the mi, which must be the length of
the stream.

○ Thus, first moments are especially easy to compute; just count the length of the
stream seen so far.
Definition of Moments
● The second moment is the sum of the squares of the mi.
● It is sometimes called the surprise number = how uneven the
distribution of elements in the stream.
● To see the distinction区别:

○ suppose we have a stream of length 100, in which 11 different elements appear.

○ The most even distribution of these 11 elements would have one appearing 10
times and the other ten appearing 9 times each.

■ In this case, the surprise number is 102+ 10(92)= 910


The Alon-Matias-Szegedy Algorithm for Second
Moments
● For now, let us assume that a stream has a particular length n.
● Suppose we do not have enough space to count all the mi for all
the elements of the stream.
● We can still estimate the second moment of the stream using a
limited amount of space; the more space we use, the more
accurate the estimate will be.
The Alon-Matias-Szegedy Algorithm for Second
Moments
● We compute some number of variables. For each variable X, we
store:

○ 1. A particular element of the universal set, which we refer to as X.element ,and

○ 2. An integer X.value, which is the value of the variable. To determine the value of
a variable X, we choose a position in the stream between 1 and n, uniformly and at
random. Set X.element to be the element found there, and initialize X.value to 1.
As we read the stream, add 1 to X.value each time we encounter another
occurrence of X.element .
The Alon-Matias-Szegedy Algorithm for Second
Moments
● Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c,
a, a, b.
● The length of the stream is n = 15.
a : 5 times,
b : 4 times
c : 3 times
d : 3 times
● the second moment for the stream is
52 + 42+ 32 + 32 = 59
Exercise: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5
● 9,9,1,1,5
● 10,9,9,9,9,9,9,9,9,9,9
Answer: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5 = 52 = 25
● 9,9,1,1,5 =22 + 22 + 12 = 9
● 10,9,9,9,9,9,9,9,9,9,9 =12 + 102 =101
5.6 Counting
Ones in a
Window
next…
The Datar-Gionis-Indyk-Motwani Algorithm

● This version of the algorithm uses O(log2 N) bits to represent a


window of N bits, and allows us to estimate the number of 1’s in the
window with an error of no more than 50%.
● To begin, each bit of the stream has a timestamp, the position in
which it arrives.
● The first bit has timestamp 1, the second has timestamp 2, and so
on.
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
Maintaining the DGIM Conditions
Maintaining the DGIM Conditions
5.7 Decaying
Windows
next…
Problem
• Given a stream, which items are currently
popular
– E.g., popular movie tickets, popular items being
sold in Amazon, etc.
– appear more than s times in the window
• Possible solution
– Stream per item; at each timepoint “1” if the item
appears in the original stream and “0” otherwise
– Use DGIM to estimate counts of 1 for each item 77
Example
Problem
with this
• Original Stream: approach?
1, 2, 1, 1, 3, 2, 4

Too many
streams!
Stream for 1: 1, 0, 1, 1, 0, 0, 0

Stream for 2: 0, 1, 0, 0, 0, 1, 0

Stream for 3: 0, 0, 0, 0, 1, 0, 0

Stream for 4: 0, 0, 0, 0, 0, 0, 1
78
Exponentially Decaying Windows

• A heuristic for selecting frequent items


– Gives more weight to more recent popular items
– Instead of computing count in the last N elements, compute smooth
aggregation over entire stream

• If stream is a1,a2… and we are taking the sum over the stream, the
answer at time t is defined as
t

 i
a (
i =1
1 − c ) t −i

where c is a tiny constant ~ 10-6 79


Exponentially Decaying Windows (cont)

• If stream is a1,a2… and we are taking the sum over the stream, the
answer at time t is defined as
t

 i
a (
i =1
1 − c ) t −i

where c is a tiny constant ~ 10-6


• When new at+1 arrives,
– (1) multiply current sum by (1-c) and
– (2) add at+1

80
Property of Decaying Windows

• Remember, for each item x, we have a running score


t

 i
 x

i =1
(1 − c ) t −i

• Summing over all running scores we get

t t
1
 
x i =1
i
x
(1 − c) t −i
=  (1 − c)
i =1
t −i t →
⎯⎯
⎯→
c

81
Memory Requirements
• Suppose we want to find items with weight greater
than ½
• Since sum of all scores is 1/c, there cant be more
than 2/c items with weight ½ or more!
• So, 2/c is a limit on the number of scores being
counted at any time
– For other weight requirements, we would get a different
bound, in a similar manner

82

You might also like