Data Science 5
Data Science 5
Introduction to
Data Science
Chapter 5
Mining Data Stream
Assumptions
○ standing queries
○ ad-hoc queries
Standing queries
● In Fig. 4.1 a place within the processor where
standing queries are stored.
● These queries are, in a sense, permanently
executing, and produce outputs at
appropriate times.
Ad-hoc queries
● A question asked once.
Issues in Stream Processing
● We must process elements in real time, or we lose the opportunity
to process them at all.
● Thus, it often is important that the stream-processing algorithm is
executed in main memory, without access to secondary storage or
with only rare accesses to secondary storage.
● Moreover, even when streams are “slow,” as in the sensor-data
(ocean behavior - 3.5 terabytes arriving every day), there may be
many such streams.
● Even if each stream by itself can be processed using a small
amount of main memory, the requirements of all the streams
together can easily exceed the amount of available main memory.
Issues in Stream Processing
● Thus, many problems about streaming data would be easy to solve
Sampling
Algorithms
Simple Random
Stratified Reservoir Undersampling
Random
Sampling Sampling and
Sampling Oversampling
5.3 Filtering Streams
- The Bloom Filter
- Analysis of Bloom Filtering
next…
The Bloom Filter
● Bloom filter is represented by a bit array of m bits. At the beginning all bits set
to 0.
● To insert element into the filter, we :
○ calculate values of all k hash functions for the element and
○ use their output values as indices in the array where we set bits to 1.
● To test if element is in the filter, we :
○ calculate all k hash functions for the element and
○ check bits in all corresponding indices:
■ if all bits are set, then answer is "maybe"
■ if at least 1 bit isn't set, then answer is "definitely not"
Hashing
DJB2
DJB2a FNV
murmur SDBM
Example:
Email Email
address itself
Hash functions:
• A hash function is function that can be used to map data to fixed-size values.
• The values returned by a hash function are called hash values, hash codes,
The Bloom Filter
For example,
• bit vector can
be m=50 cells
in length
• k=3 means 3
hash functions
will be applied
to the key.
How it works (Bloom Filter)
1. Initiate an empty bit array of length m. m=10
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0
3. To add element, calculate all the K hashes and set the corresponding bits to 1.
e.g.: element “[email protected]”
- Func. A: position = 2
- Func. B: position = 3
0 1 2 3 4 5 6 7 8 9
0 0 1 1 0 0 0 0 0 0
How it works (Bloom Filter)
1 0 1 1 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
1 0 1 1 0 0 1 1 0 0
Bloom filter with 3 elements. It consists of 10 bits and uses 3 hash functions.
How it works (Bloom Filter)
● To test if [email protected] is in the set.
● If [email protected], : Func. A = 7, Func. B = 9
0 1 2 3 4 5 6 7 8 9
1 0 1 1 0 0 1 1 0 1
1 0 1 1 0 0 1 1 0 1
● Example 2: A similar problem is a Web site like Google that does not require
login to issue a search query, and may be able to identify users only by the IP
address from which they send the query. There are about 4 billion IP
addresses, sequences of four 8-bit bytes will serve as the universal set in this
case.
The Count-Distinct Problem
● Keep them in an efficient search structure such as a hash table or search
tree, so one can quickly add new elements and check whether or not the
element that just arrived on the stream was already seen.
● As long as the number of distinct elements is not too great, this structure
can fit in main memory and there is little problem obtaining an exact
answer to the question how many distinct elements appear in the stream.
● However, if the number of distinct elements is too great, or if there are too
many streams that need to be processed at once (e.g., Yahoo! wants to
count the number of unique users viewing each of its pages in a month),
then we cannot store the needed data in main memory.
The Count-Distinct Problem
● There are several options.
○ Thus, first moments are especially easy to compute; just count the length of the
stream seen so far.
Definition of Moments
● The second moment is the sum of the squares of the mi.
● It is sometimes called the surprise number = how uneven the
distribution of elements in the stream.
● To see the distinction区别:
○ The most even distribution of these 11 elements would have one appearing 10
times and the other ten appearing 9 times each.
○ 2. An integer X.value, which is the value of the variable. To determine the value of
a variable X, we choose a position in the stream between 1 and n, uniformly and at
random. Set X.element to be the element found there, and initialize X.value to 1.
As we read the stream, add 1 to X.value each time we encounter another
occurrence of X.element .
The Alon-Matias-Szegedy Algorithm for Second
Moments
● Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c,
a, a, b.
● The length of the stream is n = 15.
a : 5 times,
b : 4 times
c : 3 times
d : 3 times
● the second moment for the stream is
52 + 42+ 32 + 32 = 59
Exercise: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5
● 9,9,1,1,5
● 10,9,9,9,9,9,9,9,9,9,9
Answer: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5 = 52 = 25
● 9,9,1,1,5 =22 + 22 + 12 = 9
● 10,9,9,9,9,9,9,9,9,9,9 =12 + 102 =101
5.6 Counting
Ones in a
Window
next…
The Datar-Gionis-Indyk-Motwani Algorithm
Too many
streams!
Stream for 1: 1, 0, 1, 1, 0, 0, 0
Stream for 2: 0, 1, 0, 0, 0, 1, 0
Stream for 3: 0, 0, 0, 0, 1, 0, 0
Stream for 4: 0, 0, 0, 0, 0, 0, 1
78
Exponentially Decaying Windows
• If stream is a1,a2… and we are taking the sum over the stream, the
answer at time t is defined as
t
i
a (
i =1
1 − c ) t −i
• If stream is a1,a2… and we are taking the sum over the stream, the
answer at time t is defined as
t
i
a (
i =1
1 − c ) t −i
80
Property of Decaying Windows
i
x
i =1
(1 − c ) t −i
t t
1
x i =1
i
x
(1 − c) t −i
= (1 − c)
i =1
t −i t →
⎯⎯
⎯→
c
81
Memory Requirements
• Suppose we want to find items with weight greater
than ½
• Since sum of all scores is 1/c, there cant be more
than 2/c items with weight ½ or more!
• So, 2/c is a limit on the number of scores being
counted at any time
– For other weight requirements, we would get a different
bound, in a similar manner
82