0% found this document useful (0 votes)

46 views82 pages

Data Science 5

The document discusses data stream mining and counting distinct elements in a stream. It introduces the count-distinct problem of determining how many different elements have appeared in a data stream. It describes the Flajolet-Martin algorithm for approximately solving this problem in sub-linear space. The algorithm uses probabilistic data structures like Bloom filters and hash functions to estimate distinct counts with a controllable error rate while using little memory. Issues with processing high-speed streams in real-time with limited resources are also covered.

Uploaded by

kagome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views82 pages

Data Science 5

Uploaded by

kagome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

AACS1573

Introduction to
Data Science
Chapter 5
Mining Data Stream
Assumptions

1. data arrives in a stream or

streams & if it is not
processed immediately or
stored, then it is lost
forever.

2. Arrives so rapidly that it is

not feasible to store it all in active
storage (i.e., in a conventional
database).
The Stream Data Model
- A Data-Stream-Management System
- Examples of Stream Sources
- Stream Queries
- Issues in Stream Processing
The Stream Data Model
http://www.climate4you.com/SeaTemperatures.html
Stream Queries
There are two ways that queries get asked about
streams.

○ standing queries

○ ad-hoc queries
Standing queries
● In Fig. 4.1 a place within the processor where
standing queries are stored.
● These queries are, in a sense, permanently
executing, and produce outputs at
appropriate times.
Ad-hoc queries
● A question asked once.
Issues in Stream Processing
● We must process elements in real time, or we lose the opportunity
to process them at all.
● Thus, it often is important that the stream-processing algorithm is
executed in main memory, without access to secondary storage or
with only rare accesses to secondary storage.
● Moreover, even when streams are “slow,” as in the sensor-data
(ocean behavior - 3.5 terabytes arriving every day), there may be
many such streams.
● Even if each stream by itself can be processed using a small
amount of main memory, the requirements of all the streams
together can easily exceed the amount of available main memory.
Issues in Stream Processing
● Thus, many problems about streaming data would be easy to solve

IF WE HAD ENOUGH MEMORY

but become rather hard and require the invention of new
techniques in order to execute them at a realistic rate on a
machine
of realistic size.
5.2 Sampling Data in a
Stream
next…
Sampling Algorithms

Sampling
Algorithms

Simple Random
Stratified Reservoir Undersampling
Random
Sampling Sampling and
Sampling Oversampling
5.3 Filtering Streams
- The Bloom Filter
- Analysis of Bloom Filtering
next…
The Bloom Filter
● Bloom filter is represented by a bit array of m bits. At the beginning all bits set
to 0.
● To insert element into the filter, we :
○ calculate values of all k hash functions for the element and
○ use their output values as indices in the array where we set bits to 1.
● To test if element is in the filter, we :
○ calculate all k hash functions for the element and
○ check bits in all corresponding indices:
■ if all bits are set, then answer is "maybe"
■ if at least 1 bit isn't set, then answer is "definitely not"
Hashing

Hash functions accelerate table or database lookup by detecting duplicated

records in a large file.
Hash function
● A hash function is any function that can be used
to map data of arbitrary size to data of fixed size.
● The values returned by a hash function are
called hash values, hash codes, digests, or
simply hashes.
● One use is a data structure called a hash table,
widely used in computer software for rapid data
lookup.
● Hash functions accelerate table or database
lookup by detecting duplicated records in a
large file.
Hash Functions

DJB2

DJB2a FNV

murmur SDBM
Example:

• Suppose we have a set, S, of 1000 allowed email addresses – those we will

allow through because we believe them not to be spam
Bloom Filter (example)

Email Email
address itself

Hash functions:
• A hash function is function that can be used to map data to fixed-size values.
• The values returned by a hash function are called hash values, hash codes,
The Bloom Filter

For example,
• bit vector can
be m=50 cells
in length
• k=3 means 3
hash functions
will be applied
to the key.
How it works (Bloom Filter)
1. Initiate an empty bit array of length m. m=10
0 1 2 3 4 5 6 7 8 9

0 0 0 0 0 0 0 0 0 0

2. Select K different Hash functions

e.g.: K= 2, Func. A and Func. B

3. To add element, calculate all the K hashes and set the corresponding bits to 1.
e.g.: element “[email protected]”
- Func. A: position = 2
- Func. B: position = 3
0 1 2 3 4 5 6 7 8 9

0 0 1 1 0 0 0 0 0 0
How it works (Bloom Filter)

4. Repeat “[email protected]” : Func. A = 0, Func. B = 2

0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 0 0 0 0

5. Repeat “[email protected]” : Func. A = 6, Func. B = 7

0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 1 1 0 0

Bloom filter with 3 elements. It consists of 10 bits and uses 3 hash functions.
How it works (Bloom Filter)
● To test if [email protected] is in the set.
● If [email protected], : Func. A = 7, Func. B = 9
0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 1 1 0 1

● The element [email protected] is “definitely not” in the set.

● To test [email protected], : Func. A = 0, Func. B = 6

0 1 2 3 4 5 6 7 8 9

1 0 1 1 0 0 1 1 0 1

● The element [email protected] is “maybe” in the set.

How it works (Bloom filter)
● [email protected] is not in the set, it is concluded as “maybe” in
the set
● This is an example of False Positive.
The Bloom Filter
● Bloom filter can be used anywhere where false positives are OK and can be
handled but false negatives are not acceptable.
● Bloom filters can also be used to reduce I/O operations and thus increase
performance.
The Bloom Filter
● Example
● Consider Bloom filter of 16 bits (m=16):

● As an example, consider 2 hash functions (k=2): MurmurHash3 and Fowler-

Noll-Vo.
● Now, let's add element ferret to the filter:  MurmurHash3(ferret)=1,
FNV(ferret)=11.
● Therefore, we need to set bits #1 and #11 in the filter: 
The Bloom Filter
● Add element bernau to the filter:
MurmurHash3(bernau)=4, FNV(bernau)=4
● So, we need to set only the bit #4:

● But that about element paris? Its hashes are MurmurHash3(paris)=11,

FNV(paris)=4.  
● If we check bits #4 and #11, they all set in the filter and we shell conclude
that paris maybe in the set.
● This is an example of a false positive response, because we know that we
didn't add paris to the filter and the appropriate bits were set independently
by 2 different elements (bit #4 were set by bernau and bit #11 by ferret).
Question (Bloom Filter)
● Bloom filter helps to answer “Yes” “No” question. “Is the item in the list?”
● Add “Cat” and “Dog” into the 12-bit Bloom filter.
○ “Cat” hashes to 3, 4, 10
○ “Dog” hashes to 1, 2, 5
● Build the Bloom filter diagram

● “elephant” hashes to 2, 4, 5. Is “elephant” is the set?

● Determine the type of error and describe the error.

Examples of applications
AACS1573
Introduction to
Data Science
Chapter 5
Mining Data Stream
(Part B)
We will learn about…

5.4 5.5 5.6 5.7

Counting Counting
Estimating Decaying
Distinct Ones in a
Moments Windows
Elements in a Window
Stream

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

Table of
Contents
We have learned about…

5.4 5.5 5.6 5.7

Counting Counting
Estimating Decaying
Distinct Ones in a
Moments Windows
Elements in a Window
Stream

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

Coming next…
Chapter 6 Case Study
5.4 Counting
Distinct
Elements in a
Stream
next…
5.4 Counting Distinct Elements in a
Stream

- The Count-Distinct Problem

- The Flajolet-Martin Algorithm
- Space Requirements
Counting Distinct Elements in a Stream
● In this section we look at a third simple kind
of processing we might want to do on a
stream.
● As with the previous examples – sampling
and filtering – it is somewhat tricky to do
what we want in a reasonable amount of
main memory, so we use a variety of hashing
and a randomized algorithm to get
approximately what we want with little space
needed per stream.
The Count-Distinct Problem

● Suppose stream elements are

chosen from some universal set.
○ Problem or Question : We would like to
know how many different elements have
appeared in the stream.

○ Solution: Counting from the beginning of

the stream.
● Example 1: As a useful example of this problem, consider a Web site gathering
statistics on how many unique users it has seen in each given month. The
universal set is the set of logins for that site, and a stream element is
generated each time someone logs in. This measure is appropriate for a site
like Amazon, where the typical user logs in with their unique login name.

● Example 2: A similar problem is a Web site like Google that does not require
login to issue a search query, and may be able to identify users only by the IP
address from which they send the query. There are about 4 billion IP
addresses, sequences of four 8-bit bytes will serve as the universal set in this
case.
The Count-Distinct Problem
● Keep them in an efficient search structure such as a hash table or search
tree, so one can quickly add new elements and check whether or not the
element that just arrived on the stream was already seen.
● As long as the number of distinct elements is not too great, this structure
can fit in main memory and there is little problem obtaining an exact
answer to the question how many distinct elements appear in the stream.
● However, if the number of distinct elements is too great, or if there are too
many streams that need to be processed at once (e.g., Yahoo! wants to
count the number of unique users viewing each of its pages in a month),
then we cannot store the needed data in main memory.
The Count-Distinct Problem
● There are several options.

○ We could use more machines, each machine

handling only one or several of the streams.

○ We could store most of the data structure in

secondary memory and batch stream elements so
whenever we brought a disk block to main memory
there would be many tests and updates to be
performed on the data in that block.

○ we could use the strategy to be discussed in this

section, where we only estimate the number of
distinct elements but use much less memory than the
number of distinct elements.
The Flajolet-Martin Algorithm
● The idea behind the Flajolet-Martin Algorithm is that the more different
elements we see in the stream, the more different hash-values we shall see.
● As we see more different hash-values, it becomes more likely that one of
these values will be “unusual.”
● The particular unusual property we shall exploit is that the value ends in
many 0’s, although many other options exist.
● Whenever we apply a hash function h to a stream element a, the bit string
h(a) will end in some number of 0’s, possibly none.
● Call this number the tail length for a and h.
● Let R be the maximum tail length of any a seen so far in the stream.
● Then we shall use estimate 2R for the number of distinct elements seen in
the stream.
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
● Calculate Hash function, h (x)
The Flajolet-Martin Algorithm
The Flajolet-Martin Algorithm
● Calculate binary bit for every calculated Hash function
The Flajolet-Martin Algorithm
● Find out the trailing zeros
The Flajolet-Martin Algorithm
● Find out the trailing zeros
The Flajolet-Martin Algorithm
● Find out the distinct element
Question (Flajolet-Martin Algorithm)
● Input stream of integers, x = 3, 4, 4, 6, 2
● Hash Function, h(x) = 6x + 1 mod 5
● Compute the binary equivalent for every hash function calculated
● Compute the count of trailing zeros in each hash function
● Compute the working of getting the distinct value using FM
● List the distinct values.
Space Requirements

● Observe that as we read the stream it is not

necessary to store the elements seen.
● The only thing we need to keep in main memory is
one integer per hash function; this integer records
the largest tail length seen so far for that hash
function and any stream element.
Space Requirements
● If we are processing only one stream, we could use
millions of hash functions, which is far more than we
need to get a close estimate.
● Only if we are trying to process many streams at the
same time would main memory constrain the number
of hash functions we could associate with any one
stream.
● In practice, the time it takes to compute hash values for
each stream element would be the more significant
limitation on the number of hash functions we use.
5.5
Estimating
Moments
next…
Estimating Moments
● In this section we consider a generalization of the problem of
counting distinct elements in a stream. = Moments.
Moments
the distribution of frequencies of
different elements in the stream
Definition of Moments
● The 1st moment is the sum of the mi, which must be the length of
the stream.

○ Thus, first moments are especially easy to compute; just count the length of the
stream seen so far.
Definition of Moments
● The second moment is the sum of the squares of the mi.
● It is sometimes called the surprise number = how uneven the
distribution of elements in the stream.
● To see the distinction区别:

○ suppose we have a stream of length 100, in which 11 different elements appear.

○ The most even distribution of these 11 elements would have one appearing 10
times and the other ten appearing 9 times each.

■ In this case, the surprise number is 102+ 10(92)= 910

The Alon-Matias-Szegedy Algorithm for Second
Moments
● For now, let us assume that a stream has a particular length n.
● Suppose we do not have enough space to count all the mi for all
the elements of the stream.
● We can still estimate the second moment of the stream using a
limited amount of space; the more space we use, the more
accurate the estimate will be.
The Alon-Matias-Szegedy Algorithm for Second
Moments
● We compute some number of variables. For each variable X, we
store:

○ 1. A particular element of the universal set, which we refer to as X.element ,and

○ 2. An integer X.value, which is the value of the variable. To determine the value of
a variable X, we choose a position in the stream between 1 and n, uniformly and at
random. Set X.element to be the element found there, and initialize X.value to 1.
As we read the stream, add 1 to X.value each time we encounter another
occurrence of X.element .
The Alon-Matias-Szegedy Algorithm for Second
Moments
● Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c,
a, a, b.
● The length of the stream is n = 15.
a : 5 times,
b : 4 times
c : 3 times
d : 3 times
● the second moment for the stream is
52 + 42+ 32 + 32 = 59
Exercise: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5
● 9,9,1,1,5
● 10,9,9,9,9,9,9,9,9,9,9
Answer: Calculate the
surprise number using Alon-
Matias-Szegedy Algorithm
● 5,5,5,5,5 = 52 = 25
● 9,9,1,1,5 =22 + 22 + 12 = 9
● 10,9,9,9,9,9,9,9,9,9,9 =12 + 102 =101
5.6 Counting
Ones in a
Window
next…
The Datar-Gionis-Indyk-Motwani Algorithm

● This version of the algorithm uses O(log2 N) bits to represent a

window of N bits, and allows us to estimate the number of 1’s in the
window with an error of no more than 50%.
● To begin, each bit of the stream has a timestamp, the position in
which it arrives.
● The first bit has timestamp 1, the second has timestamp 2, and so
on.
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
The Datar-Gionis-Indyk-Motwani
Algorithm
Maintaining the DGIM Conditions
Maintaining the DGIM Conditions
5.7 Decaying
Windows
next…
Problem
• Given a stream, which items are currently
popular
– E.g., popular movie tickets, popular items being
sold in Amazon, etc.
– appear more than s times in the window
• Possible solution
– Stream per item; at each timepoint “1” if the item
appears in the original stream and “0” otherwise
– Use DGIM to estimate counts of 1 for each item 77
Example
Problem
with this
• Original Stream: approach?
1, 2, 1, 1, 3, 2, 4

Too many
streams!
Stream for 1: 1, 0, 1, 1, 0, 0, 0

Stream for 2: 0, 1, 0, 0, 0, 1, 0

Stream for 3: 0, 0, 0, 0, 1, 0, 0

Stream for 4: 0, 0, 0, 0, 0, 0, 1
78
Exponentially Decaying Windows

• A heuristic for selecting frequent items

– Gives more weight to more recent popular items
– Instead of computing count in the last N elements, compute smooth
aggregation over entire stream

• If stream is a1,a2… and we are taking the sum over the stream, the
answer at time t is defined as
t

 i
a (
i =1
1 − c ) t −i

where c is a tiny constant ~ 10-6 79

Exponentially Decaying Windows (cont)

• If stream is a1,a2… and we are taking the sum over the stream, the
answer at time t is defined as
t

 i
a (
i =1
1 − c ) t −i

where c is a tiny constant ~ 10-6

• When new at+1 arrives,
– (1) multiply current sum by (1-c) and
– (2) add at+1

80
Property of Decaying Windows

• Remember, for each item x, we have a running score

 i
 x

i =1
(1 − c ) t −i

• Summing over all running scores we get

t t
1
 
x i =1
i
x
(1 − c) t −i
=  (1 − c)
i =1
t −i t →
⎯⎯
⎯→
c

81
Memory Requirements
• Suppose we want to find items with weight greater
than ½
• Since sum of all scores is 1/c, there cant be more
than 2/c items with weight ½ or more!
• So, 2/c is a limit on the number of scores being
counted at any time
– For other weight requirements, we would get a different
bound, in a similar manner

Requirement Analysis & Negotiation
100% (1)
Requirement Analysis & Negotiation
32 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
mining data stream
No ratings yet
mining data stream
31 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
DSBD_Unit-II_3
No ratings yet
DSBD_Unit-II_3
28 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
module4(2)
No ratings yet
module4(2)
10 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
BDA PT 2
No ratings yet
BDA PT 2
35 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Streams 2
No ratings yet
Streams 2
49 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
14 Streams
No ratings yet
14 Streams
6 pages
BDA-UNIT3
No ratings yet
BDA-UNIT3
22 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
DGIM
No ratings yet
DGIM
90 pages
Bloom Filter
No ratings yet
Bloom Filter
50 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
(8) Bloom Filters - A Probabilistic Data Structure _ LinkedIn
No ratings yet
(8) Bloom Filters - A Probabilistic Data Structure _ LinkedIn
7 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
No ratings yet
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
14 pages
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Bloom Filter
No ratings yet
Bloom Filter
9 pages
BDA
No ratings yet
BDA
6 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Probablistic Data Structures
No ratings yet
Probablistic Data Structures
5 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Mod2_Data_Streams
No ratings yet
Mod2_Data_Streams
75 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
unit-3.pptx
No ratings yet
unit-3.pptx
49 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
MMD 05
No ratings yet
MMD 05
50 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
Bloom Filters A Tutorial, Analysis, and Survey
No ratings yet
Bloom Filters A Tutorial, Analysis, and Survey
31 pages
BDA Assignment2 BE6 20
No ratings yet
BDA Assignment2 BE6 20
9 pages
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,549,004 B2
12 pages
2020300053_BDA_EXP4_CHINMAY
No ratings yet
2020300053_BDA_EXP4_CHINMAY
4 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Bloom Filter Guo
No ratings yet
Bloom Filter Guo
90 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Unit 2
No ratings yet
Unit 2
23 pages
Bda A4
No ratings yet
Bda A4
10 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
Learn Design and Analysis of Algorithms in 24 Hours
From Everand
Learn Design and Analysis of Algorithms in 24 Hours
Alex Nordeen
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Small Basic Assessment 2017
No ratings yet
Small Basic Assessment 2017
4 pages
Functional Requirements
No ratings yet
Functional Requirements
16 pages
How To Assemble (15 Steps) and Disassemble (9 Steps) A Computer
100% (2)
How To Assemble (15 Steps) and Disassemble (9 Steps) A Computer
14 pages
Parallel Computing Challanges
No ratings yet
Parallel Computing Challanges
7 pages
Coam2 - ST
No ratings yet
Coam2 - ST
74 pages
CA Clarity PPM Project Management User Guide - Digital Celerity LLC
100% (1)
CA Clarity PPM Project Management User Guide - Digital Celerity LLC
397 pages
Java Worksheet: Ans) Int (A) Int (5) Q3
No ratings yet
Java Worksheet: Ans) Int (A) Int (5) Q3
4 pages
Week 2 Questions - PROCESS MANAGEMENT
No ratings yet
Week 2 Questions - PROCESS MANAGEMENT
1 page
Soap Tutorial
33% (3)
Soap Tutorial
8 pages
Python Cookbook
100% (3)
Python Cookbook
477 pages
TP Jenkins
No ratings yet
TP Jenkins
16 pages
1 IJTIM+ +Vol.2+Issue.1+ +2022 +Enterprise+Resource+Planning +25 May
No ratings yet
1 IJTIM+ +Vol.2+Issue.1+ +2022 +Enterprise+Resource+Planning +25 May
22 pages
Analog and Networking Systems: Network Graphic Annunciator
No ratings yet
Analog and Networking Systems: Network Graphic Annunciator
3 pages
CH 02
No ratings yet
CH 02
8 pages
Media Dozer II LIC Datasheet
No ratings yet
Media Dozer II LIC Datasheet
2 pages
Cdns Qspi Flash CTRL and Phy Design Specification
No ratings yet
Cdns Qspi Flash CTRL and Phy Design Specification
149 pages
Backing Up Abacking Up and Restoring Zimbra (Open Source Version) - Zimbra - Wikind Restoring Zimbra (Open Source Version) - Zimbra - Wiki
No ratings yet
Backing Up Abacking Up and Restoring Zimbra (Open Source Version) - Zimbra - Wikind Restoring Zimbra (Open Source Version) - Zimbra - Wiki
6 pages
Termux GUI Logcat
No ratings yet
Termux GUI Logcat
4 pages
HP Nonstop S-Series Hardware Installation and Fastpath Guide
No ratings yet
HP Nonstop S-Series Hardware Installation and Fastpath Guide
630 pages
How To Connect To EPRS - (Updated)
100% (1)
How To Connect To EPRS - (Updated)
33 pages
AMP For Endpoints User Guide
No ratings yet
AMP For Endpoints User Guide
225 pages
Enhanced Security Methods of Door Locking Based Fingerprint
No ratings yet
Enhanced Security Methods of Door Locking Based Fingerprint
7 pages
ICSOC 2022 Process Oriented Intents
No ratings yet
ICSOC 2022 Process Oriented Intents
9 pages
FlutterFlow Plans
No ratings yet
FlutterFlow Plans
11 pages
Indian Railway
No ratings yet
Indian Railway
19 pages
Week 5 - ERP
No ratings yet
Week 5 - ERP
46 pages
Avaya Cable Guide - MCC & SCC
No ratings yet
Avaya Cable Guide - MCC & SCC
6 pages
Enodeb Wraparound Testing
No ratings yet
Enodeb Wraparound Testing
22 pages
Powerflex 70: Resetting A Drive Which Is Password Protected: Answer Id
100% (1)
Powerflex 70: Resetting A Drive Which Is Password Protected: Answer Id
2 pages

Data Science 5

Uploaded by

Data Science 5

Uploaded by

AACS1573

1. data arrives in a stream or

2. Arrives so rapidly that it is

IF WE HAD ENOUGH MEMORY

Hash functions accelerate table or database lookup by detecting duplicated

• Suppose we have a set, S, of 1000 allowed email addresses – those we will

2. Select K different Hash functions

4. Repeat “[email protected]” : Func. A = 0, Func. B = 2

5. Repeat “[email protected]” : Func. A = 6, Func. B = 7

● The element [email protected] is “definitely not” in the set.

● To test [email protected], : Func. A = 0, Func. B = 6

● The element [email protected] is “maybe” in the set.

● As an example, consider 2 hash functions (k=2): MurmurHash3 and Fowler-

● But that about element paris? Its hashes are MurmurHash3(paris)=11,

● “elephant” hashes to 2, 4, 5. Is “elephant” is the set?

● Determine the type of error and describe the error.

5.4 5.5 5.6 5.7

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

5.4 5.5 5.6 5.7

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

- The Count-Distinct Problem

● Suppose stream elements are

○ Solution: Counting from the beginning of

○ We could use more machines, each machine

○ We could store most of the data structure in

○ we could use the strategy to be discussed in this

● Observe that as we read the stream it is not

○ suppose we have a stream of length 100, in which 11 different elements appear.

■ In this case, the surprise number is 102+ 10(92)= 910

○ 1. A particular element of the universal set, which we refer to as X.element ,and

● This version of the algorithm uses O(log2 N) bits to represent a

• A heuristic for selecting frequent items

where c is a tiny constant ~ 10-6 79

where c is a tiny constant ~ 10-6

• Remember, for each item x, we have a running score

• Summing over all running scores we get

You might also like