0% found this document useful (0 votes)
30 views

Unit III - MMD - Lecture Notes

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Unit III - MMD - Lecture Notes

Uploaded by

kjsravani2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MINING MASSIVE DATASETS

(20DS7T07)
R20 :: CSE (DATA SCIENCE) :: IV–I Semester
LECTURE NOTES

Mining Data Streams: The Stream Data Model, A Data-Stream-Management


System, Examples of Stream Sources, Stream Queries, Issues in Stream
UNIT III Processing. Sampling Data in a Stream, The General Sampling Problem, Varying
the Sample Size. Filtering Streams, A Motivating Example, The Bloom Filter,
Analysis of Bloom Filtering

3. Mining Data Streams


Make another assumption:
data arrives in a stream or streams, and if it is not processed immediately or stored, then it
is lost forever.
Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it
all in active storage (i.e., in a conventional database), and then interact with it at the time of our
choosing.
The algorithms for processing streams each involve summarization of the stream in some
way.
3.1 The Stream Data Model:
elements of streams and stream processing.
difference between streams and databases
the special problems that arise when dealing with streams.
typical applications where the stream model applies
3.1.1 A Data-Stream-Management System
A stream processor as a kind of data-management system, the high-level organization of which is
suggested in Fig. 4.1. Any number of streams can enter the system. Each stream can provide
elements at its own schedule; they need not have the same data rates or data types, and the time
between elements of one stream need not be uniform.
The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system.

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 1


Streams may be archived in a large archival store, but we assume it is not possible to
answer queries from the archival store. It could be examined only under special circumstances
using time-consuming retrieval processes. There is also a working store, into which summaries or
parts of streams may be placed, and which can be used for answering queries. The working store
might be disk, or it might be main memory, depending on how fast we need to process queries.
But either way, it is of sufficiently limited capacity that it cannot store all the data from all the
streams.
3.1.2 Examples of Stream Sources
Some of the ways in which stream data arises naturally:
Sensor Data
Imagine a temperature sensor bobbing about in the ocean, sending back to a base station a reading

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 2


of the surface temperature each hour. The data produced by this sensor is a stream of real numbers.
Now, give the sensor a GPS unit, and let it report surface height instead of temperature. The
surface height varies quite rapidly compared with temperature, so we might have the sensor send
back a reading every tenth of a second. If it sends a 4-byte real number each time, then it produces
3.5 megabytes per day. It will still take some time to fill up main memory, let alone a single disk.
To learn something about ocean behavior, we might want to deploy a million sensors, each sending
back a stream, at the rate of ten per second. A million sensors isn’t very many; there would be one
for every 150 square miles of ocean. Now we have 3.5 terabytes arriving every day, and we
definitely need to think about what can be kept in working storage and what can only be archived.
Image Data
Satellites often send down to earth streams consisting of many terabytes of images per day.
Surveillance cameras produce images with lower resolution than satellites, but there can be many
of them, each producing a stream of images at intervals like one second. London is said to have six
million such cameras, each producing a stream.
Internet and Web Traffic
A switching node in the middle of the Internet receives streams of IP packets from many inputs
and routes them to its outputs. Normally, the job of the switch is to transmit data and not to retain
it or query it. But there is a tendency to put more capability into the switch, e.g., the ability to
detect denial-of-service attacks or the ability to reroute packets based on information
about congestion in the network.
Web sites receive streams of various types. For example, Google receives several hundred million
search queries per day. Yahoo! accepts billions of “clicks” per day on its various sites. Many
interesting things can be learned from these streams.
3.1.3 Stream Queries
There are two ways that queries get asked about streams. We show in Fig. 4.1 a place within the
processor where standing queries are stored. These queries are, in a sense, permanently executing,
and produce outputs at appropriate times.
Example: Web sites often like to report the number of unique users over the past month. If we
think of each login as a stream element, we can maintain a window that is all logins in the most
recent month. We must associate the arrival time with each login, so we know when it no longer
belongs to the window. If we think of the window as a relation Logins(name, time), then it is

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 3


simple to get the number of unique users over the past month.
The SQL query is:
SELECT COUNT(DISTINCT(name))
FROM Logins
WHERE time >= t;
Here, t is a constant that represents the time one month before the current time.
Note that we must be able to maintain the entire stream of logins for the past month in working
storage. However, for even the largest sites, that data is not more than a few terabytes, and so
surely can be stored on disk.
3.1.4 Issues in Stream Processing
First, streams often deliver elements very rapidly. We must process elements in real time, or we
lose the opportunity to process them at all, without accessing the archival storage. Thus, it often is
important that the stream-processing algorithm is executed in main memory, without access to
secondary storage or with only rare accesses to secondary storage.
Moreover, even when streams are “slow,” as in the sensor-data example of Section 4.1.2, there
may be many such streams. Even if each stream by itself can be processed using a small amount of
main memory, the requirements of all the streams together can easily exceed the amount of
available main memory.
3.2 Sampling Data in a Stream:
The general problem we shall address is selecting a subset of a stream so that we can ask queries
about the selected subset and have the answers be statistically representative of the stream as a
whole. If we know what queries are to be asked, then there are a number of methods that might
work, but we are looking for a technique that will allow ad-hoc queries on the sample.
Our running example is the following. A search engine receives a stream of queries, and it would
like to study the behavior of typical users.
We assume the stream consists of tuples (user, query, time). Suppose that we want to answer
queries such as “What fraction of the typical user’s queries were repeated over the past month?”
Assume also that we wish to store only 1/10th of the stream elements.
3.2.1 The General Sampling Problem:
The running example is typical of the following general problem. Our stream consists of tuples
with n components. A subset of the components are the key components, on which the selection of

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 4


the sample will be based. In our running example, there are three components – user, query, and
time – of which only user is in the key. However, we could also take a sample of queries by
making query be the key, or even take a sample of user-query pairs by making both those
components form the key.
To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept the
tuple for the sample if the hash value is less than a. If the key consists of more than one
component, the hash function needs to combine the values for those components to make a single
hash-value. The result will be a sample consisting of all tuples with certain key values. The
selected key values will be approximately a/b of all the key values appearing in the stream.
3.2.3 Varying the Sample Size:
Often, the sample will grow as more of the stream enters the system. In our running example, we
retain all the search queries of the selected 1/10th of the users, forever. As time goes on, more
searches for the same users will be accumulated, and new users that are selected for the sample
will appear in the stream.
If we have a budget for how many tuples from the stream can be stored as the sample, then the
fraction of key values must vary, lowering as time goes on. In order to assure that at all times, the
sample consists of all tuples from a subset of the key values, we choose a hash function h from key
values to a very large number of values 0, 1, . . . ,B−1. We maintain a threshold t, which initially
can be the largest bucket number, B − 1. At all times, the sample consists of those tuples whose
key K satisfies h(K) ≤ t. New tuples from the stream are added to the sample if and only if they
satisfy the same condition.
If the number of stored tuples of the sample exceeds the allotted space, we lower t to t−1 and
remove from the sample all those tuples whose key K hashes to t. For efficiency, we can lower t by
more than 1, and remove the tuples with several of the highest hash values, whenever we need to
throw some key values out of the sample. Further efficiency is obtained by maintaining an index
on the hash value, so we can find all those tuples whose keys hash to a particular value quickly.
3.3 Filtering Streams:
Another common process on streams is selection, or filtering. We want to accept those tuples in the
stream that meet a criterion. Accepted tuples are passed to another process as a stream, while other
tuples are dropped. If the selection criterion is a property of the tuple that can be calculated (e.g.,
the first component is less than 10), then the selection is easy to do.

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 5


The problem
becomes harder when the criterion involves lookup for membership in a set. It is especially hard,
when that set is too large to store in main memory. In this section, we shall discuss the technique
known as “Bloom filtering” as a way to eliminate most of the tuples that do not meet the criterion.
3.3.1 A Motivating Example:
Suppose we have a set S of one billion allowed email addresses – those that we will allow through
because we believe them not to be spam. The stream consists of pairs: an email address and the
email itself. Since the typical email address is 20 bytes or more, it is not reasonable to store S in
main memory. Thus, we can either use disk accesses to determine whether or not to let through any
given stream element, or we can devise a method that requires no more main memory than we
have available, and yet will filter most of the undesired stream elements.
Suppose for argument’s sake that we have one gigabyte of available main memory. In the
technique known as Bloom filtering, we use that main memory as a bit array. In this case, we have
room for eight billion bits, since one byte equals eight bits. Devise a hash function h from email
addresses to eight billion buckets. Hash each member of S to a bit, and set that bit to 1. All other
bits of the array remain 0.
Since there are one billion members of S, approximately 1/8th of the bits will be 1. The exact
fraction of bits set to 1 will be slightly less than 1/8th, because it is possible that two members of S
hash to the same bit. We shall discuss the exact fraction of 1’s in Section 4.3.3. When a stream
element arrives, we hash its email address. If the bit to which that email address hashes is 1, then
we let the email through. But if the email address hashes to a 0, we are certain that the address is
not in S, so we can drop this stream element.
Unfortunately, some spam email will get through. Approximately 1/8th of the stream elements
whose email address is not in S will happen to hash to a bit whose value is 1 and will be let
through. Nevertheless, since the majority of emails are spam (about 80% according to some
reports), eliminating 7/8th of the spam is a significant benefit. Moreover, if we want to eliminate
every spam, we need only check for membership in S those good and bad emails that get through
the filter. Those checks will require the use of secondary memory to access S itself. There are also
other options, as we shall see when we study the general Bloom-filtering technique.
As a simple example, we could use a cascade of filters, each of which would eliminate 7/8th of the
remaining spam.

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 6


3.3.2 The Bloom Filter:

The purpose of the Bloom filter is to allow through all stream elements whose keys are in S,
while rejecting most of the stream elements whose keys are not in S.
To initialize the bit array, begin with all bits 0. Take each key value in S and hash it using
each of the k hash functions. Set to 1 each bit that is hi(K) for some hash function hi and
some key value K in S.

3.3.3 Analysis of Bloom Filtering:


If a key value is in S, then the element will surely pass through the Bloom filter. However, if the
key value is not in S, it might still pass. We need to understand how to calculate the probability of
a false positive, as a function of n, the bit-array length, m the number of members of S, and k, the
number of hash functions.
The model to use is throwing darts at targets. Suppose we have x targets and y darts. Any dart is
equally likely to hit any target. After throwing the darts, how many targets can we expect to be hit
at least once? The analysis is similar to the analysis in Section 3.4.2, and goes as follows:

Mining Massive Datasets – UNIT 3 – Lecture Notes Page 7


Mining Massive Datasets – UNIT 3 – Lecture Notes Page 8

You might also like