0% found this document useful (0 votes)
28 views

Mining Data Streams

This document summarizes techniques for sampling data streams to obtain representative samples while using limited memory. It describes how direct random sampling of each element can produce misleading results for certain queries. Maintaining a list of all unique keys is also infeasible due to memory constraints. The document proposes using a hash function to map keys to buckets, including an element if its key hashes to a chosen bucket. This allows obtaining samples of any desired fraction in a memory-efficient way.

Uploaded by

Hashir Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Mining Data Streams

This document summarizes techniques for sampling data streams to obtain representative samples while using limited memory. It describes how direct random sampling of each element can produce misleading results for certain queries. Maintaining a list of all unique keys is also infeasible due to memory constraints. The document proposes using a hash function to map keys to buckets, including an element if its key hashes to a chosen bucket. This allows obtaining samples of any desired fraction in a memory-efficient way.

Uploaded by

Hashir Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Mining Data Streams

Introduction

• This chapter considers data arriving in fast


streams that can't be stored indefinitely.
• Data must be summarized due to its rapid arrival
and potential loss if not processed.
• Stream processing algorithms involve creating
useful samples, filtering undesirable elements,
and estimating element counts with minimal
storage.
• An alternative method for summarizing streams involves utilizing a fixed-size
"window" that encompasses the last n elements, where n is typically large.
• This window behaves like a relational database, enabling queries to be performed on
its content.
• When dealing with multiple streams or large window sizes, it might be impractical to
store complete windows for each stream.
• Consequently, summarization techniques are necessary to condense the window data.
• The main focus is on solving the fundamental challenge of approximating the count of
occurrences of "1"s within the window of a binary stream.
• The aim is to achieve this approximation while using significantly less storage space
compared to storing the entire window.
• This technique has broader applications and can be extended to approximate various
types of sums beyond just counting occurrences.
The Stream Data Model
• The Stream Data Model
• difference between streams and databases
• the special problems that arise when dealing
with streams.
A Data-Stream-Management System
•Stream processors can be likened to data-management systems.
•The structure resembles Fig. 4.1, with multiple streams entering the system.
•Streams have varied schedules, rates, and data types.
•Unlike a database system, stream processing doesn't control element arrival rates.
•Unlike databases, where data retrieval rate is controlled, stream processing faces the
risk of data loss due to uncontrolled arrival rates.
•Streams can be archived, but not suitable for direct querying; retrieval is time-
consuming.
•A working store, with limited capacity (disk or memory), holds summaries or parts
of streams for query responses.
•The choice of working store depends on query processing speed requirements.
•The working store can't accommodate all data from all streams due to limited
capacity.
Examples of Stream Sources
Sensor Data:
• Temperature sensor in the ocean provides hourly surface
temperature readings.
• Low data rate, can be stored in main memory indefinitely.
• Adding GPS makes it report surface height, requiring readings
every tenth of a second.
• With 4-byte real numbers, it produces 3.5 megabytes per day.
• For a million sensors sending ten readings per second, 3.5
terabytes daily data.
• Challenge: deciding what to keep in working storage and what
to archive.
Image Data:
• Satellites and surveillance cameras produce
image streams.
• Satellites send terabytes of images daily.
• Surveillance cameras, like London's six million,
produce streams of images every second.
Image Data:
• Satellites and surveillance cameras produce
image streams.
• Satellites send terabytes of images daily.
• Surveillance cameras, like London's six million,
produce streams of images every second.
Stream Queries
• Two Query Approaches:
• Standing Queries:
– Permanently executing queries.
– Outputs generated at suitable times.
– Stored in processor for continuous execution.
• Ad-hoc Queries:
– One-time questions about current stream state.
– Handling depends on stored summaries or parts.
Standing Query Examples:
• Temperature Alert:
– Stream from ocean-surface-temperature sensor.
– Alert if temperature exceeds 25°C.
– Based on most recent element.
• Average of Recent Readings:
– Standing query computes average of last 24 readings.
– Easily answered by storing recent stream elements.
– Oldest element removed with arrival of new one.
• Maximum Temperature Query:
– Retain summary of maximum temperature.
– Updated when new element arrives.
– Answered by current maximum value.
• Average Temperature Over Time:
– Record reading count and sum.
– Quotient of count and sum as answer.
– Easily adjusted with new readings.
Ad-hoc Query Handling:
• Ad-hoc queries for current stream state.
• Can't store all streams; prepare for expected queries.
• Store appropriate summaries or parts for specific queries.
Wide Variety of Ad-hoc Queries:
• Approach: Store sliding window of each stream in working store.
• Sliding window: Most recent 'n' elements or elements in last 't'
time units.
• Treat window as relation, query using SQL.
• Stream-management system maintains fresh window by
removing oldest elements.
Unique Users Query:
• Web sites report unique users over past month.
• Stream elements: Logins; Maintain window of recent
month's logins.
• Window treated as relation Logins(name, time).
• SQL query for unique users over past month.
Data Storage Consideration:
• Must maintain stream data for queries.
• Even for large sites, data is a few terabytes.
• Storage on disk feasible for handling stream data.
Issues in Stream Processing
Constraints of Stream Processing
• Real-Time Processing:
– Streams deliver elements rapidly.
– Must process elements immediately or lose the chance.
– Minimize reliance on archival storage.
– Execute algorithms in main memory.
– Rare access to secondary storage, if needed.
• Multiple Streams and Memory Limitations:
– Even slow streams can collectively overwhelm memory.
– Many streams may exceed available memory.
– New techniques needed for realistic processing on realistic machines.
Key Insights:
• Approximation vs. Exact Solution:
– Efficient to get an approximate answer rather than
exact.
– Trade-off between speed and precision.
• Hashing Techniques:
– Similar to techniques in Chapter 3.
– Introduce controlled randomness to algorithms.
– Produces close-to-true approximate answers.
Takeaways:
• Stream processing demands real-time action.
• Memory constraints require creative
solutions.
• Approximations often offer practical efficiency.
• Hashing techniques aid in precise
approximations.
Sampling Data In Stream
Introduction
• Managing streaming data through reliable
samples.
• Unique use of hashing for stream algorithms.
A Motivating Example
• Analyzing User Query Behavior in a Stream
• Stream Characteristics and Objective:
– Search engine receives a query stream (user, query, time).
– Objective: Study typical user behavior.
– Query: "What fraction of typical user's queries were repeated in the
past month?"
– Store only 1/10th of stream elements.
• Initial Sampling Approach:
– Generate a random number for each query (0 to 9).
– Store tuple if random number is 0.
– Each user's queries have about 1/10th stored.
– Law of large numbers compensates for statistical noise.
• Challenge and Misleading Result:
– Misleading result for average duplicate queries.
– Example: User issued s single queries, d double queries, no more than
double.
– In a 1/10th sample, s/10 single queries expected.
– For d double queries, only d/100 will appear twice.
– 18d/100 queries appearing once due to sample selection.
• Correct Answer vs. Sample Outcome:
– Correct answer: Fraction of repeated searches = d/(s+d).
– Sample outcome: Fraction from sample = d/(10s+19d).
– Deriving the latter formula with d/100 and s/10+18d/100 fractions.
• Analysis of Sampling Outcome:
– The fraction appearing twice in the sample is d/100 divided by d/100 + s/10 +
18d/100.
– This ratio simplifies to d/(10s + 19d), not matching the correct answer.
– For positive s and d, d/(s + d) ≠ d/(10s + 19d).
• Key Takeaway:
• Sampling approach can yield misleading results for certain queries.
• Theoretical correct answer doesn't align with the derived outcome
from the sample.
• Careful consideration and alternative approaches are needed for
accurate analysis.
Obtaining a Representative Sample
• Challenges in Sample Selection:
– Certain statistical queries can't be answered via a simple
sample of each user's queries.
– Goal: Choose 1/10th of users, take all their queries, excluding
others.
• Storing User Information Approach:
– Maintain list of users and whether they're in the sample.
– Process each query, check if user is in the sample list.
– If yes, add query to the sample; if no, exclude query.
– Generate random integer for unseen users to decide
inclusion.
• Memory Constraints and Hash Function:
– Memory limitations hinder storing all users' info.
– Hash function enables decision without user list.
– Hash user names to buckets (0 to 9).
– If hashed to bucket 0, include query; otherwise, exclude.
• Hash Function as Random Number Generator:
– Hash function acts as a pseudo-random generator.
– Same user hashed multiple times yields consistent result.
– Enables reevaluating inclusion decision as new queries arrive.
• Generalizing for Variable Sample Sizes:
– Hash user names to b buckets (0 to b-1).
– Sample fraction a/b of users.
– Include query if hash value < a.
Key Takeaway:
• Direct sampling of each user's queries may not work for specific queries.
• Maintaining user information might not be feasible due to memory
constraints.
• Hash functions offer a practical solution to make inclusion decisions based
on buckets.
• This technique enables obtaining a representative sample of any desired
fraction of users.
The General Sampling Problem
• Problem Overview:
– Stream contains n-component tuples.
– Some components form the key for sample selection.
– Key components influence which tuples are included in the sample.
– In our example, key components are user, query, and time.
• Flexibility in Sample Composition:
– Key components can be selected to vary the focus of sampling.
– Choices include individual components (user, query) or
combinations (user-query pairs).
• Hash-Based Sampling Approach:
– To achieve a sample of size a/b, hash key values to b buckets.
– Tuple is added to the sample if hash value is less than a.
• Handling Multi-Component Keys:
– When key comprises multiple components, hash function
combines them into a single hash value.
– The resulting sample includes tuples with specific key values,
maintaining the desired sample ratio.
• Key Takeaway:
• General problem involves selecting tuples based on key
components.
• Versatility in selecting key components influences the
focus of sampling.
• Hash-based approach adapts for single or multi-
component keys, maintaining desired sample size and
composition.
Varying the Sample Size
• Sample Growth Over Time:
– Sample size often expands as more stream data is processed.
– Example retains 1/10th of users' search queries indefinitely.
– Over time, additional queries for existing users and new users
appear.
• Adjusting Sample Size with Budget:
– If sample storage has a budget, key value fraction must
decrease.
– Goal: Maintain a representative sample of selected key values.
• Maintaining Sample Composition:
– Use hash function h: key values → large range of values (0 to B−1).
– Set threshold t initially at B − 1.
– Sample comprises tuples with key K such that h(K) ≤ t.
• Dynamically Updating the Sample:
– New tuples included in the sample if they satisfy h(K) ≤ t
condition.
– Exceeding storage limit prompts reduction in t value.
– Remove tuples with key values hashed to t to maintain budget.
• Efficiency Enhancement:
– Lower t by more than 1 when necessary.
– Remove tuples with highest hash values when removing key values from
sample.
– Index on hash value allows quick access to tuples matching a specific hash.
• Key Takeaway:
• As stream data accumulates, the sample size may increase.
• Adjusting sample size based on storage budget ensures continued
sample representativeness.
• Hash function, threshold t, and dynamic adjustments maintain sample
composition.
Exercise 4.2.1 Summary: Constructing a Sample for Queries
• Given a stream of tuples with the schema Grades(university, courseID,
studentID, grade), where universities, courseIDs, and studentIDs have
uniqueness constraints within universities, and the goal is to answer
queries approximately from a 1/20th sample of the data. For each query
below, the key attributes necessary to construct the sample are:
(a) Query: Estimate average number of students in a course.
• Key Attributes: university, courseID
• Construction: Hash the combination of university and courseID to create
a hash value. Include the tuple in the sample if the hash value is less
than 1/20.
(b) Query: Estimate fraction of students with GPA ≥ 3.5.
• Key Attributes: university, studentID
• Construction: Hash the combination of university and studentID to
create a hash value. Include the tuple in the sample if the hash value is
less than 1/20.
(c) Query: Estimate fraction of courses with ≥ 50% students with grade
"A".
• Key Attributes: university, courseID
• Construction: Hash the combination of university and courseID to create
a hash value. Include the tuple in the sample if the hash value is less
than 1/20.
Key Takeaway:
• Sample construction involves hashing key
attributes to create hash values.
• Each query's specific key attributes are chosen
based on the query's focus.
• Hash value comparison determines inclusion in
the sample (with 1/20 probability).

You might also like