0% found this document useful (0 votes)
2 views

Data Stream Management

Uploaded by

cb250802
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Stream Management

Uploaded by

cb250802
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Stream Systems

santanoo

Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”,
PODS’02.
An Overview Data and
of Streams Query Models
Outline of
this Talk Other
Approximation
Research
Queries
Issues
Data Streams
 Traditional DBMS – data stored in finite,
persistent data sets
 New Applications – data input as continuous,
ordered data streams
A data stream as a growing relational table of
potentially infinite size
Using Traditional Database
User/Application

Query Result
Query Result
… …

Loader
New Approach for Data
Streams
User/Application

Register Query
Results

Stream Query
Processor
New Approach for Data
Streams
User/Application

Register Query
Results
Data
Stream Query Stream
Processor Management
System
(DSMS)
Scratch Space
(Memory and/or Disk)
Sample Applications
 Network management and traffic engineering
(e.g., Sprint)
 Streams of measurements and packet traces
 Queries: detect anomalies, adjust routing

 Telecom call data


(e.g., AT&T)
 Streams of call records
 Queries: fraud, customer call patterns, billing
Sample Applications
 Sensor Networks
 Large number of cheap, wireless sensors
 streams of real-world measurements
 Queries: monitoring, aggregate, alert

 Web tracking and personalization


(e.g., Yahoo, Google)
 Clickstreams,user query streams, log records
 Queries: monitoring, analysis, personalization
Challenges
 Multiple,continuous, rapid, time-varying,
ordered streams
 Main memory computations
 Queries may be continuous (not just one-time)
 Evaluatedcontinuously as stream data arrives
 Answer updated over time

 Queries may be ad-hoc


 Beyond relational queries (scientific, data
mining)
Meta-Questions
 Killer-apps
 Application
stream rates exceed DBMS capacity?
 Can DSMS handle high rates anyway?

 Motivation
 Need for general-purpose DSMS?
 Not ad-hoc, application-specific systems?
 Non-Trivial
 DSMS = merely DBMS with enhanced support for
triggers, temporal constructs, data rate mgmt?
DBMS versus DSMS
 Persistent relations  Transient streams
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
 Random access  Sequential access
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
 Random access  Sequential access
 “Unbounded” disk store  Bounded main memory
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
 Random access  Sequential access
 “Unbounded” disk store  Bounded main memory
 Only current state matters  History/arrival-order is
critical
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
 Random access  Sequential access
 “Unbounded” disk store  Bounded main memory
 Only current state matters  History/arrival-order is
 Relatively low update rate critical
 Possibly multi-GB arrival
rate
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
 Random access  Sequential access
 “Unbounded” disk store  Bounded main memory
 Only current state matters  History/arrival-order is
 Relatively low update rate critical
 No real-time services  Possibly multi-GB arrival
rate
 Real-time requirements
DBMS versus DSMS
 Persistent relations  Transient streams
 One-time queries  Continuous queries
 Random access  Sequential access
 “Unbounded” disk store  Bounded main memory
 Only current state matters  History/arrival-order is
 Relatively low update rate critical
 No real-time services  Possibly multi-GB arrival
 Assume precise data rate
 Real-time requirements
 Data stale/imprecise
Outline of this Talk
 An Overview of Streams
 Data and Query Models
 Approximation Queries
 Other Research Issues
Aurora/STREAM Overview
Synopses Output streams
Query Plans
Running Op
Ready Op
Applications register
Waiting Op x p continuous queries

s s x Users issue
continuous and
ad-hoc queries

Historical
Storage Administrator monitors
query execution and adjusts
run-time parameters
Input streams
Data Model
 Append-only
 Call records
 Updates
 Stock tickers
 Deletes
 Transactional data
 Meta-Data
 Control signals, punctuations

System Internals – probably need all above


Query Model
User/Application
Query Registration
Answer Availability
• Predefined
• One-time
• Ad-hoc • Event/timer based
• Predefined, inactive • Multiple-time, periodic
until invoked • Continuous (stored or
streamed)

Query Processor

Stream Access
• Arbitrary
• Weighted history
• Sliding window
DSMS
Example Queries
John May

Central Central
Office Office

Outgoing (call_ID, caller, time, event)


Incoming (call_ID, callee, time, event)

DSMS
event = start or end
Query 1 (self-join)
 Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time – O1.time > 2
AND O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
 Resultrequires unbounded storage
 Can provide result as data stream
 Can output after 2 min, without seeing end
Query 2 (join)
 Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
 Canstill provide result as data stream
 Requires unbounded temporary storage
Query 3 (group-by
aggregation)
 Total connection time for each caller
SELECT O1.caller, sum(O2.time – O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
GROUP BY O1.caller
 Cannot provide result in (append-only) stream
 Output updates?
 Provide current value on demand?
Outline of this Talk
 An Overview of Streams
 Data and Query Model
 Approximation Queries
 Other Research Issues
Impact of Limited Memory
 Continuousstreams grow unboundedly
 Queries may require unbounded memory
 [ABBMW 02]
a priori memory bounds for query
 Conjunctive queries with arithmetic comparisons
 Impact of duplication elimination

 Open – general queries


Approximate Query Evaluation
 Why?
 Handling load – streams coming too fast
 Data stream is archived in a off-site data
warehouse, expensive access of archived
data
 Avoid unbounded storage and computation
 Ad hoc queries need approximate history
 Try to look at the data items only once and
in a fixed order
Approximate Query Evaluation
 How? Sliding windows, synopsis, samples, load-
shed
 Major Issues?
 Metric for set-valued queries
 Composition of approximate operators
 How is it understood/controlled by user?
 Integrate into query language
 Query planning and interaction with resource allocation
 Accuracy-efficiency-storage tradeoff and global metric
Synopses
 Queries may access or aggregate past data
 Need bounded-memory history-approximation
 Synopsis?
 Succinct summary of old stream tuples
 Like indexes/materialized-views, but base data is
unavailable
 Examples
 Sliding Windows
 Samples
 Sketches
 Histograms
 Wavelet representation
Sketching Techniques
 Self-Join Size Estimation
 Stream of values from D = {1,2,…,n}
 Let f = frequency of value i
i
 Consider S = Σ f 2, or Gini’s index of
i
homogeneity.
 Useful in parallel DB applications, error
estimation in query result size estimation and
access plan costs.
 Equivalent query: count (R |><| R)
D
Evaluating S = Σ fi2
 To update S, keep a counter fi for each value i
in the domain D  (n) space
 Has to be kept for each self-join
 Question – estimating S in sub-linear space?

(O(log n))
Self-Join Size Estimation
 AMS Technique (randomized sketches)
 Given (f ,f ,…,f )
1 2 N

Z = random{-1,1}
i

X = Σ fiZi (X incrementally computable)


 Theorem Exp[X2] = Σ fi2
 Cross-terms fiZi fjZj have 0 expectation
 Square-terms fiZi fiZi = fi2
 Space = log (N + Σ fi)
 Independent samples Xk reduce variance
Estimation Quality
 How can independent samples Xk improve
the quality of estimation?
 Keep s x s samples for X
1 2 k

s reduces variance, s2 boosts confidence


1
s1
Atomic Avg(X1j2)
Sketch
Avg(X2j2)
Median
s2 Avg(X ) 2
3j (sketch)
Avg(X4j2)
Avg(X5j2)
Sample Run of AMS
V = 3 6 2 5 7

Z1 = 1 1 -1 1 -1 Z2 = -1 1 -1 1 1

Σvi2 = 123 X1= 5, X12 = 25 X2= 14, X22 = 196 Est = 110.5

V = 4 6 2 5 7

Z1 = 1 1 -1 1 -1 Z2 = -1 1 -1 1 1

Σ vi2 = 130, X1= 6, X12 = 36, X2= 12, X22 = 144, Est = 90
Comments on AMS
 The self-join size can be computed on-line
 Sufficiently small variance (controlled by s 1 and s2)
 Can this method be extended to answer other
queries?
Complex Aggregate Queries
 A. Dobra et al. extend the idea of AMS to provide
approximate answers to complex aggregate queries.
 SELECT AGG FROM R1,R2,…,Rr where E
 AGG: COUNT/SUM/AVERAGE
 E: conjunction of (Ri.Aj = Rk.Al)
 It is proved that the error of these estimates is at
most ε with probability 1-δ.
Basic Notions of
Approximation
 For aggregate queries (e.g., SUM, COUNT), approximation
quality can be measured by relative error:
 (Estimated value – Actual value) / Actual value
 Open question: for queries involving more than simple
aggregation, how should we define approximation?
 Consider S |><|BT: (S: {A,B}, T: {B,C})

A B C A B C
10 20.5 Doctor 8 10.3 Lawyer
8 10.3 Lawyer 3 10.2 Teacher
3 10.2 Teacher

Actual Result Approximate Result


Basic Notions of
Approximation (2)
 Can we accept this kind of approximation?
A B C A B C
10 20.5 Doctor 11 21.6 Doctor
8 10.3 Lawyer 8 10.3 Student
3 10.2 Teacher 3 10.2 Teacher

Actual Result Approximate Result


Basic Notions of
Approximation (3)
 Can we provide useful (semantically correct) but stale results?

A B C A B C
10 20.5 Doctor 10 20.5 Doctor
8 10.3 Lawyer 8 10.3 Lawyer
3 10.2 Teacher
Approximate Result
Actual Result (at time t) (correct result at time t -
)
Outline of this Talk
 An Overview of Streams
 Data and Query Model
 Approximation Queries
 Other Research Issues
Data Mining
 High-Speed Stream Data Mining
 Association Rules
 Stream Clustering
 Decision Trees

 Single-pass algorithms for infering


interesting patterns on-line (as the data
stream arrives)
 Useful for mission-critical tasks like telecom
fraud detection
Conclusion: Future Work
 Query Processing
 Stream Algebra and Query Languages
 Approximations
 Blocking Operators, Constraints, Punctuations
 Runtime Management
 Scheduling, Memory Management, Rate Management
 Query Optimization (Adaptive, Multi-Query, Ad-hoc)
 Distributed processing
 Synopses and Algorithmic Problems
 Systems
 UI, statistics, crash recovery and transaction management
 System development and deployment
References
B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom.
Models and Issues in Data Stream Systems, PODS
’02. (Paper and Talk Slides)
A. Arasu, B. Babcock, S. Babu, J. McAlister, J. Widom.
Characterizing Memory Requirements for Queries
over Continuous Data Streams, PODS ’02.
A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi.
Processing Complex Aggregate Queries over Data
Streams, SIGMOD ’02.
N. Alon, Y. Matias, M. Szegedy. The Space
Complexity of Approximating the Frequency
Moments, STOC ’96.
Thank You!

You might also like