Data Stream Management
Data Stream Management
santanoo
Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”,
PODS’02.
An Overview Data and
of Streams Query Models
Outline of
this Talk Other
Approximation
Research
Queries
Issues
Data Streams
Traditional DBMS – data stored in finite,
persistent data sets
New Applications – data input as continuous,
ordered data streams
A data stream as a growing relational table of
potentially infinite size
Using Traditional Database
User/Application
Query Result
Query Result
… …
Loader
New Approach for Data
Streams
User/Application
Register Query
Results
Stream Query
Processor
New Approach for Data
Streams
User/Application
Register Query
Results
Data
Stream Query Stream
Processor Management
System
(DSMS)
Scratch Space
(Memory and/or Disk)
Sample Applications
Network management and traffic engineering
(e.g., Sprint)
Streams of measurements and packet traces
Queries: detect anomalies, adjust routing
Motivation
Need for general-purpose DSMS?
Not ad-hoc, application-specific systems?
Non-Trivial
DSMS = merely DBMS with enhanced support for
triggers, temporal constructs, data rate mgmt?
DBMS versus DSMS
Persistent relations Transient streams
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
Only current state matters History/arrival-order is
critical
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
Only current state matters History/arrival-order is
Relatively low update rate critical
Possibly multi-GB arrival
rate
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
Only current state matters History/arrival-order is
Relatively low update rate critical
No real-time services Possibly multi-GB arrival
rate
Real-time requirements
DBMS versus DSMS
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
Only current state matters History/arrival-order is
Relatively low update rate critical
No real-time services Possibly multi-GB arrival
Assume precise data rate
Real-time requirements
Data stale/imprecise
Outline of this Talk
An Overview of Streams
Data and Query Models
Approximation Queries
Other Research Issues
Aurora/STREAM Overview
Synopses Output streams
Query Plans
Running Op
Ready Op
Applications register
Waiting Op x p continuous queries
s s x Users issue
continuous and
ad-hoc queries
Historical
Storage Administrator monitors
query execution and adjusts
run-time parameters
Input streams
Data Model
Append-only
Call records
Updates
Stock tickers
Deletes
Transactional data
Meta-Data
Control signals, punctuations
Query Processor
Stream Access
• Arbitrary
• Weighted history
• Sliding window
DSMS
Example Queries
John May
Central Central
Office Office
DSMS
event = start or end
Query 1 (self-join)
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time – O1.time > 2
AND O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
Resultrequires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end
Query 2 (join)
Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
Canstill provide result as data stream
Requires unbounded temporary storage
Query 3 (group-by
aggregation)
Total connection time for each caller
SELECT O1.caller, sum(O2.time – O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
GROUP BY O1.caller
Cannot provide result in (append-only) stream
Output updates?
Provide current value on demand?
Outline of this Talk
An Overview of Streams
Data and Query Model
Approximation Queries
Other Research Issues
Impact of Limited Memory
Continuousstreams grow unboundedly
Queries may require unbounded memory
[ABBMW 02]
a priori memory bounds for query
Conjunctive queries with arithmetic comparisons
Impact of duplication elimination
(O(log n))
Self-Join Size Estimation
AMS Technique (randomized sketches)
Given (f ,f ,…,f )
1 2 N
Z = random{-1,1}
i
Z1 = 1 1 -1 1 -1 Z2 = -1 1 -1 1 1
Σvi2 = 123 X1= 5, X12 = 25 X2= 14, X22 = 196 Est = 110.5
V = 4 6 2 5 7
Z1 = 1 1 -1 1 -1 Z2 = -1 1 -1 1 1
Σ vi2 = 130, X1= 6, X12 = 36, X2= 12, X22 = 144, Est = 90
Comments on AMS
The self-join size can be computed on-line
Sufficiently small variance (controlled by s 1 and s2)
Can this method be extended to answer other
queries?
Complex Aggregate Queries
A. Dobra et al. extend the idea of AMS to provide
approximate answers to complex aggregate queries.
SELECT AGG FROM R1,R2,…,Rr where E
AGG: COUNT/SUM/AVERAGE
E: conjunction of (Ri.Aj = Rk.Al)
It is proved that the error of these estimates is at
most ε with probability 1-δ.
Basic Notions of
Approximation
For aggregate queries (e.g., SUM, COUNT), approximation
quality can be measured by relative error:
(Estimated value – Actual value) / Actual value
Open question: for queries involving more than simple
aggregation, how should we define approximation?
Consider S |><|BT: (S: {A,B}, T: {B,C})
A B C A B C
10 20.5 Doctor 8 10.3 Lawyer
8 10.3 Lawyer 3 10.2 Teacher
3 10.2 Teacher
A B C A B C
10 20.5 Doctor 10 20.5 Doctor
8 10.3 Lawyer 8 10.3 Lawyer
3 10.2 Teacher
Approximate Result
Actual Result (at time t) (correct result at time t -
)
Outline of this Talk
An Overview of Streams
Data and Query Model
Approximation Queries
Other Research Issues
Data Mining
High-Speed Stream Data Mining
Association Rules
Stream Clustering
Decision Trees