The document discusses the characteristics and challenges of processing data streams, which are continuous, fast-changing, and often infinite in volume. It highlights various applications of stream data, such as telecommunications and financial markets, and outlines methodologies for processing such data, including random sampling and sliding windows. Additionally, it addresses the need for multi-dimensional processing and the feasibility of implementing stream data cubes for efficient analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views20 pages
Data Stream Unit4
The document discusses the characteristics and challenges of processing data streams, which are continuous, fast-changing, and often infinite in volume. It highlights various applications of stream data, such as telecommunications and financial markets, and outlines methodologies for processing such data, including random sampling and sliding windows. Additionally, it addresses the need for multi-dimensional processing and the feasibility of implementing stream data cubes for efficient analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 20
Characterid@eserpeta Streams
= Data Streams
Data streams—continuous, ordered, changing, fast, huge amount
= Traditional DBMS—data stored in finite, persistent data sets
= Characteristics
April 25, 2020
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs of today
Random access is expensive—single scan algorithm (can only have
one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
Data Mining: Concepts and TechniquesStream Data Applications
= Telecommunication calling records
= Business: credit card transaction flows
= Network monitoring and traffic engineering
= Financial market: stock exchange
= Engineering & industrial processes: power supply &
manufacturing
= Sensor, monitoring & surveillance: video streams, RFIDs
= Security monitoring
= Web logs and Web page click streams
= Massive data sets (even saved but random access is too
expensive)
April 25, 2020 Data Mining: Concepts and TechniquesChallenges of Stream Data Processing
= Multiple, continuous, rapid, time-varying, ordered streams
= Main memory computations
= Queries are often continuous
» Evaluated continuously as stream data arrives
= Answer updated over time
Queries are often complex
= Beyond element-at-a-time processing
= Beyond stream-at-a-time processing
= Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining
= Most stream data are at low-level or multi-dimensional in nature
April 25, 2020 Data Mining: Concepts and TechniquesProcessing Stream Queries
= Query types
= One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive)
= Predefined query vs. ad-hoc query (issued on-line)
= Unbounded memory requirements
« For real-time response, main memory algorithm should be used
= Memory requirement is unbounded if one will join future tuples
= Approximate query answering
= With bounded memory, it is not always possible to produce exact
answers
« High-quality approximate answers are desired
= Data reduction and synopsis construction methods
= Sketches, random sampling, histograms, wavelets, etc.
April 25, 2020 Data Mining: Concepts and TechniquesMethodologies for Stream Data Processing
= Major methods
Random sampling
Histograms
Sliding windows
Multi-resolution model
Sketches
Radomized algorithms
= Random sampling (but without knowing the total length in advance)
April 25, 2020
Reservoir sampling. maintain a set of s candidates in the reservoir, which
form a true random sample of the element seen so far in the stream. As
the data stream fiow, every new element has a certain probability (s/N)
of replacing an old element in the reservoir.
Data Mining: Concepts and TechniquesStream Data Processing Methods (1)
= Sliding windows
= Make decisions based only on recent data of sliding window size w
= Anelement arriving at time texpires at time t+ w
= Histograms
= Approximate the frequency distribution of element values in a stream
«= Partition data into a set of contiguous buckets
= Multi-resolution models
= Popular models: balanced binary trees, micro-clusters, and wavelets
= Sketches
« Histograms and wavelets require multi-passes over the data but
sketches can operate in a single pass
F,= ym}
= Frequency moments of a stream A = {@j, ..., An}, Fy: va
April 25, 2020 Data Mining: Concepts and Techniques 10Challenges for Mining Dynamics in Data
Streams
= Most stream data are at pretty low-level or multi-
dimensional in nature: needs ML/MD processing
= Analysis requirements
= Multi-dimensional trends and unusual patterns
= Capturing important changes at multi-dimensions/levels
= Fast, real-time detection and response
= Comparing with data cube: Similarity and differences
= Stream (data) cube or stream OLAP: Is this feasible?
= Can we implement it efficiently?
April 25, 2020 Data Mining: Concepts and Techniques 12A Stream Cube Architecture
« Atilted time frame
= Different time granularities
» second, minute, quarter, hour, day, week, ...
= Critical layers
= Minimum interest layer (m-layer)
= Observation layer (o-layer)
« User: watches at o-layer and occasionally needs to drill-down down
to m-layer
= Partial materialization of stream cubes
= Full materialization: too space and time consuming
= No materialization: slow response at query time
= Partial materialization: what do we mean “partial”?
April 25, 2020 Data Mining: Concepts and Techniques
13A Stream Cube Architecture
= Atilted time frame
« Different time granularities
» second, minute, quarter, hour, day, week, ...
= Critical layers
= Minimum interest layer (m-layer)
= Observation layer (o-layer)
= User: watches at o-layer and occasionally needs to drill-down down
to m-layer
= Partial materialization of stream cubes
= Full materialization: too space and time consuming
= No materialization: slow response at query time
« Partial materialization: what do we mean “partial”?
April 25, 2020 Data Mining: Concepts and Techniques
13A Titled Time Model
= Natural tilted time frame:
= Example: Minimal: quarter, then 4 quarters > 1 hour, 24 hours >
day, ...
12 months 31 days 24 hours 4 qtrs
Lotretithiieritiieriitiil, time
= Logarithmic tilted time frame:
= Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, ...
64t,32t,16t, 8t, 4t 2t, t,t
seet iti titi i i ly ay,
April 25, 2020 Data Mining: Concepts and Techniques 14A Titled Time Model (2)
= Pyramidal tilted time frame:
= Example: Suppose there are 5 frames and each takes
maximal 3 snapshots
= Given a snapshot number N, if N mod 24 = 0, insert
into the frame number d. If there are more than 3
snapshots, “kick out” the oldest one.
Snapshots (by clock time)
69 67 65
April 25, 2020 Data Mining: Concepts and Techniques 15“Anfinite length: Impractical to store and use all historical data
——
- = Requires infinite storage
= Running time
* Concept-drift: As concept drifting, new features may appear
** Concept-evolution: New type of class normally holds new set of
features
** Feature-evolution: New features involves
= Infinite data streams
Y Normally, global feature set is unknown
v New features may appear“Uses past labeled data to build classification model
Predicts the labels of future instances using the model
“Helps decision making2Sketches
¥ Histograms and wavelets require multi-passes over the data, but sketches can
operate in a single pass ,
¥ Frequency moments of a stream A= {a;,-., ay}, Fy = Dm
where v: the universe or domain size, m,,the frequency of { in the
sequence
= Given N elts and v values, sketches can approximate Fo, F,, F, in O(log v +
log N) space
** Randomized algorithms
¥ Monte Carlo algorithm: bound on running time but may not return correct
result
¥ Chebyshev’s inequality:
* Let X be a random variable with mean pt and standard deviation o
Y¥ Chernoff bound:
+ Let X be the sum of independent Poisson trials X,, ..., X,, 6 in (0, 1)
* The probabilitv decreases exnoentiallv as we move from the mean* Most stream data are at pretty low-level or multi-dimensional in
nature: needs ML/MD processing
* Analysis requirements
* Multi-dimensional trends and unusual patterns
* Capturing important changes at multi-dimensions/levels
* Fast, real-time detection and response
* Comparing with data cube: Similarity and differences
* Stream (data) cube or stream OLAP: Is this feasible?
* Can we implement it efficiently?“ Analysis of Web click streams
* Raw data at low levels: seconds, web page addresses, user
IP addresses, ...
* Analysts want: changes, trends, unusual patterns, at_
reasonable levels of details
* E.g., Average clicking traffic in North America on sports in
the last 15 minutes is 40% higher than that in the last 24
hours.”
*% Analysis of power consumption streams
ca Raw data: power consumption flow for every household,
every minute
* Patterns one may find: average hourly power consumption
surges up 30% for manufacturing companies in Chicago in
the last 2 hours today than that of the same day a week agoStreaming Data
* What is Streaming Data
sssa=
sss= .
sss=
(ronnie Stream Processing Output |
Engine |Streaming Data
1)Manage small time window
2) unbounded data
3) Data not correlated
4) Data can only be Timestamped, or geo taggedStream Model
Streams may be archived in a large archival store, but we
assume it is not possible to answer queries from the
archival store.
It could be examined only under special circumstances
using time-consuming retrieval processes.
There is also a working store, into which summaries or
parts of streams may be placed, and which can be used for
answering queries.
The working store might be disk, or it might be main
memory, depending on how fast we need to process
queries. But either way, it is of sufficiently limited capacite ———
that it cannot store all the data from all the streams sireen osData Streams
¢ In many data mining situations, we do not know
the entire data set in advance
* Stream Management is important when the
input rate is controlled externally:
— Google queries
— Twitter or Facebook status updates
* We can think of the data as infinite and
non-stationary (the distribution changes
over time)
Data Strear