0% found this document useful (0 votes)
4 views20 pages

Data Stream Unit4

The document discusses the characteristics and challenges of processing data streams, which are continuous, fast-changing, and often infinite in volume. It highlights various applications of stream data, such as telecommunications and financial markets, and outlines methodologies for processing such data, including random sampling and sliding windows. Additionally, it addresses the need for multi-dimensional processing and the feasibility of implementing stream data cubes for efficient analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
4 views20 pages

Data Stream Unit4

The document discusses the characteristics and challenges of processing data streams, which are continuous, fast-changing, and often infinite in volume. It highlights various applications of stream data, such as telecommunications and financial markets, and outlines methodologies for processing such data, including random sampling and sliding windows. Additionally, it addresses the need for multi-dimensional processing and the feasibility of implementing stream data cubes for efficient analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 20
Characterid@eserpeta Streams = Data Streams Data streams—continuous, ordered, changing, fast, huge amount = Traditional DBMS—data stored in finite, persistent data sets = Characteristics April 25, 2020 Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing Data Mining: Concepts and Techniques Stream Data Applications = Telecommunication calling records = Business: credit card transaction flows = Network monitoring and traffic engineering = Financial market: stock exchange = Engineering & industrial processes: power supply & manufacturing = Sensor, monitoring & surveillance: video streams, RFIDs = Security monitoring = Web logs and Web page click streams = Massive data sets (even saved but random access is too expensive) April 25, 2020 Data Mining: Concepts and Techniques Challenges of Stream Data Processing = Multiple, continuous, rapid, time-varying, ordered streams = Main memory computations = Queries are often continuous » Evaluated continuously as stream data arrives = Answer updated over time Queries are often complex = Beyond element-at-a-time processing = Beyond stream-at-a-time processing = Beyond relational queries (scientific, data mining, OLAP) Multi-level/multi-dimensional processing and data mining = Most stream data are at low-level or multi-dimensional in nature April 25, 2020 Data Mining: Concepts and Techniques Processing Stream Queries = Query types = One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) = Predefined query vs. ad-hoc query (issued on-line) = Unbounded memory requirements « For real-time response, main memory algorithm should be used = Memory requirement is unbounded if one will join future tuples = Approximate query answering = With bounded memory, it is not always possible to produce exact answers « High-quality approximate answers are desired = Data reduction and synopsis construction methods = Sketches, random sampling, histograms, wavelets, etc. April 25, 2020 Data Mining: Concepts and Techniques Methodologies for Stream Data Processing = Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches Radomized algorithms = Random sampling (but without knowing the total length in advance) April 25, 2020 Reservoir sampling. maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream fiow, every new element has a certain probability (s/N) of replacing an old element in the reservoir. Data Mining: Concepts and Techniques Stream Data Processing Methods (1) = Sliding windows = Make decisions based only on recent data of sliding window size w = Anelement arriving at time texpires at time t+ w = Histograms = Approximate the frequency distribution of element values in a stream «= Partition data into a set of contiguous buckets = Multi-resolution models = Popular models: balanced binary trees, micro-clusters, and wavelets = Sketches « Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass F,= ym} = Frequency moments of a stream A = {@j, ..., An}, Fy: va April 25, 2020 Data Mining: Concepts and Techniques 10 Challenges for Mining Dynamics in Data Streams = Most stream data are at pretty low-level or multi- dimensional in nature: needs ML/MD processing = Analysis requirements = Multi-dimensional trends and unusual patterns = Capturing important changes at multi-dimensions/levels = Fast, real-time detection and response = Comparing with data cube: Similarity and differences = Stream (data) cube or stream OLAP: Is this feasible? = Can we implement it efficiently? April 25, 2020 Data Mining: Concepts and Techniques 12 A Stream Cube Architecture « Atilted time frame = Different time granularities » second, minute, quarter, hour, day, week, ... = Critical layers = Minimum interest layer (m-layer) = Observation layer (o-layer) « User: watches at o-layer and occasionally needs to drill-down down to m-layer = Partial materialization of stream cubes = Full materialization: too space and time consuming = No materialization: slow response at query time = Partial materialization: what do we mean “partial”? April 25, 2020 Data Mining: Concepts and Techniques 13 A Stream Cube Architecture = Atilted time frame « Different time granularities » second, minute, quarter, hour, day, week, ... = Critical layers = Minimum interest layer (m-layer) = Observation layer (o-layer) = User: watches at o-layer and occasionally needs to drill-down down to m-layer = Partial materialization of stream cubes = Full materialization: too space and time consuming = No materialization: slow response at query time « Partial materialization: what do we mean “partial”? April 25, 2020 Data Mining: Concepts and Techniques 13 A Titled Time Model = Natural tilted time frame: = Example: Minimal: quarter, then 4 quarters > 1 hour, 24 hours > day, ... 12 months 31 days 24 hours 4 qtrs Lotretithiieritiieriitiil, time = Logarithmic tilted time frame: = Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, ... 64t,32t,16t, 8t, 4t 2t, t,t seet iti titi i i ly ay, April 25, 2020 Data Mining: Concepts and Techniques 14 A Titled Time Model (2) = Pyramidal tilted time frame: = Example: Suppose there are 5 frames and each takes maximal 3 snapshots = Given a snapshot number N, if N mod 24 = 0, insert into the frame number d. If there are more than 3 snapshots, “kick out” the oldest one. Snapshots (by clock time) 69 67 65 April 25, 2020 Data Mining: Concepts and Techniques 15 “Anfinite length: Impractical to store and use all historical data —— - = Requires infinite storage = Running time * Concept-drift: As concept drifting, new features may appear ** Concept-evolution: New type of class normally holds new set of features ** Feature-evolution: New features involves = Infinite data streams Y Normally, global feature set is unknown v New features may appear “Uses past labeled data to build classification model Predicts the labels of future instances using the model “Helps decision making 2Sketches ¥ Histograms and wavelets require multi-passes over the data, but sketches can operate in a single pass , ¥ Frequency moments of a stream A= {a;,-., ay}, Fy = Dm where v: the universe or domain size, m,,the frequency of { in the sequence = Given N elts and v values, sketches can approximate Fo, F,, F, in O(log v + log N) space ** Randomized algorithms ¥ Monte Carlo algorithm: bound on running time but may not return correct result ¥ Chebyshev’s inequality: * Let X be a random variable with mean pt and standard deviation o Y¥ Chernoff bound: + Let X be the sum of independent Poisson trials X,, ..., X,, 6 in (0, 1) * The probabilitv decreases exnoentiallv as we move from the mean * Most stream data are at pretty low-level or multi-dimensional in nature: needs ML/MD processing * Analysis requirements * Multi-dimensional trends and unusual patterns * Capturing important changes at multi-dimensions/levels * Fast, real-time detection and response * Comparing with data cube: Similarity and differences * Stream (data) cube or stream OLAP: Is this feasible? * Can we implement it efficiently? “ Analysis of Web click streams * Raw data at low levels: seconds, web page addresses, user IP addresses, ... * Analysts want: changes, trends, unusual patterns, at_ reasonable levels of details * E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.” *% Analysis of power consumption streams ca Raw data: power consumption flow for every household, every minute * Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago Streaming Data * What is Streaming Data sssa= sss= . sss= (ronnie Stream Processing Output | Engine | Streaming Data 1)Manage small time window 2) unbounded data 3) Data not correlated 4) Data can only be Timestamped, or geo tagged Stream Model Streams may be archived in a large archival store, but we assume it is not possible to answer queries from the archival store. It could be examined only under special circumstances using time-consuming retrieval processes. There is also a working store, into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be disk, or it might be main memory, depending on how fast we need to process queries. But either way, it is of sufficiently limited capacite ——— that it cannot store all the data from all the streams sireen os Data Streams ¢ In many data mining situations, we do not know the entire data set in advance * Stream Management is important when the input rate is controlled externally: — Google queries — Twitter or Facebook status updates * We can think of the data as infinite and non-stationary (the distribution changes over time) Data Strear

You might also like