Data Stream Unit4

The document discusses the characteristics and challenges of processing data streams, which are continuous, fast-changing, and often infinite in volume. It highlights various applications of stream data, such as telecommunications and financial markets, and outlines methodologies for processing such data, including random sampling and sliding windows. Additionally, it addresses the need for multi-dimensional processing and the feasibility of implementing stream data cubes for efficient analysis.

Uploaded by

tiyasachowdhury473

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

4 views20 pages

Data Stream Unit4

Uploaded by

tiyasachowdhury473

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 20

Characterid@eserpeta Streams = Data Streams Data streams—continuous, ordered, changing, fast, huge amount = Traditional DBMS—data stored in finite, persistent data sets = Characteristics April 25, 2020 Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing Data Mining: Concepts and TechniquesStream Data Applications = Telecommunication calling records = Business: credit card transaction flows = Network monitoring and traffic engineering = Financial market: stock exchange = Engineering & industrial processes: power supply & manufacturing = Sensor, monitoring & surveillance: video streams, RFIDs = Security monitoring = Web logs and Web page click streams = Massive data sets (even saved but random access is too expensive) April 25, 2020 Data Mining: Concepts and TechniquesChallenges of Stream Data Processing = Multiple, continuous, rapid, time-varying, ordered streams = Main memory computations = Queries are often continuous » Evaluated continuously as stream data arrives = Answer updated over time Queries are often complex = Beyond element-at-a-time processing = Beyond stream-at-a-time processing = Beyond relational queries (scientific, data mining, OLAP) Multi-level/multi-dimensional processing and data mining = Most stream data are at low-level or multi-dimensional in nature April 25, 2020 Data Mining: Concepts and TechniquesProcessing Stream Queries = Query types = One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) = Predefined query vs. ad-hoc query (issued on-line) = Unbounded memory requirements « For real-time response, main memory algorithm should be used = Memory requirement is unbounded if one will join future tuples = Approximate query answering = With bounded memory, it is not always possible to produce exact answers « High-quality approximate answers are desired = Data reduction and synopsis construction methods = Sketches, random sampling, histograms, wavelets, etc. April 25, 2020 Data Mining: Concepts and TechniquesMethodologies for Stream Data Processing = Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches Radomized algorithms = Random sampling (but without knowing the total length in advance) April 25, 2020 Reservoir sampling. maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream fiow, every new element has a certain probability (s/N) of replacing an old element in the reservoir. Data Mining: Concepts and TechniquesStream Data Processing Methods (1) = Sliding windows = Make decisions based only on recent data of sliding window size w = Anelement arriving at time texpires at time t+ w = Histograms = Approximate the frequency distribution of element values in a stream «= Partition data into a set of contiguous buckets = Multi-resolution models = Popular models: balanced binary trees, micro-clusters, and wavelets = Sketches « Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass F,= ym} = Frequency moments of a stream A = {@j, ..., An}, Fy: va April 25, 2020 Data Mining: Concepts and Techniques 10Challenges for Mining Dynamics in Data Streams = Most stream data are at pretty low-level or multi- dimensional in nature: needs ML/MD processing = Analysis requirements = Multi-dimensional trends and unusual patterns = Capturing important changes at multi-dimensions/levels = Fast, real-time detection and response = Comparing with data cube: Similarity and differences = Stream (data) cube or stream OLAP: Is this feasible? = Can we implement it efficiently? April 25, 2020 Data Mining: Concepts and Techniques 12A Stream Cube Architecture « Atilted time frame = Different time granularities » second, minute, quarter, hour, day, week, ... = Critical layers = Minimum interest layer (m-layer) = Observation layer (o-layer) « User: watches at o-layer and occasionally needs to drill-down down to m-layer = Partial materialization of stream cubes = Full materialization: too space and time consuming = No materialization: slow response at query time = Partial materialization: what do we mean “partial”? April 25, 2020 Data Mining: Concepts and Techniques 13A Stream Cube Architecture = Atilted time frame « Different time granularities » second, minute, quarter, hour, day, week, ... = Critical layers = Minimum interest layer (m-layer) = Observation layer (o-layer) = User: watches at o-layer and occasionally needs to drill-down down to m-layer = Partial materialization of stream cubes = Full materialization: too space and time consuming = No materialization: slow response at query time « Partial materialization: what do we mean “partial”? April 25, 2020 Data Mining: Concepts and Techniques 13A Titled Time Model = Natural tilted time frame: = Example: Minimal: quarter, then 4 quarters > 1 hour, 24 hours > day, ... 12 months 31 days 24 hours 4 qtrs Lotretithiieritiieriitiil, time = Logarithmic tilted time frame: = Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, ... 64t,32t,16t, 8t, 4t 2t, t,t seet iti titi i i ly ay, April 25, 2020 Data Mining: Concepts and Techniques 14A Titled Time Model (2) = Pyramidal tilted time frame: = Example: Suppose there are 5 frames and each takes maximal 3 snapshots = Given a snapshot number N, if N mod 24 = 0, insert into the frame number d. If there are more than 3 snapshots, “kick out” the oldest one. Snapshots (by clock time) 69 67 65 April 25, 2020 Data Mining: Concepts and Techniques 15“Anfinite length: Impractical to store and use all historical data —— - = Requires infinite storage = Running time * Concept-drift: As concept drifting, new features may appear ** Concept-evolution: New type of class normally holds new set of features ** Feature-evolution: New features involves = Infinite data streams Y Normally, global feature set is unknown v New features may appear“Uses past labeled data to build classification model Predicts the labels of future instances using the model “Helps decision making2Sketches ¥ Histograms and wavelets require multi-passes over the data, but sketches can operate in a single pass , ¥ Frequency moments of a stream A= {a;,-., ay}, Fy = Dm where v: the universe or domain size, m,,the frequency of { in the sequence = Given N elts and v values, sketches can approximate Fo, F,, F, in O(log v + log N) space ** Randomized algorithms ¥ Monte Carlo algorithm: bound on running time but may not return correct result ¥ Chebyshev’s inequality: * Let X be a random variable with mean pt and standard deviation o Y¥ Chernoff bound: + Let X be the sum of independent Poisson trials X,, ..., X,, 6 in (0, 1) * The probabilitv decreases exnoentiallv as we move from the mean* Most stream data are at pretty low-level or multi-dimensional in nature: needs ML/MD processing * Analysis requirements * Multi-dimensional trends and unusual patterns * Capturing important changes at multi-dimensions/levels * Fast, real-time detection and response * Comparing with data cube: Similarity and differences * Stream (data) cube or stream OLAP: Is this feasible? * Can we implement it efficiently?“ Analysis of Web click streams * Raw data at low levels: seconds, web page addresses, user IP addresses, ... * Analysts want: changes, trends, unusual patterns, at_ reasonable levels of details * E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.” *% Analysis of power consumption streams ca Raw data: power consumption flow for every household, every minute * Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week agoStreaming Data * What is Streaming Data sssa= sss= . sss= (ronnie Stream Processing Output | Engine |Streaming Data 1)Manage small time window 2) unbounded data 3) Data not correlated 4) Data can only be Timestamped, or geo taggedStream Model Streams may be archived in a large archival store, but we assume it is not possible to answer queries from the archival store. It could be examined only under special circumstances using time-consuming retrieval processes. There is also a working store, into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be disk, or it might be main memory, depending on how fast we need to process queries. But either way, it is of sufficiently limited capacite ——— that it cannot store all the data from all the streams sireen osData Streams ¢ In many data mining situations, we do not know the entire data set in advance * Stream Management is important when the input rate is controlled externally: — Google queries — Twitter or Facebook status updates * We can think of the data as infinite and non-stationary (the distribution changes over time) Data Strear

Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
Data Streams: Models and Algorithms
No ratings yet
Data Streams: Models and Algorithms
372 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Stream Data
No ratings yet
Stream Data
70 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Unit 4
No ratings yet
Unit 4
84 pages
Module II
No ratings yet
Module II
22 pages
Unit5-Dwdm
No ratings yet
Unit5-Dwdm
58 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Stream Mining
No ratings yet
Stream Mining
65 pages
Recommendation System
No ratings yet
Recommendation System
70 pages
Data Mining Unit-V
No ratings yet
Data Mining Unit-V
19 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Bda M4
No ratings yet
Bda M4
57 pages
Mining&Data Stream Unit-3 - Removed
No ratings yet
Mining&Data Stream Unit-3 - Removed
50 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Unit-5 Data Mining AIML
No ratings yet
Unit-5 Data Mining AIML
31 pages
DWDM Unit 5 Part One
No ratings yet
DWDM Unit 5 Part One
29 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
Bda L4
No ratings yet
Bda L4
32 pages
Unit 3
No ratings yet
Unit 3
30 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
U3 Notes
No ratings yet
U3 Notes
27 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Big Data Unit Ii Notes
No ratings yet
Big Data Unit Ii Notes
19 pages
Mining Techniques For Streaming Data
No ratings yet
Mining Techniques For Streaming Data
14 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
DM Unit V
No ratings yet
DM Unit V
20 pages
Stream
No ratings yet
Stream
30 pages
Unit 4
No ratings yet
Unit 4
10 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Mod4 DWDM BTECH
No ratings yet
Mod4 DWDM BTECH
9 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
10 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
Short Notes On Unit 4 - Data Mining and Data Wareho
No ratings yet
Short Notes On Unit 4 - Data Mining and Data Wareho
7 pages
Unit 2
No ratings yet
Unit 2
10 pages
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
No ratings yet
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
10 pages
A Comprehensive Study of Data Stream Mining Techniques
No ratings yet
A Comprehensive Study of Data Stream Mining Techniques
9 pages
Overview of Streaming-Data Algorithms
No ratings yet
Overview of Streaming-Data Algorithms
10 pages
A
No ratings yet
A
3 pages
Mining Frequent Itemsets Based On CBSW Method: K Jothimani, DR Antony Selvadossthanmani
No ratings yet
Mining Frequent Itemsets Based On CBSW Method: K Jothimani, DR Antony Selvadossthanmani
5 pages
Mod2 Research
No ratings yet
Mod2 Research
18 pages
It Girl Workout
No ratings yet
It Girl Workout
1 page
Link Mining Graph Mining Notes
No ratings yet
Link Mining Graph Mining Notes
7 pages
Title 489
No ratings yet
Title 489
3 pages
Web Mining
No ratings yet
Web Mining
6 pages
Civilization
No ratings yet
Civilization
3 pages
Structured Research Paper On Labour Problems
No ratings yet
Structured Research Paper On Labour Problems
3 pages
Social Labour of Teenagers Research Structure
No ratings yet
Social Labour of Teenagers Research Structure
3 pages
Structured Research Paper On Economics
No ratings yet
Structured Research Paper On Economics
3 pages
Structured Research Paper On Indian Currency
No ratings yet
Structured Research Paper On Indian Currency
3 pages
Feedback Control System Challenges
No ratings yet
Feedback Control System Challenges
3 pages
Title 452
No ratings yet
Title 452
2 pages
Title: The Importance of Cleanliness: Social, Environmental, and Health Perspectives
No ratings yet
Title: The Importance of Cleanliness: Social, Environmental, and Health Perspectives
2 pages
Structured Research Paper On Job Satisfaction
No ratings yet
Structured Research Paper On Job Satisfaction
2 pages
Research Extra
No ratings yet
Research Extra
2 pages

Data Stream Unit4

Uploaded by

Data Stream Unit4

Uploaded by

You might also like