Dwdm Unit 5 Part One
Dwdm Unit 5 Part One
prof KITS
UNIT-V
Mining Stream, Time-Series, and Sequence Data: Mining Data Streams, Mining
Time-Series Data, Mining Sequence Patterns in Transactional Databases
Unlike traditional data sets, stream data flow in and out of a computer system
continuously and with varying update rates. They are temporally ordered, fast changing,
massive, and potentially infinite. It may be impossible to store an entiredata stream or
to scan through it multiple times due to its tremendous volume. More-over, stream data
tend to be of a rather low level of abstraction, whereas most analysts are interested in
relatively high-level dynamic changes, such as trends and deviations.
◼ Data Streams
◼ Characteristics
Dept of AI 1
K.Radhika Asst.prof KITS
Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-
level and multi-dimensional processing.
◼ Methodology
◼ Major methods
◼ Random sampling
◼ Histograms
◼ Sliding windows
◼ Multi-resolution model
◼ Sketches
◼ Radomized algorithms
Dept of AI 2
K.Radhika Asst.prof KITS
1) Random sampling
2) Sliding windows
3) Histograms
4) Multi-resolution models
5) Sketches
6) Randomized algorithms
a. Monte Carlo algorithm: bound on running time but may not return
correct result
P(| X − | k) 2
2
k
Dept of AI 3
K.Radhika Asst.prof KITS
b. Chebyshev’s inequality:
P[ X (1+ ) |] e −
2
/4
c. Chernoff bound:
◼ Raw data: power consumption flow for every household, every minute
Dept of AI 4
K.Radhika Asst.prof KITS
There are many possible ways to design a titled time frame. Here we introduce three
models, as illustrated in Figure 8.1: (1) natural tilted time frame model, (2) logarithmic
tilted time frame model, and (3) progressive logarithmic tilted time frame model.
◼ Critical layers
Dept of AI 5
K.Radhika Asst.prof KITS
Dept of AI 6
K.Radhika Asst.prof KITS
Critical Layers
(*,city,quarter)
observation layer
(user_group,street_block,minute)
(individual_user,street_address,second)
Fig: Two critical layers in a “power supply station” stream data cube
Dept of AI 7
K.Radhika Asst.prof KITS
Dept of AI 8
K.Radhika Asst.prof KITS
The Lossy Counting algorithm has three nice properties: (1) there are no false neg-
atives, that is, there is no true frequent item that is not output; (2) false positives are
quite “positive” as well, since the output items will have a frequency of at least N N;and
(3) the frequency of a frequent item can be underestimated by at most N. For fre-quent
items, this underestimation is only a small fraction of its true frequency, so this
approximation is acceptable
◼ Strength
◼ A simple idea
◼ Weakness:
◼ The output is based on all previous data. But sometimes, we are only
interested in recent data
2) Is decision-tree good for modeling fast changing data, e.g., stock market
analysis?
Dept of AI 9
K.Radhika Asst.prof KITS
◼ Naïve Bayesian
Hoeffding Tree
The Hoeffding tree algorithm is a decision tree learning method for stream data
classi-fication. It was initially used to track Web clickstreams and construct models to
predict which Web hosts and Web sites a user is likely to access. It typically runs in
sublinear time and produces a nearly identical decision tree to that of traditional batch
learners. It uses Hoeffding trees, which exploit the idea that a small sample can often
be enough to choose an optimal splitting attribute. This idea is supported
mathematically by the Hoeffding bound (or additive Chernoff bound).
r: random variable
R: range of r
n: # independent observations
=
◼ Hoeffding Tree Input
S: sequence of examples
X: attributes
G( ): evaluation function
Dept of AI 10
K.Radhika Asst.prof KITS
d: desired accuracy
Packets > 10
Data stream
yes no
protocol=http
Packets > 10
Data stream
Yes no
Protocol = ftp
Fig: The nodes of the Hoeffding tree are created incrementally as more
samples sream in.
Dept of AI 11
K.Radhika Asst.prof KITS
1) Strengths
◼ Scales better than traditional methods
◼ Sublinear with sampling
◼ Very small memory utilization
◼ Incremental
◼ Make class predictions in parallel
◼ New examples are added as they come
There are, however, weaknesses to the Hoeffding tree algorithm. For
example, the algorithm spends a great deal of time with attributes that have
nearly identical splitting quality. In addition, the memory utilization can be
further optimized. Finally, the algorithm cannot handle concept drift, because
once a node is created, it can never change.
Dept of AI 12
K.Radhika Asst.prof KITS
◼ CVFDT
◼ Increments count with new example
◼ Decrement old example
◼ Sliding window
◼ Nodes assigned monotonically increasing IDs
◼ Grows alternate subtrees
◼ When alternate more accurate => replace old
◼ O(w) better runtime than VFDT-window
A Classifier Ensemble Approach to Stream Data Classification
There are several reasons for involving more than one classifier. Decision trees
are not necessarily the most natural method for han-dling concept drift. Specifically,
if an attribute near the root of the tree in CVFDT no longer passesthe Hoeffding
bound, a large portion of the tree must be regrown. Many other classifiers, such as
naïve Bayes, are not subject to this weakness. In addition, naïve Bayesian classifiers
also supply relative probabilities along with the class
Dept of AI 13
K.Radhika Asst.prof KITS
Dept of AI 14
K.Radhika Asst.prof KITS
Dept of AI 15
K.Radhika Asst.prof KITS
Trend Analysis
A time series involving a variable Y , representing, say, the daily closing price
of a share in a stock market, can be viewed as a function of time t, thatis, Y =
F(t). Such a function can be illustrated as a time-series graph, as shown in Figure
8.4, which describes a point moving with the passage of time.
Dept of AI 16
K.Radhika Asst.prof KITS
In general there are two goals in time-series analysis: (1) modeling time series
(i.e., to gain insight into the mechanisms or underlying forces that generate the
time series), and (2) forecasting time series (i.e., to predict the future values of
the time-series variables).Trend analysis consists of the following four major
components or movements for characterizing time-series data:
price
Figure 8.4 Time-series data of the stock price of AllElectronics over time. The
trend is shown with a dashed curve, calculated by a moving average
Dept of AI 17
K.Radhika Asst.prof KITS
2) Time series analysis: decomposition of a time series into these four basic
movements
◼ Additive Modal: TS = T + C + S + I
◼ Multiplicative Modal: TS = T C S I
Dept of AI 18
K.Radhika Asst.prof KITS
◼ Find the curve minimizing the sum of the squares of the deviation of
points on the curve from the corresponding data points
Dept of AI 19
K.Radhika Asst.prof KITS
◼ Medical diagnosis
Data Reduction and Transformation Techniques
Due to the tremendous size and high-dimensionality of time-series data, data reduction
often serves as the first step in time-series analysis. Major strategies for data reduction
include attribute subset selection (which removes irrelevant or redundant attributes or
dimensions), dimensionality reduction (which typi-cally employs signal processing
techniques to obtain a r educed version of the original data), and numerosity
reduction (where data are replaced or estimated by alternative, smaller representations,
such as histograms, clustering, and sampling). Because time series can be viewed as
data of very high dimensionality where each point of time can be viewed as a
dimension, dimensionality reduction is our major concern here.
1) Many techniques for signal analysis require the data to be in the frequency
domain.
Dept of AI 20
K.Radhika Asst.prof KITS
For efficient accessing, a multidimensional index can be constructed using the first
few Fourier coefficients. When a similarity query is submitted to the system, the
index can be used to retrieve the sequences that are at most a certainsmall distance
away from the query sequence. Post processing is then performed by computing
the actual distance between sequences in the time domain and discarding any false
matches.
For subsequence matching, each sequence can be broken down into a set of
“pieces” of windows with length w. In one approach, the features of the subsequence
inside each window are then extracted. Each sequence is mapped to a “trail” in the
feature space. The trail of each sequence is divided into “subtrails,” each represented
by a minimum bounding rectangle. A multipiece assembly algorithm can then be used
to search for longer sequence matches.
1) Atomic matching
◼ Find all pairs of gap-free windows of a small length that are similar
2) Window stitching
◼ Stitch similar windows to form pairs of large similar subsequences
allowing gaps between atomic matches
3) Subsequence Ordering
◼ Linearly order the subsequence matches to determine whether enough
similar pieces exist.
Dept of AI 21
K.Radhika Asst.prof KITS
Find all of the sequences that are similar to some sequence in class A, but not
similar to any sequence in class B.
◼ Allows users to define and query the overall shape of time sequences
Dept of AI 22
K.Radhika Asst.prof KITS
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Dept of AI 23
K.Radhika Asst.prof KITS
“How can we develop efficient and scalable methods for sequential pattern
mining?” Recent developments have made progress in two directions: (1) efficient
methods for mining the full set of sequential patterns, and (2) efficient methods for
mining only the set of closed sequential patterns, where a sequential pattern s is closed
if there exists no sequential pattern s0 where s0 is a proper supersequence of s, and s0
has the same (frequency) support as s.6
All three approaches either directly or indirectly explore the Apriori property,
stated as follows: every nonempty subsequence of a sequential pattern is a sequential
pattern. (Recall that for a pattern to be called sequential, it must be frequent. That is, it
must sat-isfy minimum support.) The Apriori property is antimonotonic (or downward-
closed) in that, if a sequence cannot pass a test (e.g., regarding minimum support), all
of its supersequences will also fail the test. Use of this property to prune the search
space can help make the discovery of sequential patterns more efficient.
Dept of AI 24
K.Radhika Asst.prof KITS
The mapping from horizontal to vertical format requires one scan of the
database. A major advantage of using this format is that we can determine the
support of any k-sequence by simply joining the ID lists of any two of its (k 1)-
length subsequences. The length of the resulting ID list (i.e., unique sequence
ID values) is equal to the support of the k-sequence, which tells us
Dept of AI 25
K.Radhika Asst.prof KITS
Dept of AI 26
K.Radhika Asst.prof KITS
Dept of AI 27
K.Radhika Asst.prof KITS
Dept of AI 28
K.Radhika Asst.prof KITS
Dept of AI 29