0% found this document useful (0 votes)
19 views29 pages

Dwdm Unit 5 Part One

The document discusses mining stream, time-series, and sequence data, highlighting the challenges and methodologies for processing continuous and massive data streams generated by various real-time systems. It covers techniques such as random sampling, sliding windows, histograms, and sketches for efficient data analysis, as well as the importance of tilted time frames for analyzing data at different granularities. Additionally, it introduces classification methods for dynamic data streams, including decision trees and Hoeffding trees, emphasizing their strengths and weaknesses in handling fast-changing data.

Uploaded by

sainadh9700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views29 pages

Dwdm Unit 5 Part One

The document discusses mining stream, time-series, and sequence data, highlighting the challenges and methodologies for processing continuous and massive data streams generated by various real-time systems. It covers techniques such as random sampling, sliding windows, histograms, and sketches for efficient data analysis, as well as the importance of tilted time frames for analyzing data at different granularities. Additionally, it introduces classification methods for dynamic data streams, including decision trees and Hoeffding trees, emphasizing their strengths and weaknesses in handling fast-changing data.

Uploaded by

sainadh9700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

K.Radhika Asst.

prof KITS

UNIT-V

Mining Stream, Time-Series, and Sequence Data: Mining Data Streams, Mining
Time-Series Data, Mining Sequence Patterns in Transactional Databases

Mining Stream, Time-Series, and Sequence Data

Mining Data Streams


Tremendous and Potentially infinite volumes of data streams are often generated by
real time surveillance systems, communication networks, Internet traffic, on-line
transactions in the financial market or retail industry electric power grids, industry
production processes, remote sensors and other dynamic environments.

Unlike traditional data sets, stream data flow in and out of a computer system
continuously and with varying update rates. They are temporally ordered, fast changing,
massive, and potentially infinite. It may be impossible to store an entiredata stream or
to scan through it multiple times due to its tremendous volume. More-over, stream data
tend to be of a rather low level of abstraction, whereas most analysts are interested in
relatively high-level dynamic changes, such as trends and deviations.

◼ Data Streams

◼ Data streams—continuous, ordered, changing, fast, huge amount

◼ Traditional DBMS—data stored in finite, persistent data sets

◼ Characteristics

◼ Huge volumes of continuous data, possibly infinite

◼ Fast changing and requires fast, real-time response

◼ Data stream captures nicely our data processing needs of today

◼ Random access is expensive—single scan algorithm (can only have


one look)

◼ Store only the summary of the data seen thus far

Dept of AI 1
K.Radhika Asst.prof KITS

Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-
level and multi-dimensional processing.

Stream Data Applications


◼ Telecommunication calling records
◼ Business: credit card transaction flows
◼ Network monitoring and traffic engineering
◼ Financial market: stock exchange
◼ Engineering & industrial processes: power supply & manufacturing
◼ Sensor, monitoring & surveillance: video streams, RFIDs
◼ Security monitoring
◼ Web logs and Web page click streams
Methodologies for Stream Data Processing
◼ Major challenges

◼ Keep track of a large universe, e.g., pairs of IP address, not ages

◼ Methodology

◼ Synopses (trade-off between accuracy and storage)

◼ Use synopsis data structure, much smaller (O(logk N) space) than


their base data set (O(N) space)

◼ Compute an approximate answer within a small error range (factor ε


of the actual answer)

◼ Major methods

◼ Random sampling

◼ Histograms

◼ Sliding windows

◼ Multi-resolution model

◼ Sketches

◼ Radomized algorithms

Dept of AI 2
K.Radhika Asst.prof KITS

1) Random sampling

◼ Reservoir sampling: maintain a set of s candidates in the reservoir,


which form a true random sample of the element seen so far in the
stream. As the data stream flow, every new element has a certain
probability (s/N) of replacing an old element in the reservoir.

2) Sliding windows

a. Make decisions based only on recent data of sliding window size w

b. An element arriving at time t expires at time t + w

3) Histograms

a. Approximate the frequency distribution of element values in a stream

b. Partition data into a set of contiguous buckets

c. Equal-width (equal value range for buckets) vs. V-optimal (minimizing


frequency variance within each bucket)

4) Multi-resolution models

a. Popular models: balanced binary trees, micro-clusters, and wavelets

5) Sketches

a. Histograms and wavelets require multi-passes over the data but


sketches can operate in a single pass

b. Frequency moments of a stream A = {a1, …, aN}, Fk: v


Fk =  mi k
where v: the universe or domain size, mi: the frequency of i in the sequence i=1

◼ Given N elts and v values, sketches can approximate F0, F1, F2


in O(log v + log N) space

6) Randomized algorithms

a. Monte Carlo algorithm: bound on running time but may not return
correct result

P(| X −  | k)   2
2

k
Dept of AI 3
K.Radhika Asst.prof KITS

b. Chebyshev’s inequality:

i. Let X be a random variable with mean μ and standard


deviation σ

P[ X  (1+  )  |]  e − 
2
/4
c. Chernoff bound:

i. Let X be the sum of independent Poisson trials X1, …, Xn, δ in


(0, 1]

ii. The probability decreases exponentially as we move from the


mean.
7)

Stream OLAP and Stream Data Cubes


Stream data are generated continuously in a dynamic environment, with huge volume
infinite flow, and fast- changing behavior. It is impossible to store such data streams completely
in a dataware house.

Multi-Dimensional Stream Analysis: Examples


◼ Analysis of Web click streams

◼ Raw data at low levels: seconds, web page addresses, user IP


addresses, …

◼ Analysts want: changes, trends, unusual patterns, at reasonable levels


of details

◼ E.g., Average clicking traffic in North America on sports in the last 15


minutes is 40% higher than that in the last 24 hours.”

◼ Analysis of power consumption streams

◼ Raw data: power consumption flow for every household, every minute

Dept of AI 4
K.Radhika Asst.prof KITS

◼ Patterns one may find: average hourly power consumption surges up


30% for manufacturing companies in Chicago in the last 2 hours today
than that of the same day a week ago .

Time Dimension with Compressed Time Scale: Tiltled Frame


In stream data analysis, people are interested in recent changes at a fine scale but in
long term changes at a coarse scale. Naturally, we can register time at different levels
of granularity. The most recent time is registered at the finest granularity; the more
distant time is registered at a coarser granularity; and the level of coarseness depends
on the application requirements and on how old the time
point is (from the current time). Such a time dimension model is called a tilted time
frame.

There are many possible ways to design a titled time frame. Here we introduce three
models, as illustrated in Figure 8.1: (1) natural tilted time frame model, (2) logarithmic
tilted time frame model, and (3) progressive logarithmic tilted time frame model.

◼ A tilted time frame

◼ Different time granularities

◼ second, minute, quarter, hour, day, week, …

◼ Critical layers

◼ Minimum interest layer (m-layer)

◼ Observation layer (o-layer)

◼ User: watches at o-layer and occasionally needs to drill-down down to


m-layer

◼ Partial materialization of stream cubes

◼ Full materialization: too space and time consuming

◼ No materialization: slow response at query time

Dept of AI 5
K.Radhika Asst.prof KITS

◼ Partial materialization: what do we mean “partial”

A Titled Time Model

◼ Natural tilted time frame:

◼ Example: Minimal: quarter, then 4 quarters → 1 hour, 24 hours →


day, …
12 months 31 days 24 hours 4 qtrs
time

◼ Logarithmic tilted time frame:

◼ Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …

64t 32t 16t 8t 4t 2t t t


Time

A Titled Time Model

◼ Pyramidal tilted time frame:

◼ Example: Suppose there are 5 frames and each takes maximal 3


snapshots

◼ Given a snapshot number N, if N mod 2d = 0, insert into the frame


number d. If there are more than 3 snapshots, “kick out” the oldest one.

Frame no. Snapshots (by clock time)


0 69 67 65
1 70 66 62
2 68 60 52
3 56 40 24
4 48 16
5 64 32

Dept of AI 6
K.Radhika Asst.prof KITS

Critical Layers

(*,city,quarter)

observation layer

minimal interest layer

(user_group,street_block,minute)

Primitive data layer

(individual_user,street_address,second)

Fig: Two critical layers in a “power supply station” stream data cube

In many applications, it is beneficial to dynamically and incrementally compute and


store two critical cuboids (or layers), which are determined based on their conceptual
andcomputational importance in stream data analysis. The first layer, called the
minimal interest layer, is the minimally interesting layer that an analyst would like to
study. It is necessary to have such a layer because it is often neither cost effective nor
interesting in practice to examine the minute details of stream data. The second
layer, called the observation layer, is the layer at which an analyst (or an automated
system) would like
to continuously study the data. This can involve making decisions regarding the signaling
of exceptions, or drilling down along certain paths to lower layers to find cells
indicating data exceptions.
Frequent-Pattern Mining in Data Streams

Dept of AI 7
K.Radhika Asst.prof KITS

◼ Frequent pattern mining is valuable in stream applications


◼ e.g., network intrusion mining (Dokas, et al’02)
◼ Mining precise freq. patterns in stream data: unrealistic
◼ Even store them in a compressed form, such as FPtree
◼ How to mine frequent patterns with good approximation?
◼ Approximate frequent patterns (Manku & Motwani VLDB’02)
◼ Keep only current frequent patterns? No changes can be detected
◼ Mining evolution freq. patterns (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)
◼ Use tilted time window frame
◼ Mining evolution and dramatic changes of frequent patterns
Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and
El Abbadi, ICDT'05)

Mining Approximate Frequent Patterns

◼ Mining precise freq. patterns in stream data: unrealistic

◼ Even store them in a compressed form, such as FPtree

◼ Approximate answers are often sufficient (e.g., trend/pattern analysis)

◼ Example: a router is interested in all flows:

◼ whose frequency is at least 1% (σ) of the entire traffic stream


seen so far

◼ and feels that 1/10 of σ (ε = 0.1%) error is comfortable

◼ How to mine frequent patterns with good approximation?

◼ Lossy Counting Algorithm (Manku & Motwani, VLDB’02)

◼ Major ideas: not tracing items until it becomes frequent

◼ Adv: guaranteed error bound

Dept of AI 8
K.Radhika Asst.prof KITS

Disadv: keep a large set of traces

Lossy Counting Algorithm

The Lossy Counting algorithm has three nice properties: (1) there are no false neg-
atives, that is, there is no true frequent item that is not output; (2) false positives are
quite “positive” as well, since the output items will have a frequency of at least N N;and
(3) the frequency of a frequent item can be underestimated by at most N. For fre-quent
items, this underestimation is only a small fraction of its true frequency, so this
approximation is acceptable

◼ Strength

◼ A simple idea

◼ Can be extended to frequent itemsets

◼ Weakness:

◼ Space Bound is not good

◼ For frequent itemsets, they do scan each record many times

◼ The output is based on all previous data. But sometimes, we are only
interested in recent data

Classification for Dynamic Data Streams


1) Decision tree induction for stream data classification

◼ VFDT (Very Fast Decision Tree)/CVFDT (Domingos, Hulten,


Spencer, KDD00/KDD01)

2) Is decision-tree good for modeling fast changing data, e.g., stock market
analysis?

3) Other stream classification methods

◼ Instead of decision-trees, consider other models

Dept of AI 9
K.Radhika Asst.prof KITS

◼ Naïve Bayesian

◼ Ensemble (Wang, Fan, Yu, Han. KDD’03)

◼ K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’04)

◼ Tilted time framework, incremental updating, dynamic maintenance,


and model construction

◼ Comparing of models to find changes

Hoeffding Tree
The Hoeffding tree algorithm is a decision tree learning method for stream data
classi-fication. It was initially used to track Web clickstreams and construct models to
predict which Web hosts and Web sites a user is likely to access. It typically runs in
sublinear time and produces a nearly identical decision tree to that of traditional batch
learners. It uses Hoeffding trees, which exploit the idea that a small sample can often
be enough to choose an optimal splitting attribute. This idea is supported
mathematically by the Hoeffding bound (or additive Chernoff bound).

◼ Hoeffding Bound (Additive Chernoff Bound)

r: random variable

R: range of r

n: # independent observations

Mean of r is at least ravg – ε, with probability 1 – d

=
◼ Hoeffding Tree Input

S: sequence of examples

X: attributes

G( ): evaluation function

Dept of AI 10
K.Radhika Asst.prof KITS

d: desired accuracy

◼ Hoeffding Tree Algorithm

1. for each example in S


2. retrieve G(Xa) and G(Xb) //two highest G(Xi)
3. if ( G(Xa) – G(Xb) > ε )
4. split on Xa
5. recurse to next node
6. break

Packets > 10

Data stream

yes no

protocol=http

Packets > 10

Data stream

Yes no

yes Protocol = http

Protocol = ftp

Fig: The nodes of the Hoeffding tree are created incrementally as more
samples sream in.

Dept of AI 11
K.Radhika Asst.prof KITS

Hoeffding Tree: Strengths and Weaknesses

1) Strengths
◼ Scales better than traditional methods
◼ Sublinear with sampling
◼ Very small memory utilization
◼ Incremental
◼ Make class predictions in parallel
◼ New examples are added as they come
There are, however, weaknesses to the Hoeffding tree algorithm. For
example, the algorithm spends a great deal of time with attributes that have
nearly identical splitting quality. In addition, the memory utilization can be
further optimized. Finally, the algorithm cannot handle concept drift, because
once a node is created, it can never change.

VFDT (Very Fast Decision Tree)


The VFDT (Very Fast Decision Tree) algorithm makes several modifications to the
Hoeffding tree algorithm to improve both speed and memory utilization. The
modifications include breaking near-ties during attribute selection more aggressively,
computing the G function after a number of training examples, deactivating the least
promising leaves whenever memory is running low, dropping poor splitting
attributes, and improving the initialization method.
◼ Modifications to Hoeffding Tree
◼ Near-ties broken more aggressively
◼ G computed every nmin
◼ Deactivates certain leaves to save memory
◼ Poor attributes dropped
◼ Initialize with traditional learner (helps learning curve)
◼ Compare to Hoeffding Tree: Better time and memory

Dept of AI 12
K.Radhika Asst.prof KITS

◼ Compare to traditional decision tree


◼ Similar accuracy
◼ Better runtime with 1.61 million examples
◼ 21 minutes for VFDT
◼ 24 hours for C4.5
◼ Still does not handle concept drift

CVFDT (Concept-adapting VFDT)


Concept-adapting Very Fast Decision Tree algorithm uses a sliding window
approach; however itv does not construct a new model from scratch each time.
◼ Concept Drift
◼ Time-changing data streams

◼ Incorporate new and eliminate old

◼ CVFDT
◼ Increments count with new example
◼ Decrement old example
◼ Sliding window
◼ Nodes assigned monotonically increasing IDs
◼ Grows alternate subtrees
◼ When alternate more accurate => replace old
◼ O(w) better runtime than VFDT-window
A Classifier Ensemble Approach to Stream Data Classification
There are several reasons for involving more than one classifier. Decision trees
are not necessarily the most natural method for han-dling concept drift. Specifically,
if an attribute near the root of the tree in CVFDT no longer passesthe Hoeffding
bound, a large portion of the tree must be regrown. Many other classifiers, such as
naïve Bayes, are not subject to this weakness. In addition, naïve Bayesian classifiers
also supply relative probabilities along with the class

Dept of AI 13
K.Radhika Asst.prof KITS

labels, which expresses the confidence of a decision. Furthermore, CVFDT’s


automatic elimination of old examples may not be prudent. Rather than keeping only
the most up-to-date examples, the ensemble approach discards the least accu-rate
classifiers. Experimentation shows that the ensemble approach achieves greater
accuracy than any one of the single classifiers.

Clustering Evolving Data Streams


For effective clustering of stream data, several new methodologies have been devel-oped, as
follows:

1) Compute and store summaries of past data: Due to limited memory


space and fast response requirements, compute summaries of the
previously seen data, store the relevant results, and use such summaries
to compute important statistics when required.
2) Apply a divide-and conquer strategy:
Divide data streams into chunks based onorder of arrival, compute
summaries for these chunks, and then merge the summaries. In this way,
larger models can be built out of smaller building blocks.
3) Incremental clustering of incoming data streams: Because stream data
enter the sys-tem continuously and incrementally, the clusters derived
must be incrementally refined.
4) Perform microclustering as well as macroclustering analysis: Stream
clusters can be computed in two steps: (1) compute and store
summaries at the microcluster
level, where microclusters are formed by applying a hierarchical bottom-
up clustering algorithm (Section 7.5.1), and (2) compute macroclusters
(such as by using another clustering algorithm to group the
microclusters) at the user-specified level. This two
step computation effectively compresses the data and often results in a sma
ller margin of error.
5) Explore multiple time granularity for the analysis of cluster evolution:
Because the
more recent data often play a different role from that of the remote (i.e., ol
der) data in stream data analysis, use a tilted time frame model to store
snapshots of summarized data at different points in time.
6) Divide stream clustering into on-line and off-line processes: While data
are
streaming in, basic summaries of data snapshots should be computed, stor
ed, and incremen-tally updated. Therefore, an on-line process is needed

Dept of AI 14
K.Radhika Asst.prof KITS

to maintain such dynamically changing clusters. Meanwhile, a user may


pose queries to ask about past, current, or evolving
clusters. Such analysis can be performed off-
line or as a process independent of on-line cluster maintenance.

CluStream: A Framework for Clustering Evolving Data


Streams
CluStream is an algorithm for the clustering of evolving data streams based on user-
specified, on-line clustering queries. It divides the clustering process into on-line and off
line components.
The on-line component computes and stores summary statistics about the data stream
using microclusters, and performs incremental on-line computation and maintenance
of the microclusters. The off-line component does macro clusteringand answers
various user questions using the stored summary statistics, which are based on the
tilted time frame model.
◼ Design goal

◼ High quality for clustering evolving data streams with greater


functionality

◼ While keep the stream mining requirement in mind

◼ One-pass over the original stream data

◼ Limited space usage and high efficiency

◼ CluStream: A framework for clustering evolving data streams

◼ Divide the clustering process into online and offline


components

◼ Online component: periodically stores summary


statistics about the stream data

◼ Offline component: answers various user questions


based on the stored summary statistic

Dept of AI 15
K.Radhika Asst.prof KITS

Mining Time-Series Data


A time-series database consists of sequences of values or events obtained over
repeated measurements of time. The values are typically measured at equal time
intervals (e.g., hourly, daily, weekly). Time-series databases are popular in many
applications, such as stock market analysis, economic and sales fore-casting,
budgetary analysis, utility studies, inventory studies, yield projections, work-load
projections, process and quality control, observation of natural phenomena (such as
atmosphere, temperature, wind, earthquake), scientific and engineering experiments,
and medical treatments. A time-series database is also a sequence database. However,
a sequence database is any database that consists of sequences of ordered events, with
or without concrete notions of time. For example, Web page traversal sequences and
customer shopping transaction sequences are sequence data, but they may not be time-
series data.
◼ Time-series database
◼ Consists of sequences of values or events changing with time
◼ Data is recorded at regular intervals
◼ Characteristic time-series components
◼ Trend, cycle, seasonal, irregular
◼ Applications
◼ Financial: stock price, inflation
◼ Industry: power consumption
◼ Scientific: experiment results
◼ Meteorological: precipitation

Trend Analysis
A time series involving a variable Y , representing, say, the daily closing price
of a share in a stock market, can be viewed as a function of time t, thatis, Y =
F(t). Such a function can be illustrated as a time-series graph, as shown in Figure
8.4, which describes a point moving with the passage of time.

Dept of AI 16
K.Radhika Asst.prof KITS

In general there are two goals in time-series analysis: (1) modeling time series
(i.e., to gain insight into the mechanisms or underlying forces that generate the
time series), and (2) forecasting time series (i.e., to predict the future values of
the time-series variables).Trend analysis consists of the following four major
components or movements for characterizing time-series data:

1) Trend or long-term movements: These indicate the general direction


in which a time-series graph is moving over a long interval of time. This
movement is displayed by a trend curve, or a trend line. For example , the
trend curve of Figure 8.4 is indicated by a dashed curve. Typical
methods for determining a trend curve or trend line include the weighted
moving average method and the least squares method, discussed later.
2) Cyclic movements or cyclic variations: These refer to the cycles, that is,
the long-term oscillations about a trend line or curve, which may or may
not be periodic. That is, the cycles need not necessarily follow exactly
similar patterns after equal intervals of time.

price

time all electronics stock _

10 day moving average---


-

Figure 8.4 Time-series data of the stock price of AllElectronics over time. The
trend is shown with a dashed curve, calculated by a moving average

3) Seasonal movements or seasonal variations: These are systematic or


calendar related. Examples include events that recur annually, such as the
sudden increase in sales of chocolates and flowers before Valentine’s Day or

Dept of AI 17
K.Radhika Asst.prof KITS

of department store items before Christmas. The observed increase in water


consumption in summer due to warm weather is another example. In these
examples, seasonal movements are the identical or nearly identical patterns that
a time series appears to follow during corresponding months of successiveyears.
4) Irregular or random movements: These characterize the sporadic motion of
time series due to random or chance events, such as labor disputes, floods, or
announced personnel changes within companies.

Categories of Time-Series Movements

1) Categories of Time-Series Movements

◼ Long-term or trend movements (trend curve): general direction in


which a time series is moving over a long interval of time.

◼ Cyclic movements or cycle variations: long term oscillations about a


trend line or curve.

◼ e.g., business cycles, may or may not be periodic

◼ Seasonal movements or seasonal variations

◼ i.e, almost identical patterns that a time series appears to


follow during corresponding months of successive years.

◼ Irregular or random movements.

2) Time series analysis: decomposition of a time series into these four basic
movements

◼ Additive Modal: TS = T + C + S + I

◼ Multiplicative Modal: TS = T  C  S  I

Estimation of Trend Curve

1) The freehand method

◼ Fit the curve by looking at the graph

Dept of AI 18
K.Radhika Asst.prof KITS

◼ Costly and barely reliable for large-scaled data mining

2) The least-square method

◼ Find the curve minimizing the sum of the squares of the deviation of
points on the curve from the corresponding data points

3) The moving-average method

4) Moving average of order n

◼ Smoothes the data

◼ Eliminates cyclic, seasonal and irregular movements

◼ Loses the data at the beginning or end of a series

◼ Sensitive to outliers (can be reduced by weighted moving average)

Similarity Search in Time-Series Analysis

1) Normal database query finds exact match


2) Similarity search finds data sequences that differ only slightly from the given
query sequence
3) Two categories of similarity queries
◼ Whole matching: find a sequence that is similar to the query sequence
◼ Subsequence matching: find all pairs of similar sequences
4) Typical Applications
◼ Financial market
◼ Market basket data analysis
◼ Scientific databases

Dept of AI 19
K.Radhika Asst.prof KITS

◼ Medical diagnosis
Data Reduction and Transformation Techniques
Due to the tremendous size and high-dimensionality of time-series data, data reduction
often serves as the first step in time-series analysis. Major strategies for data reduction
include attribute subset selection (which removes irrelevant or redundant attributes or
dimensions), dimensionality reduction (which typi-cally employs signal processing
techniques to obtain a r educed version of the original data), and numerosity
reduction (where data are replaced or estimated by alternative, smaller representations,
such as histograms, clustering, and sampling). Because time series can be viewed as
data of very high dimensionality where each point of time can be viewed as a
dimension, dimensionality reduction is our major concern here.

1) Many techniques for signal analysis require the data to be in the frequency
domain.

2) Usually data-independent transformations are used


◼ The transformation matrix is determined a priori
◼ discrete Fourier transform (DFT)
◼ discrete wavelet transform (DWT)
3) The distance between two signals in the time domain is the same as their
Euclidean distance in the frequency domain
Several dimensionality reduction techniques can be used in time-series analysis.
Examples include (1) the discrete Fourier transform (DFT) as the classical data reduction
technique, (2) more recently developed discrete wavelet transforms (DWT),
(3) Singu-lar Value Decomposition (SVD) based on Principle Components Analysis
(PCA), and (4) random projection-based sketch techniques which can also give a good-
quality synopsis of data.
Many techniques for signal analysis require the data to be in the frequencydomain.
Therefore, distance-preserving orthonormal transformations are often used to trans-
form the data from the time domain to the frequency domain.

Indexing Methods for Similarity Search

Dept of AI 20
K.Radhika Asst.prof KITS

For efficient accessing, a multidimensional index can be constructed using the first
few Fourier coefficients. When a similarity query is submitted to the system, the
index can be used to retrieve the sequences that are at most a certainsmall distance
away from the query sequence. Post processing is then performed by computing
the actual distance between sequences in the time domain and discarding any false
matches.

For subsequence matching, each sequence can be broken down into a set of
“pieces” of windows with length w. In one approach, the features of the subsequence
inside each window are then extracted. Each sequence is mapped to a “trail” in the
feature space. The trail of each sequence is divided into “subtrails,” each represented
by a minimum bounding rectangle. A multipiece assembly algorithm can then be used
to search for longer sequence matches.

Similarity Search Methods


For similarity analysis of time-series data, Euclidean distance is typically used as a
similarity measure. Here, the smaller the distance between two sets of time-series data,
the more similar are the two series. However, we cannot directly apply the Euclidean
distance. Instead, we need to consider differences in the baseline and scale
(or amplitude) of our two series. For example, one stock’s value may have a baseline of
around $20 and fluctuate with a relatively large amplitude (such as between $15 and
$25), while another could have a baseline of around $100 and fluctuate with a
relatively small amplitude (such as between $90 and $110). The distance from one
baseline to another is referred to as the offset.

Steps for Performing a Similarity Search

1) Atomic matching
◼ Find all pairs of gap-free windows of a small length that are similar
2) Window stitching
◼ Stitch similar windows to form pairs of large similar subsequences
allowing gaps between atomic matches
3) Subsequence Ordering
◼ Linearly order the subsequence matches to determine whether enough
similar pieces exist.

Dept of AI 21
K.Radhika Asst.prof KITS

Figure 8.5 Subsequence matching in time-series data:

Query Languages for Time Sequences

1) Time-sequence query language

◼ Should be able to specify sophisticated queries like

Find all of the sequences that are similar to some sequence in class A, but not
similar to any sequence in class B.

◼ Should be able to support various kinds of queries: range queries, all-


pair queries, and nearest neighbor queries.

2) Shape definition language

◼ Allows users to define and query the overall shape of time sequences

Dept of AI 22
K.Radhika Asst.prof KITS

◼ Uses human readable series of sequence transitions or macros

◼ Ignores the specific details

◼ E.g., the pattern up, Up, UP can be used to describe increasing


degrees of rising slopes

Mining Sequence Patterns in Transactional Databases


A sequence database consists of sequences of ordered elements or events, recorded
with or without a concrete notion of time. There are many applications involving
sequence data. Typical examples include customer shopping sequences, Web
clickstreams, bio-logical sequences, sequences of events in science and engineering,
and in natural and social developments.

Sequential Pattern Mining: Concepts and Primitives

Sequential pattern mining is the mining of frequently occurring


ordered events or subsequences as patterns. An example of a sequential pattern
is “Customers who buy a Canon digital camera are likely to buy an HP color
printer within a month.” For retail data, sequential patterns are useful for shelf
placement and promotions. This industry, as well as telecommunications and
other businesses, may also use sequential patterns for targeted marketing,
customer retention, and many other tasks. Other areas in which sequential
patterns can be applied include Web access pat-tern analysis, weather
prediction, production processes, and network intrusion detection.

Table 8.1: A Sequence database

SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>

Given support threshold min_sup=2,<(ab)c> is a sequential pattern.

Dept of AI 23
K.Radhika Asst.prof KITS

Scalable Methods for Mining Sequential Patterns


Sequential pattern mining is computationally challenging because such mining
may generate and/or test a combinatorially explosive number of intermediate
subsequences.

“How can we develop efficient and scalable methods for sequential pattern
mining?” Recent developments have made progress in two directions: (1) efficient
methods for mining the full set of sequential patterns, and (2) efficient methods for
mining only the set of closed sequential patterns, where a sequential pattern s is closed
if there exists no sequential pattern s0 where s0 is a proper supersequence of s, and s0
has the same (frequency) support as s.6

Three such approaches for sequential pattern mining, represented by the


algorithms GSP, SPADE, and PrefixSpan, respectively. GSP adopts a candidate
generate-and-test approach using horizonal data format (where the data are
represented as hsequence ID : sequence of itemsetsi, as usual, where each itemset is an
event). SPADE adopts a candidate generate-and-test approach using vertical data
format (where the data are represented as hitemset : (sequence ID, event ID)i). The vertical
data format can be obtained by transforming from a horizontally formatted sequence
database in just one scan. PrefixSpan is a pat-tern growth method, which does not
require candidate generation.

All three approaches either directly or indirectly explore the Apriori property,
stated as follows: every nonempty subsequence of a sequential pattern is a sequential
pattern. (Recall that for a pattern to be called sequential, it must be frequent. That is, it
must sat-isfy minimum support.) The Apriori property is antimonotonic (or downward-
closed) in that, if a sequence cannot pass a test (e.g., regarding minimum support), all
of its supersequences will also fail the test. Use of this property to prune the search
space can help make the discovery of sequential patterns more efficient.

GSP: A Sequential Pattern Mining Algorithm Based on


Candidate Generate-and-Test

GSP (Generalized Sequential Pattern) is an extension of their seminal algorithm


for frequent itemset mining, known as Apriori. GSP uses the downward- closure
property of sequential patterns and adopts a multiple-pass, candidate generate-and-test
approach.

1) GSP (Generalized Sequential Pattern) mining algorithm

Dept of AI 24
K.Radhika Asst.prof KITS

◼ proposed by Agrawal and Srikant, EDBT’96


2) Outline of the method
◼ Initially, every item in DB is a candidate of length-1
◼ for each level (i.e., sequences of length-k) do
◼ scan database to collect support count for each
candidate sequence
◼ generate candidate length-(k+1)
sequences from length-k frequent
sequences using Apriori
◼ repeat until no frequent sequence or no
candidate can be found
3) Major strength: Candidate pruning by Apriori.
Examine GSP using an example
Initial candidates: all singleton sequences
◼ <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
Scan database once, count support for candidates

SPADE: An Apriori-Based Vertical Data Format


Sequential Pattern Mining Algorithm
SPADE (Sequential Pattern Discovery using Equivalent Class) developed by Zaki
2001.The Apriori like sequential pattern mining approach (based on candidate
generate-and-test) can also be explored by mapping a sequence database into
vertical data format. In vertical data format, the database becomes a set of tuples
of the form itemset : (sequence ID, event ID). The event identifier serves as a
timestamp within a sequence. The event ID of the ith itemset (or event) in a
sequence is i. Note than an itemset can occur in morethan one sequence. The
set of (sequence ID, event ID) pairs for a given itemset forms the ID list of the
itemset.

The mapping from horizontal to vertical format requires one scan of the
database. A major advantage of using this format is that we can determine the
support of any k-sequence by simply joining the ID lists of any two of its (k 1)-
length subsequences. The length of the resulting ID list (i.e., unique sequence
ID values) is equal to the support of the k-sequence, which tells us

Dept of AI 25
K.Radhika Asst.prof KITS

whether the sequence is frequent.

PrefixSpan: Prefix-Projected Sequential Pattern Growth

Pattern growth is a method of frequent-pattern mining that does not requirecandi-


date generation. The technique originated in the FP-growth algorithm for

Dept of AI 26
K.Radhika Asst.prof KITS

transaction databases. Pattern growth is a method of frequent-pattern mining that does


not require candidate generation.The general idea of this approach is as follows: it finds
the frequent single items, then compresses this information into a frequent- pattern
tree, or FP-tree. The FP-tree is used to generate a set of projected databases, each
associated with one frequent item. Each of these databases is mined separately. The
algorithm builds prefix patterns, which it concatenates with suffix patterns to find
frequent pat-terns, avoiding candidate generation. Here, we look at PrefixSpan, which
extends the pattern-growth approach to instead mine sequential patterns.

Mining Closed Sequential Patterns


◼ A closed sequential pattern s: there exists no superpattern s’ such that s’ ‫ כ‬s,
and s’ and s have the same support.
◼ Motivation: reduces the number of (redundant) patterns but attains the same
expressive power.
◼ Using Backward Subpattern and Backward Superpattern pruning to prune
redundant search space.

CloSpan is an efficient closed sequential pattern mining method. The method


is based on a property of sequence databases, called equivalence of projected
databases, stated as follows: Two projected sequence databases, Sj
= Sj ,v(i:e:; is a subsequence of ), are equivalent if and only if the total number
of items in Sj is equal to the total number of items in Sj .

Mining Multidimensional, Multilevel Sequential Patterns

Dept of AI 27
K.Radhika Asst.prof KITS

Sequence identifiers (representing individual customers, for example) and sequence


items (such as products bought) are often associated with additional pieces of
information. Sequential pattern mining should take advantage of such additional
information to discover interesting patterns in multidimensional, multilevel
information space. Take customer shopping transactions, for instance. In a sequence
database for such data, the additional information associated with sequence IDs could
include customer age, address, group, and profession. Information associated with
items could include item category, brand, model type, model number, place
manufactured, and manufacture date. Mining multidimensional, multilevel sequential
patterns is the discovery of interesting patterns in such a broad dimensional space, at
different levels of detail.

Constraint-Based Mining of Sequential Patterns


constraint-based mining, which incorporates user specified constraints to
reduce the search space and derive only patterns that are of interest to the
user.Constraints can be expressed in many forms. They may specify desired
relation-ships between attributes, attribute values, or aggregates within the
resulting patterns mined. Regular expressions can also be used as constraints in
the form of “pattern templates,” which specify the desired form of the patterns
to be mined. The key idea to note is that these kinds of constraints can be used
during the mining process to confine the search space, thereby improving (1)
the efficiency of the mining and (2) the interestingness of the resulting patterns
found. This idea is also referred to as “pushing the constraints deep into the
mining process.”

1) Constraint-based sequential pattern mining


◼ Constraints: User-specified, for focused mining of desired
patterns
◼ How to explore efficient mining with constraints? —
Optimization
2) Classification of constraints
◼ Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10
◼ Monotone: E.g., count (S) > 5, S  {PC, digital_camera}
◼ Succinct: E.g., length(S)  10, S  {Pentium, MS/Office,
MS/Money}
◼ Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160,
max(S)/avg(S) < 2, median(S) – min(S) > 5
◼ Inconvertible: E.g., avg(S) – median(S) = 0

Dept of AI 28
K.Radhika Asst.prof KITS

Dept of AI 29

You might also like