0% found this document useful (0 votes)

112 views

Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation

Time Series

Uploaded by

muhammadriz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views

Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation

Time Series

Uploaded by

muhammadriz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Neural Networks 43 (2013) 7283

Contents lists available at SciVerse ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Change-point detection in time-series data by relative density-ratio estimation

Song Liu a, , Makoto Yamada b , Nigel Collier c,d , Masashi Sugiyama a
a
Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan
b
NTT Communication Science Laboratories, 2-4, Hikaridai, Seika-cho, Kyoto 619-0237, Japan
c
National Institute of Informatics (NII), 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
d
European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

article info abstract

Article history: The objective of change-point detection is to discover abrupt property changes lying behind time-
Received 21 February 2012 series data. In this paper, we present a novel statistical change-point detection algorithm based on non-
Received in revised form 26 December 2012 parametric divergence estimation between time-series samples from two retrospective segments. Our
Accepted 15 January 2013
method uses the relative Pearson divergence as a divergence measure, and it is accurately and efficiently
estimated by a method of direct density-ratio estimation. Through experiments on artificial and real-
Keywords:
world datasets including human-activity sensing, speech, and Twitter messages, we demonstrate the
Change-point detection
Distribution comparison
usefulness of the proposed method.
Relative density-ratio estimation 2013 Elsevier Ltd. All rights reserved.
Kernel methods
Time-series data

1. Introduction point detection scenario and propose a novel non-parametric

method.
Detecting abrupt changes in time-series data, called change- Having been studied for decades, some pioneering works
point detection, has attracted researchers in the statistics and demonstrated good change-point detection performance by com-
data mining communities for decades (Basseville & Nikiforov, paring the probability distributions of time-series samples over
1993; Brodsky & Darkhovsky, 1993; Gustafsson, 2000). Depending past and present intervals (Basseville & Nikiforov, 1993). As both
on the delay of detection, change-point detection methods can the intervals move forward, a typical strategy is to issue an alarm
be classified into two categories: Real-time detection (Adams & for a change point when the two distributions are becoming signif-
icantly different. Various change-point detection methods follow
MacKay, 2007; Garnett, Osborne, & Roberts, 2009; Paquet, 2007)
this strategy, for example, the cumulative sum (Basseville & Niki-
and retrospective detection (Basseville & Nikiforov, 1993; Moskvina
forov, 1993), the generalized likelihood-ratio method (Gustafsson,
& Zhigljavsky, 2003a; Takeuchi & Yamanishi, 2006).
1996), and the change finder (Takeuchi & Yamanishi, 2006). Such
Real-time change-point detection targets applications that
a strategy has also been employed in novelty detection (Guralnik
require immediate responses such as robot control. On the other
& Srivastava, 1999) and outlier detection (Hido, Tsuboi, Kashima,
hand, although retrospective change-point detection requires
Sugiyama, & Kanamori, 2011).
longer reaction periods, it tends to give more robust and accurate Another group of methods that have attracted high popularity
detection. Retrospective change-point detection accommodates in recent years is the subspace methods (Ide & Tsuda, 2007;
various applications that allow certain delays, for example, Kawahara, Yairi, & Machida, 2007; Moskvina & Zhigljavsky, 2003a,
climate change detection (Reeves, Chen, Wang, Lund, & Lu, 2007), 2003b). By using a pre-designed time-series model, a subspace is
genetic time-series analysis (Wang, Wu, Ji, Wang, & Liang, 2011), discovered by principal component analysis from trajectories in
signal segmentation (Basseville & Nikiforov, 1993), and intrusion past and present intervals, and their dissimilarity is measured by
detection in computer networks (Yamanishi, Takeuchi, Williams, & the distance between the subspaces. One of the major approaches
Milne, 2000). In this paper, we focus on the retrospective change- is called subspace identification, which compares the subspaces
spanned by the columns of an extended observability matrix
generated by a state-space model with system noise (Kawahara
et al., 2007). Recent efforts along this line of research have led
Corresponding author. Tel.: +81 080 4464 8710.
E-mail addresses: [email protected] (S. Liu), [email protected] to a computationally efficient algorithm based on Krylov subspace
(M. Yamada), [email protected] (N. Collier), [email protected] (M. Sugiyama). learning (Ide & Tsuda, 2007) and a successful application of
URL: http://sugiyama-www.cs.titech.ac.jp/sugi (M. Sugiyama). detecting climate change in south Kenya (Itoh & Kurths, 2010).
0893-6080/$ see front matter 2013 Elsevier Ltd. All rights reserved.
doi:10.1016/j.neunet.2013.01.012
S. Liu et al. / Neural Networks 43 (2013) 7283 73

K an even better estimate from a small number of samples. We

experimentally demonstrate that our RuLSIF-based change-point
detection method compares favorably with other approaches.
The rest of this paper is structured as follows: In Section 2, we
formulate our change-point detection problem. In Section 3, we
Fig. 1. Rationale of direct density-ratio estimation. describe our proposed change-point detection algorithms based
on uLSIF and RuLSIF, together with the review of the KLIEP-
The methods explained above rely on pre-designed parametric based method. In Section 4, we report experimental results on
models, such as underlying probability distributions (Basseville various artificial and real-world datasets including human-activity
& Nikiforov, 1993; Gustafsson, 1996), auto-regressive models sensing, speech, and Twitter messages from February 2010 to
(Takeuchi & Yamanishi, 2006), and state-space models (Ide & October 2010. Finally, in Section 5, conclusions together with
Tsuda, 2007; Kawahara et al., 2007; Moskvina & Zhigljavsky, future perspectives are stated.
2003a, 2003b), for tracking specific statistics such as the mean,
the variance, and the spectrum. As alternatives, non-parametric 2. Problem formulation
methods such as kernel density estimation (Brodsky & Darkhovsky,
1993; Csrg & Horvth, 1988) are designed with no particular In this section, we formulate our change-point detection
parametric assumption. However, they tend to be less accurate problem.
in high-dimensional problems because of the so-called curse of Let y (t ) 2 Rd be a d-dimensional time-series sample at time t.
dimensionality (Bellman, 1961; Vapnik, 1998). Let
To overcome this difficulty, a new strategy was introduced
Y (t ) := [y (t )> , y (t + 1)> , . . . , y (t + k 1)> ]> 2 Rdk
recently, which estimates the ratio of probability densities directly
without going through density estimation (Sugiyama, Suzuki, & be a subsequence1 of time series at time t with length k, where >
Kanamori, 2012a). The rationale of this density-ratio estimation represents the transpose. Following the previous work (Kawahara
idea is that knowing the two densities implies knowing the density & Sugiyama, 2012), we treat the subsequence Y (t ) as a sample,
ratio, but not vice versa; knowing the ratio does not necessarily instead of a single d-dimensional time-series sample y (t ), by which
imply knowing the two densities because such decomposition time-dependent information can be incorporated naturally (see
is not unique (Fig. 1). Thus, direct density-ratio estimation is Fig. 2). Let Y(t ) be a set of n retrospective subsequence samples
substantially easier than density estimation (Sugiyama et al., starting at time t:
2012a). Following this idea, methods of direct density-ratio Y(t ) := {Y (t ), Y (t + 1), . . . , Y (t + n 1)}.
estimation have been developed (Sugiyama, Suzuki, & Kanamori,
2012b), e.g., kernel mean matching (Gretton et al., 2009), the Note that [Y (t ), Y (t + 1), . . . , Y (t + n 1)] 2 Rdkn forms a Hankel
logistic-regression method (Bickel, Brckner, & Scheffer, 2007), matrix and plays a key role in change-point detection based on
and the KullbackLeibler importance estimation procedure (KLIEP) subspace learning (Kawahara et al., 2007; Moskvina & Zhigljavsky,
(Sugiyama et al., 2008). In the context of change-point detection, 2003a).
KLIEP was reported to outperform other approaches (Kawahara For change-point detection, let us consider two consecutive
& Sugiyama, 2012) such as the one-class support vector machine segments Y(t ) and Y(t + n). Our strategy is to compute a certain
(Desobry, Davy, & Doncarli, 2005; Schlkopf, Platt, Shawe- dissimilarity measure between Y(t ) and Y(t + n), and use it as
Taylor, Smola, & Williamson, 2001) and singular-spectrum analysis the plausibility of change points. More specifically, the higher the
(Moskvina & Zhigljavsky, 2003b). Thus, change-point detection dissimilarity measure is, the more likely the point is a change
based on direct density-ratio estimation is promising. point.2
The goal of this paper is to further advance this line of research. Now the problems that need to be addressed are what kind of
More specifically, our contributions in this paper are twofold. dissimilarity measure we should use and how we estimate it from
The first contribution is to apply a recently-proposed density- data. We will discuss these issues in the next section.
ratio estimation method called the unconstrained least-squares
importance fitting (uLSIF) (Kanamori, Hido, & Sugiyama, 2009) to 3. Change-point detection via density-ratio estimation
change-point detection. The basic idea of uLSIF is to directly learn
the density-ratio function in the least-squares fitting framework. In this section, we first define our dissimilarity measure, and
then show methods for estimating the dissimilarity measure.
Notable advantages of uLSIF are that its solution can be computed
analytically (Kanamori et al., 2009), it achieves the optimal non-
parametric convergence rate (Kanamori, Suzuki, & Sugiyama, 3.1. Divergence-based dissimilarity measure and density-ratio esti-
2012b), it has the optimal numerical stability (Kanamori, Suzuki, mation
& Sugiyama, in press), and it has higher robustness than KLIEP
(Sugiyama et al., 2012b). Through experiments on a range of In this paper, we use a dissimilarity measure of the following
datasets, we demonstrate the superior detection accuracy of the form:
uLSIF-based change-point detection method. D(Pt k Pt +n ) + D(Pt +n k Pt ), (1)
The second contribution of this paper is to further improve
the uLSIF-based change-point detection method by employing a
state-of-the-art extension of uLSIF called relative uLSIF (RuLSIF) 1 In fact, only in the case of one-dimensional time-series, Y (t ) is a subsequence.
(Yamada, Suzuki, Kanamori, Hachiya, & Sugiyama, in press). A For higher-dimensional time-series, Y (t ) concatenates the subsequences of all
potential weakness of the density-ratio based approach is that dimensions into a one-dimensional vector.
density ratios can be unbounded (i.e., they can be infinity) if the 2 Another possible formulation is to compare distributions of samples in Y(t )
denominator density is not well-defined. The basic idea of RuLSIF and Y(t + n) in the framework of hypothesis testing (Henkel, 1976). Although this
is to consider relative density ratios, which are smoother and always gives a useful threshold to determine whether a point is a change point, computing
the p-value is often time consuming, particularly in a non-parametric setup (Efron &
bounded from above. Theoretically, it was proved that RuLSIF Tibshirani, 1993). For this reason, we do not take the hypothesis testing approach in
possesses a superior non-parametric convergence property than this paper, although it is methodologically straightforward to extend the proposed
plain uLSIF (Yamada et al., in press), implying that RuLSIF gives approach to the hypothesis testing framework.
74 S. Liu et al. / Neural Networks 43 (2013) 7283

Fig. 2. An illustrative example of notations on one-dimensional time-series data.

where Pt and Pt +n are probability distributions of samples in Y(t ) In the rest of this section, we review three methods of directly
and Y(t + n), respectively. D(P k P 0 ) denotes the f -divergence (Ali p(Y )
estimating the density ratio p0 (Y ) from samples {Yi }ni=1 and {Yj0 }nj=1
& Silvey, 1966; Csiszr, 1967): drawn from p(Y ) and p0 (Y ): The KL importance estimation proce-
Z dure (KLIEP) (Sugiyama et al., 2008) in Section 3.2, unconstrained
p(Y )
D(P k P 0 ) := p0 (Y )f dY , (2) least-squares importance fitting (uLSIF) (Kanamori et al., 2009) in
p0 (Y )
Section 3.3, and relative uLSIF (RuLSIF) (Yamada et al., in press) in
where f is a convex function such that f (1) = 0, and p(Y ) and Section 3.4.
p0 (Y ) are probability density functions of P and P 0 , respectively.
We assume that p(Y ) and p0 (Y ) are strictly positive. Since the 3.2. KLIEP
f -divergence is asymmetric (i.e., D(P k P 0 ) 6= D(P 0 k P )), we
symmetrize it in our dissimilarity measure (1) for all divergence- KLIEP (Sugiyama et al., 2008) is a direct density-ratio estimation
based methods.3 algorithm that is suitable for estimating the KL divergence.
The f -divergence includes various popular divergences such as
the KullbackLeibler (KL) divergence by f (t ) = t log t (Kullback & 3.2.1. Density-ratio model
Leibler, 1951) and the Pearson (PE) divergence by f (t ) = 12 (t 1)2 Let us model the density ratio
p(Y )
by the following kernel
p0 (Y )
(Pearson, 1900): model:
Z
p(Y ) n
X
KL(P k P 0 ) := p(Y ) log dY , (3) g (Y ; ) := ` K (Y , Y` ), (5)
p0 (Y )
`=1
Z 2
1 p(Y ) where := (1 , . . . , n )> are parameters to be learned from data
PE(P k P 0 ) := p0 (Y ) 1 dY . (4)
2 p0 (Y ) samples, and K (Y , Y 0 ) is a kernel basis function. In practice, we use
the Gaussian kernel:
Since the probability densities p(Y ) and p0 (Y ) are unknown in
practice, we cannot directly compute the f -divergence (and thus 0 kY Y 0 k2
K (Y , Y ) = exp ,
the dissimilarity measure). A naive way to cope with this problem 2 2
is to perform density estimation and plug the estimated densities
where (> 0) is the kernel width. In all our experiments, the
p(Y ) and b
b p0 (Y ) in the definition of the f -divergence. However,
kernel width is determined based on cross-validation.
density estimation is known to be a hard problem (Vapnik, 1998),
and thus such a plug-in approach is not reliable in practice. 3.2.2. Learning algorithm
Recently, a novel method of divergence approximation based
The parameters in the model g (Y ; ) are determined so that
on direct density-ratio estimation was explored (Kanamori et al.,
the KL divergence from p(Y ) to g (Y ; )p0 (Y ) is minimized:
2009; Nguyen, Wainwright, & Jordan, 2010; Sugiyama et al., Z
2008). The basic idea of direct density-ratio estimation is to learn p(Y )
p(Y ) KL = p(Y ) log dY
the density-ratio function p0 (Y ) without going through separate p0 (Y )g (Y ; )
Z Z
density estimation of p(Y ) and p0 (Y ). An intuitive rationale of p(Y )
direct density-ratio estimation is that knowing the two densities = p(Y ) log dY p(Y ) log (g (Y ; )) dY .
p0 (Y )
p(Y ) and p0 (Y ) means knowing their ratio, but not vice versa;
p(Y ) After ignoring the first term which is irrelevant to g (Y ; ) and
knowing the ratio p0 (Y ) does not necessarily mean knowing the
approximating the second term with the empirical estimates, the
two densities p(Y ) and p0 (Y ) because such decomposition is not KLIEP optimization problem is given as follows:
unique (see Fig. 1). This implies that estimating the density ratio is !
n n
substantially easier than estimating the densities, and thus directly 1X X
estimating the density ratio would be more promising4 (Sugiyama
max log ` K (Yi , Y` ) ,
n i=1 `=1
et al., 2012a).
n n
1 XX
s.t. ` K (Yj0 , Y` ) = 1 and 1 , . . . , n 0.
n j=1 `=1
3 In the previous work (Kawahara & Sugiyama, 2012), the asymmetric
dissimilarity measure D(Pt k Pt +n ) was used. As we numerically illustrate in
The equality constraint is for the normalization purpose because
Section 4, the use of the symmetrized divergence contributes highly to improving g (Y ; )p0 (Y ) should be a probability density function. The
the performance. For this reason, we decided to use the symmetrized dissimilarity inequality constraint comes from the non-negativity of the
measure (1). density-ratio function. Since this is a convex optimization problem,
4 Vladimir Vapnik advocated in his seminal book (Vapnik, 1998) that one should
the unique global optimal solution b can be simply obtained, for
avoid solving a more difficult problem as an intermediate step. The support vector
example, by a gradient-projection iteration. Finally, a density-ratio
machine (Cortes & Vapnik, 1995) is a representative example that demonstrates the
usefulness of this principle: It avoids solving a more general problem of estimating
estimator is given as
data generating probability distributions, and only learns a decision boundary that n
X
is sufficient for pattern recognition. The idea of direct density-ratio estimation also g (Y ) =
b b
` K (Y , Y` ).
follows Vapniks principle. `=1
S. Liu et al. / Neural Networks 43 (2013) 7283 75

KLIEP was shown to achieve the optimal non-parametric 3.3.2. Change-point detection by uLSIF
convergence rate (Nguyen et al., 2010; Sugiyama et al., 2008). g (Y ), an approximator of the PE
Given a density-ratio estimator b
divergence can be constructed as
3.2.3. Change-point detection by KLIEP n
1 X
n
1X 1
g (Y ), an approximator of the KL
Given a density-ratio estimator b b :=
PE g (Yj0 )2 +
b g (Yi )
b .
2n j=1 n i=1 2
divergence is given as
n This approximator is derived from the following expression of
1X
b :=
KL g (Yi ).
log b the PE divergence (Sugiyama, Kawanabe, & Chui, 2010; Sugiyama,
n i=1 Yamada et al., 2011):
Z 2
In the previous work (Kawahara & Sugiyama, 2012), this 1 p(Y )
KLIEP-based KL-divergence estimator was applied to change-point PE(P k P 0 ) = p0 (Y )dY
2 p0 (Y )
detection and demonstrated to be promising in experiments. Z
p(Y ) 1
+ p(Y )dY . (9)
3.3. uLSIF p0 ( Y ) 2
The first two terms of (9) are actually the negative uLSIF opti-
Recently, another direct density-ratio estimator called uLSIF mization objective without regularization. This expression can also
was proposed (Kanamori et al., 2009, 2012b), which is suitable for be obtained based on the fact that the f -divergence D(P k P 0 ) is
estimating the PE divergence. lower-bounded via the LegendreFenchel convex duality (Rockafel-
lar, 1970) as follows Keziou (2003) and Nguyen, Wainwright, and
3.3.1. Learning algorithm Jordan (2007):
In uLSIF, the same density-ratio model as KLIEP is used (see D(P k P 0 )
Section 3.2.1). However, its training criterion is different; the Z Z
density-ratio model is fitted to the true density-ratio under the = sup p(Y )h(Y ) dY p0 (Y )f (h(Y )) dY , (10)
squared loss. More specifically, the parameter in the model h
g (Y ; ) is determined so that the following squared loss J (Y ) is where f is the convex conjugate of convex function f defined at
minimized: (2). The PE divergence corresponds to f (t ) = 12 (t 1)2 , for which
Z 2 (t )2
1 p(Y ) convex conjugate is given by f (t ) = + t . For f (t ) =
J (Y ) = g (Y ; ) p0 (Y ) dY 2
p(Y )
2 p0 (Y ) 1
(t 1)2 , the supremum can be achieved when = h(Y ) + 1.
2 p0 (Y )
Z 2 Z Substituting h(Y ) =
p(Y )
1 into (10), we can obtain (9).
1 p(Y ) p0 (Y )
= p0 (Y ) dY p(Y )g (Y ; ) dY uLSIF has some notable advantages: Its solution can be
2 p0 (Y )
Z computed analytically (Kanamori et al., 2009) and it possesses the
1 optimal non-parametric convergence rate (Kanamori et al., 2012b).
+ g (Y ; )2 p0 (Y ) dY . Moreover, it has the optimal numerical stability (Kanamori et al.,
2
in press), and it is more robust than KLIEP (Sugiyama et al., 2012b).
Since the first term is a constant, we focus on the last two In Section 4, we will experimentally demonstrate that uLSIF-based
terms. By substituting g (Y ; ) with our model stated in (5) and change-point detection compares favorably with the KLIEP-based
approximating the integrals by the empirical averages, the uLSIF method.
optimization problem is given as follows:
3.4. RuLSIF
1 >
min b
H h> +
b > , (6)
2Rn 2 2 Depending on the condition of the denominator density p0 (Y ),
p(Y )
>
where the penalty term 2 is included for a regularization the density-ratio value p0 (Y ) can be unbounded (i.e., they can be
purpose. ( 0) denotes the regularization parameter, which is infinity). This is actually problematic because the non-parametric
chosen by cross-validation (Sugiyama et al., 2008). b
H is the n n convergence rate of uLSIF is governed by the sup-norm of the
p(Y )
matrix with the (`, `0 )-th element given by true density-ratio function: maxY p0 (Y ) . To overcome this problem,
n
relative density-ratio estimation was introduced (Yamada et al., in
1X press).
b
H`,`0 := K (Yj0 , Y` )K (Yj0 , Y`0 ). (7)
n j =1
3.4.1. Relative PE divergence
b
h is the n-dimensional vector with the `-th element given by Let us consider the -relative PE-divergence for 0 < 1:
n
1X PE (P k P 0 ) := PE(P k P + (1 )P 0 )
b
h` := K (Yi , Y` ). Z 2
n i =1 p(Y )
= p0 (Y ) 1 dY ,
p0 (Y )
It is easy to confirm that the solution b
of (6) can be analytically
obtained as where p0 (Y ) = p(Y ) + (1 )p0 (Y ) is the -mixture density. We
refer to
b H + In ) 1b
= (b h, (8) p(Y )
r ( Y ) =
where In denotes the n-dimensional identity matrix. Finally, a p(Y ) + (1 )p0 (Y )
density-ratio estimator is given as as the -relative density-ratio. The -relative density-ratio is
n
X reduced to the plain density-ratio if = 0, and it tends to be
g (Y ) =
b b
` K (Y , Y` ). smoother as gets larger. Indeed, one can confirm that the
`=1 -relative density-ratio is bounded above by 1/ for > 0, even
76 S. Liu et al. / Neural Networks 43 (2013) 7283

p(Y )
when the plain density-ratio p0 (Y ) is unbounded. This was proved
to contribute to improving the estimation accuracy (Yamada et al.,
in press).
As explained in Section 3.1, we use symmetrized divergence
PE (P k P 0 ) + PE (P 0 k P )
as a change-point score, where each term is estimated separately.

3.4.2. Learning algorithm

For approximating the -relative density ratio r (Y ), we still
use the same kernel model g (Y ; ) given by (5). In the same way
as the uLSIF method, the parameter is learned by minimizing the
squared loss between true and estimated relative ratios:
Z 2
1
J (Y ) = p0 (Y ) r (Y ) g (Y ; ) dY
2
Z Z
1
= p0 (Y )r2 (Y ) dY p(Y )r (Y )g (Y ; ) dY
2
Z Z
1
+ p(Y )g (Y ; )2 dY +
p0 (Y )g (Y ; )2 dY .
2 2
Again, by ignoring the constant and approximating the
Fig. 3. (Top) The original signal (blue) is segmented into 3 sections with equal
expectations by sample averages, the -relative density-ratio can length. Samples are drawn from the normal distributions N (0, 22 ), N (0, 12 ), and
be learned in the same way as the plain density-ratio. Indeed, the N (0, 22 ), respectively. (Bottom) Symmetric (red) and asymmetric (black and green)
optimization problem of a relative variant of uLSIF, called RuLSIF, PE divergences. (For interpretation of the references to colour in this figure legend,
is given as the same form as uLSIF; the only difference is the the reader is referred to the web version of this article.)
definition of the matrix b
H:
n
b X The top graph in Fig. 3 shows an artificial time-series signal that
H`,`0 := K (Yi , Y` )K (Yi , Y`0 )
n i =1 consists of three segments with equal length of 200. The samples
n are drawn from N (0, 22 ), N (0, 12 ), and N (0, 22 ), respectively,
(1 ) X where N (, 2 ) denotes the normal distribution with mean and
+ K (Yj0 , Y` )K (Yj0 , Y`0 ).
n j =1 variance 2 . Thus, the variances change at time 200 and 400. In this
experiment, we consider three types of divergence measures:
Thus, the advantages of uLSIF regarding the analytic solution,
numerical stability, and robustness are still maintained in RuLSIF.
PE (Symmetric) : PE (Pt k Pt +n ) + PE (Pt +n k Pt ),
Furthermore, RuLSIF possesses an even better non-parametric
convergence property than uLSIF (Yamada et al., in press). PE (Forward) : PE (Pt k Pt +n ),
PE (Backward) : PE (Pt +n k Pt ).
3.4.3. Change-point detection by RuLSIF
By using an estimator bg (Y ) of the -relative density-ratio, the Three divergences are compared in the bottom graph of Fig. 3.
-relative PE divergence can be approximated as As we can see from the graphs, PE (Forward) detects the
n n n first change point successfully, but not the second one. On the
X 1 X 1X 1
b :=
PE g (Yi )2
b g (Yj0 )2 +
b g (Yi )
b . other hand, PE (Backward) behaves oppositely. This implies that
2n i=1 2n j =1
n i=1 2 combining forward and backward divergences can improve the
In Section 4, we will experimentally demonstrate that the overall change-point detection performance. For this reason, we
RuLSIF-based change-point detection performs even better than only use PE (Symmetric) as the change-point score of the proposed
the plain uLSIF-based method. method from here on.
Next, we illustrate the behavior of our proposed RuLSIF-based
4. Experiments method, and then compare its performance with the uLSIF-based
and KLIEP-based methods. In our implementation, two sets of
In this section, we experimentally investigate the performance
candidate parameters,
of the proposed and existing change-point detection methods on
artificial and real-world datasets including human-activity sens- = 0.6dmed , 0.8dmed , dmed , 1.2dmed , and 1.4dmed ,
ing, speech, and Twitter messages. The MATLAB implementation
of the proposed method is available at = 10 3 , 10 2 , 10 1 , 100 , and 101 ,
http://sugiyama-www.cs.titech.ac.jp/~song/change_detection/ . are provided to the cross-validation procedure, where dmed de-
For all experiments, we fix the parameters at n = 50 and notes the median distance between samples. The best combination
k = 10. in the RuLSIF-based method is fixed to 0.1. Sensitivity to of these parameters is chosen by grid search via cross-validation.
different parameter choices and more issues regarding algorithm- We use 5-fold cross-validation for all experiments.
specific parameter tuning will be discussed below. We use the following 4 artificial time-series datasets that
contain manually inserted change-points:
4.1. Artificial datasets
Dataset 1 (Jumping mean): The following 1-dimensional auto-
As mentioned in Section 3.1, we use the symmetrized regressive model borrowed from Takeuchi and Yamanishi
divergence for change-point detection. We first illustrate how (2006) is used to generate 5000 samples (i.e., t = 1, . . . , 5000):
symmetrization of the PE divergence affects the change-point
detection performance. y(t ) = 0.6y(t 1) 0.5y(t 2) + t ,
S. Liu et al. / Neural Networks 43 (2013) 7283 77

(a) Dataset 1. (b) Dataset 2.

(c) Dataset 3. (d) Dataset 4.

Fig. 4. Illustrative time-series samples (upper) and the change-point score obtained by the RuLSIF-based method (lower). The true change-points are marked by black
vertical lines in the upper graphs.

where t is a Gaussian noise with mean and standard Note that, to explore the ability of detecting change points with
deviation 1.5. The initial values are set as y(1) = y(2) = 0. A different significance, we purposely made latter change-points
change point is inserted at every 100 time steps by setting the more significant than earlier ones in the above datasets.
noise mean at time t as Fig. 4 shows examples of these datasets for the last 10 change
( points and corresponding change-point score obtained by the
0 N = 1,
N = N proposed RuLSIF-based method. Although the last 10 change
N 1 N = 2, . . . , 49,
+ points are the most significant, we can see from the graphs that, for
16
Datasets 3 and 4, these change points can be even hardly identified
where N is a natural number such that 100(N 1) + 1 t by human. Nevertheless, the change-point score obtained by the
100N.
proposed RuLSIF-based method increases rapidly after changes
Dataset 2 (Scaling variance): The same auto-regressive model as
occur.
Dataset 1 is used, but a change point is inserted at every 100
Next, we compare the performance of RuLSIF-based, uLSIF-
time steps by setting the noise standard deviation at time t as
8 based, and KLIEP-based methods in terms of the receiver operating
<1 N = 1, 3, . . . , 49, characteristic (ROC) curves and the area under the ROC curve (AUC)
= N values. We define the true positive rate and false positive rate in the
:ln e + N = 2, 4, . . . , 48.
following way (Kawahara & Sugiyama, 2012):
4
Dataset 3 (Switching covariance): 2-dimensional samples of size True positive rate (TPR): ncr /ncp ,
5000 are drawn from the origin-centered normal distribution, False positive rate (FPR): (nal ncr )/nal ,
and a change point is inserted at every 100 time steps by setting
the covariance matrix at time t as where ncr denotes the number of times change points are correctly
80 1 detected, ncp denotes the number of all change points, and nal is the
> 4 N 2
>
> 1 number of all detection alarms.
>
> B 5 500 C N = 1, 3, . . . , 49, Following the strategy of the previous researches (Desobry
>
> @ 4 N 2 A
>
< et al., 2005; Harchaoui, Bach, & Moulines, 2009), peaks of a change-
1
= 0 5 500 1 point score are regarded as detection alarms. More specifically, a
>
> 4 N 2 detection alarm at step t is regarded as correct if there exists a
>
> B 1 +
>
> @4 N 2 5 500 C A N = 2, 4, . . . , 48. true alarm at step t such that t 2 [t 10, t + 10]. To avoid
>
>
: + 1 duplication, we remove the kth alarm at step tk if tk tk 1 < 20.
5 500 We set up a threshold for filtering out all alarms whose
Dataset 4 (Changing frequency): 1-dimensional samples of size change-point scores are lower than or equal to . Initially, we set
5000 are generated as to be equal to the score of the highest peak. Then, by lowering
y(t ) = sin(!x) + t , gradually, both TPR and FPR become non-decreasing. For each ,
we plot TPR and FPR on the graph, and thus a monotone curve can
where t is a origin-centered Gaussian noise with standard
be drawn.
deviation 0.8. A change point is inserted at every 100 points by
Fig. 5 illustrates ROC curves averaged over 50 runs with
changing the frequency ! at time t as
8 different random seeds for each dataset. Table 1 describes the
<1 N = 1, mean and standard deviation of the AUC values over 50 runs.
!N = N The best and comparable methods by the t-test with significance
:! N 1 ln e + N = 2, . . . , 49.
2 level 5% are described in boldface. The experimental results show
78 S. Liu et al. / Neural Networks 43 (2013) 7283

(a) Dataset 1. (b) Dataset 2.

(c) Dataset 3. (d) Dataset 4.

Fig. 5. Average ROC curves of RuLSIF-based, uLSIF-based, and KLIEP-based methods.

that the uLSIF-based method tends to outperform the KLIEP-based Table 1

The AUC values of RuLSIF-based, uLSIF-based, and KLIEP-based methods. The best
method, and the RuLSIF-based method even performs better than
and comparable methods by the t-test with significance level 5% are described in
the uLSIF-based method. boldface.
Finally, we investigate the sensitivity of the performance on dif- RuLSIF uLSIF KLIEP
ferent choices of n and k in terms of AUC values. In Fig. 6, the AUC
Dataset 1 .848(.023) .763 (.023) .713 (.036)
values of RuLSIF ( = 0.1 and 0.2), uLSIF (which corresponds Dataset 2 .846(.031) .806 (.035) .623 (.040)
to RuLSIF with = 0), and KLIEP were plotted for k = 5, 10, Dataset 3 .972(.012) .943 (.015) .904 (.017)
and 15 under a specific choice of n in each graph. We generate Dataset 4 .844(.031) .801 (.024) .602 (.036)
such graphs for all 4 datasets with n = 25, 50, and 75. The result
shows that the proposed method consistently performs better than Subspace identification (SI) (Kawahara et al., 2007): SI identifies
the other methods, and the order of the methods according to the a subspace in which time-series data is constrained, and eval-
performance is kept unchanged over various choices of n and k. uates the distance of target sequences from the subspace. The
Moreover, the RuLSIF methods with = 0.1 and 0.2 perform subspace spanned by the columns of an observability matrix is
rather similarly. For this reason, we keep using the medium param- used for estimating the distance from the subspace spanned by
eter values among the candidates in the following experiments: subsequences of time-series data. For this method, we use the
top 4 significant singular values according to our preliminary
n = 50, k = 10, and = 0.1.
experiment results.
Auto-regressive (AR) (Takeuchi & Yamanishi, 2006): AR first fits
4.2. Real-world datasets an AR model to time-series data, and then auxiliary time-series
is generated from the AR model. With an extra AR model-fitting,
the change-point score is given by the log-likelihood. The order
Next, we evaluate the performance of the density-ratio esti- of the AR model is chosen by Schwarzs Bayesian information
mation based methods and other existing change-point detection criterion (Schwarz, 1978).
methods using two real-world datasets: Human-activity sensing One-class support vector machine (OSVM) (Desobry et al., 2005):
and speech. Change-point scores are calculated by OSVM using two sets of
We include the following methods in our comparison. descriptors of signals. The kernel width is set to the median
value of the distances between samples, which is a popular
Singular spectrum transformation (SST) (Ide & Tsuda, 2007; heuristic in kernel methods (Schlkopf & Smola, 2002). Another
Itoh & Kurths, 2010; Moskvina & Zhigljavsky, 2003a): Change- parameter is set to 0.2, which indicates the proportion of out-
point scores are evaluated on two consecutive trajectory ma- liers.
trices using the distance-based singular spectrum analysis. First, we use a human activity dataset. This is a subset of
This corresponds to a state-space model with no system the Human Activity Sensing Consortium (HASC) challenge 2011,5
noise. For this method, we use the first 4 eigenvectors to which provides human activity information collected by portable
compare the difference between two subspaces, which was
confirmed to be reasonable choice in our preliminary experi-
ments. 5 http://hasc.jp/hc2011/.
S. Liu et al. / Neural Networks 43 (2013) 7283 79

(a) Dataset 1 (n = 25). (b) Dataset 1 (n = 50). (c) Dataset 1 (n = 75).

(d) Dataset 2 (n = 25). (e) Dataset 2 (n = 50). (f) Dataset 2 (n = 75).

(g) Dataset 3 (n = 25). (h) Dataset 3 (n = 50). (i) Dataset 3 (n = 75).

(j) Dataset 4 (n = 25). (k) Dataset 4 (n = 50). (l) Dataset 4 (n = 75).

Fig. 6. AUC plots for n = 25, 50, 75 and k = 5, 10, 15. The horizontal axes denote k, while the vertical axes denote AUC values.

three-axis accelerometers. The task of change-point detection is to trends of changing behaviors, except the changes around time
segment the time-series data according to the 6 behaviors: stay, 1200 and 1500. However, because these changes are difficult to
walk, jog, skip, stair up, and stair down. The starting time be recognized even by humans, we do not regard them as critical
of each behavior is arbitrarily decided by each user. Because the flaws. Fig. 7(b) illustrates ROC curves averaged over 10 datasets,
orientation of accelerometers is not necessarily fixed, we take the and Fig. 7(c) describes AUC values for each of the 10 datasets.
`2 -norm of the 3-dimensional (i.e., x-, y-, and z-axes) data. The experimental results show that the proposed RuLSIF-based
In Fig. 7(a), examples of original time-series, true change points, method tends to perform better than other methods.
and change-point scores obtained by the RuLSIF-based method are Next, we use the IPSJ SIG-SLP Corpora and Environments for
plotted. This shows that the change-point score clearly captures Noisy Speech Recognition (CENSREC) dataset provided by National
80 S. Liu et al. / Neural Networks 43 (2013) 7283

(a) One of the original signals and change-point scores obtained by the RuLSIF-based method.

(b) Average ROC curves. (c) AUC values. The best and comparable methods by the t-test with
significance level 5% are described in boldface.

Fig. 7. HASC human-activity dataset.

Institute of Informatics (NII),6 which records human voice in a Here we track the degree of popularity of a given topic by
noisy environment. The task is to extract speech sections from monitoring the frequency of selected keywords. More specifically,
recorded signals. This dataset offers several voice recordings we focus on events related to Deepwater Horizon oil spill in
with different background noises (e.g., noise of a highway and the Gulf of Mexico which occurred on April 20, 2010,8 and was
restaurant). Segmentation of the beginning and ending of human widely broadcast among the Twitter community. We use the
voice is manually annotated. Note that we only use the annotations frequencies of 10 keywords: gulf , spill, bp, oil, hayward,
as the ground truth for the final performance evaluation, not for mexico, coast, transocean, halliburton, and obama (see
change-point detection (i.e., this experiment is still completely Fig. 9(a)). We perform change-point detection directly on the 10-
unsupervised). dimensional data, with the hope that we can capture correlation
Fig. 8(a) illustrates an example of the original signals, true changes between multiple keywords, in addition to changes in the
change-points, and change-point scores obtained by the proposed frequency of each keyword.
RuLSIF-based method. This shows that the proposed method still For quantitative evaluation, we referred to the Wikipedia
gives clear indications for speech segments. Fig. 8(b) and (c) show entry Timeline of the Deepwater Horizon oil spill,9 as a real-
average ROC curves over 10 datasets and AUC values for each world event source. The change-point score obtained by the
of the 10 datasets. The results show that the proposed method proposed RuLSIF-based method is plotted in Fig. 9(b), where four
significantly outperforms other methods. occurrences of important real-world events show the development
of this news story.
4.3. Twitter dataset As we can see from Fig. 9(b), the change-point score increases
immediately after the initial explosion of the deepwater horizon
Finally, we apply the proposed change-point detection method oil platform and soon reaches the first peak when oil was found on
to the CMU Twitter dataset,7 which is an archive of Twitter the shore of Louisiana on April 30. Shortly after BP announced its
messages collected from February 2010 to October 2010 via the preliminary estimation on the amount of leaking oil, the change-
Twitter application programming interface. point score rises quickly again and reaches its second peak at the

6 http://research.nii.ac.jp/src/eng/list/index.html. 8 http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill.
7 http://www.ark.cs.cmu.edu/tweets/. 9 http://en.wikipedia.org/wiki/Timeline_of_the_Deepwater_Horizon_oil_spill.
S. Liu et al. / Neural Networks 43 (2013) 7283 81

(a) One of the original signals and change-point scores obtained by the RuLSIF-based method. (b) Average ROC curves.

Fig. 8. CENSREC speech dataset.

end of May, at which time President Obama visited Louisiana to experiments, however, we did not observe a significant improve-
assure local residents of the federal governments support. On June ment by changing the margin. For this reason, we decided to use a
25, BP stock was at its one year lowest price, while the change- straightforward model such that two segments have no margin in
point score spikes at the third time. Finally, BP cut off the spill on between.
July 15, as the score reaches its last peak. Through the experiment illustrated in Fig. 6 in Section 4.1, we
can see that the performance of the proposed method is affected
5. Conclusion and future perspectives by the choice of hyper-parameters n and k. However, discovering
optimal values for these parameters remains a challenge, which
In this paper, we first formulated the problem of retrospective will be investigated in our future work.
change-point detection as the problem of comparing two proba- RuLSIF was shown to possess a better convergence property
bility distributions over two consecutive time segments. We then than uLSIF (Yamada et al., in press) in terms of density ratio
provided a comprehensive review of state-of-the-art density-ratio estimation. However, how this theoretical advantage in density
and divergence estimation methods, which are key building blocks ratio estimation can be translated into practical performance
of our change-point detection methods. Our contributions in this improvement in change detection is still not clear, beyond the
paper were to extend the existing KLIEP-based change-point detec-
intuition that a better divergence estimator gives a better change
tion method (Kawahara & Sugiyama, 2012), and to propose to use
score. We will address this issue more formally in the future work.
uLSIF as a building block. uLSIF has various theoretical and prac-
Although the proposed RuLSIF-based change-point detection
tical advantages, for example, the uLSIF solution can be computed
was shown to work well even for multi-dimensional time-series
analytically, it possesses the optimal non-parametric convergence
data, its accuracy may be further improved by incorporating
rate, it has the optimal numerical stability, and it has higher robust-
ness than KLIEP. We further proposed to use RuLSIF, a novel diver- dimensionality reduction. Recently, several attempts were made
gence estimation paradigm that emerged in the machine learning to combine dimensionality reduction with direct density-ratio
community recently. RuLSIF inherits good properties of uLSIF, and estimation (Sugiyama et al., 2010; Sugiyama, Yamada et al., 2011;
moreover it possesses an even better non-parametric convergence Yamada & Sugiyama, 2011). Our future work will apply these
property. Through extensive experiments on artificial datasets and techniques to change-point detection and evaluate their practical
real-world datasets including human-activity sensing, speech, and usefulness.
Twitter messages, we demonstrated that the proposed RuLSIF- Compared with other approaches, methods based on density
based change-point detection method is promising. ratio estimation tend to be computationally more expensive
Though we estimated a density ratio between two consecutive because of the cross-validation procedure for model selection.
segments, some earlier researches (Basseville & Nikiforov, 1993; However, thanks to the analytic solution, the RuLSIF- and uLSIF-
Gustafsson, 1996, 2000) introduced a hyper-parameter that con- based methods are computationally more efficient than the KLIEP-
trols the size of a margin between two segments. In our preliminary based method that requires an iterative optimization procedure
82 S. Liu et al. / Neural Networks 43 (2013) 7283

(a) Normalized frequencies of 10 keywords.

(b) Change-point score obtained by the RuLSIF-based method and exemplary real-world events.

Fig. 9. Twitter dataset.

(see Fig. 9 in Kanamori et al. (2009) for the detailed time References
comparison between uLSIF and KLIEP). Our important future work
is to further improve the computational efficiency of the RuLSIF- Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection.
based method. Technical Report. arXiv:0710.3742v1 [stat.ML].
In this paper, we focused on computing the change-point Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society, Series B, 28(1),
score that represents the plausibility of change points. Another
131142.
possible formulation is hypothesis testing, which provides a Basseville, M., & Nikiforov, I. V. (1993). Detection of abrupt changes: theory and
useful threshold to determine whether a point is a change point. application. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
Methodologically, it is straightforward to extend the proposed Bellman, R. (1961). Adaptive control processes: a guided tour. Princeton, NJ, USA:
Princeton University Press.
method to produce the p-values, following the recent literatures
Bickel, S., Brckner, M., & Scheffer, T. (2007). Discriminative learning for differing
(Kanamori, Suzuki, & Sugiyama, 2012a; Sugiyama, Suzuki, Itoh, training and test distributions. In Proceedings of the 24th international conference
Kanamori, & Kimura, 2011). However, computing the p-value is on machine learning (pp. 8188).
often time consuming, particularly in a non-parametric setup. Brodsky, B., & Darkhovsky, B. (1993). Nonparametric methods in change-point
problems. Dordrecht, the Netherlands: Kluwer Academic Publishers.
Thus, overcoming the computational bottleneck is an important
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3),
future work for making this approach more practical. 273297.
Recent reports pointed out that Twitter messages can be Csiszr, I. (1967). Information-type measures of difference of probability distribu-
indicative of real-world events (Petrovi, Osborne, & Lavrenko, tions and indirect observation. Studia Scientiarum Mathematicarum Hungarica,
2, 229318.
2010; Sakaki, Okazaki, & Matsuo, 2010). Following this line, we Csrg, M., & Horvth, L. (1988). 20 nonparametric methods for changepoint
showed in Section 4.3 that our change-detection method can be problems. In P. R. Krishnaiah, & C. R. Rao (Eds.), Handbook of statistics, vol. 7
used as a novel tool for analyzing Twitter messages. An important (pp. 403425). Amsterdam, the Netherlands: Elsevier.
future challenge along this line includes automatic keyword Desobry, F., Davy, M., & Doncarli, C. (2005). An online kernel change detection
algorithm. IEEE Transactions on Signal Processing, 53(8), 29612974.
selection for topics of interests. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY,
USA: Chapman & Hall/CRC.
Garnett, R., Osborne, M.A., & Roberts, S.J. (2009). Sequential Bayesian prediction in
Acknowledgments the presence of changepoints. In Proceedings of the 26th annual international
conference on machine learning (pp. 345352).
SL was supported by NII internship fund and the JST PRESTO Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Schlkopf, B.
(2009). Covariate shift by kernel mean matching. In J. Quionero-Candela,
program. MY and MS were supported by the JST PRESTO program.
M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine
NC was supported by NII Grand Challenge project fund. learning (pp. 131160). Cambridge, MA, USA: MIT Press (Chapter 8).
S. Liu et al. / Neural Networks 43 (2013) 7283 83

Guralnik, V., & Srivastava, J. (1999). Event detection from time series data. In Petrovi, S., Osborne, M., & Lavrenko, V. (2010). Streaming first story detection
Proceedings of the fifth ACM SIGKDD international conference on knowledge with application to Twitter. In Human language technologies: the 2010 annual
discovery and data mining (pp. 3342). conference of the North American chapter of the association for computational
Gustafsson, F. (1996). The marginalized likelihood ratio test for detecting abrupt linguistics (pp. 181189).
changes. IEEE Transactions on Automatic Control, 41(1), 6678. Reeves, J., Chen, J., Wang, X. L., Lund, R., & Lu, Q. (2007). A review and comparison
Gustafsson, F. (2000). Adaptive filtering and change detection. Chichester, UK: Wiley. of changepoint detection techniques for climate data. Journal of Applied
Harchaoui, Z., Bach, F., & Moulines, E. (2009). Kernel change-point analysis. Advances Meteorology and Climatology, 46(6), 900915.
in Neural Information Processing Systems, 21, 609616. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ, USA: Princeton University
Henkel, R. E. (1976). Tests of significance. Beverly Hills, CA, USA: SAGE Publication. Press.
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users: real-
outlier detection using direct density ratio estimation. Knowledge and time event detection by social sensors. In Proceedings of the 19th international
Information Systems, 26(2), 309336. conference on World Wide Web (pp. 851860).
Ide, T., & Tsuda, K. (2007). Change-point detection using Krylov subspace learning. Schlkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).
In Proceedings of the SIAM international conference on data mining (pp. 515520).
Estimating the support of a high-dimensional distribution. Neural Computation,
Itoh, N., & Kurths, J. (2010). Change-point detection of climate time series by
13(7), 14431471.
nonparametric method. In Proceedings of the world congress on engineering and
Schlkopf, B., & Smola, A. J. (2002). Learning with kernels support vector machines,
computer science, 2010, vol. 1.
regularization, optimization, and beyond. MIT Press.
Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
importance estimation. Journal of Machine Learning Research, 10, 13911445.
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012a). f -divergence estimation and 6(2), 461464.
two-sample homogeneity test under semiparametric density-ratio models. IEEE Sugiyama, M., Kawanabe, M., & Chui, P. L. (2010). Dimensionality reduction for
Transactions on Information Theory, 58(2), 708720. density ratio estimation in high-dimensional spaces. Neural Networks, 23(1),
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012b). Statistical analysis of kernel-based 4459.
least-squares density-ratio estimation. Machine Learning, 86(3), 335367. Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., & Kimura, M. (2011). Least-squares
Kanamori, T., Suzuki, T., & Sugiyama, M. (2013). Computational complexity of two-sample test. Neural Networks, 24(7), 735751.
kernel-based density-ratio estimation: a condition number analysis. Machine Sugiyama, M., Suzuki, T., & Kanamori, T. (2012a). Density ratio estimation in machine
Learning (in press). learning. Cambridge, UK: Cambridge University Press.
Kawahara, Y., & Sugiyama, M. (2012). Sequential change-point detection based Sugiyama, M., Suzuki, T., & Kanamori, T. (2012b). Density ratio matching under the
on direct density-ratio estimation. Statistical Analysis and Data Mining, 5(2), Bregman divergence: a unified framework of density ratio estimation. Annals of
114127. the Institute of Statistical Mathematics, 64, 10091044.
Kawahara, Y., Yairi, T., & Machida, K. (2007). Change-point detection in time-series Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Buenau, P., & Kawanabe, M.
data based on subspace identification. In Proceedings of the 7th IEEE international (2008). Direct importance estimation for covariate shift adaptation. Annals of
conference on data mining (pp. 559564). the Institute of Statistical Mathematics, 60(4), 699746.
Keziou, A. (2003). Dual representation of -divergences and applications. Comptes Sugiyama, M., Yamada, M., von Bnau, P., Suzuki, T., Kanamori, T., & Kawanabe,
Rendus Mathematique, 336(10), 857862. M. (2011). Direct density-ratio estimation with dimensionality reduction via
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of least-squares hetero-distributional subspace search. Neural Networks, 24(2),
Mathematical Statistics, 22(1), 7986. 183198.
Moskvina, V., & Zhigljavsky, A. (2003a). Application of singular-spectrum analysis Takeuchi, J., & Yamanishi, K. (2006). A unifying framework for detecting outliers
to change-point detection in time series. School of Mathematics, Cardiff and change points from non-stationary time series data. IEEE Transactions on
University. http://slb.cf.ac.uk/maths/subsites/stats/changepoint/CH_P_T_S.pdf. Knowledge and Data Engineering, 18(4), 482492.
Moskvina, V., & Zhigljavsky, A. (2003b). An algorithm based on singular spectrum Vapnik, V. N. (1998). Statistical learning theory. New York, NY, USA: Wiley.
analysis for change-point detection. Communications in Statistics: Simulation and Wang, Y., Wu, C., Ji, Z., Wang, B., & Liang, Y. (2011). Non-parametric change-point
Computation, 32(2), 319352. method for differential gene expression detection. PLoS ONE, 6(5), e20060.
Nguyen, X., Wainwright, M.J., & Jordan, M.I. (2007). Nonparametric estimation of the
Yamada, M., & Sugiyama, M. (2011). Direct density-ratio estimation with di-
likelihood ratio and divergence functionals. In Proceedings of IEEE International
mensionality reduction via hetero-distributional subspace analysis. In
Symposium on Information Theory (pp. 20162020).
Proceedings of the twenty-fifth AAAI conference on artificial intelligence
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence
(pp. 549554). Aug. 711.
functionals and the likelihood ratio by convex risk minimization. IEEE
Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., & Sugiyama, M. (2013).
Transactions on Information Theory, 56(11), 58475861.
Paquet, U. (2007). Empirical Bayesian change point detection. Graphical Models, Relative density-ratio estimation for robust distribution comparison. Neural
1995, 120. Computation (in press).
Pearson, K. (1900). On the criterion that a given system of deviations from the Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised
probable in the case of a correlated system of variables is such that it can outlier detection using finite mixtures with discounting learning algorithms.
be reasonably supposed to have arisen from random sampling. Philosophical In Proceedings of the sixth ACM SIGKDD international conference on knowledge
Magazine, 50, 157175. discovery and data mining (pp. 320324).

Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
Rao (2022) - A Course in Time Series Analysis
No ratings yet
Rao (2022) - A Course in Time Series Analysis
527 pages
Amit Konar, Diptendu Bhattacharya-Time-Series Prediction and Applications. A Machine Intelligence Approach-Springer (2017)
No ratings yet
Amit Konar, Diptendu Bhattacharya-Time-Series Prediction and Applications. A Machine Intelligence Approach-Springer (2017)
248 pages
A Review of Changepoint Detection Models
No ratings yet
A Review of Changepoint Detection Models
11 pages
21 Ejs1809
No ratings yet
21 Ejs1809
48 pages
17-AOS1662
No ratings yet
17-AOS1662
32 pages
A PDF Free Change Detection Test Based on Density Difference Estimation
No ratings yet
A PDF Free Change Detection Test Based on Density Difference Estimation
11 pages
A Pdf-Free Change Detection Test Based On Density Difference Estimation
No ratings yet
A Pdf-Free Change Detection Test Based On Density Difference Estimation
11 pages
A Review of Change Point Detection Methods: January 2018
No ratings yet
A Review of Change Point Detection Methods: January 2018
47 pages
Change Point Detection in Time Series Data With Random Forests
No ratings yet
Change Point Detection in Time Series Data With Random Forests
13 pages
Optimal Nonparametric Change Point Detection and Localization
No ratings yet
Optimal Nonparametric Change Point Detection and Localization
44 pages
Signal segmentations
No ratings yet
Signal segmentations
39 pages
Pattern Recognition: Haider Raza, Girijesh Prasad, Yuhua Li
No ratings yet
Pattern Recognition: Haider Raza, Girijesh Prasad, Yuhua Li
11 pages
Applying Temporal Dependence To de
No ratings yet
Applying Temporal Dependence To de
19 pages
High Dimensional Change Point Estimation Via Sparse Projection
No ratings yet
High Dimensional Change Point Estimation Via Sparse Projection
27 pages
A Nonparametric Approach For Multiple Change Point Analysis of Multivariate Data
No ratings yet
A Nonparametric Approach For Multiple Change Point Analysis of Multivariate Data
29 pages
Change Point Detection
No ratings yet
Change Point Detection
23 pages
T6_QMchange-point-anomaly
No ratings yet
T6_QMchange-point-anomaly
11 pages
Anomaly Detection
No ratings yet
Anomaly Detection
51 pages
Fast Detection of Nonlinearity and Nonstationarity in Short and Noisy Time Series
No ratings yet
Fast Detection of Nonlinearity and Nonstationarity in Short and Noisy Time Series
6 pages
2010 IJCNN ICI-based CDT
No ratings yet
2010 IJCNN ICI-based CDT
7 pages
Change Detection Algorithms: y P y T
No ratings yet
Change Detection Algorithms: y P y T
41 pages
Cheboli Deepthi May2010 PDF
No ratings yet
Cheboli Deepthi May2010 PDF
83 pages
Preliminaries
No ratings yet
Preliminaries
5 pages
Information Based Approach For Detecting Change Points in Inverse
No ratings yet
Information Based Approach For Detecting Change Points in Inverse
53 pages
Te 1555
No ratings yet
Te 1555
134 pages
3481 PDF
No ratings yet
3481 PDF
8 pages
Detection of Multiple Change Points From Clustering Individual Observations
No ratings yet
Detection of Multiple Change Points From Clustering Individual Observations
13 pages
Adaptive Time Series Segmentation Algorithm Based On Trend Turning Points and State Changes
No ratings yet
Adaptive Time Series Segmentation Algorithm Based On Trend Turning Points and State Changes
19 pages
A Review On Outlier Detection in Time Series Data BCAM 1 PDF
No ratings yet
A Review On Outlier Detection in Time Series Data BCAM 1 PDF
45 pages
242944
No ratings yet
242944
108 pages
QuantStrat Trade1
No ratings yet
QuantStrat Trade1
71 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
20 pages
Michele Basseville Igor V Nikiforov - Detection of Abrupt Changes Theory and Application
No ratings yet
Michele Basseville Igor V Nikiforov - Detection of Abrupt Changes Theory and Application
469 pages
A Brief Overview of The Regime Shift Detection Methods: Sergei Rodionov
No ratings yet
A Brief Overview of The Regime Shift Detection Methods: Sergei Rodionov
28 pages
Optimal Multi-Scale Patterns in Time Series Streams: Spiros Papadimitriou Philip S. Yu
No ratings yet
Optimal Multi-Scale Patterns in Time Series Streams: Spiros Papadimitriou Philip S. Yu
12 pages
Detection of Abrupt Changes: Theory and Application
No ratings yet
Detection of Abrupt Changes: Theory and Application
470 pages
S C - B E D F N - I L M: Equential Lustering Ased Vent Etection OR ON Ntrusive OAD Onitoring
No ratings yet
S C - B E D F N - I L M: Equential Lustering Ased Vent Etection OR ON Ntrusive OAD Onitoring
9 pages
A Novel Change Point Detection Approach For Analysis of Time-Ordered Satellite Imagery
No ratings yet
A Novel Change Point Detection Approach For Analysis of Time-Ordered Satellite Imagery
12 pages
Paper 7-Application of Relevance Vector Machines in Real Time Intrusion Detection
No ratings yet
Paper 7-Application of Relevance Vector Machines in Real Time Intrusion Detection
6 pages
Large-Scale Unusual Time Series Detection
No ratings yet
Large-Scale Unusual Time Series Detection
4 pages
Algorithm For Detection of Change Point in On-Line Monitoring Data
No ratings yet
Algorithm For Detection of Change Point in On-Line Monitoring Data
9 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
Review Article: Anomaly Detection in Spatio-Temporal Data: Theory and Application
No ratings yet
Review Article: Anomaly Detection in Spatio-Temporal Data: Theory and Application
12 pages
Causal Discovery from Time-Series Data with Short-Term Invariance-Based Convolutional Neural Network
No ratings yet
Causal Discovery from Time-Series Data with Short-Term Invariance-Based Convolutional Neural Network
19 pages
Discovery of Non-Gaussian Linear Causal Models Using ICA
No ratings yet
Discovery of Non-Gaussian Linear Causal Models Using ICA
8 pages
Journal of Statistical Software: Anomaly
No ratings yet
Journal of Statistical Software: Anomaly
24 pages
CDaIDiDS (1)
No ratings yet
CDaIDiDS (1)
170 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
Dynamic Linear Models With Switching
No ratings yet
Dynamic Linear Models With Switching
8 pages
A Course in Time Series Analysis 1662068197
No ratings yet
A Course in Time Series Analysis 1662068197
300 pages
? ?????? ?? ???? ?????? ????????
No ratings yet
? ?????? ?? ???? ?????? ????????
300 pages
Environmetrics - 2022 - Li - Changepoint Detection in Autocorrelated Ordinal Categorical Time Series
No ratings yet
Environmetrics - 2022 - Li - Changepoint Detection in Autocorrelated Ordinal Categorical Time Series
22 pages
Bayesian Online Changepoint Detection
No ratings yet
Bayesian Online Changepoint Detection
7 pages
Fading Histograms in Detecting Distribution and Concept Changes
No ratings yet
Fading Histograms in Detecting Distribution and Concept Changes
30 pages
Bayesian Change Point Analysis of Linear Models On Graphs: September 2015
No ratings yet
Bayesian Change Point Analysis of Linear Models On Graphs: September 2015
31 pages
Image Change Detection Algorithms: A Systematic Survey: Rjradke@ecse - Rpi.edu, (Andras, Alkofo) @rpi - Edu, Roysam@ecse - Rpi.edu
No ratings yet
Image Change Detection Algorithms: A Systematic Survey: Rjradke@ecse - Rpi.edu, (Andras, Alkofo) @rpi - Edu, Roysam@ecse - Rpi.edu
32 pages
Concept_Drift_Detection_via_Equal_Intensity_k-Means_Space_Partitioning
No ratings yet
Concept_Drift_Detection_via_Equal_Intensity_k-Means_Space_Partitioning
14 pages
Learning Temporally Causal Latent Processes From General Temporal Data
No ratings yet
Learning Temporally Causal Latent Processes From General Temporal Data
31 pages
A Study For Detection of Drift in Sensor Measurements
No ratings yet
A Study For Detection of Drift in Sensor Measurements
176 pages
Theoretical method to increase the speed of continuous mapping in a three-dimensional laser scanning system using servomotors control
From Everand
Theoretical method to increase the speed of continuous mapping in a three-dimensional laser scanning system using servomotors control
Lars Lindner
No ratings yet
The Data Culture Playbook: A Guide To Building Business Resilience With Data
No ratings yet
The Data Culture Playbook: A Guide To Building Business Resilience With Data
10 pages
305 WRKDOC Sandee Panel Session Report
No ratings yet
305 WRKDOC Sandee Panel Session Report
11 pages
Animals: Automatic Classification of Cat Vocalizations Emitted in Different Contexts
No ratings yet
Animals: Automatic Classification of Cat Vocalizations Emitted in Different Contexts
14 pages
Scha Pire
No ratings yet
Scha Pire
182 pages
Computational Health Informatics in The Big Data Age - A Survey
No ratings yet
Computational Health Informatics in The Big Data Age - A Survey
36 pages
Feedback Report IPFP Faculty Member
No ratings yet
Feedback Report IPFP Faculty Member
1 page
S. No GP Cat Course Code Course - Name CRH Pre Req Sec Batc H Lab/T H
No ratings yet
S. No GP Cat Course Code Course - Name CRH Pre Req Sec Batc H Lab/T H
9 pages
PHD Openings
No ratings yet
PHD Openings
1 page
CV DR Saqib Anwar
100% (1)
CV DR Saqib Anwar
5 pages
TH Course - Name CRH Sec Resource Person Monday Tuesday Wednesday Thurday Friday Saturday S. No Course Code
No ratings yet
TH Course - Name CRH Sec Resource Person Monday Tuesday Wednesday Thurday Friday Saturday S. No Course Code
1 page
Suggested Welcome Announcement
No ratings yet
Suggested Welcome Announcement
3 pages
Presentation For Income Tax Return
No ratings yet
Presentation For Income Tax Return
12 pages
Symbolic vs. Subsymbolic
No ratings yet
Symbolic vs. Subsymbolic
11 pages
Collaborative Filtering Using A Regression-Based Approach: Slobodan Vucetic
No ratings yet
Collaborative Filtering Using A Regression-Based Approach: Slobodan Vucetic
22 pages
Publication Detail Performa 2016-17
No ratings yet
Publication Detail Performa 2016-17
1 page
Probability and Statistics Course Outline
67% (3)
Probability and Statistics Course Outline
2 pages
Some Important Matrix Factorizations: LU Decomposition
No ratings yet
Some Important Matrix Factorizations: LU Decomposition
39 pages
C. R. Q. - Limits (Xi) : Abx X FX X Bax X FX F B
No ratings yet
C. R. Q. - Limits (Xi) : Abx X FX X Bax X FX F B
1 page
Vectors and Scalars
No ratings yet
Vectors and Scalars
22 pages
Assignment On Vector and Matrix
No ratings yet
Assignment On Vector and Matrix
7 pages
03.1 Quadratic Sequences
No ratings yet
03.1 Quadratic Sequences
14 pages
DLL Gen Math Week 2
No ratings yet
DLL Gen Math Week 2
6 pages
MIT6 262S11 Lec01
No ratings yet
MIT6 262S11 Lec01
10 pages
Spring 2017 - Piret: 1 i (3+i) i3−1 i3 −1 i3 −1 −1 −1 i i2kπ i −2kπ 1 1−i 3/4 −1/2 3/4 3i (π/16+3kπ/2) −3/8
No ratings yet
Spring 2017 - Piret: 1 i (3+i) i3−1 i3 −1 i3 −1 −1 −1 i i2kπ i −2kπ 1 1−i 3/4 −1/2 3/4 3i (π/16+3kπ/2) −3/8
2 pages
2 Buckling
No ratings yet
2 Buckling
53 pages
Graphs and Transformations, Mixed Exercise 4: 1 A y X 2 B
No ratings yet
Graphs and Transformations, Mixed Exercise 4: 1 A y X 2 B
6 pages
Tutorial Matrices
No ratings yet
Tutorial Matrices
6 pages
Topology Mcqs 2 Topology Mcqs 2
100% (1)
Topology Mcqs 2 Topology Mcqs 2
7 pages
Standard Score and Normal distMEd
No ratings yet
Standard Score and Normal distMEd
12 pages
Courses IA
No ratings yet
Courses IA
11 pages
VCE Methods 3 & 4 Study Design
No ratings yet
VCE Methods 3 & 4 Study Design
6 pages
9S5Apm6HEemE8A7At5Cb6A Week 2 Householder
No ratings yet
9S5Apm6HEemE8A7At5Cb6A Week 2 Householder
7 pages
ASA Question Bank
No ratings yet
ASA Question Bank
2 pages
The Kochen-Specker Theorem (4) - The Functional Composition Principle
No ratings yet
The Kochen-Specker Theorem (4) - The Functional Composition Principle
2 pages
Business Mathematics - Model Paper
100% (1)
Business Mathematics - Model Paper
4 pages
Leaving Cert Maths
No ratings yet
Leaving Cert Maths
9 pages
SAS Day11 - ITE048 Discrete Structure
No ratings yet
SAS Day11 - ITE048 Discrete Structure
9 pages
Bok:978 1 4471 2879 3
No ratings yet
Bok:978 1 4471 2879 3
576 pages
MIPS Manual PDF
No ratings yet
MIPS Manual PDF
35 pages
Automatic Surface Creation Using Pro/ENGINEER Reverse Engineering
No ratings yet
Automatic Surface Creation Using Pro/ENGINEER Reverse Engineering
4 pages
NCERT Solutions For Class 12 Maths Chapter 5 Continuity and Differentiability
No ratings yet
NCERT Solutions For Class 12 Maths Chapter 5 Continuity and Differentiability
115 pages
Combinatorics Course Week 8: 1 Homework
No ratings yet
Combinatorics Course Week 8: 1 Homework
5 pages
Intro To Special Relativity & General Relivity Chapter 7
No ratings yet
Intro To Special Relativity & General Relivity Chapter 7
8 pages

Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation

Uploaded by

Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation

Uploaded by

Neural Networks 43 (2013) 7283

Contents lists available at SciVerse ScienceDirect

Change-point detection in time-series data by relative density-ratio estimation

article info abstract

1. Introduction point detection scenario and propose a novel non-parametric

K an even better estimate from a small number of samples. We

Fig. 2. An illustrative example of notations on one-dimensional time-series data.

3.4.2. Learning algorithm

(a) Dataset 1. (b) Dataset 2.

(c) Dataset 3. (d) Dataset 4.

(a) Dataset 1. (b) Dataset 2.

(c) Dataset 3. (d) Dataset 4.

Fig. 5. Average ROC curves of RuLSIF-based, uLSIF-based, and KLIEP-based methods.

that the uLSIF-based method tends to outperform the KLIEP-based Table 1

(a) Dataset 1 (n = 25). (b) Dataset 1 (n = 50). (c) Dataset 1 (n = 75).

(d) Dataset 2 (n = 25). (e) Dataset 2 (n = 50). (f) Dataset 2 (n = 75).

(g) Dataset 3 (n = 25). (h) Dataset 3 (n = 50). (i) Dataset 3 (n = 75).

(j) Dataset 4 (n = 25). (k) Dataset 4 (n = 50). (l) Dataset 4 (n = 75).

Fig. 7. HASC human-activity dataset.

Fig. 8. CENSREC speech dataset.

(a) Normalized frequencies of 10 keywords.

Fig. 9. Twitter dataset.

You might also like