0% found this document useful (0 votes)
112 views

Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation

Time Series

Uploaded by

muhammadriz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation

Time Series

Uploaded by

muhammadriz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Neural Networks 43 (2013) 7283

Contents lists available at SciVerse ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Change-point detection in time-series data by relative density-ratio estimation


Song Liu a, , Makoto Yamada b , Nigel Collier c,d , Masashi Sugiyama a
a
Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan
b
NTT Communication Science Laboratories, 2-4, Hikaridai, Seika-cho, Kyoto 619-0237, Japan
c
National Institute of Informatics (NII), 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
d
European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

article info abstract


Article history: The objective of change-point detection is to discover abrupt property changes lying behind time-
Received 21 February 2012 series data. In this paper, we present a novel statistical change-point detection algorithm based on non-
Received in revised form 26 December 2012 parametric divergence estimation between time-series samples from two retrospective segments. Our
Accepted 15 January 2013
method uses the relative Pearson divergence as a divergence measure, and it is accurately and efficiently
estimated by a method of direct density-ratio estimation. Through experiments on artificial and real-
Keywords:
world datasets including human-activity sensing, speech, and Twitter messages, we demonstrate the
Change-point detection
Distribution comparison
usefulness of the proposed method.
Relative density-ratio estimation 2013 Elsevier Ltd. All rights reserved.
Kernel methods
Time-series data

1. Introduction point detection scenario and propose a novel non-parametric


method.
Detecting abrupt changes in time-series data, called change- Having been studied for decades, some pioneering works
point detection, has attracted researchers in the statistics and demonstrated good change-point detection performance by com-
data mining communities for decades (Basseville & Nikiforov, paring the probability distributions of time-series samples over
1993; Brodsky & Darkhovsky, 1993; Gustafsson, 2000). Depending past and present intervals (Basseville & Nikiforov, 1993). As both
on the delay of detection, change-point detection methods can the intervals move forward, a typical strategy is to issue an alarm
be classified into two categories: Real-time detection (Adams & for a change point when the two distributions are becoming signif-
icantly different. Various change-point detection methods follow
MacKay, 2007; Garnett, Osborne, & Roberts, 2009; Paquet, 2007)
this strategy, for example, the cumulative sum (Basseville & Niki-
and retrospective detection (Basseville & Nikiforov, 1993; Moskvina
forov, 1993), the generalized likelihood-ratio method (Gustafsson,
& Zhigljavsky, 2003a; Takeuchi & Yamanishi, 2006).
1996), and the change finder (Takeuchi & Yamanishi, 2006). Such
Real-time change-point detection targets applications that
a strategy has also been employed in novelty detection (Guralnik
require immediate responses such as robot control. On the other
& Srivastava, 1999) and outlier detection (Hido, Tsuboi, Kashima,
hand, although retrospective change-point detection requires
Sugiyama, & Kanamori, 2011).
longer reaction periods, it tends to give more robust and accurate Another group of methods that have attracted high popularity
detection. Retrospective change-point detection accommodates in recent years is the subspace methods (Ide & Tsuda, 2007;
various applications that allow certain delays, for example, Kawahara, Yairi, & Machida, 2007; Moskvina & Zhigljavsky, 2003a,
climate change detection (Reeves, Chen, Wang, Lund, & Lu, 2007), 2003b). By using a pre-designed time-series model, a subspace is
genetic time-series analysis (Wang, Wu, Ji, Wang, & Liang, 2011), discovered by principal component analysis from trajectories in
signal segmentation (Basseville & Nikiforov, 1993), and intrusion past and present intervals, and their dissimilarity is measured by
detection in computer networks (Yamanishi, Takeuchi, Williams, & the distance between the subspaces. One of the major approaches
Milne, 2000). In this paper, we focus on the retrospective change- is called subspace identification, which compares the subspaces
spanned by the columns of an extended observability matrix
generated by a state-space model with system noise (Kawahara
et al., 2007). Recent efforts along this line of research have led
Corresponding author. Tel.: +81 080 4464 8710.
E-mail addresses: [email protected] (S. Liu), [email protected] to a computationally efficient algorithm based on Krylov subspace
(M. Yamada), [email protected] (N. Collier), [email protected] (M. Sugiyama). learning (Ide & Tsuda, 2007) and a successful application of
URL: http://sugiyama-www.cs.titech.ac.jp/sugi (M. Sugiyama). detecting climate change in south Kenya (Itoh & Kurths, 2010).
0893-6080/$ see front matter 2013 Elsevier Ltd. All rights reserved.
doi:10.1016/j.neunet.2013.01.012
S. Liu et al. / Neural Networks 43 (2013) 7283 73

K an even better estimate from a small number of samples. We


experimentally demonstrate that our RuLSIF-based change-point
detection method compares favorably with other approaches.
The rest of this paper is structured as follows: In Section 2, we
formulate our change-point detection problem. In Section 3, we
Fig. 1. Rationale of direct density-ratio estimation. describe our proposed change-point detection algorithms based
on uLSIF and RuLSIF, together with the review of the KLIEP-
The methods explained above rely on pre-designed parametric based method. In Section 4, we report experimental results on
models, such as underlying probability distributions (Basseville various artificial and real-world datasets including human-activity
& Nikiforov, 1993; Gustafsson, 1996), auto-regressive models sensing, speech, and Twitter messages from February 2010 to
(Takeuchi & Yamanishi, 2006), and state-space models (Ide & October 2010. Finally, in Section 5, conclusions together with
Tsuda, 2007; Kawahara et al., 2007; Moskvina & Zhigljavsky, future perspectives are stated.
2003a, 2003b), for tracking specific statistics such as the mean,
the variance, and the spectrum. As alternatives, non-parametric 2. Problem formulation
methods such as kernel density estimation (Brodsky & Darkhovsky,
1993; Csrg & Horvth, 1988) are designed with no particular In this section, we formulate our change-point detection
parametric assumption. However, they tend to be less accurate problem.
in high-dimensional problems because of the so-called curse of Let y (t ) 2 Rd be a d-dimensional time-series sample at time t.
dimensionality (Bellman, 1961; Vapnik, 1998). Let
To overcome this difficulty, a new strategy was introduced
Y (t ) := [y (t )> , y (t + 1)> , . . . , y (t + k 1)> ]> 2 Rdk
recently, which estimates the ratio of probability densities directly
without going through density estimation (Sugiyama, Suzuki, & be a subsequence1 of time series at time t with length k, where >
Kanamori, 2012a). The rationale of this density-ratio estimation represents the transpose. Following the previous work (Kawahara
idea is that knowing the two densities implies knowing the density & Sugiyama, 2012), we treat the subsequence Y (t ) as a sample,
ratio, but not vice versa; knowing the ratio does not necessarily instead of a single d-dimensional time-series sample y (t ), by which
imply knowing the two densities because such decomposition time-dependent information can be incorporated naturally (see
is not unique (Fig. 1). Thus, direct density-ratio estimation is Fig. 2). Let Y(t ) be a set of n retrospective subsequence samples
substantially easier than density estimation (Sugiyama et al., starting at time t:
2012a). Following this idea, methods of direct density-ratio Y(t ) := {Y (t ), Y (t + 1), . . . , Y (t + n 1)}.
estimation have been developed (Sugiyama, Suzuki, & Kanamori,
2012b), e.g., kernel mean matching (Gretton et al., 2009), the Note that [Y (t ), Y (t + 1), . . . , Y (t + n 1)] 2 Rdkn forms a Hankel
logistic-regression method (Bickel, Brckner, & Scheffer, 2007), matrix and plays a key role in change-point detection based on
and the KullbackLeibler importance estimation procedure (KLIEP) subspace learning (Kawahara et al., 2007; Moskvina & Zhigljavsky,
(Sugiyama et al., 2008). In the context of change-point detection, 2003a).
KLIEP was reported to outperform other approaches (Kawahara For change-point detection, let us consider two consecutive
& Sugiyama, 2012) such as the one-class support vector machine segments Y(t ) and Y(t + n). Our strategy is to compute a certain
(Desobry, Davy, & Doncarli, 2005; Schlkopf, Platt, Shawe- dissimilarity measure between Y(t ) and Y(t + n), and use it as
Taylor, Smola, & Williamson, 2001) and singular-spectrum analysis the plausibility of change points. More specifically, the higher the
(Moskvina & Zhigljavsky, 2003b). Thus, change-point detection dissimilarity measure is, the more likely the point is a change
based on direct density-ratio estimation is promising. point.2
The goal of this paper is to further advance this line of research. Now the problems that need to be addressed are what kind of
More specifically, our contributions in this paper are twofold. dissimilarity measure we should use and how we estimate it from
The first contribution is to apply a recently-proposed density- data. We will discuss these issues in the next section.
ratio estimation method called the unconstrained least-squares
importance fitting (uLSIF) (Kanamori, Hido, & Sugiyama, 2009) to 3. Change-point detection via density-ratio estimation
change-point detection. The basic idea of uLSIF is to directly learn
the density-ratio function in the least-squares fitting framework. In this section, we first define our dissimilarity measure, and
then show methods for estimating the dissimilarity measure.
Notable advantages of uLSIF are that its solution can be computed
analytically (Kanamori et al., 2009), it achieves the optimal non-
parametric convergence rate (Kanamori, Suzuki, & Sugiyama, 3.1. Divergence-based dissimilarity measure and density-ratio esti-
2012b), it has the optimal numerical stability (Kanamori, Suzuki, mation
& Sugiyama, in press), and it has higher robustness than KLIEP
(Sugiyama et al., 2012b). Through experiments on a range of In this paper, we use a dissimilarity measure of the following
datasets, we demonstrate the superior detection accuracy of the form:
uLSIF-based change-point detection method. D(Pt k Pt +n ) + D(Pt +n k Pt ), (1)
The second contribution of this paper is to further improve
the uLSIF-based change-point detection method by employing a
state-of-the-art extension of uLSIF called relative uLSIF (RuLSIF) 1 In fact, only in the case of one-dimensional time-series, Y (t ) is a subsequence.
(Yamada, Suzuki, Kanamori, Hachiya, & Sugiyama, in press). A For higher-dimensional time-series, Y (t ) concatenates the subsequences of all
potential weakness of the density-ratio based approach is that dimensions into a one-dimensional vector.
density ratios can be unbounded (i.e., they can be infinity) if the 2 Another possible formulation is to compare distributions of samples in Y(t )
denominator density is not well-defined. The basic idea of RuLSIF and Y(t + n) in the framework of hypothesis testing (Henkel, 1976). Although this
is to consider relative density ratios, which are smoother and always gives a useful threshold to determine whether a point is a change point, computing
the p-value is often time consuming, particularly in a non-parametric setup (Efron &
bounded from above. Theoretically, it was proved that RuLSIF Tibshirani, 1993). For this reason, we do not take the hypothesis testing approach in
possesses a superior non-parametric convergence property than this paper, although it is methodologically straightforward to extend the proposed
plain uLSIF (Yamada et al., in press), implying that RuLSIF gives approach to the hypothesis testing framework.
74 S. Liu et al. / Neural Networks 43 (2013) 7283

Fig. 2. An illustrative example of notations on one-dimensional time-series data.

where Pt and Pt +n are probability distributions of samples in Y(t ) In the rest of this section, we review three methods of directly
and Y(t + n), respectively. D(P k P 0 ) denotes the f -divergence (Ali p(Y )
estimating the density ratio p0 (Y ) from samples {Yi }ni=1 and {Yj0 }nj=1
& Silvey, 1966; Csiszr, 1967): drawn from p(Y ) and p0 (Y ): The KL importance estimation proce-
Z dure (KLIEP) (Sugiyama et al., 2008) in Section 3.2, unconstrained
p(Y )
D(P k P 0 ) := p0 (Y )f dY , (2) least-squares importance fitting (uLSIF) (Kanamori et al., 2009) in
p0 (Y )
Section 3.3, and relative uLSIF (RuLSIF) (Yamada et al., in press) in
where f is a convex function such that f (1) = 0, and p(Y ) and Section 3.4.
p0 (Y ) are probability density functions of P and P 0 , respectively.
We assume that p(Y ) and p0 (Y ) are strictly positive. Since the 3.2. KLIEP
f -divergence is asymmetric (i.e., D(P k P 0 ) 6= D(P 0 k P )), we
symmetrize it in our dissimilarity measure (1) for all divergence- KLIEP (Sugiyama et al., 2008) is a direct density-ratio estimation
based methods.3 algorithm that is suitable for estimating the KL divergence.
The f -divergence includes various popular divergences such as
the KullbackLeibler (KL) divergence by f (t ) = t log t (Kullback & 3.2.1. Density-ratio model
Leibler, 1951) and the Pearson (PE) divergence by f (t ) = 12 (t 1)2 Let us model the density ratio
p(Y )
by the following kernel
p0 (Y )
(Pearson, 1900): model:
Z
p(Y ) n
X
KL(P k P 0 ) := p(Y ) log dY , (3) g (Y ; ) := ` K (Y , Y` ), (5)
p0 (Y )
`=1
Z 2
1 p(Y ) where := (1 , . . . , n )> are parameters to be learned from data
PE(P k P 0 ) := p0 (Y ) 1 dY . (4)
2 p0 (Y ) samples, and K (Y , Y 0 ) is a kernel basis function. In practice, we use
the Gaussian kernel:
Since the probability densities p(Y ) and p0 (Y ) are unknown in
practice, we cannot directly compute the f -divergence (and thus 0 kY Y 0 k2
K (Y , Y ) = exp ,
the dissimilarity measure). A naive way to cope with this problem 2 2
is to perform density estimation and plug the estimated densities
where (> 0) is the kernel width. In all our experiments, the
p(Y ) and b
b p0 (Y ) in the definition of the f -divergence. However,
kernel width is determined based on cross-validation.
density estimation is known to be a hard problem (Vapnik, 1998),
and thus such a plug-in approach is not reliable in practice. 3.2.2. Learning algorithm
Recently, a novel method of divergence approximation based
The parameters in the model g (Y ; ) are determined so that
on direct density-ratio estimation was explored (Kanamori et al.,
the KL divergence from p(Y ) to g (Y ; )p0 (Y ) is minimized:
2009; Nguyen, Wainwright, & Jordan, 2010; Sugiyama et al., Z
2008). The basic idea of direct density-ratio estimation is to learn p(Y )
p(Y ) KL = p(Y ) log dY
the density-ratio function p0 (Y ) without going through separate p0 (Y )g (Y ; )
Z Z
density estimation of p(Y ) and p0 (Y ). An intuitive rationale of p(Y )
direct density-ratio estimation is that knowing the two densities = p(Y ) log dY p(Y ) log (g (Y ; )) dY .
p0 (Y )
p(Y ) and p0 (Y ) means knowing their ratio, but not vice versa;
p(Y ) After ignoring the first term which is irrelevant to g (Y ; ) and
knowing the ratio p0 (Y ) does not necessarily mean knowing the
approximating the second term with the empirical estimates, the
two densities p(Y ) and p0 (Y ) because such decomposition is not KLIEP optimization problem is given as follows:
unique (see Fig. 1). This implies that estimating the density ratio is !
n n
substantially easier than estimating the densities, and thus directly 1X X
estimating the density ratio would be more promising4 (Sugiyama
max log ` K (Yi , Y` ) ,
n i=1 `=1
et al., 2012a).
n n
1 XX
s.t. ` K (Yj0 , Y` ) = 1 and 1 , . . . , n 0.
n j=1 `=1
3 In the previous work (Kawahara & Sugiyama, 2012), the asymmetric
dissimilarity measure D(Pt k Pt +n ) was used. As we numerically illustrate in
The equality constraint is for the normalization purpose because
Section 4, the use of the symmetrized divergence contributes highly to improving g (Y ; )p0 (Y ) should be a probability density function. The
the performance. For this reason, we decided to use the symmetrized dissimilarity inequality constraint comes from the non-negativity of the
measure (1). density-ratio function. Since this is a convex optimization problem,
4 Vladimir Vapnik advocated in his seminal book (Vapnik, 1998) that one should
the unique global optimal solution b can be simply obtained, for
avoid solving a more difficult problem as an intermediate step. The support vector
example, by a gradient-projection iteration. Finally, a density-ratio
machine (Cortes & Vapnik, 1995) is a representative example that demonstrates the
usefulness of this principle: It avoids solving a more general problem of estimating
estimator is given as
data generating probability distributions, and only learns a decision boundary that n
X
is sufficient for pattern recognition. The idea of direct density-ratio estimation also g (Y ) =
b b
` K (Y , Y` ).
follows Vapniks principle. `=1
S. Liu et al. / Neural Networks 43 (2013) 7283 75

KLIEP was shown to achieve the optimal non-parametric 3.3.2. Change-point detection by uLSIF
convergence rate (Nguyen et al., 2010; Sugiyama et al., 2008). g (Y ), an approximator of the PE
Given a density-ratio estimator b
divergence can be constructed as
3.2.3. Change-point detection by KLIEP n
1 X
n
1X 1
g (Y ), an approximator of the KL
Given a density-ratio estimator b b :=
PE g (Yj0 )2 +
b g (Yi )
b .
2n j=1 n i=1 2
divergence is given as
n This approximator is derived from the following expression of
1X
b :=
KL g (Yi ).
log b the PE divergence (Sugiyama, Kawanabe, & Chui, 2010; Sugiyama,
n i=1 Yamada et al., 2011):
Z 2
In the previous work (Kawahara & Sugiyama, 2012), this 1 p(Y )
KLIEP-based KL-divergence estimator was applied to change-point PE(P k P 0 ) = p0 (Y )dY
2 p0 (Y )
detection and demonstrated to be promising in experiments. Z
p(Y ) 1
+ p(Y )dY . (9)
3.3. uLSIF p0 ( Y ) 2
The first two terms of (9) are actually the negative uLSIF opti-
Recently, another direct density-ratio estimator called uLSIF mization objective without regularization. This expression can also
was proposed (Kanamori et al., 2009, 2012b), which is suitable for be obtained based on the fact that the f -divergence D(P k P 0 ) is
estimating the PE divergence. lower-bounded via the LegendreFenchel convex duality (Rockafel-
lar, 1970) as follows Keziou (2003) and Nguyen, Wainwright, and
3.3.1. Learning algorithm Jordan (2007):
In uLSIF, the same density-ratio model as KLIEP is used (see D(P k P 0 )
Section 3.2.1). However, its training criterion is different; the Z Z
density-ratio model is fitted to the true density-ratio under the = sup p(Y )h(Y ) dY p0 (Y )f (h(Y )) dY , (10)
squared loss. More specifically, the parameter in the model h
g (Y ; ) is determined so that the following squared loss J (Y ) is where f is the convex conjugate of convex function f defined at
minimized: (2). The PE divergence corresponds to f (t ) = 12 (t 1)2 , for which
Z 2 (t )2
1 p(Y ) convex conjugate is given by f (t ) = + t . For f (t ) =
J (Y ) = g (Y ; ) p0 (Y ) dY 2
p(Y )
2 p0 (Y ) 1
(t 1)2 , the supremum can be achieved when = h(Y ) + 1.
2 p0 (Y )
Z 2 Z Substituting h(Y ) =
p(Y )
1 into (10), we can obtain (9).
1 p(Y ) p0 (Y )
= p0 (Y ) dY p(Y )g (Y ; ) dY uLSIF has some notable advantages: Its solution can be
2 p0 (Y )
Z computed analytically (Kanamori et al., 2009) and it possesses the
1 optimal non-parametric convergence rate (Kanamori et al., 2012b).
+ g (Y ; )2 p0 (Y ) dY . Moreover, it has the optimal numerical stability (Kanamori et al.,
2
in press), and it is more robust than KLIEP (Sugiyama et al., 2012b).
Since the first term is a constant, we focus on the last two In Section 4, we will experimentally demonstrate that uLSIF-based
terms. By substituting g (Y ; ) with our model stated in (5) and change-point detection compares favorably with the KLIEP-based
approximating the integrals by the empirical averages, the uLSIF method.
optimization problem is given as follows:
3.4. RuLSIF
1 >
min b
H h> +
b > , (6)
2Rn 2 2 Depending on the condition of the denominator density p0 (Y ),
p(Y )
>
where the penalty term 2 is included for a regularization the density-ratio value p0 (Y ) can be unbounded (i.e., they can be
purpose. ( 0) denotes the regularization parameter, which is infinity). This is actually problematic because the non-parametric
chosen by cross-validation (Sugiyama et al., 2008). b
H is the n n convergence rate of uLSIF is governed by the sup-norm of the
p(Y )
matrix with the (`, `0 )-th element given by true density-ratio function: maxY p0 (Y ) . To overcome this problem,
n
relative density-ratio estimation was introduced (Yamada et al., in
1X press).
b
H`,`0 := K (Yj0 , Y` )K (Yj0 , Y`0 ). (7)
n j =1
3.4.1. Relative PE divergence
b
h is the n-dimensional vector with the `-th element given by Let us consider the -relative PE-divergence for 0 < 1:
n
1X PE (P k P 0 ) := PE(P k P + (1 )P 0 )
b
h` := K (Yi , Y` ). Z 2
n i =1 p(Y )
= p0 (Y ) 1 dY ,
p0 (Y )
It is easy to confirm that the solution b
of (6) can be analytically
obtained as where p0 (Y ) = p(Y ) + (1 )p0 (Y ) is the -mixture density. We
refer to
b H + In ) 1b
= (b h, (8) p(Y )
r ( Y ) =
where In denotes the n-dimensional identity matrix. Finally, a p(Y ) + (1 )p0 (Y )
density-ratio estimator is given as as the -relative density-ratio. The -relative density-ratio is
n
X reduced to the plain density-ratio if = 0, and it tends to be
g (Y ) =
b b
` K (Y , Y` ). smoother as gets larger. Indeed, one can confirm that the
`=1 -relative density-ratio is bounded above by 1/ for > 0, even
76 S. Liu et al. / Neural Networks 43 (2013) 7283

p(Y )
when the plain density-ratio p0 (Y ) is unbounded. This was proved
to contribute to improving the estimation accuracy (Yamada et al.,
in press).
As explained in Section 3.1, we use symmetrized divergence
PE (P k P 0 ) + PE (P 0 k P )
as a change-point score, where each term is estimated separately.

3.4.2. Learning algorithm


For approximating the -relative density ratio r (Y ), we still
use the same kernel model g (Y ; ) given by (5). In the same way
as the uLSIF method, the parameter is learned by minimizing the
squared loss between true and estimated relative ratios:
Z 2
1
J (Y ) = p0 (Y ) r (Y ) g (Y ; ) dY
2
Z Z
1
= p0 (Y )r2 (Y ) dY p(Y )r (Y )g (Y ; ) dY
2
Z Z
1
+ p(Y )g (Y ; )2 dY +
p0 (Y )g (Y ; )2 dY .
2 2
Again, by ignoring the constant and approximating the
Fig. 3. (Top) The original signal (blue) is segmented into 3 sections with equal
expectations by sample averages, the -relative density-ratio can length. Samples are drawn from the normal distributions N (0, 22 ), N (0, 12 ), and
be learned in the same way as the plain density-ratio. Indeed, the N (0, 22 ), respectively. (Bottom) Symmetric (red) and asymmetric (black and green)
optimization problem of a relative variant of uLSIF, called RuLSIF, PE divergences. (For interpretation of the references to colour in this figure legend,
is given as the same form as uLSIF; the only difference is the the reader is referred to the web version of this article.)
definition of the matrix b
H:
n
b X The top graph in Fig. 3 shows an artificial time-series signal that
H`,`0 := K (Yi , Y` )K (Yi , Y`0 )
n i =1 consists of three segments with equal length of 200. The samples
n are drawn from N (0, 22 ), N (0, 12 ), and N (0, 22 ), respectively,
(1 ) X where N (, 2 ) denotes the normal distribution with mean and
+ K (Yj0 , Y` )K (Yj0 , Y`0 ).
n j =1 variance 2 . Thus, the variances change at time 200 and 400. In this
experiment, we consider three types of divergence measures:
Thus, the advantages of uLSIF regarding the analytic solution,
numerical stability, and robustness are still maintained in RuLSIF.
PE (Symmetric) : PE (Pt k Pt +n ) + PE (Pt +n k Pt ),
Furthermore, RuLSIF possesses an even better non-parametric
convergence property than uLSIF (Yamada et al., in press). PE (Forward) : PE (Pt k Pt +n ),
PE (Backward) : PE (Pt +n k Pt ).
3.4.3. Change-point detection by RuLSIF
By using an estimator bg (Y ) of the -relative density-ratio, the Three divergences are compared in the bottom graph of Fig. 3.
-relative PE divergence can be approximated as As we can see from the graphs, PE (Forward) detects the
n n n first change point successfully, but not the second one. On the
X 1 X 1X 1
b :=
PE g (Yi )2
b g (Yj0 )2 +
b g (Yi )
b . other hand, PE (Backward) behaves oppositely. This implies that
2n i=1 2n j =1
n i=1 2 combining forward and backward divergences can improve the
In Section 4, we will experimentally demonstrate that the overall change-point detection performance. For this reason, we
RuLSIF-based change-point detection performs even better than only use PE (Symmetric) as the change-point score of the proposed
the plain uLSIF-based method. method from here on.
Next, we illustrate the behavior of our proposed RuLSIF-based
4. Experiments method, and then compare its performance with the uLSIF-based
and KLIEP-based methods. In our implementation, two sets of
In this section, we experimentally investigate the performance
candidate parameters,
of the proposed and existing change-point detection methods on
artificial and real-world datasets including human-activity sens- = 0.6dmed , 0.8dmed , dmed , 1.2dmed , and 1.4dmed ,
ing, speech, and Twitter messages. The MATLAB implementation
of the proposed method is available at = 10 3 , 10 2 , 10 1 , 100 , and 101 ,
http://sugiyama-www.cs.titech.ac.jp/~song/change_detection/ . are provided to the cross-validation procedure, where dmed de-
For all experiments, we fix the parameters at n = 50 and notes the median distance between samples. The best combination
k = 10. in the RuLSIF-based method is fixed to 0.1. Sensitivity to of these parameters is chosen by grid search via cross-validation.
different parameter choices and more issues regarding algorithm- We use 5-fold cross-validation for all experiments.
specific parameter tuning will be discussed below. We use the following 4 artificial time-series datasets that
contain manually inserted change-points:
4.1. Artificial datasets
Dataset 1 (Jumping mean): The following 1-dimensional auto-
As mentioned in Section 3.1, we use the symmetrized regressive model borrowed from Takeuchi and Yamanishi
divergence for change-point detection. We first illustrate how (2006) is used to generate 5000 samples (i.e., t = 1, . . . , 5000):
symmetrization of the PE divergence affects the change-point
detection performance. y(t ) = 0.6y(t 1) 0.5y(t 2) + t ,
S. Liu et al. / Neural Networks 43 (2013) 7283 77

(a) Dataset 1. (b) Dataset 2.

(c) Dataset 3. (d) Dataset 4.

Fig. 4. Illustrative time-series samples (upper) and the change-point score obtained by the RuLSIF-based method (lower). The true change-points are marked by black
vertical lines in the upper graphs.

where t is a Gaussian noise with mean and standard Note that, to explore the ability of detecting change points with
deviation 1.5. The initial values are set as y(1) = y(2) = 0. A different significance, we purposely made latter change-points
change point is inserted at every 100 time steps by setting the more significant than earlier ones in the above datasets.
noise mean at time t as Fig. 4 shows examples of these datasets for the last 10 change
( points and corresponding change-point score obtained by the
0 N = 1,
N = N proposed RuLSIF-based method. Although the last 10 change
N 1 N = 2, . . . , 49,
+ points are the most significant, we can see from the graphs that, for
16
Datasets 3 and 4, these change points can be even hardly identified
where N is a natural number such that 100(N 1) + 1 t by human. Nevertheless, the change-point score obtained by the
100N.
proposed RuLSIF-based method increases rapidly after changes
Dataset 2 (Scaling variance): The same auto-regressive model as
occur.
Dataset 1 is used, but a change point is inserted at every 100
Next, we compare the performance of RuLSIF-based, uLSIF-
time steps by setting the noise standard deviation at time t as
8 based, and KLIEP-based methods in terms of the receiver operating
<1 N = 1, 3, . . . , 49, characteristic (ROC) curves and the area under the ROC curve (AUC)
= N values. We define the true positive rate and false positive rate in the
:ln e + N = 2, 4, . . . , 48.
following way (Kawahara & Sugiyama, 2012):
4
Dataset 3 (Switching covariance): 2-dimensional samples of size True positive rate (TPR): ncr /ncp ,
5000 are drawn from the origin-centered normal distribution, False positive rate (FPR): (nal ncr )/nal ,
and a change point is inserted at every 100 time steps by setting
the covariance matrix at time t as where ncr denotes the number of times change points are correctly
80 1 detected, ncp denotes the number of all change points, and nal is the
> 4 N 2
>
> 1 number of all detection alarms.
>
> B 5 500 C N = 1, 3, . . . , 49, Following the strategy of the previous researches (Desobry
>
> @ 4 N 2 A
>
< et al., 2005; Harchaoui, Bach, & Moulines, 2009), peaks of a change-
1
= 0 5 500 1 point score are regarded as detection alarms. More specifically, a
>
> 4 N 2 detection alarm at step t is regarded as correct if there exists a
>
> B 1 +
>
> @4 N 2 5 500 C A N = 2, 4, . . . , 48. true alarm at step t such that t 2 [t 10, t + 10]. To avoid
>
>
: + 1 duplication, we remove the kth alarm at step tk if tk tk 1 < 20.
5 500 We set up a threshold for filtering out all alarms whose
Dataset 4 (Changing frequency): 1-dimensional samples of size change-point scores are lower than or equal to . Initially, we set
5000 are generated as to be equal to the score of the highest peak. Then, by lowering
y(t ) = sin(!x) + t , gradually, both TPR and FPR become non-decreasing. For each ,
we plot TPR and FPR on the graph, and thus a monotone curve can
where t is a origin-centered Gaussian noise with standard
be drawn.
deviation 0.8. A change point is inserted at every 100 points by
Fig. 5 illustrates ROC curves averaged over 50 runs with
changing the frequency ! at time t as
8 different random seeds for each dataset. Table 1 describes the
<1 N = 1, mean and standard deviation of the AUC values over 50 runs.
!N = N The best and comparable methods by the t-test with significance
:! N 1 ln e + N = 2, . . . , 49.
2 level 5% are described in boldface. The experimental results show
78 S. Liu et al. / Neural Networks 43 (2013) 7283

(a) Dataset 1. (b) Dataset 2.

(c) Dataset 3. (d) Dataset 4.

Fig. 5. Average ROC curves of RuLSIF-based, uLSIF-based, and KLIEP-based methods.

that the uLSIF-based method tends to outperform the KLIEP-based Table 1


The AUC values of RuLSIF-based, uLSIF-based, and KLIEP-based methods. The best
method, and the RuLSIF-based method even performs better than
and comparable methods by the t-test with significance level 5% are described in
the uLSIF-based method. boldface.
Finally, we investigate the sensitivity of the performance on dif- RuLSIF uLSIF KLIEP
ferent choices of n and k in terms of AUC values. In Fig. 6, the AUC
Dataset 1 .848(.023) .763 (.023) .713 (.036)
values of RuLSIF ( = 0.1 and 0.2), uLSIF (which corresponds Dataset 2 .846(.031) .806 (.035) .623 (.040)
to RuLSIF with = 0), and KLIEP were plotted for k = 5, 10, Dataset 3 .972(.012) .943 (.015) .904 (.017)
and 15 under a specific choice of n in each graph. We generate Dataset 4 .844(.031) .801 (.024) .602 (.036)
such graphs for all 4 datasets with n = 25, 50, and 75. The result
shows that the proposed method consistently performs better than Subspace identification (SI) (Kawahara et al., 2007): SI identifies
the other methods, and the order of the methods according to the a subspace in which time-series data is constrained, and eval-
performance is kept unchanged over various choices of n and k. uates the distance of target sequences from the subspace. The
Moreover, the RuLSIF methods with = 0.1 and 0.2 perform subspace spanned by the columns of an observability matrix is
rather similarly. For this reason, we keep using the medium param- used for estimating the distance from the subspace spanned by
eter values among the candidates in the following experiments: subsequences of time-series data. For this method, we use the
top 4 significant singular values according to our preliminary
n = 50, k = 10, and = 0.1.
experiment results.
Auto-regressive (AR) (Takeuchi & Yamanishi, 2006): AR first fits
4.2. Real-world datasets an AR model to time-series data, and then auxiliary time-series
is generated from the AR model. With an extra AR model-fitting,
the change-point score is given by the log-likelihood. The order
Next, we evaluate the performance of the density-ratio esti- of the AR model is chosen by Schwarzs Bayesian information
mation based methods and other existing change-point detection criterion (Schwarz, 1978).
methods using two real-world datasets: Human-activity sensing One-class support vector machine (OSVM) (Desobry et al., 2005):
and speech. Change-point scores are calculated by OSVM using two sets of
We include the following methods in our comparison. descriptors of signals. The kernel width is set to the median
value of the distances between samples, which is a popular
Singular spectrum transformation (SST) (Ide & Tsuda, 2007; heuristic in kernel methods (Schlkopf & Smola, 2002). Another
Itoh & Kurths, 2010; Moskvina & Zhigljavsky, 2003a): Change- parameter is set to 0.2, which indicates the proportion of out-
point scores are evaluated on two consecutive trajectory ma- liers.
trices using the distance-based singular spectrum analysis. First, we use a human activity dataset. This is a subset of
This corresponds to a state-space model with no system the Human Activity Sensing Consortium (HASC) challenge 2011,5
noise. For this method, we use the first 4 eigenvectors to which provides human activity information collected by portable
compare the difference between two subspaces, which was
confirmed to be reasonable choice in our preliminary experi-
ments. 5 http://hasc.jp/hc2011/.
S. Liu et al. / Neural Networks 43 (2013) 7283 79

(a) Dataset 1 (n = 25). (b) Dataset 1 (n = 50). (c) Dataset 1 (n = 75).

(d) Dataset 2 (n = 25). (e) Dataset 2 (n = 50). (f) Dataset 2 (n = 75).

(g) Dataset 3 (n = 25). (h) Dataset 3 (n = 50). (i) Dataset 3 (n = 75).

(j) Dataset 4 (n = 25). (k) Dataset 4 (n = 50). (l) Dataset 4 (n = 75).

Fig. 6. AUC plots for n = 25, 50, 75 and k = 5, 10, 15. The horizontal axes denote k, while the vertical axes denote AUC values.

three-axis accelerometers. The task of change-point detection is to trends of changing behaviors, except the changes around time
segment the time-series data according to the 6 behaviors: stay, 1200 and 1500. However, because these changes are difficult to
walk, jog, skip, stair up, and stair down. The starting time be recognized even by humans, we do not regard them as critical
of each behavior is arbitrarily decided by each user. Because the flaws. Fig. 7(b) illustrates ROC curves averaged over 10 datasets,
orientation of accelerometers is not necessarily fixed, we take the and Fig. 7(c) describes AUC values for each of the 10 datasets.
`2 -norm of the 3-dimensional (i.e., x-, y-, and z-axes) data. The experimental results show that the proposed RuLSIF-based
In Fig. 7(a), examples of original time-series, true change points, method tends to perform better than other methods.
and change-point scores obtained by the RuLSIF-based method are Next, we use the IPSJ SIG-SLP Corpora and Environments for
plotted. This shows that the change-point score clearly captures Noisy Speech Recognition (CENSREC) dataset provided by National
80 S. Liu et al. / Neural Networks 43 (2013) 7283

(a) One of the original signals and change-point scores obtained by the RuLSIF-based method.

(b) Average ROC curves. (c) AUC values. The best and comparable methods by the t-test with
significance level 5% are described in boldface.

Fig. 7. HASC human-activity dataset.

Institute of Informatics (NII),6 which records human voice in a Here we track the degree of popularity of a given topic by
noisy environment. The task is to extract speech sections from monitoring the frequency of selected keywords. More specifically,
recorded signals. This dataset offers several voice recordings we focus on events related to Deepwater Horizon oil spill in
with different background noises (e.g., noise of a highway and the Gulf of Mexico which occurred on April 20, 2010,8 and was
restaurant). Segmentation of the beginning and ending of human widely broadcast among the Twitter community. We use the
voice is manually annotated. Note that we only use the annotations frequencies of 10 keywords: gulf , spill, bp, oil, hayward,
as the ground truth for the final performance evaluation, not for mexico, coast, transocean, halliburton, and obama (see
change-point detection (i.e., this experiment is still completely Fig. 9(a)). We perform change-point detection directly on the 10-
unsupervised). dimensional data, with the hope that we can capture correlation
Fig. 8(a) illustrates an example of the original signals, true changes between multiple keywords, in addition to changes in the
change-points, and change-point scores obtained by the proposed frequency of each keyword.
RuLSIF-based method. This shows that the proposed method still For quantitative evaluation, we referred to the Wikipedia
gives clear indications for speech segments. Fig. 8(b) and (c) show entry Timeline of the Deepwater Horizon oil spill,9 as a real-
average ROC curves over 10 datasets and AUC values for each world event source. The change-point score obtained by the
of the 10 datasets. The results show that the proposed method proposed RuLSIF-based method is plotted in Fig. 9(b), where four
significantly outperforms other methods. occurrences of important real-world events show the development
of this news story.
4.3. Twitter dataset As we can see from Fig. 9(b), the change-point score increases
immediately after the initial explosion of the deepwater horizon
Finally, we apply the proposed change-point detection method oil platform and soon reaches the first peak when oil was found on
to the CMU Twitter dataset,7 which is an archive of Twitter the shore of Louisiana on April 30. Shortly after BP announced its
messages collected from February 2010 to October 2010 via the preliminary estimation on the amount of leaking oil, the change-
Twitter application programming interface. point score rises quickly again and reaches its second peak at the

6 http://research.nii.ac.jp/src/eng/list/index.html. 8 http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill.
7 http://www.ark.cs.cmu.edu/tweets/. 9 http://en.wikipedia.org/wiki/Timeline_of_the_Deepwater_Horizon_oil_spill.
S. Liu et al. / Neural Networks 43 (2013) 7283 81

(a) One of the original signals and change-point scores obtained by the RuLSIF-based method. (b) Average ROC curves.

(c) AUC values. The best and comparable methods by the t-test
with significance level 5% are described in boldface.

Fig. 8. CENSREC speech dataset.

end of May, at which time President Obama visited Louisiana to experiments, however, we did not observe a significant improve-
assure local residents of the federal governments support. On June ment by changing the margin. For this reason, we decided to use a
25, BP stock was at its one year lowest price, while the change- straightforward model such that two segments have no margin in
point score spikes at the third time. Finally, BP cut off the spill on between.
July 15, as the score reaches its last peak. Through the experiment illustrated in Fig. 6 in Section 4.1, we
can see that the performance of the proposed method is affected
5. Conclusion and future perspectives by the choice of hyper-parameters n and k. However, discovering
optimal values for these parameters remains a challenge, which
In this paper, we first formulated the problem of retrospective will be investigated in our future work.
change-point detection as the problem of comparing two proba- RuLSIF was shown to possess a better convergence property
bility distributions over two consecutive time segments. We then than uLSIF (Yamada et al., in press) in terms of density ratio
provided a comprehensive review of state-of-the-art density-ratio estimation. However, how this theoretical advantage in density
and divergence estimation methods, which are key building blocks ratio estimation can be translated into practical performance
of our change-point detection methods. Our contributions in this improvement in change detection is still not clear, beyond the
paper were to extend the existing KLIEP-based change-point detec-
intuition that a better divergence estimator gives a better change
tion method (Kawahara & Sugiyama, 2012), and to propose to use
score. We will address this issue more formally in the future work.
uLSIF as a building block. uLSIF has various theoretical and prac-
Although the proposed RuLSIF-based change-point detection
tical advantages, for example, the uLSIF solution can be computed
was shown to work well even for multi-dimensional time-series
analytically, it possesses the optimal non-parametric convergence
data, its accuracy may be further improved by incorporating
rate, it has the optimal numerical stability, and it has higher robust-
ness than KLIEP. We further proposed to use RuLSIF, a novel diver- dimensionality reduction. Recently, several attempts were made
gence estimation paradigm that emerged in the machine learning to combine dimensionality reduction with direct density-ratio
community recently. RuLSIF inherits good properties of uLSIF, and estimation (Sugiyama et al., 2010; Sugiyama, Yamada et al., 2011;
moreover it possesses an even better non-parametric convergence Yamada & Sugiyama, 2011). Our future work will apply these
property. Through extensive experiments on artificial datasets and techniques to change-point detection and evaluate their practical
real-world datasets including human-activity sensing, speech, and usefulness.
Twitter messages, we demonstrated that the proposed RuLSIF- Compared with other approaches, methods based on density
based change-point detection method is promising. ratio estimation tend to be computationally more expensive
Though we estimated a density ratio between two consecutive because of the cross-validation procedure for model selection.
segments, some earlier researches (Basseville & Nikiforov, 1993; However, thanks to the analytic solution, the RuLSIF- and uLSIF-
Gustafsson, 1996, 2000) introduced a hyper-parameter that con- based methods are computationally more efficient than the KLIEP-
trols the size of a margin between two segments. In our preliminary based method that requires an iterative optimization procedure
82 S. Liu et al. / Neural Networks 43 (2013) 7283

(a) Normalized frequencies of 10 keywords.

(b) Change-point score obtained by the RuLSIF-based method and exemplary real-world events.

Fig. 9. Twitter dataset.

(see Fig. 9 in Kanamori et al. (2009) for the detailed time References
comparison between uLSIF and KLIEP). Our important future work
is to further improve the computational efficiency of the RuLSIF- Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection.
based method. Technical Report. arXiv:0710.3742v1 [stat.ML].
In this paper, we focused on computing the change-point Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society, Series B, 28(1),
score that represents the plausibility of change points. Another
131142.
possible formulation is hypothesis testing, which provides a Basseville, M., & Nikiforov, I. V. (1993). Detection of abrupt changes: theory and
useful threshold to determine whether a point is a change point. application. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
Methodologically, it is straightforward to extend the proposed Bellman, R. (1961). Adaptive control processes: a guided tour. Princeton, NJ, USA:
Princeton University Press.
method to produce the p-values, following the recent literatures
Bickel, S., Brckner, M., & Scheffer, T. (2007). Discriminative learning for differing
(Kanamori, Suzuki, & Sugiyama, 2012a; Sugiyama, Suzuki, Itoh, training and test distributions. In Proceedings of the 24th international conference
Kanamori, & Kimura, 2011). However, computing the p-value is on machine learning (pp. 8188).
often time consuming, particularly in a non-parametric setup. Brodsky, B., & Darkhovsky, B. (1993). Nonparametric methods in change-point
problems. Dordrecht, the Netherlands: Kluwer Academic Publishers.
Thus, overcoming the computational bottleneck is an important
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3),
future work for making this approach more practical. 273297.
Recent reports pointed out that Twitter messages can be Csiszr, I. (1967). Information-type measures of difference of probability distribu-
indicative of real-world events (Petrovi, Osborne, & Lavrenko, tions and indirect observation. Studia Scientiarum Mathematicarum Hungarica,
2, 229318.
2010; Sakaki, Okazaki, & Matsuo, 2010). Following this line, we Csrg, M., & Horvth, L. (1988). 20 nonparametric methods for changepoint
showed in Section 4.3 that our change-detection method can be problems. In P. R. Krishnaiah, & C. R. Rao (Eds.), Handbook of statistics, vol. 7
used as a novel tool for analyzing Twitter messages. An important (pp. 403425). Amsterdam, the Netherlands: Elsevier.
future challenge along this line includes automatic keyword Desobry, F., Davy, M., & Doncarli, C. (2005). An online kernel change detection
algorithm. IEEE Transactions on Signal Processing, 53(8), 29612974.
selection for topics of interests. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY,
USA: Chapman & Hall/CRC.
Garnett, R., Osborne, M.A., & Roberts, S.J. (2009). Sequential Bayesian prediction in
Acknowledgments the presence of changepoints. In Proceedings of the 26th annual international
conference on machine learning (pp. 345352).
SL was supported by NII internship fund and the JST PRESTO Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Schlkopf, B.
(2009). Covariate shift by kernel mean matching. In J. Quionero-Candela,
program. MY and MS were supported by the JST PRESTO program.
M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine
NC was supported by NII Grand Challenge project fund. learning (pp. 131160). Cambridge, MA, USA: MIT Press (Chapter 8).
S. Liu et al. / Neural Networks 43 (2013) 7283 83

Guralnik, V., & Srivastava, J. (1999). Event detection from time series data. In Petrovi, S., Osborne, M., & Lavrenko, V. (2010). Streaming first story detection
Proceedings of the fifth ACM SIGKDD international conference on knowledge with application to Twitter. In Human language technologies: the 2010 annual
discovery and data mining (pp. 3342). conference of the North American chapter of the association for computational
Gustafsson, F. (1996). The marginalized likelihood ratio test for detecting abrupt linguistics (pp. 181189).
changes. IEEE Transactions on Automatic Control, 41(1), 6678. Reeves, J., Chen, J., Wang, X. L., Lund, R., & Lu, Q. (2007). A review and comparison
Gustafsson, F. (2000). Adaptive filtering and change detection. Chichester, UK: Wiley. of changepoint detection techniques for climate data. Journal of Applied
Harchaoui, Z., Bach, F., & Moulines, E. (2009). Kernel change-point analysis. Advances Meteorology and Climatology, 46(6), 900915.
in Neural Information Processing Systems, 21, 609616. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ, USA: Princeton University
Henkel, R. E. (1976). Tests of significance. Beverly Hills, CA, USA: SAGE Publication. Press.
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users: real-
outlier detection using direct density ratio estimation. Knowledge and time event detection by social sensors. In Proceedings of the 19th international
Information Systems, 26(2), 309336. conference on World Wide Web (pp. 851860).
Ide, T., & Tsuda, K. (2007). Change-point detection using Krylov subspace learning. Schlkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).
In Proceedings of the SIAM international conference on data mining (pp. 515520).
Estimating the support of a high-dimensional distribution. Neural Computation,
Itoh, N., & Kurths, J. (2010). Change-point detection of climate time series by
13(7), 14431471.
nonparametric method. In Proceedings of the world congress on engineering and
Schlkopf, B., & Smola, A. J. (2002). Learning with kernels support vector machines,
computer science, 2010, vol. 1.
regularization, optimization, and beyond. MIT Press.
Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
importance estimation. Journal of Machine Learning Research, 10, 13911445.
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012a). f -divergence estimation and 6(2), 461464.
two-sample homogeneity test under semiparametric density-ratio models. IEEE Sugiyama, M., Kawanabe, M., & Chui, P. L. (2010). Dimensionality reduction for
Transactions on Information Theory, 58(2), 708720. density ratio estimation in high-dimensional spaces. Neural Networks, 23(1),
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012b). Statistical analysis of kernel-based 4459.
least-squares density-ratio estimation. Machine Learning, 86(3), 335367. Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., & Kimura, M. (2011). Least-squares
Kanamori, T., Suzuki, T., & Sugiyama, M. (2013). Computational complexity of two-sample test. Neural Networks, 24(7), 735751.
kernel-based density-ratio estimation: a condition number analysis. Machine Sugiyama, M., Suzuki, T., & Kanamori, T. (2012a). Density ratio estimation in machine
Learning (in press). learning. Cambridge, UK: Cambridge University Press.
Kawahara, Y., & Sugiyama, M. (2012). Sequential change-point detection based Sugiyama, M., Suzuki, T., & Kanamori, T. (2012b). Density ratio matching under the
on direct density-ratio estimation. Statistical Analysis and Data Mining, 5(2), Bregman divergence: a unified framework of density ratio estimation. Annals of
114127. the Institute of Statistical Mathematics, 64, 10091044.
Kawahara, Y., Yairi, T., & Machida, K. (2007). Change-point detection in time-series Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Buenau, P., & Kawanabe, M.
data based on subspace identification. In Proceedings of the 7th IEEE international (2008). Direct importance estimation for covariate shift adaptation. Annals of
conference on data mining (pp. 559564). the Institute of Statistical Mathematics, 60(4), 699746.
Keziou, A. (2003). Dual representation of -divergences and applications. Comptes Sugiyama, M., Yamada, M., von Bnau, P., Suzuki, T., Kanamori, T., & Kawanabe,
Rendus Mathematique, 336(10), 857862. M. (2011). Direct density-ratio estimation with dimensionality reduction via
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of least-squares hetero-distributional subspace search. Neural Networks, 24(2),
Mathematical Statistics, 22(1), 7986. 183198.
Moskvina, V., & Zhigljavsky, A. (2003a). Application of singular-spectrum analysis Takeuchi, J., & Yamanishi, K. (2006). A unifying framework for detecting outliers
to change-point detection in time series. School of Mathematics, Cardiff and change points from non-stationary time series data. IEEE Transactions on
University. http://slb.cf.ac.uk/maths/subsites/stats/changepoint/CH_P_T_S.pdf. Knowledge and Data Engineering, 18(4), 482492.
Moskvina, V., & Zhigljavsky, A. (2003b). An algorithm based on singular spectrum Vapnik, V. N. (1998). Statistical learning theory. New York, NY, USA: Wiley.
analysis for change-point detection. Communications in Statistics: Simulation and Wang, Y., Wu, C., Ji, Z., Wang, B., & Liang, Y. (2011). Non-parametric change-point
Computation, 32(2), 319352. method for differential gene expression detection. PLoS ONE, 6(5), e20060.
Nguyen, X., Wainwright, M.J., & Jordan, M.I. (2007). Nonparametric estimation of the
Yamada, M., & Sugiyama, M. (2011). Direct density-ratio estimation with di-
likelihood ratio and divergence functionals. In Proceedings of IEEE International
mensionality reduction via hetero-distributional subspace analysis. In
Symposium on Information Theory (pp. 20162020).
Proceedings of the twenty-fifth AAAI conference on artificial intelligence
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence
(pp. 549554). Aug. 711.
functionals and the likelihood ratio by convex risk minimization. IEEE
Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., & Sugiyama, M. (2013).
Transactions on Information Theory, 56(11), 58475861.
Paquet, U. (2007). Empirical Bayesian change point detection. Graphical Models, Relative density-ratio estimation for robust distribution comparison. Neural
1995, 120. Computation (in press).
Pearson, K. (1900). On the criterion that a given system of deviations from the Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised
probable in the case of a correlated system of variables is such that it can outlier detection using finite mixtures with discounting learning algorithms.
be reasonably supposed to have arisen from random sampling. Philosophical In Proceedings of the sixth ACM SIGKDD international conference on knowledge
Magazine, 50, 157175. discovery and data mining (pp. 320324).

You might also like