Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation
Change-Point Detection in Time Series Data by Relative Density-Ratio Estimation
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
where Pt and Pt +n are probability distributions of samples in Y(t ) In the rest of this section, we review three methods of directly
and Y(t + n), respectively. D(P k P 0 ) denotes the f -divergence (Ali p(Y )
estimating the density ratio p0 (Y ) from samples {Yi }ni=1 and {Yj0 }nj=1
& Silvey, 1966; Csiszr, 1967): drawn from p(Y ) and p0 (Y ): The KL importance estimation proce-
Z dure (KLIEP) (Sugiyama et al., 2008) in Section 3.2, unconstrained
p(Y )
D(P k P 0 ) := p0 (Y )f dY , (2) least-squares importance fitting (uLSIF) (Kanamori et al., 2009) in
p0 (Y )
Section 3.3, and relative uLSIF (RuLSIF) (Yamada et al., in press) in
where f is a convex function such that f (1) = 0, and p(Y ) and Section 3.4.
p0 (Y ) are probability density functions of P and P 0 , respectively.
We assume that p(Y ) and p0 (Y ) are strictly positive. Since the 3.2. KLIEP
f -divergence is asymmetric (i.e., D(P k P 0 ) 6= D(P 0 k P )), we
symmetrize it in our dissimilarity measure (1) for all divergence- KLIEP (Sugiyama et al., 2008) is a direct density-ratio estimation
based methods.3 algorithm that is suitable for estimating the KL divergence.
The f -divergence includes various popular divergences such as
the KullbackLeibler (KL) divergence by f (t ) = t log t (Kullback & 3.2.1. Density-ratio model
Leibler, 1951) and the Pearson (PE) divergence by f (t ) = 12 (t 1)2 Let us model the density ratio
p(Y )
by the following kernel
p0 (Y )
(Pearson, 1900): model:
Z
p(Y ) n
X
KL(P k P 0 ) := p(Y ) log dY , (3) g (Y ; ) := ` K (Y , Y` ), (5)
p0 (Y )
`=1
Z 2
1 p(Y ) where := (1 , . . . , n )> are parameters to be learned from data
PE(P k P 0 ) := p0 (Y ) 1 dY . (4)
2 p0 (Y ) samples, and K (Y , Y 0 ) is a kernel basis function. In practice, we use
the Gaussian kernel:
Since the probability densities p(Y ) and p0 (Y ) are unknown in
practice, we cannot directly compute the f -divergence (and thus 0 kY Y 0 k2
K (Y , Y ) = exp ,
the dissimilarity measure). A naive way to cope with this problem 2 2
is to perform density estimation and plug the estimated densities
where (> 0) is the kernel width. In all our experiments, the
p(Y ) and b
b p0 (Y ) in the definition of the f -divergence. However,
kernel width is determined based on cross-validation.
density estimation is known to be a hard problem (Vapnik, 1998),
and thus such a plug-in approach is not reliable in practice. 3.2.2. Learning algorithm
Recently, a novel method of divergence approximation based
The parameters in the model g (Y ; ) are determined so that
on direct density-ratio estimation was explored (Kanamori et al.,
the KL divergence from p(Y ) to g (Y ; )p0 (Y ) is minimized:
2009; Nguyen, Wainwright, & Jordan, 2010; Sugiyama et al., Z
2008). The basic idea of direct density-ratio estimation is to learn p(Y )
p(Y ) KL = p(Y ) log dY
the density-ratio function p0 (Y ) without going through separate p0 (Y )g (Y ; )
Z Z
density estimation of p(Y ) and p0 (Y ). An intuitive rationale of p(Y )
direct density-ratio estimation is that knowing the two densities = p(Y ) log dY p(Y ) log (g (Y ; )) dY .
p0 (Y )
p(Y ) and p0 (Y ) means knowing their ratio, but not vice versa;
p(Y ) After ignoring the first term which is irrelevant to g (Y ; ) and
knowing the ratio p0 (Y ) does not necessarily mean knowing the
approximating the second term with the empirical estimates, the
two densities p(Y ) and p0 (Y ) because such decomposition is not KLIEP optimization problem is given as follows:
unique (see Fig. 1). This implies that estimating the density ratio is !
n n
substantially easier than estimating the densities, and thus directly 1X X
estimating the density ratio would be more promising4 (Sugiyama
max log ` K (Yi , Y` ) ,
n i=1 `=1
et al., 2012a).
n n
1 XX
s.t. ` K (Yj0 , Y` ) = 1 and 1 , . . . , n 0.
n j=1 `=1
3 In the previous work (Kawahara & Sugiyama, 2012), the asymmetric
dissimilarity measure D(Pt k Pt +n ) was used. As we numerically illustrate in
The equality constraint is for the normalization purpose because
Section 4, the use of the symmetrized divergence contributes highly to improving g (Y ; )p0 (Y ) should be a probability density function. The
the performance. For this reason, we decided to use the symmetrized dissimilarity inequality constraint comes from the non-negativity of the
measure (1). density-ratio function. Since this is a convex optimization problem,
4 Vladimir Vapnik advocated in his seminal book (Vapnik, 1998) that one should
the unique global optimal solution b can be simply obtained, for
avoid solving a more difficult problem as an intermediate step. The support vector
example, by a gradient-projection iteration. Finally, a density-ratio
machine (Cortes & Vapnik, 1995) is a representative example that demonstrates the
usefulness of this principle: It avoids solving a more general problem of estimating
estimator is given as
data generating probability distributions, and only learns a decision boundary that n
X
is sufficient for pattern recognition. The idea of direct density-ratio estimation also g (Y ) =
b b
` K (Y , Y` ).
follows Vapniks principle. `=1
S. Liu et al. / Neural Networks 43 (2013) 7283 75
KLIEP was shown to achieve the optimal non-parametric 3.3.2. Change-point detection by uLSIF
convergence rate (Nguyen et al., 2010; Sugiyama et al., 2008). g (Y ), an approximator of the PE
Given a density-ratio estimator b
divergence can be constructed as
3.2.3. Change-point detection by KLIEP n
1 X
n
1X 1
g (Y ), an approximator of the KL
Given a density-ratio estimator b b :=
PE g (Yj0 )2 +
b g (Yi )
b .
2n j=1 n i=1 2
divergence is given as
n This approximator is derived from the following expression of
1X
b :=
KL g (Yi ).
log b the PE divergence (Sugiyama, Kawanabe, & Chui, 2010; Sugiyama,
n i=1 Yamada et al., 2011):
Z 2
In the previous work (Kawahara & Sugiyama, 2012), this 1 p(Y )
KLIEP-based KL-divergence estimator was applied to change-point PE(P k P 0 ) = p0 (Y )dY
2 p0 (Y )
detection and demonstrated to be promising in experiments. Z
p(Y ) 1
+ p(Y )dY . (9)
3.3. uLSIF p0 ( Y ) 2
The first two terms of (9) are actually the negative uLSIF opti-
Recently, another direct density-ratio estimator called uLSIF mization objective without regularization. This expression can also
was proposed (Kanamori et al., 2009, 2012b), which is suitable for be obtained based on the fact that the f -divergence D(P k P 0 ) is
estimating the PE divergence. lower-bounded via the LegendreFenchel convex duality (Rockafel-
lar, 1970) as follows Keziou (2003) and Nguyen, Wainwright, and
3.3.1. Learning algorithm Jordan (2007):
In uLSIF, the same density-ratio model as KLIEP is used (see D(P k P 0 )
Section 3.2.1). However, its training criterion is different; the Z Z
density-ratio model is fitted to the true density-ratio under the = sup p(Y )h(Y ) dY p0 (Y )f (h(Y )) dY , (10)
squared loss. More specifically, the parameter in the model h
g (Y ; ) is determined so that the following squared loss J (Y ) is where f is the convex conjugate of convex function f defined at
minimized: (2). The PE divergence corresponds to f (t ) = 12 (t 1)2 , for which
Z 2 (t )2
1 p(Y ) convex conjugate is given by f (t ) = + t . For f (t ) =
J (Y ) = g (Y ; ) p0 (Y ) dY 2
p(Y )
2 p0 (Y ) 1
(t 1)2 , the supremum can be achieved when = h(Y ) + 1.
2 p0 (Y )
Z 2 Z Substituting h(Y ) =
p(Y )
1 into (10), we can obtain (9).
1 p(Y ) p0 (Y )
= p0 (Y ) dY p(Y )g (Y ; ) dY uLSIF has some notable advantages: Its solution can be
2 p0 (Y )
Z computed analytically (Kanamori et al., 2009) and it possesses the
1 optimal non-parametric convergence rate (Kanamori et al., 2012b).
+ g (Y ; )2 p0 (Y ) dY . Moreover, it has the optimal numerical stability (Kanamori et al.,
2
in press), and it is more robust than KLIEP (Sugiyama et al., 2012b).
Since the first term is a constant, we focus on the last two In Section 4, we will experimentally demonstrate that uLSIF-based
terms. By substituting g (Y ; ) with our model stated in (5) and change-point detection compares favorably with the KLIEP-based
approximating the integrals by the empirical averages, the uLSIF method.
optimization problem is given as follows:
3.4. RuLSIF
1 >
min b
H h> +
b > , (6)
2Rn 2 2 Depending on the condition of the denominator density p0 (Y ),
p(Y )
>
where the penalty term 2 is included for a regularization the density-ratio value p0 (Y ) can be unbounded (i.e., they can be
purpose. ( 0) denotes the regularization parameter, which is infinity). This is actually problematic because the non-parametric
chosen by cross-validation (Sugiyama et al., 2008). b
H is the n n convergence rate of uLSIF is governed by the sup-norm of the
p(Y )
matrix with the (`, `0 )-th element given by true density-ratio function: maxY p0 (Y ) . To overcome this problem,
n
relative density-ratio estimation was introduced (Yamada et al., in
1X press).
b
H`,`0 := K (Yj0 , Y` )K (Yj0 , Y`0 ). (7)
n j =1
3.4.1. Relative PE divergence
b
h is the n-dimensional vector with the `-th element given by Let us consider the -relative PE-divergence for 0 < 1:
n
1X PE (P k P 0 ) := PE(P k P + (1 )P 0 )
b
h` := K (Yi , Y` ). Z 2
n i =1 p(Y )
= p0 (Y ) 1 dY ,
p0 (Y )
It is easy to confirm that the solution b
of (6) can be analytically
obtained as where p0 (Y ) = p(Y ) + (1 )p0 (Y ) is the -mixture density. We
refer to
b H + In ) 1b
= (b h, (8) p(Y )
r ( Y ) =
where In denotes the n-dimensional identity matrix. Finally, a p(Y ) + (1 )p0 (Y )
density-ratio estimator is given as as the -relative density-ratio. The -relative density-ratio is
n
X reduced to the plain density-ratio if = 0, and it tends to be
g (Y ) =
b b
` K (Y , Y` ). smoother as gets larger. Indeed, one can confirm that the
`=1 -relative density-ratio is bounded above by 1/ for > 0, even
76 S. Liu et al. / Neural Networks 43 (2013) 7283
p(Y )
when the plain density-ratio p0 (Y ) is unbounded. This was proved
to contribute to improving the estimation accuracy (Yamada et al.,
in press).
As explained in Section 3.1, we use symmetrized divergence
PE (P k P 0 ) + PE (P 0 k P )
as a change-point score, where each term is estimated separately.
Fig. 4. Illustrative time-series samples (upper) and the change-point score obtained by the RuLSIF-based method (lower). The true change-points are marked by black
vertical lines in the upper graphs.
where t is a Gaussian noise with mean and standard Note that, to explore the ability of detecting change points with
deviation 1.5. The initial values are set as y(1) = y(2) = 0. A different significance, we purposely made latter change-points
change point is inserted at every 100 time steps by setting the more significant than earlier ones in the above datasets.
noise mean at time t as Fig. 4 shows examples of these datasets for the last 10 change
( points and corresponding change-point score obtained by the
0 N = 1,
N = N proposed RuLSIF-based method. Although the last 10 change
N 1 N = 2, . . . , 49,
+ points are the most significant, we can see from the graphs that, for
16
Datasets 3 and 4, these change points can be even hardly identified
where N is a natural number such that 100(N 1) + 1 t by human. Nevertheless, the change-point score obtained by the
100N.
proposed RuLSIF-based method increases rapidly after changes
Dataset 2 (Scaling variance): The same auto-regressive model as
occur.
Dataset 1 is used, but a change point is inserted at every 100
Next, we compare the performance of RuLSIF-based, uLSIF-
time steps by setting the noise standard deviation at time t as
8 based, and KLIEP-based methods in terms of the receiver operating
<1 N = 1, 3, . . . , 49, characteristic (ROC) curves and the area under the ROC curve (AUC)
= N values. We define the true positive rate and false positive rate in the
:ln e + N = 2, 4, . . . , 48.
following way (Kawahara & Sugiyama, 2012):
4
Dataset 3 (Switching covariance): 2-dimensional samples of size True positive rate (TPR): ncr /ncp ,
5000 are drawn from the origin-centered normal distribution, False positive rate (FPR): (nal ncr )/nal ,
and a change point is inserted at every 100 time steps by setting
the covariance matrix at time t as where ncr denotes the number of times change points are correctly
80 1 detected, ncp denotes the number of all change points, and nal is the
> 4 N 2
>
> 1 number of all detection alarms.
>
> B 5 500 C N = 1, 3, . . . , 49, Following the strategy of the previous researches (Desobry
>
> @ 4 N 2 A
>
< et al., 2005; Harchaoui, Bach, & Moulines, 2009), peaks of a change-
1
= 0 5 500 1 point score are regarded as detection alarms. More specifically, a
>
> 4 N 2 detection alarm at step t is regarded as correct if there exists a
>
> B 1 +
>
> @4 N 2 5 500 C A N = 2, 4, . . . , 48. true alarm at step t such that t 2 [t 10, t + 10]. To avoid
>
>
: + 1 duplication, we remove the kth alarm at step tk if tk tk 1 < 20.
5 500 We set up a threshold for filtering out all alarms whose
Dataset 4 (Changing frequency): 1-dimensional samples of size change-point scores are lower than or equal to . Initially, we set
5000 are generated as to be equal to the score of the highest peak. Then, by lowering
y(t ) = sin(!x) + t , gradually, both TPR and FPR become non-decreasing. For each ,
we plot TPR and FPR on the graph, and thus a monotone curve can
where t is a origin-centered Gaussian noise with standard
be drawn.
deviation 0.8. A change point is inserted at every 100 points by
Fig. 5 illustrates ROC curves averaged over 50 runs with
changing the frequency ! at time t as
8 different random seeds for each dataset. Table 1 describes the
<1 N = 1, mean and standard deviation of the AUC values over 50 runs.
!N = N The best and comparable methods by the t-test with significance
:! N 1 ln e + N = 2, . . . , 49.
2 level 5% are described in boldface. The experimental results show
78 S. Liu et al. / Neural Networks 43 (2013) 7283
Fig. 6. AUC plots for n = 25, 50, 75 and k = 5, 10, 15. The horizontal axes denote k, while the vertical axes denote AUC values.
three-axis accelerometers. The task of change-point detection is to trends of changing behaviors, except the changes around time
segment the time-series data according to the 6 behaviors: stay, 1200 and 1500. However, because these changes are difficult to
walk, jog, skip, stair up, and stair down. The starting time be recognized even by humans, we do not regard them as critical
of each behavior is arbitrarily decided by each user. Because the flaws. Fig. 7(b) illustrates ROC curves averaged over 10 datasets,
orientation of accelerometers is not necessarily fixed, we take the and Fig. 7(c) describes AUC values for each of the 10 datasets.
`2 -norm of the 3-dimensional (i.e., x-, y-, and z-axes) data. The experimental results show that the proposed RuLSIF-based
In Fig. 7(a), examples of original time-series, true change points, method tends to perform better than other methods.
and change-point scores obtained by the RuLSIF-based method are Next, we use the IPSJ SIG-SLP Corpora and Environments for
plotted. This shows that the change-point score clearly captures Noisy Speech Recognition (CENSREC) dataset provided by National
80 S. Liu et al. / Neural Networks 43 (2013) 7283
(a) One of the original signals and change-point scores obtained by the RuLSIF-based method.
(b) Average ROC curves. (c) AUC values. The best and comparable methods by the t-test with
significance level 5% are described in boldface.
Institute of Informatics (NII),6 which records human voice in a Here we track the degree of popularity of a given topic by
noisy environment. The task is to extract speech sections from monitoring the frequency of selected keywords. More specifically,
recorded signals. This dataset offers several voice recordings we focus on events related to Deepwater Horizon oil spill in
with different background noises (e.g., noise of a highway and the Gulf of Mexico which occurred on April 20, 2010,8 and was
restaurant). Segmentation of the beginning and ending of human widely broadcast among the Twitter community. We use the
voice is manually annotated. Note that we only use the annotations frequencies of 10 keywords: gulf , spill, bp, oil, hayward,
as the ground truth for the final performance evaluation, not for mexico, coast, transocean, halliburton, and obama (see
change-point detection (i.e., this experiment is still completely Fig. 9(a)). We perform change-point detection directly on the 10-
unsupervised). dimensional data, with the hope that we can capture correlation
Fig. 8(a) illustrates an example of the original signals, true changes between multiple keywords, in addition to changes in the
change-points, and change-point scores obtained by the proposed frequency of each keyword.
RuLSIF-based method. This shows that the proposed method still For quantitative evaluation, we referred to the Wikipedia
gives clear indications for speech segments. Fig. 8(b) and (c) show entry Timeline of the Deepwater Horizon oil spill,9 as a real-
average ROC curves over 10 datasets and AUC values for each world event source. The change-point score obtained by the
of the 10 datasets. The results show that the proposed method proposed RuLSIF-based method is plotted in Fig. 9(b), where four
significantly outperforms other methods. occurrences of important real-world events show the development
of this news story.
4.3. Twitter dataset As we can see from Fig. 9(b), the change-point score increases
immediately after the initial explosion of the deepwater horizon
Finally, we apply the proposed change-point detection method oil platform and soon reaches the first peak when oil was found on
to the CMU Twitter dataset,7 which is an archive of Twitter the shore of Louisiana on April 30. Shortly after BP announced its
messages collected from February 2010 to October 2010 via the preliminary estimation on the amount of leaking oil, the change-
Twitter application programming interface. point score rises quickly again and reaches its second peak at the
6 http://research.nii.ac.jp/src/eng/list/index.html. 8 http://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill.
7 http://www.ark.cs.cmu.edu/tweets/. 9 http://en.wikipedia.org/wiki/Timeline_of_the_Deepwater_Horizon_oil_spill.
S. Liu et al. / Neural Networks 43 (2013) 7283 81
(a) One of the original signals and change-point scores obtained by the RuLSIF-based method. (b) Average ROC curves.
(c) AUC values. The best and comparable methods by the t-test
with significance level 5% are described in boldface.
end of May, at which time President Obama visited Louisiana to experiments, however, we did not observe a significant improve-
assure local residents of the federal governments support. On June ment by changing the margin. For this reason, we decided to use a
25, BP stock was at its one year lowest price, while the change- straightforward model such that two segments have no margin in
point score spikes at the third time. Finally, BP cut off the spill on between.
July 15, as the score reaches its last peak. Through the experiment illustrated in Fig. 6 in Section 4.1, we
can see that the performance of the proposed method is affected
5. Conclusion and future perspectives by the choice of hyper-parameters n and k. However, discovering
optimal values for these parameters remains a challenge, which
In this paper, we first formulated the problem of retrospective will be investigated in our future work.
change-point detection as the problem of comparing two proba- RuLSIF was shown to possess a better convergence property
bility distributions over two consecutive time segments. We then than uLSIF (Yamada et al., in press) in terms of density ratio
provided a comprehensive review of state-of-the-art density-ratio estimation. However, how this theoretical advantage in density
and divergence estimation methods, which are key building blocks ratio estimation can be translated into practical performance
of our change-point detection methods. Our contributions in this improvement in change detection is still not clear, beyond the
paper were to extend the existing KLIEP-based change-point detec-
intuition that a better divergence estimator gives a better change
tion method (Kawahara & Sugiyama, 2012), and to propose to use
score. We will address this issue more formally in the future work.
uLSIF as a building block. uLSIF has various theoretical and prac-
Although the proposed RuLSIF-based change-point detection
tical advantages, for example, the uLSIF solution can be computed
was shown to work well even for multi-dimensional time-series
analytically, it possesses the optimal non-parametric convergence
data, its accuracy may be further improved by incorporating
rate, it has the optimal numerical stability, and it has higher robust-
ness than KLIEP. We further proposed to use RuLSIF, a novel diver- dimensionality reduction. Recently, several attempts were made
gence estimation paradigm that emerged in the machine learning to combine dimensionality reduction with direct density-ratio
community recently. RuLSIF inherits good properties of uLSIF, and estimation (Sugiyama et al., 2010; Sugiyama, Yamada et al., 2011;
moreover it possesses an even better non-parametric convergence Yamada & Sugiyama, 2011). Our future work will apply these
property. Through extensive experiments on artificial datasets and techniques to change-point detection and evaluate their practical
real-world datasets including human-activity sensing, speech, and usefulness.
Twitter messages, we demonstrated that the proposed RuLSIF- Compared with other approaches, methods based on density
based change-point detection method is promising. ratio estimation tend to be computationally more expensive
Though we estimated a density ratio between two consecutive because of the cross-validation procedure for model selection.
segments, some earlier researches (Basseville & Nikiforov, 1993; However, thanks to the analytic solution, the RuLSIF- and uLSIF-
Gustafsson, 1996, 2000) introduced a hyper-parameter that con- based methods are computationally more efficient than the KLIEP-
trols the size of a margin between two segments. In our preliminary based method that requires an iterative optimization procedure
82 S. Liu et al. / Neural Networks 43 (2013) 7283
(b) Change-point score obtained by the RuLSIF-based method and exemplary real-world events.
(see Fig. 9 in Kanamori et al. (2009) for the detailed time References
comparison between uLSIF and KLIEP). Our important future work
is to further improve the computational efficiency of the RuLSIF- Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection.
based method. Technical Report. arXiv:0710.3742v1 [stat.ML].
In this paper, we focused on computing the change-point Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one
distribution from another. Journal of the Royal Statistical Society, Series B, 28(1),
score that represents the plausibility of change points. Another
131142.
possible formulation is hypothesis testing, which provides a Basseville, M., & Nikiforov, I. V. (1993). Detection of abrupt changes: theory and
useful threshold to determine whether a point is a change point. application. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
Methodologically, it is straightforward to extend the proposed Bellman, R. (1961). Adaptive control processes: a guided tour. Princeton, NJ, USA:
Princeton University Press.
method to produce the p-values, following the recent literatures
Bickel, S., Brckner, M., & Scheffer, T. (2007). Discriminative learning for differing
(Kanamori, Suzuki, & Sugiyama, 2012a; Sugiyama, Suzuki, Itoh, training and test distributions. In Proceedings of the 24th international conference
Kanamori, & Kimura, 2011). However, computing the p-value is on machine learning (pp. 8188).
often time consuming, particularly in a non-parametric setup. Brodsky, B., & Darkhovsky, B. (1993). Nonparametric methods in change-point
problems. Dordrecht, the Netherlands: Kluwer Academic Publishers.
Thus, overcoming the computational bottleneck is an important
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3),
future work for making this approach more practical. 273297.
Recent reports pointed out that Twitter messages can be Csiszr, I. (1967). Information-type measures of difference of probability distribu-
indicative of real-world events (Petrovi, Osborne, & Lavrenko, tions and indirect observation. Studia Scientiarum Mathematicarum Hungarica,
2, 229318.
2010; Sakaki, Okazaki, & Matsuo, 2010). Following this line, we Csrg, M., & Horvth, L. (1988). 20 nonparametric methods for changepoint
showed in Section 4.3 that our change-detection method can be problems. In P. R. Krishnaiah, & C. R. Rao (Eds.), Handbook of statistics, vol. 7
used as a novel tool for analyzing Twitter messages. An important (pp. 403425). Amsterdam, the Netherlands: Elsevier.
future challenge along this line includes automatic keyword Desobry, F., Davy, M., & Doncarli, C. (2005). An online kernel change detection
algorithm. IEEE Transactions on Signal Processing, 53(8), 29612974.
selection for topics of interests. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY,
USA: Chapman & Hall/CRC.
Garnett, R., Osborne, M.A., & Roberts, S.J. (2009). Sequential Bayesian prediction in
Acknowledgments the presence of changepoints. In Proceedings of the 26th annual international
conference on machine learning (pp. 345352).
SL was supported by NII internship fund and the JST PRESTO Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Schlkopf, B.
(2009). Covariate shift by kernel mean matching. In J. Quionero-Candela,
program. MY and MS were supported by the JST PRESTO program.
M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine
NC was supported by NII Grand Challenge project fund. learning (pp. 131160). Cambridge, MA, USA: MIT Press (Chapter 8).
S. Liu et al. / Neural Networks 43 (2013) 7283 83
Guralnik, V., & Srivastava, J. (1999). Event detection from time series data. In Petrovi, S., Osborne, M., & Lavrenko, V. (2010). Streaming first story detection
Proceedings of the fifth ACM SIGKDD international conference on knowledge with application to Twitter. In Human language technologies: the 2010 annual
discovery and data mining (pp. 3342). conference of the North American chapter of the association for computational
Gustafsson, F. (1996). The marginalized likelihood ratio test for detecting abrupt linguistics (pp. 181189).
changes. IEEE Transactions on Automatic Control, 41(1), 6678. Reeves, J., Chen, J., Wang, X. L., Lund, R., & Lu, Q. (2007). A review and comparison
Gustafsson, F. (2000). Adaptive filtering and change detection. Chichester, UK: Wiley. of changepoint detection techniques for climate data. Journal of Applied
Harchaoui, Z., Bach, F., & Moulines, E. (2009). Kernel change-point analysis. Advances Meteorology and Climatology, 46(6), 900915.
in Neural Information Processing Systems, 21, 609616. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ, USA: Princeton University
Henkel, R. E. (1976). Tests of significance. Beverly Hills, CA, USA: SAGE Publication. Press.
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users: real-
outlier detection using direct density ratio estimation. Knowledge and time event detection by social sensors. In Proceedings of the 19th international
Information Systems, 26(2), 309336. conference on World Wide Web (pp. 851860).
Ide, T., & Tsuda, K. (2007). Change-point detection using Krylov subspace learning. Schlkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001).
In Proceedings of the SIAM international conference on data mining (pp. 515520).
Estimating the support of a high-dimensional distribution. Neural Computation,
Itoh, N., & Kurths, J. (2010). Change-point detection of climate time series by
13(7), 14431471.
nonparametric method. In Proceedings of the world congress on engineering and
Schlkopf, B., & Smola, A. J. (2002). Learning with kernels support vector machines,
computer science, 2010, vol. 1.
regularization, optimization, and beyond. MIT Press.
Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
importance estimation. Journal of Machine Learning Research, 10, 13911445.
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012a). f -divergence estimation and 6(2), 461464.
two-sample homogeneity test under semiparametric density-ratio models. IEEE Sugiyama, M., Kawanabe, M., & Chui, P. L. (2010). Dimensionality reduction for
Transactions on Information Theory, 58(2), 708720. density ratio estimation in high-dimensional spaces. Neural Networks, 23(1),
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012b). Statistical analysis of kernel-based 4459.
least-squares density-ratio estimation. Machine Learning, 86(3), 335367. Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., & Kimura, M. (2011). Least-squares
Kanamori, T., Suzuki, T., & Sugiyama, M. (2013). Computational complexity of two-sample test. Neural Networks, 24(7), 735751.
kernel-based density-ratio estimation: a condition number analysis. Machine Sugiyama, M., Suzuki, T., & Kanamori, T. (2012a). Density ratio estimation in machine
Learning (in press). learning. Cambridge, UK: Cambridge University Press.
Kawahara, Y., & Sugiyama, M. (2012). Sequential change-point detection based Sugiyama, M., Suzuki, T., & Kanamori, T. (2012b). Density ratio matching under the
on direct density-ratio estimation. Statistical Analysis and Data Mining, 5(2), Bregman divergence: a unified framework of density ratio estimation. Annals of
114127. the Institute of Statistical Mathematics, 64, 10091044.
Kawahara, Y., Yairi, T., & Machida, K. (2007). Change-point detection in time-series Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Buenau, P., & Kawanabe, M.
data based on subspace identification. In Proceedings of the 7th IEEE international (2008). Direct importance estimation for covariate shift adaptation. Annals of
conference on data mining (pp. 559564). the Institute of Statistical Mathematics, 60(4), 699746.
Keziou, A. (2003). Dual representation of -divergences and applications. Comptes Sugiyama, M., Yamada, M., von Bnau, P., Suzuki, T., Kanamori, T., & Kawanabe,
Rendus Mathematique, 336(10), 857862. M. (2011). Direct density-ratio estimation with dimensionality reduction via
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of least-squares hetero-distributional subspace search. Neural Networks, 24(2),
Mathematical Statistics, 22(1), 7986. 183198.
Moskvina, V., & Zhigljavsky, A. (2003a). Application of singular-spectrum analysis Takeuchi, J., & Yamanishi, K. (2006). A unifying framework for detecting outliers
to change-point detection in time series. School of Mathematics, Cardiff and change points from non-stationary time series data. IEEE Transactions on
University. http://slb.cf.ac.uk/maths/subsites/stats/changepoint/CH_P_T_S.pdf. Knowledge and Data Engineering, 18(4), 482492.
Moskvina, V., & Zhigljavsky, A. (2003b). An algorithm based on singular spectrum Vapnik, V. N. (1998). Statistical learning theory. New York, NY, USA: Wiley.
analysis for change-point detection. Communications in Statistics: Simulation and Wang, Y., Wu, C., Ji, Z., Wang, B., & Liang, Y. (2011). Non-parametric change-point
Computation, 32(2), 319352. method for differential gene expression detection. PLoS ONE, 6(5), e20060.
Nguyen, X., Wainwright, M.J., & Jordan, M.I. (2007). Nonparametric estimation of the
Yamada, M., & Sugiyama, M. (2011). Direct density-ratio estimation with di-
likelihood ratio and divergence functionals. In Proceedings of IEEE International
mensionality reduction via hetero-distributional subspace analysis. In
Symposium on Information Theory (pp. 20162020).
Proceedings of the twenty-fifth AAAI conference on artificial intelligence
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence
(pp. 549554). Aug. 711.
functionals and the likelihood ratio by convex risk minimization. IEEE
Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., & Sugiyama, M. (2013).
Transactions on Information Theory, 56(11), 58475861.
Paquet, U. (2007). Empirical Bayesian change point detection. Graphical Models, Relative density-ratio estimation for robust distribution comparison. Neural
1995, 120. Computation (in press).
Pearson, K. (1900). On the criterion that a given system of deviations from the Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised
probable in the case of a correlated system of variables is such that it can outlier detection using finite mixtures with discounting learning algorithms.
be reasonably supposed to have arisen from random sampling. Philosophical In Proceedings of the sixth ACM SIGKDD international conference on knowledge
Magazine, 50, 157175. discovery and data mining (pp. 320324).