3 - Benchmarking Time Series Classification - Functional
3 - Benchmarking Time Series Classification - Functional
Florian Pfisterer, Xudong Sun1 , Laura Beggel1 , Fabian Scheipl, Bernd Bischl
Department of Statistics, Ludwig-Maximilians-Universität München
Ludwigstr. 33,
arXiv:1911.07511v2 [stat.ML] 24 Feb 2021
Abstract
Time series classification problems have drawn increasing attention in the ma-
chine learning and statistical community. Closely related is the field of func-
tional data analysis (FDA): it refers to the range of problems that deal with
the analysis of data that is continuously indexed over some domain. While of-
ten employing different methods, both fields strive to answer similar questions,
a common example being classification or regression problems with functional
covariates. We study methods from functional data analysis, such as functional
generalized additive models, as well as functionality to concatenate (functional)
feature extraction or basis representations with traditional machine learning al-
gorithms like support vector machines or classification trees. In order to assess
the methods and implementations, we run a benchmark on a wide variety of
representative (time series) data sets, with in-depth analysis of empirical re-
sults, and strive to provide a reference ranking for which method(s) to use for
non-expert practitioners. Additionally we provide a software framework in R
for functional data analysis for supervised learning, including machine learning
and more linear approaches from statistics. This allows convenient access, and
in connection with the machine-learning toolbox mlr, those methods can now
also be tuned and benchmarked.
Keywords: Functional Data Analysis, Time Series, Classification, Regression
The analysis of functional data is becoming more and more important in many
areas of application such as medicine, economics, or geology (cf. Ullah and
Finch [1], Wang et al. [2]), where this type of data occurs naturally. In industry,
functional data are often a by-product of continuous monitoring of production
processes, yielding great potential for data mining tasks. A common type of
functional data are time series, as time series can often be considered as dis-
cretized functions over time.
Many researchers publish software implementations of their algorithms, there-
fore simplifying the access to already established methods. Even though such a
readily available, broad range of methods to choose from is desirable in general,
it also makes it harder for non-expert users to decide which method to apply
to a problem at hand and to figure out how to optimize their performance.
As a result, there is an increasing demand for automated model selection and
parameter tuning.
Furthermore, the functionality of available pipeline steps ranges from simple
data structures for functional data, to feature extraction methods and packages
offering direct modeling procedures for regression and classification. Users are
again faced with a multiplicity of software implementations to choose from and,
in many instances, combining several implementations may be required. This
can be difficult and time-consuming, since the various implementations utilize a
multiplicity of different workflows which the user needs to become familiar with
and synchronize in order to correctly carry out the desired analysis.
There is a wide variety of packages for functional data analysis in R [3] available
that provide functionality for analyzing functional data. Examples range from
the fda [4] package which includes object types for functional data and allows
for smoothing and simple regression, to, e.g., boosted additive regression models
for functional data in FDboost [5]. For an extensive overview, see the CRAN
task view [6].
Many of those packages are designed to provide algorithmic solutions for one
specific problem, and each of them requires the user to become familiar with
its user interface. Some of the packages, however, such as fda.usc [7] or refund
[8] are not designed for only one specific analysis task, but combine several
approaches. Nevertheless, these packages do not offer unified frameworks or
consistent user interfaces for their various methods, and most of the packages
can still only be applied separately.
A crucial advantage of providing several algorithms in one package with a unified
and principled user interface is that it becomes much easier to compare the
provided methods with the intention to find the best solution for a problem at
hand. But to determine the best alternative, one still has to be able to compare
the methods at their best performance on the considered data, which requires
hyperparameter search and, more preferably, efficient tuning methods.
While the different underlying packages are often difficult and sometimes even
impossible to extend to new methods, custom implementations and extensions
can be easily included in the accompanying software.
2
We want to stress that the focus of this paper does not lie in proposing new al-
gorithms for functional data analysis. Its added value lies in a large comparison
of algorithms while providing a unified and easily accessible interface for com-
bining statistical methods for functional data with the broad range of functions
provided by mlr, most importantly benchmarking and tuning. Additionally, the
often overlooked possibility of extracting non-functional features from functional
data is integrated, which enables the user to apply classical machine learning
algorithms such as support vector machines [9] to functional data problems.
In a benchmark study similar to Bagnall et al. [10] and Fawaz et al. [11], we ex-
plore the performance of implemented methods, and try to answer the following
questions:
3
2. Related Work
In the remainder of the paper, we focus on comparing algorithms from the func-
tional data analysis and the machine learning domain. Functional data analysis
traditionally values interpretable results and valid statistical inference over pre-
diction quality. Therefore functional data algorithms are often not compared
with respect to their predictive performance in literature. We aim to close this
gap. On the other hand, machine learning algorithms often do not yield inter-
pretable results. While we consider both aspects to be important, we want to
focus on predictive performance in this paper.
fourier transforms data from the time domain into the frequency domain
using the fast fourier transform [18]. Extracted features are either phase
or amplitude coefficients.
bsignal B-Spline representations from package FDboost [5] are used as
feature extractors. Given the knots vector and effective degree of freedom,
we extract the design matrix for the functional data using mboost.
wavelets [15] applies a discrete wavelet transform to time series or func-
tional data, e.g., with Haar or Daubechies wavelets. The extracted features
are wavelet coefficients at several resolution levels.
PCA projects the data on their principal component vectors. Only a
subset of the principal component scores representing a given proportion
of signal variance is retained.
4
DTWKernel computes the dynamic time warping distances of functional
or time series data to (a set of) reference data. We implement dynamic
time warping (DTW) based feature extraction. This method computes
the dynamic time warping distance of each observed function to a (user-
specified) set of reference curves. The distances of each observation to the
reference curves is then extracted as a vector-valued feature. The reference
curves can either be supplied by the user, e.g., they could be several typical
functions for the respective classes, or they can be obtained from the
training data. In order to compute dynamic time warping distances, we use
a fast dynamic time warping [19] implementation from package rucrdtw
[20].
MultiResFeatures extracts features, such as the mean at different levels
of resolution (zoom-in steps). Inspired by the image pyramid and wavelet
methods, we implement a feature extraction method, multi-resolution fea-
ture extraction where we extract features like mean and variance com-
puted over specified windows of varying widths. Starting from the full
sequence, the sequence is repeatedly divided into smaller pieces, where at
each resolution level, a scalar value is extracted. All extracted features
are concatenated to form the final feature vector.
5
2.3. Toolboxes for functional data analysis
The package fda [4] contains several object types for functional data and allows
for smoothing and regression for functional data. Analogously, the R-package
fda.usc [7] contains several classification algorithms that can be used with func-
tional data. In Python, scikit-fda [27] offers both representation of and (pre-
)processing methods for functional data, but only a very small set of machine
learning methods for classification or regression problems is implemented at the
time of writing.
As a byproduct of the Time-Series Classification Bake-off [10], a wide variety
of algorithms were implemented and made available. But this implementation
emphasizes the benchmark over providing a data analysis toolbox for users, and
is therefore not easily usable for inexperienced users.
2.4. Benchmarks
The recently published benchmark analysis Time-Series Classification Bake-off
by Bagnall et al. [10] provides an overview of the performance of 18 state-of-
the-art algorithms for time series classification. They re-implement (in Java)
and compare 18 algorithms designed especially for time series classification on
85 benchmark time series data sets from Bagnall et al. [28]. In their analysis,
they also include results from several standard machine learning algorithms.
They note that the rotation forest [12] and random forest [29] are competitive
with their time series classification baseline [1-nearest neighbor with dynamic
time warping distance; 30]. Their results show that ensemble methods such as
collection of transformation ensembles [COTE; 31] perform best, but for the
price of considerable runtime.
Deep learning methods applied to time series classification tasks have also shown
competitive prediction power. For example, [11] provide a comprehensive review
of state-of-the-art methods. The authors compared both generative models and
discriminative models, including fully connected neural networks, convolutional
neural networks, auto-encoders and echo state networks, whereas only discrimi-
native end-to-end approaches were incorporated in the benchmark study.
The benchmark study conducted in this work does not aim to replicate or com-
pete with earlier studies like [10], but instead tries to extend their results.
3. Functional Data
6
measurement of a single observation is a vector of scalar components, functional
features are function-valued over their domain. The features x = (x1 , ..., xp )
can thus also be a function, i.e., xj = gj (t), g : T → R. In practice, functional
data comes in the form of observed values gj (t), t ∈ {1, ..., L}, where each t
corresponds to a discrete point on the continuum. Those observed values stem
from an underlying function f evaluated over a set of points. A frequent type
of functional data is time series data, i.e., measurements of a process measured
at discrete time-points.
For example, in some electrical engineering applications, signals are obtained
over time at a certain sampling rate, but other domains are possible as well.
Spectroscopic data, for example, are functional data recorded over certain parts
of the electromagnetic spectrum. One such example is depicted in Figure 1. It
shows spectroscopy data of fossil fuels [32] where the measured signal represents
reflected energies in the ultraviolet-visible (UV-VIS) and the near infrared spec-
trum (NIR). In the plot, different colors correspond to different instances. This
is a typical example of a scalar-on-function regression problem, where the inputs
are a collection of spectroscopic curves for a fuel, and the prediction target is
the heating value of the fossil fuel.
In Figure 2, we display two functional classification scenarios. The goal in
those scenarios is to distinguish the class type of the curve, which can also
be understood as a function-on-scalar problem. Figure 2a shows the vertical
position of an actor’s hand while either drawing a toy-gun and aiming at a
target, or just imitating the motion with the blank hand. This position is
measured over time. The two different types of classes of the curves can be
distinguished by the color scheme.
Figure 2b shows a data set built for distinguishing images of beetles from images
of flies based on their outlines. While following the outline, the distance to the
center of the object is measured which is then used for classification purposes.
The latter data sets are available from [28].
The interested reader is referred to Ramsay [21] and Kokoszka and Reimherr
[35] for more in-depth introductions to this topic.
7
Heat emission for different fossil fuels
0
0.5
0.0
Heat [mJ]
UVVIS Energy
−1
NIR Energy
17.5
20.0
22.5
25.0
−0.5
−2
−1.0
5. Benchmark Experiment
8
2
1
Hand position in X−axis
Distance to center
1
Class Class
0
Gun−Draw Beetle
0
Point Fly
−1 −1
−2
−2
Figure 2: Excerpts from two time series classification data sets. (a): Gunpoint data [33], (b):
BeetleFly Data [34].
results obtained by Bagnall et al. [10] or Fawaz et al. [11]. Instead we focus on
providing a benchmark complementary to previous benchmarks. This is done
because i) the experiments require large amounts of computational resources,
and ii) the added value of an exact replication of the experiments (with open
source code) is comparatively small. Nonetheless, we aim for results that can
be compared, and thus extend the results obtained by Bagnall et al. [10] by
staying close to their setup. The experiments were carried out on a high per-
formance computing cluster, supported by the Leibniz Rechenzentrum Munich.
Individual runs were allowed up to 2.2 GB of RAM and 4 hours run-time for
each evaluation. We want to stress that this benchmark compares implemen-
tations, which does not always necessarily correspond to the performance of
the corresponding theoretical algorithm. Additionally, methods for functional
data analysis are traditionally more focused on valid statistical inference and
interpretable results, which does not necessarily coincide with high predictive
performance.
9
Data sets 51 Data sets, see table B.7
Algorithms Function (Package)
Machine Learning: - glmnet (glmnet)
- rpart (rpart)
- ksvm? (kernlab)
- ranger? (ranger)
- xgboost? (xgboost)
having varying training set sizes or measurement lengths. For more detailed
information about the data sets, see Bagnall et al. [28].
We selected data using the following criteria: In order to reduce the compu-
tational resources we did i) not run data sets that have multiple versions, ii)
exclude data sets with less then 3 examples in each class iii) remove data sets
with more than 10000 instances or time series longer than 750 measurements.
As some of the classifiers only work with multi-class targets via 1-vs-all clas-
sification, we iv) additionally excluded data sets with more then 40 classes.
In essence, we benchmark small and medium sized data sets with a moderate
amount of different classes.
We add 7 new algorithms and 6 feature extraction methods which can be com-
bined with arbitrary machine learning methods for scalar features (c.f. Table 1).
Additionally we test 5 classical machine learning methods, in order to obtain
a broader perspective on expected performance if the functional nature of the
data is ignored. As we benchmark default settings as well as tuned algorithms,
in total 80 different algorithms are evaluated across all data sets. When com-
bining feature extraction and machine learning methods, we fuse the learning
algorithm and the preprocessing, thus treating them as a pipeline where data
is internally transformed before applying the learner. This allows us to jointly
tune the hyperparameters of learning algorithm and preprocessing method. The
respective defaults and parameter ranges can be obtained from Table 3 (feature
extractors) and Table 5 (learning algorithms). More detailed description of the
10
hyperparameters can be obtained from the respective packages documentation.
In order to generate train/test splits, and thus obtain an unbiased estimate of
the algorithm’s performance, we use stratified sub-sampling. We use 20 different
train/test splits for each data set in order to reduce variance and report the
average. For tuned models, we use use nested cross-validation [37] to ensure
unbiased estimates, where the outer loop is again subsampling with 20 splits,
and the inner resampling for tuning is a 3-fold (stratified) cross-validation. All
compared 80 algorithms are presented exactly the same index sets for the 20
train-test outer subsampling splits.
Mean misclassification error (MMCE) is chosen as a measure of predictive per-
formance in order to stay consistent with Bagnall et al. [10]. Other measures,
such as area under the curve (AUC) require predicted probabilities and do not
trivially extend to multi-class settings.
While Bagnall et al. [10] tune all algorithms across a carefully handcrafted grid,
we use Bayesian optimization [38]. In order to stay comparable, we analogously
fix the amount of tuning iterations to 100.
We use mlrMBO [39] in order to perform Bayesian optimization of the hyperpa-
rameters of the respective algorithm. Additionally, in order to scale the method
to a larger amount of data sets and machines, the R-package batchtools (Bischl
et al. [40], Lang et al. [41]) is used. This enables running benchmark experiments
on high-performance clusters. For the benchmark experiment, a job is defined
as re-sampling of a single algorithm (or tuning thereof) on a single version of a
data set. This allows for parallelization to an arbitrary number of CPU’s, while
at the same time guaranteeing reproducibility. The code for the benchmark is
available from https://github.com/compstat-lmu/2019_fda_benchmark for
reproducibility.
5.2. Results
This Section tries to answer the questions posed in section 1. We evaluate i)
various machine learning algorithms in combination with feature extraction, ii)
classical time series classification approaches, iii) the effect of tuning hyperpa-
rameters for several methods, and iv) try to give recommendations with respect
to which algorithm(s) to choose for new classification problems.
Algorithms evaluated in this benchmark have been divided into three groups:
Algorithms specifically tailored to functional data, classical machine learning
algorithms without feature extraction and classical machine learning algorithms
in combination with feature extraction.
11
1.00 ●
●
●
●
●
0.75
Accuracy
0.50
●
0.25
0.00
Figure 3: Performances for functional data analysis algorithms in default settings (untuned)
across all 51 data sets.
1.00
Tuning
0.75
default
mbo
Accuracy
Learner
0.50
glmnet
● ksvm
●
●
● ● ●
● ranger
● ●
● ●
● ●
● ●
● ● ● rpart
0.25 ●
● ●
●
● ● ●
●
●
● ● xgboost
● ● ●
● ● ●
●
● ●
●
● ●
●
● ●
● ●
●
● ●
● ●
●
● ● ● ●
●
0.00 ● ●
n t
m.tu m
rang ranged
er.tu er
r d
xgb xgb par t
t.tun t
ed
n t
m.tu m
rang ranged
er.tu er
r d
xgb xgb par t
t.tun t
ed
n t
m.tu m
rang ranged
er.tu er
r d
xgb xgb par t
t.tun t
ed
n t
m.tu m
rang ranged
er.tu er
r d
xgb xgb par t
t.tun t
ed
n t
m.tu m
rang ranged
er.tu er
ned
xgb xgb par t
t.tun t
ed
n t
m.tu m
rang ranged
er.tu er
ned
xgb xgb par t
t.tun t
ed
ksv ksve
oos oos
ksv ksve
oos oos
ksv ksve
oos oos
ksv ksve
oos oos
ksv ksve
oos oos
ksv ksve
oos oos
ne
ne
ne
ne
n
n
glm
glm
glm
glm
glm
glm
r
Figure 4: Results for feature extraction-based machine learning algorithms with default and
tuned (MBO) hyperparameters across 51 data sets. Hyperparameters are tuned jointly for
learner and feature extraction method.
12
with conventional machine learning algorithms, even at their default hyper-
parameters. Among the learners, random forests, especially in combination with
bsignal show quite advantageous performance. In addition, we find an obvious
improvement from hyper-parameter tuning for the Fourier feature extraction.
In terms of learners, random forest and gradient boosted tree learners (xgboost)
perform better than support vector machines.
1.00
Tuning
0.75 default
tuned
Accuracy
Learner
glmnet
0.50
ksvm
ranger
rpart
● xgboost
0.25
●
t
t
ne
os
ar
ge
v
rp
ks
bo
m
n
ra
gl
xg
Learner
13
on equivalent hardware on high-performance computing infrastructure. Due to
fluctuations in server load, this does not allow for an exact comparison with
respect to computation time, but we hope to achieve comparable results as
we repeatedly evaluate on sub-samples. Note that we restrict the tuning to 3
algorithms where tuning traditionally leads to higher performances.2
Running time (sec) (log)
104
102
default
mbo
100
10−2
nn
sc lm
au el
fd knn
FD np
t
am
t
vm
er
xg rt
t
os
ne
os
a
fd ern
ng
au .g
.k
rp
bo
ks
bo
fg
sc
m
.
nc
sc
sc
ra
gl
.k
au
au
Fu
fd
si
as
fd
cl
Learner
Figure 6: Comparison of running time for the different learner classes with default and tuned
hyperparameters across 51 data sets. A log transformation on the running time in seconds is
applied, and the mean running time is visualized for each stratification as a horizontal line
within the violin plot.
14
Table 2: Top 10 algorithms by average rank across all data sets. Percent Accuracy describes
the fraction of the maximal accuracy reached for each task.
If the only criterion for model selection is predictive performance, (tuned) ma-
chine learning models in combination with feature extraction is a competitive
baseline. This class of methods achieves within 95% of the optimal performance
on 47 out of 51 data sets, while they include the best performing classifier in 35
cases.
15
id type values def. trafo
bsignal
bsignal.knots int {3,...,500} 10 -
bsignal.df int {1,...,10} 3 -
multires
res.level int {2,...,5} - -
shift num [0.01,1] - -
pca
rank. int {1,...,30} - -
wavelets
filter chr d4,d8,d20,la8,la20,bl14,bl20,c6,c24 - -
boundary chr periodic,reflection - -
fourier
trafo.coeff chr phase,amplitude - -
dtwkernel
ref.method chr random,all random -
n.refs num [0,1] - -
dtwwindow num [0,1] - -
Table 3: Parameter spaces and default settings for feature extraction methods.
• The user is not required to learn and deal with the vast complexity of the
different interfaces the underlying packages expose.
• All of the existing functionality (e.g., preprocessing, resampling, perfor-
mance measures, tuning, parallelization) of the mlr ecosystem can now be
used in conjunction with already existing algorithms for functional data.
• We expose functionality that allows us to work with functional data using
traditional machine learning methods via feature extraction methods.
• Integration of additional preprocessing methods or models is (fairly) trivial
and automatically benefits from the full mlr ecosystem.
16
Name Algorithm Setting Accuracy
Beef xgboost wavelet tuned 0.83
ChlorineConcentration ksvm none tuned 0.91
DistalPhalanxOutlineAgeGroup ranger none default 0.83
DistalPhalanxOutlineCorrect ranger dtwkernel default 0.83
DistalPhalanxTW ranger bsignal default 0.76
Earthquakes FDboost none default 0.80
Ham xgboost wavelet tuned 0.84
InsectWingbeatSound ranger wavelet default 0.65
SonyAIBORobotSurface1 ksvm wavelet default 0.94
Table 4: Data sets together with corresponding mlrFDA learner and accuracy for which our
learners were able to improve accuracy in the conducted experiments.
• Tuning only a subset of the presented learners and feature extractions, i.e.,
the methods listed in Table 2, is sufficient to achieve good performances
on almost all data sets in our benchmark.
• A simple random forest without any preprocessing can also be a reasonable
baseline for time series data. It achieves an average rank of 15.59 (top 4)
in our benchmark.
17
parameter type values default trafo
ksvm
C num [-15,15] - 2^x
sigma num [-15,10] - 2^x
ranger
mtry.power num [0,1] - px
min.node.size num [0,0.99] - 2^(log2(n) ∗ x)
sample.fraction num [0.1,1] - -
xgboost
nrounds int {1,...,5000} 100 -
eta num [-10,0] - 2^x
subsample num [0.1,1] - -
booster chr gbtree,gblinear - -
max depth int {1,...,15} - -
min child weight num [0,7] - 2^x
colsample bytree num [0,1] - -
colsample bylevel num [0,1] - -
lambda num [-10,10] - 2^x
alpha num [-10,10] - 2^x
FDboost
mstop int {1,...,5000} 100 -
nu num [0,1] 0.01 -
df num [1,5] 4 -
knots int {5,...,100} 10 -
degree int {1,...,4} 3 -
Table 5: Parameter spaces and defaults used for tuning machine learning and functional data
algorithms. In case no default is provided, package defaults are used. Additional information
can be found in the respective packages documentation.
1.0 ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
0.9 ●
● ●
●
● ●
● ●
● ● ●
●
●
0.8
mlrFDA ACC
● ●
● ●
●
●
● ●
●
●
●
0.7
●
●
0.6
0.5
Figure 7: Comparing accuracy between our mlrFDA learners and the classical time series
classification algorithms in [10]. For each data set, only the best accuracy for each of the two
benchmarks is shown. We observe that for 9 of the evaluated data sets the classification per-
formance can directly be improved solely by applying our mlrFDA learners, while we perform
on par with the classical time series classification algorithms (when rounding to 3 decimal
digits) on two data sets.
• Most algorithms for functional data (e.g., FDboost) do not perform well in
our benchmark study. As those algorithms are fully interpretable and offer
statistically valid coefficients, they can still be useful in some applications,
and should thus not be ruled out.
• Feature extraction techniques, such as b-spline representations (bsignal )
and wavelet extraction work well in conjunction with machine learning
techniques for vector valued features such as xgboost and random forest.
• Tuning leads to an average reduction in absolute MMCE of 3.59% (ranger),
5.69% (xgboost), 7.78% (ksvm) (across feature extraction techniques) and
11% (FDboost). This holds for all feature extraction techniques, where
improvements range from 1.12% multires to 20.3% fourier.
In future work we will continue to expand the available toolbox along with a
benchmark of new methods, and provide the R community a wider range of
methods that can be used for the analysis of functional data. This includes
not only integrating many already available packages, and as a result to en-
able preprocessing operations such as smoothing (e.g., fda [4]) and alignment
(e.g., fdasrvf [43] or tidyfun [44]), but also to explore and integrate advanced
imputation methods for functional data. Further work will also extend the cur-
rent implementation to support data that is measured on unequal or irregular
19
grids. Additionally, we aim to implement some of the current state-of-the art
machine learning models from the time series classification bake-off [10], such as
the Collective of Transformation-Based Ensembles (COTE) [31]. This enables
researchers to use and compare with current state-of-the-art methods.
Acknowledgements
This work has been funded by the German Federal Ministry of Education and
Research (BMBF) under Grant No. 01IS18036A. The authors of this work take
full responsibilities for its content.
20
References
21
[13] C. Stachl, M. Bühner, Show me how you drive and i’ll tell you who you
are. recognizing gender using automotive driving parameters, Procedia
Manufacturing 3 (2015) 5587 – 5594. URL: http://www.sciencedirect.
com/science/article/pii/S2351978915007441. doi:https://doi.org/
10.1016/j.promfg.2015.07.743, 6th International Conference on Ap-
plied Human Factors and Ergonomics (AHFE 2015) and the Affiliated Con-
ferences, AHFE 2015.
[14] R. Hyndman, E. Wang, Y. Kang, T. Talagala, Y. Yang, tsfeatures:
Time Series Feature Extraction, 2018. URL: https://github.com/
robjhyndman/tsfeatures/, r package version 0.1.
22
[26] T. Maierhofer, F. Pfisterer, classiFunc: Classification of Functional Data,
2018. URL: https://CRAN.R-project.org/package=classiFunc, r pack-
age version 0.1.1.
[27] Grupo de Aprendizaje Automatico - Universidad Autonoma de Madrid,
scikit-fda: Functional Data Analysis in Python, 2019. URL: https://fda.
readthedocs.io.
[28] A. Bagnall, J. Lines, W. Wickers, E. Keogh, The UEA & UCR time series
classification repository, 2017. www.timeseriesclassification.com.
[29] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. URL: https:
//doi.org/10.1023/A:1010933404324. doi:10.1023/A:1010933404324.
[30] P. Tormene, T. Giorgino, S. Quaglini, M. Stefanelli, Matching incomplete
time series with dynamic time warping: An algorithm and an application
to post-stroke rehabilitation, Artificial Intelligence in Medicine 45 (2008)
11–34. doi:10.1016/j.artmed.2008.11.007.
23
[38] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian opti-
mization of machine learning algorithms, in: F. Pereira, C. J. C.
Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neu-
ral Information Processing Systems 25, Curran Associates, Inc.,
2012, pp. 2951–2959. URL: http://papers.nips.cc/paper/
4522-practical-bayesian-optimization-of-machine-learning-algorithms.
pdf.
[39] B. Bischl, J. Richter, J. Bossek, D. Horn, J. Thomas, M. Lang, mlrMBO:
A Modular Framework for Model-Based Optimization of Expensive Black-
Box Functions, 2017. URL: http://arxiv.org/abs/1703.03373.
[43] J. D. Tucker, fdasrvf: Elastic Functional Data Analysis, 2016. URL: https:
//CRAN.R-project.org/package=fdasrvf, r package version 1.6.0.
[44] F. Scheipl, J. Goldsmith, tidyfun, https://github.com/fabian-s/
tidyfun, 2019.
[45] L. Jin, Q. Niu, Y. Jiang, H. Xian, Y. Qin, M. Xu, Driver sleepiness detection
system based on eye movements variables, Advances in Mechanical Engi-
neering 5 (2013) 648431. URL: https://doi.org/10.1155/2013/648431.
doi:10.1155/2013/648431.
[46] M. Murugappan, M. Rizon, R. Nagarajan, S. Yaacob, Eeg feature ex-
traction for classifying emotions using fcm and fkm, in: Proceedings
of the 7th WSEAS International Conference on Applied Computer and
Applied Computational Science, ACACOS’08, World Scientific and En-
gineering Academy and Society (WSEAS), Stevens Point, Wisconsin,
USA, 2008, pp. 299–304. URL: http://dl.acm.org/citation.cfm?id=
1415743.1415793.
[47] S. Soltani, On the use of the wavelet decomposition for time se-
ries prediction, Neurocomputing 48 (2002) 267 – 277. URL: http:
//www.sciencedirect.com/science/article/pii/S0925231201006488.
doi:https://doi.org/10.1016/S0925-2312(01)00648-8.
24
Appendix A. API overview
For the interested reader, we introduce a brief overview of the API and func-
tionality.
R> library(mlr)
R> library(FDboost)
R> df = data.frame(fuelSubset[c("heatan", "h2o", "UVVIS", "NIR")])
The first step when setting up an experiment in any analysis is to make the data
accessible for the specific algorithms that will be applied. In mlr, the data it-
self, and additional information, such as which column corresponds to the target
variable is stored as a Task, requiring the input data to be of type data.frame.
The list of column positions of the functional features is then passed as argu-
ment fd.features to makeFunctionalData(), which returns an object of type
data.frame in which the columns corresponding to each functional feature are
combined into matrix columns. 3
3 As an alternative, a list of the column names containing the functional features is also
valid as argument to fd.features, which is especially useful if columns are already labeled.
25
\$ heatan: num 26.8 27.5 23.8 18.2 17.5 ...
\$ h2o : num 2.3 3 2 1.85 2.39 ...
\$ UVVIS : num [1:129, 1:134] 0.145 -1.584 -0.814 -1.311 -1.373 ...
..- attr(*, "dimnames")=List of 2
.. ..\$ : NULL
.. ..\$ : chr "UVVIS.1" "UVVIS.2" "UVVIS.3" "UVVIS.4" ...
\$ NIR : num [1:129, 1:231] 0.2818 0.2916 -0.0042 -0.034 -0.1804 ...
..- attr(*, "dimnames")=List of 2
.. ..\$ : NULL
.. ..\$ : chr "NIR.1" "NIR.2" "NIR.3" "NIR.4" ...
We additionally specify the name "fuelsubset" and the target variable "heatan".
The structure of the functional Task object is rather similar to the non-functional
Task, with the additional information functionals, which states how many
functional features are present in the underlying data.
R> tsk1 = makeRegrTask("fuelsubset", data = fdf, target = "heatan")
R> print(tsk1)
Supervised task: fuelsubset
Type: regr
Target: heatan
Observations: 129
Features:
numerics factors ordered functionals
1 0 0 2
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
After defining the task, a learner is created by calling makeLearner. This
contains the algorithm that will be fitted on the data in order to obtain a
model. Currently, mlrFDA supports both functional regression and functional
classification. A list of supported learners can be found in Table 1
26
Name Function Package
Discrete Wavelet Transform extractFDAWavelets() wavelets
Fast Fourier Transform extractFDAFourier() stats
Principal Component Analysis extractFDAPCA() stats
B-Spline Features extractFDABsignal() FDboost
Multi-Resolution Feature Extraction extractFDAMultiResFeatures() -
Time Series Features extractFDATsfeatures() tsfeatures
Dynamic Time-Warping Kernel extractFDADTWKernel() rucrdtw
Table A.6: Feature extraction methods currently implemented in mlrFDA and underlying
packages
$task
Supervised task: fuelsubset
Type: regr
Target: heatan
Observations: 129
Features:
numerics factors ordered functionals
137 0 0 0
Missings: FALSE
27
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
$desc
Extraction of features from functional data:
Target: heatan
Functional Features: 2; Extracted features: 2
In the same way, we can train and predict on data, or benchmark multiple
learners across multiple data sets. Additionally, we can apply a tuneWrapper to
our learner in order to automatically tune hyperparameters of the learner and
the preprocessing method during cross-validation.
Table B.7 contains all data sets used in the benchmark along with additional
data properties.
Experiments for some algorithm / data set combinations failed due to implemen-
tation details or algorithm properties. In order to increase transparency, failed
algorithms are listed here, and if available reasons for failure are provided.
At the time of the benchmark, the implementation in the tsfeatures package
was not stable enough to be included in the benchmark.
• classif.fgam
Data sets: BeetleFly, BirdChicken, Coffee, Computers, DistalPhalanx-
OutlineCorrect, Earthquakes, ECG200, ECGFiveDays, ElectricDeviceOn,
28
GunPoint, Ham, Herring, ItalyPowerDemand, Lightning2, MoteStrain,
ShapeletSim, SonyAIBORobotSurface1, Strawberry, ToeSegmentation1,
TwoLeadECG, Wafer, Wine, Yoga
Reason: Too few instances in some classes, such that p > n.
• classif.fdausc.kernel and .np Data sets: ElectricDeviceOn, Shapelet-
Sim
• classif.fdausc.knn
Data sets: DistalPhalanxTW, EpilepsyX
29
Euclidean_1NN ●
xgboost_wavelet_default ●
glmnet_bsignal_default ●
ksvm_multires_tuned ●
ksvm_none_tuned ●
xgboost_multires_default ●
ACF ●
xgboost_none_default ●
DDTW_R1_1NN ●
ranger_dtwkernel_default ●
xgboost_bsignal_default ●
ranger_fourier_tuned ●
ranger_fpca_tuned ●
xgboost_multires_tuned ●
ksvm_bsignal_tuned ●
xgboost_fourier_tuned ●
MLP ●
xgboost_dtwkernel_tuned ●
PS ●
ranger_multires_tuned ●
ranger_dtwkernel_tuned ●
BoP ●
ranger_multires_default ●
DTW_R1_1NN ●
DDTW_Rn_1NN ●
WDDTW_1NN ●
SVMQ ●
ranger_none_tuned ●
Algorithm
SAXVSM ●
xgboost_none_tuned ●
RandF ●
LCSS_1NN ●
xgboost_bsignal_tuned ●
ranger_bsignal_default ●
ranger_wavelet_default ●
ranger_bsignal_tuned ●
ranger_none_default ●
DD_DTW ●
knn_dtw_default ●
ERP_1NN ●
DTW_Rn_1NN ●
WDTW_1NN ●
DTD_C ●
TWE_1NN ●
CID_DTW ●
knn_dtw_tuned ●
LPS ●
xgboost_wavelet_tuned ●
ranger_wavelet_tuned ●
MSM_1NN ●
RotF ●
TSBF ●
LS ●
DTW_F ●
TSF ●
EE ●
BOSS ●
ST ●
Flat.COTE ●
HIVE.COTE ●
10 20 30 40 50
Average_Rank
Figure 8: Comparing sorted average performance ranks between our mlrFDA learners (algo-
rithm names in lower case) and the classical time series classification algorithms (algorithm
names in capital) in [10], The mean rank of each individual learner over all 49 data sets is
displayed. Only the first half of all algorithms being compared are displayed here. We observe
that the Ensemble Methods like HIVE.COTE, FLAT.COTE, ST, BOSS, EE occupy the top
tier, while the rest of the rank space are interleaved by our mlrFDA algorithms and algorithms
from [10]
Table B.7: Data sets from the UCI Archive used in the benchmark.