0% found this document useful (0 votes)
66 views

Neural Networks

This document summarizes a study on improving the robustness of hyperparameter selection for Echo State Networks (ESNs) when predicting chaotic time series. The study investigates commonly used validation strategies, proposes new validation strategies tailored for chaotic systems, and compares Bayesian optimization and grid search for hyperparameter optimization. Numerical tests on Lorenz, Lorenz-96, and Kuznetsov oscillator systems show the proposed validation strategies outperform existing methods. The strategies aim to select hyperparameters that generalize well to unseen data, improving ESNs' ability to accurately predict chaotic dynamics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Neural Networks

This document summarizes a study on improving the robustness of hyperparameter selection for Echo State Networks (ESNs) when predicting chaotic time series. The study investigates commonly used validation strategies, proposes new validation strategies tailored for chaotic systems, and compares Bayesian optimization and grid search for hyperparameter optimization. Numerical tests on Lorenz, Lorenz-96, and Kuznetsov oscillator systems show the proposed validation strategies outperform existing methods. The strategies aim to select hyperparameters that generalize well to unseen data, improving ESNs' ability to accurately predict chaotic dynamics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Neural Networks 142 (2021) 252–268

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Robust Optimization and Validation of Echo State Networks for


learning chaotic dynamics

Alberto Racca a , Luca Magri a,b,c,d ,
a
Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK
b
Aeronautics Department, Imperial College London, Exhibition Rd, London, SW7 2AZ, UK
c
The Alan Turing Institute, 96 Euston Road, London, England, NW1 2DB, UK
d
Institute for Advanced Study, Technical University of Munich, Lichtenbergstrasse 2a, 85748 Garching, Germany1

article info a b s t r a c t

Article history: An approach to the time-accurate prediction of chaotic solutions is by learning temporal patterns
Received 9 February 2021 from data. Echo State Networks (ESNs), which are a class of Reservoir Computing, can accurately
Received in revised form 3 May 2021 predict the chaotic dynamics well beyond the predictability time. Existing studies, however, also
Accepted 6 May 2021
showed that small changes in the hyperparameters may markedly affect the network’s performance.
Available online 14 May 2021
The overarching aim of this paper is to improve the robustness in the selection of hyperparameters
Keywords: in Echo State Networks for the time-accurate prediction of chaotic solutions. We define the robustness
Chaotic dynamical systems of a validation strategy as its ability to select hyperparameters that perform consistently between
Reservoir Computing validation and test sets. The goal is three-fold. First, we investigate routinely used validation strategies.
Robustness Second, we propose the Recycle Validation, and the chaotic versions of existing validation strategies, to
specifically tackle the forecasting of chaotic systems. Third, we compare Bayesian optimization with
the traditional grid search for optimal hyperparameter selection. Numerical tests are performed on
prototypical nonlinear systems that have chaotic and quasiperiodic solutions, such as the Lorenz and
Lorenz-96 systems, and the Kuznetsov oscillator. Both model-free and model-informed Echo State
Networks are analysed. By comparing the networks’ performance in learning chaotic (unpredictable)
versus quasiperiodic (predictable) solutions, we highlight fundamental challenges in learning chaotic
solutions. The proposed validation strategies, which are based on the dynamical systems properties
of chaotic time series, are shown to outperform the state-of-the-art validation strategies. Because the
strategies are principled – they are based on chaos theory such as the Lyapunov time – they can be
applied to other Recurrent Neural Networks architectures with little modification. This work opens up
new possibilities for the robust design and application of Echo State Networks, and Recurrent Neural
Networks, to the time-accurate prediction of chaotic systems.
© 2021 Elsevier Ltd. All rights reserved.

1. Introduction knowledge – e.g., initial conditions and parameters – grows


exponentially until nonlinear saturation. Practically, it is not
Chaotic systems naturally appear in many branches of science possible to time-accurately predict chaotic solutions after a time
and engineering, from turbulent flows (e.g., Bec et al., 2006; Bof- scale, known as the predictability time. The predictability time
fetta, Cencini, Falcioni, & Vulpiani, 2002; Deissler, 1986), through scales with the inverse of the dominant Lyapunov exponent,
vibrations (Moon & Shaw, 1983), electronics and telecommu- which is typically a small characteristic scale of the system under
nications (Kennedy, Rovatti, & Setti, 2000), quantum mechan- investigation (Boffetta et al., 2002).
ics (Stöckmann, 2000), reacting flows (Hassanaly & Raman, 2019; An approach to the prediction of chaotic dynamics is data-
Nastac, Labahn, Magri, & Ihme, 2017), to epidemic driven. Given a time series (data), we wish to learn the underlying
modelling (Bolker & Grenfell, 1993), to name only a few. The chaotic dynamics to predict the future evolution. The data-driven
time-accurate computation of chaotic systems is hindered by approach, also known as model-free, traces back to the delay
the ‘‘butterfly effect’’ (Lorenz, 1963): an error in the system’s coordinate embedding by Takens (1981), which is widely used
in time series analysis, in particular, in low-dimensional sys-
∗ Corresponding author at: Department of Engineering, University of tems (Guckenheimer & Holmes, 2013). An alternative data-driven
Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK. approach to inferring (or, equivalently, learning) chaotic dynam-
E-mail address: [email protected] (L. Magri). ics from data is machine learning. Machine learning is establish-
1 Visiting fellowship. ing itself as a paradigm that is complementary to first-principles

https://doi.org/10.1016/j.neunet.2021.05.004
0893-6080/© 2021 Elsevier Ltd. All rights reserved.
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

modelling of nonlinear systems in computational science and which are the original networks, the training is performed on
engineering (Baker et al., 2019). In the realm of neural networks, data only (Lukoševičius, 2012). On the other hand, in model-
which is the focus of this paper, the feed-forward neural network informed ESNs, the governing equations, or a reduced-order form
is the archetypical architecture, which may excel at classifica- of them, are embedded in the architecture, for example, in the
tion and regression problems (Goodfellow, Bengio, & Courville, reservoir in hybrid ESNs (Pathak et al., 2018), or in the loss
2016). The feed-forward neural network, however, is not the function in physics-informed ESNs (Doan et al., 2019) . In chaotic
optimal architecture for chaotic time series forecasting because time series forecasting, model-informed ESNs typically outper-
it is not designed to learn temporal correlations. Specifically, in form model-free ESNs (Doan et al., 2019; Pathak et al., 2018;
time series forecasting, inputs and outputs are ordered sequen- Wikner et al., 2020). Both model-free and model-informed Echo
tially, in other words, they are temporally correlated. To over- State Networks perform as well as LSTMs and GRUs, requiring
come the limitations of feed-forward neural networks, Recurrent less computational resources for training (Chattopadhyay, Has-
Neural Networks (RNNs) (Rumelhart, Hinton, & Williams, 1986) sanzadeh, & Subramanian, 2020; Vlachas et al., 2020). However,
have been designed to learn temporal correlations. Examples of there are two major challenges when using Echo State Networks.
successful applications span from speech recognition (Sak, Se- First, a random initialization is required to create the reser-
nior, & Beaufays, 2014), through language translation (Sutskever, voir (Lukoševičius, 2012). Networks with different initializations
Vinyals, & Le, 2014), fluids (Brunton, Noack, & Koumoutsakos, may perform substantially differently, even after hyperparame-
2020; Doan, Polifke, & Magri, 2021; Nakai & Saiki, 2018; Vlachas, ter tuning (Haluszczynski & Räth, 2019), thus, network testing
Byeon, Wan, Sapsis, & Koumoutsakos, 2018; Wan & Sapsis, 2018), through an ensemble of network realizations is required. Sec-
to thermo-acoustic oscillations (Huhn & Magri, 2020), among ondly, there is high hyperparameter sensitivity (Jiang & Lai, 2019;
many others. RNNs take into account the sequential nature of Lukoševičius, 2012). The most common validation strategy to
the inputs by updating a hidden time-varying state through an compute the hyperparameters for learning chaotic dynamics is
internal loop. As a result of the long-lasting time dependencies of the Single Shot Validation, which minimizes the error in an inter-
the hidden state, however, training RNNs with Back Propagation val subsequent to the training interval. Other validation strategies
Through Time (Werbos, 1990) is notoriously difficult. This is have been investigated, such as the Walk Forward Validation and
because the repeated backwards multiplication of intermediate the K-Fold cross Validation (Lukoševičius & Uselis, 2019), but the
gradients causes the final gradient to either vanish or become study was restricted to non-chaotic systems. The computation
unbounded depending on the spectral radius of the gradient of the optimal set of hyperparameters is typically performed by
Grid Search (Doan et al., 2021; Jaeger & Haas, 2004; Pathak,
matrix (Bengio, Simard, & Frasconi, 1994; Werbos, 1988). This
Hunt, Girvan, Lu, & Ott, 2018; Pathak et al., 2018), although other
makes the training ill-posed, which may negatively affect the
optimization strategies such as Evolutionary Algorithms (Ferreira,
computation of the optimal set of weights. To overcome this
Ludermir, & De Aquino, 2013; Ishu, van der Zant, Becanovic,
problem, two main types of RNN architectures have been pro-
& Ploger, 2004), Stochastic Gradient Descent (Thiede & Parlitz,
posed: Gated Structures and Reservoir Computing. Gated Struc-
2019), Particle Swarm Optimization (Wang & Yan, 2015) and
tures prevent gradients from vanishing or becoming unbounded
Bayesian Optimization (Yperman & Becker, 2016) have been pro-
by regularizing the passage of information inside the network, as
posed. In particular, Bayesian Optimization (BO) has proved to
accomplished in architectures such as Long Short-Term Memory
improve the performance of reservoir-computing architectures
(LSTM) networks (Hochreiter & Schmidhuber, 1997) and Gated
in the prediction of chaotic time series, outperforming the com-
Recurrent Units (GRU) networks (Cho, van Merriënboer, Gul-
monly used Grid Search strategy (Griffith, Pomerance, & Gauthier,
cehre, Bahdanau, Bougares, Schwenk, & Bengio, 2014). Alterna-
2019). Bayesian Optimization is a gradient-free search strategy,
tively, in Reservoir Computing (RC) (Jaeger & Haas, 2004; Maass,
thereby, it is less sensitive to local minima with respect to gradi-
Natschläger, & Markram, 2002), a high-dimensional dynamical ent descent methods (Thiede & Parlitz, 2019; Yperman & Becker,
system, the reservoir, acts both as a nonlinear expansion of the 2016). Moreover, Bayesian Optimization is based on Gaussian
inputs and as the memory of the system (Lukoševičius, 2012). At Process (GP) regression (Rasmussen, 2003), therefore, it naturally
each time step, the output is computed as a linear combination quantifies the uncertainty on the computation.
of the reservoir state’s components, the weights of which are the This paper studies the robust selection of hyperparameters in
only trainable parameters of the machine. Training is, therefore, Echo State Networks for the time-accurate prediction of chaotic
reduced to a linear regression problem, which bypasses the issue attractors, with a focus on the performance of the hyperparam-
of repeated gradients multiplication in RNNs. eters between validation and test sets across different network
In chaotic attractors, Reservoir Computing has been employed realizations. We define the robustness of a validation strategy as
to achieve at least four different goals: to (i) learn ergodic proper- its ability to select hyperparameters that perform consistently
ties, such as Lyapunov exponents (Lu, Hunt, & Ott, 2018; Pathak, between validation and test sets. The objective of this work is
Lu, Hunt, Girvan, & Ott, 2017) and statistics (Huhn & Magri, 2020; three-fold. First, we investigate the robustness of the Single Shot
Lu et al., 2018); (ii) filter out noise to recover the determin- Validation, Walk Forward Validation and the K-Fold cross Valida-
istic dynamics (Doan, Polifke, & Magri, 2020b), (iii) reconstruct tion. Second, we propose the Recycle Validation and the chaotic
unmeasured (hidden) variables (Doan, Polifke, & Magri, 2020a; version of existing validation strategies to specifically tackle the
Lu et al., 2017; Racca & Magri, 2021) and (iv) time-accurately forecasting of chaotic systems. Third, we compare Bayesian opti-
predict the dynamics (Doan, Polifke, & Magri, 2019; Pathak, Hunt, mization with grid search for optimal hyperparameter selection.
Girvan, Lu, & Ott, 2018; Wikner et al., 2020). In this work, we The Lorenz system (Lorenz, 1963), the Lorenz-96 model (Lorenz,
focus on the time-accurate short term prediction of chaotic at- 1996) and the Kutznetsov oscillator (Kuznetsov, Kuznetsov, &
tractors. A successful Reservoir Computing architecture is the Stankevich, 2010) are considered as prototypical nonlinear de-
Echo State Network (ESN) (Jaeger & Haas, 2004), which is a uni- terministic systems. We highlight fundamental challenges in the
versal approximator (Gonon & Ortega, 2021; Grigoryeva & Ortega, robustness of ESNs for chaotic solutions with a comparative in-
2018) suitable for the prediction of chaotic time series (Pathak, vestigation on quasiperiodic oscillations. Both model-free and
Hunt, Girvan, Lu, & Ott, 2018). There are two broad categories model-informed architectures are analysed.
of Echo State Networks: model-free (Lukoševičius, 2012; Pathak, The paper is organized as follows. Section 2 presents the
Hunt, Girvan, Lu, & Ott, 2018) and model-informed (Doan et al., model-free and model-informed Echo State Network architec-
2019; Pathak et al., 2018). On the one hand, in model-free ESNs, tures. Section 3 describes the validation strategies. Section 4
253
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

investigates the robustness of the Single Shot Validation in fore- 2020). In the closed-loop configuration, starting from an initial
casting chaotic time series. Section 5 analyses the new validation data point as an input and an initial reservoir state obtained
strategies to improve the robustness in forecasting chaotic time after the washout interval, the output, up (ti ), is fed back to the
series. Section 6 investigates the robustness of the validation network as an input for the next time step prediction. In doing
strategies in forecasting quasiperiodic time series. Finally, we so, the network is able to autonomously evolve in the future. The
summarize the results of this study and discuss future work in closed-loop configuration is used during validation and testing.
the conclusions (Section 7).

2. Echo state networks 2.1. Model-free and model-informed architectures

As shown in Fig. 1, in the Echo State Network, at any time ti


We consider model-free and model-informed architectures
the input vector, uin (ti ) ∈ RNu , is mapped into the reservoir state,
by the input matrix, Win ∈ RNr ×Nu , where Nr ≫ Nu . The reservoir (Fig. 1). The basic model-free ESN is obtained by setting g(r(ti )) =
state, r(ti ) ∈ RNr , is updated at each time iteration as a function r(ti ). This architecture, however, generates symmetric solutions
of the current input and its previous value in the closed-loop configuration (Huhn & Magri, 2020; Lu et al.,
2017), which can cause the predicted trajectory to stray away
r(ti+1 ) = tanh (Win uin (ti ) + Wr(ti )) , (1) from the actual attractor towards a symmetric attractor, which
where W ∈ RNr ×Nr is the state matrix. The predicted output, is not a solution of the dynamical system (but it is a solution of
up (ti+1 ) ∈ RNu , is obtained as the network). To break the symmetry, we add biases in the input
and output layers
up (ti+1 ) = r̂(ti+1 )T Wout , r̂(ti+1 ) = g(r(ti+1 )); (2)
r(ti+1 ) = tanh (Win [uin (ti ); bin ] + Wr(ti )) , r̂(ti+1 ) = [r(ti+1 ); 1],
where g(·) is a nonlinear transformation, r̂(ti+1 ) ∈ RNr̂ is the .
T
updated reservoir state, and Wout ∈ RNr̂ ×Nu is the output matrix. up (ti+1 ) = r̂(ti+1 ) Wout ;
The input matrix, Win , and state matrix, W, are (pseudo)randomly (5)
generated and fixed, while the weights of the output matrix, Wout ,
are computed by training the network. In this work, the input ma- where [ · ; · ] indicates vertical concatenation, bin is the scalar
trix, Win , has only one element different from zero per row, which input bias and Win ∈ RNr ×(Nu +1) . In the model-informed ESN, also
is sampled from a uniform distribution in [−σin , σin ], where σin is known as hybrid as proposed by Pathak et al. (2018), information
the input scaling. The state matrix, W, is an Erdős–Renyi matrix about the governing equations (model knowledge) is embed-
with average connectivity d, in which each neuron (each row
ded into the model through a function of the input, K(uin (ti )),
of W) has on average only d connections (non-zero elements).
which, for example, may be a reduced order model that provides
The non-zero elements are obtained by sampling from a uniform
information about the output at the next time step as
distribution in [−1, 1]; the entire matrix is then rescaled by a
multiplication factor to set the spectral radius, ρ . The spectral r̂(ti+1 ) = [r(ti+1 ); 1; K(uin (ti ))]. (6)
radius is key to enforcing the echo state property. (In a network
with the echo state property, the state loses its dependence In this work, we use K(uin (ti )) only to update the reservoir
on its previous values for sufficiently large times and, there- state (Wikner et al., 2020), in order to use the same input matrix,
fore, it is uniquely defined by the sequence of inputs.) While Win , and state matrix, W, of the model-free architecture. This
the echo state property may hold for a wider range of spectral allows us to directly compare the performances of the model-free
radii (Yildiz, Jaeger, & Kiebel, 2012), the condition ρ < 1 is and model-informed architectures.
typically chosen (Lukoševičius, 2012).
The ESN can be run either in open-loop or closed-loop config-
uration. In the open-loop configuration, first, we feed data as the 3. Validation
input at each time step to compute and store r̂(ti ) (1)–(2). In the
initial transient of this process, the washout interval, we do not
The purpose of the validation is to determine the hyperparam-
compute the output, up (ti ). The purpose of the washout interval is
eters by minimizing an error. We make a distinction between the
for the reservoir state to satisfy the echo state property, thereby
becoming independent of the arbitrarily chosen initial reservoir hyperparameters (i) that require re-initialization of Win and W,
state, r(t0 ) = 0. Secondly, we train the output matrix, Wout , by and (ii) that do not require re-initialization. The size of the reser-
minimizing the Mean Square Error (MSE) between the outputs, voir, Nr , and connectivity, d, require re-initialization, whereas the
up (ti ), and the data, ud (ti ), over a training set of Ntr points input scaling, σin , the spectral radius, ρ , the Tikhonov parameter,
β , and the input bias, bin , do not. The fundamental difference
Ntr
1 ∑ between (i) and (ii) is that the random component of the re-
MSE ≜ ∥up (ti ) − ud (ti )∥2 , (3)
Ntr Nu initialization of Win and W makes the objective function to be
i=1
optimized random, which significantly increases the complexity
where ∥ · ∥ is the L2 norm. Minimizing (3) is a least-squares of the optimization. In this study, we minimize the error with
minimization problem, which can be solved as a linear system respect to the input scaling, σin , and spectral radius, ρ , which are
through ridge regression key hyperparameters for the performance of the network (Jiang
(RRT + β I)Wout = RUTd , (4) & Lai, 2019; Lukoševičius, 2012). For convenience, we rewrite the
Nr̂ ×Ntr Nu ×Ntr
reservoir state equation (1) as
where R ∈ R and Ud ∈ R are the horizontal concate-
nation of the updated reservoir states, r̂(ti ), and the data, ud (ti ), r(ti+1 ) = tanh(σin Ŵin [uin (ti ); bin ] + ρ Ŵr(ti )), (7)
respectively; I is the identity matrix and β is the
user-defined Tikhonov regularization parameter (Tikhonov, Gon- where the non-zero elements of Ŵin are sampled from the uni-
charsky, Stepanov, & Yagola, 2013). We solve the linear system form distribution in [−1, 1] and Ŵ has been scaled to have a
through the linalg.solve function in numpy (Harris et al., unitary spectral radius.
254
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 1. Schematic representation of (a) model-free and (b) model-informed Echo State Networks (ESNs).

3.1. Performance metrics reason, as shown in Section 4, the Single Shot Validation strategy
is not suited for chaotic time series prediction.
We determine the hyperparameters by minimizing the Mean These observations lead us to use validation strategies based
Squared Error (3) in the validation interval of fixed length. The on multiple validation intervals, which may precede the train-
networks are tested on multiple starting points along the at- ing set, such as the Walk Forward Validation (WFV) and the
tractor by using both the Mean Squared Error and Prediction K-Fold cross Validation (KFV). We also propose an ad-hoc val-
Horizon (PH), the latter of which is defined as the time interval idation strategy—the Recycle Validation (RV). The objective of
during which the normalized error is smaller than a user-defined these strategies is to tune the hyperparameters over an effectively
threshold k (Doan et al., 2019; Pathak et al., 2018) larger portion of the trajectory, by minimizing the average of the
∥up (ti ) − ud (ti )∥ objective function (error) over multiple validation intervals. The
√ ∑NPH < k, (8) regular version of these strategies consists of creating subsequent
1 2
N j =1 ∥ud (tj )∥ folds by moving forward in time the validation set by its own
PH
length. Additionally, we propose the chaotic version, in which we
where NPH are the number of timesteps in the Prediction Horizon,
move the fold forward in time by one Lyapunov Time (LT) (Fig. 2).
and k = 0.2 unless otherwise specified. The Mean Squared Error
The Lyapunov Time is a key time scale in chaotic dynamical
and Prediction Horizon for the same starting point in the attractor
systems, which is defined as the inverse of the leading Lyapunov
are strictly correlated (see Supplementary Materials S.1). We use
exponent Λ of the system, which, in turn, is the exponential
the Mean Squared Error to partition the dataset in intervals of
rate at which infinitesimally close trajectories, δ q(0) diverge (e.g.,
fixed length during validation, while we use the Prediction Hori-
Boffetta et al., 2002)
zon in the test set because it is the most physical quantity when
assessing the time-accurate prediction of chaotic systems (e.g., ∥δ q(t)∥ ∼ ∥δ q(0)∥ exp(Λt) t → ∞, ∥δ q(0)∥ → 0. (9)
Doan et al., 2020b; Pathak, Hunt, Girvan, Lu, & Ott, 2018).

3.2. Strategies Walk Forward Validation. In the Walk Forward Validation


(WFV) (Fig. 2b), we partition the available data in multiple splits,
The most common validation strategy for ESNs is the Single while maintaining sequentiality of the data. From a starting
Shot Validation (SSV), which splits the available data in a training dataset of length n, the first m points (m < n) are taken as the
set and a single subsequent validation set (Fig. 2a). The time first fold, with Ntr points for training and v points for validation
interval of the validation set, during which the hyperparame- (v + Ntr = m). These quantities must respect (n − m) = (k1 −
ters are tuned, is small and represents only a fraction of the 1)v; k1 ∈ N. The remaining (k1 −1) folds are generated by moving
attractor. In nonlinear time series prediction, the choice of the the training plus validation set forward in time by a number of
validation strategy has to take into account (i) the intervals we points v . This way, the original dataset is partitioned in k1 folds
are interested in predicting and (ii) the nature of the signal we and the hyperparameters are selected to minimize the average
are trying to reproduce. Here, we are interested in predicting MSE over the folds. For every set of hyperparameters and every
multiple intervals as the trajectory spans the attractor, rather fold the output matrix, Wout , is recomputed.
than a specific interval starting from a specific initial condition. K-Fold cross Validation. Although the K-Fold cross Validation
Moreover, the trajectory that spans the attractor has no under- (KFV) (Fig. 2c) is a common strategy in regression and classifica-
lying time-varying statistics, e.g there is no time-dependency tion, it is not commonly used in time series prediction because
of the mean of the signal, hence trajectories return indefinitely the validation and training intervals are not sequential to each
in nearby regions of the attractor (or, more technically, chaotic other. This strategy partitions the available data in k2 splits. Over
dynamics is topologically mixing) (e.g., Guckenheimer & Holmes, the entire dataset of length n, after an initial bv points, with
2013). This implies that we can obtain information regarding the 0 ≤ b < 1, needed to have an integer number of splits, the
intervals that we are interested in predicting from any interval remaining n − bv points are used as k2 validation intervals, each of
of the trajectory that constitutes our dataset, regardless of the length v . For each validation interval we define a different fold,
interval position in time within the dataset. This means that (i) in which we use all the remaining data points for training. We
all the parts of the dataset are equally important in determining determine the hyperparameters by minimizing the average of the
the hyperparameters and (ii) the validation should be performed MSE between the folds. For every set of hyperparameters and
on the entire dataset and not only on the last portion of it. For this every fold the output matrix, Wout , is recomputed.
255
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 2. Partition of the data in the different validation strategies. In (b–d), bar 1 shows the first fold, bar 2 shows the second fold, and bar 2c shows the second fold
in the chaotic version (shifted by one Lyapunov time).

Recycle Validation. We propose the Recycle Validation (RV) in bars 2 and 2c, respectively. The chaotic versions of the Walk
(Fig. 2d), which exploits the information obtained by both open- Forward Validation, the K-fold cross Validation and the Recycle
loop and closed-loop configurations. Because the network works Validation are denoted by the subscript c.
in two different configurations, it can obtain additional informa- Computing Wout . In each validation strategy, for each pair
tion when validating on data already used in training. To do so, of input scaling, σin , and spectral radius, ρ , we recompute the
first, we train Wout only once per set of hyperparameters using output matrix, Wout . Moreover, for a pair of σin and ρ in each
the entire dataset of n points. Second, we validate the network on fold of the K-Fold Validation and Walk Forward Validation a
k2 splits of length v from data that has already been used to train different Wout is computed. Even with same hyperparameters the
the output weights. Each split is imposed by moving forward in folds have different Wout because the training data is different.
time the previous validation interval by v points. After an initial For each validation strategy, once Wout is determined in open-
bv points, with 0 ≤ b < 1, needed to have an integer number loop, the error that is minimized is that obtained by running
of splits, the remaining n − bv points are used as k2 validation the network in closed-loop in the validation interval(s). After
intervals. We determine the hyperparameters by computing the training and validation are completed – i.e., we have selected the
average of the MSE between the splits. This strategy has four hyperparameters – the Wout to be used in the test set is computed
main advantages. First, it can be used in small datasets, where on the entire dataset used for training plus validation using the
the partition of the dataset in separate training and validation optimal hyperparameters.
sets may cause the other strategies to perform poorly. In small
datasets, the validation intervals represent a larger percentage of
3.3. Bayesian optimization and grid search
the dataset since each validation interval needs to be multiple
Lyapunov Times to capture the divergence of chaotic trajectories.
To find the minimum of the Mean Squared Error (3) of the
Therefore, the training set becomes substantially smaller than
validation set in the hyperparameter space, we use Bayesian Op-
the dataset and the output matrix used during validation differs
substantially from the output matrix of the whole dataset. This timization (BO), which is compared to Grid Search (GS). Bayesian
results in a poor selection of hyperparameters. Second, for a given Optimization has been shown to outperform other state-of-the-
dataset, we maximize the number of validation splits, using the art optimization methods when the number of evaluations of an
same validation intervals of the K-Fold cross Validation. Third, we expensive objective function is limited (Brochu, Cora, & De Fre-
tune the hyperparameters using the same output matrix, Wout , itas, 2010; Snoek, Larochelle, & Adams, 2012). It is a global search
that we use in the test set. Fourth, it has lower computational method, which is able to incorporate prior knowledge about the
cost than the K-Fold cross Validation because it does not require objective function and to use information from the entire search
retraining the output matrix for the different folds, which makes space. It treats the objective function as a black box, therefore,
it computationally cheaper (Appendix A). it does not require gradient information. Starting from an initial
Chaotic version. The chaotic version consists of shifting the Nst evaluations of the objective function, BO performs a Gaus-
validation intervals forward in time, not by their own length, sian Process (GP) regression (Rasmussen, 2003) to reconstruct
but by one Lyapunov Time when constructing the next fold. In the function in the search space, using function evaluations as
doing so, different splits will overlap, but, since the trajectory data. Once the GP fitting is available, we select the new point at
related to the split that started one Lyapunov Time (LT) earlier which to evaluate the objective function so that the new point
has strayed away from the attractor on average by eΛ×1LT = e, maximizes the acquisition function. The acquisition function is
the two intervals contain different information. The purpose of evaluated on the mean and standard deviation of the GP re-
this version is to further increase the number of intervals on construction. After the objective function is evaluated at a new
which the network is validated. The regular and chaotic versions point, the enlarged data set, comprising of the new point, is used
for each validation strategy are shown in frames (b–d) in Fig. 2 to perform another GP regression, select a new point and so
256
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

on and forth. In this work, we use the gp-hedge Bayesian Op- To assess quantitatively the correlation of the optimal hyper-
timization algorithm implemented in scikit-optimize library parameters’ performance between the validation and test sets, we
in Python (Hoffman, Brochu, & de Freitas, 2011; Virtanen & et al., use the Spearman coefficient (Spearman, 1904)
2020). The details of the formulation are explained in Appendix B ∑2Nens
j=1 (z(x)j − Nens )(z(y)j − Nens )
and in the Supplementary Material S.5. r̃S (x, y) = √∑ √∑ ,
2Nens 2Nens
j=1 (z(x)j − Nens )2 j=1 (z(y)j − Nens )2
4. Robustness of the single shot validation [ ] [ ]
(BO) (BO)
mVal mTest
As a prototypical chaotic system, we investigate the Lorenz x= (GS)
, y= (GS)
, (11)
mVal mTest
system (Lorenz, 1963), which is a reduced-order model of
Rayleigh–Bénard convection where z(x) is the ranking function; m ∈ RNens contains the
MSE for the optimal hyperparameters in validation (subscript
ẋ = σL (y − x)
Val), or test (subscript Test) obtained by Bayesian Optimization
ẏ = x(ρL − z) − y (superscript BO), or Grid Search (superscript GS). r̃S quantifies
ż = xy − βL z , (10) the correlation between the MSE of the optimal hyperparameters,
obtained during validation by both Bayesian Optimization and
where [σL , βL , ρL ] = [10, 8/3, 28] is selected to generate chaotic Grid Search, and the MSE for the same hyperparameters in the
solutions. The system is integrated with a forward Euler scheme test set over the ensemble. The values r̃S = {−1, 0, 1} indicate
with step dt = 0.009 LT. The Lyapunov Time is LT = Λ−1 ≈ anticorrelation, no correlation and correlation, respectively.
1.1 (Viswanath, 1998). Fig. 6 shows the correlation analysis. The scatter plot for x and
We analyse the performance of the Single Shot Validation y (panel (a)) shows that the MSE of the optimal hyperparameters
(SSV), which is employed for training (1 LT to 9 LTs), validation in the validation and test sets are weakly correlated with r̃S =
(9 LTs to 12 LTs), and testing (12 LTs to 15 LTs), as shown in Fig. 3. 0.32. Panels (b, c) show the values of the optimal hyperparam-
The input, uin (ti ), is normalized by its maximum variation. (This eters, which vary substantially from one network realization to
is done because we are using a single scalar quantity σin to scale another.
all the components of the input.) The network has a fixed number
4.1. Remarks
of neurons, Nr = 100, connectivity, d = 3, Tikhonov parameter,
βt = 10−11 and input bias, bin = 1 (Lukoševičius, 2012). The
First, because the MSE and optimal hyperparameters vary
bias, bin , is set for it to have the same order of magnitude of
significantly in different network realizations, we advise per-
the normalized input. The input scaling, σin , and spectral radius,
forming optimization separately for each network to increase
ρ , are tuned during validation in the range [0.5, 5] × [0.1, 1] the performance (as further verified in Appendix C). Second,
to minimize the log10 (MSE). The range of the spectral radius, ρ , hyperparameters that are optimal in the validation set may have
is selected for the network to respect the echo state property, a poor performance in the test set, which may greatly reduce the
whereas the range of the input scaling, σin , is selected to normal- benefit of using Bayesian Optimization to select the hyperparam-
ize the inputs. The optimization is performed with (i) Grid Search eters. This highlights a fundamental challenge in learning chaotic
(GS) consisting of 7 × 7 points, and (ii) Bayesian Optimization solutions, in which validation and test sets may be topologically
(BO) consisting of 5 × 5 starting points and 24 points acquired different portions of the attractor. We, thus, advise that the Single
by the gp-hedge algorithm. The two optimization schemes are Shot Validation not be used in the validation of Echo State Net-
applied to an ensemble of Nens = 50 networks, which differ by works in chaotic attractors. To improve robustness, we analyse
the random initialization of the input matrix, Win , and the state different validation strategies (Section 3.2) in the next section.
matrix, W. Nens is selected after a test on statistical convergence
(more details in Supplementary Material S.2). 5. Validation strategies in chaotic systems
Fig. 4 shows the performance of the optimal hyperparameters
computed by Grid Search and Bayesian Optimization for the en- 5.1. Lorenz system
semble members. First, we analyse the performance in validation
(panel (a)). As shown by the medians reported in the caption, We compare different validation strategies on the ensemble
Bayesian optimization markedly outperforms Grid Search. Sec- of Nens = 50 networks in a ‘‘short’’ dataset (12 LTs) and a ‘‘long’’
ond, we analyse the performance in the test set (panel (b)). The dataset (24 LTs) of the Lorenz system. The long dataset is obtained
performance of each network is assessed by computing the MSE by the integration of the time series in Fig. 3. In addition to
the short dataset, we analyse the long dataset for two reasons.
in the test set for the hyperparameters found in the validation
First, we wish to test validation strategies that require larger
set. For this, the output matrix, Wout , of the test set is obtained
datasets to fully perform, such as the Walk Forward Validation.
by retraining over both the training and validation sets. There is a
Second, we wish to investigate how the robustness is affected by
significant difference between the performance of the ensemble
the size of the dataset. We use the Single Shot Validation (SSV),
in the validation set (a) and test set (b). This means that the Single
Walk Forward Validation (WFV), K-Fold Validation (KFV), Recycle
Shot Validation is not robust, i.e., it is not able to select hyperpa- Validation (RV), and corresponding chaotic versions (subscript
rameters that perform consistently between validation and test c). The long dataset allows us to define an additional chaotic
set. This marginal robustness causes the overall performance of Walk Forward Validation (WFVc ) denoted by the superscript ∗ as
the networks and the benefit of using Bayesian Optimization to detailed in the Supplementary Material (S.3).
be markedly reduced. This is a signature of chaos, whose unpre- The test set has Nt = 100 starting points on the attractor
dictability results in a weak correlation between validation and to sample different regions of the solutions (more details in
test sets. We further verify the low correlation between the sets Supplementary Material S.2). For each starting point in the test
by computing the mean of the Gaussian process reconstruction set, the initial reservoir state vector is obtained by performing
from a 30 × 30 grid of log10 (MSE) for a representative network 1 LT of washout. The Prediction Horizon is globally quantified as
of the ensemble (Fig. 5). The performance of the hyperparameters an arithmetic mean, PHtest , with threshold k = 0.2; whereas the
can deteriorate by four, or more, orders of magnitude from the Mean Squared Error is globally quantified as a geometric mean,
validation set to the test set (panel (c)). MSETest , in intervals of 3 LTs.
257
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 3. Solution of the Lorenz system. (a) Time series, and (b) phase plot for a longer time window, where the red trajectory indicates the 0 to 12 LTs interval. Time
is expressed in Lyapunov time (LT) units. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 4. Performance of the optimal hyperparameters computed by Grid Search (GS) and Bayesian Optimization (BO) in (a) validation and (b) test sets. Vertical lines
indicate the median of Grid Search (dash-dotted) and Bayesian Optimization (dashed). The medians are [5.4, 23.0] × 10−6 in the validation set and [64.8, 60.5] × 10−6
in the test set for BO and GS, respectively.

Fig. 5. Mean of the Gaussian Process reconstruction of the MSE in the (a) validation and (b) test sets for a representative network in the ensemble. Frame (c) shows
the difference between the two sets. The MSE is saturated to be ≤ 1 in (a, b), whereas the error is saturated to be ≤ 104 in (c). The reconstruction is performed on
a grid of 30 × 30 evaluations of log10 (MSE). For the same hyperparameters, the MSE can differ by orders of magnitude between the validation and test sets.

Model-free ESN A correlation analysis is shown in Table 1 with the Spear-


Fig. 7 shows the mean of the Gaussian Process reconstruction man correlation coefficients, r̃s (11) (short and long datasets);
of log10 (MSE) in the hyperparameter space for a representative and Fig. 8 with scatter plots of the optimal hyperparameters’
network of the ensemble. Panels (a, b, c) show the performance performance (long dataset, for brevity). The Single Shot Validation
has the lowest correlation among all the validation strategies
of three validation strategies in the validation set, whereas panel
in both datasets. The chaotic versions of the validation strate-
(d) shows the performance of the network in the test set. Because
gies correlate better than the corresponding regular versions. In
the error in (b, c) is similar to the error in (d), and the error particular, the chaotic K-Fold Validation and the chaotic Recycle
in (a) differs from (d), we conclude that in the test set the Validation have the highest correlations. In general, increasing
hyperparameters computed through KFVc and RVc perform well, the size of the dataset increases the correlation, but the Single
but the hyperparameters computed through SSV perform poorly. Shot Validation in the long dataset has a lower correlation than
258
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 6. (a) Linear regression (LinReg) and scatter plot of the MSE of the optimal hyperparameters obtained from Bayesian Optimization (BO) and Grid Search
(GS) for each network. Optimal hyperparameters for each network and corresponding MSE in (b) validation and (c) test sets. For different networks the optimal
hyperparameters, and their performance, vary significantly.

Fig. 7. Mean of the Gaussian Process reconstruction of the MSE for the short dataset of the Lorenz system for a representative network of the ensemble. Validation
set for (a) Single Shot Validation (SSV), (b) chaotic K-Fold Validation (KFVc ), and (c) chaotic Recycle Validation (RVc ); and test set (d). The MSE is saturated to be
≤ 1. The Gaussian Process is based on a grid of 30 × 30 data points.

Fig. 8. Linear regression (LinReg) and scatter plot of the MSE of the optimal hyperparameters obtained from Bayesian Optimization (BO) and Grid Search (GS) for
each network. Single Shot Validation (SSV), Walk Forward Validation (WFV), K-Fold Validation (KFV), Recycle Validation (RV), and their chaotic versions (subscript
c). Long dataset of the Lorenz system.

259
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 9. Comparison between hyperparameter optimization by Bayesian Optimization (BO) and Grid Search (GS). The performance metrics are the Mean Square error
(MSE) and Predictability Horizon (PH). 25th (lower bar), 50th (marker) and 75th (upper bar) percentiles. (a, c) short dataset, (b, d) long dataset. Lorenz system.

Table 1 K-Fold Validation and the chaotic Recycle Validation are robust
Spearman coefficients between validation and test sets for model-free ESN in (r̃S ≥ 0.9), but as the number of neurons increases, we observe
the Lorenz system. Bold text indicates the highest correlation in the dataset.
a decrease in robustness for all validation strategies with smaller
r̃S SSV WFV WFVc WFV∗c KFV KFVc RV RVc
improvements in the Prediction Horizon. This may be caused by
Short dataset (12 LTs) 0.31 0.31 0.50 – 0.60 0.65 0.59 0.62 a slight overfitting: the networks are large with respect to the
Long dataset (24 LTs) 0.49 0.51 0.61 0.70 0.70 0.85 0.67 0.81
relatively simple and small datasets of the Lorenz system. The
overfitting becomes more significant in the Recycle Validation, as
compared to the K-Fold Validation, because the validation uses
the K-Fold Validations and the Recycle Validations in the short data that has been already seen by the network during training.
dataset. This further demonstrates the poor robustness of the
Single Shot Validation. Last, but not least, the Recycle Validation is Model-informed ESN
computationally cheaper than the K-Fold Validation because the We leverage knowledge about the governing equations
output matrix is the same for the different folds (more analysis through K(uin (ti )) in the model in Eq. (6). In this testcase, we
on the computational time can be found in Appendix A). use a reduced-order model obtained through Proper Orthogonal
A comparison between Bayesian Optimization (BO) and Grid Decomposition (POD) (Lumley, 1967; Weiss, 2019) to define a
Search (GS) is shown in Fig. 9. Panels (a, b) show the ratio of the POD-informed ESN. POD provides a fixed rank subspace, E , of
MSE between the optimal hyperparameters obtained by Bayesian the state space, in which the projection of the original state
Optimization and Grid Search in the validation (Val) and test vector optimally preserves its energy. The POD modes/energies
(Test) sets. In both datasets, Bayesian Optimization outperforms are the eigenvectors/eigenvalues of the data covariance matrix,
1
Grid Search in the validation set in ∼ 75% of the networks (except C = m− 1
UT U. The M × Nu matrix U is the vertical concatenation
for one outlier). However, BO and GS perform similarly in the of the M snapshots of the Nu -dimensional timeseries used for
test set, especially in the short dataset (a). In the long dataset washout, training and validation of the network, from which
(b), Bayesian Optimization on average outperforms Grid Search, its mean, d ∈ RNu , is subtracted columns-wise. We create an
although there is a decrease in performance with respect to the NPOD -dimensional reduced-order model by taking the modes, φ i ,
validation set. Panels (c, d) show the Prediction Horizon (PH) associated with the NPOD largest eigenvalues of C. Because C is
in the test set. The chaotic K-Fold Validation and the chaotic a symmetric matrix, its eigenvectors form an orthonormal basis,
Recycle Validation increase the Prediction Horizon by 0.5 LTs on which is stored in the orthogonal matrix Φ = [φ φ 1 ; . . . ; φ nPOD ].
average with respect to the Single Shot Validation. The Prediction The state vector, q, is expressed as a function of its components
Horizon of the long datasets (d) is ≳ 0.5 LTs larger than that of ξ in the subspace, E , spanned by Φ , and its components η in
the short dataset (c). This results in the performance of the KFVc the orthogonal complement of E spanned by the basis Ψ : q =
and RVc in the short dataset being closer to the performance of Φ ξ + Ψ η + d. The evolution equations are then obtained by using
the SSV in the long dataset. Because Bayesian Optimization does a flat Galerkin approximation (Matthies & Meyer, 2003), which
not produce a substantial increase in the Prediction Horizon with neglects the contribution of the orthogonal complement: Ψ η ≃ 0.
respect to Grid Search, we conclude that the performance of the The dynamical system, q̇ = f(q), is projected onto E through
networks is more sensitive to the validation strategy rather than ξ = Φ T (q − d) as
the optimization scheme.
ξ̇ξ = Φ T f(Φ ξ + d). (12)
Fig. 10 shows the performance and robustness of selected val-
idation strategies for different sizes of the reservoir. The chaotic In the POD-informed ESN (q ≡ uin ), we use NPOD = 2 to generate
K-Fold Validation and the chaotic Recycle Validation outperform the reduced-order model, which accounts for 96% of the energy of
and are more robust than the Single Shot Validation in all cases the original signal. We use the evolution of the trajectory on the
studied (apart from one outlier). In small reservoirs, the chaotic POD subspace, E , to inform the ESN through K(uin (ti )) = ξ (ti+1 ).
260
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 10. Prediction Horizon medians (continuous lines) and Spearman coefficients (dashed lines) as functions of the size of the reservoir. The networks are optimized
through Bayesian Optimization in the Single Shot Validation (SSV), chaotic K-Fold Validation (KFVc ) and chaotic Recycle Validation (RVc ). Short (a) and long (b)
datasets. The x-axis scale highlights the logarithmic relationship between the performance and the size of the reservoir. Lorenz system.

Table 2 Table 3
Spearman coefficients between validation and test sets for the model-informed Spearman coefficients between validation and test sets in the Lorenz-96 model.
ESN in the Lorenz system. Bold text indicates the highest correlation in the Bold text indicates the highest correlation in the dataset.
dataset. r̃S SSV WFV WFVc KFV KFVc RV RVc
r̃S SSV WFV WFVc WFV∗c KFV KFVc RV RVc Lorenz-96 0.20 0.62 0.76 0.84 0.92 0.69 0.84
Short dataset (12 LTs) 0.15 0.39 0.34 – 0.32 0.73 0.41 0.56
Long dataset (24 LTs) 0.19 0.42 0.36 0.51 0.59 0.80 0.55 0.80

optimization of the MSE in intervals that last multiple LTs may


not be strongly correlated with maximizing the PH in the same
We solve the ODE system in Eq. (12) using at each time step intervals. Second, in contrast to the three-dimensional Lorenz
forward Euler with initial condition ξ (ti ) = Φ T (uin (ti ) − d), which system, the large size of the dataset with respect to the PH allows
is the projection of the input to the network onto E . us to define multiple validation intervals to optimize the PH di-
As compared to the model-free ESN, there is a decrease in rectly. We define the Spearman coefficients (11) with the PH. We
correlation between validation and test sets for almost all the test the validation strategies (detailed in Supplementary Material
validation strategies (Table 2). The Single Shot Validation is still S.3) by computing the arithmetic mean PHtest of the Prediction
outperformed by the other strategies, while the chaotic K-Fold Horizon on Nt = 100 starting points along the attractor. For each
Validation and chaotic Recycle Validation have the highest corre- point, the initial reservoir state vector is obtained by performing
lation. The POD-informed architecture increases the performance 100 steps of washout.
by 1 LT on average (see additional results in the Supplementary The Spearman coefficients are shown in Table 3. The Single
Material S.4), but it does not increase the robustness of the Shot Validation is the least robust strategy, whereas the chaotic
networks with respect to the model-free ESN. Recycle Validation and chaotic K-Fold Validation are the most
robust. Apart from the Single Shot Validation, the Spearman coef-
5.2. High-dimensional chaos: Lorenz-96 ficients are markedly larger than in the three-dimensional Lorenz
system (Table 1). This is due to the large size of the dataset with
The Lorenz-96 model (Lorenz, 1996) describes the time evo- respect to the Lyapunov Time of the system, which allows us
lution of an atmospheric scalar that is spatially discretized at J to use a large number of splits in the validation strategies. This
points. It consists of a series of coupled ODEs for the components means that in large chaotic datasets robust results are obtained
of the state vector x = [x1 , x2 , . . . , xJ ] by selecting the hyperparameters through the average of multiple
intervals in the dataset.
ẋj = (xj+1 − xj−2 )xj−1 − xj + F . (13)
Fig. 11 shows the performance of the validation strategies.
The objective of this testcase is to verify that the results Bayesian Optimization (BO) outperforms Grid Search (GS) in all
observed for low-dimensional chaos in the Lorenz system hold in the strategies in the validation (Val) set (panel (a)). On the one
a higher-dimensional system. We consider the case with periodic hand, the poor robustness of the Single Shot Validation causes the
boundary conditions, J = 10 and F = 8, which we integrate in two optimization methods to perform similarly in the test set. On
time using a fourth-order Runge–Kutta scheme with time step the other hand, BO outperforms GS in more than 75% networks
dt = 0.024 LTs (1 LT = Λ−1 ≈ 0.83). The dataset used for in the test set in the K-Fold Validations and Recycle Validations.
training and validation consists of 104 data points, which span Panel (b) shows the Prediction Horizon in the test set. The Single
a 240 LTs time series. The network parameters, the size of the Shot Validation has the smallest PH, whereas the chaotic K-
ensemble, and the optimization strategies are the same of the Fold Validation has the largest. The performance for reservoirs
previous sections. We study larger networks, N = 1000, and of different sizes is shown in Panel (c). The Prediction Horizon
modify the input scaling, bin = 0.1, for it to have the same order increases with the size of the reservoir, while the robustness
of magnitude of the input, which is obtained by normalizing the and relative performance of the selected validation strategies is
signal by its maximum variation (component-wise). We select qualitatively similar to the 1000-neuron case.
the hyperparameters by maximizing the arithmetic mean of the
Prediction Horizon (PH) in the validation intervals, instead of 6. Quasiperiodic versus chaotic solutions
minimizing the geometric average of the Mean Squared Error
(MSE). We do this for two reasons. First, from preliminary runs We analyse the nonlinear oscillator proposed by Kuznetsov
and previous studies (Vlachas et al., 2020), we expect the PH et al. (2010), which physically represents a self-oscillatory dis-
to be smaller than 1 LT in most intervals. This means that the charge in an electric circuit. The oscillator is a three-dimensional
261
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 11. 25th (lower bar), 50th (marker) and 75th (upper bar) percentiles of the Predictability Horizon (PH) for hyperparameter optimization by Bayesian Optimization
(BO) and Grid Search (GS), (a, b). PH medians from BO (continuous lines) and Spearman coefficients (dashed lines) as a function of the size of the reservoir, (c).
Lorenz-96 model.

system, which can display periodic, quasiperiodic and chaotic Table 4


behaviours as a function of the parameters [λ, ω0 , µ] Spearman coefficients between validation and test sets for the model-free ESN
in the Kuznetsov oscillator. Bold text indicates the highest correlation in the
ẋ = y, dataset.
r̃S SSV WFV WFVc KFV KFVc RV RVc
1
ẏ = y(λ + z + x2 − x4 ) − ω02 x, Quasiperiodic dataset 0.80 0.75 0.71 0.93 0.92 0.97 0.97
2 Chaotic dataset 0.49 0.48 0.58 0.70 0.76 0.66 0.81
ż = µ − x2 . (14)

The primary purpose of this testcase is to compare the robustness


of Echo State Networks in forecasting quasiperiodic solutions dataset is assessed only through the Mean Squared Error because
versus chaotic solutions. This enables us to determine whether the Prediction Horizon is infinite, i.e., a quasiperiodic solution has
the challenges encountered in Section 5 are specific to learning zero dominant Lyapunov exponents (Kantz & Schreiber, 2004).
chaotic time series. We obtain quasiperiodic and chaotic solutions
by setting λ = 0, ω0 = 2.7 and µQp = 0.9 and µCh = 0.5, Model-free ESN
respectively. In both cases, we generate the solution using the
adaptive integration scheme implemented in scipy.odeint and The Spearman coefficients (Table 4) show that the correlation
store the solution every dt = 0.0008 LT, where LT ≈ 25 for the between validation and test sets is higher in the quasiperiodic
chaotic case (Kuznetsov et al., 2010). In both cases, the dataset dataset than the chaotic dataset. Notably, the peak r̃s = 0.97
consists of 7.5 LTs to be used for washout, training and validation. obtained in the Recycle Validations indicates almost complete
The network parameters, the size of the ensemble, and the correlation. As before, the Single Shot Validation is outperformed
optimization strategies are the same as those of Section 4. We by the K-Fold Validation and Recycle Validation, but its correla-
modify the input scaling, bin = 0.1, for it to have the same order tion in the quasiperiodic dataset is higher than that of chaotic
of magnitude of the input, which is obtained by normalizing the cases. The high correlation in the quasiperiodic dataset is identi-
signal by its maximum variation component-wise. We study the fied as the dense clustering around the linear regression of Fig. 12.
enlarged interval ρ = [0.01, 1] because we observed empirically Two remarks can be made. On the one hand, the high corre-
that the optimal hyperparameters often lie in the range ρ ≤ 0.1. lation in the quasiperiodic dataset implies that the challenges
Given the multiple orders of magnitude of the spectral radius, in producing robust results in Echo State Networks in chaotic
the hyperparameter space is analysed in a logarithmic scale. attractors are due to the complexity of the chaotic signal, rather
We use the same architecture and validation strategies for the than the properties of the networks. On the other hand, the
quasiperiodic and chaotic case (as detailed in the Supplementary marked difference in performance between different networks is
Material S.3). The different strategies are tested by computing the still present in the quasiperiodic dataset, which means that ESNs
arithmetic mean PHtest of the Prediction Horizon on Nt starting are sensitive to the realizations (further analysis is reported in
points for the chaotic case, and by computing the geometric mean Appendix C). Practically, we advise that different networks be
MSEtest of the Mean Squared Error in 2 LTs intervals starting from optimized independently in the quasiperiodic case as well.
the same points. In the chaotic case, we select Nt = 75, whereas Panels (a, b) of Fig. 13 show the ratio of the MSE between
in the quasiperiodic we select Nt = 50 through the procedure the optimal hyperparameters obtained by Bayesian Optimization
described in Supplementary Material S.2. For each starting point (BO) and the optimal hyperparameters from Grid Search (GS) in
in the test set, the initial reservoir state vector is obtained by per- the validation and test sets. On the one hand, in the quasiperiodic
forming 0.5 LTs of washout. The performance in the quasiperiodic case (a) the performance in the validation set is similar to the test
262
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. 12. Linear regression (LinReg) and scatter plot of the MSE of the optimal hyperparameters obtained from Bayesian Optimization (BO) and Grid Search (GS)
for each network. (a, d) Single Shot Validation, (b, e) chaotic K-Fold Validation and (c, f) chaotic Recycle Validation. (a–c) quasiperiodic and (d–f) chaotic datasets.
Kuznetsov oscillator.

Fig. 13. Comparison between hyperparameter optimization by Bayesian Optimization (BO) and Grid Search (GS) for the two performance metrics (MSE, PH). 25th
(lower bar), 50th (marker) and 75th (upper bar) percentiles. (a, c) quasiperiodic dataset, (b, d) chaotic dataset. Kuznetsov oscillator.

set. One the other hand, in the chaotic case (b) BO outperforms GS is reported in the Supplementary Material S.4. The results are
in the validation set, although the two schemes perform similarly qualitatively similar to those observed for 100 neurons. In all
in the test set. In panels (c, d), we show the performance of cases (except for one outlier), the chaotic Recycle Validation
the networks in the test set using the MSE for the quasiperiodic shows the smallest MSE, the largest Prediction Horizon and the
dataset (c) and the Prediction Horizon in the chaotic dataset (d). highest robustness.
In the quasiperiodic dataset, Bayesian Optimization outperforms
Grid Search, and the K-Fold Validations and Recycle Validations
Model-informed ESN
outperform the other validation strategies. In the chaotic dataset,
as seen in the Lorenz system, Bayesian Optimization only slightly
outperforms Grid Search, while the K-fold Validations and Recycle We design a Forward Euler (FE) informed ESN (6) by integrat-
Validations still outperform the other validation strategies. An ing in time with forward Euler the y Eq. (14) only
analysis of the performance and robustness of the validation 1
strategies for different sizes of the reservoir (up to 500 neurons) K(uin ) = y + dt(y(λ + z + x2 − x4 ) − ω02 x). (15)
2
263
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Table 5 of ESNs, we recommend computing the optimal hyperparame-


Spearman coefficients between validation and test sets for the model-informed ters for each network. In the test performed in the paper, this
ESN in the Kuznetsov oscillator. Bold text indicates the highest correlation in
the dataset.
can increase up to six Lyapunov Times the network’s Prediction
r̃S SSV WFV WFVc KFV KFVc RV RVc
Horizon as compared to using the same set of hyperparameters
for all realizations.
Quasiperiodic dataset 0.78 0.65 0.67 0.71 0.80 0.98 0.98
Chaotic dataset 0.57 0.63 0.63 0.75 0.79 0.71 0.85 This work opens up new possibilities for using Echo State
Networks and, in general, recurrent neural networks, for robust
learning of chaotic dynamics.

Table 5 shows the Spearman coefficients for the FE-informed Code and supplementary material
model. In the quasiperiodic dataset, the correlation decreases for
all the validation strategies with respect to the model-free case The code for the validation strategies and optimization schemes
(see Table 4) except for the Recycle Validations, which have the used in this work can be found in the openly-available GitLab
highest correlation. However, in the chaotic dataset, the correla- repository https://gitlab.com/ar994/robust-validation-esn. Sup-
tion increases for all the validation strategies. Here, the chaotic K- plementary material related to this article can be found online
Fold Validation and chaotic Recycle Validation are the strategies at https://doi.org/10.1016/j.neunet.2021.05.004.
with the highest correlation. In the same fashion as the Lorenz
system, the FE-informed architecture per se does enhance the Declaration of competing interest
performance, but it does not enhance the robustness of Echo State
Networks (additional results can be found in the Supplementary The authors declare that they have no known competing finan-
Material S.4). cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
7. Conclusions
Acknowledgements
The Echo State Network (ESN) is a reservoir computing archi-
tecture that is able to learn accurately the nonlinear dynamics A. Racca is supported by the EPSRC-DTP and the Cambridge
of systems from data. The overarching objective of this paper is Commonwealth, European & International Trust under a Cam-
to investigate and improve the robustness of ESNs, with a focus bridge European Scholarship. L. Magri is supported by the Royal
on the forecasting of chaotic systems. First, we analyse the Single Academy of Engineering Research Fellowship scheme and the vis-
Shot Validation, which is the commonly used strategy to select iting fellowship at the Technical University of Munich – Institute
the hyperparameters. We show that the Single Shot Validation for Advanced Study, funded by the German Excellence Initiative
is the least performing strategy to fine-tune the hyperparam- and the European Union Seventh Framework Programme under
eters. Second, we validate the ESNs on multiple points of the grant agreement no. 291763. The authors would like to thank Dr.
chaotic attractor, for which the validation set is not necessarily N. A. K. Doan and F. Huhn for insightful discussions. The authors
subsequent in time to the training set. We propose the Recycle are grateful to two anonymous reviewers and the handling editor
Validation and the chaotic version of existing validation strategies (Prof. H. Jaeger) for their thorough comments.
based on multiple folds, such as the Walk Forward Validation and
the K-Fold Cross Validation. The K-Fold Validation and Recycle Appendix A. Computational time
Validation offer the greatest robustness and performance, with
their chaotic versions outperforming the corresponding regular In Fig. A.14, we show the CPU time required by the validation
versions. Importantly, the Recycle Validation is computationally strategies to perform a Grid Search in hyperparameters space for
cheaper than the K-Fold Cross Validation. Third, we compare a single network. The computational advantage of the Recycle
Bayesian Optimization with Grid Search to compute the optimal Validation with respect to the K-Fold Validation increases with
hyperparameters. We find that in the validation set Bayesian the size of the dataset and the size of the reservoir. We expect
Optimization is an optimization scheme that consistently finds the improvement in computational time to be more significant
a set of hyperparameters that perform significantly better than in RNN architectures whose training is more expensive, such as
the Grid Search. On the one hand, in learning quasiperiodic so- LSTMs and GRUs.
lutions, hyperparameters that work optimally in the validation The Bayesian Optimization described in Section 4 costs ap-
proximately 6 s more per network in all the cases shown. This
set continue to work optimally in the test set. This is because
is because the additional cost of the Bayesian Optimization is
quasiperiodic solutions are predictable (i.e., they do not have
independent of the cost of the evaluation function.
positive Lyapunov exponents). This finding is, thus, expected
to generalize to other predictable solutions, such as frequency-
Implementation
locked solutions and limit cycles. On the other hand, in learning
chaotic solutions, hyperparameters that work optimally in the
To reduce the computational cost of the Walk Forward Valida-
validation set do not necessarily work optimally in the test set.
tion and of the K-Fold Validation, for each set of hyperparameters,
We argue that this occurs because of the chaotic nature of the
we store R, RRT and RUTd in Eq. (4) for the entire dataset. For the
attractor, in which the nonlinear dynamics, although determinis-
i-th fold, we then compute Ri RTi (and in the same way Ri UTdi ) that
tic, manifest themselves as unpredictable variations. Fourth, we
we need for ridge regression through
analyse the model-free ESN, which is fully data-driven, and the
model-informed ESN, which leverages knowledge of the gov- Ri RTi = RRT − R̂i R̂Ti , (A.1)
erning equations. We find that the model-informed architecture
markedly improves the network’s prediction capabilities, but it where R̂i is the complement to Ri for the entire dataset, namely
does not improve the robustness. Finally, we find that the optimal R ∈ RNr̂ ×Ntr , Ri ∈ RNr̂ ×Ni and R̂i ∈ RNr̂ ×N̂i , where Ni + N̂i =
hyperparameters are significantly sensitive to the random initial- Ntr . Since for large datasets computing Ri RTi (and Ri UTdi ) is the
ization of the ESN. Practically, when working with an ensemble main computational cost of solving Eq. (4) and Ni ≫ N̂i , using
264
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. A.14. CPU time required for a single network of size N to perform a 7 × 7 Grid Search in hyperparameters space in the 12 LT and 24 LT datasets of the
Lorenz system. The validation strategies are the Single Shot Validation (SSV), the Walk Forward Validation (WFV), K-Fold Validation (KFV), Recycle Validation (RV),
and respective chaotic versions (subscript c). The runs are on a single Intel i7-8750H processor.

Eq. (A.1) reduces significantly the computational time required is selected, the Gaussian Process Regression is performed again
by the strategies. For example, in the Lorenz-96 testcase the total using the updated set of data points, until the prescribed maxi-
time required by the chaotic K-Fold is reduced by a factor three mum number of function evaluations is reached. More details are
with respect to computing Ri RTi directly. reported in the Supplementary Material S.5.

Appendix B. Bayesian optimization for hyperparameters Appendix C. Hyperparameter variations for different realiza-
tions
After we evaluate the objective function at Nst starting points,
the objective function is reconstructed in the hyperparameter As shown in Fig. 6 for the Lorenz system, different network
search space using the function evaluations as data points for realizations have different optimal hyperparameters, which vary
noise-free Gaussian Process Regression. The computational cost significantly from one network realization to another. This sug-
of the regression is proportional to Nd3 , where Nd is the number gests that different networks need to be trained independently. If
of data points, because of the inversion of the covariance matrix. we select a fixed set of hyperparameters, some networks will per-
The inversion is performed by Cholesky factorization regularized form poorly (Haluszczynski & Räth, 2019). In this appendix, we
by the addition of α = 10−10 on the diagonal elements. quantify the difference in performance between optimizing the
Once the Gaussian Process is performed, the next point at network independently and using a fixed set of hyperparameters
which to evaluate the objective function is selected in the hy- for the entire ensemble. Fig. C.15 shows the mean of the Gaussian
perparameter space to maximize the acquisition function. The Process reconstruction of the log10 (MSE) in the test set for the
acquisition function evaluates a potential point usefulness in short dataset in the Lorenz system. In panels (a, b), we show the
finding the global minimum, so that points with a high value of MSE in the test set for two representative networks from the
the acquisition function are selected during the search. A new ensemble, while in panel (c), we show the error between the
point can be chosen for one of two reasons: (i) to try to find a two networks. The two networks differ substantially. The same
new minimum by using current knowledge of the search space hyperparameters may result in MSEs that differ by more than four
and (ii) to increase the knowledge of the space by exploring new orders of magnitude.
regions. This trade-off is called balance between exploitation and To quantitatively evaluate the performance of the networks,
exploration. Practically, the most used acquisition functions in the we assess two possible choices of fixed hyperparameters: (i) we
literature are the Probability of Improvement (PI), the Expected search the optimal fixed hyperparameters by minimizing the ge-
Improvement (EI) and the Lower Confidence Bound (LCB) (Brochu ometric mean over the 50 networks of the MSE in the validation
et al., 2010). On a given testcase, it is difficult to determine a priori set; (ii) we use the hyperparameters obtained by performing the
which acquisition function will perform better. For this reason search on a representative network from the ensemble and use
we use the gp-hedge algorithm (Hoffman et al., 2011), which those hyperparameters for all the networks. In both (i) and (ii),
improves the performance with respect to the single acquisition we perform the search using Bayesian Optimization in the chaotic
functions. In the algorithm, when deciding the next point of the K-Fold Cross Validation (KFVc ) and chaotic Recycle Validation
search, the three acquisition functions are evaluated over the (RVc ). Fig. C.16 shows the violin plots and 25th, 50th and 75th
search space. Each acquisition function provides its own optimal percentiles for the Prediction Horizon in the test set for the
point as a candidate. The next point at which the function is Lorenz system. Using fixed hyperparameters yields a decrease
going to be evaluated is selected among the three candidates with in performance in the percentiles of around 0.5 LTs when using
probability given by the softmax function. The softmax function (i), and of more than 1 LTs when using (ii). In addition, the
is evaluated on the cumulative reward from previous candidate tail of the distribution prolongates to values of the Prediction
points proposed by the acquisition functions, so that the strategy Horizon below 1 LT, which means that the fixed hyperparameters
leans towards exploitation as the search progress. Once the point perform poorly in a fraction of the networks. Finally, we note
265
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. C.15. Mean of the Gaussian Process reconstruction of the MSE in the test set for (a, b) two representative networks in the short dataset of the Lorenz system,
and (c) difference between the two networks. For visualization purposes we saturate the MSE to be ≤ 100 and the error to be ≤ 104 . The Gaussian Process is based
on a grid of 30 × 30 data points. For the same hyperparameters, the MSE can differ by orders of magnitude between the two networks.

Fig. C.16. Violin plots and 25th (lower bar), 50th (marker) and 75th (upper bar) percentiles of the Prediction Horizon in the test set for the 50 networks ensemble
in the (a) short (b) and long datasets in the Lorenz system. Independent optimization (Ind) of each network, optimal set of fixed hyperparameters (Fix (i)), and
optimal hyperparameters of a single network (Fix (ii)). We use Bayesian Optimization in the chaotic K-Fold Validation (KFVc ) and chaotic Recycle Validation (RVc ).

Fig. C.17. Mean of the Gaussian Process reconstruction of the MSE in the test set for (a, b) two representative networks in the quasiperiodic dataset, and (c) difference
between the two networks. For visualization purposes we saturate the MSE to be ≤ 1 and the error to be ≤ 104 . The Gaussian Process is based on a grid of 30 × 30
data points.

that the decrease in the Prediction Horizon percentiles for (ii) obtained from validating one network for another network, is key
is larger than the improvement that we obtain when using the in Echo State Networks. Similar conclusions can be drawn for the
new validation strategies, the increased size of the dataset or quasiperiodic dataset in the Kuznetsov oscillator (Figs. C.17, C.18).
the model-informed architecture. This means that optimizing the This means that the difference between realizations is caused by
network independently, and therefore not using hyperparameters the ESNs, and not by chaos.
266
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Fig. C.18. Violin plots and 25th (lower bar), 50th (marker) and 75th (upper bar) percentiles for the 50 networks ensemble of the MSE in the quasiperiodic dataset,
(a), and the Prediction Horizon for the chaotic dataset, (b), in the test set in the Kuznetsov Oscillator. Independent optimization (Ind) of each network, optimal set
of fixed hyperparameters (Fix (i)), and optimal hyperparameters of a single network (Fix (ii)). We use Bayesian Optimization in the chaotic K-Fold Validation (KFVc )
and chaotic Recycle Validation (RVc ).

References Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
Griffith, A., Pomerance, A., & Gauthier, D. J. (2019). Forecasting chaotic systems
Baker, N., Alexander, F., Bremer, T., Hagberg, A., Kevrekidis, Y., Najm, H., et al. with very low connectivity reservoir computers. Chaos. An Interdisciplinary
(2019). Workshop report on basic research needs for scientific machine learning: Journal of Nonlinear Science, 29(12), Article 123108.
Core technologies for artificial intelligence: Tech. rep., Washington, DC (United Grigoryeva, L., & Ortega, J. -P. (2018). Echo state networks are universal. Neural
States): USDOE Office of Science (SC). Networks, 108, 495–508. http://dx.doi.org/10.1016/j.neunet.2018.08.025, URL
Bec, J., Biferale, L., Boffetta, G., Cencini, M., Musacchio, S., & Toschi, F. (2006). https://www.sciencedirect.com/science/article/pii/S089360801830251X.
Lyapunov Exponents of heavy particles in turbulence. Physics of Fluids, 18(9), Guckenheimer, J., & Holmes, P. (2013). Nonlinear oscillations, dynamical systems,
1–5. http://dx.doi.org/10.1063/1.2349587, arXiv:0606024. and bifurcations of vector fields, Vol. 42. Springer Science & Business Media.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies Haluszczynski, A., & Räth, C. (2019). Good and bad predictions: Assessing
with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), and improving the replication of chaotic attractors by means of reservoir
157–166. computing. Chaos. An Interdisciplinary Journal of Nonlinear Science, 29(10),
Boffetta, G., Cencini, M., Falcioni, M., & Vulpiani, A. (2002). Predictability: A way Article 103143.
to characterize complexity. Physics Reports, 356(6), 367–474. http://dx.doi. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P.,
org/10.1016/S0370-1573(01)00025-4, arXiv:0101029. Cournapeau, D., et al. (2020). Array programming with NumPy. Nature,
Bolker, B. M., & Grenfell, B. T. (1993). Chaos and biological complexity in measles 585(7825), 357–362. http://dx.doi.org/10.1038/s41586-020-2649-2.
dynamics. Proceedings of the Royal Society of London, Series B, 251(1330), Hassanaly, M., & Raman, V. (2019). Ensemble-LES analysis of perturbation re-
75–81. sponse of turbulent partially-premixed flames. Proceedings of the Combustion
Brochu, E., Cora, V. M., & De Freitas, N. (2010). A tutorial on Bayesian optimiza- Institute, 37(2), 2249–2257. http://dx.doi.org/10.1016/j.proci.2018.06.209.
tion of expensive cost functions, with application to active user modeling Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural
and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599. Computation, 9(8), 1735–1780.
Brunton, S. L., Noack, B. R., & Koumoutsakos, P. (2020). Machine learning for Hoffman, M., Brochu, E., & de Freitas, N. (2011). Portfolio allocation for bayesian
fluid mechanics. Annual Review of Fluid Mechanics, 52, 477–508. optimization. In UAI’11, Proceedings of the Twenty-Seventh Conference on
Chattopadhyay, A., Hassanzadeh, P., & Subramanian, D. (2020). Data-driven Uncertainty in Artificial Intelligence (pp. 327–336).
predictions of a multiscale lorenz 96 chaotic system using machine-learning Huhn, F., & Magri, L. (2020). Learning ergodic averages in chaotic systems. In
methods: reservoir computing, artificial neural network, and long short-term International Conference on Computational Science (pp. 124–132). Springer.
memory network. Nonlinear Processes in Geophysics, 27(3), 373–389.
Ishu, K., van der Zant, T., Becanovic, V., & Ploger, P. (2004). Identification of
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
motion with echo state network. In Oceans ’04 MTS/IEEE Techno-Ocean ’04
Schwenk, H., et al. (2014). Learning phrase representations using RNN
(IEEE Cat, No.04CH37600): Vol. 3 (pp. 1205–1210).
encoder–decoder for statistical machine translation. In Proceedings of the
Jaeger, H., & Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
2014 Conference on Empirical Methods in Natural Language Processing
and saving energy in wireless communication. Science, 304(5667), 78–80.
(EMNLP) (pp. 1724–1734). Doha, Qatar: Association for Computational
Jiang, J., & Lai, Y. -C. (2019). Model-free prediction of spatiotemporal dynamical
Linguistics, http://dx.doi.org/10.3115/v1/D14-1179, https://www.aclweb.org/
systems with recurrent neural networks: Role of network spectral radius.
anthology/D14-1179, arXiv preprint arXiv:1406.1078.
Physical Review Research, 1(3), Article 033056.
Deissler, R. G. (1986). Is Navier-Stokes turbulence chaotic? Physics of Fluids,
Kantz, H., & Schreiber, T. (2004). Nonlinear time series analysis, Vol. 7. Cambridge
29(5, May 1986), 1453–1457. http://dx.doi.org/10.1063/1.865663, URL https:
university press.
//aip.scitation.org/doi/10.1063/1.865663.
Kennedy, M., Rovatti, R., & Setti, G. (2000). Chaotic electronics in telecommunica-
Doan, N. A. K., Polifke, W., & Magri, L. (2019). Physics-informed echo state
tions. CRC press.
networks for chaotic systems forecasting. In International Conference on
Computational Science (pp. 192–198). Springer. Kuznetsov, A., Kuznetsov, S., & Stankevich, N. (2010). A simple autonomous
Doan, N. A. K., Polifke, W., & Magri, L. (2020a). Learning hidden states in a chaotic quasiperiodic self-oscillator. Communications in Nonlinear Science and
system: A physics-informed echo state network approach. In International Numerical Simulation, 15(6), 1676–1681.
Conference on Computational Science (pp. 117–123). Springer. Lorenz, E. N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric
Doan, N. A. K., Polifke, W., & Magri, L. (2020b). Physics-informed echo state Sciences, 20(2), 130–141.
networks. Journal of Computer Science, 47, Article 101237. http://dx.doi. Lorenz, E. N. (1996). Predictability: A problem partly solved. In Proc. seminar on
org/10.1016/j.jocs.2020.101237, URL http://www.sciencedirect.com/science/ predictability: Vol. 1.
article/pii/S1877750320305408. Lu, Z., Hunt, B. R., & Ott, E. (2018). Attractor reconstruction by machine learning.
Doan, N. A. K., Polifke, W., & Magri, L. (2021). Short-and long-term prediction of Chaos. An Interdisciplinary Journal of Nonlinear Science, 28(6), Article 061104.
a chaotic flow: A physics-constrained reservoir computing approach. arXiv Lu, Z., Pathak, J., Hunt, B., Girvan, M., Brockett, R., & Ott, E. (2017). Reservoir
preprint arXiv:2102.07514. observers: Model-free inference of unmeasured variables in chaotic systems.
Ferreira, A. A., Ludermir, T. B., & De Aquino, R. R. (2013). An approach to Chaos. An Interdisciplinary Journal of Nonlinear Science, 27(4), Article 041102.
reservoir computing design and training. Expert Systems with Applications, Lukoševičius, M. (2012). A practical guide to applying echo state networks. In
40(10), 4172–4182. Neural networks: Tricks of the trade (pp. 659–686). Springer.
Gonon, L., & Ortega, J. -P. (2021). Fading memory echo state networks are uni- Lukoševičius, M., & Uselis, A. (2019). Efficient cross-validation of echo state net-
versal. Neural Networks, http://dx.doi.org/10.1016/j.neunet.2021.01.025, URL works. In International conference on artificial neural networks (pp. 121–133).
https://www.sciencedirect.com/science/article/pii/S0893608021000332. Springer.

267
A. Racca and L. Magri Neural Networks 142 (2021) 252–268

Lumley, J. L. (1967). The structure of inhomogeneous turbulent flows. In Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with
Atmospheric turbulence and radio wave propagation. Nauka. neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K.
Maass, W., Natschläger, T., & Markram, H. (2002). Real-time computing with- Q. Weinberger (Eds.), 27, Advances in Neural Information Processing Systems.
out stable states: A new framework for neural computation based on Curran Associates, Inc., arXiv preprint arXiv:1409.3215.
perturbations. Neural Computation, 14(11), 2531–2560. Takens, F. (1981). Detecting strange attractors in turbulence. In Dynamical
Matthies, H. G., & Meyer, M. (2003). Nonlinear Galerkin methods for the model systems and turbulence, Warwick 1980 (pp. 366–381). Springer.
reduction of nonlinear dynamical systems. Computers and Structures, 81(12), Thiede, L. A., & Parlitz, U. (2019). Gradient based hyperparameter optimization
1277–1286. in echo state networks. Neural Networks, 115, 23–29.
Moon, F. C., & Shaw, S. W. (1983). Chaotic vibrations of a beam with non-linear Tikhonov, A. N., Goncharsky, A., Stepanov, V., & Yagola, A. G. (2013). Numerical
boundary conditions. International Journal of Non-Linear Mechanics, 18(6), methods for the solution of ill-posed problems, Vol. 328. Springer Science &
465–477. Business Media.
Nakai, K., & Saiki, Y. (2018). Machine-learning inference of fluid variables from Virtanen, P., & et al. (2020). SciPy 1.0: Fundamental algorithms for scientific
data using reservoir computing. Physical Review E, 98(2), Article 023111. computing in Python. Nature Methods, 17, 261–272.
Nastac, G., Labahn, J. W., Magri, L., & Ihme, M. (2017). Lyapunov Exponent as Viswanath, D. (1998). Lyapunov exponents from random Fibonacci sequences to the
a metric for assessing the dynamic content and predictability of large-eddy lorenzequations: Tech. rep., Cornell University.
simulations. Physical Review Fluids, 2(9), Article 094606. http://dx.doi.org/10. Vlachas, P. R., Byeon, W., Wan, Z. Y., Sapsis, T. P., & Koumoutsakos, P. (2018).
1103/PhysRevFluids.2.094606. Data-driven forecasting of high-dimensional chaotic systems with long short-
Pathak, J., Hunt, B., Girvan, M., Lu, Z., & Ott, E. (2018). Model-free prediction term memory networks. Proceedings of The Royal Society of London. Series A.
of large spatiotemporally chaotic systems from data: A reservoir computing Mathematical, Physical and Engineering Sciences, 474(2213), Article 20170844.
approach. Physical Review Letters, 120(2), Article 024102. Vlachas, P. R., Pathak, J., Hunt, B. R., Sapsis, T. P., Girvan, M., Ott, E., et al.
Pathak, J., Lu, Z., Hunt, B. R., Girvan, M., & Ott, E. (2017). Using machine learning (2020). Backpropagation algorithms and reservoir computing in recurrent
to replicate chaotic attractors and calculate Lyapunov exponents from data. neural networks for the forecasting of complex spatiotemporal dynamics.
Chaos. An Interdisciplinary Journal of Nonlinear Science, 27(12), Article 121102. Neural Networks.
Pathak, J., Wikner, A., Fussell, R., Chandra, S., Hunt, B. R., Girvan, M., et al. Wan, Z. Y., & Sapsis, T. P. (2018). Machine learning the kinematics of spherical
(2018). Hybrid forecasting of chaotic processes: Using machine learning particles in fluid flows. Journal of Fluid Mechanics, 857.
in conjunction with a knowledge-based model. Chaos. An Interdisciplinary Wang, H., & Yan, X. (2015). Optimizing the echo state network with a bi-
Journal of Nonlinear Science, 28(4), Article 041101. nary particle swarm optimization algorithm. Knowledge-Based Systems, 86,
Racca, A., & Magri, L. (2021). Automatic-differentiated physics-informed echo 182–193.
state network (API-ESN). In International Conference on Computational Science Weiss, J. (2019). A tutorial on the proper orthogonal decomposition. In AIAA
(accepted). arXiv preprint arXiv:2101.00002. aviation 2019 forum (pp. 3333).
Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer Werbos, P. J. (1988). Generalization of backpropagation with application to a
school on machine learning (pp. 63–71). Springer. recurrent gas market model. Neural Networks, 1(4), 339–356.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations Werbos, P. J. (1990). Backpropagation through time: What it does and how to
by back-propagating errors. Nature, 323(6088), 533–536. do it. Proceedings of the IEEE, 78(10), 1550–1560.
Sak, H., Senior, A. W., & Beaufays, F. (2014). Long short-term memory recurrent Wikner, A., Pathak, J., Hunt, B., Girvan, M., Arcomano, T., Szunyogh, I., et al.
neural network architectures for large scale acoustic modeling. (2020). Combining machine learning with knowledge-based modeling for
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization scalable forecasting and subgrid-scale closure of large, complex, spatiotem-
of machine learning algorithms. In Advances in neural information processing poral systems. Chaos. An Interdisciplinary Journal of Nonlinear Science, 30(5),
systems (pp. 2951–2959). Article 053111.
Spearman, C. (1904). The proof and measurement of association between two Yildiz, I. B., Jaeger, H., & Kiebel, S. J. (2012). Re-visiting the echo state property.
things. American Journal of Psychology, 15(1), 72–101, URL http://www.jstor. Neural Networks, 35, 1–9. http://dx.doi.org/10.1016/j.neunet.2012.07.005, URL
org/stable/1412159. https://www.sciencedirect.com/science/article/pii/S0893608012001852.
Stöckmann, H. -J. (2000). Quantum chaos: An introduction. American Association Yperman, J., & Becker, T. (2016). Bayesian optimization of hyper-parameters in
of Physics Teachers. reservoir computing. arXiv preprint arXiv:1611.05193.

268

You might also like