Trajectory Flow Matching
Trajectory Flow Matching
1
McGill University, 2 Mila - Quebec AI Institute,
3
Yale School of Medicine
4
School of Clinical Medicine, University of Cambridge,
5
Université de Montréal, 6 CIFAR Fellow
Abstract
1 Introduction
Real world problems often involve systems that evolve continuously over time, yet these systems
are usually noisy and irregularly sampled. In addition, real-world time series often relate to other
covariates, leading to complex patterns such as intersecting trajectories. For instance, in the context
of clinical trajectories in healthcare, patients’ vital sign evolution can follow drastically different,
crossing paths even if the initial measurements are similar, due to the influence of the covariates such
as medication intervention and underlying health conditions. These covariates can be time-varying or
static, and often sparse.
Differential equation-based dynamical models are proficient at learning continuous variables with-
out imputations [Chen et al., 2018, Rubanova et al., 2019, Kidger et al., 2021b]. Nevertheless,
systems governed by ordinary differential equations (ODEs) or stochastic differential equations
(SDEs) are unable to accommodate intersecting trajectories, and thus requires modifications such
as augmentation or modelling higher-order derivatives [Dupont et al., 2019]. While ODEs model
deterministic systems, SDEs contain a diffusion term and can better represent the inherent uncertainty
and fluctuations present in many real world systems. However, fitting stochastic equations to real life
∗
Joint first authorship
†
Joint senior authorship. Correspondence to [email protected]
Code available at: https://github.com/nZhangx/TrajectoryFlowMatching
data is challenging because they have thus far required time-consuming backpropagation through an
SDE integration.
In the domain of generative models, diffusion models [Ho et al., 2020, Nichol and Dhariwal, 2021,
Song et al., 2021] and more recently flow matching models [Lipman et al., 2023, Albergo et al.,
2023, Li et al., 2020] have had enormous success by training dynamical models in a simulation-free
framework. The simulation-free framework facilitates the training of much larger models with
significantly improved speed and stability. In this work we generalize simulation-free training for
fitting stochastic differential equations to time-series data, to learn population trajectories while
preserving individual characteristics with conditionals. We present this method as Trajectory Flow
Matching. We demonstrate that our method outperforms current state of the art time series modelling
architecture including RNN, ODE based and flow matching methods. We empirically demonstrate the
utility of our method in clinical applications where hemodynamic trajectories are critical for ongoing
dynamic monitoring and care. We applied our method to the following longitudinal electronic health
record datasets from multiple clinical settings: medical intensive care unit (MICU) data of patients
with sepsis, Emergency Department (ED) data of patients with acute gastrointestinal bleeding, and
MICU data of patients with acute gastrointestinal bleeding.
Our main contributions are:
• We prove the conditions under which continuous time dynamics can be trained simulation-
free using matching techniques.
• We extend the approach to irregularly sampled trajectories with a time predictive loss and to
estimate uncertainty using an uncertainty prediction loss.
• We empirically demonstrate that our approach reduces the error by 15-83% when applied to
the real world clinical data modelling.
2 Preliminaries
2.1 Notation
We consider the setting of a distribution of trajectories over Rd denoted X := {x1 , x2 , . . . , xn }
where each xi consists of a set of trajectories of length T i.e. xi := {xi1 , xi2 , . . . , xiT } with associated
times ti := {ti1 , ti2 , . . . , tiT }. Let xi[t−h,t−1] denote a vector of the last h observed time points. We
denote a (Lipschitz smooth) time dependent vector field conditioned on arbitrary conditions c ∈ Re
v(t, xt , x[t−h,t−1] , c) → dx d
dt : ([0, 1], R , R
h×d
, Re ) → Rd with flow ϕt (v) which induces the time-
dependent density pt = ϕt (v)# (p0 ) for any density p0 : Rd → R+ with Rd p0 = 1. We also
R
consider the coupling π(x0 , x1 ) which operates on the product space of marginal distributions p0 , p1 .
2
2.2 Neural Stochastic Differential Equations
A stochastic differential equation (SDE) can be expressed in terms of a smooth drift f : [0, T ]×Rd →
2
Rd and diffusion g : [0, T ] × Rd → Rd in the Ito sense as:
dxt = f dt + g dWt
d
where Wt : [0, T ] → R is the d-dimensional Wiener process. A density p0 (x0 ) evolved according to
an SDE induces a collection of marginal distributions pt (xt ) viewed as a function p : [0, T ] × Rd →
R+ . In a Neural SDE [Li et al., 2020, Kidger et al., 2021a,b] the drift and diffusion terms are
parameterized with neural networks fθ (t, xt ) and gθ (t, xt ).
dxt = fθ (t, xt )dt + gθ (t, xt )dWt (1)
where the goal is to select θ to enforce xT ∼ Xtrue for some distributional notion of similarity
such as the Wasserstein distance [Kidger et al., 2021b] or Kullback-Leibler divergence [Li et al.,
2020]. However, these objectives are simulation-based, requiring a backpropagation through an SDE
solver, which suffers from severe speed and stability issues. While some issues such as memory
and numerical truncation can be ameliorated using the adjoint state method and advanced numerical
solvers [Kidger et al., 2021b], optimization of Neural SDEs is still a significant issue.
We note that in the special case of zero-diffusion (i.e. gθ (t, xt ) = 0) this reduces to a neural ordinary
differential equation (Neural ODE) [Chen et al., 2018], which is easier to optimize than SDEs, but
still presents challenges to scalability.
2.3 Matching algorithms
Matching algorithms are a simulation-free class of training algorithms which are able to bypass
backpropagation through the solver during training by constructing the marginal distribution as a
mixture of tractable conditional probability paths.
The marginal density pt induced by eq. 1 evolves according to the Fokker-Plank equation (FPE):
g2
∂t pt = −∇ · (pt ft ) + ∆pt (2)
2
where ∆pt = ∇ · (∇pt ) denotes the Laplacian of pt and gradients are taken with respect to xt .
Matching algorithms first construct a factorization of pt into conditional densities pt (xt |z) such that
pt = Eq(z) [pt (xt |z)] and where pt (xt |z) is generated by an SDE dxt = vt (xt |z)dt + σt (xt |z)dWt .
Given this construction it can be shown that the minimizer of
h i
2 2
Lmatch (θ) := Et,q(z),pt (x|z) ∥fθ (t, xt ) − vt (xt |z)∥ + λ2t ∥gθ (t, xt ) − σt (xt |z)∥ (3)
satisfies the FPE of the marginal pt . This is especially useful in the generative modeling setting where
q0 is samplable noise (e.g. N (0, 1)) and q1 is the data distribution. Then we can define z := (x0 , x1 )
as a tuple of noise and data with q(z) := q0 (x0 ) ⊗ q1 (x1 ). This makes eq. 3 optimize a model which
will draw new samples according to the data distribution q1 (x1 ) using
Z 1
x0 ∼ q 0 ; x1 = fθ (t, xt )dt + gθ (t, xt )dWt (4)
0
with the integration computed numerically using any off-the-shelf SDE solver. While this is guaran-
teed to preserve the distribution over time, it is not guaranteed to preserve the coupling of q0 and q1
(if given).
Paired bridge matching In generative modeling random pairings [Liu et al., 2023c, Albergo and
Vanden-Eijnden, 2023, Albergo et al., 2023] or optimal transport [Tong et al., 2024, Pooladian et al.,
2023] pairings are constructed for the conditional distribution q(z). However, in some problems we
would like to match pairs of points as is the case in image-to-image translation [Isola et al., 2017, Liu
et al., 2023a, Somnath et al., 2023]. In this case, training data comes as pairs (x0 , x1 ). In this case we
set q(z) := q(x0 , x1 ) to be samples from these known pairs, and optimize eq. 3. While empirically,
these models perform well, there are no guarantees that the coupling will be preserved outside of the
special case when data comes from the (entropic) optimal transport coupling πε∗ (q0 , q1 ) and defined
as: Z
∗
πε (q0 , q1 ) = arg min d(x0 , x1 )2 dπ(x0 , x1 ) + ε KL(π∥q0 ⊗ q1 ), (5)
π∈U (q0 ,q1 )
3
Algorithm 1 General Trajectory Flow Matching
Input: Trajectories X , noise σ, initial network vθ .
while Training do
z ∼ U(X ), k ∼ U{1, T − 1}, t ∼ U(0, 1)
µt ← txk + (1 − t)xk+1
xt ∼ N (µt , σ 2 t(1 − t)I)
2
xk+1 −xt
LTFM (θ) ← vθ (k + t, xt ) − 1−t
2
Lσt (θ) ← ∥σθ (k + t, xt ) − LTFM ∥
θ ← Update(θ, ∇θ LTFM (θ), ∇θ Lσt (θ))
return vθ , σθ
where U (q0 , q1 ) is the set of admissible transport plans (i.e. joint distributions over x0 and x1 whose
marginals are equal to q0 and q1 ) as shown in [Shi et al., 2023] for some regularization parameter
ε ∈ R≥0 .
4
Where Π(u)⋆ represents the coupling of a model which attains minimal loss according to eq. 3 and
Π⋆ (x1:T ) is the coupling of the data distribution. Intuitively, as long as no two paths cross given
conditionals c, then the coupling is preserved. In prior work c = ∅, and the coupling is only preserved
in special cases such as eq. 5.
We next enumerate three assumptions under which the coupling is guaranteed to be preserved at the
optima. We note that these are
(A1) When c = x0 and there exists T : X → X such that T (x0 ) = x1 iff Π⋆ (x0 , x1 ). We note
that this is equivalent to asserting the existence of a Monge map T ⋆ for the coupling Π⋆ .
(A2) There exist no two trajectories xi , xj such that xit = xjt for h consecutive observations and
g = 0.
(A3) Trajectories are associated with unique conditional vectors c independent of t.
Even in cases when (A1)-(A3) may not hold exactly, TFM is a useful model and can often still learn
useful models of the data. In some sense uniqueness up to some history length is enough as it shows
TFM is as powerful as discrete-time autoregressive models. Proofs and further examples are available
in §A.1.
3.2 Target prediction reparameterization
While flow matching generally predicts the flow, there is a target predicting equivalent namely
⌈t⌉ ⌈t⌉
x̂ (t,x )−x −xt
given vθ (t, x) := θ ⌈t⌉−tt t
and ut (x|z) := x⌈t⌉−t which is equivalent to x1 − x0 when
xt : tx1 + (1 − t)x0 then it is easy to show that the target predicting loss is equivalent to a time-
weighted flow-matching loss. Specifically let the target predicting loss be
⌈t⌉
Ltarget (θ) = Et,q(z),pt (x|z) ∥x̂θ (t, x) − x⌈t⌉ ∥2 (10)
then it is easy to show that
Proposition 3.3. There exists a scaling function c(t) : R+ → R such that Ltarget (θ) = c(t)Lmatch (θ).
3.3 Irregularly sampled trajectories
We next consider irregularly sampled time series of the form xi := (xi1 , ti1 ), (xi2 , ti2 ), . . . , (xiT , tiT )
with ti1 < ti2 < · · · < tiT with tnext denoting the next timepoint observed after time t. In this case,
when combined with the target predicting reparameterization in §3.2, we can predict the time till next
observation. We therefore parameterize an auxiliary model hθ (t, xt ) : [0, T ] × Rd → [0, T ] which
predicts the next observation time. This is useful numerically, but also, perhaps more importantly,
is useful in a clinical setting, where the spacing between measurements can be as informative as
the measurements themselves [Allam et al., 2021]. hθ is trained to predict the time till the next
observation with the time predictive loss:
X
Ltp (θ) = ∥hθ (t, xt ) − (tnext − t)∥22 (11)
t∈T i
where tnext is the time of the next measurement. This can be used in conjunction with the xnext
predictor to calculate the flow at time t as
x̂1θ (t, xt ) − xt
vθ (t, xt ) := (12)
hθ (t, xt ) − t
which can be used for inference on new trajectories.
3.4 Uncertainty prediction
Finally, we consider uncertainty prediction. till now we have defined conditional probability paths
using a fixed noise parameter σ. However, this does not have to be fixed. Instead, we consider a
learned σθ (t, xt ) which can be learned iteratively with the loss:
X 2
Luncertainty (θ, x) = σθ (t, xt ) − ∥x̂θ (t, xt ) − xnext ∥22 2 (13)
t∈T
which learns to predict the error in the estimate of xt . This loss can be interpreted as training an
epistemic uncertainty predictor which is similar to that proposed in direct epistemic uncertainty
prediction (DEUP) [Lahlou et al., 2023].
5
Figure 2: 1D harmonic oscillator overfitting experiment results. Left: TFM-ODE (ours) with memory
= 3. Middle: TFM-ODE (ours) without memory. Right: Aligned FM [Liu et al., 2023a, Somnath
et al., 2023].
4 Experimental Results
In this section we empirically evaluate the performance of the trajectory flow matching objective
in terms of time series modeling error, but also uncertainty quantification. We also evaluate a
variety of simulation-based and simulation-free methods including both stochastic and deterministic
methods. Stochastic methods are in general more difficult to fit, but can be used to better model
uncertainty and variance. Further experimental details can be found in §B. Experiments were run on
a computing cluster with a heterogenous cluster of NVIDIA RTX8000, V100, A40, and A100 GPUs
for approximately 24,000 GPU hours. Individual training runs require approximately one gpu day.
Baselines In addition to different ablations of trajectory flow matching, we also evaluate Neu-
ralODE [Chen et al., 2018], NeuralSDE [Li et al., 2020, Kidger et al., 2021b, Kidger, 2022], Latent
NeuralODE [Rubanova et al., 2019], and an aligned flow matching method (Aligned FM) [Liu et al.,
2023a, Somnath et al., 2023] where the couplings are sampled according to the ground truth coupling
during training.
Metrics We primarily make use of two metrics. The average mean-squared-error (Mean MSE) over
left out time series to measure the time series modeling error defined as
1 X
MSE(x̂, x) = ∥x̂t − xt ∥22 , (14)
T −1
t∈[2,T ]
where x̂ and x are the predicted and true trajectories respectively. We also use the maximum mean
discrepancy with a radial basis function kernel (RBF MMD) which measures how well the distribution
over next observation is modelled by comparing the predicted distribution to the distribution over
next states in the ground truth trajectory. Specifically we compute:
1 X
ˆ t , ∆t )
RBF-MMD(θ, x̂, x) := MMD(∆ (15)
T −1
t∈[2,T ]
Rt
ˆ t = x̂t − xt−1 , ∆t = xt − xt−1 , and x̂t :=
where ∆ fθ (s, xs )ds + gθ (t, xs )dWs is a set of
s=t−1
samples from the model prediction at time t.
4.1 Exploring coupling preservation with 1D harmonic oscillators
We begin by evaluating how trajectory flow matching performs in a simple one dimensional setting of
harmonic oscillators. We show that vanilla conditional flow and bridge matching [Liu et al., 2023c,b,
Albergo and Vanden-Eijnden, 2023], specifically aligned approaches [Somnath et al., 2023, Liu
et al., 2023a] are unable to preserve the coupling even in a simple one dimensional setting. However,
augmented with our trajectory flow matching approach, and specifically using (A2), which includes
information on previous observations, the model is able to fit the harmonic oscillator dataset well.
The harmonic oscillator dataset consists of one-dimensional oscillatory trajectories from a damped
harmonic oscillator, with each trajectory distinguished by a unique damping coefficient c. Specifically
we sample trajectories x from:
xi = xi−1 + vi−1 (ti − ti−1 ); x0 = 1 (16)
6
where v is the velocity of the oscillator updated by
c k
vi = vi−1 + − vi−1 − xi−1 (ti − ti−1 ); v0 = 0 (17)
m m
with ti = 0.1 · i for i = 0, 1, 2, . . . , 99, spring constant k = 1, and mass m = 1.
As c increases, the trajectories evolve from underdamped scenarios with prolonged oscillations to
critically and overdamped states where the oscillator quickly stabilizes. This leads to intersecting
trajectories due to frequency and phase differences, despite their shared starting point. We perform
overfitting experiments on three trajectories generated by varying c.
As shown in Figure 2, models without history information are unable to distinguish between the
three crossing trajectories that share the same starting point, resulting in overlapping predictions. In
contrast, TFM-ODE that incorporates three previous observations is able to fit the crossing trajectories
with high accuracy, with the predicted trajectories almost completely overlapping the ground truth.
This is because the dataset with satisfies (A2) with h = 4 (TFM-ODE), but not h = 0 (TFM-ODE no
memory and Aligned FM).
4.2 Experiments on clinical datasets
Next we compared the performance of TFM and TFM-ODE with the current SDE and ODE baselines,
respectively, for modeling real-world patient trajectories formed with heart rate and mean arterial
blood pressure measurements within the first 24 hours of admission across three different datasets.
These are clinical measurements that are taken most frequently and used to evaluate the hemodynamic
status of patients, a key indicator of disease severity. Additionally, we evaluated our models against
flow matching on these datasets, each with distinct characteristics, to assess their ability to generalize
across different distributions. A full description of the datasets are available in Appendix B.2 with
the publicly available datasets used under The PhysioNet Credentialed Health Data License Version
1.5.0 and the EHR dataset with local institutional IRB approval:
• ICU Sepsis: a subset of the eICU Collaborative Research Database v2.0 [Pollard et al.,
2019] of patients admitted with sepsis as the primary diagnosis
• ICU Cardiac Arrest: a subset of the eICU Collaborative Research Database v2.0 [Pollard
et al., 2019] of patients at risk for cardiac arrest
• ICU GIB: a subset of the Medical Information Mart for Intensive Care III [Johnson et al.,
2016] of patients with gastrointestinal bleeding as the primary diagnosis
• ED GIB: patients presenting with signs and symptoms of acute gastrointestinal bleeding to
the emergency department of a large tertiary care academic health system
4.2.1 Prediction accuracy and precision: TFM and TFM-ODE
TFM-ODE yields more accurate trajectory prediction Across the three datasets TFM-ODE
outperformed the baseline models by 15% to 20%, as seen in table 1. We noticed that TFM has a
similar performance as TFM-ODE. In one case TFM outperformed the non-stochastic TFM-ODE, as
seen in the ICU GIB dataset. For ICU sepsis, the performance improvement from the baseline is the
most significant, around 83%. This coincides with the ICU sepsis dataset having the most amount
of measurement per trajectory. The improvement is seen in both TFM and TFM-ODE, possibly
indicating they are able to learn better given more data, resulting in a more precise flow. Not formally
measured, we noted that given the same time constraint, FM based models were significantly faster
and often finished training before the time limit.
TFM yields better uncertainty prediction Though TFM-ODE had lower test MSE for two out
of three times, TFM yielded better uncertainty prediction overall, as seen in table 2. Notably, TFM
also had less variance in the uncertainty prediction than TFM-ODE. A plausible explanation in this
case is a sacrifice in bias that subsequently decreases the variance for the stochastic implementation,
reflecting the bias-variance trade off. Sampled graphs of TFM can be seen in figure 3. It is notable
that the model is able to detect the measurement uncertainty at certain timepoints, matching the
increase in amplitude of oscillation in patient trajectories.
4.2.2 Trajectory Variance Distribution Comparison
TFM trajectories accurately match the noise distribution in the data TFM is able to match
the noise distribution in addition to the overall trajectory shape, which is useful in settings where
7
Table 1: Mean ± Std. deviation MSE (×10−3 ) by models and datasets. Split into deterministic (top)
and stochastic models (bottom). Top performing model for each setting and dataset in bold.
Table 2: Uncertainty test MSE loss for TFM-ODE and TFM with two different ICU datasets.
Figure 3: Three samples from predicted trajectory and uncertainty on ICU GIB test set. Top:
Predicted (orange) and the ground truth (blue) mean arterial pressure (MAP). Bottom: The absolute
value of the uncertainty predicted by TFM.
data has high stochasticity. We compared our models to NeuralODE and NeuralSDE in matching the
variance in neighboring data points, seen in table 3. We verify that between the baseline NeuralSDE
and NeuralODE, NeuralSDE has a lower MMD and is better able to match data points. We find in
ICU GIB and ED GIB datasets, TFM outperforms both in matching the variance in data. Notably,
the performance pattern is reversed for the MMD metrics and mean MSE metrics with respect to
TFM and TFM-ODE where better MSE leads to worse MMD and vice versa. As such, this further
confirms the bias-variance trade-off for both TFM and TFM-ODE implementation.
Table 3: Data variance MMD for by models and datasets. Split into deterministic models (top) and
stochastic models (bottom). Top performing model for each setting and dataset in bold.
8
Table 4: Mean MSE (×10−3 ) by ablated versions of TFM, TFM-ODE, and datasets.
Uncertainty Memory Hidden ICU Sepsis ICU Cardiac ICU GIB ED GIB
Prediction Size Arrest
TFM-ODE ✓ ✓ 256 0.793 ± 0.017 2.762 ± 0.017 2.673 ± 0.069 8.245 ± 0.495
✓ 256 1.170 ± 0.014 2.759 ± 0.015 3.097 ± 0.054 8.659 ± 0.429
256 1.555 ± 0.122 3.242 ± 0.050 2.981 ± 0.161 6.381 ± 0.451
64 1.936 ± 0.262 3.244± 0.025 4.003 ± 0.347 11.253± 4.597
TFM ✓ ✓ 256 0.796 ± 0.026 2.596 ± 0.079 2.762 ± 0.021 8.613 ± 0.260
✓ 256 0.816 ± 0.031 2.778 ± 0.021 2.754 ± 0.095 8.600 ± 0.389
64 1.965 ± 0.289 3.271 ± 0.031 4.037 ± 0.314 7.549 ± 0.737
Uncertainty improves performance of trajectory prediction For TFM and TFM-ODE, the flow
network used to learn the uncertainty σxt is separate from the flow network learning xt . The loss
function of the network learning xt is independent of uncertainty flow network. Therefore, it was
unexpected that taking away the uncertainty prediction would result in increased MSE test loss for
learning xt . This implies further a process in the synergistic effects between xt flow and σxt flow.
Trajectory memory may improve performance in high frequency measurement settings We
conditioned the model based on a sliding window of trajectory history to disentangle data points that
otherwise look indistinguishable to FM models. This improved the interpolation performance in the
ICU Sepsis and ICU GIB dataset. Notably, this modification did not improve performance the ED
GIB dataset, which could be due to shorter trajectories for patients and lower measurement frequency
in the defined time period. This may also be explained by the decreased severity of disease in the
ED compared to the ICU. Adding memory as a condition may be more suitable for patients whose
clinical trajectories have a higher frequency of measurements.
5 Related Work
Continuous-time neural network architectures have outperformed traditional RNN methods in mod-
eling irregularly sampled clinical time series to optimize interpolation and extrapolation. Neural
ODE with latent representations of trajectories [Rubanova et al., 2019] outperformed RNN-based
approaches [Lipton et al., 2016, Che et al., 2018, Cao et al., 2018, Rajkomar et al., 2018] for inter-
polation while providing explicit uncertainty estimates about latent states. More recently, Neural
SDEs appear to outperform LSTM [Hochreiter and Schmidhuber, 1997], Neural ODE [Chen et al.,
2018, De Brouwer et al., 2019, Dupont et al., 2019, Lechner and Hasani, 2020], and attention-based
[Shukla and Marlin, 2021, Lee et al., 2022] approaches in interpolation performance while natively
handling uncertainty using drift and diffusion terms [Oh et al., 2024].
Discrete-time approaches offer an alternative to our continuous-time model model transformers
utilize a discrete-time representation with a sequential processing [Gao et al., 2024, Nie et al., 2023,
Woo et al., 2024, Ansari et al., 2024, Dong et al., 2024, Garza and Mergenthaler-Canseco, 2023, Das
et al., 2024, Liu et al., 2024, Kuvshinova et al., 2024] models for traditional time series modeling.
Adaptations to the baseline transformer includes structuring observations into text with finetuning
[Zhang et al., 2023, Zhou et al., 2023], without finetuning [Xue and Salim, 2024, Gruver et al., 2023],
or using autoregressive model vision transformers to model unevenly spaced time series data by
converting time series into images [Li et al., 2023].
Continuous-time systems are of great interest for learning causal representations using assumptions
by using observations to directly modify the system state [De Brouwer et al., 2022, Jia and Benson,
2019]. Variations include intervention modeling with separate ODEs for interventions and outcome
processes [Gwak et al., 2020], using liquid time-constant networks [Hasani et al., 2021, Vorbach
et al., 2021], or modeling treatment effects with either one [Bellot and van der Schaar, 2021] or
multiple interventions [Seedat et al., 2022]. The importance of accounting for external interventions
is a particular challenge in clinical data, where external interventions (change in environment due to
treatment decisions or clinical context such as ED or ICU) are common in clinical data trajectories.
6 Conclusion
In this work we present Trajectory Flow Matching, a simulation-free training algorithm for neural
differential equation models. We show when trajectory flow matching is valid theoretically, then
demonstrate its usefulness empirically in a clinical setting. The ability to model the underlying
9
continuous physiologic processes during critical illness using irregular, sparsely sampled, and noisy
data has the potential for broad impacts in care settings such as the emergency department or ICU.
These models could be used to improve clinical decision making, inform monitoring strategies,
and optimize resource allocation by identifying which patients are likely to deteriorate or recover.
These use cases will require thorough prospective validation and calibration for specific clinical
outcomes, for example using the likelihood of a patient crossing a specific heart rate or blood pressure
threshold for decisions on level of care (ICU versus inpatient floors) or specific interventions such
as transfusions. In these applications, it will be important to assess and control for bias that may be
present due to which patient subpopulations are present in training data.
Limitations Limitations of the method includes the selective utility of integrating memory in
clinical settings with high measurement frequency and no current capacity for estimating causal
representations, though this will be an important future research direction. Potential harms include
the following: erroneous predictions that either results in delayed care or overutilization of the
health system. Accurate trajectory predictions have the potential to inform clinical decision-making
regarding the appropriate level of care, leading to more timely and appropriate interventions.
Future work We hope to extend our method to cover other types of time series that have periodicity
in the components, potentially incorporating Fourier transform [Li et al., 2021] and Physics-Inspired
Neural Networks (PINN). Since interpretability is an important factor for clinical reliability, we are
developing methods to further elucidate key components affecting the prediction. As well, we hope
to incorporate functional flow matching for fully continuous setting [Kerrigan et al., 2024].
7 Broader Impact
Our work extends flow matching into the domain of time series modeling, demonstrating a specific
instance of clinical time series prediction. In contrast to the large transformer-based models, our
method has fewer in parameters and less training time needed. Notably, it scales well with parameters.
As well, our parameterization on Stochastic Differential Equations (SDE) allow faster training time
than traditional SDE integration.
Accurate timeseries modeling in healthcare has the potential for significant benefits, but also in-
troduces risks. Benefits that could be derived from more accurate prediction of clinical courses
include improved treatment decisions, resource allocation, as well as more informative discussions
of prognosis with patients or family members. Risks may come from inaccuracies in predictions
which could lead to harms by biasing decision making of clinical teams. In the general case of
false negative prediction (prediction of trajectories with falsely favorable outcomes) this may lead to
undertreatment and in the case of false positive prediction (prediction of trajectories with incorrect
detrimental outcomes) or overtreating patients. These inaccuracies may also propagate biases in
training data.
To move towards broad impact in the clinical domain, this work will require validation and bias esti-
mates. Furthermore, models deployed in domains with high-stakes prediction require interpretability,
which can help identify biases, miscalibration, discordance with domain knowledge, as well as build
trust with teams using predictions from the model. At this time, flow-based methods have limited
tools for interpretability, and we recognize this as a gap in need of future work.
Acknowledgements
The authors would like to thank Mathieu Blanchette for useful comments on early versions of this
manuscript. We are also grateful to the anonymous reviewers for suggesting numerous improvements.
The authors acknowledge funding from the National Institutes of Health, UNIQUE, CIFAR, NSERC,
Intel, and Samsung. The research was enabled in part by computational resources provided by the Dig-
ital Research Alliance of Canada (https://alliancecan.ca), Mila (https://mila.quebec),
Yale School of Medicine and NVIDIA.
References
M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants.
In The Eleventh International Conference on Learning Representations, 2023. URL https:
//openreview.net/forum?id=li7qeBbCR1t.
10
M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework
for flows and diffusions. CoRR, abs/2303.08797, 2023. URL https://doi.org/10.48550/
arXiv.2303.08797.
A. Allam, S. Feuerriegel, M. Rebhan, and M. Krauthammer. Analyzing patient trajectories with
artificial intelligence. J Med Internet Res, 23(12):e29812, Dec 2021. ISSN 1438-8871. doi:
10.2196/29812. URL https://www.jmir.org/2021/12/e29812.
A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram,
S. Pineda Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola,
A. Gordon Wilson, M. Bohlke-Schneider, and Y. Wang. Chronos: Learning the language of time
series. arXiv preprint arXiv:2403.07815, 2024.
A. Bellot and M. van der Schaar. Policy analysis using synthetic controls in continuous-time. In
M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine
Learning, volume 139 of Proceedings of Machine Learning Research, pages 759–768. PMLR,
18–24 Jul 2021. URL https://proceedings.mlr.press/v139/bellot21a.html.
W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li. Brits: Bidirectional recurrent imputation
for time series. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran As-
sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/
file/734e6bfcd358e25ac1db0a4241b95651-Paper.pdf.
Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate
time series with missing values. Scientific Reports, 8(1), 2018. doi: 10.1038/s41598-018-24271-9.
R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential
equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Asso-
ciates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/
file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf.
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm
sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
M. M. Churpek, T. C. Yuen, S. Y. Park, D. O. Meltzer, J. B. Hall, and D. P. Edelson. Derivation of a
cardiac arrest prediction model using ward vital signs. Critical Care Medicine, 2012.
A. Das, W. Kong, R. Sen, and Y. Zhou. A decoder-only foundation model for time-series forecasting.
In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.
net/forum?id=jn2iTJas6h.
E. De Brouwer, J. Simm, A. Arany, and Y. Moreau. Gru-ode-bayes: Continuous modeling of
sporadically-observed time series. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol-
ume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_
files/paper/2019/file/455cb2657aaa59e32fad80cb0b65b9dc-Paper.pdf.
E. De Brouwer, J. Gonzalez, and S. Hyland. Predicting the impact of treatments over time with
uncertainty aware neural differential equations. In G. Camps-Valls, F. J. R. Ruiz, and I. Valera,
editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics,
volume 151 of Proceedings of Machine Learning Research, pages 4705–4722. PMLR, 28–30 Mar
2022. URL https://proceedings.mlr.press/v151/de-brouwer22a.html.
J. Dong, H. Wu, Y. Wang, Y.-Z. Qiu, L. Zhang, J. Wang, and M. Long. Timesiam: A pre-training
framework for siamese time-series modeling. In Forty-first International Conference on Machine
Learning, 2024. URL https://openreview.net/forum?id=wrTzLoqbCg.
E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural odes. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, edi-
tors, Advances in Neural Information Processing Systems, volume 32. Curran Associates,
Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/
21be9a4bd4f81549a9d1d241981cec3c-Paper.pdf.
11
S. Gao, T. Koker, O. Queen, T. Hartvigsen, T. Tsiligkaridis, and M. Zitnik. Units: Building a unified
time series model. arXiv, 2024. URL https://arxiv.org/pdf/2403.00131.pdf.
N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson. Large language models are zero-shot time series
forecasters. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,
Advances in Neural Information Processing Systems, volume 36, pages 19622–19635. Curran As-
sociates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/3eb7ca52e8207697361b2c0fb3926511-Paper-Conference.pdf.
D. Gwak, G. Sim, M. Poli, S. Massaroli, J. Choo, and E. Choi. Neural Ordinary Differential
Equations for Intervention Modeling. arXiv e-prints, art. arXiv:2010.08304, Oct. 2020. doi:
10.48550/arXiv.2010.08304.
R. Hasani, M. Lechner, A. Amini, D. Rus, and R. Grosu. Liquid time-constant networks. Proceedings
of the AAAI Conference on Artificial Intelligence, 35(9):7657–7666, May 2021. doi: 10.1609/aaai.
v35i9.16936. URL https://ojs.aaai.org/index.php/AAAI/article/view/16936.
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 33:6840–6851, 2020.
P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial
networks. CVPR, 2017.
G. Kerrigan, G. Migliorini, and P. Smyth. Functional flow matching. In S. Dasgupta, S. Mandt, and
Y. Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and
Statistics, volume 238 of Proceedings of Machine Learning Research, pages 3934–3942. PMLR,
02–04 May 2024. URL https://proceedings.mlr.press/v238/kerrigan24a.html.
P. Kidger, J. Foster, X. Li, H. Oberhauser, and T. Lyons. Neural sdes as infinite-dimensional gans. In
International conference on machine learning. PMLR, 2021a.
P. Kidger, J. Foster, X. C. Li, and T. Lyons. Efficient and accurate gradients for neural sdes. In
M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in
Neural Information Processing Systems, volume 34, pages 18747–18761. Curran Associates,
Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/
9ba196c7a6e89eafd0954de80fc1b224-Paper.pdf.
12
M. Lechner and R. Hasani. Learning long-term dependencies in irregularly-sampled time series.
arXiv preprint arXiv:2006.04418, 2020.
Y. Lee, E. Jun, J. Choi, and H.-I. Suk. Multi-view integrative attention-based deep representation
learning for irregular clinical time-series data. IEEE Journal of Biomedical and Health Informatics,
26(8):4270–4280, 2022. doi: 10.1109/JBHI.2022.3172549.
X. Li, T.-K. L. Wong, R. T. Q. Chen, and D. K. Duvenaud. Scalable gradients and variational
inference for stochastic differential equations. In C. Zhang, F. Ruiz, T. Bui, A. B. Dieng, and
D. Liang, editors, Proceedings of The 2nd Symposium on Advances in Approximate Bayesian
Inference, volume 118 of Proceedings of Machine Learning Research, pages 1–28. PMLR, 08 Dec
2020. URL https://proceedings.mlr.press/v118/li20a.html.
Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. liu, K. Bhattacharya, A. Stuart, and A. Anandkumar.
Fourier neural operator for parametric partial differential equations. In International Conference on
Learning Representations, 2021. URL https://openreview.net/forum?id=c8P9NQVtmnO.
Z. Li, S. Li, and X. Yan. Time series as images: Vision transformer for irregularly sampled time
series. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,
Advances in Neural Information Processing Systems, volume 36, pages 49187–49204. Curran As-
sociates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/9a17c1eb808cf012065e9db47b7ca80d-Paper-Conference.pdf.
Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative
modeling. In The Eleventh International Conference on Learning Representations, 2023. URL
https://openreview.net/forum?id=PqvMRDCJT9t.
Z. C. Lipton, D. Kale, and R. Wetzel. Directly modeling missing data in sequences with rnns:
Improved classification of clinical time series. In F. Doshi-Velez, J. Fackler, D. Kale, B. Wallace,
and J. Wiens, editors, Proceedings of the 1st Machine Learning for Healthcare Conference,
volume 56 of Proceedings of Machine Learning Research, pages 253–270, Northeastern University,
Boston, MA, USA, 18–19 Aug 2016. PMLR. URL https://proceedings.mlr.press/v56/
Lipton16.html.
G.-H. Liu, A. Vahdat, D.-A. Huang, E. A. Theodorou, W. Nie, and A. Anandkumar. I2sb: image-
to-image schrödinger bridge. In Proceedings of the 40th International Conference on Machine
Learning, ICML’23. JMLR.org, 2023a.
X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with
rectified flow. The Eleventh International Conference on Learning Representations (ICLR), 2023b.
URL https://par.nsf.gov/biblio/10445517.
X. Liu, L. Wu, M. Ye, and qiang liu. Learning diffusion bridges on constrained domains. In
The Eleventh International Conference on Learning Representations, 2023c. URL https://
openreview.net/forum?id=WH1yCa0TbB.
Y. Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long. Timer: Transformers for Time Series
Analysis at Scale. arXiv e-prints, art. arXiv:2402.02368, Feb. 2024. doi: 10.48550/arXiv.2402.
02368.
A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In M. Meila and
T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume
139 of Proceedings of Machine Learning Research, pages 8162–8171. PMLR, 18–24 Jul 2021.
URL https://proceedings.mlr.press/v139/nichol21a.html.
Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A time series is worth 64 words: Long-
term forecasting with transformers. In The Eleventh International Conference on Learning
Representations, 2023. URL https://openreview.net/forum?id=Jbdc0vTOcol.
Y. Oh, D. Lim, and S. Kim. Stable neural stochastic differential equations in analyzing irregular time
series data. In The Twelfth International Conference on Learning Representations, 2024. URL
https://openreview.net/forum?id=4VIgNuQ1pY.
13
T. Pollard, A. Johnson, J. Raffa, L. A. Celi, O. Badawi, and R. Mark. eicu collaborative research
database (version 2.0), 2019. URL https://doi.org/10.13026/C2WM1R.
A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun,
P. Sundberg, H. Yee, K. Zhang, Y. Zhang, G. Flores, G. E. Duggan, J. Irvine, Q. Le, K. Litsch,
A. Mossin, J. Tansuwan, D. Wang, J. Wexler, J. Wilson, D. Ludwig, S. L. Volchenboum, K. Chou,
M. Pearson, S. Madabushi, N. H. Shah, A. J. Butte, M. D. Howell, C. Cui, G. S. Corrado, and
J. Dean. Scalable and accurate deep learning with electronic health records. npj Digital Medicine,
1(1):18, 2018. doi: 10.1038/s41746-018-0029-1.
N. Seedat, F. Imrie, A. Bellot, Z. Qian, and M. van der Schaar. Continuous-time modeling of
counterfactual outcomes using neural controlled differential equations. In K. Chaudhuri, S. Jegelka,
L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International
Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research,
pages 19497–19521. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/
seedat22b.html.
S. N. Shukla and B. Marlin. Multi-time attention networks for irregularly sampled time series. In
International Conference on Learning Representations, 2021. URL https://openreview.net/
forum?id=4c0J6lwQ4_.
V. R. Somnath, M. Pariset, Y.-P. Hsieh, M. R. Martinez, A. Krause, and C. Bunne. Aligned diffusion
Schrödinger bridges. In R. J. Evans and I. Shpitser, editors, Proceedings of the Thirty-Ninth
Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine
Learning Research, pages 1985–1995. PMLR, 31 Jul–04 Aug 2023. URL https://proceedings.
mlr.press/v216/somnath23a.html.
G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo. Unified training of universal time
series forecasting transformers. In Forty-first International Conference on Machine Learning,
2024. URL https://openreview.net/forum?id=Yd8eHMY1wz.
14
H. Xue and F. D. Salim. PromptCast: A New Prompt-Based Learning Paradigm for Time
Series Forecasting . IEEE Transactions on Knowledge & Data Engineering, 36(11):6851–
6864, Nov. 2024. ISSN 1558-2191. doi: 10.1109/TKDE.2023.3342137. URL https:
//doi.ieeecomputersociety.org/10.1109/TKDE.2023.3342137.
Y. Zhang, K. Gong, K. Zhang, H. Li, Y. Qiao, W. Ouyang, and X. Yue. Meta-transformer: A unified
framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023.
T. Zhou, P. Niu, x. wang, L. Sun, and R. Jin. One fits all: Power general time series analysis by
pretrained lm. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,
Advances in Neural Information Processing Systems, volume 36, pages 43322–43355. Curran As-
sociates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/86c17de05579cde52025f9984e6e2ebb-Paper-Conference.pdf.
J. E. Zimmerman, A. A. Kramer, D. S. McNair, and F. M. Malila. Acute physiology and chronic health
evaluation (apache) iv: hospital mortality assessment for today’s critically ill patients. Critical
care medicine, 34(5):1297–1310, 2006.
15
A Proof of theorems
We first prove a Lemma which shows TFM learns valid flows between distributions with the target
prediction reparameterization trick.
Lemma A.1. If pt (x) > 0, δdata is Lipschitz continuous for all x ∈ Rd and t ∈ [0, 1], LF M and
LT F M are equal,
∇θ LF M (θ) = ∇θ LT F M (θ)
Proof. This proof is a simple extension of Lipman et al. [2023], Tong et al. [2024] which proved
LCF M and LF M are equal under similar constraint.
−x0
Given δdata = t1 − t0 , we have ut (x) = xδ1data where t0 is the previous time in the time series,
and t1 is the current time for inference. For the time series data, we are assuming it to be Lipschitz
continuous there exist L ≥ 0 such that for all x, y ∈ Rn , |f (x) − f (y)| ≤ L∥x − y∥.
1
∇θ Ept (x) ∥vθ (t, x) − ut (x)∥2 = Et,q(z),pt (x|z) ∥x̂1 (t, x) − x1 ∥2 (18)
(1 − t)2 θ
1
∥x̂1θ (t, x)∥2 − 2 x̂1θ (t, x), x1 + x21
= ∇θ Et,q(z),pt (x|z) 2
(1 − t)
(19)
1
= ∇θ Et,q(z),pt (x|z) ∥x̂1 (t, x)∥2 − 2 x̂1θ (t, x), x1
(1 − t)2 θ
(20)
⌈t⌉ 2
x̂θ (t, x) − x x⌈t⌉ − x
Et,q(z),pt (x|z) ∥vθ (t, x) − ut (x|z)∥2 = Et,q(z),pt (x|z) − (22)
⌈t⌉ − t ⌈t⌉ − t
1 ⌈t⌉
= Et,q(z),pt (x|z) ∥x̂ (t, x) − x⌈t⌉ ∥2 (23)
(⌈t⌉ − t)2 θ
16
Lemma 3.1 The SDE dxt = ut (x|z)dt + σ 2 dWt where ut is defined in eq. 9 generates pt (x|z) in
eq. 8 with initial condition p0 := δx1 where δ is the Dirac delta function.
Proof. For simplicity of notation we first show the case where ⌈t⌉ = 1.
1 − xt
dxt = ut (x|z)dt + σ 2 dWt = dt + σ 2 dWt (24)
1−t
which is equivalent to the d dimensional Brownian bridge which has marginal
N ((1 − t)x0 + tx1 , σ 2 t(1 − t)) (25)
which completes the proof for ⌈t⌉ = 1.
Proposition 3.2 (Coupling Preservation) Under mild regulatory criteria on ut (·|z), pt , and q, if
Et∼U (0,T ),z∼q(z),c∼q(c|z),xt ∼pt (xt |z) ∥ut (xt |z, c) − ut (xt |c)∥22 = 0
and z, q(z), pt (x|z), and ut (x|z) are as defined in eqs. 6-9 then Π(u)⋆ = Π⋆ (x1:T ).
Proof. We prove the deterministic case with T = 1. The extensions to stochastic and T > 1 are
evident. The couplings are equal if the marginal vector field ut (xt |c) = ut (xt |z, c) everywhere as
R1
the coupling is governed by the push forward flows ϕ(x0 , c) = 0 ut (xt |c)dt, and ϕ(x0 , c, z) =
R1
0
(ut (xt |z, c). If
Et∼U (0,T ),z∼q(z),c∼q(c|z),xt ∼pt (xt |z) ∥ut (xt |z, c) − ut (xt |c)∥22 = 0
then ϕ(x0 , c, z) = ϕ(x0 , c) for all x0 and therefore the couplings of the optimal map are equivalent.
We note that this requires exchange of integrals under the same conditions as Lemma A.1.
17
Figure 4: Left: Distribution of number of complete vital measurements per patient trajectory within
the first 24 hours of admission in each clinical dataset. Right: Distribution of raw heart rate values in
each clinical dataset.
B Experimental Details
B.1 1D Oscillators
The three oscillation trajectories correspond to c = 0.25 (the red trajectory in Figure 2), c = 2 (blue),
and c = 3.75 (green). Before used as an input, t was scaled to between 0 and 1 by dividing by 10.
18
primary admission diagnosis (2689 patients in training set, 336 in validation set, and 337 in test
set). The following data fields were extracted: patient sex, age, heart rate, mean arterial pressure,
norepinephrine dose and infusion rate, and a validated ICU score (APACHE-IV). Each patient’s
complete pair measurements of heart rate and mean arterial pressure over time form one trajectory to
be modeled.
Norepinephrine infusion rates were calculated by converting drug doses or infusion rates to µg/kg/min,
and where drug doses were not explicitly available, the dose was inferred from the free text given in
the drug name. Start and end times for norepinephrine infusion were calculated by dividing the dose
by the infusion rate. Where there appeared to be multiple infusions at the same time, the maximum
infusion rate was taken as the infusion rate. As a conditional input to the models, the norepinephrine
infusion doses are then scaled to between 0 and 1 by dividing by the maximum norepinephrine value
in the dataset.
The APACHE-IV score, a validated critical care risk score, predicts individual patient mortality risk
[Zimmerman et al., 2006]. In data preprocessing, we uses logistic regression of the score against
binary hospital mortality data to generate a probability for each patient, serving as an additional input
condition for models.
Intensive Care Unit Cardiac Arrest (ICU Cardiac Arrest) Dataset This dataset was extracted
from the to eICU Collaborative Research Database v2.0 [Pollard et al., 2019] described above to
reflect ICU patients at risk for cardiac arrest. This dataset excludes patients who presented with
myocardial infarction (MI) and includes variables used in the Cardiac Arrest Risk Triage (CART)
score [Churpek et al., 2012]: respiratory rate, heart rate, diastolic blood pressure, and age at the time
of ICU admission. As an input to the model, the age was z-score normalized. 51671 patients were
included in the training set, with 6459 patients each in the validation and test sets.
Intensive Care Unit Acute Gastrointestinal Bleeding (ICU GIB) Dataset The Medical Infor-
mation Mart for Intensive Care III (MIMIC-III) critical care database contains data for over 40,000
patients in the Beth Israel Deaconess Medical Center from 2001 to 2012 requiring an ICU stay
[Johnson et al., 2016]. We selected a cohort of 2602 ICU patients with the primary diagnosis of
gastrointestinal bleeding to form the ICU GIB dataset, split into a training set of 2082 patients, and
a validation set and a test set of 260 patients each. We extracted the following variables: age, sex,
heart rate, systolic blood pressure, diastolic blood pressure, usage of vasopressor, usage of blood
product, usage of packed red blood cells, and liver disease. Since the vasopressor and blood product
usage are encoded as a binary value and may not represent actual infusion amount that are most
likely decaying, we experimented with adding a Gaussian decay to them to use as conditional inputs.
Likewise, trajectories to model consist of complete pairs of heart rate and mean arterial pressure
(calculated from systolic blood pressure and diastolic blood pressure) measurements.
Emergency Department Acute Gastrointestinal Bleeding (ED GIB) Dataset This dataset reflects
3348 patients presenting with signs and symptoms of acute gastrointestinal bleeding to two hospital
campuses in Yale New Haven Hospital between 2014 and 2018. The patients were split into a training
set, a validation set, and a test set of 2636, 352, and 360 patients. Variables extracted include patient
sex, age, heart rate, mean arterial pressure, initial measurements of 24 lab tests, and 17 pre-existing
medical conditions as determined by ICD-10 codes. Like ICU Sepsis data, the trajectoires consist of
complete pairs of heart rate and mean arterial pressure measurements.
Age, initial lab test measurements (three labs omitted due to missing data), and pre-existing medical
conditions were used to train an XGBoost model [Chen and Guestrin, 2016] to predict the binary
outcome variable indicating the need for hospital-based care. The resulting probabilities of requiring
hospital-based care (outcome of 1) for each patient were then calculated using the trained model and
used as conditional input to conditional models in experiments on this dataset.
Of note, the outcome variable was defined as 1 if a patient (1) requires red blood cell transfusion, (2)
requires urgent intervention (endoscopic, interventional radiologic, or surgical) to stop bleeding or
(3) all-cause 30-day mortality. Labs and medical conditions included in this dataset are listed below.
Labs in bold were excluded from the XGBoost risk score calculation due to missing data.
• Labs: Sodium, Potassium, Chloride, Carbon Dioxide, Blood Urea Nitrogen, Creatinine,
International Normalized Ratio, Partial Thromboplastin Time, White Blood Cell Count,
Hemoglobin, Platelet Count, Hematocrit, Mean Corpuscular Volume, Mean Corpuscular
Hemoglobin, Mean Corpuscular Hemoglobin Concentration, Red Cell Distribution Width,
19
Figure 5: Sigma mean MSE comparison
20
Figure 6: Memory Mean MSE comparison
21