0% found this document useful (0 votes)
38 views49 pages

Automatically Learning H

Uploaded by

sun.jin.k.1000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views49 pages

Automatically Learning H

Uploaded by

sun.jin.k.1000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Automatically Learning Hybrid Digital Twins of

Dynamical Systems

Samuel Holt∗, Tennison Liu∗ & Mihaela van der Schaar


DAMTP, University of Cambridge
Cambridge, UK
{sih31, tl522, mv472}@[Link]

Abstract
Digital Twins (DTs) are computational models that simulate the states and temporal
dynamics of real-world systems, playing a crucial role in prediction, understanding,
and decision-making across diverse domains. However, existing approaches to
DTs often struggle to generalize to unseen conditions in data-scarce settings,
a crucial requirement for such models. To address these limitations, our work
begins by establishing the essential desiderata for effective DTs. Hybrid Digital
Twins (HDTwins) represent a promising approach to address these requirements,
modeling systems using a composition of both mechanistic and neural components.
This hybrid architecture simultaneously leverages (partial) domain knowledge and
neural network expressiveness to enhance generalization, with its modular design
facilitating improved evolvability. While existing hybrid models rely on expert-
specified architectures with only parameters optimized on data, automatically
specifying and optimizing HDTwins remains intractable due to the complex search
space and the need for flexible integration of domain priors. To overcome this
complexity, we propose an evolutionary algorithm (HDTwinGen) that employs
Large Language Models (LLMs) to autonomously propose, evaluate, and optimize
HDTwins.2 Specifically, LLMs iteratively generate novel model specifications,
while offline tools are employed to optimize emitted parameters. Correspondingly,
proposed models are evaluated and evolved based on targeted feedback, enabling
the discovery of increasingly effective hybrid models. Our empirical results reveal
that HDTwinGen produces generalizable, sample-efficient, and evolvable models,
significantly advancing DTs’ efficacy in real-world applications.

1 Introduction
Digital Twins (DTs) are computational models that accurately simulate the states and temporal
dynamics of real-world systems [1, 2]. They are particularly useful in modeling dynamical systems,
which consist of multiple interdependent components that evolve over time [3, 4]. Take, for example,
the epidemiological dynamics of a contagious disease containing various components, including
infection rates, recovery rates, population movement, and intervention strategies. DTs can integrate
these factors to simulate future outcomes (e.g. predict disease spread), understand system changes
(e.g. examining shifts in disease dynamics for varying demographics), and evaluate the impact of
control measures (e.g. to curb disease transmission) [5, 6].
Desiderata. A notable differentiator between DTs and general machine learning (ML) models
is the emphasis on generalization. DTs are designed to simulate completely unseen scenarios or
interventions at inference time. Therefore, a crucial consideration is ▶ [P1] out-of-distribution

Equal contributions; authors listed in randomized order.
2
Code is available at [Link]

38th Conference on Neural Information Processing Systems (NeurIPS 2024).


generalization: the ability to generalize to state-action distributions beyond those observed during
training. This challenge is often compounded by the scarcity of observational data available to accu-
rately learn dynamics, highlighting the importance of ▶ [P2] sample-efficient learning. Additionally,
the model should be ▶ [P3] evolvable: capable of efficiently adapting (i.e. with minimal retraining)
to changes in the underlying system dynamics. This is particularly crucial in healthcare domains,
such as epidemiological modeling and treatment planning, where DTs are regularly updated to
reflect fundamental changes in disease transmission patterns (caused by viral mutations, vaccination
coverage) or evolving drug resistance mechanisms, often with minimal additional data of emergent
dynamics [6, 7].
Existing approaches for creating DTs primarily utilize two approaches: mechanistic models or ML-
based neural models. Mechanistic models, denoted as fmech , are closed-form equations grounded
in domain knowledge such as biological or physical principles. They offer high accuracy and
generalization given sufficient domain understanding but are limited in their ability to model systems
where scientific knowledge is incomplete [8, 9]. Of related note, techniques have been introduced
to discover governing equations directly from data, but face challenges in scaling to more complex
problem settings [10, 11]. Conversely, neural approaches, fneural , leverage neural networks (NN) to
learn DTs directly from data, often requiring minimal knowledge [12, 13, 14, 15]. Such models are
effective given sufficient training data that provides adequate coverage of state-action distributions,
but struggle in data-scarce settings and are difficult to evolve to reflect changing conditions due to
their overparameterized, monolithic nature.
Key considerations. Informed by this context, Hybrid Digtal Twins (HDTwins) combine the strengths
of both approaches through compositions of neural and mechanistic components, i.e. f = fmech ◦
fneural . Here, fmech symbolically incorporates domain-grounded priors, improving generalization and
regularization while simplifying the complexity of patterns that have to be learned by the neural
component. In other terms, fneural complements the mechanistic component by modeling complex
temporal patterns in regions where the mechanistic model might be oversimplified or incomplete.
Consequently, HDTwins can more accurately and robustly capture system dynamics, particularly in
settings with (limited) empirical data and (partial) domain knowledge.
Conceptually, hybrid modeling involves two stages: model specification, determining the model
structure (e.g. neural architecture, symbolic equations), and model parameterization, estimating
model parameters (e.g. neural weights, coefficients). This process, with model specification in
particular, has traditionally relied heavily on human expertise to craft problem-specific models
[16, 17, 18, 19]. In this work, we investigate the feasibility of automatically designing hybrid models
with minimal expert involvement, which would significantly enhance the efficiency and scalability
of model development. This task is challenging, as it requires searching for optimal specification
and corresponding parameters within a vast combinatorial model space [20, 21]. To address this, we
introduce HDTwinGen, a novel evolutionary framework that autonomously and efficiently designs
HDTwins. At a high level, our method represents hybrid model specifications in code and leverages
large language models (LLMs) for their domain knowledge, contextual understanding, and learning
capabilities to propose symbolically represented models and search the model space [22, 23, 24].
This is coupled with offline optimization tools to empirically estimate model parameters from training
data. More specifically, HDTwinGen utilizes two LLM agents: the modeling agent, whose task is
to generate novel model specifications, and the evaluation agent, which analyzes performance and
provides targeted recommendations for improvement. Through multiple iterations, HDTwinGen
efficiently evolves better performing hybrid models with informed modifications.
Contributions: 1 Conceptually, we present the first work in automated hybrid model design, jointly
optimizing model specification and parameterization of hybrid digital twins. 2 Technically, we
introduce HDTwinGen, a novel evolutionary framework employing LLMs and offline optimization
tools to propose, evaluate, and iteratively enhance hybrid models. 3 Empirically, we demonstrate
that our method learns more accurate DTs, achieving ▶ better out-of-distribution generalization, ▶
sample-efficient learning, and ▶ increased flexibility for modular evolvability.

2 Digital Twins of Dynamical Systems

A dynamical system S := (X , U, Φ) is a tuple of its dX -dimensional state space X ⊆ RdX , an


(optional) dU -dimensional action space U ⊆ RdU , and a dynamics model Φ. The state at time

2
t ∈ T ⊆ R+ is represented as a vector, x(t) ∈ X and similarly the action taken is represented as
a vector u(t) ∈ U. The continuous-time dynamics of the system can be described by dx(t)/dt =
Φ(x(t), u(t), t), where Φ : X × U × T → X . We optionally consider the existence of some policy
π : X → P (U) that acts on the system by mapping a state x(t) to a distribution over actions u(t).
Digital Twins. Digital twins (DTs) aim to approximate Φ : X × U × T → X using a computational
model fθ,ω(θ) ∈ F learned from data. Here, we use θ ∈ Θ to denote the specification of the model
(e.g. linear) and ω(θ) ∈ Ω(θ) to indicate the set of parameters specified by θ. Additionally, F, Θ,
and Ω(θ) are the spaces of all possible models, specifications, and parameters, respectively. Next, we
outline the key desiderata for a DT:
[P1] Generalization to unseen state-action distributions. As DTs are required to simulate vary-
ing conditions, they should extrapolate to state-action distributions not observed during training
time. Formally, the generalization error E(x(t),u(t),y(t))∼pOOD [L(fθ,ω(θ) (x(t), u(t)), y(t))] should
be minimized, where L is some loss function, and pOOD represents the out-of-distribution scenario.
[P2] Sample-efficient learning. Given the often limited availability of real-world data, DTs should
learn robustly from minimal empirical data. In other words, they must have good sample complexity,
achieving the desired level of generalization with a limited number of observations [25].
[P3] Evolvability. Dynamical systems are, by nature, non-stationary and evolve over time [26, 27].
From a modeling perspective, the DT should be easily evolved to reflect changing underlying
dynamics, minimizing the need for additional data or expensive model re-development, i.e. θ and
ω(θ) should be easily adjustable to reflect changing system dynamics.
For the purpose of model learning, we assume access to an offline dataset containing N ∈ N+ trajec-
tories, where the measurements of the systems are made at discrete time points [T ] = [t1 , t2 , . . . T ].
This dataset, D = {{(x(n) (t), u(n) (t), y (n) (t)) | t ∈ [T ]}}N
n=1 , contains state-action trajectories
sampled regularly over time, where y (n) (t) = x(n) (t + ∆t) represents the subsequent state.

3 Hybrid Digital Twins

HDTwin. A Hybrid Digital Twin is a composition of mechanistic and neural components, represented
as fθ,ω(θ) = fmech ◦ fneural [18, 28]. This class of hybrid models offers several advantages that align
with our desiderata. The mechanistic component allows partial knowledge to be encoded through
its symbolic form, which, while not sufficient alone to accurately predict underlying dynamics, is
complemented by the neural components that learn from available data. This combination aids
in generalization ([P1]), especially moving beyond conditions observed in training, and improves
sample complexity ([P2]). Furthermore, the mechanistic component can be quickly and easily
updated with new parameters due to its simpler, lower-dimensional structure, allowing the overall
model to adapt efficiently to remain accurate in changing conditions ([P3]). In this work, we focus
on additive compositions, fθ,ω(θ) = fmech + fneural , as they are more interpretable. Additionally, it
enables individual contributions of mechanistic and neural components to be easily disentangled and
simplifies the optimization to allow gradient-based methods [16]. Nonetheless, we encourage future
works to investigate alternative composition strategies (e.g. branching composition) to develop more
advanced HDTwins [29].
Learning the hybrid model can be decomposed into two steps: (1) model specification, or learning the
structure, θ ∈ Θ, of the dynamics function that describes how the system evolves over time; and (2)
model parameterization, which estimates the specific values of parameters ω(θ) ∈ Ω(θ) for a given
specification θ. For instance, the logistic-growth model specifies a structure for population growth,
while parameterization involves estimating the growth rate and carrying capacity.3 More generally,
this learning problem can be mathematically formulated as a bilevel optimization problem:

θ∗ = arg min Louter (θ, ω ∗ (θ)), where ω ∗ (θ) = arg min Linner (θ, ω(θ)) (1)
θ∈Θ ω∈Ω(θ)

Here, the upper-level problem involves finding the optimal specification θ∗ that minimizes the outer
objective Louter , while the lower-level problem involves finding the optimal parameters ω ∗ (θ) for a
given specification θ that minimizes the inner objective function Linner . To be more concrete, the
3
This model relates population size N , growth rate r, and carrying capacity K, ∂N/∂t = rN (1 − N/K ) [30].

3
1) Modeling agent generates f (g
<latexit sha1_base64="HXg0t7nFwB4JIYF3xxhtNXs/feI=">AAACcHicbVFNbxMxEPUuX234SumlEiAMEVJSQbRbVdBjBZceihQEaStlQ+R1ZhOr9nplzyIiy+f+P278CC78ArxJKkHbkSy9eW9mPH7OKyksJsmvKL51+87dexubrfsPHj563N56cmJ1bTgMuZbanOXMghQlDFGghLPKAFO5hNP8/GOjn34HY4Uuv+KigrFis1IUgjMM1KR9UUxchnNA9oZmWsGMdVdpz9PMCkUzhB+I6I6PP/mmNGRGOaWnIL3vZorhnDPpjvw31529TXs+zLkkBzeRXwJ5OYbrskHe9ybtTtJPlkGvg3QNOmQdg0n7ZzbVvFZQIpfM2lGaVDh2zKDgEnwrqy1UjJ+zGYwCLJkCO3ZLwzx9HZgpLbQJp0S6ZP/tcExZu1B5qGzWtle1hrxJG9VYHIydKKsaoeSri4paUtS0cZ9OhQGOchEA40aEXSmfM8M4hj9qBRPSq0++Dk72+um7/v7n/c7hh7UdG+QpeUW6JCXvySE5IgMyJJz8jrajZ9Hz6E+8E7+IX65K42jds03+i3j3L4+wvlg=</latexit>

1)
, P (g 1)
, S context )
Modeling ✓,!(✓) ⇠ LLMmodel (H
specification
Context 2) Optimize model
User
S context parameterization
<latexit sha1_base64="BbTNiL1tf7iDA4UaJcxY/rWa5/w=">AAAB/nicbVDLSsNAFJ3UV42vqLhyEyyCq5KIqBux6MZlRfuANpbJdNoOnUzCzI1YQsBfEcGFIm79A/duxL9x0nahrQcGDufcyz1z/IgzBY7zbeRmZufmF/KL5tLyyuqatb5RVWEsCa2QkIey7mNFORO0Agw4rUeS4sDntOb3zzO/dkulYqG4hkFEvQB3BeswgkFLLWurGWDoEcyTq/QmIaEAegdpyyo4RWcIe5q4Y1I4/TBPoscvs9yyPpvtkMQBFUA4VqrhOhF4CZbACKep2YwVjTDp4y5taCpwQJWXDOOn9q5W2nYnlPoJsIfq740EB0oNAl9PZmHVpJeJ/3mNGDrHXsJEFAMVZHSoE3MbQjvrwm4zSQnwgSaYSKaz2qSHJSagGzN1Ce7kl6dJdb/oHhYPLp1C6QyNkEfbaAftIRcdoRK6QGVUQQQl6AE9oxfj3ngyXo230WjOGO9soj8w3n8AsxGZnw==</latexit>

<latexit sha1_base64="PDKOIt/dH3jREfmI2cRPr76tA8E=">AAACl3icjVFda9swFJW9rzb7yra+jL2IhZVktMEeZRuDsrKOUehLCktbiNIgy9eJqCwZ6XoQjP9Sf8ze9m8mx3nY0j3sguCccz8knZsUSjqMol9BeOfuvfsPtrY7Dx89fvK0++z5uTOlFTAWRhl7mXAHSmoYo0QFl4UFnicKLpLr4yZ/8QOsk0Z/x2UB05zPtcyk4OilWfeG5RwXgqtqVF9V/fl+PKjpIWXVbj/z3LNZxXAByPcoMznMeb+lg6u3tZdSUMjbQs/KwklldMsHe4x1dplKDboGrQae/u/A042Bng9YPev2omG0CnobxGvQI+sYzbo/WWpEmYNGobhzkzgqcFpxi1IoqDusdFBwcc3nMPFQ8xzctFr5WtM3XklpZqw/GulK/bOj4rlzyzzxlY2LbjPXiP/KTUrMPk4rqYsSQYv2oqxUFA1tlkRTaUGgWnrAhZX+rVQsuOUC/So73oR488u3wfm7Yfx+eHB20Dv6srZji7wir0mfxOQDOSInZETGRAQ7wafgOPgavgw/h9/Ck7Y0DNY9L8hfEZ79BvCXxWI=</latexit>

(1)
<latexit sha1_base64="qOe0U2anFLD6QFOCdyu2jrW1Bdw=">AAAB8HicbVBNSwMxEJ31s9avqkcvwSLUg2VXinoseumxgv2Qdi3ZNNuGJtklyQpl6a/w4kERr/4cb/4b03YP2vpg4PHeDDPzgpgzbVz321lZXVvf2Mxt5bd3dvf2CweHTR0litAGiXik2gHWlDNJG4YZTtuxolgEnLaC0e3Ubz1RpVkk7804pr7AA8lCRrCx0kPtMS0Nzr2zSa9QdMvuDGiZeBkpQoZ6r/DV7UckEVQawrHWHc+NjZ9iZRjhdJLvJprGmIzwgHYslVhQ7aezgyfo1Cp9FEbKljRopv6eSLHQeiwC2ymwGepFbyr+53USE177KZNxYqgk80VhwpGJ0PR71GeKEsPHlmCimL0VkSFWmBibUd6G4C2+vEyaF2Xvsly5qxSrN1kcOTiGEyiBB1dQhRrUoQEEBDzDK7w5ynlx3p2PeeuKk80cwR84nz9+5I+R</latexit>

(g 1)
P (g 1)
= {(f✓,!(✓)⇤ , (1)
, (1)
),
D = {Dtrain , Dval } H Modeling
<latexit sha1_base64="fpWY79JSM2dj/rRZ/hCjj90ydl4=">AAACM3icbVBNSwMxEM36WevXqkcvwSJ4kLIrRb0IRT2IJwVrhW4ps2lqQ7PZJZktlmX/kxf/iAdBPCji1f9gWntQ64PAy3szzMwLEykMet6zMzU9Mzs3X1goLi4tr6y6a+vXJk414zUWy1jfhGC4FIrXUKDkN4nmEIWS18PeydCv97k2IlZXOEh4M4JbJTqCAVqp5Z4HEWCXgcxOc3pEg+zHv5UFyO9QRxlqECrPd+m/bh9kngd5yy15ZW8EOkn8MSmRMS5a7mPQjlkacYVMgjEN30uwmYFGwSTPi0FqeAKsB7e8YamCiJtmNro5p9tWadNOrO1TSEfqz44MImMGUWgrhzubv95Q/M9rpNg5bGZCJSlyxb4HdVJJMabDAGlbaM5QDiwBpoXdlbIuaGBoYy7aEPy/J0+S672yv1+uXFZK1eNxHAWySbbIDvHJAamSM3JBaoSRe/JEXsmb8+C8OO/Ox3fplDPu2SC/4Hx+AZzSrWY=</latexit>

S context
<latexit sha1_base64="BbTNiL1tf7iDA4UaJcxY/rWa5/w=">AAAB/nicbVDLSsNAFJ3UV42vqLhyEyyCq5KIqBux6MZlRfuANpbJdNoOnUzCzI1YQsBfEcGFIm79A/duxL9x0nahrQcGDufcyz1z/IgzBY7zbeRmZufmF/KL5tLyyuqatb5RVWEsCa2QkIey7mNFORO0Agw4rUeS4sDntOb3zzO/dkulYqG4hkFEvQB3BeswgkFLLWurGWDoEcyTq/QmIaEAegdpyyo4RWcIe5q4Y1I4/TBPoscvs9yyPpvtkMQBFUA4VqrhOhF4CZbACKep2YwVjTDp4y5taCpwQJWXDOOn9q5W2nYnlPoJsIfq740EB0oNAl9PZmHVpJeJ/3mNGDrHXsJEFAMVZHSoE3MbQjvrwm4zSQnwgSaYSKaz2qSHJSagGzN1Ce7kl6dJdb/oHhYPLp1C6QyNkEfbaAftIRcdoRK6QGVUQQQl6AE9oxfj3ngyXo230WjOGO9soj8w3n8AsxGZnw==</latexit>

!(✓)⇤ argmin!(✓)2⌦(✓) L(f✓,!(✓) ,


<latexit sha1_base64="eYN6Y2OoZqGOUt85WjeMhhqtBik=">AAACgnicbVFNj9MwEHXCAkv5KnDkYlGt1K5QSWAFHDisgAMHJBaJ7q5Ul2riTlprbSeyJ0AV5Yfwt7jxa8BpI7QfjGTp6c2beeOZrNTKU5L8juJrO9dv3Ny91bt95+69+/0HD499UTmJE1nowp1m4FErixNSpPG0dAgm03iSnb1r8yff0HlV2C+0LnFmYGlVriRQoOb9n6IwuIRhLWiFBM3o677QmBM4V3zngvAHOVODWxplm3ndqbfiERfKcvHpPNVwYYBWEnT9sRnm867vU36xsgmE6P2Tvm9bd17kIFg1o3l/kIyTTfCrIO3AgHVxNO//EotCVgYtSQ3eT9OkpFmYnZTU2PRE5bEEeQZLnAZowaCf1ZsVNnwvMAueFy48S3zDnq+owXi/NllQtkP7y7mW/F9uWlH+elYrW1aEVm6N8kpzKnh7D75QDiXpdQAgnQqzcrkCB5LC1XphCenlL18Fx8/H6cvxweeDweHbbh277DF7woYsZa/YIfvAjtiESfYn2ovG0bN4J96P0/jFVhpHXc0jdiHiN38B6FnEmw==</latexit>

...
Agent
Evaluation (f✓,!(✓)⇤ ,
(K) (K)
, (K)
)} Modeling
Dtrain )
LLMmodel
<latexit sha1_base64="iCvsSaBgp4eRNfuwmpgsCn+d5aA=">AAACBnicbVA9SwNBEN2LXzF+RS1FOAyCVbiToJZBG4sIEcwHJCHsbSbJkt27Y3dODMdVNv4VGwtFbP0Ndv4bN8kVmvhg4O17M+zM80LBNTrOt5VZWl5ZXcuu5zY2t7Z38rt7dR1EikGNBSJQTY9qENyHGnIU0AwVUOkJaHijq4nfuAeleeDf4TiEjqQDn/c5o2ikbv6wjfCAiHGlcpN04+lLyVgGPRBJ0s0XnKIzhb1I3JQUSIpqN//V7gUskuAjE1TrluuE2ImpQs4EJLl2pCGkbEQH0DLUpxJ0J56ekdjHRunZ/UCZ8tGeqr8nYiq1HkvPdEqKQz3vTcT/vFaE/YtOzP0wQvDZ7KN+JGwM7Ekmdo8rYCjGhlCmuNnVZkOqKEOTXM6E4M6fvEjqp0X3rFi6LRXKl2kcWXJAjsgJcck5KZNrUiU1wsgjeSav5M16sl6sd+tj1pqx0pl98gfW5w/9AJoc</latexit>

Feedback Context
Set of Top-𝐾 HDTwins
HDTwin
f✓,!(✓)⇤
<latexit sha1_base64="4PWfXA1kYScT4qAWyohmA//9Qrg=">AAACBXicbVDJSgNBEO1xjXEb9aiHxiBEkTAjQT0GvXiMYBZIYujp1CRNeha6a4Qw5OLFX/HiQRGv/oM3/8bOctDEBwWv36uiq54XS6HRcb6thcWl5ZXVzFp2fWNza9ve2a3qKFEcKjySkap7TIMUIVRQoIR6rIAFnoSa178e+bUHUFpE4R0OYmgFrBsKX3CGRmrbB347bWIPkJ3SZhRAl+Unz+P7k2HbzjkFZww6T9wpyZEpym37q9mJeBJAiFwyrRuuE2MrZQoFlzDMNhMNMeN91oWGoSELQLfS8RVDemSUDvUjZSpEOlZ/T6Qs0HoQeKYzYNjTs95I/M9rJOhftlIRxglCyCcf+YmkGNFRJLQjFHCUA0MYV8LsSnmPKcbRBJc1IbizJ8+T6lnBPS8Ub4u50tU0jgzZJ4ckT1xyQUrkhpRJhXDySJ7JK3mznqwX6936mLQuWNOZPfIH1ucPiv6X9w==</latexit>

3) Evaluation agent generates H (g) ⇠ LLMeval (S context , P (g) )


<latexit sha1_base64="q58lENv0p46RnEHSMSjdLbGCwuY=">AAACQ3icbVBNTxsxEPVCaWlaaArHXqxGSEFC0S5CbY8ILhyolKoNILIh8jqTYGF7V/YsIrL837j0D/TGH+iFAwj1WqneJEJ89EmWnt+bGY9fVkhhMY6vorn5FwsvXy2+rr15u7T8rv5+5cDmpeHQ4bnMzVHGLEihoYMCJRwVBpjKJBxmZ7uVf3gOxopc/8BxAT3FRloMBWcYpH79eO/ENUfrPrVCpQgXiOj297/6vpvcjHJwzqT3zVQxPOVMuu/+5N7jua6Y9xv03m/76cT1fr0Rt+IJ6HOSzEiDzNDu13+lg5yXCjRyyaztJnGBPccMCi7B19LSQsH4GRtBN1DNFNiem2Tg6VpQBnSYm3A00on6sMMxZe1YZaGy2tQ+9Srxf163xOGXnhO6KBE0nz40LCXFnFaB0oEwwFGOA2HciLAr5afMMI4h9loIIXn65efkYLOVfGptfdtqbO/M4lgkH8hH0iQJ+Uy2yR5pkw7h5JL8JjfkNvoZXUd30Z9p6Vw061kljxD9/Qc/6rQN</latexit>

NL feedback Update 3) Evaluate HDTwin


<latexit sha1_base64="hDy5NrqgLfRTUh9TctHghsElY8U=">AAAClnichVHda9swEJe9buuyj2bry2AvYmElHiXYo+zjYaW0jA76ksHSFqI0nGU5EZUlI50HwfhP2j/Tt/03k5Mw1nSwA8Hv4+4k3aWlkg7j+FcQ3tu6/+Dh9qPO4ydPn+10n784d6ayXIy4UcZepuCEklqMUKISl6UVUKRKXKTXJ61/8UNYJ43+jotSTAqYaZlLDuilafcnKwDnHFQ9bK7q/ixq6GfK6r1+Pq0ZzgXCPmWmEDPor2h09bZNTKLGG5lQCH9YVTqpjF7xaJ+xzh5TmUFHW/i/jme3Op5tdPQ8Ys2024sH8TLoXZCsQY+sYzjt3rDM8KoQGrkC58ZJXOKkBouSK9F0WOVECfwaZmLsoYZCuEm9HGtD33glo7mx/mikS/XvihoK5xZF6jPbIbpNrxX/5Y0rzD9OaqnLCoXmq4vySlE0tN0RzaQVHNXCA+BW+rdSPgcLHP0mO34IyeaX74Lzd4Pk/eDg20Hv6Hg9jm3yirwmfZKQD+SIfCVDMiI82A0+BcfBSfgyPAy/hKer1DBY1+ySWxEOfwNgl8Ua</latexit>

(1)
P (g) = {(f✓,!(✓)⇤ , (1)
, (1)
), population <latexit sha1_base64="kYeK8W+oXhNwSMSB4jlCjHQXI4U=">AAACOXicbVBNaxRBEO1J/IgbPzZ69NK4CBsJy4wEk4sQ1IMHDyu4SWBns9T01uw26Y+huyZkGeZv5eK/8CZ48aBIrvkD9uwuqIkPGl6/ekVVvaxQ0lMcf43W1m/dvnN3415r8/6Dh4/aW48PvS2dwIGwyrrjDDwqaXBAkhQeFw5BZwqPstO3Tf3oDJ2X1nyieYEjDVMjcymAgjRu99Oy8FJZw1/zVAPNBKjqQ93Nx1VKMyTY4anVOIXu8rt98qLe+eN8Vzc+PCenqzNQdb09bnfiXrwAv0mSFemwFfrj9pd0YkWp0ZBQ4P0wiQsaVeBICoV1Ky09FiBOYYrDQA1o9KNqcXnNnwdlwnPrwjPEF+rfHRVo7+c6C85mZX+91oj/qw1LyvdHlTRFSWjEclBeKk6WNzHyiXQoSM0DAeFk2JWLGTgQFMJuhRCS6yffJIcve8mr3u7H3c7Bm1UcG+wpe8a6LGF77IC9Z302YIJdsG/sB/sZfY6+R7+iy6V1LVr1PGH/ILr6DXk9rf0=</latexit>

= L(f✓,!(✓)⇤ , Dval ) Compute average val loss


Evaluation context
<latexit sha1_base64="BbTNiL1tf7iDA4UaJcxY/rWa5/w=">AAAB/nicbVDLSsNAFJ3UV42vqLhyEyyCq5KIqBux6MZlRfuANpbJdNoOnUzCzI1YQsBfEcGFIm79A/duxL9x0nahrQcGDufcyz1z/IgzBY7zbeRmZufmF/KL5tLyyuqatb5RVWEsCa2QkIey7mNFORO0Agw4rUeS4sDntOb3zzO/dkulYqG4hkFEvQB3BeswgkFLLWurGWDoEcyTq/QmIaEAegdpyyo4RWcIe5q4Y1I4/TBPoscvs9yyPpvtkMQBFUA4VqrhOhF4CZbACKep2YwVjTDp4y5taCpwQJWXDOOn9q5W2nYnlPoJsIfq740EB0oNAl9PZmHVpJeJ/3mNGDrHXsJEFAMVZHSoE3MbQjvrwm4zSQnwgSaYSKaz2qSHJSagGzN1Ce7kl6dJdb/oHhYPLp1C6QyNkEfbaAftIRcdoRK6QGVUQQQl6AE9oxfj3ngyXo230WjOGO9soj8w3n8AsxGZnw==</latexit>

Agent S ...
= (f✓,!(✓)⇤ , Dval ) Compute component-wise
<latexit sha1_base64="I0pB0qwxgVuUsYUPjGBMll/5dbo=">AAACMnicbVDBShxBEO1RE3WjyWqOXhqXwBpkmQmiXgQxHvRmwFVhZ11qemt2G7tnhu4acRnmm7z4JYEczCEiXv2I9OzuwWgeNLx+r4qqelGmpCXfv/dmZufevZ9fWKx9WFr++Km+snpm09wIbItUpeYiAotKJtgmSQovMoOgI4Xn0dX3yj+/RmNlmpzSKMOuhkEiYymAnNSrH4d9VAR8j09IM+4VIQ2RYJOHqcYBNCffjcuvpZM00FCAKg7Lqg5vyOjiGlRZbvTqDb/lj8HfkmBKGmyKk179Z9hPRa4xIaHA2k7gZ9QtwJAUCstamFvMQFzBADuOJqDRdovxySX/4pQ+j1PjXkJ8rL7sKEBbO9KRq6xWtq+9Svyf18kp3u0WMslywkRMBsW54pTyKj/elwYFqZEjIIx0u3IxBAOCXMo1F0Lw+uS35OxbK9hubf3YauwfTONYYGtsnTVZwHbYPjtiJ6zNBLtlv9gf9uDdeb+9R+9pUjrjTXs+s3/gPf8F4kKqow==</latexit>

(K)
Modeling (K) (K)
val loss
<latexit sha1_base64="KrpJOoLwjee/kOJZDXkTWAbH6I0=">AAACV3icbVFNaxsxENVuvhz3I0577EXEJNihNbshtD2G9tKjC3US8DpmVp61RbTSIs22mGX/ZOklf6WXVmv7kDp9IHh6b4bRPKWFko6i6CEId3b39g9ah+1nz1+8POocv7p2prQCR8IoY29TcKikxhFJUnhbWIQ8VXiT3n9u/JvvaJ00+hstC5zkMNcykwLIS9OOTnKghQBVDeu7qjfv1zxRmBFYa37wsy33Xdz4plCl40nSPuO9bFoltECCt17PcQ699bV/d157aYZqZZWFk8ro/rTTjQbRCvwpiTekyzYYTjs/k5kRZY6ahALnxnFU0KQCS1IorNtJ6bAAcQ9zHHuqIUc3qVa51PzUKzOeGeuPJr5SH3dUkDu3zFNf2ezptr1G/J83Lin7OKmkLkpCLdaDslJxMrwJmc+kRUFq6QkIK/1buViABUH+K9o+hHh75afk+mIQvx9cfr3sXn3axNFib9gJ67GYfWBX7AsbshET7Bf7HewEu8FD8CfcD1vr0jDY9Lxm/yA8/gsz/7LK</latexit>

LLMeval (f✓,!(✓)⇤ , , )} P (g) P (g 1)


<latexit sha1_base64="ckF/kAhLyeGQ2wVo2WP31X7dhHk=">AAACBXicbVC7SgNBFJ31GeMraqnFYhCswq4EtQzaWESIYB6QLGF2cpMMmX0wczcYlm1s/BUbC0Vs/Qc7/8bJZgtNPHDhzDn3MvceNxRcoWV9G0vLK6tr67mN/ObW9s5uYW+/oYJIMqizQASy5VIFgvtQR44CWqEE6rkCmu7oeuo3xyAVD/x7nITgeHTg8z5nFLXULRx1EB4QMa5Wb5NunL6kF8OYiiTpFopWyUphLhI7I0WSodYtfHV6AYs88JEJqlTbtkJ0YiqRMwFJvhMpCCkb0QG0NfWpB8qJ0ysS80QrPbMfSF0+mqn6eyKmnlITz9WdHsWhmvem4n9eO8L+pRNzP4wQfDb7qB8JEwNzGonZ4xIYiokmlEmudzXZkErKUAeX1yHY8ycvksZZyT4vle/KxcpVFkeOHJJjckpsckEq5IbUSJ0w8kieySt5M56MF+Pd+Ji1LhnZzAH5A+PzByzkmak=</latexit>

Context Set of Top-𝐾 HDTwins (f✓,!(✓)⇤ , , )

Figure 1: HDTwinGen: evolutionary framework. The process begins with user-provided modeling
context S context and D = {Dtrain , Dval }. 1) In iteration g, the modeling agent generates model
specification as a Python program fθ,ω(θ) . 2) Parameters are optimized using the offline optimization
tool to yield fθ,ω∗ (θ) . 3) The HDTwin is evaluated based on model loss υ and component-wise loss δ.
Subsequently, the model pool P (g) is updated with top-K models. 4) The evaluation agent provides
targeted feedback for model improvement H (g) by analyzing models in P (g) using performance
metrics requirements outlined in S context . This iterative loop repeats for G iterations.

outer objective measures the generalization performance, empirically measured on the validation set
Lval , while the inner objective measures the fitting error, as evaluated on the training set Ltrain .
Combinatorial search space. The space of possible specifications Θ (e.g. different networks,
functional forms) is discrete and combinatorially large, while Ω(θ) represents the continuous space
of parameters to be optimized. Selecting the optimal θ, ω(θ) thus involves searching through a vast
combinatorial space. Performing this search through traditional means, such as genetic programming
[21] or evolutionary algorithms [20], is computationally challenging, time-consuming, and often
technically infeasible. To the best of our knowledge, our work is the first to address the problem of
automatic HDTwin development, where we incorporate LLMs (combined with offline optimization
tools) to automatically optimize both the specification and the parameterization of hybrid models.

4 HDTwinGen: Automatic Design of HDTwins


Human experts craft models by making strategic design decisions based on their domain knowledge,
starting with a sensible initial model specification and performing intelligent modifications based
on empirical evaluations. Our key insight is that LLMs can effectively emulate these capabilities
to efficiently navigate the search space in Equation (1) and autonomously design HDTwins. More
specifically, our method utilizes LLMs for three major purposes: ▶ source of domain knowledge,
where LLMs inject domain-consistent knowledge into the model specification, particularly through
the symbolic representation fmech ; ▶ efficient search, by making intelligent modifications to the
specification to converge more efficiently on the optimal hypothesis; and ▶ contextual understand-
ing, enabling the algorithm to incorporate task-specific context and targeted feedback for model
improvement [22, 23, 24].
Overview. We operationalize this insight through HDTwinGen, an evolutionary algorithm that
iteratively evolves a population of candidate solutions to automatically search for the best HDTwin.
Our approach employs a framework comprising three key elements: (1) human experts provide
an initial system description, modeling objectives, and requirements as a structured prompt; (2)
a modeling agent proposes new model specifications, optimizes their parameters on a training
dataset, and collects validation performance metrics; (3) an evaluation agent assesses the proposed
models using both data-driven performance metrics and qualitative evaluations against expert-defined
objectives and requirements. The agents communicate using natural language and a custom code
format representing the HDTwin model, facilitating autonomous and iterative model enhancement.
An overview of our method is presented in Figure 1, with pseudocode in Appendix E.1.
Initial prompt design. The optimization process begins with a human expert providing a structured
prompt, referred to as the modeling context S context . This modeling context outlines the system
description, modeling objectives L, and requirements R:

4
1. The system description semantically describes the system, including state and action variables,
giving the algorithm the contextual understanding necessary for informed model development.
2. The modeling objective specifies quantitative performance requirements via a metric L.
3. The modeling requirements R are qualitative and described in natural language, detailing aspects
such as interpretability (e.g. fully mechanistic or hybrid model) and additional scientific knowledge
(e.g. a log-linear relationship between variables).
In practice, R can incorporate various requirements, allowing for the design of both purely mechanis-
tic and hybrid models, a flexibility that we demonstrate experimentally. The model is represented
in Python, where purely mechanistic specifications are represented in native Python and neural
components are represented using PyTorch [31]. Moreover, S context includes a skeleton code
to guide the synthesis of executable code in a predetermined format. For illustrative purposes, an
example of S context is provided in Appendix E.4.
Evolutionary optimization overview. Given S context as input, HDTwinGen performs G iterations of
optimization, where G ∈ N+ . The population of proposed HDTwins at iteration g is represented as
P (g) . Each iteration creates a new candidate model based on previously created models in P (g) and
feedback. Only the top K models are retained after each iteration, except when g < K, in which
case all generated models are kept, i.e. maxg∈[G] |P (g) | = K. Each model in P (g) is characterized
by a tuple containing its model specification (represented symbolically through code) and validation
metrics. After completing G iterations, the model with the best validation performance in P (G) is
selected as the final model.

4.1 Modeling Agent

Proposing HDTwins. The goal of the modeling step is to propose novel HDTwins based on previously
proposed models and feedback from the evaluation agent. Specifically, on the g-th iteration, the
modeling agent takes as input ▶ P (g−1) : the set of top-K previously generated models; ▶ H (g−1) :
the most recent feedback produced by the evaluation agent (where on the initial step, g = 1, both
are empty, i.e., H (0) = ∅, P (0) = ∅); and ▶ S context : the modeling context. The modeling
agent generates a model specification θ using a predefined code format (i.e. skeleton code). By
observing multiple previously best-performing models and their performances, the modeling agent
can exploit this context as a rich form of in context-learning and evolve improved specifications
in subsequent generations [22]. Each generated specification emits its corresponding parameters,
ω(θ) are fitted to the training set Dtrain . More formally, we represent this generative procedure as
fθ,ω(θ) ∼ LLMmodel (H, P (g) , S context ).
Model specification. To generate model specifications, the modeling agent decomposes the system
into a set of components, with each component describing the dynamics of a specific state variable. In
other words, for a system with dX state variables, there will be dX components. Each component is
characterized by its own set of inputs and a unique dynamics function that describes the dynamics of
its associated state variable over time. This modular representation enables independent analysis and
optimization of individual components. In cases where R specifies purely mechanistic equations, the
component dynamics are entirely defined using closed-form equations. Conversely, in a hybrid model,
the mechanistic equation can be augmented with a neural network (implemented in PyTorch) to model
residuals (i.e. in an additive fashion). The choice between mechanistic and hybrid models is left to
the user, balancing the trade-off between transparency and predictive performance. Concretely, the
specification step involves ‘filling in’ the skeleton code with a detailed body of code, specifying
the decomposition, and delineating each component’s dynamics function as a separate code structure
(for a generated HDTwin example, see Appendix I).
Model optimization. The generated specification emits ω(θ), which are treated as placeholder values,
and are then optimized against the training dataset. Specifically, we optimize the mean squared
error for the parameters that minimize this loss as ω ∗ (θ) = arg minω(θ) L(fθ,ω(θ) , Dtrain ). In this
work, we consider ω(θ) to be continuous variables, and as such, we optimize θ by stochastic gradient
descent, using the Adam optimizer [32]. However, we note other optimization algorithms, such as
black-box optimizers, could also be used (for more details, see Appendix F, Equation (5)). The
parameter optimization step then yields the complete model, fθ,ω∗ (θ) .
Quantitative evaluation. For each generated model, we evaluate them quantitatively. Specif-
ically, we collect the validation mean squared error loss per component, which we denote as

5
δ = [δ1 , δ2 , . . . , δdX ] (Appendix F, Equation (6)). We also compute the validation loss of the
overall model as well as υ = L(fθ,ω∗ (θ) , Dval ). Finally, the generated model and its validation losses
are included in a tuple and added to the top-K models P (g) ← P (g−1) ⊕ (fθ,ω∗ (θ) , δ, υ), where P (g)
automatically removes the lowest performing models, and also only adds a new model to P (g) if it is
unique. We highlight that we consider the top-K models only to apply selection pressure, such that
only the best-performing models are considered when generating the next HDTwin [33].

4.1.1 Evaluation Agent


Model evaluation. The goal of the evaluation step is to reflect on the current set of top-K models, P (g)
against requirements R and provide actionable and detailed feedback to the modeling agent for model
improvement: H (g) ∼ LLMeval (R, P (g) ). We note that H (g) is provided in natural language and can
be viewed as a dense feedback signal, a notable distinction from traditional learning methods, where
feedback often takes the form of simple scalar values, such as loss gradients or rewards. Leveraging
natural language feedback allows the agent to (1) engage in comparative analysis, identifying effective
specifications in P (g) contributing to higher performance and discerning patterns common in less
effective models, informing its suggestions for further model improvement; (2) qualitatively evaluate
models against qualitative requirements R—leveraging the LLM’s capacity to reason about proposed
HDTwins to reflect these requirements via model improvement feedback.
Enhancing search. By providing rich feedback to improve model specification, the evaluation and
modeling agent collaborate to efficiently evolve high-performing models. Empirically, in Appendix J,
we observe that the evaluation agent provides targeted and specific feedback, including component-
specific suggestions, proposing alternative decompositions, removing parameters, or introducing
non-linear terms. It is noteworthy that the feedback H (g) , expressed flexibly in natural language,
could easily be further enriched through direct human feedback. We demonstrate this human-in-
the-loop capability by including expert feedback during the optimization process through H (g) and
observed that it was integrated into newly generated HDTwins. Though further investigation is
beyond the scope of this work, this demonstration highlights promising avenues for augmenting
human-machine collaboration in the autonomous design of DTs.

5 Related Works
For an extended related work, refer to Appendix B. Our work focuses on autonomously learning DTs
from data, with several relevant research strands:
Neural sequence models. ML approaches commonly address learning system dynamics as a se-
quential modeling problem. In these settings, fθ,ω(θ) are typically black-box models, where θ ∈ Θ
is the NN architecture and ω(θ) are its weights. Early models like Hidden Markov Models [34]
and Kalman filters [35] made simplifying Markovian and linearity assumptions, later extended to
nonlinear settings [36, 37]. Subsequent models, including recurrent neural networks [38], along
with their advanced variants [39, 40, 41], introduced the capability to model longer-term depen-
dencies. More recent advancements include attention mechanisms [42] and Transformer models
[43], significantly improving the handling of long-term dependencies in sequence data. Another line
of work, Neural Ordinary Differential Equations (NODE) [14, 44, 45], interprets neural network
operations as differential equations. These methods have found utility in modeling a range of complex
systems [46, 47, 48, 49]. While deep sequence models are proficient at capturing complex dynamics,
they are heavily reliant on training data for generalization ([P1, P2]), and their monolithic and
overparameterized structures limit evolvability ([P3]).
Mechanistic (discovery) models. Beyond purely neural approaches, another line of work aims to
discover a system’s governing equations directly from data. Here θ ∈ Θ are closed-form equations
and ω(θ) are their parameters. These include symbolic regression techniques [10], Eureqa [50],
SINDy [11], D-CODE [51, 52], among others [52, 53] that search for θ and ω(θ) from data. These
techniques struggle to scale to higher-dimensional settings and rely on experts to perform variable
selection and define the function set and primitives available to the search algorithms.
Hybrid models. Recent efforts have also created hybrid models by integrating physical laws with
neural models. Physics-informed neural networks [15, 54], and methods including Hamiltonian
Neural Networks [55], Lagrangian Neural Networks [56] integrate structural priors of physical

6
systems to improve generalization. These techniques introduce specialized mechanisms to incorporate
precisely known physical principles. Additionally, [57] integrates prior ODE/PDE knowledge into
a hybrid model, using specialized regularization to penalize the neural component’s information
content. [58, 59] consider settings where an expert equation is known, but equation variables are
latent and unobserved. Correspondingly, they employ two sets of latent variables: one governed
by expert equations and another linked to neural components. [60] performs data augmentation by
sampling out-of-distribution trajectories from expert models. While existing approaches rely on
expert models to perform the hybrid model design, HDTwinGen is an automated approach to jointly
optimize hybrid model specification and its parameters.

6 Experiments and Evaluation


In this section, we evaluate HDTwinGen and verify that it significantly outperforms state-of-the-art
methods in modeling system dynamics over time from an observed dataset and corresponding system
description.4
Benchmark datasets. We evaluate against six real-world complex system datasets; where each
dataset is either a real-world dataset or has been sampled from an accurate simulator designed by hu-
man experts. Three are derived from a state-of-the-art biomedical Pharmacokinetic-Pharmacodynamic
(PKPD) model of lung cancer tumor growth, used to simulate the combined effects of chemotherapy
and radiotherapy in lung cancer [61] (Equation (2))—this has been extensively used by other works
[62, 63, 64]. Here we use this bio-mathematical lung cancer model to create three variations of
lung cancer under the effect of no treatments (Lung Cancer), chemotherapy only (Lung Cancer
(with Chemo.)), and chemotherapy combined with radiotherapy (Lung Cancer (with Chemo. &
Radio.)). We also compare against an accurate and complex COVID-19 epidemic agent-based
simulator (COVID-19) [65], which is capable of modeling non-pharmaceutical interventions, such
as physical distancing during a lockdown. Furthermore, we compare against an ecological model
of a microcosm of algae, flagellate, and rotifer populations (Plankton Microcosm)—replicating
an experimental three-species prey-predator system [66]. Moreover, we also compare against a
real-world dataset of hare and lynx populations (Hare-Lynx), replicating predator-prey dynamics
[67]. We detail all benchmark datasets details in Appendix C.
Evaluation Metrics. We employ mean squared error (MSE) to evaluate the benchmark methods on a
held-out test dataset of state-action trajectories, denoted as Dtest , using the loss defined in Equation (5)
and report this as TM SE . Each metric is averaged over ten runs with different random seeds, and we
present these averages along with their 95% confidence intervals, further detailed in Appendix G.
Benchmark methods. To assess whether HDTwinGen is state-of-the-art, we compare it with the
most competitive and popular neural network models, which, when modeling the dynamics of a
system over time, becomes a form of ODE model, that is a neural ODE [14] with action inputs
(DyNODE) [68]. Moreover, we also compare against a recurrent neural network (RNN) [69] and a
state-of-the-art transformer (Transformer) [64]. We also compare against mechanistic dynamical
equations derived from equation discovery methods for ODEs, including Genetic Programming (GP)
[10] and Sparse Identification of Nonlinear Dynamics (SINDy) [11]. Lastly, we compare against
a hybrid model (APHYNITY) that integrates prior knowledge in the form of ODEs into hybrid
models, while penalizing the information content from the neural component [57]. Moreover, we
compare against the ablations of our method, of the zero-shot generated HDTwin (ZeroShot) and this
model with subsequently optimized parameters (ZeroOptim). We provide method implementation,
hyperparameter, and experimental details in Appendix D.

7 Main Results
We evaluated all our benchmark methods across all our datasets tabulated in Table 1. HDTwinGen
models the system the most accurately, achieving the lowest test prediction mean squared error on
the held-out test dataset of state-action trajectories. In the interest of space, we include additional
experimental evaluations in the appendix. Specifically, we also evaluate ▶ HDTwinGen performance
on a suite of synthetically and procedurally generated benchmarks (Appendix H.9); ▶ comparisons
4
Our implementation is available at [Link] We also provide a wider
lab code repository at [Link]

7
Table 1: Benchmark method performance. Reporting the test prediction MSE (TM SE ) of the
produced system models on held-out test datasets across all benchmark datasets. HDTwinGen
achieves the lowest test prediction error. The results are averaged over ten random seeds, with ±
indicating 95% confidence intervals.
Lung Cancer Lung Cancer (with Chemo.) Lung Cancer (with Chemo. & Radio.) Hare-Lynx Plankton Microcosm COVID-19
Method TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓
SINDy 327±5.79 11.8±0.395 13.7±0.573 388±4.29e-14 0.00135±0 93.4±0.458
GP 158±94.1 154±505 171±8.99 514±381 0.00474±0.0564 10.1±18
DyNODE 327±5.8 52±47.1 16.3±5.58 439±0 0.00036±0.00078 74±2.36
RNN 1.17e+06±3.08e+04 708±86.1 136±5.6 3.71e+03±3.39e+03 0.0281±0.0406 1.38e+04±1.65e+03
Transformer 7.48±1.06 0.348±0.0618 0.216±0.0345 716±42.5 3.69e-05±1.83e-05 0.309±0.222
APHYNITY 9.06±1.37 81.6±81.3 1.21e+03±1.69e+03 321±12.6 4.21e-05±3.45e-05 88.8±9.97
ZeroShot 5.45e+03±6.71e+03 292±80.2 5.81e+03±4.02e+03 338±0 0.325±0.242 2.31e+03±2.24e+03
ZeroOptim 216±172 31.2±45 6.08±7.9 353±0
10 3
0.0132±0.00116 7.88±0.0414

Test MSE
2

HDTwinGen 4.41±8.07 0.0889±0.0453 0.131±0.198 291±30.3 2.51e-06±2.2e-06 1.72±2.28


10
1
10
100
10°1
100 101 102 103
Number of trajectories

HDTwinGen SINDy RNN

Table 2: Out of distribution shifts. On a varia- 103


ZeroOptim DyNODE Transformer
ZeroShot

Test MSE
2
10
tion of the Lung Cancer (with Chemo. & Radio.), 101
HDTwinGen is more robust to OOD shifts in un- 100

seen state-action distributions. 10°1


100 101 102 103
Number of trajectories
Lung Cancer
(with Chemo. & Radio.) Figure 2: Sample efficiency. Analyzing perfor-
Method IID TM SE ↓ OOD TM SE ↓ mance as a function
HDTwinGen ofSINDy
the number
RNN of training
DyNODE 0.0115±0.0121 1.75±0.769
trajectories in the Lung Cancer (with Chemo. &
ZeroOptim DyNODE Transformer
ZeroShot
SINDy 0.302±0.286 5.9±2.55
RNN 1.43e+04±2.02e+03 1.84e+05±4.06e+04 Radio.) dataset. We observe that HDTwinGen
Transformer 0.0262±0.00514 1.19e+04±2.78e+03 achieves the lowest test prediction error, even in
ZeroShot 4.95e+03±1.43e+04 1.91e+04±6.36e+04 the very challenging low data regime. This high-
ZeroOptim 3.49±0.0364 4.84±5.17
HDTwinGen 0.00872±0.0187 0.0846±0.0891 lights the role of priors embedded in HDTwin in
sample-efficient generalization.

against domain-specific baselines (Appendix H.8) and ▶ various ablation experiments, including
ablation of LLM hyperparameters, prompt design, and algorithm settings (Appendices H.5 to H.7).

7.1 Insight Experiments

This section provides an in-depth analysis of HDTwinGen’s effectiveness related to its benchmark
counterparts. Specifically, we examine the core desiderata for an effective DT described in Section 2:
[P1] out-of-distribution generalization, [P2] sample-efficient learning, and [P3] evolvability.
[P1] Can an HDTwin generalize to out-of-distribution shifts? To explore out-of-distribution shifts,
we adapt the Lung Cancer (with Chemo. & Radio.) to produce a training dataset of states in a range
that is outside those observed in the test set over all trajectories (Appendix H.1). We tabulate this
in Table 2. Empirically, we find that HDTwinGen is more robust to out-of-distribution shifts than
existing methods, benefiting from explicit decomposition and robust hybrid models. Notably, the
neural network method DyNODE shows the largest relative error increase from IID to OOD by two
orders of magnitude, while the mechanistic method SINDy exhibits a smaller increase by only one
order of magnitude. This demonstrates the importance of hybrid models that leverage both neural
and mechanistic components to enhance generalization performance under distribution shifts.
[P2] Can HDTwinGen improve sample-efficiency in model learning? To explore the low data
settings, we re-ran all benchmark methods with fewer samples in their training dataset on the Lung
Cancer (with Chemo. & Radio.) dataset. We plot this in Figure 2. Empirically, we observe that
HDTwinGen can achieve lower performance errors, especially in lower-sample regimes.
[P3] Can HDTwinGen evolve its modular HDTwin to fit the system? We analyze this from an
empirical point of view to determine if HDTwinGen can correctly evolve the generated HDTwin
and reduce its prediction error over subsequent generations. We observe that HDTwinGen can
indeed understand, reason, and iteratively evolve the generated code representation of the HDTwin
to incorporate a better fitting HDTwin, as observed in Figure 3. In particular, the annotated results
demonstrate that HDTwinGen effectively refines the hybrid model by strategically adjusting its neural
and mechanistic components (in a fashion akin to human experts), leading to significant improvements

8
Validation MSE
100

10°1
2 4 6 8 10 12 14
Number of Generations

Enhanced white box model with logistic growth for tumor volume,
interaction term for combined therapy effect, and bi-exponential Top-1 HDTwin HDTwin Generated at Step
decay for chemotherapy drug concentration.
Initial white box

Validation MSE
linear model.

100 White and black box model with logistic White and black box model with logistic
growth for tumor volume, Michaelis- growth for tumor volume, Michaelis-
Menten saturation for chemotherapy Menten saturation for chemotherapy
effect, sigmoid function for radiotherapy effect, sigmoid function for radiotherapy
effect, resistance development, and a effect, resistance development, and a
black box MLP for residuals. simplified black box MLP for residuals.

White and black box model with


dynamic carrying capacity and decay
10°1 rate, non-linear interaction term, and an
updated black box MLP for residuals.

2 4 6 8 10 12 14
Number of Generations
Figure 3: HDTwinGen effectively evolves HDTwin. Validation MSE of the HDTwin generated in
each iteration, showing the Pareto-front of the best generated HDTwin (Top-1 HDTwin), and the
generated HDTwin per generation step—additionally with a few of the HDTwins labeled with their
Exposed (# Individuals)

model descriptions. HDTwinGen can efficiently understand, modify, and hence evolve the HDTwin
to achieve
100000a better-fitting model (Appendix H.4).

50000
Exposed (# Individuals)

Sudden
100000 Intervention of
0 Lockdown
0 5 10 15 20 25 30 35 40
Observed and trained on t (Days)
50000 Unobserved to predict

0
0 5 10 15 20 25 30 35 40
t (Days)
True HDTwin (with intervention) SINDy DyNODE

Figure 4: COVID-19 unobserved intervention. The symbolic code-based representation of HDTwin


can be easily adapted to unobserved interventions through targeted adjustments of parameters.

in accuracy and robustness. This


True NDSTiterative
(with evolution process demonstrates
intervention) SINDy HDTwinGen’s ability to
DyNODE
adapt and optimize its modular components.
Can HDTwinGen Understand and Modify Its HDTwin? We investigate whether large language
model (LLM) agents can take an optimized high-dimensional twin (HDTwin) from an existing
benchmark dataset and adapt it to model an unobserved intervention that is not present in the training
data. We note that this intervention emulates scenarios where the dynamics of the underlying system
changes. We affirmatively answer this question by constructing a scenario where our COVID-19
simulator incorporates an unobserved intervention of a lockdown policy, which reduces physical
interactions between individuals (Appendix H.2). As demonstrated in Figure 4, we observe that
the code-model representation of the HDTwin can be (1) understood by the modeling agent LLM
and (2) adapted in its parameters to accurately model and reflect this intervention. We find that
HDTwinGen is the only method capable of changing the overall functional behavior by modifying a
single parameter in the model; in contrast, all other existing data-driven methods require a dataset of
state-action trajectories under the new dynamics introduced by this intervention.
Ablation Studies. We conducted ablation studies on HDTwinGen and found several key insights.
First, retaining the top-K models within the LLM context leads to improved model generation (Ap-
pendix H.5). Additionally, HDTwinGen is compatible with various LLMs and different temperature
settings (Appendix H.6). It also benefits from including textual descriptions of the variables to be
modeled as prior information (Appendix H.7). Finally, HDTwinGen can be specifically instructed to
generate mechanistic white-box models if desired (Appendix H.10).

9
8 Limitations and Discussions
In summary, this work addresses the problem of learning digital twins for continuous-time dynamical
systems. After establishing clear learning objectives and key requirements, we introduce Hybrid
Digital Twins (HDTwins)—a promising approach that combines mechanistic understanding with
neural architectures. HDTwins encode domain knowledge symbolically while leveraging neural
networks for enhanced expressiveness. Conventional hybrid models, however, rely heavily on
expert specification with learning limited to parameter optimization, constraining their scalability
and applicability. To overcome these limitations, we propose a novel approach to automatically
specify and parameterize HDTwins through HDTwinGen, an evolutionary framework that leverages
LLMs to iteratively search for and optimize high-performing hybrid twins. Our empirical results
demonstrate that evolved HDTwins consistently outperform existing approaches across multiple
criteria, exhibiting superior out-of-distribution generalization, enhanced sample efficiency, and
improved modular evolvability.
Limitations. While our results are promising, several important limitations remain. HDTwinGen’s
efficacy depends critically on human experts providing initial system specifications and on the
underlying LLM’s domain knowledge and model generation capabilities. Our current implementation
focuses exclusively on continuous-time systems, which, although broadly applicable, represent only
a subset of real-world systems. Future work could extend our approach through human-in-the-loop
feedback mechanisms, integration with external tools, and expansion to broader system classes.
Ethical implications. We acknowledge the risk of bias transmission from the black-box LLMs
into the evolved models. While our hybrid approach enables greater expert scrutiny through its
human-interpretable components, we strongly recommend a comprehensive evaluation of evolved
models for fairness, bias, and privacy concerns before deployment in sensitive applications.

Acknowledgments and Disclosure of Funding


We thank the anonymous reviewers, area and program chairs, members of the van der Schaar lab,
and Andrew Rashbass for many insightful comments and suggestions. TL and SH would like to
acknowledge and thank AstraZeneca for their sponsorship and support. This work was supported by
Microsoft’s Accelerate Foundation Models Academic Research initiative.

References
[1] F. Tao, J. Cheng, Q. Qi, M. Zhang, H. Zhang, and F. Sui, “Digital twin-driven product design,
manufacturing and service with big data,” The International Journal of Advanced Manufacturing
Technology, vol. 94, pp. 3563–3576, 2018.
[2] J. Corral-Acero, F. Margara, M. Marciniak, C. Rodero, F. Loncaric, Y. Feng, A. Gilbert, J. F.
Fernandes, H. A. Bukhari, A. Wajdan et al., “The ‘digital twin’to enable the vision of precision
cardiology,” European heart journal, vol. 41, no. 48, pp. 4556–4564, 2020.
[3] H. A. Simon, The sciences of the artificial. MIT press, 1996.
[4] J. Ladyman, J. Lambert, and K. Wiesner, “What is a complex system?” European Journal for
Philosophy of Science, vol. 3, pp. 33–67, 2013.
[5] Q. Qi and F. Tao, “Digital twin and big data towards smart manufacturing and industry 4.0: 360
degree comparison,” Ieee Access, vol. 6, pp. 3585–3593, 2018.
[6] V. Iranzo and S. Pérez-González, “Epidemiological models and covid-19: a comparative view,”
History and Philosophy of the Life Sciences, vol. 43, no. 3, p. 104, 2021.
[7] I. Bozic, J. G. Reiter, B. Allen, T. Antal, K. Chatterjee, P. Shah, Y. S. Moon, A. Yaqubie,
N. Kelly, D. T. Le et al., “Evolutionary dynamics of cancer in response to targeted combination
therapy,” elife, vol. 2, p. e00747, 2013.
[8] R. Rosen, G. Von Wichert, G. Lo, and K. D. Bettenhausen, “About the importance of autonomy
and digital twins for the future of manufacturing,” Ifac-papersonline, vol. 48, no. 3, pp. 567–572,
2015.

10
[9] T. Erol, A. F. Mendi, and D. Doğan, “The digital twin revolution in healthcare,” in 2020 4th
international symposium on multidisciplinary studies and innovative technologies (ISMSIT).
IEEE, 2020, pp. 1–7.
[10] J. R. Koza, “Genetic programming as a means for programming computers by natural selection,”
Statistics and computing, vol. 4, pp. 87–112, 1994.
[11] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governing equations from data by
sparse identification of nonlinear dynamical systems,” Proceedings of the national academy of
sciences, vol. 113, no. 15, pp. 3932–3937, 2016.
[12] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” Advances in
neural information processing systems, vol. 31, 2018.
[13] J. Yoon, D. Jarrett, and M. Van der Schaar, “Time-series generative adversarial networks,”
Advances in neural information processing systems, vol. 32, 2019.
[14] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential
equations,” Advances in neural information processing systems, vol. 31, 2018.
[15] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep
learning framework for solving forward and inverse problems involving nonlinear partial
differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
[16] L. Faure, B. Mollet, W. Liebermeister, and J.-L. Faulon, “A neural-mechanistic hybrid approach
improving the predictive power of genome-scale metabolic models,” Nature Communications,
vol. 14, no. 1, p. 4669, 2023.
[17] J. Pinto, J. R. Ramos, R. S. Costa, and R. Oliveira, “A general hybrid modeling framework for
systems biology applications: Combining mechanistic knowledge with deep neural networks
under the sbml standard,” AI, vol. 4, no. 1, pp. 303–318, 2023.
[18] P. Wang, Z. Zhu, W. Liang, L. Liao, and J. Wan, “Hybrid mechanistic and neural network
modeling of nuclear reactors,” Energy, vol. 282, p. 128931, 2023.
[19] R. Cheng, A. Verma, G. Orosz, S. Chaudhuri, Y. Yue, and J. Burdick, “Control regularization for
reduced variance reinforcement learning,” in International Conference on Machine Learning.
PMLR, 2019, pp. 1141–1150.
[20] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin,
“Large-scale evolution of image classifiers,” in International conference on machine learning.
PMLR, 2017, pp. 2902–2911.
[21] T. N. Mundhenk, M. Landajuela, R. Glatt, C. P. Santiago, D. M. Faissol, and B. K. Petersen,
“Symbolic regression via neural-guided genetic programming population seeding,” in Proceed-
ings of the 35th International Conference on Neural Information Processing Systems, 2021, pp.
24 912–24 923.
[22] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural
information processing systems, vol. 33, pp. 1877–1901, 2020.
[23] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-
thought prompting elicits reasoning in large language models,” Advances in neural information
processing systems, vol. 35, pp. 24 824–24 837, 2022.
[24] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.
Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,”
Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[25] M. J. Kearns and U. Vazirani, An introduction to computational learning theory. MIT press,
1994.
[26] H. A. Simon, “The architecture of complexity,” Proceedings of the American philosophical
society, vol. 106, no. 6, pp. 467–482, 1962.

11
[27] T. L. Rogers, B. J. Johnson, and S. B. Munch, “Chaos is not rare in natural ecosystems,” Nature
Ecology & Evolution, vol. 6, no. 8, pp. 1105–1111, 2022.

[28] M. Sokolov, M. von Stosch, H. Narayanan, F. Feidl, and A. Butté, “Hybrid modeling—a
key enabler towards realizing digital twins in biopharma?” Current Opinion in Chemical
Engineering, vol. 34, p. 100715, 2021.

[29] S. Chaudhuri, K. Ellis, O. Polozov, R. Singh, A. Solar-Lezama, Y. Yue et al., “Neurosymbolic


programming,” Foundations and Trends® in Programming Languages, vol. 7, no. 3, pp. 158–
243, 2021.

[30] A. Tsoularis and J. Wallace, “Analysis of logistic growth models,” Mathematical biosciences,
vol. 179, no. 1, pp. 21–55, 2002.

[31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,


N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning
library,” Advances in neural information processing systems, vol. 32, 2019.

[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.

[33] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1, pp. 66–73, 1992.

[34] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech
recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[35] R. Kalman, “A new approach to linear filtering and prediction problems,” Trans. ASME, D,
vol. 82, pp. 35–44, 1960.

[36] L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, and H. Sahli, “Hybrid
deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition,”
in 2013 Humaine association conference on affective computing and intelligent interaction.
IEEE, 2013, pp. 312–317.

[37] R. G. Krishnan, U. Shalit, and D. Sontag, “Deep kalman filters,” arXiv preprint
arXiv:1511.05121, 2015.

[38] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.

[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9,
no. 8, pp. 1735–1780, 1997.

[40] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and


Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine
translation,” arXiv preprint arXiv:1406.1078, 2014.

[41] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,”
Advances in neural information processing systems, vol. 27, 2014.

[42] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align
and translate,” arXiv preprint arXiv:1409.0473, 2014.

[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and


I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems,
vol. 30, 2017.

[44] E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural odes,” Advances in neural information
processing systems, vol. 32, 2019.

[45] S. I. Holt, Z. Qian, and M. van der Schaar, “Neural laplace: Learning diverse classes of
differential equations in the laplace domain,” in International Conference on Machine Learning.
PMLR, 2022, pp. 8811–8832.

12
[46] M. A. Zaytar and C. El Amrani, “Sequence to sequence weather forecasting with long short-term
memory recurrent neural networks,” International Journal of Computer Applications, vol. 143,
no. 11, pp. 7–11, 2016.
[47] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional
transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[48] L. Sehovac and K. Grolinger, “Deep learning for load forecasting: Sequence to sequence
recurrent neural networks with attention,” Ieee Access, vol. 8, pp. 36 411–36 426, 2020.
[49] S. Holt, A. Hüyük, Z. Qian, H. Sun, and M. van der Schaar, “Neural laplace control for
continuous-time delayed systems,” in International Conference on Artificial Intelligence and
Statistics. PMLR, 2023, pp. 1747–1778.
[50] M. Schmidt and H. Lipson, “Distilling free-form natural laws from experimental data,” science,
vol. 324, no. 5923, pp. 81–85, 2009.
[51] Z. Qian, K. Kacprzyk, and M. van der Schaar, “D-CODE: Discovering closed-form ODEs
from observed trajectories,” in International Conference on Learning Representations, 2022.
[Online]. Available: [Link]
[52] K. Kacprzyk, Z. Qian, and M. van der Schaar, “D-cipher: discovery of closed-form partial
differential equations,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[53] K. Kacprzyk, T. Liu, and M. van der Schaar, “Towards transparent time series forecasting,” in
The Twelfth International Conference on Learning Representations, 2024. [Online]. Available:
[Link]
[54] S. Cuomo, V. S. Di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli, “Scientific
machine learning through physics–informed neural networks: Where we are and what’s next,”
Journal of Scientific Computing, vol. 92, no. 3, p. 88, 2022.
[55] S. Greydanus, M. Dzamba, and J. Yosinski, “Hamiltonian neural networks,” Advances in neural
information processing systems, vol. 32, 2019.
[56] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho, “Lagrangian neural
networks,” arXiv preprint arXiv:2003.04630, 2020.
[57] Y. Yin, V. Le Guen, J. Dona, E. de Bézenac, I. Ayed, N. Thome, and P. Gallinari, “Augmenting
physical models with deep networks for complex dynamics forecasting,” Journal of Statistical
Mechanics: Theory and Experiment, vol. 2021, no. 12, p. 124012, 2021.
[58] N. Takeishi and A. Kalousis, “Physics-integrated variational autoencoders for robust and
interpretable generative modeling,” Advances in Neural Information Processing Systems, vol. 34,
pp. 14 809–14 821, 2021.
[59] Z. Qian, W. Zame, L. Fleuren, P. Elbers, and M. van der Schaar, “Integrating expert odes
into neural odes: pharmacology and disease progression,” Advances in Neural Information
Processing Systems, vol. 34, pp. 11 364–11 383, 2021.
[60] A. Wehenkel, J. Behrmann, H. Hsu, G. Sapiro, G. Louppe, and J.-H. Jacobsen, “Robust
hybrid learning with expert augmentation,” Transactions on Machine Learning Research, 2023.
[Online]. Available: [Link]
[61] C. Geng, H. Paganetti, and C. Grassberger, “Prediction of Treatment Response for Com-
bined Chemo- and Radiation Therapy for Non-Small Cell Lung Cancer Patients Using a
Bio-Mathematical Model,” Scientific Reports, vol. 7, no. 1, p. 13542, Oct. 2017.
[62] I. Bica, A. M. Alaa, J. Jordon, and M. van der Schaar, “Estimating counterfactual treatment
outcomes over time through adversarially balanced representations,” in International Conference
on Learning Representations, 2020.
[63] N. Seedat, F. Imrie, A. Bellot, Z. Qian, and M. van der Schaar, “Continuous-time modeling
of counterfactual outcomes using neural controlled differential equations,” arXiv preprint
arXiv:2206.08311, 2022.

13
[64] V. Melnychuk, D. Frauen, and S. Feuerriegel, “Causal transformer for estimating counterfactual
outcomes,” in International Conference on Machine Learning. PMLR, 2022, pp. 15 293–
15 329.
[65] C. C. Kerr, R. M. Stuart, D. Mistry, R. G. Abeysuriya, K. Rosenfeld, G. R. Hart, R. C. Núñez,
J. A. Cohen, P. Selvaraj, B. Hagedorn et al., “Covasim: an agent-based model of covid-19
dynamics and interventions,” PLOS Computational Biology, vol. 17, no. 7, p. e1009149, 2021.
[66] T. Hiltunen, L. Jones, S. Ellner, and N. G. Hairston Jr, “Temporal dynamics of a simple
community with intraguild predation: an experimental test,” Ecology, vol. 94, no. 4, pp. 773–
779, 2013.
[67] E. P. Odum and G. W. Barrett, “Fundamentals of ecology,” The Journal of Wildlife Management,
vol. 36, no. 4, p. 1372, 1972.
[68] V. M. M. Alvarez, R. Roşca, and C. G. Fălcuţescu, “Dynode: Neural ordinary differential
equations for dynamics modeling in continuous control,” arXiv preprint arXiv:2009.04278,
2020.
[69] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-
propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
[70] S. Holt, A. Hüyük, and M. van der Schaar, “Active observing in continuous-time control,”
Advances in Neural Information Processing Systems, vol. 36, 2024.
[71] S. Holt, Z. Qian, and M. van der Schaar, “Deep generative symbolic regression,” in The
Eleventh International Conference on Learning Representations, 2023. [Online]. Available:
[Link]
[72] K. Kacprzyk, S. Holt, J. Berrevoets, Z. Qian, and M. van der Schaar, “ODE
discovery for longitudinal heterogeneous treatment effects inference,” in The Twelfth
International Conference on Learning Representations, 2024. [Online]. Available:
[Link]
[73] T. Liu, N. Astorga, N. Seedat, and M. van der Schaar, “Large language models to enhance
bayesian optimization,” in The Twelfth International Conference on Learning Representations,
2024. [Online]. Available: [Link]
[74] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen, “Large language models
as optimizers,” in The Twelfth International Conference on Learning Representations, 2024.
[Online]. Available: [Link]
[75] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda,
N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv
preprint arXiv:2107.03374, 2021.
[76] S. Holt, M. R. Luyten, and M. van der Schaar, “L2MAC: Large language model automatic
computer for extensive code generation,” in The Twelfth International Conference on Learning
Representations, 2024. [Online]. Available: [Link]
[77] S. Holt, Z. Qian, T. Liu, J. Weatherall, and M. van der Schaar, “Data-driven discovery of
dynamical systems in pharmacology using large language models,” in The Thirty-eighth Annual
Conference on Neural Information Processing Systems, 2024.
[78] N. Astorga, T. Liu, N. Seedat, and M. van der Schaar, “Partially observable cost-aware active-
learning with large language models,” in The Thirty-Eighth Annual Conference on Neural
Information Processing Systems, 2024.
[79] W. Bonnaffé and T. Coulson, “Fast fitting of neural ordinary differential equations by bayesian
neural gradient matching to infer ecological interactions from time-series data,” Methods in
Ecology and Evolution, vol. 14, no. 6, pp. 1543–1563, 2023.
[80] S. K. Kumar, “On weight initialization in deep neural networks,” arXiv preprint
arXiv:1704.08863, 2017.

14
[81] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional recurrent neural networks,”
in International conference on artificial neural networks. Springer, 2007, pp. 549–558.
[82] B. K. Petersen, M. L. Larma, T. N. Mundhenk, C. P. Santiago, S. K. Kim, and J. T. Kim, “Deep
symbolic regression: Recovering mathematical expressions from data via risk-seeking policy
gradients,” in International Conference on Learning Representations, 2020.
[83] S. Hsiang, D. Allen, S. Annan-Phan, K. Bell, I. Bolliger, T. Chong, H. Druckenmiller, L. Y.
Huang, A. Hultgren, E. Krasovich et al., “The effect of large-scale anti-contagion policies on
the covid-19 pandemic,” Nature, vol. 584, no. 7820, pp. 262–267, 2020.
[84] O. N. Bjørnstad, K. Shea, M. Krzywinski, and N. Altman, “The seirs model for infectious
disease dynamics.” Nature methods, vol. 17, no. 6, pp. 557–559, 2020.
[85] F. Brauer, C. Castillo-Chavez, and C. Castillo-Chavez, Mathematical models in population
biology and epidemiology. Springer, 2012, vol. 2.

15
Appendix
Table of Contents
A HDTwinGen Overview 17

B Extended Related Work 17

C Benchmark Dataset Environment Details 18


C.1 Cancer PKPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
C.2 COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C.3 Plankton Microcosm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
C.4 Hare-Lynx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Benchmark Method Implementation Details 20

E HDTwinGen Implementation Details 21


E.1 HDTwin pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
E.2 Training HDTwins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
E.3 HDTwinGen Prompt Templates . . . . . . . . . . . . . . . . . . . . . . . . . . 22
E.4 HDTwinGen System Requirements Prompts . . . . . . . . . . . . . . . . . . . 23

F Model Optimization Losses 27

G Evaluation Metrics 27

H Additional Results 28
H.1 Out-of-distribution Experiment and Setup . . . . . . . . . . . . . . . . . . . . . 28
H.2 COVID-19 Unobserved Intervention Experiment and Setup . . . . . . . . . . . 28
H.3 HDTwinGen top-1 decreases over time . . . . . . . . . . . . . . . . . . . . . . 28
H.4 HDTwinGen Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
H.5 HDTwinGen Ablation No Memory . . . . . . . . . . . . . . . . . . . . . . . . 32
H.6 Evaluating Different LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
H.7 Prompt Ablations with Varying Amounts of Prior Information . . . . . . . . . . 32
H.8 Domain-Specific Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
H.9 Procedurally Generated Synthetic Model Benchmark . . . . . . . . . . . . . . . 34
H.10 Interpretability Scale, Performance of only White-Box Models . . . . . . . . . . 35
H.11 HDTwinGen Flexibly Integrates Expert-in-the-loop Feedback . . . . . . . . . . 35
H.12 HDTwinGen Accelerates Model Development and Enhances Performance . . . . 36

I Hybrid Model Output Examples 38

J HDTwinGen can reason about HDTwins 40

16
A HDTwinGen Overview
We provide
000 an illustrative example of HDTwinGen working in practice in Figure 5. Develop a PyTorch module to simulate tumor and drug
concentration changes in lung cancer treatment, ensuring

001 000 Develop a PyTorch module to simulate tumor and drug


interpretability and a validation loss of  1e-6...
Develop a PyTorch module to simulate tumor and drug
concentration changes in lung cancer treatment, ensuring concentration changes in lung cancer treatment, ensuring
002 001 interpretability and a validation loss of  1e-6...
You generated the following code completions, which Develop a PyTorch module to simulate tumor and drug
interpretability and a validation loss of  1e-6...

003 then had their parameters optimized to the training dataset. concentration changes in lung cancer treatment, ensuring

Generation002 1 Generation 2 which Please reflect on how you can improve the code to mini-
Generation 20 interpretability and a validation loss of  1e-6...

HDTwin
You generated the following code completions, You generated the following code completions, which

..
.
mize the validation loss to 1e-6 or less.
User004
Develop a PyTorch module to simulate tumor and drug
003 then had their parameters optimized to the training dataset. • Iteration 1: Model Code Val MSE: 4.41 (Val loss per then had their parameters optimized to the training dataset.

<latexit sha1_base64="ZAVVT2gn9vhtCALb21XwRBiwVqE=">AAAB7XicbVBNS8NAEJ34WetX1aOXxSJ4Kkmh6rHgxWMF+wFtKJvNpl27yYbdSaGU/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6Lrfzsbm1vbObmGvuH9weHRcOjltGZVpxptMSaU7ATVcioQ3UaDknVRzGgeSt4PR3dxvj7k2QiWPOEm5H9NBIiLBKFqp1RuHCk2/VHYr7gJknXg5KUOORr/01QsVy2KeIJPUmK7npuhPqUbBJJ8Ve5nhKWUjOuBdSxMac+NPF9fOyKVVQhIpbStBslB/T0xpbMwkDmxnTHFoVr25+J/XzTC69aciSTPkCVsuijJJUJH56yQUmjOUE0so08LeStiQasrQBlS0IXirL6+TVrXiXVdqD9VyvZbHUYBzuIAr8OAG6nAPDWgCgyd4hld4c5Tz4rw7H8vWDSefOYM/cD5/AMomjz0=</latexit>
concentration changes in lung cancer treatment, ensuring
Please reflect on how you can improve the code to mini- dim; tumor_volume: 2.77, chemo: 6.06) Please reflect on how you can improve the code to mini-
interpretability and a validation loss of  1e-6...
005 000 004
You are tasked with creating a PyTorch module to simulate
the impact of chemo and radiotherapy on lung cancer tu-
mize the validation loss to 1e-6 or less.
• Iteration 1: Model Code Val MSE: 4.41 (Val loss per
• Iteration 2: Model Code Val MSE: 4.23 (Val loss per
dim; tumor_volume: 8.47, chemo: 8.95e-14)
Develop a PyTorch module to simulate tumor and drug
concentration changes in lung cancer treatment, ensuring
You generated the following code completions, which
then had their parameters optimized to the training dataset.
Develop amize the validation
PyTorch
• Iteration
concentration changes
module toloss
1: in
Model
lung Code
to 1e-6tumor
simulate
cancerVal
or less.
and drug
MSE: 4.41
treatment, (Val loss per
ensuring
Please reflect on how you can improve the code to mini-
001 dim; tumor_volume: 2.77, chemo: 6.06) ... dim; tumor_volume:
and a validation2.77, chemo: 6.06)
006
mor volume and chemotherapy drug concentration. interpretability and a validation loss of  1e-6... interpretability loss of  1e-6...
Task Overview: 005 You generated the following code completions, which
then had their parameters optimized to the training dataset.
• Iteration 2: Model Code Val MSE: 4.23
dim; tumor_volume: 8.47, chemo: 8.95e-14)
(Val loss per You are tasked with creating a PyTorch module to simulate
the impact of chemo and radiotherapy on lung cancer tu-
mize the validation loss to 1e-6
Develop a PyTorch module to simulate tumor and drug
• Iteration
concentration changes 1: in
Model
lung Code
cancerVal
or less.
MSE: 4.41
treatment, (Val loss per
ensuring
• Iteration 2: Model Code Val MSE: 4.23 (Val loss per
dim; tumor_volume: 8.47, chemo: 8.95e-14)

007 002and006
• Objective: Simulate changes in tumor volume and
drug concentration due to chemotherapy radiother-
Please reflect on how you can improve the code to mini-
mize the validation loss to 1e-6 or less.
[LLM Critic Response] To enhance model accuracy:
mor volume and chemotherapy drug concentration.
You generated the following code completions, which
dim; tumor_volume:
interpretability and a validation2.77, chemo:
loss of 6.06)
 1e-6...
• Iteration 2: Model Code Val MSE: 4.23 (Val loss per
...
You generated the following code completions, which
• Reduce the complexity of the MLP to improve general- Task Overview:
003),), anddrug
apy dosages. then had their parameters optimized to the training dataset. dim; tumor_volume: 8.47, chemo: 8.95e-14) then had their parameters optimized to the training dataset.
008
• Iteration 1: Model Code Val MSE: 4.41 (Val loss per
• Key Variables: Tumor volume (cm
007 [LLM Critic Response] To enhance model accuracy: ization, as a complex structure hasn’t consistently lowered • Objective: Simulate changes in tumor volume and
3
concentra- dim; tumor_volume: 2.77, chemo: 6.06) Please reflect on how you can improve the code to mini- Please reflect on how you can improve the code to mini-
[LLM Critic Response] To enhance model accuracy:
System tion ( ), chemotherapy dosage ( radiotherapy • Introduce a non-linear term, like a Michaelis-Menten validation losses. mg mg drug concentration due to chemotherapy and radiother- You generated the following code completions, which
Develop amize the validation
module toloss to 1e-6tumor
or less. mize the validation loss to 1e-6 or less.
004
PyTorch simulate and drug • Reduce the complexity of the MLP to improve general-
009
m3 m3
equation, to model the chemo concentrations and tumor • Reassess the chemotherapy and radiotherapy interaction apy dosages. then had their parameters optimized to the training dataset.
• Data: Train on a dataset of 1000 patients 008
dosage (Gy). • Iteration
concentration changes 1: in
Model
lung Code
cancerVal MSE: 4.41
treatment, (Val loss per
ensuring • Iteration 1: Model Code Val MSE: 4.41 (Val loss per
reduction. term to better represent their combined effects, considering • Key Variables: Tumor volume (cm3 ), drug concentra- Please reflect on[LLMhow Critic Response]
you can improve To theenhance
code to model
mini- accuracy: ization, as a complex structure hasn’t consistently lowered
Requirements
over 60 days. dim; tumor_volume:
interpretability and a validation2.77, chemo:
loss of 6.06)
 1e-6... dim; tumor_volume: 2.77, chemo: 6.06)
• Introduce a non-linear
or less. term, like a Michaelis-Menten • Iteration 2: Model validation losses.
005 009
mg
• Goals: Model should generalize well to unseen data with [LLM Critic Response] To enhance model accuracy: • Replace the linear radiotherapy dosage effect with a non- non-linear interactions if necessary. tion ( with ), chemotherapy dosage ( mg3to),simulate
and radiotherapy mize the validation loss to 1e-6
010
Code Val MSE: 4.23 (Val loss per

• Add a black box Critic


component, such asLLM
You are tasked m3 creating a PyTorch module m
• Adopt a logistic growth model for tumor dynamics, incor- • Evaluate each biological term in the white box model for equation,
CodetoVal
model the4.41
chemo
(Valconcentrations and tumor • Reassess the chemotherapy and radiotherapy interaction
a validation loss of  1e-6. linear sigmoid function. the impactdosage (Gy).and radiotherapy on lung cancer tu-
of chemo • Iteration 1: Model MSE: loss per dim; tumor_volume: 8.47, chemo: 8.95e-14)
<latexit sha1_base64="yDX4CEpg7eXeccEOaBhhcwpfsOU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvQVZkRqS4LblxWsQ+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6WNza3tnfJuZW//4PCoenzS1TJVhHaI5FL1Q6wpZ4J2DDOc9hNFcRxy2gunt7nfe6JKMykezSyhQYzHgkWMYGMlfxBjMyGYZw/zYbXmNtwF0DrxClKDAu1h9WswkiSNqTCEY619z01MkGFlGOF0XhmkmiaYTPGY+pYKHFMdZIvIc3RhlRGKpLJPGLRQf29kONZ6Fod2Mo+oV71c/M/zUxPdBBkTSWqoIMuPopQjI1F+PxoxRYnhM0swUcxmRWSCFSbGtlSxJXirJ6+T7mXDazaa91e1Vr2oowxncA518OAaWnAHbegAAQnP8ApvjnFenHfnYzlacoqdU/gD5/MHheORWQ==</latexit>

R 006
porating environmental carrying capacity. a multi-layer per- its impact on validation loss. Remove or consolidate terms • Data: Train on a dataset of 1000
mor volume and chemotherapy drug concentration. patients over 60 days. reduction.
dim; tumor_volume: 2.77, chemo: 6.06) ...
term to better represent their combined effects, considering

011 Code Requirements:


• Implement within a PyTorch ‘[Link]‘010 called ‘StateD-
• Introduce a decay rate for chemotherapy drug concentra-
tion to reflect metabolic processes.
ceptron, to address unexplained variances, ensuring the
complexity is managed to avoid overfitting.
that do not significantly benefit the model to enhance per-
formance and simplicity.
Critic LLM
• Goals: Model should generalize well to unseen data with
Task Overview:
a validation loss of  1e-6.
You generated the [LLM Critic Response]
following
• Adopt a logistic
To enhance
code completions,
growth model
model accuracy:
which
for tumor
• Iteration 2: Model
[Link], incor-
• Replace
Codethe
linear 8.47,
Vallinear
sigmoid
MSE:radiotherapy
function.
4.23 (Val lossdosage
per effect with a non- non-linear interactions if necessary.
• Evaluate each biological term in the white box model for

HDTwin
then had their parameters optimized to the training dim; tumor_volume: chemo: 8.95e-14)

007
ifferential‘. • Add interaction terms between chemotherapy and radio- • Objective: Simulate changes in tumor volume and Please reflect onporating environmental
can improvecarrying
the codecapacity. • Add a black box component, such as a multi-layer per- its impact on validation loss. Remove or consolidate terms
012 • Fill in __init__ and forward
011
methods to output the
rate of change of tumor volume and drug concentration. HDTwin
therapy to model combined treatment effects.
Code Requirements:
drug concentration due to chemotherapy and radiother-
• Implement within a PyTorch ‘[Link]‘ called ‘StateD-
mize the validation
how you
• Introduce a decay
loss to 1e-6
tion toCode
reflect
or less.
to mini-
rate for chemotherapy drug concentra- ceptron, to address unexplained variances, ensuring the [LLM Critic thatResponse]
• Reduce formance
To enhancebenefit
do not significantly
the complexity
modelthe accuracy:
model to enhance per-
of the MLP to improve general-
apy dosages. Valmetabolic processes. complexity is managed to avoid overfitting. and simplicity.
008 012
• Iteration 1: Model MSE: 4.41 (Val loss per
013 • Use neural networks and differential equations; start with
interpretable models before moving to complex ones.
Please now regenerate the code Please now regenerate the code ifferential‘.
• Key Variables:
tion ( mg3 •),Fill
Tumor volume (cm3 ), drug concentra-
in __init__
chemotherapy and (forward
dosage mg
), andmethods to output the
radiotherapy
dim; tumor_volume: • Add2.77,
interaction
chemo:terms6.06)between chemotherapy and radio-
therapy to model combined treatment effects.
[LLM Critic Response] To enhance model accuracy:
• Introduce a non-linear term, like a Michaelis-Menten
ization, as a complex structure hasn’t consistently lowered
validation losses.

009 013
Please now regenerate the code m rate of change of tumor volume m3 and drug concentration.
• Follow the provided skeleton code without altering func- equation, to model the chemo concentrations and tumor • Reassess the chemotherapy and radiotherapy interaction
014 tion definitions.
[LLM Actor Response] Model Code (creates a white and [LLM Actor Response] Model Code (white and black box
dosage (Gy).
• Useonneural
• Data: Train networks
a dataset
interpretable
andpatients
of 1000 differential
over equations;
60 days. start with
Please now regenerate the code
reduction. termPlease nowrepresent
to better regenerate the combined
their code effects, considering
shouldmodels before moving to complex ones. [LLM Critic Response] To enhance model accuracy: • Replace the linear radiotherapy dosage effect with a non- non-linear interactions if necessary.

Actor 014 LLM model


• Goals: Model
010 model with logistic growth for tumor volume, Michaelis- generalize well to unseen data with
015
black box model with logistic growth for tumor volume, • Follow Pleasea logistic
now regenerate the code
[LLM Actor Response] Model Code (creates a white box Michaelis-Menten saturation for chemotherapy effect, sig- Menten saturation for chemotherapy effect, sigmoid func- a validation loss ofthe
 provided
1e-6. skeleton code without altering func- • Adopt growth model for tumor dynamics, incor- linear sigmoid function. • Evaluate each biological term in the white box model for

016
[LLM Actor Response] Model Code (creates a linear white
box model) 011 015
with logistic growth for tumor volume and decay
rate for chemotherapy drug concentration) ment, and a black boxActor LLM
moid function for radiotherapy effect, resistance develop-
MLP for residuals)
tion for radiotherapy effect, resistance development, and a
simplified black box MLP for residuals)
tion definitions.
Code Requirements:
• Implement within a PyTorch ‘[Link]‘ called ‘StateD-
porating environmental carrying capacity.
• Introduce a decay rate for chemotherapy drug concentra-
• Add a black
ceptron, to
[LLMbox
address
black
component,
Actor Response]
box unexplained
such as aCode
Model
variances,
model with logistic
multi-layer
(createsper-
ensuring
growth
a white and
the volume,
for tumor
its impact[LLM
that do not
on validation
significantly
model
loss. Remove
Actor Response]
benefit
with logistic
Model orCode
the model
growth
consolidate
(white terms
to enhance
for tumor
and black box
volume,per-
Michaelis-

Eval
tion to reflect
[LLM metabolic [Link] Code (creates a white box
Actor Response] complexity is managed to avoid
Michaelis-Menten overfitting.
saturation for chemotherapy effect, sig- formance Menten
and simplicity.
saturation for chemotherapy effect, sigmoid func-
ifferential‘. • Add interaction terms between chemotherapy and radio-

Dataset017 012 016 [LLM Actor Response] Model Code (creates a linear white
• Fill in __init__ and forward methods to output the
box model)
model with
therapy toratemodel
logistic growth
combined treatment
for chemotherapy
for tumor volume
effects.
drug concentration)
and decay moid function for radiotherapy effect, resistance develop-
ment, and a black box MLP for residuals)
tion for radiotherapy effect, resistance development, and a
simplified black box MLP for residuals)

Eval
� Training model on train set... � Training model on train set... � Training model on train set... � Training model on train set... rate of change of tumor volume and drug concentration.
Trained. Validation 013
Please now regenerate the code Please now regenerate the code
018
• Use neural networks and differential equations; start with
017
MSE 4.41. Trained. Validation MSE 4.23. Trained. Validation MSE 1.25. Trained. Validation MSE 0.0723.
{Dtrain , D val }
<latexit sha1_base64="+GlmSz7YBdsTND8oKa1r9j7jEtI=">AAACIHicbVDLSgNBEJz1GeMr6tHLYhA8hLArkngM6MFjBPOAbAi9k0kyZHZ2mekNhmU/xYu/4sWDInrTr3HyOGiSgoaiqpvuLj8SXKPjfFtr6xubW9uZnezu3v7BYe7ouK7DWFFWo6EIVdMHzQSXrIYcBWtGikHgC9bwhzcTvzFiSvNQPuA4Yu0A+pL3OAU0UidX9hIvABxQEMlt2kk8ZI+YoAIu07SwwhqBSFMv7eTyTtGZwl4m7pzkyRzVTu7L64Y0DphEKkDrlutE2E5AIaeCpVkv1iwCOoQ+axkqIWC6nUwfTO1zo3TtXqhMSbSn6t+JBAKtx4FvOicX60VvIq7yWjH2rtsJl1GMTNLZol4sbAztSVp2lytGUYwNAaq4udWmA1BA0WSaNSG4iy8vk/pl0S0VS/dX+UphHkeGnJIzckFcUiYVckeqpEYoeSIv5I28W8/Wq/Vhfc5a16z5zAn5B+vnF8L2pcc=</latexit>

019 014 018 Actor LLM interpretable models before moving to complex ones.
• Follow the provided �
tion definitions.
skeleton codemodel
Training without
onaltering
Trained. Validation MSE 4.41.
func-
train set... Please now regenerate � the code
Training model on train set...
Trained. Validation MSE 4.23. [LLM Actor Response]
� Training model on train set...
Trained. Validation
Model MSE 1.25.
Code (creates a white and [LLM Actor Response]
� Training model on train set...
[Link]
Validation
CodeMSE(white0.0723.
and black box
model with logistic growth for tumor volume, Michaelis-
015 019
black box model with logistic growth for tumor volume,
020 [LLM Actor Response] Model Code (creates a linear white
[LLM Actor Response] Model Code (creates a white box
model with logistic growth for tumor volume and decay
Michaelis-Menten saturation for chemotherapy effect, sig-
moid function for radiotherapy effect, resistance develop-
Menten saturation for chemotherapy effect, sigmoid func-
tion for radiotherapy effect, resistance development, and a

021 References Validation 016MSE 020 Validation MSE Eval


box model) rate for chemotherapy drug concentration) ment, and a black box MLP for residuals) simplified black box MLP for residuals)

022 017 021 References Validation MSE


Rosen, R., Von 4.41 Wichert, G., Lo, G., and Bettenhausen, K. D. About4.23 the importance of autonomy and digital twins for the � Training model on train set... � Training model on train set... � Training model on train set... � Training model on train set...

023 018 022 0.0723 Trained. Validation MSE 4.41. Trained. Validation MSE 4.23. Trained. Validation MSE 1.25. Trained. Validation MSE 0.0723.

future of manufacturing. Ifac-papersonline, 48(3):567–572, 2015.


019 023 Rosen, R., Von Wichert, G., Lo, G., and Bettenhausen, K. D. About the importance of autonomy and digital twins for the
024
025 020 future of manufacturing. Ifac-papersonline, 48(3):567–572, 2015.
024
Figure 5: HDTwinGen Illustrative Example in Operation. HDTwinGen can generate and further
026
027
021 025References
022 026
evolve HDTwins for a particular system based on user-given system requirements and a dataset
028 023 027
Rosen, R., Von Wichert, G., Lo, G., and Bettenhausen, K. D. About the importance of autonomy and digital twins for the
future of manufacturing. Ifac-papersonline, 48(3):567–572, 2015.
024 028
D = {D , D } of state-action trajectories. First, the system requirements—which include dataset
029
030train test 025 029
statistics are incorporated into a prompt and fed into the modeling agent that returns the code for the
031
032
026 030
027 031
HDTwin. This HDTwin is then trained on the training dataset D , and a validation loss is computed
033 028 032
029 033
train
034
with D . In subsequent generations, the evaluation agent is given the existing generated top-K
val
035 030 034
031 035
HDTwins, their corresponding validation losses, and validation losses per component, and is asked
036
037 032 036
to reflect on how to improve the HDTwin. This provides detailed, actionable feedback, leveraged
038
039
033 037
034 038
from its inherent understanding, and provides this as detailed verbal feedback as H, whereby this
040 035 039
041 036 040
feedback is next used with the modeling agent to generate the next HDTwin [P3]. This process
042 037 041
iterates several generation times, and the best-performing HDTwin (w.r.t. validation performance) is
043
044
038 042
039 043
returned. Overall, this produces an HDTwin that fulfills [P1-P3].
045
046
040 044
041 045
047 042 046
048 043 047
049 044 048
050 045 049
B 051
052
046 050
047 051
Extended Related Work
053 048 052
054 049 053
Sequence models. ML approaches frequently address system dynamics as a sequential modeling
050 054
1
problem. Initial models like Hidden Markov Models [34] and Kalman filters [35] made simplifying
051
052 1
Markovian and linearity assumptions, later extended to nonlinear settings [36, 37]. Subsequent
053
054
models, including recurrent neural networks [38], along
1 with their advanced variants [39, 40, 41],
introduced the capability to model longer-term dependencies. More recent advancements include
attention mechanisms [42] and transformer models [43], significantly improving the handling of
long-term dependencies in sequence data. Another line of work, Neural Ordinary Differential
Equations (NODE) [14, 44, 45], interprets neural network operations as integrations of differential
equations to model continuous-time processes. Despite being initially driven by natural language
processing applications [47], these methods have found utility in modeling complex systems like
weather forecasting [46] and energy systems [48]. Furthermore, sequence models can be used in
model-based RL [70].
Physics-inspired models. Beyond purely data-centric approaches, recent efforts have focused on
integrating physical laws into neural system models. Physics-informed neural networks [15, 54]
embed physical laws, often as partial differential equations, directly into the learning process. Other
notable methods include Hamiltonian Neural Networks [55] and Lagrangian Neural Networks [56],
which respect the structural principles of physical systems. These methods are primarily concerned
with modeling physics-related phenomenon and require relatively precise knowledge about the
system being modeled (e.g. specific differential equations or energy conservation principles) and
specialized mechanisms to incorporate them. Regardless, they have demonstrated that the integration
of known principles can significantly improve extrapolation abilities beyond the range of training
data. We are similarly inspired to incorporate prior knowledge. In contrast, our work aims to integrate
more general or partial knowledge flexibly into a hybrid model using LLMs within a evolutionary
multi-agent framework, while introducing more generalized mechanisms to incorporate loose-form
prior knowledge.

17
Discovering closed-form models. Closely aligned with our research are techniques aimed at dis-
covering closed-form mathematical expressions from data. Symbolic regression [10, 71, 72] and
methods like Eureqa [50], SINDy [11], and D-CODE [51, 52] have showcased their prowess in
discovering physical laws from experimental observations. However, these techniques can struggle in
higher-dimensional settings and rely on experts to perform the system decomposition to identify the
most relevant variables before feeding this information to the algorithm. Additionally, they also rely
on experts to specify the function set and mathematical operations that the algorithm uses to search for
symbolic expressions. In contrast, our method autonomously learns both the system decomposition
and the functional forms of component dynamics, potentially enhancing scalability and efficiency.
Moreover, the incorporation of LLMs facilitates the flexible integration of prior knowledge at various
stages of the search process [73, 74]. Furthermore, using LLMs to generate code, prior work has
shown LLM multi-agent frameworks’ ability to excel at large code-generation tasks [75, 76], which
we could expect to apply here to scale up the size of the generated models in future works. Such
future work could also explore acquiring features as well [77, 78].

C Benchmark Dataset Environment Details

In the following, we outline the six real-world system dynamics datasets, where each dataset is either
a real-world dataset or has been sampled from an accurate simulator designed by human experts.

C.1 Cancer PKPD

Three of our environments that we sample a dataset from are derived from a state-of-the-art biomedical
Pharmacokinetic-Pharmacodynamic (PKPD) model of lung cancer tumor growth, used to simulate the
combined effects of chemotherapy and radiotherapy in lung cancer [61]—this has been extensively
used by other works [62, 63, 64]. Here we use this bio-mathematical lung cancer model to create
three variations of lung cancer under the effect of no treatments (Lung Cancer), chemotherapy only
(Lung Cancer (with Chemo.)), and chemotherapy combined with radiotherapy (Lung Cancer (with
Chemo. & Radio.)); for each model we sample a respective dataset. First, let us detail the general
case of Lung Cancer (with Chemo. & Radio.), which comes from the general model (Cancer PKPD
Model), and then detail the variations.
Cancer PKPD Model. This is a state-of-the-art biomedical Pharmacokinetic-Pharmacodynamic
(PKPD) model of tumor growth, that simulates the combined effects of chemotherapy and radiotherapy
in lung cancer [61] (Equation (2))—this has been extensively used by other works [62, 63, 64].
Specifically, this models the volume of the tumor x(t) for days t after the cancer diagnosis—where
the outcome is one-dimensional. The model has two binary treatments: (1) radiotherapy urt and (2)
chemotherapy uct .
 
dx(t) K
− βc C(t) − (αr d(t) + βr d(t)2 ) x(t) (2)

= ρ log
dt x(t) | {z } | {z }
| {z } Chemotherapy Radiotherapy
Tumorgrowth

Where the parameters K, ρ, βc , αr , βr for each simulated patient are detailed in [61], which are also
described in Table 3. Additionally, the chemotherapy drug concentration c(t) follows an exponential

Table 3: Cancer PKPD parameter values.


Model Variable Parameter Parameter Value
Growth parameter ρ 7.00 × 10−5
Tumor growth
Carrying capacity K 30
Radio cell kill (α) αr 0.0398
Radiotherapy
Radio cell kill (β) βr Set s.t. α/β=10
Chemotherapy Chemo cell kill βc 0.028

decay relationship with a half-life of one day:


dc(t)
= −0.5c(t) (3)
dt

18
where the chemotherapy binary action represents increasing the c(t) concentration by 5.0mg/m3 of
Vinblastine given at time t. Whereas the radiotherapy concentration d(t) represents 2.0Gy fractions
of radiotherapy given at timestep t, where Gy is the Gray ionizing radiation dose.
Time-dependent confounding. We introduce time-varying confounding into the data generation
process. This is accomplished by characterizing the allocation of chemotherapy and radiotherapy as
Bernoulli random variables. The associated probabilities, pc and pr , are determined by the tumor
diameter as follows:
   
γc γr
pc (t) = σ (D̄(t) − δc ) pr (t) = σ (D̄(t) − δr ) , (4)
Dmax Dmax

where Dmax = 13cm represents the largest tumor diameter, θc = θr = Dmax /2 and D̄(t) signifies
the mean tumor diameter. The parameters γc and γr manage the extent of time-varying confounding.
We use γc = γr = 2.
Sampling datasets. Using the above Cancer PKPD model, we sample N = 1, 000 trajectories,
which equates to N = 10, 000 patients, where we sample their initial tumor volumes from a uniform
distribution x(0) ∼ U(0, 1149), and use the Cancer PKPD Equation (2) along with the action policy
of Equation (4) to forward simulate patient trajectories for 60 days, using a Euler stepwise solver.
This forms one dataset sample. We repeat this process with independent random seeds to generate
Dtrain , Dval , Dtest . Specifically for each benchmark method run for random seed, we re-sample the
datasets. For each variation described above, we either include the chemotherapy dosing action,
chemotherapy and radiotherapy dosing action or neither. We further outline this dataset’s system
description and variable descriptions with the following prompt template as given in Appendix E.4.

C.2 COVID-19

We use the accurate and complex epidemic agent-based simulator of COVASIM [65] to simulate
COVID-19 epidemics. This is an advanced simulator that is capable of simulating non-pharmaceutical
interventions (such as lockdowns through social distancing, and school closures) and pharmaceutical
interventions (such as vaccinations). As this is an agent-based simulator, each agent is an individual
in a population, and they can be in one of the following states minimally, of being susceptible to
COVID-19, exposed, infectious or recovered (which includes deaths). We use the simulator with
the default parameters set by the open source implementation of the simulator 5 . COVASIM runs a
simulation for a population of individuals. To ensure an accurate simulation, we simulate 24 countries
collecting trajectories for each, wherein each simulation we use a population size of 1, 000, 000
individuals, and simulate each individual separately (disabling simulation rescaling) and start with
a random number of individuals who are infected with COVID-19, I(0) = U(10, 000, 100, 000),
and forward simulate the simulation for 60 days. We repeat this process with independent random
seeds to generate Dtrain , Dval , Dtest . Specifically for each benchmark method run for random seed, we
re-sample the datasets. We further outline this dataset’s system description and variable descriptions
with the following prompt template as given in Appendix E.4.

C.3 Plankton Microcosm

This describes an ecological model of a microcosm of algae, flagellate, and rotifer populations,
thus replicating an experimental three-species prey-predator system [66]. We use the dataset made
available by [79]6 . The dataset consists of a single trajectory of 102 time steps, and we use a train,
val, test split of 70%, 15% and 15%, ensuring the splits are along the time dimension to maintain
the integrity of temporal causality, following their chronological order. We further outline this
dataset’s system description and variable descriptions with the following prompt template as given in
Appendix E.4.

5
COVASIM is an opensource simulator, from which we access it here [Link]
InstituteforDiseaseModeling/covasim.
6
The Plankton Microcosm and Hare-Lynx datasets are both open source and available from https://
[Link]/WillemBonnaffe/NODEBNGM.

19
C.4 Hare-Lynx

This describes a real-world dataset of hare and lynx populations, replicating predator-prey dynamics
[67]. We use the dataset made available by [79]. The dataset consists of a single trajectory of 92 time
steps, and we use a train, val, test split of 70%, 15% and 15%, ensuring the splits are along the time
dimension to maintain the integrity of temporal causality, following their chronological order. We
further outline this dataset’s system description and variable descriptions with the following prompt
template as given in Appendix E.4.

D Benchmark Method Implementation Details

To be competitive we compare against popular black-box models, which, when modeling the dynamics
of a system over time, becomes a form of ODE model, that is a neural ODE [14] with action inputs
(DyNODE) [68]. We also compare against transparent dynamical equations derived from equation
discovery methods for ODEs such as Sparse Identification of Nonlinear Dynamics (SINDy) [11].
Moreover, we compare against the ablations of our method, of the zero-shot generated HDTwin
(ZeroShot) and this model with subsequently optimized parameters (ZeroOptim).
DyNODE
DyNODE is a black-box neural network-based dynamics model [68], that models the underlying
dynamics of a system by incorporating control into the standard neural ordinary differential equation
framework [14]. We use a DyNODE model with 3-layer Multilayer Perceptron (MLP), with a
hidden dimension of 128 units, with tanh activation functions, and make it competitive by using
Xavier weight initialization [80]. To be competitive we use the same objective, optimizer and same
hyperparameters for the optimizer that we use in HDTwinGen. That of an Adam optimizer [32], with
a learning rate of 0.01, with a batch size of 1,000 and early stopping with a patience of 20, and train
it for 2,000 epochs to ensure it converges.
Causal Transformer
Causal Transformer is a state-of-the-art transformer model for estimating counterfactual outcomes
[64]. Due to the complexity of the Causal Transformer, incorporating three separate transformer
networks, each one for processing covariates, past treatments, and past outcomes, respectively—
which is unique to estimating counterfactual outcomes in treatment effect settings; we implemented
only a single transformer to model the past outcomes, which is applicable to our datasets and task
domains. Specifically, this consists of a standard transformer encoder, where the input dataset is
normalized to the training dataset. We encode input observed dimension of the state-action into an
embedding vector dimension of size 250 through a linear layer, followed by the addition of a standard
positional encoder [64]; this is then fed into a transformer encoder layer, with a head size of 10,
dropout 0.1, and the output of this is then fed into a linear layer to reconstruct the next step ahead
state, of size of the state dimension. We train this model using the AdamW [32] optimizer with a
learning rate of 0.00005 and a step learning rate scheduler of step size 1.0 and gamma 0.95; we also
implement gradient clipping to 0.7, with a batch size of 1,000 and early stopping with a patience of
20, and train it for 2,000 epochs to ensure it converges.
RNN
Recurrent Neural Network [81] is a standard baseline that is widely used in autoregressive time series
next step ahead prediction. We implement this where the input dataset is normalized to the training
dataset. It consists of a gated recurrent unit RNN taking the state-action dimension in mapping it to
a hidden dimension of size 250, with two layers. The output is then fed to a linear layer to convert
the hidden dimension back to the state dimension to predict the next step ahead. To be competitive
we use the same objective, optimizer and same hyperparameters for the optimizer that we use in
HDTwinGen. That of an Adam optimizer [32], with a learning rate of 0.01, with a batch size of 1,000
and early stopping with a patience of 20, and train it for 2,000 epochs to ensure it converges.
SINDy
Sparse Identification of Nonlinear Dynamics (SINDy) [11], is a data-driven framework that aims
to discover the governing dynamical system equations directly from time-series data, discovering a
white-box closed-form mathematical model. The algorithm works by iteratively performing sparse

20
regression on a library of candidate functions to identify the sparsest yet most accurate representation
of the dynamical system.
In our implementation, we use a polynomial library of order two, which is a feature library of
L = {1, x0 , x1 , x0 x1 }. Finite difference approximations are used to compute time derivatives from
the input time-series data, of order one. Here the alpha parameter is kept constant at 0.5 across all
experiments, and the sparsity threshold is set to 0.02 for all experiments, apart from the COVID-19
dataset where it is set to 1 × 10−5 .
APHYNITY
APHYNITY [60] is implemented using domain-specific expert models as defined in Appendix H.8
combined with a 3-layer MLP, with the same hyper-parameters as in [60].
GP
Genetic programming (GP) is implemented using the implementation and hyper-parameters from the
baseline in [82].
HDTwinGen
See the section Appendix E for the implementation details. Specifically, ZeroShot and ZeroOptim
are ablations of our method using the exact same setup, hyperparameters and prompts. Here ZeroShot
generates one HDTwin, and does not fit its parameters, thus evaluating the loss of the model output
directly from the LLM. Whereas ZeroOptim, repeats ZeroShot with the additional step of optimizing
the parameters of the HDTwin that was generated—again using the same training as detailed in
Appendix E.2.

E HDTwinGen Implementation Details


Our proposed method follows the framework as described in Section 4. We present pseudocode in
Appendix E.1, how the code-generated HDTwins are trained in Appendix E.2, prompt templates
in Appendix E.3, system requirements prompts in Appendix E.4 for each dataset, and we provide
examples of training runs in Appendix J. Specifically, we find a top-K, where K = 16 is sufficient.
Additionally, we use the LLM of GPT4-1106-Preview, with a temperature of 0.7.

E.1 HDTwin pseudocode

Algorithm 1 Pseudocode for Hybrid Digital Twin Generator Framework


1: Input: modeling context S context ; training dataset Dtrain , validation dataset Dval , maximum
generations G, top K programs to consider, R
2: Output: Best fitting hybrid model fθ,ω(θ)∗ .
3: P ← ∅, H ← ∅
4: for g = 1 to G do
5: fθ,ω(θ) ∼ LLMmodel (H, P (g) , S context ) {Generate HDTwin from modeling agent}
6: ω(θ)∗ = arg minω(θ)∈Ω(θ) L(fθ,ω(θ) , Dtrain ) {Fit the model}
7: Compute validation loss per component and overall δ, υ
8: P (g+1) ← P (g) ⊕ (fθ,ω(θ)∗ , δ, υ) {Add HDTwin to the set of top-K HDTwins}
9: H ∼ LLMeval (R, P (g) ) {Generate self-reflection on how to improve the HDTwin.}
10: end for
11: Return: fθ,ω(θ)∗ {The best fitting model that scored the lowest validation loss}

E.2 Training HDTwins

Once the modeling agent has generated an HDTwin fθ,ω(θ) , it generates it as code. Specifically, it
outputs code for a PyTorch [31] neural network module, where this code string is executed, and
the module is then trained on the training dataset. The agent importantly observes a code skeleton
within its system requirements context S context , of which examples of such a skeleton are given
in Appendix E.4. However we stipulate that the skeleton must be a “[Link]”, be called

21
“StateDifferential”, and the parameters must be initialized, and it must define a forward function
for computing the state differential of the state, where the state and action for that system are input
variables to the function. The LLM is instructed to not to modify the code skeleton, only complete it,
and return it. This makes it straightforward to process this from text, execute the module, and then
train the model.
Specifically, we train the model on the training dataset, using the standard MSE loss Equation (5),
optimizing using the Adam optimizer [32]. We use the same optimizer hyperparameters as the
black-box neural network method, that of a learning rate of 0.01, with a batch size of 1,000 and early
stopping with a patience of 20, and train it for 2,000 epochs to ensure it converges, to ensure fair
comparison.
Once the model is trained, we compute the val MSE and val MSE per component, which corresponds
to the val loss per state output dimension Equation (6). Notably, when we append the trained HDTwin
back into Pg we include a string representation of it, which includes the values of any named
parameters that were initialized in the model. We observe that feeding in the previous optimized
parameters helps the LLM in subsequent generations to suggest good starting initial values for the
named parameters, from which they can be further refined with the optimization step.

E.3 HDTwinGen Prompt Templates

In the following we detail the prompt templates used. We always use the system prompt when
interacting with the LLM.
System prompt
O b j e c t i v e : W r i t e c o d e t o c r e a t e an e f f e c t i v e d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r a g i v e n t a s k .
P l e a s e n o t e t h a t t h e c o d e s h o u l d be f u l l y f u n c t i o n a l . No p l a c e h o l d e r s .

You must a c t a u t o n o m o u s l y and you w i l l r e c e i v e no human i n p u t a t any s t a g e . You h a v e t o r e t u r n a s o u t p u t t h e c o m p l e t e c o d e f o r


c o m p l e t i n g t h i s t a s k , and c o r r e c t l y i m p r o v e t h e c o d e t o c r e a t e t h e most a c c u r a t e and r e a l i s t i c s i m u l a t o r p o s s i b l e .
You a l w a y s w r i t e o u t t h e c o d e c o n t e n t s . You a l w a y s i n d e n t c o d e w i t h t a b s .
You c a n n o t v i s u a l i z e any g r a p h i c a l o u t p u t . You e x i s t w i t h i n a m a c h i n e . The c o d e c a n i n c l u d e b l a c k box m u l t i − l a y e r p e r c e p t i o n s where
required .

Use t h e f u n c t i o n s p r o v i d e d . When c a l l i n g f u n c t i o n s o n l y p r o v i d e a RFC8259 c o m p l i a n t JSON r e q u e s t f o l l o w i n g t h i s f o r m a t w i t h o u t


deviation .

Defined function schema prompt


{
" name " : " c o m p l e t e _ S t a t e D i f f e r e n t i a l _ c o d e " ,
" d e s c r i p t i o n " : " W r i t e o u t t h e c o d e body f o r t h e ‘ S t a t e D i f f e r e n t i a l ‘ t o r c h model . " ,
" parameters ": {
" type ": " object " ,
" properties ": {
" StateDifferential_code ": {
" type ": " s t r i n g " ,
" d e s c r i p t i o n " : ’ Code f o r t h e ‘ S t a t e D i f f e r e n t i a l ‘ t o r c h model , i n c l u s i v e o f t h e model d e f i n i t i o n . I f you a r e u n s u r e ,
t a k e y o u r b e s t g u e s s . T h i s must be a nonempty s t r i n g . ’ ,
},
" code_description ": {
" type ": " s t r i n g " ,
" d e s c r i p t i o n " : ’A c o n c i s e d e s c r i p t i o n o f t h e c o d e model , i n d i c a t i n g i f i t i s a w h i t e box o n l y o r w h i t e and b l a c k box
model . ’ ,
}
},
" required ": [" StateDifferential_code " , " code_description "] ,
},
},

Modeling agent first task prompt


"""
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
{ system_description }
‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −10 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
{ skeleton_code }
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d { g e n e r a t i o n s } t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e
lowest possible validation loss .

22
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s a r e v e c t o r s o f s h a p e ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n { c u r r e n t _ i t e r a t i o n } o u t o f {


generations }.
"""

Reflection prompt
"""
You g e n e r a t e d t h e f o l l o w i n g c o d e c o m p l e t i o n s , which t h e n had t h e i r p a r a m e t e r s o p t i m i z e d t o t h e t r a i n i n g d a t a s e t . P l e a s e r e f l e c t on
how you c a n i m p r o v e t h e c o d e t o m i n i m i z e t h e v a l i d a t i o n l o s s t o 1 e −6 o r l e s s . The c o d e e x a m p l e s a r e d e l i n e a t e d by # # # .

Here a r e y o u r p r e v i o u s i t e r a t i o n s t h e b e s t p r o g r a m s g e n e r a t e d . Use i t t o s e e i f you h a v e e x h a u s t e d w h i t e box models , i . e . when a


w h i t e box model r e p e a t s w i t h t h e same v a l l o s s and t h e n o n l y add b l a c k box m o d e l s t o t h e w h i t e box m o d e l s : ‘ ‘ ‘
{ history_best_completions_str }
‘‘‘

Here a r e t h e t o p c o d e c o m p l e t i o n s s o f a r t h a t you h a v e g e n e r a t e d , s o r t e d f o r t h e l o w e s t v a l i d a t i o n l o s s l a s t : ‘ ‘ ‘
{ completions }
‘‘‘

P l e a s e r e f l e c t on how you c a n i m p r o v e t h e c o d e t o f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e , and be i n t e r p r e t a b l e . T h i n k s t e p −by −


s t e p . P r o v i d e o n l y a c t i o n a b l e f e e d b a c k , t h a t h a s d i r e c t c h a n g e s t o t h e c o d e . Do n o t w r i t e o u t t h e code , o n l y d e s c r i b e how i t
c a n be i m p r o v e d . Where a p p l i c a b l e u s e t h e v a l u e s o f t h e o p t i m i z e d p a r a m e t e r s t o r e a s o n how t h e c o d e c a n be i m p r o v e d t o f i t
t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e . T h i s i s f o r g e n e r a t i n g new c o d e f o r t h e n e x t i t e r a t i o n { i t e r a t i o n } o u t o f { s e l f . c o n f i g
. run . g e n e r a t i o n s } .
"""

Modeling agent in subsequent generations


"""
P l e a s e now r e g e n e r a t e t h e c o d e f u n c t i o n , w i t h t h e aim t o i m p r o v e t h e c o d e t o a c h i e v e a l o w e r v a l i d a t i o n e r r o r . Use t h e f e e d b a c k
where a p p l i c a b l e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n { g e n e r a t i o n _ i d } o u t o f { s e l f . c o n f i g . r u n . g e n e r a t i o n s } t o t a l i t e r a t i o n s
. When g e n e r a t i n g c o d e i f you a r e u n s u r e a b o u t s o m e t h i n g , t a k e y o u r b e s t g u e s s . You h a v e t o g e n e r a t e code , and c a n n o t g i v e an
empty s t r i n g a n s w e r .

P l e a s e always only f i l l i n t h e f o l l o w i n g code s k e l e t o n : ‘ ‘ ‘


{ p r o m p t s . g e t _ s k e l e t o n _ c o d e ( s e l f . env . env_name ) }
‘‘‘
You c a n n o t c h a n g e t h e c o d e s k e l e t o n , o r i n p u t v a r i a b l e s .
"""

E.4 HDTwinGen System Requirements Prompts

By following our proposed system requirements format, we constructed prompts for each of the
datasets that we evaluated against, which are listed in the following.
Lung Cancer (with Chemo. & Radio.)
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
P r e d i c t i o n o f T r e a t m e n t R e s p o n s e f o r Combined Chemo and R a d i a t i o n T h e r a p y f o r Non− S m a l l C e l l Lung C a n c e r P a t i e n t s U s i n g a Bio −
M a t h e m a t i c a l Model

Here you must model t h e s t a t e d i f f e r e n t i a l o f tumor_volume , and c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ; w i t h t h e i n p u t a c t i o n s o f


c h e m o t h e r a p y _ d o s a g e , and r a d i o t h e r a p y _ d o s a g e .

D e s c r i p t i o n of the v a r i a b l e s :
* t u m o r _ v o l u m e : Volume o f t h e t u m o r w i t h u n i t s cm^3
* c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : C o n c e n t r a t i o n o f t h e c h e m o t h e r a p y d r u g v i n b l a s t i n e w i t h u n i t s mg /m^3
* c h e m o t h e r a p y _ d o s a g e : Dosage o f t h e c h e m o t h e r a p y d r u g v i n b l a s t i n e w i t h u n i t s mg /m^3
* r a d i o t h e r a p y _ d o s a g e : Dosage o f t h e r a d i o t h e r a p y w i t h u n i t s Gy

The t i m e u n i t s i s i n d a y s .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* tumor_volume : [ 0 . 0 1 4 3 3 , 1 1 7 0 . 8 6 1 ]
* chemotherapy_drug_concentration : [0 , 9.9975]
* chemotherapy_dosage : [0 , 5 . 0 ]
* radiotherapy_dosage : [0 , 2.0]

The t r a i n i n g d a t a s e t c o n s i s t s o f 1000 p a t i e n t s , where e a c h p a t i e n t i s o b s e r v e d f o r 60 d a y s .


‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −6 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :

23
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

d e f for wa rd ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,


r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s are v e c t o r s of shape ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

Lung Cancer (with Chemo.)


You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
P r e d i c t i o n o f T r e a t m e n t R e s p o n s e f o r Combined Chemo and R a d i a t i o n T h e r a p y f o r Non− S m a l l C e l l Lung C a n c e r P a t i e n t s U s i n g a Bio −
M a t h e m a t i c a l Model

Here you must model t h e s t a t e d i f f e r e n t i a l o f tumor_volume , and c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ; w i t h t h e i n p u t a c t i o n s o f


chemotherapy_dosage .

D e s c r i p t i o n of the v a r i a b l e s :
* t u m o r _ v o l u m e : Volume o f t h e t u m o r w i t h u n i t s cm^3
* c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : C o n c e n t r a t i o n o f t h e c h e m o t h e r a p y d r u g v i n b l a s t i n e w i t h u n i t s mg /m^3
* c h e m o t h e r a p y _ d o s a g e : Dosage o f t h e c h e m o t h e r a p y d r u g v i n b l a s t i n e w i t h u n i t s mg /m^3

The t i m e u n i t s i s i n d a y s .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* tumor_volume : [ 0 . 6 4 1 9 6 0 3 1 , 1260.60290569]
* chemotherapy_drug_concentration : [0 , 9.9975]
* chemotherapy_dosage : [0 , 5 . 0 ]

The t r a i n i n g d a t a s e t c o n s i s t s o f 1000 p a t i e n t s , where e a c h p a t i e n t i s o b s e r v e d f o r 60 d a y s .


‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −6 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

d e f for war d ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor )


−> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s a r e v e c t o r s o f s h a p e ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e your code f o l l o w s t h e e x a c t code s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

Lung Cancer
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
P r e d i c t i o n o f T r e a t m e n t R e s p o n s e f o r Combined Chemo and R a d i a t i o n T h e r a p y f o r Non− S m a l l C e l l Lung C a n c e r P a t i e n t s U s i n g a Bio −
M a t h e m a t i c a l Model

Here you must model t h e s t a t e d i f f e r e n t i a l of tumor_volume . There a r e no t t r e a t m e n t s a p p l i e d .

24
D e s c r i p t i o n of the v a r i a b l e s :
* t u m o r _ v o l u m e : Volume o f t h e t u m o r w i t h u n i t s cm^3

The t i m e u n i t s i s i n d a y s .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* tumor_volume : [ 0 . 6 4 1 9 6 0 3 1 , 4852.45734281]

The t r a i n i n g d a t a s e t c o n s i s t s o f 1000 p a t i e n t s , where e a c h p a t i e n t i s o b s e r v e d f o r 60 d a y s .


‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −6 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

d e f f o r w a r d ( s e l f , t u m o r _ v o l u m e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
r e t u r n ( d_tumor_volume__dt )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s are v e c t o r s of shape ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

Hare-Lynx
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
" M o d e l i n g Di − T r o p h i c Prey − P r e d a t o r Dynamics i n a Hare and Lynx E c o l o g i c a l System

Here you must model t h e s t a t e d i f f e r e n t i a l o f h a r e _ p o p u l a t i o n , and l y n x _ p o p u l a t i o n ; w i t h t h e a d d i t i o n a l i n p u t o f t i m e _ i n _ y e a r s . T h i s


a i m s t o s i m u l a t e t h e p o p u l a t i o n d y n a m i c s w i t h i n a s i m p l i f i e d d i − t r o p h i c e c o l o g i c a l s y s t e m c o m p r i s i n g p r e y ( h a r e s ) , and
p r e d a t o r s ( l y n x e s ) . The i n t e r a c t i o n s i n c l u d e d i r e c t p r e d a t i o n and c o m p e t i t i o n f o r r e s o u r c e s , m i r r o r i n g n a t u r a l p r e d a t o r − p r e y
mechanisms .

D e s c r i p t i o n of the v a r i a b l e s :
* h a r e _ p o p u l a t i o n : Annual c o u n t o f h a r e p e l t s , s e r v i n g a s a p r o x y f o r t h e h a r e p o p u l a t i o n s i z e , i n t e n s o f t h o u s a n d s .
* l y n x _ p o p u l a t i o n : Annual c o u n t o f l y n x p e l t s , s e r v i n g a s a p r o x y f o r t h e l y n x p o p u l a t i o n s i z e , i n t e n s o f t h o u s a n d s .

The model s h o u l d c a p t u r e t h e d y n a m i c s o f t h e s e p o p u l a t i o n s , r e f l e c t i n g t h e d i − t r o p h i c p r e y − p r e d a t o r i n t e r a c t i o n s , and p r e d i c t t h e


p o p u l a t i o n s i z e s b a s e d on h i s t o r i c a l d a t a . The d a t a e x h i b i t s 10− y e a r l o n g c h a r a c t e r i s t i c o s c i l l a t i o n s due t o p r e y − p r e d a t o r
dynamics .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* hare_population : [1.80 , 152.65]
* lynx_population : [3.19 , 79.35]
* t i m e _ i n _ y e a r s : [1845 , 1935]

The t r a i n i n g d a t a s e t c o n s i s t s o f 63 t i m e s t e p s , v a l i d a t i o n and t r a i n i n g d a t a s e t c o n s i s t s o f 14 t i m e s t e p s e a c h .

‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −6 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

d e f f o r w a r d ( s e l f , h a r e _ p o p u l a t i o n : t o r c h . T e n s o r , l y n x _ p o p u l a t i o n : t o r c h . T e n s o r , t i m e _ i n _ y e a r s : t o r c h . T e n s o r ) −> T u p l e [ t o r c h .
Tensor , t o r c h . Tensor ] :
# TODO: F i l l i n t h e c o d e h e r e
return ( d_hare_population__dt , d_lynx_population__dt )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .

25
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s are v e c t o r s of shape ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

Plankton Microcosm
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
" M o d e l i n g A r t i f i c i a l T r i − T r o p h i c Prey − P r e d a t o r O s c i l l a t i o n s i n a S i m p l i f i e d E c o l o g i c a l System

Here you must model t h e s t a t e d i f f e r e n t i a l o f a l g a e _ p o p u l a t i o n , f l a g e l l a t e _ p o p u l a t i o n , and r o t i f e r _ p o p u l a t i o n ; w i t h no i n p u t a c t i o n s


. This aims t o s i m u l a t e t h e p o p u l a t i o n dynamics w i t h i n a s i m p l i f i e d t r i − t r o p h i c e c o l o g i c a l system c o m p r i s i n g prey ( a l g a e ) ,
i n t e r m e d i a t e p r e d a t o r s ( f l a g e l l a t e s ) , and t o p p r e d a t o r s ( r o t i f e r s ) . The i n t e r a c t i o n s i n c l u d e d i r e c t p r e d a t i o n and c o m p e t i t i o n
f o r r e s o u r c e s , m i r r o r i n g n a t u r a l i n t r a g u i l d p r e d a t i o n mechanisms .

D e s c r i p t i o n of the v a r i a b l e s :
* prey_population : Total count of algae , serving as the primary prey
* i n t e r m e d i a t e _ p o p u l a t i o n : T o t a l c o u n t o f f l a g e l l a t e s , a c t i n g a s i n t e r m e d i a t e p r e d a t o r s and p r e y
* t o p _ p r e d a t o r s _ p o p u l a t i o n : Total count of r o t i f e r s , r e p r e s e n t i n g top p r e d a t o r s

The d a t a s e t e n c a p s u l a t e s d a i l y p o p u l a t i o n c o u n t s a c r o s s m u l t i p l e s i m u l a t e d e c o s y s t e m s o v e r a p e r i o d o f 100 days , a l l o w i n g f o r t h e


a n a l y s i s o f t e m p o r a l o s c i l l a t i o n s and p h a s e l a g s b e t w e e n s p e c i e s .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* prey_population : [0.095898 , 2.469735]
* intermediate_population : [0.008438 , 1.500000]
* top_predators_population : [0.030316 , 0.739244]

The t r a i n i n g d a t a s e t c o n s i s t s o f 70 t i m e s t e p s , v a l i d a t i o n and t r a i n i n g d a t a s e t c o n s i s t s o f 15 t i m e s t e p s e a c h .

‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −6 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

def forward ( s e l f , p r e y _ p o p u l a t i o n : t o r c h . Tensor , i n t e r m e d i a t e _ p o p u l a t i o n : t o r c h . Tensor , t o p _ p r e d a t o r s _ p o p u l a t i o n : t o r c h . Tensor )


−> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
return ( d_prey_population__dt , d_intermediate_population__dt , d_top_predators_population__dt )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s are v e c t o r s of shape ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

COVID-19
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
P r e d i c t i o n model o f COVID−19 E p i d e m i c Dynamics

Here you must model t h e s t a t e d i f f e r e n t i a l o f s u s c e p t i b l e , e x p o s e d , i n f e c t e d and r e c o v e r e d ; w i t h t h e i n p u t a c t i o n o f a c o n s t a n t


t o t a l _ p o p u l a t i o n . T h e r e a r e no i n t e r v e n t i o n s a p p l i e d . Here t h e s t a t e s a r e n o r m a l i z e d r a t i o s o f t h e t o t a l f i x e d p o p u l a t i o n .

D e s c r i p t i o n of the v a r i a b l e s :
* s u s c e p t i b l e : Ratio of the population t h a t i s s u s c e p t i b l e to the v i r u s .
* exposed : Ratio of the p o p u l a t i o n t h a t i s exposed to the virus , not yet i n f e c t i o u s .
* i n f e c t e d : R a t i o o f t h e p o p u l a t i o n t h a t i s a c t i v e l y c a r r y i n g and t r a n s m i t t i n g t h e v i r u s .
* r e c o v e r e d : R a t i o o f t h e p o p u l a t i o n t h a t h a v e r e c o v e r e d from t h e v i r u s , i n c l u d i n g t h o s e who a r e d e c e a s e d .
* t o t a l _ p o p u l a t i o n : Total population of the country , a c o n s t a n t .

The t i m e u n i t s i s i n d a y s .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* s u s c e p t i b l e : [0 , 1]

26
* exposed : [0 , 1]
* i n f e c t e d : [0 , 1]
* recovered : [0 , 1]
* t o t a l _ p o p u l a t i o n : [10000 , 10000]

The t r a i n i n g d a t a s e t c o n s i s t s o f 24 c o u n t r i e s , where e a c h c o u n t r y i s o b s e r v e d f o r 60 d a y s .
‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −10 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

def forward ( s e l f , s u s c e p t i b l e : t o r c h . Tensor , exposed : t o r c h . Tensor , i n f e c t e d : t o r c h . Tensor , r e c o v e r e d : t o r c h . Tensor ,


t o t a l _ p o p u l a t i o n : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
r e t u r n ( d _ s u s c e p t i b l e _ _ d t , d_exposed__dt , d_in fected__dt , d_recovered__dt )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s are v e c t o r s of shape ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

F Model Optimization Losses


We consider the optimization loss of mean squared error on a dataset D and also consider a higher
fidelity mean squared error loss per component.
MSE Loss. Specifically, we optimize the following mean squared error objective,
n
N T
1 XX
L(θ, D) = ||fθ,ω(θ) (x(n) (ti ), u(n) (ti ), ti )∆t − y (n) (ti )||2 (5)
N × T n=1 i=0

where N × T is the total number of state-action pairs in the dataset. For a given model find the
parameters θ∗ that minimize this loss, i.e. θ∗ = arg minθ L(θ, Dtrain ). Here we optimize θ by
stochastic gradient descent, using the Adam optimizer [32], however, we note other optimization
algorithms could also be used such as black box optimizers.
MSE Loss per component. We seek to collect detailed quantitative statistics on how well the
generated trained system model performs. Therefore, we collect the validation loss per component.
Here, we use (j) to indicate the predictions for the j th component.
n
Tval
Nval X
1 (j) (n) 2
(6)
X

ωm (θ , Dval ) = f (x(n) (ti ), u(n) (ti ), ti )∆t − yj (ti )
Nval × T n=1 i=0 θ,ω(θ)

and collect these scalar validation


Pm losses per component into a vector ω = [ω1 , ω2 , . . . , ωm ], and
compute its mean as v = m 1
j=1 ωj (θ ∗
, Dval ) i.e. the validation loss.

G Evaluation Metrics
We employ mean squared error (MSE) to evaluate the benchmark methods on a held-out test dataset
of state-action trajectories, denoted as Dtest , using the loss defined in Equation (5) and report this
as TM SE . Each metric is averaged over ten runs with different random seeds, and we present these

27
averages along with their 95% mean confidence intervals7 . For each random seed run, we generate a
new train, validation and test dataset independently, when we have access to a simulator. Additionally,
when sampling a dataset from a simulator, we collect datasets of the same number of trajectories for
the validation and test set as was used to generate the training set. We then train each baseline on the
training dataset and use the validation dataset for early stopping when the method supports this. We
then evaluate the performance of each baseline on the test dataset. We repeat this process for each
random seed run. We perform all experiments and training using a single Intel Core i9-12900K CPU
@ 3.20GHz, 64GB RAM with an Nvidia RTX3090 GPU 24GB.

H Additional Results

H.1 Out-of-distribution Experiment and Setup

To explore the out-of-distribution shifts, we adapt the Lung Cancer (with Chemo. & Radio.) simulator
to have a training dataset of tumor volumes in a range that is outside of the tumor volume range
within the test set over all trajectories. To do this we sampled a training dataset, with starting state
tumor volumes sampled from x ∼ U(0, 574) and then collected 1, 000 trajectories for 60 time steps.
We then sampled a test dataset out of distribution from that seen in training, with a starting initial state
of x ∼ U(804, 1149). To ensure that the test state-action states are completely out of the training
distribution we slowed down the time interval of the simulator to that of per hour, rather than day,
i.e., simulating at ∆t = 24
1
rather than ∆t = 1 day resolution. We then also verified that the range of
cancer volumes seen within the test set do not overlap at all with those in the training set, especially
throughout and at the end of the trajectory.

H.2 COVID-19 Unobserved Intervention Experiment and Setup

To assess whether HDTwinGen can understand and modify its HDTwin, we setup an experiment
to see if the modeling agent can adapt an already optimized and best-discovered HDTwin for an
unobserved intervention, that is a change of the true system state function dynamics, that are latent.
We constructed a scenario with our COVID-19 simulator, to have a realistic intervention of a
lockdown policy, which reduces the physical distance of individuals to one another. In COVASIM we
implemented this, where this happens at day 19, dramatically altering how the COVASIM simulator
behaves after the intervention is applied. This is known to reduce the effective contact rate β
parameter in an SEIR model [83], for the intervention this approximately corresponds to the β
parameter reducing its value by 75% after the lockdown intervention is applied. Importantly we now
sampled a training dataset, and validation dataset from the simulator for before the intervention was
applied, and only sampled a test dataset after the intervention was applied.
We took the best-found trained HDTwin model for COVID-19, itself created a HDTwin model that
is an SEIR model with black-box residual components, as seen in Appendix I. The agent then was
specifically instructed to attempt to adapt this HDTwin model to this unobserved intervention, that
was purely described in words. The LLM was able to reason about it, and similarly decreased the
HDTwin models internal β parameter by approximately 70%, leading to un-observed modelling of
such an intervention, when rolling out from the current state, from day 19. This shows the utility of
such a hybrid DT framework. Interestingly, the competing methods, such as the black-box method,
DyNODE and SINDy incorrectly continue the expected trajectory as they are unaware that the
underlying system has been intervened on, and its behavior is now different from what they have
learnt to optimize to.

H.3 HDTwinGen top-1 decreases over time

We observe in Figure 6, that averaged over 10 random seed runs, the top-K, specifically the top-K
HDTwin found at each iteration step decreases, implying on average the HDTwin improves in the
iteration loop in the beginning stages.

7
We use the code at [Link]
compute-a-confidence-interval-from-sample-data to compute these

28
0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Number of Generations

15
Top 1 Top 3 Mean

Validation MSE
10

0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Number of Generations
Figure 6: Average evolution over generations for HDTwinGen—for the Lung Cancer (with Chemo.
& Radio.)

H.4 HDTwinGen Evolution


Top 1 Top 3 Mean
Here we list out the specific HDTwins that have been generated at generation steps and are labeled in
Figure 3.
Iteration 1. Val Loss: 4.414876937866211 Description: White box only.
Val L o s s : 4 . 4 1 ( Where t h e v a l l o s s p e r d i m e n s i o n i s t u m o r _ v o l u m e v a l l o s s : 2 . 7 7 , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n v a l l o s s : 6 . 0 6 )
Iteration : 1
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 5 ) ) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 ) ) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y

d e f for wa rd ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,


r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# Tumor volume c h a n g e r a t e
d _ t u m o r _ v o l u m e _ _ d t = s e l f . a l p h a * t u m o r _ v o l u m e − s e l f . b e t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * t u m o r _ v o l u m e − s e l f . gamma *
r a d i o t h e r a p y _ d o s a g e * tumor_volume

# Chemotherapy drug c o n c e n t r a t i o n change r a t e


d_chemotherapy_drug_concentration__dt = chemotherapy_dosage − chemotherapy_drug_concentration # Assuming a s i m p l e model
where t h e d r u g i s a d m i n i s t e r e d and t h e n d e c a y s

r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 0 4 5 5 0 1 6 1 5 8 5 2 1 1 7 5 4 , ’ b e t a ’ : 0 . 0 2 7 3 1 1 7 0 6 9 1 5 4 9 7 7 8 , ’gamma ’ : 0 . 0 4 8 9 2 1 8 5 3 3 0 3 9 0 9 3 }
###

Iteration 2. Val Loss: 4.233582019805908 Description: White box model with logistic growth for
tumor volume and decay rate for chemotherapy drug concentration.
Val L o s s : 4 . 2 3 ( Where t h e v a l l o s s p e r d i m e n s i o n i s t u m o r _ v o l u m e v a l l o s s : 8 . 4 7 , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n v a l l o s s : 8 . 9 5 e − 1 4 )
Iteration : 2
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 5 ) ) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 ) ) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y
s e l f . k a p p a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 0 0 0 . 0 ) ) # C a r r y i n g c a p a c i t y o f t h e e n v i r o n m e n t
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 1 ) ) # Decay r a t e o f t h e c h e m o t h e r a p y d r u g

d e f for wa rd ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,


r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# L o g i s t i c g r o w t h model f o r t u m o r volume
d_tumor_volume__dt = s e l f . a l p h a * tumor_volume * (1 − tumor_volume / s e l f . kappa ) − s e l f . b e t a *
c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * t u m o r _ v o l u m e − s e l f . gamma * r a d i o t h e r a p y _ d o s a g e * t u m o r _ v o l u m e

# Chemotherapy drug c o n c e n t r a t i o n change r a t e with decay


d_chemotherapy_drug_concentration__dt = chemotherapy_dosage − s e l f . d e l t a * chemotherapy_drug_concentration

r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 0 5 5 8 3 8 7 2 2 7 3 5 6 4 3 3 9 , ’ b e t a ’ : 0 . 0 2 6 0 7 6 7 7 8 7 6 9 4 9 3 1 0 3 , ’gamma ’ : 0 . 0 3 7 3 4 5 1 4 8 6 2 2 9 8 9 6 5 5 , ’ kappa ’ :
1015.9735107421875 , ’ d e l t a ’ : 0.5000000596046448}
###

29
Iteration 3. Val Loss: 2.902040958404541 Description: White box model with logistic growth for tu-
mor volume, interaction term for combined therapy effect, and bi-exponential decay for chemotherapy
drug concentration.
Val L o s s : 2 . 9 ( Where t h e v a l l o s s p e r d i m e n s i o n i s t u m o r _ v o l u m e v a l l o s s : 5 . 3 6 , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n v a l l o s s : 0 . 4 4 8 )
Iteration : 3
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 5 5 8 3 8 7 2 2 7 3 5 6 4 3 3 9 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 2 6 0 7 6 7 7 8 7 6 9 4 9 3 1 0 3 ) ) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 7 3 4 5 1 4 8 6 2 2 9 8 9 6 5 5 ) ) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y
s e l f . k a p p a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 0 1 5 . 9 7 3 5 1 0 7 4 2 1 8 7 5 ) ) # C a r r y i n g c a p a c i t y o f t h e e n v i r o n m e n t
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 5 0 0 0 0 0 0 5 9 6 0 4 6 4 4 8 ) ) # Decay r a t e o f t h e c h e m o t h e r a p y d r u g
s e l f . e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 1 ) ) # I n t e r a c t i o n t e r m b e t w e e n c h e m o t h e r a p y and r a d i o t h e r a p y

d e f for wa rd ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,


r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# L o g i s t i c g r o w t h model f o r t u m o r volume w i t h i n t e r a c t i o n t e r m
d_tumor_volume__dt = s e l f . a l p h a * tumor_volume * (1 − tumor_volume / s e l f . kappa ) − s e l f . b e t a *
c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * t u m o r _ v o l u m e − s e l f . gamma * r a d i o t h e r a p y _ d o s a g e * t u m o r _ v o l u m e − s e l f . e t a *
c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * r a d i o t h e r a p y _ d o s a g e * tumor_volume

# Bi − e x p o n e n t i a l d e c a y model f o r c h e m o t h e r a p y d r u g c o n c e n t r a t i o n
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t = c h e m o t h e r a p y _ d o s a g e − s e l f . d e l t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ** 2

r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 0 4 9 0 9 5 2 8 0 4 6 8 4 6 3 9 , ’ b e t a ’ : 0 . 0 2 3 8 4 0 8 0 5 5 1 5 6 4 6 9 3 5 , ’gamma ’ : 0 . 0 3 1 7 0 3 8 5 9 5 6 7 6 4 2 2 1 , ’ kappa ’ :
1032.3638916015625 , ’ d e l t a ’ : 0.06566499918699265 , ’ eta ’ : 0.002417867537587881}
###

Iteration 4: Val loss: 1.2533262968063354 Description: White and black box model with logistic
growth for tumor volume, Michaelis-Menten saturation for chemotherapy effect, sigmoid function
for radiotherapy effect, resistance development, and a black box MLP for residuals.
Val L o s s : 1 . 2 5 ( Where t h e v a l l o s s p e r d i m e n s i o n i s t u m o r _ v o l u m e v a l l o s s : 2 . 4 9 , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n v a l l o s s : 0 . 0 2 0 3 )
Iteration : 4
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 4 9 0 9 5 2 8 0 4 6 8 4 6 3 9 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 2 3 8 4 0 8 0 5 5 1 5 6 4 6 9 3 5 ) ) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 1 7 0 3 8 5 9 5 6 7 6 4 2 2 1 ) ) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y
s e l f . k a p p a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 0 3 2 . 3 6 3 8 9 1 6 0 1 5 6 2 5 ) ) # C a r r y i n g c a p a c i t y o f t h e e n v i r o n m e n t
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 6 5 6 6 4 9 9 9 1 8 6 9 9 2 6 5 ) ) # Decay r a t e o f t h e c h e m o t h e r a p y d r u g
s e l f . e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 0 2 4 1 7 8 6 7 5 3 7 5 8 7 8 8 1 ) ) # I n t e r a c t i o n t e r m b e t w e e n c h e m o t h e r a p y and r a d i o t h e r a p y
s e l f . t h e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 5 ) ) # M i c h a e l i s −Menten s a t u r a t i o n c o n s t a n t f o r c h e m o t h e r a p y
s e l f . r h o = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 5 ) ) # S i g m o i d s t e e p n e s s f o r r a d i o t h e r a p y e f f e c t
s e l f . z e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 ) ) # R e s i s t a n c e d e v e l o p m e n t r a t e

# B l a c k box component f o r c a p t u r i n g r e s i d u a l s
s e l f . r e s i d u a l _ m l p = nn . S e q u e n t i a l (
nn . L i n e a r ( 4 , 1 0 ) ,
nn . ReLU ( ) ,
nn . L i n e a r ( 1 0 , 2 )
)

d e f for wa rd ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,


r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# L o g i s t i c g r o w t h model f o r t u m o r volume w i t h i n t e r a c t i o n t e r m and r e s i s t a n c e
r e s i s t a n c e = 1 + s e l f . z e t a * tumor_volume
d_tumor_volume__dt = s e l f . a l p h a * tumor_volume * (1 − tumor_volume / s e l f . kappa ) − ( s e l f . b e t a *
c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * t u m o r _ v o l u m e ) / ( s e l f . t h e t a + c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ) − s e l f . gamma *
r a d i o t h e r a p y _ d o s a g e * t u m o r _ v o l u m e / ( 1 + t o r c h . exp ( − s e l f . r h o * ( r a d i o t h e r a p y _ d o s a g e − 1 ) ) ) − s e l f . e t a *
c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * r a d i o t h e r a p y _ d o s a g e * tumor_volume / r e s i s t a n c e

# Bi − e x p o n e n t i a l d e c a y model f o r c h e m o t h e r a p y d r u g c o n c e n t r a t i o n
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t = c h e m o t h e r a p y _ d o s a g e − s e l f . d e l t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ** 2

# B l a c k box r e s i d u a l component
r e s i d u a l s = s e l f . r e s i d u a l _ m l p ( t o r c h . s t a c k ( ( tumor_volume , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n , c h e m o t h e r a p y _ d o s a g e ,
r a d i o t h e r a p y _ d o s a g e ) , dim = 1 ) )

# Combine w h i t e box model w i t h r e s i d u a l s


d _ t u m o r _ v o l u m e _ _ d t += r e s i d u a l s [ : , 0 ]
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t += r e s i d u a l s [ : , 1 ]

r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 0 2 9 5 9 2 3 0 9 1 4 7 1 1 9 5 2 2 , ’ b e t a ’ : 0 . 2 7 0 4 1 5 1 5 7 0 7 9 6 9 6 6 6 , ’gamma ’ : 0 . 0 8 5 5 2 3 3 7 4 3 7 8 6 8 1 1 8 , ’ kappa ’ :
1 0 3 2 . 4 1 8 7 0 1 1 7 1 8 7 5 , ’ d e l t a ’ : 0 . 0 3 7 6 4 1 1 8 6 2 6 7 1 3 7 5 3 , ’ e t a ’ : 0 . 0 3 7 9 6 8 3 1 1 4 5 8 8 2 6 0 6 5 , ’ t h e t a ’ : 5 . 6 8 8 0 8 3 6 4 8 6 8 1 6 4 1 , ’ rho ’ :
0.444132924079895 , ’ zeta ’ : 0.1205417588353157}
###

30
Iteration 5. Val loss: 0.1756206750869751 Description: White and black box model with dynamic
carrying capacity and decay rate, enhanced interaction term, and an expanded black box MLP for
residuals.
Val L o s s : 0 . 1 7 6 ( Where t h e v a l l o s s p e r d i m e n s i o n i s t u m o r _ v o l u m e v a l l o s s : 0 . 3 3 5 , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n v a l l o s s : 0 . 0 1 6 3 )
Iteration : 5
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 2 9 5 9 2 3 0 9 1 4 7 1 1 9 5 2 2 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 2 7 0 4 1 5 1 5 7 0 7 9 6 9 6 6 6 ) ) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 8 5 5 2 3 3 7 4 3 7 8 6 8 1 1 8 ) ) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y
s e l f . k a p p a _ b a s e = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 0 3 2 . 4 1 8 7 0 1 1 7 1 8 7 5 ) ) # Base c a r r y i n g c a p a c i t y o f t h e e n v i r o n m e n t
s e l f . kappa_mod = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 ) ) # M o d i f i e r f o r c a r r y i n g c a p a c i t y b a s e d on t r e a t m e n t
s e l f . d e l t a _ b a s e = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 7 6 4 1 1 8 6 2 6 7 1 3 7 5 3 ) ) # Base d e c a y r a t e o f t h e c h e m o t h e r a p y d r u g
s e l f . d e l t a _ m o d = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 ) ) # M o d i f i e r f o r d e c a y r a t e b a s e d on t u m o r volume
s e l f . e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 7 9 6 8 3 1 1 4 5 8 8 2 6 0 6 5 ) ) # I n t e r a c t i o n t e r m b e t w e e n c h e m o t h e r a p y and r a d i o t h e r a p y
s e l f . t h e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 5 . 6 8 8 0 8 3 6 4 8 6 8 1 6 4 1 ) ) # M i c h a e l i s −Menten s a t u r a t i o n c o n s t a n t f o r c h e m o t h e r a p y
s e l f . r h o = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 4 4 4 1 3 2 9 2 4 0 7 9 8 9 5 ) ) # S i g m o i d s t e e p n e s s f o r r a d i o t h e r a p y e f f e c t
s e l f . z e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 2 0 5 4 1 7 5 8 8 3 5 3 1 5 7 ) ) # R e s i s t a n c e d e v e l o p m e n t r a t e

# B l a c k box component f o r c a p t u r i n g r e s i d u a l s
s e l f . r e s i d u a l _ m l p = nn . S e q u e n t i a l (
nn . L i n e a r ( 4 , 2 0 ) ,
nn . ReLU ( ) ,
nn . L i n e a r ( 2 0 , 2 0 ) ,
nn . ReLU ( ) ,
nn . L i n e a r ( 2 0 , 2 )
)

d e f for wa rd ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,


r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# A d j u s t e d c a r r y i n g c a p a c i t y b a s e d on t r e a t m e n t
k a p p a = s e l f . k a p p a _ b a s e + s e l f . kappa_mod * ( c h e m o t h e r a p y _ d o s a g e + r a d i o t h e r a p y _ d o s a g e )

# A d j u s t e d d e c a y r a t e b a s e d on t u m o r volume
d e l t a = s e l f . d e l t a _ b a s e + s e l f . delta_mod * tumor_volume

# L o g i s t i c g r o w t h model f o r t u m o r volume w i t h i n t e r a c t i o n t e r m and r e s i s t a n c e


r e s i s t a n c e = 1 + s e l f . z e t a * tumor_volume
d_tumor_volume__dt = s e l f . a l p h a * tumor_volume * (1 − tumor_volume / kappa ) − ( s e l f . b e t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n *
t u m o r _ v o l u m e ) / ( s e l f . t h e t a + c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ) − s e l f . gamma * r a d i o t h e r a p y _ d o s a g e * t u m o r _ v o l u m e /
( 1 + t o r c h . exp ( − s e l f . r h o * ( r a d i o t h e r a p y _ d o s a g e − 1 ) ) ) − s e l f . e t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n *
r a d i o t h e r a p y _ d o s a g e * tumor_volume / r e s i s t a n c e

# Non− l i n e a r d e c a y model f o r c h e m o t h e r a p y d r u g c o n c e n t r a t i o n
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t = c h e m o t h e r a p y _ d o s a g e − d e l t a * t o r c h . pow ( c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n , 1 . 5 )

# B l a c k box r e s i d u a l component
r e s i d u a l s = s e l f . r e s i d u a l _ m l p ( t o r c h . s t a c k ( ( tumor_volume , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n , c h e m o t h e r a p y _ d o s a g e ,
r a d i o t h e r a p y _ d o s a g e ) , dim = 1 ) )

# Combine w h i t e box model w i t h r e s i d u a l s


d _ t u m o r _ v o l u m e _ _ d t += r e s i d u a l s [ : , 0 ]
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t += r e s i d u a l s [ : , 1 ]

r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 0 2 2 8 2 0 6 5 5 2 5 6 5 0 9 7 8 , ’ b e t a ’ : 0 . 3 4 3 2 7 9 6 2 9 9 4 5 7 5 5 , ’gamma ’ : 0 . 0 8 0 9 7 3 2 8 9 9 0 6 9 7 8 6 1 , ’ k a p p a _ b a s e ’ :
1 0 3 1 . 5 6 0 4 2 4 8 0 4 6 8 7 5 , ’ kappa_mod ’ : − 2 . 1 3 1 5 4 9 3 5 8 3 6 7 9 2 , ’ d e l t a _ b a s e ’ : − 0 . 0 2 8 7 9 9 2 4 3 2 7 1 3 5 0 8 6 , ’ d e l t a _ m o d ’ : 0 . 0 0 0 3 3 0 2 3 6 2 4 5 6 4 5 2 1 0 1 5 ,
’ e t a ’ : 0 . 0 0 3 7 5 6 9 2 3 3 6 8 2 0 0 6 5 9 8 , ’ t h e t a ’ : 8 . 6 7 3 7 2 2 2 6 7 1 5 0 8 7 9 , ’ rho ’ : 0 . 3 7 4 2 8 1 4 9 5 8 0 9 5 5 5 0 5 , ’ z e t a ’ : 0 . 1 5 6 7 0 0 9 9 8 5 4 4 6 9 3 }
###

Iteration 12. Val loss: 0.07228073477745056 Description: White and black box model with logistic
growth for tumor volume, Michaelis-Menten saturation for chemotherapy effect, sigmoid function
for radiotherapy effect, resistance development, and a simplified black box MLP for residuals.
Val L o s s : 0 . 0 7 2 3 ( Where t h e v a l l o s s p e r d i m e n s i o n i s t u m o r _ v o l u m e v a l l o s s : 0 . 1 3 3 , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n v a l l o s s : 0 . 0 1 2 )
I t e r a t i o n : 11
###
‘‘‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 4 ) ) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 8 ) ) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y
s e l f . k a p p a _ b a s e = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 0 3 0 . 0 ) ) # Base c a r r y i n g c a p a c i t y o f t h e e n v i r o n m e n t
s e l f . kappa_mod = nn . P a r a m e t e r ( t o r c h . t e n s o r ( − 2 . 0 ) ) # M o d i f i e r f o r c a r r y i n g c a p a c i t y b a s e d on t r e a t m e n t
s e l f . d e l t a _ b a s e = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 ) ) # Base d e c a y r a t e o f t h e c h e m o t h e r a p y d r u g
s e l f . d e l t a _ m o d = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 0 0 3 ) ) # M o d i f i e r f o r d e c a y r a t e b a s e d on t u m o r volume
s e l f . e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 0 4 ) ) # I n t e r a c t i o n t e r m b e t w e e n c h e m o t h e r a p y and r a d i o t h e r a p y
s e l f . t h e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 0 . 0 ) ) # M i c h a e l i s −Menten s a t u r a t i o n c o n s t a n t f o r c h e m o t h e r a p y
s e l f . r h o = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 3 ) ) # S i g m o i d s t e e p n e s s f o r r a d i o t h e r a p y e f f e c t
s e l f . z e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 5 ) ) # R e s i s t a n c e d e v e l o p m e n t r a t e

# B l a c k box component f o r c a p t u r i n g r e s i d u a l s
s e l f . r e s i d u a l _ m l p = nn . S e q u e n t i a l (
nn . L i n e a r ( 4 , 1 6 ) ,
nn . LeakyReLU ( 0 . 1 ) ,
nn . L i n e a r ( 1 6 , 8 ) ,
nn . LeakyReLU ( 0 . 1 ) ,
nn . L i n e a r ( 8 , 2 )
)

31
d e f for war d ( s e l f , tumor_volume : t o r c h . Tensor , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor , chemotherapy_dosage : t o r c h . Tensor ,
r a d i o t h e r a p y _ d o s a g e : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# A d j u s t e d c a r r y i n g c a p a c i t y b a s e d on t r e a t m e n t
k a p p a = s e l f . k a p p a _ b a s e + s e l f . kappa_mod * ( c h e m o t h e r a p y _ d o s a g e + r a d i o t h e r a p y _ d o s a g e )

# A d j u s t e d d e c a y r a t e b a s e d on t u m o r volume
d e l t a = s e l f . d e l t a _ b a s e + s e l f . delta_mod * tumor_volume

# L o g i s t i c g r o w t h model f o r t u m o r volume w i t h i n t e r a c t i o n t e r m and r e s i s t a n c e


r e s i s t a n c e = 1 + s e l f . z e t a * tumor_volume
d_tumor_volume__dt = s e l f . a l p h a * tumor_volume * (1 − tumor_volume / kappa ) − ( s e l f . b e t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n *
t u m o r _ v o l u m e ) / ( s e l f . t h e t a + c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n ) − s e l f . gamma * r a d i o t h e r a p y _ d o s a g e * t u m o r _ v o l u m e /
( 1 + t o r c h . exp ( − s e l f . r h o * ( r a d i o t h e r a p y _ d o s a g e − 1 ) ) ) − s e l f . e t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n *
r a d i o t h e r a p y _ d o s a g e * tumor_volume / r e s i s t a n c e

# Non− l i n e a r d e c a y model f o r c h e m o t h e r a p y d r u g c o n c e n t r a t i o n
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t = c h e m o t h e r a p y _ d o s a g e − d e l t a * t o r c h . pow ( c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n , 1 . 5 )

# B l a c k box r e s i d u a l component
r e s i d u a l s = s e l f . r e s i d u a l _ m l p ( t o r c h . s t a c k ( ( tumor_volume , c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n , c h e m o t h e r a p y _ d o s a g e ,
r a d i o t h e r a p y _ d o s a g e ) , dim = 1 ) )

# Combine w h i t e box model w i t h r e s i d u a l s


d _ t u m o r _ v o l u m e _ _ d t += r e s i d u a l s [ : , 0 ]
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t += r e s i d u a l s [ : , 1 ]

r e t u r n ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )
‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 0 1 6 9 0 3 1 9 7 3 9 2 8 2 1 3 1 2 , ’ b e t a ’ : 0 . 5 5 8 2 6 4 0 1 7 1 0 5 1 0 2 5 , ’gamma ’ : 0 . 0 8 4 9 5 7 7 4 8 6 5 1 5 0 4 5 2 , ’ k a p p a _ b a s e ’ :
1 0 2 9 . 8 0 2 4 9 0 2 3 4 3 7 5 , ’ kappa_mod ’ : − 2 . 4 5 4 5 8 9 1 2 8 4 9 4 2 6 2 7 , ’ d e l t a _ b a s e ’ : 0 . 1 2 8 3 2 7 9 8 0 6 3 7 5 5 0 3 5 , ’ d e l t a _ m o d ’ : − 3.1 435 782 148 03 710 6 e −05 ,
’ e t a ’ : 0 . 0 0 6 9 3 2 4 7 6 0 6 9 7 7 8 2 0 4 , ’ t h e t a ’ : 1 3 . 6 9 9 5 8 0 1 9 2 5 6 5 9 1 8 , ’ rho ’ : 0 . 2 4 8 1 5 6 3 3 8 9 3 0 1 3 , ’ z e t a ’ : 0 . 1 3 7 2 7 8 3 0 3 5 0 3 9 9 0 1 7 }
###

H.5 HDTwinGen Ablation No Memory

Ablation Study. We also ablate HDTwin by removing its memory, only keeping the last hybrid
model it generated. We observe decreased performance as shown in Table 4.

Table 4: HDTwinGen Ablation


Lung Cancer
(with Chemo. & Radio.)
Method TM SE ↓
HDTwinGen 0.0889±0.0453
HDTwinGen-no-memory 17.6±215

H.6 Evaluating Different LLMs

We performed a complete re-run of our main experiments under the same settings, now using a
different LLM within our HDTwinGen framework, GPT-3.5. These new results are included in
Table 5. HDTwinGen can operate fully with a less capable LLM model. However, the generated and
discovered models’ performance correlates to the underlying LLM model’s performance as expected.

Table 5: Ablation of using different LLMs. Test MSE (TM SE ) averaged over ten random seeds.
HDTwinGen is capable of using other LLM models, however, the best performance results are
provided with better-performing LLMs (e.g. GPT-4). The results are presented with ± indicating
95% confidence intervals.
Lung Cancer Lung Cancer (with Chemo.) Lung Cancer (with Chemo. & Radio.) Hare-Lynx Plankton Microcosm COVID-19
Method TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓
HDTwinGen (GPT-3.5) 2.89e + 03 ± 6.17e + 03 151 ± 57.8 46.4 ± 52.1 298 ± 19.2 0.0007 ± 0.000233 56 ± 28.2
HDTwinGen (GPT-4) 4.41 ± 8.07 0.0889 ± 0.0453 0.131 ± 0.198 291 ± 30.3 2.51e-06 ± 2.2e-06 1.72 ± 2.28

We also explored the effect of changing the LLMs underlying temperature hyperparameter, here using
GPT-4, varying the temperature from 0, 0.7 to 2.0, where we used 0.7 throughout all our experiments
(Appendix E). As outlined in Table 6, we observe that HDTwinGen is still able to operate with a
different underlying temperature of the LLM.

H.7 Prompt Ablations with Varying Amounts of Prior Information

We conducted a complete re-run of our main experiments, ablating the prompt, which provides the
prior information in the form of a textual prior to HDTwinGen. We provide the ablation results
in Table 7. Specifically the prompt as outlined in Appendix E is structured to include separately a

32
Table 6: Ablation of varying the LLMs temperature. HDTwinGen can still operate with different
LLM temperatures. Reporting the test prediction MSE (TM SE ) averaged over ten random seeds.
Lung Cancer Lung Cancer (with Chemo.) Lung Cancer (with Chemo. & Radio.) Hare-Lynx Plankton Microcosm COVID-19
Method TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓
HDTwinGen (Temp 0.0) 8.84±17.8 0.353±0.936 0.414±1.12 258±32.9 4.06e-05±0.000102 0.0461±2.37
HDTwinGen (Temp 0.7) 4.41±8.07 0.0889±0.0453 0.131±0.198 291±30.3 2.51e-06±2.2e-06 1.72±2.28
HDTwinGen (Temp 2.0) 1.05±2.27 0.548±0.698 8.43±26.4 249±53 2.32e-06±3.26e-06 0.0447±3.95

{system description}, {skeleton code}, and {useful to know} components. Here, prior
information is conveyed through the {system description}, which describes the system to be
modeled, its features, and their ranges; minor system prior information is also conveyed through
the {skeleton code}, as this includes task-specific feature names as input variables. The other
components of the prompt (e.g. {useful to know}), do not include any task-specific information,
and are there to provide general instructions to make the framework work, such as generate a pytorch
model as code in the response.
We ablate these components of the prompt, first by removing the task-specific prior
{system description} labeled HDTwinGen (Partial Context); second, by removing all task-
specific priors removing both {system description} and {skeleton code} (where we change
the feature names to meaningless names such as x1, x2, etc.) labeled HDTwinGen (No Context);
third, by removing only the {useful to know} information that helps the framework, such as
instructions to decompose the system, and or combine white-box models with black-box models for
the white-box model residuals, labeled HDTwinGen (No Instructions). We observe in the tabulated
results (Table 7) that HDTwinGen can still operate without any task-specific prior information,
however having textual prior aids in generating better-performing models, and partially removing
HDTwinGen operation instructions, makes it generate slightly less good models.

Table 7: Prompt Ablations with Varying Amounts of Prior Information. Test MSE TM SE
averaged over ten random seeds. HDTwinGen can still operate without any task-specific prior
information, however having textual prior aids in generating better-performing models. The results
are averaged over ten random seeds, with ± indicating 95% confidence intervals.
Lung Cancer Lung Cancer (with Chemo.) Lung Cancer (with Chemo. & Radio.) Hare-Lynx Plankton Microcosm COVID-19
Method TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓
HDTwinGen (Partial Context) 6.77±6.4 0.601±1.83 0.061±0.159 277±54.3 3.9e-06±8.87e-06 2.3±4.24
HDTwinGen (No Context) 30.3±50.3 2.57±2.44 1.52±1.98 297±51.9 6.91e-06±4.11e-06 5.12e+10±1.31e+11
HDTwinGen (No Instructions) 2.31±2.78 0.0933±0.287 0.212±0.0487 313±63.1 0.0016±0.00407 17.3±47.7
HDTwinGen 4.41±8.07 0.0889±0.0453 0.131±0.198 291±30.3 2.51e-06±2.2e-06 1.72±2.28

H.8 Domain-Specific Baselines

We performed a complete re-run of our main datasets using domain-specific white-box baselines, as
determined by a human expert, as shown in Table 8. Specifically, for COVID-19 modeling, we fit
a SEIR model [84], a Lotka–Volterra model for predator-prey population dynamics (Hare-Lynx &
Plankton Microcosm datasets) [85], and a logistic tumor growth model with chemo. & radio. effects
modeling. HDTwinGen still models the system most accurately, achieving the lowest test prediction
MSE on the held-out test dataset of state-action trajectories.
SEIR Model for COVID-19 Modeling. The SEIR model is a compartmental model used in
epidemiology to simulate how a disease spreads through a population. It divides the population into
four compartments: susceptible (S), exposed (E), infectious (I), and recovered (R). The transitions
between these compartments are governed by the following differential equations:

dS SI
= −β ,
dt N
dE SI
=β − σE,
dt N
dI
= σE − γI,
dt
dR
= γI,
dt

33
Table 8: Table 5. (Test MSE TM SE averaged over ten random seeds)
Method Lung Cancer Lung Cancer (with Chemo.) Lung Cancer (with Chemo. & Radio.) Hare-Lynx Plankton Microcosm COVID-19
Domain Specific Baselines Description Logistic Tumor Growth Logistic Tumor Growth (with Chemo.) Logistic Tumor Growth (with Chemo. & Radio.) Lotka–Volterra Multi-species Lotka–Volterra SEIR
Domain Specific Baselines 904±162 200±71.1 6.39±0.637 346±6.7 0.0127±0.00203 7.88±0.046
HDTwinGen 4.41±8.07 0.0889±0.0453 0.131±0.198 291±30.3 2.51e-06±2.2e-06 1.72±2.28

where N is the total population (assumed constant), β is the transmission rate, σ is the rate at which
exposed individuals become infectious, and γ is the recovery rate. These parameters are crucial for
capturing the dynamics of the disease spread and are estimated from data or literature.
Lotka–Volterra Model for Predator-Prey Dynamics. The Lotka-Volterra model describes the
dynamics of biological systems in which two species interact, one as a predator and the other as prey.
The model is represented by a set of first-order, non-linear, differential equations:
dx
= αx − βxy,
dt
dy
= δxy − γy,
dt
where x and y represent the prey and predator populations, respectively. The parameters α, β, γ, and
δ denote the prey reproduction rate, the predation rate upon the prey, the predator mortality rate, and
the rate at which predators increase by consuming prey, respectively.
Lotka-Volterra Triple Species Model. The extended Lotka-Volterra model incorporating a third
species involves additional interactions that can represent various ecological relationships such as
competition, predation, or mutualism. For the sake of illustration, let’s consider a system with two
predators and one prey. The model is described by the following set of differential equations:
dx
= x(α − βy − δz),
dt
dy
= y(−γ + ϵx),
dt
dz
= z(−µ + νx),
dt
where: x represents the prey population. y and z represent the two predator populations. α is the
natural growth rate of the prey in the absence of predation. β and δ are the predation rates of the first
and second predators on the prey, respectively. γ and µ are the natural death rates of the first and
second predators, respectively, in the absence of the prey. ϵ and ν are the growth rates of the first and
second predators per unit of prey consumed.
Logistic Tumor Growth Model with Treatment Effects. The logistic tumor growth model with
chemotherapy and radiotherapy effects incorporates the logistic growth model’s capacity to simulate
the saturation effect observed in tumor growth, alongside treatment effects. The model can be
described as:
 
dN N
= rN 1 − − C(N ) − R(N ),
dt K
where N is the tumor cell population, r is the intrinsic growth rate of the tumor, and K is the carrying
capacity of the environment. C(N ) and R(N ) represent the effects of chemotherapy and radiotherapy
on the tumor cell population, respectively. These treatment functions are often modeled based on
dose-response curves and can vary depending on the specific drugs and radiation doses used.

H.9 Procedurally Generated Synthetic Model Benchmark

We performed a complete re-run of our main baselines on a new entirely procedurally generated
synthetic model benchmark. Specifically, by procedurally generating synthetic models, this allows us
to test how HDTwinGen performs when the LLM has never seen such a model. To create diverse
synthetic models, we modified the structure of underlying cancer with chemo and radio models to
incorporate non-biological random modifications, which include the use of trigonometric operators
and division operators. In the following, we provide the exact changes made and the structure of the
underlying equation.

34
Synthetic 1 (inc. γ sin(ωt)). Here the underlying equation is
   
dx(t) K
= ρ log − βc C(t) − (αr d(t) + βr d(t)2 ) + γ sin(ωt) x(t)
dt x(t)

Synthetic 2 (inc. −δI(t))


   
dx(t) K
= ρ log − βc C(t) − (αr d(t) + βr d(t)2 ) − δI(t) x(t)
dt x(t)

Synthetic 3 (inc. log( x(t)+N


K
(t) ))
   
dx(t) K
= ρ log − βc C(t) − (αr d(t) + βr d(t)2 ) x(t)
dt x(t) + N (t)

Synthetic 4 (inc. ϵ cos(ϕt))


   
dx(t) K
= ρ log − βc C(t) − (αr d(t) + βr d(t)2 ) + ϵ cos(ϕt) x(t)
dt x(t)

Synthetic 5 (inc. θC(t)d(t))


   
dx(t) K
= ρ log − βc C(t) − (αr d(t) + βr d(t)2 ) − θC(t)d(t) x(t)
dt x(t)

We observe in the tabulated results in Table 9 that HDTwinGen can still generate models that perform
well.

Table 9: Procedurally Generated Synthetic Model Benchmark. Test MSE TM SE averaged over
ten random seeds. Reporting the test prediction MSE (TM SE ) of the produced system models on
held-out test datasets across all synthetic datasets. HDTwinGen achieves the lowest test prediction
error. The results are averaged over ten random seeds, with ± indicating 95% confidence intervals.
Synthetic 1 (inc. γ sin(ωt)) Synthetic 2 (inc. −δI(t)) Synthetic 3 (inc. log( x(t)+N
K
(t) )) Synthetic 4 (inc. ϵ cos(ϕt)) Synthetic 5 (inc. θC(t)d(t))
Method TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓
DyNODE 65.9±5.82 17.6±15.3 12.8±5.32 63±4.28 15.1±8.64
SINDy 69±1.87 16.2±0.972 13.7±0.574 68.4±1.45 14.2±0.598
ZeroShot 6.05e+03±3.77e+03 1.18e+04±2.25e+04 6e+03±4.16e+03 3.86e+03±3.73e+03 4.56e+03±3.84e+03
ZeroOptim 56.2±1.96 14.5±1.17 1.82±0.774 56.9±1.82 3.07±1.11
HDTwinGen 54.2±2.55 0.0707±0.113 0.245±0.377 54.8±1.98 0.0683±0.0464

H.10 Interpretability Scale, Performance of only White-Box Models

To investigate questions, of how well do the white-box models that HDTwinGen generates perform,
we explore an ablation of HDTwinGen where we constrain the generated models to be white-box
only, i.e., a mathematical equation with no black-box neural network components. We tabulate this
in Table 10, and observe that even when HDTwinGen is constrained to only generate white-box
models (HDTwinGen (Only White-Box)) it still performs acceptably, indicating that the white-box
generated models are modeling well the underlying system when fitted to the dataset.

Table 10: Interpretability Scale, Performance of only White-Box Models. Reporting the test
prediction MSE (TM SE ) of the produced system models on held-out test datasets across all benchmark
datasets. The results are averaged over ten random seeds, with ± indicating 95% confidence intervals.
Lung Cancer Lung Cancer (with Chemo.) Lung Cancer (with Chemo. & Radio.) Hare-Lynx Plankton Microcosm COVID-19
Method TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓ TM SE ↓
HDTwinGen (Only White-Box) 59.4±101 4.8±11.8 2.42±2.02 337±25.4 0.000111±0.000125 5.92±1.17
HDTwinGen (White-Box & Black-box residuals) 4.41±8.07 0.0889±0.0453 0.131±0.198 291±30.3 2.51e-06±2.2e-06 1.72±2.28

H.11 HDTwinGen Flexibly Integrates Expert-in-the-loop Feedback

Experts play an active role in model development in two main ways:

35
1. Initial prompt. Experts can describe the system, and specify modeling goals and task-
specific requirements through S context (Appendix E).
2. Direct model feedback. In each iteration of HDTwinGen, the expert can provide direct
feedback through H to guide model improvement.

To further demonstrate the flexibility of expert involvement in the modeling process, we supply two
additional experiments:

1. Expert specifies instructions to produce a fully white-box model through S context , which is
provided in Appendix H.10.
2. Expert provides specific feedback on model improvement during the development process
through H. We performed this experiment by stopping HDTwinGen during its iterations,
providing human expert targeted feedback, and then observing if the subsequent model
generated was able to incorporate the feedback by making targeted changes to the underlying
model. We confirm that this is the case and provide insight that the underlying LLM is able
to interpret the HDTwin code model and selectively change parts. We provide a figure to
illustrate this result, as seen in Figure 7.

Expert Feedback
Create a model to predict changes in tumor
volume and chemotherapy drug concentration
in NSCLC patients, using differential
[Iteration 5] [Iteration 6]
equations that incorporate logistic tumor
class StateDifferential([Link]): growth with a log term, chemotherapy, and class StateDifferential([Link]):
def __init__(self): radiotherapy effects, with parameters def __init__(self):
super(StateDifferential, self).__init__() estimated from patient data. super(StateDifferential, self).__init__()
# Parameters for the tumor growth model # Parameters for the tumor growth model
[Link] = [Link]([Link](0.1)) # Tumor growth rate self.r = [Link]([Link](0.1)) # Intrinsic growth rate of the tumor
[Link] = [Link]( self.K = [Link](
[Link](0.05) [Link](1000.0)
) # Tumor death rate due to chemotherapy ) # Carrying capacity of the tumor environment
[Link] = [Link]( self.alpha_C = [Link](
[Link](0.03) [Link](0.05)
) # Tumor death rate due to radiotherapy ) # Efficacy coefficient for chemotherapy
self.alpha_R = [Link](
def forward( [Link](0.03)
self, ) # Efficacy coefficient for radiotherapy
tumor_volume: [Link], self.beta_C = [Link](
chemotherapy_drug_concentration: [Link], [Link](0.01)
chemotherapy_dosage: [Link], ) # Clearance rate of chemotherapy drug
radiotherapy_dosage: [Link],
) -> Tuple[[Link], [Link]]: def forward(
# Tumor volume change rate self,
d_tumor_volume__dt = ( tumor_volume: [Link],
[Link] * tumor_volume chemotherapy_drug_concentration: [Link],
- [Link] * chemotherapy_drug_concentration * tumor_volume chemotherapy_dosage: [Link],
- [Link] * radiotherapy_dosage * tumor_volume radiotherapy_dosage: [Link],
) ) -> Tuple[[Link], [Link]]:
# Tumor volume change rate incorporating logistic growth model with a log term
# Chemotherapy drug concentration change rate d_tumor_volume__dt = (
d_chemotherapy_drug_concentration__dt = ( self.r * tumor_volume * (1 - [Link](tumor_volume / self.K))
chemotherapy_dosage - chemotherapy_drug_concentration - self.alpha_C * chemotherapy_drug_concentration * tumor_volume
) # Assuming a simple model where the drug is administered and then decays - self.alpha_R * radiotherapy_dosage * tumor_volume
)
return (d_tumor_volume__dt, d_chemotherapy_drug_concentration__dt)
# Chemotherapy drug concentration change rate considering the administered dosage and natural decay
Validation loss: d_chemotherapy_drug_concentration__dt = (
chemotherapy_dosage - self.beta_C * chemotherapy_drug_concentration
4.486 )

return (d_tumor_volume__dt, d_chemotherapy_drug_concentration__dt)

Validation loss:
0.948

Figure 7: HDTwinGen can flexibly integrate expert-in-the-loop feedback, if it is provided.

H.12 HDTwinGen Accelerates Model Development and Enhances Performance

We seek to determine the runtimes of HDTwinGen model development compared to human-clock


time from a human experiment (Mechanical Turk) experiment where participants are asked to refine
models and how performant these models are compared to Bayesian optimization (BO) for a fixed
model budget. To assess this we performed the additional experiments of:

1. Human-driven model development: Hiring two experienced software engineers (Mechani-


cal Turks) to develop and refine models, providing them with exactly the same prompts that
HDTwinGen uses, using a human instead of the LLM in the model development loop.
2. AutoML: Hyperparameter tuning (HPT) using for DyNODE and SINDy using BO (HPT
search space detailed in Table 11).

We performed these two experiments on the Lung Cancer (with Chemo. & Radio.) dataset, and the
results are tabulated in Table 12. Here we present, the time to generate an individual model (which
includes generating the model and training time); the test MSE TM SE after one hour of run-time, and
test MSE separately for a budget of only 15 model evaluations.
Analyzing the results provides the following insights:

1. HDTwinGen Makes Model Development Notably Faster Compared to Human-Clock


Time: HDTwinGen takes an average of 45.56 minutes to complete an experiment using 15

36
Table 11: HPT details. BO HPT search space for Figure 8.
Method Hyperparameter Search Space Best Params
DyNode learning_rate [1e-5, 1e-1] 0.0123
DyNode weight_decay [1e-5, 1e-1] 0.00029
DyNode hidden_dim [32, 1024] 788
DyNode model_activation [tanh, silu, ELU] tanh
DyNode model_initialization [xavier, normal] normal
SINDy polynomial_library_degree [1, 3] 2
SINDy polynomial_library_interaction_only [True, False] True
SINDy threshold [1e-5, 1e-1] 0.0194
SINDy alpha 102 [1e-5, 1e-1] 0.0015

Table 12: Method performance comparison. Reporting the time to generate a model (in minutes)
and the test MSE (TM SE ) after Test MSE
101 one hour of run-time and with a budget of only 15 model evaluations.
Here ∗ indicates 73.11% of iteration time is consumed by LLM querying.
Method Time to generate a model (Minutes) ↓ Test MSE TM SE (After one hour of run-time) ↓ Test MSE TM SE (Budget of only 15 model evaluations) ↓
0
HPO for DyNODE
HPO for SINDy
10
0.37
0.16
1.122
13.225
2.209
13.245
Human Experts (Mechanical Turk) 9.875 68.004 2.209
HDTwinGen 3.037∗ 0.072 0.072

10°1
model evaluations/generations, whereas human experts took 148.1 minutes (2 hours and 28
2 4 6 8 10 12 14
minutes) to generate and iterate 15 models. Therefore, using HDTwinGen is considerably
Number of Generations
faster than human-clock time, which is also an advantage of HDTwinGen.
2. HDTwinGen Generates Better Performing Models for a Fixed Model Budget: HDTwin-
Gen generates better-performing models compared to the human experts and the Bayesian
hyperparameter optimization (HPO) of the baselines of DyNode and SINDy, for a fixed
budget of 15 model evaluations. We provide a figure showing the best-performing model
performance against each generation in a Figure 8. This figure shows that across model
evaluations, HDTwinGen still generates models that perform well.

NSDT HPO for DyNode


102 Human Experts HPO for SINDy
Test MSE

101

100

10°1
2 4 6 8 10 12 14
Number of Generations
Figure 8: Best-performing model performance against each generation, for setup in Appendix H.12.
HDTwinGen Generates Better Performing Models for a Fixed Model Budget.

NSDT BO DyNode
Human Experts BO SINDy

37
I Hybrid Model Output Examples
These are the final discovered hybrid models generated from our method HDTwin, for each respective
environment.
Cancer (with Chemo & Radio)
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def _ _ i n i t _ _ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . _ _ i n i t _ _ ( )
# P a r a m e t e r s f o r t h e t u m o r g r o w t h model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 3 ) ) # Tumor g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r (
torch . tensor (0.4)
) # Tumor d e a t h r a t e due t o c h e m o t h e r a p y
s e l f . gamma = nn . P a r a m e t e r (
torch . tensor (0.08)
) # Tumor d e a t h r a t e due t o r a d i o t h e r a p y
s e l f . k a p p a _ b a s e = nn . P a r a m e t e r (
torch . tensor (1030.0)
) # Base c a r r y i n g c a p a c i t y o f t h e e n v i r o n m e n t
s e l f . kappa_mod = nn . P a r a m e t e r (
torch . tensor ( −2.0)
) # M o d i f i e r f o r c a r r y i n g c a p a c i t y b a s e d on t r e a t m e n t
s e l f . d e l t a _ b a s e = nn . P a r a m e t e r (
torch . tensor (0.1)
) # Base d e c a y r a t e o f t h e c h e m o t h e r a p y d r u g
s e l f . d e l t a _ m o d = nn . P a r a m e t e r (
torch . tensor (0.0003)
) # M o d i f i e r f o r d e c a y r a t e b a s e d on t u m o r v o l u m e
s e l f . e t a = nn . P a r a m e t e r (
torch . tensor (0.004)
) # I n t e r a c t i o n t e r m b e t w e e n c h e m o t h e r a p y and r a d i o t h e r a p y
s e l f . t h e t a = nn . P a r a m e t e r (
torch . tensor (10.0)
) # M i c h a e l i s −Menten s a t u r a t i o n c o n s t a n t f o r c h e m o t h e r a p y
s e l f . r h o = nn . P a r a m e t e r (
torch . tensor (0.3)
) # Sigmoid s t e e p n e s s f o r r a d i o t h e r a p y e f f e c t
s e l f . z e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 5 ) ) # R e s i s t a n c e d e v e l o p m e n t r a t e

# B l a c k box c o m p o n e n t f o r c a p t u r i n g r e s i d u a l s
s e l f . r e s i d u a l _ m l p = nn . S e q u e n t i a l (
nn . L i n e a r ( 4 , 1 6 ) ,
nn . LeakyReLU ( 0 . 1 ) ,
nn . L i n e a r ( 1 6 , 8 ) ,
nn . LeakyReLU ( 0 . 1 ) ,
nn . L i n e a r ( 8 , 2 ) ,
)

def forward (
self ,
tumor_volume : t o r c h . Tensor ,
c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n : t o r c h . Tensor ,
chemotherapy_dosage : t o r c h . Tensor ,
r a d i o t h e r a p y _ d o s a g e : t o r c h . Tensor ,
) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# A d j u s t e d c a r r y i n g c a p a c i t y b a s e d on t r e a t m e n t
k a p p a = s e l f . k a p p a _ b a s e + s e l f . kappa_mod * (
chemotherapy_dosage + radiotherapy_dosage
)

# A d j u s t e d d e c a y r a t e b a s e d on t u m o r v o l u m e
d e l t a = s e l f . d e l t a _ b a s e + s e l f . delta_mod * tumor_volume

# L o g i s t i c g r o w t h model f o r t u m o r v o l u m e w i t h i n t e r a c t i o n t e r m and r e s i s t a n c e
r e s i s t a n c e = 1 + s e l f . z e t a * tumor_volume
d_tumor_volume__dt = (
s e l f . a l p h a * tumor_volume * (1 − tumor_volume / kappa )
− ( s e l f . b e t a * c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n * tumor_volume )
/ ( s e l f . theta + chemotherapy_drug_concentration )
− s e l f . gamma
* radiotherapy_dosage
* tumor_volume
/ ( 1 + t o r c h . exp ( − s e l f . r h o * ( r a d i o t h e r a p y _ d o s a g e − 1 ) ) )
− self . eta
* chemotherapy_drug_concentration
* radiotherapy_dosage
* tumor_volume
/ resistance
)

# Non− l i n e a r d e c a y model f o r c h e m o t h e r a p y d r u g c o n c e n t r a t i o n
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t = c h e m o t h e r a p y _ d o s a g e − d e l t a * t o r c h . pow (
chemotherapy_drug_concentration , 1.5
)

# B l a c k box r e s i d u a l c o m p o n e n t
residuals = self . residual_mlp (
torch . stack (
(
tumor_volume ,
chemotherapy_drug_concentration ,
chemotherapy_dosage ,
radiotherapy_dosage ,
),
dim =1 ,
)
)

# Combine w h i t e box model w i t h r e s i d u a l s


d _ t u m o r _ v o l u m e _ _ d t += r e s i d u a l s [ : , 0 ]
d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t += r e s i d u a l s [ : , 1 ]

38
return ( d_tumor_volume__dt , d _ c h e m o t h e r a p y _ d r u g _ c o n c e n t r a t i o n _ _ d t )

COVID-19
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def _ _ i n i t _ _ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# I n i t i a l i z e the parameters for t h e SEIR model u s i n g o p t i m i z e d v a l u e s
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r (0.2607165277004242) ) # Transmission rate
s e l f . s i g m a = nn . P a r a m e t e r ( t o r c h . t e n s o r (0.23686641454696655) ) # Incubation rate
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r (0.105068139731884) ) # Recovery r a t e
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 1 ) ) # Death r a t e

# Define a n e u r a l network f o r c a p t u r i n g complex p a t t e r n s


# A d j u s t t h e a r c h i t e c t u r e b a s e d on p r e v i o u s i t e r a t i o n s
s e l f . r e s i d u a l _ n n = nn . S e q u e n t i a l (
nn . L i n e a r ( 4 , 1 2 ) ,
nn . ReLU ( ) ,
nn . L i n e a r ( 1 2 , 1 2 ) ,
nn . ReLU ( ) ,
nn . L i n e a r ( 1 2 , 4 )
)

def forward ( s e l f , s u s c e p t i b l e : t o r c h . Tensor , exposed : t o r c h . Tensor , i n f e c t e d : t o r c h . Tensor , r e c o v e r e d : t o r c h . Tensor ,


t o t a l _ p o p u l a t i o n : t o r c h . T e n s o r ) −> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# SEIR model d i f f e r e n t i a l e q u a t i o n s
d_susceptible__dt = − s e l f . beta * susceptible * infected
d_exposed__dt = s e l f . b e t a * s u s c e p t i b l e * i n f e c t e d − s e l f . sigma * exposed
d _ i n f e c t e d _ _ d t = s e l f . s i g m a * e x p o s e d − ( s e l f . gamma + s e l f . d e l t a ) * i n f e c t e d
d _ r e c o v e r e d _ _ d t = s e l f . gamma * i n f e c t e d − s e l f . d e l t a * i n f e c t e d

# Calculate r e s i d u a l s using the neural network


s t a t e s = t o r c h . s t a c k ( [ s u s c e p t i b l e , e x p o s e d , i n f e c t e d , r e c o v e r e d ] , dim = 1 )
residuals = self . residual_nn ( states )

# Add r e s i d u a l s t o t h e d i f f e r e n t i a l e q u a t i o n s
d _ s u s c e p t i b l e _ _ d t += r e s i d u a l s [ : , 0 ]
d _ e x p o s e d _ _ d t += r e s i d u a l s [ : , 1 ]
d _ i n f e c t e d _ _ d t += r e s i d u a l s [ : , 2 ]
d _ r e c o v e r e d _ _ d t += r e s i d u a l s [ : , 3 ]

return ( d _ s u s c e p t i b l e _ _ d t , d_exposed__dt , d_infected __dt , d_recovered__dt )

Plankton Microcosm
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def _ _ i n i t _ _ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . _ _ i n i t _ _ ( )
# D e f i n e t h e p a r a m e t e r s f o r t h e i m p r o v e d model w i t h c a r r y i n g c a p a c i t i e s
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 2 2 9 6 7 8 2 6 5 7 5 0 4 0 8 1 7 ) ) # P r e y g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 6 8 9 9 6 3 5 7 9 1 7 7 8 5 6 4 ) ) # P r e y d e a t h r a t e due t o p r e d a t i o n
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 5 5 6 2 1 7 6 7 0 4 4 0 6 7 3 8 ) ) # I n t e r m e d i a t e p r e d a t o r e f f i c i e n c y
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 8 1 3 5 0 9 2 8 5 4 4 9 9 8 1 7 ) ) # Top p r e d a t o r e f f i c i e n c y
# Carrying c a p a c i t i e s f o r each p o p u l a t i o n
s e l f . K_prey = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 4 6 8 0 6 6 6 9 2 3 5 2 2 9 4 9 ) ) # Carrying capacity for the prey population
s e l f . K _ i n t e r m e d i a t e = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 8 1 8 0 0 8 0 6 5 2 2 3 6 9 3 8 ) ) # C a r r y i n g c a p a c i t y f o r t h e i n t e r m e d i a t e p r e d a t o r
population
s e l f . K_top = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 4 1 8 6 9 5 7 7 7 6 5 4 6 4 7 8 ) ) # C a r r y i n g c a p a c i t y f o r t h e t o p p r e d a t o r p o p u l a t i o n
# Competition c o e f f i c i e n t s
s e l f . s i g m a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( − 0 . 2 7 2 6 1 4 8 0 6 8 9 0 4 8 7 6 7 ) ) # C o m p e t i t i o n c o e f f i c i e n t f o r p r e y and i n t e r m e d i a t e p r e d a t o r s
s e l f . e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 6 4 4 2 2 2 3 4 8 9 2 8 4 5 1 5 ) ) # C o m p e t i t i o n c o e f f i c i e n t f o r i n t e r m e d i a t e p r e d a t o r s and t o p
predators
# MLP f o r r e s i d u a l s w i t h r e f i n e d a r c h i t e c t u r e
s e l f . r e s i d u a l _ m l p = nn . S e q u e n t i a l (
nn . L i n e a r ( 3 , 1 2 8 ) ,
nn . LeakyReLU ( 0 . 0 1 ) ,
nn . D r o p o u t ( 0 . 2 5 ) ,
nn . L i n e a r ( 1 2 8 , 1 2 8 ) ,
nn . LeakyReLU ( 0 . 0 1 ) ,
nn . D r o p o u t ( 0 . 2 5 ) ,
nn . L i n e a r ( 1 2 8 , 3 )
)

d e f f o r w a r d ( s e l f , p r e y _ p o p u l a t i o n : t o r c h . T e n s o r , i n t e r m e d i a t e _ p o p u l a t i o n : t o r c h . T e n s o r , t o p _ p r e d a t o r s _ p o p u l a t i o n : t o r c h . T e n s o r ) −>
Tuple [ t o r c h . Tensor , t o r c h . Tensor , t o r c h . Tensor ] :
# Improved d i f f e r e n t i a l e q u a t i o n s with c a r r y i n g c a p a c i t i e s
d _ p r e y _ p o p u l a t i o n _ _ d t = s e l f . a l p h a * p r e y _ p o p u l a t i o n * ( 1 − p r e y _ p o p u l a t i o n / s e l f . K_prey ) − s e l f . b e t a * p r e y _ p o p u l a t i o n *
intermediate_population
d _ i n t e r m e d i a t e _ p o p u l a t i o n _ _ d t = ( s e l f . gamma * p r e y _ p o p u l a t i o n * i n t e r m e d i a t e _ p o p u l a t i o n / ( 1 + p r e y _ p o p u l a t i o n ) ) − s e l f . d e l t a *
i n t e r m e d i a t e _ p o p u l a t i o n * t o p _ p r e d a t o r s _ p o p u l a t i o n − s e l f . s i g m a * i n t e r m e d i a t e _ p o p u l a t i o n ** 2 * ( 1 −
intermediate_population / s e l f . K_intermediate )
d _ t o p _ p r e d a t o r s _ p o p u l a t i o n _ _ d t = − s e l f . gamma * t o p _ p r e d a t o r s _ p o p u l a t i o n + s e l f . d e l t a * i n t e r m e d i a t e _ p o p u l a t i o n *
t o p _ p r e d a t o r s _ p o p u l a t i o n − s e l f . e t a * t o p _ p r e d a t o r s _ p o p u l a t i o n ** 2 * ( 1 − t o p _ p r e d a t o r s _ p o p u l a t i o n / s e l f . K_top )

# C a l c u l a t e r e s i d u a l s u s i n g MLP
r e s i d u a l s = s e l f . r e s i d u a l _ m l p ( t o r c h . s t a c k ( [ p r e y _ p o p u l a t i o n , i n t e r m e d i a t e _ p o p u l a t i o n , t o p _ p r e d a t o r s _ p o p u l a t i o n ] , dim = 1 ) )

# Correct the d i f f e r e n t i a l equations with the residuals


d _ p r e y _ p o p u l a t i o n _ _ d t += r e s i d u a l s [ : , 0 ]
d _ i n t e r m e d i a t e _ p o p u l a t i o n _ _ d t += r e s i d u a l s [ : , 1 ]
d _ t o p _ p r e d a t o r s _ p o p u l a t i o n _ _ d t += r e s i d u a l s [ : , 2 ]

return ( d_prey_population__dt , d _ i n t e r m e d i a t e _ p o p u l a t i o n _ _ d t , d _ t o p _ p r e d a t o r s _ p o p u l a t i o n _ _ d t )

Hare-Lynx
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def _ _ i n i t _ _ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# Define the parameters for the m o d i f i e d Lotka −V o l t e r r a e q u a t i o n s
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 6 7 ) ) # Prey b i r t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . tensor (0.005) ) # Predation rate

39
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 2 5 3 ) ) # P r e d a t o r d e a t h r a t e
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 0 2 8 ) ) # P r e d a t o r r e p r o d u c t i o n r a t e
# A d d i t i o n a l p a r a m e t e r s f o r c a r r y i n g c a p a c i t y and t i m e −d e p e n d e n t e f f e c t s
s e l f . k a p p a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 1 . 0 ) ) # C a r r y i n g c a p a c i t y f o r h a r e s
s e l f . omega = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 2 * t o r c h . p i / 1 0 ) ) # F r e q u e n c y o f o s c i l l a t i o n s (10 − y e a r p e r i o d )
s e l f . p h i = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 0 ) ) # Phase s h i f t f o r o s c i l l a t i o n s
# Neural network f o r capturing r e s i d u a l s
s e l f . r e s i d u a l _ m l p = nn . S e q u e n t i a l (
nn . L i n e a r ( 3 , 1 0 ) ,
nn . ReLU ( ) ,
nn . L i n e a r ( 1 0 , 2 )
)

d e f f o r w a r d ( s e l f , h a r e _ p o p u l a t i o n : t o r c h . T e n s o r , l y n x _ p o p u l a t i o n : t o r c h . T e n s o r , t i m e _ i n _ y e a r s : t o r c h . T e n s o r ) −> T u p l e [ t o r c h .
Tensor , t o r c h . Tensor ] :
# M o d i f i e d L o t k a − V o l t e r r a e q u a t i o n s w i t h c a r r y i n g c a p a c i t y and t i m e −d e p e n d e n t e f f e c t s
d _ h a r e _ p o p u l a t i o n _ _ d t = s e l f . a l p h a * h a r e _ p o p u l a t i o n * (1 − h a r e _ p o p u l a t i o n / s e l f . kappa ) − s e l f . b e t a * h a r e _ p o p u l a t i o n *
lynx_population
d _ l y n x _ p o p u l a t i o n _ _ d t = − s e l f . gamma * l y n x _ p o p u l a t i o n + s e l f . d e l t a * h a r e _ p o p u l a t i o n * l y n x _ p o p u l a t i o n
# Time−d e p e n d e n t o s c i l l a t o r y c o m p o n e n t
t i m e _ e f f e c t = t o r c h . s i n ( s e l f . omega * t i m e _ i n _ y e a r s + s e l f . p h i )
# Combine w h i t e box model w i t h n e u r a l n e t w o r k r e s i d u a l s
r e s i d u a l s = s e l f . r e s i d u a l _ m l p ( t o r c h . s t a c k ( ( h a r e _ p o p u l a t i o n , l y n x _ p o p u l a t i o n , t i m e _ e f f e c t ) , dim = 1 ) )
d _ h a r e _ p o p u l a t i o n _ _ d t += r e s i d u a l s [ : , 0 ]
d _ l y n x _ p o p u l a t i o n _ _ d t += r e s i d u a l s [ : , 1 ]
return ( d_hare_population__dt , d_lynx_population__dt )

J HDTwinGen can reason about HDTwins

Worked log output of HDTwinGen of part of a run, running on the Plankton Microcosm dataset. It
can reason about structures and parameters.
You w i l l g e t a s y s t e m d e s c r i p t i o n t o c o d e a d i f f e r e n t i a l e q u a t i o n s i m u l a t o r f o r .

System D e s c r i p t i o n : ‘ ‘ ‘
" M o d e l i n g A r t i f i c i a l T r i − T r o p h i c Prey − P r e d a t o r O s c i l l a t i o n s i n a S i m p l i f i e d E c o l o g i c a l System

Here you must model t h e s t a t e d i f f e r e n t i a l o f a l g a e _ p o p u l a t i o n , f l a g e l l a t e _ p o p u l a t i o n , and r o t i f e r _ p o p u l a t i o n ; w i t h no i n p u t a c t i o n s


. This aims t o s i m u l a t e t h e p o p u l a t i o n dynamics w i t h i n a s i m p l i f i e d t r i − t r o p h i c e c o l o g i c a l system c o m p r i s i n g prey ( a l g a e ) ,
i n t e r m e d i a t e p r e d a t o r s ( f l a g e l l a t e s ) , and t o p p r e d a t o r s ( r o t i f e r s ) . The i n t e r a c t i o n s i n c l u d e d i r e c t p r e d a t i o n and c o m p e t i t i o n
f o r r e s o u r c e s , m i r r o r i n g n a t u r a l i n t r a g u i l d p r e d a t i o n mechanisms .

D e s c r i p t i o n of the v a r i a b l e s :
* prey_population : Total count of algae , serving as the primary prey
* i n t e r m e d i a t e _ p o p u l a t i o n : T o t a l c o u n t o f f l a g e l l a t e s , a c t i n g a s i n t e r m e d i a t e p r e d a t o r s and p r e y
* t o p _ p r e d a t o r s _ p o p u l a t i o n : Total count of r o t i f e r s , r e p r e s e n t i n g top p r e d a t o r s

The d a t a s e t e n c a p s u l a t e s d a i l y p o p u l a t i o n c o u n t s a c r o s s m u l t i p l e s i m u l a t e d e c o s y s t e m s o v e r a p e r i o d o f 100 days , a l l o w i n g f o r t h e


a n a l y s i s o f t e m p o r a l o s c i l l a t i o n s and p h a s e l a g s b e t w e e n s p e c i e s .

A d d i t i o n a l l y t h e s e v a r i a b l e s have t h e r a n g e s of :
* prey_population : [0.095898 , 2.469735]
* intermediate_population : [0.008438 , 1.500000]
* top_predators_population : [0.030316 , 0.739244]

The t r a i n i n g d a t a s e t c o n s i s t s o f 70 t i m e s t e p s , v a l i d a t i o n and t r a i n i n g d a t a s e t c o n s i s t s o f 15 t i m e s t e p s e a c h .

‘‘‘

Modelling goals : ‘ ‘ ‘
* The p a r a m e t e r s o f t h e model w i l l be o p t i m i z e d t o an o b s e r v e d t r a i n i n g d a t a s e t w i t h t h e g i v e n s i m u l a t o r .
* The o b s e r v e d t r a i n i n g d a t a s e t h a s v e r y few s a m p l e s , and t h e model must be a b l e t o g e n e r a l i z e t o u n s e e n d a t a .
‘‘‘

Requirement S p e c i f i c a t i o n : ‘ ‘ ‘
* The c o d e g e n e r a t e d s h o u l d a c h i e v e t h e l o w e s t p o s s i b l e v a l i d a t i o n l o s s , o f 1 e −6 o r l e s s .
* The c o d e g e n e r a t e d s h o u l d be i n t e r p r e t a b l e , and f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e .
‘‘‘

S k e l e t o n code t o f i l l i n : ‘ ‘ ‘
c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

def forward ( s e l f , p r e y _ p o p u l a t i o n : t o r c h . Tensor , i n t e r m e d i a t e _ p o p u l a t i o n : t o r c h . Tensor , t o p _ p r e d a t o r s _ p o p u l a t i o n : t o r c h . Tensor )


−> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
return ( d_prey_population__dt , d_intermediate_population__dt , d_top_predators_population__dt )
‘‘‘

U s e f u l t o know : ‘ ‘ ‘
* You a r e a c o d e e v o l v i n g machine , and you w i l l be c a l l e d 20 t i m e s t o g e n e r a t e code , and i m p r o v e t h e c o d e t o a c h i e v e t h e l o w e s t
possible validation loss .
* The model d e f i n e s t h e s t a t e d i f f e r e n t i a l and w i l l be u s e d w i t h an ODE s o l v e r t o f i t t h e o b s e r v e d t r a i n i n g d a t a s e t .
* You c a n u s e any p a r a m e t e r s you want and any b l a c k box n e u r a l n e t w o r k c o m p o n e n t s ( m u l t i − l a y e r p e r c e p t r o n s ) ; however , you h a v e t o
define these .
* I t i s p r e f e r a b l e t o decompose t h e s y s t e m i n t o d i f f e r e n t i a l e q u a t i o n s ( c o m p a r t m e n t s ) i f p o s s i b l e .
* You c a n u s e any u n a r y f u n c t i o n s , f o r e x a m p l e l o g , exp , power e t c .
* Under no c i r c u m s t a n c e c a n you c h a n g e t h e s k e l e t o n c o d e f u n c t i o n d e f i n i t i o n s , o n l y f i l l i n t h e c o d e .
* The i n p u t t e n s o r s a r e v e c t o r s o f s h a p e ( b a t c h _ s i z e ) .
* Use i n i t i a l l y w h i t e box m o d e l s f i r s t and t h e n s w i t c h t o h y b r i d w h i t e and b l a c k box m o d e l s f o r t h e r e s i d u a l s , o n l y a f t e r no f u r t h e r
b e s t p r o g r a m i t e r a t i o n i m p r o v e m e n t w i t h w h i t e box m o d e l s .
* Make s u r e y o u r c o d e f o l l o w s t h e e x a c t c o d e s k e l e t o n s p e c i f i c a t i o n .
* Use PyTorch .
‘‘‘

T h i n k s t e p −by − s t e p , and t h e n g i v e t h e c o m p l e t e f u l l w o r k i n g c o d e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 0 o u t o f 2 0 .

M a i n P r o c e s s | 2024 −01 −29 0 0 : 0 2 : 0 1 , 0 8 2 , 8 2 m u l t i p r o c e s s i n g INFO [ U s e r ]

40
You g e n e r a t e d t h e f o l l o w i n g c o d e c o m p l e t i o n s , which t h e n had t h e i r p a r a m e t e r s o p t i m i z e d t o t h e t r a i n i n g d a t a s e t . P l e a s e r e f l e c t on
how you c a n i m p r o v e t h e c o d e t o m i n i m i z e t h e v a l i d a t i o n l o s s t o 1 e −6 o r l e s s . The c o d e e x a m p l e s a r e d e l i n e a t e d by # # # .

Here a r e y o u r p r e v i o u s i t e r a t i o n s t h e b e s t p r o g r a m s g e n e r a t e d . Use i t t o s e e i f you h a v e e x h a u s t e d w h i t e box models , i . e . when a


w h i t e box model r e p e a t s w i t h t h e same v a l l o s s and t h e n o n l y add b l a c k box m o d e l s t o t h e w h i t e box m o d e l s : ‘ ‘ ‘
I t e r a t i o n 0 . B e s t Val L o s s : 0 . 0 1 2 2 2 4 0 6 7 9 3 3 8 5 7 4 4 1 . Model d e s c r i p t i o n : w h i t e box model
I t e r a t i o n 1 . B e s t Val L o s s : 0 . 0 0 3 1 2 9 9 0 2 9 9 4 2 6 0 1 9 2 . Model d e s c r i p t i o n : w h i t e box model w i t h a d d i t i o n a l p a r a m e t e r s f o r c o m p e t i t i o n and
c a r r y i n g c a p a c i t y , and i m p r o v e d d i f f e r e n t i a l e q u a t i o n s t o b e t t e r c a p t u r e t h e t r i − t r o p h i c d y n a m i c s .
‘‘‘

Here a r e t h e t o p c o d e c o m p l e t i o n s s o f a r t h a t you h a v e g e n e r a t e d , s o r t e d f o r t h e l o w e s t v a l i d a t i o n l o s s l a s t : ‘ ‘ ‘

Val L o s s : 0 . 0 1 2 2 ( Where t h e v a l l o s s p e r d i m e n s i o n i s p r e y _ p o p u l a t i o n v a l l o s s : 0 . 0 3 1 6 , i n t e r m e d i a t e _ p o p u l a t i o n v a l l o s s : 2 . 1 3 e −05 ,


top_predators_population val loss : 0.00505) I t e r a t i o n : 0
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# D e f i n e t h e p a r a m e t e r s f o r t h e Lotka − V o l t e r r a e q u a t i o n s
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # P r e y g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # P r e y d e a t h r a t e due t o p r e d a t i o n
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # P r e d a t o r g r o w t h r a t e due t o p r e d a t i o n
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # P r e d a t o r d e a t h r a t e

def forward ( s e l f , p r e y _ p o p u l a t i o n : t o r c h . Tensor , i n t e r m e d i a t e _ p o p u l a t i o n : t o r c h . Tensor , t o p _ p r e d a t o r s _ p o p u l a t i o n : t o r c h . Tensor )


−> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# Lotka − V o l t e r r a e q u a t i o n s
d_prey_population__dt = s e l f . alpha * prey_population − s e l f . beta * prey_population * intermediate_population
d _ i n t e r m e d i a t e _ p o p u l a t i o n _ _ d t = s e l f . gamma * p r e y _ p o p u l a t i o n * i n t e r m e d i a t e _ p o p u l a t i o n − s e l f . d e l t a *
intermediate_population * top_predators_population
d _ t o p _ p r e d a t o r s _ p o p u l a t i o n _ _ d t = − s e l f . gamma * t o p _ p r e d a t o r s _ p o p u l a t i o n + s e l f . d e l t a * i n t e r m e d i a t e _ p o p u l a t i o n *
top_predators_population

return ( d_prey_population__dt , d_intermediate_population__dt , d_top_predators_population__dt )

‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : 0 . 1 0 9 7 7 2 0 1 9 0 8 8 2 6 8 2 8 , ’ b e t a ’ : 0 . 6 8 7 0 3 6 9 3 1 5 1 4 7 4 , ’gamma ’ : 0 . 1 2 2 3 4 0 5 3 7 6 0 7 6 6 9 8 3 , ’ d e l t a ’ :
0.8221619725227356}
###

Val L o s s : 0 . 0 0 3 1 3 ( Where t h e v a l l o s s p e r d i m e n s i o n i s p r e y _ p o p u l a t i o n v a l l o s s : 0 . 0 0 7 7 6 , i n t e r m e d i a t e _ p o p u l a t i o n v a l l o s s : 2 . 3 1 e
−05 , t o p _ p r e d a t o r s _ p o p u l a t i o n v a l l o s s : 0 . 0 0 1 6 1 ) I t e r a t i o n : 1
###
‘‘‘
import torch
i m p o r t t o r c h . nn a s nn
from t y p i n g i m p o r t T u p l e

c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# D e f i n e t h e p a r a m e t e r s f o r t h e i m p r o v e d model
s e l f . a l p h a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 0 9 7 7 2 0 1 9 0 8 8 2 6 8 2 8 ) ) # P r e y g r o w t h r a t e
s e l f . b e t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 6 8 7 0 3 6 9 3 1 5 1 4 7 4 ) ) # P r e y d e a t h r a t e due t o p r e d a t i o n
s e l f . gamma = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 1 2 2 3 4 0 5 3 7 6 0 7 6 6 9 8 3 ) ) # P r e d a t o r g r o w t h r a t e due t o p r e d a t i o n
s e l f . d e l t a = nn . P a r a m e t e r ( t o r c h . t e n s o r ( 0 . 8 2 2 1 6 1 9 7 2 5 2 2 7 3 5 6 ) ) # P r e d a t o r d e a t h r a t e
# A d d i t i o n a l p a r a m e t e r s f o r c o m p e t i t i o n and c a r r y i n g c a p a c i t y
s e l f . s i g m a = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # C o m p e t i t i o n c o e f f i c i e n t f o r p r e y and i n t e r m e d i a t e p r e d a t o r s
s e l f . e t a = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # C o m p e t i t i o n c o e f f i c i e n t f o r i n t e r m e d i a t e p r e d a t o r s and t o p p r e d a t o r s
s e l f . K = nn . P a r a m e t e r ( t o r c h . r a n d ( 1 ) ) # Carrying c a p a c i t y for the prey population

def forward ( s e l f , p r e y _ p o p u l a t i o n : t o r c h . Tensor , i n t e r m e d i a t e _ p o p u l a t i o n : t o r c h . Tensor , t o p _ p r e d a t o r s _ p o p u l a t i o n : t o r c h . Tensor )


−> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# Improved d i f f e r e n t i a l e q u a t i o n s
d _ p r e y _ p o p u l a t i o n _ _ d t = s e l f . a l p h a * p r e y _ p o p u l a t i o n * ( 1 − p r e y _ p o p u l a t i o n / s e l f . K) − s e l f . b e t a * p r e y _ p o p u l a t i o n *
intermediate_population
d _ i n t e r m e d i a t e _ p o p u l a t i o n _ _ d t = s e l f . gamma * p r e y _ p o p u l a t i o n * i n t e r m e d i a t e _ p o p u l a t i o n − s e l f . d e l t a *
i n t e r m e d i a t e _ p o p u l a t i o n * t o p _ p r e d a t o r s _ p o p u l a t i o n − s e l f . s i g m a * i n t e r m e d i a t e _ p o p u l a t i o n ** 2
d _ t o p _ p r e d a t o r s _ p o p u l a t i o n _ _ d t = − s e l f . gamma * t o p _ p r e d a t o r s _ p o p u l a t i o n + s e l f . d e l t a * i n t e r m e d i a t e _ p o p u l a t i o n *
t o p _ p r e d a t o r s _ p o p u l a t i o n − s e l f . e t a * t o p _ p r e d a t o r s _ p o p u l a t i o n ** 2

return ( d_prey_population__dt , d_intermediate_population__dt , d_top_predators_population__dt )


‘‘‘
o p t i m i z e d _ p a r a m e t e r s = { ’ a l p h a ’ : − 0 . 0 1 3 4 3 8 1 3 5 3 8 5 5 1 3 3 0 6 , ’ b e t a ’ : 0 . 6 1 8 6 3 1 6 6 0 9 3 8 2 6 2 9 , ’gamma ’ : 0 . 3 6 3 3 5 3 5 2 0 6 3 1 7 9 0 1 6 , ’ d e l t a ’ :
0 . 6 2 1 3 7 6 2 7 6 0 1 6 2 3 5 4 , ’ sigma ’ : 0 . 4 7 4 4 1 9 3 2 5 5 9 0 1 3 3 6 7 , ’ e t a ’ : − 0 . 4 3 2 1 2 1 7 8 3 4 9 4 9 4 9 3 4 , ’K ’ : 0 . 3 4 9 3 5 0 0 3 5 1 9 0 5 8 2 3 }
###

‘‘‘

P l e a s e r e f l e c t on how you c a n i m p r o v e t h e c o d e t o f i t t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e , and be i n t e r p r e t a b l e . T h i n k s t e p −by −


s t e p . P r o v i d e o n l y a c t i o n a b l e f e e d b a c k , t h a t h a s d i r e c t c h a n g e s t o t h e c o d e . Do n o t w r i t e o u t t h e code , o n l y d e s c r i b e how i t
c a n be i m p r o v e d . Where a p p l i c a b l e u s e t h e v a l u e s o f t h e o p t i m i z e d p a r a m e t e r s t o r e a s o n how t h e c o d e c a n be i m p r o v e d t o f i t
t h e d a t a s e t a s a c c u r a t e l y a s p o s s i b l e . T h i s i s f o r g e n e r a t i n g new c o d e f o r t h e n e x t i t e r a t i o n 2 o u t o f 2 0 .

M a i n P r o c e s s | 2024 −01 −29 0 0 : 0 2 : 0 1 , 0 8 2 , 8 2 m u l t i p r o c e s s i n g INFO [ A s s i s t a n t ] To i m p r o v e t h e c o d e f o r i t e r a t i o n 2 , c o n s i d e r t h e f o l l o w i n g


a c ti o n a b le feedback :

1 . ** I n c o r p o r a t e R e s o u r c e L i m i t a t i o n f o r I n t e r m e d i a t e P r e d a t o r s and Top P r e d a t o r s * * : The c u r r e n t model i n c l u d e s a c a r r y i n g c a p a c i t y


t e r m (K) f o r t h e p r e y p o p u l a t i o n b u t d o e s n o t a c c o u n t f o r r e s o u r c e l i m i t a t i o n f o r i n t e r m e d i a t e p r e d a t o r s and t o p p r e d a t o r s .
I n t r o d u c e c a r r y i n g c a p a c i t y p a r a m e t e r s f o r t h e s e p o p u l a t i o n s t o model t h e i r g r o w t h s a t u r a t i o n .

2 . ** R e f i n e I n t e r a c t i o n Terms * * : The i n t e r a c t i o n t e r m s c u r r e n t l y u s e s i m p l e m u l t i p l i c a t i v e f a c t o r s b e t w e e n p o p u l a t i o n s . C o n s i d e r
more complex i n t e r a c t i o n s t h a t may i n c l u d e n o n l i n e a r e f f e c t s o r f u n c t i o n a l r e s p o n s e s , s u c h a s a H o l l i n g Type I I f u n c t i o n a l
response , t o b e t t e r c a p t u r e t h e dynamics of p r e d a t i o n .

3 . ** A d j u s t N e g a t i v e Growth R a t e * * : The o p t i m i z e d p a r a m e t e r ’ a l p h a ’ f o r t h e p r e y g r o w t h r a t e h a s a n e g a t i v e v a l u e , which i s


b i o l o g i c a l l y u n r e a l i s t i c a s i t would i m p l y t h e p r e y p o p u l a t i o n d e c r e a s e s i n t h e a b s e n c e o f p r e d a t o r s . E n s u r e t h a t t h e g r o w t h
r a t e p a r a m e t e r s a r e c o n s t r a i n e d t o be p o s i t i v e d u r i n g o p t i m i z a t i o n .

41
4 . ** Review C o m p e t i t i o n C o e f f i c i e n t s * * : The o p t i m i z e d p a r a m e t e r s ’ sigma ’ and ’ e t a ’ f o r c o m p e t i t i o n a r e p o s i t i v e and n e g a t i v e ,
r e s p e c t i v e l y . While ’ sigma ’ b e i n g p o s i t i v e i s c o n s i s t e n t w i t h c o m p e t i t i o n , ’ e t a ’ b e i n g n e g a t i v e i s n o t t y p i c a l f o r a
c o m p e t i t i o n c o e f f i c i e n t . Re− e v a l u a t e t h e c o m p e t i t i o n t e r m s i n t h e e q u a t i o n s t o e n s u r e t h e y a r e b i o l o g i c a l l y m e a n i n g f u l and
c o n s t r a i n t h e c o e f f i c i e n t s t o be non − n e g a t i v e d u r i n g o p t i m i z a t i o n .

5 . ** I n t r o d u c e D e n s i t y − D e p e n d e n t M o r t a l i t y f o r Top P r e d a t o r s * * : The model c u r r e n t l y d o e s n o t i n c l u d e a t e r m f o r d e n s i t y − d e p e n d e n t


m o r t a l i t y o f t o p p r e d a t o r s . Add a t e r m t o t h e d i f f e r e n t i a l e q u a t i o n f o r t o p p r e d a t o r s t h a t a c c o u n t s f o r i n t r a s p e c i f i c
competition or other density −dependent m o r t a l i t y f a c t o r s .

6 . ** P a r a m e t e r I n i t i a l i z a t i o n Based on O p t i m i z e d V a l u e s * * : Use t h e o p t i m i z e d p a r a m e t e r v a l u e s from i t e r a t i o n 1 a s i n i t i a l v a l u e s f o r


t h e p a r a m e t e r s i n i t e r a t i o n 2 . This can p r o v i d e a b e t t e r s t a r t i n g p o i n t f o r t h e o p t i m i z a t i o n p r o c e s s .

7 . ** C o n s i d e r Time D e l a y s * * : I n r e a l e c o s y s t e m s , t h e r e c a n be t i m e d e l a y s b e t w e e n p r e d a t i o n and c h a n g e s i n p o p u l a t i o n s i z e s due t o


f a c t o r s l i k e g e s t a t i o n or m a t u r a t i o n p e r i o d s . I f t h e v a l i d a t i o n l o s s does not improve s i g n i f i c a n t l y with t h e above changes ,
c o n s i d e r a d d i n g t i m e d e l a y t e r m s t o t h e model .

8 . ** H y b r i d White and B l a c k Box Model * * : I f t h e v a l i d a t i o n l o s s d o e s n o t i m p r o v e below t h e t h r e s h o l d o f 1 e −6 w i t h t h e a b o v e w h i t e


box model i m p r o v e m e n t s , c o n s i d e r i n t r o d u c i n g a b l a c k box component , s u c h a s a m u l t i − l a y e r p e r c e p t r o n , t o c a p t u r e t h e
r e s i d u a l s o f t h e w h i t e box model . T h i s h y b r i d a p p r o a c h c a n p o t e n t i a l l y model complex i n t e r a c t i o n s t h a t a r e n o t c a p t u r e d by
t h e w h i t e box d i f f e r e n t i a l e q u a t i o n s .

By i m p l e m e n t i n g t h e s e c h a n g e s , t h e model s h o u l d become more b i o l o g i c a l l y r e a l i s t i c and may f i t t h e d a t a s e t more a c c u r a t e l y ,


p o t e n t i a l l y achieving a lower v a l i d a t i o n l o s s .
M a i n P r o c e s s | 2024 −01 −29 0 0 : 0 2 : 0 1 , 0 8 2 , 8 2 m u l t i p r o c e s s i n g INFO [ U s e r ]
P l e a s e now r e g e n e r a t e t h e c o d e f u n c t i o n , w i t h t h e aim t o i m p r o v e t h e c o d e t o a c h i e v e a l o w e r v a l i d a t i o n e r r o r . Use t h e f e e d b a c k
where a p p l i c a b l e . You a r e g e n e r a t i n g c o d e f o r i t e r a t i o n 2 o u t o f 20 t o t a l i t e r a t i o n s . When g e n e r a t i n g c o d e i f you a r e u n s u r e
a b o u t s o m e t h i n g , t a k e y o u r b e s t g u e s s . You h a v e t o g e n e r a t e code , and c a n n o t g i v e an empty s t r i n g a n s w e r .

P l e a s e always only f i l l i n t h e f o l l o w i n g code s k e l e t o n : ‘ ‘ ‘


c l a s s S t a t e D i f f e r e n t i a l ( nn . Module ) :
def __init__ ( s e l f ) :
super ( S t a t e D i f f e r e n t i a l , s e l f ) . __init__ ()
# TODO: F i l l i n t h e c o d e h e r e

def forward ( s e l f , p r e y _ p o p u l a t i o n : t o r c h . Tensor , i n t e r m e d i a t e _ p o p u l a t i o n : t o r c h . Tensor , t o p _ p r e d a t o r s _ p o p u l a t i o n : t o r c h . Tensor )


−> T u p l e [ t o r c h . T e n s o r , t o r c h . T e n s o r , t o r c h . T e n s o r ] :
# TODO: F i l l i n t h e c o d e h e r e
return ( d_prey_population__dt , d_intermediate_population__dt , d_top_predators_population__dt )
‘‘‘
You c a n n o t c h a n g e t h e c o d e s k e l e t o n , o r i n p u t v a r i a b l e s .

42
NeurIPS Paper Checklist
1. Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the
paper’s contributions and scope?
Answer: [Yes]
Justification: The abstract and introduction clearly state the main claims, which are backed
up by empirical evidence.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims
made in the paper.
• The abstract and/or introduction should clearly state the claims made, including the
contributions made in the paper and important assumptions and limitations. A No or
NA answer to this question will not be perceived well by the reviewers.
• The claims made should match theoretical and experimental results, and reflect how
much the results can be expected to generalize to other settings.
• It is fine to include aspirational goals as motivation as long as it is clear that these goals
are not attained by the paper.
2. Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: The main limitations of the work are discussed in Section 8, including the
assumption of semantic priors and types of system studied.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that
the paper has limitations, but those are not discussed in the paper.
• The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to
violations of these assumptions (e.g., independence assumptions, noiseless settings,
model well-specification, asymptotic approximations only holding locally). The authors
should reflect on how these assumptions might be violated in practice and what the
implications would be.
• The authors should reflect on the scope of the claims made, e.g., if the approach was
only tested on a few datasets or with a few runs. In general, empirical results often
depend on implicit assumptions, which should be articulated.
• The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution
is low or images are taken in low lighting. Or a speech-to-text system might not be
used reliably to provide closed captions for online lectures because it fails to handle
technical jargon.
• The authors should discuss the computational efficiency of the proposed algorithms
and how they scale with dataset size.
• If applicable, the authors should discuss possible limitations of their approach to
address problems of privacy and fairness.
• While the authors might fear that complete honesty about limitations might be used by
reviewers as grounds for rejection, a worse outcome might be that reviewers discover
limitations that aren’t acknowledged in the paper. The authors should use their best
judgment and recognize that individual actions in favor of transparency play an impor-
tant role in developing norms that preserve the integrity of the community. Reviewers
will be specifically instructed to not penalize honesty concerning limitations.
3. Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and
a complete (and correct) proof?

43
Answer: [NA]
Justification: This paper does not introduce any theoretical results.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and cross-
referenced.
• All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if
they appear in the supplemental material, the authors are encouraged to provide a short
proof sketch to provide intuition.
• Inversely, any informal proof provided in the core of the paper should be complemented
by formal proofs provided in appendix or supplemental material.
• Theorems and Lemmas that the proof relies upon should be properly referenced.
4. Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main ex-
perimental results of the paper to the extent that it affects the main claims and/or conclusions
of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: In Section 6 and Appendices C to G, we thoroughly outline the experimental
procedure. This, combined with the released code, will enable all results in the paper to be
reproduced.
Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived
well by the reviewers: Making the paper reproducible is important, regardless of
whether the code and data are provided or not.
• If the contribution is a dataset and/or model, the authors should describe the steps taken
to make their results reproducible or verifiable.
• Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully
might suffice, or if the contribution is a specific model and empirical evaluation, it may
be necessary to either make it possible for others to replicate the model with the same
dataset, or provide access to the model. In general. releasing code and data is often
one good way to accomplish this, but reproducibility can also be provided via detailed
instructions for how to replicate the results, access to a hosted model (e.g., in the case
of a large language model), releasing of a model checkpoint, or other means that are
appropriate to the research performed.
• While NeurIPS does not require releasing code, the conference does require all submis-
sions to provide some reasonable avenue for reproducibility, which may depend on the
nature of the contribution. For example
(a) If the contribution is primarily a new algorithm, the paper should make it clear how
to reproduce that algorithm.
(b) If the contribution is primarily a new model architecture, the paper should describe
the architecture clearly and fully.
(c) If the contribution is a new model (e.g., a large language model), then there should
either be a way to access this model for reproducing the results or a way to reproduce
the model (e.g., with an open-source dataset or instructions for how to construct
the dataset).
(d) We recognize that reproducibility may be tricky in some cases, in which case
authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in
some way (e.g., to registered users), but it should be possible for other researchers
to have some path to reproducing or verifying the results.
5. Open access to data and code

44
Question: Does the paper provide open access to the data and code, with sufficient instruc-
tions to faithfully reproduce the main experimental results, as described in supplemental
material?
Answer: [Yes]
Justification: We provide code at [Link] Addition-
ally, the datasets and preprocessing procedure are described in Appendix C.
Guidelines:
• The answer NA means that paper does not include experiments requiring code.
• Please see the NeurIPS code and data submission guidelines ([Link]
public/guides/CodeSubmissionPolicy) for more details.
• While we encourage the release of code and data, we understand that this might not be
possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not
including code, unless this is central to the contribution (e.g., for a new open-source
benchmark).
• The instructions should contain the exact command and environment needed to run to
reproduce the results. See the NeurIPS code and data submission guidelines (https:
//[Link]/public/guides/CodeSubmissionPolicy) for more details.
• The authors should provide instructions on data access and preparation, including how
to access the raw data, preprocessed data, intermediate data, and generated data, etc.
• The authors should provide scripts to reproduce all experimental results for the new
proposed method and baselines. If only a subset of experiments are reproducible, they
should state which ones are omitted from the script and why.
• At submission time, to preserve anonymity, the authors should release anonymized
versions (if applicable).
• Providing as much information as possible in supplemental material (appended to the
paper) is recommended, but including URLs to data and code is permitted.
6. Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyper-
parameters, how they were chosen, type of optimizer, etc.) necessary to understand the
results?
Answer: [Yes]
Justification: Appendices C to G details all training and test details, including data splits
and hyperparameter tuning procedures for all baselines.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail
that is necessary to appreciate the results and make sense of them.
• The full details can be provided either with the code, in appendix, or as supplemental
material.
7. Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate
information about the statistical significance of the experiments?
Answer: [Yes]
Justification: We report results over 10 seeded runs for all methods, reporting mean and 95%
confidence interval of results.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confi-
dence intervals, or statistical significance tests, at least for the experiments that support
the main claims of the paper.

45
• The factors of variability that the error bars are capturing should be clearly stated (for
example, train/test split, initialization, random drawing of some parameter, or overall
run with given experimental conditions).
• The method for calculating the error bars should be explained (closed form formula,
call to a library function, bootstrap, etc.)
• The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error
of the mean.
• It is OK to report 1-sigma error bars, but one should state it. The authors should
preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis
of Normality of errors is not verified.
• For asymmetric distributions, the authors should be careful not to show in tables or
figures symmetric error bars that would yield results that are out of range (e.g. negative
error rates).
• If error bars are reported in tables or plots, The authors should explain in the text how
they were calculated and reference the corresponding figures or tables in the text.
8. Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the com-
puter resources (type of compute workers, memory, time of execution) needed to reproduce
the experiments?
Answer: [Yes]
Justification: In Appendix G, we report all computer resources required to reproduce the
results.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster,
or cloud provider, including relevant memory and storage.
• The paper should provide the amount of compute required for each of the individual
experimental runs as well as estimate the total compute.
• The paper should disclose whether the full research project required more compute
than the experiments reported in the paper (e.g., preliminary or failed experiments that
didn’t make it into the paper).
9. Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the
NeurIPS Code of Ethics [Link]
Answer: [Yes]
Justification: The authors have read the Code of Ethics and confirm that the paper conforms
to the code.
Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a
deviation from the Code of Ethics.
• The authors should make sure to preserve anonymity (e.g., if there is a special consid-
eration due to laws or regulations in their jurisdiction).
10. Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative
societal impacts of the work performed?
Answer: [Yes]
Justification: Section 8 discusses the societal impacts of the work.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.

46
• If the authors answer NA or No, they should explain why their work has no societal
impact or why the paper does not address societal impact.
• Examples of negative societal impacts include potential malicious or unintended uses
(e.g., disinformation, generating fake profiles, surveillance), fairness considerations
(e.g., deployment of technologies that could make decisions that unfairly impact specific
groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied
to particular applications, let alone deployments. However, if there is a direct path to
any negative applications, the authors should point it out. For example, it is legitimate
to point out that an improvement in the quality of generative models could be used to
generate deepfakes for disinformation. On the other hand, it is not needed to point out
that a generic algorithm for optimizing neural networks could enable people to train
models that generate Deepfakes faster.
• The authors should consider possible harms that could arise when the technology is
being used as intended and functioning correctly, harms that could arise when the
technology is being used as intended but gives incorrect results, and harms following
from (intentional or unintentional) misuse of the technology.
• If there are negative societal impacts, the authors could also discuss possible mitigation
strategies (e.g., gated release of models, providing defenses in addition to attacks,
mechanisms for monitoring misuse, mechanisms to monitor how a system learns from
feedback over time, improving the efficiency and accessibility of ML).
11. Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible
release of data or models that have a high risk for misuse (e.g., pretrained language models,
image generators, or scraped datasets)?
Answer: [NA]
Justification: This paper does not release any pretrained models or collated datasets that
might pose a risk to misuse.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with
necessary safeguards to allow for controlled use of the model, for example by requiring
that users adhere to usage guidelines or restrictions to access the model or implementing
safety filters.
• Datasets that have been scraped from the Internet could pose safety risks. The authors
should describe how they avoided releasing unsafe images.
• We recognize that providing effective safeguards is challenging, and many papers do
not require this, but we encourage authors to take this into account and make a best
faith effort.
12. Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in
the paper, properly credited and are the license and terms of use explicitly mentioned and
properly respected?
Answer: [Yes]
Justification: The paper cites original owners of all code (for baselines) and datasets used in
the paper.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a
URL.
• The name of the license (e.g., CC-BY 4.0) should be included for each asset.

47
• For scraped data from a particular source (e.g., website), the copyright and terms of
service of that source should be provided.
• If assets are released, the license, copyright information, and terms of use in the
package should be provided. For popular datasets, [Link]/datasets
has curated licenses for some datasets. Their licensing guide can help determine the
license of a dataset.
• For existing datasets that are re-packaged, both the original license and the license of
the derived asset (if it has changed) should be provided.
• If this information is not available online, the authors are encouraged to reach out to
the asset’s creators.
13. New Assets
Question: Are new assets introduced in the paper well documented and is the documentation
provided alongside the assets?
Answer: [NA]
Justification: This paper does not release any new assets.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their
submissions via structured templates. This includes details about training, license,
limitations, etc.
• The paper should discuss whether and how consent was obtained from people whose
asset is used.
• At submission time, remember to anonymize your assets (if applicable). You can either
create an anonymized URL or include an anonymized zip file.
14. Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper
include the full text of instructions given to participants and screenshots, if applicable, as
well as details about compensation (if any)?
Answer: [NA]
Justification: This work does not involve crowdsourcing or research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with
human subjects.
• Including this information in the supplemental material is fine, but if the main contribu-
tion of the paper involves human subjects, then as much detail as possible should be
included in the main paper.
• According to the NeurIPS Code of Ethics, workers involved in data collection, curation,
or other labor should be paid at least the minimum wage in the country of the data
collector.
15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human
Subjects
Question: Does the paper describe potential risks incurred by study participants, whether
such risks were disclosed to the subjects, and whether Institutional Review Board (IRB)
approvals (or an equivalent approval/review based on the requirements of your country or
institution) were obtained?
Answer: [NA]
Justification: This paper does not involve crowdsourcing or research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with
human subjects.

48
• Depending on the country in which research is conducted, IRB approval (or equivalent)
may be required for any human subjects research. If you obtained IRB approval, you
should clearly state this in the paper.
• We recognize that the procedures for this may vary significantly between institutions
and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the
guidelines for their institution.
• For initial submissions, do not include any information that would break anonymity (if
applicable), such as the institution conducting the review.

49

You might also like