0% found this document useful (0 votes)
24 views25 pages

PDE-LEARN - Using Deep Learning To Discover PDE From Noisy, Limited Data

This document introduces PDE-LEARN, a deep learning algorithm that can identify governing partial differential equations directly from noisy, limited measurements of a physical system. PDE-LEARN uses a rational neural network to approximate the system response function and can learn from multiple data sets simultaneously. The algorithm is demonstrated to discover several linear and nonlinear PDEs from noisy, limited data.

Uploaded by

Jinhan Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

PDE-LEARN - Using Deep Learning To Discover PDE From Noisy, Limited Data

This document introduces PDE-LEARN, a deep learning algorithm that can identify governing partial differential equations directly from noisy, limited measurements of a physical system. PDE-LEARN uses a rational neural network to approximate the system response function and can learn from multiple data sets simultaneously. The algorithm is demonstrated to discover several linear and nonlinear PDEs from noisy, limited data.

Uploaded by

Jinhan Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

PDE-LEARN: Using Deep Learning to Discover Partial Differential

Equations from Noisy, Limited Data


Robert Stephany∗,a and Christopher Earlsa,b
a
Center for Applied Mathematics, Cornell University, Ithaca, NY 14850, United States
b
School of Civil & Environmental Engineering, Cornell University, Ithaca, NY 14850,
arXiv:2212.04971v2 [cs.LG] 10 Feb 2023

United States

Abstract

In this paper, we introduce PDE-LEARN, a novel deep learning algorithm that can identify governing partial
differential equations (PDEs) directly from noisy, limited measurements of a physical system of interest.
PDE-LEARN uses a Rational Neural Network, U , to approximate the system response function and a sparse,
trainable vector, ξ, to characterize the hidden PDE that the system response function satisfies. Our approach
couples the training of U and ξ using a loss function that (1) makes U approximate the system response
function, (2) encapsulates the fact that U satisfies a hidden PDE that ξ characterizes, and (3) promotes
sparsity in ξ using ideas from iteratively reweighted least-squares. Further, PDE-LEARN can simultaneously
learn from several data sets, allowing it to incorporate results from multiple experiments. This approach
yields a robust algorithm to discover PDEs directly from realistic scientific data. We demonstrate the efficacy
of PDE-LEARN by identifying several PDEs from noisy and limited measurements.

Keywords: Deep Learning, Sparse Regression, Partial Differential Equation Discovery, Physics Informed
Machine Learning

∗ Corresponding Author.

Email address: [email protected]

1
1 Introduction
Scientific progress is contingent upon finding predictive models for the natural world. Historically, scientists
have discovered new laws by studying physical systems to distill the first principles that govern those sys-
tems. This first-principles approach has yielded predictive models in many fields, including fluid mechanics,
population dynamics, quantum mechanics, and general relativity. Unfortunately, despite years of concerted
effort, many systems currently lack predictive models, particularly those in the biological sciences [1], [3].

These roadblocks call for a new approach to identifying governing equations. At its core, discovering new
scientific laws relies on identifying simple governing equations from complex data sets. Pattern recognition
is, notably, one of the core goals of machine learning. Advances in machine learning, specifically deep
learning, offer an intriguing alternative approach to discovering scientific laws. The burgeoning field of
physics-informed machine learning seeks to use machine learning for scientific and engineering applications
by designing models whose architecture incorporates knowledge of the physical system. PDE discovery,
a sub-field of physics-informed machine learning, seeks to use machine learning models to identify partial
differential equations (PDEs) from data. PDE discovery may be able to discover predictive models for
physical systems that have proven difficult to model.

Crucially, any approach that aims to discover scientific laws must be able to work with scientific data.
Scientific data sets are often limited (sometimes only a few hundred data points) and may contain substantial
noise. Therefore, PDE discovery algorithms must be able to work with limited, noisy data. Further, scientists
often perform multiple experiments on the same system. As such, PDE discovery algorithms should be able
to incorporate information from multiple data sets. This paper presents a novel PDE discovery algorithm
that meets these challenges.

Contributions: In this paper, we develop a novel PDE discovery algorithm, PDE-LEARN, that can learn a
general class of PDEs directly from data. In particular, PDE-LEARN can identify both linear and nonlinear
PDEs with several spatial variables. Significantly, PDE-LEARN is also robust to noise and limited data.
Further, PDE-LEARN can pool information from multiple experiments, by learning from multiple data sets
simultaneously. We confirm PDE-LEARN’s efficacy through a set of numerical experiments. These experiments
demonstrate that PDE-LEARN can discover a variety of linear and nonlinear PDEs, even when its training set
is limited and noisy.

Outline: The rest of this paper is organized as follows: First, in section 2, we state our assumptions on
the data and the form of the hidden PDE. Next, in section 3, we survey related work on PDE discovery,
focusing on methods that encode the hidden PDE into their loss functions. In section 4, we give a compre-
hensive description of the PDE-LEARN algorithm. In section 5, we demonstrate the efficacy of PDE-LEARN by
discovering a variety of PDEs from limited, noisy data sets. Section 6 discusses our rationale behind the
PDE-LEARN algorithm, practical considerations, limitations, and potential future research directions. Finally,
we provide concluding remarks in section 7.

2 Problem Statement
Our algorithm uses noisy measurements of one or more system response functions to identify a hidden PDE
that those system response functions satisfy. This section mainly serves to make the preceding statement
more precise. First, we introduce the notation we use throughout this paper. We then define what we mean
by system response functions and state our assumptions about the hidden PDE they satisfy. We then specify
our assumptions about our noisy measurements of the system response function. Finally, we conclude this
section by formally stating the problem that our approach seeks to solve.

Problem domain: Let nD , nS ∈ N and Ω1 , . . . , ΩnS ⊆ RnD be a collection of open, connected sets. We
refer to Ωi as the ith spatial problem domain. We assume that for each i ∈ {1, 2, . . . , nS }, a physical process
evolves on the domain Ωi during the time interval (0, Ti ], for some Ti > 0. Thus, there exists a system
response function ui : (0, Ti ] × Ωi → R that describes the state of the ith physical process at each position

2
and time. In particular, ui (t, X) denotes the system state at time t ∈ (0, Ti ] and position X ∈ Ω. We refer
to the Cartesian product (0, Ti ] × Ωi as the ith problem domain. If nS = 1, we drop the subscript notation
and refer to (0, T ] × Ω as the problem domain.

Roughly speaking, we assume that each system response function corresponds to an experiment. Thus, by
allowing multiple system response functions, we can deal with the case when the user has has data from
several experiments. It is also possible, and perfectly admissible, to have just one problem domain and
response function.

Derivative Notation: In this paper, Dsn g denotes the nth partial derivative of any sufficiently smooth
function g with respect to the variable s. Thus, for example,

Dx2 g(t, x, y, z) = ∂ 2 g(t, x, y, z)/∂x2 .


For brevity, we abbreviate Ds1 g as Ds g. Throughout this paper, it will be helpful to have a concise notation
for all partial derivative operators below a certain order. With that in mind, let

∂ˆ0 u, ∂ˆ1 u, . . . , ∂ˆNM u


denote an enumeration of the partial derivatives of u of order ≤ M (with the convention that the identity
map is a 0th-order partial derivative). Note that ∂ˆ0 u, ∂ˆ1 u, . . . , ∂ˆNM u includes mixed partials if u is a function
of multiple variables.

As an example, let’s consider the case when M = 2 and nD = 3. Thus, we are interested in all partial
derivatives of order ≤ 2 of a function of three variables, x, y, and z. In this case, NM = 9, and one possible
enumeration is

∂ˆ0 u = u, ∂ˆ1 u = Dx u, ∂ˆ2 u = Dx2 u, ∂ˆ3 u = Dy u, ∂ˆ4 u = Dy2 u


∂ˆ5 u = Dz u, ∂ˆ6 u = D2 u,
z ∂ˆ7 u = Dx Dy u, ∂ˆ8 u = Dy Dz u, ∂ˆ9 u = Dx Dz u.

As this example demonstrates, ∂ˆ5 u does not necessarily represent a fifth-order partial derivative of u. Rather,
∂ˆ5 u denotes the fifth term in our specific enumeration of the partial derivatives of u.

Hidden PDE: We assume there exists a hidden PDE of order ≤ M such that ui satisfies the hidden PDE
on (0, Ti ] × Ωi . We further assume this PDE takes the following form:

  XK  
f0 ∂ˆ0 ui , . . . , ∂ˆNM ui = ck fk ∂ˆ0 ui , . . . , ∂ˆNM ui . (1)
k=1

Critically, we assume that equation 1 holds with the same coefficients for each i ∈ {1, 2, . . . , nS }. In this
expression, we refer to the functions f0 , f1 , . . . , fK as the library terms. We refer to f0 as the left-hand side
term or LHS term for short. We also refer to f1 , . . . , fK as the right-hand side terms or RHS terms for short.
Each library term is a function of u and its partial derivatives (both in space and time) of order ≤ M .

At a high level, the algorithm we propose attempts to learn the coefficients c1 , . . . , cK using noisy, limited
measurements of the system response functions and an assumed set of library terms. Before we can state
our algorithm precisely, we need to make a few assumptions about the library and the system response data.

Monomial Library Terms: Many physical systems are governed by equation 1 for a particular M and
set of coefficients. In many cases of practical interest (e.g. solid and fluid mechanics, thermodynamics,
and quantum mechanics), the governing PDE consists of terms that are monomials of u and its partial
derivatives. That is, the terms are of the form
N
YM  pk (m)
fk (∂ˆ0 u, . . . , ∂ˆNM u) = ∂ˆm u , (2)
m=0

3
For some pk (0), . . . , pk (NM ) ∈ N ∪ {0}. We call library functions of this form monomial library functions. In
this paper, we consider libraries consisting of monomial library terms. In principle, however, our proposed
algorithm can work with any library.
n oNData (i)
(i) (i)
Data Points: Let i ∈ {1, 2, . . . , nS } and let tj , Xj ⊆ (0, T ] × Ω be a collection of data points
j=1
in the ith problem domain. We assume that we have noisy measurements of ui at these data points, which
we denote by
n  oNData (i)
(i) (i)
ũi tj , Xj ⊆ (0, Ti ] × Ωi .
j=1

We refer to the collection of these measurements as the ith noisy data set. Further, we refer to the corre-
n  oNData (i)
(i) (i)
sponding set ui tj , Xj as the ith noise-free data set. Critically, we only assume knowledge of
j=1
the noisy data set. In general, since the measurements are noisy,
   
(i) (i) (i) (i)
ui tj , Xj 6= ũi tj , Xj .

We assume, however, that for each j ∈ {1, 2, . . . , NData (i)},


   
(i) (i) (i) (i)
u tj , Xj − ũi tj , Xj ∼ N (0, σi2 ),
 
(i) (i)
for some σi > 0. We refer to this difference as the noise at the data point tj , Xj . We assume that the
noises at different data points are independent and identically distributed. We define the noise level of a
data set as the ratio of σi to the standard deviation of the noise-free data set. Finally, if nS = 1, we write
NData for NData (1).
n  oNData (i)
(i) (i)
Goals: Our goal is to use the noisy data sets, ũi tj , Xj , and the library, f0 , . . . , fK , to learn
j=1
the coefficients c1 , . . . , cK in equation 1. To do this, we learn an approximation to each ui that satisfies
equation 1 for some set of coefficients. To maximize the applicability of our approach, we make as few
assumptions as possible about the hidden PDE. In particular, we generally use a large library. The goal here
is to select a library that is broad enough to include the terms that are present (have non-zero coefficients)
in the hidden PDE without requiring the user to identify those terms beforehand. This approach means our
library includes many extraneous terms; i.e., most of the coefficients should be zero. Therefore, we tacitly
assume the right-hand side of equation 1 is sparse.

For reference, table 1 lists the notation we introduced in this section.

3 Related Work
In this section, we discuss relevant previous work on PDE discovery to contextualize our contributions. PDE
discovery began in the late 2000s with two papers, [9] and [29]. While neither paper is directly concerned with
identifying PDEs (the former focuses on discovering dynamical systems, while the latter focuses on identifying
invariant and conservation laws), they set the stage for using machine learning to discover scientific laws
from data. Both approaches use genetic algorithms to learn relationships (hidden dynamical system in the
former and conservation laws in the latter) that the system response function satisfies.

A significant advance came a few years after [29] when Rudy et al. introduced PDE-FIND [27]. Developed
as a modification of the Sparse Identification of Nonlinear DYnamics (SINDY) algorithm [13], PDE-FIND
represents one of the earliest and most important breakthroughs in identifying PDEs directly from data.
PDE-FIND uses similar assumptions to the ones listed in section 2 but additionally assumes that nS = 1 (a
2 For a given problem, all spatial domains come from the SAME Euclidean subspace. That is, Ωi ⊆ RnD for each i.

4
Notation Meaning
nD The number of spatial dimensions in the spatial problem domain.
nS The number of system response functions.
Ωi The ith spatial domain2 : an open, connected subset of RnD .
(0, Ti ] ⊆ R The time interval over which ui evolves on Ωi .
ui : (0, Ti ] × Ωi → R The ith system response function. If nS = 1, we denote u1 = u.
NM The number of distinct partial derivatives of u of order ≤ M .
∂ˆ0 , . . . , ∂ˆNM An enumeration of the partial derivatives of order ≤ M (including the identity
map).
Dsn g the nth derivative of some function g with respect to the variable s.
Ds g abbreviated notation for Ds1 g.
NData (i) Number of data points in the ith noisy-data set. If nS = 1, we write
NData (1) = NData .
n oNData (i)
(i) (i)
tj , Xj The data points for the ith system response function.
j=1
ũi (t, X) A noisy measurement of ui at (t, X) ∈ (0, Ti ] × Ωi .
K The number of library functions. See equation 1
f0 The left-hand side term. See equation 1.
f1 . . . , fK The right-hand side terms. See equation 1.
c1 , . . . , cK The coefficients of the RHS terms terms f1 , . . . , fK in equation 1.
Noise level The ratio of the standard deviation of the noise to that of the noise-free data set.

Table 1: The notation and terminology of section (2)

single experiment), f0 (u) = Dt u, and the RHS terms depend only on the lone system response function,
u, and its spatial partial derivatives. Critically, PDE-FIND also assumes the data points occur on a regular
grid. This additional constraint allows PDE-FIND to use numerical differentiation techniques to approximate
the partial derivatives of u. Using these approximations, PDE-FIND can evaluate the library terms at the
data points, which engenders a linear system for the coefficients c1 , . . . , cK . PDE-FIND then finds a sparse,
approximate solution to this system using an algorithm called Sequentially Thresholded Least Squares, or
ST-Ridge for short. PDE-FIND can successfully identify a wide range of PDEs directly from data. With that
said, PDE-FIND does have some fundamental limitations. In particular, since numerical differentiation tends
to amplify noise, PDE-FIND’s performance decreases considerably in the presence of moderate noise levels.
Further, requiring the data to occur on a regular grid is a cumbersome limitation for scientific applications,
where data can be difficult and expensive to acquire.

In addition to PDE-FIND, two other notable examples of early PDE discovery algorithms are [28] and [8].
The former is similar to PDE-FIND but uses spectral methods to approximate the derivatives. This change
enables their approach to identify a variety of PDEs, even in the presence of significant noise. Like PDE-FIND,
however, [28] does require that the data points occur on a regular grid. The latter trains a neural network,
U : (0, T ] × Ω → R, to match a noisy data set (thereby learning an approximation to the system response
function) and then uses sparse regression to identify the coefficients c1 , ..., cK . Using a network to interpolate
the data allows the data points to be dispersed anywhere in the problem domain.

Another significant contribution to PDE discovery came a few years later with DeepMoD [11]. DeepMoD
learns an approximation, ξ, to the coefficients c1 , ..., cK , while simultaneously training a neural network,
U : (0, T ] × Ω → R, to match a noisy data set. Thus, their approach learns the hidden PDE and the
system response function at the same time. Further, like [8], the data points for DeepMoD can be arbitrarily

5
distributed throughout the problem domain. DeepMoD uses Automatic Differentiation [7] to calculate the
partial derivatives of the neural network at randomly selected collocation points in the problem domain.
DeepMoD can then use these partial derivatives to evaluate the library terms at the collocation points. To
train U and ξ, DeepMoD uses a three-part loss function. The first part measures how well U matches the
data set, the second measures how well U satisfies the hidden PDE 1 with the components of ξ in place
of c1 , . . . , cK , and the third is the L1 norm of ξ (this promotes sparsity in ξ). The second and third parts
of the loss function embed the LASSO loss function within DeepMoD’s loss function. These parts encourage
U to learn the function that roughly matches the data but also satisfies a PDE of the form of equation 1.
This approach embeds the fact that the system response function satisfies a PDE characterized by ξ into the
loss function. This loss function produces a robust algorithm that can identify PDEs, even from noisy and
limited data. DeepMoD served as the original inspiration for PDE-LEARN. Finally, it is worth noting that [15]
proposed an approach similar to DeepMoD but uses an innovative training scheme that achieves impressive
results on several PDEs.

More recently, the authors of this paper proposed another algorithm, PDE-READ [31]. PDE-READ uses two
neural networks: The first, U : (0, T ] × Ω → R, learns the system response function, and the second, N ,
learns an abstract representation of the right-hand side of equation 1. That approach is based on Raissi’s
deep hidden physics models algorithm [26]. Like DeepMoD, PDE-READ learns an approximation of the system
response function while simultaneously identifying the hidden PDE. Significantly, PDE-READ utilizes Rational
Neural Networks [12], a type of fully connected neural network whose activation functions are trainable
rational functions. After training both networks, PDE-READ uses a modified version of the Recursive Feature
Elimination algorithm [19] to extract c1 , . . . , cK from N . This approach proves impressively robust, as
PDE-READ can identify a variety of PDEs even from limited measurements with very high noise levels.

The methods discussed above mostly use a standard fully connected neural network to approximate the
system response function and identify the hidden PDE. The PDE-discovery community has, however, pro-
posed many other approaches. [18] and [24] learn the hidden PDE via a weak-formulation approach. Using
weak forms places additional restrictions on the form of the hidden PDE but engenders an algorithm that
is remarkably robust to noise. Further, [4] uses a Gaussian process to approximate the system response
function and a genetic algorithm to identify the hidden PDE. Finally, [10] uses Bayesian Neural networks to
learn the system response function and identify the hidden PDE.

In this paper, we introduce a new PDE discovery algorithm, PDE-LEARN. PDE-LEARN can learn a broad class of
PDEs directly from noisy and limited data. Like PDE-READ, PDE-LEARN utilizes Rational Neural Networks [12]
to learn an approximation of the system response functions. Further, like DeepMoD, PDE-LEARN uses a sparse,
trainable vector, ξ, to learn an approximation to the coefficients c1 , . . . , cK . What sets PDE-LEARN apart,
however, is its three-part loss function that incorporates aspects of the iteratively reweighted least-squares
algorithm [14] to help promote sparsity in ξ. Equally important, PDE-LEARN uses an adaptive procedure to
place additional collocation points in regions where the collocation loss is the greatest. This process helps
accelerate convergence and yields an algorithm that is highly effective in the low-data, high-noise data limit.

4 Methodology
In this section, we describe our algorithm - PDE discovery via L0 Error Approximation and Rational
Neural networks - or PDE-LEARN for short. PDE-LEARN does the following: First, it uses noisy data sets,
(i) (i) NData (i)
{ũi (tj , Xj )}j=1 , to learn an approximation to each system response function, ui . Second, it uses the
fact that the system response functions satisfy a PDE of the form of equation 1 to learn the coefficients
c1 , . . . , cK . Significantly, PDE-LEARN can learn any PDE of the form of equation 1 and can operate with
multiple spatial variables.

PDE-LEARN uses a Rational Neural Network [12], Ui : (0, T ] × Ω → R, to approximate the ith system response
function, ui . It approximates the coefficients c1 , . . . , cK in equation 1 using a trainable vector, ξ. During
training, PDE-LEARN learns ξ and each Ui by minimizing the loss function

6
nS
X nS
X
Loss (U1 , . . . , UnS , ξ) = wData LossData (Ui ) + wColl LossColl (Ui , ξ) + wLp LossLp (ξ). (3)
i=1 i=1

Here, wData , wColl , and wLp are user-selected scalar hyperparameters. Further,

  NData
X(i)
1     2
(i) (i) (i) (i)
LossData (Ui ) = U tj , Xj − ũ tj , Xj (4)
NData (i) j=1
  NColl
X(i)
1   2
(i) (i)
LossColl (Ui , ξ) = RP DE Ui , t̂j , X̂j (5)
NColl (i) j=1
K
X
LossLp (ξ) = ak ξk2 , (6)
k=1

In equation 6, PDE-LEARN updates the constants ak at the start of each epoch such that the Lp loss approx-
imates the p norm of ξ. We discuss this in detail in section 4.2. In equation 5, RP DE is the PDE-Residual,
defined by

    XK  
(i) (i)
RP DE Ui , t̂j , X̂j = f0 t̂i , X̂i − ξk fk t̂i , X̂i , (7)
k=1
      
(i) (i) (i) (i) (i) (i)
where fk t̂j , X̂j is an abbreviation for fk ∂ˆ0 U t̂j , X̂j , . . . , ∂ˆNM U t̂j , X̂j . PDE-LEARN uses
automatic differentiation [7] to calculate the partial derivatives of U and subsequently evaluate the library
n oNColl (i)
(j) (j)
functions. The points t̂i , X̂i ⊆ (0, Ti ] × Ωi are the collocation points. We discuss these in
i=1
detail below in section 4.1.

We refer to LossData , LossColl , and LossLp as the Data, Collocation, and Lp losses, respectively. Figures 1
and 2 depict how PDE-LEARN evaluates the Data and Collocation Losses, respectively.

The Data Loss forces U to satisfy the noisy data, {ũ(ti , Xi )}N Data
i=1 at the data points. It is the mean square
error between U ’s predictions and the noisy data set. The collocation loss forces U to satisfy the PDE
encoded in

  XK  
(i) (i) (i) (i)
f0 t̂j , X̂j = ξk fk t̂j , X̂j ,
k=1
n oNColl (i)
(i) (i)
at the collocation points t̂j , X̂j . It is what couples the training of U and ξ. Finally, LossLp
i=1
encodes our assumption that most of the coefficients in equation 1 are zero by promoting sparsity in ξ. It is
a weighted sum of the squares of components of ξ. PDE-LEARN accomplishes this by re-selecting the weights,
ak , at the start of each epoch. We discuss this in detail in section 4.2.

To use PDE-LEARN, one must provide a noisy data set, select an architecture for U , and select an appropriate
collection of library terms. Here, appropriate means that the right-hand side of the hidden PDE can be
expressed as a sparse linear combination of the library terms. Once PDE-LEARN has finished training, it
reports identified PDE.

4.1 Collocation Loss


   
(i) (i) (i) (i)
In equation 7, RP DE Ui , t̂j , X̂j is the PDE residual of Ui at the collocation point t̂j , X̂j ∈ (0, T ] ×
Ω. PDE-LEARN uses two types of collocation points: random collocation points and targeted collocation

7
Figure 1: This figure depicts the process that PDE-LEARN uses to evaluate the data loss. The white circles on the
left side of the figure represent the data points. Moving from left to right, we feed data points to the rational neural
network, U . We then compare the resulting predictions with the noisy measurements of the system response function
(in this case, for Burger’s equation in section 5.1). This process yields the data loss (right side of the figure).

points. Each problem domain has collocation points (both random and targeted). For each problem domain,
PDE-LEARN selects the random collocation points by repeatedly sampling from a uniform distribution over
the problem domain,
 
(i) (i) 
t̂j , X̂j ∼ Unif (0, Ti ] × Ωi .

The number of random collocation points in each problem domain, NColl Random , is a hyperparameter.
PDE-LEARN re-samples the random collocation points for each problem domain at the start of each epoch.
The targeted collocation points are the random collocation points from previous epochs at which the PDE
residual is unusually large. At the start of training, we initialize the targeted collocation points to be the
empty set. During subsequent epochs, PDE-LEARN uses the following procedure for each system response
function:

1. During each epoch PDE-LEARN records the absolute value of the PDE residual at each collocation point
(both random and targeted) for the ith system response function. PDE-LEARN records this set of
non-negative values.
2. PDE-LEARN then computes the mean and standard deviation of this set.

3. PDE-LEARN then determines which collocation points have an absolute PDE-Residual that is more than
three standard deviations larger than the mean. These points are the targeted collocation points for
the next epoch.

This procedure accelerates training by adaptively focusing the training of Ui and ξ on regions of the problem
domain where the PDE residual is large. If the PDE residual is large in a particular region of the ith problem
domain, any collocation points in that region will become targeted collocation points. Because the random
collocation points are re-sampled, new collocation points will appear in the problematic region at the start of
each epoch. Thus, targeted collocation points will accumulate in that region. Eventually, the PDE residual
in that region will dominate the collocation loss. This forces Ui and ξ to adjust until the PDE residual in
the problematic region shrinks.

Though PDE-LEARN can work with an arbitrary collection of library functions, we only consider mono-
mial library functions in this paper. To evaluate these functions efficiently, PDE-LEARN records every
partial derivative operator that is present in at least one library function (equivalently, {∂ˆj : ∃ k ∈
{0, 1, 2, . . . , K} such that ∂ˆj u is one of the sub-terms of fk }). At the start of each epoch, PDE-LEARN eval-
uates these partial derivatives of each Ui at each collocation point. We implemented this process to be as

8
Figure 2: This figure depicts the process that PDE-LEARN uses to calculate the collocation loss. The white and black
circles on the left side of the figure represent the random and targeted collocation points, respectively. Moving from
left to right, PDE-LEARN evaluates U at each collocation point. It then uses automatic differentiation to evaluate ∂ˆ0 U ,
. . . , ∂ˆM
N
U , and the library functions at the collocation points. Finally, using these values and ξ, PDE-LEARN evaluates
the Collocation Loss.

efficient as possible. In particular, it starts by computing the lowest-order partial derivatives of Ui . For
subsequent partial derivatives, PDE-LEARN uses the following rule: if we can express a partial derivative of Ui
as a partial derivative operator applied to another partial derivative of Ui that we have already computed,
then compute the new partial derivative from the old one. This approach allows us to compute all the
necessary partial derivatives of Ui without any redundant computations. After computing the partial deriva-
tives of Ui , PDE-LEARN uses them to evaluate the library terms at the collocation points. It then evaluates
the PDE-Residual, equation 7, at each collocation point. Finally, from the PDE residuals, PDE-LEARN can
compute the collocation loss, equation 5. Figure 2 depicts the process that PDE-LEARN uses to calculate the
Collocation Loss.

4.2 Lp Loss
The collocation and Lp losses embed the iteratively reweighted least-squares [14] loss function into ours. At
the start of each epoch, PDE-LEARN updates the weights a1 , . . . , aK in equation 6 using the following rule:
 
1
ak = (8)
min{δ, |ξk |2−p }
Here, δ > 0 is a small constant to prevent division by zero, and p ∈ (0, 2) is a hyperparameter. Crucially,
the value ξk that appears in this equation is the kth component of ξ at the start of the epoch; thus ak is
treated as a constant during backpropagation. Notably, if ξk is sufficiently large, then

ak ξk2 = |ξk |p
and so,
K
X
LossLp = |ξk |p ≈ kξk0 .
k=0

Critically, PDE-LEARN evaluates each ak at the start of every epoch and treats them as constants during
that epoch. This subtle detail allows LossLp to closely approximate kξk0 while remaining a smooth, convex
function of ξ’s components. We discuss this in section 6.3.

9
4.3 Training
PDE-LEARN identifies the hidden PDE using a three-step training process. We use the Adam [20] optimizer
to minimize 3 in all three steps.

We call the first step the burn-in step. PDE-LEARN first initializes ξ and the Ui ’s. It initializes ξ to a vector
of zeros. It initializes the weights matrices and bias vectors in Ui using the initialization procedure in [17].
Finally, it initializes the rational activation functions using the procedure in [12]. PDE-LEARN then sets wLp
to zero and begins training. During this step, Ui learns an approximation to ui . Since wLp = 0, almost all
components of ξ become non-zero; we do not attempt to identify the PDE during this step.

At the end of the burn-in step, PDE-LEARN prunes ξ by eliminating all RHS terms whose corresponding
component √ of ξ is smaller than a threshold. Throughout this paper, we select the threshold to be slightly
larger than ε, where ε is machine epsilon for single-precision floating numbers. We discuss the implications
of pruning in section 6.2.

In the second step, which we call the sparsification step, we set wLp to a small, non-zero value and then
resume training. During this step, ξ becomes sparse, only retaining the components of ξ that correspond to
RHS terms that are present in the hidden PDE. After training, we repeat the pruning process (which usually
eliminates the bulk of the extraneous RHS terms). By the end of this step, PDE-LEARN identifies which RHS
terms have non-zero coefficients. However, since the Lp loss encourages each coefficient to go to zero, the
magnitudes of the retained coefficients are usually too small at this point.

In the third step, which we call the fine-tuning step, we set wLp to zero once again and resume training.
This step retains only the RHS terms that survived the sparsification step. By removing the Lp loss, the
components of ξ are no longer under pressure to be as close to 0 as possible, which allows them to converge to
the values in the hidden PDE. PDE-LEARN trains until the Lp loss stops increasing. PDE-LEARN then reports
the identified PDE encoded in ξ.

For brevity, we will let NBurn−in , NSparse , and NF ine−tune denote the number of burn-in, sparsification,
and fine-tuning epochs, respectively

Table 2 lists the notation we introduced in this section.

5 Experiments
We implemented PDE-LEARN as an open-source Python library. Our implementation is publicly available
at https://github.com/punkduckable/PDE-LEARN, along with auxiliary MATLAB scripts to generate our
data sets.

As stated in section 4, each Ui is a rational neural network [12]. Thus, Ui ’s activation functions are trainable
(3, 2) rational functions (3rd order polynomial in the numerator, second-order polynomial in the denomina-
tor). In other words, the coefficients that define the numerator and denominator polynomials are trainable
parameters that the network learns along with its weight matrices and bias vectors. Each hidden layer gets
its own activation function, which we apply to each hidden unit in that layer.

In this section, we test PDE-LEARN on several PDEs, both linear and non-linear. All of the data we use in
these experiments come from numerical simulations. The data sets from these simulations represent our
noise-free data sets. To create noisy, limited data sets with NData ∈ N points and a noise level q ≥ 0, we use
the following procedure:

1. Calculate the standard deviation, σnf , the samples of the system response function in the noise-free
data set.

10
Notation Meaning
Ui Neural Network to approximate ui .
ξ A trainable vector in RK whose components approximate c1 , . . . , cK . See equation 1.
p A hyperparameter representing the “p” in “Lp ”. See equation 6.
LossData The data loss. It measures the mean square error between Ui and the noisy measurements
of ui at the ith data points. See equation 4
LossColl The collocation loss. It measures how well U satisfies the hidden PDE encoded in ξ at the
collocation points. See equation 5.
LossLp The Lp loss. It represents the Lp quasi-norm of ξ raised to the pth power. See equation 6.
RP DE The PDE-residual. See equation 7.
      
(i) (i) (i) (i) (i) (i)
fk t̂j , X̂j An abbreviation of fk ∂ˆ0 U t̂j , X̂j , . . . , ∂ˆNM U t̂j , X̂j .
wData , wColl , wLp Hyperparameters that specify the weight of LossData , LossColl , and LossLp , respectively.
See equation 3.
NColl (i) The number of collocation points for ui . See equation 5.
NRandom Coll A hyperparameter specifying the number of random collocation points. We use the same
value for each system response function. See sub-section 4.1.
n oNColl (i)
(i) (i)
t̂j , X̂j The ith set of collocation points. See equation 5.
i=1
NBurn−in The number of burn-in epochs.
NSparse The number of sparsification epochs.
NF ine−tune The number of fine-tuning epochs.

Table 2: The notation and terminology of section (4)

2. Select a subset of size NData from the noise-free data set by sampling NData points from the noise-free
data set without replacement. The resulting subset is the limited data set.
3. Independently sample a Gaussian Distribution with mean 0 and standard deviation q ∗ σnf once for
each point in the limited data set. Add the ith value to the ith point in the limited data set. The
resulting set is our noisy, limited data set.

As discussed in section 4.3, we use a three-step training process to train ξ and {U1 , . . . , UnS }. In our
experiments, we stop the burn-in step when loss stops decreasing, which often takes between 1000 and 1500
epochs. For the sparsification step, we select a small wLp value (usually 0.0001) and train until the Lp loss
stabilizes for a few hundred epochs (usually 1, 000 to 2, 000 epochs after burn-in). Finally, for the fine-tuning
step, we stop training once the Lp loss stops increasing or once an equation-specific3 of fine-tuning epochs
are complete.

In our experiments, all three steps use the Adam optimizer [20] with a learning rate of 10−3 . Though we do
not use it in our experiments, our implementation supports the LBFGS optimizer [23].

In every experiment in this section, we set p = 0.1 and NRandom Coll = 3, 000 (see section 4.3). We did not
attempt to optimize these values and do not claim they are optimal. However, we found them to be sufficient
in our experiments.
3 For every equation except the Allen-Cahn equation, we use an upper limit of 2, 000 fine-tuning epochs. In these experiments,

the Lp loss generally stops changing by that time. For the Allen-Cahn equation, however, the Lp loss takes much longer to
stabilize, so we use an upper limit of 10, 000 fine-tuning epochs.

11
Figure 3: Noise-free Burgers’ equation data set.

5.1 Burgers’ equation


Burgers’ equation is a non-linear second-order non-linear PDE. It was first studied in [6] but arises in many
contexts, including Fluid Mechanics, Nonlinear Acoustics, Gas Dynamics, and Traffic Flow [5]. Burgers’
equation has the form

Dt u = ν(Dx2 u) − (u)(Dx u), (9)


where velocity, u, is a function of t and x. Here, ν > 0 is the diffusion coefficient. Significantly, solutions to
Burgers’ equation can develop shocks (discontinuities).

We test PDE-LEARN on a Burgers’ equation with ν = 0.1 on the domain (t, x) ∈ [0, 10] × [−8, 8]. Thus, for
these experiments, nS = 1. For this data set,
 πx 
u(0, x) = − sin .
8
To make the noise-free data set, we partition the problem domain using a spatiotemporal grid with 257
grid lines along the x-axis and 201 along the t-axis. Thus, each grid square has a length of 1/16 along the
x-axis and a length of 0.05 along the t-axis. Our script Burgers Sine.m (in the MATLAB sub-directory of
our repository) uses Chebfun’s [16] spin class to find a numerical solution to Burgers’ equation on this grid.
Figure 3 depicts the noise-free data set.

Using the procedure outlined at the beginning of section 5, we generate several noisy, limited data sets from
the noise-free data set. We test PDE-LEARN on each data set. In each experiment, U contains five layers with
20 neurons per layer. For these experiments, the left-hand side term is
 
f0 ∂ˆ0 u, . . . , ∂ˆNM u = Dt U.

Likewise, the right-hand side terms are

12
U, Dx U,Dx2 U, Dx3 U,
(U )2 , (Dx U )U,(Dx2 U )U, (Dx U )2 ,
(U )3 , (Dx U )(U )2 ,(Dx2 U )U 2 , (Dx U )2 U,
(U )4 , (Dx U )(U )3 ,(Dx2 U )(U )3 , (Dx U )2 (U )2 , (Dx U )3 U

Thus, our library includes terms with up to third-order spatial derivatives and fourth-order multiplicative
products. For these experiments, we use 1, 000 burn-in epochs (NBurn−in ), 1, 000 sparsification epochs
(NSparse ) with wLp = 0.0001, and a variable number of fine-tuning epochs (NF ine−tune ). Table 3 reports
the results of our experiments with Burgers’ equation.

Table 3: Experimental results for Burgers’ equation

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


4, 000 50% 1, 000 1, 000 2, 000 Dt U = 0.0987(Dx2 U ) − 0.9704(Dx U )(U )
4, 000 75% 1, 000 1, 000 1, 000 Dt U = 0.1025(Dx2 U ) − 0.9883(Dx U )(U )
4, 000 100% 1, 000 1, 000 400 Dt U = 0.0824(Dx2 U ) − 0.8476(Dx U )(U )
2, 000 50% 1, 000 1, 000 1, 000 Dt U = 0.0850(Dx2 U ) − 0.9010(Dx U )(U )
2, 000 75% 1, 000 1, 000 0 Dt U = 0.0608(Dx2 U ) − 0.7067(Dx U )(U )
2, 000 100% 1, 000 1, 000 750 Dt U = 0.0582(Dx2 U ) − 0.6756(Dx U )(U )
1, 000 25% 1, 000 1, 000 1, 000 Dt U = 0.1003(Dx2 U ) − 0.9880(Dx U )(U )
1, 000 50% 1, 000 1, 000 200 Dt U = 0.0839(Dx2 U ) − 0.7977(Dx U )(U )
1, 000 75%1 1, 000 1, 000 0 Dt U = 0.0386(Dx2 U ) − 0.5327(Dx U )(U )
500 10% 1, 000 1, 000 1, 000 Dt U = 0.1002(Dx2 U ) − 0.9723(Dx U )(U )
500 25% 1, 000 1, 000 1, 000 Dt U = 0.0876(Dx2 U ) − 0.9714(Dx U )(U )
500 50%1 1, 000 1, 000 0 Dt U = 0.0394(Dx2 U ) − 0.7171(Dx U )(U )
250 10% 1, 000 1, 000 200 Dt U = −0.0638(U ) − 0.0969(Dx U )(U )
250 25% 1, 000 1, 000 1, 000 Dt U = 0.0948(Dx2 U ) − 0.8346(Dx U )(U )
1
For these experiment, we use WLp = 0.0002 during the sparsification step. If
we set wLp = 0.0001, PDE-LEARN identifies Dt U = −0.0373(U ) + 0.0385(Dx2 U ) −
0.5168(Dx U )(U ) in the NData = 1, 000, 75% noise experiment and Dt U =
−0.0379(U ) + 0.0327(Dx2 U ) − 0.6407(Dx U ) ∗ (U ) − 0.0132(Dx U )2 in the NData = 500,
50% noise experiment.

Thus, PDE-LEARN successfully learns Burgers’ equation in all but one of the experiments. PDE-LEARN can
identify Burgers’ equation from 2, 000 data points even when we corrupt the data set with 100% noise. If the
noise level decreases to just 50%, PDE-LEARN can reliably identify Burgers’ equation with as few as 500 data
points. Increasing the noise tends to increase the relative error between the identified and the corresponding
coefficients in the hidden PDE. With that said, in the 75% noise and 4, 000 data point experiment, the
relative error of identified coefficients is ≈ 1%.

Notably, PDE-LEARN fails to identify Burgers’ equation in one of the NData = 250 experiments. Even in this
experiment, however, PDE-LEARN correctly identifies the RHS term (Dx U )(U ). This result suggests that even
when PDE-LEARN fails, it may still extract useful information about the hidden PDE. Interestingly, when
NData = 250, PDE-LEARN successfully identifies Burgers’ equation when the noise level is 25% but not when
it is 10%. This result suggests that the number of measurements, not the noise level, is the main limiting
factor in the low-data limit.

13
5.2 KdV Equation
Next, we consider The Korteweg–De Vries (KdV) equation, a non-linear third-order equation. [21] derived
the KdV equation to describe the evolution of one-dimensional, shallow-water waves. With appropriate
scaling, the KdV equation is

Dt u = −(u)(Dx u) − Dx3 u, (10)


where wave height, u, is a function of x and t.

We test PDE-LEARN on the KdV equation on the domain (t, x) ∈ [0, 40] × [−20, 20]. For this equation, we
consider two initial conditions:
 πx 
u(0, x) = − sin ,
20
and
  x 2   πx 
u(0, x) = exp −π cos .
30 10
We partition the problem domain into a spatiotemporal grid with 257 grid lines along the x-axis and 201 along
the t-axis. We use Chebfun’s [16] spin class to solve the KdV equation with each initial condition on this
grid. We refer to the resulting solutions as our noise-free KdV-sin and KdV-exp-cos data sets, respectively.
The scripts KdV Sine.m and KdV Exp Cos.m in the MATLAB sub-directory of our repository generate these
data sets. Figures 4 and 5 depict the data sets.

In all of our experiments with the KdV equation, U contains four layers with 40 neurons per layer4 . Further,
we use the same left and right-hand side terms as in the Burgers’ experiments.

5.2.1 KdV sin data set


We test PDE-LEARN on several noisy, limited data sets built using the two noise-free data sets. We test
PDE-LEARN on each data set. For these experiments, burn-in takes between 1, 000 and 2, 000 epochs using
the Adam optimizer (stopping once the data loss stops decreasing). For the sparsification step, we use
wLp = 0.0002 and train for between 1, 000 and 2, 000 epochs (stopping once the Lp loss remains roughly
constant for at least 200 epochs). Table 4 reports our experimental results with the KdV-sin data set.

These results show that PDE-LEARN can identify the KdV equation from limited measurements, even at high
noise levels. As in the Burgers experiments, the identified coefficients tend to be more accurate in lower
noise experiments. Interestingly, PDE-LEARN can identify the KdV equation from the sin data set with as
few as 500 data points and 25% noise.

With that said, PDE-LEARN does have its limits. In particular, PDE-LEARN fails to identify the KdV equation
in a few experiments. The identified PDE is too sparse in the 1, 000 data points, 50% noise, and 500
data points, 50% noise experiments. In both cases, however, PDE-LEARN correctly identifies the RHS-term
(Dx U )(U ). By contrast, the identified PDE contains extra terms in the 250 data points, 10% noise, and
250% data points, 25% noise experiments. In these experiments, the identified PDE contains both terms of
the KdV equation (in addition to some extra ones). These results strengthen our assertion that PDE-LEARN
identifies useful information even when it fails. If the identified PDE is too sparse, the terms in the identified
PDE are likely to be present in the hidden PDE. Likewise, if the identified is not sparse enough, the terms
of the hidden PDE are likely to be in the identified PDE.

4 We tried using the same architecture as in the Burgers’ experiments. However, we found that architecture was too simple

to learn the intricacies of the KdV data sets.

14
Figure 4: Noise-free KdV-sin data set.

Figure 5: Noise-free KdV-exp-cos data set.

15
Table 4: Experimental results for the KdV-sin data set

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


4, 000 50% 2, 000 2, 000 1, 000 Dt U = −0.8010(Dx3 U ) − 0.8339(Dx U )(U )
4, 000 75% 1, 000 1, 000 300 Dt U = −0.5650(Dx3 U ) − 0.5954(Dx U )(U )
2, 000 25% 1, 000 1, 000 2, 000 Dt U = −0.9101(Dx3 U ) − 0.9217(Dx U )(U )
2, 000 50% 1, 500 2, 000 2, 000 Dt U = −0.8680(Dx3 U ) − 0.8778(Dx U )(U )
1, 000 10% 1, 000 1, 000 2, 000 Dt U = −0.9249(Dx3 U ) − 0.9245(Dx U )(U )
1, 000 25% 1, 500 1, 500 2, 000 Dt U = −0.8752(Dx3 U ) − 0.8937(Dx U )(U )
1, 000 50% 1, 000 1, 000 100 Dt U = −0.0889(Dx U )(U )
500 10% 1, 000 1, 000 2, 000 Dt U = −0.9183(Dx3 U ) − 0.9311(Dx U )(U )
500 25% 1, 000 1, 500 500 Dt U = −0.6641(Dx3 U ) − 0.7350(Dx U )(U )
500 50% 1, 000 1, 000 0 Dt U = −0.0947(Dx U )(U )
250 10% 1, 000 1, 000 0 Dt U = −0.4514(Dx3 U ) − 0.4400(Dx U ) ∗ (U ) −
0.1034(Dx U )3 (U )
250 25% 1, 000 1, 000 0 Dt U = −0.0943(Dx3 U ) − 0.0725(Dx U ) ∗ (U ) −
0.1347(Dx U )3 (U )

5.2.2 KdV exp-cos data set


Next, we test PDE-LEARN on the KdV exp-cos data set. For these experiments, burn-in takes 1, 000 epochs.
For the sparsification step, we set wLp = 0.0001 and train for 1, 000 epochs. Table 5 reports our experimental
results with the KdV exp-cos data set.

Once again, PDE-LEARN correctly identified the KdV equation under several noise levels across several data
set sizes. Significantly, it identifies the KdV equation from just 250 data points with 25% noise. Further,
as in with the sin data set, PDE-LEARN successfully identified the KdV equation from 4, 000 data points and
75% noise.

As with the sin data set, however, PDE-LEARN does have limits. It fails to identify Burger’s equation in the
1, 000 data points, 50% noise experiment, and the 500 data points, 50% noise experiments. In the former,
the identified PDE is too sparse, though the lone term in the identified PDE, (Dx U )(U ), is present in the
KdV equation. The latter is more concerning, as the term Dx U is in the identified PDE, while one of the
terms of the KdV equation, Dx3 U , is not. With that said, even in this case, PDE-LEARN did correctly identify
one of the terms in the KdV equation, (Dx U )(U ). These result suggest that PDE-LEARN fails only under
extreme conditions and can still yield useful information when it does fail.

5.2.3 Combined data sets


Finally, to demonstrate that PDE-LEARN can learn from multiple data sets simultaneously (assuming the
corresponding system response functions satisfy a common PDE), we test with the sin and exp-cos data
sets simultaneously. For these experiments, burn-in takes 1, 000 epochs using the Adam optimizer (stopping
once the data loss stops decreasing). For the sparsification step, we use wLp = 0.0001 and train for between
1, 000 epochs. For the combined data sets, we only consider conditions that cause PDE-LEARN trouble when
learning from a single data set. Table 6 reports our experimental results for the combined KdV data set
experiments.

Notably, PDE-LEARN successfully identifies the KdV equation when each data set contains 1000 data points
and 50% noise, even though it can not identify the KdV equation from either data set individually. Further,

16
Table 5: Experimental results for the KdV-exp-cos data set

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


4, 000 50% 1, 000 1, 000 2, 000 Dt U = −0.8302(Dx3 U ) − 0.8700(Dx U )(U ))
4, 000 75% 1, 000 1, 000 300 Dt U = −0.5758(Dx3 U ) − 0.6384(Dx U )(U )
2, 000 25% 1, 000 1, 000 2, 000 Dt U = −0.9101(Dx3 U ) − 0.9217(Dx U )(U )
2, 000 50% 1, 000 1, 000 1, 000 Dt U = −0.5585(Dx3 U ) − 0.6066(Dx U )(U )
1, 000 10% 1, 000 1, 000 2, 000 Dt U = −0.9145(Dx3 U ) − 0.9229(Dx U )(U )
1, 000 25% 1, 000 1, 000 2, 000 Dt U = −0.7976(Dx3 U ) − 0.8157(Dx U )(U )
1, 000 50% 6001 1, 000 1, 000 Dt U = −0.1006(Dx U )(U )
500 10% 1, 000 1, 000 2, 000 Dt U = −0.9698(Dx3 U ) − 0.9322(Dx U )(U )
500 25% 1, 000 1, 000 200 Dt U = −0.1303(Dx U )(U ) − 0.0584(Dx U )3 (U )
500 50% 1, 000 1, 000 700 Dt U = 0.1323(Dx U ) − 0.1818(Dx U )(U )
250 10% 1, 000 1, 000 500 Dt U = −0.7927(Dx3 U ) − 0.8440(Dx U )(U )
250 25% 1, 000 1, 000 200 Dt U = −0.1028(Dx3 U ) − 0.2467(Dx U )(U )
1
We stop the burn in step after just 600 epochs for the 1000 data point, 50% noise experiment.
This is because the solution network began over-fitting the data set.

Table 6: Experimental results for the combined KdV data sets

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


1, 000 25% 1, 000 1, 000 2, 000 (Dt U ) = −0.0233(Dx3 U ) − 0.1286(Dx U )(U )
1, 000 50% 1, 000 1, 000 0 (Dt U ) = −0.8475(Dx3 U ) − 0.8719(Dx U )(U )
500 25% 1, 000 1, 000 1, 000 (Dt U ) = −0.6778(Dx3 U ) − 0.7282(Dx U )(U )
500 50% 500 1, 000 0 (Dt U ) = −0.0794(Dx U )(U )
25 25% 1, 000 1, 000 0 (Dt U ) = −0.1267(Dx U )(U )
1
Due to over-fitting, we stop burn in after 600 epochs in the 1000 data point, 50% noise
experiment.

in the experiments we ran on both the individual and the combined data sets, the identified coefficients tend
to be more accurate in the combined experiments. These results suggest that using multiple data sets can
improve PDE-LEARN’s performance.

Even with the combined data set, PDE-LEARN has limitations. It fails to identify the KdV equation in the 500
data points, 50% noise, and the 250 data points, 25% noise experiments. As in previous experiments, even
when PDE-LEARN fails, the terms in the identified PDEs are present in the KdV equation. Thus, PDE-LEARN
can still uncover useful information, even when it can not identify the hidden PDE.

Our experiments with the KdV equation suggest that PDE-LEARN’s performance degrades when using fewer
than 500 data points, irrespective of the noise level. Above this threshold, however, PDE-LEARN appears to
be reliable, even in the presence of significant noise.

5.3 Kuramoto–Sivashinsky equation


The Kuramoto–Sivashinsky (KS) equation [22] [30] is a non-linear fourth-order equation that arises in many
physical contexts, including flame propagation, plasma physics, chemical physics, and combustion dynamics
[25]. In one dimension, the KS equation takes the following form:

17
Figure 6: Noise-free KS equation data set.

Dt u = νDx2 u − µDx4 u − λ(u)(Dx u) (11)


If ν < 0, solutions to the KS equation can be chaotic with violent shocks [22] [25].

We test PDE-LEARN on the KS equation with ν = −1, µ = 1, and λ = 1. For these experiments, our problem
domain is (t, x) ∈ (0, T ] × S = (0, 5] × (−5, 5). We partition S = (−5, 5) into 255 equally-sized sub-intervals
and (0, T ] = (0, 5] into 200 equally-sized sub-intervals. This partition engenders a regular grid with 256
equally-spaced grid lines along the x-axis and 201 along the t-axis. We use Chebfun’s spin class to solve the
KS equation on this grid subject to the initial condition
 
2πx   πx 
U (0, x) = cos 1 + sin .
5 5
The resulting solution is our noise-free sin data set, which we depict in figure 6.

Using the procedure outlined at the beginning of section 5, we generate several noisy, limited data sets from
the noise-free data set. We test PDE-LEARN on each data set. In each experiment, U contains four layers
with 40 neurons per layer. Further, we use the same left-hand and right-hand side terms as in the previous
experiments, except that we add Dx4 U to the RHS terms. For these experiments, we use 2, 000 burn-in
epochs (NBurn−in ), 1, 000 − 2, 000 sparsification epochs (NSparse ) with wLp = 0.0003, and a variable number
of fine-tuning epochs (NF ine−tune ). Further, during the sparsification step, we set wLp = 0.0003. Table 7
reports the results of our experiments with the KS equation.

Thus, PDE-LEARN can successfully identify the KS equation with up to 15% noise. In the 20% noise ex-
periment, the identified PDE does not contain the Dx4 U term but does contain terms that are not present
in the KS equation. This result is a notable departure from the results we observe with other equations,
where misidentified PDEs contain too many or too few terms but never both. Even in this case, however,
the identified PDE contains two of the terms of the KS equation. Thus, PDE-LEARN still recovers useful

18
Table 7: Experimental results for the KS sin data set

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


4, 000 5% 2, 000 1, 000 2, 000 Dt U = −0.8118(Dx2 U ) − 0.8202(Dx4 U ) − 0.8579(Dx U )(U )
4, 000 10% 2, 000 1, 000 2, 000 Dt U = −0.7747(Dx2 U ) − 0.7895(Dx4 U ) − 0.8252(Dx U )(U )
4, 000 15% 2, 000 2, 000 1, 000 Dt U = −0.6875(Dx2 U ) − 0.7067(Dx4 U ) − 0.7662(Dx U )(U )
4, 000 20% 2, 000 1, 000 500 Dt U = 0.1395(U ) + 0.1808(Dx2 U ) − 0.4439(Dx U )(U ) +
0.0592(Dx U ) ∗ (U )3 + 0.0834(Dx U )3 (U )

information. These results suggest that while PDE-LEARN can identify the KS equation, it appears to be less
robust with this equation than with other equations we consider in this section. We discuss this result in
section 6.5.

Figure 7: Noise-free Allen-Cahn equation data set.

5.4 Allen-Cahn equation


Next, we consider the Allen-Cahn equation. This second-order non-linear PDE describes the separation of
the constituent metals in of multi-component molten alloy mixture [2]. The Allen-Cahn equation has the
following form:

Dt u = ε(Dx2 u) − u3 + u. (12)

We test PDE-LEARN on the Allen-Cahn equation with ε = 0.003 on the domain (t, x) ∈ [0, 40] × [−20, 20] with
the initial condition
5
u(0, x) = −0.2 sin (2πx) + 0.8 sin (5πx) .

19
The Allen-Cahn equation with this value of ε presents an interesting challenge for PDE-LEARN because it
contains a small coefficient. As described in section 4, PDE-LEARN eliminates all components of ξ which are
smaller than a pre-defined threshold. In all our e.pngxperiments5 , the threshold is around 10−4 . As such,
the Allen-Cahn equation contains a coefficient close to the threshold. Therefore, if the identified coefficient is
too far off from the true value, it risks being thresholded. Notwithstanding, PDE-LEARN performs admirably,
even when the hidden PDE contains coefficients that are close to the threshold.

We partition the problem domain into a spatiotemporal grid with 257 equally-spaced grid lines along the
x-axis and 201 along the t-axis. We use Chebfun’s [16] spin class to solve the Allen-Cahn equation on this
grid. We refer to the resulting solutions as our noise-free Allen-Cahn data set. The script Allen Cahn.m in
the MATLAB sub-directory of our repository generates this data set. Figure 7 depicts the data sets.

For these experiments, U contains five layers with 20 neurons per layer. Further, we use the same left-hand
and right-hand side terms as in the Burgers’ and KdV experiments. Table 8 reports the results of our Allen-
Cahn equation experiments. In each experiment, the burn-in step lasts for 2, 000 epochs (NBurn−in = 2, 000),
the sparsification step lasts 1, 000 epochs (NSparse = 1, 000), and the fine-tuning step lasts for 10, 000 epochs
(NF ine−tune = 10, 000). Further, during the sparsification step, we set wLp = 0.0002.

Table 8: Experimental results for the Allen-Cahn equation

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


4, 000 25% 2, 000 1, 000 10, 000 Dt U = 0.9576(U ) + 0.0029(Dx2 U ) − 0.9673(U )3
4, 000 50% 2, 000 1, 000 10, 000 Dt U = 0.9165(U ) + 0.0026(Dx2 U ) − 0.9290(U )3
4, 000 75% 2, 000 1, 000 10, 000 Dt U = 0.9391(U ) + 0.0028(Dx2 U ) − 0.9123(U )3
4, 000 100% 2, 000 1, 000 10, 000 Dt U = 0.6921(U ) + 0.0020(Dx2 U ) − 0.6211(U )3
1, 000 25% 2, 000 1, 000 10, 000 Dt U = 0.8686(U ) + 0.0028(Dx2 U ) − 0.8158(U )3
1, 000 50% 2, 000 1, 000 10, 000 Dt U = 0.1405(U ) − 0.0910(U )3
500 10% 2, 000 1, 000 10, 000 Dt U = 0.9473(U ) + 0.0028(Dx2 U ) − 0.9386(U )3
500 25%1 2, 000 1, 000 10, 000 Dt U = 0.8562(U ) + 0.0025(Dx2 U ) − 0.8239(U )3
1
For this experiment, we use WLp = 0.0003 during the sparsification step. If we set wLp =
0.0002, PDE-LEARN identifies Dt U = 0.4520(U )+0.0018(Dx2 U )−0.3729(U )3 +0.0019(Dx U )2 (U ).

Thus, PDE-LEARN correctly identifies the Allen-Cahn equation in all but one of our experiments. Significantly,
in every experiment, the right-hand side of the identified PDE contains the U 3 term with its small coefficient.
These results suggest that PDE-LEARN can reliably identify small coefficients in hidden PDEs, even when they
are close to the threshold. Notably, PDE-LEARN identifies the Allen-Cahn equation with up to 100% noise
and 4, 000 data points. Further, it correctly identifies the Allen-Cahn equation with 25% noise using just 500
data points. With that said, PDE-LEARN’s robustness to noise does decrease with the number of data points.
In particular, it fails to identify the Allen-Cahn equation in the 1, 000 data points, 50% noise experiment.
In this experiment, the identified PDE is too sparse, though the two RHS terms that PDE-LEARN identifies
are present in the Allen-Cahn equation.

5.5 2D Wave equation


All of the experiments we have considered thus far deal with a single spatial variable (x) and a Hidden PDE
whose left-hand side is a time derivative. While PDEs of this form are not uncommon in physics, they do
not constitute all possible PDEs of practical interest. As stated in section 4, PDE-LEARN can learn a more
general class of PDEs. In this subsection, we demonstrate this ability by testing PDE-LEARN on the 2D wave
5 Notably, we could rewrite PDE-LEARN to use double-precision floating point numbers, allowing for a much smaller threshold.
We discuss this in section 6.3.

20
equation. This second-order PDE appears in electromagnetism, structural mechanics, etc. With two spatial
variables, the wave equation takes on the following form:

Dt2 u = c2 Dx2 u + Dy2 u .




Here, c > 0 is called the wave speed. Rearranging the above equation gives

Dx2 u = εDt2 u − Dy2 u, (13)


where ε = 1/c2 . We test PDE-LEARN on this form of the 2D wave equation with ε = 1 and the problem
domain (t, x, y) ∈ (0, 10] × [−5, 5] × [−5, 5]. We use a known solution to generate the noise-free data set. In
particular,

u(t, x, y) = − sin (t − x) + exp (.05(t − x − y)) + sin (t − y) ,


satisfies the heat equation on the problem domain. To make our noise-free data set, we evaluate this function
at 4, 000 points drawn from a uniform distribution over the problem domain. We then corrupt this data set
using varying noise levels, engendering our noisy and limited data sets.

In all of our experiments with the wave equation, U contains four layers with 40 neurons per layer. For these
experiments, we use the left-hand side term
 
f0 ∂ˆ0 u, . . . , ∂ˆNM u = Dx2 U.

Further, we use the following library of right-hand side terms

U, Dt U, Dt2 U,Dt3 U, Dy U, Dy2 U, Dy3 U,


(U )2 , (Dt U )U, (Dt2 U )U, (Dt U )2 , (Dy U )U, (Dy2 U )U, (Dy U )2 , (Dt U )(Dy U ),
(U )3 , (Dt U )(U )2 , (Dt2 U )U 2 , (Dt U )2 U,(Dy U )U 2 , (Dy2 U )U 2 , (Dy U )2 U, (Dt U )(Dy U )U

Table 9 reports the results of our wave equation experiments. In each experiment, the burn-in step lasts for
2, 000 epochs (NBurn−in = 2, 000), the sparsification step lasts 2, 000 − 3, 000 epochs (2, 000 ≤ NSparse ≤
3, 000), and the fine-tuning step lasts for up to 1, 000 epochs (NF ine−tune ≤ 1, 000). Further, during the
sparsification step, we set wLp = 0.0003.

Table 9: Experimental results for the 2D wave equation

NData Noise NBurn−in NSparse NF ine−tune Identified PDE


4, 000 0% 2, 000 2, 000 1, 000 Dx2 U = 0.9973(Dt2 U ) − 0.9938(Dy2 U )
4, 000 25% 2, 000 2, 000 1, 000 Dx2 U = 0.9981(Dt2 U ) − 0.9919(Dy2 U )
4, 000 50% 2, 000 2, 000 1, 000 Dx2 U = 0.9656(Dt2 U ) − 0.9568(Dy2 U )
4, 000 75% 2, 000 2, 000 400 Dx2 U = 0.9268(Dt2 U ) − 0.9179(Dy2 U )
4, 000 100%1 2, 000 3, 000 1, 000 Dx2 U = 0.9004(Dt2 U ) − 0.8794(Dy2 U )
1
For this experiment, we use WLp = 0.001 during the sparsification step. If we
set wLp = 0.0003, PDE-LEARN identifies Dx2 U = 0.8270(Dt2 U ) − 0.6916(Dy2 U ) −
0.2471(Dy2 U )(U ) − 0.0075(Dt2 U )(U )2 + 0.0739(Dy2 U )(U )2 .

Thus, PDE-LEARN successfully identifies the wave equation in all experiments, though we did have to increase
wLp in the 100% noise experiment. Further, in most cases, the coefficients in the identified PDE closely match
those of the true PDE, deviating less than 1% from their true values in the 0% and 25% noise experiments.
However, as with other equations, these experiments reveal that PDE-LEARN does have some limitations. If
we do not increase wLp to 0.001 in the 100% noise experiment, the identified PDE contains terms that are

21
not present in the heat equation. However, even in this case, the identified PDE contains both RHS terms
of the wave equation. Further, the coefficients of the extra terms are significantly smaller than those of the
RHS terms present in the wave equation. This result adds to our observation that PDE-LEARN extracts useful
information about the hidden PDE even when it fails. These experiments demonstrate that PDE-LEARN can
identify PDEs with multiple spatial variables and works with arbitrary left-hand side terms.

6 Discussion
This section discusses further aspects of PDE-LEARN, with a special emphasis on our rationale behind the
algorithm’s design. First, in section 6.1, we discuss hyperparameter selection with PDE-LEARN. Section 6.2
discusses why pruning (between the burn-in, sparsification, and fine-tuning steps) is necessary. In section
6.3, we discuss the Lp loss. Section 6.4 analyzes why the coefficients in the identified PDE tend to be smaller
than the corresponding coefficients in the hidden PDE. Finally, in section 6.5, we discuss some limitations
of PDE-LEARN as well as potential future directions.

6.1 Hyperparameter Selection


PDE-LEARN contains many hyperparameters, including the library terms, loss function weights, p, and network
architecture. We did not perform hyperparameter selection in our experiments. Therefore, we do not claim
our choices in the experiments are optimal; they may not be suitable in every situation.

We found that p = 0.1 works well for the equations in our experiments. With that said, other values of p
may work well in other situations. p is a hyperparameter and should be treated as such (with p = 0.1 as a
good default value). Our loss function weights work well in our experiments. However, we believe it may
make sense to use different weights for the data and collocation losses if the data set is unusually limited
or noisy. As for the library terms, our experiments suggest that PDE-LEARN can identify sparse PDEs even
from a relatively large library of RHS terms. Using a large library increases the likelihood that the terms
in the hidden PDE are in the library. Therefore, choosing a large library is a good default choice. Finally,
for the network architecture, we believe the default choice of network architecture should contain the fewest
parameters possible to learn the underlying data set (which can be empirically determined).

6.2 Pruning
Even after the sparsification step, most components of ξ are small (a few orders of magnitude above machine
epsilon) but non-zero. One likely reason for this is that PDE-LEARN works with finite√precision floating-point
arithmetic. Let ε denote machine-epsilon. If a component of ξ, ξk , is smaller than ε, then ξk2 in equation
2
6 is smaller than machine epsilon, meaning that PDE-LEARN can not accurately √ compute ξ . These results
make thresholding necessary and are why we set the threshold slightly above ε.

The biggest drawback of pruning is that PDE-LEARN can not identify coefficients in the hidden PDE whose
magnitude is smaller than the threshold. In our experiments, we use single-precision floating-point arithmetic,
meaning that our threshold is around 10−4 . However, PDE-LEARN can be implemented using double-precision
floating-point arithmetic, allowing for a smaller threshold should the need arise.

6.3 Lp loss
As discussed in section 4, the Lp loss effectively embeds the Iteratively Reweighted Least Squares loss into
our loss function. At the start of each epoch, the Lp loss is equal to kξkpp . Crucially, however, since the
weights, ak , in the Lp loss are treated as constants during back-propagation, the Lp loss is a convex function
of ξ. By contrast, the p-norm ξ → kξkpp for 0 < p < 1 is not convex. In particular, it contains sharp cusps
along the coordinate axes. These cusps make it nearly impossible for standard optimizers (such as the Adam
optimizer we use to train PDE-LEARN) to converge to a minimum of the p-norm. To illustrate this point, we

22
tried replacing the Lp loss with kξkpp . Unsurprisingly, this change renders PDE-LEARN unusable; it fails to
converge, even when training on noise-free data sets. Thus, the Iteratively Reweighted Least Squares loss
function is an essential aspect of PDE-LEARN.

6.4 The coefficients in the identified PDEs are too small


In many of our experiments, the coefficients in the identified PDE are smaller than those in the hidden PDE.
This effect appears to get worse as the noise level increases. We believe this is a result of how we sparsify
the PDE.

During the sparsification step, the Lp loss pushes the components of ξ to zero. Our choice of p = 0.1 means
that the Lp loss does a reasonable job of penalizing the number of non-zero terms, though it still pushes all
components of ξ towards zero. In principle, the collocation loss will grow if the components of ξ deviate too
much from corresponding values in the hidden PDE. Thus, the components of ξ must balance the collocation
and Lp losses. The result is usually a compromise; the components of ξ become as small as they can be
without causing a significant increase in the collocation loss. Thus, the components of ξ corresponding
to terms in the hidden PDE generally survive the sparsification step but end up with artificially small
magnitudes. This result is why we include the fine-tuning step, during which the coefficients recover and
approach the corresponding values in the hidden PDE. With that said, noise makes it difficult for PDE-LEARN
to precisely resolve the coefficients. This means that the collocation loss does not significantly decrease once
the components of ξ are reasonably close to the corresponding values in the hidden PDE. This result may
explain why the coefficients in the identified PDE tend to shrink as the noise level increase.

Running more fine-tuning epochs generally improves the accuracy of the identified coefficients. However, if
the noise is high and the data is limited, the networks can over-fit the data set. Over-fitting begins when
the testing data loss increases while the training data loss decreases. In our experiments, we use an early
stopping procedure to stop the fine-tuning step as soon as over-fitting begins.

6.5 Limitations and Future Directions


Our experiments in section 5 demonstrate that PDE-LEARN can learn a wide variety of PDEs. PDE-LEARN
places relatively weak assumptions on the form of the underlying PDE. While the previous works discussed
in section 3 assume the left-hand side of the hidden PDE is a time derivative, PDE-LEARN can learn PDEs
with arbitrary left-hand side terms. With that said, the hidden PDE must be in the form of equation 1.
Therefore, the user must select an appropriate library of terms. Without specialized domain knowledge,
selecting an appropriate library may be challenging.

Our implementation of PDE-LEARN exclusively uses monomial library terms. However, assuming that mono-
mial terms will work in all cases is not reasonable; thus we point out that our implementation of PDE-LEARN
can be modified to work with other library terms. Even with this change, it is unclear how PDE-LEARN would
perform when trying to identify PDEs whose terms are not monomials of the ui ’s, and their associated partial
derivatives. Therefore, generalizing PDE-LEARN to use other library terms, and exploring how this impacts
PDE-LEARN’s performance, represents a potential future area of research.

Finally, it is worth noting that PDE-LEARN’s performance varies by equation. PDE-LEARN had little trouble
discovering the Brugers’, KdV, Allen-Cahn, and Heat equations. However, PDE-LEARN only identify the KS
equation with 15% noise. It is not immediately clear why this equation is more challenging for PDE-LEARN to
identify, though past works have reported similar results [15]. Identifying factors (both in the hidden PDE
and the data set) that impact PDE-LEARN’s performance represents an important area of future research.

23
7 Conclusion
This paper introduced PDE-LEARN, a novel PDE-discovery algorithm to identify human-readable PDEs from
noisy and limited data. PDE-LEARN utilizes Rational Neural Networks, targeted collocation points, and a
three-part loss function inspired by Iteratively Reweighted Least Squares. Further, unlike many previous
works, PDE-LEARN can identify PDEs with multiple spatial variables and arbitrary left-hand side terms (see
equation 1). The general form of the hidden PDE, equation 1, gives PDE-LEARN tremendous flexibility in
discovering PDEs from data. Our experiments in section 5 demonstrate that PDE-LEARN can identify a
variety of PDEs from noisy, limited data sets.

PDE-LEARN appears to be an effective tool for PDE discovery. Its ability to identify a variety of linear and
non-linear PDEs, including those with multiple spatial variables, suggests that PDE-LEARN may be useful in
discovering governing equations for physical systems that, until now, have evaded such descriptions.

8 Acknowledgements
This work is supported by the Office of Naval Research (ONR), under grant N00014-22-1-2055. Further,
Robert Stephany is supported his NDSEG fellowship.

References
[1] Clare I Abreu et al. “Mortality causes universal changes in microbial community composition”. In:
Nature communications 10.1 (2019), pp. 1–9.
[2] Samuel M Allen and John W Cahn. “A microscopic theory for antiphase boundary motion and its
application to antiphase domain coarsening”. In: Acta metallurgica 27.6 (1979), pp. 1085–1095.
[3] Daniel R Amor, Christoph Ratzke, and Jeff Gore. “Transient invaders can induce shifts between alter-
native stable states of microbial communities”. In: Science advances 6.8 (2020), eaay8676.
[4] Steven Atkinson et al. “Data-driven discovery of free-form governing differential equations”. In: arXiv
preprint arXiv:1910.05117 (2019).
[5] Cea Basdevant et al. “Spectral and finite difference solutions of the Burgers equation”. In: Computers
& fluids 14.1 (1986), pp. 23–41.
[6] Harry Bateman. “Some recent researches on the motion of fluids”. In: Monthly Weather Review 43.4
(1915), pp. 163–170.
[7] Atilim Gunes Baydin et al. “Automatic differentiation in machine learning: a survey”. In: Journal of
machine learning research 18 (2018).
[8] Jens Berg and Kaj Nyström. “Neural network augmented inverse problems for PDEs”. In: arXiv
preprint arXiv:1712.09685 (2017).
[9] Josh Bongard and Hod Lipson. “Automated reverse engineering of nonlinear dynamical systems”. In:
Proceedings of the National Academy of Sciences 104.24 (2007), pp. 9943–9948.
[10] Christophe Bonneville and Christopher J Earls. “Bayesian Deep Learning for Partial Differential Equa-
tion Parameter Discovery with Sparse and Noisy Data”. In: arXiv preprint arXiv:2108.04085 (2021).
[11] Gert-Jan Both et al. “DeepMoD: Deep learning for Model Discovery in noisy data”. In: Journal of
Computational Physics 428 (2021), p. 109985.
[12] Nicolas Boullé, Yuji Nakatsukasa, and Alex Townsend. “Rational neural networks”. In: Advances in
Neural Information Processing Systems 33 (2020), pp. 14243–14253.
[13] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. “Discovering governing equations from data
by sparse identification of nonlinear dynamical systems”. In: Proceedings of the national academy of
sciences 113.15 (2016), pp. 3932–3937.

24
[14] Rick Chartrand and Wotao Yin. “Iteratively reweighted algorithms for compressive sensing”. In: 2008
IEEE international conference on acoustics, speech and signal processing. IEEE. 2008, pp. 3869–3872.
[15] Zhao Chen, Yang Liu, and Hao Sun. “Physics-informed learning of governing equations from scarce
data”. In: Nature communications 12.1 (2021), pp. 1–13.
[16] Tobin A Driscoll, Nicholas Hale, and Lloyd N Trefethen. Chebfun guide. 2014.
[17] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neu-
ral networks”. In: Proceedings of the thirteenth international conference on artificial intelligence and
statistics. JMLR Workshop and Conference Proceedings. 2010, pp. 249–256.
[18] Daniel R Gurevich, Patrick AK Reinbold, and Roman O Grigoriev. “Robust and optimal sparse regres-
sion for nonlinear PDE models”. In: Chaos: An Interdisciplinary Journal of Nonlinear Science 29.10
(2019), p. 103113.
[19] Isabelle Guyon et al. “Gene selection for cancer classification using support vector machines”. In:
Machine learning 46.1 (2002), pp. 389–422.
[20] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint
arXiv:1412.6980 (2014).
[21] Diederik Johannes Korteweg and Gustav De Vries. “XLI. On the change of form of long waves advancing
in a rectangular canal, and on a new type of long stationary waves”. In: The London, Edinburgh, and
Dublin Philosophical Magazine and Journal of Science 39.240 (1895), pp. 422–443.
[22] Yoshiki Kuramoto and Toshio Tsuzuki. “Persistent propagation of concentration waves in dissipative
media far from thermal equilibrium”. In: Progress of theoretical physics 55.2 (1976), pp. 356–369.
[23] Dong C Liu and Jorge Nocedal. “On the limited memory BFGS method for large scale optimization”.
In: Mathematical programming 45.1 (1989), pp. 503–528.
[24] Daniel A Messenger and David M Bortz. “Weak SINDy for partial differential equations”. In: Journal
of Computational Physics 443 (2021), p. 110525.
[25] Demetrios T Papageorgiou and Yiorgos S Smyrlis. “The route to chaos for the Kuramoto-Sivashinsky
equation”. In: Theoretical and Computational Fluid Dynamics 3.1 (1991), pp. 15–42.
[26] Maziar Raissi. “Deep hidden physics models: Deep learning of nonlinear partial differential equations”.
In: The Journal of Machine Learning Research 19.1 (2018), pp. 932–955.
[27] Samuel H Rudy et al. “Data-driven discovery of partial differential equations”. In: Science Advances
3.4 (2017), e1602614.
[28] Hayden Schaeffer. “Learning partial differential equations via data discovery and sparse optimization”.
In: Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473.2197
(2017), p. 20160446.
[29] Michael Schmidt and Hod Lipson. “Distilling free-form natural laws from experimental data”. In:
science 324.5923 (2009), pp. 81–85.
[30] Gregory I Sivashinsky. “Nonlinear analysis of hydrodynamic instability in laminar flames—I. Derivation
of basic equations”. In: Acta astronautica 4.11 (1977), pp. 1177–1206.
[31] Robert Stephany and Christopher Earls. “PDE-READ: Human-readable partial differential equation
discovery using deep learning”. In: Neural Networks 154 (2022), pp. 360–382.

25

You might also like