Dymchenko and Raffin - 2023 - Loss-Driven Sampling Within Hard-To-Learn Areas Fo
Dymchenko and Raffin - 2023 - Loss-Driven Sampling Within Hard-To-Learn Areas Fo
Abstract
This paper focuses on active learning methods for training neural networks from
synthetic input samples that can be generated on-demand. This includes Physics In-
formed Neural Networks (PINNs), simulation-based inference, deep surrogates and
deep reinforcement learning. An adaptive process observes the training progress
and steers the data generation with the goal of speeding up and increasing the
quality of training. We propose a novel adaptive sampling method that concentrates
samples close to the areas showing high loss values. Compared to the state-of-
the-art R3 sampling our algorithm converges to a validation loss of 0.5 in 6000
iterations, while it takes 25000 iterations to reach a loss of 0.7 for the R3 algorithm
when training a PINN with the Allen Cahn equation.
1 Introduction
The machine learning community has recently shown a growing interest in applying deep neu-
ral networks (DNN) to physical science [Lavin et al., 2021]. The novel approaches take different
perspectives on the subject [Das and Tesfamariam, 2022]. Some design specific types of neural
models and loss functions, such as graph-based networks [Meyer et al., 2021], physics-informed
neural networks (PINNs) [Raissi et al., 2019]. Others use DNNs to learn probability distributions
for experimental design [Foster et al., 2021], probabilistic programming [Baydin et al., 2019], and
simulation-based inference [Cranmer et al., 2022]. Deep surrogates are combined with traditional nu-
merical solvers and trained through different modalities on supercomputers as in [Meyer et al., 2022],
[Brace et al., 2021] and [Ward et al., 2021]. In this paper we focus on adaptive sampling methods to
optimize DNN training in the case where samples can be generated on-demand, usually through a
solver code.
Classical DNN training methods work with fixed datasets that are repeatedly presented during training
across multiple epochs. The ability to perform training using synthetic data, which can be generated
on-demand, opens the way to different strategies to improve the training process. Synthetic data is
generated from a set of input parameters sampled within a given bounded domain. The inputs can
be directly used for DNN training without transformation as in data-free PINNs training, or be the
initial conditions for autoregressive solvers, which produce data series. Adaptive sampling monitors
training progress to guide the sampling process towards selecting the inputs that provide the most
effective data. The associated computing cost should also be reduced so as not to slow down training.
The question of data selection and sample similarity originated in the context of training DNN with a
finite dataset. It is tackled in various ways: measuring samples uncertainty by approximating training
dynamics [Kye et al., 2022, Wang et al., 2022], calculating samples influence [K and Søgaard, 2021],
selecting representative subset with use of gradients [Killamsetty et al., 2022, Fayyaz et al., 2022,
Katharopoulos and Fleuret, 2019].
• a loss-driven sampling method called Breed (Balance Ratio and EnhancE Density);
• a concept of exploration-concentration balance control with ratio value;
• a novel benchmark designed for evaluating sampling strategies for simulation-based training;
• a comparison with the state-of-the-art R3 sampling on two tasks, which shows a significant
performance improvement on both, the convergence speed and validation loss.
2 Proposed method
Let’s first introduce notations. The sampling selects input points x ∈ X ⊆ Rd . x can be used directly
as a collocation point as for PINNs training, or go through a function f (x) = y, where y ∈ Y ⊆ Rd
is the output, the tuple (x, y) being used for training. The trained neural network is denoted fθ (x),
where θ are neural network parameters. In the case of surrogate training, the goal is to have fθ
approximate f .
To balance a training set, we introduce a concentrate-explore value r(i) ∈ [0, 1] as a ratio of points
to sample non-uniformly by concentrating in areas of high loss. The remaining points are sampled
uniformly in the domain for (1) having examples for the network to remember what was learned and
(2) exploring areas with samples that might show high losses. The value r(i) changes over iterations
i of the neural network, growing linearly from the starting value sr to the ending value er for cr
iterations and then is constant at value er . The configuration of r value is called scenario and denoted
as a triplet (sr , er , cr ).
The sampling strategy is based on loss statistics from the neural network similar to [Wu et al., 2022].
The loss is normalized to construct a distribution that will provide a number of points to sample
within sample neighbourhoods.
The algorithm is repeated for N iterations. i.e. i = [N ], where we denote [N ] = 0, . . . , N − 1.
The covariance Σ = Id · σ defines the radius of the spherical neighbourhood (fixed). The values
r(i) ∈ [0, 1] to control concentrate-explore trade-off ratio are predefined (subsection 2.1). The initial
training set S (0) is sampled uniformly and the number of samples |S (i) | = Ns is fixed.
(i) (i)
At each iteration i, the neural network fθ(i) (x) provides a loss value L(xj ; θ(i) ) = lj ≥ 0 per sam-
(i)
ple xj ∈ S (i) for j = [Ns ]. The values of vector l(i) then are sum normalized to have distribution
properties. The categorical distribution over the samples of S (i) is constructed proportionally to the
(i) ′(i)
loss, i.e. P (xj ) = lj . This distribution trialed n times is a multinomial distribution P (n, l′(i) ) .
It models the number of points, called children, in the next training set to be sampled around each
point, called parent, of the current training set. The sampling is run with replacements, so a parent
2
with high loss will have several children while a parent with low loss might not have any. It allows
us to adaptively refine sampling density in areas with high loss. The name of the algorithm, Breed,
self-explains this mechanic of breeding the most interesting points from the point of view of the
training process.
(i+1)
The next training set S (i+1) consists of two sets. The concentration set Sc is a set of points
(i+1) (i)
sampled in a loss-driven manner. Its size depends on r(i) value, i.e. |Sc | = Nc = ⌊Ns × r(i) ⌋.
(i)
To construct it, the number of children for each parent is sampled as {mj }j=[Ns ] ∼ P (Nc , l′(i) ).
P (i)
Note that j=[Ns ] mj = Nc . A parent point acts as the centre of a Gaussian with Σ width, which
we call a neighbourhood. We sample the children set from each neighbourhood, i.e.:
[ (i)
[ n (i+1) (i)
o
Sc(i+1) = CΣ (xj ) = xk ∼ N (xj , Σ), k = [mj ] . (1)
j=[Ns ] j=[Ns ]
(i+1) (i+1)
Finally, the composed training set is S (i+1) := Sc ∪ Su .
3 Experiments
We compare1 Breed sampling to the baseline uniform dynamic sampling, which creates a train-
ing set for each iteration by selecting Ns uniformly distributed points in X , and the R3 sam-
pling [Daw et al., 2023] on two benchmark problems.
3
Validation set Hard validation set
Uniform Uniform
R3 0.6 R3
Huber Loss
0.4
0.3
10 2
0.2
0.1
0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration
(a) Imbalanced pits GD
1.1
Validation set Validation set
Uniform 1.0
1.0 R3
Relative L 2 error
Relative L 2 error
Breed 0.9
0.9
0.8 0.8 Uniform
R3
0.7 0.7 Breed
0.6
0.6
0.5
0.4 0.5
0 10000 20000 30000 40000 0 10000 20000 30000 40000
Iteration Iteration
(b) Allen Cahn equation (averaged) (c) Allen Cahn equation (not averaged)
Figure 2: Comparison of validation errors over training iterations for Breed, R3 and uniform sampling
for (a) Imbalanced Pits Gradient Descent and (b-c) Allen Cahn PINN. The plots (a-b) are presented
as mean and standard deviation computed over 5 random seeds, whereas in the plot (c) each line
represents one run.
Results analysis. Results are presented in Figure 2 as errors over training iterations. For all
experiments, Breed shows both faster convergence and lower errors compared to Uniform and
R3 sampling. Notice that for Breed the error decrease happens near the cr iteration, showing the
importance of the exploration-concentration r scheme. In Figure 2a, R3 sampling performs worse
than the baseline for both validation sets, while Breed reaches twice lower error than the baseline even
for the hard validation set. In Figure 2b, the variation of error for R3 is explained by the instability of
the method visible in Figure 2c. R3 converged for 3 out of the 5 trainings to the same error value. In
opposite, Breed shows a stable convergence for all runs, while the uniform sampling converges only
for 1 run.
4 Conclusion
We presented Breed, a novel adaptive sampling algorithm for training DNNs with synthetic data.
Breed relies on the per-sample loss value to identify hard-to-train areas and combines a dual
exploration-concentration scheme with Uniform sampling to discover potential hard areas and
to remember trivial ones and Gaussian multinomial sampling to focus on hard areas. The experiments
demonstrate overall better performance in quality, convergence speed, and stability compared to the
baseline uniform and the state-of-the-art R3 sampling. Future work will focus on validating Breed
with more benchmarks, including higher dimension problems and simulation-based scenarios with
functions generating time series.
5 Acknowledgements
4
References
[Baydin et al., 2019] Baydin, A. G., Shao, L., Bhimji, W., Heinrich, L., Meadows, L., Liu, J., Munk, A.,
Naderiparizi, S., Gram-Hansen, B., Louppe, G., Ma, M., Zhao, X., Torr, P., Lee, V., Cranmer, K., Prabhat, and
Wood, F. (2019). Etalumis: Bringing probabilistic programming to scientific simulators at scale. Publisher:
IEEE Computer Society.
[Brace et al., 2021] Brace, A., Yakushin, I., Ma, H., Trifan, A., Munson, T., Foster, I., Ramanathan, A., Lee,
H., Turilli, M., and Jha, S. (2021). Coupling streaming AI and HPC ensembles to achieve 100-1000x faster
biomolecular simulations.
[Cranmer et al., 2022] Cranmer, K., Brehmer, J., and Louppe, G. (2022). The frontier of simulation-based
inference.
[Das and Tesfamariam, 2022] Das, S. and Tesfamariam, S. (2022). State-of-the-art review of design of experi-
ments for physics-informed deep learning. Number: arXiv:2202.06416.
[Daw et al., 2023] Daw, A., Bu, J., Wang, S., Perdikaris, P., and Karpatne, A. (2023). Mitigating propagation
failures in physics-informed neural networks using retain-resample-release (r3) sampling. In Proceedings of
the 40th International Conference on Machine Learning, pages 7264–7302. PMLR. ISSN: 2640-3498.
[Fayyaz et al., 2022] Fayyaz, M., Aghazadeh, E., Modarressi, A., Pilehvar, M. T., Yaghoobzadeh, Y., and
Kahou, S. E. (2022). BERT on a data diet: Finding important examples by gradient-based pruning. In
NeurIPS.
[Foster et al., 2021] Foster, A., Ivanova, D. R., Malik, I., and Rainforth, T. (2021). Deep adaptive design:
Amortizing sequential bayesian experimental design.
[K and Søgaard, 2021] K, K. and Søgaard, A. (2021). Revisiting methods for finding influential examples.
[Katharopoulos and Fleuret, 2019] Katharopoulos, A. and Fleuret, F. (2019). Not all samples are created equal:
Deep learning with importance sampling.
[Killamsetty et al., 2022] Killamsetty, K., Abhishek, G. S., Ramakrishnan, G., Evfimievski, A. V., Popa, L., and
Iyer, R. (2022). AUTOMATA : Gradient based data subset selection for compute-efficient hyper-parameter
tuning. In Advances in Neural Information Processing Systems.
[Kye et al., 2022] Kye, S. M., Choi, K., and Chang, B. (2022). TiDAL: Learning training dynamics for active
learning. Publisher: arXiv Version Number: 1.
[Lavin et al., 2021] Lavin, A., Zenil, H., Paige, B., Krakauer, D., Gottschlich, J., Mattson, T., Anandkumar, A.,
Choudry, S., Rocki, K., Baydin, A. G., Prunkl, C., Paige, B., Isayev, O., Peterson, E., McMahon, P. L., Macke,
J., Cranmer, K., Zhang, J., Wainwright, H., Hanuka, A., Veloso, M., Assefa, S., Zheng, S., and Pfeffer, A.
(2021). Simulation intelligence: Towards a new generation of scientific methods.
[Meyer et al., 2021] Meyer, L., Pottier, L., Ribes, A., and Raffin, B. (2021). Deep surrogate for direct time fluid
dynamics. pages 1–7.
[Meyer et al., 2022] Meyer, L., Ribés, A., and Raffin, B. (2022). Simulation-based parallel training.
[Nabian et al., 2021] Nabian, M. A., Gladstone, R. J., and Meidani, H. (2021). Efficient training of physics-
informed neural networks via importance sampling. 36(8):962–977.
[Raissi et al., 2019] Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks:
A deep learning framework for solving forward and inverse problems involving nonlinear partial differential
equations. 378:686–707.
[Wang et al., 2022] Wang, H., Huang, W., Wu, Z., Tong, H., Margenot, A. J., and He, J. (2022). Deep active
learning by leveraging training dynamics.
[Ward et al., 2021] Ward, L., Sivaraman, G., Pauloski, J. G., Babuji, Y., Chard, R., Dandu, N., Redfern, P. C.,
Assary, R. S., Chard, K., Curtiss, L. A., Thakur, R., and Foster, I. (2021). Colmena: Scalable machine-
learning-based steering of ensemble simulations for high performance computing. pages 9–20.
[Wu et al., 2022] Wu, C., Zhu, M., Tan, Q., Kartha, Y., and Lu, L. (2022). A comprehensive study of non-
adaptive and residual-based adaptive sampling for physics-informed neural networks.
[Yang et al., 2022] Yang, Z., Qiu, Z., and Fu, D. (2022). DMIS: Dynamic mesh-based importance sampling for
training physics-informed neural networks.