\newaliascnt

lemmaequation\aliascntresetthelemma \newaliascnttheoremequation\aliascntresetthetheorem \newaliascntdefinitionequation\aliascntresetthedefinition

Evaluating Uncertainty in Deep Gaussian Processes

Matthijs van der Lende
Jeremias Lino Ferrao
Niclas Müller-Hof
Bernoulli Institute, University of Groningen
Abstract

Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accuracy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both performance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at https://github.com/matthjs/xai-gp.

1 Introduction

Modern machine learning models, particularly in high-stakes domains like healthcare and autonomous systems, require not only accurate predictions but also reliable estimates of uncertainty. While deep learning models excel at prediction tasks, their deterministic formulation leads to overconfident/uncalibrated uncertainty estimates, especially on out-of-distribution data. Gaussian processes (GPs) [23, 19] are nonparametric Bayesian models that provide well-calibrated predictive distributions, effectively capturing epistemic uncertainty—the uncertainty that arises from not knowing the true underlying model. They are used, for instance, in Bayesian optimization, where they act as a probabilistic surrogate model that is used to find promising hyperparameters in a hyperparameter space. There is also an interesting theoretical relationship between neural networks and GPs, namely, as the width of a neural network approaches infinity, the model converges to a GP with a specific kernel determined by their architecture [19].

Vanilla GPs rely on a kernel or covariance function to define a notion of similarity between datapoints. As a result, the function class that can be modeled is limited by the choice of kernel. Deep GPs (DGPs), introduced as multi-layer compositions of GPs [6], inherit the nonparametric flexibility of GPs while enabling richer hierarchical feature learning. More recently, Deep Sigma Point Processes (DSPPs) [14] have been introduced as an alternative formulation of DGPs that admit a simpler training procedure.

Both models theoretically provide well-calibrated uncertainty quantification by propagating uncertainty through layers, yet their empirical calibration performance relative to uncertainty quantification baselines, such as deep ensembles, remains understudied. In particular, using metrics other than the negative log likelihood (NLL), which, although a proper scoring rule, can over-emphasize tail probabilities [21]. Additionally, there is a lack of empirical results on the use of non-Gaussian likelihoods with DGPs/DSPPs that arise in classification [14].

In particular, we are interested in determining how well calibrated the uncertainty quantification is of DGPs and DSPPs compared to a baseline (e.g., ensembles) for both regression and classification. For this purpose, we include several standardized uncertainty quantification evaluation methods, like expected calibration error and calibration plots. We will also look at out-of-distribution detection, in particular looking at how calibration behaves under distribution shift, which has not been studied in the model’s respective papers, following the methodology from Ovadia et al. [20].

2 Background

2.1 Gaussian Processes

A GP is a collection of random variables (RVs) {fGP(𝐱)𝐱𝒳}conditional-setsubscript𝑓GP𝐱𝐱𝒳\{f_{\text{GP}}(\mathbf{x})\mid\mathbf{x}\in\mathcal{X}\}{ italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ) ∣ bold_x ∈ caligraphic_X }, any finite number of which has a joint Gaussian distribution [23]. The index set 𝒳𝒳\mathcal{X}caligraphic_X is related to the input of some function f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y that we want to approximate. It can be interpreted as: At each point 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X, the output of a GP model is a RV denoted fGP(𝐱)subscript𝑓GP𝐱f_{\text{GP}}(\mathbf{x})italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ).

A GP, denoted fGP(𝐱)𝒢𝒫(m(𝐱),k(𝐱,𝐱))similar-tosubscript𝑓GP𝐱𝒢𝒫𝑚𝐱𝑘𝐱superscript𝐱f_{\text{GP}}(\mathbf{x})\sim\mathcal{GP}(m(\mathbf{x}),k(\mathbf{x},\mathbf{x% }^{\prime}))italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ) ∼ caligraphic_G caligraphic_P ( italic_m ( bold_x ) , italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ), is fully specified by a mean function m:𝒳:𝑚𝒳m:\mathcal{X}\rightarrow\mathbb{R}italic_m : caligraphic_X → blackboard_R and covariance or kernel function k:𝒳×𝒳:𝑘𝒳𝒳k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}italic_k : caligraphic_X × caligraphic_X → blackboard_R which are defined as:

m(𝐱)𝑚𝐱\displaystyle m(\mathbf{x})italic_m ( bold_x ) =𝔼[fGP(𝐱)],absent𝔼delimited-[]subscript𝑓GP𝐱\displaystyle=\mathbb{E}[f_{\text{GP}}(\mathbf{x})],= blackboard_E [ italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ) ] , (1)
k(𝐱,𝐱)𝑘𝐱superscript𝐱\displaystyle k(\mathbf{x},\mathbf{x}^{\prime})italic_k ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =𝔼[(fGP(𝐱)m(𝐱))(fGP(𝐱)m(𝐱))].absent𝔼delimited-[]subscript𝑓GP𝐱𝑚𝐱subscript𝑓GPsuperscript𝐱𝑚superscript𝐱\displaystyle=\mathbb{E}[(f_{\text{GP}}(\mathbf{x})-m(\mathbf{x}))(f_{\text{GP% }}(\mathbf{x}^{\prime})-m(\mathbf{x}^{\prime}))].= blackboard_E [ ( italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ) - italic_m ( bold_x ) ) ( italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_m ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] .

A GP offers a Bayesian approach to nonparametric regression or classification. Without any data, the kernel function represents our prior belief about the function we are trying to model, as it encodes similarity between data points, with closer points having higher covariance. A GP can then be conditioned on a dataset, 𝒟𝒟\mathcal{D}caligraphic_D, to get a posterior GP fGP(𝐱)𝒢𝒫(mpost(𝐱),kpost(𝐱,𝐱))similar-tosubscript𝑓GP𝐱𝒢𝒫subscript𝑚post𝐱subscript𝑘post𝐱superscript𝐱f_{\text{GP}}(\mathbf{x})\sim\mathcal{GP}(m_{\text{post}}(\mathbf{x}),k_{\text% {post}}(\mathbf{x},\mathbf{x}^{\prime}))italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ) ∼ caligraphic_G caligraphic_P ( italic_m start_POSTSUBSCRIPT post end_POSTSUBSCRIPT ( bold_x ) , italic_k start_POSTSUBSCRIPT post end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ), which is our posterior belief about the function. A common choice of kernel function is the Radial Basis Function (RBF) kernel:

kRBF(𝐱,𝐱;l,σf)=σfexp(𝐱𝐱22l2),subscript𝑘RBF𝐱superscript𝐱𝑙subscript𝜎𝑓subscript𝜎𝑓superscriptnorm𝐱superscript𝐱22superscript𝑙2k_{\text{RBF}}(\mathbf{x},\mathbf{x}^{\prime};l,\sigma_{f})=\sigma_{f}\exp% \bigg{(}-\frac{\parallel\mathbf{x}-\mathbf{x}^{\prime}\parallel^{2}}{2l^{2}}% \bigg{)},italic_k start_POSTSUBSCRIPT RBF end_POSTSUBSCRIPT ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_l , italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (2)

with lengthscale and outputscale parameters l,σf𝑙subscript𝜎𝑓l,\sigma_{f}\in\mathbb{R}italic_l , italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R, which when used result in smooth infinitely differentiable functions being sampled from the GP.

Aside from computing the posterior, GP hyperparameters θ𝜃\thetaitalic_θ, such as parameters of the kernel, can be fit to data by maximizing the marginal log likelihood (MLL):

p(𝐲|𝐗,θ)=Np(𝐲|𝐟,𝐗)p(𝐟|𝐗,θ)𝑑𝐟,𝑝conditional𝐲𝐗𝜃subscriptsuperscript𝑁𝑝conditional𝐲𝐟𝐗𝑝conditional𝐟𝐗𝜃differential-d𝐟p(\mathbf{y}|\mathbf{X},\theta)=\int_{\mathbb{R}^{N}}p(\mathbf{y}|\mathbf{f},% \mathbf{X})p(\mathbf{f}|\mathbf{X},\theta)\,d\mathbf{f},italic_p ( bold_y | bold_X , italic_θ ) = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( bold_y | bold_f , bold_X ) italic_p ( bold_f | bold_X , italic_θ ) italic_d bold_f , (3)

where 𝐗=(𝐱1,𝐱N)N×D𝐗subscript𝐱1subscript𝐱𝑁superscript𝑁𝐷\mathbf{X}=(\mathbf{x}_{1},\ldots\mathbf{x}_{N})\in\mathbb{R}^{N\times D}bold_X = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT are data points with targets 𝐲N𝐲superscript𝑁\mathbf{y}\in\mathbb{R}^{N}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT where yi=f(𝐱i)+ϵy,ϵy𝒩(0,σy)formulae-sequencesubscript𝑦𝑖𝑓subscript𝐱𝑖subscriptitalic-ϵ𝑦similar-tosubscriptitalic-ϵ𝑦𝒩0subscript𝜎𝑦y_{i}=f(\mathbf{x}_{i})+\epsilon_{y},\epsilon_{y}\sim\mathcal{N}(0,\sigma_{y})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) and 𝐟N𝐟superscript𝑁\mathbf{f}\in\mathbb{R}^{N}bold_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are the instantiations of the function latent variables fGP(𝐱i)subscript𝑓GPsubscript𝐱𝑖f_{\text{GP}}(\mathbf{x}_{i})italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for each 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Both the MLL p(𝐲|𝐗,θ)𝑝conditional𝐲𝐗𝜃p(\mathbf{y}|\mathbf{X},\theta)italic_p ( bold_y | bold_X , italic_θ ) and the posterior distribution p(𝐟|𝐗,𝐲,𝐗)𝑝conditionalsubscript𝐟𝐗𝐲subscript𝐗p(\mathbf{f}_{*}|\mathbf{X},\mathbf{y},\mathbf{X}_{*})italic_p ( bold_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | bold_X , bold_y , bold_X start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) for test points 𝐗subscript𝐗\mathbf{X}_{*}bold_X start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT can be computed in closed form, but this required an inversion of the kernel matrix over all data points, which is O(N3)𝑂superscript𝑁3O(N^{3})italic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) in time and O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in space.

2.2 Sparse Variational Gaussian Processes

Sparse Variational Gaussian Processes (SVGPs) [26, 12] approximate the GP predictive posterior distribution through variational inference111See Appendix B for a quick overview of variational inference.. This is done by introducing inducing variables (fGP(𝐳1),,fGP(𝐳M))=𝐮superscriptsubscript𝑓GPsubscript𝐳1subscript𝑓GPsubscript𝐳𝑀top𝐮(f_{\text{GP}}(\mathbf{z}_{1}),\ldots,f_{\text{GP}}(\mathbf{z}_{M}))^{\top}=% \mathbf{u}( italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_u that depend on variational parameters 𝐙=(𝐳1,,𝐳M)𝐙subscript𝐳1subscript𝐳𝑀\mathbf{Z}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{M})bold_Z = ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) called inducing points where MNmuch-less-than𝑀𝑁M\ll Nitalic_M ≪ italic_N. The inducing points together with the associated inducing variables serve as an approximation of the full dataset 𝒟=(𝐱i,yi)i=1,,N𝒟subscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=(\mathbf{x}_{i},y_{i})_{i=1,...,N}caligraphic_D = ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_N end_POSTSUBSCRIPTthat the GP conditions on.

SVGPs consider the following variational distribution:

q(𝐟,𝐮)=p(𝐟|𝐮,𝐗,𝐙)q(𝐮)q(𝐮)=𝒩(𝐦,𝐒).formulae-sequence𝑞𝐟𝐮𝑝conditional𝐟𝐮𝐗𝐙𝑞𝐮𝑞𝐮𝒩𝐦𝐒q(\mathbf{f},\mathbf{u})=p(\mathbf{f}|\mathbf{u},\mathbf{X},\mathbf{Z})q(% \mathbf{u})\quad q(\mathbf{u})=\mathcal{N}(\mathbf{m},\mathbf{S}).italic_q ( bold_f , bold_u ) = italic_p ( bold_f | bold_u , bold_X , bold_Z ) italic_q ( bold_u ) italic_q ( bold_u ) = caligraphic_N ( bold_m , bold_S ) . (4)

For an SVGP, the variational parameters are ψ={𝐙,𝐦,𝐒}𝜓𝐙𝐦𝐒\psi=\{\mathbf{Z},\mathbf{m},\mathbf{S}\}italic_ψ = { bold_Z , bold_m , bold_S } but may also include more parameters such as kernel hyperparameters. The approximate posterior over the function values is calculated by marginalizing over the inducing variables:

p(𝐟|𝒟,𝐗)Mp(𝐟|𝐮)q(𝐮)𝑑𝐮.𝑝conditionalsubscript𝐟𝒟subscript𝐗subscriptsuperscript𝑀𝑝conditionalsubscript𝐟𝐮𝑞𝐮differential-d𝐮p(\mathbf{f}_{*}|\mathcal{D},\mathbf{X}_{*})\approx\int_{\mathbb{R}^{M}}p(% \mathbf{f}_{*}|\mathbf{u})q(\mathbf{u})\;d\mathbf{u}.italic_p ( bold_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | caligraphic_D , bold_X start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≈ ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( bold_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT | bold_u ) italic_q ( bold_u ) italic_d bold_u . (5)

The variational objective is derived by maximizing the Evidence Lower Bound (ELBO), which in this context becomes:

LSVGP(ψ|𝒟)=i=1N𝔼q(fGP(𝐱i))[logp(yi|fi)]D𝕂𝕃(q(𝐮)p(𝐮|𝐙)),L_{\text{SVGP}}(\psi|\mathcal{D})=\sum^{N}_{i=1}\mathbb{E}_{q(f_{\text{GP}}(% \mathbf{x}_{i}))}[\log p(y_{i}|f_{i})]-D_{\mathbb{KL}}(q(\mathbf{u})\parallel p% (\mathbf{u}|\mathbf{Z})),italic_L start_POSTSUBSCRIPT SVGP end_POSTSUBSCRIPT ( italic_ψ | caligraphic_D ) = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] - italic_D start_POSTSUBSCRIPT blackboard_K blackboard_L end_POSTSUBSCRIPT ( italic_q ( bold_u ) ∥ italic_p ( bold_u | bold_Z ) ) , (6)

where p(yi|fi)𝑝conditionalsubscript𝑦𝑖subscript𝑓𝑖p(y_{i}|f_{i})italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the likelihood for the observations given latent function values fGP(𝐱i)=fisubscript𝑓GPsubscript𝐱𝑖subscript𝑓𝑖f_{\text{GP}}(\mathbf{x}_{i})=f_{i}italic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

It can be shown that the complexity now becomes O(NM2)𝑂𝑁superscript𝑀2O(NM^{2})italic_O ( italic_N italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in time and O(NM+M2)𝑂𝑁𝑀superscript𝑀2O(NM+M^{2})italic_O ( italic_N italic_M + italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in space [19]. It is worth noting that as M𝑀Mitalic_M increases, the approximation quality of exact inference is recovered. Too few inducing points may make the GP behave as if it was underfitting [2].

2.3 Deep Gaussian Processes and Deep Sigma Point Processes

DGPs extend the GP framework to multiple layers, allowing for the construction of hierarchies of latent functions [6, 19]:

𝒟𝒢𝒫(𝐱)=𝒇L𝒇1(𝐱),𝒇i()=[fGP,i(1)(),,fGP,i(Hi)()],fGP,i(j)𝒢𝒫(mi(),ki(,)).𝒟𝒢𝒫𝐱absentsubscript𝒇𝐿subscript𝒇1𝐱missing-subexpressionsubscript𝒇𝑖superscriptsuperscriptsubscript𝑓GP𝑖1superscriptsubscript𝑓GP𝑖subscript𝐻𝑖topmissing-subexpressionsimilar-tosuperscriptsubscript𝑓GP𝑖𝑗𝒢𝒫subscript𝑚𝑖subscript𝑘𝑖\begin{array}[]{l l}\mathcal{DGP}(\mathbf{x})&=\bm{f}_{L}\circ\cdots\circ\bm{f% }_{1}(\mathbf{x}),\\ &\bm{f}_{i}(\cdot)=[f_{\text{GP},i}^{(1)}(\cdot),\ldots,f_{\text{GP},i}^{(H_{i% })}(\cdot)]^{\top},\\ &f_{\text{GP},i}^{(j)}\sim\mathcal{GP}(m_{i}(\cdot),k_{i}(\cdot,\cdot)).\end{array}start_ARRAY start_ROW start_CELL caligraphic_D caligraphic_G caligraphic_P ( bold_x ) end_CELL start_CELL = bold_italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∘ ⋯ ∘ bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) = [ italic_f start_POSTSUBSCRIPT GP , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( ⋅ ) , … , italic_f start_POSTSUBSCRIPT GP , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( ⋅ ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_f start_POSTSUBSCRIPT GP , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∼ caligraphic_G caligraphic_P ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ , ⋅ ) ) . end_CELL end_ROW end_ARRAY (7)

DGPs have a neural network-like structure with L𝐿Litalic_L layers, each containing H𝐻Hitalic_H GPs. Posterior inference in GPs is no longer tractable as it requires marginalizing over a large number of RVs, corresponding to the latent function values at each layer. The stochastic variational inference method from Section 2.2 can be generalized to DGPs [24], known as doubly stochastic variational inference. Each layer is agumented with inducing variables 𝐮(l)superscript𝐮𝑙\mathbf{u}^{(l)}bold_u start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and corresponding variational distribution q(𝐮(l))𝑞superscript𝐮𝑙q(\mathbf{u}^{(l)})italic_q ( bold_u start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ). The variational ELBO for DGPs is then given by:

LDGP(ψ|𝒟)=i=1N𝔼q(𝐟iL)[logp(𝐲i|𝐟iL)]βl=1LD𝕂𝕃[q(𝐔l)p(𝐔l|𝐙l1)],L_{\text{DGP}}(\psi|\mathcal{D})=\sum^{N}_{i=1}\mathbb{E}_{q(\mathbf{f}^{L}_{i% })}[\log p(\mathbf{y}_{i}|\mathbf{f}_{i}^{L})]-\beta\sum^{L}_{l=1}D_{\mathbb{% KL}}[q(\mathbf{U}^{l})\|p(\mathbf{U}^{l}|\mathbf{Z}^{l-1})],italic_L start_POSTSUBSCRIPT DGP end_POSTSUBSCRIPT ( italic_ψ | caligraphic_D ) = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) ] - italic_β ∑ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT blackboard_K blackboard_L end_POSTSUBSCRIPT [ italic_q ( bold_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∥ italic_p ( bold_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) ] , (8)

where q(𝐟iL)𝑞subscriptsuperscript𝐟𝐿𝑖q(\mathbf{f}^{L}_{i})italic_q ( bold_f start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the approximate posterior of the final layer’s latent function for the i𝑖iitalic_ith data point, and q(𝐔l)𝑞superscript𝐔𝑙q(\mathbf{U}^{l})italic_q ( bold_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) and p(𝐔l|𝐙l1)𝑝conditionalsuperscript𝐔𝑙superscript𝐙𝑙1p(\mathbf{U}^{l}|\mathbf{Z}^{l-1})italic_p ( bold_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) are the variational and prior distributions over the inducing variables at layer l𝑙litalic_l, respectively. β>0𝛽0\beta>0italic_β > 0 is a regularization constant. As with SVGP, the final output prediction can be integrated out analytically, however the remaining latent variables must be sampled. Resulting in ‘doubly‘ stochastic gradients from two levels of sampling: (1) data mini-batching to scale to large datasets and (2) sampling through the hidden layers (using the reparameterization trick for Gaussians) so that the latent 𝐟i(l)superscriptsubscript𝐟𝑖𝑙\mathbf{f}_{i}^{(l)}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are sampled at each layer. Note also that the expectation in (8) is approximated via Monte Carlo sampling.

DSPPs are hierarchical GP models with two key differences from DGPs [14]. First, DSPPs are trained via a regularized maximum likelihood objective rather than the ELBO in (8):

dspp(θ|𝒟)=i=1Nlogpdspp(𝐲i|𝐱i)βl=1LD𝕂𝕃[q(𝐔l)p(𝐔l|𝐙l1)],\mathcal{L}_{\text{dspp}}(\theta|\mathcal{D})=\sum^{N}_{i=1}\log p_{\text{dspp% }}(\mathbf{y}_{i}|\mathbf{x}_{i})-\beta\sum^{L}_{l=1}D_{\mathbb{KL}}[q(\mathbf% {U}^{l})\|p(\mathbf{U}^{l}|\mathbf{Z}^{l-1})],caligraphic_L start_POSTSUBSCRIPT dspp end_POSTSUBSCRIPT ( italic_θ | caligraphic_D ) = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT dspp end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_β ∑ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT blackboard_K blackboard_L end_POSTSUBSCRIPT [ italic_q ( bold_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∥ italic_p ( bold_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | bold_Z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) ] , (9)

The logpDSPP(𝐲i|𝐱i)subscript𝑝DSPPconditionalsubscript𝐲𝑖subscript𝐱𝑖\log p_{\text{DSPP}}(\mathbf{y}_{i}|\mathbf{x}_{i})roman_log italic_p start_POSTSUBSCRIPT DSPP end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) directly maximizes the probability of the observed data given the model as a finite Gaussian mixture, obtained via sigma point approximations instead of Monte Carlo sampling, enabling maximum likelihood training. Just like DGPs the objective depends on parameters θ𝜃\thetaitalic_θ containing for each GP σysubscript𝜎𝑦\sigma_{y}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, 𝐦,𝐒,𝐙𝐦𝐒𝐙\mathbf{m},\mathbf{S},\mathbf{Z}bold_m , bold_S , bold_Z, kernel hyperparameters and likelihood hyperparameters (e.g., observation noise), which are jointly optimized using stochastic gradient descent methods. However unlike DGPs, DSPPs also parameterize the hidden latent function values using a learnable quadrature rule. To make this more explicit: in DGPs, for an input 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a layer containing W𝑊Witalic_W GPs, we sample:

fiw=μfw(𝐱i)+ϵσfw(𝐱i),ϵ𝒩(0,1),formulae-sequencesubscript𝑓𝑖𝑤subscript𝜇subscript𝑓𝑤subscript𝐱𝑖italic-ϵsubscript𝜎subscript𝑓𝑤subscript𝐱𝑖similar-toitalic-ϵ𝒩01f_{iw}=\mu_{f_{w}}(\mathbf{x}_{i})+\epsilon\sigma_{f_{w}}(\mathbf{x}_{i}),% \quad\epsilon\sim\mathcal{N}(0,1),italic_f start_POSTSUBSCRIPT italic_i italic_w end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ϵ italic_σ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , (10)

where μfwsubscript𝜇subscript𝑓𝑤\mu_{f_{w}}italic_μ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT and σfwsubscript𝜎subscript𝑓𝑤\sigma_{f_{w}}italic_σ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the approximate posterior mean and standard deviation predictions of the w𝑤witalic_w-th GP. We then take S𝑆Sitalic_S samples to obtain an unbiased Monte Carlo estimate of 𝔼q(𝐟iL)[logp(𝐲i|𝐟iL)]subscript𝔼𝑞superscriptsubscript𝐟𝑖𝐿delimited-[]𝑝conditionalsubscript𝐲𝑖superscriptsubscript𝐟𝑖𝐿\mathbb{E}_{q(\mathbf{f}_{i}^{L})}[\log p(\mathbf{y}_{i}|\mathbf{f}_{i}^{L})]blackboard_E start_POSTSUBSCRIPT italic_q ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) ]. In DSPPs, we use the following quadrature rule:

fiw(j)=μgf(𝐱i)+ξw(j)σfw(𝐱i),superscriptsubscript𝑓𝑖𝑤𝑗subscript𝜇subscript𝑔𝑓subscript𝐱𝑖superscriptsubscript𝜉𝑤𝑗subscript𝜎subscript𝑓𝑤subscript𝐱𝑖f_{iw}^{(j)}=\mu_{g_{f}}(\mathbf{x}_{i})+\xi_{w}^{(j)}\sigma_{f_{w}}(\mathbf{x% }_{i}),italic_f start_POSTSUBSCRIPT italic_i italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ξ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (11)

where {ξw(j)}j=1Qsuperscriptsubscriptsuperscriptsubscript𝜉𝑤𝑗𝑗1𝑄\{\xi_{w}^{(j)}\}_{j=1}^{Q}{ italic_ξ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT are Q𝑄Qitalic_Q learnable quadrature points (also called sigma points), which have associated learnable quadrature weights {ω(j)}j=1Qsuperscriptsubscriptsuperscript𝜔𝑗𝑗1𝑄\{\omega^{(j)}\}_{j=1}^{Q}{ italic_ω start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT. pDSPP(𝐲i|𝐱i)subscript𝑝DSPPconditionalsubscript𝐲𝑖subscript𝐱𝑖p_{\text{DSPP}}(\mathbf{y}_{i}|\mathbf{x}_{i})italic_p start_POSTSUBSCRIPT DSPP end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is computed by evaluating the model at each of the Q𝑄Qitalic_Q quadrature sites, passing each output through the likelihood (which, for regression, is a Gaussian likelihood), and weighting them by ω(s)superscript𝜔𝑠\omega^{(s)}italic_ω start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT. This produces a Gaussian mixture with Q𝑄Qitalic_Q components:

pdspp(𝐲i|𝐱i)=j=1Qω(j)p(j)(𝐲i|𝐱i).subscript𝑝dsppconditionalsubscript𝐲𝑖subscript𝐱𝑖subscriptsuperscript𝑄𝑗1superscript𝜔𝑗superscript𝑝𝑗conditionalsubscript𝐲𝑖subscript𝐱𝑖p_{\text{dspp}}(\mathbf{y}_{i}|\mathbf{x}_{i})=\sum^{Q}_{j=1}\omega^{(j)}p^{(j% )}(\mathbf{y}_{i}|\mathbf{x}_{i}).italic_p start_POSTSUBSCRIPT dspp end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (12)

The benefit of this is that the resulting predictive distributions are no longer degraded by posterior approximations. Only mini-batches of datapoints are sampled, so the objective is now ‘singly stochastic‘. The intuition behind sigma points is that they represent a small, carefully selected set of points that capture the key characteristics (mean and variance) of a distribution, allowing us to approximate integrals or expectations more efficiently.

According to Jankowiak et al. [14], DSPPs offer better calibrated probabilities/uncertainty quantification than DGPs, however, they only measured this using the negative log likelihood.

2.4 Classification

In the previous sections, we assumed that we were doing regression with a Gaussian likelihood. To adapt GPs for classification, we have to use a different likelihood to output class probabilities. In GPs, the likelihood p(𝐲|𝐟)𝑝conditional𝐲𝐟p(\mathbf{y}|\mathbf{f})italic_p ( bold_y | bold_f ) defines the relationship between the latent variable 𝐟𝐟\mathbf{f}bold_f modeled by the GP and the observed data 𝐲𝐲\mathbf{y}bold_y. For regression and continuous outputs, the likelihood is typically a Gaussian p(𝐲|𝐟)=𝒩(𝐲|𝐟,σy2𝐈)𝑝conditional𝐲𝐟𝒩conditional𝐲𝐟subscriptsuperscript𝜎2𝑦𝐈p(\mathbf{y}|\mathbf{f})=\mathcal{N}(\mathbf{y}|\mathbf{f},\sigma^{2}_{y}% \mathbf{I})italic_p ( bold_y | bold_f ) = caligraphic_N ( bold_y | bold_f , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_I ), which adds Gaussian noise ϵ𝒩(0,σy2)similar-toitalic-ϵ𝒩0subscriptsuperscript𝜎2𝑦\epsilon\sim\mathcal{N}(0,\sigma^{2}_{y})italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) to the predictions.

In classification, fGP(𝐱)=fsubscript𝑓GP𝐱𝑓f_{\text{GP}}(\mathbf{x})=fitalic_f start_POSTSUBSCRIPT GP end_POSTSUBSCRIPT ( bold_x ) = italic_f are the logits. Suppose we have D𝐷Ditalic_D GP units 𝐟=(f1,,fD)𝐟subscript𝑓1subscript𝑓𝐷\mathbf{f}=(f_{1},\ldots,f_{D})bold_f = ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) in the final layer of a hierarchical GP model. To obtain class probabilities, we can use the Softmax likelihood:

p(𝐲|𝐟)=Softmax(𝐖𝐟),𝑝conditional𝐲𝐟Softmax𝐖𝐟p(\mathbf{y}|\mathbf{f})=\text{Softmax}(\mathbf{W}\mathbf{f}),italic_p ( bold_y | bold_f ) = Softmax ( bold_Wf ) , (13)

where 𝐖D×C𝐖superscript𝐷𝐶\mathbf{W}\in\mathbb{R}^{D\times C}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_C end_POSTSUPERSCRIPT is a learnable matrix of mixing weights applied to the latent functions 𝐟𝐟\mathbf{f}bold_f. This avoids needing as many GP units in the final layer as classes, which becomes computationally expensive if the number of classes is large. For our benchmarking, this was not the case, so we set 𝐖𝐖\mathbf{W}bold_W to be the identity matrix.

2.5 Ensembles for Uncertainty Quantification

Ensembles [16] provide a solid baseline for uncertainty estimation by training K𝐾Kitalic_K independent neural networks with parameters {θj}j=1Ksuperscriptsubscriptsubscript𝜃𝑗𝑗1𝐾\{\theta_{j}\}_{j=1}^{K}{ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT with randomized initializations and aggregating their predictions, which provides a Monte Carlo estimate of the Bayesian predictive posterior distribution. For regression tasks, each ensemble member is a dual-output model predicting both a mean μi(𝐱)subscript𝜇𝑖𝐱\mu_{i}(\mathbf{x})italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) and variance σi2(𝐱)superscriptsubscript𝜎𝑖2𝐱\sigma_{i}^{2}(\mathbf{x})italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ), trained via a Gaussian negative log-likelihood loss:

NN(θj)=i=1Npθj(𝐲i|𝐱i)=i=1N12log(2πσj2(𝐱i))+(𝐲iμj(𝐱i))22σj2(𝐱i),subscriptNNsubscript𝜃𝑗superscriptsubscript𝑖1𝑁subscript𝑝subscript𝜃𝑗conditionalsubscript𝐲𝑖subscript𝐱𝑖superscriptsubscript𝑖1𝑁122𝜋superscriptsubscript𝜎𝑗2subscript𝐱𝑖superscriptsubscript𝐲𝑖subscript𝜇𝑗subscript𝐱𝑖22superscriptsubscript𝜎𝑗2subscript𝐱𝑖\mathcal{L}_{\text{NN}}(\theta_{j})=\sum_{i=1}^{N}-p_{\theta_{j}}(\mathbf{y}_{% i}|\mathbf{x}_{i})=\sum_{i=1}^{N}\frac{1}{2}\log(2\pi\sigma_{j}^{2}(\mathbf{x}% _{i}))+\frac{(\mathbf{y}_{i}-\mu_{j}(\mathbf{x}_{i}))^{2}}{2\sigma_{j}^{2}(% \mathbf{x}_{i})},caligraphic_L start_POSTSUBSCRIPT NN end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + divide start_ARG ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , (14)

which jointly optimizes accuracy and uncertainty calibration. During inference, predictions are combined into a Gaussian mixture 𝒩(μ(𝐱),σ2(𝐱))𝒩subscript𝜇𝐱superscriptsubscript𝜎2𝐱\mathcal{N}(\mu_{*}(\mathbf{x}),\sigma_{*}^{2}(\mathbf{x}))caligraphic_N ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_x ) , italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) ), where

μ(𝐱)=1Kj=1Kμj(𝐱),σ2(𝐱)=1Kj=1K(σj2(𝐱)+μj2(𝐱))μ2(𝐱).formulae-sequencesubscript𝜇𝐱1𝐾superscriptsubscript𝑗1𝐾subscript𝜇𝑗𝐱superscriptsubscript𝜎2𝐱1𝐾superscriptsubscript𝑗1𝐾superscriptsubscript𝜎𝑗2𝐱superscriptsubscript𝜇𝑗2𝐱superscriptsubscript𝜇2𝐱\mu_{*}(\mathbf{x})=\frac{1}{K}\sum_{j=1}^{K}\mu_{j}(\mathbf{x}),\quad\sigma_{% *}^{2}(\mathbf{x})=\frac{1}{K}\sum_{j=1}^{K}\left(\sigma_{j}^{2}(\mathbf{x})+% \mu_{j}^{2}(\mathbf{x})\right)-\mu_{*}^{2}(\mathbf{x}).italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) , italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) + italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) ) - italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) . (15)

For classification, each member parameterizes a Gaussian distribution over logits 𝒩(μj(𝐱),σj2(𝐱))𝒩subscript𝜇𝑗𝐱superscriptsubscript𝜎𝑗2𝐱\mathcal{N}(\mu_{j}(\mathbf{x}),\sigma_{j}^{2}(\mathbf{x}))caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) ) and the models are trained using the cross entropy loss. Class probabilities are estimated by drawing K𝐾Kitalic_K samples 𝐳^j𝒩(μj(𝐱),σj2(𝐱))similar-tosubscript^𝐳𝑗𝒩subscript𝜇𝑗𝐱superscriptsubscript𝜎𝑗2𝐱\hat{\mathbf{z}}_{j}\sim\mathcal{N}(\mu_{j}(\mathbf{x}),\sigma_{j}^{2}(\mathbf% {x}))over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) ), passing them through a softmax, and averaging:

p(𝐲|𝐱)=1Kj=1Ksoftmax(𝐳^j).𝑝conditional𝐲𝐱1𝐾superscriptsubscript𝑗1𝐾softmaxsubscript^𝐳𝑗p(\mathbf{y}|\mathbf{x})=\frac{1}{K}\sum_{j=1}^{K}\text{softmax}(\hat{\mathbf{% z}}_{j}).italic_p ( bold_y | bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT softmax ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (16)

3 Methodology

3.1 Metrics

We used evaluation metrics, such as mean absolute error (MAE) for regression and accuracy (ACC) for classification. For uncertainty quantification, we are interested in the degree to which the model is calibrated; that is, error and misclassification should be proportional to the output uncertainty (e.g., predictive variance, max class probabilities) made by the model. So, if a classifier predicts p(y=cx)=0.5𝑝𝑦conditional𝑐𝑥0.5p(y=c\mid x)=0.5italic_p ( italic_y = italic_c ∣ italic_x ) = 0.5, then we expect c𝑐citalic_c to be the true label 50% of the time.

Let pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) denote the output distribution of a predictive model, which we will assume has parameters θ𝜃\thetaitalic_θ. A common class of metrics that are used to evaluate predictive models with uncertainty are proper scoring rules. A proper scoring rule is a function S(pθ,(y,x))𝑆subscript𝑝𝜃𝑦𝑥S(p_{\theta},(y,x))italic_S ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_y , italic_x ) ) that evaluates the quality of a predictive distribution pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) against the true distribution p(y|x)superscript𝑝conditional𝑦𝑥p^{*}(y|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ), such that the expected score is maximized when the predicted distribution equals the true distribution. Formally:

𝔼(x,y)p[S(pθ,(y,x))]𝔼(x,y)p[S(p,(y,x))],subscript𝔼similar-to𝑥𝑦superscript𝑝delimited-[]𝑆subscript𝑝𝜃𝑦𝑥subscript𝔼similar-to𝑥𝑦superscript𝑝delimited-[]𝑆superscript𝑝𝑦𝑥\mathbb{E}_{(x,y)\sim p^{*}}[S(p_{\theta},(y,x))]\leq\mathbb{E}_{(x,y)\sim p^{% *}}[S(p^{*},(y,x))],blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_S ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_y , italic_x ) ) ] ≤ blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_S ( italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ( italic_y , italic_x ) ) ] , (17)

with equality if and only if pθ(y|x)=p(y|x)subscript𝑝𝜃conditional𝑦𝑥superscript𝑝conditional𝑦𝑥p_{\theta}(y|x)=p^{*}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ).

The following evaluation metrics for uncertainty were used. We denote \downarrow to indicate that lower is better and vice versa for \uparrow. Given a evaluation dataset (xi,yi)i=1,,Nsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁(x_{i},y_{i})_{i=1,\ldots,N}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_N end_POSTSUBSCRIPT we averaged these scores over the samples:

Negative Log-Likelihood (NLL) \downarrow: A proper scoring rule, although it can overemphasize tail probabilities [27]. This is the only evaluation metric used for uncertainty in Jankowiak et al. [14]. For classification, we used a categorical likelihood:

SNLL(pθ,(y,x))=logpθ(y|x),subscript𝑆NLLsubscript𝑝𝜃𝑦𝑥subscript𝑝𝜃conditional𝑦𝑥S_{\text{NLL}}(p_{\theta},(y,x))=-\log p_{\theta}(y|x),italic_S start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_y , italic_x ) ) = - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) , (18)

and for regression the Gaussian likelihood:

SNLL(pθ,(y,x))=12log(2πσ2(x))+(yμ(x))22σ2(x),subscript𝑆NLLsubscript𝑝𝜃𝑦𝑥122𝜋superscript𝜎2𝑥superscript𝑦𝜇𝑥22superscript𝜎2𝑥S_{\text{NLL}}(p_{\theta},(y,x))=\frac{1}{2}\log(2\pi\sigma^{2}(x))+\frac{(y-% \mu(x))^{2}}{2\sigma^{2}(x)},italic_S start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_y , italic_x ) ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) ) + divide start_ARG ( italic_y - italic_μ ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) end_ARG , (19)

where μ(x)𝜇𝑥\mu(x)italic_μ ( italic_x ) and σ2(x)superscript𝜎2𝑥\sigma^{2}(x)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x ) are the model-predicted mean and variance for input x𝑥xitalic_x.

Expected Calibration Error (ECE) \downarrow: Not a proper scoring method, but useful for assessing calibration. It is related to reliability plots. In case of classification we look at B𝐵Bitalic_B bins each with indices b={i[1,N]:pθ(yi|xi)(b1B,bB]}subscript𝑏conditional-set𝑖1𝑁subscript𝑝𝜃conditionalsubscript𝑦𝑖subscript𝑥𝑖𝑏1𝐵𝑏𝐵\mathcal{B}_{b}=\{i\in[1,N]:p_{\theta}(y_{i}|x_{i})\in(\frac{b-1}{B},\frac{b}{% B}]\}caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = { italic_i ∈ [ 1 , italic_N ] : italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ ( divide start_ARG italic_b - 1 end_ARG start_ARG italic_B end_ARG , divide start_ARG italic_b end_ARG start_ARG italic_B end_ARG ] } such that each bin has predictions whose confidence fall within a confidence interval. We can compute the accuracy within bin b𝑏bitalic_b as

acc(b)=1|b|nb𝕀(y^n=yn),accsubscript𝑏1subscript𝑏subscript𝑛subscript𝑏𝕀subscript^𝑦𝑛subscript𝑦𝑛\text{acc}(\mathcal{B}_{b})=\frac{1}{|\mathcal{B}_{b}|}\sum_{n\in\mathcal{B}_{% b}}\mathbb{I}(\hat{y}_{n}=y_{n}),acc ( caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (20)

where y^nsubscript^𝑦𝑛\hat{y}_{n}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the max probability class and ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT the true class. We then define conf(b)confsubscript𝑏\text{conf}(\mathcal{B}_{b})conf ( caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) as the average confidence (predictive variance or max probability) within a bin and plot the accuracy against the confidence, obtaining the reliability plot. The ECE is computed as:

ECE=b=1B|b|B|acc(b)conf(b)|.ECEsubscriptsuperscript𝐵𝑏1subscript𝑏𝐵accsubscript𝑏confsubscript𝑏\text{ECE}=\sum^{B}_{b=1}\frac{|\mathcal{B}_{b}|}{B}|\text{acc}(\mathcal{B}_{b% })-\text{conf}(\mathcal{B}_{b})|.ECE = ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT divide start_ARG | caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | end_ARG start_ARG italic_B end_ARG | acc ( caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) - conf ( caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) | . (21)

For regression, we can consider confidence interval accuracy instead, and a set of equally spaced confidences αSα𝛼subscript𝑆𝛼\alpha\in S_{\alpha}italic_α ∈ italic_S start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT dividing [0,1]01[0,1][ 0 , 1 ] into B𝐵Bitalic_B equally spaced values. For each prediction interval at confidence level α𝛼\alphaitalic_α, the confidence interval accuracy is the fraction of true target values y𝑦yitalic_y falling within the interval:

acc(α)=1Ni=1N𝕀(yi[li,ui]),acc𝛼1𝑁subscriptsuperscript𝑁𝑖1𝕀subscript𝑦𝑖subscript𝑙𝑖subscript𝑢𝑖\text{acc}(\alpha)=\frac{1}{N}\sum^{N}_{i=1}\mathbb{I}(y_{i}\in[l_{i},u_{i}]),acc ( italic_α ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) , (22)

where l=μ|zη2|σ,u=μ+|zη2|σformulae-sequence𝑙𝜇subscript𝑧𝜂2𝜎𝑢𝜇subscript𝑧𝜂2𝜎l=\mu-|z_{\frac{\eta}{2}}|\sigma,u=\mu+|z_{\frac{\eta}{2}}|\sigmaitalic_l = italic_μ - | italic_z start_POSTSUBSCRIPT divide start_ARG italic_η end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT | italic_σ , italic_u = italic_μ + | italic_z start_POSTSUBSCRIPT divide start_ARG italic_η end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT | italic_σ with η=1+α𝜂1𝛼\eta=1+\alphaitalic_η = 1 + italic_α and zη2subscript𝑧𝜂2z_{\frac{\eta}{2}}italic_z start_POSTSUBSCRIPT divide start_ARG italic_η end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT being the z score corresponding to the η/2𝜂2\eta/2italic_η / 2 quantile.

Calibration error is then calculated analogously to the classification case for expected calibration error:

CEreg=1Bi=1B|αkacc(αk)|.subscriptCEreg1𝐵subscriptsuperscript𝐵𝑖1subscript𝛼𝑘accsubscript𝛼𝑘\text{CE}_{\text{reg}}=\frac{1}{B}\sum^{B}_{i=1}|\alpha_{k}-\text{acc}(\alpha_% {k})|.CE start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - acc ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | . (23)

3.2 Datasets

We evaluate our models on two distinct tasks using publicly available datasets: regression on protein structure data and classification on epileptic seizure recognition data.

Physicochemical Properties of Protein Tertiary Structure (CASP): For the regression task, we used the dataset detailing the physicochemical properties of protein tertiary structure, sourced from CASP 5-9 and originally compiled by Prashant Rana [22]. This dataset consists of 45,730 protein decoys. The objective is regression analysis to predict a continuous variable ranging from 0 to 21 Angstroms. The input features comprise 9 numerical attributes (F1 through F9) that describe various physicochemical properties of a protein. All features are real-valued. For our experiments, we employed an 80:20 split for training and testing, respectively. The features were standardized (zero mean, unit variance) before training.

Epileptic Seizure Recognition (ESR): For the classification task, we utilized the Epileptic Seizure Recognition dataset [1]. This dataset contains 11,500 instances, each representing a one-second segment of EEG recordings. It includes 178 numerical features (X1 through X178) corresponding to EEG signal values at different time points. The original dataset is formulated as a 5-class classification problem, where class 1 represents seizure activity, and classes 2-5 represent various non-seizure states (e.g., tumor area, healthy area with eyes open/closed). For our study, we converted this into a binary classification task: predicting the presence (class 1) versus the absence (classes 2-5 merged into class 0) of a seizure. The features were standardized (zero mean, unit variance) before training. Similar to the regression task, we used an 80:20 train/test split for evaluation.

3.3 Hyperparameter Tuning

We performed hyperparameter optimization using Bayesian optimization (BayesOpt). See Table 1 for the set of hyperparameters we used and the range of their values. Some hyperparameters we fixed, while others were tunable by BayesOpt. To prevent data leakage, we further divide the training set into an 80:20 ratio for validation during hyper-parameter tuning. As a reminder, the BayesOpt algorithm proceeds as follows: We start with an initial dataset 𝒟0=(𝐱i,yi)i=1,,n0subscript𝒟0subscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1subscript𝑛0\mathcal{D}_{0}=(\mathbf{x}_{i},y_{i})_{i=1,...,n_{0}}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT where the target outputs yi=f(𝐱i)+ϵisubscript𝑦𝑖𝑓subscript𝐱𝑖subscriptitalic-ϵ𝑖y_{i}=f(\mathbf{x}_{i})+\epsilon_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are assumed to be noisy outputs of the function f𝑓fitalic_f we want to optimize and 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are sobol (quasi-random, low-discrepancy) points in the input space. Afterwords, at each iteration n𝑛nitalic_n, a dataset 𝒟nsubscript𝒟𝑛\mathcal{D}_{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is maintained. A GP222This is a separate GP from the (deep) GP that is fit on the train set in case we are optimizing the hyperparameters of a (deep) GP. Furthermore, this GP uses exact inference instead of a variational approximation. can then be used to estimate p(f|𝒟)𝑝conditional𝑓𝒟p(f|\mathcal{D})italic_p ( italic_f | caligraphic_D ), a distribution over f𝑓fitalic_f. An acquisition function α(𝐱;𝒟n)𝛼𝐱subscript𝒟𝑛\alpha(\mathbf{x};\mathcal{D}_{n})italic_α ( bold_x ; caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is then used to select a new candidate 𝐱𝐱\mathbf{x}bold_x based on its expected utility. Once yn+1=f(𝐱n+1)+ϵn+1subscript𝑦𝑛1𝑓subscript𝐱𝑛1subscriptitalic-ϵ𝑛1y_{n+1}=f(\mathbf{x}_{n+1})+\epsilon_{n+1}italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT has been observed, the GP is updated by computing p(f|𝒟n+1)𝑝conditional𝑓subscript𝒟𝑛1p(f|\mathcal{D}_{n+1})italic_p ( italic_f | caligraphic_D start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ).

For hyperparameter tuning we used the expected improvement acquisition function and ran the optimization loop for 20 trials, 5 of which were used for random initialization. We optimized the negative log likelihood SNLLsubscript𝑆NLLS_{\text{NLL}}italic_S start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT as it is a proper scoring rule.

For the remaining experimental setup sections, we used the optimal hyperparameters that were obtained for a particular dataset and model after BayesOpt.

Table 1: Hyperparameter Setup for Model Training: Kernel (RBF) and likelihood (Gaussian or Softmax) were shared for GP-based models. And for NN we used ReLU activation functions. Learning rates are optimized on a log scale. Notation for layers: [n0,,nL]subscript𝑛0subscript𝑛𝐿[n_{0},\ldots,n_{L}][ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] where L𝐿Litalic_L is the number of hidden layers and nlsubscript𝑛𝑙n_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the number of units in layer l𝑙litalic_l. [][\;][ ] is used to indicate no hidden layers.
Hyperparameter Value or Range Applicable Models
Monte Carlo samples Fixed: 10 DGP
Quadrature sites Fixed: 8 DSPP
Scaling of KL divergence (β)𝛽(\beta)( italic_β ) Fixed: 1 DGP/DSPP
Epochs Fixed: Until convergence All
(CASP: 20, ESR: 30)
Learning rate Varied: 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (log) All
Model architecture (layers) Varied: [],[1],[1,1],[3],[3,3]delimited-[]111delimited-[]333[\;],[1],[1,1],[3],[3,3][ ] , [ 1 ] , [ 1 , 1 ] , [ 3 ] , [ 3 , 3 ] (Protein) DGP, DSPP
Varied: [],[1],[1,1],[5],[5,5]delimited-[]111delimited-[]555[\;],[1],[1,1],[5],[5,5][ ] , [ 1 ] , [ 1 , 1 ] , [ 5 ] , [ 5 , 5 ] (ESR) DGP, DSPP
Number of inducing points (M𝑀Mitalic_M) Varied: 50–200 DGP, DSPP
NN architecture (layers/units) Varied: [2n,n],n[8,16,32,64]2𝑛𝑛𝑛8163264[2n,n],n\in[8,16,32,64][ 2 italic_n , italic_n ] , italic_n ∈ [ 8 , 16 , 32 , 64 ] Deep Ensemble
# Models Varied: 2-10 Deep Ensemble
Optimizer Adam All

3.4 Experiment Setup

Here, we describe the experiment conducted to obtain the final evaluation results. The same procedure was applied to both the regression (Protein) and classification (ESR) datasets.

We trained three models—DGPs, DSPPs, and an ensemble of MLPs—on the training split and evaluated them on the test split using the metrics covered in Section 3.1, namely NLL, ECE, accuracy and MAE. This includes producing a calibration plot showing the confidence of the models against their accuracy. We also kept track of training and validation loss curves to assess model fit.

3.5 Distribution Shift Experiment Setup

To evaluate the robustness and calibration of our models under inputs that deviate from the training distribution, we designed a synthetic feature-level shift framework that can be applied uniformly across regression and classification tasks. This enables a controlled, systematic stress test of model performance and uncertainty quantification beyond in-distribution settings. We define five classes of perturbations, each parameterized by a severity level s[0,1]𝑠01s\in[0,1]italic_s ∈ [ 0 , 1 ]. Let 𝐗=(𝐱1,,𝐱N)𝐗subscript𝐱1subscript𝐱𝑁\mathbf{X}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})bold_X = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) denote the data matrix of feature vectors:

Gaussian Noise

𝐗=𝐗+ε,ε𝒩(0,(σs)2).formulae-sequencesuperscript𝐗𝐗𝜀similar-to𝜀𝒩0superscript𝜎𝑠2\mathbf{X}^{\prime}=\mathbf{X}+\varepsilon,\quad\varepsilon\sim\mathcal{N}(0,(% \sigma s)^{2}).bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_X + italic_ε , italic_ε ∼ caligraphic_N ( 0 , ( italic_σ italic_s ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (24)

We add Gaussian Noise to the features: the corrupted features 𝐗superscript𝐗\mathbf{X}^{\prime}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are obtained by adding noise ε𝜀\varepsilonitalic_ε to the original features 𝐗𝐗\mathbf{X}bold_X, where the noise is drawn from a zero-mean normal distribution with standard deviation σs𝜎𝑠\sigma sitalic_σ italic_s.

Feature Masking

𝐗i,j={0with probability s,𝐗i,jotherwise.subscriptsuperscript𝐗𝑖𝑗cases0with probability 𝑠subscript𝐗𝑖𝑗otherwise\mathbf{X}^{\prime}_{i,j}=\begin{cases}0&\text{with probability }s,\\ \mathbf{X}_{i,j}&\text{otherwise}.\end{cases}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL with probability italic_s , end_CELL end_ROW start_ROW start_CELL bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL otherwise . end_CELL end_ROW (25)

We simulate missing or occluded features by randomly setting each feature value to zero with probability s𝑠sitalic_s, leaving it unchanged otherwise.

Feature Scaling

𝐗=(1+s)𝐗.superscript𝐗1𝑠𝐗\mathbf{X}^{\prime}=(1+s)\cdot\mathbf{X}.bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( 1 + italic_s ) ⋅ bold_X . (26)

We apply a global scaling factor 1+s1𝑠1+s1 + italic_s to all features, modeling scenarios such as unnormalized sensor inputs or calibration mismatches.

Feature Permutation

𝐗i,j={𝐗π(i),jwith probability s,𝐗i,jotherwise.subscriptsuperscript𝐗𝑖𝑗casessubscript𝐗𝜋𝑖𝑗with probability 𝑠subscript𝐗𝑖𝑗otherwise\mathbf{X}^{\prime}_{i,j}=\begin{cases}\mathbf{X}_{\pi(i),j}&\text{with % probability }s,\\ \mathbf{X}_{i,j}&\text{otherwise}.\end{cases}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL bold_X start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_j end_POSTSUBSCRIPT end_CELL start_CELL with probability italic_s , end_CELL end_ROW start_ROW start_CELL bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL otherwise . end_CELL end_ROW (27)

Each feature value 𝐗i,jsubscript𝐗𝑖𝑗\mathbf{X}_{i,j}bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is randomly replaced with another value from the same column (row π(i)𝜋𝑖\pi(i)italic_π ( italic_i )) with probability s𝑠sitalic_s, disrupting inter-feature dependencies while preserving marginal distributions.

Outlier Injection

𝐗i,j=𝐗i,j+δ,δ{±3σj} with probability s.formulae-sequencesubscriptsuperscript𝐗𝑖𝑗subscript𝐗𝑖𝑗𝛿𝛿plus-or-minus3subscript𝜎𝑗 with probability 𝑠\mathbf{X}^{\prime}_{i,j}=\mathbf{X}_{i,j}+\delta,\quad\delta\in\{\pm 3\cdot% \sigma_{j}\}\text{ with probability }s.bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + italic_δ , italic_δ ∈ { ± 3 ⋅ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } with probability italic_s . (28)

Outliers are injected by perturbing a fraction s𝑠sitalic_s of the features with large positive or negative deviations proportional to the pre-feature standard deviation σjsubscript𝜎𝑗\sigma_{j}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

To assess robustness under distribution shift, we performed N=5𝑁5N=5italic_N = 5 independent training runs per model, each initialized with different random seeds. In each run, the model is re-initialized, trained from scratch, and subsequently evaluated on test sets perturbed by all corruption types at increasing severity level s{0.0,0.1,0.2,0.4,0.6,0.8}𝑠0.00.10.20.40.60.8s\in\{0.0,0.1,0.2,0.4,0.6,0.8\}italic_s ∈ { 0.0 , 0.1 , 0.2 , 0.4 , 0.6 , 0.8 }.

At each severity level, we computed ECE, accuracy and MAE for classification and regression, respectively. Results were aggregated across all runs and five perturbation types, yielding 25 metric values per severity. These are visualized using boxplots to reflect variation due to both training stochasticity and corruption type. As models are evaluated without retraining on the shifted test sets, the results quantify robustness to distributional changes that preserve overall structure while altering feature-level statistics.

Our shift analysis framework builds on work in robustness and uncertainty evaluation under corruptions, covariate shift, and non-adversarial perturbations [10, 8, 13, 11, 20], and complements regularization-based approaches to improving model robustness [25].

4 Results

This section presents the empirical evaluation of our models on the selected datasets. The experiments utilized the optimal hyperparameters identified via Bayesian optimization, as detailed in Appendix A (Table 3). We evaluate the models on regression (CASP dataset) and classification (ESR dataset) tasks, focusing on predictive performance, uncertainty calibration, and robustness under distribution shift.

An examination of Figure 1 reveals that all models demonstrate expected loss curves, without significant overfitting to the training set. However, the loss gap between train and validation sets can still be remedied using regularization techniques like dropout and weight penalties. Additionally, in the case of DSPPs, the β𝛽\betaitalic_β hyper-parameter can be lowered to reduce the constraint that the approximate posterior should not deviate too much from the prior.

Table 2 summarizes the primary performance metrics on the respective test sets. For the CASP regression task, DSPP demonstrated superior performance in terms of uncertainty quantification, achieving the lowest NLL of 2.985 and the best ECE of 0.026. While DSPP excelled in calibration, the DGP model obtained the lowest MAE of 3.425, indicating the most accurate point predictions on average for this task. The Deep Ensemble baseline was outperformed by both GP variants in NLL and MAE, although its ECE (0.112) was slightly better than that of the DGP (0.132).

The results on the ESR classification task reveal a different trend between models. The Deep Ensemble baseline achieved the highest accuracy at 0.976 and the lowest NLL (0.073), suggesting strong predictive performance. However, DSPP provided the best calibration with the lowest ECE (0.035) while maintaining a high accuracy of 0.969. The DGP model lagged behind the other two methods in both accuracy (0.912) and NLL (0.258) on this classification benchmark, although its calibration (ECE 0.054) was not substantially higher than its competitors.

Visual inspection of the calibration curves provides further insight into model confidence. Figure 2 displays the reliability plots for the CASP regression task. The DSPP model’s curve closely tracks the ideal diagonal line, visually confirming its excellent calibration reported in Table 2 and indicating neither significant over nor underconfidence. The DGP model shows a curve consistently lying below the diagonal across nearly all confidence levels, indicating a clear tendency towards overconfidence; the model predicts higher certainty than its accuracy justifies. The Deep Ensemble model displays a more mixed pattern: it appears slightly overconfident at low confidence levels but becomes increasingly underconfident at mid-to-high confidence levels.

For the ESR classification task (Figure 3), all three models demonstrate good calibration. The calibration curves for the Deep Ensemble, DSPP, and DGP all closely follow the ideal diagonal line, indicating that the models are generally well-calibrated. The visual differences between the plots are minor, aligning with the strong quantitative ECE performance observed for all models on this dataset (Table 2).

To assess performance under data distribution shifts, we evaluated the models on perturbed test sets using the methodology described in Section 3.5. The results are summarized through grouped boxplots, where each severity level aggregates performance across all types of corruptions.

Figure 4 shows the ECE under increasing shift severity for the CASP regression task. While both DGP and DSPP exhibit rising ECE with severity, the Deep Ensemble remains stable around 0.11. DGP shows the highest ECE and is most affected by shift, whereas DSPP maintains the lowest ECE across most severity levels and increases more gradually. Overall, DSPP demonstrates the strongest calibration performance under distributional shift.

Turning to the ESR classification task calibration trends on ESR are illustrated in Figure 4. All three models maintain low ECE values across the full range of shift severities, indicating generally strong calibration performance. While DSPP continues to exhibit competitive calibration with the lowest ECE median at most levels, its variability increases under stronger shifts. The Deep Ensemble also maintains low calibration error but shows a slightly broader distribution at the highest severity. Notably, DGP displays stable calibration with relatively narrow interquartile ranges, though its median ECE is slightly higher than the others in the later stages.

Finally, ablation studies were conducted to understand the impact of key hyperparameters (Appendix C). The results, shown in Figures 5 and 6, indicate that model performance (measured by NLL) is sensitive to both the number of inducing points M𝑀Mitalic_M and the model depth (number of layers). Generally, increasing M𝑀Mitalic_M improved NLL up to a certain point for both DGP and DSPP (Figure 5), highlighting the trade-off between approximation quality and computational cost inherent in sparse GP methods. The effect of depth was dataset-dependent (Figure 6); for CASP, NLL tended to increase with more layers for DSPP, while for ESR, deeper models consistently performed better.

Refer to caption
Refer to caption
Figure 1: Training and validation loss curves of our optimized models on CASP and ESR
Refer to caption
Refer to caption
Refer to caption
Figure 2: Calibration curves of our models for the Protein regression dataset.
Refer to caption
Refer to caption
Refer to caption
Figure 3: Calibration curves of our models for the Epileptic Sezure Recognition dataset.
Refer to caption
Refer to caption
Figure 4: Box plots showing ECE under distributional shift for the CASP regression (left) and ESR classification (right) tasks across three methods: Deep Ensemble (blue), DGP (orange), and DSPP (green).
Table 2: Performance metrics for our optimized models on their corresponding datasets.
Model Dataset NLL \downarrow ECE \downarrow MAE \downarrow ACC \uparrow
DGP CASP 3.412 0.132 3.425
Deep Ensemble CASP 3.387 0.112 5.525
DSPP CASP 2.985 0.026 4.016
DGP ESR 0.258 0.054 0.912
Deep Ensemble ESR 0.073 0.040 0.976
DSPP ESR 0.091 0.035 0.969

5 Discussion

Our results highlight the potential of deep probabilistic models in trustworthy machine learning. In particular, we observe that GP-based models (DGPs and DSPPs) provide well-calibrated uncertainty estimates as well as good predictive performance, but this advantage may vary depending on the dataset and task type. We confirm empirical results from Jankowiak et al. [14] that the use of deterministic approximation (via sigma points) to propagate uncertainty through nonlinear transformations provides better calibrated uncertainty estimates compared to the standard sampling-based approach in DGPs (which also makes the model more computationally efficient). Furthermore, our experiments demonstrate the effectiveness of this approach for non-Gaussian likelihoods (specifically, the Softmax likelihood in the ESR classification task), an aspect not previously explored empirically for DSPPs.

A crucial finding emerged from the distribution shift experiments. Deep Ensembles displayed notable resilience, maintaining relatively stable performance and calibration across increasing perturbation severities (Figures 4, 7). In contrast, the GP-based models showed greater sensitivity. DGPs were particularly affected on the regression task, with significant degradation in both prediction accuracy (MAE) and calibration (ECE). DSPPs presented a more nuanced picture under shift: on the CASP regression task, they maintained the best calibration (lowest median ECE) despite the increasing shift, outperforming DGPs significantly in this regard. However, on the ESR classification task, while competitive at low severities, DSPP accuracy degraded more sharply than ensembles, and its ECE showed increasing variability under stronger shifts, indicating less reliable calibration compared to ensembles in that scenario. This suggests that while the sigma point method aids calibration robustness in some cases, the overall model structure’s resilience to feature perturbations might be lower than that of ensembles, depending on the task and dataset. This underscores that good in-distribution calibration does not guarantee robustness, necessitating evaluation under shift.

Our study, however, has limitations. The focus on moderately complex, tabular datasets restricts the generalizability of these findings to high-dimensional data common in fields like computer vision or NLP, where architectural differences might significantly influence robustness. Furthermore, while DSPPs avoid sampling during inference, potentially offering speed advantages over DGPs, we did not perform rigorous wall-clock time comparisons. Meaningful assessment of computational efficiency requires strictly controlled and standardized environments to ensure fair comparisons free from confounding system variables.

5.1 Future Work

In this work we focused on standard likelihoods for regression and classification with deep GPs. Future work could expand upon this by including a larger class of likelihoods. For regression tasks, Laplace or Student-t𝑡titalic_t likelihoods can provide greater resilience to outliers due to their heavier tails. For classification, Bernoulli and categorical likelihoods are standard choices, but Ordinal or Dirichlet-based likelihoods may be better suited for tasks involving ranked or multi-class uncertainty. A Poisson likelihood could be used for Poisson regression to model count data where outcomes represent the number of events occurring in a fixed interval. In this scenario the underlying log rate function is modeled by a GP. More generally, a Poisson process where the log rate function is itself a stochastic process is known as a ‘Cox process‘ or ‘doubly stochastic Poisson process‘ [5, 19].

It is also worth re-exploring the idea of deep kernel learning [28] but for DGPs/DSPPs. Deep kernel learning involves combining neural networks with GPs by using a neural network as a feature extractor before applying the GP. In other words, the GP operates on the latent space produced by a neural network as opposed to the raw input space. Previous work in this area primarily focused on deep kernel learning with a single GP. While DGPs/DSPPs can, in theory, learn hierarchical feature representations, their MLP-like structure makes them less suitable for certain data types. For example, it is difficult to scale these models to image data due to the large number of features corresponding to the pixels in the image. A potential solution is to use a convolutional neural network or vision transformer followed by the deep GP, which can then be trained in an end-to-end fashion.

This work focused on standard on the standard regression and classification. However a promising area of research is the use of GPs for uncertainty quantification in Reinforcement Learning (RL) [7, 4, 9, 15, 17]. The use of uncertainty quantification in RL has several use cases [18]. RL algorithms that quantify (epistemic) uncertainty could address sample complexity through more intelligent exploration of the environment by letting the agent focus on high-uncertainty regions likely to yield informative transitions. The field of safe RL [3] uses uncertainty quantification to adhere to safety constraints during the learning process, which is important in robotics applications, where overly aggressive exploration could potentially damage the physical system. Recently, Lende et al. [17] explored using DGPs for estimating the action-value function, quantifying the uncertainty of actions in a state in addition to its expected value. However, this framework has not been extended to DSPPs or to estimating the policy directly.

6 Conclusion

Our experiments demonstrate that deep Gaussian process models can deliver competitive predictive performance while providing robust and well-calibrated uncertainty estimates. The results suggest that DSPPs often outperform DGPs and ensembles in terms of calibration, showing strong reliability across varied tasks and moderate distributional shifts. These findings highlight the promise of DSPPs for trustworthy AI solutions, where model confidence is as important as predictive accuracy. Future research can explore wider applications, include more complex likelihoods, and further examine how these methods adapt to larger and more diverse datasets.

References

  • Andrzejak et al. [2002] Ralph Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical review. E, Statistical, nonlinear, and soft matter physics, 64:061907, 01 2002. doi: 10.1103/PhysRevE.64.061907.
  • Bauer et al. [2016] Matthias Bauer, Mark Van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse gaussian process approximations. Advances in neural information processing systems, 29, 2016. URL https://arxiv.org/abs/1606.04820.
  • Berkenkamp [2019] Felix Berkenkamp. Safe exploration in reinforcement learning: Theory and applications in robotics. PhD thesis, ETH Zurich, 2019. URL https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/370833/root.pdf.
  • Chowdhary et al. [2014] Girish Chowdhary, Miao Liu, Robert Grande, Thomas Walsh, Jonathan How, and Lawrence Carin. Off-policy reinforcement learning with gaussian processes. IEEE/CAA Journal of Automatica Sinica, 1(3):227–238, 2014. doi: 10.1109/JAS.2014.7004680.
  • Cox [2018] D. R. Cox. Some statistical methods connected with series of events. Journal of the Royal Statistical Society: Series B (Methodological), 17(2):129–157, 12 2018. ISSN 0035-9246. doi: 10.1111/j.2517-6161.1955.tb00188.x. URL https://doi.org/10.1111/j.2517-6161.1955.tb00188.x.
  • Damianou and Lawrence [2013] Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In Carlos M. Carvalho and Pradeep Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, pages 207–215, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR. URL https://proceedings.mlr.press/v31/damianou13a.html.
  • Engel et al. [2005] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 201–208, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805. doi: 10.1145/1102351.1102377. URL https://doi.org/10.1145/1102351.1102377.
  • Geirhos et al. [2018] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018.
  • Grande et al. [2014] Robert Grande, Thomas Walsh, and Jonathan How. Sample efficient reinforcement learning with gaussian processes. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1332–1340, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/grande14.html.
  • Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  • Hendrycks et al. [2019] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019.
  • Hensman et al. [2015] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process Classification. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA, 09–12 May 2015. PMLR. URL https://proceedings.mlr.press/v38/hensman15.html.
  • Ilyas et al. [2019] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
  • Jankowiak et al. [2020] Martin Jankowiak, Geoff Pleiss, and Jacob Gardner. Deep sigma point processes. In Jonas Peters and David Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 789–798. PMLR, 03–06 Aug 2020. URL https://proceedings.mlr.press/v124/jankowiak20a.html.
  • Kameda and Tanaka [2023] Kiseki Kameda and Fuyuhiko Tanaka. Reinforcement learning with gaussian process regression using variational free energy. Journal of Intelligent Systems, 32(1):20220205, 2023. doi: doi:10.1515/jisys-2022-0205. URL https://doi.org/10.1515/jisys-2022-0205.
  • Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf.
  • Lende et al. [2025] Matthijs van der Lende, Matthia Sabatelli, and Juan Cardenas-Cartagena. Interpretable function approximation with gaussian processes in value-based model-free reinforcement learning. In Tetiana Lutchyn, Adín Ramírez Rivera, and Benjamin Ricaud, editors, Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL), volume 265 of Proceedings of Machine Learning Research, pages 141–154. PMLR, 07–09 Jan 2025. URL https://proceedings.mlr.press/v265/lende25a.html.
  • Lockwood and Si [2022] Owen Lockwood and Mei Si. A review of uncertainty for deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 18(1):155–162, Oct. 2022. doi: 10.1609/aiide.v18i1.21959. URL https://ojs.aaai.org/index.php/AIIDE/article/view/21959.
  • Murphy [2023] Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. URL http://probml.github.io/book2.
  • Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift, 2019. URL https://arxiv.org/abs/1906.02530.
  • Quinonero-Candela et al. [2005] Joaquin Quinonero-Candela, Carl Edward Rasmussen, Fabian Sinz, Olivier Bousquet, and Bernhard Schölkopf. Evaluating predictive uncertainty challenge. In Machine Learning Challenges Workshop, pages 1–27. Springer, 2005.
  • Rana [2013] Prashant Rana. Physicochemical Properties of Protein Tertiary Structure. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5QW3H.
  • Rasmussen and Williams [2004] Carl Edward Rasmussen and Christopher K. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, 2004. ISBN 978-0-262-25683-4. OCLC: 1178958074.
  • Salimbeni and Deisenroth [2017] Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8208974663db80265e9bfe7b222dcb18-Paper.pdf.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • Titsias [2009] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL https://proceedings.mlr.press/v5/titsias09a.html.
  • Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992.
  • Wilson et al. [2016] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370–378, Cadiz, Spain, 09–11 May 2016. PMLR. URL https://proceedings.mlr.press/v51/wilson16.html.

Appendix A Optimized Model Hyper-parameters

We report the results of our hyper-parameter tuning for each combination of model and dataset in Table 3. Besides the variables mentioned in the table, we fix the number of quadrature points in DSPP to 8 and the number of Monte Carlo samples in DGP to 10.

Table 3: Optimized hyper-parameters after Bayesian search.
Model Dataset LR # Epochs Arch # Inducing Points # Models
DGP CASP 0.1 20 [3] 159
Deep Ensemble CASP 0.025 20 [128, 64] 9
DSPP CASP 0.055 20 [ ] 50
DGP ESR 0.1 30 [5, 2] 200
Deep Ensemble ESR 0.001 30 [128, 64] 10
DSPP ESR 0.068 30 [5, 5, 2] 50

Appendix B Variational Inference

To make training scalable, GP based methods make use of variational inference (VI). We briefly cover here how VI works in general.

VI is a method for approximating complex probability distributions, particularly posterior distributions in Bayesian inference. Instead of sampling from the posterior (as in Markov Chain Monte Carlo methods), VI reformulates inference as an optimization problem.

Consider a model with unknown latent variables 𝐳𝐳\mathbf{z}bold_z, known variables 𝐱𝐱\mathbf{x}bold_x, and parameters θ𝜃\thetaitalic_θ (in the case of a parametric model). We want to compute the posterior

pθ(𝐳|𝐱)=pθ(𝐱|𝐳)pθ(𝐳)pθ(𝐱),subscript𝑝𝜃conditional𝐳𝐱subscript𝑝𝜃conditional𝐱𝐳subscript𝑝𝜃𝐳subscript𝑝𝜃𝐱p_{\theta}(\mathbf{z}|\mathbf{x})=\frac{p_{\theta}(\mathbf{x}|\mathbf{z})p_{% \theta}(\mathbf{z})}{p_{\theta}(\mathbf{x})},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z | bold_x ) = divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) end_ARG , (29)

where we assumes the normalization constant pθ(𝐱)=pθ(𝐱,𝐳)𝑑𝐳subscript𝑝𝜃𝐱subscript𝑝𝜃𝐱𝐳differential-d𝐳p_{\theta}(\mathbf{x})=\int p_{\theta}(\mathbf{x},\mathbf{z})d\mathbf{z}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) = ∫ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x , bold_z ) italic_d bold_z is intractable.

VI approximates p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ) with a simpler distribution qψ(𝐳)subscript𝑞𝜓𝐳q_{\psi}(\mathbf{z})italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) from a parametric family 𝒬𝒬\mathcal{Q}caligraphic_Q (e.g., Gaussians) such that:

ψ=argminψD𝕂𝕃(qψ(𝐳)pθ(𝐳|𝐱))=argminψ𝔼qψ(𝐳)[qψ(𝐳)log(pθ(𝐱|𝐳)pθ(𝐳)pθ(𝐱))]=argminψ𝔼qψ(𝐳)[logqψ(𝐳)logpθ(𝐱|𝐳)logpθ(𝐳)](θ,ψ|𝐱)+logpθ(𝐱).\begin{array}[]{l l}\psi^{*}&=\text{argmin}_{\psi}\;D_{\mathbb{KL}}(q_{\psi}(% \mathbf{z})\|p_{\theta}(\mathbf{z}|\mathbf{x}))\\ &=\text{argmin}_{\psi}\mathbb{E}_{q_{\psi}(\mathbf{z})}\big{[}q_{\psi}(\mathbf% {z})-\log\big{(}\frac{p_{\theta}(\mathbf{x}|\mathbf{z})p_{\theta}(\mathbf{z})}% {p_{\theta}(\mathbf{x})}\big{)}\big{]}\\ &=\text{argmin}_{\psi}\underbrace{\mathbb{E}_{q_{\psi}(\mathbf{z})}[\log q_{% \psi}(\mathbf{z})-\log p_{\theta}(\mathbf{x}|\mathbf{z})-\log p_{\theta}(% \mathbf{z})]}_{\mathcal{L}(\theta,\psi|\mathbf{x})}+\log p_{\theta}(\mathbf{x}% ).\end{array}start_ARRAY start_ROW start_CELL italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = argmin start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT blackboard_K blackboard_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z | bold_x ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = argmin start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) - roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) end_ARG ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = argmin start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) ] end_ARG start_POSTSUBSCRIPT caligraphic_L ( italic_θ , italic_ψ | bold_x ) end_POSTSUBSCRIPT + roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) . end_CELL end_ROW end_ARRAY (30)

ψ𝜓\psiitalic_ψ are known as the variational parameters, which we optimize to obtain our approximate distribution by minimizing (θ,ψ|𝐱)=𝔼qψ(𝐳)[logpθ(𝐱,𝐳)+logqψ(𝐳)]𝜃conditional𝜓𝐱subscript𝔼subscript𝑞𝜓𝐳delimited-[]subscript𝑝𝜃𝐱𝐳subscript𝑞𝜓𝐳\mathcal{L}(\theta,\psi|\mathbf{x})=\mathbb{E}_{q_{\psi}(\mathbf{z})}[-\log p_% {\theta}(\mathbf{x},\mathbf{z})+\log q_{\psi}(\mathbf{z})]caligraphic_L ( italic_θ , italic_ψ | bold_x ) = blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x , bold_z ) + roman_log italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) ]. This objective can be rewritten as maximizing the evidence lower bound (ELBO):

L(ψ,θ|𝐱)=𝔼qψ(𝐳)[logpθ(𝐱|𝐳)]expected log likelihoodD𝕂𝕃(qψ(𝐳)pθ(𝐳))KL from posterior to prior.𝐿𝜓conditional𝜃𝐱subscriptsubscript𝔼subscript𝑞𝜓𝐳delimited-[]subscript𝑝𝜃conditional𝐱𝐳expected log likelihoodsubscriptsubscript𝐷𝕂𝕃conditionalsubscript𝑞𝜓𝐳subscript𝑝𝜃𝐳KL from posterior to priorL(\psi,\theta|\mathbf{x})=\underbrace{\mathbb{E}_{q_{\psi}(\mathbf{z})}[\log p% _{\theta}(\mathbf{x}|\mathbf{z})]}_{\text{expected log likelihood}}-% \underbrace{D_{\mathbb{KL}}(q_{\psi}(\mathbf{z})\|p_{\theta}(\mathbf{z}))}_{% \text{KL from posterior to prior}}.italic_L ( italic_ψ , italic_θ | bold_x ) = under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) ] end_ARG start_POSTSUBSCRIPT expected log likelihood end_POSTSUBSCRIPT - under⏟ start_ARG italic_D start_POSTSUBSCRIPT blackboard_K blackboard_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_z ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) ) end_ARG start_POSTSUBSCRIPT KL from posterior to prior end_POSTSUBSCRIPT . (31)

The KL term acts as a regularization term, ensuring the approximate posterior does not diverge too much from the prior distribution. As the name implies, L(θ,ψ|𝐱)𝐿𝜃conditional𝜓𝐱L(\theta,\psi|\mathbf{x})italic_L ( italic_θ , italic_ψ | bold_x ) is a lower bound of the evidence logpθ(𝐱)subscript𝑝𝜃𝐱\log p_{\theta}(\mathbf{x})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ):

L(θ,ψ|𝐱)logpθ(𝐱).𝐿𝜃conditional𝜓𝐱subscript𝑝𝜃𝐱L(\theta,\psi|\mathbf{x})\leq\log p_{\theta}(\mathbf{x}).italic_L ( italic_θ , italic_ψ | bold_x ) ≤ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) . (32)

Appendix C Ablation Experiments

C.1 Impact of Inducing Points

For this experiment, we were interested in the relationship between the number of inducing points and calibration error for our dataset. We considered DGPs/DSPPs with no hidden layers and train for 20 epochs with a learning rate of 0.010.010.010.01. We performed training and evaluation for M{32,64,128,256,512}𝑀3264128256512M\in\{32,64,128,256,512\}italic_M ∈ { 32 , 64 , 128 , 256 , 512 } inducing points. The results can be seen in Figure 5.

Refer to caption
Refer to caption
Figure 5: Negative log likelihood on test set against the number of inducing points. Left: CASP, Right: ESR

C.2 Impact of Depth/Number of Layers

For this experiment, we were interested in the relationship between the number of and the negative log likelihood for our dataset. Both the DGP and DSPP were trained for 20 epochs with a learning rate of 0.010.010.010.01. For simplicity, we used a single GP per hidden layer with 128 inducing points and performed training and evaluation for d{1,2,4,8}𝑑1248d\in\{1,2,4,8\}italic_d ∈ { 1 , 2 , 4 , 8 } number of layers. The results can be seen in Figure 6.

Refer to caption
Refer to caption
Figure 6: Negative log likelihood on test set against the number of hidden layers. Left: CASP, Right: ESR

Appendix D Additional Distribution Shift Results

Refer to caption
Refer to caption
Figure 7: Box plots showing MAE and accuracy for the ESR classification (right) and CASP regression (left) tasks under distributional shift across three methods: Deep Ensemble (blue), DGP (orange), and DSPP (green).

Figure 7 (Left) presents the MAE under increasing shift severity for the CASP regression task. DGP initially achieves the lowest error but degrades more rapidly under shift. DSPP shows a more gradual increase in error and outperforms DGP at higher severities. The Deep Ensemble maintains stable but consistently higher error across all levels. Overall, DSPP demonstrates the best robustness and predictive accuracy under distributional shift.

In the ESR classification task, Figure 7 (Right) summarizes model accuracy across severity levels. The Deep Ensemble consistently maintains the highest accuracy across all shift intensities, showing strong robustness to corruption. While DSPP initially performs well, it experiences a sharper and less stable decline in accuracy as severity increases. In contrast, DGP exhibits a lower median accuracy at high shift levels but with tighter interquartile ranges, indicating more consistent performance. DSPP maintains a higher median accuracy than DGP at the most severe shifts, yet the wider spread in its predictions reflects greater variability and less reliability under strong distributional shifts.