DiffusioNeRF_Regularizing_Neural_Radiance_Fields_With_Denoising_Diffusion_Models_CVPR_2023_paper
DiffusioNeRF_Regularizing_Neural_Radiance_Fields_With_Denoising_Diffusion_Models_CVPR_2023_paper
Abstract
4180
Various hand-crafted regularizers and learned priors ume to train NeRFs. The 3D CNN can be pretrained on a
have been proposed to tackle these issues: hand-engineered large number of scenes, which allows faster convergence on
priors to constrain the scene geometry [2,21], learned priors novel scenes.
that force plausible renderings from arbitrary views [21], Instant Neural Graphics Primitives [19] uses multi-scale
and methods that use single image depth and normal esti- hash tables to store feature encodings of all coordinates in
mation [38, 46] to provide high-level constraints on the es- a fixed memory block. This allows storing features at vary-
timated scene geometry. However, there are no approaches ing spatial resolutions, and consequently reduces the size
that learn a joint probability distribution of the scene geom- of the MLP that models geometry and color. With a GPU-
etry and color. optimized implementation, Instant NGP can train NeRFs in
Our contribution is leveraging denoising diffusion mod- minutes without quality degradation. Our contribution is in
els (DDMs) as a learned prior over color and geometry. priors used for NeRF optimization, and hence our method is
Specifically, we use an existing synthetic dataset to gener- agnostic to the underlying geometry representation. As In-
ate a dataset of RGBD patches to train our DDM. DDMs stant NGP is fast to train and render, we use it as a backbone
do not predict a probability for RGBD patch distribution. for our experiments.
Rather, they provide the gradient of the log-probability of Density regularization Mip-NeRF 360 [2] proposes a den-
the RGBD patch distribution, i.e. the negative direction sity regularizer that encourages compactness of the density
to the noise predicted by DDM is equivalent to moving along conical frustums. In addition to our learned regular-
towards the modes of the RGBD patch distribution. As izer, we use [2] density regularizer as it helps to sharpen the
NeRFs are trained with stochastic gradient descent, gradi- distribution of densities along sampled rays.
ents of log-probabilities are sufficient, as they can be back- Regularization with loss terms Loss terms to regularize
propagated to NeRF networks during training to act as a NeRFs can play an important role in the final result, as they
regularizer; probabilities are not required for this purpose. provide additional supervision to under-constrained geom-
We demonstrate that the DDM gradient encourages NeRFs etry and color fields. Some regularizers are hand-crafted to
to fit density and color fields that are more physically plau- encourage depth and normal smoothness, e.g. [2,21,23,48].
sible on the LLFF and DTU datasets. In [11], a semantic loss is introduced to make high-level
semantic attributes consistent across renderings from ran-
2. Related work dom views. In [27] a loss term regularizes rendered depth
Geometry modeling The geometry of the scene can be maps with depths estimated using Structure-from-Motion
modeled as a density field [17], occupancy field [22, 23] or and depth completion methods. MonoSDF [46] regularizes
signed distance field [40, 43, 44]. Geometry models can be occupancy fields with loss terms that incorporate depth and
rendered using differentiable surface/volumetric rendering, normals maps predicted with a single-image depth predic-
so that the training loss for a NeRF model is the photometric tion model. Similarly, [38] introduces loss terms that use
reconstruction loss [17]. Signed distance fields also require a single-image normal prediction model to regularize ren-
regularization with an Eikonal loss [6] to constrain the dis- dered normal maps. While all these approaches introduce
tance field to be valid. Our regularizer operates on rendered high-level geometric supervision to NeRFs, the predicted
color and depth patches, so it can be applied to any geome- depth and normals are fixed during NeRF fitting and hence
try representation. the depth and normal models provide a unimodal prior over
Field representation NeRFs [17] represent geometry with geometry. Furthermore, the additional supervision is not
a multi-layer perceptron that is queried with a 3D coordi- adapted to the NeRF reconstructions and hence the monoc-
nate. Positional encoding of coordinates, where coordinate ular depth and normal predictions are trusted blindly.
values are evaluated with sinusoids at different frequencies, Regularization with Normalizing Flows RegNeRF [21]
allows modeling of high-frequency density signals with uses a 2D depth patch smoothness prior and a normaliz-
MLPs [35]. Alternatively, [7, 29] encode scalar opacity and ing flow model as a learned prior over 2D RGB patches.
spherical harmonic coefficients in a sparse voxel represen- The color patches are rendered while fitting the NeRF and
tation, and shows that novel views can be synthesized with- a term proportional to the log probability density assigned
out MLPs. Similarly, Neural Sparse Voxel Fields [15] stores to the patch by the normalizing flow model is added to the
feature encodings in a sparse voxel octree structure that can loss function.
be trilinearly interpolated and passed through an MLP to However, the underlying cause of NeRF’s dramatic per-
predict density and color, thus improving the modeling ca- formance degradation in the few-view case is that the ge-
pacity and rendering speed of NeRFs. MVSNeRF [3] pre- ometry is poor, so we argue that it is preferable to regu-
dicts a volume of feature encodings by constructing a 3D larize the geometry directly, rather than indirectly via RGB
cost volume and processing it with 3D CNNs. Density and patches. By learning a distribution over RGBD patches we
color MLPs trilinearly interpolate the feature encoding vol- also benefit from the fact that color and depth are strongly
4181
Volume Rendering
Rays of a
patch
Ray
σ
x, y, z
density
C(r) RGBD
c color patch
d
Backpropagate
Figure 2. Illustration of our method. The scene is sampled with training-view rays and rays originating from random patches. Color and
density are predicted by MLP for the 3D points sampled along the rays. Volumetric rendering is used to estimate expected color C(r),
depth D(r) as well as weights of color contributions {wi } and positions of samples {ti }. These estimates are used to compute gradients
of losses that are backpropagated to color and density MLPs. DDM model ϵθ uses RGBD patches to predict color and density gradients
that are passed to MLPs directly. Instant NGP’s multi-scale hash table of feature encodings is not illustrated for simplicity.
correlated, and therefore attempting to regularize them sep- has incorporated Imagen into NeRF optimization to gener-
arately discards information. ate novel 3D assets from a text input. Unlike our work, they
RegNeRF [21] uses MLPs to model color and density use DDMs to guide optimization of NeRFs to match input
fields, hence during NeRF training the patch rendering cost text, while we use DDMs to regularize NeRFs given input
can extend NeRF training time substantially. Thus, Reg- training images.
NeRF renders 8 × 8 patches for the prior model, which
severely limits the amount of context visible to the normal- 3. Method
izing flow model. We use Instant NGP for our NeRF rep-
We start by covering preliminaries like NeRF and DDM
resentation, which has a fast rendering time, allowing us to
training. Next, we describe the relation of DDMs to the
model priors over 48 × 48 patches.
gradient of the log-likelihood of the data, and show how we
Normalizing flows are generative models that learn to incorporate DDMs as NeRF regularizers. An overview of
transform a simple probability distribution into a more com- the our method is shown in Fig. 2.
plex data distribution [13]. The model is built of blocks that
fulfil the requirements of (i) preserving the number of di- 3.1. NeRFs
mensions of input and output features; (ii) being invertible,
Given a set of images of a scene I with camera intrinsic
i.e. the input to the block can be calculated from the out-
parameters and poses, we are interested in optimizing a den-
put; and (iii) the Jacobian of each block must be tractable
sity field σ : R3 → R+ and color field c : R3 ×S2 → R3[0,1] ,
so that the log probability density can be computed. These
where the density field can be evaluated at any 3D coordi-
constraints can lead to trade-offs in which model expres-
nate (x, y, z) ∈ R3 and the color field can be evaluated at
siveness is sacrificed for tractability. Diffusion models do
any 3D coordinate and viewing direction d ∈ S2 .
not have such constraints on their structures and may there-
The density and color fields can be used to synthesize
fore be more suitable to model data priors.
views of the scene from arbitrary cameras using differen-
Denoising Diffusion Models DDMs [8, 20, 31] are pow- tiable rendering techniques. The expected color C(r) of a
erful generative models that learn to estimate gradients of ray r(t) = o + td can be estimated using discrete samples
the log data distribution. Once trained, Langevin dynam- t0:N (where ti+1 > ti > 0), so
ics sampling [42] can be used to generate novel samples
by performing a sequence of denoising steps starting from a
random sample of a standard Gaussian distribution. Denois- \mathbf {C}(\mathbf {r}) \approxeq \sum _{i=1}^N w_i \mathbf {c}(\mathbf {r}(t_i), \mathbf {d}) + \left ( 1 - \sum _{i=1}^N w_i \right ) \mathbf {c}_\text {bg} , (1)
ing Diffusion Models have successfully been used to learn
and sample images [8, 34], video [9], speech [4, 14], etc. where the weights of color contributions are
Recently, multiple DDM-based models were proposed for wi = T (ti )ρ(ti ), defined with
the task of text-to-image synthesis, e.g. DALL-E 2 [25] and
Imagen [28]. Concurrently to our work, Dreamfusion [24] \rho (t_i) = 1 - \exp ( -\sigma (\mathbf {r}(t_i))(t_{i+1} - t_{i})) (2)
4182
and scene should be within the frustum of more than one of the
training views.
T(t_i) = \prod _{j=1}^{i-1} (1 - \rho (t_j)) (3) Combining these geometric regularizers into a loss func-
tion already gives a very strong baseline,
is the accumulated transmittance function, i.e. the probabil-
ity of the ray r(t) starting at camera center o and reach- \mathcal {L}_\text {geom} = \mathcal {L}_{\text {photo}} + \lambda _{\text {fg}} \mathcal {L}_{\text {fg}} + \lambda _{\text {fr}} \mathcal {L}_{\text {fr}} + \lambda _{\text {dist}} \mathcal {L}_{\text {dist}}. \label {eq:geomloss} (9)
ing coordinate r(ti ) without being absorbed. The cbg is the The λ coefficients control the contributions of the regular-
background color, which we set to white. izers. In our experiments we refer to this combination of
Similarly, one can compute the expected depth as losses as our “geometric baseline”.
3.2. Score functions and DDMs
\mathbf {D}(\mathbf {r}) = \frac {\sum _{i=1}^N w_i t_i}{\sum _{i=1}^N w_i} . (4)
Per Bayes’ theorem, the a posteriori probability of den-
sity and color fields given training views I is
The density and color fields are optimized to reduce the
photometric reconstruction loss, e.g. the L2 difference be- p(\sigma , \mathbf {c} | \mathcal {I}) \propto p(\mathcal {I} | \sigma , \mathbf {c}) p(\sigma , \mathbf {c}), (10)
tween input images and renderings from the same views is
where we drop the normalizing constant since it depends
only on I. The log-posterior is
\mathcal {L}_{\text {photo}}(\sigma , \mathbf {c}) = \sum _{i=1}^\mathcal {I} ||I_i - \mathbf {C}_i||_2. \label {eq:photo} (5)
\log (p(\mathcal {I} | \sigma , \mathbf {c})) + \log (p(\sigma , \mathbf {c})). \label {eq:posterior} (11)
The weights of color contributions wi in Eq. 5 can be In practice, we are interested in maximizing p(σ, c|I)
regularized to have compact distribution [2]: with stochastic gradient descent, which only requires
computation of the gradient of the log-likelihood
\mathcal {L}_{\text {dist}} = \frac {1}{D(\mathbf {r})} &\biggr (\sum _{i, j} w_i w_k \left | \frac {t_i + t_{i+1}}{2} - \frac {t_j + t_{j+1}}{2} \right | \nonumber \\ &+ \frac {1}{3} \sum _{i=1}^N w_i^2 (t_{i+1} - t_i)\biggr ) , ∇σ,c log(p(I|σ, c)) and the gradient of the log-prior
∇σ,c log(p(σ, c)), i.e. the score function. Notice that
explicit computation of the probabilities of the density and
(6) color fields p(σ, c) is not required. Below, we describe how
DDMs are learned and their relation to the score function.
The forward diffusion process progressively adds small
where we deviate from the original formulation by divid- Gaussian noise to a data sample x0 ∼ q(x) to produce pro-
ing through by the expected depth for the ray, which has gressively noisier versions, so
the effect of increasing the strength of this regularizer for
geometry that is close to the camera. \mathbf {x}_\tau = \sqrt {\alpha _\tau }\mathbf {x}_{\tau -1} + \sqrt {\beta _\tau } \epsilon _{\tau -1}, (12)
We also encourage the weights to sum to unity, because
in real scenes we always expect a ray to be absorbed fully where ϵτ −1 ∼ N (0, I) and ατ = 1 − βτ , i.e. the variances
by the scene geometry: {βτ }Tτ =1 control the noise schedule. As the noise function
is Gaussian, it follows from the reparameterization trick that
\mathcal {L}_{\text {fg}} = \left (1 - \sum _{i=1}^N w_i \right ) ^ 2. (7) q(\mathbf {x}_\tau | \mathbf {x}_{0}) = \mathcal {N}(\mathbf {x}_\tau ; \sqrt {\bar {\alpha }_\tau } \mathbf {x}_{0}, (1-\bar {\alpha }_\tau \mathbf {I})), (13)
Qτ
where ᾱτ = s=0 αs , allowing efficient generation of
In the few-view case, NeRFs frequently collapse to a de- noised samples for arbitrary τ . As T → ∞ the distribu-
generate solution in which each camera is fully or partially tion of noised samples xT is equivalent to an isotropic unit
“covered up” with a copy of the corresponding training im- Gaussian.
age. To prevent this, we introduce a regularization approach The DDM [8, 20, 31] is tasked to learn the reverse diffu-
in which the placement of density that is contained in only sion process:
one view frustum is penalized as p(\mathbf {x}_{\tau -1} | \mathbf {x}_\tau ) = \mathcal {N}(\mathbf {x}_{\tau -1}; \mathbf {\mu }(\mathbf {x}_\tau , \tau ), \tilde {\beta }_\tau \mathbf {I})), (14)
\mathcal {L}_\text {fr} = \sum _i w_i \mathbf {1}(n_i <= 1), (8) where β̃τ = (1 − ᾱτ −1 )βτ /(1 − ᾱτ ) .
Since xτ is available as input to µ(xτ , τ ), the mean
µ(xτ , τ ) can be computed by predicting noise ϵτ −1 from
where ni is the number of training view frustums in which
the noised input [8]:
the point along the ray r(ti ) is contained, so that only
weights which lie in fewer than two training frustums are
\mathbf {\mu }(\mathbf {x}_\tau , \tau ) = \frac {1}{\sqrt {\alpha _\tau }} \left ( \mathbf {x}_\tau - \frac {\beta _\tau }{\sqrt {1 - \bar {\alpha }_\tau }} \epsilon _\theta (\mathbf {x}_\tau , \tau ) \right ), (15)
included in the sum. This reflects our prior that most of the
4183
Forward Diffusion modeling the score function over the distribution of RGBD
patches ϵθ ({C(r), D(r)|r ∈ P }), where P is a set of rays
that pass through a random 48 × 48 patch of pixels cast
from a random camera. To allow control of the magni-
tude of the gradients, we further normalize the output of
ϵθ ({C(r), D(r)|r ∈ P }), and refer to this regularization
…
function as ϵθ (see supplementary for details).
To train our DDM we use Hypersim [26], a photoreal-
istic synthetic dataset for indoor scene understanding with
ground truth images and depth maps. Specifically, we sam-
Reverse Diffusion ple 48 × 48 patches of images and depth maps to generate
training data for the DDM (removing problematic images
(a) and scenes as per dataset instructions); see Fig. 3(b) for ex-
amples. Fig. 3(c) shows samples of RGBD patches gener-
ated by our DDM model. The quality of samples indicates
that DDM successfully learns the data distribution of the
(b) RGBD Hypersim patches.
3.3. Regularizing NeRFs with DDMs
The gradient of the log-posterior (11), which forms our
loss function, is
(c)
Figure 3. (a) Illustration of forward and reverse diffusion pro- \nabla \log p(\sigma , \mathbf {c} | \mathcal {I}) = \nabla \log p(\sigma , \mathbf {c}) + \nabla \log p(\mathcal {I} | \sigma , \mathbf {c}). (18)
cesses. (b) Example RGBD patches in the training set of the DDM
model extracted from Hypersim dataset. (c) Example RGBD By plugging (17) into the above, we can use a diffusion
patches generated with our DDM model trained on Hypersim model as a prior over (σ, c). For the second term on the
dataset. Depths are shown as normalized inverse depths for vi- RHS we use loss in eq 9, resulting in the following gradient
sualization purposes. The noise in the samples is due to noise that for our loss function:
is injected during the sampling process.
\nabla \mathcal {L} = \nabla \mathcal {L}_{\text {photo}} + \lambda _{\text {fg}} \nabla \mathcal {L}_{\text {fg}} + \lambda _{\text {fr}} \nabla \mathcal {L}_{\text {fr}} + \lambda _{\text {dist}} \nabla \mathcal {L}_{\text {dist}} - \lambda _{\text {DDM}} \epsilon _\theta , \label {eq:lossgrad}
(19)
using a neural network ϵθ (xτ , τ ). where λDDM controls the weight of the our regularizer.
Thus, one can learn the reverse diffusion process by During NeRF optimization we compute the gradient of
training a neural network ϵθ (xτ , τ ) to estimate noise given the loss as per eq. 19 and backpropagate as usual to obtain
a noised input and noise-level using the loss function: gradients for the NeRF density and color field parameters.
3.4. Implementation Details
\mathbb {E}_{\mathbf {x}_0, \epsilon } \left [ \frac {\beta _\tau }{2\alpha _\tau (1-\bar {\alpha }_\tau )} || \epsilon - \epsilon _\theta \left ( \sqrt {\bar {\alpha }_\tau } \mathbf {x}_0 + \sqrt {1 - \bar {\alpha }_\tau } \epsilon , \tau \right ) || \right ] ,
We use the training protocol of [8, 39] to train our DDM
(16)
model. We optimize the DDM for 650,000 steps with batch
where ϵ ∼ N (0, I). Fig. 3 (a) illustrates the forward and
size 32 on 1 GPU.
backwards processes.
We use the torch-ngp [36] implementation of Instant
Importantly, it was shown in [8, 37] that a DDM noise
NGP [19] with the tiny-cuda-nn [18] back-end as the
estimator has a connection to score matching [10, 32, 33]
NeRF model for our experiments. NeRFs are optimized
and is proportional to the score function:
for 12,000 steps, where the first 2500 steps are opti-
\epsilon _\theta (\mathbf {x}_\tau , \tau ) \propto -\nabla _\mathbf {x} \log p(\mathbf {x}). \label {eq:score} (17) mized with λdist = 0 and the diffusion time parame-
ter τ smoothly interpolates from 0.1 to 0, hence we set
Hence, taking steps in the negative direction to the noise ᾱτ = cos(0.5π(τ + 0.008)/1.008) and other variables are
predicted by the model is equivalent to moving towards the derived accordingly. By scheduling τ this way the diffu-
modes of the data distribution. This can be used to generate sion model is conditioned to expect progressively less noisy
samples from the data distribution using Langevin dynam- inputs as the NeRF trains and generates increasingly more
ics [8, 32, 42]. accurate colors and depths. After 3000 steps, λdist linearly
In this work, we want to use a DDM model as a score increases from 0 until it reaches its maximum value at 8000
function estimator to regularize NeRF reconstructions ac- steps, where the maximum value is 1 × 10−4 for the DTU
cording to eq. 11. Hence, we model a prior over (σ, c) by dataset and 1.5×10−5 for the LLFF dataset. We empirically
4184
PSNR ↑ SSIM ↑ LPIPS ↓ Average ↓
Method Setting
3-view 6-view 9-view 3-view 6-view 9-view 3-view 6-view 9-view 3-view 6-view 9-view
mip-NeRF [1] Optimized per Scene 14.62 20.87 24.26 0.351 0.692 0.805 0.495 0.255 0.172 0.246 0.114 0.073
DietNeRF [11] Optimized per Scene 14.94 21.75 24.28 0.370 0.717 0.801 0.496 0.248 0.183 0.240 0.105 0.073
PixelNeRF ft [45] DTU + ft per Scene 16.17 17.03 18.92 0.438 0.473 0.535 0.512 0.477 0.430 0.217 0.196 0.163
LLFF
MVSNeRF ft [3] DTU + ft per Scene 17.88 19.99 20.47 0.584 0.660 0.695 0.327 0.264 0.244 0.157 0.122 0.111
RegNeRF [21] Optimized per Scene 19.08 21.10 24.86 0.587 0.760 0.820 0.336 0.206 0.161 0.146 0.086 0.067
Geometric Baseline Optimized per Scene 19.88 24.28 25.10 0.590 0.765 0.802 0.192 0.101 0.084 0.118 0.071 0.060
DiffusioNeRF (Ours) Optimized per Scene 19.79 23.79 25.02 0.568 0.747 0.785 0.209 0.114 0.096 0.127 0.075 0.064
mip-NeRF [1] Optimized per Scene 8.68 16.54 23.58 0.571 0.741 0.879 0.353 0.198 0.092 0.323 0.148 0.056
DietNeRF [11] Optimized per Scene 11.85 20.63 23.83 0.633 0.778 0.823 0.314 0.201 0.173 0.243 0.101 0.068
PixelNeRF ft [45] DTU + ft per Scene 18.95 20.56 21.83 0.710 0.753 0.781 0.269 0.223 0.203 0.125 0.104 0.090
DTU
MVSNeRF ft [3] DTU + ft per Scene 18.54 20.49 22.22 0.769 0.822 0.853 0.197 0.155 0.135 0.113 0.089 0.069
RegNeRF [21] Optimized per Scene 18.89 22.20 24.93 0.745 0.841 0.884 0.190 0.117 0.089 0.112 0.071 0.047
Geometric Baseline Optimized per Scene 13.60 16.43 22.01 0.661 0.759 0.853 0.212 0.147 0.071 0.185 0.092 0.056
DiffusioNeRF (Ours) Optimized per Scene 16.20 20.34 25.18 0.698 0.818 0.883 0.160 0.093 0.046 0.135 0.052 0.033
Table 1. DiffusioNeRF vs. SOTA in novel view synthesis task on LLFF and DTU datasets with few input views [21, 45]. We report
scores on PSNR, SSIM, LPIPS and Average metrics averaged over all 8 scenes when NeRFs are fitted with 3, 6 and 9 training views. For
each view/metric combination the first and second scores are highlighted.
found that this schedule of τ and regularization weights pro- views of the scene are used as ground truth to com-
duces best results. On a single Nvidia A100 GPU our NeRF pare against synthesized views. Image similarity met-
model trains in approximately 30 minutes per scene. rics such as PSNR, SSIM [41] and LPIPS [47] are mea-
Furthermore, 25% of the time we use a training pose sured for each test view and average score per each scene
for patch rendering, and sample the RGB component of the is reported. We also report an “Average” score, specifi-
RGBD patch directly from the training image. This is help- cally the geometric mean of the three metrics as per [1]:
p3
√
ful in the early stages, when NeRF renderings are not yet 10 −PSNR/10 · 1 − SSIM · LPIPS.
accurate. For the geometry estimation task, we convert an isosur-
face of the density field into a mesh using the marching
4. Experiments cubes. The mesh is culled to retain only parts that are visible
Datasets We experiment on two datasets: LLFF and DTU. in at least one training view and the background surfaces are
The LLFF [16] dataset has 8 scenes with 20-62 images masked out. We then sample the mesh to generate a point
per scene captured with a handheld camera. The scenes are cloud, and report the average chamfer L1 distance between
reconstructed with COLMAP [30] to estimate camera in- the estimated and ground truth point clouds.
trinsics, camera poses and the 3D bounds of the scenes. A
4.1. Evaluations
few images are used for training and test images are used
to evaluate novel view synthesis quality. We select LLFF Table 1 show a comparison of our geometric baseline
for evaluations as it allows comparison against other SOTA and our model against SOTA methods on LLFF and DTU
NeRF models, such as RegNeRF [21]. datasets when trained with 3, 6 and 9 views. When the num-
The DTU [12] dataset consists of images of objects ber of views is low, the regularizer can have a large impact
placed on a table against black background. Images and on the final result, which allows easier comparison of reg-
depth maps are captured with structured light scanner ularizers. As seen from table 1, the geometric baseline and
mounted on an industrial robot arm. The dataset provides our method both score favorably to other methods, achiev-
images, poses, and ground truth point clouds for evaluation. ing best scores in PSNR, LPIPS and Average metrics. Our
For novel-view synthesis with few view setting on DTU, geometric baseline has higher metrics on LLFF, however
we use the test set of 15 scans of PixelNeRF [45], allowing there are artifacts in the generated test views that can be
comparison against other methods. seen in Fig. 4. Our diffusion model-based method gener-
We use the test set of 15 scans defined in [23, 43, 46] ates more plausible depths compared to the geometric base-
to evaluate geometry quality, e.g. via the surface method line, see section 4.2. One side-effect is over-smoothing of
of evaluation as described in UNISURF [23]. Tradition- thin-structures (e.g. the top row in Fig. 4). It is also note-
ally, geometry estimated by the density field of a NeRF may worthy that test views contain parts of the scene that are not
not allow accurate surface reconstruction compared to occu- visible in any of the training views. These occluded parts
pancy and SDF-based approaches [23], which score higher of the scene can impact reconstruction scores significantly
on DTU, e.g. [23, 43, 44, 46]. (see supplementary for details).
Metrics For the task of novel-view synthesis, hold-out Table 2 shows an evaluation of reconstruction quality on
4185
Ground Truth RegNeRF Geometric Baseline DiffusioNeRF (Ours)
Figure 4. Qualitative results for the task of novel view synthesis on LLFF dataset. NeRF models are trained with 3 views and rendered
from one of test views. Our DDM model encourages more realistic geometry as seen in the depth maps.
15 scans of the DTU dataset when NeRFs are fitted with view synthesis and reconstruction quality. As reported, the
all views. In the large number of views regime, the priors geometric baseline scores favorably on the LLFF dataset,
are less important as training views provide more informa- but has issues in geometry as reflected in DTU scores. Qual-
tion about the scene. Nevertheless, the priors should not itative results in Fig. 4 demonstrate that the geometry esti-
introduce any undesirable artifacts and can help with am- mated by the geometric baseline is not realistic, even if the
biguous regions such as textureless table. Despite DDM appearance scores are high. Our DDM-based approach im-
being trained on images of indoor room-sized scenes, it proves on DTU scores, but its performance on the novel
shows good generalization to the object-centric reconstruc- view synthesis metrics is hampered by its tendency to intro-
tion task. Our density-based method performs adequately duce details in areas of the scene that are not pictured in any
when compared to occupancy and SDF-based methods. training view.
In Fig. 5 the qualitative results indicate that density based
methods struggle with shiny objects (rows 2 and 4) but can In table 3 we also show ablations of some of the finer de-
have higher fidelity geometry on diffuse and textured sur- tails of our model. This table suggests that a model trained
faces (rows 1 and 3). The textured regions alone are not on 24 × 24 patches outperforms a model trained on 48 × 48
sufficient for high quality output, e.g. our geometric base- patches on LLFF, but underperforms on DTU.
line struggles to complete the geometry of a house in row The ablations show the significance of feeding patches
1, and our DDM model provides a complementary signal from input images to DDM 25% of the time during NeRF
to the geometric regularizers resulting in fewer holes and fitting. It can be especially important early on, when ren-
smoother surfaces. dered patches are very different from input images.
4.2. Ablation studies Unsurprisingly, reducing the amount of training data for
In table 3 we show contributions of each of our optimiza- the DDM (only using 20% of the Hypersim scenes) slightly
tion terms evaluated on LLFF and DTU datasets for novel reduces the scores. The RGB-only regularization with
DDMs is similar to RegNeRF’s normalizing flow model
Mean Mean regularization, but with larger patch sizes. Interestingly, the
SDF-based Methods Chamfer- NeRF-based Methods Chamfer- RGBD regularizer trained with 20% of the data is still better
L1 ↓ L1 ↓ than the RGB-only regularizer that was trained with 100%
UNISURF [23] 1.02 Instant NGP [19] 1.71 of the data. The last two rows of the ablation show that care-
NeuS [40] 0.84 NeRF [17] 1.49
VolSDF [43] 0.86 Geometric Baseline 1.36 ful scheduling of τ and DDM gradient weights is necessary
MonoSDF [46] 0.73 DiffusioNeRF 1.21 to produce good results. This is an active area of research,
having previously been noted in [24]. The DDM weight
Table 2. DiffusioNeRF vs. SOTA in geometry reconstruction on λDDM trades off the accuracy of reconstruction around thin
the DTU dataset with all views [5]. structures against the overall depth smoothness.
4186
Scan # RGB NeuS [40] VolSDF [43] MonoSDF [46] Geom. Baseline Ours
24
69
83
110
Figure 5. Qualitative comparison of our method against SOTA on geometry reconstruction evaluated on DTU dataset.
LLFF DTU
Method Average ↓ Average ↓ Chamfer-L1 ↓
3-view 6-view 9-view 3-view 6-view 9-view All views
∇L = ∇Lphoto 0.210 0.128 0.090 0.203 0.142 0.119 2.87
∇L = ∇Lphoto + λfg ∇Lfg 0.210 0.128 0.090 0.195 0.126 0.092 1.71
∇L = ∇Lphoto + λfg ∇Lfg + λfr ∇Lfr 0.135 0.089 0.072 0.215 0.128 0.093 1.71
∇L = ∇Lphoto + λfg ∇Lfg + λfr ∇Lfr − λDDM ϵθ 0.145 0.085 0.066 0.190 0.097 0.072 1.67
∇L = ∇Lphoto + λfg ∇Lfg + λfr ∇Lfr + λdist ∇Ldist 0.118 0.071 0.060 0.185 0.092 0.056 1.36
∇L = ∇Lphoto + λfg ∇Lfg + λfr ∇Lfr + λdist ∇Ldist − λDDM ϵθ 0.127 0.075 0.064 0.135 0.052 0.033 1.21
DDM regularizer using 24x24 patches 0.126 0.074 0.061 0.195 0.068 0.043 1.22
24x24 patch DDM & NeRF fitted with 4 × λDDM 0.129 0.074 0.062 0.260 0.080 0.050 1.22
Patches from input images are not given to DDM 0.139 0.078 0.066 0.159 0.063 0.049 1.91
DDM trained with 20% of Hypersim scenes 0.132 0.078 0.066 0.163 0.057 0.035 1.65
RGB-only DDM regularizer 0.134 0.083 0.070 0.189 0.081 0.058 1.31
τ = 0 (no schedule) during NeRF fitting 0.137 0.081 0.067 0.152 0.055 0.042 1.31
NeRF fitted with 4 × λDDM 0.146 0.088 0.076 0.220 0.134 0.071 2.56
Table 3. Ablation study of our method. Note that for DTU, λfr is set to 0, hence the 2nd and 3rd rows have identical scores on DTU.
Geometric baseline corresponds to the model in the 5th row.
In this paper we address the problem of regularization One avenue of future work is formulating a principled
of NeRFs. Our approach uses a DDM trained on RGBD combination of the DDM gradient with the NeRF objective
patches to approximate a score function, i.e. the gradient to avoid heuristics-based τ and gradient scheduling.
of the logarithm of an RGBD patch distribution. Experi-
Our work is focused on NeRF optimization, however the
mentally, we demonstrate that the proposed regularization
general approach of using DDMs as a regularizer could po-
scheme improves performance on novel view synthesis and
tentially be used for other tasks that are optimized with gra-
3D reconstruction.
dient descent, e.g. self-supervised monocular depth estima-
While we show regularization using color and depth tion [5], or self-supervised stereo matching [49, 50].
patches as input, the proposed framework is versatile and
can be used to regularize the 3D voxel grid of densities, Acknowledgements We thank Niantic colleagues, espe-
density weights sampled along the ray, etc. Indeed, instead cially Gabriel Brostow, for discussions and suggestions. We
of generating RGBD patches, we can generate 3D voxel are also grateful for Jiaxiang Tang’s Pytorch implementa-
blocks of densities to train a DDM and use it during NeRF tion of Instant-NGP [36], Phil Wang’s implementation of
optimization to regularize the density field directly. Early DDM [39], and to Thomas Müller for tiny-cuda-nn [18].
4187
References Representing Scenes as Neural Radiance Fields for View
Synthesis. In ECCV, 2020. 1, 2, 7
[1] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter
[18] Thomas Müller. Tiny CUDA Neural Network Framework,
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan.
2021. github.com/nvlabs/tiny-cuda-nn. 5, 8
Mip-NeRF: A Multiscale Representation for Anti-Aliasing
Neural Radiance Fields. In ICCV, 2021. 6 [19] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
der Keller. Instant Neural Graphics Primitives with a Mul-
[2] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P.
tiresolution Hash Encoding. ACM TOG, 2022. 2, 5, 7
Srinivasan, and Peter Hedman. Mip-NeRF 360: Unbounded
Anti-Aliased Neural Radiance Fields. In CVPR, 2022. 2, 4 [20] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
[3] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Denoising Diffusion Probabilistic Models. In ICML, 2021.
Fanbo Xiang, Jingyi Yu, and Hao Su. MVSNeRF: Fast Gen- 3, 4
eralizable Radiance Field Reconstruction from Multi-View [21] Michael Niemeyer, Jonathan T. Barron, Ben Mildenhall,
Stereo. In ICCV, 2021. 2, 6 Mehdi S. M. Sajjadi, Andreas Geiger, and Noha Radwan.
[4] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Moham- RegNeRF: Regularizing Neural Radiance Fields for View
mad Norouzi, and William Chan. WaveGrad: Estimating Synthesis from Sparse Inputs. In CVPR, 2022. 1, 2, 3, 6
Gradients for Waveform Generation. In ICLR, 2020. 3 [22] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and
[5] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Andreas Geiger. Differentiable Volumetric Rendering:
Unsupervised Monocular Depth Estimation with Left-Right Learning Implicit 3D Representations without 3D Supervi-
Consistency. In CVPR, 2017. 7, 8 sion. In CVPR, 2020. 2
[6] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and [23] Michael Oechsle, Songyou Peng, and Andreas Geiger.
Yaron Lipman. Implicit Geometric Regularization for Learn- UNISURF: Unifying Neural Implicit Surfaces and Radiance
ing Shapes. In ICML, 2020. 2 Fields for Multi-View Reconstruction. In ICCV, 2021. 2, 6,
[7] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, 7
Jonathan T. Barron, and Paul Debevec. Baking Neural Ra- [24] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden-
diance Fields for Real-Time View Synthesis. ICCV, 2021. hall. DreamFusion: Text-to-3D using 2D Diffusion. In ICLR,
2 2023. 3, 7
[8] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- [25] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
fusion Probabilistic Models. In NeurIPS, 2020. 3, 4, 5 and Mark Chen. Hierarchical Text-conditional Image Fener-
[9] Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William ation with CLIP Latents. arXiv preprint arXiv:2204.06125,
Chan, Mohammad Norouzi, and David J Fleet. Video Diffu- 2022. 3
sion Models. In ICLR Workshop on Deep Generative Models [26] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit
for Highly Structured Data, 2022. 3 Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb,
[10] Aapo Hyvärinen. Estimation of non-normalized statistical and Joshua M. Susskind. Hypersim: A Photorealistic Syn-
models by score matching. Journal of Machine Learning thetic Dataset for Holistic Indoor Scene Understanding. In
Research, 2005. 5 ICCV, 2021. 5
[11] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting NeRF [27] Barbara Roessle, Jonathan T. Barron, Ben Mildenhall,
on a Diet: Semantically Consistent Few-Shot View Synthe- Pratul P. Srinivasan, and Matthias Nießner. Dense Depth Pri-
sis. In ICCV, 2021. 2, 6 ors for Neural Radiance Fields from Sparse Input Views. In
[12] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, CVPR, 2022. 2
and Henrik Aanæs. Large Scale Multi-view Stereopsis Eval- [28] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
uation. In CVPR, 2014. 6 Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
[13] Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
Normalizing Flows: An Introduction and Review of Current Jonathan Ho, David J Fleet, and Mohammad Norouzi. Pho-
Methods. IEEE TPAMI, 2020. 3 torealistic Text-to-Image Diffusion Models with Deep Lan-
[14] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and guage Understanding. In NeurIPS, 2022. 3
Bryan Catanzaro. DiffWave: A Versatile Diffusion Model [29] Sara Fridovich-Keil and Alex Yu, Matthew Tancik, Qinhong
for Audio Synthesis. In ICLR, 2020. 3 Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
[15] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Radiance Fields without Neural Networks. In CVPR, 2022.
Christian Theobalt. Neural Sparse Voxel Fields. In NeurIPS, 2
2020. 2 [30] Johannes Lutz Schönberger and Jan-Michael Frahm.
[16] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Structure-from-Motion Revisited. In CVPR, 2016. 6
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and [31] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
Abhishek Kar. Local Light Field Fusion: Practical View and Surya Ganguli. Deep Unsupervised Learning using
Synthesis with Prescriptive Sampling Guidelines. ACM Nonequilibrium Thermodynamics. In ICML, 2015. 3, 4
TOG, 2019. 1, 6 [32] Yang Song and Stefano Ermon. Generative Modeling by
[17] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Estimating Gradients of the Data Distribution. In NeurIPS,
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: 2019. 5
4188
[33] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Improving Ability. arXiv preprint arXiv:1709.00930, 2017.
Sliced Score Matching: A Scalable Approach to Density and 8
Score Estimation. In Uncertainty in Artificial Intelligence, [50] Chao Zhou, Hong Zhang, Xiaoyong Shen, and Jiaya Jia. Un-
2020. 5 supervised Learning of Stereo Matching. In ICCV, 2017. 8
[34] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-Based
Generative Modeling through Stochastic Differential Equa-
tions. In ICLR, 2021. 3
[35] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier Fea-
tures Let Networks Learn High Frequency Functions in Low
Dimensional Domains. In NeurIPS, 2020. 2
[36] Jiaxiang Tang. Torch-NGP: a PyTorch implementation of
Instant-NGP, 2022. github.com/ashawkey/torch-ngp. 5, 8
[37] Pascal Vincent. A Connection Between Score Matching and
Denoising Autoencoders. Neural Computation, 2011. 5
[38] Jiepeng Wang, Peng Wang, Xiaoxiao Long, Christian
Theobalt, Taku Komura, Lingjie Liu, and Wenping Wang.
NeuRIS: Neural Reconstruction of Indoor Scenes Using
Normal Priors. In ECCV, 2022. 2
[39] Phil Wang. Denoising Diffusion Probabilistic Model in
Pytorch, 2022. github.com/lucidrains/denoising-diffusion-
pytorch. 5, 8
[40] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. NeuS: Learning Neural Im-
plicit Surfaces by Volume Rendering for Multi-view Recon-
struction. In NeurIPS, 2021. 2, 7, 8
[41] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
moncelli. Image Quality Assessment: From Error Visibility
to Structural Similarity. IEEE TIP, 2004. 6
[42] Max Welling and Yee W Teh. Bayesian Learning via
Stochastic Gradient Langevin Dynamics. In ICML, 2011.
3, 5
[43] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume Rendering of Neural Implicit Surfaces. In NeurIPS,
2021. 2, 6, 7, 8
[44] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Basri Ronen, and Yaron Lipman. Multiview Neu-
ral Surface Reconstruction by Disentangling Geometry and
Appearance. In NeurIPS, 2020. 2, 6
[45] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
pixelNeRF: Neural radiance fields from one or few images.
In CVPR, 2021. 6
[46] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sat-
tler, and Andreas Geiger. MonoSDF: Exploring Monocular
Geometric Cues for Neural Implicit Surface Reconstruction.
In NeurIPS, 2022. 2, 6, 7, 8
[47] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The Unreasonable Effectiveness of Deep
Features as a Perceptual Metric. In CVPR, 2018. 6
[48] Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul De-
bevec, William T Freeman, and Jonathan T Barron. NeRFac-
tor: Neural Factorization of Shape and Reflectance Under an
Unknown Illumination. ACM TOG, 2021. 2
[49] Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-
Supervised Learning for Stereo Matching with Self-
4189