0% found this document useful (0 votes)
18 views

Control 3 Diff

This paper proposes Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for controllable 3D-aware image synthesis from single-view images. Control3Diff explicitly models the underlying latent distribution of 3D GANs, allowing it to generate 3D representations conditioned on various inputs like images, sketches, and text without additional supervision. The model is validated on benchmarks like FFHQ, AFHQ, and ShapeNet, demonstrating controllable 3D image generation from different inputs.

Uploaded by

An Phu Pham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Control 3 Diff

This paper proposes Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for controllable 3D-aware image synthesis from single-view images. Control3Diff explicitly models the underlying latent distribution of 3D GANs, allowing it to generate 3D representations conditioned on various inputs like images, sketches, and text without additional supervision. The model is validated on benchmarks like FFHQ, AFHQ, and ShapeNet, demonstrating controllable 3D image generation from different inputs.

Uploaded by

An Phu Pham
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Learning Controllable 3D Diffusion Models from Single-view Images

Jiatao Gu† , Qingzhe Gao§ , Shuangfei Zhai† , Baoquan Chen¶ , Lingjie Liu‡ , Josh Susskind†

Apple ‡ University of Pennsylvania § Shandong University ¶ Peking University
† ‡
{jgu32,szhai,jsusskind}@apple.com [email protected]
§ ¶
[email protected] [email protected]
arXiv:2304.06700v1 [cs.CV] 13 Apr 2023

(Optional) Input Output Input Output


Conditioning /
Guidance
...
Backward diffusion process

(a) 3D Inversion (b) SR + 3D Inversion


...

(c) Seg-map to 3D (d) Egde-map to 3D

Neural
renderer

(e) Shape to 3D (f) Text prompt to 3D

Figure 1: Left is the generation process, where a diffusion model samples a triplane which can be used for image rendering. Right are the
examples of controllable generation given various conditioning inputs, showing generated frontal and side views from Control3Diff. The
faces shown are all generated by models without real identities due to concerns about individual consent except for the input in (a).

Abstract inputs), thus enabling direct control during the diffusion pro-
cess. Moreover, our approach is general and applicable to
Diffusion models have recently become the de-facto ap- any types of controlling inputs, allowing us to train it with
proach for generative modeling in the 2D domain. However, the same diffusion objective without any auxiliary supervi-
extending diffusion models to 3D is challenging, due to the sion. We validate the efficacy of Control3Diff on standard
difficulties in acquiring 3D ground truth data for training. image generation benchmarks including FFHQ, AFHQ, and
On the other hand, 3D GANs that integrate implicit 3D rep- ShapeNet, using various conditioning inputs such as images,
resentations into GANs have shown remarkable 3D-aware sketches, and text prompts. Please see the project website
generation when trained only on single-view image datasets. (https://jiataogu.me/control3diff) for video
However, 3D GANs do not provide straightforward ways comparisons.
to precisely control image synthesis. To address these chal-
lenges, We present Control3Diff, a 3D diffusion model that
combines the strengths of diffusion models and 3D GANs for 1. Introduction
versatile controllable 3D-aware image synthesis for single-
view datasets. Control3Diff explicitly models the underly- The synthesis of photo-realistic 3D-aware images of
ing latent distribution (optionally conditioned on external real-world scenes from sparse controlling inputs is a long-

1
standing problem in both computer vision and computer the 2D solutions with diffusion models. Then, we pose the
graphics, with various applications including robotics sim- difficulties of applying similar methods to 3D scenario.
ulation, gaming, and virtual reality. Depending on the task,
sparse inputs can be single-view images, guiding poses, or 2.1. Definition
text instructions, and the objective is to recover 3D represen- 2D. The goal of controllable synthesis is to learn a genera-
tations and synthesize consistent images from novel view- tive model that synthesizes diverse 2D images x conditioned
points. This is a challenging problem, as the sparse inputs on an input control signal c. This can be mainly done by
typically contain insufficient information to predict complete sampling images in the following two ways:
3D details. Consequently, the selection of an appropriate
prior during controllable generation is crucial for resolving x ∼ exp [−`(c, x)] · pθ (x) Or pθ (x|c), (1)
uncertainties. Recently, significant progress has been made in
the field of 2D image generation through the use of diffusion- where θ is the parameters of the generative model. The for-
based generative models [87, 31, 89, 20], which learn the mer one is called guidance. At test-time, an energy function
prior and have achieved remarkable success in various con- `(c, x) is to measure the alignment between the synthesized
ditional applications such as super-resolution [77, 49, 27], image x and the input c to guide the prior generation pθ (x).
in-painting [54], image translation [75, 100] and text-guided Note that, only for the controllable tasks where the energy
synthesis [70, 73, 76, 30]. It is natural to consider applying function `(c, x) can be defined, the guidance techniques can
similar approaches in 3D generation. However, learning dif- be applied. The latter one directly formulates it as a condi-
fusion models typically relies heavily on the availability of tional generation problem pθ (x|c) if the paired data (x, c)
ground-truth data, which is not commonly available for 3D is available. As c typically contains less information than x,
content, especially for single-view images. it is crucial to handle uncertainties with generative models.
To address this limitation, we propose a framework 3D. The above formulation can be simply extended to 3D.
called Control3Diff, which links diffusion models to gen- In this work, we assume a 3D scene represented by latent
erative adversarial networks (GANs) [23] and takes advan- representation z, and we synthesize 3D-consistent images
tage of the success of GANs in 3D-aware image synthe- by rendering x = R(z, π) given different camera poses
sis [81, 11, 62, 25, 12, 64, 86]. The core idea behind 3D π. Here, we do not restrict the space of z, meaning that it
GANs is to learn a generator based on neural fields that fuse can be any high-dimensional structure that describes the 3D
3D inductive bias in modeling with volume rendering. By scene. Similarly, we can define 3D-aware controllable image
training 3D GANs on single-view data with random noises synthesis by replacing x with z in Eq. (1).
and viewpoints as inputs, we can avoid the need for 3D
ground truth. Our proposed framework Control3Diff pre- 2.2. Diffusion Models
dicts the internal states of 3D GANs given any conditioning Standard diffusion models [87, 89, 31] are explicit gen-
inputs by modeling the prior distribution of the underlying erative models defined by a Markovian process. Given an
manifolds of real data using diffusion models. Furthermore, image x, a diffusion model defines continuous time latent
the proposed framework can be trained on synthetic gener- variables {zt |t ∈ [0, 1], z0 = x} based on a fixed schedule
ation from a 3D GAN, allowing for infinite examples to be 2
{αt , σt }: q(zt |zs ) = N (zt ; αt|s zs , σt|s I), 0 ≤ s < t ≤ 1,
used for training without worrying about over-fitting. Finally,
where αt|s = αt /αs , σt|s = σt − αt|s σs2 . Following this
2 2 2
by applying the guidance techniques [20, 32] in 2D diffu-
definition, we can easily derive the latent zt at any time by
sion models, we are able to learn controllable 3D generation
q(zt |z0 ) = N (zt ; αt z0 , σt2 I). The model θ then learns the
with a single loss function for all conditional tasks. This
reverse process by denoising zt to the clean target x with a
eliminates the use of ad-hoc supervisions and constraints
weighted reconstruction loss Lθ :
which were commonly needed in existing conditional 3D
generation [9, 19]. LDiff = Ezt ∼q(zt |z0 ),t∼[0,1] ωt · kzθ (zt ) − z0 k22 . (2)
 
To validate the proposed framework, we use a variant
of the recently proposed EG3D [12] that learns an efficient Typically, θ is parameterized as a U-Net [74, 31] or ViT [66].
tri-plane representation as the basis for Control3Diff. We Sampling from a learned model pθ can be performed using
extensively conduct experiments on six types of inputs and ancestral sampling rules [31] – staring with pure Gaussian
demonstrate the effectiveness of Control3Diff on standard noise z1 ∼ N (0, I), we sample s, t following a uniformly
benchmarks including FFHQ, AFHQ-cat, and Shapenet. spaced sequence from 1 to 0:
p
2. Preliminaries: Controllable Image Synthesis zs = αs zθ (zt )+ σs2 − σ̄ 2 θ (zt )+σ̄,  ∼ N (0, I), (3)

In this section, we first define the problem of controllable where σ̄ = σs σt|s /σt and θ (zt ) = (zt − αt zθ (zt ))/σt .
image synthesis in 2D and 3D-aware manners and review By decomposing the sophisticated generative process into
Only single-view
information 3. As discussed in § 2.1, we need either an energy function
z Reconstr uction `(., .) for guidance or paired data for conditional gener-
E ?
R
ation in controllable synthesis. However, both of them
Auto-Encoder
likely collapsed unless
heavily regularized are not straightforward to define in the latent space of
H z Reconstr uction implicit 3D representations (e.g., NeRF).
Pr ior constr aints ? R
Auto-Decoder More precisely, targeting on Challenge # 1, prior arts [7, 83,
60, 94] first reconstruct the latent z from dense multi-view
Noise G z Real / Fake ? images of each scene. In the rest of paper, we refer to them as
R
? reconstruction-based methods. That is, given a set of posed
3D-GAN

images {xi }N i=1 , one can minimize:


" #
X
Samples from auto-decoder s (Rebain et. al., 2022) Samples from 3D GANs (Chan et. al., 2022)
LRC = E{xi }∼data kR(z, πi ) − xi k22 + H(z) , (4)
i
Figure 2: ↑ Comparison between reconstruction-based and GAN- where πi is the camera of xi , R is the differentiable volume
based approaches for obtaining latents z for learning the LDM; ↓ renderer of fz , and H is the prior regularization over z. Here,
learned samples from auto-decoders generally have lower quality z = E({xi }N i=1 ) represents either the backward process of
than 3D GANs due to regularization and optimization difficulties.
∇z Lrecon (also known as “auto-decoder” [85]) that updates
z via SGD, or an amortized multi-view encoder [48, 78].
hundreds of denoising steps, diffusion models effectively In spite of the good results with dense multi-view data for
expand the modeling capacity, and have been shown superior training, these methods perform poorly when only one view
performance than other types of generative models [20]. For is available for each scene (Challenge #2). Single-view auto-
better efficiency, Latent Diffusion Models (LDM [73]) have decoders usually fail to learn geometry, even with strong
extended the process in latent space by learning an additional regularization [71], the reconstructed quality is still limited.
encoder z0 = E(x) to map input images to the latent space. On the other side, using an encoder E(x) may ease the afore-
Due to the autoregressive nature, diffusion models are mentioned issues after adopting various auxiliary losses with
suitable for controllable generation (§ 2.1). Prior research novel-view rendering [9]. Yet, due to limited view coverage
studied guidance with a variety of classifiers, constraints and object occlusion, an image encoder is unable to predict
or auxiliary loss functions [20, 44, 17, 24, 33, 26, 6]. Other fully determined 3D details, resulting in additional uncertain-
works explored learning conditional diffusion models with ties. We illustrate a comparison in Fig. 2. Besides Challenge
parallel data (e.g., class labels [32], text prompts [73], #1 and #2, #3 has rarely been studied in prior research. In
aligned image maps [100]). Importantly, classifier-free guid- the next section, we will elaborate on how we address these
ance [32], which enables generation with a balance between challenges to achieve controllable 3D-aware image synthesis
sampling controlled quality and diversity, has become a basic with only single-view images for training.
building block for large-scale diffusion models [76, 70].

2.3. 3D-aware Image Synthesis 3. Method: Control3Diff


When extending image synthesis to 3D, one can model We present Control3Diff, a controllable 3D-aware gen-
each 3D scene, which corresponds to a latent representation eration framework based on a 3D GAN (§ 3.1). We study
z (§ 2.1), as a neural radiance field (NeRF [57]) fz : R5 → two ways of controlling image synthesis with Control3Diff
R4+ which maps every spatial point and the viewing direction (§§ 3.2 and 3.3). The pipeline is illustrated in Fig. 3.
to its radiance and density. fz is parameterized as MLPs [57]
3.1. Latent Diffusion with 3D GANs
or tri-planes [12, 15] with upsamplers [62, 25]. Next, we can
synthesize 3D-consistent images via volume rendering [56]. Instead of acquiring z from dense multi-view images
Despite the success in the 2D scenario, diffusion models {xi } as done in reconstruction-based methods, we directly
have rarely been applied directly in controllable 3D-aware sample from the learned distribution of z of a 3D GAN
image synthesis with NeRF. There are three key challenges: model, which is trained on single-view images. In this paper,
considering its state-of-the-art performance, we build Con-
1. Learning diffusion models requires the 3D ground-truth trol3Diff based on EG3D [12]. EG3D first learns a tri-plane
z (shown in Eq. (2)) that is often unavailable. generator G : u ∈ R512 → z ∈ R3×256×256×32 , mapping
2. While there exist approaches to acquire high-quality 3D low-dimensional noises to an expressive tri-plane. The fea-
labels from dense multi-view image collections, for most ture of a 3D point is obtained by projecting the point to three
of the cases, only single-view images are available. orthogonal planes and gathering local features from the three
Conditioning
xreal

D 0/1
(a) ?,t
u G R Encoder
xfake
Denoiser
z

...
Denoiser

( , ) R
z ?,t zt
(b) Guidance xN
Encoder
Concat
A(xfake) R
Attention
(optional)

...
(c)

Figure 3: Pipeline of Control3Diff. (a) 3D GAN training; (b) Diffusion model trained on the extracted tri-planes can be trained with or
without the input conditioning; (c) controllable 3D generation with the learned diffusion model, optionally with guidance. The tri-plane
features are presented in three color planes, and the camera poses are omitted for better visual convenience.

planes, which is the input to the radiance function fz for Although it is efficient to sample high-quality tri-planes z,
radiance and density prediction. GANs are implicit generative models [58] and do not model
Training an EG3D model requires a joint optimization the likelihood in the learned latent space. That is to say, we
of a camera-conditioned discriminator D, and we adopt the do not have a proper prior p(z) given the latent represen-
non-saturating logistic objective with R1 regularization: tations of a 3D GAN. Especially in the high-dimensional
space like tri-planes, any control without knowing the un-
LGAN = Eu∼N (0,I),π∼Π [h (D(R (G(u), π) , π)] derlying density will easily fall off the learned manifold
 (5)
+Ex,π∼data h (−D(x, π)) + γk∇x D(x, π)k22 ,

and output degenerated results. As a result, almost all ex-
isting works [9, 19] utilize 3D GANs for controlling the
where h = − log(1 + exp(−u)) and Π is the prior camera focus on low-dimensional spaces, which can be approxi-
distribution. The adversarial learning enables the training mately assumed Gaussian. However, this has to scarify the
on single-view images, as it only forces the model output to controllability. In contrast, diffusion models explicitly learn
match the training data distribution rather than learns a one- the score functions of the latent distribution even with high-
to-one mapping as an auto-encoder. Note that, in order to dimensionality [89], which fills in the missing pieces for 3D
train diffusion models more stable, besides Eq. (5), we also GANs. Also see experimental comparison in Table 1.
bound G(u) with tanh(.) and apply an additional L2 loss
similar to [83] when training EG3D. However, we observed
in our experiments that these additional constraints would
not affect the performance of EG3D. 3.2. Conditional 3D Diffusion
After EG3D is trained, as the second stage, we apply the
denoising on the tri-plane to train a diffusion model with We can synthesize controllable images by extending la-
the renderer R frozen. Training follows the same denoising tent diffusion into a conditional generation framework. Con-
objective Eq. (2) and z0 = G(u). As u is randomly sam- ventionally, learning such conditional models requires label-
pled, we can essentially learn from unlimited data. Different ing parallel corpus, e.g., large-scale text-image pairs [80] for
from [60, 94], we do not need any auxiliary loss or additional the Text-to-Image task. Compared to acquiring 2D paired
architectural change. Optionally, we can add the control sig- data, creating the paired data of the control signal and 3D
nal as the conditioning to the diffusion network to formulate representation is much more difficult. In our method, how-
a conditional generation framework for controlling (§ 3.2). ever, we can easily synthesize an infinite number of pairs of
We note that, training a diffusion model over G(u) sam- the control signal and triplanes by using the rendered images
ples differs from distilling a pre-trained GAN into another of the triplane from 3D GAN to predict the control signal
GAN generator, which is unsuitable for the controlling tasks. with an off-the-shelf method. Now, the learning objective
can be written as follows: Langevin correction steps While the 2D rendering guid-
ance can provide a gradient to learning 3D representations,
LCond = Ez0 ,zt ,t,π ωt · kzθ (zt , A (R(z0 , π))) − z0 k22 .
 
the optimization is not often stable due to the nonlinearity of
(6) mapping from 2D to 3D. Our initial experiments showed that
where z0 = G(u) is the sampled tri-plane, A is the off-the- early guidance steps get stuck in a local minimum with incor-
shelf prediction module that converts rendered images into rect geometry prediction, which is hard to correct in the later
c (e.g., “edge-detector” for edge-map to 3D generation), and denoising stage when the noise level decreases. Therefore,
π ∼ Π is a pre-defined camera distribution based on the we adopt similar ideas from the predictor-corrector [90, 33]
testing preference. Here zθ represents a conditional denoiser to include additional Langevin correction steps before the
that learns to predict denoised tri-planes given the condi- diffusion step (Eq. (3)):
tion. In early exploration, we noticed that the prior camera
distribution Π significantly impacts the generalizability of 1 √
zt = zt − δσt ˆθ (zt ) + δσt 0 , 0 ∼ N (0, I), (8)
the learned model, where for some datasets (e.g., FFHQ, 2
AFHQ), the biased camera distribution in training set would
cause degenerated results for rare camera views. Therefore, where δ is the step size, and ˆθ is derived from ẑθ in Eq. (7).
we specifically re-sample the cameras for these datasets. According to Langevin MCMC [55], the additional steps
help zt match the marginal distribution given certain σt .

Joint Diffusion with Camera Pose πc Conditional mod-


els can be learned without camera input, which implicitly Discussion: Conditioning v.s. Guidance Compared to
maps the input view to the global triplane space. It im- guidance methods in § 3.3, training a conditional 3D dif-
plies that conditional models can predict camera information fusion model has several benefits. First, in guided diffu-
through diffusion. In light of this observation, we propose to sion, a proper-designed differentiable `(., .) is necessary to
jointly predict the input camera pose πc with z in one diffu- back-propagate the gradient guidance to the diffusion model,
sion framework. Similar to 3D-aware generation, predicting which, however, is not available for all kinds of conditional
cameras from a single view is also a challenging problem, re- tasks. In contrast, conditional models do not have such re-
quiring resolving ambiguities in natural images. At the same quirements and can adapt any conditional distribution. Also,
time, previous works either rely on external deterministic conditioning is computationally more efficient because the
camera predictors [52] or optimize the cameras at inference guidance requires rendering and back-propagating through
time [46]. In this work, for simplicity, we flatten πc into a the volume renderer R at each step. However, conditioning
vector, broadcast it, and concatenate it to the channels of z methods have a possible issue. As we train our models based
as the new diffusion target. on the images generated by a pretrained 3D GAN (Eq. (6)),
the learned p(z|c) probably has domain gaps between real
3.3. Guided 3D Diffusion images and synthesized images. In such a case, guidance-
based methods become more reliable as ` is directly com-
An alternative way to control image synthesis is to fol- puted upon real controls.
low a similar recipe in 2D (as defined in § 2.1) to perform Optionally, we can combine the best of both worlds when
test-time guidance based on a task-specific energy function ` is available. For instance, we learn a conditional diffusion
`(c, z). Nevertheless, directly defining such an energy func- model and generate samples jointly with guidance (see Fig. 3
tion between c and 3D representation (i.e., a tri-plane NeRF) (c)). This paradigm can also be used when the test camera
is challenging. We circumvent this by defining `(., .) to mea- is not given: the guidance is used to update the camera πc
sure the closeness between c and the differentiablly rendered predicted by the aforementioned conditional model.
image R(z, πc ). In this way, we can learn the 3D represen-
tation using 2D rendering guidance (e.g., CLIP score [68]
4. Experiments
for text-to-3D, and MSE or perceptual loss [103] for image
inversion). Using 2D guidance for learning 3D representa- 4.1. Experimental Settings
tion is reasonable since the final targets of most controlling
tasks we care about are images synthesized from certain Dataset & Tasks We evaluate Control3Diff on three stan-
viewpoints. The 2D rendering guidance can be implemented dard image generation benchmarks – FFHQ (5122 ) [41],
efficiently via replacing zθ (zt ) in Eq. (3) with ẑθ (zt ) as: AFHQ-cat (5122 ) [16], and ShapeNet (1282 ) [85] including
two categories Cars and Chairs. Following EG3D [12], each
ẑθ (zt ) = zθ (zt ) − wt ∇zt ` [c, R(zθ (zt ), πc )] , (7) image is associated with its camera pose. We consider six
controllable 3D-aware generation tasks. For all datasets, we
where zθ is the denoised tri-plane derived from the uncondi- test the standard image-to-3D inversion (original resolution
tional prior, wt is the time-dependent weight. and low-resolution inputs) and edge-map to 3D generations.
Input Image Pred. W Pred. Tri-plane Opt. W Opt. W+ Opt. Tri-plane Ours Ours (w/o camera)

Figure 4: Comparison for 3D-inversion of in-the-wild images. We compare the proposed approach to direct prediction of the GAN’s latent
W and Tri-plane with a learned encoder, as well as an optimization based approach to infer the latent and expanded latent W, W+, as well
as the Tri-plane, following [2]. Our method achieves better view consistency with higher output image quality compared to baselines.

Table 1: Quantitative comparison on inversion. Although optimizing the Tri-plane model can fit input views well, it falls short in generating
realistic novel view images. Overall, our method achieves the best performance.

FFHQ AFHQ-Cat
PSNR ↑ SSIM ↑ LPIPS↓ ID↑ nvFID ↓ nvKID ↑ nvID↑ PSNR↑ SSIM ↑ LPIPS↓ nvFID ↓ nvKID ↑
W 15.93 0.68 0.42 0.60 39.26 0.023 0.57 16.08 0.57 0.42 9.15 0.004
Opt. W+ 17.91 0.73 0.34 0.74 38.23 0.022 0.68 18.32 0.62 0.35 10.54 0.006
Tri. 18.32 0.78 0.11 0.92 138.0 0.154 0.54 17.53 0.71 0.14 98.79 0.085
Pred. W 14.82 0.64 0.54 0.37 45.06 0.018 0.35 14.56 0.52 0.55 20.87 0.006
Ours 22.30 0.79 0.23 0.89 13.48 0.005 0.81 20.11 0.66 0.24 7.03 0.003

LR Imape Input Opt. W Opt. W+ Opt. Tri-plane Ours CelebA-HQ [38]. Besides, we also report performance on
unconditional generation with guidance in the ablation.

Baselines We choose the standard optimization-based and


encoder-based [92, 47] approaches for image-to-3D inver-
sion, and the recent Pix2Pix3D [19] as the major baseline to
Figure 5: Comparison on the SR+inversion task. By learning the compare on the Seg-to-3D task. Note that we do not focus on
proper prior with diffusion models, Control3Diff is able to syn-
achieving the state-of-the-art on a single task like inversion,
thesize realistic and faithful cat faces from low-resolution inputs,
but rather to highlight the potential of our generic frame-
while optimization-based approaches fail completely due to the
lack of proper 3D prior. work in 3D-aware generation. Thus, our comparison limits
to methods without fine-tuning the model weights [72].

For faces, we further explored segmentation to 3D, head- Evaluation Metrics For image synthesis quality, we re-
shape to 3D and text-description to 3D tasks to validate the port five standard metrics: PSNR, SSIM, SG diversity [14],
controllability at various levels. To compare with previous LPIPS [103], KID, and FID [29]. For face, we compute the
work [19] for Seg-to-3D, we additionally train one model on cosine similarity of the facial embeddings generated by the
Table 2: Quantitative comparison on Seg2Face and Seg2Cat. Output Overlay Novel views

Input Seg Map


task Seg2Face Seg2Cat

Pix2Pix3D
metric FID↓ SG↑ mIoU↑ MPA↑ FID↓ SG↑ mIoU↑ MPA↑
p2p3D 21.28 0.46 0.52 0.63 15.46 0.50 0.64 0.76
ours 12.85 0.43 0.61 0.72 11.66 0.47 0.67 0.79

Ours
Input Seg Map
facial recognition network for a given pair of faces, utilizing

Pix2Pix3D
it as ID metric. In the context of conditional generation tasks,
following Pix2Pix3D [19], we evaluate methods using mean
Intersection-over-Union (mIoU) and mean pixel accuracy
(MPA) for segmentation maps.

Ours
Implementation Details We implemented all our models
based on the standard U-Net architectures [20] where for Figure 6: Comparison on Seg-to-3D generation. All faces are
conditional diffusion models, an U-Net-based encoder is model generated, and are not real identities. Our proposed
adopted to encode the input image similar to [26], see Fig. 3 method generates images that achieve improved alignment with the
(b). We include the hyper-parameter details in Appendix. segmentation map and greater 3D consistency.
Input Edge Map Output Overlay Novel views

4.2. Image-to-3D Inversion


FFHQ

In this section, we evaluate Control3Diff on 3D inversion


tasks, comparing our methods in two cases: (1) standard
inversion and (2) a more challenging 3D super-resolution
task. To establish a baseline, we directly optimize the low-
AFHQ

dimensional latent vectors (W, W+)* , following [2], as well


as triplanes. As conventional GANs do not have learned pri-
ShapeNet Cars

ors in these spaces, optimization is performed with noise


injection regularization. We also employ an encoder-based
approach [47] that directly predicts W or triplanes. To pre-
dict triplanes, we train a separate encoder.
ShapeNet Chairs

The results are shown in Table 1 and Fig. 4 where our


methods significantly outperform the other methods in terms
of both image quality and identity consistency. While direct
optimization of triplanes may yield higher accuracy in the
input view, it always results in collapsed novel view results Figure 7: Qualitative results on Edge-to-3D generation on all three
due to a lack of prior. We also show the visual comparisons datasets. All faces are model generated.
for 3D super-resolution in Fig. 5 where our diffusion-based
approaches show more gains. of cat-parsing networks, resulting in lower accuracy than
that achieved by face-parsing networks.
4.3. Seg-to-3D & Edge-to-3D Synthesis The results of our evaluation are presented in Table 2,
which indicates that our method generates images with com-
We evaluate our methods on more general conditional 3D parable alignment and quality. Furthermore, as illustrated
generation tasks where the input control is not necessarily in Fig. 6, our method is capable of producing more realistic
the target view, e.g., Seg-to-3D and Edge-to-3D tasks. For faces in novel views. Additionally, our model successfully
the Seg-to-3D task, we train two additional parsing networks generates consistent 3D objects by taking as input edge maps,
[99] with labels provided by Pix2Pix3D [19], where the as illustrated in Fig. 6. Additional results can be found in the
segmentation ground-truth of cats is obtained via clustering Appendix D.
the DINO feature as proposed by [3]. It has been observed
that this clustering scheme adversely affects the performance 4.4. Text-to-3D Synthesis
* W, W+ refers to the compact and expanded latent space of the GAN, We demonstrate the versatility of our framework by ap-
respectively. See [2] for details. plying it to text-to-3D generation. The qualitative results
(a) (b) (c) (d)

Figure 8: Qualitative results on Text-to-3D synthesis based on given prompts: (a) A middle-aged woman with curly brown hair and pink lips;
(b) A middle-aged man with a receding hairline, a thick beard, and hazel eyes; (c) A young woman with freckles on her cheeks and brown
hair in a pixie cut; (d) a photography of Joker’s face. All faces are model generated, and are not real identities.

Uncond. + Guided Cond. + Guided


Input Image
(w/o Langevin)
Uncond. + Guided Cond. Only
(Final model) truth for training. Most works tackle this challenge by re-
constructing 3D ground truth from dense multi-view data.
Instead, our method can be trained only on single-view data
by using a 3D GAN to synthesize infinite ground-truth 3D
data. Another line of work [67, 93, 104, 18, 26] applies 2D
diffusion priors to the sparse-view reconstruction or text-to-
3D generation tasks. For example, NerfDiff [26] applies a
test-time optimization by distilling 2D diffusion priors into
NeRF for single-view reconstruction. Different from NerfD-
Figure 9: Comparison between conditional and guided diffusion. iff [26], our focus is 3D-aware image synthesis controlled by
various control signals, and we apply denoising directly in
are shown in Fig. 8. For (a)-(c), we train Control3Diff as a 3D. Furthermore, our method can be trained on single-view
conditional diffusion model where we adopt the normalized datasets without the need of multi-view data.
CLIP embedding of the model’s rendering as conditioning.
In test time, such a model can be seamlessly switched to
text-control thanks to the multi-modal space of CLIP. We
also conduct experiments with text feature guidance in (d),
where we directly apply a pre-trained 2D diffusion model as Controllable Image Synthesis with GANs Conventional
a score function similar to DreamFusion [67], and guide the GANs [22, 41, 42] can generate photo-realistic images from
generation of an unconditional 3D diffusion model. low-dimensional randomly sampled latent vectors, but have
limited controllability. Follow-up works enable controllabil-
4.5. Ablation Study ity by either adding conditioning input along with the sam-
pled vectors as input (named ”Conditional GAN”) [35, 65]
We conducted an ablation study on the task of Image-
or manipulating the sampled vectors [82, 28, 105]. These
to-3D inversion, evaluating the effects of conditioning and
works only focus on 2D image synthesis with control,
guidance on the visual performance of our method. As illus-
which cannot explicitly control 3D properties (e.g., cam-
trated in Fig. 9, our results demonstrate that the inclusion of
eras) and synthesize multi-view consistent images. Recently,
conditioning and guidance leads to superior visual perfor-
3D-GANs [81, 12, 13, 63, 25, 97] have been developed by in-
mance, while their absence results in artifacts and an inability
tegrating 3D representation and rendering into GANs. While
to fit target images. More specifically, if we only apply guid-
these models can control 3D properties by manipulating the
ance on unconditional models, the generated outputs seem to
latent vectors, their controllability is limited to global cam-
have artifacts. On the other hand, when using the conditional
era poses or geometry. Many works [37, 91, 8, 36] support
model only, the model is unable to recover all details from
fine-grained geometry editing, but most of them have only
the input image especially for background.
demonstrated results on human face or body. Other condi-
tional 3D GANs for general objects [9, 19] need additional
5. Related Work
constraints or architecture changes, however, their synthesis
Diffusion for 3D-aware Generation There have been re- quality is still limited. In contrast, our method allows a vari-
cent attempts [95, 7, 61, 59, 4, 84, 5] to extend diffusion ety of control signals (e.g., segmentation map) for fine-level
models to 3D. The key challenge here is to obtain 3D ground 3D-aware image synthesis on various kinds of objects.
6. Conclusion geometry-aware 3d generative adversarial networks. arXiv
preprint arXiv:2112.07945, 2021.
In summary, we propose Control3Diff, a versatile ap- [13] Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu,
proach for 3D-aware image synthesis that combines the and Gordon Wetzstein. pi-GAN: Periodic Implicit Genera-
strengths of 3D GANs and diffusion models. Our method tive Adversarial Networks for 3D-Aware Image Synthesis.
enables precise control over image synthesis by explicitly CVPR, 2021.
modeling the underlying latent distribution. We validate our [14] Anpei Chen, Ruiyang Liu, Ling Xie, Zhang Chen, Hao Su,
approach on standard benchmarks, demonstrating its effi- and Jingyi Yu. Sofgan: A portrait image generator with
cacy with various types of conditioning inputs. Control3Diff dynamic styling. ACM Transactions on Graphics (TOG),
represents a significant advancement in generative modeling 41(1):1–26, 2022.
in 3D, opening up new research possibilities in this area. [15] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and
Hao Su. Tensorf: Tensorial radiance fields. arXiv preprint
References arXiv:2203.09517, 2022.
[16] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha.
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- Stargan v2: Diverse image synthesis for multiple domains.
age2stylegan: How to embed images into the stylegan latent In Proceedings of the IEEE/CVF Conference on Computer
space? In Proceedings of the IEEE/CVF International Con- Vision and Pattern Recognition (CVPR), June 2020.
ference on Computer Vision, pages 4432–4441, 2019.
[17] Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
Klasky, and Jong Chul Ye. Diffusion posterior sam-
age2stylegan++: How to edit the embedded images? In
pling for general noisy inverse problems. arXiv preprint
Proceedings of the IEEE/CVF conference on computer vi-
arXiv:2209.14687, 2022.
sion and pattern recognition, pages 8296–8305, 2020.
[18] Congyue Deng, Chiyu ”Max” Jiang, Charles R. Qi, Xinchen
[3] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel.
Yan, Yin Zhou, Leonidas Guibas, and Dragomir Anguelov.
Deep vit features as dense visual descriptors. arXiv preprint
Nerdi: Single-view nerf synthesis with language-guided dif-
arXiv:2112.05814, 2(3):4, 2021.
fusion as general image priors, 2022.
[4] Titas Anciukevicius, Zexiang Xu, Matthew Fisher, Paul Hen-
derson, Hakan Bilen, Niloy J. Mitra, and Paul Guerrero. [19] Kangle Deng, Gengshan Yang, Deva Ramanan, and Jun-Yan
RenderDiffusion: Image diffusion for 3D reconstruction, Zhu. 3d-aware conditional image synthesis. arXiv preprint
inpainting and generation. arXiv, 2022. arXiv:2302.08509, 2023.
[5] Anonymous. Anonymous, SEE COVER LETTER. [20] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. Advances in Neural Informa-
[6] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild,
tion Processing Systems, 34:8780–8794, 2021.
Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and
Tom Goldstein. Universal guidance for diffusion models. [21] Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart.
arXiv preprint arXiv:2302.07121, 2023. Learning an animatable detailed 3d face model from in-
[7] Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Wal- the-wild images. ACM Transactions on Graphics (ToG),
ter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent 40(4):1–13, 2021.
Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, Af- [22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
shin Dehghan, and Josh Susskind. Gaudi: A neural architect Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
for immersive 3d scene generation. arXiv, 2022. Yoshua Bengio. Generative adversarial nets. In NeurIPS,
[8] Alexander W. Bergman, Petr Kellnhofer, Wang Yifan, Eric R. 2014.
Chan, David B. Lindell, and Gordon Wetzstein. Generative [23] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
neural articulated radiance fields. In NeurIPS, 2022. Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[9] Shengqu Cai, Anton Obukhov, Dengxin Dai, and Luc Yoshua Bengio. Generative Adversarial Nets. Advances in
Van Gool. Pix2nerf: Unsupervised conditional π-gan for Neural Information Processing Systems, pages 2672–2680,
single image to neural radiance fields translation. arXiv 2014.
preprint arXiv:2202.13162, 2022. [24] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and
[10] Caroline Chan, Frédo Durand, and Phillip Isola. Learning to Dimitris Samaras. Diffusion models as plug-and-play priors.
generate line drawings that convey geometry and semantics. arXiv preprint arXiv:2206.09012, 2022.
In Proceedings of the IEEE/CVF Conference on Computer [25] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian
Vision and Pattern Recognition, pages 7915–7925, 2022. Theobalt. Stylenerf: A style-based 3d-aware genera-
[11] Eric Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, tor for high-resolution image synthesis. arXiv preprint
and Gordon Wetzstein. pi-gan: Periodic implicit generative arXiv:2110.08985, 2021.
adversarial networks for 3d-aware image synthesis. In arXiv, [26] Jiatao Gu, Alex Trevithick, Kai-En Lin, Josh Susskind,
2020. Christian Theobalt, Lingjie Liu, and Ravi Ramamoor-
[12] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, thi. Nerfdiff: Single-image view synthesis with nerf-
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas guided distillation from 3d-aware diffusion. arXiv preprint
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient arXiv:2302.10109, 2023.
[27] Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Miguel Angel IEEE/CVF conference on computer vision and pattern recog-
Bautista, and Josh Susskind. f-dm: A multi-stage diffusion nition, pages 8110–8119, 2020.
model via progressive signal transformation. arXiv preprint [44] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming
arXiv:2210.04955, 2022. Song. Denoising diffusion restoration models. arXiv
[28] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and preprint arXiv:2201.11793, 2022.
Sylvain Paris. Ganspace: Discovering interpretable gan [45] Diederik P Kingma and Jimmy Ba. Adam: A method for
controls. arXiv preprint arXiv:2004.02546, 2020. stochastic optimization. arXiv preprint arXiv:1412.6980,
[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- 2014.
hard Nessler, and Sepp Hochreiter. Gans trained by a two [46] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwangrok Ryoo,
time-scale update rule converge to a local nash equilib- and Seungryong Kim. 3d gan inversion with pose optimiza-
rium. Advances in neural information processing systems, tion. In Proceedings of the IEEE/CVF Winter Conference on
30, 2017. Applications of Computer Vision, pages 2967–2976, 2023.
[30] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, [47] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwangrok Ryoo,
Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben and Seungryong Kim. 3d gan inversion with pose optimiza-
Poole, Mohammad Norouzi, David J Fleet, et al. Imagen tion. In Proceedings of the IEEE/CVF Winter Conference on
video: High definition video generation with diffusion mod- Applications of Computer Vision, pages 2967–2976, 2023.
els. arXiv preprint arXiv:2210.02303, 2022. [48] Adam R. Kosiorek, Heiko Strathmann, Daniel Zoran, Pol
[31] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Moreno, Rosalia Schneider, Soňa Mokrá, and Danilo J.
sion probabilistic models. Advances in Neural Information Rezende. NeRF-VAE: A Geometry Aware 3D Scene Gener-
Processing Systems, 33:6840–6851, 2020. ative Model. ICML, 2021.
[32] Jonathan Ho and Tim Salimans. Classifier-free diffusion [49] Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun
guidance. arXiv preprint arXiv:2207.12598, 2022. Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single
[33] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William image super-resolution with diffusion probabilistic models.
Chan, Mohammad Norouzi, and David J Fleet. Video diffu- Neurocomputing, 479:47–59, 2022.
sion models. arXiv preprint arXiv:2204.03458, 2022.
[50] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier
[34] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim-
Romero. Learning a model of facial shape and expression
ple diffusion: End-to-end diffusion for high resolution im-
from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
ages. arXiv preprint arXiv:2301.11093, 2023.
[51] Yu-Jhe Li, Tao Xu, Bichen Wu, Ningyuan Zheng, Xiaoliang
[35] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Dai, Albert Pumarola, Peizhao Zhang, Peter Vajda, and Kris
Efros. Image-to-image translation with conditional adver-
Kitani. 3d-aware encoding for style-based neural radiance
sarial networks. In CVPR, 2017.
fields. arXiv preprint arXiv:2211.06583, 2022.
[36] Kaiwen Jiang, Shu-Yu Chen, Feng-Lin Liu, Hongbo Fu,
[52] Connor Z Lin, David B Lindell, Eric R Chan, and Gordon
and Lin Gao. Nerffaceediting: Disentangled face editing in
Wetzstein. 3d gan inversion for controllable portrait image
neural radiance fields, 2022.
animation. arXiv preprint arXiv:2203.13441, 2022.
[37] Sun Jingxiang, Wang Xuan, Wang Lizhen, Li Xiaoyu, Zhang
Yong, Zhang Hongwen, and Liu Yebin. Next3d: Generative [53] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
neural texture rasterization for 3d-aware head avatars. arXiv regularization. arXiv preprint arXiv:1711.05101, 2017.
preprint arXiv:2205.15517, 2022. [54] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher
[38] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting
Progressive Growing of GANs for Improved Quality, Sta- using denoising diffusion probabilistic models. In Proceed-
bility, and Variation. International Conference on Learning ings of the IEEE/CVF Conference on Computer Vision and
Representations, 2018. Pattern Recognition, pages 11461–11471, 2022.
[39] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, [55] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe
Jaakko Lehtinen, and Timo Aila. Training generative adver- for stochastic gradient mcmc. Advances in neural informa-
sarial networks with limited data. In Proc. NeurIPS, 2020. tion processing systems, 28, 2015.
[40] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, [56] Nelson Max. Optical models for direct volume rendering.
Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free IEEE Transactions on Visualization and Computer Graphics,
generative adversarial networks. In Proc. NeurIPS, 2021. 1(2):99–108, 1995.
[41] Tero Karras, Samuli Laine, and Timo Aila. A style-based [57] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
generator architecture for generative adversarial networks. Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
In CVPR, pages 4401–4410, 2019. Representing scenes as neural radiance fields for view syn-
[42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, thesis. In European conference on computer vision, pages
Jaakko Lehtinen, and Timo Aila. Analyzing and improving 405–421. Springer, 2020.
the image quality of stylegan. In CVPR, pages 8110–8119, [58] Shakir Mohamed and Balaji Lakshminarayanan. Learn-
2020. ing in implicit generative models. arXiv preprint
[43] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, arXiv:1610.03483, 2016.
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- [59] Norman Müller, , Yawar Siddiqui, Lorenzo Porzi, Samuel
ing the image quality of stylegan. In Proceedings of the Rota Bulò, Peter Kontschieder, and Matthias Nießner.
[60] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, tion. International Conference on Medical Image Computing
Samuel Rota Bulò, Peter Kontschieder, and Matthias and Computer-Assisted Intervention, pages 234–241, 2015.
Nießner. Diffrf: Rendering-guided 3d radiance field [75] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
diffusion. arXiv preprint arXiv:2212.01206, 2022. Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
[61] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Norouzi. Palette: Image-to-image diffusion models. In
Mishkin, and Mark Chen. Point-e: A system for generating ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
3d point clouds from complex prompts, 2022. 10, 2022.
[62] Michael Niemeyer and Andreas Geiger. GIRAFFE: Repre- [76] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
senting Scenes as Compositional Generative Neural Feature Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Fields. CVPR, pages 11453–11464, 2021. Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
[63] Michael Niemeyer and Andreas Geiger. Giraffe: Repre- Rapha Gontijo Lopes, et al. Photorealistic text-to-image
senting scenes as compositional generative neural feature diffusion models with deep language understanding. arXiv
fields. In Proc. IEEE Conf. on Computer Vision and Pattern preprint arXiv:2205.11487, 2022.
Recognition (CVPR), 2021. [77] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
[64] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, mans, David J Fleet, and Mohammad Norouzi. Image super-
Jeong Joon Park, and Ira Kemelmacher-Shlizerman. resolution via iterative refinement. arXiv:2104.07636, 2021.
Stylesdf: High-resolution 3d-consistent image and geom- [78] Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs
etry generation. In Proceedings of the IEEE/CVF Confer- Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario
ence on Computer Vision and Pattern Recognition, pages Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszko-
13503–13513, 2022. reit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene
[65] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Representation Transformer: Geometry-Free Novel View
Yan Zhu. Semantic image synthesis with spatially-adaptive Synthesis Through Set-Latent Scene Representations. CVPR,
normalization. In CVPR, 2019. 2022.
[66] William Peebles and Saining Xie. Scalable diffusion models [79] Tim Salimans and Jonathan Ho. Progressive distillation
with transformers. arXiv preprint arXiv:2212.09748, 2022. for fast sampling of diffusion models. arXiv preprint
[67] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- arXiv:2202.00512, 2022.
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv [80] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
preprint arXiv:2209.14988, 2022. Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
[68] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, man, et al. Laion-5b: An open large-scale dataset for train-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning ing next generation image-text models. arXiv preprint
transferable visual models from natural language supervi- arXiv:2210.08402, 2022.
sion. In International conference on machine learning, pages [81] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
8748–8763. PMLR, 2021. Geiger. Graf: Generative radiance fields for 3d-aware im-
[69] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya age synthesis. Advances in Neural Information Processing
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Systems, 33:20154–20166, 2020.
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning [82] Yujun Shen and Bolei Zhou. Closed-form factorization of
transferable visual models from natural language supervi- latent semantics in gans. arXiv preprint arXiv:2007.06600,
sion. In International conference on machine learning, pages 2020.
8748–8763. PMLR, 2021. [83] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner,
[70] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation
and Mark Chen. Hierarchical text-conditional image gen- using triplane diffusion. arXiv preprint arXiv:2211.16677,
eration with clip latents. arXiv preprint arXiv:2204.06125, 2022.
2022. [84] J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner,
[71] Daniel Rebain, Mark Matthews, Kwang Moo Yi, Dmitry La- Jiajun Wu, and Gordon Wetzstein. 3d neural field generation
gun, and Andrea Tagliasacchi. Lolnerf: Learn from one look. using triplane diffusion, 2022.
In Proceedings of the IEEE/CVF Conference on Computer [85] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein.
Vision and Pattern Recognition, pages 1558–1567, 2022. Scene Representation Networks: Continuous 3D-Structure-
[72] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Aware Neural Scene Representations. Advances in Neural
Cohen-Or. Pivotal tuning for latent-based editing of real Information Processing Systems, pages 1119–1130, 2019.
images. ACM Transactions on Graphics (TOG), 42(1):1–13, [86] Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, and Peter
2022. Wonka. Epigraf: Rethinking training of 3d gans. arXiv
[73] Robin Rombach, Andreas Blattmann, Dominik Lorenz, preprint arXiv:2206.10535, 2022.
Patrick Esser, and Björn Ommer. High-resolution image [87] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
synthesis with latent diffusion models, 2021. and Surya Ganguli. Deep unsupervised learning using
[74] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net nonequilibrium thermodynamics. In International Confer-
: Convolutional Networks for Biomedical Image Segmenta- ence on Machine Learning, pages 2256–2265. PMLR, 2015.
[88] Jiaming Song, Chenlin Meng, and Stefano Ermon. De- [104] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distill-
noising diffusion implicit models. arXiv preprint ing view-conditioned diffusion for 3d reconstruction, 2022.
arXiv:2010.02502, 2020. [105] Jiapeng Zhu, Ceyuan Yang, Yujun Shen, Zifan Shi, Deli
[89] Yang Song and Stefano Ermon. Generative modeling by Zhao, and Qifeng Chen. Linkgan: Linking gan latents
estimating gradients of the data distribution. Advances in to pixels for controllable image synthesis. arXiv preprint
Neural Information Processing Systems, 32, 2019. arXiv:2301.04604, 2023.
[90] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
generative modeling through stochastic differential equa-
tions. arXiv preprint arXiv:2011.13456, 2020.
[91] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue
Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit-
ing for high-resolution 3d-aware portrait synthesis. arXiv
preprint arXiv:2205.15517, 2022.
[92] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
Daniel Cohen-Or. Designing an encoder for stylegan image
manipulation. arXiv preprint arXiv:2102.02766, 2021.
[93] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh,
and Greg Shakhnarovich. Score jacobian chaining: Lifting
pretrained 2d diffusion models for 3d generation, 2022.
[94] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin
Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang
Wen, Qifeng Chen, et al. Rodin: A generative model for
sculpting 3d digital avatars using diffusion. arXiv preprint
arXiv:2212.06135, 2022.
[95] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin
Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang
Wen, Qifeng Chen, and Baining Guo. Rodin: A generative
model for sculpting 3d digital avatars using diffusion, 2022.
[96] Daniel Watson, William Chan, Ricardo Martin-Brualla,
Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi.
Novel view synthesis with diffusion models. arXiv preprint
arXiv:2210.04628, 2022.
[97] Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen, and
Bolei Zhou. 3d-aware image synthesis via learning structural
and textural representations. 2021.
[98] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
pixelNeRF: Neural Radiance Fields from One or Few Im-
ages. IEEE Conference on Computer Vision and Pattern
Recognition, 2021.
[99] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu,
Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral net-
work with guided aggregation for real-time semantic segmen-
tation. International Journal of Computer Vision, 129:3051–
3068, 2021.
[100] Lvmin Zhang and Maneesh Agrawala. Adding conditional
control to text-to-image diffusion models, 2023.
[101] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin
Chen. An End-to-End Deep Learning Architecture for Graph
Classification. AAAI, 2018.
[102] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
man, and Oliver Wang. The Unreasonable Effectiveness of
Deep Features as a Perceptual Metric. IEEE Conference on
Computer Vision and Pattern Recognition, pages 586–595,
2018.
[103] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018.
Appendix Training We follow similar recipes [12] for training EG3D
models on four datasets. For FFHQ and ShapeNet, we train
A. Dataset Details EG3D from scratch with γ = 1 and γ = 0.3, respectively.
FFHQ contains 70k images of real human faces in reso- We first train FFHQ model at 64 × 64 resolution for 25M im-
lution of 10242 . We directly adopted the downsampled, re- ages, and another 2.5M images at 128 × 128. For ShapeNet,
aligned version provided by EG3D [12], which re-cropped we train both datasets with 10M images. AFHQ-cat is a much
the face and estimate the camera poses. smaller set, so we fine-tune the FFHQ checkpoint directly
at 128 × 128 with γ = 5 and data augmentation [39] for
AFHQ-cat contains in total 5K images of cat faces in res- 4.5M images. We additionally train an EG3D model on Cele-
olution of 5122 . The same as FFHQ, we directly download bAHQ for comparing on Seg-to-3D tasks. For this model,
the data with estimated camera poses. we fine-tune from the pre-trained FFHQ checkpoint with
cameras provided by [19]. Both human and cat face mod-
ShapeNet Cars & Chairs are standard benchmarks for els are trained with “generator pose conditioning (GPC)”.
single-image view synthesis [85]. We use the data modified Moreover, to encourage a smooth learned tri-plane space,
by pixelNeRF [98] † . The chairs dataset consists of 6591 we apply an additional regularization over the L2 norm over
scenes, and the cars dataset has 3514 scenes, both with a the tri-plane with weight λ = 1 for all experiments. We use
predefined train/val/test split. Each training scene contains a batch size of 32 on 8 NIVIDA A100 GPUs, and training
50 posed images taken from random points on a sphere. approximately takes 3 days for 25M images.
Each testing scene contains 250 posed images taken on an
Inference The trained EG3D models are used in both 3D
Archimedean spiral along the sphere. All images are ren-
diffusion training & inference. More precisely, we keep the
dered at a resolution of 1282 .
neural renderer (NeRF MLPs + 2D upsamplers, see Fig. 1
for illustration) as the final stage of the tri-plane diffusion,
CelebA-HQ Dataset [38] is comprised of 30,000 high-
which renders the denoised tri-plane into images given the
resolution images, each with dimensions of 10242 pixels.
camera input. To make sure the rendering solely depending
For the Seg-to-3D task, we utilize camera poses and labels
on the tri-plane and viewing directions, we adopt the center-
supplied by Pix2Pix3D [19].
camera for GPC, and input the EMA style vector wavg as
well as constant noise to the upsamplers. We did not notice
StyleGAN3-synthetic . Owing to concerns regarding indi-
any quality difference by replacing with the average vectors.
vidual consent, we utilize the StyleGAN3 [40] algorithm to
synthesize 165 images that subsequently facilitate qualitative B.2. 3D Diffusion Settings
analysis and video production. This methodology adheres to
Unconditional Model We use the improved UNet-based
ethical guidelines while effectively enabling the visualiza-
architecture [74, 20] for all of our main experiments of tri-
tion and evaluation of our findings. We adhere to the same
plane space diffusion. In the exploration stage, we also tried
pre-processing procedure utilized by EG3D [12] for these
different architectures such as Transformers [66], however,
synthetic images. This approach involves re-centering the
we did not notice significant difference on generation, and
faces and estimating the camera positions, thus ensuring a
keep UNet as the basic backbone. Since the tri-plane size
consistent methodology across datasets.
is fixed across various datasets, we apply exactly the same
architecture and hyperparameters for all experiments. Our
B. Implementation Details initial experiments showed that predicting the noise  (de-
B.1. 3D GAN Settings fault setting as suggested by DDPM [31]) or the velocity
v [79] tend to produce noticeable high-frequency artifacts
Model Our method is largely based on EG3D [12] ‡ which on the generated tri-planes. We suspect it is due to the tri-
adopts tri-plane representations to achieve efficient rendering plane space is naturally noisier than images, and all our
process. We use the same hyper-parameters as stated in the models are trained with the signal z0 prediction as presented
original paper [12], where the triplane dimensions are set in Eq. (2) with ωt = Sigmoid(log(αt2 /σt2 )).
3 × 256 × 256 × 32 for all datasets. To stabilize training
of diffusion models, we constraints the value of triplanes Conditional Model The main settings of conditional dif-
by bounding its values to (−1, 1) with tanh(.). We set the fusion models are identical to the unconditional models,
neural rendering resolution to be 128 × 128 for FFHQ and except for the interaction module between the conditioning
AFHQ-cat following a ×8 2D-upsampler, while 64 × 64 for input. For tasks like 3D inversion, 3D SR, Seg-to-3D and
ShapeNet Cars and Chairs following a ×2 2D-upsampler. Edge-to-3D, we transform the input into RGB images, resize
† https://github.com/sxyu/pixel-nerf the spatial resolution into 256 × 256. Then we jointly train a
‡ https://github.com/NVlabs/eg3d.git UNet-based encoder which has the same number of layers
and hidden dimensions as the denoiser. Note that, due to the to save computational cost. The Langevin correction steps
use of self-attention layers [20], the UNet-based encoder is are particularly useful for unconditional models.
able to globally adjust the features even the input images are
not spatially aligned with the canonical tri-plane space. Ad- B.3. Application Details
ditionally, similar to [96, 26], we include a cross-attention Image-to-3D Inversion In this task, we independently and
layer between each self-attention outputs of the encoder randomly select 1,000 images from both the FFHQ and
and denoiser to strengthen the conditional modeling. On the AFHQ datasets, with the results presented in the main paper.
other hand, for both the Shape-to-3D and Text-to-3D (with To enhance the experimental rigor, we additionally choose
CLIP) tasks, we do not train another encoder, but treating 1,000 random images from the test set of CelebA-HQ dataset.
the conditioning as vectors which are linearly transformed We follow the EG3D methodology to re-crop the face and
and combined with the time-embeddings. estimate the camera pose for enhanced processing. The re-
sults of CelebA-HQ are presented in Table 3. We select 5
Training We adopt the same training scheme for all our camera poses with yaw angles of -35°, -17°, 0°, 17°, and
diffusion experiments including unconditional and condi- 35°, and a roll angle of 0° to generate novel view images.
tional cases, which uses AdamW [53] optimizer with a learn- The generated images are employed to compute the Fréchet
ing rate of 2e − 5 and a EMA decaying rate of 0.9999. Inception Distance (nvFID) to the original dataset and the
To encourage our high-resolution denosier to learn suffi- ID metric (nvID) in relation to the input image.
ciently on noisy tri-planes, we adopt a shifted cosine sched-
ule (2562 → 642 ) inspired by [34]. We train all models with Seg-to-3D Following a recent work (Pix2Pix3D [19]), in
a batch size of 32 for 500K iterations on 8 NVIDIA A100 the Seg2Face process, we randomly select 500 images from
GPUs. the CelebA-HQ dataset, accompanied by their segmenta-
tion maps, and generate 10 images per input label using
Conditioning camera As pointed out in § 3.2, it is critical different random seeds. Subsequently, we predict the seg-
to train conditional diffusion models with balanced camera mentation map for each generated image using a pretrained
poses, wheres the camera viewpoints from natural images face-parsing network [99].In the Seg2Cat task, we employ a
(e.g., FFHQ, AFHQ) are typically biased toward the center similar setting. The main distinction lies in the segmentation
view. Unlike training 3D GANs where matching the camera prediction process. We use the labels from Pix2Pix3D to train
distribution is important for learning the 3D space, we found the parsing network and subsequently apply it to predict la-
it crucial to have an unbiased input camera distribution when bels from the generated images. We evaluate the performance
the 3D space is already learned. Otherwise, the performance by calculating the mean Intersection over Union (MIOU) and
of conditional generation degenerates heavily when the input average pixel accuracy (MPA) between the input labels and
image is not center-aligned. Therefore, for human and cat the predicted labels from the generated images. The Fréchet
faces, we re-sample the input cameras which looks at the Inception Distance (FID) is computed between the generated
origin and distributes uniformly. To simulate errors in camera images and all images in the CelebAHQ dataset. Single Gen-
prediction, we augment the intrinsic matrix (focal length, eration Diversity (SG Diversity) is obtained by measuring
fx , fy ) with random Gaussian noises. We do not perform the LPIPS metric between each pair of generated images,
resampling and directly use the training set cameras for given a single conditional input.
ShapeNet as it already covers all viewpoints uniformly.
Edge-to-3D We extract the edges for all datasets using
Sampling Due to the requirements of proper score func- informative drawing [10] § .
tion `(., .), we only explored guided diffusion for 3D inver-
sion and supper-resolution, while for the remaining tasks, Shape-to-3D We employ the FLAME template model [50]
we use the standard sampling strategy. No classifier-free to represent facial shapes and utilize DECA [21] for extract-
guidance [32] is applied. By default, the standard ancestral ing the corresponding FLAME parameters.
sampling [31] takes 250 denoising steps for all of the ex-
periments. For 3D inversion, we choose `(., .) to be VGG Text-to-3D For this task, we utilize CLIP [69] to extract
loss [101] with wt = 7e5 · σt in Eq. (7). We notice that it is image and text features. During the training phase, we em-
essential to use a large decreasing weight to take effective ploy the image features, while in the testing phase, we di-
guidance. For supper-resolution tasks, we use exactly the rectly use the text features. While it is commonly known that
same objective for guidance, while the loss is computed after the text and image spaces of CLIP are not fully aligned [70],
down-sampling the rendered image into the input resolution. we find the conditioning is effective as long as both features
For cases using Langevin correction, we additionally apply are normalized before diffusion.
10 correction steps as described in Eq. (8) where δ = 0.25. § https://github.com/carolineec/

We only add Langevin steps for the first 50 denoising steps informative-drawings.git
Table 3: Quantitative comparison on inversion.

CelebA-HQ
PSNR ↑ SSIM ↑ LPIPS↓ ID↑ nvFID ↑ nvID↑
W 14.98 0.65 0.42 0.54 60.67 0.50
Opt. W+ 16.62 0.71 0.34 0.74 51.23 0.66
Tri. 17.52 0.76 0.12 0.92 185.6 0.50
Pred. W 14.55 0.59 0.54 0.28 68.66 0.26 Figure 10: One failure case for conditional generation tasks on
ShapeNet Chairs. While Control3Diff is always able to generate
Ours 21.86 0.78 0.26 0.82 27.76 0.72
high-fidelity 3D objects, it sometimes fails to recover the texture
information from the input view even with guided diffusion.

B.4. Baseline Details can achieve improved performance in certain aspects. In con-
GAN Inversion Our primary focus is to compare our ap- trast, our GAN-based approach demonstrates both enhanced
proach with prevalent 2D GAN inversion methods, such as 3D consistency and sharper outputs, which contribute to the
the direct optimization scheme introduced by [43], which lower FID and LPIPS scores.
inverts real images into the W space. Additionally, we ex-
amine a related method that extends to the W+ space [1] Additional Results on CelebA-HQ To fully validate the
and directly optimizes the tri-plane, denoted as Tri.. The generality of the proposed method, we conduct additional
implementation is based on EG3D-projector ¶ . We initialize 3D inversion experiments on out-of-distribution (OOD) face
all methods with the average w derived from the dataset. For data. As shown in Table 3, we directly apply the model
the optimization process, we employ the LPIPS loss [102] trained from the FFHQ tri-plane space onto CelebA-HQ,
and utilize the Adam optimizer [45], conducting 400 opti- and report the single-view inversion performance. Although
mization steps for each image. Additionally, we utilize the tested OOD, the proposed Control3Diff performs stably and
encoder proposed by [47] to directly estimate the w values achieves larger gains against standard inversion baselines.
from images. We employ their pretrained model.
D. Additional Qualitative Results
Pix2Pix3D [19] We directly utilize the pretrained check-
points provided by authors || . 3D Inversion & SR We show additional qualitative results
of Control3Diff across datasets for both the 3D inversion
(Figs. 11 to 13) and super-resolution (Fig. 14) applications.
Pix2NeRF [9] We utilize the values provided by the au-
thors for our analysis. However, due to the absence of re- Seg-to-3D Editing Fig. 16 presents an application of our
leased models and quantitative results, our comparison is method which supports progressive 3D editing based on 2D
limited to the ShapeNet chair dataset. segmentation maps.

C. Additional Quantitative Results Text-to-3D Editing The conditional diffusion of Con-


trol3Diff also supports interactive editing given text prompt,
Inversion on ShapeNet We include additional quantita-
as demonstrated in Fig. 17.
tive results for ShapeNet Cars & Chairs in Table 4. For both
cases, we follow the standard evaluation protocol which
takes a fixed input view (typically view 64) as input control, Shape-to-3D Fig. 15 presents qualitative results on this
and render from all other cameras. Evaluation is conducted task, demonstrating that the generated images can semanti-
on the test sets. As the results shown in Table 4, while the cally ensure the preservation of identity. However, the color
proposed approach significantly improves over the existing exhibits constant fluctuations. The current control mecha-
3D-GAN inversion baselines, and achieves high scores on nisms are unable to effectively disentangle factors such as
perceptual scores such as LPIPS and FID, it has a clear lighting.
gap compared to PixelNeRF in term of PSNR. The primary
reason for this discrepancy is that PixelNeRF utilizes multi- E. Limitations and Future Work
view supervision during training, whereas our method relies Our method has two major limitations. First, while learn-
solely on single-view information. Consequently, PixelNeRF ing from the latent space of GANs allows us effectively learn
¶ https://github.com/oneThousand1000/ controllable diffusion models for 3D, it also brings the draw-
EG3D-projector backs that GANs commonly have. For instance, a common
|| https://github.com/dunbar12138/pix2pix3D artifact of the adversarial training is that the learned space
Figure 11: Qualitative results on 3D inversion for ShapeNet Cars and Chairs.
Table 4: Quantitative comparison on single-image view synthesis on ShapeNet. *Models have multi-view supervision during training, while
our methods including standard optimization-based 3D GAN-inversion baselines are trained with single-view information only.

(b) ShapeNet-Chairs
(a) ShapeNet-Cars
ShapeNet Chair
ShapeNet Cars
PSNR ↑ SSIM ↑ LPIPS↓ FID ↓
PSNR ↑ SSIM ↑ LPIPS↓ FID ↓

PixelNeRF [98] 23.72 0.90 0.128 38.49
PixelNeRF [98]∗ 23.17 0.89 0.146 59.24
3DiM [96]∗ 17.05 0.53 – 6.57
3DiM [96]∗ 21.01 0.57 – 8.99
Pix2NeRF [9] 18.14 0.84 - 14.31
Opt. W 17.89 0.85 0.124 33.15
Opt. W 18.28 0.86 0.110 10.96
Opt. W+ 19.23 0.86 0.106 17.95
Opt. W+ 19.30 0.87 0.099 12.70
Opt. Tri. 14.85 0.63 0.461 319.8
Opt. Tri. 14.11 0.64 0.412 237.4
Ours 21.13 0.89 0.090 8.86
Ours 20.16 0.89 0.090 9.76

typically has mode collapse, which in turn affects the 3D fect on synthetic datasets with complex geometries such as
diffusion learning that it may not cover full data space. In ShapeNet (see Fig. 10). As the future work, this issue can
our experiments, we particularly noticed this collapsing ef- be potentially eased by jointly training the diffusion prior
Figure 12: Qualitative results on 3D inversion for AFHQ-cat. The input images are randomly sampled from the AFHQ training set.

with the 3D-GAN, and including additional image recon-


struction loss. Moreover, comparing to pure encoder-based
approaches [51], the iterative nature of the diffusion mod-
els generally has a slower generation process. However, our
methods can be easily integrated with existing works for
speed-up diffusion models [79]. We leave this exploration as
future work.
Figure 13: Qualitative results on 3D inversion for FFHQ. Due to concerns about individual consent, all the input faces are synthesized
and manually selected from a pre-trained StyleGAN3 [40] checkpoint. We perform exactly the same pre-processing procedure as EG3D [12]
over these synthetic images, which re-centers the faces and estimates the camera positions.
Figure 14: Qualitative results on 3D super-resolution tasks for AFHQ-cat and FFHQ.

Figure 15: Qualitative results on Shape-to-3D for FFHQ. These images can semantically ensure the preservation of identity; however, the
color exhibits constant fluctuations. The current control mechanisms are unable to effectively disentangle factors such as lighting.
Figure 16: Progressive editing of Seg-to-3D synthesis. The input seg-maps are interactively edited. To achieve that, we fix the initial tri-plane
noise and use DDIM [88] to obtain diffusion samples.
Figure 17: Progressive editing of Text-to-3D synthesis. The text prompts will be first transformed to normalized CLIP embeddings, which
the diffusion model directly condition on. To achieve that, we fix the initial tri-plane noise and use DDIM [88] to obtain diffusion samples.

You might also like