Control 3 Diff
Control 3 Diff
Jiatao Gu† , Qingzhe Gao§ , Shuangfei Zhai† , Baoquan Chen¶ , Lingjie Liu‡ , Josh Susskind†
†
Apple ‡ University of Pennsylvania § Shandong University ¶ Peking University
† ‡
{jgu32,szhai,jsusskind}@apple.com [email protected]
§ ¶
[email protected] [email protected]
arXiv:2304.06700v1 [cs.CV] 13 Apr 2023
Neural
renderer
Figure 1: Left is the generation process, where a diffusion model samples a triplane which can be used for image rendering. Right are the
examples of controllable generation given various conditioning inputs, showing generated frontal and side views from Control3Diff. The
faces shown are all generated by models without real identities due to concerns about individual consent except for the input in (a).
Abstract inputs), thus enabling direct control during the diffusion pro-
cess. Moreover, our approach is general and applicable to
Diffusion models have recently become the de-facto ap- any types of controlling inputs, allowing us to train it with
proach for generative modeling in the 2D domain. However, the same diffusion objective without any auxiliary supervi-
extending diffusion models to 3D is challenging, due to the sion. We validate the efficacy of Control3Diff on standard
difficulties in acquiring 3D ground truth data for training. image generation benchmarks including FFHQ, AFHQ, and
On the other hand, 3D GANs that integrate implicit 3D rep- ShapeNet, using various conditioning inputs such as images,
resentations into GANs have shown remarkable 3D-aware sketches, and text prompts. Please see the project website
generation when trained only on single-view image datasets. (https://jiataogu.me/control3diff) for video
However, 3D GANs do not provide straightforward ways comparisons.
to precisely control image synthesis. To address these chal-
lenges, We present Control3Diff, a 3D diffusion model that
combines the strengths of diffusion models and 3D GANs for 1. Introduction
versatile controllable 3D-aware image synthesis for single-
view datasets. Control3Diff explicitly models the underly- The synthesis of photo-realistic 3D-aware images of
ing latent distribution (optionally conditioned on external real-world scenes from sparse controlling inputs is a long-
1
standing problem in both computer vision and computer the 2D solutions with diffusion models. Then, we pose the
graphics, with various applications including robotics sim- difficulties of applying similar methods to 3D scenario.
ulation, gaming, and virtual reality. Depending on the task,
sparse inputs can be single-view images, guiding poses, or 2.1. Definition
text instructions, and the objective is to recover 3D represen- 2D. The goal of controllable synthesis is to learn a genera-
tations and synthesize consistent images from novel view- tive model that synthesizes diverse 2D images x conditioned
points. This is a challenging problem, as the sparse inputs on an input control signal c. This can be mainly done by
typically contain insufficient information to predict complete sampling images in the following two ways:
3D details. Consequently, the selection of an appropriate
prior during controllable generation is crucial for resolving x ∼ exp [−`(c, x)] · pθ (x) Or pθ (x|c), (1)
uncertainties. Recently, significant progress has been made in
the field of 2D image generation through the use of diffusion- where θ is the parameters of the generative model. The for-
based generative models [87, 31, 89, 20], which learn the mer one is called guidance. At test-time, an energy function
prior and have achieved remarkable success in various con- `(c, x) is to measure the alignment between the synthesized
ditional applications such as super-resolution [77, 49, 27], image x and the input c to guide the prior generation pθ (x).
in-painting [54], image translation [75, 100] and text-guided Note that, only for the controllable tasks where the energy
synthesis [70, 73, 76, 30]. It is natural to consider applying function `(c, x) can be defined, the guidance techniques can
similar approaches in 3D generation. However, learning dif- be applied. The latter one directly formulates it as a condi-
fusion models typically relies heavily on the availability of tional generation problem pθ (x|c) if the paired data (x, c)
ground-truth data, which is not commonly available for 3D is available. As c typically contains less information than x,
content, especially for single-view images. it is crucial to handle uncertainties with generative models.
To address this limitation, we propose a framework 3D. The above formulation can be simply extended to 3D.
called Control3Diff, which links diffusion models to gen- In this work, we assume a 3D scene represented by latent
erative adversarial networks (GANs) [23] and takes advan- representation z, and we synthesize 3D-consistent images
tage of the success of GANs in 3D-aware image synthe- by rendering x = R(z, π) given different camera poses
sis [81, 11, 62, 25, 12, 64, 86]. The core idea behind 3D π. Here, we do not restrict the space of z, meaning that it
GANs is to learn a generator based on neural fields that fuse can be any high-dimensional structure that describes the 3D
3D inductive bias in modeling with volume rendering. By scene. Similarly, we can define 3D-aware controllable image
training 3D GANs on single-view data with random noises synthesis by replacing x with z in Eq. (1).
and viewpoints as inputs, we can avoid the need for 3D
ground truth. Our proposed framework Control3Diff pre- 2.2. Diffusion Models
dicts the internal states of 3D GANs given any conditioning Standard diffusion models [87, 89, 31] are explicit gen-
inputs by modeling the prior distribution of the underlying erative models defined by a Markovian process. Given an
manifolds of real data using diffusion models. Furthermore, image x, a diffusion model defines continuous time latent
the proposed framework can be trained on synthetic gener- variables {zt |t ∈ [0, 1], z0 = x} based on a fixed schedule
ation from a 3D GAN, allowing for infinite examples to be 2
{αt , σt }: q(zt |zs ) = N (zt ; αt|s zs , σt|s I), 0 ≤ s < t ≤ 1,
used for training without worrying about over-fitting. Finally,
where αt|s = αt /αs , σt|s = σt − αt|s σs2 . Following this
2 2 2
by applying the guidance techniques [20, 32] in 2D diffu-
definition, we can easily derive the latent zt at any time by
sion models, we are able to learn controllable 3D generation
q(zt |z0 ) = N (zt ; αt z0 , σt2 I). The model θ then learns the
with a single loss function for all conditional tasks. This
reverse process by denoising zt to the clean target x with a
eliminates the use of ad-hoc supervisions and constraints
weighted reconstruction loss Lθ :
which were commonly needed in existing conditional 3D
generation [9, 19]. LDiff = Ezt ∼q(zt |z0 ),t∼[0,1] ωt · kzθ (zt ) − z0 k22 . (2)
To validate the proposed framework, we use a variant
of the recently proposed EG3D [12] that learns an efficient Typically, θ is parameterized as a U-Net [74, 31] or ViT [66].
tri-plane representation as the basis for Control3Diff. We Sampling from a learned model pθ can be performed using
extensively conduct experiments on six types of inputs and ancestral sampling rules [31] – staring with pure Gaussian
demonstrate the effectiveness of Control3Diff on standard noise z1 ∼ N (0, I), we sample s, t following a uniformly
benchmarks including FFHQ, AFHQ-cat, and Shapenet. spaced sequence from 1 to 0:
p
2. Preliminaries: Controllable Image Synthesis zs = αs zθ (zt )+ σs2 − σ̄ 2 θ (zt )+σ̄, ∼ N (0, I), (3)
In this section, we first define the problem of controllable where σ̄ = σs σt|s /σt and θ (zt ) = (zt − αt zθ (zt ))/σt .
image synthesis in 2D and 3D-aware manners and review By decomposing the sophisticated generative process into
Only single-view
information 3. As discussed in § 2.1, we need either an energy function
z Reconstr uction `(., .) for guidance or paired data for conditional gener-
E ?
R
ation in controllable synthesis. However, both of them
Auto-Encoder
likely collapsed unless
heavily regularized are not straightforward to define in the latent space of
H z Reconstr uction implicit 3D representations (e.g., NeRF).
Pr ior constr aints ? R
Auto-Decoder More precisely, targeting on Challenge # 1, prior arts [7, 83,
60, 94] first reconstruct the latent z from dense multi-view
Noise G z Real / Fake ? images of each scene. In the rest of paper, we refer to them as
R
? reconstruction-based methods. That is, given a set of posed
3D-GAN
D 0/1
(a) ?,t
u G R Encoder
xfake
Denoiser
z
...
Denoiser
( , ) R
z ?,t zt
(b) Guidance xN
Encoder
Concat
A(xfake) R
Attention
(optional)
...
(c)
Figure 3: Pipeline of Control3Diff. (a) 3D GAN training; (b) Diffusion model trained on the extracted tri-planes can be trained with or
without the input conditioning; (c) controllable 3D generation with the learned diffusion model, optionally with guidance. The tri-plane
features are presented in three color planes, and the camera poses are omitted for better visual convenience.
planes, which is the input to the radiance function fz for Although it is efficient to sample high-quality tri-planes z,
radiance and density prediction. GANs are implicit generative models [58] and do not model
Training an EG3D model requires a joint optimization the likelihood in the learned latent space. That is to say, we
of a camera-conditioned discriminator D, and we adopt the do not have a proper prior p(z) given the latent represen-
non-saturating logistic objective with R1 regularization: tations of a 3D GAN. Especially in the high-dimensional
space like tri-planes, any control without knowing the un-
LGAN = Eu∼N (0,I),π∼Π [h (D(R (G(u), π) , π)] derlying density will easily fall off the learned manifold
(5)
+Ex,π∼data h (−D(x, π)) + γk∇x D(x, π)k22 ,
and output degenerated results. As a result, almost all ex-
isting works [9, 19] utilize 3D GANs for controlling the
where h = − log(1 + exp(−u)) and Π is the prior camera focus on low-dimensional spaces, which can be approxi-
distribution. The adversarial learning enables the training mately assumed Gaussian. However, this has to scarify the
on single-view images, as it only forces the model output to controllability. In contrast, diffusion models explicitly learn
match the training data distribution rather than learns a one- the score functions of the latent distribution even with high-
to-one mapping as an auto-encoder. Note that, in order to dimensionality [89], which fills in the missing pieces for 3D
train diffusion models more stable, besides Eq. (5), we also GANs. Also see experimental comparison in Table 1.
bound G(u) with tanh(.) and apply an additional L2 loss
similar to [83] when training EG3D. However, we observed
in our experiments that these additional constraints would
not affect the performance of EG3D. 3.2. Conditional 3D Diffusion
After EG3D is trained, as the second stage, we apply the
denoising on the tri-plane to train a diffusion model with We can synthesize controllable images by extending la-
the renderer R frozen. Training follows the same denoising tent diffusion into a conditional generation framework. Con-
objective Eq. (2) and z0 = G(u). As u is randomly sam- ventionally, learning such conditional models requires label-
pled, we can essentially learn from unlimited data. Different ing parallel corpus, e.g., large-scale text-image pairs [80] for
from [60, 94], we do not need any auxiliary loss or additional the Text-to-Image task. Compared to acquiring 2D paired
architectural change. Optionally, we can add the control sig- data, creating the paired data of the control signal and 3D
nal as the conditioning to the diffusion network to formulate representation is much more difficult. In our method, how-
a conditional generation framework for controlling (§ 3.2). ever, we can easily synthesize an infinite number of pairs of
We note that, training a diffusion model over G(u) sam- the control signal and triplanes by using the rendered images
ples differs from distilling a pre-trained GAN into another of the triplane from 3D GAN to predict the control signal
GAN generator, which is unsuitable for the controlling tasks. with an off-the-shelf method. Now, the learning objective
can be written as follows: Langevin correction steps While the 2D rendering guid-
ance can provide a gradient to learning 3D representations,
LCond = Ez0 ,zt ,t,π ωt · kzθ (zt , A (R(z0 , π))) − z0 k22 .
the optimization is not often stable due to the nonlinearity of
(6) mapping from 2D to 3D. Our initial experiments showed that
where z0 = G(u) is the sampled tri-plane, A is the off-the- early guidance steps get stuck in a local minimum with incor-
shelf prediction module that converts rendered images into rect geometry prediction, which is hard to correct in the later
c (e.g., “edge-detector” for edge-map to 3D generation), and denoising stage when the noise level decreases. Therefore,
π ∼ Π is a pre-defined camera distribution based on the we adopt similar ideas from the predictor-corrector [90, 33]
testing preference. Here zθ represents a conditional denoiser to include additional Langevin correction steps before the
that learns to predict denoised tri-planes given the condi- diffusion step (Eq. (3)):
tion. In early exploration, we noticed that the prior camera
distribution Π significantly impacts the generalizability of 1 √
zt = zt − δσt ˆθ (zt ) + δσt 0 , 0 ∼ N (0, I), (8)
the learned model, where for some datasets (e.g., FFHQ, 2
AFHQ), the biased camera distribution in training set would
cause degenerated results for rare camera views. Therefore, where δ is the step size, and ˆθ is derived from ẑθ in Eq. (7).
we specifically re-sample the cameras for these datasets. According to Langevin MCMC [55], the additional steps
help zt match the marginal distribution given certain σt .
Figure 4: Comparison for 3D-inversion of in-the-wild images. We compare the proposed approach to direct prediction of the GAN’s latent
W and Tri-plane with a learned encoder, as well as an optimization based approach to infer the latent and expanded latent W, W+, as well
as the Tri-plane, following [2]. Our method achieves better view consistency with higher output image quality compared to baselines.
Table 1: Quantitative comparison on inversion. Although optimizing the Tri-plane model can fit input views well, it falls short in generating
realistic novel view images. Overall, our method achieves the best performance.
FFHQ AFHQ-Cat
PSNR ↑ SSIM ↑ LPIPS↓ ID↑ nvFID ↓ nvKID ↑ nvID↑ PSNR↑ SSIM ↑ LPIPS↓ nvFID ↓ nvKID ↑
W 15.93 0.68 0.42 0.60 39.26 0.023 0.57 16.08 0.57 0.42 9.15 0.004
Opt. W+ 17.91 0.73 0.34 0.74 38.23 0.022 0.68 18.32 0.62 0.35 10.54 0.006
Tri. 18.32 0.78 0.11 0.92 138.0 0.154 0.54 17.53 0.71 0.14 98.79 0.085
Pred. W 14.82 0.64 0.54 0.37 45.06 0.018 0.35 14.56 0.52 0.55 20.87 0.006
Ours 22.30 0.79 0.23 0.89 13.48 0.005 0.81 20.11 0.66 0.24 7.03 0.003
LR Imape Input Opt. W Opt. W+ Opt. Tri-plane Ours CelebA-HQ [38]. Besides, we also report performance on
unconditional generation with guidance in the ablation.
For faces, we further explored segmentation to 3D, head- Evaluation Metrics For image synthesis quality, we re-
shape to 3D and text-description to 3D tasks to validate the port five standard metrics: PSNR, SSIM, SG diversity [14],
controllability at various levels. To compare with previous LPIPS [103], KID, and FID [29]. For face, we compute the
work [19] for Seg-to-3D, we additionally train one model on cosine similarity of the facial embeddings generated by the
Table 2: Quantitative comparison on Seg2Face and Seg2Cat. Output Overlay Novel views
Pix2Pix3D
metric FID↓ SG↑ mIoU↑ MPA↑ FID↓ SG↑ mIoU↑ MPA↑
p2p3D 21.28 0.46 0.52 0.63 15.46 0.50 0.64 0.76
ours 12.85 0.43 0.61 0.72 11.66 0.47 0.67 0.79
Ours
Input Seg Map
facial recognition network for a given pair of faces, utilizing
Pix2Pix3D
it as ID metric. In the context of conditional generation tasks,
following Pix2Pix3D [19], we evaluate methods using mean
Intersection-over-Union (mIoU) and mean pixel accuracy
(MPA) for segmentation maps.
Ours
Implementation Details We implemented all our models
based on the standard U-Net architectures [20] where for Figure 6: Comparison on Seg-to-3D generation. All faces are
conditional diffusion models, an U-Net-based encoder is model generated, and are not real identities. Our proposed
adopted to encode the input image similar to [26], see Fig. 3 method generates images that achieve improved alignment with the
(b). We include the hyper-parameter details in Appendix. segmentation map and greater 3D consistency.
Input Edge Map Output Overlay Novel views
Figure 8: Qualitative results on Text-to-3D synthesis based on given prompts: (a) A middle-aged woman with curly brown hair and pink lips;
(b) A middle-aged man with a receding hairline, a thick beard, and hazel eyes; (c) A young woman with freckles on her cheeks and brown
hair in a pixie cut; (d) a photography of Joker’s face. All faces are model generated, and are not real identities.
We only add Langevin steps for the first 50 denoising steps informative-drawings.git
Table 3: Quantitative comparison on inversion.
CelebA-HQ
PSNR ↑ SSIM ↑ LPIPS↓ ID↑ nvFID ↑ nvID↑
W 14.98 0.65 0.42 0.54 60.67 0.50
Opt. W+ 16.62 0.71 0.34 0.74 51.23 0.66
Tri. 17.52 0.76 0.12 0.92 185.6 0.50
Pred. W 14.55 0.59 0.54 0.28 68.66 0.26 Figure 10: One failure case for conditional generation tasks on
ShapeNet Chairs. While Control3Diff is always able to generate
Ours 21.86 0.78 0.26 0.82 27.76 0.72
high-fidelity 3D objects, it sometimes fails to recover the texture
information from the input view even with guided diffusion.
B.4. Baseline Details can achieve improved performance in certain aspects. In con-
GAN Inversion Our primary focus is to compare our ap- trast, our GAN-based approach demonstrates both enhanced
proach with prevalent 2D GAN inversion methods, such as 3D consistency and sharper outputs, which contribute to the
the direct optimization scheme introduced by [43], which lower FID and LPIPS scores.
inverts real images into the W space. Additionally, we ex-
amine a related method that extends to the W+ space [1] Additional Results on CelebA-HQ To fully validate the
and directly optimizes the tri-plane, denoted as Tri.. The generality of the proposed method, we conduct additional
implementation is based on EG3D-projector ¶ . We initialize 3D inversion experiments on out-of-distribution (OOD) face
all methods with the average w derived from the dataset. For data. As shown in Table 3, we directly apply the model
the optimization process, we employ the LPIPS loss [102] trained from the FFHQ tri-plane space onto CelebA-HQ,
and utilize the Adam optimizer [45], conducting 400 opti- and report the single-view inversion performance. Although
mization steps for each image. Additionally, we utilize the tested OOD, the proposed Control3Diff performs stably and
encoder proposed by [47] to directly estimate the w values achieves larger gains against standard inversion baselines.
from images. We employ their pretrained model.
D. Additional Qualitative Results
Pix2Pix3D [19] We directly utilize the pretrained check-
points provided by authors || . 3D Inversion & SR We show additional qualitative results
of Control3Diff across datasets for both the 3D inversion
(Figs. 11 to 13) and super-resolution (Fig. 14) applications.
Pix2NeRF [9] We utilize the values provided by the au-
thors for our analysis. However, due to the absence of re- Seg-to-3D Editing Fig. 16 presents an application of our
leased models and quantitative results, our comparison is method which supports progressive 3D editing based on 2D
limited to the ShapeNet chair dataset. segmentation maps.
(b) ShapeNet-Chairs
(a) ShapeNet-Cars
ShapeNet Chair
ShapeNet Cars
PSNR ↑ SSIM ↑ LPIPS↓ FID ↓
PSNR ↑ SSIM ↑ LPIPS↓ FID ↓
∗
PixelNeRF [98] 23.72 0.90 0.128 38.49
PixelNeRF [98]∗ 23.17 0.89 0.146 59.24
3DiM [96]∗ 17.05 0.53 – 6.57
3DiM [96]∗ 21.01 0.57 – 8.99
Pix2NeRF [9] 18.14 0.84 - 14.31
Opt. W 17.89 0.85 0.124 33.15
Opt. W 18.28 0.86 0.110 10.96
Opt. W+ 19.23 0.86 0.106 17.95
Opt. W+ 19.30 0.87 0.099 12.70
Opt. Tri. 14.85 0.63 0.461 319.8
Opt. Tri. 14.11 0.64 0.412 237.4
Ours 21.13 0.89 0.090 8.86
Ours 20.16 0.89 0.090 9.76
typically has mode collapse, which in turn affects the 3D fect on synthetic datasets with complex geometries such as
diffusion learning that it may not cover full data space. In ShapeNet (see Fig. 10). As the future work, this issue can
our experiments, we particularly noticed this collapsing ef- be potentially eased by jointly training the diffusion prior
Figure 12: Qualitative results on 3D inversion for AFHQ-cat. The input images are randomly sampled from the AFHQ training set.
Figure 15: Qualitative results on Shape-to-3D for FFHQ. These images can semantically ensure the preservation of identity; however, the
color exhibits constant fluctuations. The current control mechanisms are unable to effectively disentangle factors such as lighting.
Figure 16: Progressive editing of Seg-to-3D synthesis. The input seg-maps are interactively edited. To achieve that, we fix the initial tri-plane
noise and use DDIM [88] to obtain diffusion samples.
Figure 17: Progressive editing of Text-to-3D synthesis. The text prompts will be first transformed to normalized CLIP embeddings, which
the diffusion model directly condition on. To achieve that, we fix the initial tri-plane noise and use DDIM [88] to obtain diffusion samples.