(CVPR)Lin Magic3D High-Resolution Text-To-3D Content Creation CVPR 2023 Paper
(CVPR)Lin Magic3D High-Resolution Text-To-3D Content Creation CVPR 2023 Paper
Chen-Hsuan Lin∗ Jun Gao∗ Luming Tang∗ Towaki Takikawa∗ Xiaohui Zeng∗
Xun Huang Karsten Kreis Sanja Fidler† Ming-Yu Liu† Tsung-Yi Lin
NVIDIA Corporation
https://research.nvidia.com/labs/dir/magic3d
Abstract Image content creation from text prompts [2, 28, 33, 36]
has seen significant progress with the advances of diffusion
DreamFusion [31] has recently demonstrated the utility models [13, 41, 42] for generative modeling of images. The
of a pre-trained text-to-image diffusion model to optimize key enablers are large-scale datasets comprising billions
Neural Radiance Fields (NeRF) [23], achieving remarkable of samples (images with text) scrapped from the Internet
text-to-3D synthesis results. However, the method has two in- and massive amounts of compute. In contrast, 3D content
herent limitations: (a) extremely slow optimization of NeRF generation has progressed at a much slower pace. Existing
and (b) low-resolution image space supervision on NeRF, 3D object generation models [4, 9, 47] are mostly categorical.
leading to low-quality 3D models with a long processing A trained model can only be used to synthesize objects for a
time. In this paper, we address these limitations by utilizing a single class, with early signs of scaling to multiple classes
two-stage optimization framework. First, we obtain a coarse shown recently by Zeng et al. [47]. Therefore, what a user
model using a low-resolution diffusion prior and accelerate can do with these models is extremely limited and not yet
with a sparse 3D hash grid structure. Using the coarse repre- ready for artistic creation. This limitation is largely due to the
sentation as the initialization, we further optimize a textured lack of diverse large-scale 3D datasets Ð compared to image
3D mesh model with an efficient differentiable renderer in- and video content, 3D content is much less accessible on the
teracting with a high-resolution latent diffusion model. Our Internet. This naturally raises the question of whether 3D
method, dubbed Magic3D, can create high quality 3D mesh generation capability can be achieved by leveraging powerful
models in 40 minutes, which is 2× faster than DreamFu- text-to-image generative models.
sion (reportedly taking 1.5 hours on average), while also
Recently, DreamFusion [31] demonstrated its remarkable
achieving higher resolution. User studies show 61.7% raters
ability for text-conditioned 3D content generation by uti-
to prefer our approach over DreamFusion. Together with
lizing a pre-trained text-to-image diffusion model [36] that
the image-conditioned generation capabilities, we provide
generates images as a strong image prior. The diffusion
users with new ways to control 3D synthesis, opening up new
model acts as a critic to optimize the underlying 3D repre-
avenues to various creative applications.
sentation. The optimization process ensures that rendered
images from a 3D model, represented by Neural Radiance
Fields (NeRF) [23], match the distribution of photorealis-
1. Introduction
tic images across different viewpoints, given the input text
3D digital content has been in high demand for a variety prompt. Since the supervision signal in DreamFusion oper-
of applications, including gaming, entertainment, architec- ates on very low-resolution images (64 × 64), DreamFusion
ture, and robotics simulation. It is slowly finding its way into cannot synthesize high-frequency 3D geometric and texture
virtually every possible domain: retail, online conferencing, details. Due to the use of inefficient MLP architectures for
virtual social presence, education, etc. However, creating the NeRF representation, practical high-resolution synthesis
professional 3D content is not for anyone Ð it requires may not even be possible as the required memory footprint
immense artistic and aesthetic training with 3D modeling ex- and the computation budget grows quickly with the resolu-
pertise. Developing these skill sets takes a significant amount tion. Even at a resolution of 64 × 64, optimization times are
of time and effort. Augmenting 3D content creation with in hours (1.5 hours per prompt on average using TPUv4).
natural language could considerably help democratize 3D
In this paper, we present a method that can synthesize
content creation for novices and turbocharge expert artists.
highly detailed 3D models from text prompts within a re-
*† : equal contribution. duced computation time. Specifically, we propose a coarse-
300
a silver platter piled
high with fruits an iguana holding a balloon
michelangelo style statue of a stuffed grey rabbit
an astronaut holding a pretend carrot
Figure 1. Results and applications of Magic3D. Top: high-resolution text-to-3D generation. Magic3D can generate high-quality
and high-resolution 3D models from text prompts. Bottom: high-resolution prompt-based editing. Magic3D can edit 3D models by
fine-tuning with the diffusion prior using a different prompt. Taking the low-resolution 3D model as the input (left), Magic3D can modify
different parts of the 3D model corresponding to different input text prompts. Together with various creative controls on the generated 3D
models, Magic3D is a convenient tool for augmenting 3D content creation.
to-fine optimization approach that uses multiple diffusion we switch to optimizing mesh representations, a critical step
priors at different resolutions to optimize the 3D representa- that allows us to utilize diffusion priors at resolutions as high
tion, enabling the generation of both view-consistent geome- as 512 × 512. As 3D meshes are amenable to fast graphics
try as well as high-resolution details. In the first stage, we renderers that can render high-resolution images in real-time,
optimize a coarse neural field representation akin to Dream- we leverage an efficient differentiable rasterizer [9, 26] and
Fusion, but with a memory- and compute-efficient scene make use of camera close-ups to recover high-frequency
representation based on a hash grid [25]. In the second stage, details in geometry and texture. As a result, our approach
301
produces high-fidelity 3D content (see Fig. 1) that can con- ume rendering networks are typically slow to query, leading
veniently be imported and visualized in standard graphics to a trade-off between long training time [5, 29] and lack
software and does so at 2× the speed of DreamFusion. Fur- of multi-view consistency [10]. EG3D [4] partially miti-
thermore, we showcase various creative controls over the 3D gates this problem by utilizing a dual discriminator. While
synthesis process by leveraging the advancements developed obtaining promising results, these works remain limited to
for text-to-image editing applications [2, 35]. Our approach, modeling objects within a single object category, such as
dubbed Magic3D, endows users with unprecedented control cars, chairs, or human faces, thus lacking scalability and the
in crafting their desired 3D objects with text prompts and creative control desired for 3D content creation. In our paper,
reference images, bringing this technology one step closer we focus on text-to-3D synthesis, aiming to generate a 3D
to democratizing 3D content creation. renderable representation of a scene based on a text prompt.
In summary, our work makes the following contributions: Text-to-3D generation. With the recent success in text-to-
image generative modeling in recent years, text-to-3D gen-
• We propose Magic3D, a framework for high-quality 3D
eration has also gained a surge of interest from the learning
content synthesis using text prompts by improving several
community. Earlier works such as CLIP-forge [37] synthe-
major design choices made in DreamFusion. It consists of
sizes objects by learning a normalizing flow model to sample
a coarse-to-fine strategy that leverages both low- and high-
shape embeddings from textual input. However, it requires
resolution diffusion priors for learning the 3D representa-
3D assets in voxel representations during training, making it
tion of the target content. Magic3D, which synthesizes 3D
challenging to scale with data. DreamField [16] and CLIP-
content with an 8× higher resolution supervision, is also
mesh [17] mitigate the training data issue by relying on a
2× faster than DreamFusion. 3D content synthesized by
pre-trained image-text model [32] to optimize the underlying
our approach is significantly preferable by users (61.7%).
3D representations (NeRFs and meshes), such that all 2D ren-
• We extend various image editing techniques developed for derings reach high text-image alignment scores. While these
text-to-image models to 3D object editing and show their approaches avoid the requirement of expensive 3D training
applications in the proposed framework. data and mostly rely on pre-trained large-scale image-text
models, they tend to produce less realistic 2D renderings.
2. Related Work Recently, DreamFusion [31] showcased impressive ca-
pability in text-to-3D synthesis by utilizing a powerful pre-
Text-to-image generation. We have witnessed significant trained text-to-image diffusion model [36] as a strong image
progress in text-to-image generation with diffusion models prior. We build upon this work and improve over several de-
in recent years. With improvements in modeling and data sign choices to bring significantly higher-fidelity 3D models
curation, diffusion models can compose complex semantic into hands of users with a much reduced generation time.
concepts from text descriptions (nouns, adjectives, artistic
styles, etc.) to generate high-quality images of objects and 3. Background: DreamFusion
scenes [2, 33, 34, 36]. Sampling images from diffusion mod-
DreamFusion [31] achieves text-to-3D generation with
els is time consuming. To generate high-resolution images,
two key components: a neural scene representation which we
these models either utilize a cascade of super-resolution mod-
refer to as the scene model, and a pre-trained text-to-image
els [2, 36] or sample from a lower-resolution latent space
diffusion-based generative model. The scene model is a para-
and decode latents into high-resolution images [34]. Despite
metric function x = g(θ), which can produce an image x at
the advances in high-resolution image generation, using lan-
the desired camera pose. Here, g is a volumetric renderer
guage to describe and control 3D properties (e.g. camera
of choice, and θ is a coordinate-based MLP representing a
viewpoints) while maintaining coherency in 3D remains an
3D volume. The diffusion model ϕ comes with a learned de-
open, challenging problem.
noising function ϵϕ (xt ; y, t) that predicts the sampled noise
3D generative models. There is a large body of work on 3D ϵ given the noisy image xt , noise level t, and text embed-
generative modeling, exploring different types of 3D repre- ding y. It provides the gradient direction to update θ such
sentations such as 3D voxel grids [7, 12, 20, 40, 45], point- that all rendered images are pushed to the high probability
clouds [1, 21, 24, 46, 47, 49], meshes [9, 48], implicit [6, 22], density regions conditioned on the text embedding under the
or octree [15] representations. Most of these approaches rely diffusion prior. Specifically, DreamFusion introduces Score
on training data in the form of 3D assets, which are hard to Distillation Sampling (SDS), which computes the gradient:
acquire at scale. Inspired by the success of neural volume
∂x
rendering [23], recent works started investing in 3D-aware ∇θ LSDS (ϕ, g(θ)) = Et,ϵ w(t)(ϵϕ (xt ; y, t) − ϵ) . (1)
image synthesis [4, 5, 10, 11, 27, 29, 30, 38], which has the ∂θ
advantage of learning 3D generative models directly from Here, w(t) is a weighting function. We view the scene model
images Ð a more widely accessible resource. However, vol- g and the diffusion model ϕ as modular components of the
302
a stuffed grey rabbit holding a pretend carrot
1. Low-resolution
optimization camera 2. High-resolution
optimization
render
(high-res)
render
(low-res)
image diffusion latent diffusion
prior encoder prior
update
update
Figure 2. Overview of Magic3D. We generate high-resolution 3D content from an input text prompt in a coarse-to-fine manner. In the first
stage, we utilize a low-resolution diffusion prior and optimize neural field representations (color, density, and normal fields) to obtain the
coarse model. We further differentiably extract textured 3D mesh from the density and color fields of the coarse model. Then we fine-tune it
using a high-resolution latent diffusion model. After optimization, our model generates high-quality 3D meshes with detailed textures.
framework, amenable to choice. In practice, the denoising model [34]. Despite generating high-resolution images, the
function ϵϕ is often replaced with another function ϵ̃ϕ that computation of LDM is manageable because the diffusion
uses classifier-free guidance [14], which allows one to care- prior acts on the latent zt with resolution 64 × 64:
fully weigh the strength of the text conditioning (see Sec. 6).
DreamFusion relies on large classifier-free guidance weights ∂z ∂x
∇θ LSDS (ϕ, g(θ)) = Et,ϵ w(t)(ϵϕ (zt ; y, t)−ϵ) . (2)
to obtain results with better quality. ∂x ∂θ
DreamFusion adopts a variant of Mip-NeRF 360 [3] with The increase in computation time mainly comes from com-
an explicit shading model for the scene model and Ima- puting ∂x/ ∂θ (the gradient of the high-resolution rendered
gen [36] as the diffusion model. These choices result in two image) and ∂z/ ∂x (the gradient of the encoder in LDM).
key limitations. First, high-resolution geometry or textures
cannot be obtained since the diffusion model only operates 4.2. Scene Models
on 64 × 64 images. Second, the utility of a large global MLP
We cater two different 3D scene representations to the
for volume rendering is both computationally expensive as
two different diffusion priors at coarse and fine resolutions
well as memory intensive, making this approach scale poorly
to accommodate the increased resolution of rendered images
with the increasing resolution of images.
for the input of high-resolution priors, discussed as follows.
4. High-Resolution 3D Generation Neural fields as coarse scene models. The initial coarse
stage of the optimization requires finding the geometry and
Magic3D is a two-stage coarse-to-fine framework that textures from scratch. This can be challenging as we need to
uses efficient scene models that enable high-resolution text- accommodate complex topological changes in the 3D geom-
to-3D synthesis (Fig. 2). We describe our method and key etry and depth ambiguities from the 2D supervision signals.
differences from DreamFusion [31] in this section. In DreamFusion [31], the scene model is a neural field (a
coordinate-based MLP) based on Mip-NeRF 360 [3] that pre-
4.1. Coarse-to-fine Diffusion Priors
dicts albedo and density. This is a suitable choice as neural
Magic3D uses two different diffusion priors in a coarse- fields can handle topological changes in a smooth, continu-
to-fine fashion to generate high-resolution geometry and ous fashion. However, Mip-NeRF 360 [3] is computationally
textures. In the first stage, we use the base diffusion model expensive as it is based on a large global coordinate-based
described in eDiff-I [2], which is similar to the base dif- MLP. As volume rendering requires dense samples along a
fusion model of Imagen [36] used in DreamFusion. This ray to accurately render high-frequency geometry and shad-
diffusion prior is used to compute gradients of the scene ing, the cost of having to evaluate a large neural network at
model via a loss defined on rendered images at a low resolu- every sample point quickly stacks up.
tion 64 × 64. In the second stage, we use the latent diffusion For this reason, we opt to use the hash grid encoding
model (LDM) [34] that allows backpropagating gradients from Instant NGP [25], which allows us to represent high-
into rendered images at a high resolution 512 × 512; in prac- frequency details at a much lower computational cost. We
tice, we choose to use the publicly available Stable Diffusion use the hash grid with two single-layer neural networks,
303
one predicting albedo and density and the other predicting to the level set surface. This helps us significantly reduce
normals. We additionally maintain a spatial data structure the computational cost of optimizing the coarse model by
that encodes scene occupancy and utilizes empty space skip- avoiding the use of finite differencing. Accurate normals can
ping [19, 43]. Specifically, we use the density-based voxel be obtained in the fine stage of optimization when we use a
pruning approach from Instant NGP [25] with an octree- true surface rendering model.
based ray sampling and rendering algorithm [44]. With these Similar to DreamFusion, we also model the background
design choices, we drastically accelerate the optimization of using an environment map MLP, which predicts RGB colors
coarse scene models while maintaining quality. as a function of ray directions. Since our sparse represen-
tation model does not support scene reparametrization as
Textured meshes as fine scene models. In our fine stage of
in Mip-NeRF 360 [3], the optimization has a tendency to
optimization, we need to be able to accommodate very high-
ªcheatº by learning the essence of the object using the back-
resolution rendered images to fine-tune our scene model
ground environment map. As such, we use a tiny MLP for
with high-resolution diffusion priors. Using the same scene
the environment map (hidden dimension size of 16) and
representation (the neural field) from the initial coarse stage
weigh down the learning rate by 10× to allow the model to
of optimization could be a natural choice since the weights
focus more on the neural field geometry.
of the model can directly carry over. Although this strategy
can work to some extent (Figs. 4 and 5), it struggles to Mesh optimization. To optimize a mesh from the neural
render very high-resolution (e.g., 512 × 512) images within field initialization, we convert the (coarse) density field to an
reasonable memory constraints and computation budgets. SDF by subtracting it with a non-zero constant, yielding the
To resolve this issue, we use textured 3D meshes as the initial si . We additionally initialize the volume texture field
scene representation for the fine stage of optimization. In directly with the color field optimized from the coarse stage.
contrast to volume rendering for neural fields, rendering During optimization, we render the extracted surface
textured meshes with differentiable rasterization can be per- mesh into high-resolution images using a differentiable ras-
formed efficiently at very high resolutions, making meshes terizer [18,26]. We optimize both si and ∆vi for each vertex
a suitable choice for our high-resolution optimization stage. vi via backpropagation using the high-resolution SDS gra-
Using the neural field from the coarse stage as the initial- dient (Eq. 2). When rendering the mesh to an image, we
ization for the mesh geometry, we can also sidestep the also track the 3D coordinates of each corresponding pixel
difficulty of learning large topological changes in meshes. projection, which would be used to query colors in the corre-
Formally, we represent the 3D shape using a deformable sponding texture field for joint optimization.
tetrahedral grid (VT , T ), where VT is the vertices in the grid When rendering the mesh, we increase the focal length
T [8, 39]. Each vertex vi ∈ VT ⊂ R3 contains a signed to zoom in on object details, which is a critical step to-
distance field (SDF) value si ∈ R and a deformation ∆vi ∈ wards recovering high-frequency details. We keep the same
R3 of the vertex from its initial canonical coordinate. Then, pre-trained environment map from the coarse stage of opti-
we extract a surface mesh from the SDF using a differentiable mization and composite the rendered background with the
marching tetrahedra algorithm [39]. For textures, we use the rendered foreground object using differentiable antialias-
neural color field as a volumetric texture representation. ing [18]. To encourage the smoothness of the surface, we
further regularize the angular differences between adjacent
4.3. Coarse-to-fine Optimization faces on the mesh. This allows us to obtain well-behaved
geometry even under supervision signals with high variance,
We describe our coarse-to-fine optimization procedure, such as the SDS gradient ∇θ LSDS .
which first operates on a coarse neural field representation
and subsequently a high-resolution textured mesh. 5. Experiments
Neural field optimization. Similarly to Instant NGP [25], We focus on comparing our method with DreamFu-
we initialize an occupancy grid of resolution 2563 with val- sion [31] on 397 text prompts taken from the website of
ues to 20 to encourage shapes to grow in the early stages DreamFusion1 . We train Magic3D on all of the text prompts
of optimization. We update the grid every 10 iterations and and compare them with the results provided on the website.
generate an octree for empty space skipping. We decay the
occupancy grid by 0.6 in every update and follow Instant Speed evaluation. Unless otherwise noted, the coarse stage
NGP with the same update and thresholding parameters. is trained for 5000 iterations with 1024 samples along the ray
(subsequently filtered by the sparse octree) with a batch size
Instead of estimating normals from density differences,
of 32, with a total runtime of around 15 minutes (upwards of
we use an MLP to predict the normals. Note that this does
8 iterations / second, variable due to differences in sparsity).
not violate geometric properties since volume rendering is
The fine stage is trained for 3000 iterations with a batch
used instead of surface rendering; as such, the orientation
of particles at continuous positions need not be oriented 1 https://dreamfusion3d.github.io/gallery.html
304
Ours DreamFusion [31] Ours DreamFusion [31]
A very beautiful tiny human heart organic sculpture made of copper wire
a small saguaro cactus planted in a clay pot∗
and threaded pipes, very intricate, curved, Studio lighting, high resolution∗
Figure 3. Qualitative comparison with DreamFusion [31]. We use the same text prompt as in DreamFusion. For each 3D model, we
render it from two views with a textureless rendering for each view and remove the background to focus on the actual 3D shape. For the
DreamFusion results, we take frames from the videos published on the official webpage. Our Magic3D generates much higher quality 3D
shapes on both geometry and texture compared with DreamFusion. ∗ a DSLR photo of... † a zoomed out DSLR photo of...
Table 1. User preference studies. We conducted user studies to a baby bunny sitting on top of a stack of pancakes†
measure preference for 3D models generated using 397 prompts
released by DreamFusion. Overlal, more raters (61.7%) prefer 3D
models generated by Magic3D over DreamFusion. The majority of
raters (87.7%) prefer fine models over coarse models in Magic3D,
showing the effectiveness of our coarse-to-fine approach.
(a) Single-stage model
Comparison Preference
Magic3D vs. DreamFusion [31]
• More realistic 58.3%
• More detailed 66.0%
• More realistic & detailed 61.7%
Magic3D vs. Magic3D (coarse only) 87.7% (b) Coarse-to-fine model
305
Coarse
NeRF
Fine-tuned
NeRF
Fine-tuned
Mesh
baby dragon hatching out of a stone egg∗ bagel filled with cream cheese and lox∗ a marble bust of a mouse a ladybug†
Figure 5. Ablation on the fine-tuning stage. For each text prompt, we compare coarse and fine models with mesh and NeRF representations.
Mesh fine-tuning significantly improve the visual quality of generated 3D assets, providing more photo-realistic details on the 3D shapes.
306
low-resolution
before edited
a baby bunny
sitting on top
of a stack of
pancakes†
stained glass bunny, a plate of spaghetti metal bunny, a stack of carrots squirrel, a stack of books
a squirrel
wearing a
leather jacket,
riding a
motorcycle
bunny, broomstick cat, rocking horse rat, scooter
Figure 7. Magic3D with prompt-based editing. Given a coarse model (first column) generated with a base prompt, we replace the
underscored text with new text and fine-tune the NeRF to get a high-resolution NeRF model with LDM. We further fine-tune the high-
resolution mesh with the NeRF model. Such a prompt-based editing method gives artists better control over the 3D generation output.
the context of text-to-3D generation, we would like to gen- base prompt for the ªbunnyº and ªsquirrelº. We modify
erate a 3D model of a subject. This can be achieved by first the base prompt, fine-tune the NeRF model in high reso-
fine-tuning our diffusion prior models with the DreamBooth lution and optimize the mesh. Results show that we can
approach, and then using the fine-tuned diffusion priors with tune the scene model according to the prompt, e.g. changing
the [V] identifier as part of the conditioning text prompt to the ªbaby bunnyº to ªstained glass bunnyº or ªmetal bunnyº
provide the learning signal when optimizing the 3D model. results in similar geometry but with a different texture.
To demonstrate the applicability of DreamBooth in our
framework, we collect 11 images of one cat and 4 images of
one dog. We fine-tune eDiff-I [2] and LDM [34], binding the 7. Conclusion
text identifier [V] to the given subject. Then, we optimize
the 3D model with [V] in the text prompts. We use a batch
We propose Magic3D, a fast and high-quality text-to-
size of 1 for all fine-tuning. For eDiff-I, we use the Adam
3D generation framework. We benefit from both efficient
optimizer with learning rate 1 × 10−5 for 1,500 iterations;
scene models and high-resolution diffusion priors in a coarse-
for LDM, we fine-tune with learning rate 1 × 10−6 for 800
to-fine approach. In particular, the 3D mesh models scale
iterations. Fig. 6 shows our personalized text-to-3D results:
nicely with image resolution and enjoy the benefits of higher
we are able to successfully modify the 3D models preserving
resolution supervision brought by the latent diffusion model
the subjects in the given input images.
without sacrificing its speed. It takes 40 minutes from a
Prompt-based editing through fine-tuning. Another way text prompt to a high-quality 3D mesh model ready to be
to control the generated 3D content is by fine-tuning a used in graphic engines. With extensive user studies and
learned coarse model with a new prompt. Our prompt-based qualitative comparisons, we show that Magic3D is more
editing includes three stages. (a) We train a coarse model preferable (61.7%) by the raters compared to DreamFusion,
with a base prompt. (b) We modify the base prompt and fine- while enjoying a 2× speed-up. Lastly, we propose a set of
tune the coarse model with the LDM. This stage provides tools for better controlling style and content in 3D generation.
a well initialized NeRF model for the next step. Directly We hope with Magic3D, we can democratize 3D synthesis
applying mesh optimization on a new prompt would gener- and open up everyone’s creativity in 3D content creation.
ate highly detailed textures but could deform geometry only
slightly. (c) We optimize the mesh with the modified text Acknowledgements. We would like to thank Frank Shen,
prompt. Our prompt-based editing can modify the texture of Yogesh Balaji, Seungjun Nah, James Lucas, David Luebke,
the shape or transform the geometry and texture according to Clement Fuji-Tsang, Charles Loop, Qinsheng Zhang, Zan
the text. The resulting scene models preserve the layer-out Gojcic, and Jonathan Tremblay for helpful discussions and
and overall structure. Such an editing capability makes the paper proofreading. We would also like to thank Ben Poole,
3D content creation with Magic3D more controllable. In Ajay Jain, Jonathan T. Barron, and Ben Mildenhall for pro-
Fig. 7, we show two coarse NeRF models trained with the viding additional implementation details in DreamFusion.
307
References [13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. Advances in Neural Information
[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Processing Systems, 33:6840±6851, 2020. 1
Leonidas Guibas. Learning representations and generative [14] Jonathan Ho and Tim Salimans. Classifier-free diffusion
models for 3d point clouds. In International conference on guidance. In NeurIPS 2021 Workshop on Deep Generative
machine learning, pages 40±49. PMLR, 2018. 3 Models and Downstream Applications, 2021. 4
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- [15] Moritz Ibing, Gregor Kobsik, and Leif Kobbelt. Octree trans-
aming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli former: Autoregressive 3d shape generation on hierarchically
Laine, Bryan Catanzaro, et al. ediff-i: Text-to-image diffusion structured sequences. arXiv preprint arXiv:2111.12480, 2021.
models with an ensemble of expert denoisers. arXiv preprint 3
arXiv:2211.01324, 2022. 1, 3, 4, 8 [16] Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel,
[3] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P and Ben Poole. Zero-shot text-guided object generation with
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded dream fields. 2022. 3
anti-aliased neural radiance fields. In Proceedings of the [17] Nasir Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu
IEEE/CVF Conference on Computer Vision and Pattern Popa. Clip-mesh: Generating textured meshes from text
Recognition, pages 5470±5479, 2022. 4, 5 using pretrained image-text models. ACM Transactions on
[4] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Graphics (TOG), Proc. SIGGRAPH Asia, 2022. 3
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J [18] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol,
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient Jaakko Lehtinen, and Timo Aila. Modular primitives for high-
geometry-aware 3d generative adversarial networks. In Pro- performance differentiable rendering. ACM Transactions on
ceedings of the IEEE/CVF Conference on Computer Vision Graphics, 39(6), 2020. 5
and Pattern Recognition, pages 16123±16133, 2022. 1, 3 [19] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[5] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Christian Theobalt. Neural sparse voxel fields. Advances
and Gordon Wetzstein. pi-gan: Periodic implicit generative in Neural Information Processing Systems, 33:15651±15663,
adversarial networks for 3d-aware image synthesis. In Pro- 2020. 5
ceedings of the IEEE/CVF conference on computer vision and [20] Sebastian Lunz, Yingzhen Li, Andrew Fitzgibbon, and
pattern recognition, pages 5799±5809, 2021. 3 Nate Kushman. Inverse graphics gan: Learning to gener-
[6] Zhiqin Chen and Hao Zhang. Learning implicit fields for ate 3d shapes from unstructured 2d data. arXiv preprint
generative shape modeling. In Proceedings of the IEEE/CVF arXiv:2002.12674, 2020. 3
Conference on Computer Vision and Pattern Recognition, [21] Shitong Luo and Wei Hu. Diffusion probabilistic models for
pages 5939±5948, 2019. 3 3d point cloud generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
[7] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape
(CVPR), June 2021. 3
induction from 2d views of multiple objects. In 2017 Interna-
tional Conference on 3D Vision (3DV), pages 402±411. IEEE, [22] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
2017. 3 bastian Nowozin, and Andreas Geiger. Occupancy networks:
Learning 3d reconstruction in function space. In Proceed-
[8] Jun Gao, Wenzheng Chen, Tommy Xiang, Clement Fuji
ings of the IEEE Conference on Computer Vision and Pattern
Tsang, Alec Jacobson, Morgan McGuire, and Sanja Fidler.
Recognition, pages 4460±4470, 2019. 3
Learning deformable tetrahedral meshes for 3d reconstruction.
[23] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
In Advances In Neural Information Processing Systems, 2020.
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
5
Representing scenes as neural radiance fields for view synthe-
[9] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, sis. In ECCV, 2020. 1, 3
Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja
[24] Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka,
Fidler. Get3d: A generative model of high quality 3d textured
Niloy Mitra, and Leonidas Guibas. Structurenet: Hierarchical
shapes learned from images. Advances in Neural Information
graph networks for 3d shape generation. ACM Transactions
Processing Systems, 2022. 1, 2, 3
on Graphics (TOG), Siggraph Asia 2019, 38(6):Article 242,
[10] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. 2019. 3
Stylenerf: A style-based 3d aware generator for high- [25] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
resolution image synthesis. In International Conference on der Keller. Instant neural graphics primitives with a multires-
Learning Representations, 2022. 3 olution hash encoding. ACM Trans. Graph., 41(4):102:1±
[11] Zekun Hao, Arun Mallya, Serge Belongie, and Ming-Yu Liu. 102:15, July 2022. 2, 4, 5
GANcraft: Unsupervised 3D Neural Rendering of Minecraft [26] Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao,
Worlds. In ICCV, 2021. 3 Wenzheng Chen, Alex Evans, Thomas Müller, and Sanja Fi-
[12] Philipp Henzler, Niloy J. Mitra, and Tobias Ritschel. Escaping dler. Extracting triangular 3d models, materials, and lighting
plato’s cave: 3d shape from adversarial rendering. In The from images. In CVPR, pages 8280±8290, 2022. 2, 5
IEEE International Conference on Computer Vision (ICCV), [27] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
October 2019. 3 Richardt, and Yong-Liang Yang. Hologan: Unsupervised
308
learning of 3d representations from natural images. In Pro- [40] Edward J Smith and David Meger. Improved adversarial
ceedings of the IEEE/CVF International Conference on Com- systems for 3d object generation and reconstruction. In Con-
puter Vision, pages 7588±7597, 2019. 3 ference on Robot Learning, pages 87±96. PMLR, 2017. 3
[28] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [41] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and and Surya Ganguli. Deep unsupervised learning using
Mark Chen. Glide: Towards photorealistic image generation nonequilibrium thermodynamics. In International Confer-
and editing with text-guided diffusion models. arXiv preprint ence on Machine Learning, pages 2256±2265. PMLR, 2015.
arXiv:2112.10741, 2021. 1 1
[29] Michael Niemeyer and Andreas Geiger. Giraffe: Represent- [42] Yang Song and Stefano Ermon. Generative modeling by
ing scenes as compositional generative neural feature fields. estimating gradients of the data distribution. Advances in
In Proceedings of the IEEE/CVF Conference on Computer Neural Information Processing Systems, 32, 2019. 1
Vision and Pattern Recognition, pages 11453±11464, 2021. 3 [43] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis,
[30] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Mor-
Jeong Joon Park, and Ira Kemelmacher-Shlizerman. Stylesdf: gan McGuire, and Sanja Fidler. Neural geometric level of
High-resolution 3d-consistent image and geometry generation. detail: Real-time rendering with implicit 3d shapes. In Pro-
In Proceedings of the IEEE/CVF Conference on Computer ceedings of the IEEE/CVF Conference on Computer Vision
Vision and Pattern Recognition, pages 13503±13513, 2022. 3 and Pattern Recognition, pages 11358±11367, 2021. 5
[31] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. [44] Towaki Takikawa, Or Perel, Clement Fuji Tsang, Charles
Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint Loop, Joey Litalien, Jonathan Tremblay, Sanja Fidler, and
arXiv:2209.14988, 2022. 1, 3, 4, 5, 6 Maria Shugrina. Kaolin wisp: A pytorch library and en-
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya gine for neural fields research. https://github.com/
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, NVIDIAGameWorks/kaolin-wisp, 2022. 5
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning [45] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
transferable visual models from natural language supervision. Josh Tenenbaum. Learning a probabilistic latent space of ob-
In International Conference on Machine Learning, pages ject shapes via 3d generative-adversarial modeling. Advances
8748±8763. PMLR, 2021. 3 in neural information processing systems, 29, 2016. 3
[33] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [46] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge
and Mark Chen. Hierarchical text-conditional image genera- Belongie, and Bharath Hariharan. Pointflow: 3d point cloud
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. generation with continuous normalizing flows. In Proceed-
1, 3 ings of the IEEE/CVF International Conference on Computer
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Vision, pages 4541±4550, 2019. 3
Patrick Esser, and Björn Ommer. High-resolution image [47] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic,
synthesis with latent diffusion models. In Proceedings of Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point
the IEEE/CVF Conference on Computer Vision and Pattern diffusion models for 3d shape generation. In Advances in
Recognition (CVPR), pages 10684±10695, June 2022. 3, 4, 8 Neural Information Processing Systems (NeurIPS), 2022. 1,
[35] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, 3
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
[48] Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao, Yinan
tuning text-to-image diffusion models for subject-driven gen-
Zhang, Antonio Torralba, and Sanja Fidler. Image gans meet
eration. arXiv preprint arXiv:2208.12242, 2022. 3, 7
differentiable rendering for inverse graphics and interpretable
[36] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
3d neural rendering. In International Conference on Learning
Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
Representations, 2021. 3
Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,
[49] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation
et al. Photorealistic text-to-image diffusion models with deep
and completion through point-voxel diffusion. In Proceed-
language understanding. arXiv preprint arXiv:2205.11487,
ings of the IEEE/CVF International Conference on Computer
2022. 1, 3, 4
Vision, pages 5826±5835, 2021. 3
[37] Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang,
Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malek-
shan. Clip-forge: Towards zero-shot text-to-shape generation.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18603±18613, 2022. 3
[38] Katja Schwarz, Axel Sauer, Michael Niemeyer, Yiyi Liao,
and Andreas Geiger. Voxgraf: Fast 3d-aware image synthesis
with sparse voxel grids. In Advances in Neural Information
Processing Systems (NeurIPS), 2022. 3
[39] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and
Sanja Fidler. Deep marching tetrahedra: a hybrid representa-
tion for high-resolution 3d shape synthesis. In Advances in
Neural Information Processing Systems (NeurIPS), 2021. 5
309