0% found this document useful (0 votes)
25 views14 pages

Generative AI Meets 3D: A Survey On Text-to-3D in AIGC Era

Uploaded by

klaus peter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

Generative AI Meets 3D: A Survey On Text-to-3D in AIGC Era

Uploaded by

klaus peter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era

CHENGHAO LI, KAIST, South Korea


CHAONING ZHANG∗ , Kyung Hee University, South Korea
ATISH WAGHWASE, KAIST, South Korea
LIK-HANG LEE, Hong Kong Polytechnic University, Hong Kong (China)
FRANCOIS RAMEAU, State University of New York, Korea
YANG YANG, University of Electronic Science and technology, China
SUNG-HO BAE, Kyung Hee University, South Korea
CHOONG SEON HONG, Kyung Hee University, South Korea
arXiv:2305.06131v2 [cs.CV] 27 May 2023

Generative AI (AIGC, a.k.a. AI generated content) has made remarkable Contents


progress in the past few years, among which text-guided content generation
is the most practical one since it enables the interaction between human Abstract 1
instruction and AIGC. Due to the development in text-to-image as well 3D Contents 1
modeling technologies (like NeRF), text-to-3D has become a newly emerging 1 Introduction 1
yet highly active research field. Our work conducts the first yet comprehen- 2 3D Data Representation 2
sive survey on text-to-3D to help readers interested in this direction quickly 2.1 Euclidean data 2
catch up with its fast development. First, we introduce 3D data representa- 2.2 non-Euclidean data 3
tions, including both Euclidean data and non-Euclidean data. On top of that, 3 Text-to-3D Technologies 4
we introduce various foundation technologies as well as summarize how 3.1 Foundation Technologies 4
recent works combine those foundation technologies to realize satisfactory
3.2 Successful attempts 6
text-to-3D. Moreover, we summarize how text-to-3D technology is used in
various applications, including avatar generation, texture generation, shape
4 Text-to-3D applications 9
transformation, and scene generation. 4.1 Text Guided 3D Avatar Generation 9
4.2 Text Guided 3D Texture Generation 9
CCS Concepts: • Computing methodologies → Reconstruction; Shape 4.3 Text Guided 3D Scene Generation 10
modeling. 4.4 Text Guided 3D Shape Transformation 11
5 Discussion 11
Additional Key Words and Phrases: text-to-3D, generative AI, AIGC, 3D 5.1 Fidelity 11
generation, metaverse
5.2 Inference velocity 12
ACM Reference Format: 5.3 Consistency 12
Chenghao Li, Chaoning Zhang, Atish Waghwase, Lik-Hang Lee, Francois 5.4 Controllability 12
Rameau, Yang Yang, Sung-Ho Bae, and Choong Seon Hong. 2023. Generative 5.5 Applicability 12
AI meets 3D: A Survey on Text-to-3D in AIGC Era. 1, 1 (May 2023), 14 pages. 6 Conclusion 12
https://doi.org/XXXXXXX.XXXXXXX References 12

∗ Corresponding author: [email protected]

1 INTRODUCTION
Authors’ addresses: Chenghao Li, KAIST, South Korea, [email protected];
Chaoning Zhang, Kyung Hee University, South Korea, [email protected]; Generative Artificial Intelligence as the main body generating high-
Atish Waghwase, KAIST, South Korea, [email protected]; Lik-Hang Lee, quality and quantity content (also known as Artificial Intelligence
Hong Kong Polytechnic University, Hong Kong (China), [email protected]; Generated Content-AIGC) has aroused great attention in the past
Francois Rameau, State University of New York, Korea, [email protected];
Yang Yang, University of Electronic Science and technology, China, dlyyang@gmail. few years. The content generation paradigm guided and restrained
com; Sung-Ho Bae, Kyung Hee University, South Korea, [email protected]; Choong by natural language, such as text-to-text (e.g. ChatGPT [Zhang
Seon Hong, Kyung Hee University, South Korea, [email protected].
et al. 2023c]) and text-to-image [Zhang et al. 2023d] (e.g. DALLE-
2 [Ramesh et al. 2022]), is the most practical one, as it allows for a sim-
Permission to make digital or hard copies of all or part of this work for personal or
ple interaction between human guidance and generative AI [Zhang
classroom use is granted without fee provided that copies are not made or distributed et al. 2023e]. The accomplishment of Generative AI in the field of
for profit or commercial advantage and that copies bear this notice and the full citation text-to-image [Zhang et al. 2023d] is quite remarkable. As we are in
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, a 3D world, it is necessary to extend AIGC to 3D domain. There is a
to post on servers or to redistribute to lists, requires prior specific permission and/or a great demand for 3D digital content in many application scenarios,
fee. Request permissions from [email protected]. including games, movies, virtual reality, architecture and robots,
© 2023 Association for Computing Machinery.
XXXX-XXXX/2023/5-ART $15.00 such as 3D character generation, 3D texture generation, 3D scene
https://doi.org/XXXXXXX.XXXXXXX generation, etc. However, it requires a lot of artistic and aesthetic

, Vol. 1, No. 1, Article . Publication date: May 2023.


2 • Li and Zhang, et al.

training, as well as professional knowledge in 3D modeling, to cul- a potential grid structure, which allows global parameterization
tivate a professional 3D modeler. Given the current trend of 3D and a common coordinate system. These properties make extending
model development, it is essential to utilize generative AI to gener- existing 2D deep learning paradigms to 3D data a simple task, where
ate high-quality and large-scale 3D models. In addition, text-to-3D convolution operations remain the same as 2D. On the other hand,
AI modeling can greatly assist both newbies and professionals to 3D non-Euclidean data does not have a grid array structure and
realize free creation of 3D contents. is not globally parameterized. Therefore, extending classical deep
Previous methods of text-to-3D shapes have attempted to learn a learning techniques to such representations is a challenging task. In
cross-modal mapping by directly learning from text-3D pairs [Achliop- real life, the research of deep learning techniques in non-Euclidean
tas et al. 2019; Chen et al. 2019] and generate joint representations. domains is of great significance. This is referred to as geometric
Compared to text-to-image, the task of generating 3D shapes from deep learning [Cao et al. 2020].
text is more challenging. Firstly, unlike 2D images, 3D shapes are
mostly unstructured and irregular non-Euclidean data, making it 2.1 Euclidean data
difficult to apply traditional 2D deep learning models to these data. The Euclidean data preserves the attribute of the grid structure,
Moreover, there are a large number of large-scale image-text pairs with global parameterization and a common coordinate system. The
datasets available online to support text-to-image generation. How- major 3D data representations in this category include voxel grids
ever, to our knowledge, the largest text-to-3D dataset proposed and multi-view images.
in [Fu et al. 2022] contained only 369K text-3D pairs and is lim-
ited to a few object categories. This is significantly lower than the
datasets which contain 5.85B text-image pairs [Schuhmann et al.
2022]. The lack of large-scale and high-quality training data makes
the task of text-to-3D even more difficult.
Recently, the advent of some key technologies has enabled a
new paradigm of text-to-3D tasks. Firstly, Neural Radiance Fields
(NeRF) [Mildenhall et al. 2021] is an emergent 3D data representa-
tion approach. Initially, NeRF was found to perform well in the 3D Fig. 1. Voxel representation of Stanford bunny, the picture is obtained
reconstruction task, and recently NeRF and other neural 3D repre- from [Shi et al. 2022].
sentations have been applied to new view synthesis tasks that can
use real-world RGB photos. NeRF is trained to reconstruct images
2.1.1 Voxel Grids. Voxels can be used to represent individual sam-
from multiple viewpoints. As the learned radiance fields are shared
ples or data points on a regularly spaced three-dimensional grid,
between viewpoints, NeRF can smoothly and consistently interpo-
which is a Euclidean structured data structure similar to pixels [Blinn
late between viewpoints. Due to its neural representation, NeRF can
2005] in 2D space. The data point can contain a single data, such
sample with high spatial resolution, unlike voxel representations
as opacity, or multiple data, such as color and opacity. Voxels can
and point clouds, and is easier to optimize than meshes and other
also store high-dimensional feature vectors in data points, such
explicit geometric representations, since it is topology-free. The
as geometric occupancy [Mescheder et al. 2019], volumetric den-
advent of NeRF breaks the stalemate of 3D data scarcity and serves
sity [Minto et al. 2018], or signed distance values [Park et al. 2019].
as a soft bridge from 2D to 3D representation, elegantly solving the
A voxel only represents a point on this grid, not a volume; the space
problem of 3D data scarcity. Secondly, with the remarkable progress
between voxels is not represented in the voxel-based dataset. De-
of multimodal AI [Radford et al. 2021] and diffusion models [Ho
pending on the data type and the expected use of the dataset, this
et al. 2020], text-guided image content AI generation has made sig-
lost information can be reconstructed and/or approximated, for ex-
nificant progress. The key driving factor is the large-scale datasets
ample, by interpolation. The representation of voxels is simple and
of billions of text-image pairs obtained from the Internet. Recent
the spatial structure is clear, which is highly extensible and can be
works have emerged that guide 3D modeling optimization by lever-
easily applied to convolutional neural networks [Wang et al. 2019].
aging the prior of a pre-trained text-to-image generation model.
However, the efficiency is low, as it represents both the occupied
In other words, they often text-guided 3D model generation with
parts of the scene and the unoccupied parts, which leads to a large
text-to-image priors.
amount of unnecessary storage requirements. This leads to voxels
Overall, this work conducts the first yet comprehensive survey
being unsuitable for representing high-resolution data. Voxel grids
on text-to-3D. The rest of this work is organized as follows. Sec 2
have many applications in rendering tasks[Hu et al. 2023; Rematas
reviews the different representations of 3D. Sec 3 first introduces
and Ferrari 2020]. Early methods store high-dimensional feature vec-
the technology behind text-to-3D, and then summarizes the recent
tors in voxels to encode the geometry and appearance of the scene,
papers. Sec 4 introduces the application of text-to-3D in various
usually referred to as a feature volume, which can be interpreted
fields.
as a color image using projection and 2D cellular neural networks.
This also includes volumetric imaging in medicine, as well as terrain
2 3D DATA REPRESENTATION representation in games and simulations.
3D data can have different representations [Ahmed et al. 2018], 2.1.2 Multi-view Images. With the development of computer vision
divided into Euclidean and non-Euclidean. 3D Euclidean data has technology and the remarkable improvement in computing power,

, Vol. 1, No. 1, Article . Publication date: May 2023.


Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era • 3

3D space. These vertices are associated with a connectivity list that


describes how these vertices are interconnected. As meshes only
model the surface of the scene, they are more compact. The meshes
provide connectivity of surface points for modeling point relation-
ships. Due to these advantages, polygon meshes are widely used
in traditional computer graphics [Zhou et al. 2021b] applications
such as geometry processing, animation, and rendering. However,
Fig. 2. Multi-view representation of Stanford bunny, the picture is obtained on a global level, meshes are non-Euclidean data, and the local ge-
from [Park et al. 2016]. ometry of the mesh can be represented as a subset of the Euclidean
space where the known properties of the Euclidean space are not
well-defined, such as shift-invariance, vector space operations, and
coupled with the latest developments in modern digital cameras, it global parameterization systems. Thus, deep learning of 3D meshes
is now possible to easily capture large amounts of high-resolution is a challenging task. However, with the development of graph
images. There is an urgent need to extract 3D structure from these neural networks [Wu et al. 2020], meshes can be seen as graphs.
images for many applications, such as 3D reconstruction [Jin et al. MeshCNN [Hanocka et al. 2019] specifically designs convolutional
2020]. A multi-view image dataset is an aggregation of multiple and pooling layers for mesh edges and extracts edge features for
images, each representing an object or scene from different per- shape analysis. 3D meshes are important in many fields and indus-
spectives, such as the front, side and top, aggregated together to tries, such as architecture and building, furniture and home living,
form a multi-view image dataset. Since collecting 3D data from gaming and entertainment, product design, medical and life sciences,
the real world is time-consuming, and deep learning paradigms etc. They can be used for designing, visualizing, analyzing archi-
rely on large amounts of data for training, the availability of multi- tecture and products, creating character objects for games, movies,
view image datasets is its greatest advantage. Its drawback is that etc., designing new products, visualizing and analyzing anatomical
multi-view images cannot be strictly defined as 3D model data, but structures, and helping to increase understanding of diseases and
it provides a bridge between 2D and 3D visualizations. Recently, treatment methods.
NeRF [Mildenhall et al. 2021] has emerged as a novel approach for
3D reconstruction, which is well-suited for the massive data re-
quirement of the learning-based generalizable NeRF methods with
large-scale multi-view datasets [Yu et al. 2023]. It can also be applica-
ble to multi-view stereo [Furukawa et al. 2015] and view-consistent
image understanding [Dong et al. 2022] tasks.

2.2 non-Euclidean data


The second type of 3D data representation is non-Euclidean data.
This type of data does not have global parametrization or common Fig. 4. Point cloud representation of Stanford bunny model, the figure is
coordinate systems, which makes it difficult to extend 2D deep obtained from [Agarwal and Prabhakaran 2009].
learning paradigms. Much effort has been made in learning this data
representation and applying DL techniques. Researching deep learn- 2.2.2 Point Clouds. With the trend of inexpensive and convenient
ing techniques in non-Euclidean domains is of great importance, point cloud acquisition equipment, point clouds have been widely
this is referred to as Geometric Deep Learning [Cao et al. 2020]. The used in modeling and rendering, augmented reality, autonomous
main types of non-Euclidean data are point clouds, 3D meshes and vehicles, etc [Guo et al. 2020]. Point clouds are a disordered set of
implicit. discrete samples of three-dimensional shapes on three-dimensional
space. Traditionally, point clouds are non-Euclidean data because
point cloud data is non-structured globally. However, point clouds
can also be realized as a set of globally parametrized small Eu-
clidean subsets. The definition of point cloud structure depends
on whether to consider the global or local structure of the object.
Since most applications strive to capture the global characteristics
of the object to perform complex tasks, traditionally point clouds
are non-Euclidean data. Point clouds are the direct output of depth
Fig. 3. Mesh representation of Stanford bunny, the picture is obtained sensors [Liu et al. 2019] and therefore are very popular in 3D scene
from [Rossi et al. 2021]. understanding tasks. Despite being easily obtainable, the irregu-
larity of point clouds makes them hard to process with traditional
2.2.1 Meshes. 3D meshes [Wang and Zhang 2022] are one of the 2D neural networks. Numerous geometric deep learning [Cao et al.
most popular representations of 3D shapes. A 3D mesh structure is 2020] methods have been proposed to effectively analyze three-
composed of a set of polygons, termed faces, which are described dimensional point clouds, such as PointNet [Qi et al. 2017], which
based on a set of vertices that describe the presence of coordinates in is a deep learning network structure based on raw point cloud data

, Vol. 1, No. 1, Article . Publication date: May 2023.


4 • Li and Zhang, et al.

and can directly use raw point cloud data as input and use a set 3 TEXT-TO-3D TECHNOLOGIES
of sparse keypoints to summarize the input point cloud, and can In the past few years, the success of deep generative models [Ho et al.
effectively process data and have robustness to small perturbations 2020] in 2D images has been incredible. Training generative models
of the input and can achieve good performance in tasks such as in 2D space cannot meet the needs of some practical applications,
shape classification, part segmentation, and scene segmentation. 3D as our physical world actually operates in 3D space. 3D data gener-
point cloud technology can be applied to multiple fields such as ation is of paramount importance. The success of Neural Radiance
architecture, engineering, civil building design, geological survey, Fields [Mildenhall et al. 2021] has transformed the 3D reconstruc-
machine vision, agriculture, space information, and automatic driv- tion race, bringing 3D data to a whole new level. Combining prior
ing, and can provide more accurate modeling and analysis, as well knowledge from text-to-image models [Ramesh et al. 2021], many
as more accurate positioning and tracking. pioneers have achieved remarkable results in text-to-3D generation.
In this section, we will first review the key techniques underlying
text-to-3D generation. Secondly, we will survey recent text-to-3D
models.

3.1 Foundation Technologies

Fig. 5. Neural field representation of Stanford bunny, the picture is obtained


from [Park et al. 2019].

2.2.3 Neural Fields. Neural fields are a domain that is either wholly
or partially parameterized by neural networks and represented en-
tirely or partially by neural networks for scenes or objects in 3D Fig. 6. An overview of our neural radiance field scene representation and
space.At each point in 3D space, a neural network can map its as- differentiable rendering procedure, the picture is obtained from [Mildenhall
sociated characteristics to attributes. Neural fields are capable of et al. 2021].
representing 3D scenes or objects in any resolution and unknown or
complex topology due to their continuous representation. Addition- 3.1.1 NeRF. Neural Rediance Field (NeRF) [Gao et al. 2022a; Milden-
ally, compared to the above representations, only the parameters of hall et al. 2021] is a neural network-based implicit representation of
the neural network need to be stored, resulting in lower memory con- 3D scenes, which can render projection images from a given view-
sumption than other representations. The earliest work on neural point and a given position. Specifically, given a 3D point x ∈ R3
fields was used for 3D shape representation [Peng and Shamsuddin and an observation direction unit vector d ∈ R2 , NeRF encodes
2004]. SDF [Park et al. 2019] is a classical approach to represent 3D the scene as a continuous volumetric radiance field 𝑓 , yielding a
shapes as neural fields. SDF is based on a continuous volumetric differential density 𝜎 and an RGB color c: 𝑓 (x, d) = (𝜎, c).
field, represented by the distance and sign at each point on surface. Rendering of images from desired perspectives can be achieved by
Several works [Gao et al. 2022b; Shen et al. 2021] use SDF to gen- integrating color along a suitable ray, r, for each pixel in accordance
erate 3D shapes as representation. NeRF [Mildenhall et al. 2021], with the volume rendering equation:
a recent emerging representation for 3D reconstruction in neural
fields, has the advantages of high-quality and realistic 3D model ∫ 𝑡𝑓
generation, presenting realistic object surfaces and texture details Ĉ(𝑟 ) = 𝑇 (𝑡)𝜎 (𝑡)c(𝑡)𝑑𝑡 (1)
𝑡𝑛
at any angle and distance. Furthermore, it can generate 3D models
from any number of input images without specific processing or  ∫ 𝑡 
labeling of the inputs. Another advantage of neural fields is that 𝑇 (𝑡) = 𝑒𝑥𝑝 − 𝜎 (𝑠)𝑑𝑠 (2)
the neural network can be operated on low-power devices after it is 𝑡𝑛
trained. Polygon ray tracing renders high-resolution and realistic The transmission coefficient 𝑇 (𝑡) is defined as the probability
scenes at high frame rates, which requires expensive graphic cards, that light is not absorbed from the near-field boundary 𝑡𝑛 to 𝑡.
but high-quality neural fields can be rendered on mobile phones In order to train NeRF network and optimize the predicted color
and even web browsers. However, there are also some drawbacks Ĉ to fit with the ray R corresponding to the pixel in the training
of neural field technology, such as the need for a large amount of images, gradient descent is used to optimize the network and match
computational resources and time for training, as well as difficulty the target pixel color by loss:
in handling large-scale scenes and complex lighting conditions, and
its inability to be structured data, which makes it difficult to be
∑︁
L= ∥C(r) − Ĉ(r)∥ 22 (3)
directly applied to 3D assets. Neural fields are a new emerging 3D r∈ R
representation technology with strong application prospects and
can be used in 3D fields such as VR/AR and games.

, Vol. 1, No. 1, Article . Publication date: May 2023.


Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era • 5

3.1.2 CLIP. Recent advances in multimodal learning have enabled The reverse denoising process of DDPM involves learning to undo
the development of cross-modal matching models such as CLIP [Rad- the forward diffusion by performing iterative denoising, thereby
ford et al. 2021](Contrastive Language-Image Pre-training) which generating data from random noise. This process is formally de-
learn shared representations from image-text pairs. These models fined as a stochastic process, where the optimization objective is to
are able to produce a scalar score that indicates whether an image generate 𝑝𝜃 (𝑥 0 ) which follows the true data distribution 𝑞(𝑥 0 ) by
and its associated caption match or not. starting from 𝑝𝜃 (𝑇 ):

𝐸𝑡 ∼U (1,𝑇 ),x0 ∼𝑞 (x0 ),𝜖∼N (0,I) 𝜆(𝑡) ∥𝜖 − 𝜖𝜃 (x𝑡 , 𝑡) ∥ 2 (7)

Fig. 7. CLIP structure, picture obtained from [Radford et al. 2021]. Fig. 8. Overview of DDPM, the picture is obtained from [Ho et al. 2020]

In training, the standard image model is used to jointly train the


image feature extractor and linear classifier to predict some labels; 3.1.4 Pretrained text-guided image generation model. Recently, with
CLIP jointly trains the image encoder and text encoder to predict the emergence of diffusion models, pre-trained models for text-to-
the correct pairings of a batch of (image, text) training samples. image generation based on diffusion models have become good
The symmetric InfoNCE loss is used to train the image and text priors for text-to-3D tasks [Zhang et al. 2023d]. The pioneering
encoders, which can then be used for a number of downstream tasks. works into image framework can be roughly categorized according
In testing, the learned text encoder synthesizes a zero-shot linear to the diffusion priors in the pixel space or latent space. The first
classifier by embedding the names or descriptions of the target class of methods directly generates images from high-dimensional
dataset categories. Building on this, a volume has been optimized pixel-level, including GLIDE [Nichol et al. 2021] and Imagen [Saharia
to produce a high-scoring image rather than reranking. The CLIP et al. 2022]. Another approach suggests compressing the image to
structure is shown in Figure 7. a low-dimensional space first, and then training a diffusion model
in this latent space. Representative methods of latent space include
3.1.3 Diffusion model. In the past few years, the use of diffusion Stable Diffusion [Rombach et al. 2022] and DALLE-2 [Ramesh et al.
model [Ho et al. 2020] has seen a dramatic increase. Also known as 2022].
denoising diffusion probabilistic models (DDPMs) or score-based The training process of text-to-image generation can be roughly
generative models, these models generate new data that is simi- divided into three steps. Firstly, the CLIP model is employed to learn
lar to the data used to train them. Drawing inspiration from non- the correlations between text and visual semantics and map the text
equilibrium thermodynamics, DDPMs are defined as a parameter- into the image representation space. Secondly, a prior model is used
ized Markov chain of diffusion steps that adds random noise to the for inversion to map the image representation space back to the text
training data and learns to reverse the diffusion process to produce space, and thus generate text-conditional images. Lastly, diffusion
the desired data samples from the pure noise. models are utilized to learn the mapping between text encoding and
In the forward process, DDPM destroys the training data by grad- image encoding, providing prior models for text-conditional image
ually adding Gaussian noise. It starts from a data sample x0 and generation. The structure of DALLE-2 is shown in Figure 9
iteratively generates noisier samples x𝑇 with 𝑞(x𝑡 | x𝑡 −1 ), using a
Gaussian diffusion kernel:

𝑇
Ö
𝑞(𝑥 1:𝑇 |𝑥 0 ) := 𝑞(𝑥𝑡 |𝑥𝑡 −1 ), (4)
𝑡 =1
√︁
𝑞(𝑥𝑡 |𝑥𝑡 −1 ) := N (𝑥𝑡 ; 1 − 𝛽𝑡 𝑥𝑡 −1, 𝛽𝑡 𝐼 ) (5)
where 𝑇 and 𝛽𝑡 are the steps and hyper-parameters, respectively.
We can obtain noised image at arbitrary step 𝑡 with Gaussian noise
transition kernel as N in Eq. 5, by setting 𝛼𝑡 := 1 − 𝛽𝑡 and 𝛼¯𝑡 :=
Î𝑡
𝑠=0 𝛼𝑠 :
Fig. 9. Structure of DALLE-2, picture obtained from [Ramesh et al. 2022].

𝑞(𝑥𝑡 |𝑥 0 ) := N (𝑥𝑡 ; 𝛼¯𝑡 𝑥 0, (1 − 𝛼¯𝑡 )𝐼 ) (6)

, Vol. 1, No. 1, Article . Publication date: May 2023.


6 • Li and Zhang, et al.

3.2 Successful attempts


Recent pioneering studies have demonstrated the utility of pre-
trained text-to-image diffusion models to optimize neural radiance
fields, achieving significant text-to-3D synthesis results. However,
this paradigm also has some issues, and some followers aim to solve
these problems. In this section, the first part presents the pioneers
of this paradigm, and the second part presents some enhancement
works of this paradigm.

Fig. 11. Illustration of the main idea of CLIP-Forge, the picture obtained
from [Sanghi et al. 2022].

Dream Fields [Jain et al. 2022] is a concurrent work to CLIP-


Forge [Sanghi et al. 2022], which uses CLIP to synthesize and ma-
nipulate 3D object representations. However, the work differ from
Dream Fields in terms of features, resolutions, and limitiations. CLIP-
Forge generalizes poorly outside of ShapeNet categories and re-
quires ground-truth multi-view images and voxel data. Dream Fields
proposes a method for synthesizing diverse three-dimensional ob-
Fig. 10. DreamFusion outcome. Picture obtained from [Poole et al. 2022]. jects based on natural language descriptions. This method combines
neural rendering and multi-modal image and text representations,
generating various categories of geometry and colors using an opti-
3.2.1 Poinners of CLIP-Based Text-Guided 3D Shape Generation. mized multi-viewpoint neural radiation field without the need of
In recent years, with the success of text-to-image generation mod- three-dimensional supervision. To enhance the realism and visual
elling [Zhang et al. 2023d], text-to-3D generation has also attracted quality, Dream Fields introduces simple geometric priors includ-
the attention of the deep learning community [Han et al. 2023; Jain ing sparsity-induced transmission regularization, scene boundaries
et al. 2022; Lin et al. 2022; Mohammad Khalid et al. 2022; Poole et al. and a new MLP architecture. See Figure 12 for details.Experimental
2022; Wang et al. 2022b; Xu et al. 2022b]. However, the scarcity of results show that Dream Fields can generate realistic and multi-
3D data makes expansion with data challenging. DreamField [Jain viewpoint consistent geometry and colors from various natural
et al. 2022] and CLIP-Mesh [Mohammad Khalid et al. 2022] rely language descriptions.
on pre-trained image-text models [Radford et al. 2021] to optimize
underlying 3D representations (RMS and meshes) in order to alle-
viate the training data problem, achieving high text-image align-
ment scores for all 2D renderings. Although these methods avoid
the costly requirement of 3D training data and primarily rely on
a large-scale pre-trained image-text model, they often yield less
realistic 2D renderings. Recently, DreamFusion [Poole et al. 2022]
and Magic3D [Lin et al. 2022] has demonstrated impressive capabil-
ities in text-to-3D synthesis by leveraging a powerful pre-trained
text-to-image diffusion model as a strong image prior.
CLIP-Forge [Sanghi et al. 2022] was the first to attempt to apply
a pre-trained text-to-image model to 3D generation and was suc-
cessful. CLIP-Forge is a novel approach to text-to-shape generation
that involves no paired text-shape labels. It uses a pre-trained text Fig. 12. Structure of Dream Fields, picture obtained from [Jain et al. 2022]
encoder and an autoencoder to obtain a latent space for shapes. A
normalizing flow network is then conditioned with text features to CLIP-NeRF [Wang et al. 2022a] is a novel disentangled conditional
generate a shape embedding which is then converted into 3D shape. NeRF architecture that offers flexible control for editing NeRFs based
CLIP-Forge has an efficient generation process which requires no on text prompts or reference images. It introduces a shape code and
inference time optimization and can generate multiple shapes for an appearance code to independently control the deformation of
a given text. See Figure 11 for details. This method also has the the volumetric field and the emitted colors. Two code mappers, fed
advantage of avoiding the expensive inference time optimizations by the pre-trained CLIP model, enable fast inference for editing
as employed in existing text-to-shape generation models. Extensive different objects in the same category compared to the optimization-
evaluation of the method in various zero-shot generation settings is based editing method. In addition, an inversion method is proposed
provided, both qualitatively and quantitatively. to infer the shape and appearance codes from a real image, enabling

, Vol. 1, No. 1, Article . Publication date: May 2023.


Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era • 7

users to edit the shape and appearance of existing data. CLIP-NeRF DreamFusion [Poole et al. 2022]. In DreamFusion, the monitoring
is a contemporary work of Dream Fields [Jain et al. 2022], and signal runs on very low-resolution images of 64 × 64, and DreamFu-
unlike Dream Fields, the former offers greater freedom in shape sion is unable to synthesize high-frequency 3D geometry and texture
manipulation and supports global deformation, introducing two details. Because of its inefficient MLP architecture to represent NeRF,
intuitive NeRF editing methods: using short text prompts or sample the actual high-resolution synthesis may even be impossible due to
images, both of which are more user-friendly to novice users. The the rapid growth of memory usage and computation budget as the
structure of CLIP-NeRF is shown in Figure ?? resolution increases. Magic3D proposes a two-stage optimization
framework for optimizing from text to 3d synthesis results of NeRF.
In the first stage, Magic3D optimizes a coarse neural field represen-
tation similar to DreamFusion, but with a hash grid-based memory
and computation-efficient scene representation. In the second stage,
Magic3D shifts to optimize the mesh representation, leveraging a
diffusion prior at resolutions up to 512 × 512. Overview of Magic3D
as shown in Figure 15. As 3D meshes fit well with fast graphics ren-
dering solutions that can render high-resolution images in real-time,
Magic3D also uses an efficient differentiable rasterizer to recover
the high-frequency details in geometry and texture from camera
close-ups. Magic3D synthesizes 3D contents with 8 times better
resolution and 2 times faster speed than DreamFusion.
Fig. 13. Sturcute of CLIP-NeRF, picture obtained from [Wang et al. 2022a]

DreamFusion [Poole et al. 2022] employs a similar approach to


Dream Fields [Jain et al. 2022] to train NeRF, using a frozen image-
text joint embedding model from CLIP and an optimization-based
approach, but replacing CLIP with a 2D Diffusion Model distilled
loss. The DreamFusion architecture is shown in the Figure 14, where
the scene is represented by a neural radiance field initialized and
trained from scratch for each descriptive. With a pretrained text-to-
image Diffusion Model, an image parameterized in the form of NeRF, Fig. 15. Overview of Magic3D, picture obtained from [Lin et al. 2022].
and a loss function minimized towards good samples, DreamFusion
has all the necessary components for text-to-3D synthesis without Apart from the works showcased above, many other excellent
the use of 3D data. For each textual prompt, DreamFusion starts contemporaneous works have served as pioneers in pioneering the
its training from scratch with a randomly initialized NeRF. Each text-to-3D generation. CLIP-Mesh [Mohammad Khalid et al. 2022]
iteration of DreamFusion’s optimization performs the same step. For proposes a technique that can achieve zero-shot 3D model gener-
each optimization step, DreamFusion performs random sampling ation only using a target textual prompt. The motivation is that,
of camera and light sources, rendering of NeRF image from the without any 3D supervision, 3D resources corresponding to the
camera with light occlusion, computation of SDS loss gradients input textual prompt can be obtained by adjusting the control shape
relative to NeRF parameters, and updating of NeRF parameters with of the restricted subdivision surface as well as its texture and normal
an optimizer. By combining SDS with a NeRF variant tailored to maps, and can be easily deployed in games or modeling applica-
this 3D generation task, DreamFusion maximizes the fidelity and tions. SJC (Score Jacobian Chaining) [Wang et al. 2022b] proposes a
coherence of 3D generated shapes. method to promote 2D diffusion models to 3D by applying chain
rules. The paper also pointed out the effectiveness of SJC in 3D
text-driven generation tasks. Dream3D [Xu et al. 2022b] is the first
attempt introducing explicit 3D shape priors into the CLIP-guided
3D optimization process. Specifically, Dream3D first generates high-
quality 3D shapes as 3D shape priors in the text-to-shape phase, and
then employs them as the initialization of neural radiance fields, and
performs optimization with full hints. [Han et al. 2023] proposes
a novel multi-class diffusion model for addressing the challenges
in semantic-driven 3D shape generation. To solve the problems of
single-class generation, low-frequency details, and the need of a
Fig. 14. Structure of DreamFusion, picture obtained from [Poole et al. 2022]. large amount of paired data, the authors employ a pre-trained CLIP
model to establish a bridge between text, 2D images, and 3D shapes,
Magic3D [Lin et al. 2022] is a framework for high-quality 3D and apply a conditional flow model to generate shape vectors condi-
content composition with textual prompts, which optimizes the tioned on the CLIP embeddings, as well as a latent diffusion model
generation process by improving several major design choices from conditioned on multi-class shape vectors. Experimental results show

, Vol. 1, No. 1, Article . Publication date: May 2023.


8 • Li and Zhang, et al.

that the proposed framework outperforms existing methods [Poole while achieving a good balance between fidelity of 2D diffusion
et al. 2022; Valsesia et al. 2019; Yang et al. 2019; Zhou et al. 2021a]. model and 3D consistency with low overhead.

3.2.2 Followers of CLIP-Based Text-Guided 3D Shape Generation.


Pioneers in the text-to-3D field have achieved new heights with
emerging technologies, while also bringing new challenges, such as
low textual 3D model matching, slow rendering speed, low resolu-
tion of generated 3D models, etc. Numerous research endeavors aim
at addressing these issues, which are investigated in this section.
DITTO-NeRF [Seo et al. 2023b] is a novel pipeline that can gen-
erate high-quality 3D NeRF models from textual prompts or single
images. It introduces a progressive 3D object reconstruction scheme,
including scale, orientation and mask, which can propagate high-
quality information from IB to OB. Compared with the previous
artworks from image/text to 3D such as DreamFusion [Poole et al.
2022] and NeuralLift-360 [Xu et al. 2022a], DITTO-NeRF achieves
significant advantages in both quality and diversity, with shorter
training time as well. 3D-CLFusion [Li and Kitani 2023] presents
a novel text-to-3D creation method which utilizes pre-trained la-
tent variable-based NeRFs to rapidly complete 3D content creation
within less than a minute. To tackle the challenges faced by NeRFs, Fig. 16. Comparison of the model with Janus problem (left) [Wang et al.
2022b] and the upgraded model (right) [Hong et al. 2023], the picture is
3D-CLFusion adopts a novel approach called view-invariant diffu-
obtained from [Hong et al. 2023].
sion, which utilizes contrastive learning [He et al. 2020] to learn the
latent variables and thus can generate high-resolution 3D content
rapidly during the inference stage. Experimental results demonstrate Although Text-to-3D can produce impressive results, it is essen-
that 3D-CLFusion is up to 100 times faster than DreamFusion, and tially unconstrained and may lack the ability to guide or enforce 3D
can serve as a plug-and-play tool for text-to-3D with pre-trained structure. Latent-NeRF [Metzer et al. 2022] incorporates both textual
NeRFs. CLIP-Sculptor [Sanghi et al. 2022] presents a 3D shape gener- guidance and shape guidance for image generation and 3D model
ation model under text conditions that improves shape diversity and generation, as well as a latent-dispersal model for direct application
fidelity solely with image-shape pairs as supervision, thus surpass- of dispersed rendering on 3D meshes. CompoNeRF [Lin et al. 2023]
ing existing methods. The novelty of the CLIP-Sculptor lies in its proposes a novel framework that explicitly combines editable 3D
multi-resolution, voxel-based conditioned generation scheme. With- scene layouts, providing effective guidance at both the local and
out text-shape pairs, CLIP-Sculptor learns to generate 3D shapes global levels for NeRFs, to address the Guidance collapse problem
of common object categories from the joint embedding of CLIP’s faced in text-to-3D generation. CompoNeRF allows for flexible edit-
text-image. To achieve high-fidelity output, CLIP-Sculptor employs ing and recombination of trained local NeRFs into a new scene via
a multi-resolution approach. To generate diverse shapes, CLIP- manipulation of the 3D layout or textual hints, achieving faithful
Sculptor employs a discrete latent representation obtained with and editable text-to-3D results, as well as opening up potential di-
a vector quantization scheme. To further enhance shape fidelity and rections for multi-object composition via editable 3D scene layouts
diversity, CLIP-Sculptor uses a mask transformer architecture. guided by text.
3DFuse [Seo et al. 2023a] proposes a novel framework, incorporat- DreamFusion generates volumetric representations instead of
ing 3D awareness into pre-trained 2D dispersion models to improve mesh representations, which makes it impractical in many down-
the robustness and 3D consistency of the 2D dispersion model-based stream applications such as graphics, which require standard 3D rep-
approach. The view inconsistency problem in score-distilling text- resentations such as meshes. Text2Mesh [Michel et al. 2022] presents
to-3D generation, also known as Janus problem [Hong et al. 2023](as a novel framework, which is capable of editing the style of 3D
shown in Figure 16) is a big challenge that have to overcome. 3DFuse objects such as colors and local geometry details from textual de-
builds a rough 3D structure given the text prompt and then utilizes scriptions. The framework uses a fixed mesh and a learned neural
the projected specific depth map as the condition of the dispersion field to handle low-quality meshes without UV parameterization.
model. In addition, a training strategy is introduced to enable the Furthermore, Text2Mesh does not require any pre-trained genera-
2D dispersion model to learn to handle the errors and sparsity in tive models or specialized 3D mesh datasets. Thus, it can achieve
the rough 3D structure and a method to ensure semantic consis- style synthesis for various shapes of 3D meshes. TextMesh [Tsal-
tency among all viewpoints in the scene. Experimental results show icoglou et al. 2023] presents a new method for generating highly
that 3DFuse effectively solves the problem of 3D consistency and realistic 3D meshes from textual prompts, which solves the problem
develops a new approach for 3D reconstruction with 2D dispersion of NeRF being infeasible for most practical applications. To this end,
models. In another work [Hong et al. 2023] proposes two de-biased the method extends NeRF to adopt a SDF framework, thus improv-
methods to address the Janus problem in fractional distillation for ing mesh extraction, and introduces a new method for fine-tuning
3D generation. These methods reduce artifacts and improve realism, mesh textures, eliminating over-saturation effects and enhancing

, Vol. 1, No. 1, Article . Publication date: May 2023.


Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era • 9

the detail of the output 3D mesh. Point·E [Nichol et al. 2022] presents AvatarCraft [Jiang et al. 2023] utilizes a diffusion model to guide
an alternative approach for fast 3D object generation which pro- the learning of neural avatar geometry and texture based on a single
duces 3D models in only 1-2 minutes on a single GPU. The proposed text prompt, thereby addressing the challenge of creating 3D char-
method consists of two diffusion models, a text-to-image diffusion acter avatars with specified identity and artistic style that can be
model and a point cloud diffusion model. Point·E’s text-to-image easily animated. It also carefully designs an optimization framework
model utilizes a large (text, image) corpus, allowing it to adhere of neural implicit fields, including coarse-to-fine multi-boundary
to various complex cues. The image-to-3D model is trained on a box training strategy, shape regularization and diffusion-based con-
smaller (image, 3D) pair dataset. To generate 3D objects from textual straints, to generate high-quality geometry and texture, and make
cues, Point·E first samples the images using its text-to-image model, the character avatars animatable, thus simplifying the animation
followed by sampling 3D objects conditioned on the sampled images. and reshaping of the generated avatars. Experiments demonstrate
Both steps can be completed within a few seconds and without the the effectiveness and robustness of AvatarCraft in creating character
need for expensive optimization processes. avatars, rendering new views, poses, and shapes.
MotionCLIP [Tevet et al. 2022], a 3D human motion auto-encoder
featuring a latent embedding that is disentangled, well behaved,
4 TEXT-TO-3D APPLICATIONS and supports highly semantic textual descriptions. MotionCLIP
With the emergence of text-to-3D models guided by text-to-image is unique in that it aligns its latent space with that of the CLIP
priors, more fine-grained application domains have been developed, model, thus infusing the semantic knowledge of CLIP into the mo-
including text-to-avatar, text-to-texture, text-to-scene, etc. This sec- tion manifold. Furthermore, MotionCLIP leverages CLIP’s visual
tion surveys the text-guided 3D model generation models based on understanding and self-supervised motion-to-frame alignment. The
text-to-image priors. contributions of this paper are the text-to-motion capabilities it
enables, out-of-domain actions, disentangled editing, and abstract
language specification. In addition, MotionCLIP shows how the
4.1 Text Guided 3D Avatar Generation introduced latent space can be leveraged for motion interpolation,
In recent years, the creation of 3D graphical human models has editing and recognition.
drawn considerable attention due to its extensive applications in AvatarCLIP [Hong et al. 2022] introduces a text-driven frame-
areas such as movie production, video gaming, AR/VR and human- work for the production of 3D avatars and their motion generation.
computer interactions, and the creation of 3D avatars through natu- By utilizing the powerful vision-language model CLIP, AvatarCLIP
ral language could save resources and holds great research prospects. enables non-expert users to craft customized 3D avatars with the
DreamAvatar [Cao et al. 2023] proposed a framework based on shape and texture of their choice, and animate them with natural
text and shape guidance for generating high-quality 3D human language instructions. Extensive experiments indicate that Avatar-
avatars with controllable poses. It utilizes a trainable NeRF to pre- CLIP exhibits superior zero-shot performance in generating unseen
dict the density and color features of 3D points, as well as a pre- avatars and novel animations.
trained text-to-image diffusion model to provide 2D self-supervision.
SMPL [Bogo et al. 2016] model is used to provide rough pose and
shape guidance for generation, as well as a dual-space design, in-
cluding a canonical space and an observation space, which are
related by a learnable deformation field through NeRF, allowing op-
timized textures and geometries to be transferred from the canonical
space to the target pose avatar with detailed geometry and textures.
Experimental results demonstrate that DreamAvatar significantly
outperforms the state of the art, setting a new technical level for 3D
human generation based on text and shape guidance.
DreamFace [Zhang et al. 2023b] is a progressive scheme for per-
sonalized 3D facial generation guided by text. It enables ordinary Fig. 17. 3D avatar created by text guided 3D generation model, picture
users to naturally customize CG-pipe compatible 3D facial assets obtained from [Cao et al. 2023]
with desired shapes, textures and fine-grained animation capabil-
ities. DreamFace introduces a coarse-to-fine scheme to generate
a topologically unified neutral face geometry, utilizes Score Dis-
tillation Sampling (SDS) [Rombach et al. 2022] to optimize subtle 4.2 Text Guided 3D Texture Generation
translations and normals, adopts a dual-path mechanism to generate Recently, there have been a number of works on text-to-texture,
neutral appearance, and employs two-stage optimization to enhance inspired by text-to-3D. This summary lists these works.
compact priors for fine-grained synthesis, as well as to improve the TEXTure [Richardson et al. 2023] presents a novel text guided
animation capability for personalized deformation features. Dream- 3D shape texture generation, editing and transmission method. It
Face can generate realistic 3D facial assets with physical rendering utilizes a pre-trained deep-to-image topology model and iterative
quality and rich animation capabilities from video materials, even schemes to colorize the 3D models from different viewpoints, and
for fashion icons, cartoons and fictional aliens in movies. proposes a novel detailed topology sampling procedure to generate

, Vol. 1, No. 1, Article . Publication date: May 2023.


10 • Li and Zhang, et al.

seamless textures from different viewpoints using the three-step seg- 4.3 Text Guided 3D Scene Generation
mentation map. Additionally, it presents a method for transferring 3D scene modeling is a time-consuming task, usually requiring
the generated texture maps to new 3D geometry without explicit professional 3D designers to complete. To make 3D scene modeling
surface-to-surface mapping and a method for extracting semantic easier, 3D generation should be simple and intuitive to operate while
textures from a set of images without any explicit reconstruction, retaining enough controllability to meet users’ precise requirements.
and provides a way to edit and refine existing textures with text Recent works [Cohen-Bar et al. 2023; Fridman et al. 2023; Höllein
hints or user-provided doodle. et al. 2023; Po and Wetzstein 2023] in text-to-3D generation has
TANGO [Chen et al. 2022] proposes a novel method for program- made 3D scene modeling easier.
matic rendering of realistic appearance effects on arbitrary topology Set-the-Scene [Cohen-Bar et al. 2023] proposes an agent-based
given surface meshes. Based on CLIP model, the model is used global-local training framework for synthesizing 3D scenes, thus
to decompose the appearance style into spatially varying bidirec- filling an important gap from controllable text to 3D synthesis.
tional reflectance distribution functions, local geometric variations, It can learn a complete representation of each object while also
and illumination conditions. This enables realistic 3D style transfer creating harmonious scenes with style and lighting matching. The
through the automatic prediction of reflectance effects, even for framework allows various editing options such as adjusting the
bare, low-quality meshes, without the need for training on particu- placement of each individual object, deleting objects from scenes,
lar task-specific datasets. Numerous experiments demonstrate that or refining objects.
TANGO outperforms the existing text-driven 3D style transfer meth- [Po and Wetzstein 2023] proposes a local condition diffusion-
ods in terms of realism, 3D geometry consistency, and robustness based text-to-3D scene synthesis approach, which aims to make the
for stylizing low-quality meshes. generation of complex 3D scenes more intuitive and controllable. By
Fantasia3D [Chen et al. 2023a] presents a novel approach for text- providing control on the semantic parts via text hints and bounding
to-high-quality 3D content creation. The method decouples geome- boxes, the method ensures seamless transitions between these parts.
try and appearance modeling and learning, and uses a hybrid scene Experiments show that the proposed method achieves higher fidelity
representation and Spatially-varying Bidirectional Reflectance Dis- in the composition of 3D scenes than the related baselines [Liu et al.
tribution Function (BRDF) learning for surface material to achieve 2022; Wang et al. 2022b].
photorealistic rendering of the generated surface. Experimental re- SceneScape [Fridman et al. 2023] proposes a novel text-driven
sults show that the method outperforms existing approaches [Lin approach to generate permanent views, which is capable of syn-
et al. 2022; Poole et al. 2022] and supports physically plausible sim- thesizing long videos of arbitrary scenes solely based on input
ulations of relit, edited, and generated 3D assets. X-Mesh [Ma et al. texts describing the scene and camera positions. The framework
2023] presents a novel text-driven 3D stylization framework, con- combines the generative capacity of a pre-trained text-to-image
taining a novel text-guided dynamic attention module (TDAM) for model [Rombach et al. 2022] with the geometry priors learned from
more accurate attribute prediction and faster convergence speed. a pre-trained monocular depth prediction model [Ranftl et al. 2021,
Additionally, a new standard text-mesh benchmark, MIT-30, and 2020], to generate videos in an online fashion, and achieves 3D con-
two automatic metrics standards are introduced for future research sistency through online testing time training to generate videos with
to achieve fair and objective comparison. geometry-consistent scenes. Compared with the previous works
Text2Tex [Chen et al. 2023b] proposes a novel approach for gener- that are limited to a restricted domain, this framework is able to
ating high-quality textures for 3D meshes given text prompts. The generate various scenes including walking through a spaceship, a
goal of this method is to address the accumulation inconsistency and cave, or an ice city.
stretching artifacts in text-driven texture generation. The method Text2Room [Höllein et al. 2023] proposes a method of generating
integrates repair and merging into a pre-trained deep-perceptual room-scale 3D meshes with textures from given text prompts. It
image diffusion model to synthesize high-resolution local textures is the first method to generate attention-grabbing textured room-
progressively from multiple perspectives. Experiments show that scale 3D geometries solely from text inputs, which is different from
Text2Tex significantly outperforms existing text-driven and GAN- existing methods that focus on generating single objects [Lin et al.
based methods. 2022; Poole et al. 2022] or scaling trajectories(SceneScape) [Fridman
et al. 2023] from text.
Text2NeRF [Zhang et al. 2023a] proposes a text-driven realistic
3D scene generation framework combining diffusion model with
NeRF representations to support zero-shot generation of various
indoor/outdoor scenes from a variety of natural language prompts.
Additionally, a progressive inpainting and updating (PIU) strategy
is introduced to generate view-consistent novel contents for 3D
scenes, and a support set is built to provide multi-view constraints
for the NeRF model during view-by-view updating. Moreover, a
Fig. 18. Texturing results generated by text guided 3D texture model, picture depth loss is employed to achieve depth-aware NeRF optimization,
obtained by [Richardson et al. 2023]
and a two-stage depth alignment strategy is introduced to elimi-
nate estimated depth misalignment in different views. Experimental

, Vol. 1, No. 1, Article . Publication date: May 2023.


Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era • 11

results demonstrate that the proposed Text2NeRF outperforms ex- that SKED can effectively modify the existing neural fields and
isting methods [Höllein et al. 2023; Mohammad Khalid et al. 2022; generate outputs that satisfy user sketches.
Poole et al. 2022; Wang et al. 2022b] in producing photo-realistic, TextDeformer [Gao et al. 2023] proposes an automatic technique
multiview consistent, and diverse 3D scenes from a variety of natural for generating input triangle mesh deformations guided entirely by
language prompts. text prompts. The framework is capable of generating large, low-
frequency shape changes as well as small, high-frequency details,
relying on differentiable rendering to connect geometry to power-
ful pre-trained image encoders such as CLIP[Radford et al. 2021]
and DINO [Caron et al. 2021]. In order to overcome the problems
of artifacts, TextDeformer proposes to use the Jacobian matrix to
represent the mesh deformation and encourages deep features to be
computed on 2D encoded rendering to ensure shape coherence from
all 3D viewpoints. Experimental results show that the method can
smoothly deform various source meshes and target text prompts to
Fig. 19. Controllable scenes generation from text prompts, the picture is achieve large modifications and add details.
obtained by [Cohen-Bar et al. 2023]

MAV3D (Make-A-Video3D) [Singer et al. 2023] is a method for gen-


erating 3D dynamic scenes from text descriptions. The motivation
of MAV3D is to provide a method for generating dynamic videos
without 3D or 4D data, thus saving a lot of time and money. This
method adopts 4D dynamic NeRF and optimizes the scene appear-
ance, density, and motion consistency by querying the Text-to-Video
(T2V) [Singer et al. 2022] model based on diffusion. Quantitative
and qualitative experiments show that MAV3D improves over the
internal baselines established previously. MAV3D is the first method
to generate 3D dynamic scenes given text descriptions.

4.4 Text Guided 3D Shape Transformation


The traditional process of editing 3D models involves dedicated
tools and years of training in order to manually carve, extrude, and
re-texture given objects. This process is laborious and costly in terms
Fig. 20. Converting input 3D scenes according to the text instructions, the
of resources, and recently some research works have attempted to
picture is obtained from [Kamata et al. 2023]
use text-guided 3D model editing, which is demonstrated in this
section.
Instruct-NeRF2NeRF [Haque et al. 2023] presents a novel text-
instructable editing of NeRF scenes, which uses an iterative image-
5 DISCUSSION
based diffusion model (InstructPix2Pix) [Brooks et al. 2022] to edit
the input image while optimizing the underlying scene, thereby Combining NeRF and textual-to-image-prior textual-to-3D gener-
generating an optimized 3D scene that follows the edits instructions. ation paradigm is an emerging research direction with the strong
Experimental results show that the method is capable of editing advantage of generating diverse content. However, there are also
large-scale real-world scenes, achieving more realistic and targeted a lot of issues with it such as long inference time, 3D consistency
edits than previous works. issues, poor controllability and the generated content that cannot
Instruct 3D-to-3D [Kamata et al. 2023] presents a novel 3D-to-3D be applied well to industrial needs.
transformation method, which uses a pre-trained image-to-image
diffusion model to achieve 3D-to-3D transformation. Furthermore, 5.1 Fidelity
the method also proposes dynamic scaling, as well as explicitly For 3D asset generation, constrained by the weakly supervised
conditioning on the input source 3D scene, to enhance 3D consis- and low-resolution of CLIP, the upscaled results are not perfect. In
tency and controllability. Quantitative and qualitative evaluations addition, it is difficult to generate a more varied portrait under the
demonstrate that Instruct 3D-to-3D achieves higher-quality 3D-to- same prompt. For the same prompt, CLIP text features are always
3D transformation than the baseline methods [Poole et al. 2022; the same. Fidelity and speed are two indicators often requiring trade-
Wang et al. 2022a]. offs which often require the increase in inference speed to improve
SKED [Mikaeili et al. 2023] presents a sketch-based technique fidelity. At the same time, downstream application requirements
for editing 3D shapes represented by NeRF. The motivation is to should also be taken into account. Films require high-precision
introduce interactive editing into the text-to-3D pipeline, enabling models, while games often require more quantity than film-level
users to edit from a more intuitive control perspective. Results show precision.

, Vol. 1, No. 1, Article . Publication date: May 2023.


12 • Li and Zhang, et al.

5.2 Inference velocity text and shape guidance. CompoNeRF [Lin et al. 2023] is capable
A fatal issue of generating 3D content by leveraging pre-training of precisely associating guidance with particular structures via its
models based on diffusion models as a powerful prior and learning integration of editable 3D layouts and multiple local NeRFs, address-
objective is that the inference process is too slow. Even at a resolu- ing the guidance failure issue when generating multiple object 3D
tion of just 64×64, DreamFusion [Poole et al. 2022] would take 1.5 scenes.
hours to infer for each prompt using a TPUv4. The inference time is
further rising quickly along with the increase in resolution. This is
5.5 Applicability
mainly due to the fact that the inference process for generating 3D Although NeRF, a novel 3D representation, cannot be directly ap-
content is actually starting from scratch to train a Neural Radiance plied to traditional 3D application scenarios, its powerful representa-
Field [Mildenhall et al. 2021]. Notably, NeRF models are renowned tion capabilities enable it to possess unlimited application prospects.
for their slow training and inference speeds, and training a deep The greatest advantage of NeRF is that it can be trained with 2D im-
network takes a lot of time. Magic3D [Lin et al. 2022] has addressed ages. Google has already begun to use NeRFs to transform street map
the time issue by using a two-phase optimization framework. Firstly, images into immersive views on Google Maps. In the future, NeRFs
a coarse model is obtained by leveraging a low-resolution diffusion can supplement other technologies to more efficiently, accurately,
prior, and secondly, acceleration is performed by using a sparse and realistically represent 3D objects in the metaverse, augmented
3D hash grid structure. 3D-CLFusion[Li and Kitani 2023] utilizes a reality, and digital twins. To further improve these applications,
pre-trained latent NeRf and performs a fast 3D content creation in future research may focus on extracting 3D meshes, point-clouds,
less than a minute. or SDFs from the density MLP, and integrating faster NeRF models.
It remains to be seen whether the Paradigm of Pure Text Guiding
5.3 Consistency Shape Generation can cope with all scenarios. Perhaps incorporat-
ing a more intuitive guidance mechanism, such as sketch guidance
Distortion and ghost images are often encountered in the 3D scenes
or picture guidance, might be a more reasonable choice.
generated by DreamFusion [Poole et al. 2022], and unstable 3D
scenes are observed when text prompts and random seeds are 6 CONCLUSION
changed. This issue is mainly caused by the lack of perception
This work conducts the first yet comprehensive survey on text-
of 3D information from 2D prior diffusion models, as the trans-
to-3D. Specifically, we summarize text-to-3D from three aspects:
mission model has no knowledge of which direction the object is
data representations, technologies and applications. We hope this
observed from, leading to the serious distortion of 3D scenes by
survey can help readers quickly understand the field of text-to-3D
generating the front-view geometry features from all viewpoints,
and inspire more future works to explore text-to-3D.
including sides and the backs, which is usually referred to as the
Janus problem. [Hong et al. 2023] proposed two debiasing methods REFERENCES
to address such issues: score debiasing, which involves gradually in- Panos Achlioptas, Judy Fan, Robert Hawkins, Noah Goodman, and Leonidas J Guibas.
creasing the truncation value of the 2D diffusion model’s estimation 2019. ShapeGlot: Learning language for shape differentiation. In Proceedings of the
throughout the entire optimization process; and prompt debiasing, IEEE/CVF International Conference on Computer Vision. 8938–8947.
Parag Agarwal and Balakrishnan Prabhakaran. 2009. Robust blind watermarking of
which employs a language model to recognize the conflicts between point-sampled geometry. IEEE Transactions on Information Forensics and Security 4,
the user prompts and the view prompts, and adjust the difference 1 (2009), 36–48.
between the view prompts and the spatial camera pose of objects. Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkova, Rig Das,
Gleb Gusev, Djamila Aouada, and Bjorn Ottersten. 2018. A survey on deep learning
3DFuse [Seo et al. 2023a] optimized the training process to make advances on different 3D data representations. arXiv preprint arXiv:1808.01462
the 2D diffusion model learn to process the wrong and sparse 3D (2018).
James F Blinn. 2005. What is a pixel? IEEE computer graphics and applications 25, 5
structures for robust generation, as well as a way to ensure that the (2005), 82–87.
semantic consistency of all viewpoints in the scene is ensured. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and
Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and
shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference,
5.4 Controllability Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer,
561–578.
Although Text-to-3D can generate impressive results, as the text-to- Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2022. Instructpix2pix: Learning
image diffusion models are essentially unconstrained, they generally to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022).
tend to suffer from guiding collapse. This makes them less capable Wenming Cao, Zhiyue Yan, Zhiquan He, and Zhihai He. 2020. A comprehensive survey
on geometric deep learning. IEEE Access 8 (2020), 35929–35949.
of accurately associating object semantics with specific 3D struc- Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. 2023. DreamA-
tures. The issue of poor controllability has long been mainstream vatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models.
in the text-to-image generation task, with ControlNet [Zhang and arXiv preprint arXiv:2304.00916 (2023).
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo-
Agrawala 2023] addressing it by adding extra input conditions to janowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision
make the generation process even more controllable for large-scale transformers. In Proceedings of the IEEE/CVF international conference on computer
vision. 9650–9660.
text-to-image models, such as with the addition of canny edge, Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias
hough lines and depth maps. This unique combination of text and Nießner. 2023b. Text2Tex: Text-driven Texture Synthesis via Diffusion Models. arXiv
shape guidance allows for increased control over the generation preprint arXiv:2303.11396 (2023).
Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser,
process. LatentNeRF [Metzer et al. 2022] allows for increased control and Silvio Savarese. 2019. Text2shape: Generating shapes from natural language by
over the 3D generation process through its unique combination of learning joint embeddings. In Computer Vision–ACCV 2018: 14th Asian Conference

, Vol. 1, No. 1, Article . Publication date: May 2023.


Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era • 13

on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Yiqi Lin, Haotian Bai, Sijia Li, Haonan Lu, Xiaodong Lin, Hui Xiong, and Lin Wang.
Part III 14. Springer, 100–116. 2023. CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. 2023a. Fantasia3D: Disentangling Scene Layout. arXiv preprint arXiv:2303.13843 (2023).
Geometry and Appearance for High-quality Text-to-3D Content Creation. arXiv Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. 2022.
preprint arXiv:2303.13873 (2023). Compositional visual generation with composable diffusion models. In Computer
Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Kui Jia. 2022. Tango: Text-driven Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
photorealistic and robust 3d stylization via lighting decomposition. arXiv preprint Proceedings, Part XVII. Springer, 423–439.
arXiv:2210.11277 (2022). Weiping Liu, Jia Sun, Wanyi Li, Ting Hu, and Peng Wang. 2019. Deep learning on point
Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, and Daniel Cohen-Or. 2023. clouds and its application: A survey. Sensors 19, 19 (2019), 4188.
Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes. Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang,
arXiv preprint arXiv:2303.13450 (2023). Weilin Zhuang, and Rongrong Ji. 2023. X-Mesh: Towards Fast and Accurate Text-
Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu. driven 3D Stylization via Dynamic Textual Guidance. arXiv preprint arXiv:2303.15764
2022. ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial (2023).
Viewpoints. arXiv preprint arXiv:2210.03895 (2022). Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas
Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. 2023. Scenescape: Text- Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space.
driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023). In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, and Srinath Sridhar. 2022. Shapecrafter: 4460–4470.
A recursive text-conditioned 3d shape generation model. arXiv preprint Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. 2022.
arXiv:2207.09446 (2022). Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. arXiv preprint
Yasutaka Furukawa, Carlos Hernández, et al. 2015. Multi-view stereo: A tutorial. arXiv:2211.07600 (2022).
Foundations and Trends® in Computer Graphics and Vision 9, 1-2 (2015), 1–148. Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. 2022.
Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF
Litany, Zan Gojcic, and Sanja Fidler. 2022b. Get3d: A generative model of high Conference on Computer Vision and Pattern Recognition. 13492–13502.
quality 3d textured shapes learned from images. Advances In Neural Information Aryan Mikaeili, Or Perel, Daniel Cohen-Or, and Ali Mahdavi-Amiri. 2023. SKED:
Processing Systems 35 (2022), 31841–31854. Sketch-guided Text-based 3D Editing. arXiv preprint arXiv:2303.10735 (2023).
Kyle Gao, Yina Gao, Hongjie He, Denning Lu, Linlin Xu, and Jonathan Li. 2022a. Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra-
Nerf: Neural radiance field in 3d vision, a comprehensive review. arXiv preprint mamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields
arXiv:2210.00379 (2022). for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
William Gao, Noam Aigerman, Thibault Groueix, Vladimir G Kim, and Rana Hanocka. Ludovico Minto, Pietro Zanuttigh, and Giampaolo Pagnutti. 2018. Deep Learning for
2023. TextDeformer: Geometry Manipulation using Text Guidance. arXiv preprint 3D Shape Classification based on Volumetric Density and Surface Approximation
arXiv:2304.13348 (2023). Clues.. In VISIGRAPP (5: VISAPP). 317–324.
Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. 2022.
2020. Deep learning for 3d point clouds: A survey. IEEE transactions on pattern CLIP-Mesh: Generating textured meshes from text using pretrained image-text
analysis and machine intelligence 43, 12 (2020), 4338–4364. models. In SIGGRAPH Asia 2022 Conference Papers. 1–8.
Bo Han, Yitong Liu, and Yixuan Shen. 2023. Zero3D: Semantic-Driven Multi-Category Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob
3D Shape Generation. arXiv preprint arXiv:2301.13591 (2023). McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic
Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel image generation and editing with text-guided diffusion models. arXiv preprint
Cohen-Or. 2019. Meshcnn: a network with an edge. ACM Transactions on Graphics arXiv:2112.10741 (2021).
(TOG) 38, 4 (2019), 1–12. Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. 2022.
Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Point-E: A System for Generating 3D Point Clouds from Complex Prompts. arXiv
Kanazawa. 2023. Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. arXiv preprint arXiv:2212.08751 (2022).
preprint arXiv:2303.12789 (2023). Jaesik Park, Sudipta N Sinha, Yasuyuki Matsushita, Yu-Wing Tai, and In So Kweon. 2016.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momen- Robust multiview photometric stereo using planar mesh parameterization. IEEE
tum contrast for unsupervised visual representation learning. In Proceedings of the transactions on pattern analysis and machine intelligence 39, 8 (2016), 1591–1604.
IEEE/CVF conference on computer vision and pattern recognition. 9729–9738. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love-
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic grove. 2019. Deepsdf: Learning continuous signed distance functions for shape
models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851. representation. In Proceedings of the IEEE/CVF conference on computer vision and
Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. 2023. pattern recognition. 165–174.
Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. arXiv Lim Wen Peng and Siti Mariyam Shamsuddin. 2004. 3D object reconstruction and rep-
preprint arXiv:2303.11989 (2023). resentation using neural networks. In Proceedings of the 2nd international conference
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. on Computer graphics and interactive techniques in Australasia and South East Asia.
2022. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. 139–147.
arXiv preprint arXiv:2205.08535 (2022). Ryan Po and Gordon Wetzstein. 2023. Compositional 3D Scene Generation using
Susung Hong, Donghoon Ahn, and Seungryong Kim. 2023. Debiasing Scores and Locally Conditioned Diffusion. arXiv preprint arXiv:2303.12218 (2023).
Prompts of 2D Diffusion for Robust Text-to-3D Generation. arXiv preprint Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion:
arXiv:2303.15413 (2023). Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
Dongting Hu, Zhenkai Zhang, Tingbo Hou, Tongliang Liu, Huan Fu, and Mingming Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep
Gong. 2023. Multiscale Representation for Real-Time Anti-Aliasing Neural Render- learning on point sets for 3d classification and segmentation. In Proceedings of the
ing. arXiv preprint arXiv:2304.10075 (2023). IEEE conference on computer vision and pattern recognition. 652–660.
Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. 2022. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Zero-shot text-guided object generation with dream fields. In Proceedings of the Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021.
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 867–876. Learning transferable visual models from natural language supervision. In Interna-
Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai, Mingming He, Dongdong tional conference on machine learning. PMLR, 8748–8763.
Chen, and Jing Liao. 2023. AvatarCraft: Transforming Text into Neural Human Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.
Avatars with Parameterized Shape and Pose Control. arXiv preprint arXiv:2303.17606 Hierarchical text-conditional image generation with clip latents. arXiv preprint
(2023). arXiv:2204.06125 (2022).
Yiwei Jin, Diqiong Jiang, and Ming Cai. 2020. 3d reconstruction using deep learning: a Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Rad-
survey. Communications in Information and Systems 20, 4 (2020), 389–413. ford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In
Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, and Takuya Narihira. International Conference on Machine Learning. PMLR, 8821–8831.
2023. Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion. arXiv René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers
preprint arXiv:2303.15780 (2023). for dense prediction. In Proceedings of the IEEE/CVF International Conference on
Yu-Jhe Li and Kris Kitani. 2023. 3D-CLFusion: Fast Text-to-3D Rendering with Con- Computer Vision. 12179–12188.
trastive Latent Diffusion. arXiv preprint arXiv:2303.11938 (2023). René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun.
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot
Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2022. Magic3D: High- cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence
Resolution Text-to-3D Content Creation. arXiv preprint arXiv:2211.10440 (2022). 44, 3 (2020), 1623–1637.

, Vol. 1, No. 1, Article . Publication date: May 2023.


14 • Li and Zhang, et al.

Konstantinos Rematas and Vittorio Ferrari. 2020. Neural voxel renderer: Learning an Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang
accurate and controllable rendering tool. In Proceedings of the IEEE/CVF Conference Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al.
on Computer Vision and Pattern Recognition. 5417–5427. 2023. MVImgNet: A Large-scale Dataset of Multi-view Images. arXiv preprint
Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. 2023. arXiv:2303.06042 (2023).
Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721 (2023). Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Ku-
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. mar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, et al.
2022. High-resolution image synthesis with latent diffusion models. In Proceedings of 2023c. One small step for generative ai, one giant leap for agi: A complete survey
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695. on chatgpt in aigc era. arXiv preprint arXiv:2304.06488 (2023).
Alessandro Rossi, Marco Barbiero, Paolo Scremin, and Ruggero Carli. 2021. Robust Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. 2023d. Text-
Visibility Surface Determination in Object Space via Plücker Coordinates. Journal to-image Diffusion Model in Generative AI: A Survey. arXiv preprint arXiv:2303.07909
of Imaging 7, 6 (2021), 96. (2023).
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun
Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al.
et al. 2022. Photorealistic text-to-image diffusion models with deep language under- 2023e. A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to
standing. Advances in Neural Information Processing Systems 35 (2022), 36479–36494. GPT-5 All You Need? arXiv preprint arXiv:2303.11717 (2023).
Aditya Sanghi, Rao Fu, Vivian Liu, Karl Willis, Hooman Shayani, AmirHosein Khasah- Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. 2023a. Text2NeRF:
madi, Srinath Sridhar, and Daniel Ritchie. 2022. CLIP-Sculptor: Zero-Shot Genera- Text-Driven 3D Scene Generation with Neural Radiance Fields. arXiv preprint
tion of High-Fidelity and Diverse Shapes from Natural Language. arXiv preprint arXiv:2305.11588 (2023).
arXiv:2211.01427 (Nov 2022). Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- diffusion models. arXiv preprint arXiv:2302.05543 (2023).
man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye
man, et al. 2022. Laion-5b: An open large-scale dataset for training next generation Shi, Sibei Yang, Lan Xu, and Jingyi Yu. 2023b. DreamFace: Progressive Generation of
image-text models. arXiv preprint arXiv:2210.08402 (2022). Animatable 3D Faces under Text Guidance. arXiv preprint arXiv:2304.03117 (2023).
Hoigi Seo, Hayeon Kim, Gwanghyun Kim, and Se Young Chun. 2023b. DITTO- Hang Zhou, Weiming Zhang, Kejiang Chen, Weixiang Li, and Nenghai Yu. 2021b. Three-
NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model. arXiv preprint dimensional mesh steganography and steganalysis: a review. IEEE Transactions on
arXiv:2304.02827 (2023). Visualization and Computer Graphics (2021).
Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Linqi Zhou, Yilun Du, and Jiajun Wu. 2021a. 3d shape generation and completion
Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. 2023a. Let 2D Diffusion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Confer-
Model Know 3D-Consistency for Robust Text-to-3D Generation. arXiv preprint ence on Computer Vision. 5826–5835.
arXiv:2303.07937 (2023).
Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. 2021. Deep
marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis.
Advances in Neural Information Processing Systems 34 (2021), 6087–6101.
Zifan Shi, Sida Peng, Yinghao Xu, Yiyi Liao, and Yujun Shen. 2022. Deep generative
models on 3d representations: A survey. arXiv preprint arXiv:2210.15663 (2022).
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan
Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video
generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).
Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokki-
nos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. 2023. Text-
to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023).
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022.
Motionclip: Exposing human motion generation to clip space. In Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Part XXII. Springer, 358–374.
Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, and Fed-
erico Tombari. 2023. TextMesh: Generation of Realistic 3D Meshes From Text
Prompts. arXiv preprint arXiv:2304.12439 (2023).
Diego Valsesia, Giulia Fracastoro, and Enrico Magli. 2019. Learning localized generative
models for 3d point clouds via graph convolution. In International conference on
learning representations.
Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. 2022a. Clip-
nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3835–3844.
Cheng Wang, Ming Cheng, Ferdous Sohel, Mohammed Bennamoun, and Jonathan Li.
2019. NormalNet: A voxel-based CNN for 3D object classification and retrieval.
Neurocomputing 323 (2019), 139–147.
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich.
2022b. Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D
Generation. arXiv preprint arXiv:2212.00774 (2022).
He Wang and Juyong Zhang. 2022. A Survey of Deep Learning-Based Mesh Processing.
Communications in Mathematics and Statistics 10, 1 (2022), 163–194.
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu
Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions
on neural networks and learning systems 32, 1 (2020), 4–24.
Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang.
2022a. NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360°
Views. arXiv e-prints (2022), arXiv–2211.
Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and
Shenghua Gao. 2022b. Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape
Prior and Text-to-Image Diffusion Models. arXiv preprint arXiv:2212.14704 (2022).
Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath
Hariharan. 2019. Pointflow: 3d point cloud generation with continuous normalizing
flows. In Proceedings of the IEEE/CVF international conference on computer vision.
4541–4550.

, Vol. 1, No. 1, Article . Publication date: May 2023.

You might also like