Generative AI Meets 3D: A Survey On Text-to-3D in AIGC Era
Generative AI Meets 3D: A Survey On Text-to-3D in AIGC Era
1 INTRODUCTION
Authors’ addresses: Chenghao Li, KAIST, South Korea, [email protected];
Chaoning Zhang, Kyung Hee University, South Korea, [email protected]; Generative Artificial Intelligence as the main body generating high-
Atish Waghwase, KAIST, South Korea, [email protected]; Lik-Hang Lee, quality and quantity content (also known as Artificial Intelligence
Hong Kong Polytechnic University, Hong Kong (China), [email protected]; Generated Content-AIGC) has aroused great attention in the past
Francois Rameau, State University of New York, Korea, [email protected];
Yang Yang, University of Electronic Science and technology, China, dlyyang@gmail. few years. The content generation paradigm guided and restrained
com; Sung-Ho Bae, Kyung Hee University, South Korea, [email protected]; Choong by natural language, such as text-to-text (e.g. ChatGPT [Zhang
Seon Hong, Kyung Hee University, South Korea, [email protected].
et al. 2023c]) and text-to-image [Zhang et al. 2023d] (e.g. DALLE-
2 [Ramesh et al. 2022]), is the most practical one, as it allows for a sim-
Permission to make digital or hard copies of all or part of this work for personal or
ple interaction between human guidance and generative AI [Zhang
classroom use is granted without fee provided that copies are not made or distributed et al. 2023e]. The accomplishment of Generative AI in the field of
for profit or commercial advantage and that copies bear this notice and the full citation text-to-image [Zhang et al. 2023d] is quite remarkable. As we are in
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, a 3D world, it is necessary to extend AIGC to 3D domain. There is a
to post on servers or to redistribute to lists, requires prior specific permission and/or a great demand for 3D digital content in many application scenarios,
fee. Request permissions from [email protected]. including games, movies, virtual reality, architecture and robots,
© 2023 Association for Computing Machinery.
XXXX-XXXX/2023/5-ART $15.00 such as 3D character generation, 3D texture generation, 3D scene
https://doi.org/XXXXXXX.XXXXXXX generation, etc. However, it requires a lot of artistic and aesthetic
training, as well as professional knowledge in 3D modeling, to cul- a potential grid structure, which allows global parameterization
tivate a professional 3D modeler. Given the current trend of 3D and a common coordinate system. These properties make extending
model development, it is essential to utilize generative AI to gener- existing 2D deep learning paradigms to 3D data a simple task, where
ate high-quality and large-scale 3D models. In addition, text-to-3D convolution operations remain the same as 2D. On the other hand,
AI modeling can greatly assist both newbies and professionals to 3D non-Euclidean data does not have a grid array structure and
realize free creation of 3D contents. is not globally parameterized. Therefore, extending classical deep
Previous methods of text-to-3D shapes have attempted to learn a learning techniques to such representations is a challenging task. In
cross-modal mapping by directly learning from text-3D pairs [Achliop- real life, the research of deep learning techniques in non-Euclidean
tas et al. 2019; Chen et al. 2019] and generate joint representations. domains is of great significance. This is referred to as geometric
Compared to text-to-image, the task of generating 3D shapes from deep learning [Cao et al. 2020].
text is more challenging. Firstly, unlike 2D images, 3D shapes are
mostly unstructured and irregular non-Euclidean data, making it 2.1 Euclidean data
difficult to apply traditional 2D deep learning models to these data. The Euclidean data preserves the attribute of the grid structure,
Moreover, there are a large number of large-scale image-text pairs with global parameterization and a common coordinate system. The
datasets available online to support text-to-image generation. How- major 3D data representations in this category include voxel grids
ever, to our knowledge, the largest text-to-3D dataset proposed and multi-view images.
in [Fu et al. 2022] contained only 369K text-3D pairs and is lim-
ited to a few object categories. This is significantly lower than the
datasets which contain 5.85B text-image pairs [Schuhmann et al.
2022]. The lack of large-scale and high-quality training data makes
the task of text-to-3D even more difficult.
Recently, the advent of some key technologies has enabled a
new paradigm of text-to-3D tasks. Firstly, Neural Radiance Fields
(NeRF) [Mildenhall et al. 2021] is an emergent 3D data representa-
tion approach. Initially, NeRF was found to perform well in the 3D Fig. 1. Voxel representation of Stanford bunny, the picture is obtained
reconstruction task, and recently NeRF and other neural 3D repre- from [Shi et al. 2022].
sentations have been applied to new view synthesis tasks that can
use real-world RGB photos. NeRF is trained to reconstruct images
2.1.1 Voxel Grids. Voxels can be used to represent individual sam-
from multiple viewpoints. As the learned radiance fields are shared
ples or data points on a regularly spaced three-dimensional grid,
between viewpoints, NeRF can smoothly and consistently interpo-
which is a Euclidean structured data structure similar to pixels [Blinn
late between viewpoints. Due to its neural representation, NeRF can
2005] in 2D space. The data point can contain a single data, such
sample with high spatial resolution, unlike voxel representations
as opacity, or multiple data, such as color and opacity. Voxels can
and point clouds, and is easier to optimize than meshes and other
also store high-dimensional feature vectors in data points, such
explicit geometric representations, since it is topology-free. The
as geometric occupancy [Mescheder et al. 2019], volumetric den-
advent of NeRF breaks the stalemate of 3D data scarcity and serves
sity [Minto et al. 2018], or signed distance values [Park et al. 2019].
as a soft bridge from 2D to 3D representation, elegantly solving the
A voxel only represents a point on this grid, not a volume; the space
problem of 3D data scarcity. Secondly, with the remarkable progress
between voxels is not represented in the voxel-based dataset. De-
of multimodal AI [Radford et al. 2021] and diffusion models [Ho
pending on the data type and the expected use of the dataset, this
et al. 2020], text-guided image content AI generation has made sig-
lost information can be reconstructed and/or approximated, for ex-
nificant progress. The key driving factor is the large-scale datasets
ample, by interpolation. The representation of voxels is simple and
of billions of text-image pairs obtained from the Internet. Recent
the spatial structure is clear, which is highly extensible and can be
works have emerged that guide 3D modeling optimization by lever-
easily applied to convolutional neural networks [Wang et al. 2019].
aging the prior of a pre-trained text-to-image generation model.
However, the efficiency is low, as it represents both the occupied
In other words, they often text-guided 3D model generation with
parts of the scene and the unoccupied parts, which leads to a large
text-to-image priors.
amount of unnecessary storage requirements. This leads to voxels
Overall, this work conducts the first yet comprehensive survey
being unsuitable for representing high-resolution data. Voxel grids
on text-to-3D. The rest of this work is organized as follows. Sec 2
have many applications in rendering tasks[Hu et al. 2023; Rematas
reviews the different representations of 3D. Sec 3 first introduces
and Ferrari 2020]. Early methods store high-dimensional feature vec-
the technology behind text-to-3D, and then summarizes the recent
tors in voxels to encode the geometry and appearance of the scene,
papers. Sec 4 introduces the application of text-to-3D in various
usually referred to as a feature volume, which can be interpreted
fields.
as a color image using projection and 2D cellular neural networks.
This also includes volumetric imaging in medicine, as well as terrain
2 3D DATA REPRESENTATION representation in games and simulations.
3D data can have different representations [Ahmed et al. 2018], 2.1.2 Multi-view Images. With the development of computer vision
divided into Euclidean and non-Euclidean. 3D Euclidean data has technology and the remarkable improvement in computing power,
and can directly use raw point cloud data as input and use a set 3 TEXT-TO-3D TECHNOLOGIES
of sparse keypoints to summarize the input point cloud, and can In the past few years, the success of deep generative models [Ho et al.
effectively process data and have robustness to small perturbations 2020] in 2D images has been incredible. Training generative models
of the input and can achieve good performance in tasks such as in 2D space cannot meet the needs of some practical applications,
shape classification, part segmentation, and scene segmentation. 3D as our physical world actually operates in 3D space. 3D data gener-
point cloud technology can be applied to multiple fields such as ation is of paramount importance. The success of Neural Radiance
architecture, engineering, civil building design, geological survey, Fields [Mildenhall et al. 2021] has transformed the 3D reconstruc-
machine vision, agriculture, space information, and automatic driv- tion race, bringing 3D data to a whole new level. Combining prior
ing, and can provide more accurate modeling and analysis, as well knowledge from text-to-image models [Ramesh et al. 2021], many
as more accurate positioning and tracking. pioneers have achieved remarkable results in text-to-3D generation.
In this section, we will first review the key techniques underlying
text-to-3D generation. Secondly, we will survey recent text-to-3D
models.
2.2.3 Neural Fields. Neural fields are a domain that is either wholly
or partially parameterized by neural networks and represented en-
tirely or partially by neural networks for scenes or objects in 3D Fig. 6. An overview of our neural radiance field scene representation and
space.At each point in 3D space, a neural network can map its as- differentiable rendering procedure, the picture is obtained from [Mildenhall
sociated characteristics to attributes. Neural fields are capable of et al. 2021].
representing 3D scenes or objects in any resolution and unknown or
complex topology due to their continuous representation. Addition- 3.1.1 NeRF. Neural Rediance Field (NeRF) [Gao et al. 2022a; Milden-
ally, compared to the above representations, only the parameters of hall et al. 2021] is a neural network-based implicit representation of
the neural network need to be stored, resulting in lower memory con- 3D scenes, which can render projection images from a given view-
sumption than other representations. The earliest work on neural point and a given position. Specifically, given a 3D point x ∈ R3
fields was used for 3D shape representation [Peng and Shamsuddin and an observation direction unit vector d ∈ R2 , NeRF encodes
2004]. SDF [Park et al. 2019] is a classical approach to represent 3D the scene as a continuous volumetric radiance field 𝑓 , yielding a
shapes as neural fields. SDF is based on a continuous volumetric differential density 𝜎 and an RGB color c: 𝑓 (x, d) = (𝜎, c).
field, represented by the distance and sign at each point on surface. Rendering of images from desired perspectives can be achieved by
Several works [Gao et al. 2022b; Shen et al. 2021] use SDF to gen- integrating color along a suitable ray, r, for each pixel in accordance
erate 3D shapes as representation. NeRF [Mildenhall et al. 2021], with the volume rendering equation:
a recent emerging representation for 3D reconstruction in neural
fields, has the advantages of high-quality and realistic 3D model ∫ 𝑡𝑓
generation, presenting realistic object surfaces and texture details Ĉ(𝑟 ) = 𝑇 (𝑡)𝜎 (𝑡)c(𝑡)𝑑𝑡 (1)
𝑡𝑛
at any angle and distance. Furthermore, it can generate 3D models
from any number of input images without specific processing or ∫ 𝑡
labeling of the inputs. Another advantage of neural fields is that 𝑇 (𝑡) = 𝑒𝑥𝑝 − 𝜎 (𝑠)𝑑𝑠 (2)
the neural network can be operated on low-power devices after it is 𝑡𝑛
trained. Polygon ray tracing renders high-resolution and realistic The transmission coefficient 𝑇 (𝑡) is defined as the probability
scenes at high frame rates, which requires expensive graphic cards, that light is not absorbed from the near-field boundary 𝑡𝑛 to 𝑡.
but high-quality neural fields can be rendered on mobile phones In order to train NeRF network and optimize the predicted color
and even web browsers. However, there are also some drawbacks Ĉ to fit with the ray R corresponding to the pixel in the training
of neural field technology, such as the need for a large amount of images, gradient descent is used to optimize the network and match
computational resources and time for training, as well as difficulty the target pixel color by loss:
in handling large-scale scenes and complex lighting conditions, and
its inability to be structured data, which makes it difficult to be
∑︁
L= ∥C(r) − Ĉ(r)∥ 22 (3)
directly applied to 3D assets. Neural fields are a new emerging 3D r∈ R
representation technology with strong application prospects and
can be used in 3D fields such as VR/AR and games.
3.1.2 CLIP. Recent advances in multimodal learning have enabled The reverse denoising process of DDPM involves learning to undo
the development of cross-modal matching models such as CLIP [Rad- the forward diffusion by performing iterative denoising, thereby
ford et al. 2021](Contrastive Language-Image Pre-training) which generating data from random noise. This process is formally de-
learn shared representations from image-text pairs. These models fined as a stochastic process, where the optimization objective is to
are able to produce a scalar score that indicates whether an image generate 𝑝𝜃 (𝑥 0 ) which follows the true data distribution 𝑞(𝑥 0 ) by
and its associated caption match or not. starting from 𝑝𝜃 (𝑇 ):
Fig. 7. CLIP structure, picture obtained from [Radford et al. 2021]. Fig. 8. Overview of DDPM, the picture is obtained from [Ho et al. 2020]
𝑇
Ö
𝑞(𝑥 1:𝑇 |𝑥 0 ) := 𝑞(𝑥𝑡 |𝑥𝑡 −1 ), (4)
𝑡 =1
√︁
𝑞(𝑥𝑡 |𝑥𝑡 −1 ) := N (𝑥𝑡 ; 1 − 𝛽𝑡 𝑥𝑡 −1, 𝛽𝑡 𝐼 ) (5)
where 𝑇 and 𝛽𝑡 are the steps and hyper-parameters, respectively.
We can obtain noised image at arbitrary step 𝑡 with Gaussian noise
transition kernel as N in Eq. 5, by setting 𝛼𝑡 := 1 − 𝛽𝑡 and 𝛼¯𝑡 :=
Î𝑡
𝑠=0 𝛼𝑠 :
Fig. 9. Structure of DALLE-2, picture obtained from [Ramesh et al. 2022].
√
𝑞(𝑥𝑡 |𝑥 0 ) := N (𝑥𝑡 ; 𝛼¯𝑡 𝑥 0, (1 − 𝛼¯𝑡 )𝐼 ) (6)
Fig. 11. Illustration of the main idea of CLIP-Forge, the picture obtained
from [Sanghi et al. 2022].
users to edit the shape and appearance of existing data. CLIP-NeRF DreamFusion [Poole et al. 2022]. In DreamFusion, the monitoring
is a contemporary work of Dream Fields [Jain et al. 2022], and signal runs on very low-resolution images of 64 × 64, and DreamFu-
unlike Dream Fields, the former offers greater freedom in shape sion is unable to synthesize high-frequency 3D geometry and texture
manipulation and supports global deformation, introducing two details. Because of its inefficient MLP architecture to represent NeRF,
intuitive NeRF editing methods: using short text prompts or sample the actual high-resolution synthesis may even be impossible due to
images, both of which are more user-friendly to novice users. The the rapid growth of memory usage and computation budget as the
structure of CLIP-NeRF is shown in Figure ?? resolution increases. Magic3D proposes a two-stage optimization
framework for optimizing from text to 3d synthesis results of NeRF.
In the first stage, Magic3D optimizes a coarse neural field represen-
tation similar to DreamFusion, but with a hash grid-based memory
and computation-efficient scene representation. In the second stage,
Magic3D shifts to optimize the mesh representation, leveraging a
diffusion prior at resolutions up to 512 × 512. Overview of Magic3D
as shown in Figure 15. As 3D meshes fit well with fast graphics ren-
dering solutions that can render high-resolution images in real-time,
Magic3D also uses an efficient differentiable rasterizer to recover
the high-frequency details in geometry and texture from camera
close-ups. Magic3D synthesizes 3D contents with 8 times better
resolution and 2 times faster speed than DreamFusion.
Fig. 13. Sturcute of CLIP-NeRF, picture obtained from [Wang et al. 2022a]
that the proposed framework outperforms existing methods [Poole while achieving a good balance between fidelity of 2D diffusion
et al. 2022; Valsesia et al. 2019; Yang et al. 2019; Zhou et al. 2021a]. model and 3D consistency with low overhead.
the detail of the output 3D mesh. Point·E [Nichol et al. 2022] presents AvatarCraft [Jiang et al. 2023] utilizes a diffusion model to guide
an alternative approach for fast 3D object generation which pro- the learning of neural avatar geometry and texture based on a single
duces 3D models in only 1-2 minutes on a single GPU. The proposed text prompt, thereby addressing the challenge of creating 3D char-
method consists of two diffusion models, a text-to-image diffusion acter avatars with specified identity and artistic style that can be
model and a point cloud diffusion model. Point·E’s text-to-image easily animated. It also carefully designs an optimization framework
model utilizes a large (text, image) corpus, allowing it to adhere of neural implicit fields, including coarse-to-fine multi-boundary
to various complex cues. The image-to-3D model is trained on a box training strategy, shape regularization and diffusion-based con-
smaller (image, 3D) pair dataset. To generate 3D objects from textual straints, to generate high-quality geometry and texture, and make
cues, Point·E first samples the images using its text-to-image model, the character avatars animatable, thus simplifying the animation
followed by sampling 3D objects conditioned on the sampled images. and reshaping of the generated avatars. Experiments demonstrate
Both steps can be completed within a few seconds and without the the effectiveness and robustness of AvatarCraft in creating character
need for expensive optimization processes. avatars, rendering new views, poses, and shapes.
MotionCLIP [Tevet et al. 2022], a 3D human motion auto-encoder
featuring a latent embedding that is disentangled, well behaved,
4 TEXT-TO-3D APPLICATIONS and supports highly semantic textual descriptions. MotionCLIP
With the emergence of text-to-3D models guided by text-to-image is unique in that it aligns its latent space with that of the CLIP
priors, more fine-grained application domains have been developed, model, thus infusing the semantic knowledge of CLIP into the mo-
including text-to-avatar, text-to-texture, text-to-scene, etc. This sec- tion manifold. Furthermore, MotionCLIP leverages CLIP’s visual
tion surveys the text-guided 3D model generation models based on understanding and self-supervised motion-to-frame alignment. The
text-to-image priors. contributions of this paper are the text-to-motion capabilities it
enables, out-of-domain actions, disentangled editing, and abstract
language specification. In addition, MotionCLIP shows how the
4.1 Text Guided 3D Avatar Generation introduced latent space can be leveraged for motion interpolation,
In recent years, the creation of 3D graphical human models has editing and recognition.
drawn considerable attention due to its extensive applications in AvatarCLIP [Hong et al. 2022] introduces a text-driven frame-
areas such as movie production, video gaming, AR/VR and human- work for the production of 3D avatars and their motion generation.
computer interactions, and the creation of 3D avatars through natu- By utilizing the powerful vision-language model CLIP, AvatarCLIP
ral language could save resources and holds great research prospects. enables non-expert users to craft customized 3D avatars with the
DreamAvatar [Cao et al. 2023] proposed a framework based on shape and texture of their choice, and animate them with natural
text and shape guidance for generating high-quality 3D human language instructions. Extensive experiments indicate that Avatar-
avatars with controllable poses. It utilizes a trainable NeRF to pre- CLIP exhibits superior zero-shot performance in generating unseen
dict the density and color features of 3D points, as well as a pre- avatars and novel animations.
trained text-to-image diffusion model to provide 2D self-supervision.
SMPL [Bogo et al. 2016] model is used to provide rough pose and
shape guidance for generation, as well as a dual-space design, in-
cluding a canonical space and an observation space, which are
related by a learnable deformation field through NeRF, allowing op-
timized textures and geometries to be transferred from the canonical
space to the target pose avatar with detailed geometry and textures.
Experimental results demonstrate that DreamAvatar significantly
outperforms the state of the art, setting a new technical level for 3D
human generation based on text and shape guidance.
DreamFace [Zhang et al. 2023b] is a progressive scheme for per-
sonalized 3D facial generation guided by text. It enables ordinary Fig. 17. 3D avatar created by text guided 3D generation model, picture
users to naturally customize CG-pipe compatible 3D facial assets obtained from [Cao et al. 2023]
with desired shapes, textures and fine-grained animation capabil-
ities. DreamFace introduces a coarse-to-fine scheme to generate
a topologically unified neutral face geometry, utilizes Score Dis-
tillation Sampling (SDS) [Rombach et al. 2022] to optimize subtle 4.2 Text Guided 3D Texture Generation
translations and normals, adopts a dual-path mechanism to generate Recently, there have been a number of works on text-to-texture,
neutral appearance, and employs two-stage optimization to enhance inspired by text-to-3D. This summary lists these works.
compact priors for fine-grained synthesis, as well as to improve the TEXTure [Richardson et al. 2023] presents a novel text guided
animation capability for personalized deformation features. Dream- 3D shape texture generation, editing and transmission method. It
Face can generate realistic 3D facial assets with physical rendering utilizes a pre-trained deep-to-image topology model and iterative
quality and rich animation capabilities from video materials, even schemes to colorize the 3D models from different viewpoints, and
for fashion icons, cartoons and fictional aliens in movies. proposes a novel detailed topology sampling procedure to generate
seamless textures from different viewpoints using the three-step seg- 4.3 Text Guided 3D Scene Generation
mentation map. Additionally, it presents a method for transferring 3D scene modeling is a time-consuming task, usually requiring
the generated texture maps to new 3D geometry without explicit professional 3D designers to complete. To make 3D scene modeling
surface-to-surface mapping and a method for extracting semantic easier, 3D generation should be simple and intuitive to operate while
textures from a set of images without any explicit reconstruction, retaining enough controllability to meet users’ precise requirements.
and provides a way to edit and refine existing textures with text Recent works [Cohen-Bar et al. 2023; Fridman et al. 2023; Höllein
hints or user-provided doodle. et al. 2023; Po and Wetzstein 2023] in text-to-3D generation has
TANGO [Chen et al. 2022] proposes a novel method for program- made 3D scene modeling easier.
matic rendering of realistic appearance effects on arbitrary topology Set-the-Scene [Cohen-Bar et al. 2023] proposes an agent-based
given surface meshes. Based on CLIP model, the model is used global-local training framework for synthesizing 3D scenes, thus
to decompose the appearance style into spatially varying bidirec- filling an important gap from controllable text to 3D synthesis.
tional reflectance distribution functions, local geometric variations, It can learn a complete representation of each object while also
and illumination conditions. This enables realistic 3D style transfer creating harmonious scenes with style and lighting matching. The
through the automatic prediction of reflectance effects, even for framework allows various editing options such as adjusting the
bare, low-quality meshes, without the need for training on particu- placement of each individual object, deleting objects from scenes,
lar task-specific datasets. Numerous experiments demonstrate that or refining objects.
TANGO outperforms the existing text-driven 3D style transfer meth- [Po and Wetzstein 2023] proposes a local condition diffusion-
ods in terms of realism, 3D geometry consistency, and robustness based text-to-3D scene synthesis approach, which aims to make the
for stylizing low-quality meshes. generation of complex 3D scenes more intuitive and controllable. By
Fantasia3D [Chen et al. 2023a] presents a novel approach for text- providing control on the semantic parts via text hints and bounding
to-high-quality 3D content creation. The method decouples geome- boxes, the method ensures seamless transitions between these parts.
try and appearance modeling and learning, and uses a hybrid scene Experiments show that the proposed method achieves higher fidelity
representation and Spatially-varying Bidirectional Reflectance Dis- in the composition of 3D scenes than the related baselines [Liu et al.
tribution Function (BRDF) learning for surface material to achieve 2022; Wang et al. 2022b].
photorealistic rendering of the generated surface. Experimental re- SceneScape [Fridman et al. 2023] proposes a novel text-driven
sults show that the method outperforms existing approaches [Lin approach to generate permanent views, which is capable of syn-
et al. 2022; Poole et al. 2022] and supports physically plausible sim- thesizing long videos of arbitrary scenes solely based on input
ulations of relit, edited, and generated 3D assets. X-Mesh [Ma et al. texts describing the scene and camera positions. The framework
2023] presents a novel text-driven 3D stylization framework, con- combines the generative capacity of a pre-trained text-to-image
taining a novel text-guided dynamic attention module (TDAM) for model [Rombach et al. 2022] with the geometry priors learned from
more accurate attribute prediction and faster convergence speed. a pre-trained monocular depth prediction model [Ranftl et al. 2021,
Additionally, a new standard text-mesh benchmark, MIT-30, and 2020], to generate videos in an online fashion, and achieves 3D con-
two automatic metrics standards are introduced for future research sistency through online testing time training to generate videos with
to achieve fair and objective comparison. geometry-consistent scenes. Compared with the previous works
Text2Tex [Chen et al. 2023b] proposes a novel approach for gener- that are limited to a restricted domain, this framework is able to
ating high-quality textures for 3D meshes given text prompts. The generate various scenes including walking through a spaceship, a
goal of this method is to address the accumulation inconsistency and cave, or an ice city.
stretching artifacts in text-driven texture generation. The method Text2Room [Höllein et al. 2023] proposes a method of generating
integrates repair and merging into a pre-trained deep-perceptual room-scale 3D meshes with textures from given text prompts. It
image diffusion model to synthesize high-resolution local textures is the first method to generate attention-grabbing textured room-
progressively from multiple perspectives. Experiments show that scale 3D geometries solely from text inputs, which is different from
Text2Tex significantly outperforms existing text-driven and GAN- existing methods that focus on generating single objects [Lin et al.
based methods. 2022; Poole et al. 2022] or scaling trajectories(SceneScape) [Fridman
et al. 2023] from text.
Text2NeRF [Zhang et al. 2023a] proposes a text-driven realistic
3D scene generation framework combining diffusion model with
NeRF representations to support zero-shot generation of various
indoor/outdoor scenes from a variety of natural language prompts.
Additionally, a progressive inpainting and updating (PIU) strategy
is introduced to generate view-consistent novel contents for 3D
scenes, and a support set is built to provide multi-view constraints
for the NeRF model during view-by-view updating. Moreover, a
Fig. 18. Texturing results generated by text guided 3D texture model, picture depth loss is employed to achieve depth-aware NeRF optimization,
obtained by [Richardson et al. 2023]
and a two-stage depth alignment strategy is introduced to elimi-
nate estimated depth misalignment in different views. Experimental
results demonstrate that the proposed Text2NeRF outperforms ex- that SKED can effectively modify the existing neural fields and
isting methods [Höllein et al. 2023; Mohammad Khalid et al. 2022; generate outputs that satisfy user sketches.
Poole et al. 2022; Wang et al. 2022b] in producing photo-realistic, TextDeformer [Gao et al. 2023] proposes an automatic technique
multiview consistent, and diverse 3D scenes from a variety of natural for generating input triangle mesh deformations guided entirely by
language prompts. text prompts. The framework is capable of generating large, low-
frequency shape changes as well as small, high-frequency details,
relying on differentiable rendering to connect geometry to power-
ful pre-trained image encoders such as CLIP[Radford et al. 2021]
and DINO [Caron et al. 2021]. In order to overcome the problems
of artifacts, TextDeformer proposes to use the Jacobian matrix to
represent the mesh deformation and encourages deep features to be
computed on 2D encoded rendering to ensure shape coherence from
all 3D viewpoints. Experimental results show that the method can
smoothly deform various source meshes and target text prompts to
Fig. 19. Controllable scenes generation from text prompts, the picture is achieve large modifications and add details.
obtained by [Cohen-Bar et al. 2023]
5.2 Inference velocity text and shape guidance. CompoNeRF [Lin et al. 2023] is capable
A fatal issue of generating 3D content by leveraging pre-training of precisely associating guidance with particular structures via its
models based on diffusion models as a powerful prior and learning integration of editable 3D layouts and multiple local NeRFs, address-
objective is that the inference process is too slow. Even at a resolu- ing the guidance failure issue when generating multiple object 3D
tion of just 64×64, DreamFusion [Poole et al. 2022] would take 1.5 scenes.
hours to infer for each prompt using a TPUv4. The inference time is
further rising quickly along with the increase in resolution. This is
5.5 Applicability
mainly due to the fact that the inference process for generating 3D Although NeRF, a novel 3D representation, cannot be directly ap-
content is actually starting from scratch to train a Neural Radiance plied to traditional 3D application scenarios, its powerful representa-
Field [Mildenhall et al. 2021]. Notably, NeRF models are renowned tion capabilities enable it to possess unlimited application prospects.
for their slow training and inference speeds, and training a deep The greatest advantage of NeRF is that it can be trained with 2D im-
network takes a lot of time. Magic3D [Lin et al. 2022] has addressed ages. Google has already begun to use NeRFs to transform street map
the time issue by using a two-phase optimization framework. Firstly, images into immersive views on Google Maps. In the future, NeRFs
a coarse model is obtained by leveraging a low-resolution diffusion can supplement other technologies to more efficiently, accurately,
prior, and secondly, acceleration is performed by using a sparse and realistically represent 3D objects in the metaverse, augmented
3D hash grid structure. 3D-CLFusion[Li and Kitani 2023] utilizes a reality, and digital twins. To further improve these applications,
pre-trained latent NeRf and performs a fast 3D content creation in future research may focus on extracting 3D meshes, point-clouds,
less than a minute. or SDFs from the density MLP, and integrating faster NeRF models.
It remains to be seen whether the Paradigm of Pure Text Guiding
5.3 Consistency Shape Generation can cope with all scenarios. Perhaps incorporat-
ing a more intuitive guidance mechanism, such as sketch guidance
Distortion and ghost images are often encountered in the 3D scenes
or picture guidance, might be a more reasonable choice.
generated by DreamFusion [Poole et al. 2022], and unstable 3D
scenes are observed when text prompts and random seeds are 6 CONCLUSION
changed. This issue is mainly caused by the lack of perception
This work conducts the first yet comprehensive survey on text-
of 3D information from 2D prior diffusion models, as the trans-
to-3D. Specifically, we summarize text-to-3D from three aspects:
mission model has no knowledge of which direction the object is
data representations, technologies and applications. We hope this
observed from, leading to the serious distortion of 3D scenes by
survey can help readers quickly understand the field of text-to-3D
generating the front-view geometry features from all viewpoints,
and inspire more future works to explore text-to-3D.
including sides and the backs, which is usually referred to as the
Janus problem. [Hong et al. 2023] proposed two debiasing methods REFERENCES
to address such issues: score debiasing, which involves gradually in- Panos Achlioptas, Judy Fan, Robert Hawkins, Noah Goodman, and Leonidas J Guibas.
creasing the truncation value of the 2D diffusion model’s estimation 2019. ShapeGlot: Learning language for shape differentiation. In Proceedings of the
throughout the entire optimization process; and prompt debiasing, IEEE/CVF International Conference on Computer Vision. 8938–8947.
Parag Agarwal and Balakrishnan Prabhakaran. 2009. Robust blind watermarking of
which employs a language model to recognize the conflicts between point-sampled geometry. IEEE Transactions on Information Forensics and Security 4,
the user prompts and the view prompts, and adjust the difference 1 (2009), 36–48.
between the view prompts and the spatial camera pose of objects. Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkova, Rig Das,
Gleb Gusev, Djamila Aouada, and Bjorn Ottersten. 2018. A survey on deep learning
3DFuse [Seo et al. 2023a] optimized the training process to make advances on different 3D data representations. arXiv preprint arXiv:1808.01462
the 2D diffusion model learn to process the wrong and sparse 3D (2018).
James F Blinn. 2005. What is a pixel? IEEE computer graphics and applications 25, 5
structures for robust generation, as well as a way to ensure that the (2005), 82–87.
semantic consistency of all viewpoints in the scene is ensured. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and
Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and
shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference,
5.4 Controllability Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer,
561–578.
Although Text-to-3D can generate impressive results, as the text-to- Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2022. Instructpix2pix: Learning
image diffusion models are essentially unconstrained, they generally to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022).
tend to suffer from guiding collapse. This makes them less capable Wenming Cao, Zhiyue Yan, Zhiquan He, and Zhihai He. 2020. A comprehensive survey
on geometric deep learning. IEEE Access 8 (2020), 35929–35949.
of accurately associating object semantics with specific 3D struc- Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. 2023. DreamA-
tures. The issue of poor controllability has long been mainstream vatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models.
in the text-to-image generation task, with ControlNet [Zhang and arXiv preprint arXiv:2304.00916 (2023).
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo-
Agrawala 2023] addressing it by adding extra input conditions to janowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision
make the generation process even more controllable for large-scale transformers. In Proceedings of the IEEE/CVF international conference on computer
vision. 9650–9660.
text-to-image models, such as with the addition of canny edge, Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias
hough lines and depth maps. This unique combination of text and Nießner. 2023b. Text2Tex: Text-driven Texture Synthesis via Diffusion Models. arXiv
shape guidance allows for increased control over the generation preprint arXiv:2303.11396 (2023).
Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser,
process. LatentNeRF [Metzer et al. 2022] allows for increased control and Silvio Savarese. 2019. Text2shape: Generating shapes from natural language by
over the 3D generation process through its unique combination of learning joint embeddings. In Computer Vision–ACCV 2018: 14th Asian Conference
on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Yiqi Lin, Haotian Bai, Sijia Li, Haonan Lu, Xiaodong Lin, Hui Xiong, and Lin Wang.
Part III 14. Springer, 100–116. 2023. CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. 2023a. Fantasia3D: Disentangling Scene Layout. arXiv preprint arXiv:2303.13843 (2023).
Geometry and Appearance for High-quality Text-to-3D Content Creation. arXiv Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. 2022.
preprint arXiv:2303.13873 (2023). Compositional visual generation with composable diffusion models. In Computer
Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Kui Jia. 2022. Tango: Text-driven Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
photorealistic and robust 3d stylization via lighting decomposition. arXiv preprint Proceedings, Part XVII. Springer, 423–439.
arXiv:2210.11277 (2022). Weiping Liu, Jia Sun, Wanyi Li, Ting Hu, and Peng Wang. 2019. Deep learning on point
Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, and Daniel Cohen-Or. 2023. clouds and its application: A survey. Sensors 19, 19 (2019), 4188.
Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes. Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang,
arXiv preprint arXiv:2303.13450 (2023). Weilin Zhuang, and Rongrong Ji. 2023. X-Mesh: Towards Fast and Accurate Text-
Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu. driven 3D Stylization via Dynamic Textual Guidance. arXiv preprint arXiv:2303.15764
2022. ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial (2023).
Viewpoints. arXiv preprint arXiv:2210.03895 (2022). Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas
Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. 2023. Scenescape: Text- Geiger. 2019. Occupancy networks: Learning 3d reconstruction in function space.
driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023). In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, and Srinath Sridhar. 2022. Shapecrafter: 4460–4470.
A recursive text-conditioned 3d shape generation model. arXiv preprint Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. 2022.
arXiv:2207.09446 (2022). Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. arXiv preprint
Yasutaka Furukawa, Carlos Hernández, et al. 2015. Multi-view stereo: A tutorial. arXiv:2211.07600 (2022).
Foundations and Trends® in Computer Graphics and Vision 9, 1-2 (2015), 1–148. Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. 2022.
Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF
Litany, Zan Gojcic, and Sanja Fidler. 2022b. Get3d: A generative model of high Conference on Computer Vision and Pattern Recognition. 13492–13502.
quality 3d textured shapes learned from images. Advances In Neural Information Aryan Mikaeili, Or Perel, Daniel Cohen-Or, and Ali Mahdavi-Amiri. 2023. SKED:
Processing Systems 35 (2022), 31841–31854. Sketch-guided Text-based 3D Editing. arXiv preprint arXiv:2303.10735 (2023).
Kyle Gao, Yina Gao, Hongjie He, Denning Lu, Linlin Xu, and Jonathan Li. 2022a. Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra-
Nerf: Neural radiance field in 3d vision, a comprehensive review. arXiv preprint mamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields
arXiv:2210.00379 (2022). for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
William Gao, Noam Aigerman, Thibault Groueix, Vladimir G Kim, and Rana Hanocka. Ludovico Minto, Pietro Zanuttigh, and Giampaolo Pagnutti. 2018. Deep Learning for
2023. TextDeformer: Geometry Manipulation using Text Guidance. arXiv preprint 3D Shape Classification based on Volumetric Density and Surface Approximation
arXiv:2304.13348 (2023). Clues.. In VISIGRAPP (5: VISAPP). 317–324.
Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. 2022.
2020. Deep learning for 3d point clouds: A survey. IEEE transactions on pattern CLIP-Mesh: Generating textured meshes from text using pretrained image-text
analysis and machine intelligence 43, 12 (2020), 4338–4364. models. In SIGGRAPH Asia 2022 Conference Papers. 1–8.
Bo Han, Yitong Liu, and Yixuan Shen. 2023. Zero3D: Semantic-Driven Multi-Category Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob
3D Shape Generation. arXiv preprint arXiv:2301.13591 (2023). McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic
Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel image generation and editing with text-guided diffusion models. arXiv preprint
Cohen-Or. 2019. Meshcnn: a network with an edge. ACM Transactions on Graphics arXiv:2112.10741 (2021).
(TOG) 38, 4 (2019), 1–12. Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. 2022.
Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Point-E: A System for Generating 3D Point Clouds from Complex Prompts. arXiv
Kanazawa. 2023. Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. arXiv preprint arXiv:2212.08751 (2022).
preprint arXiv:2303.12789 (2023). Jaesik Park, Sudipta N Sinha, Yasuyuki Matsushita, Yu-Wing Tai, and In So Kweon. 2016.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momen- Robust multiview photometric stereo using planar mesh parameterization. IEEE
tum contrast for unsupervised visual representation learning. In Proceedings of the transactions on pattern analysis and machine intelligence 39, 8 (2016), 1591–1604.
IEEE/CVF conference on computer vision and pattern recognition. 9729–9738. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love-
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic grove. 2019. Deepsdf: Learning continuous signed distance functions for shape
models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851. representation. In Proceedings of the IEEE/CVF conference on computer vision and
Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. 2023. pattern recognition. 165–174.
Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. arXiv Lim Wen Peng and Siti Mariyam Shamsuddin. 2004. 3D object reconstruction and rep-
preprint arXiv:2303.11989 (2023). resentation using neural networks. In Proceedings of the 2nd international conference
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. on Computer graphics and interactive techniques in Australasia and South East Asia.
2022. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. 139–147.
arXiv preprint arXiv:2205.08535 (2022). Ryan Po and Gordon Wetzstein. 2023. Compositional 3D Scene Generation using
Susung Hong, Donghoon Ahn, and Seungryong Kim. 2023. Debiasing Scores and Locally Conditioned Diffusion. arXiv preprint arXiv:2303.12218 (2023).
Prompts of 2D Diffusion for Robust Text-to-3D Generation. arXiv preprint Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion:
arXiv:2303.15413 (2023). Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
Dongting Hu, Zhenkai Zhang, Tingbo Hou, Tongliang Liu, Huan Fu, and Mingming Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep
Gong. 2023. Multiscale Representation for Real-Time Anti-Aliasing Neural Render- learning on point sets for 3d classification and segmentation. In Proceedings of the
ing. arXiv preprint arXiv:2304.10075 (2023). IEEE conference on computer vision and pattern recognition. 652–660.
Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. 2022. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Zero-shot text-guided object generation with dream fields. In Proceedings of the Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021.
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 867–876. Learning transferable visual models from natural language supervision. In Interna-
Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai, Mingming He, Dongdong tional conference on machine learning. PMLR, 8748–8763.
Chen, and Jing Liao. 2023. AvatarCraft: Transforming Text into Neural Human Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.
Avatars with Parameterized Shape and Pose Control. arXiv preprint arXiv:2303.17606 Hierarchical text-conditional image generation with clip latents. arXiv preprint
(2023). arXiv:2204.06125 (2022).
Yiwei Jin, Diqiong Jiang, and Ming Cai. 2020. 3d reconstruction using deep learning: a Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Rad-
survey. Communications in Information and Systems 20, 4 (2020), 389–413. ford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In
Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, and Takuya Narihira. International Conference on Machine Learning. PMLR, 8821–8831.
2023. Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion. arXiv René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers
preprint arXiv:2303.15780 (2023). for dense prediction. In Proceedings of the IEEE/CVF International Conference on
Yu-Jhe Li and Kris Kitani. 2023. 3D-CLFusion: Fast Text-to-3D Rendering with Con- Computer Vision. 12179–12188.
trastive Latent Diffusion. arXiv preprint arXiv:2303.11938 (2023). René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun.
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot
Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2022. Magic3D: High- cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence
Resolution Text-to-3D Content Creation. arXiv preprint arXiv:2211.10440 (2022). 44, 3 (2020), 1623–1637.
Konstantinos Rematas and Vittorio Ferrari. 2020. Neural voxel renderer: Learning an Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang
accurate and controllable rendering tool. In Proceedings of the IEEE/CVF Conference Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al.
on Computer Vision and Pattern Recognition. 5417–5427. 2023. MVImgNet: A Large-scale Dataset of Multi-view Images. arXiv preprint
Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. 2023. arXiv:2303.06042 (2023).
Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721 (2023). Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Ku-
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. mar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, et al.
2022. High-resolution image synthesis with latent diffusion models. In Proceedings of 2023c. One small step for generative ai, one giant leap for agi: A complete survey
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695. on chatgpt in aigc era. arXiv preprint arXiv:2304.06488 (2023).
Alessandro Rossi, Marco Barbiero, Paolo Scremin, and Ruggero Carli. 2021. Robust Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. 2023d. Text-
Visibility Surface Determination in Object Space via Plücker Coordinates. Journal to-image Diffusion Model in Generative AI: A Survey. arXiv preprint arXiv:2303.07909
of Imaging 7, 6 (2021), 96. (2023).
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun
Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al.
et al. 2022. Photorealistic text-to-image diffusion models with deep language under- 2023e. A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to
standing. Advances in Neural Information Processing Systems 35 (2022), 36479–36494. GPT-5 All You Need? arXiv preprint arXiv:2303.11717 (2023).
Aditya Sanghi, Rao Fu, Vivian Liu, Karl Willis, Hooman Shayani, AmirHosein Khasah- Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. 2023a. Text2NeRF:
madi, Srinath Sridhar, and Daniel Ritchie. 2022. CLIP-Sculptor: Zero-Shot Genera- Text-Driven 3D Scene Generation with Neural Radiance Fields. arXiv preprint
tion of High-Fidelity and Diverse Shapes from Natural Language. arXiv preprint arXiv:2305.11588 (2023).
arXiv:2211.01427 (Nov 2022). Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- diffusion models. arXiv preprint arXiv:2302.05543 (2023).
man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye
man, et al. 2022. Laion-5b: An open large-scale dataset for training next generation Shi, Sibei Yang, Lan Xu, and Jingyi Yu. 2023b. DreamFace: Progressive Generation of
image-text models. arXiv preprint arXiv:2210.08402 (2022). Animatable 3D Faces under Text Guidance. arXiv preprint arXiv:2304.03117 (2023).
Hoigi Seo, Hayeon Kim, Gwanghyun Kim, and Se Young Chun. 2023b. DITTO- Hang Zhou, Weiming Zhang, Kejiang Chen, Weixiang Li, and Nenghai Yu. 2021b. Three-
NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model. arXiv preprint dimensional mesh steganography and steganalysis: a review. IEEE Transactions on
arXiv:2304.02827 (2023). Visualization and Computer Graphics (2021).
Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Linqi Zhou, Yilun Du, and Jiajun Wu. 2021a. 3d shape generation and completion
Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. 2023a. Let 2D Diffusion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Confer-
Model Know 3D-Consistency for Robust Text-to-3D Generation. arXiv preprint ence on Computer Vision. 5826–5835.
arXiv:2303.07937 (2023).
Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. 2021. Deep
marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis.
Advances in Neural Information Processing Systems 34 (2021), 6087–6101.
Zifan Shi, Sida Peng, Yinghao Xu, Yiyi Liao, and Yujun Shen. 2022. Deep generative
models on 3d representations: A survey. arXiv preprint arXiv:2210.15663 (2022).
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan
Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video
generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).
Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokki-
nos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. 2023. Text-
to-4d dynamic scene generation. arXiv preprint arXiv:2301.11280 (2023).
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022.
Motionclip: Exposing human motion generation to clip space. In Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Part XXII. Springer, 358–374.
Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, and Fed-
erico Tombari. 2023. TextMesh: Generation of Realistic 3D Meshes From Text
Prompts. arXiv preprint arXiv:2304.12439 (2023).
Diego Valsesia, Giulia Fracastoro, and Enrico Magli. 2019. Learning localized generative
models for 3d point clouds via graph convolution. In International conference on
learning representations.
Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. 2022a. Clip-
nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3835–3844.
Cheng Wang, Ming Cheng, Ferdous Sohel, Mohammed Bennamoun, and Jonathan Li.
2019. NormalNet: A voxel-based CNN for 3D object classification and retrieval.
Neurocomputing 323 (2019), 139–147.
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich.
2022b. Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D
Generation. arXiv preprint arXiv:2212.00774 (2022).
He Wang and Juyong Zhang. 2022. A Survey of Deep Learning-Based Mesh Processing.
Communications in Mathematics and Statistics 10, 1 (2022), 163–194.
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu
Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions
on neural networks and learning systems 32, 1 (2020), 4–24.
Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang.
2022a. NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360°
Views. arXiv e-prints (2022), arXiv–2211.
Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and
Shenghua Gao. 2022b. Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape
Prior and Text-to-Image Diffusion Models. arXiv preprint arXiv:2212.14704 (2022).
Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath
Hariharan. 2019. Pointflow: 3d point cloud generation with continuous normalizing
flows. In Proceedings of the IEEE/CVF international conference on computer vision.
4541–4550.