A Comprehensive Survey On 3D Content Generation: Corresponding Author
A Comprehensive Survey On 3D Content Generation: Corresponding Author
Jian Liu1,2 , Xiaoshui Huang2∗ , Tianyu Huang1 , Lu Chen2,4 , Yuenan Hou2 , Shixiang Tang5
Ziwei Liu3 , Wanli Ouyang2 , Wangmeng Zuo1 , Junjun Jiang1 , Xianming Liu1∗
1
Harbin Institute of Technology, 2 Shanghai AI Laboratory, 3 S-Lab,Nanyang Technological University
4
Zhejiang University, 5 The Chinese University of Hong Kong
[email protected], [email protected], [email protected]
arXiv:2402.01166v1 [cs.CV] 2 Feb 2024
19/4 19/8 19/12 20/4 20/8 20/12 21/4 21/8 21/12 22/4 22/8 22/12 23/4 23/8 23/12 24/4
HeadNeRF LRM,DMV3D
PIFu
SMPL Text2Shape SMPLicit ShapeCrafter Point-E Shap-E
TextField3d
(DDPM) framework. DDPM trains a model to perform the released to the public. Instead, recent works have to con-
reverse diffusion process - starting from a noisy signal and duct experiments based on a relatively smaller dataset Obja-
applying iterative denoising steps to recover the original data verse. LRM [Hong et al., 2024] proposed to learn an image-
distribution. Mathematically, this process can be represented to-triplane latent space and then reshape the latent feature
as xt ∼ p(xt |xt−1 ), where xt is a noisy version of the origi- for reconstructing the triplane-based implicit representation.
nal signal x0 after t diffusion steps with added Gaussian noise DMV3D [Xu et al., 2024] treated LRM as a denoising layer,
ϵt ∼ N (0, σt2 I). By learning to denoise across different noise further proposing a T -step diffusion model to generate high-
levels, the model can generate new samples by starting with quality results based on LRM. TextField3D [Huang et al.,
random noise and applying the reverse diffusion process. 2023a] is proposed for open-vocabulary generation, where a
text latent space is injected with dynamic noise to expand the
3 3D Native Generative Methods expression range of latent features.
The 3D native generative methods directly generate 3D rep- 3.2 Scene
resentations after the supervision of 3D data, in which repre- Early approach [Schwarz et al., 2020] utilize a Generative
sentation and supervision are two crucial components for the Adversarial Network (GAN) that explicitly incorporates a
generation quality. Existing 3D native generative methods parametric function, known as radiance fields. This func-
can be classified three categories: object, scene and human. tion takes the 3D coordinates and camera pose as input and
Several milestone methods are presented in Figure 1. generates corresponding density scalar and RGB values for
each point in 3D space. However, GANs suffer from train-
3.1 Object ing pathologies including mode collapse and are difficult to
With proper conditional input, 3D native generators can be train on data for which a canonical coordinate system does
trained for object-level generation. The early attempt, such not exist, as is the case for 3D scenes [Bautista et al., 2022].
as Text2Shape [Chen et al., 2019] constructed many-to-many To overcome the problems, GAUDI [Bautista et al., 2022]
relations between language and 3D physical properties, en- learns a denoising diffusion model that is fit to a set of scene
abling the generation control of color and shape. How- latents learned using an autodecoder. However, these models
ever, Text2Shape only collected 75K language descriptions [Bautista et al., 2022] all have an inherent weakness of at-
for 15K chairs and tables. ShapeCraft [Fu et al., 2022] tempting to capture the entire scene into a single vector that
gradually evolved more phrases, constructing a dataset with conditions a neural radiance field. This limits the ability to
369K shape-text pairs, named Text2Shape++. To support fit complex scene distributions. NeuralField-LDM [Kim et
recursive generation, ShapeCraft captured local details with al., 2023] first expresses the image and pose pairs as a la-
vector-quantized deep implicit functions. Recently, SDFu- tent code and learns the hierarchical diffusion model to com-
sion [Cheng et al., 2023] proposed to embed conditional fea- plete the scene generation. However, the current method is
tures to the denoising layer of diffusion training, allowing time-consuming and resolution is relatively low. The recent
multi-modal input conditions. X 3 employs a hierarchical voxel latent diffusion to generate
However, restricted by available 3D data and correspond- a higher resolution 3D representation in a coarse-to-fine man-
ing captions, previous 3D native generative models can only ner.
handle limited categories. To support large-vocabulary 3D
generation, pioneering works Point-E [Nichol et al., 2022] 3.3 Human Avatar
and Shap-E [Jun and Nichol, 2023] collected several millions Early approaches of 3D human avatar generation rely on para-
of 3D assets and corresponding text captions. Point-E trained metric models, which use a set of predefined parameters to
an image-to-point diffusion model, in which a CLIP visual create a 3D mesh of expressive human face or body. 3D
latent code is fed into the transformer. Shap-E further in- morphable model (3DMM) [Blanz and Vetter, 2003] is a sta-
troduced a latent projection to enable the reconstruction of tistical model that decomposes the intrinsic attributes of hu-
SDF representation. Nonetheless, the proposed dataset is not man face into identity, expression, and reflectance. These at-
Object SceneDreamer DreamHuman SceneTex
Scene
AvatarCLIP Text2Room HeadSculpt LucidDreamer
Human
22/9 22/10 22/11 22/12 23/1 23/2 23/3 23/4 23/5 23/6 23/7 23/8 23/9 23/10 23/11 23/12 24/1
Fantasia3d
22/9 22/10 22/11 22/12 23/1 23/2 23/3 23/4 23/5 23/6 23/7 23/8 23/9 23/10 23/11 23/12 24/1
Object GSGEN
One-2-3-45++
Scene Zero123 One-2-3-45 MVDream Wonder3D
Human 4DGen
SyncDreamer Instant3D
Dynamic
two-stage approach. In the first stage, it performed multi- to a semantic occupancy field to facilitate consistent free-
view generation. The second stage then directly regressed viewpoint image generation. Similarly, SCULPT [Sanyal et
the NeRF from the generated images via a novel sparse-view al., 2023] also presents an unpaired learning procedure to
reconstructor based on transformers. Combining multi-view effectively learn from medium-sized 3D scan datasets and
Stable Diffusion and large-scale reconstruction models can large-scale 2D image datasets to learn disentangled distribu-
effectively solve the problems of multi-face and generation tion of geometry and texture of full-body clothed humans.
speed. Get3DHuman [Xiong et al., 2023] circumvent the require-
ment of 3D training data by combining two pre-trained net-
5.2 Scene works, a StyleGAN-Human image generator and a 3D recon-
There are recently several proposed methods on 3D scene structor.
generation. MVDiffusion [Tang et al., 2023] simultaneously Driven by the significant progress of recent text-to-image
generates all images with global awareness, effectively ad- synthesis models, researchers have begun to use 3D human
dressing the common issue of error accumulation. The main datasets to enhance the powerful 2D diffusion models to syn-
feature of MVDiffusion is its ability to process perspective thesize photorealistic 3D human avatars with high-frequency
images in parallel using a pre-trained text-to-image diffusion details. DreamFace [Zhang et al., 2023a] generates pho-
model, while incorporating novel correspondence-aware at- torealistic animatable 3D head avatars by bridging vision-
tention layers to enhance cross-view interactions. Control- language models with animatable and physically-based fa-
Room3D [Schult et al., 2023] is a method to generate high- cial assets. The realistic rendering quality is achieved by a
quality 3D room meshes with only a user-given textual de- dual-path appearance generation process, which combines a
scription of the room style and a user-defined room layout. novel texture diffusion model trained on a carefully-collected
The naive layout-based 3D room generation method does not physically-based texture dataset with the pre-trained diffu-
produce plausible meshes. To address the bad geometry prob- sion prior. HumanNorm [Huang et al., 2023c] proposes a
lem and ensure consistent style, ControRoom3D leverages a two-stage diffusion pipeline for 3D human generation, which
guided panorama generation and geometry alignment mod- first generates detailed geometry by a normal-adapted diffu-
ules. SceneWiz3D [Zhang et al., 2023b] introduces a method sion model and then synthesizes photorealistic texture based
to synthesize high-fidelity 3D scenes from text. Given a text, on the generated geometry using a normal-aligned diffusion
a layout is first generated. Then, the Particle Swarm Opti- model. Both two diffusion models are fine-tuned on a dataset
mization technique is applied to automatically place the 3D of 2.9K 3D human models.
objects based on the layout and optimize the 3D scenes im-
plicitly. The SceneWiz3D also leverage a RGBD panorama 5.4 Dynamic
diffusion model to further improve the scene geometry. Jointly optimized by 2D, 3D, as well as video prior, dynamic
3D generation is gaining significant attention recently. The
5.3 Human Avatar pioneering work MAV3D [Singer et al., 2023] proposed to
Several studies on 3D human generation have been lever- generate a static 3D asset and then animate it with text-to-
aging both 2D and 3D datasets/priors to achieve more au- video diffusion, in which a 4D representation named hex-
thentic and general synthesis of 3D humans, where 3D data plane is introduced to expand 3D space with temporal dimen-
provides accurate geometry and 2D data offers diverse ap- sion. Following MAV3D, a series of works created dynamic
pearance. SofGAN [Chen et al., 2022a] proposes a control- 3D content based on a static-to-dynamic pipeline, while dif-
lable human face generator with a decoupled latent space of ferent 4D representations and supervisions are proposed to
geometry and texture learned from unpaired datasets of 2D improve generation quality. Animate124 [Zhao et al., 2023]
images and 3D facial scans. The 3D geometry is encoded introduced an image-to-4D framework, in which hex-plane is
replaced with 4D grid encoding. Except for static and dy- may see the emergence of foundation models specialized for
namic stages, a refinement stage is further proposed to guide 3D content generatoin. Additionally, future large language
the semantic alignment of image input and 4D creation with models achieving high levels of multimodal intelligence, such
ControlNet. 4D-fy [Bahmani et al., 2023] proposed a multi- as GPT-5/6, could theoretically understand images, text, and
resolution hash encoding that represents 3D and temporal even programmatically operate 3D modeling software to an
space separately. It highlighted the importance of 3D genera- expert level. However, ensuring beneficial development of
tion quality and leveraged 3D prior to guide the optimization such powerful systems will require extensive research.
of the static stage.
Recent works attempted to reconstruct 3D scenes based on 6.4 Benchmark
generated videos, introducing a new 4D pipeline that gen- Currently, 3D content quality evaluation mainly relies on hu-
erates a video and then complements its 3D representation. man ratings. [Wu et al., 2024] introduced an automated
4DGen [Yin et al., 2023] made pseudo multi-view videos via Human-Aligned Evaluator for text-to-3D generation. How-
multi-view diffusion prior and optimized the reconstruction ever, fully assessing 3D outputs is challenging since it re-
of gaussian splattings based on multi-resolution hex-plane. quires comprehending both physical 3D properties and in-
DreamGaussian4d [Ren et al., 2023] deployed 3D-aware dif- tended designs. Benchmarking 3D generation has lagged
fusion prior to supervise the multi-view reconstruction of a progress in 2D image generation benchmarks. Developing
given video and refined the corresponding scene with video robust metrics that holistically gauge geometric and textural
diffusion prior. fidelity based on photorealism standards could advance the
field.
6 Future Direction
7 Conclusion
Despite the recent progress in 3D content generation, there
are still many problems unsolved that will significantly im- In this survey, we have conducted a comprehensive analy-
pact the quality, efficiency and controllability of 3D cone- sis of 3D generative content techniques, encompassing 3D
tent generation methods. In this section, we summarize these native generation, 2D prior-based 3D generation, and hybrid
challenges and propose several future directions. 3D generation. We have introduced a novel taxonomy to suc-
cinctly summarize the advancements made in recent meth-
6.1 Challenges ods for generating 3D content. Additionally, we have identi-
fied and summarized the unresolved challenges in this field,
In terms of quality, current AIGC-3D methods have some
while also proposing several promising research directions.
limitations. For geometry, they cannot generate compact
We firmly believe that this study will serve as an invalu-
meshes and fail to model reasonable wiring. For textures,
able resource, guiding further advancements in the field as
they lack the ability to produce rich detail maps and it is
researchers tackle intriguing open problems by drawing in-
difficult to eliminate the effects of lighting and shadows.
spiration from the ideas presented in this work.
Material properties are also not well supported. Regarding
controllability, existing text/image/sketch to 3D approaches
cannot precisely output 3D assets that meet conditional re- References
quirements. Editing capabilities are also insufficient. For [Achiam et al., 2023] Josh Achiam, Steven Adler, Sandhini Agar-
speed, feed-forward and SDS methods based on GS are faster wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
but offer lower quality than optimization approaches based Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat,
et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
on NeRF. Overall, generating 3D content at production-level
2023.
quality, scale and precision remains unresolved.
[Bahmani et al., 2023] Sherwin Bahmani, Ivan Skorokhodov, Vic-
6.2 Data tor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka,
Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and
Regarding data, one challenge lies in collecting datasets con- David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score
taining billions of 3D objects, scenes and humans. This could distillation sampling. arXiv preprint arXiv:2311.17984, 2023.
potentially be achieved through an open-world 3D gaming [Bautista et al., 2022] Miguel Angel Bautista, Pengsheng Guo,
platform, where users can freely create and upload their own Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan
custom 3D models.Additionally, it would be valuable to ex- Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ul-
tract rich implicit 3D knowledge from multi-view images and bricht, et al. Gaudi: A neural architect for immersive 3d scene
videos. Large-scale datasets with such diverse, unlabeled generation. NeurIPS, 2022.
3D data hold great potential to advance unsupervised and [Betker et al., 2023] James Betker, Gabriel Goh, Li Jing, Tim
self-supervised learning approaches for generative 3D con- Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang
tent creation. Zhuang, Joyce Lee, Yufei Guo, et al. Improving image gener-
ation with better captions. Computer Science, 2023.
6.3 Model [Blanz and Vetter, 2003] Volker Blanz and Thomas Vetter. Face
There is a need to explore more effective 3D representa- recognition based on fitting a 3d morphable model. TPAMI, 2003.
tions and model architectures capable of exhibiting scale- [Cao et al., 2023] Tianshi Cao, Karsten Kreis, Sanja Fidler,
up performance alongside growing datasets. This presents Nicholas Sharp, and Kangxue Yin. Texfusion: Synthesizing 3d
an promising research avenue. Over the coming years, we textures with text-guided image diffusion models. In ICCV, 2023.
[Chen et al., 2019] Kevin Chen, Christopher B Choy, Manolis [Hong et al., 2024] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi,
Savva, Angel X Chang, Thomas Funkhouser, and Silvio Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui,
Savarese. Text2shape: Generating shapes from natural language and Hao Tan. Lrm: Large reconstruction model for single image
by learning joint embeddings. In ACCV, 2019. to 3d. ICLR, 2024.
[Chen et al., 2021] Lu Chen, Sida Peng, and Xiaowei Zhou. To- [Huang et al., 2023a] Tianyu Huang, Yihan Zeng, Bowen Dong,
wards efficient and photorealistic 3d human reconstruction: a Hang Xu, Songcen Xu, Rynson WH Lau, and Wangmeng Zuo.
brief survey. Visual Informatics, 2021. Textfield3d: Towards enhancing open-vocabulary 3d generation
with noisy text fields. arXiv preprint arXiv:2309.17175, 2023.
[Chen et al., 2022a] Anpei Chen, Ruiyang Liu, Ling Xie, Zhang
Chen, Hao Su, and Jingyi Yu. Sofgan: A portrait image generator [Huang et al., 2023b] Tianyu Huang, Yihan Zeng, Zhilu Zhang,
with dynamic styling. TOG, 2022. Wan Xu, Hang Xu, Songcen Xu, Rynson WH Lau, and Wang-
meng Zuo. Dreamcontrol: Control-based text-to-3d generation
[Chen et al., 2022b] Xu Chen, Tianjian Jiang, Jie Song, Jinlong with 3d self-prior. arXiv preprint arXiv:2312.06439, 2023.
Yang, Michael J Black, Andreas Geiger, and Otmar Hilliges. [Huang et al., 2023c] Xin Huang, Ruizhi Shao, Qi Zhang, Hong-
gDNA: Towards generative detailed neural avatars. In CVPR,
wen Zhang, Ying Feng, Yebin Liu, and Qing Wang. Humannorm:
2022.
Learning normal diffusion model for high-quality and realistic 3d
[Chen et al., 2023a] Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying human generation. arXiv preprint arXiv:2310.01406, 2023.
Lee, Sergey Tulyakov, and Matthias Nießner. Scenetex: High- [Huang et al., 2023d] Yukun Huang, Jianan Wang, Ailing Zeng,
quality texture synthesis for indoor scenes via diffusion priors. He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, and Lei Zhang.
arXiv preprint arXiv:2311.17261, 2023. Dreamwaltz: Make a scene with complex 3d animatable avatars.
[Chen et al., 2023b] Rui Chen, Yongwei Chen, Ningxin Jiao, and arXiv preprint arXiv:2305.12529, 2023.
Kui Jia. Fantasia3d: Disentangling geometry and appearance for [Jun and Nichol, 2023] Heewoo Jun and Alex Nichol. Shap-e:
high-quality text-to-3d content creation. In ICCV, 2023. Generating conditional 3d implicit functions. arXiv preprint
[Chen et al., 2023c] Zhaoxi Chen, Guangcong Wang, and Ziwei arXiv:2305.02463, 2023.
Liu. Scenedreamer: Unbounded 3d scene generation from 2d [Kerbl et al., 2023] Bernhard Kerbl, Georgios Kopanas, Thomas
image collections. TPAMI, 2023. Leimkühler, and George Drettakis. 3d gaussian splatting for real-
time radiance field rendering. TOG, 2023.
[Chen et al., 2023d] Zilong Chen, Feng Wang, and Huaping
Liu. Text-to-3d using gaussian splatting. arXiv preprint [Kim et al., 2023] Seung Wook Kim, Bradley Brown, Kangxue
arXiv:2309.16585, 2023. Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach,
Antonio Torralba, and Sanja Fidler. Neuralfield-ldm: Scene gen-
[Cheng et al., 2023] Yen-Chi Cheng, Hsin-Ying Lee, Sergey eration with hierarchical latent diffusion models. In CVPR, 2023.
Tulyakov, Alexander G Schwing, and Liang-Yan Gui. Sdfusion:
[Kolotouros et al., 2023] Nikos Kolotouros, Thiemo Alldieck, An-
Multimodal 3d shape completion, reconstruction, and generation.
In CVPR, 2023. drei Zanfir, Eduard Gabriel Bazavan, Mihai Fieraru, and Cristian
Sminchisescu. Dreamhuman: Animatable 3d avatars from text.
[Chung et al., 2023] Jaeyoung Chung, Suyoung Lee, Hyeongjin arXiv preprint arXiv:2306.09329, 2023.
Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain- [Li et al., 2023a] Chenghao Li, Chaoning Zhang, Atish Waghwase,
free generation of 3d gaussian splatting scenes. arXiv preprint Lik-Hang Lee, Francois Rameau, Yang Yang, Sung-Ho Bae, and
arXiv:2311.13384, 2023. Choong Seon Hong. Generative ai meets 3d: A survey on text-
[Corona et al., 2021] Enric Corona, Albert Pumarola, Guillem to-3d in aigc era. arXiv preprint arXiv:2305.06131, 2023.
Alenya, Gerard Pons-Moll, and Francesc Moreno-Noguer. Sm- [Li et al., 2023b] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fu-
plicit: Topology-aware generative model for clothed people. In jun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg
CVPR, 2021. Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with
[Fu et al., 2022] Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, sparse-view generation and large reconstruction model. arXiv
and Srinath Sridhar. Shapecrafter: A recursive text-conditioned preprint arXiv:2311.06214, 2023.
3d shape generation model. NeurIPS, 2022. [Lin et al., 2023] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki
[Han et al., 2023] Xiao Han, Yukang Cao, Kai Han, Xiatian Zhu, Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fi-
dler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution
Jiankang Deng, Yi-Zhe Song, Tao Xiang, and Kwan-Yee K
text-to-3d content creation. In CVPR, 2023.
Wong. Headsculpt: Crafting 3d head avatars with text. In
NeurIPS, 2023. [Liu et al., 2021] Jialin Liu, Sam Snodgrass, Ahmed Khalifa, Se-
bastian Risi, Georgios N Yannakakis, and Julian Togelius. Deep
[Höllein et al., 2023] Lukas Höllein, Ang Cao, Andrew Owens, learning for procedural content generation. Neural Computing
Justin Johnson, and Matthias Nießner. Text2room: Extracting and Applications, 2021.
textured 3d meshes from 2d text-to-image models. arXiv preprint
arXiv:2303.11989, 2023. [Liu et al., 2023a] Minghua Liu, Ruoxi Shi, Linghao Chen,
Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong
[Hong et al., 2022a] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zeng, Jiayuan Gu, and Hao Su. One-2-3-45++: Fast single im-
Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero- age to 3d objects with consistent multi-view generation and 3d
shot text-driven generation and animation of 3d avatars. arXiv diffusion. arXiv preprint arXiv:2311.07885, 2023.
preprint arXiv:2205.08535, 2022.
[Liu et al., 2023b] Minghua Liu, Chao Xu, Haian Jin, Linghao
[Hong et al., 2022b] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Chen, Zexiang Xu, Hao Su, et al. One-2-3-45: Any single im-
Liu, and Juyong Zhang. Headnerf: A real-time nerf-based para- age to 3d mesh in 45 seconds without per-shape optimization. In
metric head model. In CVPR, 2022. NeurIPS, 2023.
[Liu et al., 2023c] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, [Shi et al., 2024] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long,
Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero- Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d
1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. generation. ICLR, 2024.
[Liu et al., 2023d] Vivian Liu, Jo Vermeulen, George Fitzmaurice, [Singer et al., 2023] Uriel Singer, Shelly Sheynin, Adam Polyak,
and Justin Matejka. 3dall-e: Integrating text-to-image ai in 3d Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal,
design workflows. In ACM DIS, 2023. Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d
dynamic scene generation. arXiv preprint arXiv:2301.11280,
[Liu et al., 2023e] Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying
2023.
Shan, Gang Zeng, Dahua Lin, Xihui Liu, and Ziwei Liu. Human-
gaussian: Text-driven 3d human generation with gaussian splat- [Sun et al., 2023] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen
ting. arXiv preprint arXiv:2311.17061, 2023. Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hi-
erarchical 3d generation with bootstrapped diffusion prior. arXiv
[Liu et al., 2024] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao preprint arXiv:2310.16818, 2023.
Long, Lingjie Liu, Taku Komura, and Wenping Wang. Sync-
dreamer: Generating multiview-consistent images from a single- [Tang et al., 2023] Shitao Tang, Fuyang Zhang, Jiacheng Chen,
view image. ICLR, 2024. Peng Wang, and Yasutaka Furukawa. Mvdiffusion: Enabling
holistic multi-view image generation with correspondence-aware
[Long et al., 2023] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, diffusion, 2023.
Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai
[Tang et al., 2024] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei
Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d:
Single image to 3d using cross-domain diffusion. arXiv preprint Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splat-
arXiv:2310.15008, 2023. ting for efficient 3d content creation. ICLR, 2024.
[Wang et al., 2023a] Tengfei Wang, Bo Zhang, Ting Zhang,
[Loper et al., 2023] Matthew Loper, Naureen Mahmood, Javier
Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen,
Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A
Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A gener-
skinned multi-person linear model. In Seminal Graphics Papers:
ative model for sculpting 3d digital avatars using diffusion. In
Pushing the Boundaries, Volume 2. 2023.
CVPR, 2023.
[Nichol et al., 2022] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, [Wang et al., 2023b] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan
Pamela Mishkin, and Mark Chen. Point-e: A system for gen- Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer:
erating 3d point clouds from complex prompts. arXiv preprint High-fidelity and diverse text-to-3d generation with variational
arXiv:2212.08751, 2022. score distillation. In NeurIPS, 2023.
[Poole et al., 2023] Ben Poole, Ajay Jain, Jonathan T. Barron, and [Wu et al., 2024] Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang,
Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein.
ICLR, 2023. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d gener-
[Ren et al., 2023] Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi ation. arXiv preprint arXiv:2401.04092, 2024.
Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaus- [Xiong et al., 2023] Zhangyang Xiong, Di Kang, Derong Jin,
sian4d: Generative 4d gaussian splatting. arXiv preprint Weikai Chen, Linchao Bao, Shuguang Cui, and Xiaoguang Han.
arXiv:2312.17142, 2023. Get3DHuman: Lifting StyleGAN-Human into a 3D generative
[Richardson et al., 2023] Elad Richardson, Gal Metzer, Yuval model using pixel-aligned reconstruction priors. In ICCV, 2023.
Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided [Xu et al., 2024] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng
texturing of 3d shapes. In SIGGRAPH, 2023. Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wet-
[Saito et al., 2019] Shunsuke Saito, Zeng Huang, Ryota Natsume, zstein, Zexiang Xu, et al. Dmv3d: Denoising multi-view dif-
Shigeo Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel- fusion using 3d large reconstruction model. ICLR, 2024.
aligned implicit function for high-resolution clothed human digi- [Yin et al., 2023] Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao
tization. In ICCV, 2019. Zhao, and Yunchao Wei. 4dgen: Grounded 4d content
generation with spatial-temporal consistency. arXiv preprint
[Sanyal et al., 2023] Soubhik Sanyal, Partha Ghosh, Jinlong Yang,
arXiv:2312.17225, 2023.
Michael J Black, Justus Thies, and Timo Bolkart. SCULPT:
Shape-conditioned unpaired learning of pose-dependent clothed [Zhang et al., 2023a] Longwen Zhang, Qiwei Qiu, Hongyang Lin,
and textured human meshes. arXiv preprint arXiv:2308.10638, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang,
2023. Lan Xu, and Jingyi Yu. Dreamface: Progressive generation
of animatable 3d faces under text guidance. arXiv preprint
[Schult et al., 2023] Jonas Schult, Sam Tsai, Lukas Höllein, Bichen arXiv:2304.03117, 2023.
Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang
Wang, Felix Wimbauer, Zijian He, et al. Controlroom3d: [Zhang et al., 2023b] Qihang Zhang, Chaoyang Wang, Aliaksandr
Room generation using semantic proxy rooms. arXiv preprint Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua
arXiv:2312.05208, 2023. Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Sce-
newiz3d: Towards text-guided 3d scene composition. arXiv
[Schwarz et al., 2020] Katja Schwarz, Yiyi Liao, Michael preprint arXiv:2312.08885, 2023.
Niemeyer, and Andreas Geiger. Graf: Generative radiance
fields for 3d-aware image synthesis. NeurIPS, 2020. [Zhao et al., 2023] Yuyang Zhao, Zhiwen Yan, Enze Xie, Lan-
qing Hong, Zhenguo Li, and Gim Hee Lee. Animate124:
[Shi et al., 2022] Zifan Shi, Sida Peng, Yinghao Xu, Andreas Animating one image to 4d dynamic scene. arXiv preprint
Geiger, Yiyi Liao, and Yujun Shen. Deep generative models on arXiv:2311.14603, 2023.
3d representations: A survey. arXiv preprint arXiv:2210.15663,
2022.