VFusion3D: Learning Scalable 3D Generative Models From Video Diffusion Models
VFusion3D: Learning Scalable 3D Generative Models From Video Diffusion Models
OpenLRM TriplaneGaussian
Input
LGM VFusion3D (ours)
OpenLRM TriplaneGaussian
Input
LGM VFusion3D (ours)
OpenLRM TriplaneGaussian
Input
LGM VFusion3D (ours)
1 Introduction
The advent of 3D datasets [7, 9, 10] and the development of advanced neural
rendering methods [19, 37–39] have created new possibilities in computer vision
and computer graphics. AI for 3D content creation has shown great potential in
various fields, such as AR/VR/MR [22], 3D gaming [14, 50], and animation [20].
As a result, there is a growing demand for the development of foundation 3D
generative models that can efficiently create various high-quality 3D assets. De-
spite the clear need and potential, current methods fall short of expectations due
to their production of low-quality textures and geometric issues such as floaters.
The primary obstacle in constructing foundation 3D generative models is the
scarcity of 3D data available. Acquiring 3D data is a complex and challenging
task. Unlike images, which can be easily captured using standard cameras, 3D
data requires specialized equipment and capturing techniques. Another option is
3D modeling, a tedious process that can take up to weeks for a single asset. This
complexity results in a pool of publicly available 3D assets that is limited, with
the largest ones containing up to 10 million assets [7,9,65]. In reality, only a small
portion of them is usable since a decent percentage of 3D assets are duplicates
and many of them lack texture or even a high-quality surface [16, 23, 49].
Conversely, foundation models such as GPT [1, 55, 56] and Diffusion mod-
els [3, 8, 47] have consistently showcased remarkable capabilities. These mod-
els have proven their prowess in handling intricate tasks, exhibiting advanced
problem-solving skills, and delivering exceptional performance across a spec-
trum of challenges. Such general and strong capabilities are derived from scaling
on both data and model size [3, 40, 56]. The primary prerequisite for construct-
ing foundation models is the accessibility of an extensive amount of high-quality
training data, often surpassing billions in size. This presents a clear discrep-
ancy between the current state of 3D datasets and the requirements for training
foundation 3D generative models.
In this paper, we propose to use of video diffusion models, trained on a large
amount of texts, images, and videos, as a 3D data generator. Specifically, we use
EMU Video [13] which has been trained with a variety of videos, including those
VFusion3D 3
Reconstructor Renderer
with intricate camera movements and drone footage, and inherently contain
cues about the 3D world. This suggests that video diffusion models have a latent
understanding of how to generate videos with 3D consistency [35]. By fine-tuning
this model using rendered multi-view videos from 100K 3D data, we unlock the
model’s inherent capability to generate diverse, 3D-consistent, multi-view videos
from text and image prompts. The resulting video diffusion model allows us to
scale the 3D data necessary for learning foundation 3D generative models. Using
millions of text prompts from web-scale data and a filtering system, we generate
a multi-view dataset comprising of 3 million multi-view videos.
We utilize our synthetic multi-view dataset to train a 3D generative model
capable of reconstructing 3D assets from single images in a feed-forward manner.
Using the recently proposed Large Reconstruction Model (LRM) [16] as our
starting point, we introduce a series of training strategies to stabilize the training
process. These strategies assist the model in better learning from a substantial
volume of synthetic multi-view training samples. Post-training, we further fine-
tune with renderings from the 3D dataset, originally employed to fine-tune the
video diffusion model, to further improve our 3D generative model. As a result,
the proposed model, VFusion3D, can generate high-quality 3D assets from a
single image with any viewing angles. We compare VFusion3D against several
distillation-based and feed-forward 3D generative models using a user study and
automated metrics. Finally, we undertake a comprehensive study to explore,
analyze, and discuss topics on 3D data versus synthetic multi-view data, as well
as the scaling trends of large 3D generative models.
2 Related Work
suggests using CLIP [45] as a starting point, but textual supervision alone is
not enough for 3D generation. DreamFusion [41] optimizes an implicit 3D rep-
resentation by distilling knowledge from 2D diffusion models [11, 47, 48] using
a score-based distillation approach [41]. Such distillation-based approaches has
shown promising results and inspired subsequent research [26,32,54,59,63]. Fur-
thermore, the combination of 2D diffusion models with 3D data to achieve both
3D consistency and visual pleasing is increasingly gaining popularity [25, 28–30,
42,49]. Concurrent research [35] suggests the reconstruction of 3D Gaussians [19]
using multi-view outputs derived from fine-tuned video diffusion models [2, 21].
Such reconstruction-based methods [6, 27, 28, 60] can also leverage multi-view
data produced by other methods [29, 49]. With respect to distillation, our work
also involves the distillation of 3D knowledge from video diffusion models. How-
ever, we distill knowledge from video diffusion models in an explicit way, elimi-
nating the need for score distillation sampling [41].
Feed-forward 3D Generative Models. In the fields of 3D reconstruction and
3D generation, a prominent trend involves learning directly from 3D datasets [12,
16,32,46,51,52,61,64]. Pioneering works [12,18,52] attempted to learn 3D repre-
sentations directly from 3D [10] and multi-view [65] data, but these approaches
faced limitations in performance due to low-quality training data or small-scale
training data. A series of recent studies have suggested using a tri-plane [5] as
the 3D representation, demonstrating that it can scale to larger datasets. These
methods train large-scale models to generate a 3D representation from various
inputs, including single images [16], textual information [23], posed multi-view
images [62], and unposed multi-view images [58]. Typically, these large mod-
els [36] learn to predict a tri-plane representation, which is subsequently con-
verted into a NeRF or combined with optimized point generator for Gaussian
Splatting [67]. It is also possible to generate 3D Gaussians from multi-view in-
puts. Then, during inference time, it can employ off-the-shelf text-to-3D [57]
or image-to-3D models [49] to generate multi-view inputs, as demonstrated in
concurrent work [53]. In this study, we adopt the most general framework,i.e.,
LRM [15,16], without architectural modifications. Our objective is not to develop
a new architecture. Instead, we strive to enhance the suitability and scalability
of general models for training on large-scale synthetic multi-view images.
3 Method
Section 3.1 provides the preliminaries on the EMU Video [13] and LRM [16].
Our approach for transferring a video diffusion model into a 3D multi-view data
engine is presented in Section 3.2. The process of gradually improving the LRM
into our VFusion3D is described in Section 3.3. Figure 2 illustrates our pipeline.
3.1 Preliminaries
EMU Video. EMU Video [13] is a video diffusion model that builds upon
a pre-trained text-to-image diffusion model, EMU [8]. It initializes its weights
VFusion3D 5
from the EMU model and adds new learnable temporal parameters. EMU Video
is conditioned on both a text prompt and an image prompt, where the image
prompt serves as the first frame and can either be provided or generated. The
EMU model is trained on 1.1 billion image-text pairs, and EMU Video is further
trained on an additional 34 million video-text pairs. EMU Video is capable of
generating high-quality and temporally consistent videos with up to 16 frames.
Large Reconstruction Model. LRM is a large feed-forward model for single
image 3D reconstruction. It initially utilizes a pre-trained vision transformer,
DINO [4], to encode image features. These features are subsequently used by an
image-to-triplane decoder module via cross-attention mechanisms. Camera pa-
rameters are also sent to the image-to-triplane decoder after undergoing process-
ing through a small camera embedding network. The tri-plane representation,
predicted by the image-to-triplane decoder, is upsampled and reshaped to query
3D point features. These features are then input into a multi-layer perception
module to predict RGB and density for volumetric rendering. The training is
conducted on 3D multi-view datasets [10, 65] and is supervised using multi-view
images with LPIPS loss [66] and MSE loss.
A Red Car
and attention layers to ensure that the fine-tuning does not degrade the visual
quality of generation.
3.3 VFusion3D
VFusion3D adheres to the architecture of LRM, which is considered the most
general method yet for feed-forward 3D creation. We pinpoint the ideal training
strategies and formulate a recipe that enhances suitability and scalability when
learning from synthetic multi-view data. Our ultimate objective is to establish
learning from video diffusion generated multi-view data as a potential paradigm
for training foundation 3D models.
Improved Training Strategies for Synthetic Data. The original training
setting of LRM was designed to work with Objaverse [10] and MVImageNet [65]
data. However, we found that this setup was not entirely suitable for direct
application to our synthetic data, as synthetic data tends to be noisier and does
not always maintain full 3D consistency. To stabilize and improve the learning
process on synthetic multi-view data, we implemented a series of strategies.
• A Multi-stage Training Recipe. LRM employs a patch rendering strategy,
which renders a small 128 × 128 resolution patch from a random rendering res-
olution that ranges from 128 to 384. However, this setting can potentially cause
instability in the learning process, particularly in the early stages, and may lead
to incorrect optimization directions, resulting in predictions with checkerboard
effects. To address this, we suggest to use a multi-stage learning process. This in-
volves gradually increasing the rendering resolution, thereby preparing the model
for higher rendering resolutions at each stage. In practice, we use rendering res-
olutions of 128, 192, 256, 320, and 384. Each stage is trained for 5 epochs, with
the exception of the final stage, which is trained for 10 epochs. After training on
the 192 resolution stage, we reset the learning rate to the initial learning rate.
This adjustment allows the model to better learn from later stages.
• Image-level Supervision Instead of Pixel-level Supervision. Pixel-level
losses, such as L1 and L2, are commonly used. However, they may not be suitable
for synthetic multi-view data due to they are strict in pixel-level correspondence.
Minor inconsistencies in synthetic multi-view images are inevitable, and when
trained with pixel-level loss, these can lead to improper optimization and blurry
results. On the other hand, image-level loss methods, such as LPIPS [66], are
less stringent as they operate in feature spaces. This inherent flexibility has the
potential to strengthen our model, as it accommodates minor inconsistencies in
synthetic multi-view images without compromising the optimization process.
• Opacity Loss. In synthetic multi-view data, small colored patches occasion-
ally appear in images that should have a white background. To address this, we
run a saliency detection model [44] to obtain the masks of central objects in our
synthetic data. This allows us to apply an opacity loss, which helps maintain
focus on the foreground object. Additionally, extra opacity supervision can fa-
cilitate the training process, resulting in stronger models. We employ a general
method to extract these masks and introduce an opacity loss between rendering
density and masks. Combined with our strategy on no pixel-level losses, our ap-
proaches efficiently mitigate noises from small color patches in the background.
• Camera Noise Injection. For object observation, the camera trajectory
of our synthetic multi-view data follows a fixed sequence of azimuth changes.
8 Junlin Han et al.
This could potentially restrict the model’s generalization capability due to the
limited viewing positions. To counteract this, we introduce camera noises during
both the 3D data rendering and VFusion3D training stages. We apply a random
minor offset (ranging from -0.05 to 0.05) during the rendering of multi-view
images, which are subsequently used to fine-tune the EMU Video. Additionally,
we infuse camera noise (varying from -0.02 to 0.02) into both the intrinsic and
extrinsic matrices during the training process. This potentially enhances the
model’s robustness against incorrect camera parameters and improves its ability
to generalize to different viewing angles.
VFusion3D OpenLRM
VFusion3D LGM
0 25 50 75 100
VFusion3D Win Rates (%)
Fig. 5: User study results comparing VFusion3D and previous work. VFu-
sion3D consistently outperforms previous works by considerable margins in both gen-
eration quality and image faithfulness.
4 Experiments
Our VFusion3D model is trained on 128 NVIDIA A100 (80G) GPUs, and the
training process takes approximately 6 days to complete. We use a total batch
size of 1024, with 4 multi-view images in 128×128 resolution used for supervision
per batch, resulting in a total of 4096 multi-view images. The model is trained for
30 epochs with an initial learning rate of 4 × 10−4 , following a cosine annealing
schedule with a restart after first 10 epochs. The training begins with a warm-
up of 3000 iterations, and the optimizer used is AdamW [33]. More details are
presented in the Appendix A.
a DSLR photo of a ghost eating a hamburger beautiful, intricate butterfly a DSLR photo of a frog wearing a sweater
DreamFusion
Magic3D
ProfilicDreamer
MVDream
OpenLRM
LGM
VFusion3D
LGM [53]. For text-to-3D models, we use rendered videos from the MVDream
website for evaluation. Results are presented in Table 2, and visual samples are
shown in Figure 6. When considering feed-forward models, LGM demonstrates
better text alignment, while VFusion3D exhibits stronger image alignment. Qual-
itatively, VFusion3D shows superior results compared to other feed-forward mod-
els, especially in terms of both color and shape 3D consistency.
VFusion3D 11
User Study. To assess the overall quality of the generated content and its
faithfulness to the input images more accurately, we conducted a user study via
Amazon Mechanical Turk. We presented users with two 360◦ rendered videos -
one produced by our VFusion3D model and another by a baseline model, and
asked them to indicate a winner. A total of 65 videos (25 for 3D reconstruction, 40
for text-to-3D) were evaluated. Feedback was collected from 5 different users and
we present the majority results in the form of a winner rate. Figure 5 displays our
results. Our method surpasses SOTA baselines, demonstrating that VFusion3D
aligns closely with the original image content and exhibits high visual quality.
We conduct an ablation study on several design choices, which include: (1) the
specifications for fine-tuning the EMU Video model, (2) the training strategies
proposed in VFusion3D, and (3) the number of 3D data for fine-tuning pre-
trained VFusion3D. We use two evaluation sets for our study. First, we report
SSIM and LPIPS for 3D data evaluation, along with CLIP text similarity score
and CLIP image similarity score for text-to-3D evaluation. The 3D data evalua-
tions were conducted on a test dataset comprising 500 unseen 3D models. From
different viewing angles, we use one image as input, render 32 novel views, and
compare them against their respective ground truths. The text-to-3D evaluation
follows the setting of text-to-3D experiments as in section 4.2.
How Much 3D Data are Needed for Fine-tuning Video Diffusion Mod-
els? We explore the amount of 3D data required for fine-tuning EMU Video.
Specifically, we experiment with 10K, 50K, and 100K 3D data. Table 3 presents
the results of both the good classification rate of multi-view sequences generated
by the fine-tuned EMU Video and the performance of VFusion3D models trained
on the synthetic datasets from the respective fine-tuned EMU Video. We gener-
ate 500K multi-view sequences before filtering for training VFusion3D variants.
As expected, using more 3D data enhances the multi-view sequence generation
ability of EMU Video without compromising visual quality. This suggests that
our approach could also scale with the collection of more 3D data.
Effect of Improved Training Strategies. We evaluate the impact of the
proposed training strategies incorporated in VFusion3D. We use 500K synthetic
12 Junlin Han et al.
Num of data SSIM↑ LPIPS↓ CLIP Text Sim↑ CLIP Image Sim↑
No fine-tune 0.832 0.160 0.261 0.839
10K 0.835 0.153 0.261 0.838
30K 0.837 0.149 0.262 0.846
50K 0.839 0.147 0.262 0.834
50K (random) 0.840 0.146 0.261 0.836
50K (16-views) 0.838 0.147 0.262 0.835
100K 0.842 0.143 0.266 0.836
Table 5: Ablation study on settings of 3D data fine-tuning. In general, the use
of more 3D data leads to improved performance, irrespective of other settings.
data only for training ablation variants. The results of these variants are pre-
sented in Table 4.
Our findings include: (1) The multi-stage training recipe significantly im-
proves the results by stabilizing the training process. (2) Without image-level
supervision, such as MSE, we notice a small improvement across all metrics, as
well as in the visual results. (3) The inclusion of opacity loss can further en-
hance performance. (4) While the injection of camera noises does not improve
performance according to metrics, further testing reveals that this variant does
enhance the model’s robustness. It enables the model to handle a wider range
of potentially inaccurate camera matrices. Therefore, the decision to apply it
becomes a matter of trade-off.
Settings of 3D Data Fine-tuning. We investigate the settings of 3D data
fine-tuning, including sizes, data selections, and number of views. We evaluate
datasets of varying sizes - 10K, 30K, 50K, and 100K - where the data is selected
based on aesthetic score rankings, from high to low. To determine the aesthetic
scores of 3D data, we train an aesthetic score predictor on top of a pre-trained
2D aesthetic score predictor. This predictor uses CLIP to extract features and
an MLP for prediction. We average the CLIP features extracted from multi-view
images to make it applicable to 3D data. To ascertain whether aesthetic score
ranking is beneficial, we further draw 50K 3D data randomly from the 100k 3D
data as a variant. By default, all 3D data are rendered to 32 views, but we also
explore a variant that only uses 16 views.
Table 5 presents the quantitative results. The conclusions drawn are as fol-
lows: (1) Fine-tuning with more 3D data results in stronger models. (2) Unlike
VFusion3D 13
In this analysis, our objective is to delve into the intricacies of synthetic multi-
view data, exploring their benefits, limitations, and potential in comparison to
3D data. Given our capacity to generate a significant amount of synthetic multi-
view data, which supports the data requirements for training foundation 3D
generative models, we also present scaling trends of 3D generative models. The
evaluation protocol in this section adheres to the ablation study 4.3.
3D Data vs. synthetic multi-view Data. We aim to investigate the perfor-
mance of models trained on 3D data, synthetic data, and a combination of both
(pre-training with synthetic, then fine-tuning with 3D data). For 3D data, we
train a VFusion3D model using 150K 3D data from our internal dataset. This
includes the same 100K 3D data that were used in the fine-tuning process of the
EMU Video. Table 6 presents the quantitative results. It shows that 3D data
is more efficient than synthetic multi-view data in teaching the model to re-
construct common objects. Training with 100K 3D data points already matches
the performance of 2.7M synthetic data points. However, learning with a lim-
ited number of data cannot generalize to uncommon objects or scenes, where
large-scale synthetic data provides strong performance in generalization. An ad-
ditional advantage of synthetic data is that it can be further combined with 3D
data fine-tuning to achieve optimal performance.
Scaling Trends for 3D Generative Models.
By maintaining a fixed model architecture, we examine the impact of vary-
ing training dataset sizes, ranging from 100K, 300K, 500K, 1M, to 2.7M. Our
objective is to illustrate scaling trends that shed light on scalable 3D generative
modeling. Figure 7 presents these trends. We observe that the generation quality,
as measured by LPIPS and CLIP image similarity scores, consistently imrpoves
with the size of the synthetic dataset. Given our ability to generate an unlimited
amount of synthetic data, this makes our approach highly scalable.
14 Junlin Han et al.
0.82
0.178
0.76
0.168
0.73
0.163
0.70
0.158 0.67 0.1
0.1 0.3 0.5 1 2.7 0.3 0.5 1 2.7
Synthetic Dataset Size (M) Synthetic Dataset Size (M)
Fig. 7: Scaling trends of VFusion3D on synthetic data. The left and right figures
display the LPIPS and CLIP image similarity scores in relation to the dataset size,
respectively. The generation quality consistently improves as the dataset size increases.
Furthermore, our approach can also scale and improve with several other
factors. These include the development of stronger video diffusion models, the
availability of more 3D data for fine-tuning the video diffusion model and the
pre-trained 3D generative model, and the advancement of large 3D feed-forward
generative models. All these factors contribute to the scalability of our model,
positioning it as a promising avenue for foundation 3D generative models.
Limitations. The fine-tuned video diffusion model is less effective at generat-
ing multi-view sequences of specific objects, such as vehicles like cars, bicycles,
and motorcycles, and text-related content. Our filtering system excludes most
of these less-than-ideal results; however, this could potentially affect the perfor-
mance of the trained VFusion3D model due to an insufficient amount of data
related to vehicles and texts. This is a shortcoming inherited from the pre-trained
video diffusion model. With the development of progressively stronger video dif-
fusion models, this limitation should be mitigated.
5 Conclusion
A Training Details
Emu Video Fine-tuning. Following the EMU Video [13], we freeze the spatial
convolutional and attention layers of Emu Video, while only fine-tuning the
temporal layers. We use the standard diffusion loss for this fine-tuning process.
The Emu Video model is fine-tuned over a period of 5 days using 80 A100 GPUs,
with a total batch size of 240 and a learning rate of 1 × 10−5 . Although the 3D
consistency continues to improve with extended fine-tuning, we do not observe
any decline in visual quality. One possible explanation is the static nature of
the spatial layers and the image-conditioned network, which ensures that the
generated 360◦ videos maintain high fidelity with the high-frequency texture
components of the input.
VFusion3D. The architecture of VFusion3D is identical to LRM [16]. In addi-
tion to the training details provided in the main paper, we use 0.95 as the second
beta parameter of the AdamW optimizer [33]. We apply a gradient clipping of
1.0 and a weight decay of 0.05. This weight decay is only applied to weights that
are neither bias nor part of normalization layers. For mixed precision training,
we use Bfloat16 precision.
Fine-tuning with 3D Data. We use 32 GPUs to fine-tune the pre-trained
VFusion3D model with 3D data. At this stage, we also employ the L2 loss func-
tion for novel view supervision. The model undergoes fine-tuning with a dataset
of 100K rendered multi-view images over 10 epochs, adhering to a cosine learning
rate schedule. We set the initial learning rate as 1 × 10−4 . All other parameters
remain consistent with the VFusion3D pre-training phase.
C Failure Cases
The limitations section of the main paper presents that the fine-tuned video
generator does not always yield flawless results. This is particularly noticeable
in scenarios involving vehicles and texts, where the model sometimes generates
multi-view results that lack 3D consistency. Additional examples of this are
presented in Figure 8.
D Conversation to Meshes
We use the marching cubes algorithm [31] to extract meshes from the generated
NeRF results. Visualizations of sample converted meshes are shown in Figure 9.
References
1. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida,
D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv
preprint arXiv:2303.08774 (2023) 2
2. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D.,
Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling
latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127
(2023) 4
VFusion3D 17
3. Brooks, T., Peebles, B., Homes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D.,
Taylor, J., Luhman, T., Luhman, E., Ng, C.W.Y., Wang, R., Ramesh, A.: Video
generation models as world simulators (2024), https://openai.com/research/
video-generation-models-as-world-simulators 2
4. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Proceedings of
the International Conference on Computer Vision (ICCV) (2021) 5, 6
5. Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo,
O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d
generative adversarial networks. In: CVPR (2022) 4
6. Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala,
M., De Mello, S., Karras, T., Wetzstein, G.: Genvs: Generative novel view synthesis
with 3d-aware diffusion models (2023) 4
7. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,
Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich
3d model repository. arXiv preprint arXiv:1512.03012 (2015) 2
8. Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende,
S., Wang, X., Dubey, A., et al.: Emu: Enhancing image generation models using
photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023) 2, 4, 8
18 Junlin Han et al.
9. Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A.,
Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d
objects. Advances in Neural Information Processing Systems 36 (2024) 2
10. Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E.,
Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of
annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 13142–13153 (2023) 2, 4, 5, 7
11. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS
(2021) 4
12. Erkoç, Z., Ma, F., Shan, Q., Nießner, M., Dai, A.: Hyperdiffusion: Generating
implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015
(2023) 4
13. Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S.S., Shah,
A., Yin, X., Parikh, D., Misra, I.: Emu video: Factorizing text-to-video generation
by explicit image conditioning. arXiv preprint arXiv:2311.10709 (2023) 2, 4, 15
14. Hao, Z., Mallya, A., Belongie, S., Liu, M.Y.: Gancraft: Unsupervised 3d neural
rendering of minecraft worlds. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 14072–14082 (2021) 2
15. He, Z., Wang, T.: Openlrm: Open-source large reconstruction models. https://
github.com/3DTopia/OpenLRM (2023) 4, 9
16. Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui,
T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. ICLR (2024)
2, 3, 4, 9, 15
17. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided
object generation with dream fields. In: CVPR (2022) 3
18. Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv
preprint arXiv:2305.02463 (2023) 4
19. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. ACM Transactions on Graphics 42(4) (2023) 2,
4
20. Kolotouros, N., Alldieck, T., Zanfir, A., Bazavan, E., Fieraru, M., Sminchisescu, C.:
Dreamhuman: Animatable 3d avatars from text. Advances in Neural Information
Processing Systems 36 (2024) 2
21. Kwak, J.g., Dong, E., Jin, Y., Ko, H., Mahajan, S., Yi, K.M.: Vivid-1-to-3: Novel
view synthesis with video diffusion models. arXiv preprint arXiv:2312.01305 (2023)
4
22. Li, C., Li, S., Zhao, Y., Zhu, W., Lin, Y.: Rt-nerf: Real-time on-device neural
radiance fields towards immersive ar/vr rendering. In: Proceedings of the 41st
IEEE/ACM International Conference on Computer-Aided Design. pp. 1–9 (2022)
2
23. Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K.,
Shakhnarovich, G., Bi, S.: Instant3d: Fast text-to-3d with sparse-view generation
and large reconstruction model. ICLR (2024) 2, 4
24. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597 (2023) 9
25. Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: Aligning geometric priors in
2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596 (2023) 4
26. Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler,
S., Liu, M.Y., Lin, T.Y.: Magic3D: High-resolution text-to-3d content creation.
arXiv preprint arXiv:2211.10440 (2022) 4, 9
VFusion3D 19
27. Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J.,
Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view
generation and 3d diffusion. arXiv preprint arXiv:2311.07885 (2023) 4
28. Liu, M., Xu, C., Jin, H., Chen, L., Xu, Z., Su, H., et al.: One-2-3-45: Any single
image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint
arXiv:2306.16928 (2023) 4
29. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-
1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision. pp. 9298–9309 (2023) 4
30. Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer:
Generating multiview-consistent images from a single-view image. arXiv preprint
arXiv:2309.03453 (2023) 4
31. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface con-
struction algorithm. In: Seminal graphics: pioneering efforts that shaped the field,
pp. 347–353 (1998) 16
32. Lorraine, J., Xie, K., Zeng, X., Lin, C.H., Takikawa, T., Sharp, N., Lin, T.Y., Liu,
M.Y., Fidler, S., Lucas, J.: Att3d: Amortized text-to-3d object synthesis. arXiv
preprint arXiv:2306.07349 (2023) 4
33. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101 (2017) 9, 15
34. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained
models. arXiv preprint arXiv:2306.07279 (2023) 5
35. Melas-Kyriazi, L., Laina, I., Rupprecht, C., Neverova, N., Vedaldi, A., Gafni, O.,
Kokkinos, F.: Im-3d: Iterative multiview diffusion and reconstruction for high-
quality 3d generation. arXiv preprint arXiv:2402.08682 (2024) 3, 4
36. Mercier, A., Nakhli, R., Reddy, M., Yasarla, R., Cai, H., Porikli, F., Berger, G.:
Hexagen3d: Stablediffusion is just one step away from fast and diverse text-to-3d
generation. arXiv preprint arXiv:2401.07727 (2024) 4
37. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy
networks: Learning 3d reconstruction in function space. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. pp. 4460–4470
(2019) 2
38. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu-
nications of the ACM 65(1), 99–106 (2021) 2
39. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning
continuous signed distance functions for shape representation. In: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. pp. 165–
174 (2019) 2
40. Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205
(2023) 2
41. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: Text-to-3d using
2d diffusion. arXiv (2022) 4, 9
42. Qian, G., Mai, J., Hamdi, A., Ren, J., Siarohin, A., Li, B., Lee, H.Y., Skorokhodov,
I., Wonka, P., Tulyakov, S., et al.: Magic123: One image to high-quality 3d object
generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843
(2023) 4
43. Qin, X., Dai, H., Hu, X., Fan, D.P., Shao, L., Gool, L.V.: Highly accurate dichoto-
mous image segmentation. In: ECCV (2022) 15
20 Junlin Han et al.
44. Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O., Jagersand, M.: U2-
net: Going deeper with nested u-structure for salient object detection. vol. 106,
p. 107404 (2020) 7
45. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable
visual models from natural language supervision. In: ICML (2021) 4, 9
46. Ren, X., Huang, J., Zeng, X., Museth, K., Fidler, S., Williams, F.: Xcube (x3):
Large-scale 3d generative modeling using sparse voxel hierarchies. arXiv preprint
arXiv:2312.03806 (2023) 4
47. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2,
4
48. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,
S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J.,
Norouzi, M.: Photorealistic text-to-image diffusion models with deep language un-
derstanding. arXiv preprint arXiv:2205.11487 (2022) 4
49. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion
for 3d generation. arXiv preprint arXiv:2308.16512 (2023) 2, 4, 9
50. Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., Gould, S.: 3d-gpt: Procedural 3d
modeling with large language models. arXiv preprint arXiv:2310.12945 (2023) 2
51. Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view
3d reconstruction. arXiv preprint arXiv:2312.13150 (2023) 4
52. Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Viewset diffusion:(0-) image-
conditioned 3d generative models from 2d data. arXiv preprint arXiv:2306.07881
(2023) 4
53. Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-
view gaussian model for high-resolution 3d content creation. arXiv preprint
arXiv:2402.05054 (2024) 4, 9, 10, 15
54. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian
splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
4
55. Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalk-
wyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal
models. arXiv preprint arXiv:2312.11805 (2023) 2
56. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash-
lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 2, 5
57. Wang, P., Shi, Y.: Imagedream: Image-prompt multi-view diffusion for 3d genera-
tion. arXiv preprint arXiv:2312.02201 (2023) 4
58. Wang, P., Tan, H., Bi, S., Xu, Y., Luan, F., Sunkavalli, K., Wang, W., Xu, Z.,
Zhang, K.: Pf-lrm: Pose-free large reconstruction model for joint pose and shape
prediction. ICLR (2024) 4
59. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-
fidelity and diverse text-to-3d generation with variational score distillation. arXiv
preprint arXiv:2305.16213 (2023) 4, 9
60. Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan,
P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction
with diffusion priors. arXiv preprint arXiv:2312.02981 (2023) 4
VFusion3D 21
61. Xu, D., Yuan, Y., Mardani, M., Liu, S., Song, J., Wang, Z., Vahdat, A.:
Agg: Amortized generative 3d gaussians for single image to 3d. arXiv preprint
arXiv:2401.04099 (2024) 4
62. Xu, Y., Tan, H., Luan, F., Bi, S., Wang, P., Li, J., Shi, Z., Sunkavalli, K., Wet-
zstein, G., Xu, Z., et al.: Dmv3d: Denoising multi-view diffusion using 3d large
reconstruction model. ICLR (2024) 4
63. Yi, T., Fang, J., Wang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang,
X.: Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d
and 3d diffusion models. arXiv preprint arXiv:2310.08529 (2023) 4
64. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from
one or few images. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 4578–4587 (2021) 4
65. Yu, X., Xu, M., Zhang, Y., Liu, H., Ye, C., Wu, Y., Yan, Z., Zhu, C., Xiong, Z.,
Liang, T., et al.: Mvimgnet: A large-scale dataset of multi-view images. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 9150–9161 (2023) 2, 4, 5, 7
66. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable
effectiveness of deep features as a perceptual metric. In: CVPR (2018) 5, 7
67. Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane
meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with
transformers. arXiv preprint arXiv:2312.09147 (2023) 4