0% found this document useful (0 votes)
45 views15 pages

D S: C G S D: Epth Plat Onnecting Aussian Platting AND Epth

Uploaded by

91pasonkiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views15 pages

D S: C G S D: Epth Plat Onnecting Aussian Platting AND Epth

Uploaded by

91pasonkiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Preprint

D EPTH S PLAT: C ONNECTING G AUSSIAN S PLATTING


AND D EPTH

Haofei Xu1,2 Songyou Peng1 Fangjinhua Wang1 Hermann Blum1 Daniel Barath1
Andreas Geiger2 Marc Pollefeys1,3
1 2 3
ETH Zurich University of Tübingen, Tübingen AI Center Microsoft
Input & GT
arXiv:2410.13862v1 [cs.CV] 17 Oct 2024

etry&Ren
om d
Ge

e ri
ter

ng
Bet
MVSplat

Depth 3DGS

ns

g
U
up

in
n
e rai
DepthSplat

rvis
ed Pre-t

Image 1: Depth Image 2: Depth Novel View Validation curves of depth prediction error

Figure 1: DepthSplat enables cross-task interactions between Gaussian splatting and depth.
Left: Better depth leads to improved Gaussian splatting reconstruction. Right: Unsupervised depth
pre-training with Gaussian splatting leads to reduced depth prediction error.

A BSTRACT

Gaussian splatting and single/multi-view depth estimation are typically studied in


isolation. In this paper, we present DepthSplat to connect Gaussian splatting and
depth estimation and study their interactions. More specifically, we first contribute
a robust multi-view depth model by leveraging pre-trained monocular depth fea-
tures, leading to high-quality feed-forward 3D Gaussian splatting reconstructions.
We also show that Gaussian splatting can serve as an unsupervised pre-training
objective for learning powerful depth models from large-scale unlabelled datasets.
We validate the synergy between Gaussian splatting and depth estimation through
extensive ablation and cross-task transfer experiments. Our DepthSplat achieves
state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets in
terms of both depth estimation and novel view synthesis, demonstrating the mutual
benefits of connecting both tasks. Our code, models, and video results are available
at https://haofeixu.github.io/depthsplat/.

1 I NTRODUCTION

Novel view synthesis (Buehler et al., 2001; Zhou et al., 2018) and depth prediction (Schönberger
et al., 2016; Eigen et al., 2014) are two fundamental tasks in computer vision, serving as the driving
force behind numerous applications ranging from augmented reality to robotics and autonomous
driving. There have been notable advancements in both areas recently.
For novel view synthesis, 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has emerged as a popular
technique due to its impressive real-time performance while attaining high visual fidelity. Recently,
advances in feed-forward 3DGS models (Charatan et al., 2024; Chen et al., 2024; Szymanowicz et al.,
2024) have been introduced to alleviate the need for tedious per-scene optimization, also enabling
few-view 3D reconstruction. The state-of-the-art sparse-view method MVSplat (Chen et al., 2024)
relies on feature matching-based multi-view depth estimation (Xu et al., 2023) to localize the 3D
Gaussian positions, which makes it suffer from similar limitations (e.g., occlusions, texture-less
regions, and reflective surfaces) as other multi-view depth methods (Schönberger et al., 2016; Yao
et al., 2018; Gu et al., 2020; Wang et al., 2021; Duzceker et al., 2021).

1
Preprint

On the other hand, significant progress has been made in monocular depth estimation, with recent
models (Yang et al., 2024a; Ke et al., 2024; Yin et al., 2023; Fu et al., 2024; Ranftl et al., 2020;
Eftekhar et al., 2021) achieving robust predictions on diverse in-the-wild data. However, these depths
typically lack consistency across views, constraining their performance in downstream tasks (Wang
et al., 2023; Yin et al., 2022). In addition, both state-of-the-art multi-view (Cao et al., 2022; Xu et al.,
2023) and monocular (Yang et al., 2024a; Piccinelli et al., 2024; Yang et al., 2024b) depth models are
trained with ground truth depth supervision, which prevents exploiting large unlabelled datasets for
more robust depth predictions.
The integration of 3DGS with single/multi-view depth estimation presents a compelling solution
to overcome the individual limitations of each technique while at the same time enhancing their
strengths. To this end, we introduce DepthSplat, which exploits the complementary nature of sparse-
view feed-forward 3DGS and robust single/multi-view depth estimation to improve the performance
for both tasks.
Specifically, we first contribute a robust multi-view depth model by integrating pre-trained monocular
depth features (Yang et al., 2024b) to the multi-view feature matching branch, which not only
maintains the consistency of multi-view depth models but also leads to more robust results in
situations that are hard to match (e.g., occlusions, texture-less regions and reflective surfaces). The
predicted multi-view depth maps are then unprojected to 3D as the Gaussian centers, and we use an
additional lightweight network to predict other remaining Gaussian parameters. They are combined
together to achieve novel view synthesis with the splatting operation (Kerbl et al., 2023).
Thanks to our improved multi-view depth model, the quality of novel view synthesis with Gaussian
splatting is also significantly enhanced (see Fig. 1 left). In addition, our Gaussian splatting module is
fully differentiable, which requires only photometric supervision to optimize all model components.
This provides a new, unsupervised way to pre-train depth prediction models on large-scale unlabeled
datasets without requiring ground truth geometry information. The pre-trained depth model can be
further fine-tuned for specific depth tasks and achieves superior results over training from scratch
(see Fig. 1 right, where unsupervised pre-training leads to improved performance).
We conduct extensive experiments on the large-scale TartanAir (Wang et al., 2020), ScanNet (Dai
et al., 2017) and RealEstate10K (Zhou et al., 2018) datasets for depth estimation and Gaussian
splatting tasks, as well as the recently introduced DL3DV (Ling et al., 2023) dataset, which features
complex real-world scenes and thus is more challenging. Under various evaluation settings, our
DepthSplat achieves state-of-the-art results. The strong performance on both tasks demonstrates the
mutual benefits of connecting Gaussian splatting and depth estimation.

2 R ELATED W ORK

Multi-View Depth Estimation. As a core component of classical multi-view stereo pipelines (Schön-
berger et al., 2016), multi-view depth estimation exploits multi-view photometric consistency across
multiple images to perform feature matching and predict the depth map of the reference image. Re-
cently, many learning-based methods (Yao et al., 2018; Gu et al., 2020; Wang et al., 2021; Duzceker
et al., 2021; Wang et al., 2022; Ding et al., 2022; Cao et al., 2022) have been proposed to improve
depth accuracy. For example, MVSNet (Yao et al., 2018) uses the plane-sweep algorithm (Collins,
1996) to build a 3D cost volume, regularizes it with a 3D CNN, and then regresses the depth map.
Though these learning-based methods significantly improve the depth quality when compared to
traditional methods (Galliani et al., 2015; Schönberger et al., 2016; Xu & Tao, 2019), they can-
not handle challenging situations where the multi-view photometric consistency assumption is not
guaranteed, e.g., occlusions, low-textured areas, and non-Lambertian surfaces.
Monocular Depth Estimation. Recently, we have witnessed significant progress in depth estimation
from a single image (Ranftl et al., 2020; Bhat et al., 2023; Yin et al., 2023; Yang et al., 2024a; Ke
et al., 2024), and existing methods can produce surprisingly accurate results on diverse in-the-wild
data. However, monocular depth methods inherently suffer from scale ambiguities and are not able
to produce multi-view consistent depth predictions, which are crucial for downstream tasks like
3D reconstruction (Yin et al., 2022) and video depth estimation (Wang et al., 2023). In this paper,
we leverage the powerful features from a pre-trained monocular depth model (Yang et al., 2024b)
to augment feature-matching based multi-view depth estimation, which not only maintains high

2
Preprint

multi-view consistency but also leads to significantly improved robustness in challenging situations
such as low-textured regions and reflective surfaces.
Monocular and Multi-View Depth Fusion. Several previous methods (Bae et al., 2022; Li et al.,
2023; Yang et al., 2022; Cheng et al., 2024) try to fuse monocular and multi-view depths to alleviate the
limitations of feature matching-based multi-view depth estimations. However, they either fuse single-
and multi-view depth predictions with additional networks or rely on sophisticated architectures. In
contrast, we identify the power of off-the-shell pre-trained monocular depth models and propose to
augment multi-view cost volumes with monocular features, which not only leads to a simple model
architecture but also strong performance.
Feed-Forward Gaussian Splatting. Several feed-forward 3D Gaussian splatting models (Charatan
et al., 2024; Szymanowicz et al., 2024; Chen et al., 2024; Wewer et al., 2024; Tang et al., 2024; Xu
et al., 2024b; Zhang et al., 2024) have been proposed in literature thanks to its efficiency and ability to
handle sparse views. In particular, pixelSplat (Charatan et al., 2024) and Splatter Image (Szymanowicz
et al., 2024) predict 3D Gaussians from image features, while MVSplat (Chen et al., 2024) encodes the
feature matching information with cost volumes and achieves better geometry. However, it inherently
suffers from the limitation of feature matching in challenging situations like texture-less regions
and reflective surfaces. In this paper, we propose to integrate monocular features from pre-trained
monocular depth models for more robust depth prediction and Gaussian splatting reconstruction.
We additionally study the interactions between depth and Gaussian splatting tasks. Another line
of work like LGM (Tang et al., 2024), GRM (Xu et al., 2024b), and GS-LRM (Zhang et al., 2024)
relies significantly on the training data and compute, discarding explicit feature matching cues and
learning everything purely from data. This makes them expensive to train (e.g., GS-LRM (Zhang
et al., 2024) is trained with 64 A100 GPUs for 2 days), while our model can be trained in 1 day with
8 GPUs. Moreover, our Gaussian splatting module, in return, enables pre-training depth model from
large-scale unlabelled datasets without the need for ground truth depths.
Depth and Gaussian Splatting. In addition to the experimental setup (feed-forward) studied in this
paper, recently, there has been another line of work (Chung et al., 2024; Turkulainen et al., 2024)
which applies an additional depth loss in the Gaussian splatting optimization process. We remark
that these two approaches (feed-forward vs. per-scene optimization) are orthogonal. In particular,
our approach focuses on exploring the advanced network architectures and the power of large-scale
training data, while the optimization methods mainly study the effect of loss functions for regularizing
the optimization process.

3 D EPTH S PLAT
H×W ×3
Given N input images {I i }N i
i=1 , (I ∈ R , where H and W are the image sizes) with cor-
responding projection matrices {Pi }i=1 , (Pi ∈ R3×4 , computed from the intrinsic and extrinsic
N

matrices), our goal is to predict dense per-pixel depth Di ∈ RH×W and per-pixel Gaussian param-
×N
eters {(µj , αj , Σj , cj )}H×W
j=1 for each image, where µj , αj , Σj and cj are the 3D Gaussian’s
position, opacity, covariance, and color information. As shown in Fig. 2, the core of our method is a
multi-view depth model augmented with monocular depth features, where we obtain the position µj
of each Gaussian by unprojecting depth to 3D with camera parameters, and other Gaussian parameters
are predicted by an additional lightweight head.
More specifically, our depth model consists of two branches: one for modeling feature matching
information using cost volumes, and another for extracting monocular features from a pre-trained
monocular depth network. The cost volumes and monocular features are concatenated together for
subsequent depth regression with a 2D U-Net and a softmax layer. For the depth task, we train our
depth model with ground truth depth supervision. Our full model for novel view synthesis is trained
with the photometric rendering loss, which can also be used as an unsupervised pre-training stage for
the depth model. In the following, we introduce the individual components.

3.1 M ULTI -V IEW F EATURE M ATCHING

In this branch, we extract multi-view features with a multi-view Transformer architecture and then
build multiple cost volumes that correspond to each input view.

3
Preprint

Multi-View Cost
Cost
Cost
Transformer Volumes
Volumes
Volumes
Multi-View Branch

Unproject GT Novel View


Concatenate
Concatenate
Concatenate 2D U-Net
Loss

Render
Multi-View Images
Per-View Monocular
Monocular
Monocular
ViT Features
Features
Features

Monocular Branch 3D Gaussians Novel View

Figure 2: DepthSplat connects depth estimation and 3D Gaussian splatting with a shared architecture,
which enables cross-task transfer. In particular, DepthSplat consists of a multi-view branch to
model feature-matching information and a single-view branch to extract monocular features. The
per-view cost volumes and monocular features are concatenated for depth regression with a 2D
U-Net architecture. For the depth estimation task, we train the depth model with ground truth depth
supervision. For the Gaussian splatting task, we first unproject all depth maps to 3D as the Gaussian
centers, and in parallel, we use an additional head to predict the remaining Gaussian parameters.
Novel views are rendered with the splatting operation. The full model for novel view synthesis is
trained with the photometric rendering loss, which can also be used as an unsupervised pre-training
stage for the depth model.

Multi-View Feature Extraction. For N input images, we first use a lightweight weight-sharing
ResNet (He et al., 2016) architecture to get s× downsampled features for each image independently.
To handle different image resolutions, we make the downsampling factor s flexible by controlling the
number of stride-2 3 × 3 convolutions. For example, the downsampling factor s is 4 if two stride-2
convolutions are used and 8 if three are used. To exchange information across different views, we use
a multi-view Swin Transformer (Liu et al., 2021; Xu et al., 2022; 2023) which contains six stacked
H W
self- and cross-attention layers to obtain multi-view-aware features {Fi }N i s × s ×C ,
i=1 , F ∈ R
where C is the feature dimension. More specifically, cross-attention is performed between each
reference view and other views. When more than two images (N > 2) are given as input, we perform
cross-attention between each reference view and its top-2 nearest neighboring views, which are
selected based on their camera position distances to the reference view. This makes the computation
tractable while maintaining cross-view interactions.
Feature Matching. We encode the feature matching information across different views with the
plane-sweep stereo approach (Collins, 1996; Xu et al., 2023). More specifically, for each view i, we
first uniformly sample D depth candidates {dm }D m=1 from the near and far depth ranges and then
warp the feature F j of view j to view i with the camera projection matrix and each depth candidate
dm . Then we obtain D warped features {Fdj→i m
}D i
m=1 that correspond to feature F . We then measure
their feature correlations with the dot-product operation (Xu & Zhang, 2020; Chen et al., 2024). The
H W
cost volume Ci ∈ R s × s ×D for image i is obtained by stacking all correlations. Accordingly, we
obtain cost volumes {Ci }N N
i=1 for all input images {Ii }i=1 . For more than two input views, similar to
the strategy in cross-view attention computation, we select the top-2 nearest views for each reference
view and compute feature correlations with only the selected views. This enables our cost volume
construction to achieve a good speed-accuracy trade-off and scale efficiently to a larger number of
input views. The correlation values for the two selected views are combined with averaging.

3.2 M ONOCULAR D EPTH F EATURE E XTRACTION

Despite the remarkable progress in multi-view feature matching-based depth estimation (Yao et al.,
2018; Wang et al., 2022; Xu et al., 2023) and Gaussian splatting (Chen et al., 2024), they inherently
suffer from limitations in challenging situations like occlusions, texture-less regions, and reflective
surfaces. Thus, we propose to integrate pre-trained monocular depth features into the cost volume
representation to handle scenarios that are challenging or impossible to match.
More specifically, we leverage the pre-trained monocular depth backbone from the recent Depth
Anything (Yang et al., 2024b) model thanks to its impressive performance on diverse in-the-wild data.
The monocular backbone is a ViT (Dosovitskiy et al., 2020; Oquab et al., 2023) model, which has a

4
Preprint

patch size of 14 and outputs a feature map that is 1/14 spatial resolution of the original image. We
simply bilinearly interpolate the spatial resolution of the monocular features to the same resolution
H W
i
as the cost volume in Sec. 3.1 and obtain the monocular feature Fmono ∈ R s × s ×Cmono for input
image Ii , where Cmono is the dimension of the monocular feature. This process is performed for all
H W
i
input images in parallel and we obtain monocular features {Fmono ∈ R s × s ×Cmono }N i=1 , which are
subsequently used for per-view depth map estimations.

3.3 F EATURE F USION AND D EPTH R EGRESSION

To achieve robust and multi-view consistent depth predictions, we combine the monocular feature
H W H W
i
Fmono ∈ R s × s ×Cmono and cost volume Ci ∈ R s × s ×D via simple concatenation in the channel
dimension. A subsequent 2D U-Net (Rombach et al., 2022; Ronneberger et al., 2015) is used to
regress depth from the concatenated monocular features and cost volumes. This process is performed
for all the input images in parallel and for each image, it outputs a tensor of shape Hs × W
s × D,
where D is the number of depth candidates. We then normalize the D dimension with the softmax
operation and perform a weighted average of all depth candidates to obtain the depth output.
We also apply a hierarchical matching (Gu et al., 2020) architecture where an additional refinement
step at 2× higher feature resolution is employed to improve the performance further. More specifically,
based on the coarse depth prediction, we perform a correspondence search on the 2× higher feature
maps within the neighbors of the 2× upsampled coarse depth prediction. Since we already have a
coarse depth prediction, we only need to search a smaller range at the higher resolution, and thus, we
construct a smaller cost volume compared to the coarse stage. Such a 2-scale hierarchical architecture
not only leads to improved efficiency since most computation is spent on low resolution, but also
leads to better results thanks to the use of higher-resolution features (Gu et al., 2020). Similar feature
fusion and depth regression procedure is used to get higher-resolution depth predictions, which are
subsequently upsampled to the full resolution with a learned upsampler (Ranftl et al., 2021).

3.4 G AUSSIAN PARAMETER P REDICTION

For the task of 3D Gaussian splatting, we directly unproject the per-pixel depth maps to 3D with
the camera parameters as the Gaussian centers µj . We append an additional lightweight network to
predict other Gaussian parameters αj , Σj , cj , which are opacity, covariance, and color, respectively.
With all the predicted 3D Gaussians, we can render novel view images with the Gaussian splatting
operation (Kerbl et al., 2023).

3.5 T RAINING L OSS

We study the properties of our proposed model on two tasks: depth estimation and novel view
synthesis with 3D Gaussian splatting (Kerbl et al., 2023). The loss functions are introduced below.
Depth estimation. We train our depth model (without the Gaussian splatting head) with ℓ1 loss and
gradient loss between the inverse depths of prediction and ground truth:

Ldepth = α · |Dpred − Dgt | + β · (|∂x Dpred − ∂x Dgt | + |∂y Dpred − ∂y Dgt |), (1)

where ∂x and ∂y denotes the gradients on the x and y directions, respectively. Following Uni-
Match (Xu et al., 2023), we use α = 20 and β = 20.
View synthesis. We train our full model with a combination of mean squared error (MSE) and
LPIPS (Zhang et al., 2018) losses between rendered and ground truth image colors:
M
X
m m m m

Lgs = MSE(Irender , Igt ) + λ · LPIPS(Irender , Igt ) , (2)
m=1

where M is the number of novel views to render in a single forward pass. The LPIPS loss weight λ is
set to 0.05 following MVSplat (Chen et al., 2024).

5
Preprint

Table 1: DepthSplat model variants. We explore different monocular backbones and different
multi-view models (1-scale and 2-scale features for hierarchical matching as described in Sec. 3.3),
where both larger monocular backbones and 2-scale hierarchical models lead to consistently improved
performance for both depth estimation and view synthesis tasks.

Depth (TartanAir) 3DGS (RealEstate10K) Param


Monocular Multi-View Abs Rel ↓ δ1 ↑ PSNR ↑ SSIM ↑ LPIPS ↓ (M)
ViT-S 1-scale 8.46 93.02 26.76 0.877 0.123 37
ViT-B 1-scale 6.94 94.46 27.09 0.881 0.119 113
ViT-L 1-scale 6.07 95.52 27.34 0.885 0.118 354
ViT-S 2-scale 7.01 94.56 26.96 0.880 0.122 40
ViT-B 2-scale 6.22 95.31 27.27 0.885 0.120 117
ViT-L 2-scale 5.57 96.07 27.44 0.887 0.119 360

Table 2: Ablations. We evaluate the contribution of the monocular feature branch and the cost
volume branch, as well as different monocular features. Our results indicate that the monocular
feature and cost volume are complementary, with large performance drops when removing either one.
The pre-trained Depth Anything network weights achieve overall the best performance.

Depth (TartanAir) 3DGS (RealEstate10K)


Module Method Abs Rel ↓ δ1 ↑ PSNR ↑ SSIM ↑ LPIPS ↓
full 8.46 93.02 26.76 0.877 0.123
w/o mono feature 12.25 88.00 26.27 0.866 0.130
Components w/o cost volume 11.34 90.02 23.09 0.761 0.187
single branch 11.26 90.84 25.99 0.858 0.134
ConvNeXt-T 10.50 91.13 26.28 0.867 0.130
Midas 9.53 91.61 26.40 0.869 0.129
Monocular features DINO v2 8.93 92.49 26.68 0.874 0.125
Depth Anything v1 8.38 93.23 26.70 0.875 0.125
Depth Anything v2 8.46 93.02 26.76 0.877 0.123

4 E XPERIMENTS

Implementation Details. We implement our method in PyTorch (Paszke et al., 2019) and optimize
our model with the AdamW (Loshchilov & Hutter, 2017) optimizer and cosine learning rate learning
rate schedule with warm up in the first 5% of the total training iterations. We adopt the xForm-
ers (Lefaudeux et al., 2022) library for our monocular ViT backbone implementation. We use a lower
learning rate 2 × 10−6 for the pre-trained monocular backbone, and other remaining layers use a
learning rate of 2 × 10−4 . The feature downsampling factor s in our multi-view branch (Sec. 3.1)
is chosen based on the image resolution. More specifically, for experiments on the 256 × 256
resolution RealEstate10K (Zhou et al., 2018) dataset, we choose s = 4. For higher resolution datasets
(e.g., TartanAir (Wang et al., 2020), ScanNet (Dai et al., 2017), KITTI (Geiger et al., 2013) and
DL3DV (Ling et al., 2023)), we choose s = 8. Our hierarchical matching models in Sec. 3.3 use
2-scale features, i.e., 1/8 and 1/4, or 1/4 and 1/2 resolutions.
Training Details. For depth experiments, we mainly follow the setup of UniMatch (Xu et al., 2023)
for fair comparisons. More specifically, we train our depth model on 4 GH200 GPUs for 100K
iterations with a total batch size of 32. For Gaussian splatting experiments on RealEstate10K (Zhou
et al., 2018), we use the same training and testing splits of pixelSplat (Charatan et al., 2024) and
MVSplat (Chen et al., 2024), and train our model on 8 GH200 GPUs for 150K iterations with a total
batch size of 32, which takes about 1 day. For experiments on the DL3DV (Ling et al., 2023) dataset,
we evaluate on the official benchmark split with 140 scenes, while other remaining 9896 scenes are
used for training. We fine-tune our RealEstate10K pre-trained model on 4 A100 GPUs for 100K
iterations with a total batch size of 4, where the number of input views is randomly sampled from 2
to 6. We evaluate the model’s performance on different number of input views (2, 4, 6). Our code
and training scripts are available at https://github.com/cvg/depthsplat.

6
Preprint

Image 1 Depth (w/o mono feature) Error (w/o mono feature) Image 1 Depth (w/o mono feature) Error (w/o mono feature)

Image 2 Depth (w/ mono feature) Error (w/ mono feature) Image 2 Depth (w/ mono feature) Error (w/ mono feature)

Figure 3: Effect of monocular features for depth on ScanNet. The monocular feature greatly
improves challenging situations like texture-less regions (e.g., the wall in the first example) and
reflective surfaces (e.g., the refrigerator in the second example).

Input Image 1 Image 1: Depth Image 1: Depth Render Error Novel View Render
(w/ mono feature) (w/o mono feature) (w/o mono feature) (w/o mono feature) GT Novel View

Input Image 2 Image 2: Depth Image 2: Depth Render Error Novel View Render
(w/ mono feature) (w/o mono feature) (w/ mono feature) (w/ mono feature) GT Novel View

Figure 4: Effect of monocular features for 3DGS on RealEstate10K. Without monocular features,
the model struggles at predicting reliable depth for pixels that are not able to find correspondences
(e.g., the lounger chair highlighted with the read rectangle), which subsequently causes misalignment
in the rendered image due to the incorrect geometry.

4.1 M ODEL VARIANTS

We first study several different model variants for both depth estimation and view synthesis tasks. In
particular, we explore different monocular backbones (Yang et al., 2024b) (ViT-S, ViT-B, ViT-L) and
different multi-view models (1-scale and 2-scale). We conduct depth experiments on the large-scale
TartanAir (Wang et al., 2020) synthetic dataset, which features both indoor and outdoor scenes and
has perfect ground truth depth. The Gaussian splatting experiments are conducted on the standard
RealEstate10K (Zhou et al., 2018) dataset. Following community standards, we report the depth
evaluation metrics (Eigen et al., 2014) of Abs Rel (relative ℓ1 error) and δ1 (percentage of correctly
estimated pixels within a threshold) and novel view synthesis metrics (Kerbl et al., 2023) of PSNR,
SSIM, and LPIPS. The results in Table 1 demonstrate that both larger monocular backbones and
2-scale hierarchical models lead to consistently improved performance for both tasks.
From Table 1, we can also see that better depth network architecture leads to improved view synthesis.
In Appendix A.1, we conduct additional experiments to study the effect of different initializations
for the depth network to the view synthesis performance. Our results show that a better depth model
initialization also contributes to improved rendering quality. In summary, both better depth network
architecture and better depth model initialization lead to improved novel view synthesis results.

7
Preprint

Table 3: Unsupervised Depth Pre-Training with Gaussian Splatting. We first perform unsuper-
vised pre-training with Gaussian splatting on RealEstate10K and then measure the depth performance
on TartanAir, ScanNet and KITTI after fine-tuning for the depth task. Compared to previous su-
pervised pre-trained models Depth Anything (for monocular ViT) and UniMatch (for multi-view
Transformer), our new unsupervised pre-training improves performance consistently in all metrics.
The improvements are especially significant on the challenging datasets like TartanAir and KITTI.

TartanAir ScanNet KITTI


Initialization
Abs Rel ↓ δ1 ↑ Abs Rel ↓ δ1 ↑ Abs Rel ↓ δ1 ↑
Depth Anything (only mono) 11.12 89.97 6.78 96.13 11.67 85.96
UniMatch + Depth Anything (mv & mono) 10.86 90.55 6.70 96.14 11.56 87.27
DepthSplat (full model) 10.20 91.10 6.60 96.27 10.68 89.92

4.2 A BLATION S TUDY AND A NALYSIS

In this section, we study the properties of our key components on the TartanAir dataset (for depth
estimation) and RealEstate10K dataset (for view synthesis with Gaussian splatting).
Monocular Features for Depth Estimation and Gaussian Splatting. In Table 2, we compare our
full model (full) with the model variant where the monocular depth feature branch (with a ViT-S
model pre-trained by Depth Anything v2 (Yang et al., 2024b)) is removed (w/o mono feature), leaving
only the multi-view branch. We can observe a clear performance drop for both depth and view
synthesis tasks. In Fig. 3, we visualize the depth predictions and error maps of both models on the
ScanNet dataset. The pure multi-view feature matching-based approach struggles a lot at texture-less
regions and reflective surfaces. At the same time, our full model achieves reliable results thanks to
the powerful prior captured in monocular depth features. We also show the visual comparisons for
the Gaussian splatting task in Fig. 4 with two input views. For regions (e.g., the lounge chairs) that
only appear in a single image, the pure multi-view method is unable to find correspondences. It thus
produces unreliable depth predictions, leading to misalignment in the rendered novel views due to
the incorrect geometry.
We also experiment with removing the cost volume (w/o cost volume) in the multi-view branch
and observed a significant performance drop. This indicates that obtaining scale- and multi-view
consistent predictions with a pure monocular depth backbone is challenging, which constrains
achieving high-quality results in downstream tasks.
Fusion Strategy. We compare with alternative strategies for fusing the monocular features for
multi-view depth estimation. In particular, we compare with MVSFormer (Cao et al., 2022) which
constructs the cost volume with monocular features. More specifically, we replace our multi-view
feature extractor with a weight-sharing ViT model and use the ViT features to build the cost volume
as done in MVSFormer. This leads to a single-branch architecture (single branch in Table 2), unlike
our two-branch design where the monocular features are not used to build the cost volume. We can
observe that our fusion strategy performs significantly better than the single-branch design, potentially
because our two-branch design disentangles feature matching and obtaining monocular priors, which
makes the learning task easier.
Different Monocular Features. In Table 2, we also evaluate different monocular features, including
the ConvNeXt-T (Liu et al., 2022) features used in AFNet (Cheng et al., 2024), and other popular
monocular features including Midas (Ranftl et al., 2020) and DINO v2 (Oquab et al., 2023). The
pre-trained Depth Anything monocular features achieve overall the best performance.

4.3 U NSUPERVISED D EPTH P RE -T RAINING WITH G AUSSIAN S PLATTING

By connecting Gaussian splatting and depth, our DepthSplat provides a way to pre-train the depth
model in a fully unsupervised manner. In particular, we first train our full model on the large-scale
unlabelled RealEstate10K dataset (contains ∼ 67K Youtube videos) with only the Gaussian splatting
rendering loss (Eqn. 2), without any direct supervision on the depth predictions. After pre-training,
we take the pre-trained depth model and further fine-tune it to the depth task on the mixed TartanAir
and VKITTI2 (Cabon et al., 2020) datasets with ground truth depth supervision.

8
Preprint

Table 4: Two-view Depth Estimation on Scan-


Net. Our DepthSplat outperforms all prior meth- Table 5: Two-view 3DGS on RealEstate10K.
ods by clear margins. Our DepthSplat achieves the best performance.

Method Abs Rel ↓ RMSE ↓ RMSElog ↓ Method PSNR ↑ SSIM ↑ LPIPS ↓


DeMoN 0.231 0.761 0.289 pixelNeRF 20.43 0.589 0.550
BA-Net 0.161 0.346 0.214 GPNR 24.11 0.793 0.255
DeepV2D 0.057 0.168 0.080 AttnRend 24.78 0.820 0.213
NeuralRecon 0.047 0.164 0.093 MuRF 26.10 0.858 0.143
DRO 0.053 0.168 0.081 pixelSplat 25.89 0.858 0.142
UniMatch 0.059 0.179 0.082 MVSplat 26.39 0.869 0.128
DepthSplat (ours) 0.044 0.119 0.059 DepthSplat (ours) 27.44 0.887 0.119

In Table 3, we evaluate the performance on both in-domain TartanAir test set and the zero-shot
generalization on unseen ScanNet and KITTI datasets. We compare with only initializing the
monocular backbone with Depth Anything and additionally initializing the multi-view Transformer
backbone with UniMatch (Xu et al., 2023). Our approach achieves the best results on all three
datasets. It is also interesting to observe that our pre-training contributes more on more challenging
datasets (i.e., TartanAir and KITTI which feature complex large-scale scenes unlike ScanNet only
contains indoor scenes). We also note that both Depth Anything and UniMatch are trained with
ground truth supervision, while our DepthSplat is trained with only photometric losses. Given the
increasing popularity of view synthesis (Zheng & Vedaldi, 2024; Weng et al., 2023) and multi-view
generative models (Shi et al., 2023; Blattmann et al., 2023), new multi-view datasets (Ling et al.,
2023) and models (Voleti et al., 2024) are gradually introduced, our approach provides a way to
pre-train depth models on large-scale unlabelled multi-view image datasets. This could potentially
further improve the multi-view consistency and robustness of existing depth models (Yang et al.,
2024a; Piccinelli et al., 2024), which are usually trained with ground truth depth supervision.

4.4 B ENCHMARK C OMPARISONS

Comparisons on ScanNet and RealEstate10K. Table 4 and Table 5 compare the depth and novel
view synthesis results on the standard ScanNet and RealEstate10K benchmarks, respectively. For both
comparisons, the numbers of input images are two. We can observe that our DepthSplat outperforms
previous methods Teed & Deng (2019); Yu et al. (2021); Xu et al. (2024a); Chen et al. (2024) by
clear margins on both datasets for both tasks. The visual comparison with previous best methods is
shown in Fig. 5 and Fig. 6, where our method significantly improves the performance on challenging
scenarios like texture-less regions and occlusions.
Comparisons on DL3DV. To further evaluate the performance on complex real-world scenes,
we conduct comparisons with the latest state-of-the-art Gaussian splatting model MVSplat (Chen
et al., 2024) on the recently introduced DL3DV dataset (Ling et al., 2023). We also compare the
results of different numbers of input views (2, 4 and 6) on this dataset. We fine-tune MVSplat
and our RealEstate10K pre-trained models on DL3DV training scenes and report the results on the
benchmark test set in Table 6. Our DepthSplat consistently outperforms MVSplat in all metrics.
Visual comparisons with MVSplat on the DL3DV dataset are shown in Fig. 7, where MVSplat’s depth
quality lags behind our DepthSplat due to matching failure, which leads to blurry and distorted view
synthesis results. We show more visual comparison results in Appendix A.2. It is also worth noting
that our method scales more efficiently to more input views thanks to our lightweight local feature
matching approach (Sec 3.1), which is unlike the expensive pair-wise matching used in MVSplat.
We invite the readers to our project page https://haofeixu.github.io/depthsplat/
for the video results on different number of input views (6 and 12), where our DepthSplat is able to
reconstruct larger-scale or 360 scenes from more input views, without any optimization.

9
Preprint

Image Depth (UniMatch) Error (UniMatch) Depth (DepthSplat) Error (DepthSplat)


Figure 5: Depth Comparison on ScanNet. Our DepthSplat performs significantly better than
UniMatch (Xu et al., 2023) on challenging parts (e.g., edges of the couch and pillows).

Input GT pixelSplat MVSplat DepthSplat

Figure 6: Visual Synthesis on RealEstate10K. Our DepthSplat performs significantly better than
pixelSplat (Charatan et al., 2024) and MVSplat (Chen et al., 2024) in challenging regions.
Table 6: Comparisons on DL3DV. Our DepthSplat not only consistently outperforms MVSplat on
different number of input views, but also scales more efficiently to more input views.

Method #Views PSNR ↑ SSIM ↑ LPIPS ↓ Time (s)


MVSplat 17.54 0.529 0.402 0.072
DepthSplat 2 19.05 0.610 0.313 0.101
MVSplat 21.63 0.721 0.233 0.146
DepthSplat 4 22.82 0.766 0.188 0.124
MVSplat 22.93 0.775 0.193 0.263
DepthSplat 6 23.83 0.808 0.158 0.161
Input & GT
MVSplat
DepthSplat

Image 1: Depth Image 2: Depth Novel View

Figure 7: Visual Comparisons on DL3DV. Our DepthSplat performs significantly better than
MVSplat (Chen et al., 2024) on regions that are hard to match (e.g., repeated patterns).

10
Preprint

5 C ONCLUSION
In this paper, we introduce DepthSplat, a new approach to connecting Gaussian splatting and depth
to achieve state-of-the-art results on ScanNet, RealEstate10K and DL3DV datasets for both depth
and view synthesis tasks. We also show that our model enables unsupervised pre-training depth
with Gaussian splatting rendering loss, providing a way to leverage large-scale unlabelled multi-
view image datasets for training more multi-view consistent and robust depth models. Our current
model requires camera pose information as input along with the multi-view images, removing this
requirement would be exciting future work.

ACKNOWLEDGEMENT
We thank Yuedong Chen for his generous help with the DL3DV dataset. Andreas Geiger was
supported by the ERC Starting Grant LEGO-3D (850533) and the DFG EXC number 2064/1 - project
number 390727645.

R EFERENCES
Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Multi-view depth estimation by fusing single-
view depth probability with multi-view geometry. In CVPR, 2022.
Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Zoedepth: Zero-shot transfer by
combining relative and metric depth. In CVPR, 2023.
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik
Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling
latent video diffusion models to large datasets. arXiv, 2023.
Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured
lumigraph rendering. In ACM TOG, 2001.
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2. arXiv preprint
arXiv:2001.10773, 2020.
Chenjie Cao, Xinlin Ren, and Yanwei Fu. Mvsformer: Learning robust image representations via
transformers and temperature-based depth for multi-view stereo. TMLR, 2022.
David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats
from image pairs for scalable generalizable 3d reconstruction. In CVPR, 2024.
Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen
Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In
ECCV, 2024.
JunDa Cheng, Wei Yin, Kaixuan Wang, Xiaozhi Chen, Shijie Wang, and Xin Yang. Adaptive fusion
of single-view and multi-view depth for autonomous driving. In CVPR, 2024.
Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian
splatting in few-shot images. In CVPR, 2024.
Robert T Collins. A space-sweep approach to true multi-image matching. In CVPR, 1996.
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias
Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao
Liu. Transmvsnet: Global context-aware multi-view stereo network with transformers. CVPR,
2022.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image
is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2020.

11
Preprint

Arda Duzceker, Silvano Galliani, Christoph Vogel, Pablo Speciale, Mihai Dusmanu, and Marc
Pollefeys. Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion. In
CVPR, 2021.
Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for
making multi-task mid-level vision datasets from 3d scans. In ICCV, 2021.
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a
multi-scale deep network. NeurIPS, 2014.
Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and
Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a
single image. arXiv, 2024.
Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by
surface normal diffusion. In ICCV, 2015.
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti
dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume
for high-resolution multi-view stereo and stereo matching. In CVPR, 2020.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, 2016.
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad
Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In
CVPR, 2024.
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting
for real-time radiance field rendering. ACM TOG, 2023.
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean
Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, et al. xformers: A modular and hackable
transformer modelling library, 2022.
Rui Li, Dong Gong, Wei Yin, Hao Chen, Yu Zhu, Kaixuan Wang, Xiaozhi Chen, Jinqiu Sun, and
Yanning Zhang. Learning to fuse monocular and multi-view cues for multi-frame depth estimation
in dynamic scenes. In CVPR, 2023.
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun
Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision.
arXiv, 2023.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie.
A convnet for the 2020s. In CVPR, 2022.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv, 2017.
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov,
Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning
robust visual features without supervision. arXiv, 2023.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style,
high-performance deep learning library. In NeurIPS, 2019.
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and
Fisher Yu. Unidepth: Universal monocular metric depth estimation. In CVPR, 2024.

12
Preprint

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust
monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI,
2020.
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In
ICCV, 2021.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In CVPR, 2022.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In MICCAI. Springer, 2015.
Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view
selection for unstructured multi-view stereo. In ECCV, 2016.
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view
diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast
single-view 3d reconstruction. In CVPR, 2024.
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large
multi-view gaussian model for high-resolution 3d content creation. arXiv, 2024.
Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion. In
ICLR, 2019.
Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Seiskari, Esa Rahtu, and Juho Kannala.
Dn-splatter: Depth and normal priors for gaussian splatting and meshing. arXiv, 2024.
Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian
Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation
from a single image using latent video diffusion. arXiv, 2024.
Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale, and Marc Pollefeys. Patch-
matchnet: Learned multi-view patchmatch stereo. In CVPR, 2021.
Fangjinhua Wang, Silvano Galliani, Christoph Vogel, and Marc Pollefeys. Itermvs: Iterative probabil-
ity estimation for efficient multi-view stereo. In CVPR, 2022.
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish
Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–4916.
IEEE, 2020.
Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, and Guosheng
Lin. Neural video depth stabilizer. In ICCV, 2023.
Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, CL Chen, and Lei Zhang. Consis-
tent123: Improve consistency for one image to 3d object synthesis. arXiv, 2023.
Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, and Jan Eric Lenssen. latentsplat: Autoencod-
ing variational gaussians for fast generalizable 3d reconstruction. arXiv preprint arXiv:2403.16292,
2024.
Haofei Xu and Juyong Zhang. Aanet: Adaptive aggregation network for efficient stereo matching. In
CVPR, 2020.
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical
flow via global matching. In CVPR, 2022.
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger.
Unifying flow, stereo and depth estimation. IEEE TPAMI, 2023.

13
Preprint

Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas
Geiger, and Fisher Yu. Murf: Multi-baseline radiance fields. In CVPR, 2024a.
Qingshan Xu and Wenbing Tao. Multi-scale geometric consistency guided multi-view stereo. In
CVPR, 2019.
Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and
Gordon Wetzstein. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and
generation. arXiv, 2024b.
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth
anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024a.
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang
Zhao. Depth anything v2. arXiv, 2024b.
Zhenpei Yang, Zhile Ren, Qi Shan, and Qixing Huang. Mvs2d: Efficient multi-view stereo via
attention-driven 2d convolutions. In CVPR, 2022.
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured
multi-view stereo. In ECCV, 2018.
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Simon Chen, Yifan Liu, and Chunhua Shen.
Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE TPAMI,
2022.
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua
Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. In CVPR, 2023.
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from
one or few images. In CVPR, 2021.
Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu.
Gs-lrm: Large reconstruction model for 3d gaussian splatting. arXiv, 2024.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d represen-
tation. In CVPR, 2024.
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification:
learning view synthesis using multiplane images. ACM TOG, 2018.

A A PPENDIX
A.1 D EPTH P RE -T RAINING FOR G AUSSIAN S PLATTING

We further study the effect of depth pre-training for Gaussian splatting experiments. Unlike the
pre-trained models Depth Anything (Yang et al., 2024b) and UniMatch (Xu et al., 2023), which are
trained for either monocular and multi-view features separately, we perform joint training of our full
two-branch depth model on the depth datasets. We then compare the results of different initializations
to the depth network for the full Gaussian splatting model. We can see from Table 7 that improved
depth initialization leads to better view synthesis results.

A.2 M ORE V ISUAL C OMPARISONS ON DL3DV

We show more visual comparison results on DL3DV in Fig. 8 with different number of input views,
where our DepthSplat consistently outperforms MVSplat in challenging regions.

14
Preprint

Table 7: Depth to Gaussian Splatting Transfer. We compare different pre-trained network weights
for initializing the depth network when training our full DepthSplat model for view synthesis.
Compared to using Depth Anything and UniMatch pre-trained monocular and multi-view network
weights for initializing the monocular ViT and multi-view Transformer features, our jointly trained
2-branch depth model (full model) performs best on all metrics.

Initialization PSNR ↑ SSIM ↑ LPIPS ↓


Depth Anything (only mono) 26.59 0.874 0.1256
UniMatch + Depth Anything (mv & mono) 26.76 0.877 0.1234
DepthSplat (full model) 26.81 0.878 0.1225

Input

GT

MVSplat

DepthSplat

2-View Input 4-View Input 6-View Input

Figure 8: Different number of input views on DL3DV. Our DepthSplat performs consistently better
than MVSplat (Chen et al., 2024).

15

You might also like