论文
论文
t=5 20 50 100
Input image …
150 200 300 500
Figure 1. Perpetual View Generation. Using a collection of aerial videos of nature scenes for training (left), our method learns to take a
single image and perpetually generate novel views for a camera trajectory covering a long distance (right). Our method can successfully
generate hundreds of frames of an aerial video from a single input image (up to 500 shown here).
the movement of pixels from camera motion is explicitly training data is finite in length.
modeled using 3D geometry. Formally, for an image It with pose Pt we have an asso-
ciated disparity (i.e., inverse depth) map Dt ∈ RH×W , and
3. Perpetual View Generation we compute the next frame It+1 and its disparity Dt+1 as
Given an RGB image I0 and a camera trajectory Iˆt+1 , D̂t+1 , M̂t+1 = R(It , Dt , Pt , Pt+1 ), (1)
(P0 , P1 , P2 , . . . ) of arbitrary length, our task is to output It+1 , Dt+1 = gθ (Iˆt+1 , D̂t+1 , M̂t+1 ). (2)
a new image sequence (I0 , I1 , I2 , . . . ) that forms a video
depicting a flythrough of the scene captured by the ini- Here, Iˆt+1 and D̂t+1 are the result of rendering the image
tial view. The trajectory is a series of 3D camera poses It and disparity Dt from the new camera Pt+1 , using a
3×3 3×1
Pt = R 0 t 1 , where R and t are 3D rotations and differentiable renderer R [13]. This function also returns
translations, respectively. In addition, each camera has an a mask M̂t+1 indicating which regions of the image are
intrinsic matrix K. At training time camera data is obtained missing and need to be filled in. The refinement network
from video clips via structure-from-motion as in [52]. At gθ then inpaints, outpaints and super-resolves these inputs
test time, the camera trajectory may be pre-specified, gen- to produce the next frame It+1 and its disparity Dt+1 . The
erated by an auto-pilot algorithm, or controlled via a user process is repeated iteratively for T steps during training,
interface. and at test time for an arbitrarily long camera trajectory.
Next we discuss each step in detail.
3.1. Approach: Render, Refine, Repeat
Geometry and Rendering. Our render step R uses a dif-
Our framework applies established techniques (3D ren- ferentiable mesh renderer [13]. First, we convert each pixel
dering, image-to-image translation, auto-regressive training) coordinate (u, v) in It and its corresponding disparity d
in a novel combination. We decompose perpetual view gen- in Dt into a 3D point in the camera coordinate system:
eration into the three steps, as illustrated in Figure 2: (x, y, z) = K −1 (u, v, 1)/d. We then convert the image into
1. Render a new view from an old view, by warping the a 3D triangular mesh where each pixel is treated as a vertex
image according to a disparity map using a differen- connected to its neighbors, ready for rendering.
tiable renderer, To avoid stretched triangle artifacts at depth disconti-
2. Refine the rendered view and geometry to fill in miss- nuities and aid our refinement network by identifying re-
ing content and add detail where necessary, gions to be inpainted, we compute a per-pixel binary mask
3. Repeat this process, propagating both image and dis- Mt ∈ RH×W by thresholding the gradient of the disparity
parity to generate each new view from the one before. image ∇D̂t , computed with a a Sobel filter:
Our approach has several desirable characteristics. Repre- (
senting geometry with a disparity map allows much of the 0 where ||∇D̂t || > α,
Mt = (3)
heavy lifting of moving pixels from one frame to the next 1 otherwise.
to be handled by differentiable rendering, ensuring local
temporal consistency. The synthesis task then becomes one We use the 3D mesh to render both image and mask from the
of image refinement, which comprises: 1) inpainting disoc- new view Pt+1 , and multiply the rendered image element-
cluded regions 2) outpainting of new image regions and 3) wise by the rendered mask to give Iˆt+1 . The renderer also
super-resolving image content. Because every step is fully outputs a depth map as seen from the new camera, which
differentiable, we can train our refinement network by back- we invert and multiply by the rendered mask to obtain D̂t+1 .
propagating through several view synthesis iterations. Our This use of the mask ensures that any regions in Iˆt+1 and
auto-regressive framework means that novel views may be D̂t+1 that were occluded in It are masked out and set to zero
infinitely generated with explicit view control, even though (along with regions that were outside the field of view of the
Figure 3. Illustration of the rendering and refinement steps. Left: Our differentiable rendering stage takes a paired RGB image and
disparity map from viewpoint P0 and creates a textured mesh representation, which we render from a new viewpoint P1 , warping the
textures, adjusting disparities, and returning a binary mask representing regions to fill in. Right: The refinement stage takes the output of the
renderer and uses a deep network to fill in holes and add details. The output is a new RGB image and disparity map that can be supervised
with reconstruction and adversarial losses.
previous camera). These areas are ones that the refinement and feeding the network with its own output ameliorates
step will have to inpaint (or outpaint). See Figures 2 and 3 drift and improves visual quality as shown in our ablation
for examples of missing regions shown in pink. study (Section 6.2). However, we notice that the disparity in
Refinement and Synthesis. Given the rendered image Iˆt+1 , particular can still drift at test time, especially over time hori-
its disparity D̂t+1 and its mask M̂t+1 , our next task is to zons far longer than seen during training. Therefore we add
refine this image, which includes blurry regions and missing an explicit geometric re-grounding of the disparity maps.
pixels. In contrast to prior inpainting work [49, 36], the Specifically, we take advantage of the fact that the ren-
refinement network also has to perform super-resolution dering process provides the correct range of disparity from
and thus we cannot use a compositing operation in refining a new viewpoint D̂t+1 for visible regions of the previous
the rendered image. Instead we view the refinement step as frame. The refinement network may modify these values as
a generative image-to-image translation task, and adopt the it refines the holes and blurry regions, which can lead to drift
state-of-the-art SPADE network architecture [26] for our gθ , as the overall disparity becomes gradually larger or smaller
which directly outputs It+1 , Dt+1 . We encode I0 to provide than expected. However, we can geometrically correct this
the additional GAN noise input required by this architecture. by rescaling the refined disparity map to the correct range
See the appendix for more details. by computing a scale factor γ via solving
Rinse and Repeat. The previous steps allow us to generate min ||M (log(γDt+1 ) − log(D̂t+1 ))|| (4)
γ
a single novel view. A crucial aspect of our approach is that
we refine not only RGB but also disparity, so that scene ge- By scaling the refined disparity by γ, our approach ensures
ometry is propagated between frames. With this setup, we that the disparity map stays at a consistent scale, which sig-
can use the refined image and disparity as the next input nificantly reduces drift at test time as shown in Section 6.3.
to train in an auto-regressive manner, with losses backprop-
agated over multiple steps. Other view synthesis methods, 4. Aerial Coastline Imagery Dataset
although not designed in this manner, may also be trained Learning to generate long sequences requires real image
and evaluated in a recurrent setting, although naively repeat- sequences for training. Many existing datasets for view syn-
ing these methods without propagating the geometry as we thesis do not use sequences, but only a set of views from
do requires the geometry to be re-inferred from scratch in slightly different camera positions. Those that do have se-
every step. As we show in Section 6, training and evaluat- quences are limited in length: RealEstate10K, for example,
ing these baselines with a repeat step is still insufficient for has primarily indoor scenes with limited camera movement
perpetual view generation. [52]. To obtain long sequences with a moving camera and
Geometric Grounding to Prevent Drift. A key challenge few dynamic objects, we turn to aerial footage of beautiful
in generating long sequences is dealing with the accumula- nature scenes available on the Internet. Nature scenes are a
tion of errors [28]. In a system where the current prediction good starting point for our challenging problem, as GANs
affects future outputs, small errors in each iteration can com- have shown promising results on nature textures [30, 33].
pound, eventually generating predictions outside the distri- We collected 891 videos using keywords such as ‘coastal’
bution seen during training and causing unexpected behav- and ‘aerial footage’, and processed these videos with SLAM
iors. Repeating the generation loop in the training process and structure from motion following the approach of Zhou
The loss is computed over all iterations and over all samples
in the mini-batch.
Metrics. Evaluating the quality of the generated images in
a way that correlates with human judgement is a challenge.
We use the Fréchet inception distance (FID), a common
metric used in evaluating generative models of images. FID
computes the difference between the mean and covariance
of the embedding of real and fake images through a pre-
trained Inception network [17] to measure the realism of
the generated images as well as their diversity. We precom-
pute real statistics using 20k real image samples from our
dataset. To measure changes in generated quality over time,
we report FID over a sliding window: we write FID-w at
Figure 4. Processing video for ACID. We run structure from mo- t to indicate a FID value computed over all image outputs
tion on coastline drone footage collected from YouTube to create
within a window of width w centered at time t, i.e. {Ii }
the Aerial Coastline Imagery Dataset (ACID). See Section 4.
for t − w/2 < i ≤ t + w/2. For short-range trajectories
where ground truth images are available, we also report
et al. [52], yielding over 13,000 sequences with a total of 2.1 mean squared error (MSE) and LPIPS [51], a perceptual
million frames. We have released the list of videos and SfM similarity metric that correlates better with human percep-
camera trajectories. See Fig. 4 for an illustrative example of tual judgments than traditional metrics such as PSNR and
our SfM pipeline running on a coastline video. SSIM.
To obtain disparity maps for every frame, we use the off-
the-shelf MiDaS single-view depth prediction method [27]. Implementation Details. We train our model with T = 5
We find that MiDaS is quite robust and produces sufficiently steps of render-refine-repeat at an image resolution of 160
accurate disparity maps for our method. Because MiDaS × 256 (as most aerial videos have a 16:9 aspect ratio). The
disparity is only predicted up to scale and shift, it must first choice of T is limited by both memory and available training
be rescaled to match our data. To achieve this, we use the sequence lengths. The refinement network architecture is
sparse point-cloud computed for each scene during structure the same as that of SPADE generator in [26], and we also
from motion. For each frame we consider only the points employ the same multi-scale discriminator. We implement
that were tracked in that frame, and use least-squares to com- our models in TensorFlow, and train with a batch size of 4
pute the scale and shift that minimize the disparity error on over 10 GPUs for 7M iterations, which takes about 8 days.
these points. We apply this scale and shift to the MiDaS out- We then identify the model checkpoint with the best FID
put to obtain disparity maps (Di ) that are scale-consistent score over a validation set.
with the SfM camera trajectories (Pi ) for each sequence.
Due to the difference in camera motions between videos, 6. Evaluation
we strategically sub-sample frames to ensure consistent cam- We compare our approach with three recent state-of-the-
era speed in training sequences. See more details in the ap- art single-image view synthesis methods—the 3D Photogra-
pendix. phy method (henceforward ‘3D Photos’) [32], SynSin [45],
and single-view MPIs [38]—as well as the SVG-LP video
5. Experimental Setup synthesis method [10]. We retrain each method on our ACID
Losses. We train our approach on a collection of image se- training data, with the exception of 3D Photos which is
quences {It }Tt=0 with corresponding camera poses {Pt }Tt=0 trained on in-the-wild imagery and, like our method, takes
and disparity maps for each frame {Dt }Tt=0 . Following the MiDaS disparity as an input. SynSin and single-view MPI
literature on conditional generative models, we use an L1 were trained at a resolution of 256 × 256. SVG-LP takes two
reconstruction loss on RGB and disparity, a VGG percep- input frames for context, and operates at a lower resolution
tual loss on RGB [18] and a hinge-based adversarial loss of 128 × 128.
with a discriminator (and feature matching loss) [26] for The view synthesis baseline methods were not designed
the T frames that we synthesize during training. We also for long camera trajectories; every new frame they generate
use a KL-divergence loss [21] on our initial image encoder comes from the initial frame I0 even though after enough
LKLD = DKL (q(z|x)||N (0, 1)). Our complete loss function camera movement there may be very little overlap between
is the two. Therefore we also compare against two variants of
each of these methods. First, variants with iterated evalu-
L = Lreconst + Lperceptual + Ladv + Lfeat matching + LKLD (5) ation (Synsin–Iterated, MPI–Iterated): these methods use
Over frames 1–10 Over frames 1–50
Method LPIPS ↓ MSE ↓ FID ↓
Baseline methods
SVG-LP [10] 0.60 0.020 135.9
SynSin [45] 0.32 0.018 98.1
MPI [38] 0.35 0.019 65.0
3D Photos [32] 0.30 0.020 123.6
Applied iteratively at test time
SynSin–Iterated 0.40 0.021 143.6
MPI–Iterated 0.47 0.020 201.2
Trained with repeat (T = 5)
SynSin–Repeat 0.44 0.036 153.3
MPI–Repeat 0.55 0.020 203.0
Ours 0.32 0.020 50.6
Figure 5. FID over time. Left: FID-20 over time for 50 frames generated by
Table 1. Quantitative evaluation. We compute each method. Right: FID-50 over 500 frames generated by our method via
LPIPS and MSE against ten frames of ground truth, autopilot. For comparison, we plot FID-50 for the baselines on the first 50 steps.
and FID-50 over 50 frames generated from an input Despite generating sequences an order of magnitude longer, our FID-50 is still
test image. See Section 6.1. lower than that of the baselines. See Sec. 6.1, 6.3.
the same trained models as their baseline counterparts, but competitive with recent view synthesis approaches for short-
we apply them iteratively at test time to generate each new range synthesis on LPIPS and MSE metrics. For mid-range
frame from the previous frame rather than the initial one. evaluation, we report FID-50 over 50 generated frames. Our
Second, variants trained with repeat (Synsin–Repeat, MPI– approach has a dramatically lower FID-50 score than other
Repeat): these methods are trained autoregressively, with methods, reflecting the more naturalistic look of its output.
losses backpropagated across T = 5 steps, as in our full To quantify the degradation of each method over time, we
model. (We omit these variations for the 3D Photos method, report a sliding window FID-20 computed from t = 10 to
which was unfortunately too slow to allow us to apply it 40. As shown in Fig. 5 (left), the image quality (measured by
iteratively, and which we are not able to retrain.) FID-20) of the baseline methods deteriorates more rapidly
with increasing t compared to our approach.
6.1. Short-to-medium range view synthesis
To evaluate short-to-medium-range synthesis, we select Qualitative comparisons of these methods are shown in
ACID test sequences with an input frame and 10 subsequent Fig. 6 and our supplementary video, which illustrates how
ground truth frames (subsampling as described in the ap- the quality of each method’s output changes over time. No-
pendix), with the camera moving forwards at an angle of table here are SVG-LP’s blurriness and inability to predict
up to 45◦ . Although our method is trained on all types of any camera motion at all; the increasingly stretched textures
camera motions, this forward motion is appropriate for com- of 3D Photos’ output; and the way the MPI-based method’s
parison with view synthesis methods which are not designed individual layers become noticeable. SynSin does the best
to handle extreme camera movements. job of generating plausible texture, but still produces holes
We then extrapolate the camera motion from the last after a while and does not add new detail.
two frames of each sequence to extend the trajectory for an
additional 40 frames. To avoid the camera colliding with The –Iterated and –Repeat variants are consistently worse
the scene, we check the final camera position against the than the original SynSin and MPI methods, which suggests
disparity map of the last ground-truth frame, and discard that simply applying an existing method iteratively, or re-
sequences in which it is outside the image or at a depth large training it autoregressively, is insufficient to deal with large
enough to be occluded by the scene. camera movement. These variants show more drifting ar-
This yields a set of 279 sequences with camera trajecto- tifacts than their original versions, likely because (unlike
ries of 50 steps and ground truth images for the first 10 steps. our method), they do not propagate geometry from step to
For short-range evaluation, we compare to ground truth on step. The MPI methods additionally become very blurry on
the first 10 steps. For medium-range evaluation, we compute repeated application, as they have no ability to add detail,
FID scores over all 50 frames. lacking our refinement step.
We apply each method to these sequences to generate
novel views corresponding to the camera poses in each se- In summary, our thoughtful combination of render-refine-
quence (SVG-LP is the exception in that it does not take ac- repeat shows better results than these existing methods and
count of camera pose). See results in Table 1. While our goal variations. Figure 7 shows additional qualitative results from
is perpetual view generation, we find that our approach is generating 15 and 30 frames using on a variety of inputs.
Figure 6. Qualitative comparison over time. We show a generated sequence for each method at different time steps. Note that we only
have ground truth images for 10 frames; the subsequent frames are generated using an extrapolated trajectory. Pink region in Ours no-refine
indicate missing content uncovered by the moving camera.
Figure 7. Qualitative comparison. We show the diversity and quality of many generated results for each method on the t=15 and 30 frame
generation. Competing approaches result in missing or unrealistic frames. Our approach is able to generate plausible views of the scene.
6.2. Ablations ing pixels are highlighted in pink. In our full model, these
regions are inpainted or outpainted by the refinement net-
We investigate the benefit of training over multiple iter- work at each step. Note also that even non-masked areas
ations of our render-refine-repeat loop by also training our of the image are much blurrier when the refinement step is
model with T = 1 (‘No repeat’). As shown in in Table 2, omitted, showing the benefit of the refinement network in
the performance on short-range generation, as measured in super-resolving image content.
LPIPS and MSE, is similar to our full model, but when we
look at FID, we observe that this method generates lower
quality images and that they get substantially worse with Table 2 also shows results on two further variations of
increasing t (see Fig. 5, left). This shows the importance of our refinement step. First, replacing our refinement network
using a recurrent training setup to our method. with a simpler U-Net architecture yields substantially worse
We next consider the refine step. Omitting this step com- results (‘U-Net refinement’). Second, disabling geometric
pletely results in a larger and larger portion of the image be- grounding (Section 3.1) also leads to slightly lower quality
ing completely missing as t increases: examples are shown on this short-to-medium range view synthesis task (‘No re-
as ‘Ours (no refine)’ in Fig. 6, where for clarity the miss- grounding’).
t=0 15 35 50 80 100 150
Figure 8. Long trajectory generation. From a single image, our approach can generate 500 frames of video without suffering visually.
Please see the supplementary video for the full effect.
see the supplementary video for more examples and the full
Ablations LPIPS ↓ MSE ↓ FID-50 ↓ effect of these generated fly-through videos.
Full Model 0.32 0.020 50.6
No repeat (T = 1) 0.30 0.022 95.4 6.4. User-controlled video generation
U-net Refinement 0.54 0.052 183.0 Because our rendering step takes camera poses as an
No re-grounding 0.34 0.022 64.3 input, we can render frames for arbitrary camera trajectories
at test time, including trajectories controlled by a user in
Table 2. Ablations. We ablate aspects of our model to understand the loop. We have built a HTML interface that allows the
their contribution to the overall performance. See Section 6.2.
user to steer our auto-pilot algorithm as it flies through this
imaginary world. This demo runs over the internet and is
capable of generating a few frames per second. Please see
6.3. Perpetual view generation the supplementary video for a demonstration.
10 20 30 40
50 60 70 80
Figure 11. Generation from smartphone photo. Our perpetual view generation applied to a photo captured by the authors on a smartphone.
We use MiDaS for the initial disparity, and assume a field of view of 90◦ .
the image. These heuristics determine a (camera-relative) Note: we apply this interpolation to the long trajectory se-
target look direction and target movement direction. To en- quences in the supplementary video only, adding four new
sure smooth camera movement, we interpolate the actual frames between each pair in the sequence. However, all
look and movement directions only a small fraction (0.05) short-to-mid range comparisons and all figures and metrics
of the way to the target directions at each frame. The next in the paper are computed on raw outputs without any inter-
camera pose is then produced by moving a set distance in polation.
the move direction while looking in the look direction. To
generate a wider variety of camera trajectories (as for ex- A.9. Aerial Coastline Imagery Dataset
ample in Section C.4), or to allow user control, we can add
Our ACID dataset is available from our project page
an offset to the target look direction that varies over time:
at https://infinite-nature.github.io, in the same format as
a horizontal sinusoidal variation in the look direction, for
RealEstate10K[52]). For each video we identified as aerial
example, generates a meandering trajectory.
footage of nature scenes, we identified multiple frames for
This approach generates somewhat reasonable trajecto- which we compute structure-from-motion poses and intrin-
ries, but an exciting future direction would be to train a sics within a globally consistent system. We divide ACID
model that learns how to choose each successive camera into train and test splits.
pose, using the camera poses in our training data.
To get test sequences used during evaluation, we apply
We use this auto-pilot algorithm to seamlessly integrate the same motion-based frame subsampling described in Sec-
user control and obstacle avoidance in our demo interface tion A.3 to match the distribution seen during training for all
which can be seen in Fig. 9. view synthesis approaches. Further we constrain test items
to only include forward motion which is defined as trajec-
A.8. Additional Frame Interpolation
tories that stay within a 90◦ frontal cone of the first frame.
For the purposes of presenting a very smooth and cin- This was done to establish a fair setting with existing view
ematic video with a high frame rate, we can additionally synthesis methods which do not incorporate generative as-
interpolate between frames generated by our model. Since pects. These same test items were used in the 50-frame FID
our system produces not just RGB images but also disparity, experiments by repeatedly extrapolating the last two known
and since we have camera poses for each frame, we can use poses to generate new poses. For the 500-generation FID,
this information to aid the interpolation. For each pair of we compute future poses using the auto-pilot control de-
frames (Pt , It , Dt ) and (Pt+1 , It+1 , Dt+1 ) we proceed as scribed in Section A.7. To get “real" inception statistics to
follows: compare with, we use images from ACID.
First, we create additional camera poses (as many as
desired) by linearly interpolating position and look-direction B. Experimental implementation
between Pt and Pt+1 . Then, for each new pose P a fraction
λ of the way between Pt and Pt+1 , we use the differentiable B.1. SynSin training
renderer R to rerender It and It+1 from that viewpoint, and We first trained Synsin [45] on our nature dataset with the
blend between the two resulting images: default training settings (i.e. the presets used for the KITTI
model). We then modified the default settings by changing
It0 = R(It , Dt , Pt , P ), the camera stride in order to train Synsin to perform better
0
It+1 = R(It+1 , Dt+1 , Pt+1 , P ), (7) for the task of longer-range view synthesis. Specifically, we
I = (1 − λ)It0 + 0
λIt+1 , employ the same motion-based sampling for selecting pairs
Input MPI- SynSin- Ours
SVG-LP 3D Photos MPI SynSin MPI-Iter SynSin-Iter Repeat Repeat no-repeat Ours
t=5
10
15
25
Input
t=5
10
15
25
Figure 12. Additional Qualitative Comparisons. As in Figure 6 in the main paper, we show more qualitative view synthesis results on
various baselines. Notice how other methods produce artifacts like stretched pixels (3D Photos, MPI), or incomplete outpainting (3D Photos,
SynSin, Ours no-repeat) or fail to completely move the camera (SVG-LP). Further iter and repeat variants do not improve results. Our
approach generates realistic looking images of zoomed in views that involves adding content and super resolving stretched pixels.
Input
Figure 13. Long Generation with Disparity. We show generation of a long sequence with its corresponding disparity output. Our
render-refine-repeat approach enables refinement of both geometry and RGB textures.
across various baselines.
C.1. Limitations
As discussed in the main paper, our approach is essen-
tially a memory-less Markov process that does not guarantee
global consistency across multiple iterations. This manifests
in two ways: First on the geometry, i.e. when you look back,
there is no guarantee that the same geometric structure that
was observed in the past will be there. Second, there is also
no global consistency enforced on the appearance—–the ap-
pearance of the scene may change in short range, such as
sunny coastline turning into a cloudy coastline after several
iterations. Similarly, after hundreds of steps, two different
input images may end up in a scene that has similar stylistic
appearance, although never exactly the same set of frames.
Adding global memory to a system like ours and ensuring
more control over what will happen in the long range syn-
Figure 14. Geometric Grounding Ablation. Geometric ground- thesis is an exciting future direction.
ing is used to explicitly ensure disparities produced by the refine-
ment network match the geometry given by its input. We find this C.2. Disparity Map
important as otherwise subtle drift can cause the generated results In addition to showing the RGB texture, we can also visu-
to diverge quickly as visible in Fig. 15.
alize the refined disparity to show the geometry. In Fig. 13,
we show the long generation as well as its visualized dis-
of images as described in Section A.3. However, here we in- parity map in an unnormalized color scheme. Note that the
crease the upper end of the desired motion range by a factor disparity maps look plausible as well because we train our
of 5, which allow the network to train with longer camera discriminator over RGB and disparity concatenated. Please
strides. This obtains a better performance than the default also see our results in the supplementary video.
setting, and we use this model for all Synsin evaluations. C.3. Effect of Disabling Geometric Grounding
We found no improvement going beyond 5× camera motion
range. We also implemented an exhaustive search for desir- We use geometric grounding as a technique to avoid drift.
able image pairs within a sequence to maximize the training In particular we found that without this grounding, over a
data. time period of many frames the render-refine-repeat loop
We also experimented with SynSin-iter to synthesize long gradually pushes disparity to very small (i.e. distant) values.
videos by applying the aforementioned trained SynSin in Fig. 15 shows an example of this drifting disparity: the se-
an auto-regressive fashion at test time. But this performed quence begins plausibly but before frame 150 is reached, the
worse than the direct long-range synthesis. disparity (here shown unnormalized) has become very small.
In addition to this, we also consider the repeat variant. It is notable that once this happens the RGB images then
SynSin-repeat was implemented using a similar training begin to deteriorate, drifting further away from the space
setup, however instead we also train SynSin to take its own of plausible scenes. Note that this is a test-time difference
output and produce the next view for T = 5 steps. Due to only: the results in Fig. 15 were generated using the same
memory and engineering constraints, we are unable to fit model checkpoint as our other results, but with geometric
SynSin-repeat with the original parameters into memory, so grounding disabled at test time. We show FID-50 results to
we did our best by by reducing the batch size while keep- quantitatively measure the impact of drifting in Fig. 14.
ing as faithful to the original implementation. While this C.4. Results under Various Camera Motions
does not indicate SynSin fails at perpetual view generation,
it does suggest that certain approaches are better suited to In addition to the demo, we also provide a quantitative ex-
solve this problem. periment to measure how the model’s quality changes with
different kinds of camera motion over long trajectories. As
C. Additional Analysis and Results described in Section A.7, our auto-pilot algorithm can be
steered by adding an offset to the target look direction. We
This section contains additional results and analysis to add a horizontal offset which varies sinusoidally, causing
better understand Infinite Nature’s behavior. In Fig. 12, we the camera to turn alternately left and right every 50 frames.
show additional view synthesis results given an input image Fig. 16 compares the FID-50 scores of sequences generated
w/o Geometric Grounding
! "# #! $# %!!
Figure 15. Geometric Grounding Ablation. An example of running our pretrained model on the task of long trajectory generation but
without using geometric grounding. Disparity maps are shown using an unnormalized color scale. Athough the output begins plausibly, by
the 150th frame the disparity map has drifted very far away, and subsequently the RGB output drifts after the 175th frame.
where the relative magnitude of this offset is 0.0 (no offset), the original input image.
0.5 (gentle turns), and 1.0 (stronger turns), and visualizes
the resulting camera trajectories, viewed from above. This
experiment shows that our method is resilient to different
turning camera motions, with FID-50 scores that are compa-
rable on long generation.
Figure 16. FID with different camera motion. We consider different types of camera motion generated by our auto-pilot algorithm with
different parameters and its effect on generated quality. Right: Top-down view of three variations of camera motion that add different
amounts of additional turning to the auto-pilot algorithm. Left: Even with strongly turning camera motion, our auto-pilot algorithm is able
to generate sequences whose quality is only slightly worse than our full model evaluated only on forward translations. The unlabeled points
refer to reported baselines on FID-50 from the main paper. See Section C.4.
0→5→0 0 → 10 → 0 0 → 15 → 0
0→5→0 0 → 10 → 0 0 → 15 → 0
Figure 17. Palindromic Poses. Here we show Infinite Nature generated on palindromic sequences of poses of different lengths. Because
our model uses a memory-less representation, the forward-backward motion requires the model to hallucinate content it has previously
seen but which has gone out frame or been occluded, resulting in a generated image that does not match the original input.
Sample Caption:
Here we show infinite nature generated for palindromic