0% found this document useful (0 votes)
16 views17 pages

论文

This document introduces the problem of perpetual view generation, which is generating a long video sequence from a single input image as the camera trajectory covers a large distance. The authors propose a hybrid approach that integrates geometry from disparity maps with image synthesis techniques. It works in an iterative "render, refine, and repeat" framework to synthesize plausible scenes over long time horizons and large camera motions. They introduce a new dataset of aerial footage and compare their method to recent video generation and view synthesis baselines, showing it can generate higher quality videos over much longer trajectories.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views17 pages

论文

This document introduces the problem of perpetual view generation, which is generating a long video sequence from a single input image as the camera trajectory covers a large distance. The authors propose a hybrid approach that integrates geometry from disparity maps with image synthesis techniques. It works in an iterative "render, refine, and repeat" framework to synthesize plausible scenes over long time horizons and large camera motions. They introduce a new dataset of aerial footage and compare their method to recent video generation and view synthesis baselines, showing it can generate higher quality videos over much longer trajectories.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Infinite Nature:

Perpetual View Generation of Natural Scenes from a Single Image

Andrew Liu* Richard Tucker* Varun Jampani


Ameesh Makadia Noah Snavely Angjoo Kanazawa
Google Research
arXiv:2012.09855v4 [cs.CV] 30 Nov 2021

TRAIN TEST Output frames

t=5 20 50 100

Input image …
150 200 300 500

Figure 1. Perpetual View Generation. Using a collection of aerial videos of nature scenes for training (left), our method learns to take a
single image and perpetually generate novel views for a camera trajectory covering a long distance (right). Our method can successfully
generate hundreds of frames of an aerial video from a single input image (up to 500 shown here).

Abstract Beyond, we might find a wide ocean or new islands. At the


shore, we might see cliffs or beaches, while inland there
We introduce the problem of perpetual view generation—
could be mountains or forests. As humans, we are adept at
long-range generation of novel views corresponding to an
imagining a plausible world from a single picture, based on
arbitrarily long camera trajectory given a single image. This
our own experience.
is a challenging problem that goes far beyond the capabili-
How can we emulate this ability on a computer? One ap-
ties of current view synthesis methods, which quickly degen-
proach would be to attempt to generate an entire 3D planet
erate when presented with large camera motions. Methods
with high-resolution detail from a single image. However,
for video generation also have limited ability to produce
this would be extremely expensive and far beyond the cur-
long sequences and are often agnostic to scene geometry. We
rent state of the art. So, we pose the more tractable problem
take a hybrid approach that integrates both geometry and
of perpetual view generation: given a single image of scene,
image synthesis in an iterative ‘render, refine and repeat’
the task is to synthesize a video corresponding to an arbi-
framework, allowing for long-range generation that cover
trary camera trajectory. Solving this problem can have ap-
large distances after hundreds of frames. Our approach can
plications in content creation, novel photo interactions, and
be trained from a set of monocular video sequences. We pro-
methods that use learned world models like model-based
pose a dataset of aerial footage of coastal scenes, and com-
reinforcement learning.
pare our method with recent view synthesis and conditional
video generation baselines, showing that it can generate Perpetual view generation, though simple to state, is an
plausible scenes for much longer time horizons over large extremely challenging task. As the viewpoint moves, we
camera trajectories compared to existing methods. must extrapolate new content in unseen regions and also
Project page at https://infinite-nature.github.io. synthesize new detail in existing regions that are now closer
to the camera. Two active areas of research, video synthesis
and view synthesis, both fail to scale to this problem for
1. Introduction different reasons.
Recent video synthesis methods apply developments in
Consider the input image of a coastline in Fig. 1. Imag-
image synthesis [20] to the temporal domain, or rely on
ine flying through this scene as a bird. Initially, we would
recurrent models [10]. But they can generate only limited
see objects grow in our field of view as we approach them.
numbers of novel frames (e.g., 25 [41] or 48 frames [9]).
* indicates equal contribution Additionally, such methods often neglect an important el-
ement of the video’s structure—they model neither scene semantic information for in/outpainting. Our problem also
geometry nor camera movement. In contrast, many view incorporates aspects of super-resolution [14, 22]. Image-
synthesis methods do take advantage of geometry to synthe- specific GAN methods also demonstrate a form of image
size high-quality novel views. However, these approaches extrapolation and super-resolution of textures and natural
can only operate within a limited range of camera motion. images [53, 34, 30, 33]. In contrast to the above methods,
As shown in Figure 6, once the camera moves outside this we reason about the 3D geometry behind each image and
range, such methods fail catastrophically. study image extrapolation in the context of temporal image
We propose a hybrid framework that takes advantage sequence generation.
of both geometry and image synthesis techniques to ad- View synthesis. Many view synthesis methods operate by
dress these challenges. We use disparity maps to represent interpolating between multiple views of a scene [23, 3, 24,
a scene’s geometry, and decompose the perpetual view gen- 12, 7], although recent work can generate new views from
eration task into the framework of render-refine-and-repeat. just a single input image, as in our work [5, 39, 25, 38, 31, 6].
First, we render the current frame from a new viewpoint, However, in both settings, most methods only allow for a
using disparity to ensure that scene content moves in a geo- very limited range of output viewpoints. Even methods that
metrically correct manner. Then, we refine the resulting explicitly allow for view extrapolation (not just interpola-
image and geometry. This step adds detail and synthesizes tion) typically restrict the camera motion to small regions
new content in areas that require inpainting or outpainting. around a reference view [52, 35, 8].
Because we refine both the image and disparity, the whole One factor that limits camera motion is that many meth-
process can be repeated in an recurrent manner, allowing ods construct a static scene representation, such as a layered
for perpetual generation with arbitrary trajectories. depth image [39, 32], multiplane image [52, 38], point cloud
To train our system, we curated a large dataset of drone [25, 45], or radiance field [48, 37], and inpaint disoccluded
footage of nature and coastal scenes from over 700 videos, regions. Such representations can allow for fast rendering,
spanning 2 million frames. We run a structure from motion but the range of viable camera positions is limited by the
pipeline to recover 3D camera trajectories, and refer to this finite bounds of the scene representation. Some methods
as the Aerial Coastline Imagery Dataset (ACID). Our trained augment this scene representation paradigm, enabling a lim-
model can generate sequences of hundreds of frames while ited increase in the range of output views. Niklaus et al.
maintaining the aesthetic feel of an aerial coastal video, perform inpainting after rendering [25], while SynSin uses
even though after just a few frames, the camera has moved a post-rendering refinement network to produce realistic
beyond the limits of the scene depicted in the initial view. images from feature point-clouds [45]. We take inspiration
Our experiments show that our novel render-refine-repeat from these methods by rendering and then refining our out-
framework, with propagation of geometry via disparity put. In contrast, however, our system does not construct a
maps, is key to tackling this problem. Compared to recent single 3D representation of a scene. Instead we proceed iter-
view synthesis and video generation baselines, our approach atively, generating each output view from the previous one,
can produce plausible frames for much longer time horizons. and producing a geometric scene representation in the form
This work represents a significant step towards perpetual of a disparity map for each frame.
view generation, though it has limitations such as a lack of Some methods use video as training data. Monocular
global consistency in the hallucinated world. We believe our depth can be learned from 3D movie left-right camera pairs
method and dataset will lead to further advances in genera- [27] or from video sequences analysed with structure-from-
tive methods for large-scale scenes. motion techniques [4]. Video can also be directly used for
view synthesis [38, 45]. These methods use pairs of images,
2. Related Work whereas our model is trained on sequences of several widely-
Image extrapolation. Our work is inspired by the seminal spaced frames since we want to generate long-range video.
work of Kaneva et al. [19], which proposed a non-parametric Video synthesis. Our work is related to methods that gen-
approach for generating ‘infinite’ images through stitching erate a video sequence from one or more images [42, 11,
2D-transformed images, and by patch-based non-parametric 43, 10, 40, 47]. Many such approaches have focused on pre-
approaches for image extension [29, 1]. We revisit the ‘in- dicting the future of dynamic objects with a static camera,
finite images’ concept in a learning framework that also often using simple videos of humans walking [2] or robot
reasons about the 3D geometry behind each image. Also arms [11]. In contrast, we focus on mostly static scenes with
related to our work are recent deep learning approaches to a moving camera, using real aerial videos of nature. Some
the problem of outpainting, i.e., inferring unseen content recent research addresses video synthesis from in-the-wild
outside image boundaries [44, 46, 36], as well as inpaint- videos with moving cameras [9, 41], but without taking ge-
ing, the task of filling in missing content within an image ometry explicitly into account, and with strict limits on the
[15, 50]. These approaches use adversarial frameworks and the length of the generated video. In contrast, in our work
Figure 2. Overview. We first render an input image to a new camera view using the disparity. We then refine the image, synthesizing and
super-resolving missing content. As we output both RGB and geometry, this process can be repeated for perpetual view generation.

the movement of pixels from camera motion is explicitly training data is finite in length.
modeled using 3D geometry. Formally, for an image It with pose Pt we have an asso-
ciated disparity (i.e., inverse depth) map Dt ∈ RH×W , and
3. Perpetual View Generation we compute the next frame It+1 and its disparity Dt+1 as

Given an RGB image I0 and a camera trajectory Iˆt+1 , D̂t+1 , M̂t+1 = R(It , Dt , Pt , Pt+1 ), (1)
(P0 , P1 , P2 , . . . ) of arbitrary length, our task is to output It+1 , Dt+1 = gθ (Iˆt+1 , D̂t+1 , M̂t+1 ). (2)
a new image sequence (I0 , I1 , I2 , . . . ) that forms a video
depicting a flythrough of the scene captured by the ini- Here, Iˆt+1 and D̂t+1 are the result of rendering the image
tial view. The trajectory is a series of 3D camera poses It and disparity Dt from the new camera Pt+1 , using a
3×3 3×1

Pt = R 0 t 1 , where R and t are 3D rotations and differentiable renderer R [13]. This function also returns
translations, respectively. In addition, each camera has an a mask M̂t+1 indicating which regions of the image are
intrinsic matrix K. At training time camera data is obtained missing and need to be filled in. The refinement network
from video clips via structure-from-motion as in [52]. At gθ then inpaints, outpaints and super-resolves these inputs
test time, the camera trajectory may be pre-specified, gen- to produce the next frame It+1 and its disparity Dt+1 . The
erated by an auto-pilot algorithm, or controlled via a user process is repeated iteratively for T steps during training,
interface. and at test time for an arbitrarily long camera trajectory.
Next we discuss each step in detail.
3.1. Approach: Render, Refine, Repeat
Geometry and Rendering. Our render step R uses a dif-
Our framework applies established techniques (3D ren- ferentiable mesh renderer [13]. First, we convert each pixel
dering, image-to-image translation, auto-regressive training) coordinate (u, v) in It and its corresponding disparity d
in a novel combination. We decompose perpetual view gen- in Dt into a 3D point in the camera coordinate system:
eration into the three steps, as illustrated in Figure 2: (x, y, z) = K −1 (u, v, 1)/d. We then convert the image into
1. Render a new view from an old view, by warping the a 3D triangular mesh where each pixel is treated as a vertex
image according to a disparity map using a differen- connected to its neighbors, ready for rendering.
tiable renderer, To avoid stretched triangle artifacts at depth disconti-
2. Refine the rendered view and geometry to fill in miss- nuities and aid our refinement network by identifying re-
ing content and add detail where necessary, gions to be inpainted, we compute a per-pixel binary mask
3. Repeat this process, propagating both image and dis- Mt ∈ RH×W by thresholding the gradient of the disparity
parity to generate each new view from the one before. image ∇D̂t , computed with a a Sobel filter:
Our approach has several desirable characteristics. Repre- (
senting geometry with a disparity map allows much of the 0 where ||∇D̂t || > α,
Mt = (3)
heavy lifting of moving pixels from one frame to the next 1 otherwise.
to be handled by differentiable rendering, ensuring local
temporal consistency. The synthesis task then becomes one We use the 3D mesh to render both image and mask from the
of image refinement, which comprises: 1) inpainting disoc- new view Pt+1 , and multiply the rendered image element-
cluded regions 2) outpainting of new image regions and 3) wise by the rendered mask to give Iˆt+1 . The renderer also
super-resolving image content. Because every step is fully outputs a depth map as seen from the new camera, which
differentiable, we can train our refinement network by back- we invert and multiply by the rendered mask to obtain D̂t+1 .
propagating through several view synthesis iterations. Our This use of the mask ensures that any regions in Iˆt+1 and
auto-regressive framework means that novel views may be D̂t+1 that were occluded in It are masked out and set to zero
infinitely generated with explicit view control, even though (along with regions that were outside the field of view of the
Figure 3. Illustration of the rendering and refinement steps. Left: Our differentiable rendering stage takes a paired RGB image and
disparity map from viewpoint P0 and creates a textured mesh representation, which we render from a new viewpoint P1 , warping the
textures, adjusting disparities, and returning a binary mask representing regions to fill in. Right: The refinement stage takes the output of the
renderer and uses a deep network to fill in holes and add details. The output is a new RGB image and disparity map that can be supervised
with reconstruction and adversarial losses.

previous camera). These areas are ones that the refinement and feeding the network with its own output ameliorates
step will have to inpaint (or outpaint). See Figures 2 and 3 drift and improves visual quality as shown in our ablation
for examples of missing regions shown in pink. study (Section 6.2). However, we notice that the disparity in
Refinement and Synthesis. Given the rendered image Iˆt+1 , particular can still drift at test time, especially over time hori-
its disparity D̂t+1 and its mask M̂t+1 , our next task is to zons far longer than seen during training. Therefore we add
refine this image, which includes blurry regions and missing an explicit geometric re-grounding of the disparity maps.
pixels. In contrast to prior inpainting work [49, 36], the Specifically, we take advantage of the fact that the ren-
refinement network also has to perform super-resolution dering process provides the correct range of disparity from
and thus we cannot use a compositing operation in refining a new viewpoint D̂t+1 for visible regions of the previous
the rendered image. Instead we view the refinement step as frame. The refinement network may modify these values as
a generative image-to-image translation task, and adopt the it refines the holes and blurry regions, which can lead to drift
state-of-the-art SPADE network architecture [26] for our gθ , as the overall disparity becomes gradually larger or smaller
which directly outputs It+1 , Dt+1 . We encode I0 to provide than expected. However, we can geometrically correct this
the additional GAN noise input required by this architecture. by rescaling the refined disparity map to the correct range
See the appendix for more details. by computing a scale factor γ via solving

Rinse and Repeat. The previous steps allow us to generate min ||M (log(γDt+1 ) − log(D̂t+1 ))|| (4)
γ
a single novel view. A crucial aspect of our approach is that
we refine not only RGB but also disparity, so that scene ge- By scaling the refined disparity by γ, our approach ensures
ometry is propagated between frames. With this setup, we that the disparity map stays at a consistent scale, which sig-
can use the refined image and disparity as the next input nificantly reduces drift at test time as shown in Section 6.3.
to train in an auto-regressive manner, with losses backprop-
agated over multiple steps. Other view synthesis methods, 4. Aerial Coastline Imagery Dataset
although not designed in this manner, may also be trained Learning to generate long sequences requires real image
and evaluated in a recurrent setting, although naively repeat- sequences for training. Many existing datasets for view syn-
ing these methods without propagating the geometry as we thesis do not use sequences, but only a set of views from
do requires the geometry to be re-inferred from scratch in slightly different camera positions. Those that do have se-
every step. As we show in Section 6, training and evaluat- quences are limited in length: RealEstate10K, for example,
ing these baselines with a repeat step is still insufficient for has primarily indoor scenes with limited camera movement
perpetual view generation. [52]. To obtain long sequences with a moving camera and
Geometric Grounding to Prevent Drift. A key challenge few dynamic objects, we turn to aerial footage of beautiful
in generating long sequences is dealing with the accumula- nature scenes available on the Internet. Nature scenes are a
tion of errors [28]. In a system where the current prediction good starting point for our challenging problem, as GANs
affects future outputs, small errors in each iteration can com- have shown promising results on nature textures [30, 33].
pound, eventually generating predictions outside the distri- We collected 891 videos using keywords such as ‘coastal’
bution seen during training and causing unexpected behav- and ‘aerial footage’, and processed these videos with SLAM
iors. Repeating the generation loop in the training process and structure from motion following the approach of Zhou
The loss is computed over all iterations and over all samples
in the mini-batch.
Metrics. Evaluating the quality of the generated images in
a way that correlates with human judgement is a challenge.
We use the Fréchet inception distance (FID), a common
metric used in evaluating generative models of images. FID
computes the difference between the mean and covariance
of the embedding of real and fake images through a pre-
trained Inception network [17] to measure the realism of
the generated images as well as their diversity. We precom-
pute real statistics using 20k real image samples from our
dataset. To measure changes in generated quality over time,
we report FID over a sliding window: we write FID-w at
Figure 4. Processing video for ACID. We run structure from mo- t to indicate a FID value computed over all image outputs
tion on coastline drone footage collected from YouTube to create
within a window of width w centered at time t, i.e. {Ii }
the Aerial Coastline Imagery Dataset (ACID). See Section 4.
for t − w/2 < i ≤ t + w/2. For short-range trajectories
where ground truth images are available, we also report
et al. [52], yielding over 13,000 sequences with a total of 2.1 mean squared error (MSE) and LPIPS [51], a perceptual
million frames. We have released the list of videos and SfM similarity metric that correlates better with human percep-
camera trajectories. See Fig. 4 for an illustrative example of tual judgments than traditional metrics such as PSNR and
our SfM pipeline running on a coastline video. SSIM.
To obtain disparity maps for every frame, we use the off-
the-shelf MiDaS single-view depth prediction method [27]. Implementation Details. We train our model with T = 5
We find that MiDaS is quite robust and produces sufficiently steps of render-refine-repeat at an image resolution of 160
accurate disparity maps for our method. Because MiDaS × 256 (as most aerial videos have a 16:9 aspect ratio). The
disparity is only predicted up to scale and shift, it must first choice of T is limited by both memory and available training
be rescaled to match our data. To achieve this, we use the sequence lengths. The refinement network architecture is
sparse point-cloud computed for each scene during structure the same as that of SPADE generator in [26], and we also
from motion. For each frame we consider only the points employ the same multi-scale discriminator. We implement
that were tracked in that frame, and use least-squares to com- our models in TensorFlow, and train with a batch size of 4
pute the scale and shift that minimize the disparity error on over 10 GPUs for 7M iterations, which takes about 8 days.
these points. We apply this scale and shift to the MiDaS out- We then identify the model checkpoint with the best FID
put to obtain disparity maps (Di ) that are scale-consistent score over a validation set.
with the SfM camera trajectories (Pi ) for each sequence.
Due to the difference in camera motions between videos, 6. Evaluation
we strategically sub-sample frames to ensure consistent cam- We compare our approach with three recent state-of-the-
era speed in training sequences. See more details in the ap- art single-image view synthesis methods—the 3D Photogra-
pendix. phy method (henceforward ‘3D Photos’) [32], SynSin [45],
and single-view MPIs [38]—as well as the SVG-LP video
5. Experimental Setup synthesis method [10]. We retrain each method on our ACID
Losses. We train our approach on a collection of image se- training data, with the exception of 3D Photos which is
quences {It }Tt=0 with corresponding camera poses {Pt }Tt=0 trained on in-the-wild imagery and, like our method, takes
and disparity maps for each frame {Dt }Tt=0 . Following the MiDaS disparity as an input. SynSin and single-view MPI
literature on conditional generative models, we use an L1 were trained at a resolution of 256 × 256. SVG-LP takes two
reconstruction loss on RGB and disparity, a VGG percep- input frames for context, and operates at a lower resolution
tual loss on RGB [18] and a hinge-based adversarial loss of 128 × 128.
with a discriminator (and feature matching loss) [26] for The view synthesis baseline methods were not designed
the T frames that we synthesize during training. We also for long camera trajectories; every new frame they generate
use a KL-divergence loss [21] on our initial image encoder comes from the initial frame I0 even though after enough
LKLD = DKL (q(z|x)||N (0, 1)). Our complete loss function camera movement there may be very little overlap between
is the two. Therefore we also compare against two variants of
each of these methods. First, variants with iterated evalu-
L = Lreconst + Lperceptual + Ladv + Lfeat matching + LKLD (5) ation (Synsin–Iterated, MPI–Iterated): these methods use
Over frames 1–10 Over frames 1–50
Method LPIPS ↓ MSE ↓ FID ↓
Baseline methods
SVG-LP [10] 0.60 0.020 135.9
SynSin [45] 0.32 0.018 98.1
MPI [38] 0.35 0.019 65.0
3D Photos [32] 0.30 0.020 123.6
Applied iteratively at test time
SynSin–Iterated 0.40 0.021 143.6
MPI–Iterated 0.47 0.020 201.2
Trained with repeat (T = 5)
SynSin–Repeat 0.44 0.036 153.3
MPI–Repeat 0.55 0.020 203.0
Ours 0.32 0.020 50.6
Figure 5. FID over time. Left: FID-20 over time for 50 frames generated by
Table 1. Quantitative evaluation. We compute each method. Right: FID-50 over 500 frames generated by our method via
LPIPS and MSE against ten frames of ground truth, autopilot. For comparison, we plot FID-50 for the baselines on the first 50 steps.
and FID-50 over 50 frames generated from an input Despite generating sequences an order of magnitude longer, our FID-50 is still
test image. See Section 6.1. lower than that of the baselines. See Sec. 6.1, 6.3.

the same trained models as their baseline counterparts, but competitive with recent view synthesis approaches for short-
we apply them iteratively at test time to generate each new range synthesis on LPIPS and MSE metrics. For mid-range
frame from the previous frame rather than the initial one. evaluation, we report FID-50 over 50 generated frames. Our
Second, variants trained with repeat (Synsin–Repeat, MPI– approach has a dramatically lower FID-50 score than other
Repeat): these methods are trained autoregressively, with methods, reflecting the more naturalistic look of its output.
losses backpropagated across T = 5 steps, as in our full To quantify the degradation of each method over time, we
model. (We omit these variations for the 3D Photos method, report a sliding window FID-20 computed from t = 10 to
which was unfortunately too slow to allow us to apply it 40. As shown in Fig. 5 (left), the image quality (measured by
iteratively, and which we are not able to retrain.) FID-20) of the baseline methods deteriorates more rapidly
with increasing t compared to our approach.
6.1. Short-to-medium range view synthesis
To evaluate short-to-medium-range synthesis, we select Qualitative comparisons of these methods are shown in
ACID test sequences with an input frame and 10 subsequent Fig. 6 and our supplementary video, which illustrates how
ground truth frames (subsampling as described in the ap- the quality of each method’s output changes over time. No-
pendix), with the camera moving forwards at an angle of table here are SVG-LP’s blurriness and inability to predict
up to 45◦ . Although our method is trained on all types of any camera motion at all; the increasingly stretched textures
camera motions, this forward motion is appropriate for com- of 3D Photos’ output; and the way the MPI-based method’s
parison with view synthesis methods which are not designed individual layers become noticeable. SynSin does the best
to handle extreme camera movements. job of generating plausible texture, but still produces holes
We then extrapolate the camera motion from the last after a while and does not add new detail.
two frames of each sequence to extend the trajectory for an
additional 40 frames. To avoid the camera colliding with The –Iterated and –Repeat variants are consistently worse
the scene, we check the final camera position against the than the original SynSin and MPI methods, which suggests
disparity map of the last ground-truth frame, and discard that simply applying an existing method iteratively, or re-
sequences in which it is outside the image or at a depth large training it autoregressively, is insufficient to deal with large
enough to be occluded by the scene. camera movement. These variants show more drifting ar-
This yields a set of 279 sequences with camera trajecto- tifacts than their original versions, likely because (unlike
ries of 50 steps and ground truth images for the first 10 steps. our method), they do not propagate geometry from step to
For short-range evaluation, we compare to ground truth on step. The MPI methods additionally become very blurry on
the first 10 steps. For medium-range evaluation, we compute repeated application, as they have no ability to add detail,
FID scores over all 50 frames. lacking our refinement step.
We apply each method to these sequences to generate
novel views corresponding to the camera poses in each se- In summary, our thoughtful combination of render-refine-
quence (SVG-LP is the exception in that it does not take ac- repeat shows better results than these existing methods and
count of camera pose). See results in Table 1. While our goal variations. Figure 7 shows additional qualitative results from
is perpetual view generation, we find that our approach is generating 15 and 30 frames using on a variety of inputs.
Figure 6. Qualitative comparison over time. We show a generated sequence for each method at different time steps. Note that we only
have ground truth images for 10 frames; the subsequent frames are generated using an extrapolated trajectory. Pink region in Ours no-refine
indicate missing content uncovered by the moving camera.

Figure 7. Qualitative comparison. We show the diversity and quality of many generated results for each method on the t=15 and 30 frame
generation. Competing approaches result in missing or unrealistic frames. Our approach is able to generate plausible views of the scene.

6.2. Ablations ing pixels are highlighted in pink. In our full model, these
regions are inpainted or outpainted by the refinement net-
We investigate the benefit of training over multiple iter- work at each step. Note also that even non-masked areas
ations of our render-refine-repeat loop by also training our of the image are much blurrier when the refinement step is
model with T = 1 (‘No repeat’). As shown in in Table 2, omitted, showing the benefit of the refinement network in
the performance on short-range generation, as measured in super-resolving image content.
LPIPS and MSE, is similar to our full model, but when we
look at FID, we observe that this method generates lower
quality images and that they get substantially worse with Table 2 also shows results on two further variations of
increasing t (see Fig. 5, left). This shows the importance of our refinement step. First, replacing our refinement network
using a recurrent training setup to our method. with a simpler U-Net architecture yields substantially worse
We next consider the refine step. Omitting this step com- results (‘U-Net refinement’). Second, disabling geometric
pletely results in a larger and larger portion of the image be- grounding (Section 3.1) also leads to slightly lower quality
ing completely missing as t increases: examples are shown on this short-to-medium range view synthesis task (‘No re-
as ‘Ours (no refine)’ in Fig. 6, where for clarity the miss- grounding’).
t=0 15 35 50 80 100 150

200 250 300 350 400 450 500

Figure 8. Long trajectory generation. From a single image, our approach can generate 500 frames of video without suffering visually.
Please see the supplementary video for the full effect.

see the supplementary video for more examples and the full
Ablations LPIPS ↓ MSE ↓ FID-50 ↓ effect of these generated fly-through videos.
Full Model 0.32 0.020 50.6
No repeat (T = 1) 0.30 0.022 95.4 6.4. User-controlled video generation
U-net Refinement 0.54 0.052 183.0 Because our rendering step takes camera poses as an
No re-grounding 0.34 0.022 64.3 input, we can render frames for arbitrary camera trajectories
at test time, including trajectories controlled by a user in
Table 2. Ablations. We ablate aspects of our model to understand the loop. We have built a HTML interface that allows the
their contribution to the overall performance. See Section 6.2.
user to steer our auto-pilot algorithm as it flies through this
imaginary world. This demo runs over the internet and is
capable of generating a few frames per second. Please see
6.3. Perpetual view generation the supplementary video for a demonstration.

We also evaluate the ability of our model to perform 7. Discussion


perpetual view generation by synthesizing videos of 500
frames, using an auto-pilot algorithm to create an online We introduce the new problem of perpetual view gen-
camera trajectory that avoids flying directly into the ground, eration and present a novel framework that combines both
sky or obstacles such as mountains. This algorithm works geometric and generative techniques as a first step in tack-
iteratively in tandem with image generation to control the ling it. Our system can generate video sequences spanning
camera based on heuristics which measure the proportion hundreds of frames, which to our knowledge has not been
of sky and of foreground obstacles in the scene. See the shown for prior video or view synthesis methods. The re-
appendix for details. sults indicate that our hybrid approach is a promising step.
Nevertheless, many challenges remain.
We note that this task is exceptionally challenging and
First, our render-refine-repeat loop is by design memory-
completely outside the capabilities of current generative and
less, an intentional choice which allows us to train on finite
view synthesis methods. To further frame the difficulty, our
length videos yet generate arbitrarily long output using a fi-
refinement network has only seen videos of length 5 during
nite memory and compute budget. As a consequence it aims
training, yet we generate 500 frames for each of our test
for local consistency between nearby frames, but does not di-
sequences. As shown in Fig. 5 (right), our FID-50 score
rectly tackle questions of long-term consistency or a global
over generated frames is remarkably robust: even after 500
representation. How to incorporate long-term memory in
frames, the FID is lower than that of all the baseline methods
such a system is an exciting question for future work. Sec-
over 50 frames. Fig. 5 also shows the benefit of our proposed
ond, our refinement network, like other GANs, can produce
geometric grounding: when it is omitted, the image quality
images that seem realistic but not recognizable [16]. Fur-
gradually deteriorates, indicating that resolving drift is an
ther advancements in image and video synthesis generation
important contribution.
methods that incorporate geometry would be an interesting
Fig. 8 shows a qualitative example of long sequence
future direction. Last, we do not model dynamic scenes:
generation. In spite of the intrinsic difficulty of generating
combining our geometry-aware approach with methods that
frames over large distances, our approach retains something
can reason about object dynamics could be another fruitful
of the aesthetic look of coastline, generating new islands,
direction.
rocks, beaches, and waves as it flies through the world. The
auto-pilot algorithm can receive additional inputs (such as Acknowledgements. We would like to thank Dominik Kaeser for
a user-specified trajectory or random elements), allowing directing and helping prepare our videos and Huiwen Chang for
us to generate diverse videos from a single image. Please making the MiDaS models easily accessible.
References [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
hard Nessler, and Sepp Hochreiter. Gans trained by a two
[1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and time-scale update rule converge to a local nash equilibrium.
Dan B Goldman. PatchMatch: A randomized correspondence In NeurIPS, pages 6626–6637, 2017. 5
algorithm for structural image editing. ACM Transactions on
[18] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual
Graphics (Proc. SIGGRAPH), 28(3), Aug. 2009. 2
losses for real-time style transfer and super-resolution. In
[2] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, European conference on computer vision, pages 694–711.
and Ronen Basri. Actions as space-time shapes. In ICCV, Springer, 2016. 5
pages 1395–1402. IEEE, 2005. 2
[19] Biliana Kaneva, Josef Sivic, Antonio Torralba, Shai Avidan,
[3] Gaurav Chaurasia, Sylvain Duchêne, Olga Sorkine-Hornung, and William T. Freeman. Infinite images: Creating and ex-
and George Drettakis. Depth synthesis and local warps ploring a large photorealistic virtual space. In Proceedings
for plausible image-based navigation. Trans. on Graphics, of the IEEE, 2010. 2
32:30:1–30:12, 2013. 2 [20] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[4] Weifeng Chen, Shengyi Qian, and Jia Deng. Learning single- generator architecture for generative adversarial networks. In
image depth from videos using quality assessment networks. Proceedings of the IEEE Conference on Computer Vision and
In The IEEE Conference on Computer Vision and Pattern Pattern Recognition, pages 4401–4410, 2019. 1
Recognition (CVPR), June 2019. 2 [21] Diederik P Kingma and Max Welling. Auto-encoding varia-
[5] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neural tional bayes. arXiv preprint arXiv:1312.6114, 2013. 5
image based rendering with continuous view control. In [22] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
ICCV, 2019. 2 Andrew Cunningham, Alejandro Acosta, Andrew P Aitken,
[6] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neural Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
image based rendering with continuous view control. In realistic single image super-resolution using a generative ad-
ICCV, pages 4090–4100, 2019. 2 versarial network. In CVPR, 2017. 2
[7] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, [23] Marc Levoy and Pat Hanrahan. Light field rendering. In
and Jan Kautz. Extreme view synthesis. In Proceedings Proceedings of SIGGRAPH 96, Annual Conference Series,
of the IEEE International Conference on Computer Vision, 1996. 2
pages 7781–7790, 2019. 2 [24] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon,
[8] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and
and Jan Kautz. Extreme view synthesis. In ICCV, pages Abhishek Kar. Local light field fusion: Practical view synthe-
7781–7790, 2019. 2 sis with prescriptive sampling guidelines. ACM Transactions
[9] Aidan Clark, Jeff Donahue, and Karen Simonyan. Effi- on Graphics (TOG), 2019. 2
cient video generation on complex datasets. arXiv preprint [25] Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3D
arXiv:1907.06571, 2019. 1, 2 Ken Burns effect from a single image. ACM Transactions on
[10] Emily Denton and Rob Fergus. Stochastic video generation Graphics (TOG), 2019. 2
with a learned prior. arXiv preprint arXiv:1802.07687, 2018. [26] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
1, 2, 5, 6 Zhu. Semantic image synthesis with spatially-adaptive nor-
[11] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsuper- malization. In Proceedings of the IEEE Conference on Com-
vised learning for physical interaction through video predic- puter Vision and Pattern Recognition, 2019. 4, 5, 12
tion. In NeurIPS, pages 64–72, 2016. 2 [27] René Ranftl, Katrin Lasinger, David Hafner, Konrad
[12] John Flynn, Michael Broxton, Paul Debevec, Matthew Du- Schindler, and Vladlen Koltun. Towards robust monocular
Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and depth estimation: Mixing datasets for zero-shot cross-dataset
Richard Tucker. Deepview: View synthesis with learned gra- transfer. IEEE Transactions on Pattern Analysis and Machine
dient descent. In The IEEE Conference on Computer Vision Intelligence, 2020. 2, 5, 11
and Pattern Recognition (CVPR), June 2019. 2 [28] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A re-
[13] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, duction of imitation learning and structured prediction to
Daniel Vlasic, and William T. Freeman. Unsupervised train- no-regret online learning. In Proceedings of the fourteenth
ing for 3d morphable model regression. In The IEEE Confer- international conference on artificial intelligence and statis-
ence on Computer Vision and Pattern Recognition (CVPR), tics, pages 627–635, 2011. 4
June 2018. 3 [29] Arno Schödl, Richard Szeliski, David H Salesin, and Irfan
[14] Daniel Glasner, Shai Bagon, and Michal Irani. Super- Essa. Video textures. In Proceedings of the 27th annual
resolution from a single image. In ICCV, pages 349–356, conference on Computer graphics and interactive techniques,
2009. 2 pages 489–498, 2000. 2
[15] James Hays and Alexei A Efros. Scene completion using [30] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan:
millions of photographs. ACM Transactions on Graphics Learning a generative model from a single natural image. In
(TOG), 26(3):4–es, 2007. 2 ICCV, pages 4570–4580, 2019. 2, 4
[16] Aaron Hertzmann. Visual indeterminacy in generative neural [31] Lixin Shi, Haitham Hassanieh, Abe Davis, Dina Katabi, and
art. arXiv preprint arXiv:1910.04639, 2019. 8 Fredo Durand. Light field reconstruction using sparsity in the
continuous fourier domain. Trans. on Graphics, 34(1):12:1– ing. In Proceedings of the IEEE International Conference on
12:13, Dec. 2014. 2 Computer Vision, pages 10561–10570, 2019. 2
[32] Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin [47] Yufei Ye, Maneesh Singh, Abhinav Gupta, and Shubham
Huang. 3d photography using context-aware layered depth Tulsiani. Compositional video prediction. In ICCV, 2019. 2
inpainting. In IEEE Conference on Computer Vision and [48] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
Pattern Recognition (CVPR), 2020. 2, 5, 6 pixelNeRF: Neural radiance fields from one or few images.
[33] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani. In CVPR, 2021. 2
Ingan: Capturing and remapping the “DNA” of a natural [49] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
image. arXiv preprint arXiv:1812.00231, 2018. 2, 4 Thomas S Huang. Generative image inpainting with con-
[34] Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” textual attention. arXiv preprint arXiv:1801.07892, 2018.
super-resolution using deep internal learning. In CVPR, pages 4
3118–3126, 2018. 2 [50] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
[35] Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Thomas S Huang. Free-form image inpainting with gated
Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing convolution. In ICCV, 2019. 2
the boundaries of view extrapolation with multiplane images. [51] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman,
In The IEEE Conference on Computer Vision and Pattern and Oliver Wang. The unreasonable effectiveness of deep
Recognition (CVPR), June 2019. 2 features as a perceptual metric. In CVPR, 2018. 5
[36] Piotr Teterwak, Aaron Sarna, Dilip Krishnan, Aaron [52] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
Maschinot, David Belanger, Ce Liu, and William T Free- and Noah Snavely. Stereo magnification: Learning view
man. Boundless: Generative adversarial networks for image synthesis using multiplane images. ACM Trans. Graph.,
extension. In Proceedings of the IEEE International Con- 37(4):65:1–65:12, 2018. 2, 3, 4, 5, 11, 12, 13
ference on Computer Vision, pages 10521–10530, 2019. 2, [53] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel
4 Cohen-Or, and Hui Huang. Non-stationary texture synthesis
[37] Alex Trevithick and Bo Yang. Grf: Learning a general ra- by adversarial expansion. arXiv preprint arXiv:1805.04487,
diance field for 3d scene representation and rendering. In 2018. 2
arXiv:2010.04595, 2020. 2
[38] Richard Tucker and Noah Snavely. Single-view view syn-
thesis with multiplane images. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2020.
2, 5, 6
[39] Shubham Tulsiani, Richard Tucker, and Noah Snavely. Layer-
structured 3D scene inference via view synthesis. In The Eu-
ropean Conference on Computer Vision (ECCV), September
2018. 2
[40] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Kautz. Mocogan: Decomposing motion and content for video
generation. In CVPR, pages 1526–1535, 2018. 2
[41] Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru
Erhan, Quoc V Le, and Honglak Lee. High fidelity video
prediction with large stochastic recurrent neural networks. In
Advances in Neural Information Processing Systems, pages
81–91, 2019. 1, 2
[42] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Generating videos with scene dynamics. In NeurIPS, pages
613–621, 2016. 2
[43] Carl Vondrick and Antonio Torralba. Generating the future
with adversarial transformers. In CVPR, pages 1020–1028,
2017. 2
[44] Yi Wang, Xin Tao, Xiaoyong Shen, and Jiaya Jia. Wide-
context semantic image extrapolation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1399–1408, 2019. 2
[45] Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin
Johnson. SynSin: End-to-end view synthesis from a single
image. In CVPR, 2020. 2, 5, 6, 13
[46] Zongxin Yang, Jian Dong, Ping Liu, Yi Yang, and Shuicheng
Yan. Very long natural scenery image prediction by outpaint-
Figure 9. Infinite Nature Demo. We built a lightweight demo interface so a user can run Infinite Nature and control the camera trajectory.
In addition, the demo can take any uploaded image, and the system will automatically run MiDaS to generate an initial depth map, then
allow the user hit “play” to navigate through the generated world and click to turn the camera towards the cursor. The demo runs at several
frames per second using a free Google Colab GPU-enabled backend. Please see our video for the full effect of generating an interactive
scene flythrough.

Appendix box effect. We pre-process every frame to remove detected


letterboxes and adjust the camera intrinsics accordingly to
A. Implementation Details reflect this crop operation.
This section contains additional implementation details For the remaining sequences, we run the MiDaS sys-
for our system, including data generation, network architec- tem [27] on every frame to estimate dense disparity (inverse
ture, and inference procedure. depth). MiDaS predicts disparity only up to an unknown
scale and shift, so for each frame we use the 3D keypoints
A.1. ACID Collection and Processing produced by running SfM to compute scale and shift pa-
rameters that best fit the MiDaS disparity values to the 3D
To create the ACID dataset, we began by identifying over keypoints visible in that frame. This results in disparity im-
150 proper nouns related to coastline and island locations ages that better align with the SfM camera trajectories dur-
such as Big Sur, Half Moon Bay, Moloka’i, Shi Shi Beach, ing training. More specifically, the scale a and shift b are
Waimea bay, etc. We combined each proper noun with a set calculated via least-squares as:
of keywords ({aerial, drone, dji, andmavic}) and used these
X 2
combinations of keywords to perform YouTube video search argmin aD̂xyz + b − z −1 (6)
queries. We combined the top 10 video IDs from each query a,b
(x,y,z)∈K
to form a set of candidate videos for our dataset.
We process all the videos through a SLAM and SfM where K is the set of visible 3D keypoints from the local
pipeline as in Zhou et al. [52]. For each video, this process frame’s camera viewpoint, D̂ is the disparity map predicted
yields a set of camera trajectories, each containing cam- by MiDaS for that frame, and D̂xyz is the disparity value
era poses corresponding to individual video frames. The sampled from that map at texture coordinates corresponding
pipeline also produces a set of 3D keypoints. We manually to the projection of the point (x, y, z) according to the cam-
identify and remove videos that feature a static camera or are era intrinsics. The disparity map D we use during training
not aerial, as well as videos that feature a large number of and rendering is then D = aD̂ + b.
people or man-made structures. In an effort to limit the po-
A.2. Inference without Disparity Scaling
tential privacy concerns of our work, we also discard frames
that feature people. In particular, we run the state of the art Scaling and shifting the disparity as described above re-
object detection network [?] to identify any humans present quires a sparse point cloud, which is generated from SfM
in the frames. If detected humans occupy more than 10% and in turn requires video or multi-view imagery. At test-
of a given frame, we discard that frame. The above filtering time, however, we assume only a single view is available.
steps are applied in order to identify high-quality video se- Fortunately, this is not a problem in practice, as scaling and
quences for training with limited privacy implications, and shifting the disparity is only necessary if we seek to compare
the remaining videos form our dataset. generated frames at target poses against ground truth. If we
Many videos, especially those that feature drone footage, just want to generate sequences, we can equally well use the
are shot with cinematic horizontal borders, achieving a letter- original MiDaS disparity predictions. Fig. 10 compares long
coder maps a given image to the parameters of a multivari-
ate Gaussian that represents its feature. We can use this new
distribution to sample GAN noise used by the SPADE gener-
ator. We use the initial RGBD frame of a sequence as input
to the encoder to obtain this distribution before repeatedly
sampling from it (or using its mean at test-time) at every
step of refinement.
Our SPADE generator is identical to the original SPADE
architecture, except that the input has only 5 channels cor-
responding to RGB texture, disparity, and a mask channel
indicating missing regions.
We also considered a U-net [?]–based approach by using
the generator implementation of Pix2Pix [?], but found that
such an approach struggles to achieve good results, taking
longer to converge and in many cases, completely failing
when evaluating beyond the initial five steps.
As our discriminator, we use the Pix2PixHD [?] multi-
scale discriminator with two scales over generated RGBD
Figure 10. Scaled MiDaS vs original MiDaS. We scale the MiDaS frames. To make efficient use of memory, we run the dis-
disparity maps to be consistent with the camera poses estimated criminator on random crops of pixels and random generated
by SfM during training. At test-time our approach only requires frames over time.
a single image with disparity. Here we show results of FID-50
long generation using the original MiDaS output vs the scaled A.5. Loss Weights
MiDaS. Despite being only trained on scaled disparity, our model
We used a subset of our training set to sweep over check-
still performs competitively with (unscaled) MiDaS as its input.
points and hyperparameter configurations. For our loss,
we used λreconst = 2, λperceptual = 0.01, λadversarial = 1,
generation using scaled and original MiDaS outputs, and λKLD = 0.05, λfeat matching = 10.
shows that using original MiDaS outputs has a negligible
effect on the FID scores. Fig. 11 shows an example of a
A.6. Data Source for Qualitative Illustrations
long sequence generated with the unscaled MiDaS predic- Note that for license reasons, we do not show generated
tion from a photo taken on a smartphone, demonstrating that qualitative figures and results on ACID. Instead, we col-
our framework runs well on a single test image using the lect input images with open source licenses from Pexels [?]
original MiDaS disparity. and show the corresponding qualitative results in the paper
and the supplementary video. The quantitative results are
A.3. Aligning Camera Speed computed on the ACID test set.
The speed of camera motion varies widely in our col-
lected videos, so we normalize the amount of motion present A.7. Auto-pilot View Control
in training image sequences by computing a proxy for cam- We use an auto-pilot view control algorithm when gen-
era speed. We use the translation magnitude of the estimated erating long sequences from a single input RGB-D image.
camera poses between frames after scale-normalizing the This algorithm must generate the camera trajectory in tan-
video as in Zhou et al. [52] to determine a range of rates at dem with the image generation, so that it can avoid crash-
which each sequence can be subsampled in order to obtain ing into the ground or obstacles in the scene. Our basic
a camera speed within a desired target range. We randomly approach works as follows: at each step we take the current
select frame rates within this range to subsample videos. We disparity image and categorize all points with disparity be-
picked a target speed range for training sequences that varies low a certain threshold as sky and all points with disparity
by up to 30% and, on average, leaves 90% of an image’s above a second, higher threshold as near. (In our experi-
content visible in the next sampled frame. ments these thresholds are set to 0.05 and 0.5.) Then we
apply three simple heuristics for view-control: (1) look up
A.4. Network Architecture
or down so that a given percentage (typically 30%) of the
We use Spatially Adaptive Normalization (SPADE) of image is sky, (2) look left or right, towards whichever side
Park et al. [26] as the basis for our refinement network. has more sky, (3) If more than 20% of the image is near,
The generator consists of two parts, a variational image move up (and if less, down), otherwise move towards a
encoder and a SPADE generator. The variational image en- horizontally-centered point 30% of the way from the top of
Input

10 20 30 40

50 60 70 80

Figure 11. Generation from smartphone photo. Our perpetual view generation applied to a photo captured by the authors on a smartphone.
We use MiDaS for the initial disparity, and assume a field of view of 90◦ .

the image. These heuristics determine a (camera-relative) Note: we apply this interpolation to the long trajectory se-
target look direction and target movement direction. To en- quences in the supplementary video only, adding four new
sure smooth camera movement, we interpolate the actual frames between each pair in the sequence. However, all
look and movement directions only a small fraction (0.05) short-to-mid range comparisons and all figures and metrics
of the way to the target directions at each frame. The next in the paper are computed on raw outputs without any inter-
camera pose is then produced by moving a set distance in polation.
the move direction while looking in the look direction. To
generate a wider variety of camera trajectories (as for ex- A.9. Aerial Coastline Imagery Dataset
ample in Section C.4), or to allow user control, we can add
Our ACID dataset is available from our project page
an offset to the target look direction that varies over time:
at https://infinite-nature.github.io, in the same format as
a horizontal sinusoidal variation in the look direction, for
RealEstate10K[52]). For each video we identified as aerial
example, generates a meandering trajectory.
footage of nature scenes, we identified multiple frames for
This approach generates somewhat reasonable trajecto- which we compute structure-from-motion poses and intrin-
ries, but an exciting future direction would be to train a sics within a globally consistent system. We divide ACID
model that learns how to choose each successive camera into train and test splits.
pose, using the camera poses in our training data.
To get test sequences used during evaluation, we apply
We use this auto-pilot algorithm to seamlessly integrate the same motion-based frame subsampling described in Sec-
user control and obstacle avoidance in our demo interface tion A.3 to match the distribution seen during training for all
which can be seen in Fig. 9. view synthesis approaches. Further we constrain test items
to only include forward motion which is defined as trajec-
A.8. Additional Frame Interpolation
tories that stay within a 90◦ frontal cone of the first frame.
For the purposes of presenting a very smooth and cin- This was done to establish a fair setting with existing view
ematic video with a high frame rate, we can additionally synthesis methods which do not incorporate generative as-
interpolate between frames generated by our model. Since pects. These same test items were used in the 50-frame FID
our system produces not just RGB images but also disparity, experiments by repeatedly extrapolating the last two known
and since we have camera poses for each frame, we can use poses to generate new poses. For the 500-generation FID,
this information to aid the interpolation. For each pair of we compute future poses using the auto-pilot control de-
frames (Pt , It , Dt ) and (Pt+1 , It+1 , Dt+1 ) we proceed as scribed in Section A.7. To get “real" inception statistics to
follows: compare with, we use images from ACID.
First, we create additional camera poses (as many as
desired) by linearly interpolating position and look-direction B. Experimental implementation
between Pt and Pt+1 . Then, for each new pose P a fraction
λ of the way between Pt and Pt+1 , we use the differentiable B.1. SynSin training
renderer R to rerender It and It+1 from that viewpoint, and We first trained Synsin [45] on our nature dataset with the
blend between the two resulting images: default training settings (i.e. the presets used for the KITTI
model). We then modified the default settings by changing
It0 = R(It , Dt , Pt , P ), the camera stride in order to train Synsin to perform better
0
It+1 = R(It+1 , Dt+1 , Pt+1 , P ), (7) for the task of longer-range view synthesis. Specifically, we
I = (1 − λ)It0 + 0
λIt+1 , employ the same motion-based sampling for selecting pairs
Input MPI- SynSin- Ours
SVG-LP 3D Photos MPI SynSin MPI-Iter SynSin-Iter Repeat Repeat no-repeat Ours

t=5

10

15

25

Input

t=5

10

15

25

Figure 12. Additional Qualitative Comparisons. As in Figure 6 in the main paper, we show more qualitative view synthesis results on
various baselines. Notice how other methods produce artifacts like stretched pixels (3D Photos, MPI), or incomplete outpainting (3D Photos,
SynSin, Ours no-repeat) or fail to completely move the camera (SVG-LP). Further iter and repeat variants do not improve results. Our
approach generates realistic looking images of zoomed in views that involves adding content and super resolving stretched pixels.

50 100 150 200 250

Input

300 350 400 450 500

Figure 13. Long Generation with Disparity. We show generation of a long sequence with its corresponding disparity output. Our
render-refine-repeat approach enables refinement of both geometry and RGB textures.
across various baselines.
C.1. Limitations
As discussed in the main paper, our approach is essen-
tially a memory-less Markov process that does not guarantee
global consistency across multiple iterations. This manifests
in two ways: First on the geometry, i.e. when you look back,
there is no guarantee that the same geometric structure that
was observed in the past will be there. Second, there is also
no global consistency enforced on the appearance—–the ap-
pearance of the scene may change in short range, such as
sunny coastline turning into a cloudy coastline after several
iterations. Similarly, after hundreds of steps, two different
input images may end up in a scene that has similar stylistic
appearance, although never exactly the same set of frames.
Adding global memory to a system like ours and ensuring
more control over what will happen in the long range syn-
Figure 14. Geometric Grounding Ablation. Geometric ground- thesis is an exciting future direction.
ing is used to explicitly ensure disparities produced by the refine-
ment network match the geometry given by its input. We find this C.2. Disparity Map
important as otherwise subtle drift can cause the generated results In addition to showing the RGB texture, we can also visu-
to diverge quickly as visible in Fig. 15.
alize the refined disparity to show the geometry. In Fig. 13,
we show the long generation as well as its visualized dis-
of images as described in Section A.3. However, here we in- parity map in an unnormalized color scheme. Note that the
crease the upper end of the desired motion range by a factor disparity maps look plausible as well because we train our
of 5, which allow the network to train with longer camera discriminator over RGB and disparity concatenated. Please
strides. This obtains a better performance than the default also see our results in the supplementary video.
setting, and we use this model for all Synsin evaluations. C.3. Effect of Disabling Geometric Grounding
We found no improvement going beyond 5× camera motion
range. We also implemented an exhaustive search for desir- We use geometric grounding as a technique to avoid drift.
able image pairs within a sequence to maximize the training In particular we found that without this grounding, over a
data. time period of many frames the render-refine-repeat loop
We also experimented with SynSin-iter to synthesize long gradually pushes disparity to very small (i.e. distant) values.
videos by applying the aforementioned trained SynSin in Fig. 15 shows an example of this drifting disparity: the se-
an auto-regressive fashion at test time. But this performed quence begins plausibly but before frame 150 is reached, the
worse than the direct long-range synthesis. disparity (here shown unnormalized) has become very small.
In addition to this, we also consider the repeat variant. It is notable that once this happens the RGB images then
SynSin-repeat was implemented using a similar training begin to deteriorate, drifting further away from the space
setup, however instead we also train SynSin to take its own of plausible scenes. Note that this is a test-time difference
output and produce the next view for T = 5 steps. Due to only: the results in Fig. 15 were generated using the same
memory and engineering constraints, we are unable to fit model checkpoint as our other results, but with geometric
SynSin-repeat with the original parameters into memory, so grounding disabled at test time. We show FID-50 results to
we did our best by by reducing the batch size while keep- quantitatively measure the impact of drifting in Fig. 14.
ing as faithful to the original implementation. While this C.4. Results under Various Camera Motions
does not indicate SynSin fails at perpetual view generation,
it does suggest that certain approaches are better suited to In addition to the demo, we also provide a quantitative ex-
solve this problem. periment to measure how the model’s quality changes with
different kinds of camera motion over long trajectories. As
C. Additional Analysis and Results described in Section A.7, our auto-pilot algorithm can be
steered by adding an offset to the target look direction. We
This section contains additional results and analysis to add a horizontal offset which varies sinusoidally, causing
better understand Infinite Nature’s behavior. In Fig. 12, we the camera to turn alternately left and right every 50 frames.
show additional view synthesis results given an input image Fig. 16 compares the FID-50 scores of sequences generated
w/o Geometric Grounding

! "# #! $# %!!

%"# %#! %$# "!! ""#

Figure 15. Geometric Grounding Ablation. An example of running our pretrained model on the task of long trajectory generation but
without using geometric grounding. Disparity maps are shown using an unnormalized color scale. Athough the output begins plausibly, by
the 150th frame the disparity map has drifted very far away, and subsequently the RGB output drifts after the 175th frame.

where the relative magnitude of this offset is 0.0 (no offset), the original input image.
0.5 (gentle turns), and 1.0 (stronger turns), and visualizes
the resulting camera trajectories, viewed from above. This
experiment shows that our method is resilient to different
turning camera motions, with FID-50 scores that are compa-
rable on long generation.

C.5. Generating Forward-Backwards Sequences


Because the Render-Refine-Repeat framework uses a
memory-less representation to generate sequences, the ap-
pearance of content is not maintained across iterations. As
a consequence, pixel content seen in one view is not guar-
anteed to be preserved later when seen again, particularly if
it goes out of frame. We can observe such inconsistency by
synthesizing forward camera motion followed by the same
motion backwards (a palindromic camera trajectory), ending
at the initial pose. While generating the forward sequence
of frames, some of the content in the original input image
will leave the field of view. Then, when synthesizing the
backward motion, the model must regenerate this forgotten
content anew, resulting in pixels that do not match the origi-
nal input. Fig. 17 shows various input scenes generated for
different lengths of forward-backward motion. The further
the camera moves before returning to the initial position, the
more content will leave the field of view, and so we find that
that longer the palindromic sequence, the more the image
generated upon returning to the initial pose will differ from
Forward Motion Gentle Turns Strong Turns

Figure 16. FID with different camera motion. We consider different types of camera motion generated by our auto-pilot algorithm with
different parameters and its effect on generated quality. Right: Top-down view of three variations of camera motion that add different
amounts of additional turning to the auto-pilot algorithm. Left: Even with strongly turning camera motion, our auto-pilot algorithm is able
to generate sequences whose quality is only slightly worse than our full model evaluated only on forward translations. The unlabeled points
refer to reported baselines on FID-50 from the main paper. See Section C.4.

Input (t=0) 0→5 0 → 10 0 → 15

0→5→0 0 → 10 → 0 0 → 15 → 0

Input (t=0) 0→5 0 → 10 0 → 15

0→5→0 0 → 10 → 0 0 → 15 → 0

Figure 17. Palindromic Poses. Here we show Infinite Nature generated on palindromic sequences of poses of different lengths. Because
our model uses a memory-less representation, the forward-backward motion requires the model to hallucinate content it has previously
seen but which has gone out frame or been occluded, resulting in a generated image that does not match the original input.

Sample Caption:
Here we show infinite nature generated for palindromic

You might also like