0% found this document useful (0 votes)
57 views

Video-to-Video Synthesis: Website

This document describes an approach for video-to-video synthesis using generative adversarial networks. The goal is to learn a mapping from an input source video, like a sequence of semantic segmentation masks, to an output photorealistic video that depicts the content of the source video. The proposed method achieves high-resolution, temporally coherent video results by carefully designing generators and discriminators with a spatio-temporal adversarial objective. Experiments show the approach outperforms baselines at synthesizing realistic videos from input formats like segmentation maps, sketches, and poses.

Uploaded by

Cristian Carlos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Video-to-Video Synthesis: Website

This document describes an approach for video-to-video synthesis using generative adversarial networks. The goal is to learn a mapping from an input source video, like a sequence of semantic segmentation masks, to an output photorealistic video that depicts the content of the source video. The proposed method achieves high-resolution, temporally coherent video results by carefully designing generators and discriminators with a spatio-temporal adversarial objective. Experiments show the approach outperforms baselines at synthesizing realistic videos from input formats like segmentation maps, sketches, and poses.

Uploaded by

Cristian Carlos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Video-to-Video Synthesis

Ting-Chun Wang1 , Ming-Yu Liu1 , Jun-Yan Zhu2 , Guilin Liu1 ,


Andrew Tao1 , Jan Kautz1 , Bryan Catanzaro1
1
NVIDIA, 2 MIT CSAIL
{tingchunw,mingyul,guilinl,atao,jkautz,bcatanzaro}@nvidia.com,
[email protected]

Abstract
We study the problem of video-to-video synthesis, whose goal is to learn a mapping
function from an input source video (e.g., a sequence of semantic segmentation
masks) to an output photorealistic video that precisely depicts the content of the
source video. While its image counterpart, the image-to-image translation problem,
is a popular topic, the video-to-video synthesis problem is less explored in the
literature. Without modeling temporal dynamics, directly applying existing image
synthesis approaches to an input video often results in temporally incoherent videos
of low visual quality. In this paper, we propose a video-to-video synthesis approach
under the generative adversarial learning framework. Through carefully-designed
generators and discriminators, coupled with a spatio-temporal adversarial objective,
we achieve high-resolution, photorealistic, temporally coherent video results on
a diverse set of input formats including segmentation masks, sketches, and poses.
Experiments on multiple benchmarks show the advantage of our method compared
to strong baselines. In particular, our model is capable of synthesizing 2K resolution
videos of street scenes up to 30 seconds long, which significantly advances the
state-of-the-art of video synthesis. Finally, we apply our method to future video
prediction, outperforming several competing systems. Code, models, and more
results are available at our website.

1 Introduction
The capability to model and recreate the dynamics of our visual world is essential to building
intelligent agents. Apart from purely scientific interests, learning to synthesize continuous visual
experiences has a wide range of applications in computer vision, robotics, and computer graphics.
For example, in model-based reinforcement learning [2, 24], a video synthesis model finds use in
approximating visual dynamics of the world for training the agent with less amount of real experience
data. Using a learned video synthesis model, one can generate realistic videos without explicitly
specifying scene geometry, materials, lighting, and dynamics, which would be cumbersome but
necessary when using a standard graphics rendering engine [35].
The video synthesis problem exists in various forms, including future video prediction [15, 18, 42, 45,
50, 65, 68, 71, 77] and unconditional video synthesis [59, 67, 69]. In this paper, we study a new form:
video-to-video synthesis. At the core, we aim to learn a mapping function that can convert an input
video to an output video. To the best of our knowledge, a general-purpose solution to video-to-video
synthesis has not yet been explored by prior work, although its image counterpart, the image-to-image
translation problem, is a popular research topic [6, 31, 33, 43, 44, 63, 66, 73, 82, 83]. Our method is
inspired by previous application-specific video synthesis methods [58, 60, 61, 75].
We cast the video-to-video synthesis problem as a distribution matching problem, where the goal is
to train a model such that the conditional distribution of the synthesized videos given input videos
resembles that of real videos. To this end, we learn a conditional generative adversarial model [20]

Preprint. Work in progress.


Figure 1: Generating a photorealistic video from an input segmentation map video on Cityscapes.
Top left: input. Top right: pix2pixHD. Bottom left: COVST. Bottom right: vid2vid (ours). The
figure is best viewed with Acrobat Reader. Click the image to play the video clip.

given paired input and output videos, With carefully-designed generators and discriminators, and a
new spatio-temporal learning objective, our method can learn to synthesize high-resolution, photore-
alistic, temporally coherent videos. Moreover, we extend our method to multimodal video synthesis.
Conditioning on the same input, our model can produce videos with diverse appearances.
We conduct extensive experiments on several datasets on the task of converting a sequence of
segmentation masks to photorealistic videos. Both quantitative and qualitative results indicate that
our synthesized footage looks more photorealistic than those from strong baselines. See Figure 1
for example. We further demonstrate that the proposed approach can generate photorealistic 2K
resolution videos, up to 30 seconds long. Our method also grants users flexible high-level control
over the video generation results. For example, a user can easily replace all the buildings with trees in
a street view video. In addition, our method works for other input video formats such as face sketches
and body poses, enabling many applications from face swapping to human motion transfer. Finally,
we extend our approach to future prediction and show that our method can outperform existing
systems. Please visit our website for code, models, and more results.

2 Related Work
Generative Adversarial Networks (GANs). We build our model on GANs [20]. During GAN
training, a generator and a discriminator play a zero-sum game. The generator aims to produce
realistic synthetic data so that the discriminator cannot differentiate between real and the synthesized
data. In addition to noise distributions [14, 20, 55], various forms of data can be used as input to the
generator, including images [33, 43, 82], categorical labels [52, 53], and textual descriptions [56, 80].
Such conditional models are called conditional GANs, and allow flexible control over the output of
the model. Our method belongs to the category of conditional video generation with GANs. However,
instead of predicting future videos conditioning on the current observed frames [41, 50, 69], our
method synthesizes photorealistic videos conditioning on manipulable semantic representations, such
as segmentation masks, sketches, and poses.
Image-to-image translation algorithms transfer an input image from one domain to a corresponding
image in another domain. There exists a large body of work for this problem [6, 31, 33, 43, 44, 63, 66,
73, 82, 83]. Our approach is their video counterpart. In addition to ensuring that each video frame
looks photorealistic, a video synthesis model also has to produce temporally coherent frames, which
is a challenging task, especially for a long duration video.
Unconditional video synthesis. Recent work [59, 67, 69] extends the GAN framework for un-
conditional video synthesis, which learns a generator for converting a random vector to a video.

2
VGAN [69] uses a spatio-temporal convolutional network. TGAN [59] projects a latent code to a
set of latent image codes and uses an image generator to convert those latent image codes to frames.
MoCoGAN [67] disentangles the latent space to motion and content subspaces and uses a recurrent
neural network to generate a sequence of motion codes. Due to the unconditional setting, these
methods often produce low-resolution and short-length videos.
Future video prediction. Conditioning on the observed frames, video prediction models are trained
to predict future frames [15, 18, 36, 41, 42, 45, 50, 65, 68, 71, 72, 77]. Many of these models are trained
with image reconstruction losses, often producing blurry videos due to the classic regress-to-the-mean
problem. Also, they fail to generate long duration videos even with adversarial training [42, 50]. The
video-to-video synthesis problem is substantially different because it does not attempt to predict
object motions or camera motions. Instead, our approach is conditional on an existing video and can
produce high-resolution and long-length videos in a different domain.
Video-to-video synthesis. While video super-resolution [61, 62], video matting and blending [3, 12],
and video inpainting [74] can be considered as special cases of the video-to-video synthesis problem,
existing approaches rely on problem-specific constraints and designs. Hence, these methods cannot
be easily applied to other applications. Video style transfer [10, 22, 28, 58], transferring the style of a
reference painting to a natural scene video, is also related. In Section 4 , we show that our method
outperforms a strong baseline that combines a recent video style transfer with a state-of-the-art
image-to-image translation approach.

3 Video-to-Video Synthesis
Let sT1 ≡ {s1 , s2 , ..., sT } be a sequence of source video frames. For example, it can be a sequence
of semantic segmentation masks or edge maps. Let xT1 ≡ {x1 , x2 , ..., xT } be the sequence of
corresponding real video frames. The goal of video-to-video synthesis is to learn a mapping function
that can convert sT1 to a sequence of output video frames, x̃T1 ≡ {x̃1 , x̃2 , ..., x̃T }, so that the
conditional distribution of x̃T1 given sT1 is identical to the conditional distribution of xT1 given sT1 .
p(x̃T1 |sT1 ) = p(xT1 |sT1 ). (1)
Through matching the conditional video distributions, the model learns to generate photorealistic,
temporally coherent output sequences as if they were captured by a video camera.
We propose a conditional GAN framework for this conditional video distribution matching task. Let
G be a generator that maps an input source sequence to a corresponding output frame sequence:
xT1 = G(sT1 ). We train the generator by solving the minimax optimization problem given by
max min E(xT1 ,sT1 ) [log D(xT1 , sT1 )] + EsT1 [log(1 − D(G(sT1 ), sT1 ))], (2)
D G

where D is the discriminator. We note that as solving (2), we minimize the Jensen-Shannon divergence
between p(x̃T1 |sT1 ) and p(xT1 |sT1 ) as shown by Goodfellow et al. [20].
Solving the minimax optimization problem in (2) is a well-known, challenging task. Careful designs
of network architectures and objective functions are essential to achieve good performance as shown
in the literature [14, 21, 30, 37, 49, 51, 55, 73, 80]. We follow the same spirit and propose new network
designs and a spatio-temporal objective for video-to-video synthesis as detailed below.
Sequential generator. To simplify the video-to-video synthesis problem, we make a Markov
assumption where we factorize the conditional distribution p(x̃T1 |sT1 ) to a product form given by
T
Y
p(x̃T1 |sT1 ) = p(x̃t |x̃t−1 t
t−L , st−L ). (3)
t=1

In other words, we assume the video frames can be generated sequentially, and the generation of the
t-th frame x̃t only depends on three factors: 1) current source frame st , 2) past L source frames st−1 t−L ,
and 3) past L generated frames x̃t−1 t−L . We train a feed-forward network F to model the conditional
distribution p(x̃t |x̃t−1 t t−1 t T
t−L , st−L ) using x̃t = F (x̃t−L , st−L ). We obtain the final output x̃1 by applying
the function F in a recursive manner. We found that a small L (e.g., L = 1) causes training instability,
while a large L increases training time and GPU memory but with minimal quality improvement. In
our experiments, we set L = 2.

3
Video signals contain a large amount of redundant information in consecutive frames. If the optical
flow [46] between consecutive frames is known, we can estimate the next frame by warping the
current frame [54, 70]. This estimation would be largely correct except for the occluded areas. Based
on this observation, we model F as
F (x̃t−1 t
t−L , st−L ) = (1 − m̃t ) w̃t−1 (x̃t−1 ) + m̃t h̃t , (4)
where is the element-wise product operator and 1 is an image of all ones. The first part corresponds
to pixels warped from the previous frame, while the second part hallucinates new pixels. The
definitions of the other terms in Equation 4 are given below.
• w̃t−1 = W (x̃t−1 t
t−L , st−L ) is the estimated optical flow from x̃t−1 to x̃t , and W is the optical
flow prediction network. We estimate the optical flow using both input source images stt−L and
previously synthesized images x̃t−1t−L . By w̃t−1 (x̃t−1 ), we warp x̃t−1 based on w̃t−1 .

• h̃t = H(x̃t−1 t
t−L , st−L ) is the hallucinated image, synthesized directly by the generator H.

• m̃t = M (x̃t−1 t
t−L , st−L ) is the occlusion mask with continuous values between 0 and 1. M denotes
the mask prediction network. Our occlusion mask is soft instead of binary to better handle the
“zoom in” scenario. For example, when an object is moving closer to our camera, the object will
become blurrier over time if we only warp previous frames. To increase the resolution of the
object, we need to synthesize new texture details. By using a soft mask, we can add details by
gradually blending the warped pixels and the newly synthesized pixels.
We use residual networks [26] for M , W , and H. To generate high-resolution videos, we adopt a
coarse-to-fine generator design similar to the method of Wang et. al [73].
As using multiple discriminators can mitigate the mode collapse problem during GANs training [19,
67, 73], we also design two types of discriminators as detailed below.
Conditional image discriminator DI . The purpose of DI is to ensure that each output frame
resembles a real image given the same source image. This conditional discriminator should output 1
for a true pair (xt , st ) and 0 for a fake one (x̃t , st ).
Conditional video discriminator DV . The purpose of DV is to ensure that consecutive output
frames resemble the temporal dynamics of a real video given the same optical flow. While DI
t−2
conditions on the source image, DV conditions on the flow. Let wt−K be K − 1 optical flow for the
t−1
K consecutive real images xt−K . This conditional discriminator DV should output 1 for a true pair
(xt−1 t−2 t−1 t−2
t−K , wt−K ) and 0 for a fake one (x̃t−K , wt−K ).
We introduce two sampling operators to facilitate the discussion. First, let φI be a random image
sampling operator such that φI (xT1 , sT1 ) = (xi , si ) where i is an integer uniformly sampled from 1 to
T . In other words, φI randomly samples a pair of images from (xT1 , sT1 ). Second, we define φV as a
sampling operator that randomly retrieve K consecutive frames. Specifically, φV (w1T −1 , xT1 , sT1 ) =
i−2
(wi−K , xi−1 i−1
i−K , si−K ) where i is an integer uniformly sampled from K + 1 to T + 1. This operator
retrieves K consecutive frames and the corresponding K − 1 optical flow images. With φI and φV ,
we are ready to present our learning objective function.
Learning objective function. We train the sequential video synthesis function F by solving

min max LI (F, DI ) + max LV (F, DV ) + λW LW (F ), (5)
F DI DV

where LI is the GAN loss on images defined by the conditional image discriminator DI , LV is the
GAN loss on K consecutive frames defined by DV , and LW is the flow estimation loss. The weight
λW is set to 10 throughout the experiments based on a grid search. In addition to the loss terms
in Equation 5, we use the discriminator feature matching loss [40, 73] and VGG feature matching
loss [16, 34, 73] as they improve the convergence speed and training stability [73]. Please see the
appendix for more details.
We further define the image-conditional GAN loss LI [33] using the operator φI
EφI (xT1 ,sT1 ) [log DI (xi , si )] + EφI (x̃T1 ,sT1 ) [log(1 − DI (x̃i , si ))]. (6)
Similarly, the video GAN loss LV is given by
EφV (wT −1 ,xT ,sT ) [log DV (xi−1 i−2 i−1 i−2
i−K , wi−K )] + EφV (wT −1 ,x̃T ,sT ) [log(1 − DV (x̃i−K , wi−K ))]. (7)
1 1 1 1 1 1

4
Recall that we synthesize a video x̃T1 by recursively applying F .
The flow loss LW includes two terms. The first is the endpoint error between the ground truth and
the estimated flow, and the second is the warping loss when the flow warps the previous frame to the
next frame. Let wt be the ground truth flow from xt to xt+1 . The flow loss LW is given by
T −1
1 X 
LW = ||w̃t − wt k1 + kw̃t (xt ) − xt+1 k1 . (8)
T − 1 t=1

Foreground-background prior. When using semantic segmentation masks as the source video, we
can divide an image into foreground and background areas based on the semantics. For example,
buildings and roads belong to the background, while cars and pedestrians are considered as the
foreground. We leverage this strong foreground-background prior in the generator design to further
improve the synthesis performance of the proposed model.
In particular, we decompose the image hallucination network H into a foreground model h̃F,t =
HF (stt−L ) and a background model h̃B,t = HB (x̃t−1 t
t−L , st−L ). We note that background motion
can be modeled as a global transformation in general, where optical flow can be estimated quite
accurately. As a result, the background region can be generated accurately via warping, and the
background hallucination network HB only needs to synthesize the occluded areas. On the other hand,
a foreground object often has a large motion and only occupies a small portion of the image, which
makes optical flow estimation difficult. The network HF has to synthesize most of the foreground
content from scratch. With this foreground–background prior, F is then given by

F (x̃t−1 t

t−L , st−L ) = (1 − m̃t ) w̃t−1 (x̃t−1 ) + m̃t (1 − mB,t ) h̃F,t + mB,t h̃B,t , (9)

where mB,t is the background mask derived from the ground truth segmentation mask st . This prior
improves the visual quality by a large margin with the cost of minor flickering artifacts. In Table 2,
our user study shows that most people prefer the results with foreground–background modeling.
Multimodal synthesis. The synthesis network F is a unimodal mapping function. Given an input
source video, it can only generate one output video. To achieve multimodal synthesis [19, 73, 83], we
adopt a feature embedding scheme [73] for the source video that consists of instance-level semantic
segmentation masks. Specifically, at training time, we train an image encoder E to encode the ground
truth real image xt into a d-dimensional feature map (d = 3 in our experiments). We then apply
an instance-wise average pooling to the map so that all the pixels within the same object share the
same feature vectors. We then feed both the instance-wise averaged feature map zt and the input
semantic segmentation mask st to the generator F . Once training is done, we fit a mixture of Gaussian
distribution to the feature vectors that belong to the same object class. At test time, we sample a
feature vector for each object instance using the estimated distribution of that object class. Given
different feature vectors, the generator F can synthesize videos with different visual appearances.

4 Experiments

Implementation details. We train our network in a spatio-temporally progressive manner. In


particular, we start with generating low-resolution videos with few frames, and all the way up to
generating full resolution videos with 30 (or more) frames. Our coarse-to-fine generator consists of
three scales: 512 × 256, 1024 × 512, and 2048 × 1024 resolutions, respectively. The mask prediction
network M and flow prediction network W share all the weights except for the output layer. We
use the multi-scale PatchGAN discriminator architecture [33, 73] for the image discriminator DI . In
addition to multi-scale in the spatial resolution, our multi-scale video discriminator DV also looks
at different frame rates of the video to ensure both short-term and long-term consistency. See the
appendix for more details.
We train our model for 40 epochs using the ADAM optimizer [39] with lr = 0.0002 and (β1 , β2 ) =
(0.5, 0.999) on an NVIDIA DGX1 machine. We use the LSGAN loss [49]. Due to the high image
resolution, even with one short video per batch, we have to use all the GPUs in DGX1 (8 V100 GPUs,
each with 16GB memory) for training. We distribute the generator computation task to 4 GPUs and
the discriminator task to the other 4 GPUs. Training takes ∼ 10 days for 2K resolution.
Datasets. We evaluate the proposed approach on several datasets.

5
Table 1: Comparison between competing video-to-video synthesis approaches on Cityscapes.
Fréchet Inception Dist. I3D ResNeXt Human Preference Score short seq. long seq.
pix2pixHD 5.57 0.18 vid2vid (ours) / pix2pixHD 0.87 / 0.13 0.83 / 0.17
COVST 5.55 0.18 vid2vid (ours) / COVST 0.84 / 0.16 0.80 / 0.20
vid2vid (ours) 4.66 0.15

Table 2: Ablation study. We compare the proposed approach to its three variants.
Human Preference Score
vid2vid (ours) / no background–foreground prior 0.80 / 0.20
vid2vid (ours) / no conditional video discriminator 0.84 / 0.16
vid2vid (ours) / no flow warping 0.67 / 0.33

Table 3: Comparison between future video prediction methods on Cityscapes.


Fréchet Inception Dist. I3D ResNeXt Human Preference Score
PredNet 11.18 0.59 vid2vid (ours) / PredNet 0.92 / 0.08
MCNet 10.00 0.43 vid2vid (ours) / MCNet 0.98 / 0.02
vid2vid (ours) 3.44 0.18

• Cityscapes [13]. The dataset consists of 2048 × 1024 street scene videos captured in several
German cities. Only a subset of images in the videos contains ground truth semantic segmentation
masks. To obtain the input source videos, we use those images to train a DeepLabV3 semantic
segmentation network [11] and apply the trained network to segment all the videos. We use
the optical flow extracted by FlowNet2 [32] as the ground truth flow w. We treat the instance
segmentation masks computed by the Mask R-CNN [25] as our instance-level ground truth. In
summary, the training set contains 2975 videos, each with 30 frames. The validation set consists
of 500 videos, each with 30 frames. Finally, we test our method on three long sequences from
the Cityscapes demo videos, with 600, 1100, and 1200 frames, respectively. We will show that
although trained on short videos, our model can synthesize long videos.
• Apolloscape [29] consists of 73 street scene videos captured in Beijing, whose video lengths vary
from 100 to 1000 frames. Similar to Cityscapes, Apolloscape is constructed for the image/video
semantic segmentation task. But we use it for synthesizing videos using the semantic segmentation
mask. We split the dataset into half for training and validation.
• Face video dataset [57]. We use the real videos in the FaceForensics dataset, which contains
854 videos of news briefing from different reporters. We use this dataset for the sketch video to
face video synthesis task. To extract a sequence of sketches from a video, we first apply a face
alignment algorithm [38] to localize facial landmarks in each frame. The facial landmarks are
then connected to create the face sketch. For background, we extract Canny edges outside the
face regions. We split the dataset into 704 videos for training and 150 videos for validation.
• Dance video dataset. We download YouTube dance videos for the pose to human motion
synthesis task. Each video is about 3 ∼ 4 minutes long at 1280 × 720 resolution, and we crop the
central 512 × 720 regions. We extract human poses with DensePose [23] and OpenPose [7], and
directly concatenate the results together. Each training set includes a dance video from a single
dancer, while the test set contains videos of other dance motions or from other dancers.
Baselines. We compare our approach to two baselines trained on the same data.
• pix2pixHD [73] is the state-of-the-art image-to-image translation approach. When applying the
approach to the video-to-video synthesis task, we process input videos frame-by-frame.
• COVST is built on the coherent video style transfer [10] by replacing the stylization network with
pix2pixHD. The key idea in COVST is to warp high-level deep features using optical flow for
achieving temporally coherent outputs. No additional adversarial training is applied. We feed in
ground truth optical flow to COVST, which is impractical for real applications. In contrast, our
model estimates optical flow from source videos.
Evaluation metrics. We use both subjective and objective metrics for evaluation.
• Human preference score. We perform a human subjective test for evaluating the visual quality
of synthesized videos. We use the Amazon Mechanical Turk (AMT) platform. During each

6
Figure 2: Apolloscape results. Left: pix2pixHD. Center: COVST. Right: proposed. The input
semantic segmentation mask video is shown in the left video. The figure is best viewed with Acrobat
Reader. Click the image to play the video clip.

Figure 3: Example multi-modal video synthesis results. These synthesized videos contain different
road surfaces. The figure is best viewed with Acrobat Reader. Click the image to play the video clip.

Figure 4: Example results of changing input semantic segmentation masks to generate diverse videos.
Left: tree→building. Right: building→tree. The original video is shown in Figure 3. The figure is
best viewed with Acrobat Reader. Click the image to play the video clip.
test, an AMT participant is first shown two videos at a time (results synthesized by two different
algorithms) and then asked which one looks more like a video captured by a real camera. We
specifically ask the worker to check for both temporal coherence and image quality. A worker
must have a life-time task approval rate greater than 98% to participate in the evaluation. For each
question, we gather answers from 10 different workers. We evaluate the algorithm by the ratio
that the algorithm outputs are preferred.
• Fréchet Inception Distance (FID) [27] is a widely used metric for implicit generative models, as
it correlates well with the visual quality of generated samples. The FID was originally developed
for evaluating image generation. We propose a variant for video evaluation, which measures both
visual quality and temporal consistency. Specifically, we use a pre-trained video recognition CNN
as a feature extractor after removing the last few layers from the network. This feature extractor
will be our “inception” network. For each video, we extract a spatio-temporal feature map with
this CNN. We then compute the mean µ̃ and covariance matrix Σ̃ for the feature vectors from
all the synthesized videos. We also calculate the same  quantitiespµ andΣ for the ground truth
2
videos. The FID is then calculated as kµ − µ̃k + Tr Σ + Σ̃ − 2 ΣΣ̃ . We use two different
pre-trained video recognition CNNs in our evaluation: I3D [8] and ResNeXt [76].

Main results. We compare the proposed approach to the baselines on the Cityscapes benchmark,
where we apply the learned models to synthesize 500 short video clips in the validation set. As shown
in Table 1, our results have a smaller FID and are often favored by the human subjects. We also
report the human preference scores on the three long test videos. Again, the videos rendered by our
approach are considered more realistic by the human subjects. The human preference scores for the
Apolloscape dataset are given in the appendix.

7
Figure 5: Example face→sketch→face results. Each set shows the original video, the extracted edges,
and our synthesized video. The figure is best viewed with Acrobat Reader. Click the image to play the
video clip.

Figure 6: Example dance→pose→dance results. Each set shows the original dancer, the extracted
poses, and the synthesized video. The figure is best viewed with Acrobat Reader. Click the image to
play the video clip.
Figures 1 and 2 show the video synthesis results. Although each frame rendered by pix2pixHD is
photorealistic, the resulting video lacks temporal coherence. The road lane markings and building
appearances are inconsistent across frames. While improving upon pix2pixHD, COVST still suffers
from temporal inconsistency. On the contrary, our approach produces a high-resolution, photorealistic,
temporally consistent video output. We can generate 30-second long videos, showing that our
approach synthesizes convincing videos with longer lengths.
We conduct an ablation study to analyze several design choices of our method. Specifically, we
create three variants. In one variant, we do not use the foreground-background prior, which is termed
no background–foreground prior. That is, instead of using Equation 9, we use Equation 4.
The second variant is no conditional video discriminator where we do not use the video
discriminator DV for training. In the last variant, we remove the optical flow prediction network
W and the mask prediction network M from the generator F in Equation 4 and only use H for
synthesis. This variant is referred to as no flow warping. We use the human preference score on
Cityscapes for this ablation study. Table 2 shows that the visual quality of output videos degrades
significantly without the ablated components. To evaluate the effectiveness of different components
in our network, we also experimented with directly using ground truth flows instead of estimated
flows by our network. We found the results visually similar, which suggests that our network is robust
to the errors in the estimated flows.
Multimodal results. Figure 3 shows example multimodal synthesis results. In this example, we
keep the sampled feature vectors of all the object instances in the video the same except for the road
instance. The figure shows temporally smooth videos with different road appearances.
Semantic manipulation. Our approach also allows the user to manipulate the semantics of source
videos. In Figure 4, we show an example of changing the semantic labels. In the left video, we
replace all trees with buildings in the original segmentation masks and synthesize a new video. On
the right, we show the result of replacing buildings with trees.
Sketch-to-video synthesis for face swapping. We train a sketch-to-face synthesis video model
using the real face videos in the FaceForensics dataset [57]. As shown in Figure 5, our model can
convert sequences of sketches to photorealistic output videos. This model can be used to change the
facial appearance of the original face videos [5].
Pose-to-video synthesis for human motion transfer. We also apply our method to the task of
converting sequences of human poses to photorealistic output videos. We note that the image
counterpart was studied in recent works [4, 17, 47, 48]. As shown in Figure 6, our model learns to
synthesize high-resolution photorealistic output dance videos that contain unseen body shapes and
motions. Our method can change the clothing [79, 81] for the same dancer (Figure 6 left) as well as
transfer the visual appearance to new dancers (Figure 6 right) as explored in concurrent work [1,9,78].

8
Figure 7: Future video prediction results. Top left: ground truth. Top right: PredNet [45]. Bottom
left: MCNet [68]. Bottom right: ours. The figure is best viewed with Acrobat Reader. Click the image
to play the video clip.

Future video prediction. We show an extension of our approach to the future video prediction
task: learning to predict the future video given a few observed frames. We decompose the task
into two sub-tasks: 1) synthesizing future semantic segmentation masks using the observed frames,
and 2) converting the synthesized segmentation masks into videos. In practice, after extracting the
segmentation masks from the observed frames, we train a generator to predict future semantic masks.
We then use the proposed video-to-video synthesis approach to convert the predicted segmentation
masks to a future video.
We conduct both quantitative and qualitative evaluations with comparisons to two start-of-the-art
approaches: PredNet [45] and MCNet [68]. We follow the prior work [41, 70] and report the human
preference score. We also include the FID scores. As shown in Table 3, our model produces smaller
FIDs, and the human subjects favor our resulting videos. In Figure 7, we visualize the future video
synthesis results. While the image quality of the results from the competing algorithms degrades
significantly over time, ours remains consistent.

5 Discussion
We present a general video-to-video synthesis framework based on conditional GANs. Through
carefully-designed generators and discriminators as well as a spatio-temporal adversarial objective,
we can synthesize high-resolution, photorealistic, and temporally consistent videos. Extensive
experiments demonstrate that our results are significantly better than the results by state-of-the-art
methods. Our method also compares favorably against the competing video prediction methods.
Although our approach outperforms previous methods, our model still fails in a couple of situations.
For example, our model struggles in synthesizing turning cars due to insufficient information in
label maps. This could be potentially addressed by adding additional 3D cues, such as depth maps.
Furthermore, our model still can not guarantee that an object has a consistent appearance across
the whole video. Occasionally, a car may change its color gradually. This issue might be alleviated
if object tracking information is used to enforce that the same object shares the same appearance
throughout the entire video. Finally, when we perform semantic manipulations such as turning trees
into buildings, visible artifacts occasionally appear as building and trees have different label shapes.
This might be resolved if we train our model with coarser semantic labels, as the trained model would
be less sensitive to label shapes.
Acknowledgements We thank Karan Sapra, Fitsum Reda, and Matthieu Le for generating the
segmentation maps for us. We also thank Lisa Rhee and Miss Ketsuki for allowing us to use their
dance videos for training. We thank William S. Peebles for proofreading the paper.

9
References
[1] K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen-Or. Deep video-based performance
cloning. arXiv preprint arXiv:1808.06847, 2018.
[2] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. Deep reinforcement learning: A brief
survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.
[3] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized
classifiers. ACM Transactions on Graphics (TOG), 28(3):70, 2009.
[4] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. Synthesizing images of humans in unseen
poses. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[5] D. Bitouk, N. Kumar, S. Dhillon, P. Belhumeur, and S. K. Nayar. Face swapping: automatically replacing
faces in photographs. In ACM SIGGRAPH, 2008.
[6] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain
adaptation with generative adversarial networks. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017.
[7] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2D pose estimation using part affinity
fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[8] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[9] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. In European Conference on
Computer Vision (ECCV) Workshop, 2018.
[10] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua. Coherent online video style transfer. In IEEE International
Conference on Computer Vision (ICCV), 2017.
[11] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image
segmentation. arXiv preprint arXiv:1706.05587, 2017.
[12] T. Chen, J.-Y. Zhu, A. Shamir, and S.-M. Hu. Motion-aware gradient domain video composition. IEEE
Trans. Image Processing, 22(7):2532–2544, 2013.
[13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele. The Cityscapes dataset for semantic urban scene understanding. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.
[14] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid
of adversarial networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
[15] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In
Advances in Neural Information Processing Systems (NIPS), 2017.
[16] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks.
In Advances in Neural Information Processing Systems (NIPS), 2016.
[17] P. Esser, E. Sutter, and B. Ommer. A variational u-net for conditional appearance and shape generation. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[18] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video
prediction. In Advances in Neural Information Processing Systems (NIPS), 2016.
[19] A. Ghosh, V. Kulharia, V. Namboodiri, P. H. Torr, and P. K. Dokania. Multi-agent diverse generative
adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
Generative adversarial networks. In Advances in Neural Information Processing Systems (NIPS), 2014.
[21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein
GANs. In Advances in Neural Information Processing Systems (NIPS), 2017.
[22] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei. Characterizing and improving stability in neural style
transfer. In IEEE International Conference on Computer Vision (ICCV), 2017.
[23] R. A. Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[24] D. Ha and J. Schmidhuber. World models. In Advances in Neural Information Processing Systems (NIPS),
2018.
[25] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In IEEE International Conference on
Computer Vision (ICCV), 2017.
[26] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016.
[27] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale
update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems
(NIPS), 2017.
[28] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu. Real-time neural style transfer
for videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[29] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang. The ApolloScape dataset
for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

10
[30] X. Huang, Y. Li, O. Poursaeed, J. E. Hopcroft, and S. J. Belongie. Stacked generative adversarial networks.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[31] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In
ECCV, 2018.
[32] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical
flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2017.
[33] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial
networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[34] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In
European Conference on Computer Vision (ECCV), 2016.
[35] J. T. Kajiya. The rendering equation. In ACM SIGGRAPH, 1986.
[36] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu.
Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
[37] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability,
and variation. In International Conference on Learning Representations (ICLR), 2018.
[38] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 2009.
[39] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on
Learning Representations (ICLR), 2015.
[40] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a
learned similarity metric. In International Conference on Machine Learning (ICML), 2016.
[41] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction.
arXiv preprint arXiv:1804.01523, 2018.
[42] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion GAN for future-flow embedded video prediction.
In Advances in Neural Information Processing Systems (NIPS), 2017.
[43] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in
Neural Information Processing Systems (NIPS), 2017.
[44] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information
Processing Systems (NIPS), 2016.
[45] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised
learning. In International Conference on Learning Representations (ICLR), 2017.
[46] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision.
International Joint Conference on Artificial Intelligence (IJCAI), 1981.
[47] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation.
In Advances in Neural Information Processing Systems (NIPS), 2017.
[48] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz. Disentangled person image generation.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[49] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial
networks. In IEEE International Conference on Computer Vision (ICCV), 2017.
[50] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In
International Conference on Learning Representations (ICLR), 2016.
[51] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial
networks. In International Conference on Learning Representations (ICLR), 2018.
[52] T. Miyato and M. Koyama. cGANs with projection discriminator. In International Conference on Learning
Representations (ICLR), 2018.
[53] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In
International Conference on Machine Learning (ICML), 2017.
[54] K. Ohnishi, S. Yamamoto, Y. Ushiku, and T. Harada. Hierarchical video generation from orthogonal
information: Optical flow and texture. In AAAI, 2018.
[55] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional
generative adversarial networks. In International Conference on Learning Representations (ICLR), 2015.
[56] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image
synthesis. In International Conference on Machine Learning (ICML), 2016.
[57] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics: A large-scale
video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179, 2018.
[58] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transfer for videos. In German Conference on Pattern
Recognition, 2016.
[59] M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping.
In IEEE International Conference on Computer Vision (ICCV), 2017.
[60] A. Schödl, R. Szeliski, D. H. Salesin, and I. Essa. Video textures. ACM Transactions on Graphics (TOG),
2000.

11
[61] E. Shechtman, Y. Caspi, and M. Irani. Space-time super-resolution. IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI), 27(4):531–545, 2005.
[62] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time
single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[63] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and
unsupervised images through adversarial training. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017.
[64] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
[65] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using
lstms. In International Conference on Machine Learning (ICML), 2015.
[66] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In International
Conference on Learning Representations (ICLR), 2017.
[67] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing motion and content for video
generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[68] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video
sequence prediction. In International Conference on Learning Representations (ICLR), 2017.
[69] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances in
Neural Information Processing Systems (NIPS), 2016.
[70] C. Vondrick and A. Torralba. Generating the future with adversarial transformers. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017.
[71] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using
variational autoencoders. In European Conference on Computer Vision (ECCV), 2016.
[72] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose
futures. In IEEE International Conference on Computer Vision (ICCV), 2017.
[73] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis
and semantic manipulation with conditional GANs. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018.
[74] Y. Wexler, E. Shechtman, and M. Irani. Space-time video completion. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2004.
[75] Y. Wexler, E. Shechtman, and M. Irani. Space-time completion of video. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 29(3), 2007.
[76] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural
networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[77] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via
cross convolutional networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
[78] C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and D. Lin. Pose guided human video generation. In European
Conference on Computer Vision (ECCV), 2018.
[79] S. Yang, T. Ambert, Z. Pan, K. Wang, L. Yu, T. Berg, and M. C. Lin. Detailed garment recovery from a
single-view image. arXiv preprint arXiv:1608.01250, 2016.
[80] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. StackGAN: Text to photo-realistic
image synthesis with stacked generative adversarial networks. In IEEE International Conference on
Computer Vision (ICCV), 2017.
[81] Z.-H. Zheng, H.-T. Zhang, F.-L. Zhang, and T.-J. Mu. Image-based clothes changing system. Computational
Visual Media, 3(4):337–347, 2017.
[82] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent
adversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2017.
[83] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal
image-to-image translation. In Advances in Neural Information Processing Systems (NIPS), 2017.

12
Residual blocks Residual blocks

Semantic Intermediate
maps ... ... image

Residual blocks Residual blocks

Previous Flow map


images ... ... Mask

Figure 8: The network architecture (G1 ) for low-res videos. Our network takes in a number of
semantic label maps and previously generated images, and outputs the intermediate frame as well as
the flow map and the mask.
G2 G2

G1 Residual blocks

...
...

...

...

...

Figure 9: The network architecture (G2 ) for higher resolution videos. The label maps and previous
frames are downsampled and fed into the low-res network G1 . Then, the features from the high-res
network and the last layer of the low-res network are summed and fed into another series of residual
blocks to output the final images.

A Network Architecture

A.1 Generators

Our network adopts a coarse-to-fine architecture. For the lowest resolution, the network takes in a
number of semantic label maps stt−L and previously generated frames x̃t−1t−L as input. The label maps
are concatenated together and undergo several residual blocks to form intermediate high-level features.
We apply the same processing for the previously generated images. Then, these two intermediate
layers are added and fed into two separate residual networks to output the hallucinated image h̃t as
well as the flow map w̃t and the mask m̃t (Figure 8).
Next, to build from low-res results to higher-res results, we use another network G2 on top of the
low-res network G1 (Figure 9). In particular, we first downsample the inputs and fed them into G1 .
Then, we extract features from the last feature layer of G1 and add them to the intermediate feature
layer of G2 . These summed features are then fed into another series of residual blocks to output the
higher resolution images.

A.2 Discriminators

For our image discriminator DI , we adopt the multi-scale PatchGAN architecture [33, 73]. We also
design a temporally multi-scale video discriminator DV by downsampling the frame rates of the
real/generated videos. In the finest scale, the discriminator takes K consecutive frames in the original
sequence as input. In the next scale, we subsample the video by a factor of K (i.e., skipping every

13
K − 1 intermediate frames), and the discriminator takes consecutive K frames in this new sequence
as input. We do this for up to three scales in our implementation and find that this helps us ensure
both short-term and long-term consistency. Note that DV is also multi-scale in the spatial domain as
DI .

A.3 Feature matching loss

In our learning objective function, we also add VGG feature matching loss and discriminator feature
matching loss to improve the training stability. For VGG feature matching loss, we use the VGG
network [64] as a feature extractor and minimize L1 losses P between the extracted features from
the real and the generated images. In particular, we add i P1i [||ψ (i) (x) − ψ (i) (G(s))||1 ] to our
objective, where ψ (i) denotes the i-th layer with Pi elements of the VGG network. Similarly, we
adopt the discriminator feature matching loss, to match the statistics of features extracted by the GAN
discriminators. We use both the image discriminator DI and the video discriminator DV .

B Evaluation for the Apolloscape Dataset


We provide both the FID and the human preference score on the Apolloscape dataset. For both
metrics, our method outperforms the other baselines.

Table 4: Comparison between competing video-to-video synthesis approaches on Apolloscape.


Inception Net. of FID I3D ResNeXt Human Preference Score
pix2pixHD 2.33 0.128 vid2vid (ours) / pix2pixHD 0.61 / 0.39
COVST 2.36 0.128 vid2vid (ours) / COVST 0.59 / 0.41
vid2vid (ours) 2.24 0.125

14

You might also like