0% found this document useful (0 votes)
1 views

Fast Vid2Vid Spatial Temporal Compression for Video to Video Synthesis

Fast-Vid2Vid is a novel spatial-temporal compression framework designed to enhance video-to-video synthesis efficiency by reducing computational costs and inference latency. It achieves significant acceleration and lower resource consumption by compressing input data spatially and temporally while maintaining the original network architecture. The framework demonstrates real-time performance improvements, achieving up to 24.77 FPS with up to 9.3× computational cost savings on standard benchmarks.

Uploaded by

roxeda9678
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Fast Vid2Vid Spatial Temporal Compression for Video to Video Synthesis

Fast-Vid2Vid is a novel spatial-temporal compression framework designed to enhance video-to-video synthesis efficiency by reducing computational costs and inference latency. It achieves significant acceleration and lower resource consumption by compressing input data spatially and temporally while maintaining the original network architecture. The framework demonstrates real-time performance improvements, achieving up to 24.77 FPS with up to 9.3× computational cost savings on standard benchmarks.

Uploaded by

roxeda9678
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Fast-Vid2Vid: Spatial-Temporal Compression for

Video-to-Video Synthesis

Long Zhuo1 , Guangcong Wang2 , Shikai Li3 , Wayne Wu1,3 , and Ziwei Liu2⋆
1
Shanghai AI Laboratory
2
S-Lab, Nanyang Technological University
3
SenseTime Research
arXiv:2207.05049v1 [cs.CV] 11 Jul 2022

[email protected] {guangcong.wang, ziwei.liu}@ntu.edu.sg


[email protected] [email protected]

Segmentation2City

Inputs

Vid2Vid
MACs:1254G FPS:4.27

Fast-
Vid2Vid
MACs:151G (8.3×) FPS:24.77 (5.8×)
Sketch2Face Pose2Body

Inputs

Vid2Vid

MACs:2066G FPS:2.77 MACs:1769G FPS:3.01

Fast-
Vid2Vid

MACs:282G (8.1×) FPS:16.81 (6.1×) MACs:191G (9.3×) FPS:21.39 (7.1×)


Fig. 1: Fast-Vid2Vid. Our proposed Fast-Vid2Vid accelerates the video-to-video syn-
thesis and generates photo-realistic videos more efficiently compared to the original
vid2vid model. On standard benchmarks, Fast-Vid2Vid achieves 16.81-24.77 FPS
and saves 8.1-9.3× computational cost in the tasks of Sketch2Face, Segmentation2City
and Pose2Body.

Abstract. Video-to-Video synthesis (Vid2Vid) has achieved remarkable


results in generating a photo-realistic video from a sequence of semantic
maps. However, this pipeline suffers from high computational cost and
long inference latency, which largely depends on two essential factors:
1) network architecture parameters, 2) sequential data stream. Recently,
the parameters of image-based generative models have been significantly

Corresponding author
2 L. Zhuo et al.

compressed via more efficient network architectures. Nevertheless, exist-


ing methods mainly focus on slimming network architectures and ignore
the size of the sequential data stream. Moreover, due to the lack of tem-
poral coherence, image-based compression is not sufficient for the com-
pression of the video task. In this paper, we present a spatial-temporal
compression framework, Fast-Vid2Vid, which focuses on data aspects
of generative models. It makes the first attempt at time dimension to
reduce computational resources and accelerate inference. Specifically, we
compress the input data stream spatially and reduce the temporal redun-
dancy. After the proposed spatial-temporal knowledge distillation, our
model can synthesize key-frames using the low-resolution data stream.
Finally, Fast-Vid2Vid interpolates intermediate frames by motion com-
pensation with slight latency. On standard benchmarks, Fast-Vid2Vid
achieves around real-time performance as 20 FPS and saves around 8×
computational cost on a single V100 GPU. Code and models are publicly
available4 .

Keywords: Video-to-Video Synthesis, GAN Compression

1 Introduction

Video-to-video synthesis (vid2vid) [44] targets at synthesizing a photo-realistic


video given a sequence of semantic maps as input. A wide range of appli-
cations are derived from this task, such as face-talking video generation
(Sketch2Face) [44,43], driving video generation (Segmentation2City) [44,43] and
human pose transferring generation (Pose2Body) [5,27,53]. With the advance of
Generative Adversarial Networks (GANs) [15], vid2vid models [44,43] have made
significant progress in video quality. However, these approaches need large-scale
computational resources to yield the results, and they are computationally pro-
hibitive and environmentally unfriendly. For example, the standard vid2vid [44]
consumes 2066 G MACs for generating each frame, which is 500× more than
ResNet-50 [18]. Recent studies demonstrate that lots of recognition compression
approaches have been successfully extended to image-based GAN compression
methods [1,7,11,26,31,29]. Can we directly employ these existing image-based
GAN compression methods to achieve promising vid2vid compression models?
In the literature, image-based GAN compression methods can be roughly cat-
egorized into three groups, including knowledge distillation [1,7,26,31,2], network
pruning [31,42], and neural architecture search (NAS) [29,14,11,13,30]. They fo-
cus on obtaining a compact network by cutting the network architecture param-
eters of the original network. However, the input data, another factor that signif-
icantly affects the inference speed of a deep neural network, has been ignored by
the existing GAN compression methods. Moreover, since they are image-based
synthesis tasks, they do not consider redundant temporal information hidden
4
Project page: https://fast-vid2vid.github.io/
Code and models: https://github.com/fast-vid2vid/fast-vid2vid
Fast Vid2Vid 3

in neighbor frames of a video. Therefore, directly applying image-based com-


pression models to vid2vid synthesis is difficult to achieve the desired results.
In this work, we aim to compress the input data stream while maintaining the
well-designed network parameters, and generate the photo-realistic results for
vid2vid synthesis. Furthermore, we make an initial attempt at removing tempo-
ral redundancy to accelerate vid2vid model.
There are three critical challenges for vid2vid compression. First, the typical
vid2vid model [44] consists of several encoder-decoders to capture both spatial
and temporal features. It is difficult to reduce the parameters from such a com-
plicated structure due to the intricate connections between these encoders and
decoders. Second, it is a challenge to compress the input data stream and achieve
decent performance for GAN generation since the perceptual fields of GAN are
much more erratic than image recognition. Third, transferring knowledge from a
teacher model to a student model temporally is challenging to align with the spa-
tial knowledge distillation as the temporal knowledge is implicitly hidden within
adjacent frames and is more difficult to capture than the spatial knowledge.
To address the above issues, in this paper, we propose a novel spatial-
temporal compression framework for vid2vid synthesis, named Fast-Vid2Vid.
As shown in Fig. 2, we reduce the computational resources by only compress-
ing the input data stream through Motion-Aware Inference (MAI) without de-
stroying the well-designed and complicated network parameters of the original
Vid2Vid model, which addresses challenge 1. For challenge 2 and 3, we propose a
Spatial-Temporal Knowledge Distillation method (STKD) that transfers spatial
and temporal knowledge from the original model to the student network using
compressed input data. In particular, motivated by the spatial resolution-aware
knowledge distillation method [10] that transfers the knowledge from large-size
images to small-size ones for image recognition, our goal is to transfer the knowl-
edge from large-size synthesized videos to small-size synthesized ones to make
GAN robust enough to gain promising visual performance when the input data
is compressed.
We first train a spatially low-demand generator by taking low-resolution se-
quences as input but generating the full-resolution sequences. We perform Spa-
tial Knowledge Distillation (Spatial KD) and transfer the spatial knowledge
from the original generator to the spatially low-demand generator to obtain
high-resolution frame information. Furthermore, we train a part-time generator
by uniformly sampling video frames from sequences as real data. We perform
Temporal-aware Knowledge Distillation (Temporal KD) and distill the temporal
knowledge of the original generator to the part-time student generator to obtain
full-time motion information by the introduced two losses, i.e., local temporal
knowledge distillation loss and global temporal knowledge distillation loss. This
design aims to capture the implicit knowledge in the time dimension.
To summarize, to the best of our knowledge, we make the first attempt
to tackle the vid2vid compression problem at data aspects. On a single V100
GPU, Fast-Vid2Vid achieves 18.56 FPS (6.1× acceleration) with 8.1× less com-
putational cost on Sketch2Face, 24.77 FPS (5.8× acceleration) with 8.3× less
4 L. Zhuo et al.

computational cost on Segmentation2City, and 21.39 FPS (7.1× acceleration)


with 9.3× less computational cost on Pose2Body. The main contributions of this
paper are concluded as two-fold:
– We present Fast-Vid2Vid, an sequential data stream compression method
in spatial and temporal dimensions to greatly accelerate the vid2vid model.
– We introduce a Spatial KD method that transfers knowledge from a teacher
model input high-resolution data to a student model input low-resolution
data to learn high-resolution information.
– We propose a Temporal KD method to distill knowledge from a full-time
teacher model to a part-time student model. A new temporal knowledge
distillation loss globally is further presented to capture the time-series cor-
relation.

2 Related Work
Video-to-Video Synthesis. Video-to-video synthesis (Vid2vid) is a computer
vision task that generates a photo-realistic sequence using the corresponding se-
mantic sequence. Based on high-resolution image-based synthesis [45], Wang et
al. [44] developed a standard vid2vid synthesis model by introducing temporal
coherence. Few-shot vid2vid model [43] further extended a few-shot version of
the vid2vid model, which only uses fewer samples to achieve decent performance.
Recently, vid2vid has been successfully extended to a wide range of video gener-
ation tasks, including video super-resolution [37,8,48], video inpainting [54,49],
image-to-video synthesis [38,39] and human pose-to-body synthesis [5,12,53,27].
Most of these methods exploited temporal information to improve the perfor-
mance of generated videos. However, they do not focus on vid2vid synthesis
compression but on better visual performance.
Model Compression. Model compression aims at reducing superfluous pa-
rameters of deep neural networks to accelerate inference. In the computer vision
task, lots of model pruning approaches [17,28,24,32,19,52,42] have greatly cut
the weights of neural networks and significantly speed up inference time. Hu
et al. [25] reduced the unnecessary channels with low activations. Small incom-
ing weights [19,28] or outcoming weights [20] of convolution layers were used as
saliency metrics for pruning. GAN compression has been proved by [51] that it is
far more difficult than normal CNN compression. Due to the complex structures
of GANs, a content-aware approach [31] was proposed to use salient regions to
identify specific redundancy for GAN pruning. Wang et al. [42] reduced the re-
dundant weights by NAS using a once-for-all scheme. Notably, the mentioned
methods focus on simplifying the network structure and ignore the amount of the
input information. Furthermore, these approaches do not consider the essential
temporal coherence for video-based GAN compression, and thus achieve sub-
optimal results for vid2vid models. Therefore, it is required to remove temporal
redundancy in vid2vid models.
Knowledge Distillation. Knowledge Distillation aims to make a student net-
work imitate its teacher. Hinton et al. [22] proposed an effective framework for
Fast Vid2Vid 5

Vanilla Inference Compression


{X} {Y} STKD {Y} *

{X} *

H H
FG h PG H

w
W W Compressed Input
Teacher Output W
Full-size Input Same Parameters Student Output
Motion-aware Inference {Y}*

{X} '

Key-frames {K} Motion


h Selection
PG H
Compensation
H
w
Low-resolution Input (Zero-parameters)
W W
Key-frames Prediction Full-size Prediction
{Yk*| k ∈ K}

Fig. 2: The pipeline of our Fast-Vid2Vid. It maintains the same amount of parameters
as the original generator but compresses the input data in space and time dimen-
sions. We perform spatial-temporal knowledge distillation (STKD) to transfer knowl-
edge from the Full-time teacher Generator (FG) to the Part-time student Generator
(PG). After STKD, Fast-Vid2Vid only infers the key-frames of the semantic sequence
of low resolution and interpolates the intermediate frames by motion compensation.

model distillation in classification. Knowledge distillation has been widely used


in recognition models [6,7,33,34,50]. Recently, lots of response-based knowledge
distillation methods [1,7,11,2] were proposed for image-based GAN compression.
For example, Jin et al. [26] developed distillation techniques from [29] and used a
global kernel alignment module to gain more potential information. Liu et al. [31]
utilized a salient mask to guide the knowledge distillation process based on the
norm. These methods only address image-based knowledge distillation, and thus
only spatial knowledge is exploited, and they do not consider movements. It is
not able to fully exploit temporal knowledge for vid2vid compression. Different
from spatial-aware knowledge distillation, we consider both spatial information
and temporal information into knowledge distillation, which tailors for vid2vid
model compression. Most recently, Feng et al. [10] have presented a resolution-
aware knowledge distillation method that ignores the network parameters and
compresses the input information for image recognition. In our work, we first
introduce this input data compression method for GAN synthesis.

3 Fast-Vid2Vid
3.1 A Revisit of GAN Compression
The function of a deep neural network (DNN) can be written as f (X) = W ∗ X,
where W denotes the parameters of the networks, ∗ represents the operation of
DNN and X denotes the input data. Obviously, two essential factors accounting
for computational cost are the parameters and the input data. Existing GAN
compression methods [1,7,11,26,31,29] have intended to cut the computational
cost by reducing the parameters of network structures. However, the network
structures of GAN for specific tasks are carefully designed and it would cause
6 L. Zhuo et al.

poor visual results if the network parameters are cut arbitrarily. Another way
to reduce computational cost is by compressing the input data. In this work, we
seek for compressing the input data instead of the parameters of well-designed
networks. To the best of our knowledge, there is little literature working on
compressing data for GAN compression.

3.2 Overview of Fast-Vid2Vid


The typical Vid2vid framework [44] takes a sequence of semantic maps {X}T0 ∈
RT ×H×W with T frames and the initial real frames as input and predicts a
photo-realistic video sequence {Y }T0 ∈ RT ×H×W . H and W denotes the height
and weight of each frame. The vanilla inference of the vid2vid model (the full-
time teacher generator) that utilizes a full-size sequential input data stream is
a consecutive process that synthesizes a video sequence frame by frame. Con-
sidering both image synthesis and temporal coherence, a vid2vid model often
contains several encoder-decoders to capture spatial-temporal cues, which are
computationally prohibitive and even far from applications of mobile devices. In
this paper, we propose a Fast-Vid2Vid compression framework, an input data
compression method, to reduce the computational resources of the vid2vid frame-
work in both space and time dimensions.
Fig. 2 illustrates the overview of the proposed method. Fast-Vid2Vid first
replaces the resBlock of the original vid2vid generator [44] with decomposed
convolutional block [23] to obtain a modern network architecture, which is simi-
lar to [29]. During knowledge distillation, we train a student generator using the
compressed data and distill knowledge from the teacher generator by our pro-
posed spatial-temporal knowledge distillation method (STKD). STKD, including
spatial knowledge distilation (Spatial KD) and temporal knowledge distillation
(Temporal KD), performs spatial resolution compression and temporal se-
quential data compression. After STKD, a part-time generator cooperating
with motion compensation synthesizes a full-size sequence by motion-aware
inference (MAI).

3.3 Spatial Resolution Compression for Vid2vid


To reduce the spatial input data, a straightforward way [10] is to predict the
low-resolution results using low-resolution semantic maps as the input sequence
and re-size them into the full-resolution by a distortion algorithm. However, in
our preliminary experiments, the straightforward method leads to severe arti-
facts since the distortion algorithm lacks high-frequency information and losses
many important textures. Therefore, we make an adaptive change for vid2vid
synthesis. We replace the downsampling layers with the normal convolution lay-
ers to generate the high-resolution results input by the low-resolution semantic
maps. For formulation, the modified generator takes the low-resolution semantic

sequence {X}0T ∈ RT ×h×w as the input, where h × w = (2d1)2 H × W , and d
denotes the numbers of the modified downsampling layers. d is set to 1. In this
way, we obtain a spatially low-demand generator.
Fast Vid2Vid 7

0
{X}
t

H FG {Y}
Teacher Output
W
Full-size Input
Same Parameters

{X} '
Re-size
0 LSKD
( Eq. (1) )
t
h SG
T
w
Low-resolution Input
{Y} '
Student Output

Fig. 3: The proposed Spatial Knowledge Distillation (Spatial KD). The spatially low-
demand generator is fed with a sequence of low-resolution semantic maps and outputs
full-resolution results. The results of the spatially low-demand generator are used for
spatial knowledge distillation.

Next, the spatially low-demand generator is required to learn high-frequency


representation from the full-time teacher generator. We present a spatial knowl-
edge distillation method (Spatial KD) to model high-frequency knowledge from
the teacher net. Specifically, as shown in Fig. 3, Spatial KD shrinks the margin
between the low-resolution domain and the high-resolution domain to improve
the performance of the student network. Spatial KD implicitly transfers spatial
knowledge from the teacher net to the student net. Particularly, Spatial KD ap-
plies a knowledge distillation loss to mimic the visual features of the teacher net,
and the loss function LSKD can be written as:

L_{SKD} = \frac {1}{T}\sum _{t=0}^{T}[ MSE(Y_t,Y'_t)+ L_{per}(Y_t,Y{'}_t)], (1)

where t means the current timestamp, T is the total timestamps of the sequences,
LSKD denotes the spatial knowledge distillation loss, Y is the output sequence
of the teacher net and Y ′ is the predicted sequence of the spatially low-demand
generator. M SE represents a mean squared error between two frames. Lper
denotes a perceptual loss [44].

3.4 Temporal Sequential Data Compression for Vid2vid

Each video sequence consists of dense video frames, which brings an enormous
burden on computational devices. How to efficiently synthesize dense frames with
a sequence of semantic maps is a difficult yet important issue for lightweight
vid2vid models.
In Section 3.3, we obtain a spatially low-demand generator. To ease the bur-
den of generating dense frames for each video, we re-train the spatially low-
demand generator on sparse video sequences, which are uniformly sampled from
dense video sequences. The sampling interval is randomly selected in each train-
ing iteration. The original vid2vid generator is regarded as a full-time teacher
8 L. Zhuo et al.

Yt Xt+1 Yt+1

FG …

X0 X1 Xgap+1


Full-time
FG FG FG
Consecutive … Teacher Net
Generation

Ip Time Gap: 1 Y0 Random Time Gap Ygap YT Seq-t


LLTKD ( Eq. (4) ) LLTKD ( Eq. (4) ) LLTKD ( Eq. (4) ) LGTKD
( Eq. (5) )

Seq-s

Ip Y0* *
Ygap Y*T

Re-size
Part-time
Re-size

Re-size
PG PG
PG
Student Net
X0 Xgap XT

Fig. 4: The proposed temporal-aware knowledge distillation (Temporal KD). The full-
time generator and part-time generator synthesize the current frame using the previous
frames and the semantic maps. The full-time teacher generator takes full-resolution
semantic maps as inputs and generates a full sequence, while the part-time student
generator takes only several low-resolution semantic maps as inputs and generates the
corresponding frames at random intervals.

generator and the re-trained spatially low-demand generator is regarded as a


part-time student generator. To distill the temporal knowledge from the full-
time teacher generator to the part-time student generator, we propose a lo-
cal temporal-aware knowledge distillation method and a global temporal-aware
knowledge distillation for temporal distillation, as shown in Fig. 4.
Both the full-time teacher generator and the part-time student generator
take the previous p − 1 synthesized frames {Y }1p−1 and p semantic maps {X}p1
as input and generate the next frame. The previous frames are used to cap-
ture the temporal coherence of the sequences and generate more coherent video
frames. The generation process of the full-time teacher generator is the consecu-
tive generation. More generally, each generation iteration of the full-time teacher
generator can be formulated as follows:

Y_k = f_{FG}({\{X\}_{k-p}^{k}},{\{Y\}_{k-p}^{k-1}}), (2)

where Yk denotes the predicted current generative frame of the full-time teacher
generator. fF G denotes the generation function of the full-time teacher generator.
{X}kk−p denotes p+1 frames of semantic maps, and {Y }k−1 k−p denotes the previous
p generated frames.
Different from the full-time teacher generator whose uniform sampling inter-
val is 1, the uniform sampling interval of the part-time student generator is g,
where 1 < g < T . g is a random number and randomly selected in each training
iteration. Similarly, the frame generation of the part-time student generator can
be formulated as follows:

Y^{*}_k = f_{PG}(f_{R}^d({\{X_t\}_{\ k-p}^{*k}}),\{Y\}_{\ k-p}^{*k-1}), (3)


Fast Vid2Vid 9

where Yk∗ denotes the predicted current generative frame of the part-time stu-
dent generator. fP G denotes the generation function of the part-time student
generator, fRd denotes the function of reducing the resolution into (2d1)2 , {X}∗k
k−p
includes p frames of the sparse semantic sequences and {Y }∗k−1
k−p is the previous
frames of the synthesized sparse sequences.
To better illustrate our proposed Temporal KD, we set p = 1 in Fig. 4.
Specifically, the full-time teacher generator takes a semantic sequence {X}T0
as input and generates an entire sequence {Y }T0 frame by frame. For the k-th
frame synthesis, the full-time generator takes Xk−1 , Xk and Yk−1 as input and
generates Yk .
Because the full-time teacher generator is trained on sequences with dense
frames and is learned to generate dense coherent frames, the full-time teacher
generator cannot directly skip partial frames to generate sparse frames, leading
to expensive computational cost. The part-time student generator can generate
sparse frames and interpolate intermediate frames with the slight computational
cost. However, since the part-time student generator is trained on sequences with
sampled sparse frames, the low sample rate could be two times less than tempo-
ral motion frequency and thus leads to aliasing according to Nyquist–Shannon
sampling theorem. Our preliminary experiments also show that the large changes
between two non-adjacent frames cause remarkable inter-frame incoherence and
generate a bad result.
Local Temporal-aware Knowledge Distillation. We first introduce the lo-
cal temporal-aware knowledge distillation to optimize the part-time student gen-
erator. Our goal is to distill the knowledge from the full-time generator to the
part-time student generator to reduce aliasing. A straightforward idea is to align
the outputs of the full-time generator and the outputs of the part-time student
generator and reduce the distances between the corresponding synthesized frame
pairs. The loss function of local temporal-aware knowledge distillation is given
by

L_{LTKD} = \frac {1}{T}\sum _{t=0}^{T}[ MSE(Y_t^{*},Y_t)+ L_{per}(Y_t^{*},Y_t)], (4)

where LLT KD denotes a local temporal-aware knowledge distillation loss, Y ′


denotes the resulting frame of the spatially low-demand generator and Y denotes
the resulting frame of the teacher net. M SE represents a mean squared error
between two frames. Lper denotes a perceptual loss [44].
Global Temporal-aware Knowledge Distillation. Local temporal-aware
knowledge distillation allows the part-time student generator to imitate the lo-
cal motion of the full-time teacher generator. However, it does not consider the
global semantic consistency. Therefore, we further propose a global temporal-
aware knowledge distillation to distill global temporal coherence from the full-
time generator to the part-time student generator.
It is observed that the current frame synthesis deeply relies on the results of
the previous synthesis. This indicates that the generated current frame implic-
itly contains information from the previous frames. The global temporal-aware
10 L. Zhuo et al.

Seq-t
H

W LGTKD
( Eq. (5) )
Seq-s
H I3D Model
Distributions
W

Fig. 5: The proposed temporal loss for temporal global knowledge distillation. The se-
quence of teacher net (seq-t) and the sequence of student net (seq-s) are extracted
time-series coherence features by the well-trained I3D model for calculating the dis-
tances.

knowledge distillation exploits the generated non-adjacent frames generated by


the full-time teacher generator to distill the hidden information of temporal co-
herence. The frames generated by the full-time teacher generator at the same
timestamps as the non-adjacent frames by the part-time student generator are
extracted and concatenated into a predicted sequence {Y ∗ } (Seq-s). Similarly,
the results of the full-time teacher generator are concatenated into a sequence
{Y S } (Seq-t) in time order, as shown in the right figure of Fig. 3. We intro-
duce a global temporal loss to minimize the distance between {Y ∗ } and {Y S },
namely I3D-loss. The global temporal-aware knowledge distillation employs a
well-trained I3D [4] model, a well-known video recognition model, to extract the
time-series features of neighbor frames of {Y S } and {Y S }. The global temporal-
aware knowledge distillation loss is given by
L_{GTKD} = MSE( f_{I3D}(\{Y^{*}\}\}), f_{I3D}(\{Y^{S}\})), (5)
where M SE calculates the mean squared error between two feature vectors,
fI3D is the function of the pre-trained I3D model. Finally, we obtain a temporal
knowledge distillation loss, as written as,
L_{TKD} = \alpha L_{LTKD} +\beta L_{GTKD}, (6)
where α and β control the importance of local and global losses. We set α = 2
and β = 15 to enhance the global temporal coherence.

Full Objective Function. We integrate local temporal-aware and global temporal-


aware knowledge distillation into a unified optimization framework, which en-
ables the part-time student generator to imitate both global and local motions
of the full-time teacher generator. The full objective function is given by
L_{KD} = \sigma L_{STD} +\gamma L_{TKD}, (7)
where σ and γ control the weights of spatial and temporal KD losses, respectively.
In particular, σ is set as 1 and γ is set as 2 to learn more knowledge of the
temporal features from the teacher net.
Fast Vid2Vid 11

3.5 Semantic-driven Motion Compensation for Interpolation


Synthesis

The temporal compression further greatly reduces the computational cost com-
pared with the original vid2vid generator. However, the part-time student gen-
erator can only synthesize sparse frames Y ′ . To compensate for this problem,
we use a fast motion compensation method [35], a zero-parameter algorithm,
to complete the sequence. Motion compensation enables the synthesis of the in-
ter frames between key-frames. As the adjacent frames are with slight changes,
the final results remain a reliable visual performance by reducing the tempo-
ral redundancy. During inference, another question is which frames should be
synthesized by the part-time student generator as sparse frames and how to de-
termine these key-frames without sufficient photo-realistic frames. In this paper,
we surprisingly find that we can distinguish key-frames {Xk′ |k ∈ K}, where K

is a set containing the numbers of key-frame, from semantic maps {X}0T . With
the key semantic maps, the part-time student generator generates sparse frames
{Yk′ |k ∈ K} and finally, we interpolate other inter frames to a full-size result
sequence {Y } ∈ RT ×H×W by the fast motion compensation method.

4 Experiments

4.1 Experimental Setup

Models. We conduct our experiments using vid2vid[44] model. The original


vid2vid model uses a coarse-to-fine generator for high-resolution output. To
simplify the compression problem, we only retrain the first-scale vid2vid model
based on the official repository5 . We also evaluate three compression methods for
image synthesis models, including NAS compression [29], CA compression [31]
and CAT [26]. Since vid2vid is a 2D-based generation framework, these three
methods can be easily transferred into this task with adjustments. For NAS com-
pression, we adopt our full-time teacher generator to perform NAS with FVD
the metric. For CA compression, we use salience-aware regions as the content
of interest to compress the model. For CAT, the vid2vid model with modified
residual blocks is built and retrained, followed by pruning and distillation using
the global alignment kernel.
Datasets. We evaluate our proposed compression method on several datasets.
We pre-process the following datasets as the same as the settings of vid2vid,
including Face Video dataset [36], Cityscape Video dataset [9] and Youtube
Dancing Video dataset [44]. Since we only apply the first scale of the vid2vid
model, we re-size the datasets for convenience. Face Video dataset is re-sized
into 512 × 512, Cityscape Video dataset is re-size into 256 × 512, and Youtube
Dancing Video dataset is re-sized to 384 × 512.
Evaluation Metrics. We apply three metrics for quantitative evaluation, in-
cluding FID [21], FVD [41] and pose error [43]. FID [21] indicates the similarity
5
https://github.com/NVIDIA/vid2vid
12 L. Zhuo et al.

Table 1: Quantitative results of Fast-Vid2Vid. m.MACs represents the mean of the


total MACs of the calculation resources for a sequence of video.
Metric
Task Method m.MACs(G) FPS
FID(↓) FVD (↓) PE(↓)
original 2066 2.77 34.17 6.74 —
NAS 303 10.50 33.64 6.86 —
Sketch2Face CA 290 11.07 35.82 6.95 —
CAT 294 10.89 33.90 6.43 —
Ours 282 18.56 29.02 5.79 —
original 1254 4.27 — 2.76 —
NAS 277 12.44 — 2.99 —
Segmentation2City CA 187 15.48 — 3.98 —
CAT 178 13.29 — 3.44 —
Ours 132 24.77 — 2.33 —
original 1769 3.01 — 12.31 2.60
NAS 280 12.57 — 12.53 3.28
Pose2Body CA 253 13.92 — 15.89 4.85
CAT 257 12.48 — 15.75 4.18
Ours 191 21.39 — 10.03 2.18

between real and pseudo images using a well-trained classifier network. FVD aims
to reveal the similarity between the real video and the synthesized video. Pose er-
ror measuring the absolute error in pixels between the estimated rendered poses
and the original rendered poses predicted by Openpose [3] and Densepose [16].
The lower score of the three metrics represents better performance.
Key-frame Selection. We first calculate the residual maps between the ad-
jacent frames, and sum up each map to draw smooth statistical curves using
sliding windows. Thus, the peak ones of the curves represent the local maximum
of the difference between two adjacent frames and are used as keyframes. Note
that our keyframe selection only consumes about 0.5 milliseconds for processing
a video of 30 frames.
Motion Compensation. Motion compensation is to predict a video frame
given the previous frames and future frames in video compression, which is with
fewer remnants than linear interpolation. We adopt an overlapped block motion
compensation (OBMC) [35] and an enhanced predictive zonal search (EPZS)
method [40] to generate the non-keyframes by “FFMPEG” toolbox. EPZS con-
sumes about 2 MACs for each 16×16 patch and OBMC consumes 5 MACs for
each pixel, and thus requires 0.0008146G MACs for each video frame (512×512
resolution), which is much less than our generative model part (282G MACs).

4.2 Quantitative Results

We compare our compression method with the state-of-the-art GAN compression


methods, NAS [29], CA [31] and CAT [26] on three benchmark datasets to
validate the effectiveness of our approach. For a fair comparison, we perform
a decent pruning rate by removing around 60% channels of the vid2vid model
Fast Vid2Vid 13

7.75 16
Sketch2Face
7.50 Segmentation2City
TKD-local 14 Pose2Body
w/o TKD
7.25
12
7.00 TKD-global LSTM [45]
6.75 10
FVD

FVD
6.50 8
6.25
6
6.00 MMD-loss [11]
4
5.75
STKD
5.50 2
28 30 32 34 36 38 40 3 4 5 6 7 8 9 10
FID Windows

(a) (b)

Fig. 6: Ablation Study for Fast Vid2Vid. (a) Temporal Loss ablation study for temporal
loss. (b) The trade-off experiments for the windows of key-frames selector. Larger
windows means less mMACs.

in CA and CAT, and use NAS to find out the best network configuration with
similar mMACs.
The experimental results are shown in Table 1. We can see that given the
lower computational budget, our method achieves the best FID and FVD on
three datasets. Specifically, other GAN compression methods have lower per-
formance than the full-size model while our method slightly outperforms the
original model. Other compression methods speed up the original model by sim-
ply cutting the network structures, and they ignore the temporal coherence.
Meanwhile, the original vid2vid model significantly accumulates losses during
inference. Our proposed motion-aware inference accumulates less losses since it
only generates several frames of the sequence. Such results show the advantage
of our spatial-temporal aware compression methods.

4.3 Ablation Study

We adopt face video as the benchmark dataset for our ablation study.
Effectiveness of Temporal KD Loss. Based on the spatially low-demand
generator mentioned before, we analyze the knowledge distillation for the vid2vid
model. We set 6 different distillation loss schemes as: (1) w/o TKD: the spatially
low-demand generator was retrained on the dataset; (2) TKD-local: the spatially
low-demand generator is transferred the only local knowledge from the teacher
net; (3) TKD-global: the spatially low-demand generator is transferred the only
global knowledge from the teacher net; (4) MMD: the spatially low-demand
generator is transferred the knowledge using MMD-loss [10]. (5) LSTM: the
spatially low-demand generator is transferred the knowledge based on LSTM
regulation [47]. (6) TKD: the spatially low-demand generator is transferred both
local and global knowledge from the teacher net;
14 L. Zhuo et al.

GT w/o TKD TKD-Local TKD-Global LSTM MMD STKD

Fig. 7: The comparison of the results of different knowledge distillation.

Table 2: Ablation Study for spatial compression with the proposed Temporal KD.
Method MACs(G) FPS FID FVD
CA 331 17.00 36.65 6.76
CAT 310 18.02 35.64 6.85
NAS 344 16.78 32.41 6.71
Spatial KD 282 18.56 29.02 5.79

As shown in Fig. 6(a), the local knowledge distillation loss improves the per-
formance of the model without KD. Furthermore, the temporal KD loss globally
further improves the performance of the common local KD loss, especially in
FVD. Moreover, our proposed KD loss outperforms MMD-loss and LSTM-based
KD loss. It indicates that the temporal KD loss effectively enhances the similar-
ity of distribution on the video recognition network between the videos generated
by the teacher network and the student network. We also provide the qualitative
comparison among these KD methods in Fig. 7 and our STKD generates more
photo-realistic frames than others.
Effectiveness of Spatial KD Loss. We conduct an ablative study for Spatial
KD on the Sketch2Face benchmark. In the video setting, spatial compression
methods are used together with our proposed Temporal KD to perform vid2vid
compression. Table 2 shows that our proposed Spatial KD performs better than
other image compression methods. Our Spatial KD does not destroy network
structures of the original GAN while other methods tweak the sophisticated
parameters.
Impact of Windows for Key-frame selection. We investigate the sliding
windows to select the key-frames. The larger sliding windows mean that there
are fewer key-frames selected and thus use less computational resources. We aim
to find out the best trade-off between the sliding windows and the performance.
Interestingly, as shown in Fig. 6(b), it shows a significant rise in FVD when
increasing the sliding windows, and achieves the best performance when the
sliding windows are three in three tasks. It indicates that the part-time student
generator needs enough independent motion to maintain decent performance.
Effectiveness of Interpolation. We conduct two common interpolation meth-
ods for completing the video, namely linear interpolation and motion compensa-
tion. We also conduct ablative studies on interpolation methods. As we can see
in Table 3, motion compensation outperforms the simple linear interpolation.
Fast Vid2Vid 15

Table 4: The performance of different time


Table 3: The performance of different in- gap.
terpolation.
Gap FID FVD
Interpolation FID FVD
Fixed time gap 32.21 6.85
Linear Interpolation 31.55 6.45
Random time gap 31.43 6.22
Motion Compensation 29.02 5.79
Key-frames 29.02 5.79

Inputs CA CAT NAS Fast-Vid2Vid Vid2Vid GT




Inputs CA CAT NAS Fast-Vid2Vid Vid2Vid GT





Inputs CA CAT NAS Fast-Vid2Vid Vid2Vid GT



Fig. 8: Qualitative results compared with the advanced GAN compression methods in
the task of Sketch2Face, Segmentation2City and Pose2Body.

Therefore, we apply motion compensation as our zero-parameters interpolation


method.
Effectiveness of Time Gap. We also study the ways of selected frames for
the generation of the part-time student generator. We use the same numbers of
the selected frames. Specifically, the fixed time gap strategy selects the frames in
fixed time intervals, the random time gap strategy chooses the frames in random
intervals, and the key-frame strategy selects the key-frames as the frames to be
synthesized by the part-time student generator. As shown in Table 4, the key-
frame strategy outperforms other strategies since the key-frames of a sequence
consist of all essential motions and texture.
16 L. Zhuo et al.

4.4 Qualitative Results

We illustrate the output sequences of the mentioned methods in Fig. 8. The


generators in vid2vid synthesis would accumulate the visual losses sequentially
since the current frame relies on the previously generated frames. We also visu-
alize more examples in Fig. 9, Fig. 10 and Fig. 11. Compared with other GAN
compression methods, our proposed method generates more realistic results.

5 Discussion

We discuss some future directions for this work. Recently, sequence-in and sequence-
out methods, like transformer, are challenging for model compression. On the
contrary, our Fast-Vid2Vid accelerates Vid2Vid by optimizing a part-time stu-
dent generator (via temporal-aware KD compression) and a lower-resolution
spatial generator (via spatial KD compression), which is versatile for various
networks. When combined with seq-in and seq-out transformers like visTR [46],
Fast-Vid2Vid first synthesizes a partial video by a part-time transformer-based
student generator (via fully parallel computation) and then recovers the full
video by motion compensation.

6 Conclusion

In this paper, we present Fast-Vid2Vid to accelerate vid2vid synthesis. We pro-


pose a spatial-temporal compression framework to accelerate the inference by
compressing the sequential input data stream but maintaining the parameters
of the network. In space dimension, we distill knowledge from the full-resolution
domain to the low-resolution domain and obtain a spatially low-demand gener-
ator. In time dimension, we use temporal-aware knowledge distillation for local
and global knowledge to convert the spatially low-demand generator from a full-
time generator to a part-time generator. Finally, the part-time generator is used
for motion-aware inference where it only generates the key-frames of the sequence
and interpolates the middle frames by motion compensation. By reducing the
resolution of the input data and extracting the key-frames of the data stream,
Fast-Vid2Vid saves the computational resources significantly.

Acknowledgements

This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088), and
under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects
(IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the
industry partner(s).
Fast Vid2Vid 17

References
1. Aguinaldo, A., Chiang, P.Y., Gain, A., Patil, A., Pearson, K., Feizi, S.: Compressing
gans using knowledge distillation. arXiv preprint arXiv:1902.00159 (2019) 2, 5
2. Belousov, S.: Mobilestylegan: A lightweight convolutional neural network for high-
fidelity image synthesis. arXiv preprint arXiv:2104.04767 (2021) 2, 5
3. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation
using part affinity fields. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 7291–7299 (2017) 12
4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the
kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 6299–6308 (2017) 10
5. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 5933–5942
(2019) 2, 4
6. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object
detection models with knowledge distillation. Advances in neural information pro-
cessing systems 30 (2017) 5
7. Chen, H., Wang, Y., Shu, H., Wen, C., Xu, C., Shi, B., Xu, C., Xu, C.: Distilling
portable generative adversarial networks for image translation. In: Proceedings of
the AAAI Conference on Artificial Intelligence. vol. 34, pp. 3585–3592 (2020) 2, 5
8. Chu, M., Xie, Y., Mayer, J., Leal-Taixé, L., Thuerey, N.: Learning temporal co-
herence via self-supervision for gan-based video generation. ACM Transactions on
Graphics (TOG) 39(4), 75–1 (2020) 4
9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3213–3223 (2016) 11
10. Feng, Z., Lai, J., Xie, X.: Resolution-aware knowledge distillation for efficient in-
ference. IEEE Transactions on Image Processing 30, 6985–6996 (2021) 3, 5, 6,
13
11. Fu, Y., Chen, W., Wang, H., Li, H., Lin, Y., Wang, Z.: Autogan-distiller: Search-
ing to compress generative adversarial networks. arXiv preprint arXiv:2006.08198
(2020) 2, 5
12. Gafni, O., Wolf, L., Taigman, Y.: Vid2game: Controllable characters extracted
from real-world videos. arXiv preprint arXiv:1904.08379 (2019) 4
13. Gao, C., Chen, Y., Liu, S., Tan, Z., Yan, S.: Adversarialnas: Adversarial neural
architecture search for gans. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 5680–5689 (2020) 2
14. Gong, X., Chang, S., Jiang, Y., Wang, Z.: Autogan: Neural architecture search for
generative adversarial networks. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 3224–3234 (2019) 2
15. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural
Information Processing Systems. vol. 27 (2014) 2
16. Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation
in the wild. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 7297–7306 (2018) 12
17. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for
efficient neural networks. arXiv preprint arXiv:1506.02626 (2015) 4
18 L. Zhuo et al.

18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016) 2
19. He, Y., Kang, G., Dong, X., Fu, Y., Yang, Y.: Soft filter pruning for accelerating
deep convolutional neural networks. arXiv preprint arXiv:1808.06866 (2018) 4
20. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural
networks. In: Proceedings of the IEEE international conference on computer vision.
pp. 1389–1397 (2017) 4
21. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained
by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems 30 (2017) 11
22. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531 (2015) 4
23. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu,
Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 1314–1324
(2019) 6
24. Hu, H., Peng, R., Tai, Y., Tang, C., Trimming, N.: A data-driven neuron pruning
approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 46
(2016) 4
25. Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: A data-driven
neuron pruning approach towards efficient deep architectures. arXiv preprint
arXiv:1607.03250 (2016) 4
26. Jin, Q., Ren, J., Woodford, O.J., Wang, J., Yuan, G., Wang, Y., Tulyakov, S.:
Teachers do more than teach: Compressing image-to-image models. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
13600–13611 (2021) 2, 5, 11, 12
27. Kappel, M., Golyanik, V., Elgharib, M., Henningson, J.O., Seidel, H.P., Castillo,
S., Theobalt, C., Magnor, M.: High-fidelity neural human motion transfer from
monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition. pp. 1541–1550 (2021) 2, 4
28. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient
convnets. arXiv preprint arXiv:1608.08710 (2016) 4
29. Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J.Y., Han, S.: Gan compression: Efficient
architectures for interactive conditional gans. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 5284–5294 (2020)
2, 5, 6, 11, 12
30. Lin, J., Zhang, R., Ganz, F., Han, S., Zhu, J.Y.: Anycost gans for interactive image
synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 14986–14996 (2021) 2
31. Liu, Y., Shu, Z., Li, Y., Lin, Z., Perazzi, F., Kung, S.Y.: Content-aware gan com-
pression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 12156–12166 (2021) 2, 4, 5, 11, 12
32. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolu-
tional networks through network slimming. In: Proceedings of the IEEE interna-
tional conference on computer vision. pp. 2736–2744 (2017) 4
33. Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and
privileged information. arXiv preprint arXiv:1511.03643 (2015) 5
34. Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling
knowledge from neurons. In: Thirtieth AAAI conference on artificial intelligence
(2016) 5
Fast Vid2Vid 19

35. Orchard, M.T., Sullivan, G.J.: Overlapped block motion compensation: An


estimation-theoretic approach. IEEE Transactions on Image Processing (1994) 11,
12
36. Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face-
forensics: A large-scale video dataset for forgery detection in human faces. arXiv
preprint arXiv:1803.09179 (2018) 11
37. Sajjadi, M.S., Vemulapalli, R., Brown, M.: Frame-recurrent video super-resolution.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition. pp. 6626–6634 (2018) 4
38. Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion
model for image animation. Advances in Neural Information Processing Systems
32, 7137–7147 (2019) 4
39. Siarohin, A., Woodford, O.J., Ren, J., Chai, M., Tulyakov, S.: Motion representa-
tions for articulated animation. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 13653–13662 (2021) 4
40. Tourapis, A.M.: Enhanced predictive zonal search for single and multiple frame mo-
tion estimation. In: Visual Communications and Image Processing 2002. vol. 4671,
pp. 1069–1079. SPIE (2002) 12
41. Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly,
S.: Towards accurate generative models of video: A new metric & challenges. arXiv
preprint arXiv:1812.01717 (2018) 11
42. Wang, H., Gui, S., Yang, H., Liu, J., Wang, Z.: Gan slimming: All-in-one gan
compression by a unified optimization framework. In: European Conference on
Computer Vision. pp. 54–73. Springer (2020) 2, 4
43. Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-
to-video synthesis. arXiv preprint arXiv:1910.12713 (2019) 2, 4, 11
44. Wang, T.C., Liu, M.Y., Zhu, J.Y., Liu, G., Tao, A., Kautz, J., Catanzaro, B.:
Video-to-video synthesis. arXiv preprint arXiv:1808.06601 (2018) 2, 3, 4, 6, 7, 9,
11
45. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-
resolution image synthesis and semantic manipulation with conditional gans. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018) 4
46. Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end
video instance segmentation with transformers. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 8741–8750 (2021)
16
47. Xiao, Z., Fu, X., Huang, J., Cheng, Z., Xiong, Z.: Space-time distillation for video
super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition. pp. 2113–2122 (2021) 13
48. Xu, G., Xu, J., Li, Z., Wang, L., Sun, X., Cheng, M.M.: Temporal modulation
network for controllable space-time video super-resolution. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6388–
6397 (2021) 4
49. Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 3723–3732 (2019) 4
50. Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast opti-
mization, network minimization and transfer learning. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 4133–4141 (2017) 5
20 L. Zhuo et al.

51. Yu, C., Pool, J.: Self-supervised gan compression. arXiv preprint arXiv:2007.01491
(2020) 4
52. Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., Wang, Y.: A system-
atic dnn weight pruning framework using alternating direction method of multi-
pliers. In: Proceedings of the European Conference on Computer Vision (ECCV).
pp. 184–199 (2018) 4
53. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.: Dance dance generation: Mo-
tion transfer for internet videos. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision Workshops. pp. 0–0 (2019) 2, 4
54. Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment net-
work for video inpainting. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 16448–16457 (2021) 4
Fast Vid2Vid 21

Fig. 9: Qualitative results of the testing data compared with the advanced GAN com-
pression methods in the task of Sketch2Face. From top to the bottom, rows are seman-
tic maps, CA’s results, CAT’s results, NAS’s results, Vid2Vid’s results, Fast-Vid2Vid’s
results and GT.
22 L. Zhuo et al.

Fig. 10: Qualitative results of the testing data compared with the advanced GAN com-
pression methods in the task of Segmentation2City. From top to the bottom, rows are
semantic maps, CA’s results, CAT’s results, NAS’s results, Vid2Vid’s results, Fast-
Vid2Vid’s results and GT.
Fast Vid2Vid 23

Fig. 11: Qualitative results of the testing data compared with the advanced GAN com-
pression methods in the task of Pose2Body. From top to the bottom, rows are semantic
maps, CA’s results, CAT’s results, NAS’s results, Vid2Vid’s results, Fast-Vid2Vid’s re-
sults and GT.

You might also like