Fast Vid2Vid Spatial Temporal Compression for Video to Video Synthesis
Fast Vid2Vid Spatial Temporal Compression for Video to Video Synthesis
Video-to-Video Synthesis
Long Zhuo1 , Guangcong Wang2 , Shikai Li3 , Wayne Wu1,3 , and Ziwei Liu2⋆
1
Shanghai AI Laboratory
2
S-Lab, Nanyang Technological University
3
SenseTime Research
arXiv:2207.05049v1 [cs.CV] 11 Jul 2022
Segmentation2City
Inputs
Vid2Vid
MACs:1254G FPS:4.27
Fast-
Vid2Vid
MACs:151G (8.3×) FPS:24.77 (5.8×)
Sketch2Face Pose2Body
Inputs
Vid2Vid
Fast-
Vid2Vid
1 Introduction
2 Related Work
Video-to-Video Synthesis. Video-to-video synthesis (Vid2vid) is a computer
vision task that generates a photo-realistic sequence using the corresponding se-
mantic sequence. Based on high-resolution image-based synthesis [45], Wang et
al. [44] developed a standard vid2vid synthesis model by introducing temporal
coherence. Few-shot vid2vid model [43] further extended a few-shot version of
the vid2vid model, which only uses fewer samples to achieve decent performance.
Recently, vid2vid has been successfully extended to a wide range of video gener-
ation tasks, including video super-resolution [37,8,48], video inpainting [54,49],
image-to-video synthesis [38,39] and human pose-to-body synthesis [5,12,53,27].
Most of these methods exploited temporal information to improve the perfor-
mance of generated videos. However, they do not focus on vid2vid synthesis
compression but on better visual performance.
Model Compression. Model compression aims at reducing superfluous pa-
rameters of deep neural networks to accelerate inference. In the computer vision
task, lots of model pruning approaches [17,28,24,32,19,52,42] have greatly cut
the weights of neural networks and significantly speed up inference time. Hu
et al. [25] reduced the unnecessary channels with low activations. Small incom-
ing weights [19,28] or outcoming weights [20] of convolution layers were used as
saliency metrics for pruning. GAN compression has been proved by [51] that it is
far more difficult than normal CNN compression. Due to the complex structures
of GANs, a content-aware approach [31] was proposed to use salient regions to
identify specific redundancy for GAN pruning. Wang et al. [42] reduced the re-
dundant weights by NAS using a once-for-all scheme. Notably, the mentioned
methods focus on simplifying the network structure and ignore the amount of the
input information. Furthermore, these approaches do not consider the essential
temporal coherence for video-based GAN compression, and thus achieve sub-
optimal results for vid2vid models. Therefore, it is required to remove temporal
redundancy in vid2vid models.
Knowledge Distillation. Knowledge Distillation aims to make a student net-
work imitate its teacher. Hinton et al. [22] proposed an effective framework for
Fast Vid2Vid 5
{X} *
H H
FG h PG H
w
W W Compressed Input
Teacher Output W
Full-size Input Same Parameters Student Output
Motion-aware Inference {Y}*
{X} '
Fig. 2: The pipeline of our Fast-Vid2Vid. It maintains the same amount of parameters
as the original generator but compresses the input data in space and time dimen-
sions. We perform spatial-temporal knowledge distillation (STKD) to transfer knowl-
edge from the Full-time teacher Generator (FG) to the Part-time student Generator
(PG). After STKD, Fast-Vid2Vid only infers the key-frames of the semantic sequence
of low resolution and interpolates the intermediate frames by motion compensation.
3 Fast-Vid2Vid
3.1 A Revisit of GAN Compression
The function of a deep neural network (DNN) can be written as f (X) = W ∗ X,
where W denotes the parameters of the networks, ∗ represents the operation of
DNN and X denotes the input data. Obviously, two essential factors accounting
for computational cost are the parameters and the input data. Existing GAN
compression methods [1,7,11,26,31,29] have intended to cut the computational
cost by reducing the parameters of network structures. However, the network
structures of GAN for specific tasks are carefully designed and it would cause
6 L. Zhuo et al.
poor visual results if the network parameters are cut arbitrarily. Another way
to reduce computational cost is by compressing the input data. In this work, we
seek for compressing the input data instead of the parameters of well-designed
networks. To the best of our knowledge, there is little literature working on
compressing data for GAN compression.
0
{X}
t
H FG {Y}
Teacher Output
W
Full-size Input
Same Parameters
{X} '
Re-size
0 LSKD
( Eq. (1) )
t
h SG
T
w
Low-resolution Input
{Y} '
Student Output
Fig. 3: The proposed Spatial Knowledge Distillation (Spatial KD). The spatially low-
demand generator is fed with a sequence of low-resolution semantic maps and outputs
full-resolution results. The results of the spatially low-demand generator are used for
spatial knowledge distillation.
where t means the current timestamp, T is the total timestamps of the sequences,
LSKD denotes the spatial knowledge distillation loss, Y is the output sequence
of the teacher net and Y ′ is the predicted sequence of the spatially low-demand
generator. M SE represents a mean squared error between two frames. Lper
denotes a perceptual loss [44].
Each video sequence consists of dense video frames, which brings an enormous
burden on computational devices. How to efficiently synthesize dense frames with
a sequence of semantic maps is a difficult yet important issue for lightweight
vid2vid models.
In Section 3.3, we obtain a spatially low-demand generator. To ease the bur-
den of generating dense frames for each video, we re-train the spatially low-
demand generator on sparse video sequences, which are uniformly sampled from
dense video sequences. The sampling interval is randomly selected in each train-
ing iteration. The original vid2vid generator is regarded as a full-time teacher
8 L. Zhuo et al.
Yt Xt+1 Yt+1
FG …
X0 X1 Xgap+1
…
Full-time
FG FG FG
Consecutive … Teacher Net
Generation
Seq-s
…
Ip Y0* *
Ygap Y*T
Re-size
Part-time
Re-size
Re-size
PG PG
PG
Student Net
X0 Xgap XT
Fig. 4: The proposed temporal-aware knowledge distillation (Temporal KD). The full-
time generator and part-time generator synthesize the current frame using the previous
frames and the semantic maps. The full-time teacher generator takes full-resolution
semantic maps as inputs and generates a full sequence, while the part-time student
generator takes only several low-resolution semantic maps as inputs and generates the
corresponding frames at random intervals.
where Yk denotes the predicted current generative frame of the full-time teacher
generator. fF G denotes the generation function of the full-time teacher generator.
{X}kk−p denotes p+1 frames of semantic maps, and {Y }k−1 k−p denotes the previous
p generated frames.
Different from the full-time teacher generator whose uniform sampling inter-
val is 1, the uniform sampling interval of the part-time student generator is g,
where 1 < g < T . g is a random number and randomly selected in each training
iteration. Similarly, the frame generation of the part-time student generator can
be formulated as follows:
where Yk∗ denotes the predicted current generative frame of the part-time stu-
dent generator. fP G denotes the generation function of the part-time student
generator, fRd denotes the function of reducing the resolution into (2d1)2 , {X}∗k
k−p
includes p frames of the sparse semantic sequences and {Y }∗k−1
k−p is the previous
frames of the synthesized sparse sequences.
To better illustrate our proposed Temporal KD, we set p = 1 in Fig. 4.
Specifically, the full-time teacher generator takes a semantic sequence {X}T0
as input and generates an entire sequence {Y }T0 frame by frame. For the k-th
frame synthesis, the full-time generator takes Xk−1 , Xk and Yk−1 as input and
generates Yk .
Because the full-time teacher generator is trained on sequences with dense
frames and is learned to generate dense coherent frames, the full-time teacher
generator cannot directly skip partial frames to generate sparse frames, leading
to expensive computational cost. The part-time student generator can generate
sparse frames and interpolate intermediate frames with the slight computational
cost. However, since the part-time student generator is trained on sequences with
sampled sparse frames, the low sample rate could be two times less than tempo-
ral motion frequency and thus leads to aliasing according to Nyquist–Shannon
sampling theorem. Our preliminary experiments also show that the large changes
between two non-adjacent frames cause remarkable inter-frame incoherence and
generate a bad result.
Local Temporal-aware Knowledge Distillation. We first introduce the lo-
cal temporal-aware knowledge distillation to optimize the part-time student gen-
erator. Our goal is to distill the knowledge from the full-time generator to the
part-time student generator to reduce aliasing. A straightforward idea is to align
the outputs of the full-time generator and the outputs of the part-time student
generator and reduce the distances between the corresponding synthesized frame
pairs. The loss function of local temporal-aware knowledge distillation is given
by
Seq-t
H
W LGTKD
( Eq. (5) )
Seq-s
H I3D Model
Distributions
W
Fig. 5: The proposed temporal loss for temporal global knowledge distillation. The se-
quence of teacher net (seq-t) and the sequence of student net (seq-s) are extracted
time-series coherence features by the well-trained I3D model for calculating the dis-
tances.
The temporal compression further greatly reduces the computational cost com-
pared with the original vid2vid generator. However, the part-time student gen-
erator can only synthesize sparse frames Y ′ . To compensate for this problem,
we use a fast motion compensation method [35], a zero-parameter algorithm,
to complete the sequence. Motion compensation enables the synthesis of the in-
ter frames between key-frames. As the adjacent frames are with slight changes,
the final results remain a reliable visual performance by reducing the tempo-
ral redundancy. During inference, another question is which frames should be
synthesized by the part-time student generator as sparse frames and how to de-
termine these key-frames without sufficient photo-realistic frames. In this paper,
we surprisingly find that we can distinguish key-frames {Xk′ |k ∈ K}, where K
′
is a set containing the numbers of key-frame, from semantic maps {X}0T . With
the key semantic maps, the part-time student generator generates sparse frames
{Yk′ |k ∈ K} and finally, we interpolate other inter frames to a full-size result
sequence {Y } ∈ RT ×H×W by the fast motion compensation method.
4 Experiments
between real and pseudo images using a well-trained classifier network. FVD aims
to reveal the similarity between the real video and the synthesized video. Pose er-
ror measuring the absolute error in pixels between the estimated rendered poses
and the original rendered poses predicted by Openpose [3] and Densepose [16].
The lower score of the three metrics represents better performance.
Key-frame Selection. We first calculate the residual maps between the ad-
jacent frames, and sum up each map to draw smooth statistical curves using
sliding windows. Thus, the peak ones of the curves represent the local maximum
of the difference between two adjacent frames and are used as keyframes. Note
that our keyframe selection only consumes about 0.5 milliseconds for processing
a video of 30 frames.
Motion Compensation. Motion compensation is to predict a video frame
given the previous frames and future frames in video compression, which is with
fewer remnants than linear interpolation. We adopt an overlapped block motion
compensation (OBMC) [35] and an enhanced predictive zonal search (EPZS)
method [40] to generate the non-keyframes by “FFMPEG” toolbox. EPZS con-
sumes about 2 MACs for each 16×16 patch and OBMC consumes 5 MACs for
each pixel, and thus requires 0.0008146G MACs for each video frame (512×512
resolution), which is much less than our generative model part (282G MACs).
7.75 16
Sketch2Face
7.50 Segmentation2City
TKD-local 14 Pose2Body
w/o TKD
7.25
12
7.00 TKD-global LSTM [45]
6.75 10
FVD
FVD
6.50 8
6.25
6
6.00 MMD-loss [11]
4
5.75
STKD
5.50 2
28 30 32 34 36 38 40 3 4 5 6 7 8 9 10
FID Windows
(a) (b)
Fig. 6: Ablation Study for Fast Vid2Vid. (a) Temporal Loss ablation study for temporal
loss. (b) The trade-off experiments for the windows of key-frames selector. Larger
windows means less mMACs.
in CA and CAT, and use NAS to find out the best network configuration with
similar mMACs.
The experimental results are shown in Table 1. We can see that given the
lower computational budget, our method achieves the best FID and FVD on
three datasets. Specifically, other GAN compression methods have lower per-
formance than the full-size model while our method slightly outperforms the
original model. Other compression methods speed up the original model by sim-
ply cutting the network structures, and they ignore the temporal coherence.
Meanwhile, the original vid2vid model significantly accumulates losses during
inference. Our proposed motion-aware inference accumulates less losses since it
only generates several frames of the sequence. Such results show the advantage
of our spatial-temporal aware compression methods.
We adopt face video as the benchmark dataset for our ablation study.
Effectiveness of Temporal KD Loss. Based on the spatially low-demand
generator mentioned before, we analyze the knowledge distillation for the vid2vid
model. We set 6 different distillation loss schemes as: (1) w/o TKD: the spatially
low-demand generator was retrained on the dataset; (2) TKD-local: the spatially
low-demand generator is transferred the only local knowledge from the teacher
net; (3) TKD-global: the spatially low-demand generator is transferred the only
global knowledge from the teacher net; (4) MMD: the spatially low-demand
generator is transferred the knowledge using MMD-loss [10]. (5) LSTM: the
spatially low-demand generator is transferred the knowledge based on LSTM
regulation [47]. (6) TKD: the spatially low-demand generator is transferred both
local and global knowledge from the teacher net;
14 L. Zhuo et al.
Table 2: Ablation Study for spatial compression with the proposed Temporal KD.
Method MACs(G) FPS FID FVD
CA 331 17.00 36.65 6.76
CAT 310 18.02 35.64 6.85
NAS 344 16.78 32.41 6.71
Spatial KD 282 18.56 29.02 5.79
As shown in Fig. 6(a), the local knowledge distillation loss improves the per-
formance of the model without KD. Furthermore, the temporal KD loss globally
further improves the performance of the common local KD loss, especially in
FVD. Moreover, our proposed KD loss outperforms MMD-loss and LSTM-based
KD loss. It indicates that the temporal KD loss effectively enhances the similar-
ity of distribution on the video recognition network between the videos generated
by the teacher network and the student network. We also provide the qualitative
comparison among these KD methods in Fig. 7 and our STKD generates more
photo-realistic frames than others.
Effectiveness of Spatial KD Loss. We conduct an ablative study for Spatial
KD on the Sketch2Face benchmark. In the video setting, spatial compression
methods are used together with our proposed Temporal KD to perform vid2vid
compression. Table 2 shows that our proposed Spatial KD performs better than
other image compression methods. Our Spatial KD does not destroy network
structures of the original GAN while other methods tweak the sophisticated
parameters.
Impact of Windows for Key-frame selection. We investigate the sliding
windows to select the key-frames. The larger sliding windows mean that there
are fewer key-frames selected and thus use less computational resources. We aim
to find out the best trade-off between the sliding windows and the performance.
Interestingly, as shown in Fig. 6(b), it shows a significant rise in FVD when
increasing the sliding windows, and achieves the best performance when the
sliding windows are three in three tasks. It indicates that the part-time student
generator needs enough independent motion to maintain decent performance.
Effectiveness of Interpolation. We conduct two common interpolation meth-
ods for completing the video, namely linear interpolation and motion compensa-
tion. We also conduct ablative studies on interpolation methods. As we can see
in Table 3, motion compensation outperforms the simple linear interpolation.
Fast Vid2Vid 15
…
…
…
…
…
…
…
…
…
…
Fig. 8: Qualitative results compared with the advanced GAN compression methods in
the task of Sketch2Face, Segmentation2City and Pose2Body.
5 Discussion
We discuss some future directions for this work. Recently, sequence-in and sequence-
out methods, like transformer, are challenging for model compression. On the
contrary, our Fast-Vid2Vid accelerates Vid2Vid by optimizing a part-time stu-
dent generator (via temporal-aware KD compression) and a lower-resolution
spatial generator (via spatial KD compression), which is versatile for various
networks. When combined with seq-in and seq-out transformers like visTR [46],
Fast-Vid2Vid first synthesizes a partial video by a part-time transformer-based
student generator (via fully parallel computation) and then recovers the full
video by motion compensation.
6 Conclusion
Acknowledgements
This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088), and
under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects
(IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the
industry partner(s).
Fast Vid2Vid 17
References
1. Aguinaldo, A., Chiang, P.Y., Gain, A., Patil, A., Pearson, K., Feizi, S.: Compressing
gans using knowledge distillation. arXiv preprint arXiv:1902.00159 (2019) 2, 5
2. Belousov, S.: Mobilestylegan: A lightweight convolutional neural network for high-
fidelity image synthesis. arXiv preprint arXiv:2104.04767 (2021) 2, 5
3. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation
using part affinity fields. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 7291–7299 (2017) 12
4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the
kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 6299–6308 (2017) 10
5. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 5933–5942
(2019) 2, 4
6. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object
detection models with knowledge distillation. Advances in neural information pro-
cessing systems 30 (2017) 5
7. Chen, H., Wang, Y., Shu, H., Wen, C., Xu, C., Shi, B., Xu, C., Xu, C.: Distilling
portable generative adversarial networks for image translation. In: Proceedings of
the AAAI Conference on Artificial Intelligence. vol. 34, pp. 3585–3592 (2020) 2, 5
8. Chu, M., Xie, Y., Mayer, J., Leal-Taixé, L., Thuerey, N.: Learning temporal co-
herence via self-supervision for gan-based video generation. ACM Transactions on
Graphics (TOG) 39(4), 75–1 (2020) 4
9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3213–3223 (2016) 11
10. Feng, Z., Lai, J., Xie, X.: Resolution-aware knowledge distillation for efficient in-
ference. IEEE Transactions on Image Processing 30, 6985–6996 (2021) 3, 5, 6,
13
11. Fu, Y., Chen, W., Wang, H., Li, H., Lin, Y., Wang, Z.: Autogan-distiller: Search-
ing to compress generative adversarial networks. arXiv preprint arXiv:2006.08198
(2020) 2, 5
12. Gafni, O., Wolf, L., Taigman, Y.: Vid2game: Controllable characters extracted
from real-world videos. arXiv preprint arXiv:1904.08379 (2019) 4
13. Gao, C., Chen, Y., Liu, S., Tan, Z., Yan, S.: Adversarialnas: Adversarial neural
architecture search for gans. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 5680–5689 (2020) 2
14. Gong, X., Chang, S., Jiang, Y., Wang, Z.: Autogan: Neural architecture search for
generative adversarial networks. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 3224–3234 (2019) 2
15. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural
Information Processing Systems. vol. 27 (2014) 2
16. Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation
in the wild. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 7297–7306 (2018) 12
17. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for
efficient neural networks. arXiv preprint arXiv:1506.02626 (2015) 4
18 L. Zhuo et al.
18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016) 2
19. He, Y., Kang, G., Dong, X., Fu, Y., Yang, Y.: Soft filter pruning for accelerating
deep convolutional neural networks. arXiv preprint arXiv:1808.06866 (2018) 4
20. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural
networks. In: Proceedings of the IEEE international conference on computer vision.
pp. 1389–1397 (2017) 4
21. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained
by a two time-scale update rule converge to a local nash equilibrium. Advances in
neural information processing systems 30 (2017) 11
22. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531 (2015) 4
23. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu,
Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 1314–1324
(2019) 6
24. Hu, H., Peng, R., Tai, Y., Tang, C., Trimming, N.: A data-driven neuron pruning
approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 46
(2016) 4
25. Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: A data-driven
neuron pruning approach towards efficient deep architectures. arXiv preprint
arXiv:1607.03250 (2016) 4
26. Jin, Q., Ren, J., Woodford, O.J., Wang, J., Yuan, G., Wang, Y., Tulyakov, S.:
Teachers do more than teach: Compressing image-to-image models. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
13600–13611 (2021) 2, 5, 11, 12
27. Kappel, M., Golyanik, V., Elgharib, M., Henningson, J.O., Seidel, H.P., Castillo,
S., Theobalt, C., Magnor, M.: High-fidelity neural human motion transfer from
monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition. pp. 1541–1550 (2021) 2, 4
28. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient
convnets. arXiv preprint arXiv:1608.08710 (2016) 4
29. Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J.Y., Han, S.: Gan compression: Efficient
architectures for interactive conditional gans. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 5284–5294 (2020)
2, 5, 6, 11, 12
30. Lin, J., Zhang, R., Ganz, F., Han, S., Zhu, J.Y.: Anycost gans for interactive image
synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 14986–14996 (2021) 2
31. Liu, Y., Shu, Z., Li, Y., Lin, Z., Perazzi, F., Kung, S.Y.: Content-aware gan com-
pression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 12156–12166 (2021) 2, 4, 5, 11, 12
32. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolu-
tional networks through network slimming. In: Proceedings of the IEEE interna-
tional conference on computer vision. pp. 2736–2744 (2017) 4
33. Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and
privileged information. arXiv preprint arXiv:1511.03643 (2015) 5
34. Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling
knowledge from neurons. In: Thirtieth AAAI conference on artificial intelligence
(2016) 5
Fast Vid2Vid 19
51. Yu, C., Pool, J.: Self-supervised gan compression. arXiv preprint arXiv:2007.01491
(2020) 4
52. Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., Wang, Y.: A system-
atic dnn weight pruning framework using alternating direction method of multi-
pliers. In: Proceedings of the European Conference on Computer Vision (ECCV).
pp. 184–199 (2018) 4
53. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.: Dance dance generation: Mo-
tion transfer for internet videos. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision Workshops. pp. 0–0 (2019) 2, 4
54. Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment net-
work for video inpainting. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 16448–16457 (2021) 4
Fast Vid2Vid 21
Fig. 9: Qualitative results of the testing data compared with the advanced GAN com-
pression methods in the task of Sketch2Face. From top to the bottom, rows are seman-
tic maps, CA’s results, CAT’s results, NAS’s results, Vid2Vid’s results, Fast-Vid2Vid’s
results and GT.
22 L. Zhuo et al.
Fig. 10: Qualitative results of the testing data compared with the advanced GAN com-
pression methods in the task of Segmentation2City. From top to the bottom, rows are
semantic maps, CA’s results, CAT’s results, NAS’s results, Vid2Vid’s results, Fast-
Vid2Vid’s results and GT.
Fast Vid2Vid 23
Fig. 11: Qualitative results of the testing data compared with the advanced GAN com-
pression methods in the task of Pose2Body. From top to the bottom, rows are semantic
maps, CA’s results, CAT’s results, NAS’s results, Vid2Vid’s results, Fast-Vid2Vid’s re-
sults and GT.