Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
Figure 1. Guided Video Synthesis We present an approach based on latent video diffusion models that synthesizes videos (top and bottom)
guided by content described through text (top) or images (bottom) while keeping the structure of an input video (middle).
Abstract
17346
Current methods repurpose existing image models by ei- ditional video generation [11, 64]. Our focus is on provid-
ther propagating edits with approaches that compute ex- ing user control over the synthesis process whereas these
plicit correspondences [5] or by finetuning on each individ- approaches are limited to sampling random content resem-
ual video [63]. We aim to circumvent expensive per-video bling their training distribution.
training and correspondence calculation to achieve fast in- Diffusion models for image synthesis Diffusion models
ference for arbitrary videos. (DMs) [51, 53] can synthesize detailed media in many for-
We propose a controllable structure and content-aware mats, such as images [34, 38], 3d shapes [66] and anima-
latent video diffusion model trained on a large-scale dataset tions [54]. Many works improve diffusion-based image syn-
of uncaptioned videos and images. We opt to represent thesis by changing the parameterization [14, 27, 46], intro-
structure with monocular depth estimates, and content with ducing advanced sampling methods [52, 24, 22, 47, 20], de-
embeddings predicted by a pre-trained neural network. Our signing more powerful architectures [3, 15, 57, 30], or con-
approach offers several powerful modes of control. First, ditioning on additional information [25]. Text-conditioning,
we train our model such that the content of inferred videos, based on embeddings from CLIP [32] or T5 [33], has be-
e.g. their appearance or style, match user-provided images come a particularly powerful approach for providing artis-
or text prompts (Fig. 1). Second, we vary the fidelity of tic control over model output [44, 28, 34, 3, 65, 10]. La-
the structure representation during training to allow select- tent diffusion models (LDMs) [38] perform diffusion in
ing the strength of the structure preservation at test-time. a compressed latent space reducing memory requirements
Finally, we also adjust the inference process via a custom and runtime. Our video model is an LDM trained simulta-
guidance method, inspired by classifier-free guidance, to neously on videos and images.
enable control over temporal consistency. Diffusion models for video synthesis Recently, diffusion
In summary, we present the following contributions: models, masked generative models and autoregressive mod-
• We extend latent diffusion models to video generation els have been applied to text-conditioned video synthe-
by introducing temporal layers into a pre-trained im- sis [17, 13, 58, 67, 18, 49]. Similar to [17] and [49], we ex-
age model and by joint training on images and videos. tend image synthesis diffusion models to video generation
by introducing temporal connections into a pre-existing im-
• We present a structure and content-aware model that age model. Our model edits videos rather than synthesizing
edits videos given example images or text. Our method them from scratch. We demonstrate through a user study
does not require per-video training or pre-processing. that our model with explicit conditioning over structure is
preferred over other related approaches.
• We demonstrate full control over temporal, content and
structure consistency. We show for the first time that Video translation and propagation Image-to-image trans-
joint image-video training enables control over tempo- lation models, such as pix2pix [19, 62], can modify each
ral stability. And, training on varying levels of detail frame in a video individually. This produces temporal in-
in the structure representation allows choosing the de- consistencies as the time axis is ignored. Accounting for
sired level of preservation during inference. temporal or geometric information, such as flow, can in-
crease consistency across frames when repurposing image
• We show that our approach is preferred over several synthesis models [42, 9]. We can extract such structural
other approaches in a user study. We further improve information to aid our spatio-temporal LDM in text- and
the accuracy of previously unseen content by finetun- image-guided video synthesis. Many generative adversar-
ing on a small set of images of the desired subject. ial methods, such as vid2vid [61, 60], leverage this type of
input to guide synthesis.
2. Related Work Video style transfer takes a reference style image and sta-
tistically applies its style to an input video [40, 8, 55]. In
Controllable video editing and media synthesis is an ac- contrast, our method edits both style and content while pre-
tive area of research. In this section, we review prior work in serving the structure of a video instead of matching feature
related areas and connect our method to these approaches. statistics only. Text2Live [5] allows editing input videos us-
Unconditional video generation Generative adversarial ing text prompts by decomposing a video into neural lay-
networks (GANs) [12] can learn to synthesize videos based ers [21]. Once available, a layered video representation
on specific training data [59, 45, 1, 56]. These methods of- [37] provides consistent propagation across frames. Sin-
ten struggle with stability during optimization, and produce Fusion [29] and Tune-a-Video [63] use diffusion models to
fixed-length videos [59, 45] or longer videos where artifacts edit videos but require per-video training. This limits the
accumulate over time [50]. [6] synthesize longer videos us- practicality of the approaches in creative tools. We opt to
ing a GAN with a better encoding of the time axis. Autore- instead train our model on a large-scale dataset permitting
gressive transformers have also been proposed for uncon- inference on any video without individual training.
7347
Figure 2. Overview: During training (left), input videos x are encoded to z0 with a fixed encoder E and diffused to zt . We extract a
structure representation s by encoding depth maps obtained with MiDaS, and a content representation c by encoding one of the frames
with CLIP. The model then learns to reverse the diffusion process in the latent space, with the help of s, which gets concatenated to zt , as
well as c, which is provided via cross-attention blocks. During inference (right), the structure s of an input video is provided in the same
manner. To specify content via text, we convert CLIP text embeddings to image embeddings via a prior.
7348
In contrast, during inference, structure s and content c
are derived from an input video y and from a text prompt t
respectively. An edited version x of y is obtained by sam-
pling the generative model conditioned on s(y) and c(t):
7349
Figure 4. Temporal Control: By training image and video models jointly, we obtain explicit control over the temporal consistency of
edited videos via a temporal guidance scale ωt . On the left, frame consistency measured via CLIP cosine similarity of consecutive frames
increases monotonically with ωt , while mean squared error between frames warped with optical flow decreases monotonically. On the
right, lower scales (0.5 in the middle row) achieve edits with a ”hand-drawn” look, whereas higher scales (1.5 in the bottom row) result in
smoother results. Top row shows the original input video, the two edits use the prompt ”pencil sketch of a man looking at the camera”.
a man using
a laptop in-
side a train,
anime style
a woman
and man
take selfies
while walk-
ing down
the street,
claymation
kite-surfer
in the ocean
at sunset
alien ex-
plorer
hiking in
the moun-
tains
Figure 5. Our approach enables a wide range of video edits, including changes to animation styles such as anime or claymation, changes
of environment such as time of day, and changing characters such as humans to aliens.
7350
Figure 6. Prompt-vs-frame consistency: Image models such as
SD-Depth achieve good prompt consistency but fail to produce Figure 7. User Preferences: Based on our user study, the results
consistent edits across frames. Propagation based approaches such from our model are preferred over the baseline models.
as IVS and Text2Live increase frame consistency but fail to pro-
vide edits reflecting the prompt accurately. Our method achieves
the best combination of frame and prompt consistency. Our experiments demonstrate that this approach controls
temporal consistency in the outputs, see Fig. 4.
7351
Input
Mask
A snowboarder
in a snow park
on the moun-
tain
Figure 8. Background Editing: Masking the denoising process allows us to restrict edits to backgrounds for more control over results.
ties. For example, we can produce various animation styles, methods, results from our approach are preferred roughly 3
changes in time of day, and more complex changes of sub- out of 4 times. A visual comparison among the methods can
ject, such as turning a hiker into an alien (see Fig. 5). Please be found in the supplementary. We observe that SDEdit is
see the supplementary material for more results. sensitive to the editing strength. Low values fail to achieve
Using CLIP image embeddings as the content represen- the desired editing effect whereas high values change the
tation allows users to specify content through images. As structure of the input. Even with a fixed seed, both style
an example application, we demonstrate character replace- and structure can change in unnatural ways between frames
ment in Fig. 9. For every video in a set of six videos, we re- as their relationship is ignored by image-based approaches.
synthesize it five times, each time providing a single content Propagation of SDEdit outputs (IVS) leads to more con-
image taken from another video in the set. We can retain sistent results but often introduces propagation artifacts es-
content characteristics with ts = 3 despite large differences pecially with large motion. Depth-SD produces accurate,
in their pose and shape. structure-preserving edits for individual frames but frames
Lastly, we illustrate the use of masked video editing in are inconsistent across time. The outputs of Text2Live tend
Fig. 8, where the model predicts everything outside the to be temporally smooth due to its reliance on Layered Neu-
masked area(s) while retaining the original content inside ral Atlases [21], but it often produces edits that represent
the masked area. Notably, this technique resembles inpaint- the edit prompt inaccurately. A direct comparison with
ing with diffusion models [43, 25]. Text2Live is difficult as it requires input masks and sepa-
rate edit prompts for foreground and background. In addi-
4.2. User Study tion, computing a neural atlas takes about 10 hours whereas
We benchmark against Text2Live [5], a recent approach our approach requires approximately a minute.
for text-guided video editing that employs layered neu-
4.3. Quantitative Evaluation
ral atlases [21]. As a baseline, we compare against
SDEdit [26] in two ways; per-frame generated results and We quantify trade-offs between frame consistency and
a first-frame result propagated by a few-shot video styl- prompt consistency with the following two metrics.
ization method [55] (IVS). We also include two depth- Frame consistency We compute CLIP image embeddings
based versions of Stable Diffusion; one trained with depth- on all frames of output videos and report the average cosine
conditioning [2] and one that retains past results based on similarity between all pairs of consecutive frames.
depth estimates [9]. We also include an ablation: applying Prompt consistency We compute CLIP image embeddings
SDEdit to our video model trained without conditioning on on all frames of output videos and the CLIP text embed-
a structure representation (ours, ∼ s). ding of the edit prompt. We report average cosine similarity
We judge the success of our method qualitatively based between text and image embedding over all frames.
on a user study. We run the user study using Amazon Me- Fig. 6 shows the results of each model using our frame
chanical Turk (AMT) on an evaluation set of 35 represen- consistency and prompt consistency metrics. Our model
tative video editing prompts. For each example, we ask tends to outperform the baseline models in both aspects
5 annotators to compare faithfulness to the video editing (placed higher in the upper-right quadrant of the graph). We
prompt (”Which video better represents the provided edited also notice a slight tradeoff with increasing the strength pa-
caption?”) between a baseline and our method, presented in rameters in the baseline models: larger strength scales im-
random order, and use a majority vote for the final result. plies higher prompt consistency at the cost of lower frame
The results can be found in Fig. 7. Across all compared consistency. Increasing the temporal scale (ωt ) of our model
7352
Figure 10. Controlling Fidelity: We obtain control over structure
and appearance-fidelity. Each cell shows three frames produced
with decreasing structure-fidelity ts (left-to-right) and increasing
number of customization training steps (top-to-bottom). The bot-
Figure 9. Image Prompting: We combine the structure of a driv- tom shows examples of images used for customization (red border)
ing video (first column) with content from other videos (first row). and the input image (blue border). Same driving video as in Fig. 1.
7353
References Long video generation with time-agnostic vqgan and time-
sensitive transformer. arXiv preprint arXiv:2204.03638,
[1] Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and 2022. 2
Luc Van Gool. Towards high resolution video generation [12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
with progressive growing of sliced wasserstein gans. arXiv Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
preprint arXiv:1810.02419, 2018. 2 Yoshua Bengio. Generative adversarial nets. In Z. Ghahra-
[2] Stability AI. Stable diffusion depth. mani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Wein-
https://github.com/Stability-AI/stablediffusion, 2022. berger, editors, Advances in Neural Information Processing
7 Systems, volume 27. Curran Associates, Inc., 2014. 2
[3] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, [13] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
Liu. ediff-i: Text-to-image diffusion models with ensemble video: High definition video generation with diffusion mod-
of expert denoisers. arXiv preprint arXiv:2211.01324, 2022. els. arXiv preprint arXiv:2210.02303, 2022. 2, 3
2, 4 [14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
[4] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S. Li, fusion probabilistic models. In H. Larochelle, M. Ranzato,
Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in
Geiping, and Tom Goldstein. Cold diffusion: Inverting ar- Neural Information Processing Systems, volume 33, pages
bitrary image transforms without noise, 2023. 6840–6851. Curran Associates, Inc., 2020. 2, 3
[5] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- [15] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
ten, and Tali Dekel. Text2live: Text-driven layered image Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
and video editing. In European Conference on Computer models for high fidelity image generation. J. Mach. Learn.
Vision, pages 707–723. Springer, 2022. 2, 7 Res., 23:47–1, 2022. 2
[16] Jonathan Ho and Tim Salimans. Classifier-free diffusion
[6] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun
guidance, 2022. 6
Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A
Efros, and Tero Karras. Generating long videos of dynamic [17] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
scenes. 2022. 2 Chan, Mohammad Norouzi, and David J Fleet. Video dif-
fusion models. arXiv:2204.03458, 2022. 2, 4
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
[18] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu,
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
and Jie Tang. Cogvideo: Large-scale pretraining for
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
text-to-video generation via transformers. arXiv preprint
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
arXiv:2205.15868, 2022. 2
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Efros. Image-to-image translation with conditional adver-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
sarial networks. In Proceedings of the IEEE conference on
Clark, Christopher Berner, Sam McCandlish, Alec Radford,
computer vision and pattern recognition, pages 1125–1134,
Ilya Sutskever, and Dario Amodei. Language models are
2017. 2
few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell,
[20] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
M.F. Balcan, and H. Lin, editors, Advances in Neural Infor-
Elucidating the design space of diffusion-based generative
mation Processing Systems, volume 33, pages 1877–1901.
models. In Proc. NeurIPS, 2022. 2
Curran Associates, Inc., 2020. 6
[21] Yoni Kasten, Dolev Ofri, Oliver Wang, and Tali Dekel. Lay-
[8] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang ered neural atlases for consistent video editing. ACM Trans-
Hua. Coherent online video style transfer. In Proceedings actions on Graphics (TOG), 40(6):1–12, 2021. 2, 7
of the IEEE International Conference on Computer Vision,
[22] Zhifeng Kong and Wei Ping. On fast sampling of diffu-
pages 1105–1114, 2017. 2
sion probabilistic models. arXiv preprint arXiv:2106.00132,
[9] deforum. Deforum stable diffusion. 2021. 2, 6
https://github.com/deforum/stable-diffusion, 2022. 2, [23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
7 Blip: Bootstrapping language-image pre-training for unified
[10] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, vision-language understanding and generation. In ICML,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, 2022. 6
Hongxia Yang, and Jie Tang. Cogview: Mastering text- [24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx-
to-image generation via transformers. In M. Ranzato, uan Li, and Jun Zhu. DPM-solver: A fast ODE solver for
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman diffusion probabilistic model sampling in around 10 steps.
Vaughan, editors, Advances in Neural Information Process- In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and
ing Systems, volume 34, pages 19822–19835. Curran Asso- Kyunghyun Cho, editors, Advances in Neural Information
ciates, Inc., 2021. 2 Processing Systems, 2022. 2, 6
[11] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan [25] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher
Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting
7354
using denoising diffusion probabilistic models. In Proceed- sentation for video editing. ACM SIGGRAPH 2008 papers,
ings of the IEEE/CVF Conference on Computer Vision and 2008. 2
Pattern Recognition, pages 11461–11471, 2022. 2, 7 [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[26] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Patrick Esser, and Björn Ommer. High-resolution image syn-
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis thesis with latent diffusion models, 2021. 1, 2, 3, 6
and editing with stochastic differential equations. CoRR, [39] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
abs/2108.01073, 2021. 7 net: Convolutional networks for biomedical image segmen-
[27] Alexander Quinn Nichol and Prafulla Dhariwal. Improved tation. In International Conference on Medical image com-
denoising diffusion probabilistic models. In International puting and computer-assisted intervention, pages 234–241.
Conference on Machine Learning, pages 8162–8171. PMLR, Springer, 2015. 3
2021. 2 [40] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox.
[28] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Artistic style transfer for videos. In Bodo Rosenhahn and
Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Bjoern Andres, editors, Pattern Recognition, pages 26–36,
Sutskever, and Mark Chen. GLIDE: Towards photorealis- Cham, 2016. Springer International Publishing. 2
tic image generation and editing with text-guided diffusion [41] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Pro- tuning text-to-image diffusion models for subject-driven
ceedings of the 39th International Conference on Machine generation. arXiv preprint arXiv:2208.12242, 2022. 8
Learning, volume 162 of Proceedings of Machine Learning [42] Alexander S. Disco diffusion v5.2 - warp fusion.
Research, pages 16784–16804. PMLR, 17–23 Jul 2022. 2 https://github.com/Sxela/DiscoDiffusion-Warp, 2022. 2
[29] Yaniv Nikankin, Niv Haim, and Michal Irani. Sinfusion: [43] Chitwan Saharia, William Chan, Huiwen Chang, Chris A.
Training diffusion models on a single image or video. arXiv Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mo-
preprint arXiv:2211.11743, 2022. 2 hammad Norouzi. Palette: Image-to-image diffusion mod-
els, 2021. 7
[30] William Peebles and Saining Xie. Scalable diffusion models
with transformers, 2022. 2 [44] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
[31] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
beláez, Alexander Sorkine-Hornung, and Luc Van Gool.
Rapha Gontijo Lopes, et al. Photorealistic text-to-image
The 2017 davis challenge on video object segmentation.
diffusion models with deep language understanding. arXiv
arXiv:1704.00675, 2017. 6
preprint arXiv:2205.11487, 2022. 2
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[45] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
ral generative adversarial nets with singular value clipping.
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
In Proceedings of the IEEE international conference on com-
ing transferable visual models from natural language super-
puter vision, pages 2830–2839, 2017. 2
vision. In International Conference on Machine Learning,
[46] Tim Salimans and Jonathan Ho. Progressive distillation for
pages 8748–8763. PMLR, 2021. 2, 4
fast sampling of diffusion models. In International Confer-
[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, ence on Learning Representations, 2022. 2, 3
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and [47] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise
Peter J. Liu. Exploring the limits of transfer learning with a estimation for generative diffusion models. arXiv preprint
unified text-to-text transformer. J. Mach. Learn. Res., 21(1), arXiv:2104.02600, 2021. 2
jun 2022. 2 [48] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
[34] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
and Mark Chen. Hierarchical text-conditional image gener- Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
ation with clip latents, 2022. 1, 2, 4 Open dataset of clip-filtered 400 million image-text pairs.
[35] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, arXiv preprint arXiv:2111.02114, 2021. 4
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. [49] Uriel Singer, Adam Polyak, Thomas Hayes, Xiaoyue Yin, Jie
Zero-shot text-to-image generation. In Marina Meila and An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
Tong Zhang, editors, Proceedings of the 38th International Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman.
Conference on Machine Learning, volume 139 of Pro- Make-a-video: Text-to-video generation without text-video
ceedings of Machine Learning Research, pages 8821–8831. data. ArXiv, abs/2209.14792, 2022. 2, 4
PMLR, 18–24 Jul 2021. 4 [50] Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho-
[36] René Ranftl, Katrin Lasinger, David Hafner, Konrad seiny. Stylegan-v: A continuous video generator with the
Schindler, and Vladlen Koltun. Towards robust monocular price, image quality and perks of stylegan2, 2021. 2
depth estimation: Mixing datasets for zero-shot cross-dataset [51] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
transfer. IEEE Transactions on Pattern Analysis and Ma- and Surya Ganguli. Deep unsupervised learning using
chine Intelligence, 44:1623–1637, 2019. 6 nonequilibrium thermodynamics. In Francis Bach and David
[37] Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and An- Blei, editors, Proceedings of the 32nd International Con-
drew William Fitzgibbon. Unwrap mosaics: a new repre- ference on Machine Learning, volume 37 of Proceedings
7355
of Machine Learning Research, pages 2256–2265, Lille, Yonghui Wu. Scaling autoregressive models for content-rich
France, 07–09 Jul 2015. PMLR. 2, 3 text-to-image generation, 2022. 2
[52] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- [66] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic,
ing diffusion implicit models. In International Conference Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent
on Learning Representations, 2021. 2, 6 point diffusion models for 3d shape generation. In Advances
[53] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- in Neural Information Processing Systems (NeurIPS), 2022.
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based 2
generative modeling through stochastic differential equa- [67] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv,
tions. arXiv preprint arXiv:2011.13456, 2020. 2 Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video
[54] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, generation with latent diffusion models, 2022. 2
Amit H Bermano, and Daniel Cohen-Or. Human motion dif-
fusion model. arXiv preprint arXiv:2209.14916, 2022. 2
[55] Ondřej Texler, David Futschik, Michal Kučera, Ondřej
Jamriška, Šárka Sochorová, Menglei Chai, Sergey Tulyakov,
and Daniel Sýkora. Interactive video stylization using few-
shot patch-based training. ACM Transactions on Graphics,
39(4):73, 2020. 2, 7
[56] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Kautz. MoCoGAN: Decomposing motion and content for
video generation. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1526–1535, 2018. 2
[57] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. Advances in Neural In-
formation Processing Systems, 34:11287–11302, 2021. 2
[58] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
Phenaki: Variable length video generation from open domain
textual description, 2022. 2
[59] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Generating videos with scene dynamics. Advances in neu-
ral information processing systems, 29, 2016. 2
[60] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu,
Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video
synthesis. In Advances in Neural Information Processing
Systems (NeurIPS), 2019. 2
[61] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
video synthesis. In Conference on Neural Information Pro-
cessing Systems (NeurIPS), 2018. 2
[62] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018. 2
[63] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian
Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and
Mike Zheng Shou. Tune-a-video: One-shot tuning of image
diffusion models for text-to-video generation. arXiv preprint
arXiv:2212.11565, 2022. 2
[64] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind
Srinivas. Videogpt: Video generation using vq-vae and trans-
formers. arXiv preprint arXiv:2104.10157, 2021. 2
[65] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han,
Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and
7356