0% found this document useful (0 votes)
47 views

Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Structure and Content-Guided Video Synthesis with Diffusion Models

Patrick Esser Johnathan Chiu Parmida Atighehchian


Jonathan Granskog Anastasis Germanidis
Runway
https://research.runwayml.com/gen1

Figure 1. Guided Video Synthesis We present an approach based on latent video diffusion models that synthesizes videos (top and bottom)
guided by content described through text (top) or images (bottom) while keeping the structure of an input video (middle).

Abstract

Text-guided generative diffusion models unlock powerful 1. Introduction


image creation and editing tools. Recent approaches that
edit the content of footage while retaining structure require Demand for more intuitive and performant video edit-
expensive re-training for every input or rely on error-prone ing tools has increased as video-centric platforms have been
propagation of image edits across frames. popularized. But editing in the format is still complex
and time-consuming due the temporal nature of video data.
In this work, we present a structure and content-guided State-of-the-art machine learning models have shown great
video diffusion model that edits videos based on descrip- promise in improving editing workflows.
tions of the desired output. Conflicts between user-provided Generative approaches for image synthesis recently ex-
content edits and structure representations occur due to in- perienced a rapid surge in quality and popularity due to the
sufficient disentanglement between the two aspects. As a so- introduction of powerful diffusion models trained on large-
lution, we show that training on monocular depth estimates scale datasets. Text-conditioned models, such as DALL-
with varying levels of detail provides control over structure E 2 [34] and Stable Diffusion [38], enable novice users to
and content fidelity. A novel guidance method, enabled by generate detailed imagery given only a text prompt as input.
joint video and image training, exposes explicit control over Latent diffusion models especially enable efficient methods
temporal consistency. Our experiments demonstrate a wide for producing imagery via synthesis in a perceptually com-
variety of successes; fine-grained control over output char- pressed space.
acteristics, customization based on a few reference images, Motivated by this progress, we investigate generative
and a strong user preference towards results by our model. models suited for interactive applications in video editing.

17346
Current methods repurpose existing image models by ei- ditional video generation [11, 64]. Our focus is on provid-
ther propagating edits with approaches that compute ex- ing user control over the synthesis process whereas these
plicit correspondences [5] or by finetuning on each individ- approaches are limited to sampling random content resem-
ual video [63]. We aim to circumvent expensive per-video bling their training distribution.
training and correspondence calculation to achieve fast in- Diffusion models for image synthesis Diffusion models
ference for arbitrary videos. (DMs) [51, 53] can synthesize detailed media in many for-
We propose a controllable structure and content-aware mats, such as images [34, 38], 3d shapes [66] and anima-
latent video diffusion model trained on a large-scale dataset tions [54]. Many works improve diffusion-based image syn-
of uncaptioned videos and images. We opt to represent thesis by changing the parameterization [14, 27, 46], intro-
structure with monocular depth estimates, and content with ducing advanced sampling methods [52, 24, 22, 47, 20], de-
embeddings predicted by a pre-trained neural network. Our signing more powerful architectures [3, 15, 57, 30], or con-
approach offers several powerful modes of control. First, ditioning on additional information [25]. Text-conditioning,
we train our model such that the content of inferred videos, based on embeddings from CLIP [32] or T5 [33], has be-
e.g. their appearance or style, match user-provided images come a particularly powerful approach for providing artis-
or text prompts (Fig. 1). Second, we vary the fidelity of tic control over model output [44, 28, 34, 3, 65, 10]. La-
the structure representation during training to allow select- tent diffusion models (LDMs) [38] perform diffusion in
ing the strength of the structure preservation at test-time. a compressed latent space reducing memory requirements
Finally, we also adjust the inference process via a custom and runtime. Our video model is an LDM trained simulta-
guidance method, inspired by classifier-free guidance, to neously on videos and images.
enable control over temporal consistency. Diffusion models for video synthesis Recently, diffusion
In summary, we present the following contributions: models, masked generative models and autoregressive mod-
• We extend latent diffusion models to video generation els have been applied to text-conditioned video synthe-
by introducing temporal layers into a pre-trained im- sis [17, 13, 58, 67, 18, 49]. Similar to [17] and [49], we ex-
age model and by joint training on images and videos. tend image synthesis diffusion models to video generation
by introducing temporal connections into a pre-existing im-
• We present a structure and content-aware model that age model. Our model edits videos rather than synthesizing
edits videos given example images or text. Our method them from scratch. We demonstrate through a user study
does not require per-video training or pre-processing. that our model with explicit conditioning over structure is
preferred over other related approaches.
• We demonstrate full control over temporal, content and
structure consistency. We show for the first time that Video translation and propagation Image-to-image trans-
joint image-video training enables control over tempo- lation models, such as pix2pix [19, 62], can modify each
ral stability. And, training on varying levels of detail frame in a video individually. This produces temporal in-
in the structure representation allows choosing the de- consistencies as the time axis is ignored. Accounting for
sired level of preservation during inference. temporal or geometric information, such as flow, can in-
crease consistency across frames when repurposing image
• We show that our approach is preferred over several synthesis models [42, 9]. We can extract such structural
other approaches in a user study. We further improve information to aid our spatio-temporal LDM in text- and
the accuracy of previously unseen content by finetun- image-guided video synthesis. Many generative adversar-
ing on a small set of images of the desired subject. ial methods, such as vid2vid [61, 60], leverage this type of
input to guide synthesis.
2. Related Work Video style transfer takes a reference style image and sta-
tistically applies its style to an input video [40, 8, 55]. In
Controllable video editing and media synthesis is an ac- contrast, our method edits both style and content while pre-
tive area of research. In this section, we review prior work in serving the structure of a video instead of matching feature
related areas and connect our method to these approaches. statistics only. Text2Live [5] allows editing input videos us-
Unconditional video generation Generative adversarial ing text prompts by decomposing a video into neural lay-
networks (GANs) [12] can learn to synthesize videos based ers [21]. Once available, a layered video representation
on specific training data [59, 45, 1, 56]. These methods of- [37] provides consistent propagation across frames. Sin-
ten struggle with stability during optimization, and produce Fusion [29] and Tune-a-Video [63] use diffusion models to
fixed-length videos [59, 45] or longer videos where artifacts edit videos but require per-video training. This limits the
accumulate over time [50]. [6] synthesize longer videos us- practicality of the approaches in creative tools. We opt to
ing a GAN with a better encoding of the time axis. Autore- instead train our model on a large-scale dataset permitting
gressive transformers have also been proposed for uncon- inference on any video without individual training.

7347
Figure 2. Overview: During training (left), input videos x are encoded to z0 with a fixed encoder E and diffused to zt . We extract a
structure representation s by encoding depth maps obtained with MiDaS, and a content representation c by encoding one of the frames
with CLIP. The model then learns to reverse the diffusion process in the latent space, with the help of s, which gets concatenated to zt , as
well as c, which is provided via cross-attention blocks. During inference (right), the structure s of an input video is provided in the same
manner. To specify content via text, we convert CLIP text embeddings to image embeddings via a prior.

3. Method ing to the following equation with parameters θ


Z
For our purposes, it will be helpful to think of a video pθ (x0 ) := pθ (x0:T )dx1:T (2)
in terms of its content and structure. By structure, we refer
T
to characteristics describing geometry and dynamics, e.g. Y
shapes and locations of subjects as well as their temporal pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ), (3)
t=1
changes. We define content as features describing the ap-
pearance and semantics of the video, such as the colors and pθ (xt−1 |xt ) := N (xt−1 , µθ (xt , t), Σθ (xt , t)) . (4)
styles of objects and the lighting. The goal of our model is
Using a fixed variance Σθ (xt , t), we are left learning the
to edit the content of a video while retaining its structure.
means of the reverse process µθ (xt , t). Training is typically
To achieve this, we learn a generative model p(x|s, c) of performed via a reweighted variational bound on the maxi-
videos x conditioned on representations of structure s and mum likelihood objective, resulting in a loss
content c. We infer the shape representation s from an input
video, and modify it based on a text prompt c describing L := Et,q λt ∥µt (xt , x0 ) − µθ (xt , t)∥2 , (5)
the edit. First, we describe our realization of the genera-
where µt (xt , x0 ) is the mean of the forward process poste-
tive model as a conditional latent video diffusion model and,
rior q(xt−1 |xt , x0 ), which is available in closed form [14].
then, we describe our choices for shape and content repre-
Parameterization The mean µθ (xt , t) is then predicted by
sentations. Finally, we discuss the optimization process of
a UNet architecture [39] that receives the noisy input xt and
our model. See Fig. 2 for an overview.
the diffusion timestep t as inputs. Other parameterizations
and weightings, such as ϵ [14] and v-parameterizations [46],
can significantly improve sample quality compared to di-
3.1. Latent diffusion models rectly predicting the mean. Similar to [13], we found that
v-parameterization improves color consistency thus all our
Diffusion models Diffusion models [51] learn to reverse experiments use it (see supp. material for more details).
a fixed forward diffusion process, which is defined as Latent diffusion Latent diffusion models [38] (LDMs) take
the diffusion process into the latent space. This provides
p improved separation between compressive and generative
q(xt |xt−1 ) := N (xt , 1 − βt xt−1 , βt I) . (1) learning phases of the model. Specifically, LDMs use an au-
toencoder where an encoder E maps input data x to a lower
dimensional latent code according to z = E(x) while a de-
Normally-distributed noise is slowly added to each sample
coder D converts latent codes back to the input space such
xt−1 to obtain xt . The forward process models a fixed
that perceptually x ≈ D(E(x)).
Markov chain and the noise is dependent on a variance
Our encoder downsamples RGB-images x ∈ R3×H×W
schedule βt where t ∈ {1, . . . , T }, with T being the total
by a factor of eight and outputs four channels, resulting in a
number of steps in our diffusion chain, and x0 := x.
latent code z ∈ R4×H/8×W/8 . Thus, the diffusion UNet op-
Learning to Denoise The reverse process is defined accord- erates on a much smaller representation which significantly

7348
In contrast, during inference, structure s and content c
are derived from an input video y and from a text prompt t
respectively. An edited version x of y is obtained by sam-
pling the generative model conditioned on s(y) and c(t):

z ∼ pθ (z|s(y), c(t)), x = D(z) . (7)


Figure 3. Temporal Extension: We extend an image-based UNet
architecture to videos, by adding temporal layers in its building Content Representation We utilize CLIP [32] to infer a
blocks. We add a 1D temporal convolution after each 2D spatial content representation from both text inputs t and video in-
convolution in its residual blocks (left), and we add a 1D temporal puts x similar to previous works [35, 3]. CLIP embeddings
attention block after each of its 2D spatial attention blocks (right). are a promising content representation as they are more sen-
sitive to semantic and stylistic properties while being more
improves runtime and memory efficiency. The latter is cru- invariant towards geometric attributes [34]. During train-
cial for video modeling where the additional time axis in- ing, we encode a random frame in each input video with
creases memory costs. CLIP. To support text-based editing at inference, we train a
prior model that allows sampling image embeddings from
3.2. Spatio-temporal Latent Diffusion text embeddings [35, 49].
To correctly model a distribution over video frames, the Structure Representation We need a representation that
architecture must account for temporal relationships. We provides adequate separation between structure and content.
also want to jointly learn an image model with shared pa- We find that depth estimates extracted from input video
rameters to benefit from better generalization obtained by frames provide the desired properties as they encode signif-
training on large-scale image datasets. icantly less content information compared to simpler struc-
To achieve this, we extend an image architecture by in- ture representations, such as edge filters which also encode
troducing temporal layers, which are only active for video textural properties. Still, depth maps reveal the silhouttes of
inputs. All other layers are shared between the image and objects which can prevent content edits involving changes
video model. The autoencoder remains fixed and processes in object shape.
each frame in a video independently. To offer control over the amount of structure to preserve,
The UNet consists of two main building blocks: Resid- we propose to train a model on structure representations
ual blocks and transformer blocks (see Fig. 3). Similar to with varying amounts of information. In particular, we blur
[17, 49], we extend them to videos by adding both 1D con- depth estimates given a parameter ts . During training, ts is
volutions across time and 1D self-attentions across time. In randomly sampled between 0 and Ts . The parameter can
each residual block, we introduce one temporal convolution then be controlled at inference to achieve different editing
after each 2D convolution. Similarly, after each spatial 2D effects (see Fig. 10).
transformer block, we also include one temporal 1D trans- While depths map work well for our usecase, our ap-
former block, which mimics its spatial counterpart along the proach generalizes to other geometric guidance features or
time axis. We also input learnable positional encodings of combinations of features that might be more helpful for
the frame index into temporal transformer blocks. other specific applications. For example, models focus-
ing on human video synthesis might benefit from estimated
3.3. Representing Content and Structure poses or face landmarks.
Conditioning Mechanisms We account for the different
Conditional Diffusion Models Diffusion models are
characteristics of our content and structure representations
well-suited to modeling conditional distributions such as
with two different conditioning mechanisms. Since struc-
p(x|s, c). The forward process q remains unchanged while
ture represents a significant portion of the spatial informa-
the conditioning variables s, c become additional inputs to
tion of video frames, we use concatenation for conditioning
the model.
to make effective use of this information. In contrast, at-
Our goal is to edit a video based on a text prompt de-
tributes described by the content representation are not tied
scribing the desired output. We choose to train on un-
to particular locations. Hence, we leverage cross-attention
captioned video data due to the lack of large-scale paired
which can effectively transport this information to any po-
video-text datasets of similar quality as image datasets like
sition.
[48]. Therefore, during training, we must derive structure
We use the spatial transformer blocks of the UNet archi-
and content representations from the training video x itself,
tecture for cross-attention conditioning. Each contains two
i.e. s = s(x) and c = c(x), resulting in a per-example loss
attention operations, where the first one perform a spatial
of
self-attention and the second one a cross attention with keys
λt ∥µt (E(x)t , E(x)0 ) − µθ (E(x)t , t, s(x), c(x))∥2 . (6) and values computed from the CLIP image embedding.

7349
Figure 4. Temporal Control: By training image and video models jointly, we obtain explicit control over the temporal consistency of
edited videos via a temporal guidance scale ωt . On the left, frame consistency measured via CLIP cosine similarity of consecutive frames
increases monotonically with ωt , while mean squared error between frames warped with optical flow decreases monotonically. On the
right, lower scales (0.5 in the middle row) achieve edits with a ”hand-drawn” look, whereas higher scales (1.5 in the bottom row) result in
smoother results. Top row shows the original input video, the two edits use the prompt ”pencil sketch of a man looking at the camera”.

Prompt Driving Video (top) and Result (bottom)

a man using
a laptop in-
side a train,
anime style

a woman
and man
take selfies
while walk-
ing down
the street,
claymation

kite-surfer
in the ocean
at sunset

alien ex-
plorer
hiking in
the moun-
tains

Figure 5. Our approach enables a wide range of video edits, including changes to animation styles such as anime or claymation, changes
of environment such as time of day, and changing characters such as humans to aliens.

7350
Figure 6. Prompt-vs-frame consistency: Image models such as
SD-Depth achieve good prompt consistency but fail to produce Figure 7. User Preferences: Based on our user study, the results
consistent edits across frames. Propagation based approaches such from our model are preferred over the baseline models.
as IVS and Text2Live increase frame consistency but fail to pro-
vide edits reflecting the prompt accurately. Our method achieves
the best combination of frame and prompt consistency. Our experiments demonstrate that this approach controls
temporal consistency in the outputs, see Fig. 4.

To condition on structure, we first estimate depth 3.4. Optimization


maps for all input frames using the MiDaS DPT-Large We train on an internal dataset of 240M images and 6.4M
model [36]. We then apply ts iterations of blurring and video clips. We use image batches of size 9216 with reso-
downsampling to the depth maps, where ts controls the lutions of 320 × 320, 384 × 320 and 448 × 256, as well as
amount of structure to preserve. We resample the perturbed the same resolutions with flipped aspect ratios. We sample
depth map to the resolution of the RGB-frames and encode image batches with a probabilty of 12.5%. For videos, we
it using E. This latent representation of structure is concate- use batch size 1152 and 8 frames from each video sampled
nated with the input zt given to the UNet. We also input four frames apart with a resolution of 448 × 256.
four channels containing a sinusoidal embedding of ts . We train our model in multiple stages. First, we initial-
Sampling While Eq. (2) provides a direct way to sample ize model weights based on a pretrained text-conditional
from the trained model, many other sampling methods [52, latent diffusion model [38]1 . We change the conditioning
24, 22] require only a fraction of the number of diffusion from CLIP text embeddings to CLIP image embeddings and
timesteps to achieve good sample quality. We use DDIM fine-tune for 15k steps on images only. Afterwards, we in-
[52] throughout our experiments. Furthermore, classifier- troduce temporal connections as described in Sec. 3.2 and
free diffusion guidance [16] significantly improves sam- train jointly on images and videos for 75k steps. We then
ple quality. For a conditional model µθ (xt , t, c), this is add conditioning on structure s with ts ≡ 0 fixed and train
achieved by training the model to also perform uncondi- for 25k steps. Finally, we resume training with ts sampled
tional predictions µθ (xt , t, ∅) and then adjusting predictions uniformly between 0 and 7 for another 10k steps.
during sampling according to

µ̃θ (xt , t, c) = µθ (xt , t, ∅) + ω(µθ (xt , t, c) − µθ (xt , t, ∅))


4. Results
To evaluate our approach, we use videos from DAVIS
where ω is the guidance scale that controls the strength. [31] and various stock footage. To automatically create edit
Based on the intuition that ω extrapolates the direction be- prompts, we first run a captioning model [23] to obtain a
tween an unconditional and a conditional model, we ap- description of the original video content. We then use GPT-
ply this idea to control temporal consistency of our model. 3 [7] to generate edited prompts.
Specifically, since we are training both an image and a video
model with shared parameters, we can consider predictions 4.1. Qualitative Results
by both models for the same input. Let µθ (zt , t, c, s) de-
We demonstrate that our approach performs well on a
note the prediction of our video model, and let µπθ (zt , t, c, s)
number of diverse inputs (see Fig. 5). Our method handles
denote the prediction of the image model applied to each
a large variety of footage, such as landscapes and close-
frame individually. Taking classifier-free guidance for c into
ups, and diverse camera motion without any explicit track-
account, we then adjust our prediction according to
ing of the input. Our depth-based structure representation
µ̃θ (zt , t, c, s) = µπθ (zt , t, ∅, s) combined with large-scale image-video joint training en-
able strong generalization and powerful editing capabili-
+ ωt (µθ (xt , t, ∅, s) − µπθ (xt , t, ∅, s)) (8)
1 https://github.com/runwayml/stable-diffusion
+ ω(µθ (xt , t, c, s) − µθ (xt , t, ∅, s))

7351
Input

Mask

A snowboarder
in a snow park
on the moun-
tain

Figure 8. Background Editing: Masking the denoising process allows us to restrict edits to backgrounds for more control over results.

ties. For example, we can produce various animation styles, methods, results from our approach are preferred roughly 3
changes in time of day, and more complex changes of sub- out of 4 times. A visual comparison among the methods can
ject, such as turning a hiker into an alien (see Fig. 5). Please be found in the supplementary. We observe that SDEdit is
see the supplementary material for more results. sensitive to the editing strength. Low values fail to achieve
Using CLIP image embeddings as the content represen- the desired editing effect whereas high values change the
tation allows users to specify content through images. As structure of the input. Even with a fixed seed, both style
an example application, we demonstrate character replace- and structure can change in unnatural ways between frames
ment in Fig. 9. For every video in a set of six videos, we re- as their relationship is ignored by image-based approaches.
synthesize it five times, each time providing a single content Propagation of SDEdit outputs (IVS) leads to more con-
image taken from another video in the set. We can retain sistent results but often introduces propagation artifacts es-
content characteristics with ts = 3 despite large differences pecially with large motion. Depth-SD produces accurate,
in their pose and shape. structure-preserving edits for individual frames but frames
Lastly, we illustrate the use of masked video editing in are inconsistent across time. The outputs of Text2Live tend
Fig. 8, where the model predicts everything outside the to be temporally smooth due to its reliance on Layered Neu-
masked area(s) while retaining the original content inside ral Atlases [21], but it often produces edits that represent
the masked area. Notably, this technique resembles inpaint- the edit prompt inaccurately. A direct comparison with
ing with diffusion models [43, 25]. Text2Live is difficult as it requires input masks and sepa-
rate edit prompts for foreground and background. In addi-
4.2. User Study tion, computing a neural atlas takes about 10 hours whereas
We benchmark against Text2Live [5], a recent approach our approach requires approximately a minute.
for text-guided video editing that employs layered neu-
4.3. Quantitative Evaluation
ral atlases [21]. As a baseline, we compare against
SDEdit [26] in two ways; per-frame generated results and We quantify trade-offs between frame consistency and
a first-frame result propagated by a few-shot video styl- prompt consistency with the following two metrics.
ization method [55] (IVS). We also include two depth- Frame consistency We compute CLIP image embeddings
based versions of Stable Diffusion; one trained with depth- on all frames of output videos and report the average cosine
conditioning [2] and one that retains past results based on similarity between all pairs of consecutive frames.
depth estimates [9]. We also include an ablation: applying Prompt consistency We compute CLIP image embeddings
SDEdit to our video model trained without conditioning on on all frames of output videos and the CLIP text embed-
a structure representation (ours, ∼ s). ding of the edit prompt. We report average cosine similarity
We judge the success of our method qualitatively based between text and image embedding over all frames.
on a user study. We run the user study using Amazon Me- Fig. 6 shows the results of each model using our frame
chanical Turk (AMT) on an evaluation set of 35 represen- consistency and prompt consistency metrics. Our model
tative video editing prompts. For each example, we ask tends to outperform the baseline models in both aspects
5 annotators to compare faithfulness to the video editing (placed higher in the upper-right quadrant of the graph). We
prompt (”Which video better represents the provided edited also notice a slight tradeoff with increasing the strength pa-
caption?”) between a baseline and our method, presented in rameters in the baseline models: larger strength scales im-
random order, and use a majority vote for the final result. plies higher prompt consistency at the cost of lower frame
The results can be found in Fig. 7. Across all compared consistency. Increasing the temporal scale (ωt ) of our model

7352
Figure 10. Controlling Fidelity: We obtain control over structure
and appearance-fidelity. Each cell shows three frames produced
with decreasing structure-fidelity ts (left-to-right) and increasing
number of customization training steps (top-to-bottom). The bot-
Figure 9. Image Prompting: We combine the structure of a driv- tom shows examples of images used for customization (red border)
ing video (first column) with content from other videos (first row). and the input image (blue border). Same driving video as in Fig. 1.

results in higher frame consistency but lower prompt consis- 5. Conclusion


tency. We also observe that an increased structure scale (ts )
results in higher prompt consistency as the content becomes Our latent video diffusion model synthesizes new videos
less determined by the input structure. given structure and content information. We ensure struc-
tural consistency by conditioning on depth estimates while
4.4. Customization content is controlled with images or natural language. Tem-
poral layers and joint image-video training achieve stable
Customization of image models enables generation of results across frames. A novel guidance method, inspired
previously unseen content, such as specific people or styles, by classifier-free guidance, allows for control over tempo-
based on a small dataset used for finetuning [41]. We fine- ral consistency. By training on depth maps with varying
tune our depth-conditioned latent video diffusion model on degrees of detail, we can adjust the level of structure preser-
a set of 15-30 images and produce videos containing the de- vation. This, together with model customization, improves
sired subject. Half of the batch elements are of the subject content fidelity. Our quantitative evaluation and user study
and the other half belong to the original training dataset. show that our method is preferred over related approaches.
Fig. 10 shows an example with different numbers of Future works should investigate other conditioning data,
customization steps as well as different levels of structure such as facial landmarks and pose estimates, and additional
preservation ts . Customization improves fidelity to the style 3d-priors to improve generated results. Our model is in-
and appearance of the character. In combination with higher tended for creative applications in content creation, but we
ts values, accurate animations are possible despite using a realize the risks of dual-use and hope that further work will
driving video of a person with different characteristics. be aimed at combating abuse of generative models.

7353
References Long video generation with time-agnostic vqgan and time-
sensitive transformer. arXiv preprint arXiv:2204.03638,
[1] Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and 2022. 2
Luc Van Gool. Towards high resolution video generation [12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
with progressive growing of sliced wasserstein gans. arXiv Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
preprint arXiv:1810.02419, 2018. 2 Yoshua Bengio. Generative adversarial nets. In Z. Ghahra-
[2] Stability AI. Stable diffusion depth. mani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Wein-
https://github.com/Stability-AI/stablediffusion, 2022. berger, editors, Advances in Neural Information Processing
7 Systems, volume 27. Curran Associates, Inc., 2014. 2
[3] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, [13] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
Liu. ediff-i: Text-to-image diffusion models with ensemble video: High definition video generation with diffusion mod-
of expert denoisers. arXiv preprint arXiv:2211.01324, 2022. els. arXiv preprint arXiv:2210.02303, 2022. 2, 3
2, 4 [14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
[4] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S. Li, fusion probabilistic models. In H. Larochelle, M. Ranzato,
Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in
Geiping, and Tom Goldstein. Cold diffusion: Inverting ar- Neural Information Processing Systems, volume 33, pages
bitrary image transforms without noise, 2023. 6840–6851. Curran Associates, Inc., 2020. 2, 3
[5] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- [15] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
ten, and Tali Dekel. Text2live: Text-driven layered image Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
and video editing. In European Conference on Computer models for high fidelity image generation. J. Mach. Learn.
Vision, pages 707–723. Springer, 2022. 2, 7 Res., 23:47–1, 2022. 2
[16] Jonathan Ho and Tim Salimans. Classifier-free diffusion
[6] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun
guidance, 2022. 6
Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A
Efros, and Tero Karras. Generating long videos of dynamic [17] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
scenes. 2022. 2 Chan, Mohammad Norouzi, and David J Fleet. Video dif-
fusion models. arXiv:2204.03458, 2022. 2, 4
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
[18] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu,
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
and Jie Tang. Cogvideo: Large-scale pretraining for
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
text-to-video generation via transformers. arXiv preprint
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
arXiv:2205.15868, 2022. 2
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Efros. Image-to-image translation with conditional adver-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
sarial networks. In Proceedings of the IEEE conference on
Clark, Christopher Berner, Sam McCandlish, Alec Radford,
computer vision and pattern recognition, pages 1125–1134,
Ilya Sutskever, and Dario Amodei. Language models are
2017. 2
few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell,
[20] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
M.F. Balcan, and H. Lin, editors, Advances in Neural Infor-
Elucidating the design space of diffusion-based generative
mation Processing Systems, volume 33, pages 1877–1901.
models. In Proc. NeurIPS, 2022. 2
Curran Associates, Inc., 2020. 6
[21] Yoni Kasten, Dolev Ofri, Oliver Wang, and Tali Dekel. Lay-
[8] Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang ered neural atlases for consistent video editing. ACM Trans-
Hua. Coherent online video style transfer. In Proceedings actions on Graphics (TOG), 40(6):1–12, 2021. 2, 7
of the IEEE International Conference on Computer Vision,
[22] Zhifeng Kong and Wei Ping. On fast sampling of diffu-
pages 1105–1114, 2017. 2
sion probabilistic models. arXiv preprint arXiv:2106.00132,
[9] deforum. Deforum stable diffusion. 2021. 2, 6
https://github.com/deforum/stable-diffusion, 2022. 2, [23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
7 Blip: Bootstrapping language-image pre-training for unified
[10] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, vision-language understanding and generation. In ICML,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, 2022. 6
Hongxia Yang, and Jie Tang. Cogview: Mastering text- [24] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx-
to-image generation via transformers. In M. Ranzato, uan Li, and Jun Zhu. DPM-solver: A fast ODE solver for
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman diffusion probabilistic model sampling in around 10 steps.
Vaughan, editors, Advances in Neural Information Process- In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and
ing Systems, volume 34, pages 19822–19835. Curran Asso- Kyunghyun Cho, editors, Advances in Neural Information
ciates, Inc., 2021. 2 Processing Systems, 2022. 2, 6
[11] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan [25] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher
Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting

7354
using denoising diffusion probabilistic models. In Proceed- sentation for video editing. ACM SIGGRAPH 2008 papers,
ings of the IEEE/CVF Conference on Computer Vision and 2008. 2
Pattern Recognition, pages 11461–11471, 2022. 2, 7 [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[26] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Patrick Esser, and Björn Ommer. High-resolution image syn-
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis thesis with latent diffusion models, 2021. 1, 2, 3, 6
and editing with stochastic differential equations. CoRR, [39] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
abs/2108.01073, 2021. 7 net: Convolutional networks for biomedical image segmen-
[27] Alexander Quinn Nichol and Prafulla Dhariwal. Improved tation. In International Conference on Medical image com-
denoising diffusion probabilistic models. In International puting and computer-assisted intervention, pages 234–241.
Conference on Machine Learning, pages 8162–8171. PMLR, Springer, 2015. 3
2021. 2 [40] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox.
[28] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Artistic style transfer for videos. In Bodo Rosenhahn and
Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Bjoern Andres, editors, Pattern Recognition, pages 26–36,
Sutskever, and Mark Chen. GLIDE: Towards photorealis- Cham, 2016. Springer International Publishing. 2
tic image generation and editing with text-guided diffusion [41] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Pro- tuning text-to-image diffusion models for subject-driven
ceedings of the 39th International Conference on Machine generation. arXiv preprint arXiv:2208.12242, 2022. 8
Learning, volume 162 of Proceedings of Machine Learning [42] Alexander S. Disco diffusion v5.2 - warp fusion.
Research, pages 16784–16804. PMLR, 17–23 Jul 2022. 2 https://github.com/Sxela/DiscoDiffusion-Warp, 2022. 2
[29] Yaniv Nikankin, Niv Haim, and Michal Irani. Sinfusion: [43] Chitwan Saharia, William Chan, Huiwen Chang, Chris A.
Training diffusion models on a single image or video. arXiv Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mo-
preprint arXiv:2211.11743, 2022. 2 hammad Norouzi. Palette: Image-to-image diffusion mod-
els, 2021. 7
[30] William Peebles and Saining Xie. Scalable diffusion models
with transformers, 2022. 2 [44] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
[31] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
beláez, Alexander Sorkine-Hornung, and Luc Van Gool.
Rapha Gontijo Lopes, et al. Photorealistic text-to-image
The 2017 davis challenge on video object segmentation.
diffusion models with deep language understanding. arXiv
arXiv:1704.00675, 2017. 6
preprint arXiv:2205.11487, 2022. 2
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[45] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
ral generative adversarial nets with singular value clipping.
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
In Proceedings of the IEEE international conference on com-
ing transferable visual models from natural language super-
puter vision, pages 2830–2839, 2017. 2
vision. In International Conference on Machine Learning,
[46] Tim Salimans and Jonathan Ho. Progressive distillation for
pages 8748–8763. PMLR, 2021. 2, 4
fast sampling of diffusion models. In International Confer-
[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, ence on Learning Representations, 2022. 2, 3
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and [47] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise
Peter J. Liu. Exploring the limits of transfer learning with a estimation for generative diffusion models. arXiv preprint
unified text-to-text transformer. J. Mach. Learn. Res., 21(1), arXiv:2104.02600, 2021. 2
jun 2022. 2 [48] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
[34] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
and Mark Chen. Hierarchical text-conditional image gener- Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
ation with clip latents, 2022. 1, 2, 4 Open dataset of clip-filtered 400 million image-text pairs.
[35] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, arXiv preprint arXiv:2111.02114, 2021. 4
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. [49] Uriel Singer, Adam Polyak, Thomas Hayes, Xiaoyue Yin, Jie
Zero-shot text-to-image generation. In Marina Meila and An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
Tong Zhang, editors, Proceedings of the 38th International Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman.
Conference on Machine Learning, volume 139 of Pro- Make-a-video: Text-to-video generation without text-video
ceedings of Machine Learning Research, pages 8821–8831. data. ArXiv, abs/2209.14792, 2022. 2, 4
PMLR, 18–24 Jul 2021. 4 [50] Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho-
[36] René Ranftl, Katrin Lasinger, David Hafner, Konrad seiny. Stylegan-v: A continuous video generator with the
Schindler, and Vladlen Koltun. Towards robust monocular price, image quality and perks of stylegan2, 2021. 2
depth estimation: Mixing datasets for zero-shot cross-dataset [51] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
transfer. IEEE Transactions on Pattern Analysis and Ma- and Surya Ganguli. Deep unsupervised learning using
chine Intelligence, 44:1623–1637, 2019. 6 nonequilibrium thermodynamics. In Francis Bach and David
[37] Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and An- Blei, editors, Proceedings of the 32nd International Con-
drew William Fitzgibbon. Unwrap mosaics: a new repre- ference on Machine Learning, volume 37 of Proceedings

7355
of Machine Learning Research, pages 2256–2265, Lille, Yonghui Wu. Scaling autoregressive models for content-rich
France, 07–09 Jul 2015. PMLR. 2, 3 text-to-image generation, 2022. 2
[52] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- [66] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic,
ing diffusion implicit models. In International Conference Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent
on Learning Representations, 2021. 2, 6 point diffusion models for 3d shape generation. In Advances
[53] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- in Neural Information Processing Systems (NeurIPS), 2022.
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based 2
generative modeling through stochastic differential equa- [67] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv,
tions. arXiv preprint arXiv:2011.13456, 2020. 2 Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video
[54] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, generation with latent diffusion models, 2022. 2
Amit H Bermano, and Daniel Cohen-Or. Human motion dif-
fusion model. arXiv preprint arXiv:2209.14916, 2022. 2
[55] Ondřej Texler, David Futschik, Michal Kučera, Ondřej
Jamriška, Šárka Sochorová, Menglei Chai, Sergey Tulyakov,
and Daniel Sýkora. Interactive video stylization using few-
shot patch-based training. ACM Transactions on Graphics,
39(4):73, 2020. 2, 7
[56] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Kautz. MoCoGAN: Decomposing motion and content for
video generation. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1526–1535, 2018. 2
[57] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. Advances in Neural In-
formation Processing Systems, 34:11287–11302, 2021. 2
[58] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
Phenaki: Variable length video generation from open domain
textual description, 2022. 2
[59] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Generating videos with scene dynamics. Advances in neu-
ral information processing systems, 29, 2016. 2
[60] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu,
Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video
synthesis. In Advances in Neural Information Processing
Systems (NeurIPS), 2019. 2
[61] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,
Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-
video synthesis. In Conference on Neural Information Pro-
cessing Systems (NeurIPS), 2018. 2
[62] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018. 2
[63] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian
Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and
Mike Zheng Shou. Tune-a-video: One-shot tuning of image
diffusion models for text-to-video generation. arXiv preprint
arXiv:2212.11565, 2022. 2
[64] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind
Srinivas. Videogpt: Video generation using vq-vae and trans-
formers. arXiv preprint arXiv:2104.10157, 2021. 2
[65] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han,
Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and

7356

You might also like