ControlVideo Adding Conditional Control For One Shot Text-to-Video Editing
ControlVideo Adding Conditional Control For One Shot Text-to-Video Editing
Min Zhao1,3 , Rongzhen Wang2 , Fan Bao1,3 , Chongxuan Li2∗ , Jun Zhu1,3,4∗
1
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, THBI Lab, Tsinghua University
2
arXiv:2305.17098v1 [cs.CV] 26 May 2023
Abstract
In this paper, we present ControlVideo, a novel method for text-driven video editing.
Leveraging the capabilities of text-to-image diffusion models and ControlNet,
ControlVideo aims to enhance the fidelity and temporal consistency of videos that
align with a given text while preserving the structure of the source video. This is
achieved by incorporating additional conditions such as edge maps, fine-tuning
the key-frame and temporal attention on the source video-text pair with carefully
designed strategies. An in-depth exploration of ControlVideo’s design is conducted
to inform future research on one-shot tuning video diffusion models. Quantitatively,
ControlVideo outperforms a range of competitive baselines in terms of faithfulness
and consistency while still aligning with the textual prompt. Additionally, it
delivers videos with high visual realism and fidelity w.r.t. the source content,
demonstrating flexibility in utilizing controls containing varying degrees of source
video information, and the potential for multiple control combinations. The project
page is available at https://ml.cs.tsinghua.edu.cn/controlvideo/.
1 Introduction
The endeavor of text-driven video editing is to seamlessly generate novel videos derived from textual
prompts and existing video footage, thereby reducing manual labor. This technology stands to
significantly influence an array of fields such as advertising, marketing, and social media content.
Within this process, it is critical that the edited videos should faithfully preserve the content of the
source video, maintain temporal consistency between generated frames and align with the target
prompts. However, fulfilling all these requirements simultaneously presents considerable challenges.
Training a text-to-video model [1, 2] directly on extensive text-video data necessitates considerable
computational resources. Recent advancements in large-scale text-to-image diffusion models [3–5]
and controllable image editing [6–8] have been leveraged in zero-shot [9, 10] and one-shot [11, 12]
methodologies for text-driven video editing. These developments have shown promising capabilities
to edit videos in response to a variety of textual prompts, without requiring additional video data.
However, despite the significant strides made in aligning output with text prompts, empirical evidence
(see Figure 6 and Table 1) suggests that existing approaches still struggle to faithfully and adequately
control the output, while also preserving temporal consistency.
To this end, we present ControlVideo, a novel approach for faithful and consistent text-driven
video editing, based on a pretrained text-to-image diffusion model. Drawing inspiration from
∗
The Corresponding authors.
2 Background
Diffusion Models. Diffusion models [16–19] gradually perturb data x0 ∼ q(x0 ) by a forward
diffusion process:
T
Y √
q(x1:T ) = q(x0 ) q(xt |xt−1 ), q(xt |xt−1 ) = N (xt ; αt xt−1 , βt I),
t=1
where βt is the noise schedule, αt = 1 − βt and is designed to satisfy xT ∼ N (0, I). The data can
be generated starting from Gaussian noise through the reverse diffusion process, where the reverse
transition kernel q(xt−1 |xt ) is learned by a Gaussian model: pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt ), σt2 I).
[18] shows learning the mean µθ (xt ) can be derived to learn a noise prediction network ϵθ (xt , t) via
2
Source Video: “a car” Source Video: “a girl”
Control Control
Control Control
“+ Vivian Maier style, on a street beside a building” “Sherlock Holmes is dancing, on the street of London, raining”
Figure 1: The main results of ControlVideo with different controls including (a) Canny edge maps, (b)
HED boundary, (c) depth maps and (d) pose. ControlVideo can generate faithful and consistent videos
in the task of attributes, style, background editing and replacing subjects. By choosing different
control types, ControlVideo allows users to flexibly adjust the balance between faithfulness and
editing capabilities. Multiple controls can be easily combined together for video editing.
3
Training Inference
Basic Block
Source Video Source Promt:
Pseudo 3D "A car"
ResNet
Source Video
ControlVideo
DDIM
Temporal Inversion
ControlVideo
Control Key-frame Attention Temporal Attention
Basic Block with
Temporal DDIM
Attention Sampling
Basic Block without
Temporal Attention
Target Promt:
"A car, autumn"
Figure 2: Flowchart of ControlVideo. ControlVideo incorporates visual conditions for all frames to
amplify the source video’s guidance, key-frame attention that aligns all frames with a selected one
and temporal attention modules succeeded by a zero convolutional layer for temporal consistency and
faithfulness. The three key components and corresponding fine-tuned parameters are designed by a
systematic empirical study ( 3.3). Built upon the trained ControlVideo, during inference, we employ
DDIM inversion to obtain the noise inversion XT and then generate the edited video Y0 using the
target prompt starting from YT = XT via DDIM sampling.
a mean-squared error loss:
min Et,x0 ,ϵ∼N (0,I) ||ϵ − ϵθ (xt , t)||2 . (1)
θ
DDIM inversion and DDIM sampling. Deterministic DDIM sampling [20] is one of ODE-based
sampling methods [16, 21] to generate samples starting from xT ∼ N (0, I) via the following
iteration rule:
√
√ xt − 1 − αt ϵθ (xt , t) p
xt−1 = αt−1 √ + 1 − αt−1 ϵθ (xt , t). (2)
αt
DDIM inversion [20] can convert a real image x0 to related inversion noise by reversing the above
process, which can be reconstructed by DDIM sampling. Therefore, it is usually adopted in image
editing [6, 7, 22].
Latent Diffusion Models. To reduce computational resources, latent diffusion model (LDM) [3]
first leverages an encoder E to transform the image x0 into lower-dimensional latent space z0 = E(x0 ),
which can be reconstructed by a decoder x0 ≈ D(z0 ), and then trains the noise prediction network
ϵθ (zt , t) in the latent space. For text-to-image generation, LDM learns a conditional noise prediction
network ϵθ (xt , p, t) w.r.t. the mean-squared error:
min Et,x0 ,p,ϵ∼N (0,I) ||ϵ − ϵθ (xt , p, t)||2 ,
θ
3 ControlVideo
In this section, we present ControlVideo, a framework designed to enhance faithfulness and tem-
poral consistency in text-driven video editing, built upon Stable Diffusion [3] and ControNet [13].
Specifically, in Section 3.1, we detail the training and sampling framework of ControlVideo. The
4
architectural design of ControlVideo is explained in Section 3.2. Further, we conduct a comprehensive
empirical examination of ControlVideo’s key components in Section 3.3, analyzing the impact of
each component individually and their collective influence when combined.
Built upon ϵϕ∗ (Xt , ps , c, t), during inference, we employ DDIM inversion (see Eq. (2)) to obtain the
noise inversion XT , which encodes the information of the source video and then generate the edited
video Y0 using the target prompt pt starting from YT = XT via DDIM sampling [20].
In line with prior studies [11, 23], we first adopt pseudo 3D convolution layers by inflating the 2D
convolution layers in ResNet to handle video inputs. Specifically, we replace the 3 × 3 kernels with
1 × 3 × 3 kernels. The three key components of ControlVideo are explained as follows.
Adding Controls. Inspired by the recent advancements in image editing [6, 7], a natural approach
is to generate the edited videos starting from the noise inversion of source video X0 via DDIM
inversion. However, as depicted in Figure 5 (row 3), such method leads to less faithful edited
videos. To address this issue, we introduce additional visual conditions such as Canny edge maps,
HED boundaries, and depth maps for all frames to amplify the source video’s guidance to enhance
faithfulness. Specifically, we leverage ControlNet [13] to process visual conditions, which has been
pretrained on the Stable Diffusion. Formally, let hu , hc ∈ RN ×H×W ×C denote the embeddings of
encoder of Stable Diffusion and ControlNet with the same layer respectively. These features are
combined by summing them as h = hu + λhc , where λ is the control scale, and are fed into the
decoder of the Stable Diffusion via skip connection. As shown in Figure 5 (row 4), such guidance
from the source video significantly improves faithfulness (e.g. preserving the trees in the background
of the source video in the “a car” → “a red car” case).
Since different control types encompass varying degrees of information derived from the source
video, ControlVideo allows users to flexibly adjust the balance between faithfully maintaining the
content of the source video and enabling more extensive editing capabilities. For instance, HED
boundary provides detailed boundary information of the source video and is suitable for refined
control such as facial video editing, which needs extremely refine control to preserve the identity and
emotional accuracy by maintaining the individual’s unique facial characteristics (see Figure 6). On
the other hand, pose control provides greater flexibility in modifying the subject and background,
making it suitable for applications like motion transfer. Moreover, we can flexibly combine multiple
controls by weighted
P summing different control features to utilize the advantage of different control
types: h = hu + i λi hc , where λi is the control scale of i-th control (see Figure 1).
Key-frame Attention. The spatial self-attention mechanism utilized in T2I diffusion models
updates each frame individually, leading to temporal inconsistent video outputs. To address this
issue, drawing inspiration from previous works [24, 25] that utilizes key frames to propagate edits
throughout videos as well as recent advancements in video editing [11], we transform the original
spatial self-attention in both Stable Diffusion and ControlNet into key-frame attention, which aligns
2
We assume that raw videos have already been mapped to latent space throughout the paper.
5
source video
Figure 3: Comparisons with different designs of key and value in self-attention. The green color
marked our choice. See detail analysis in 3.3
all frames with a selected one. Formally, let v i denote the embeddings of the i-th frame and let
k ∈ [1, N ] represent the selected key frame. The key-frame attention is defined as follows:
Q = W Q vi , K = W K vk , V = W V vk ,
where W Q , W K , W V are the projection layer. The results show that there is no significant difference
in using different key frame selections (see Appendix B.1). Therefore, we use the first frame as the
key frame in this work. We employ the original spatial self-attention weights as initialization and
finetune W O by a systematic empirical study (see Sec. 3.3).
Temporal Attention. To improve faithfulness and temporal consistency of the edited video, we
incorporate temporal attention modules as extra branches in the diffusion model. Based on the
observation that different attention mechanisms consistently model the relationships between image
features, we employ the original spatial self-attention weights as initialization. We add a zero
convolutional layer [13] after each temporal attention module to preserve the output before fine-
tuning. We incorporate temporal attention along with key-frame attention in the main UNet except
middle block via a systematic empirical study(see Sec. 3.3).
3.3 Analysis
In this section, we conduct a systematical empirical study by analyzing results on 20 video-text pair
data and evaluate CLIP-temp, CLIP-text and SSIM (see Sec. 5) in Appendix B.
The Design of Key and Value in Self-Attention and Fine-tuned Parameters. Let [; ] denote the
concat operation. We consider using these embeddings as key and value: (1) v i : original spatial self-
attention in T2I models. (2) v k , which is our key-frame attention. We select four different key frames.
(3) [v m ; v i ] [9], where m = Round( N2 ). (4) [v 1 ; v i−1 ] [11, 26]. (5) [v 1 ; v i−1 ; v i+1 ], which includes
bi-directional information. (6) [v 1 ; v i ; v i−1 ; v i+1 ]. As shown in Figure 3, key-frame attention shows
the highest temporal consistency, implying that utilizing a key frame to propagate throughout videos
is useful. There is no significant difference in different key frame selections (see Appendix B.1).
In addition, adding the current frame features v i shows less temporal inconsistency because the
v i contains different information between frames. For example, the color of the car turned red in
[v m ; v i ], [v 1 ; v i ; v i−1 ; v i+1 ] following v i (column 2). Further, we conduct following experiments to
investigate finetune which parameters is more useful (see Appendix B.2): (1) W Q . (2) W O . (3)
W K , W V . (4)W Q , W K , W V . (5) W Q , W K , W V , W O . (6) add Lora [27] on W Q , W K , W V , W O .
We find finetuning W O shows good performance with less parameters.
The Way to Initialization and The Incorporation of Local and Global Positions for Introducing
Temporal Attention. As shown in Figure 4(a), using pretrain spatial self-attention weights as ini-
tialization achieves better performance. Next, we explore following potential locations to incorporate
temporal attention in transformer blocks: (1) before self-attention. (2) with self-attention. (3) after
self-attention. (4)after cross-attention. (5) after FNN. As shown in Figure 4(b), before self-attention
and with self-attention result in the best temporal consistency. This is because the input of these two
locations is the same as spatial self-attention, which serves as the initial weight of temporal attention.
Notably, with self-attention shows higher text alignment, making it our final choice. Moreover, we
find the after FNN location yields the worst temporal consistency and should be avoided.
To investigate the optimal global location for adding temporal attention, we first conduct the following
experiments: (1) ControlNet+UNet. (2) ControlNet. (3) UNet. (4) Encoder of UNet. (5) Decoder
6
(a) Comparison with different initialization (b) Comparison with different local locations of temporal attention in transformer block
random using pretrain before with after after
source video source video after FNN
initialization weights self-attention self-attention self-attention cross-attention
“the back view of a woman with beautiful “the back view of a woman with beautiful
scenery”→ “···, starry sky” scenery”→ “···, sunrising , early morning”
(c) Comparison with different global locations of temporal attention
ControlNet + encoder decoder block 1,2,3 block 1,2 block 2,3
source video UNet ControlNet UNet
of UNet of UNet in UNet in UNet in UNet
Figure 4: Ablation studies of (a) the way to initialize and the incorporation of (b) local positions and
(c) global positions for introducing temporal attention. The green color marked our choice. See detail
analysis in 3.3.
of UNet. As shown in Figure 4(c), incorporating temporal attention only to the ControlNet fails to
preserve the background and removing it does not decrease performance (all vs UNet). This suggests
that ControlNet only extracts condition-related features (e.g. pose) and discards the other features
(e.g. background), while U-Net, which is used for generation task, preserves all image information.
As such, we ultimately choose to add temporal attention to UNet. Additionally, the decoder location
achieves better performance than the encoder. This may be because, in U-Net, the decoder contains
more information than the encoder by using skip connections to incorporate features from the encoder.
Next, we investigate the location in UNet by following experiments: (1) all; (2) Block 1,2; (3) Block
1,3; (4) Block 2,3; (5) Block 1,2,3, which is UNet except middle block. As shown in Figure 4(c), the
Block 1,2,3 shows similar performance with all while with less parameters, which is chosen as the
final design.
Analyze the Role of Each Module and their Combination. As shown in Figure 5, adding controls
provides additional guidance from the source video, thus improving faithfulness a lot. The key-frame
attention improves temporal consistency a lot. The temporal attention improves faithfulness and
temporal consistency. Combining all the modules achieves the best performance. Also, we can
observe when controls contains more detail information of source video (e.g. HED boundary), adding
control and key-frame attention can achieve relatively good results. When the controls contains less
information (e.g. pose) or even non-existent, the temporal attention can greatly improve faithfulness.
4 Related Work
Diffusion Models for Text-driven Image Editing. Building upon the remarkable advances of
text-to-image diffusion models [3, 4], numerous excellent methods have shown promising results in
text-driven real image editing creation. In particular, several works such as Prompt-to-Prompt [6],
Plug-and-Play [7] and Pix2pix-Zero [8] explore the attention control over the generated content and
achieve SOTA results. Such methods usually start from the noise inversion and replace attention maps
in the generation process with the attention maps from source prompt, which retrain the spatial layout
of the source image. Despite significant advances, directly applying these image editing methods to
video frames leads to temporal flickering.
7
“back view”→ “back view, “a person is dancing”→ “a
“a car”→ “a red car”
sunrising, early morning” panda is dancing”
Source Video
Control
Stable Diffusion
(baseline)
+ Control
+ Key-frame
At.
+ Temporal
At.
+ Key-frame At.
+ Temporal At.
+ Control +
Key-frame At.
+ Control +
Temporal At.
Our Complete
Version
Figure 5: Ablation studies of each module and their combination. At. represents attention. See detail
analysis in 3.3.
Diffusion Models for Text-driven Video Editing. Gen-1 [2] trains a video diffusion model on
large-scale datasets and Dreamix [1] finetunes the pretrained Imagen Video [4] for text-driven video
editing, achieving impressive performance. However, they require expensive computational resources.
To overcome this, recent works build upon large-scale text-to-image diffusion models [3–5] and
controllable image editing for this task on single text-video pair. In particular, Tune-A-Video [11]
inflates the T2I diffusion model to text-to-video diffusion model and finetunes it on the source video
and source prompt. Inspired by this, several works [10, 12, 26] propose to first optimize the null-text
embedding for accurate video inversion and adopt attention map injection in the generation process,
achieving superior performance. Despite the significant advances, empirical evidence suggests that
they still struggle to faithfully and adequately control the output, while also preserving temporal
consistency.
5 Experiments
Implementation Details. We collect 40 video-text pair data including DAVIS dataset [28] and
other videos in the wild from website3 . For fair comparisons, following previous works [9–12],
we sample 8 frames with 512 × 512 resolution from each video. The Stable Diffusion 1.5 [3] and
ControlNet 1.0 [13] are adopted in this work. We train the ControlVideo for 80, 300, 500 and 1500
iterations for canny edge maps, HED boundary, depth maps and pose respectively with learning rate
3 × 10−5 . The DDIM sampler [20] with 50 steps and 12 classifier-free guidance is used for inference.
The control scale λ is set 1.
Evaluation Metrics. We evaluate the edited video from three aspects: temporal consistency,
faithfulness and text alignment. Following previous works [9, 12], we report CLIP-temp for temporal
3
https://www.pexels.com
8
Stable
Source Video Ours Diffusion Tune-A-Video vid2vid-zero Video-P2P FateZero
Figure 6: Comparison with baselines. ControlVideo successfully preserves the identity and emotional
accuracy by maintaining the individual’s unique facial characteristics while others fail.
consistency and CLIP-text for text alignment. We also report SSIM [15] between each input-output
pair for faithfulness. We perform user study to quantify text alignment, temporal consistency,
faithfulness and the overall all aspects by pairwise comparisons between the baselines and our
method. We provide more details about evaluation in Appendix A.
Applications. We show the diverse video editing applications of ControlVideo including attributes,
style, background editing and replacing subjects in Figure 1. For example, in the Figure 1 (a),
ControlVideo change the color of car into red with others unchanged, suggesting the ability of local
attributes editing. In Figure 1 (d), we show the dancing man has been changed into Michel Jacksons
successfully. In the Figure 1 (e), with the guidance of canny edge map and pose, the dancing person
is changed into a dancing panda with the highly faithful background. More qualitative results are
available in Appendix D.
Comparisons. We compare ControlVideo with Stable Diffusion and the following state-of-the-art
text-driven video editing methods: Tune-A-Video [11], vid2vid-zero [8] and FateZero [9], which are
reproduced using public code. During inference, all baselines and ControlVideo use DDIM inversion
9
Table 1: Quantitative results. Text. and Temp. represent CLIP-text and CLIP-temp respectively. User
study shows the preference rate of ControlVideo against baselines via human evaluation. A., T., F.
and O. represent text alignment, temporal consistency, faithfulness and overall aspects.
and DDIM sampling with the same setting. The quantitative and qualitative comparisons are reported
in Table 1 and Figure 6. We also compare with Video-P2P [12]. We find it is more easy to reconstruct
the video (see Figure 6) compared with other baselines and thus show quantitative results in the
Appendix.
Extensive results demonstrate that ControlVideo substantially outperforms all these baselines in terms
of faithfulness and temporal consistency with comparable text alignment performance. Notably, in
the faithfulness evaluation in user study, we outperform the baseline by a significant margin (over
80%). As depicted in Figure 6, our method successfully preserves the identity and emotional accuracy
by maintaining the individual’s unique facial characteristics while others fail.
6 Conclusion
In this paper, we present ControlVideo to utilize T2I diffusion models for one-shot text-to-video
editing, which introduces additional controls to preserve structure of source video, key-frame attention
that aligns all frames with a key frame, and temporal attention using pre-trained spatial self-attention
weights to improve faithfulness and temporal consistency. We demonstrate the effectiveness of
ControlVideo by outperforming state-of-art text-driven video editing methods.
10
Table 2: Quantitative results about different choices of key and value in self-attention.
A Implementation Details
We evaluate the human preference from text alignment, faithfulness, temporal consistency, and all
three aspects combined. A total of 10 subjects participated in this section. Taking faithfulness as an
example, given a source video, the participants are instructed to select which edited video is more
faithful to the source video in the pairwise comparisons between the baselines and ControlVideo.
B Ablation Studies
In this section, we present the quantitative results of our ablation studies. Recognizing that the
quantitative results may diverge from human evaluation, we ultimately prioritize human evaluation as
our primary measure, while utilizing the quantitative results as supplementary references.
As shown in Table 2, selecting a key-frame as key and value achieves high temporal consistency
performance and there is no significant difference in using different key-frame selections.
) Based on the results presented in Table 3, we observe that fine-tuning W O yields superior perfor-
mance while utilizing fewer parameters, making it our ultimate selection.
The quantitative results are shown in Table 4. We find before self-attention and with self-attention
result in the best temporal consistency, and with self-attentinon has a slight higher text alignment
score. In addition, we find after FNN location yields the worst temporal consistency and should be
avoided. Moreover, using pretraining weights achieves better performance than random initialization.
11
Table 4: Quantitative results about different local locations for introducing temporal attention.
Table 5: Quantitative results about different global locations for introducing temporal attention. These
quantitative results diverge from human evaluation and we ultimately prioritize human evaluation as
our primary measure and list them as references. See details in B.4.
The quantitative results are shown in Table 5, which diverges from human evaluation. We ultimately
prioritize human evaluation as our primary measure and list them as references. From human
evaluation aspects (see Figure 4 in the main text), we find adding temporal attention on Block 1,2,3
in UNet achieve good performance.
Table 6: Quantitative results. Text. and Temp. represent CLIP-text and CLIP-temp respectively. User
study shows the preference rate of ControlVideo against baselines via human evaluation. A., T., F.
and O. represent text alignment, temporal consistency, faithfulness and overall aspects.
12
Source Video: “a girl”
Control
Control
Control
13
Source Video: “a girl”
Control
Control
Control
“a black cat”
14
Source Video: “the back view of a woman ”
Control
Control
“a cake with romantic pure red candlestick, beautifully backlit, matte painting concept art”
Control
“burning mountain, ready for rescue, star wars, end of the world, depressing atmosphere, sci-fi”
15
Source Video: “a person is dancing”
Control
Control
Control
16
Source Video: “a black swan is swimming in a river”
17
References
[1] Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv
Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors.
arXiv preprint arXiv:2302.01329, 2023.
[2] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis
Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint
arXiv:2302.03011, 2023.
[3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
[4] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko,
Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High
definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[5] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical
text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[6] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.
Prompt-to-prompt image editing with cross attention control. International Conference on
Learning Representations, 2023.
[7] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features
for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
[8] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.
Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023.
[9] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and
Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint
arXiv:2303.09535, 2023.
[10] Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, and Chunhua
Shen. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint
arXiv:2303.17599, 2023.
[11] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan,
Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models
for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
[12] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing
with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
[13] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion
models. arXiv preprint arXiv:2302.05543, 2023.
[14] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International conference on machine learning,
pages 8748–8763. PMLR, 2021.
[15] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment:
from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–
612, 2004.
[16] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
[17] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the
optimal reverse variance in diffusion probabilistic models. In International Conference on
Learning Representations, 2021.
18
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances
in Neural Information Processing Systems, 33:6840–6851, 2020.
[19] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.
Advances in Neural Information Processing Systems, 34, 2021.
[20] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv
preprint arXiv:2010.02502, 2020.
[21] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A
fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint
arXiv:2206.00927, 2022.
[22] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion
for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
[23] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J
Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
[24] Ondřej Jamriška, Šárka Sochorová, Ondřej Texler, Michal Lukáč, Jakub Fišer, Jingwan Lu, Eli
Shechtman, and Daniel Sỳkora. Stylizing video by example. ACM Transactions on Graphics
(TOG), 38(4):1–11, 2019.
[25] Ondřej Texler, David Futschik, Michal Kučera, Ondřej Jamriška, Šárka Sochorová, Menclei
Chai, Sergey Tulyakov, and Daniel Sỳkora. Interactive video stylization using few-shot patch-
based training. ACM Transactions on Graphics (TOG), 39(4):73–1, 2020.
[26] Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, and Sungroh Yoon. Edit-a-video:
Single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945, 2023.
[27] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.
[28] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung,
and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint
arXiv:1704.00675, 2017.
19