Cog Video X
Cog Video X
An Expert Transformer
Zhuoyi Yang⋆ Jiayan Teng⋆ Wendi Zheng Ming Ding Shiyu Huang Jiazheng Xu
Yuanming Yang Wenyi Hong Xiaohan Zhang Guanyu Feng Da Yin
arXiv:2408.06072v1 [cs.CV] 12 Aug 2024
Xiaotao Gu Yuxuan Zhang Weihan Wang Yean Cheng Ting Liu Bin Xu
Yuxiao Dong Jie Tang
Abstract
We introduce CogVideoX, large-scale diffusion transformer models designed for
generating videos based on text prompts. To efficently model video data, we pro-
pose to levearge a 3D Variational Autoencoder (VAE) to compress videos along
both spatial and temporal dimensions. To improve the text-video alignment, we pro-
pose an expert transformer with the expert adaptive LayerNorm to facilitate the deep
fusion between the two modalities. By employing a progressive training technique,
CogVideoX is adept at producing coherent, long-duration videos characterized by
significant motions. In addition, we develop an effective text-video data processing
pipeline that includes various data preprocessing strategies and a video captioning
method. It significantly helps enhance the performance of CogVideoX, improving
both generation quality and semantic alignment. Results show that CogVideoX
demonstrates state-of-the-art performance across both multiple machine metrics
and human evaluations. The model weights of both the 3D Causal VAE and
CogVideoX are publicly available at https://github.com/THUDM/CogVideo.
1 Introduction
The rapid development of text-to-video models has been phenomenal, driven by both the Transformer
architecture (Vaswani et al., 2017) and diffusion model (Ho et al., 2020). Early attempts to pretrain and
scale Transformers to generate videos from text have shown great promise, such as CogVideo (Hong
et al., 2022) and Phenaki (Villegas et al., 2022). Meanwhile, diffusion models have recently made
exciting advancements in multimodal generation, including video generation (Singer et al., 2022;
Ho et al., 2022). By using Transformers as the backbone of diffusion models, i.e., Diffusion
Transformers (DiT) (Peebles & Xie, 2023), text-to-video generation has reached groundbreaking
levels, as evidenced by the impressive Sora showcases (OpenAI, 2024b).
Despite these rapid advancements in DiTs, it remains technically unclear how to achieve long-term
consistent video generation. Challenges such as efficiently modeling video data, effectively aligning
videos with text semantics, as well as constructing the high-quality text-video pairs for model training
have thus far been largely unaddressed.
In this work, we train and introduce CogVideoX, a set of large-scale diffusion transformer models
designed for generating long-term, temporally consistent videos. We address the challenges mentioned
above by developing a 3D variational Autoencoder (VAE), an expert Transformer, and a video data
*Equal contributions. Core contributors: Zhuoyi, Jiayan, Wendi, Ming, and Shiyu.
{yangzy22,tengjy24}@mails.tsinghua.edu.cn, {yuxiaod,jietang}@tsinghua.edu.cn
Technical Report
Figure 1: The performance of openly-accessible text-to-video models in different aspects.
filtering and captioning pipeline, respectively. First, to efficiently consume video data, we design
and train a 3D causal VAE that compresses the video along both spatial and temporal dimensions.
Compared to unfolding a video into a one-dimensional sequence in the pixel space, this strategy helps
significantly reduce the sequence length and associated training compute. Unlike previous video
models (Blattmann et al., 2023) that often use a 2D VAE to encode each frame separately, the 3D
VAE helps prevent flicker in the generated videos, that is, ensuring continuity among frames.
Second, to improve the alignment between videos and texts, we propose an expert Transformer with
expert adaptive LayerNorm to facilitate the fusion between the two modalities. To ensure the temporal
consistency in video generation and capture large-scale motions, we propose to use 3D full attention
to comprehensively model the video along both temporal and spatial dimensions.
Third, as most video data available online lacks accurate textual descriptions, we develop a video
captioning pipeline capable of accurately describing video content. This pipeline is used to generate
new textual descriptions for all video data, which significantly enhances CogVideoX’s ability to grasp
precise semantic understanding.
In addition, we adopt and design progressive training techniques, including mixed-duration training
and resolution progressive training, to further enhance the generation performance and stability of
CogVideoX. Furthermore, we propose Explicit Uniform Sampling, which stablizes the training loss
curve and accelerates convergence by setting different timestep sampling intervals on each data
parallel rank.
Both machine and human evaluations suggest that CogVideoX outperforms well-known public
models. Figure 1 shows the performance of CogVideoX in different aspects.
CogVideoX is an ongoing attempt to advance text-to-video generation. To facilitate further devel-
opments, we open-source the model weight of part of CogVideoX and the 3D VAE, and we plan to
release future and larger models as well. Now open-sourced CogVideoX is capable of generating
720×480 videos of six seconds with eight frames per second. It can be publicly accessed from
https://github.com/THUDM/CogVideo.
2
ztext and zvision are concatenated along the sequence dimension. The concatenated embeddings are
then fed into a stack of expert transformer blocks. Finally, the model output are unpatchified to
restore the original latent shape, which is then decoded using a 3D causal VAE decoder to reconstruct
the video. We illustrate the technical design of the 3D causal VAE and expert transfomer in detail.
Videos encompass not only spatial information but also substantial temporal information, usually
resulting in orders of magnitude more data volumes than images. To tackle the computational
challenge of modeling video data, we propose to implement a video compression module based
on 3D Variational Autoencoders (3D VAEs) (Yu et al., 2023). The idea is to incorporate three-
dimentional convolutions to compress videos both spatially and temporally. This can help achieve a
higher compression ratio with largely improved quality and continuity of video reconstruction when
compared to previous image VAEs (Rombach et al., 2022; Esser et al., 2021).
Figure 3 (a) shows the structure of the proposed 3D VAE. It comprises an encoder, a decoder and
a latent space regularizer. The Gaussian latent space is constrained by a Kullback-Leibler (KL)
regularizer. The encoder and decoder consist of four symmetrically arranged stages, respectively
performing 2× downsampling and upsampling by the interleaving of ResNet block stacked stages.
The first two rounds of downsampling and the last two upsampling involve both the spatial and
temporal dimensions, while the last round only applies spatial sampling. This enables the 3D VAE
to achieve a 4× compression in the temporal dimension and an 8×8 compression in the spatial
dimension. In total, this achieves a 4×8×8 compression from pixels to the latents.
We adopt the temporally causal convolution (Yu et al., 2023), which places all the paddings at the
beginning of the convolution space, as shown in Figure 3 (b). This ensures the future information
not to influence the present or past predictions. Given that processing videos with a large number of
frames introduces excessive GPU memory usage, we apply context parallel at the temporal dimension
3
Figure 3: (a) The structure of the 3D VAE in CogVideoX. It comprises an encoder, a decoder and a
latent space regularizer, achieving a 4×8×8 compression from pixels to the latents. (b) The context
parallel implementation on the temporally causal convolution.
for 3D convolution to distribute computation among multiple devices. As illustrated by Figure 3 (b),
due to the causal nature of the convolution, each rank simply sends a segment of length k − 1 to the
next rank, where k indicates the temporal kernel size. This results in relatively low communication
overhead.
During actual implementation, we first train a 3D VAE on lower resolutions and fewer frames to save
computation. We observe that the encoding of larger resolution generalizes naturally, while extending
the number of frames to be encoded does not work as seamlessly. Therefore, we conduct a two-stage
training process by first training on short videos and finetuning by context parallel on long videos.
Both stages of training utilize a weighted combination of the L2 loss, LPIPS (Zhang et al., 2018)
perceptual loss, and GAN loss from a 3D discriminator.
We introduce the design choices in Transformer for CogVideoX, including the patching, positional
embedding, and attention strategies for handling the text-video data effectively and efficiently.
Patchify. The 3D causal VAE encodes a video latent vector of shape T × H × W × C, where
T represents the number of frames, H and W represent the height and width of each frame, C
represents the channel number, respectively. The video latents are then patchified along the spatial
dimensions, generating sequence zvision of length T · H W
p · p . Note that, we do not patchify along the
temporal dimension in order to enable joint training of images and videos.
3D-RoPE. Rotary Position Embedding (RoPE) (Su et al., 2024) is a relative positional encoding that
has been demonstrated to capture inter-token relationships effectively in LLMs, particularly excelling
in modeling long sequences. To adapt to video data, we extend the original RoPE to 3D-RoPE. Each
latent in the video tensor can be represented by a 3D coordinate (x, y, t). We independently apply
1D-RoPE to each dimension of the coordinates, each occupying 3/8, 3/8, and 2/8 of the hidden
states’ channel. The resulting encoding is then concatenated along the channel dimension to obtain
the final 3D-RoPE encoding.
We empirically examine the use of RoPE. Figure 4 (a)shows the comparison between 3D RoPE and
sinusoidal absolute position encoding. We can observe that the loss curve using 3D RoPE converges
significantly faster than that with sinusoidal encoding. We further compare the use of 3D RoPE
alone against the combination of 3D RoPE and learnable absolute position embedding. Figure 4 (b)
4
indicates that the loss curves of both methods converge almost identically. Therefore, we choose to
use 3D RoPE alone for simplicity.
(c) Expert Ada. LN vs. Expert Ada. LN + MLP (d) Uniform Sampling vs. No Uniform Sampling
Figure 4: Training loss curve of different ablations.
Expert Transformer Block. We concatenate the embeddings of both text and video at the input
stage to better align visual and semantic information. However, the feature spaces of these two
modalities differ significantly, and their embeddings may even have different numerical scales. To
better process them within the same sequence, we employ the Expert Adaptive Layernorm to handle
each modality independently. As shown in Figure 2, following DiT (Peebles & Xie, 2023), we use
the timestep t of the diffusion process as the input to the modulation module. Then, the Vision
Expert Adaptive Layernorm (Vison Expert AdaLN) and Text Expert Adaptive Layernorm (Text
Expert AdaLN) apply this modulation mechanism to the vision hidden states and text hidden states,
respectively. This strategy promotes the alignment of feature spaces across two modalities while
minimizing additional parameters.
To verify the adoption of Expert Adaptive Layernorm, we experiment with different ways of incorpo-
rating experts: expert LayerNorm and MLP, and expert Layernorm only. Our experiments find that
adding expert MLP does not effectively accelerate the model’s convergence (Cf. Figure 4 (c)). To
reduce the model parameters, we only choose to use Expert Adaptive Layernorm.
3D Full Attention. Previous works (Singer et al., 2022; Guo et al., 2023) often employ separated
spatial and temporal attention to reduce computational complexity and facilitate fine-tuning from
text-to-image models. However, as illustrated in Figure 5, this separated attention approach requires
extensive implicit transmission of visual information, significantly increasing the learning complexity
and making it challenging to maintain the consistency of large-movement objects. Considering the
great success of long-context training in LLMs (AI@Meta, 2024; Bai et al., 2024; Xiong et al., 2023)
and the efficiency of FlashAttention (Dao et al., 2022), we propose a 3D text-video hybrid attention
mechanism. This mechanism not only achieves better results but can also be easily adapted to various
parallel acceleration methods.
5
Figure 5: The separated spatial and temporal attention makes it challenging to handle the large motion
between adjacent frames. In the figure, the head of the person in frame i + 1 cannot directly attend
to the head in frame i. Instead, visual information can only be implicitly transmitted through other
background patches. This can lead to inconsistency issues in the generated videos.
Figure 6: The diagram of mixed-duration training and Frame Pack. To fully utilize the data and
enhance the model’s generalization capability, we train with videos of different durations within the
same batch.
3 Training CogVideoX
We mix images and videos during training, treating each image as a single-frame video. Additionally,
we employ progressive training from the resolution perspective. For the diffusion setting, we adopt
v-prediction (Salimans & Ho, 2022) and zero SNR (Lin et al., 2024), following the noise schedule
used in LDM (Rombach et al., 2022). During diffusion training for timestep sampling, we also
employ an explicit uniform timestep sampling method, which benefits training stability.
Previous video training methods often involve joint training of images and videos with fixed number
of frames (Singer et al., 2022; Blattmann et al., 2023). However, this approach usually leads to two
issues: First, there is a significant gap between the two input types using bidirectional attention, with
images having one frame while videos having dozens of frames. We observe that models trained
this way tend to diverge into two generative modes based on the token count and not to have good
generalizations. Second, to train with a fixed duration, we have to discard short videos and truncate
long videos, which prevents full utilization of the videos of varying number of frames.
To address these issues, we choose mixed-duration training, which means training videos of different
lengths together. However, inconsistent data shapes within the batch make training difficult. Inspired
by Patch’n Pack (Dehghani et al., 2024), we place videos of different lengths into the same batch
to ensure consistent shapes within each batch, a method we refer to as Frame Pack. The process is
illustrated in Figure 6.
6
3.2 Resolution Progressive Training
The training pipeline of CogVideoX is divided into three stages: low-resolution training, high-
resolution training, and high-quality video fine-tuning. Similar to images, videos from the Internet
usually include a significant amount of low-resolution ones. Progressive training can effectively
utilize videos of various resolutions. Moreover, training at low resolution initially can equip the
model with coarse-grained modeling capabilities, followed by high-resolution training to enhance its
ability to capture fine details. Compared to direct high-resolution training, staged training can also
help reduce the overall training time.
Figure 7: The comparison between the initial generation states of extrapolation and interpolation
when increasing the resolution with RoPE encoding. Extrapolation tends to generate multiple small,
clear, and repetitive images, while interpolation generates a blurry large image.
High-Quality Fine-Tuning. Since the filtered pre-training data still contains a certain proportion
of dirty data, such as subtitles, watermarks, and low-bitrate videos, we selected a subset of higher
quality video data, accounting for 20% of the total dataset, for fine-tuning in the final stage. This
step effectively removed generated subtitles and watermarks and slightly improved the visual quality.
However, we also observed a slight degradation in the model’s semantic ability.
where t is uniformly distributed between 1 and T. The common practice is for each rank in the
data parallel group to uniformly sample a value between 1 and T , which is in theory equivalent to
Equation 1. However, in practice, the results obtained from such random sampling are often not
sufficiently uniform, and since the magnitude of the diffusion loss is related to the timesteps, this
can lead to significant fluctuations in the loss. Thus, we propose to use Explicit Uniform Sampling
to divide the range from 1 to T into n intervals, where n is the number of ranks. Each rank then
uniformly samples within its respective interval. This method ensures a more uniform distribution of
timesteps. As shown in Figure 4 (d), the loss curve from training with Explicit Uniform Sampling is
noticeably more stable.
In addition, we compare the loss at each diffusion timestep alone between the two methods for a
more precise comparison. We find that after using explicit uniform sampling, the loss at all timesteps
decreased faster, indicating that this method can accelerate loss convergence.
7
3.4 Data
We construct a collection of relatively high-quality video clips with text descriptions through video
filters and recaption models. After filtering, approximately 35M single-shot clips remain, with each
clip averaging about 6 seconds.
Video Filtering. Video generation models need to learn the dynamic information of the world,
but unfiltered video data is of highly noisy distribution, primarily due to two reasons: First, videos
are human-created, and artificial editing may distort the authentic dynamic information; Second,
the quality of videos can significantly drop due to issues during filming, such as camera shakes and
substandard equipment.
In addition to the intrinsic quality of the videos, we also consider how well the video data supports
model training. Videos with minimal dynamic information or lacking connectivity in dynamic aspects
are considered detrimental. Consequently, we have developed a set of negative labels, which include:
• Editing: Videos that have undergone obvious artificial processing, such as re-editing and
special effects, causing degradation of the visual integrity.
• Lack of Motion Connectivity: Video segments with image transitions lacking motion
connectivity, commonly seen in videos artificially spliced or edited from images.
• Low Quality: Poorly shot videos with unclear visuals or excessive camera shake.
• Lecture Type: Videos focusing primarily on a person continuously talking with minimal
effective motion, such as educational content, lectures, and live-streamed discussions.
• Text Dominated: Videos containing a substantial amount of visible text or primarily
focusing on textual content.
• Noisy Screenshots: Noisy videos recorded from phone or computer screens.
We sample 20,000 video data samples and label the presence of negative tags in each of them. By
using these annotations, we train several filters based on video-llama (Zhang et al., 2023b) to screen
out low-quality video data.
In addition, we calculate the optical flow scores and image aesthetic scores of all training videos and
dynamically adjust the threshold ranges during training to ensure the fluency and aesthetic quality of
the generated videos.
Video Caption. Typically, most video data does not come with corresponding descriptive text, so
it is necessary to convert the video data into textual descriptions to provide the essential training
data for text-to-video models. Currently, there are some video caption datasets available, such as
Panda70M (Chen et al., 2024b), COCO Caption (Lin et al., 2014), and WebVid Bain et al. (2021).
However, the captions in these datasets are usually very short and fail to describe the video’s content
comprehensively.
To generate high-quality video caption data, we establish a Dense Video Caption Data Generation
pipeline, as detailed in Figure 8. The idea is to generate video captions from image captions.
First, we use the Panda70M video captioning model (Chen et al., 2024b) to generate short captions
for the videos. Then, we employ the image recaptioning model CogVLM (Wang et al., 2023a)
used in Stable Diffusion 3 (Esser et al., 2024) and CogView3 (Zheng et al., 2024a) to create dense
image captions for each of the frames within a video. Subsequently, we use GPT-4 to summarize
all the image captions to produce the final video caption. To accelerate the generation from image
captions to video captions, we fine-tune a Llama2 model (Touvron et al., 2023) using the summary
data generated by GPT-4 (Achiam et al., 2023), enabling large-scale video caption data generation.
Additional details regarding the video caption data generation process can be found in Appendix C.
The pipeline above generates the caption data that is used to trained the CogVideoX model introduced
in this report. To further accelerate video recaptioning, we also fine-tune an end-to-end video
understanding model CogVLM2-Caption, based on the CogVLM2-Video1 and Llama3 (AI@Meta,
2024), by using the dense caption data generated from the aforementioned pipeline. The video caption
1
The CogVLM2-Video model weight is openly available at https://github.com/THUDM/CogVLM2.
8
Figure 8: The pipeline for dense video caption data generation. In this pipeline, we generate short
video captions with the Panda70M model, extract frames to create dense image captions, and use
GPT-4 to summarize these into final video captions. To accelerate this process, we fine-tuned a Llama
2 model with the GPT-4 summaries.
data generated by CogVLM2-Caption is used to train the next generation of CogVideoX. Examples
of video captions generated by this end-to-end CogVLM2-Caption model are shown in Appendix D.
In Appendix E, we also present some examples of video generation where a video is first input into
CogVLM2-Caption to generate captions, which are then used as input for CogVideoX to generate
new videos, effectively achieving video-to-video generation.
4 Empirical Evaluation
In this section, we present the performance of CogVideoX through two primary methods: automated
metric evaluation and human assessment. We train the CogVideoX models with different parameter
sizes. We show results for 2B and 5B for now, larger models are still in training.
To facilitate the development of text-to-video generation, we open-source the model weight at
https://github.com/THUDM/CogVideo.
Evaluation Metrics. To evaluate the text-to-video generation, we employed several metrics from
VBench (Huang et al., 2024): Human Action, Scene, Dynamic Degree, Multiple Objects, and
Appearance Style. VBench is a suite of tools designed to automatically assess the quality of generated
videos. We have selected certain metrics from VBench, excluding others that do not align with
our evaluation needs. For example, the color metric, intended to measure the presence of objects
corresponding to specific colors across frames in the generated video, assesses the model’s quality by
calculating the probability. However, this metric may mislead video generation models that exhibit
greater variation, thus it is not to include it in our evaluation.
For longer-generated videos, some models might produce videos with minimal changes between
frames to obtain higher scores, but these videos lack rich content. Therefore, a metric for evaluating
the dynamism of the video becomes more important. To address this, we employ two video evaluation
tools: Dynamic Quality from Devil (Liao et al., 2024) and GPT4o-MTScore from ChronoMagic (Yuan
et al., 2024), which focus more on the dynamic characteristics of videos. Dynamic Quality is
defined by the integration of various quality metrics with dynamic scores, mitigating biases arising
from negative correlations between video dynamics and video quality. ChronoMagic, for instance,
9
Table 1: Evaluation results.
Results. Table 1 provides the performance comparison of CogVideoX and other models.
CogVideoX achieves the best performance in five out of the seven metrics and shows competi-
tive results in the remaining two metrics. These results demonstrate that the model not only excels in
video generation quality but also outperforms previous models in handling various complex dynamic
scenes. In addition, Figure 1 presents a radar chart that visually illustrates the performance advantages
of CogVideoX.
In addition to automated scoring mechanisms, a comparative analysis between the Kling (Team,
2024) and CogVideoX is conducted with manual evaluation. One hundred meticulously crafted
prompts are used for human evaluators, characterized by their broad distribution, clear articulation,
and well-defined conceptual scope. We randomize videos for blind evalution. A panel of evaluators
is instructed to assign scores for each detail on a scale from zero to one, with the overall total score
rated on a scale from 0 to 5, where higher scores reflect better video quality. To better complement
automated evaluation, human evaluation emphasizes the instruction-following capability: the total
score cannot exceed 2 if the generated video fails to follow the instruction.
The results shown in Table 2 indicate that CogVideoX wins the human preference over Kling across
all aspects. More details about human evaluation are shown in Appendix F.
5 Conclusion
In this paper, we present CogVideoX, a state-of-the-art text-to-video diffusion model. It leverages
a 3D VAE and an Expert Transformer architecture to generate coherent long-duration videos with
significant motion. By implementing a comprehensive data processing pipeline and a video re-
captioning method, we significantly improve the quality and semantic alignment of the generated
videos. Our progressive training techniques, including mixed-duration training and resolution
progressive training, further enhance the model’s performance and stability. Our ongoing efforts
focus on refining the CogVideoX’s ability to capture complex dynamics and ensure even higher
10
quality in video generation. We are also exploring the scaling laws of video generation models and
aim to train larger and more powerful models to generate longer and higher-quality videos, pushing
the boundaries of what is achievable in text-to-video generation.
Acknowledgments
We would like to thank all the data annotators, infrastructure operators, collaborators, and partners.
We also extend our gratitude to everyone at Zhipu AI and Tsinghua University who have provided
support, feedback, or contributed to the CogVideoX, even if not explicitly mentioned in this report.
We would also like to greatly thank BiliBili for technical discussions.
References
Pika beta. 2023. URL https://pika.art/home.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774, 2023.
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/
main/MODEL_CARD.md.
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi
Li. Longalign: A recipe for long context alignment of large language models. arXiv preprint
arXiv:2401.18058, 2024.
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and
image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference
on computer vision, pp. 1728–1738, 2021.
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang
Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer
Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik
Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling
latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan.
Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024a.
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao,
Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m:
Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 13320–13331, 2024b.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-
efficient exact attention with io-awareness. Advances in Neural Information Processing Systems,
35:16344–16359, 2022.
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde
Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch
n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural
Information Processing Systems, 36, 2024.
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image
synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pp. 12873–12883, 2021.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam
Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for
high-resolution image synthesis. In Forty-first International Conference on Machine Learning,
2024.
11
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala,
Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models
without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
neural information processing systems, 33:6840–6851, 2020.
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P
Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition
video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale
pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing
Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin,
Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang
Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward
feedback. arXiv preprint arXiv:2405.18750, 2024.
Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng
Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics
perspective, 2024. URL https://arxiv.org/abs/2407.01094.
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and
sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of
computer vision, pp. 5404–5411, 2024.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–
ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings,
Part V 13, pp. 740–755. Springer, 2014.
OpenAI. Gpt-4o. 2024a.
OpenAI. Sora. 2024b. URL https://openai.com/index/sora/.
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of
the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of machine learning research, 21(140):1–67, 2020.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition, pp. 10684–10695, 2022.
runway. Gen-2. 2023. URL https://runwayml.com/ai-tools/gen-2-text-to-video.
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv
preprint arXiv:2202.00512, 2022.
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry
Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video
data. arXiv preprint arXiv:2209.14792, 2022.
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced
transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
Kuaishou AI Team. Kling. 2024. URL https://kling.kuaishou.com/en.
12
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017.
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang,
Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable
length video generation from open domain textual descriptions. In International Conference on
Learning Representations, 2022.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang,
Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv
preprint arXiv:2311.03079, 2023a.
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan
He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent
diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin,
Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of
foundation models. arXiv preprint arXiv:2309.16039, 2023.
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong
Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–
tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu,
Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic
evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522, 2024.
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei
Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video
generation. arXiv preprint arXiv:2309.15818, 2023a.
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language
model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 586–595, 2018.
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong,
Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion.
arXiv preprint arXiv:2403.05121, 2024a.
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou,
Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March
2024b. URL https://github.com/hpcaitech/Open-Sora.
13
Figure 9: Text to video showcases. The displayed prompt will be upsampled before being fed into
the model. The generated videos contain large motion and can produce various video styles.
14
Figure 10: Text to video showcases.
15
A Image To Video Model
We finetune an image-to-video model from the text-to-video model. Drawing from the (Blattmann
et al., 2023), we add an image as an additional condition alongside the text. The image is passed
through 3D VAE and concatenated with the noised input in the channel dimension. Similar to
super-resolution tasks, there is a significant distribution gap between training and inference( the first
frame of videos vs. real-world images). To enhance the model’s robustness, we add large noise to
the image condition during training. Some examples are shown in 11, 12. CogVideoX can handle
different styles of image input.
16
Figure 11: Image to video showcases. The displayed prompt will be upsampled before being fed into
the model.
17
Figure 12: Image to video showcases.
18
B Caption Upsampler
To ensure that text input distribution during inference is as close as possible to the distribution during
training, similar to (Betker et al., 2023), we use a large language model to upsample the user’s input
during inference, making it more detailed and precise. Finetuned LLM can generate better prompts
than zero/few-shot.
For image-to-video, we use the vision language model to upsample the prompt, such as GPT4V,
CogVLM(Wang et al., 2023a).
You are part of a team of bots that create videos. You work
with an assistant bot that will draw anything you say in
square brackets. For example, outputting \" a beautiful
morning in the woods with the sun peaking through the
trees \" will trigger your partner bot to output a video
of a forest morning, as described. You will be prompted
by people looking to create detailed, amazing videos.
The way to accomplish this is to take their short prompts
and make them extremely detailed and descriptive.
There are a few rules to follow :
You will only ever output a single video description
per user request.
When modifications are requested, you should not simply
make the description longer. You should refactor the
entire description to integrate the suggestions.
19
the content of the video and the changes that occur, in
chronological order.\n Please keep the description of this
video within 100 English words.
20
and tranquil sea backdrop. This surreal tableau blends natural
beauty with human ingenuity, creating a serene yet whimsical
atmosphere that emphasizes the crab’s unique adaptation and the
contrast between nature and technology in this quiet nocturnal
setting.
21
E Video to Video via CogVideoX and CogVLM2-Caption
In this section, we present several examples of video-to-video generation using CogVideoX and
CogVLM2-Caption. Specifically, we first input the original video into CogVLM2-Caption to obtain
the video’s caption, and then feed this caption into the CogVideoX model to generate a new video.
From the examples below, it can be seen that our pipeline achieves a high degree of fidelity to the
original video:
22
23
F Human Evaluation Details
Sensory Quality: This part focuses mainly on the perceptual quality of videos, including subject
consistency, frame continuity, and stability.
Instruction Following: This part focuses on whether the generated video aligns with the prompt,
including the accuracy of the subject, quantity, elements, and details.
Physics Simulation: This part focuses on whether the model can adhere to the objective law of the
physical world, such as the lighting effect, interactions between different objects, and the realism of
fluid dynamics.
24
Cover Quality: This part mainly focuses on metrics that can be assessed from single-frame images,
including aesthetic quality, clarity, and fidelity.
25