0% found this document useful (0 votes)

77 views

Cog Video X

Uploaded by

vitaebin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views

Cog Video X

Uploaded by

vitaebin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CogVideoX: Text-to-Video Diffusion Models with

An Expert Transformer

Zhuoyi Yang⋆ Jiayan Teng⋆ Wendi Zheng Ming Ding Shiyu Huang Jiazheng Xu
Yuanming Yang Wenyi Hong Xiaohan Zhang Guanyu Feng Da Yin
arXiv:2408.06072v1 [cs.CV] 12 Aug 2024

Xiaotao Gu Yuxuan Zhang Weihan Wang Yean Cheng Ting Liu Bin Xu
Yuxiao Dong Jie Tang

Zhipu AI Tsinghua University

Abstract
We introduce CogVideoX, large-scale diffusion transformer models designed for
generating videos based on text prompts. To efficently model video data, we pro-
pose to levearge a 3D Variational Autoencoder (VAE) to compress videos along
both spatial and temporal dimensions. To improve the text-video alignment, we pro-
pose an expert transformer with the expert adaptive LayerNorm to facilitate the deep
fusion between the two modalities. By employing a progressive training technique,
CogVideoX is adept at producing coherent, long-duration videos characterized by
significant motions. In addition, we develop an effective text-video data processing
pipeline that includes various data preprocessing strategies and a video captioning
method. It significantly helps enhance the performance of CogVideoX, improving
both generation quality and semantic alignment. Results show that CogVideoX
demonstrates state-of-the-art performance across both multiple machine metrics
and human evaluations. The model weights of both the 3D Causal VAE and
CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

1 Introduction
The rapid development of text-to-video models has been phenomenal, driven by both the Transformer
architecture (Vaswani et al., 2017) and diffusion model (Ho et al., 2020). Early attempts to pretrain and
scale Transformers to generate videos from text have shown great promise, such as CogVideo (Hong
et al., 2022) and Phenaki (Villegas et al., 2022). Meanwhile, diffusion models have recently made
exciting advancements in multimodal generation, including video generation (Singer et al., 2022;
Ho et al., 2022). By using Transformers as the backbone of diffusion models, i.e., Diffusion
Transformers (DiT) (Peebles & Xie, 2023), text-to-video generation has reached groundbreaking
levels, as evidenced by the impressive Sora showcases (OpenAI, 2024b).
Despite these rapid advancements in DiTs, it remains technically unclear how to achieve long-term
consistent video generation. Challenges such as efficiently modeling video data, effectively aligning
videos with text semantics, as well as constructing the high-quality text-video pairs for model training
have thus far been largely unaddressed.
In this work, we train and introduce CogVideoX, a set of large-scale diffusion transformer models
designed for generating long-term, temporally consistent videos. We address the challenges mentioned
above by developing a 3D variational Autoencoder (VAE), an expert Transformer, and a video data
*Equal contributions. Core contributors: Zhuoyi, Jiayan, Wendi, Ming, and Shiyu.
{yangzy22,tengjy24}@mails.tsinghua.edu.cn, {yuxiaod,jietang}@tsinghua.edu.cn

Technical Report
Figure 1: The performance of openly-accessible text-to-video models in different aspects.

filtering and captioning pipeline, respectively. First, to efficiently consume video data, we design
and train a 3D causal VAE that compresses the video along both spatial and temporal dimensions.
Compared to unfolding a video into a one-dimensional sequence in the pixel space, this strategy helps
significantly reduce the sequence length and associated training compute. Unlike previous video
models (Blattmann et al., 2023) that often use a 2D VAE to encode each frame separately, the 3D
VAE helps prevent flicker in the generated videos, that is, ensuring continuity among frames.
Second, to improve the alignment between videos and texts, we propose an expert Transformer with
expert adaptive LayerNorm to facilitate the fusion between the two modalities. To ensure the temporal
consistency in video generation and capture large-scale motions, we propose to use 3D full attention
to comprehensively model the video along both temporal and spatial dimensions.
Third, as most video data available online lacks accurate textual descriptions, we develop a video
captioning pipeline capable of accurately describing video content. This pipeline is used to generate
new textual descriptions for all video data, which significantly enhances CogVideoX’s ability to grasp
precise semantic understanding.
In addition, we adopt and design progressive training techniques, including mixed-duration training
and resolution progressive training, to further enhance the generation performance and stability of
CogVideoX. Furthermore, we propose Explicit Uniform Sampling, which stablizes the training loss
curve and accelerates convergence by setting different timestep sampling intervals on each data
parallel rank.
Both machine and human evaluations suggest that CogVideoX outperforms well-known public
models. Figure 1 shows the performance of CogVideoX in different aspects.
CogVideoX is an ongoing attempt to advance text-to-video generation. To facilitate further devel-
opments, we open-source the model weight of part of CogVideoX and the 3D VAE, and we plan to
release future and larger models as well. Now open-sourced CogVideoX is capable of generating
720×480 videos of six seconds with eight frames per second. It can be publicly accessed from
https://github.com/THUDM/CogVideo.

2 The CogVideoX Architecture

In the section, we present the CogVideoX model. Figure 2 illustrates the overall architecture. Given a
pair of video and text input, we design a 3D causal VAE to compress the video into the latent space,
and the latents are then patchified and unfolded into a long sequence denoted as zvision . Simultaneously,
we encode the textual input into text embeddings ztext using T5 (Raffel et al., 2020). Subsequently,

2
ztext and zvision are concatenated along the sequence dimension. The concatenated embeddings are
then fed into a stack of expert transformer blocks. Finally, the model output are unpatchified to
restore the original latent shape, which is then decoded using a 3D causal VAE decoder to reconstruct
the video. We illustrate the technical design of the 3D causal VAE and expert transfomer in detail.

Figure 2: The overall architecture of CogVideoX.

2.1 3D Causal VAE

Videos encompass not only spatial information but also substantial temporal information, usually
resulting in orders of magnitude more data volumes than images. To tackle the computational
challenge of modeling video data, we propose to implement a video compression module based
on 3D Variational Autoencoders (3D VAEs) (Yu et al., 2023). The idea is to incorporate three-
dimentional convolutions to compress videos both spatially and temporally. This can help achieve a
higher compression ratio with largely improved quality and continuity of video reconstruction when
compared to previous image VAEs (Rombach et al., 2022; Esser et al., 2021).
Figure 3 (a) shows the structure of the proposed 3D VAE. It comprises an encoder, a decoder and
a latent space regularizer. The Gaussian latent space is constrained by a Kullback-Leibler (KL)
regularizer. The encoder and decoder consist of four symmetrically arranged stages, respectively
performing 2× downsampling and upsampling by the interleaving of ResNet block stacked stages.
The first two rounds of downsampling and the last two upsampling involve both the spatial and
temporal dimensions, while the last round only applies spatial sampling. This enables the 3D VAE
to achieve a 4× compression in the temporal dimension and an 8×8 compression in the spatial
dimension. In total, this achieves a 4×8×8 compression from pixels to the latents.
We adopt the temporally causal convolution (Yu et al., 2023), which places all the paddings at the
beginning of the convolution space, as shown in Figure 3 (b). This ensures the future information
not to influence the present or past predictions. Given that processing videos with a large number of
frames introduces excessive GPU memory usage, we apply context parallel at the temporal dimension

3
Figure 3: (a) The structure of the 3D VAE in CogVideoX. It comprises an encoder, a decoder and a
latent space regularizer, achieving a 4×8×8 compression from pixels to the latents. (b) The context
parallel implementation on the temporally causal convolution.

for 3D convolution to distribute computation among multiple devices. As illustrated by Figure 3 (b),
due to the causal nature of the convolution, each rank simply sends a segment of length k − 1 to the
next rank, where k indicates the temporal kernel size. This results in relatively low communication
overhead.
During actual implementation, we first train a 3D VAE on lower resolutions and fewer frames to save
computation. We observe that the encoding of larger resolution generalizes naturally, while extending
the number of frames to be encoded does not work as seamlessly. Therefore, we conduct a two-stage
training process by first training on short videos and finetuning by context parallel on long videos.
Both stages of training utilize a weighted combination of the L2 loss, LPIPS (Zhang et al., 2018)
perceptual loss, and GAN loss from a 3D discriminator.

2.2 Expert Transformer

We introduce the design choices in Transformer for CogVideoX, including the patching, positional
embedding, and attention strategies for handling the text-video data effectively and efficiently.

Patchify. The 3D causal VAE encodes a video latent vector of shape T × H × W × C, where
T represents the number of frames, H and W represent the height and width of each frame, C
represents the channel number, respectively. The video latents are then patchified along the spatial
dimensions, generating sequence zvision of length T · H W
p · p . Note that, we do not patchify along the
temporal dimension in order to enable joint training of images and videos.

3D-RoPE. Rotary Position Embedding (RoPE) (Su et al., 2024) is a relative positional encoding that
has been demonstrated to capture inter-token relationships effectively in LLMs, particularly excelling
in modeling long sequences. To adapt to video data, we extend the original RoPE to 3D-RoPE. Each
latent in the video tensor can be represented by a 3D coordinate (x, y, t). We independently apply
1D-RoPE to each dimension of the coordinates, each occupying 3/8, 3/8, and 2/8 of the hidden
states’ channel. The resulting encoding is then concatenated along the channel dimension to obtain
the final 3D-RoPE encoding.
We empirically examine the use of RoPE. Figure 4 (a)shows the comparison between 3D RoPE and
sinusoidal absolute position encoding. We can observe that the loss curve using 3D RoPE converges
significantly faster than that with sinusoidal encoding. We further compare the use of 3D RoPE
alone against the combination of 3D RoPE and learnable absolute position embedding. Figure 4 (b)

4
indicates that the loss curves of both methods converge almost identically. Therefore, we choose to
use 3D RoPE alone for simplicity.

(a) RoPE vs. Sinusoidal (b) RoPE vs. RoPE + Learnable

(c) Expert Ada. LN vs. Expert Ada. LN + MLP (d) Uniform Sampling vs. No Uniform Sampling
Figure 4: Training loss curve of different ablations.

Expert Transformer Block. We concatenate the embeddings of both text and video at the input
stage to better align visual and semantic information. However, the feature spaces of these two
modalities differ significantly, and their embeddings may even have different numerical scales. To
better process them within the same sequence, we employ the Expert Adaptive Layernorm to handle
each modality independently. As shown in Figure 2, following DiT (Peebles & Xie, 2023), we use
the timestep t of the diffusion process as the input to the modulation module. Then, the Vision
Expert Adaptive Layernorm (Vison Expert AdaLN) and Text Expert Adaptive Layernorm (Text
Expert AdaLN) apply this modulation mechanism to the vision hidden states and text hidden states,
respectively. This strategy promotes the alignment of feature spaces across two modalities while
minimizing additional parameters.
To verify the adoption of Expert Adaptive Layernorm, we experiment with different ways of incorpo-
rating experts: expert LayerNorm and MLP, and expert Layernorm only. Our experiments find that
adding expert MLP does not effectively accelerate the model’s convergence (Cf. Figure 4 (c)). To
reduce the model parameters, we only choose to use Expert Adaptive Layernorm.

3D Full Attention. Previous works (Singer et al., 2022; Guo et al., 2023) often employ separated
spatial and temporal attention to reduce computational complexity and facilitate fine-tuning from
text-to-image models. However, as illustrated in Figure 5, this separated attention approach requires
extensive implicit transmission of visual information, significantly increasing the learning complexity
and making it challenging to maintain the consistency of large-movement objects. Considering the
great success of long-context training in LLMs (AI@Meta, 2024; Bai et al., 2024; Xiong et al., 2023)
and the efficiency of FlashAttention (Dao et al., 2022), we propose a 3D text-video hybrid attention
mechanism. This mechanism not only achieves better results but can also be easily adapted to various
parallel acceleration methods.

5
Figure 5: The separated spatial and temporal attention makes it challenging to handle the large motion
between adjacent frames. In the figure, the head of the person in frame i + 1 cannot directly attend
to the head in frame i. Instead, visual information can only be implicitly transmitted through other
background patches. This can lead to inconsistency issues in the generated videos.

Figure 6: The diagram of mixed-duration training and Frame Pack. To fully utilize the data and
enhance the model’s generalization capability, we train with videos of different durations within the
same batch.

3 Training CogVideoX
We mix images and videos during training, treating each image as a single-frame video. Additionally,
we employ progressive training from the resolution perspective. For the diffusion setting, we adopt
v-prediction (Salimans & Ho, 2022) and zero SNR (Lin et al., 2024), following the noise schedule
used in LDM (Rombach et al., 2022). During diffusion training for timestep sampling, we also
employ an explicit uniform timestep sampling method, which benefits training stability.

3.1 Frame Pack

Previous video training methods often involve joint training of images and videos with fixed number
of frames (Singer et al., 2022; Blattmann et al., 2023). However, this approach usually leads to two
issues: First, there is a significant gap between the two input types using bidirectional attention, with
images having one frame while videos having dozens of frames. We observe that models trained
this way tend to diverge into two generative modes based on the token count and not to have good
generalizations. Second, to train with a fixed duration, we have to discard short videos and truncate
long videos, which prevents full utilization of the videos of varying number of frames.
To address these issues, we choose mixed-duration training, which means training videos of different
lengths together. However, inconsistent data shapes within the batch make training difficult. Inspired
by Patch’n Pack (Dehghani et al., 2024), we place videos of different lengths into the same batch
to ensure consistent shapes within each batch, a method we refer to as Frame Pack. The process is
illustrated in Figure 6.

6
3.2 Resolution Progressive Training

The training pipeline of CogVideoX is divided into three stages: low-resolution training, high-
resolution training, and high-quality video fine-tuning. Similar to images, videos from the Internet
usually include a significant amount of low-resolution ones. Progressive training can effectively
utilize videos of various resolutions. Moreover, training at low resolution initially can equip the
model with coarse-grained modeling capabilities, followed by high-resolution training to enhance its
ability to capture fine details. Compared to direct high-resolution training, staged training can also
help reduce the overall training time.

Figure 7: The comparison between the initial generation states of extrapolation and interpolation
when increasing the resolution with RoPE encoding. Extrapolation tends to generate multiple small,
clear, and repetitive images, while interpolation generates a blurry large image.

Extrapolation of Position Code. When adapting low-resolution position encoding to high-

resolution, we consider two different methods: interpolation and extrapolation. We show the effects of
two methods in Figure 7. Interpolation tens to preserve global information more effectively, whereas
the extrapolation better retains local details. Given that RoPE is a relative position encoding, We
chose the extrapolation to maintain the relative position between pixels.

High-Quality Fine-Tuning. Since the filtered pre-training data still contains a certain proportion
of dirty data, such as subtitles, watermarks, and low-bitrate videos, we selected a subset of higher
quality video data, accounting for 20% of the total dataset, for fine-tuning in the final stage. This
step effectively removed generated subtitles and watermarks and slightly improved the visual quality.
However, we also observed a slight degradation in the model’s semantic ability.

3.3 Explicit Uniform Sampling

Ho et al. (2020) defines the training objective of diffusion as

√ √ 2
Lsimple (θ) := Et,x0 ,ϵ ϵ − ϵθ ( ᾱt x0 + 1 − ᾱt ϵ, t) , (1)

where t is uniformly distributed between 1 and T. The common practice is for each rank in the
data parallel group to uniformly sample a value between 1 and T , which is in theory equivalent to
Equation 1. However, in practice, the results obtained from such random sampling are often not
sufficiently uniform, and since the magnitude of the diffusion loss is related to the timesteps, this
can lead to significant fluctuations in the loss. Thus, we propose to use Explicit Uniform Sampling
to divide the range from 1 to T into n intervals, where n is the number of ranks. Each rank then
uniformly samples within its respective interval. This method ensures a more uniform distribution of
timesteps. As shown in Figure 4 (d), the loss curve from training with Explicit Uniform Sampling is
noticeably more stable.
In addition, we compare the loss at each diffusion timestep alone between the two methods for a
more precise comparison. We find that after using explicit uniform sampling, the loss at all timesteps
decreased faster, indicating that this method can accelerate loss convergence.

7
3.4 Data

We construct a collection of relatively high-quality video clips with text descriptions through video
filters and recaption models. After filtering, approximately 35M single-shot clips remain, with each
clip averaging about 6 seconds.

Video Filtering. Video generation models need to learn the dynamic information of the world,
but unfiltered video data is of highly noisy distribution, primarily due to two reasons: First, videos
are human-created, and artificial editing may distort the authentic dynamic information; Second,
the quality of videos can significantly drop due to issues during filming, such as camera shakes and
substandard equipment.
In addition to the intrinsic quality of the videos, we also consider how well the video data supports
model training. Videos with minimal dynamic information or lacking connectivity in dynamic aspects
are considered detrimental. Consequently, we have developed a set of negative labels, which include:

• Editing: Videos that have undergone obvious artificial processing, such as re-editing and
special effects, causing degradation of the visual integrity.
• Lack of Motion Connectivity: Video segments with image transitions lacking motion
connectivity, commonly seen in videos artificially spliced or edited from images.
• Low Quality: Poorly shot videos with unclear visuals or excessive camera shake.
• Lecture Type: Videos focusing primarily on a person continuously talking with minimal
effective motion, such as educational content, lectures, and live-streamed discussions.
• Text Dominated: Videos containing a substantial amount of visible text or primarily
focusing on textual content.
• Noisy Screenshots: Noisy videos recorded from phone or computer screens.

We sample 20,000 video data samples and label the presence of negative tags in each of them. By
using these annotations, we train several filters based on video-llama (Zhang et al., 2023b) to screen
out low-quality video data.
In addition, we calculate the optical flow scores and image aesthetic scores of all training videos and
dynamically adjust the threshold ranges during training to ensure the fluency and aesthetic quality of
the generated videos.

Video Caption. Typically, most video data does not come with corresponding descriptive text, so
it is necessary to convert the video data into textual descriptions to provide the essential training
data for text-to-video models. Currently, there are some video caption datasets available, such as
Panda70M (Chen et al., 2024b), COCO Caption (Lin et al., 2014), and WebVid Bain et al. (2021).
However, the captions in these datasets are usually very short and fail to describe the video’s content
comprehensively.
To generate high-quality video caption data, we establish a Dense Video Caption Data Generation
pipeline, as detailed in Figure 8. The idea is to generate video captions from image captions.
First, we use the Panda70M video captioning model (Chen et al., 2024b) to generate short captions
for the videos. Then, we employ the image recaptioning model CogVLM (Wang et al., 2023a)
used in Stable Diffusion 3 (Esser et al., 2024) and CogView3 (Zheng et al., 2024a) to create dense
image captions for each of the frames within a video. Subsequently, we use GPT-4 to summarize
all the image captions to produce the final video caption. To accelerate the generation from image
captions to video captions, we fine-tune a Llama2 model (Touvron et al., 2023) using the summary
data generated by GPT-4 (Achiam et al., 2023), enabling large-scale video caption data generation.
Additional details regarding the video caption data generation process can be found in Appendix C.
The pipeline above generates the caption data that is used to trained the CogVideoX model introduced
in this report. To further accelerate video recaptioning, we also fine-tune an end-to-end video
understanding model CogVLM2-Caption, based on the CogVLM2-Video1 and Llama3 (AI@Meta,
2024), by using the dense caption data generated from the aforementioned pipeline. The video caption
1
The CogVLM2-Video model weight is openly available at https://github.com/THUDM/CogVLM2.

8
Figure 8: The pipeline for dense video caption data generation. In this pipeline, we generate short
video captions with the Panda70M model, extract frames to create dense image captions, and use
GPT-4 to summarize these into final video captions. To accelerate this process, we fine-tuned a Llama
2 model with the GPT-4 summaries.

data generated by CogVLM2-Caption is used to train the next generation of CogVideoX. Examples
of video captions generated by this end-to-end CogVLM2-Caption model are shown in Appendix D.
In Appendix E, we also present some examples of video generation where a video is first input into
CogVLM2-Caption to generate captions, which are then used as input for CogVideoX to generate
new videos, effectively achieving video-to-video generation.

4 Empirical Evaluation

In this section, we present the performance of CogVideoX through two primary methods: automated
metric evaluation and human assessment. We train the CogVideoX models with different parameter
sizes. We show results for 2B and 5B for now, larger models are still in training.
To facilitate the development of text-to-video generation, we open-source the model weight at
https://github.com/THUDM/CogVideo.

4.1 Automated Metric Evaluation

Baselines. We choose openly-accessible top-performing text-to-video models as baselines, includ-

ing T2V-Turbo (Li et al., 2024), AnimateDiff (Guo et al., 2023), VideoCrafter2 (Chen et al., 2024a),
OpenSora (Zheng et al., 2024b), Show-1 (Zhang et al., 2023a), Gen-2 (runway, 2023), Pika (pik,
2023), and LaVie-2 (Wang et al., 2023b).

Evaluation Metrics. To evaluate the text-to-video generation, we employed several metrics from
VBench (Huang et al., 2024): Human Action, Scene, Dynamic Degree, Multiple Objects, and
Appearance Style. VBench is a suite of tools designed to automatically assess the quality of generated
videos. We have selected certain metrics from VBench, excluding others that do not align with
our evaluation needs. For example, the color metric, intended to measure the presence of objects
corresponding to specific colors across frames in the generated video, assesses the model’s quality by
calculating the probability. However, this metric may mislead video generation models that exhibit
greater variation, thus it is not to include it in our evaluation.
For longer-generated videos, some models might produce videos with minimal changes between
frames to obtain higher scores, but these videos lack rich content. Therefore, a metric for evaluating
the dynamism of the video becomes more important. To address this, we employ two video evaluation
tools: Dynamic Quality from Devil (Liao et al., 2024) and GPT4o-MTScore from ChronoMagic (Yuan
et al., 2024), which focus more on the dynamic characteristics of videos. Dynamic Quality is
defined by the integration of various quality metrics with dynamic scores, mitigating biases arising
from negative correlations between video dynamics and video quality. ChronoMagic, for instance,

9
Table 1: Evaluation results.

Human Dynamic Multiple Appearance Dynamic GPT4o-MT

Models Scene
Action Degree Objects Style Quality Score
T2V-Turbo 95.2 55.58 49.17 54.65 24.42 – –
AnimateDiff 92.6 50.19 40.83 36.88 22.42 – 2.62
VideoCrafter-2.0 95.0 55.29 42.50 40.66 25.13 43.6 2.68
OpenSora V1.2 85.8 42.47 47.22 58.41 23.89 63.7 2.52
Show-1 95.6 47.03 44.44 45.47 23.06 57.7 –
Gen-2 89.2 48.91 18.89 55.47 19.34 43.6 2.62
Pika 88.0 44.80 37.22 46.69 21.89 52.1 2.48
LaVie-2 96.4 49.59 31.11 64.88 25.09 – 2.46
CogVideoX-2B 88.0 39.94 63.33 53.70 23.67 57.7 3.09
CogVideoX-5B 96.8 55.44 62.22 70.95 24.44 69.5 3.36

introduces GPT4o-MTScore, a metric designed to measure the metamorphic amplitude of time-lapse

videos, such as those depicting physical, biological, and meteorological changes. This metric using
GPT-4o (OpenAI, 2024a) to score the degree of change, providing a fine-grained assessment of video
dynamism.

Results. Table 1 provides the performance comparison of CogVideoX and other models.
CogVideoX achieves the best performance in five out of the seven metrics and shows competi-
tive results in the remaining two metrics. These results demonstrate that the model not only excels in
video generation quality but also outperforms previous models in handling various complex dynamic
scenes. In addition, Figure 1 presents a radar chart that visually illustrates the performance advantages
of CogVideoX.

4.2 Human Evaluation

In addition to automated scoring mechanisms, a comparative analysis between the Kling (Team,
2024) and CogVideoX is conducted with manual evaluation. One hundred meticulously crafted
prompts are used for human evaluators, characterized by their broad distribution, clear articulation,
and well-defined conceptual scope. We randomize videos for blind evalution. A panel of evaluators
is instructed to assign scores for each detail on a scale from zero to one, with the overall total score
rated on a scale from 0 to 5, where higher scores reflect better video quality. To better complement
automated evaluation, human evaluation emphasizes the instruction-following capability: the total
score cannot exceed 2 if the generated video fails to follow the instruction.
The results shown in Table 2 indicate that CogVideoX wins the human preference over Kling across
all aspects. More details about human evaluation are shown in Appendix F.

Table 2: Human evaluation between CogVideoX and Kling.

Sensory Instruction Physics Cover Total
Model
Quality Following Simulation Quality Score
Kling 0.638 0.367 0.561 0.668 2.17
CogVideoX-5B 0.722 0.495 0.667 0.712 2.74

5 Conclusion
In this paper, we present CogVideoX, a state-of-the-art text-to-video diffusion model. It leverages
a 3D VAE and an Expert Transformer architecture to generate coherent long-duration videos with
significant motion. By implementing a comprehensive data processing pipeline and a video re-
captioning method, we significantly improve the quality and semantic alignment of the generated
videos. Our progressive training techniques, including mixed-duration training and resolution
progressive training, further enhance the model’s performance and stability. Our ongoing efforts
focus on refining the CogVideoX’s ability to capture complex dynamics and ensure even higher

10
quality in video generation. We are also exploring the scaling laws of video generation models and
aim to train larger and more powerful models to generate longer and higher-quality videos, pushing
the boundaries of what is achievable in text-to-video generation.

Acknowledgments
We would like to thank all the data annotators, infrastructure operators, collaborators, and partners.
We also extend our gratitude to everyone at Zhipu AI and Tsinghua University who have provided
support, feedback, or contributed to the CogVideoX, even if not explicitly mentioned in this report.
We would also like to greatly thank BiliBili for technical discussions.

References
Pika beta. 2023. URL https://pika.art/home.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774, 2023.
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/
main/MODEL_CARD.md.
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi
Li. Longalign: A recipe for long context alignment of large language models. arXiv preprint
arXiv:2401.18058, 2024.
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and
image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference
on computer vision, pp. 1728–1738, 2021.
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang
Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer
Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik
Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling
latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan.
Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024a.
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao,
Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m:
Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 13320–13331, 2024b.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-
efficient exact attention with io-awareness. Advances in Neural Information Processing Systems,
35:16344–16359, 2022.
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde
Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch
n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural
Information Processing Systems, 36, 2024.
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image
synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pp. 12873–12883, 2021.
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam
Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for
high-resolution image synthesis. In Forty-first International Conference on Machine Learning,
2024.

11
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala,
Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models
without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
neural information processing systems, 33:6840–6851, 2020.
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P
Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition
video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale
pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing
Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin,
Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang
Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward
feedback. arXiv preprint arXiv:2405.18750, 2024.
Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng
Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics
perspective, 2024. URL https://arxiv.org/abs/2407.01094.
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and
sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of
computer vision, pp. 5404–5411, 2024.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–
ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings,
Part V 13, pp. 740–755. Springer, 2014.
OpenAI. Gpt-4o. 2024a.
OpenAI. Sora. 2024b. URL https://openai.com/index/sora/.
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of
the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. Journal of machine learning research, 21(140):1–67, 2020.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition, pp. 10684–10695, 2022.
runway. Gen-2. 2023. URL https://runwayml.com/ai-tools/gen-2-text-to-video.
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv
preprint arXiv:2202.00512, 2022.
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry
Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video
data. arXiv preprint arXiv:2209.14792, 2022.
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced
transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
Kuaishou AI Team. Kling. 2024. URL https://kling.kuaishou.com/en.

12
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation
and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017.
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang,
Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable
length video generation from open domain textual descriptions. In International Conference on
Learning Representations, 2022.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang,
Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv
preprint arXiv:2311.03079, 2023a.
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan
He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent
diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin,
Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of
foundation models. arXiv preprint arXiv:2309.16039, 2023.
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong
Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–
tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu,
Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic
evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522, 2024.
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei
Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video
generation. arXiv preprint arXiv:2309.15818, 2023a.
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language
model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 586–595, 2018.
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong,
Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion.
arXiv preprint arXiv:2403.05121, 2024a.
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou,
Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March
2024b. URL https://github.com/hpcaitech/Open-Sora.

13
Figure 9: Text to video showcases. The displayed prompt will be upsampled before being fed into
the model. The generated videos contain large motion and can produce various video styles.
14
Figure 10: Text to video showcases.
15
A Image To Video Model
We finetune an image-to-video model from the text-to-video model. Drawing from the (Blattmann
et al., 2023), we add an image as an additional condition alongside the text. The image is passed
through 3D VAE and concatenated with the noised input in the channel dimension. Similar to
super-resolution tasks, there is a significant distribution gap between training and inference( the first
frame of videos vs. real-world images). To enhance the model’s robustness, we add large noise to
the image condition during training. Some examples are shown in 11, 12. CogVideoX can handle
different styles of image input.

16
Figure 11: Image to video showcases. The displayed prompt will be upsampled before being fed into
the model.

17
Figure 12: Image to video showcases.

18
B Caption Upsampler
To ensure that text input distribution during inference is as close as possible to the distribution during
training, similar to (Betker et al., 2023), we use a large language model to upsample the user’s input
during inference, making it more detailed and precise. Finetuned LLM can generate better prompts
than zero/few-shot.
For image-to-video, we use the vision language model to upsample the prompt, such as GPT4V,
CogVLM(Wang et al., 2023a).

Zero-shot prompt for Text Upsampler

You are part of a team of bots that create videos. You work
with an assistant bot that will draw anything you say in
square brackets. For example, outputting \" a beautiful
morning in the woods with the sun peaking through the
trees \" will trigger your partner bot to output a video
of a forest morning, as described. You will be prompted
by people looking to create detailed, amazing videos.
The way to accomplish this is to take their short prompts
and make them extremely detailed and descriptive.
There are a few rules to follow :
You will only ever output a single video description
per user request.
When modifications are requested, you should not simply
make the description longer. You should refactor the
entire description to integrate the suggestions.

C Dense Video Caption Data Generation

In the pipeline for generating video captions, we extract one frame every two seconds for image
captioning. Ultimately, we collected 50,000 data points to fine-tune the summary model. Below is
the prompt we used for summarization with GPT-4:

Prompt for GPT-4 Summary

We extracted several frames from this video and described

each frame using an image understanding model, stored
in the dictionary variable ‘image_captions: Dict[str: str]‘.
In ‘image_captions‘, the key is the second at which the image
appears in the video, and the value is a detailed description
of the image at that moment. Please describe the content of
this video in as much detail as possible, based on the
information provided by ‘image_captions‘, including
the objects, scenery, animals, characters, and camera
movements within the video. \n image_captions={new_captions}\n
You should output your summary directly, and not mention
variables like ‘image_captions‘ in your response.
Do not include ‘\\n’ and the word ’video’ in your response.
Do not use introductory phrases such as: \"The video
presents\", \"The video depicts\", \"This video showcases\",
\"The video captures\" and so on.\n Please start the
description with the video content directly, such as \"A man
first sits in a chair, then stands up and walks to the
kitchen....\"\n Do not use phrases like: \"as the video
progressed\" and \"Throughout the video\".\n Please describe

19
the content of the video and the changes that occur, in
chronological order.\n Please keep the description of this
video within 100 English words.

D Video Caption Example

Below we present some examples to compare the performance of the Panda-70M video captioning
model and our CogVLM2-Caption model:

Caption Generated by Panda-70M

A close up of a white dragon with blue eyes.

Caption Generated by CogVLM2-Caption

A meticulously crafted white dragon with a serene expression

and piercing blue eyes is showcased in various close-ups, its
scales detailed with golden highlights. Initially, the dragon’s
horns are curved back, but as time passes, they change to
spiraling gold, adding elegance. The creature’s demeanor shifts
from tranquil majesty to a more formidable presence, with its
mouth agape revealing sharp teeth at times. Its expressions
alternate between calmness and fierce determination,
maintaining an air of mythical serenity throughout. The
consistent neutral background focuses attention on the dragon’s
textured features and expressive eyes.

Caption Generated by Panda-70M

A crab is walking on the beach with a light bulb on its back.

Caption Generated by CogVLM2-Caption

A hermit crab with a dark, glossy shell and reddish-brown legs

is seen carrying an illuminated light bulb on its back across
the sandy terrain of a beach at night. The scene transitions
from a soft glow to a more pronounced illumination as the crab
moves, highlighting its delicate limbs against the smooth sand

20
and tranquil sea backdrop. This surreal tableau blends natural
beauty with human ingenuity, creating a serene yet whimsical
atmosphere that emphasizes the crab’s unique adaptation and the
contrast between nature and technology in this quiet nocturnal
setting.

Caption Generated by Panda-70M

A young black man is sitting on a cloud and reading a book

with a blue sky in the background.

Caption Generated by CogVLM2-Caption

A young Black man with an afro hairstyle and a neatly trimmed

beard is seen sitting cross-legged on fluffy white clouds,
deeply engrossed in reading a book with a red cover. He wears
a plain white T-shirt and dark pants against a vivid blue sky
dotted with cumulus clouds. Throughout the scenes, his
expression remains one of deep concentration and peaceful
contemplation, highlighting a moment of intellectual pursuit
amidst nature’s grandeur. The imagery suggests a serene
atmosphere that emphasizes solitude and introspection, with no
other people or objects around him.

21
E Video to Video via CogVideoX and CogVLM2-Caption
In this section, we present several examples of video-to-video generation using CogVideoX and
CogVLM2-Caption. Specifically, we first input the original video into CogVLM2-Caption to obtain
the video’s caption, and then feed this caption into the CogVideoX model to generate a new video.
From the examples below, it can be seen that our pipeline achieves a high degree of fidelity to the
original video:

22
23
F Human Evaluation Details
Sensory Quality: This part focuses mainly on the perceptual quality of videos, including subject
consistency, frame continuity, and stability.

Table 3: Sensory Quality Evaluation Criteria.

Score Evaluation Criteria
1 High sensory quality: 1. The appearance and morphological features of objects in the
video are completely consistent 2. High picture stability, maintaining high resolution
consistently 3. Overall composition/color/boundaries match reality 4. The picture is
visually appealing
0.5 Average sensory quality: 1. The appearance and morphological features of objects in
the video are at least 80% consistent 2. Moderate picture stability, with only 50% of
the frames maintaining high resolution 3. Overall composition/color/boundaries match
reality by at least 70% 4. The picture has some visual appeal
0 Poor sensory quality: large inconsistencies in appearance and morphology, low video
resolution, and composition/layout not matching reality

Instruction Following: This part focuses on whether the generated video aligns with the prompt,
including the accuracy of the subject, quantity, elements, and details.

Table 4: Instruction Following Evaluation Criteria.

Score Evaluation Criteria
1 100% follow the text instruction requirements, including but not limited to: elements
completely correct, quantity requirements consistent, elements complete, features accu-
rate, etc.
0.5 100% follow the text instruction requirements, but the implementation has minor flaws
such as distorted main subjects or inaccurate features.
0 Does not 100% follow the text instruction requirements, with any of the following issues:
1. Generated elements are inaccurate 2. Quantity is incorrect 3. Elements are incomplete
4. Features are inaccurate

Physics Simulation: This part focuses on whether the model can adhere to the objective law of the
physical world, such as the lighting effect, interactions between different objects, and the realism of
fluid dynamics.

Table 5: Physics Simulation Evaluation Criteria.

Score Evaluation Criteria
1 Good physical realism simulation capability, can achieve: 1. Real-time tracking 2. Good
action understanding, ensuring dynamic realism of entities 3. Realistic lighting and
shadow effects, high interaction fidelity 4. Accurate simulation of fluid motion
0.5 Average physical realism simulation capability, with some degradation in real-time track-
ing, dynamic realism, lighting and shadow effects, and fluid motion simulation. Issues
include: 1. Slightly unnatural transitions in dynamic effects, with some discontinuities
2. Lighting and shadow effects not matching reality 3. Distorted interactions between
objects 4. Floating fluid motion, not matching reality
0 Poor physical realism simulation capability, results do not match reality, obviously fake

24
Cover Quality: This part mainly focuses on metrics that can be assessed from single-frame images,
including aesthetic quality, clarity, and fidelity.

Table 6: Cover Quality Evaluation Criteria.

Score Evaluation Criteria
1 Image is clear, subject is obvious, display is complete, color tone is normal.
0.5 Image quality is average. The subject is relatively complete, color tone is normal.
0 Cover image resolution is low, image is blurry.

Solar Panel Installation IT Report
100% (3)
Solar Panel Installation IT Report
24 pages
Aptitude Test: Directions
No ratings yet
Aptitude Test: Directions
18 pages
CV-VAE- A Compatible Video VAE for Latent Generative Video Models
No ratings yet
CV-VAE- A Compatible Video VAE for Latent Generative Video Models
25 pages
ModelScope Text-to-Video Technical Report
No ratings yet
ModelScope Text-to-Video Technical Report
14 pages
3D VQ-GAN
No ratings yet
3D VQ-GAN
5 pages
T2V (1)
No ratings yet
T2V (1)
5 pages
AUTOREGRESSIVE VIDEO GENERATION
No ratings yet
AUTOREGRESSIVE VIDEO GENERATION
22 pages
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
No ratings yet
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
13 pages
Video Crafter 2
No ratings yet
Video Crafter 2
11 pages
Cogvideo: Large-Scale Pretraining For Text-To-Video Generation Via Transformers
No ratings yet
Cogvideo: Large-Scale Pretraining For Text-To-Video Generation Via Transformers
15 pages
Photorealistic Video Generation With Diffusion Models
No ratings yet
Photorealistic Video Generation With Diffusion Models
13 pages
Video GPT
No ratings yet
Video GPT
14 pages
GAMEPLAY HIGHLIGHTS GENERATION
No ratings yet
GAMEPLAY HIGHLIGHTS GENERATION
12 pages
Text To Video Generation Using Deep Learning
No ratings yet
Text To Video Generation Using Deep Learning
7 pages
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
No ratings yet
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
17 pages
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
No ratings yet
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
11 pages
VideoMAE_V2_Scaling_Video_Masked_Autoencoders_with_Dual_Masking
No ratings yet
VideoMAE_V2_Scaling_Video_Masked_Autoencoders_with_Dual_Masking
12 pages
Skorokhodov_StyleGAN-V_A_Continuous_Video_Generator_With_the_Price_Image_Quality_CVPR_2022_paper
No ratings yet
Skorokhodov_StyleGAN-V_A_Continuous_Video_Generator_With_the_Price_Image_Quality_CVPR_2022_paper
11 pages
Make Pixels Dance - High-Dynamic Video Generation
No ratings yet
Make Pixels Dance - High-Dynamic Video Generation
11 pages
InternVideo2
No ratings yet
InternVideo2
27 pages
SPOTEM
No ratings yet
SPOTEM
6 pages
A Good Image Generator Is What You Need For High Resolution Video Synthesis
No ratings yet
A Good Image Generator Is What You Need For High Resolution Video Synthesis
23 pages
2101.06341
No ratings yet
2101.06341
27 pages
VideoPoet A Large Language Model For Zero-Shot Video Generation
No ratings yet
VideoPoet A Large Language Model For Zero-Shot Video Generation
20 pages
Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper
No ratings yet
Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper
10 pages
Hu Make It Move Controllable Image-to-Video Generation With Text Descriptions CVPR 2022 Paper
No ratings yet
Hu Make It Move Controllable Image-to-Video Generation With Text Descriptions CVPR 2022 Paper
10 pages
entropy-25-01469
No ratings yet
entropy-25-01469
22 pages
2302.03011 Ai Video Creation
No ratings yet
2302.03011 Ai Video Creation
26 pages
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
No ratings yet
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
6 pages
Tivgan: Text To Image To Video Generation With Step-By-Step Evolutionary Generator
No ratings yet
Tivgan: Text To Image To Video Generation With Step-By-Step Evolutionary Generator
10 pages
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
No ratings yet
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
3 pages
Emu Video
No ratings yet
Emu Video
29 pages
Stable Video Diffusion
No ratings yet
Stable Video Diffusion
30 pages
V I T S D: Ideo Nstruction Uning With Ynthetic ATA
No ratings yet
V I T S D: Ideo Nstruction Uning With Ynthetic ATA
24 pages
Transformers_in_computational_visual_media_A_surve
No ratings yet
Transformers_in_computational_visual_media_A_surve
30 pages
Text-to-video_model
No ratings yet
Text-to-video_model
6 pages
test_2_35
No ratings yet
test_2_35
25 pages
NeurIPS 2021 Cogview Mastering Text To Image Generation Via Transformers Paper
No ratings yet
NeurIPS 2021 Cogview Mastering Text To Image Generation Via Transformers Paper
14 pages
2023 ICCV Accurate and Fast Compressed Video Captioning 2309.12867
No ratings yet
2023 ICCV Accurate and Fast Compressed Video Captioning 2309.12867
10 pages
Video Understanding With Large Language Models - A Survey
No ratings yet
Video Understanding With Large Language Models - A Survey
24 pages
Lin_SwinBERT_End-to-End_Transformers_With_Sparse_Attention_for_Video_Captioning_CVPR_2022_paper
No ratings yet
Lin_SwinBERT_End-to-End_Transformers_With_Sparse_Attention_for_Video_Captioning_CVPR_2022_paper
10 pages
Sv4d- Dynamic 3d Content Generation With Multi-frame and Multi-View Consistency
No ratings yet
Sv4d- Dynamic 3d Content Generation With Multi-frame and Multi-View Consistency
21 pages
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
No ratings yet
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
4 pages
VaVAM
No ratings yet
VaVAM
26 pages
5.5.2 Video To Text With LSTM Models
No ratings yet
5.5.2 Video To Text With LSTM Models
10 pages
Generating Video Descriptions With Attention-Driven LSTM Models in Hindi Language
No ratings yet
Generating Video Descriptions With Attention-Driven LSTM Models in Hindi Language
9 pages
2405.03770v1 video understanding survey
No ratings yet
2405.03770v1 video understanding survey
48 pages
2409.14709v1
No ratings yet
2409.14709v1
5 pages
Text To Video - Model
No ratings yet
Text To Video - Model
2 pages
Temporal Context Mining For Learned Video Compression
No ratings yet
Temporal Context Mining For Learned Video Compression
12 pages
Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
No ratings yet
Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
44 pages
AudioVisual Video Summarization
No ratings yet
AudioVisual Video Summarization
8 pages
2407.07464v2
No ratings yet
2407.07464v2
17 pages
2207.07852v1
No ratings yet
2207.07852v1
23 pages
Conv Transformer
No ratings yet
Conv Transformer
17 pages
me_review_report__Copy_
No ratings yet
me_review_report__Copy_
19 pages
Comparing Attention-Based Neural Architectures For Video Captioning
No ratings yet
Comparing Attention-Based Neural Architectures For Video Captioning
10 pages
Codef: Content Deformation Fields For Temporally Consistent Video Processing
No ratings yet
Codef: Content Deformation Fields For Temporally Consistent Video Processing
11 pages
dense video
No ratings yet
dense video
35 pages
Mathematics 11 03685
No ratings yet
Mathematics 11 03685
16 pages
2406.09399v1
No ratings yet
2406.09399v1
13 pages
Troubleshooting Docker
From Everand
Troubleshooting Docker
John Wooten
No ratings yet
Python 201 Michael Driscoll pdf download
No ratings yet
Python 201 Michael Driscoll pdf download
18 pages
"Webpage Bookmarking and Printing": Learning Objectives
No ratings yet
"Webpage Bookmarking and Printing": Learning Objectives
10 pages
Exercises INF 5860 Solution Hints
No ratings yet
Exercises INF 5860 Solution Hints
11 pages
SE_2024_Assignment 6
No ratings yet
SE_2024_Assignment 6
6 pages
Code Sóng Gann
No ratings yet
Code Sóng Gann
6 pages
L009UiGS-RM 230751
No ratings yet
L009UiGS-RM 230751
5 pages
Full Download Advances in Cryptology EUROCRYPT 2018 37th Annual International Conference on the Theory and Applications of Cryptographic Techniques Tel Aviv Israel April 29 May 3 2018 Proceedings Part I Jesper Buus Nielsen PDF DOCX
100% (2)
Full Download Advances in Cryptology EUROCRYPT 2018 37th Annual International Conference on the Theory and Applications of Cryptographic Techniques Tel Aviv Israel April 29 May 3 2018 Proceedings Part I Jesper Buus Nielsen PDF DOCX
35 pages
User Manual: 170.IU0.TMS.000 4.3-99/D
No ratings yet
User Manual: 170.IU0.TMS.000 4.3-99/D
30 pages
An Ink-Controlled Fountain Drawing and Writing Pen
No ratings yet
An Ink-Controlled Fountain Drawing and Writing Pen
2 pages
Real-Mode Memory Addressing
No ratings yet
Real-Mode Memory Addressing
20 pages
DFD Notes
No ratings yet
DFD Notes
2 pages
Form Color Class
No ratings yet
Form Color Class
11 pages
Earned Value Analysis With Progress Expert 2: Asta Development
100% (1)
Earned Value Analysis With Progress Expert 2: Asta Development
26 pages
Research pr1
No ratings yet
Research pr1
13 pages
Guide How To Host A Programming Contest On Hacker Earth
No ratings yet
Guide How To Host A Programming Contest On Hacker Earth
18 pages
Syslog Plugin For Cacti
100% (3)
Syslog Plugin For Cacti
8 pages
CAPL 2 CANNewsletter 201409 PressArticle en
No ratings yet
CAPL 2 CANNewsletter 201409 PressArticle en
2 pages
Subnet Table
No ratings yet
Subnet Table
2 pages
Datasheet
No ratings yet
Datasheet
28 pages
Ijaerv13n12 71
No ratings yet
Ijaerv13n12 71
4 pages
Data Control Language
No ratings yet
Data Control Language
10 pages
Manual Vs Automated Testing
No ratings yet
Manual Vs Automated Testing
14 pages
Gaurav Singh Mls-203 Assignment
No ratings yet
Gaurav Singh Mls-203 Assignment
9 pages
123 005 00
No ratings yet
123 005 00
28 pages
Your Document Right Now, Plus Millions More, With A Free Trial
No ratings yet
Your Document Right Now, Plus Millions More, With A Free Trial
18 pages
Discrete Mathematics Assignment-01
No ratings yet
Discrete Mathematics Assignment-01
10 pages
Imaster NCE V100R020C10 REST NBI User Guide 10
100% (1)
Imaster NCE V100R020C10 REST NBI User Guide 10
110 pages
Basics Design 05 Colour Sample Chapter Using Colour
100% (1)
Basics Design 05 Colour Sample Chapter Using Colour
34 pages

Cog Video X

Uploaded by

Cog Video X

Uploaded by

CogVideoX: Text-to-Video Diffusion Models with

Zhipu AI Tsinghua University

2 The CogVideoX Architecture

Figure 2: The overall architecture of CogVideoX.

2.1 3D Causal VAE

2.2 Expert Transformer

(a) RoPE vs. Sinusoidal (b) RoPE vs. RoPE + Learnable

3.1 Frame Pack

Extrapolation of Position Code. When adapting low-resolution position encoding to high-

3.3 Explicit Uniform Sampling

Ho et al. (2020) defines the training objective of diffusion as

4.1 Automated Metric Evaluation

Baselines. We choose openly-accessible top-performing text-to-video models as baselines, includ-

Human Dynamic Multiple Appearance Dynamic GPT4o-MT

introduces GPT4o-MTScore, a metric designed to measure the metamorphic amplitude of time-lapse

4.2 Human Evaluation

Table 2: Human evaluation between CogVideoX and Kling.

Zero-shot prompt for Text Upsampler

C Dense Video Caption Data Generation

Prompt for GPT-4 Summary

We extracted several frames from this video and described

D Video Caption Example

Caption Generated by Panda-70M

A close up of a white dragon with blue eyes.

Caption Generated by CogVLM2-Caption

A meticulously crafted white dragon with a serene expression

Caption Generated by Panda-70M

A crab is walking on the beach with a light bulb on its back.

Caption Generated by CogVLM2-Caption

A hermit crab with a dark, glossy shell and reddish-brown legs

Caption Generated by Panda-70M

A young black man is sitting on a cloud and reading a book

Caption Generated by CogVLM2-Caption

A young Black man with an afro hairstyle and a neatly trimmed

Table 3: Sensory Quality Evaluation Criteria.

Table 4: Instruction Following Evaluation Criteria.

Table 5: Physics Simulation Evaluation Criteria.

Table 6: Cover Quality Evaluation Criteria.

You might also like