【视频生成大模型】视频生成大模型 THUDM/CogVideoX-2b

szZack

已于 2024-10-22 13:57:50 修改

阅读量904

点赞数 3

CC 4.0 BY-SA版权

分类专栏：视频生成大模型文章标签：视频生成大模型

于 2024-10-19 14:15:30 首次发布

本文链接：https://blog.csdn.net/zengNLP/article/details/143077575

视频生成大模型专栏收录该内容

1 篇文章

订阅专栏

【视频生成大模型】视频生成大模型 THUDM/CogVideoX-2b

CogVideoX-2b 模型介绍
运行环境安装
运行模型
下载
开源协议
参考

CogVideoX-2b 模型介绍

CogVideoX是清影同源的开源版本视频生成模型。

基础信息：

在这里插入图片描述

发布时间

2024年8月份

模型测试生成的demo视频

https://github.com/THUDM/CogVideo/raw/main/resources/videos/1.mp4

视频生成1

https://github.com/THUDM/CogVideo/raw/main/resources/videos/2.mp4

视频生成2

生成视频限制

提示词语言 English*
提示词长度上限 226 Tokens
视频长度 6 秒
帧率 8 帧 / 秒
视频分辨率 720 * 480，不支持其他分辨率(含微调)

运行环境安装

# diffusers>=0.30.1
# transformers>=0.44.0
# accelerate>=0.33.0 (suggest install from source)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

运行模型

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Quantized Inference

PytorchAO 和 Optimum-quanto 可以用于对文本编码器、Transformer 和 VAE 模块进行量化，从而降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或较小 VRAM 的 GPU 上运行该模型成为可能！值得注意的是，TorchAO 量化与 torch.compile 完全兼容，这可以显著加快推理速度。

# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
# Source and nightly installation is only required until next release.

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
   text_encoder=text_encoder,
   transformer=transformer,
   vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

# prompt 只能输入英文
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)