0% found this document useful (0 votes)
12 views

Automatic Video Generator

Text-to-video (T2V) generation is an emerging field in artificial intelligence, gaining traction with advances in deep learning models like generative adversarial networks (GANs), diffusion models, and hybrid architectures. This paper provides a comprehensive survey of recent T2V methodologies, exploring models such as GAN-based frameworks, VEGAN-CLIP, IRC-GAN, Sora OpenAI, and CogVideoX, which aim to transform textual descriptions into coherent video content.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Automatic Video Generator

Text-to-video (T2V) generation is an emerging field in artificial intelligence, gaining traction with advances in deep learning models like generative adversarial networks (GANs), diffusion models, and hybrid architectures. This paper provides a comprehensive survey of recent T2V methodologies, exploring models such as GAN-based frameworks, VEGAN-CLIP, IRC-GAN, Sora OpenAI, and CogVideoX, which aim to transform textual descriptions into coherent video content.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Automatic Video Generator


K Tresha1; Kavya2; Medhaa PB3; Pragathi T4
Department of Information Science RNS Institute of Technology, Bengaluru

Abstract:- Text-to-video (T2V) generation is an architectures have been designed to address these unique
emerging field in artificial intelligence, gaining traction challenges, focusing on semantic coherence, smooth
with advances in deep learning models like generative transitions, and realistic motion.
adversarial networks (GANs), diffusion models, and
hybrid architectures. This paper provides a Despite these advancements, T2V generation faces
comprehensive survey of recent T2V methodologies, several limitations. Models often struggle with maintaining
exploring models such as GAN-based frameworks, high resolution, stable frame quality, and semantic
VEGAN-CLIP, IRC-GAN, Sora OpenAI, and alignment with input captions across sequences. High
CogVideoX, which aim to transform textual descriptions computational costs further constrain the scalability and
into coherent video content. These models face accessibility of these methods. Nonetheless, new models
challenges in maintaining semantic coherence, temporal such as IRC-GAN, Sora OpenAI, and CogVideoX have
consistency, and realistic motion across generated made strides in improving video coherence and alignment
frames. We examine the architectural designs, with input prompts. Benchmarks like T2VBench have
methodologies, and applications of key models, emerged to provide a standard for evaluating temporal
highlighting the advantages and limitations in their consistency and content alignment, both crucial for
approaches to video synthesis. Additionally, we discuss producing realistic video content.
benchmark advancements, such as T2VBench, which
plays a crucial role in evaluating temporal consistency This paper surveys recent methodologies in T2V
and content alignment. This review sheds light on the generation, analyzing their architectures, strengths, and
strengths and limitations of existing approaches and limitations. By exploring advancements in T2V models and
outlines ethical considerations and future directions for the emerging trends in this field, we aim to highlight the
T2V generation in the realm of generative AI. current capabilities of generative AI in video synthesis and
identify areas for future improvement.
Keywords:- Text-to-Video (T2V) Generation, Deep
Learning, Generative Adversarial Networks (GANs), II. LITERATURE SURVEY
Diffusion Models, Hybrid Architectures, VQGAN-
CLIP,IRC-GAN, Sora Open AI, Cog Video X, Semantic  Introduction to Text-to-Video Generation:
Coherence, Temporal Consistency, Realistic Motion, Video Text-to-video (T2V) generation is an evolving field
Synthesis, Benchmark Advancements, T2VBench,Content that translates textual descriptions into video outputs,
Alignment, Ethical Considerations, Generative AI. bridging the gap between linguistic and visual information.
Advances in generative AI, especially in generative
I. INTRODUCTION adversarial networks (GANs) and diffusion models, have
been instrumental in addressing the challenges of
Text-to-video (T2V) generation represents a cutting- maintaining temporal coherence and scene consistency
edge area in multimedia content creation, where generative across frames.
models aim to translate textual descriptions into dynamic,
visually coherent videos. Unlike static image generation,  Generative Models for T2V Synthesis:
video synthesis involves not only creating realistic visuals The main generative models facilitating T2V synthesis
but also ensuring temporal coherence across frames, which include GAN-based architectures, vector quantized GANs
adds considerable complexity to the task. Recent advances combined with CLIP (VEGAN-CLIP), and diffusion
in generative models, including Generative Adversarial models. Each model addresses the unique demands of video
Networks (GANs), Variational Autoencoders (VAEs), and generation, such as maintaining spatial features across
diffusion models, have enabled significant progress in frames and ensuring temporal continuity. Notable
generating high-quality, realistic videos from textual approaches include:
prompts.
 TiVGAN: Generates frames sequentially from a single
The rise of deep learning has transformed content image, focusing on evolving visual coherence through an
creation across media, positioning generative AI as a iterative process.
powerful tool for video synthesis. This task is challenging as  TGANs-C: Introduces multi-discriminator GANs for
it requires models to capture both spatial and temporal text-aligned temporal coherence.
features, translating a single textual prompt into a sequence  Tune-A-Video: Leverages pre-trained text-to-image
of visually consistent and contextually accurate frames. models fine-tuned with single text-video pairs for
Models like VQGAN-CLIP, Temporal GANs Conditioning efficiency.
on Captions (TGANs-C), and hybrid VAE-GAN

IJISRT24DEC207 www.ijisrt.com 104


Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Evaluation and Benchmarking:  Proposed System
To ensure quality in T2V generation, benchmark The proposed systems from the documents cover
systems like T2VBench evaluate models on temporal various advanced generative models for text-to-video
consistency, alignment accuracy, and narrative flow. synthesis. Here's a summary:
T2VBench assesses models such as ZeroScope and Pika on
16 performance dimensions, helping highlight each model's  TiVGAN (Text-to-Image-to-Video Generative
specific strengths, such as sequencing and movement Adversarial Network) - TiVGAN generates videos
dynamics. through an evolutionary process, creating an initial high-
quality frame from a text input and sequentially evolving
 Key Challenges and Future Directions: it to ensure coherence across frames. Its staged training
Despite promising advancements, challenges such as stabilizes frame quality but can struggle with complex,
maintaining semantic coherence and producing high- dynamic scenes.
resolution outputs persist. Future research directions include
incorporating multimodal data, such as audio, and enhancing  TGANs-C (Temporal GANs Conditioning on Captions)-
spatio-temporal consistency to improve narrative flow and TGANs-C uses a multi-discriminator GAN framework to
fidelity in generated videos. ensure temporal and semantic coherence with text
captions. It leverages video, frame, and motion
This overview integrates methodologies, benchmarks, discriminators to maintain smooth transitions and
and future insights based on recent research realistic content but has high computational demands.
developments in text-driven video generation.
 Tune-A-Video - This model extends pretrained text-to-
 Objectives image models for video generation by finetuning them
on a single text-video pair, making it computationally
 Develop Advanced Generative Models: efficient. However, its one-shot tuning approach limits
To create robust generative models, particularly generalization without additional data.
leveraging GANs, VAEs, and diffusion models, capable of
translating textual descriptions into realistic and coherent  IRC-GAN (Introspective Recurrent Convolutional GAN)
video sequences. The focus is on achieving high-quality - This model combines a recurrent generator with LSTM
synthesis that respects temporal coherence across frames and convolutional layers for better alignment with text
and aligns accurately with input text. and is suited for high-resolution tasks but lacks
scalability for real-time applications.
 Enhance Temporal and Spatial Coherence:
To improve the temporal consistency and spatial  Sora OpenAI - Designed for democratized video
fidelity of generated videos, ensuring smooth transitions creation, Sora uses a transformer-based diffusion
between frames and realistic motion. Techniques such as model, enabling high-resolution, complex scene
multi-discriminator frameworks, spatio-temporal attention, generation. Ethical concerns around misuse highlight the
and structure-guided sampling are explored to maintain need for safeguards.
object and motion continuity across sequences.
 CogVideoX - Featuring a 3D VAE and a diffusion
 Benchmark and Evaluate T2V Models: transformer, CogVideoX generates long, coherent videos
To establish comprehensive evaluation benchmarks, with high dynamic object realism but is limited by
like T2VBench, for assessing T2V models across multiple computational demands.
dimensions including event sequencing, narrative flow, and
alignment accuracy. This enables a structured comparison of These models demonstrate progress in aligning video
models and identification of areas needing improvement. content with text inputs while balancing quality and
efficiency. Challenges remain, particularly with temporal
 Reduce Computational Requirements: coherence, resolution, and ethical use.
To optimize the efficiency of T2V models by
developing methods that reduce computational costs, such  Advantages of Proposed System
as one-shot tuning and efficient fine-tuning of pre-trained The proposed systems for text-to-video generation
text-to-image models. This is crucial to make T2V offer several advantages, each designed to enhance the
generation accessible and practical for broader applications. quality, coherence, and efficiency of generated videos:

 Broaden Application Scope and Real-World Relevance:  TiVGAN (Text-to-Image-to-Video Generative


To explore applications beyond synthetic datasets, Adversarial Network)
such as real-world scenarios that demand high interactivity,
semantic understanding, and multimodal data integration.  Enhanced Temporal Coherence: By generating each
This includes adapting T2V models for use in entertainment, frame sequentially, TiVGAN improves coherence across
education, and personalized content creation. frames, maintaining smooth transitions.

IJISRT24DEC207 www.ijisrt.com 105


Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Stability in Training: The stepwise frame-generation III. PROPOSED METHODOLOGY
approach helps stabilize training, reducing issues like
mode collapse common in GANs. The proposed methodology for these advanced
generative models combines several novel and creative
 TGANs-C (Temporal GANs Conditioning on Captions) approaches to tackle the complex challenges of text-to-video
generation. Here’s an outline of the main methodologies
 Robust Semantic and Temporal Alignment: TGANs-C used by each model:
uses a multi-discriminator setup to ensure that each
frame aligns with the text input while maintaining video  TiVGAN (Text-to-Image-to-Video Generative
continuity. Adversarial Network)
 Smooth Transitions: Its use of motion and video
discriminators allows for more realistic transitions,  Sequential Frame Generation: TiVGAN initiates the
improving the fluidity of generated motion. video by generating a high-quality single frame based on
text input, then incrementally adds frames. This
 Tune-A-Video evolutionary approach allows the model to stabilize its
output progressively, focusing first on visual accuracy
 High Efficiency: Tune-A-Video is highly efficient, and later on achieving temporal consistency across
leveraging pretrained text-to-image models and requiring frames.
only one-shot tuning, which reduces computational costs
significantly.  TGANs-C (Temporal GANs Conditioning on Captions)
 Object Consistency Across Frames: By focusing on
spatio-temporal attention, it maintains object  Multi-Discriminator Architecture: TGANs-C employs a
consistency, resulting in more stable and coherent video, frame, and motion discriminator to enforce both
videos. spatial and temporal coherence in the generated video.
These discriminators work together to assess frame-by-
 IRC-GAN (Introspective Recurrent Convolutional GAN) frame realism and the overall sequence continuity,
ensuring the video aligns with text input in a seamless,
 High-Resolution Output: IRC-GAN’s architecture continuous flow.
supports high-resolution output, making it suitable for
applications demanding quality detail.  Tune-A-Vide
 Improved Frame Alignment with Text: The introspective
mechanism ensures that frames are closely aligned with  One-Shot Tuning with Pretrained Text-to-Image Models:
textual inputs, which is crucial for applications needing Tune-A-Video reuses pretrained text-to-image models,
precise semantic fidelity. tuning them with a single text-video pair. It employs a
diffusion-based sampling method with spatio-temporal
 Sora OpenAI attention, which is a lightweight way to expand text-to-
image capabilities into video, maintaining consistency
 Complex Scene Handling: Sora’s transformer-based while using minimal data and computation.
diffusion model enables it to handle complex, minute-
long videos with consistent frame quality, suitable for  IRC-GAN (Introspective Recurrent Convolutional GAN)
industries like entertainment and education.
 Democratized Access to Video Creation: Designed with  Recurrent Convolutional and LSTM Layers: By
accessibility in mind, it allows more users to create integrating recurrent layers (LSTM) with 2D
high-quality videos from text prompts, making it useful convolutions, IRC-GAN enhances frame quality and
for diverse fields. enforces temporal coherence. Its introspective
mechanism maximizes mutual information between
 CogVideoX frames, aligning them effectively with textual prompts.
This makes it particularly effective for high-
 Long-Form, Coherent Video Sequences: CogVideoX resolution tasks where detail and frame alignment are
excels at generating long sequences with dynamic object crucial.
tracking and scene realism, ideal for creating extended
content.  Sora OpenAI
 Advanced Resolution and Detail: With multi-resolution
frame packing and progressive training, it produces  Transformer-Based Diffusion Model: Sora uses a
detailed, high-quality video output. transformer model to structure high-quality, complex,
minute-long videos that retain frame consistency and
These models push the boundaries of text-to-video detail. By incorporating feedback loops for continuous
generation by enhancing the realism, coherence, and refinement based on user input, it adapts to different
efficiency of video synthesis, making them suitable for video types, making it versatile for various applications,
applications in entertainment, marketing, education, and like marketing and education.
beyond.

IJISRT24DEC207 www.ijisrt.com 106


Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 CogVideoX REFERENCES

 Diffusion Transformer with 3D VAE: CogVideoX uses a [1]. TiVGAN: Text to Image to Video Generation
3D Variational Autoencoder (VAE) in combination with With Step-by-Step Evolutionary Generator
a diffusion transformer model. This method allows DOYEON KIM(Member, IEEE), DONGGYU JOO
for long-form generation with high realism, using AND JUNMO KIM , (Member, IEEE)School of
a multi-resolution frame-packing approach to produce Electrical Engineering, Korea Advanced Institute of
detailed, coherent scenes while progressively training the Science and Technology, Daejeon 34141, South
video sequence for quality enhancement. Korea.
[2]. Generate Impressive Videos with Text Instructions:
Together, these methodologies showcase creative uses A Review of OpenAI Sora, Stable Diffusion,
of GANs, transformers, VAEs, and one-shot tuning to Lumiere and Comparable Models by Enis
generate videos that are semantically aligned, temporally Karaarslan1 and ¨Omer Aydın1.
consistent, and computationally efficient. The blend of [3]. Conditional GAN with Discriminative Filter
stepwise generation, multi-discriminator setups, and Generation for Text-to-Video.Synthesis by Yogesh
progressive training strategies marks a breakthrough in AI- Balaji ,Martin Renqiang Min , Bing Bai , Rama
driven video generation, opening up vast new applications Chellappa1 and Hans Peter Graf2.University of
and possibilities. Maryland, College Park, NEC Labs America –
Princeton
 System Architecture [4]. Transforming Text into Video: A Proposed
Methodology for Video Production Using the
VQGAN-CLIP Image Generative AI Model by
SukChang Lee Prof., Dept. of Digital Contents,
Konyang Univ., Korea
[5]. To Create What You Tell: Generating Videos from
Captions by Yingwei Pan, Zhaofan Qiu, Ting Yao,
Houqiang Li and Tao Mei.University of Science and
Technology of China, Hefei, China.Microsoft
Research, Beijing, China
[6]. Yitong Li, Martin Renqiang Min,Dinghan Shen,
David Carlson,Lawrence Carin,Duke University,
Durham, NC, United States, 27708 NEC
Laboratories America, Princeton, NJ, United States,
08540 {yitong.li, dinghan.shen, david.carlson,
lcarin}@duke.edu, [email protected]
[7]. AUTOLV: AUTOMATIC LECTURE
VIDEO GENERATOR Wenbin Wang Yang
Song Sanjay Jha ,School of Computer Science and
Engineering, University of New South Wales,
Australia
[8]. Sounding Video Generator: A Unified Framework
for Text-guided Sounding Video Generation.Jiawei
Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing
Liu
[9]. IRC-GAN: Introspective Recurrent Convolutional
Fig 1 System Architecture GAN for Text-to-video Generation,Kangle Deng ,
Tianyi Fei, Xin Huang and Yuxin Pengy.Institute of
IV. CONCLUSION Computer Science and Technology,
Peking,University, Beijing,
Models like TiVGAN, TGANs-C, Tune-A-Video, IRC- [email protected]
GAN, Sora OpenAI, and CogVideoX have addressed [10]. Sora OpenAI’s Prelude: Social Media Perspectives on
challenges in generating realistic, coherent videos from text, Sora OpenAI and the Future of AI Video
each contributing unique solutions to improve temporal Generation:REZA HADI MOGAVI, DERRICK
coherence, semantic alignment, and resolution. WANG, JOSEPH TU, HILDA HADAN,
and SABRINA A.
While these models exhibit significant strengths—such
as efficient training, semantic consistency, and handling
complex scenes—limitations still exist. High computational
costs and challenges in maintaining long-term coherence and
quality, especially in dynamic scenes and high resolutions,
remain hurdles to broader adoption.

IJISRT24DEC207 www.ijisrt.com 107


Volume 9, Issue 12, December – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[11]. SGANDURRA,Stratford School of Interaction [19]. CogVideo: Large-scale Pre Training
Design and Business, University of Waterloo, for Text-to-Video,Generation via Transformers by
Canada,PAN HUI, Hong Kong University of Science Wenyi Hong,Ming Ding,Wendi Zheng, Xinghan Liu,
and Technology (Guangzhou), Hong Kong SAR and Jie Tang,Tsinghua University zBAAI
Guangzhou, China,LENNART E. NACKE, Stratford {hongwy18@mails, dm18@mails,
School of Interaction Design and Business, jietang@mail}.tsinghua.edu.cn
University of Waterloo, Canada [20]. To Create What You Tell: Generating Videos from
[12]. CogVideoX: Text-to-Video Diffusion Models with Captions by Yingwei Pan, Zhaofan Qiu, Ting Yao,
An Expert Transformer:Zhuoyi Yang Jiayan Teng Houqiang Li and Tao Mei.University of Science and
Wendi Zheng Ming Ding Shiyu Huang,Jiazheng Xu Technology of China, Hefei, China.Microsoft
Yuanming Yang Wenyi Hong Xiaohan Zhang Research, Beijing, China
Guanyu Feng,Da Yin Xiaotao Gu Yuxuan Zhang
Weihan Wang Yean Cheng,Ting Liu Bin Xu Yuxiao
Dong Jie Tang
[13]. StreamingT2V: Consistent, Dynamic, and
Extendable.Long Video Generation from
Text:Roberto Henschel, Levon Khachatryan, Daniil
Hayrapetyan, Hayk Poghosyan, Vahram
Tadevosyan,Zhangyang Wang1,2, Shant
Navasardyan1, Humphrey Shi1,3,1Picsart AI
Research (PAIR) 2UT Austin 3SHI Labs @ Georgia
Tech, Oregon & UIUC
[14]. TAVGBench: Benchmarking Text to Audible-Video
Generation:Yuxin Mao1, Xuyang Shen2, Jing
Zhang3, Zhen Qin4, Jinxing Zhou5, Mochu Xiang1,
Yiran Zhong2, Yuchao Dai1.Northwestern
Polytechnical University
[15]. ,OpenNLPLab, Shanghai AI Lab ,Australian
National University,TapTap 5Hefei University of
Technology ART•V: Auto-Regressive Text-to-Video
Generation with Diffusion Models:Wenming Weng,
Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang,
Dacheng Yin,Zhiyuan Zhao, Kai Qiu, Jianmin Bao,
Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei
Xiong.University of Science and Technology of
China ,Microsoft Research Asia
[16]. Rescribe: Authoring and Automatically ,Editing
Audio Descriptions:Amy Pavel ,Gabriel Reyes
,Jeffrey P. Bigham T2VBench: Benchmarking
Temporal Dynamics for Text-to-Video Generation
by Pengliang Ji, Chuyang Xiao, Huilin Tai, Mingxiao
Huo.Carnegie Mellon University,ShanghaiTech
University,McGill University
[17]. Tune-A-Video: One-Shot Tuning of Image Diffusion
Models for Text-to-Video Generation by Jay
Zhangjie Wu Yixiao Ge Xintao Wang Stan Weixian
Lei Yuchao Gu Yufei Shi Wynne Hsu Ying Shan
Xiaohu Qie Mike Zheng Shou.Show Lab, National
University of Singapore ARC Lab, Tencent PCG
[18]. LAVIE: HIGH-QUALITY VIDEO GENERATION
WITH CASCADED LATENT DIFFUSION
MODELSYaohuiWang, Xinyuan Chen, Xin Ma,
Shangchen Zhou, Ziqi Huang,Yi Wang, Ceyuan
Yang, Yinan He, Jiashuo Yu, Peiqing Yang,Yuwei
Guo, TianxingWu, Chenyang Si, Yuming Jiang,
Cunjian Chen,Chen Change Loy, Bo Dai, Dahua Lin,
Yu Qiao, Ziwei Liu

IJISRT24DEC207 www.ijisrt.com 108

You might also like