Automatic Video Generator
Automatic Video Generator
ISSN No:-2456-2165
Abstract:- Text-to-video (T2V) generation is an architectures have been designed to address these unique
emerging field in artificial intelligence, gaining traction challenges, focusing on semantic coherence, smooth
with advances in deep learning models like generative transitions, and realistic motion.
adversarial networks (GANs), diffusion models, and
hybrid architectures. This paper provides a Despite these advancements, T2V generation faces
comprehensive survey of recent T2V methodologies, several limitations. Models often struggle with maintaining
exploring models such as GAN-based frameworks, high resolution, stable frame quality, and semantic
VEGAN-CLIP, IRC-GAN, Sora OpenAI, and alignment with input captions across sequences. High
CogVideoX, which aim to transform textual descriptions computational costs further constrain the scalability and
into coherent video content. These models face accessibility of these methods. Nonetheless, new models
challenges in maintaining semantic coherence, temporal such as IRC-GAN, Sora OpenAI, and CogVideoX have
consistency, and realistic motion across generated made strides in improving video coherence and alignment
frames. We examine the architectural designs, with input prompts. Benchmarks like T2VBench have
methodologies, and applications of key models, emerged to provide a standard for evaluating temporal
highlighting the advantages and limitations in their consistency and content alignment, both crucial for
approaches to video synthesis. Additionally, we discuss producing realistic video content.
benchmark advancements, such as T2VBench, which
plays a crucial role in evaluating temporal consistency This paper surveys recent methodologies in T2V
and content alignment. This review sheds light on the generation, analyzing their architectures, strengths, and
strengths and limitations of existing approaches and limitations. By exploring advancements in T2V models and
outlines ethical considerations and future directions for the emerging trends in this field, we aim to highlight the
T2V generation in the realm of generative AI. current capabilities of generative AI in video synthesis and
identify areas for future improvement.
Keywords:- Text-to-Video (T2V) Generation, Deep
Learning, Generative Adversarial Networks (GANs), II. LITERATURE SURVEY
Diffusion Models, Hybrid Architectures, VQGAN-
CLIP,IRC-GAN, Sora Open AI, Cog Video X, Semantic Introduction to Text-to-Video Generation:
Coherence, Temporal Consistency, Realistic Motion, Video Text-to-video (T2V) generation is an evolving field
Synthesis, Benchmark Advancements, T2VBench,Content that translates textual descriptions into video outputs,
Alignment, Ethical Considerations, Generative AI. bridging the gap between linguistic and visual information.
Advances in generative AI, especially in generative
I. INTRODUCTION adversarial networks (GANs) and diffusion models, have
been instrumental in addressing the challenges of
Text-to-video (T2V) generation represents a cutting- maintaining temporal coherence and scene consistency
edge area in multimedia content creation, where generative across frames.
models aim to translate textual descriptions into dynamic,
visually coherent videos. Unlike static image generation, Generative Models for T2V Synthesis:
video synthesis involves not only creating realistic visuals The main generative models facilitating T2V synthesis
but also ensuring temporal coherence across frames, which include GAN-based architectures, vector quantized GANs
adds considerable complexity to the task. Recent advances combined with CLIP (VEGAN-CLIP), and diffusion
in generative models, including Generative Adversarial models. Each model addresses the unique demands of video
Networks (GANs), Variational Autoencoders (VAEs), and generation, such as maintaining spatial features across
diffusion models, have enabled significant progress in frames and ensuring temporal continuity. Notable
generating high-quality, realistic videos from textual approaches include:
prompts.
TiVGAN: Generates frames sequentially from a single
The rise of deep learning has transformed content image, focusing on evolving visual coherence through an
creation across media, positioning generative AI as a iterative process.
powerful tool for video synthesis. This task is challenging as TGANs-C: Introduces multi-discriminator GANs for
it requires models to capture both spatial and temporal text-aligned temporal coherence.
features, translating a single textual prompt into a sequence Tune-A-Video: Leverages pre-trained text-to-image
of visually consistent and contextually accurate frames. models fine-tuned with single text-video pairs for
Models like VQGAN-CLIP, Temporal GANs Conditioning efficiency.
on Captions (TGANs-C), and hybrid VAE-GAN
Diffusion Transformer with 3D VAE: CogVideoX uses a [1]. TiVGAN: Text to Image to Video Generation
3D Variational Autoencoder (VAE) in combination with With Step-by-Step Evolutionary Generator
a diffusion transformer model. This method allows DOYEON KIM(Member, IEEE), DONGGYU JOO
for long-form generation with high realism, using AND JUNMO KIM , (Member, IEEE)School of
a multi-resolution frame-packing approach to produce Electrical Engineering, Korea Advanced Institute of
detailed, coherent scenes while progressively training the Science and Technology, Daejeon 34141, South
video sequence for quality enhancement. Korea.
[2]. Generate Impressive Videos with Text Instructions:
Together, these methodologies showcase creative uses A Review of OpenAI Sora, Stable Diffusion,
of GANs, transformers, VAEs, and one-shot tuning to Lumiere and Comparable Models by Enis
generate videos that are semantically aligned, temporally Karaarslan1 and ¨Omer Aydın1.
consistent, and computationally efficient. The blend of [3]. Conditional GAN with Discriminative Filter
stepwise generation, multi-discriminator setups, and Generation for Text-to-Video.Synthesis by Yogesh
progressive training strategies marks a breakthrough in AI- Balaji ,Martin Renqiang Min , Bing Bai , Rama
driven video generation, opening up vast new applications Chellappa1 and Hans Peter Graf2.University of
and possibilities. Maryland, College Park, NEC Labs America –
Princeton
System Architecture [4]. Transforming Text into Video: A Proposed
Methodology for Video Production Using the
VQGAN-CLIP Image Generative AI Model by
SukChang Lee Prof., Dept. of Digital Contents,
Konyang Univ., Korea
[5]. To Create What You Tell: Generating Videos from
Captions by Yingwei Pan, Zhaofan Qiu, Ting Yao,
Houqiang Li and Tao Mei.University of Science and
Technology of China, Hefei, China.Microsoft
Research, Beijing, China
[6]. Yitong Li, Martin Renqiang Min,Dinghan Shen,
David Carlson,Lawrence Carin,Duke University,
Durham, NC, United States, 27708 NEC
Laboratories America, Princeton, NJ, United States,
08540 {yitong.li, dinghan.shen, david.carlson,
lcarin}@duke.edu, [email protected]
[7]. AUTOLV: AUTOMATIC LECTURE
VIDEO GENERATOR Wenbin Wang Yang
Song Sanjay Jha ,School of Computer Science and
Engineering, University of New South Wales,
Australia
[8]. Sounding Video Generator: A Unified Framework
for Text-guided Sounding Video Generation.Jiawei
Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing
Liu
[9]. IRC-GAN: Introspective Recurrent Convolutional
Fig 1 System Architecture GAN for Text-to-video Generation,Kangle Deng ,
Tianyi Fei, Xin Huang and Yuxin Pengy.Institute of
IV. CONCLUSION Computer Science and Technology,
Peking,University, Beijing,
Models like TiVGAN, TGANs-C, Tune-A-Video, IRC- [email protected]
GAN, Sora OpenAI, and CogVideoX have addressed [10]. Sora OpenAI’s Prelude: Social Media Perspectives on
challenges in generating realistic, coherent videos from text, Sora OpenAI and the Future of AI Video
each contributing unique solutions to improve temporal Generation:REZA HADI MOGAVI, DERRICK
coherence, semantic alignment, and resolution. WANG, JOSEPH TU, HILDA HADAN,
and SABRINA A.
While these models exhibit significant strengths—such
as efficient training, semantic consistency, and handling
complex scenes—limitations still exist. High computational
costs and challenges in maintaining long-term coherence and
quality, especially in dynamic scenes and high resolutions,
remain hurdles to broader adoption.