0% found this document useful (0 votes)
52 views32 pages

Seminar Report 6657

Uploaded by

boycoder310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views32 pages

Seminar Report 6657

Uploaded by

boycoder310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

lOMoARcPSD|24516919

Seminar Report - 6657

Computer science (Malla Reddy Group of Institutions)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Adnan Ali ([email protected])
lOMoARcPSD|24516919

A SEMINAR REPORT

On

New Age AI : Creating Video from


Text / Sora-OpenAI

Submitted to

MALLA REDDY ENGINEERING COLLEGE


In partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
In
Computer Science and Engineering (AIML)

By

Madhav Sai Tirukovela


Regd. No: 20J41A6657

Under the Guidance of


Mr. K. Dileep Reddy
(Assistant Professor, CSE AIML)

Computer Science and Engineering (AIML)

MALLA REDDY ENGINEERING COLLEGE


(An UGC Autonomous Institution, Approved by AICTE, New Delhi & Affiliated to JNTUH,
Hyderabad).
Maisammaguda(H), Medchal - Malkajgiri District, Secunderabad, Telangana State – 500100,
www.mrec.ac.in

MARCH - 2024

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

MALLA REDDY ENGINEERING COLLEGE


(An UGC Autonomous Institution, Approved by AICTE, New Delhi & Affiliated to JNTUH, Hyderabad).
Maisammaguda(H), Medchal - Malkajgiri District, Secunderabad, Telangana State – 500100,
www.mrec.ac.in

COMPUTER SCIENCE AND ENGINEERING (AIML)

CERTIFICATE

Certified that seminar work entitled “New Age AI : Creating Video from Text /
Sora - OpenAI” is a bonafide work carried out in the 8th semester by Madhav
Sai Tirukovela in partial fulfilment for the award of Bachelor of Technology in
Computer Science and Engineering (Artificial Intelligence & Machine
Learning) from Malla Reddy Engineering College, during the academic year
2023 – 2024. I wish him success in all future endeavors.

Seminar Coordinator Internal Examiner Head of Department


Mr. K Dileep Reddy Dr. U. Mohan Srinivas
Assistant Professor Professor & HOD

Place:
Date:

ii

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

ACKNOWLEDGEMENT

We express our sincere thanks to our Principal, Dr. A. Ramaswami Reddy, who took

keen interest and encouraged us in every effort during the research work.

We express our heartfelt thanks to Dr. U. Mohan Srinivas, Professor and HOD,

Department of Computer Science and Engineering (AIML), for his kind attention and

valuable guidance throughout the research work.

We are thankful to our Seminar Coordinator, Mr. K. Dileep Reddy, Assistant

Professor, Department of Computer Science and Engineering (AIML), for his

cooperation during the research work.

We also thank all the teaching and non-teaching staff of the Department for their

cooperation during the project work.

Madhav Sai Tirukovela


CSE AIML
Regd.No.- 20J41A6657

iii

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

ABSTRACT

The field of artificial intelligence (AI) has witnessed remarkable advancements,


particularly in the realm of natural language processing (NLP) and computer vision.
One of the latest innovations in this domain is the ability of AI systems to create video
content directly from textual input. This groundbreaking capability not only
streamlines the video production process but also opens up new avenues for creativity
and storytelling.

In this paper, we delve into the exciting world of AI-driven video creation from text.
We explore the underlying technologies that make this feat possible, including deep
learning algorithms, neural networks, and multimodal architectures. By analyzing
recent developments and state-of-the-art approaches, we provide insights into the
challenges and opportunities associated with this emerging technology.

Furthermore, we discuss the potential applications and implications of AI-generated


video content across various industries. From personalized marketing videos to
educational tutorials and entertainment media, the ability to translate text into engaging
visuals has transformative potential. We also examine the ethical considerations and
societal impact of AI-driven video creation, highlighting the importance of responsible
AI deployment and algorithmic transparency.

In conclusion, the convergence of AI, NLP, and computer vision is revolutionizing the
way we produce and consume video content. As we navigate this new age of AI-
powered creativity, understanding the capabilities and limitations of text-to-video
technologies is crucial for harnessing their full potential while ensuring ethical and
inclusive practices..

Signature of the
Student

Name : Madhav Sai Tirukovela


Regn. No : 20J41A6657
Semester : 8th

Branch : CSE AIML


Date :

iv
Downloaded by Adnan Ali ([email protected])
lOMoARcPSD|24516919

TABLE OF CONTENTS

Chapt Description Page No.


er
No.
ABSTRACT iv

TABLE OF CONTENTS v

LIST OF FIGURES vii

LIST OF TABLES viii

1 INTRODUCTION 1

2
1.1 Sora (text - to - video model)
2
2
1.2 History
2 OBJECTIVES 3

2.1 Safety 6

2.2 Research Techniques 7


2.3 Applications 8

3 METHODOLOGY
11
3.1 Inspiration from Large Language Models 11

3.2 Training 11
3.2.1 Video Compression Network 11
3.2.2 Space Time Latent Patches 12
3.2.3 Scaling Transformers for Video Generation 12

3.3 Training Approach 13

3.3.1 Variable Durations, Resolutions, Aspect 13


Ratios
3.3.2 Sampling Flexibility 13
3.3.3 Improved Framing & Composition 13
3.3.4 Language Understanding 14

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

v
4 RESULTS AND DISCUSSIONS 15

4.1 Prompting with images and videos 15

4.2 Image generation capabilities 16

4.3 Emerging Simulation capabilities 17

4.3.1 Three Dimensional Consistency 17

4.3.2 Long range coherence and object permanence 17

4.3.3 Interacting Worlds 17

4.3.4 Simulating Digital Worlds 17

5 CONCLUSION 18

6 FUTURE SCOPE 19

REFERENCES 21

vi

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

LIST OF FIGURES

FIGURE NO. TIT PAGE


LE NO.
1.1 An image generated by given the user prompt 3
3.1 Dimensionality Reduction 11
3.2 Arranging randomly initialized patches in 12
appropriately sized grid

4.1 Input video provided for Video Editing 15


4.2 Transformation of styles and environments 16

4.3 Image generated by Sor of resolution 16


2048x2048

vii

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

LIST OF TABLES

TABLE TIT PAGE NO.


NO. LE
7
2.1 Challenges and Ethical Considerations

2.2 Applications of AI-Generated Video Content 10

viii

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHAPTER I

INTRODUCTION

In recent years, artificial intelligence (AI) has undergone a profound


transformation, ushering in a new era of innovation and automation across diverse
industries. One of the most exciting developments in this field is the ability of AI
systems to create video content directly from textual input. This groundbreaking
capability represents a convergence of natural language processing (NLP) and
computer vision, promising to revolutionize the way we produce and consume
visual media.
Traditionally, the creation of video content has been a labor-intensive and time-
consuming process, requiring skilled professionals in videography, editing, and
production. However, advancements in deep learning algorithms and neural
networks have enabled AI systems to understand and interpret textual descriptions,
converting them into rich and dynamic visual sequences.
The concept of generating video from text opens up a multitude of possibilities
across various domains, including marketing, education, entertainment, and
communication. Imagine being able to generate personalized video advertisements
based on customer preferences, or transforming written scripts into immersive
educational videos with interactive visuals.
Moreover, AI-driven video creation has the potential to democratize content
production, allowing individuals and organizations with limited resources to
access professional-quality video content generation tools. This democratization of
visual storytelling not only fosters creativity but also promotes inclusivity by
amplifying diverse voices and narratives.
In this paper, we delve into the fascinating world of AI-powered video creation
from text. We explore the underlying technologies, applications, challenges, and
ethical considerations associated with this emerging paradigm. By examining
recent advancements and real-world use cases, we aim to provide a comprehensive
overview of the transformative potential of "New Age AI: Creating Video from
Text."

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

1.1 SORA (TEXT - TO - VIDEO MODEL)


Sora is a generative artificial intelligence model developed by OpenAI, that
specializes in text-to-video generation. The model accepts textual descriptions, known
as prompts, from users and generates short video clips corresponding to those
descriptions. Prompts can specify artistic styles, fantastical imagery, or real-world
scenarios. When creating real-world scenarios, user input may be required to ensure
factual accuracy, otherwise features can be added erroneously. Sora is praised for its
ability to produce videos with high levels of visual detail, including intricate camera
movements and characters that exhibit a range of emotions. Furthermore, the model
possesses the functionality to extend existing short videos by generating new content
that seamlessly precedes or follows the original clip. As of March 2024, it is
unreleased and not yet available to the public.

1.2 HISTORY OF SORA / OpenAI

Several other text-to-video generating models had been created prior to Sora,
including Meta's Make-A-Video, Runway's Gen-2, and Google's Lumiere, the last of
which, as of February 2024, is also still in its research phase. OpenAI, the company
behind Sora, had released DALL·E 3, the third of its DALL-E text-to-image models,
in September 2023.

The team that developed Sora named it after the Japanese word for sky to signify its
"limitless creative potential". On February 15, 2024, OpenAI first previewed Sora by
releasing multiple clips of high-definition videos that it created, including an SUV
driving down a mountain road, an animation of a "short fluffy monster" next to a
candle, two people walking through Tokyo in the snow, and fake historical footage of
the California gold rush, and stated that it was able to generate videos up to one
minute long. The company then shared a technical report, which highlighted the
methods used to train the model. OpenAI CEO Sam Altman also posted a series of
tweets, responding to Twitter users' prompts with Sora-generated videos of the
prompts.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

OpenAI has stated that it plans to make Sora available to the public but that it would
not be soon; it has not specified when. The company provided limited access to a
small "red team", including experts in misinformation and bias, to perform adversarial
testing on the model. The company also shared Sora with a small group of creative
professionals, including video makers and artists, to seek feedback on its usefulness in
creative fields.

Fig 1.1 An image generated by Sora given the user prompt.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHAPTER 2
OBJECTIVES

OpenAI’s big model, Sora, can make a whole minute of really good quality video.
The outcomes of their work indicate that making video generation models bigger
is a good way to create versatile simulators for the real world. Sora is a flexible
model for visual data. It can create videos and pictures of different lengths, shapes,
and sizes, even up to a full minute of high-definition video.

Today, Sora is becoming available to red teamers to assess critical areas for harms
or risks. We are also granting access to a number of visual artists, designers, and
filmmakers to gain feedback on how to advance the model to be most helpful for
creative professionals.

Sora is able to generate complex scenes with multiple characters, specific types of
motion, and accurate details of the subject and background. The model
understands not only what the user has asked for in the prompt, but also how those
things exist in the physical world.

The model has a deep understanding of language, enabling it to accurately


interpret prompts and generate compelling characters that express vibrant
emotions. Sora can also create multiple shots within a single generated video that
accurately persist characters and visual style.

The current model has weaknesses. It may struggle with accurately simulating the
physics of a complex scene, and may not understand specific instances of cause
and effect. For example, a person might take a bite out of a cookie, but afterward,
the cookie may not have a bite mark.

The model may also confuse spatial details of a prompt, for example, mixing up
left and right, and may struggle with precise descriptions of events that take place
over time, like following a specific camera trajectory.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

Understand the Technology: Explore the underlying technologies such as deep


learning algorithms, neural networks, and multimodal architectures that enable AI
systems to create video content from textual input.

Examine Real-World Applications: Analyze and showcase the diverse


applications of AI-generated video content across industries such as marketing,
education, entertainment, and communication, highlighting specific use cases and
success stories.

Assess Advantages and Limitations: Evaluate the advantages and limitations of


AI-driven video creation compared to traditional methods, including aspects such
as efficiency, cost-effectiveness, scalability, and quality of output.

Discuss Ethical and Societal Implications: Discuss the ethical considerations


and societal impact of AI-generated video content, addressing issues such as
algorithmic bias, data privacy, intellectual property rights, and the role of
responsible AI deployment.

Explore Future Trends and Developments: Predict and discuss potential future
trends, advancements, and innovations in the field of text-to-video AI
technologies, including potential improvements in accuracy, realism, and user
customization options.

Promote Awareness and Education: Raise awareness and educate stakeholders,


including industry professionals, researchers, policymakers, and the general
public, about the capabilities, opportunities, and challenges associated with AI-
powered video creation.

Encourage Collaboration and Innovation: Foster collaboration between AI


researchers, content creators, technology developers, and end-users to drive
innovation, co-create new solutions, and unlock the full creative potential of AI-
driven video content creation.

Provide Practical Guidance: Offer practical guidance, best practices, and


recommendations for organizations and individuals looking to integrate AI-
powered video creation tools into their workflows, ensuring effective
implementation, user engagement, and ethical standards compliance.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

2.1 SAFETY

We’ll be taking several important safety steps ahead of making Sora available in
OpenAI’s products. We are working with red teamers — domain experts in areas
like misinformation, hateful content, and bias — who will be adversarially testing
the model.

We’re also building tools to help detect misleading content such as a detection
classifier that can tell when a video was generated by Sora. We plan to include
C2PA metadata in the future if we deploy the model in an OpenAI product.
In addition to us developing new techniques to prepare for deployment, we’re
leveraging the existing safety methods that we built for our products that use
DALL·E 3, which are applicable to Sora as well.

For example, once in an OpenAI product, our text classifier will check and reject
text input prompts that are in violation of our usage policies, like those that request
extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of
others. We’ve also developed robust image classifiers that are used to review the
frames of every video generated to help ensure that it adheres to our usage policies,
before it’s shown to the user.

We’ll be engaging policymakers, educators and artists around the world to


understand their concerns and to identify positive use cases for this new
technology. Despite extensive research and testing, we cannot predict all of the
beneficial ways people will use our technology, nor all the ways people will abuse
it. That’s why we believe that learning from real-world use is a critical component
of creating and releasing increasingly safe AI systems over time.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHALLENGE/ETHICAL ISSUE DESCRIPTION

Algorithmic Bias Risk of biases in AI models affecting content


generation

Privacy Concerns Protection of user data and sensitive


information

Intellectual Property Copyright and ownership issues related to AI-


generated content

Transparency Ensuring transparency in AI decision-making


processes

Fairness Addressing fairness and inclusivity in content


generation

Accountability Establishing accountability for AI-generated


content

Fig 2.1 Challenges and Ethical Considerations

2.2 RESEARCH TECHNIQUES

Sora is a diffusion model, which generates a video by starting off with one that
looks like static noise and gradually transforms it by removing the noise over
many steps.
Sora is capable of generating entire videos all at once or extending generated
videos to make them longer. By giving the model foresight of many frames at a
time, we’ve solved a challenging problem of making sure a subject stays the same
even when it goes out of view temporarily.

Similar to GPT models, Sora uses a transformer architecture, unlocking superior


scaling performance.

We represent videos and images as collections of smaller units of data called


patches, each of which is akin to a token in GPT. By unifying how we represent
data, we can train diffusion transformers on a wider range of visual data than was
possible before, spanning different durations, resolutions and aspect ratios.

Sora builds on past research in DALL·E and GPT models. It uses the recaptioning
technique from DALL·E 3, which involves generating highly descriptive captions
for the visual training data. As a result, the model is able to follow the user’s text

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

instructions in the generated video more faithfully.

In addition to being able to generate a video solely from text instructions, the
model is able to take an existing still image and generate a video from it,
animating the image’s contents with accuracy and attention to small detail. The
model can also take an existing video and extend it or fill in missing frames. Learn
more in our technical report.

Sora serves as a foundation for models that can understand and simulate the real
world, a capability we believe will be an important milestone for achieving AGI.

2.3 APPLICATIONS
Marketing and Advertising:
Personalized Video Ads: AI can generate personalized video advertisements
based on user preferences, browsing history, and demographic data, enhancing
engagement and conversion rates.
Product Demonstrations: Textual descriptions of products or services can be
transformed into dynamic video demonstrations, showcasing features, benefits,
and usage scenarios.

Education and Training:


Interactive Tutorials: Text-based tutorials can be converted into interactive video
lessons with simulations, quizzes, and feedback mechanisms, fostering active
learning and knowledge retention.
Virtual Classrooms: AI-generated video content can simulate classroom
environments, lectures, and educational modules, enabling distance learning and
remote education initiatives.

Entertainment Industry:

AI-Generated Movies and Shows: Entire movies or series can be conceptualized


and visualized based on textual scripts, characters, settings, and plotlines,
offering new avenues for creative storytelling.

Virtual Reality Experiences: Textual descriptions can be transformed into


immersive VR experiences, interactive narratives, and virtual worlds, enhancing
user immersion and entertainment value.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

Healthcare and Medical Education:

Medical Simulations: Text-to-video AI can create medical simulations, surgical


procedures, patient case studies, and anatomy tutorials for healthcare
professionals and students.

Patient Education Videos: Complex medical information can be simplified and


visualized through AI-generated video content, aiding patient education,
compliance, and understanding.

Journalism and Media:

Automated News Reports: AI can generate news reports, summaries, and data
visualizations from textual news articles, enhancing newsroom efficiency and
multimedia storytelling.
Data Visualization: Textual data can be transformed into informative and
engaging video infographics, charts, and graphs, aiding in data-driven
storytelling.

Gaming and Interactive Content:

Dynamic Storytelling: Text-based narratives in games can be dynamically


converted into video sequences, enhancing storytelling, character development,
and player immersion. Interactive Game Elements: AI-generated video content
can create interactive game elements, cutscenes, and visual effects, enriching
gameplay experiences.

Virtual Events and Conferences:


Virtual Conferences: AI can create virtual conference environments, keynote
presentations, and interactive sessions based on textual agendas and event
descriptions. Digital Exhibitions: Textual descriptions of products, services, or
artworks can be transformed into virtual exhibitions, tours, and showcases,
enabling online participation and engagement.

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

INDUSTRY APPLICATION

Marketing Personalized video ads, product demonstrations

Education Interactive tutorials, virtual classrooms

Entertainment AI-generated movies, virtual reality experiences

Healthcare Medical simulations, patient education videos

Journalism Automated news reports, data visualization

Gaming Dynamic storytelling, interactive game elements

Virtual Events Virtual conferences, digital exhibitions

Table 2.2 Applications of AI - Generated Video Content

These applications demonstrate the diverse and transformative impact of AI-


driven video creation from text across industries, highlighting opportunities for
innovation, personalization, and enhanced user experiences.

10

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHAPTER 3

METHODOLOGY

3.1 INSPIRATION FROM LARGE LANGUAGE MODELS (LLM’s)

Source of Inspiration: The approach is inspired by large language models that


achieve generalist capabilities through training on vast amounts of internet-scale
data.

LLM Paradigm: Large language models, exemplified by LLMs, are successful in


part due to the use of tokens. Tokens serve as a unified representation for diverse
modalities of text, including code, math, and various natural languages.

Fig 3.1 Dimensionality Reduction

3.2 TRAINING

The training of Sora involves video compression, extraction of spacetime latent


patches, and scaling transformers for video generation. Let’s break down each
part:
3.2.1 Video Compression Network

Input: Raw video footage. Objective: The goal of this network is to reduce the
dimensionality of visual data in videos.
Output: A latent representation that is compressed both temporally (across time)
and spatially (across space).

11

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

Training: This network is trained on raw videos to generate a compressed latent


space. This latent space retains essential visual information while reducing
overall complexity.

3.2.2 Space Time Latent Patches

Objective: Extracting meaningful patches from a compressed input video to act


as transformer tokens.
Process: From the compressed video, spacetime patches (considering both
spatial and temporal dimensions) are extracted.
Applicability: This scheme is mentioned to work not only for videos but also for
images, as images are considered videos with a single frame.
Benefits: The patch-based representation allows Sora to be trained on videos and
images with varying resolutions, durations, and aspect ratios.

3.2.3 Scaling Transformers for Video Generation

Model Type: Sora is described as a diffusion model and a diffusion transformer.


Training Objective: Sora is trained to predict the original “clean” patches given
input noisy patches and conditioning information (such as text prompts).
Scaling Properties: Transformers, including diffusion transformers, have
demonstrated effective scaling across various domains, such as language
modeling, computer vision, and image generation. This scalability is crucial for
handling diverse data types and complexities.
Inference Control: During inference (generation), the size of generated videos
can be controlled by arranging randomly-initialized patches in an appropriately-
sized grid.

Fig 3.2 Arranging randomly initialized patches in appropriately sized grid.


12

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

In summary, Sora integrates a video compression network to create a


compressed latent space, utilizes spacetime latent patches as transformer tokens
for both videos and images, and employs a diffusion transformer for video
generation with scalability across different domains. The model is trained to
handle noisy input patches and predict the original “clean” patches, and it allows
control over the size of generated videos during inference.

3.3 TRAINING APPROACH


There are several aspects of the Sora model’s training approach for image and
video generation, emphasizing the advantages of training on data at its native
size. Here’s an explanation:

3.3.1 Variable Durations, Resolutions, Aspect Ratios

Past Approaches: Traditional methods for image and video generation often
involve resizing, cropping, or trimming videos to a standard size (e.g., 4-second
videos at 256x256 resolution).
Native Size Training Benefits: The Sora model opts to train on data at its native
size, avoiding the standardization of duration, resolution, or aspect ratio.

3.3.2 Sampling Flexibility


Wide Range of Sizes: Sora is designed to sample videos with various sizes,
including widescreen 1920x1080p and vertical 1080x1920, offering flexibility
for creating content for different devices directly at their native aspect ratios.
Prototyping at Lower Sizes: This flexibility allows for quick content
prototyping at lower sizes before generating at full resolution, all using the same
model.

3.3.3 Improved Framing Composition


Empirical Observation: Training on videos at their native aspect ratios is
empirically found to improve composition and framing. Comparison to
Common Practice: Comparisons with a model that crops all training videos to
be square (common practice in generative model training) show that the Sora
model tends to have improved framing, avoiding issues where the subject is
only partially in view.
13

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

3.3.4 Language Understanding

Text-to-Video Generation Training: Training text-to-video generation


systems requires a large dataset of videos with corresponding text captions.

Re-Captioning Technique: The re-captioning technique from DALL·E is


applied, involving training a highly descriptive captioner model and using it to
produce text captions for all videos in the training set.

Improvements in Fidelity: Training on highly descriptive video captions is


found to improve text fidelity and overall video quality.

GPT for User Prompts: Leveraging GPT, short user prompts are turned into
longer detailed captions, which are then sent to the video model. This enables
Sora to generate high-quality videos that accurately follow user prompts.

In summary, the Sora model’s approach involves training on data at its native
size, providing flexibility in sampling videos of varying sizes, improving
framing and composition, and incorporating language understanding techniques
for generating videos based on descriptive captions and user prompts.

14

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHAPTER 4

RESULTS AND DISCUSSIONS

4.1 PROMPTING WITH IMAGES & VIDEOS

Sora can also be prompted with other inputs, such as pre-existing images or video. This
capability enables Sora to perform a wide range of image and video editing tasks—creating
perfectly looping video, animating static images, extending videos forwards or backwards in
time, etc.

Animating DALL-E Images


Sora is capable of generating videos provided an image and prompt as input.

Extending Generated Videos


Sora is also capable of extending videos, either forward or backward in time. Below are three
videos that were all extended backward in time starting from a segment of a generated video.
We can use this method to extend a video both forward and backward to produce a seamless
infinite loop.

Video - to - Video Editing


Diffusion models have enabled a plethora of methods for editing images and videos from text
prompts. Below we apply one of these methods, SDEdit,32 to Sora. This technique enables Sora
to transform the styles and environments of input videos zero-shot.

Fig 4.1 Input video provided for Video Editing

15

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

Fig 4.2 Transformation of styles and environments

Connecting Videos

We can also use Sora to gradually interpolate between two input videos, creating
seamless transitions between videos with entirely different subjects and scene
compositions.

4.2 IMAGE GENERATION CAPABILITIES

Sora is also capable of generating images. We do this by arranging patches of


Gaussian noise in a spatial grid with a temporal extent of one frame. The model
can generate images of variable sizes—up to 2048 x 2048 resolution.

16

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

Fig 4.3 Image generated by Sora of resolution 2048 x 2048

4.3 EMERGING SIMULATION CAPABILITIES


We find that video models exhibit a number of interesting emergent capabilities
when trained at scale. These capabilities enable Sora to simulate some aspects of
people, animals and environments from the physical world. These properties
emerge without any explicit inductive biases for 3D, objects, etc.—they are purely
phenomena of scale.
4.3.1 Three-Dimensional consistency.
Sora can generate videos with dynamic camera motion. As the camera shifts and
rotates, people and scene elements move consistently through three-dimensional
space.

4.3.2 Long-range coherence and object permanence.


A significant challenge for video generation systems has been maintaining
temporal consistency when sampling long videos. We find that Sora is often,
though not always, able to effectively model both short- and long-range
dependencies. For example, our model can persist people, animals and objects
even when they are occluded or leave the frame. Likewise, it can generate multiple
shots of the same character in a single sample, maintaining their appearance
throughout the video.

4.3.3 Interacting with the world. Sora can sometimes simulate actions that
affect the state of the world in simple ways. For example, a painter can leave new
strokes along a canvas that persist over time, or a man can eat a burger and leave
bite marks.

4.3.4 Simulating digital worlds.


Sora is also able to simulate artificial processes–one example is video games. Sora
can simultaneously control the player in Minecraft with a basic policy while also
rendering the world and its dynamics in high fidelity. These capabilities can be
elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

These capabilities suggest that continued scaling of video models is a promising


path towards the development of highly-capable simulators of the physical and
digital world, and the objects, animals and people that live within them.
17

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHAPTER 5

CONCLUSION

The evolution of artificial intelligence (AI) has brought forth a transformative era
in video content creation, where text can be seamlessly translated into captivating
visual narratives. The exploration of "New Age AI: Creating Video from Text"
has illuminated the tremendous potential, challenges, and ethical considerations
inherent in this cutting-edge technology.

As we reflect on the discussed objectives, it is evident that AI-driven video


creation holds immense promise across diverse sectors. From personalized
marketing campaigns that resonate with individual preferences to immersive
educational experiences that transcend traditional boundaries, the applications of
text-to-video AI are vast and impactful.

Sora marks a significant stride forward in the realm of AI-generated video content.
Its unique capabilities and user-friendly features open doors for content creators,
educators, and businesses to explore new dimensions in visual storytelling. As
Sora continues to evolve, it holds the promise of transforming the way we
perceive and create videos in the digital landscape.

Looking ahead, the future of AI in video creation promises continued innovation


and refinement. Anticipated advancements in accuracy, realism, and
customization options will further enhance the quality and user experience of AI-
generated video content. Collaboration between researchers, developers, content
creators, and stakeholders will play a pivotal role in driving these advancements
and ensuring ethical AI deployment.
We believe the capabilities Sora has today demonstrate that continued scaling of
video models is a promising path towards the development of capable simulators
of the physical and digital world, and the objects, animals and people that live
within them.

In conclusion, "New Age AI: Creating Video from Text" represents a significant
milestone in the evolution of AI and visual storytelling. By embracing the
potential of AI-driven video creation while upholding ethical standards and
fostering collaboration, we can unlock new realms of creativity, engagement, and
18

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

inclusivity in the digital landscape.

19

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

CHAPTER 6

FUTURE SCOPE

The future scope for "New Age AI: Creating Video from Text" is promising and
expansive, with several potential avenues for growth, innovation, and impact.
Here are some key areas of future scope:

Enhanced Realism and Immersion: AI algorithms will continue to evolve,


leading to improvements in generating video content that is highly realistic and
immersive. Advances in natural language understanding, computer vision, and
graphics rendering will contribute to more lifelike visuals and seamless
integration of text-based narratives into video formats.

Interactive and Personalized Experiences: Future developments may enable AI


systems to create interactive video content that responds dynamically to viewer
inputs or preferences. Personalization algorithms could tailor video narratives
based on individual user profiles, enhancing engagement and relevance.

Multimodal Fusion: The integration of multiple modalities such as text, audio,


images, and video will enable AI systems to create rich, multimodal storytelling
experiences. This could lead to innovative multimedia presentations, virtual
reality (VR) experiences, and mixed-reality content that blurs the line between
physical and digital worlds.

Cross-Domain Applications: AI-powered video creation from text will find


applications beyond traditional sectors such as marketing and education.
Industries like healthcare, journalism, gaming, and virtual events may leverage
this technology for purposes such as medical education simulations, news
reporting, interactive storytelling in games, and virtual conferences.

Ethical AI and Bias Mitigation: Continued efforts will be made to address


ethical concerns related to AI-generated content, including bias mitigation,
fairness, transparency, and accountability. Development of ethical AI
frameworks, bias detection algorithms, and responsible AI practices will be
crucial for fostering trust and societal acceptance.

20

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

Collaborative Creation Platforms: Future platforms may emerge that facilitate


collaborative creation of AI-generated video content, enabling teams of creators,
designers, and AI experts to collaborate seamlessly. These platforms could
integrate version control, real-time editing, and feedback mechanisms to
streamline the content creation workflow.

Education and Skill Development: As AI-driven video creation becomes more


accessible, education and skill development programs will play a vital role in
equipping individuals with the knowledge and expertise to harness this
technology effectively. Training initiatives, online courses, and workshops
focused on AI content creation will empower a new generation of creators and
storytellers.

Regulatory Frameworks and Standards: Governments and regulatory bodies


may develop frameworks and standards to govern the use of AI in video content
creation, ensuring compliance with legal requirements, privacy norms, and ethical
guidelines. Industry collaborations and self-regulatory initiatives will also
contribute to shaping responsible AI practices.

In summary, the future scope for "New Age AI: Creating Video from Text" is
characterized by continuous innovation, interdisciplinary collaborations, ethical
considerations, and a wide range of potential applications across various domains.
Embracing these opportunities while addressing challenges will pave the way for
a dynamic and inclusive AI-powered content creation landscape.

21

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

REFERENCES

1. Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. "Unsupervised


learning of video representations using lstms." International conference on machine
learning. PMLR, 2015.↩︎
2. Chiappa, Silvia, et al. "Recurrent environment simulators." arXiv preprint
arXiv:1704.02254 (2017).↩︎
3. Ha, David, and Jürgen Schmidhuber. "World models." arXiv preprint arXiv:1803.10122
(2018).↩︎
4. Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with
scene dynamics." Advances in neural information processing systems 29 (2016).↩︎
5. Tulyakov, Sergey, et al. "Mocogan: Decomposing motion and content for video
generation." Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018.↩︎
6. Clark, Aidan, Jeff Donahue, and Karen Simonyan. "Adversarial video generation on
complex datasets." arXiv preprint arXiv:1907.06571 (2019). ↩︎
7. Brooks, Tim, et al. "Generating long videos of dynamic scenes." Advances in Neural
Information Processing Systems 35 (2022): 31769-31781.↩︎
8. Yan, Wilson, et al. "Videogpt: Video generation using vq-vae and transformers."
arXiv preprint arXiv:2104.10157 (2021).↩︎
9. Wu, Chenfei, et al. "Nüwa: Visual synthesis pre-training for neural visual world
creation." European conference on computer vision. Cham: Springer Nature
Switzerland, 2022.↩︎
10. Ho, Jonathan, et al. "Imagen video: High definition video generation with diffusion
models." arXiv preprint arXiv:2210.02303 (2022).↩︎
11. Blattmann, Andreas, et al. "Align your latents: High-resolution video synthesis with
latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2023.↩︎
12. Gupta, Agrim, et al. "Photorealistic video generation with diffusion models." arXiv
preprint arXiv:2312.06662 (2023).↩︎
13. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).↩︎
↩︎
14. Brown, Tom, et al. "Language models are few-shot learners." Advances in neural
information processing systems 33 (2020): 1877-1901.↩︎
↩︎
15. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image
recognition at scale." arXiv preprint arXiv:2010.11929 (2020). ↩︎
↩︎

22

Downloaded by Adnan Ali ([email protected])


lOMoARcPSD|24516919

16. Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the
IEEE/CVF international conference on computer vision. 2021. ↩︎
↩︎
17. He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. 2022. ↩︎
↩︎
18. Dehghani, Mostafa, et al. "Patch n'Pack: NaViT, a Vision Transformer for any
Aspect Ratio and Resolution." arXiv preprint arXiv:2307.06304 (2023). ↩︎
↩︎
19. Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models."
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2022.↩︎
20. Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv
preprint arXiv:1312.6114 (2013).↩︎
21. Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium
thermodynamics." International conference on machine learning. PMLR, 2015. ↩︎
22. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models."
Advances in neural information processing systems 33 (2020): 6840-6851. ↩︎
23. Nichol, Alexander Quinn, and Prafulla Dhariwal. "Improved denoising diffusion
probabilistic models." International Conference on Machine Learning. PMLR, 2021. ↩︎
24. Dhariwal, Prafulla, and Alexander Quinn Nichol. "Diffusion Models Beat GANs on
Image Synthesis." Advances in Neural Information Processing Systems. 2021. ↩︎
25. Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models."
Advances in Neural Information Processing Systems 35 (2022): 26565-26577. ↩︎
26. Peebles, William, and Saining Xie. "Scalable diffusion models with transformers."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. ↩︎
27. Chen, Mark, et al. "Generative pretraining from pixels." International conference on
machine learning. PMLR, 2020.↩︎
28. Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International Conference
on Machine Learning. PMLR, 2021.↩︎
29. Yu, Jiahui, et al. "Scaling autoregressive models for content-rich text-to-image
generation." arXiv preprint arXiv:2206.10789 2.3 (2022): 5. ↩︎
30. Betker, James, et al. "Improving image generation with better captions." Computer
Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8 ↩︎
↩︎
31. Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip
latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.↩︎
32. Meng, Chenlin, et al. "Sdedit: Guided image synthesis and editing with stochastic
differential equations." arXiv preprint arXiv:2108.01073 (2021).↩︎

23

Downloaded by Adnan Ali ([email protected])

You might also like