Seminar Report 6657
Seminar Report 6657
A SEMINAR REPORT
On
Submitted to
BACHELOR OF TECHNOLOGY
In
Computer Science and Engineering (AIML)
By
MARCH - 2024
CERTIFICATE
Certified that seminar work entitled “New Age AI : Creating Video from Text /
Sora - OpenAI” is a bonafide work carried out in the 8th semester by Madhav
Sai Tirukovela in partial fulfilment for the award of Bachelor of Technology in
Computer Science and Engineering (Artificial Intelligence & Machine
Learning) from Malla Reddy Engineering College, during the academic year
2023 – 2024. I wish him success in all future endeavors.
Place:
Date:
ii
ACKNOWLEDGEMENT
We express our sincere thanks to our Principal, Dr. A. Ramaswami Reddy, who took
keen interest and encouraged us in every effort during the research work.
We express our heartfelt thanks to Dr. U. Mohan Srinivas, Professor and HOD,
Department of Computer Science and Engineering (AIML), for his kind attention and
We also thank all the teaching and non-teaching staff of the Department for their
iii
ABSTRACT
In this paper, we delve into the exciting world of AI-driven video creation from text.
We explore the underlying technologies that make this feat possible, including deep
learning algorithms, neural networks, and multimodal architectures. By analyzing
recent developments and state-of-the-art approaches, we provide insights into the
challenges and opportunities associated with this emerging technology.
In conclusion, the convergence of AI, NLP, and computer vision is revolutionizing the
way we produce and consume video content. As we navigate this new age of AI-
powered creativity, understanding the capabilities and limitations of text-to-video
technologies is crucial for harnessing their full potential while ensuring ethical and
inclusive practices..
Signature of the
Student
iv
Downloaded by Adnan Ali ([email protected])
lOMoARcPSD|24516919
TABLE OF CONTENTS
TABLE OF CONTENTS v
1 INTRODUCTION 1
2
1.1 Sora (text - to - video model)
2
2
1.2 History
2 OBJECTIVES 3
2.1 Safety 6
3 METHODOLOGY
11
3.1 Inspiration from Large Language Models 11
3.2 Training 11
3.2.1 Video Compression Network 11
3.2.2 Space Time Latent Patches 12
3.2.3 Scaling Transformers for Video Generation 12
v
4 RESULTS AND DISCUSSIONS 15
5 CONCLUSION 18
6 FUTURE SCOPE 19
REFERENCES 21
vi
LIST OF FIGURES
vii
LIST OF TABLES
viii
CHAPTER I
INTRODUCTION
Several other text-to-video generating models had been created prior to Sora,
including Meta's Make-A-Video, Runway's Gen-2, and Google's Lumiere, the last of
which, as of February 2024, is also still in its research phase. OpenAI, the company
behind Sora, had released DALL·E 3, the third of its DALL-E text-to-image models,
in September 2023.
The team that developed Sora named it after the Japanese word for sky to signify its
"limitless creative potential". On February 15, 2024, OpenAI first previewed Sora by
releasing multiple clips of high-definition videos that it created, including an SUV
driving down a mountain road, an animation of a "short fluffy monster" next to a
candle, two people walking through Tokyo in the snow, and fake historical footage of
the California gold rush, and stated that it was able to generate videos up to one
minute long. The company then shared a technical report, which highlighted the
methods used to train the model. OpenAI CEO Sam Altman also posted a series of
tweets, responding to Twitter users' prompts with Sora-generated videos of the
prompts.
OpenAI has stated that it plans to make Sora available to the public but that it would
not be soon; it has not specified when. The company provided limited access to a
small "red team", including experts in misinformation and bias, to perform adversarial
testing on the model. The company also shared Sora with a small group of creative
professionals, including video makers and artists, to seek feedback on its usefulness in
creative fields.
CHAPTER 2
OBJECTIVES
OpenAI’s big model, Sora, can make a whole minute of really good quality video.
The outcomes of their work indicate that making video generation models bigger
is a good way to create versatile simulators for the real world. Sora is a flexible
model for visual data. It can create videos and pictures of different lengths, shapes,
and sizes, even up to a full minute of high-definition video.
Today, Sora is becoming available to red teamers to assess critical areas for harms
or risks. We are also granting access to a number of visual artists, designers, and
filmmakers to gain feedback on how to advance the model to be most helpful for
creative professionals.
Sora is able to generate complex scenes with multiple characters, specific types of
motion, and accurate details of the subject and background. The model
understands not only what the user has asked for in the prompt, but also how those
things exist in the physical world.
The current model has weaknesses. It may struggle with accurately simulating the
physics of a complex scene, and may not understand specific instances of cause
and effect. For example, a person might take a bite out of a cookie, but afterward,
the cookie may not have a bite mark.
The model may also confuse spatial details of a prompt, for example, mixing up
left and right, and may struggle with precise descriptions of events that take place
over time, like following a specific camera trajectory.
Explore Future Trends and Developments: Predict and discuss potential future
trends, advancements, and innovations in the field of text-to-video AI
technologies, including potential improvements in accuracy, realism, and user
customization options.
2.1 SAFETY
We’ll be taking several important safety steps ahead of making Sora available in
OpenAI’s products. We are working with red teamers — domain experts in areas
like misinformation, hateful content, and bias — who will be adversarially testing
the model.
We’re also building tools to help detect misleading content such as a detection
classifier that can tell when a video was generated by Sora. We plan to include
C2PA metadata in the future if we deploy the model in an OpenAI product.
In addition to us developing new techniques to prepare for deployment, we’re
leveraging the existing safety methods that we built for our products that use
DALL·E 3, which are applicable to Sora as well.
For example, once in an OpenAI product, our text classifier will check and reject
text input prompts that are in violation of our usage policies, like those that request
extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of
others. We’ve also developed robust image classifiers that are used to review the
frames of every video generated to help ensure that it adheres to our usage policies,
before it’s shown to the user.
Sora is a diffusion model, which generates a video by starting off with one that
looks like static noise and gradually transforms it by removing the noise over
many steps.
Sora is capable of generating entire videos all at once or extending generated
videos to make them longer. By giving the model foresight of many frames at a
time, we’ve solved a challenging problem of making sure a subject stays the same
even when it goes out of view temporarily.
Sora builds on past research in DALL·E and GPT models. It uses the recaptioning
technique from DALL·E 3, which involves generating highly descriptive captions
for the visual training data. As a result, the model is able to follow the user’s text
In addition to being able to generate a video solely from text instructions, the
model is able to take an existing still image and generate a video from it,
animating the image’s contents with accuracy and attention to small detail. The
model can also take an existing video and extend it or fill in missing frames. Learn
more in our technical report.
Sora serves as a foundation for models that can understand and simulate the real
world, a capability we believe will be an important milestone for achieving AGI.
2.3 APPLICATIONS
Marketing and Advertising:
Personalized Video Ads: AI can generate personalized video advertisements
based on user preferences, browsing history, and demographic data, enhancing
engagement and conversion rates.
Product Demonstrations: Textual descriptions of products or services can be
transformed into dynamic video demonstrations, showcasing features, benefits,
and usage scenarios.
Entertainment Industry:
Automated News Reports: AI can generate news reports, summaries, and data
visualizations from textual news articles, enhancing newsroom efficiency and
multimedia storytelling.
Data Visualization: Textual data can be transformed into informative and
engaging video infographics, charts, and graphs, aiding in data-driven
storytelling.
INDUSTRY APPLICATION
10
CHAPTER 3
METHODOLOGY
3.2 TRAINING
Input: Raw video footage. Objective: The goal of this network is to reduce the
dimensionality of visual data in videos.
Output: A latent representation that is compressed both temporally (across time)
and spatially (across space).
11
Past Approaches: Traditional methods for image and video generation often
involve resizing, cropping, or trimming videos to a standard size (e.g., 4-second
videos at 256x256 resolution).
Native Size Training Benefits: The Sora model opts to train on data at its native
size, avoiding the standardization of duration, resolution, or aspect ratio.
GPT for User Prompts: Leveraging GPT, short user prompts are turned into
longer detailed captions, which are then sent to the video model. This enables
Sora to generate high-quality videos that accurately follow user prompts.
In summary, the Sora model’s approach involves training on data at its native
size, providing flexibility in sampling videos of varying sizes, improving
framing and composition, and incorporating language understanding techniques
for generating videos based on descriptive captions and user prompts.
14
CHAPTER 4
Sora can also be prompted with other inputs, such as pre-existing images or video. This
capability enables Sora to perform a wide range of image and video editing tasks—creating
perfectly looping video, animating static images, extending videos forwards or backwards in
time, etc.
15
Connecting Videos
We can also use Sora to gradually interpolate between two input videos, creating
seamless transitions between videos with entirely different subjects and scene
compositions.
16
4.3.3 Interacting with the world. Sora can sometimes simulate actions that
affect the state of the world in simple ways. For example, a painter can leave new
strokes along a canvas that persist over time, or a man can eat a burger and leave
bite marks.
CHAPTER 5
CONCLUSION
The evolution of artificial intelligence (AI) has brought forth a transformative era
in video content creation, where text can be seamlessly translated into captivating
visual narratives. The exploration of "New Age AI: Creating Video from Text"
has illuminated the tremendous potential, challenges, and ethical considerations
inherent in this cutting-edge technology.
Sora marks a significant stride forward in the realm of AI-generated video content.
Its unique capabilities and user-friendly features open doors for content creators,
educators, and businesses to explore new dimensions in visual storytelling. As
Sora continues to evolve, it holds the promise of transforming the way we
perceive and create videos in the digital landscape.
In conclusion, "New Age AI: Creating Video from Text" represents a significant
milestone in the evolution of AI and visual storytelling. By embracing the
potential of AI-driven video creation while upholding ethical standards and
fostering collaboration, we can unlock new realms of creativity, engagement, and
18
19
CHAPTER 6
FUTURE SCOPE
The future scope for "New Age AI: Creating Video from Text" is promising and
expansive, with several potential avenues for growth, innovation, and impact.
Here are some key areas of future scope:
20
In summary, the future scope for "New Age AI: Creating Video from Text" is
characterized by continuous innovation, interdisciplinary collaborations, ethical
considerations, and a wide range of potential applications across various domains.
Embracing these opportunities while addressing challenges will pave the way for
a dynamic and inclusive AI-powered content creation landscape.
21
REFERENCES
22
16. Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the
IEEE/CVF international conference on computer vision. 2021. ↩︎
↩︎
17. He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition. 2022. ↩︎
↩︎
18. Dehghani, Mostafa, et al. "Patch n'Pack: NaViT, a Vision Transformer for any
Aspect Ratio and Resolution." arXiv preprint arXiv:2307.06304 (2023). ↩︎
↩︎
19. Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models."
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2022.↩︎
20. Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv
preprint arXiv:1312.6114 (2013).↩︎
21. Sohl-Dickstein, Jascha, et al. "Deep unsupervised learning using nonequilibrium
thermodynamics." International conference on machine learning. PMLR, 2015. ↩︎
22. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models."
Advances in neural information processing systems 33 (2020): 6840-6851. ↩︎
23. Nichol, Alexander Quinn, and Prafulla Dhariwal. "Improved denoising diffusion
probabilistic models." International Conference on Machine Learning. PMLR, 2021. ↩︎
24. Dhariwal, Prafulla, and Alexander Quinn Nichol. "Diffusion Models Beat GANs on
Image Synthesis." Advances in Neural Information Processing Systems. 2021. ↩︎
25. Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models."
Advances in Neural Information Processing Systems 35 (2022): 26565-26577. ↩︎
26. Peebles, William, and Saining Xie. "Scalable diffusion models with transformers."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. ↩︎
27. Chen, Mark, et al. "Generative pretraining from pixels." International conference on
machine learning. PMLR, 2020.↩︎
28. Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International Conference
on Machine Learning. PMLR, 2021.↩︎
29. Yu, Jiahui, et al. "Scaling autoregressive models for content-rich text-to-image
generation." arXiv preprint arXiv:2206.10789 2.3 (2022): 5. ↩︎
30. Betker, James, et al. "Improving image generation with better captions." Computer
Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023): 8 ↩︎
↩︎
31. Ramesh, Aditya, et al. "Hierarchical text-conditional image generation with clip
latents." arXiv preprint arXiv:2204.06125 1.2 (2022): 3.↩︎
32. Meng, Chenlin, et al. "Sdedit: Guided image synthesis and editing with stochastic
differential equations." arXiv preprint arXiv:2108.01073 (2021).↩︎
23