0% found this document useful (0 votes)
23 views5 pages

Group 85 Survey Paper (1) - 1

The document discusses a proposed system for detecting DeepFake videos using a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to identify manipulated content. The system leverages frame-level feature extraction and temporal analysis to distinguish between real and fake videos, addressing the growing threat of misinformation. The approach aims to enhance digital content verification and improve trust in media by providing accurate classifications and confidence scores for video authenticity.

Uploaded by

Aditya Dumbre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

Group 85 Survey Paper (1) - 1

The document discusses a proposed system for detecting DeepFake videos using a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to identify manipulated content. The system leverages frame-level feature extraction and temporal analysis to distinguish between real and fake videos, addressing the growing threat of misinformation. The approach aims to enhance digital content verification and improve trust in media by providing accurate classifications and confidence scores for video authenticity.

Uploaded by

Aditya Dumbre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Deepfake Video Detection using Machine Learning

Aditya Dumbre1, Omsai Alladwar2, Venktesh Kale3, Prof. N. B. Mulange4


1,2,3
Student, 4Professor
1,2,3,4
Department of Computer Engineering
1,2,3,4
Smt. Kashibai Navale College of Engineering, Pune, Maharashtra, India

Abstract— In recent months, free deep learning-based simple pranks or entertainment. In the wrong hands, they can
software tools have facilitated the creation of credible face be used to create damaging content that tarnishes reputations,
exchanges in videos that leave few traces of manipulation, spreads propaganda, or incites panic among the general
commonly known as "DeepFake" (DF) videos. public. For instance, DeepFakes have been used to
Manipulations of digital videos have been demonstrated for impersonate public figures in ways that can influence
several decades through the use of visual effects, but recent elections, spread fake news, or even lead to financial fraud.
advances in deep learning have drastically increased the These videos are so convincing that they can be difficult to
realism of fake content and made its creation more accessible. distinguish from genuine content, leading to a growing threat
These so-called AI-synthesized media (popularly referred to of misinformation that is difficult to counter.
as DF) can be generated with relative ease using AI tools. As these technologies continue to evolve, the
However, detecting these DF videos presents a significant challenge of detecting and mitigating the harmful effects of
challenge, as training algorithms to spot such fakes is DeepFakes becomes more urgent. The manipulation of audio
complex and requires substantial computational resources. and video content not only undermines trust in online media
but also poses serious ethical, social, and political risks. This
To address this challenge, we propose a system that makes it critical for researchers, technologists, and policy
leverages both Convolutional Neural Networks (CNN) and makers to work together to develop tools and strategies to
Recurrent Neural Networks (RNN) for DF detection. The combat the misuse of DeepFake technology, safeguard the
system uses a CNN to extract frame-level features and an public, and maintain the integrity of digital media.To
RNN to learn temporal inconsistencies between frames overcome such a situation, DF detection is very important.
caused by manipulation tools, helping to identify whether a So, we describe a new deep learning-based method that can
video has been altered. By training this architecture on a large effectively distinguish AI-generated fake videos (DF Videos)
dataset of fake videos, we demonstrate that our approach can from real videos. It’s incredibly important to develop
achieve competitive results, offering a promising solution for technology that can spot fakes, so that the DF can be
combating the growing threat of DF videos. Furthermore, the identified and prevented from spreading over the internet.
use of this combined architecture allows for efficient For detection the DF it is very important to
detection without compromising accuracy. understand the way Generative Adversarial Network (GAN)
Keywords: Deepfake Video Detection, convolutional creates the DF. GAN takes as input a video and an image of
Neuralnetwork (CNN), recurrent neural network (RNN) a specific individual (‘target’), and outputs another video
with the target’s faces replaced with those of another
I. INTRODUCTION individual (‘source’). The backbone of DF are deep
The increasing sophistication of smartphone adversarial neural networks trained on face images and target
cameras, coupled with the widespread availability of high- videos to automatically map the faces and facial expressions
speed internet across the globe, has significantly expanded of the source to the target. With proper post- processing, the
the reach of social media platforms and media-sharing resulting videos can achieve a high level of realism. The
portals. This technological advancement has made the GAN split the video into frames and replaces the input image
creation, sharing, and transmission of digital videos easier in every frame. Further it reconstructs the video. This process
and faster than ever before. The surge in computational is usually achieved by using autoencoders. We describe a
power has further bolstered the capabilities of deep learning new deep learning-based method that can effectively
technologies, enabling advancements that would have been distinguish DF videos from the real ones. Our method is
unimaginable just a few years ago. Deep learning, now more based same process that is used to create the DF by GAN.
accessible and powerful, has opened new avenues for The method is based on a property of the DF videos, due to
innovation but also introduced significant challenges. One of limitation of computation resources and production time, the
the most prominent of these is the rise of "DeepFake" (DF) DF algorithm can only synthesize face images of a fixed size,
technology. and they must undergo an affinal warping to match the
DeepFakes are produced using deep generative configuration of the source’s face. This warping leaves some
adversarial networks (GANs), which can manipulate both distinguishable artifacts in the output deepfake video due to
video and audio content to create highly realistic and the resolution inconsistency between warped face area and
convincing fake media. What was once the domain of highly surrounding context. Our method detects such artifacts by
skilled visual effects professionals can now be done comparing the generated face areas and their surrounding
relatively easily using publicly available tools. The regions by splitting the video into frames and extracting the
accessibility of DF creation has led to their proliferation features with a ResNext Convolutional Neural Network
across social media platforms, where they are often used (CNN) and using the Recurrent Neural Network (RNN) with
maliciously to spread false or misleading information. This Long Short-Term Memory (LSTM) capture the temporal
has led to a rise in online spamming, disinformation inconsistencies between frames introduced by GAN during
campaigns, and the manipulation of public opinion. the reconstruction of the DF. To train the ResNext CNN
The danger posed by DeepFakes goes beyond model, we simplify the process by simulating the resolution
inconsistency in affine face wrappings directly.
III. PROPOSED SYSTEM
Our project focuses on developing a web-based platform for
II. LITERATURE SURVEY detecting DeepFake (DF) videos using deep learning
The explosive growth in deep fake video and its illegal use techniques such as Convolutional Neural Networks (CNNs)
is a major threat to democracy, justice, and public trust. Due and Recurrent Neural Networks (RNNs). The platform will
to this there is a increased the demand for fake video analysis, allow users to upload videos, which will be analyzed and
detection and intervention. Some of the relatedword in deep classified as either real or fake based on temporal
fake detection are listed below: inconsistencies and manipulations introduced by DF creation
ExposingDF Videos by Detecting Face Warping tools. This platform could also evolve into a browser plugin
Artifacts used an approach to detects artifacts by comparing or be integrated into popular applications like WhatsApp,
the generated face areas and their surrounding regions with a Facebook, or Instagram, enabling real-time detection of fake
dedicated Convolutional Neural Network model. In this work media before content is shared with others.
there were two-fold of Face Artifacts. The system is designed to detect various types of DeepFakes,
Their method is based on the observations that including face-swapping (replacement DF), partial
current DF algorithm can only generate images of limited manipulations like altered facial expressions (retrenchment
resolutions, which are then needed to be further transformed DF), and identity blending (interpersonal DF). Our approach
to match the faces to be replaced in the source video. aims to enhance digital content verification by offering a
Exposing AI Created Fake Videos by DetectingEye solution that is secure, easy to use, and highly accurate. The
Blinking describes a new method to expose fake face videos project will focus on performance, ensuring it meets
generated with deep neural network models. The method is standards of reliability, accuracy, and user-friendliness while
based on detection of eye blinking in the videos, which is a handling large datasets of fake and real videos.
physiological signal that is not well presented in the
synthesized fake videos. The method is evaluated over By providing this tool, we aim to significantly reduce the
benchmarks of eye-blinking detection datasets and shows spread of manipulated content across social media platforms
promising performance on detecting videos generated with and improve the trustworthiness of digital media. This
Deep Neural Network based software DF. system has the potential to become an essential part of the
Their method only uses the lack of blinking as a clue global effort to combat misinformation and the growing
for detection. However certain other parameters must be misuse of DeepFake technology.figure.1 represents the
considered for detection of the deep fake like teeth simple system architecture of theproposed system: -
enchantment, wrinkles on faces etc. Our method is proposed
to consider all these parameters.
Using capsule networks to detect forged images and
videos uses a method that uses a capsule network to detect
forged, manipulated images and videos in different scenarios,
like replay attack detection and computer- generated video
detection.
In their method, they have used random noise in
the training phase which is not a good option. Still the model
performed beneficial in their dataset but may fail on real time
data due to noise in training. Our method is proposed to be
trained on noiseless and real time datasets.
Detection of Synthetic Portrait Videos using
Biological Signals approach extract biological signals from
facial regions on authentic and fake portrait video pairs.
Apply transformations to compute the spatial coherence and
temporal consistency, capture the signal characteristics in
feature sets and PPG maps, and train a probabilistic SVM and
a CNN. Then, the aggregate authenticity probabilities to
decide whether the video is fake or authentic.
Fake Catcher detects fake content with high
accuracy, independent of the generator, content, resolution, Fig. 1: System Architecture
and quality of the video. Due to lack of discriminator leading
to the loss in their findings to preserve biological signals,
formulating a differentiable loss function that follows the
proposed signal processing steps is not straight forward
process.
A. Dataset:
We are using a mixed dataset which consists of equal amount of
videos from different dataset sources like YouTube,
FaceForensics++[14], Deep fake detection challenge dataset[13].
Our newly prepared dataset contains 50% of the original video
and 50% of the manipulated deepfake videos. The dataset is split
into 70% train and 30% test set.

B. Preprocessing:
Dataset preprocessing includes the splitting the video into frames.
Followed by the face detection and cropping the frame with
detected face. To maintain the uniformity in the number of
frames the mean of the dataset video is calculated and the new
processed face cropped dataset is created containing the frames
equal to the mean. The frames that doesn’t have faces in it are
ignored during preprocessing. As processing the 10 second
video at 30 frames per second i.e total 300 frames will require
a lot of computationalpower. So for experimental purpose we
are proposing to usedonly first 100 frames for training the model.

C. Model:
The model consists of resnext50_32x4d followed by one LSTM
layer. The Data Loader loads the preprocessed face cropped
videos and split the videos into train and test set. Further the
frames from the processed videos are passed to the model for
training and testing in mini batches.

D. ResNext CNN for Feature Extraction


Instead of writing the rewriting the classifier, we are
proposing to use the ResNext CNN classifier for extracting
the features and accurately detecting the frame level features.
Following, we will be fine-tuning the network by adding
extra required layers and selecting a proper learning rate to
properly converge the gradient descent of the model. The
2048-dimensional feature vectors after the last pooling layers
are then used as the sequential LSTM input.
E. LSTM for Sequence Processing
Let us assume a sequence of ResNext CNN feature vectors of Fig. 2: Training Flow
input frames as input and a 2-node neural network with the
probabilities of the sequence being part of a deep fake video
or an untampered video. The key challenge that we need to
address is the de- sign of a model to recursively process a
sequence in a meaningful manner. For this problem, we are
proposing to the use of a 2048 LSTM unit with 0.4 chance of
dropout, which is capable to do achieve our objective. LSTM
is used to process the frames in a sequential manner so that the
temporal analysis of the video can be made, by comparing the
frame at ‘t’ second with the frame of ‘t-n’ seconds. Where n
can be any number of frames before t.
F. Predict:
A new video is passed to the trained model for prediction. A new
video is also preprocessed to bring in the format of the trained
model. The video is split into frames followed by face cropping
and instead of storing the video into local storage the cropped
frames are directly passed to the trained model for detection.

Fig. 3: Expected Results


today's landscape, where the prevalence of DeepFake videos
poses significant risks to information integrity and public
trust.

V. CONCLUSION
We presented a neural network-based approach designed to
effectively classify videos as either DeepFake or real, while
also providing confidence scores for our proposed model.
This method draws inspiration from the process of DeepFake
creation, which typically employs Generative Adversarial
Networks (GANs) in conjunction with Autoencoders. By
understanding the mechanics behind how DeepFakes are
generated, we have developed a model that aims to reverse-
engineer this process for detection purposes.

Our approach utilizes a multi-layered architecture that


includes frame-level detection through a ResNeXt
Convolutional Neural Network (CNN) and video
classification via a Recurrent Neural Network (RNN) that
incorporates Long Short-Term Memory (LSTM) units. The
ResNeXt architecture is particularly effective for image
classification tasks due to its ability to learn intricate patterns
and features in visual data, which is crucial for identifying
subtle artifacts typically present in manipulated videos.

The frame-level detection component analyzes individual


frames of the video to extract features indicative of potential
manipulations. These features are then fed into the RNN,
which leverages LSTM cells to capture temporal
dependencies across frames. This is important because
DeepFakes often exhibit inconsistencies over time, such as
unnatural facial movements or mismatched lip-syncing. By
Fig. 4: Prediction flow combining frame analysis with temporal modeling, our
method is designed to effectively discern whether a video has
been subject to manipulation.
IV. RESULT
The proposed method evaluates the video based on a range
The output of our model will indicate whether a video is of parameters discussed in our paper, allowing it to make
classified as a DeepFake or a real video, accompanied by a informed classifications. These parameters may include
confidence score that reflects the model's certainty in its visual artifacts, frame inconsistencies, and temporal
prediction. For instance, if the model classifies a video as a anomalies, among others. We believe that this
DeepFake with a confidence score of **94%**, it means that comprehensive approach will enable our model to achieve
the model is highly confident in its prediction, suggesting a very high accuracy when applied to real-time data, making it
strong likelihood that the video has been manipulated. a valuable tool for detecting DeepFakes in various
applications, from social media to news broadcasting.
Conversely, if the model identifies a video as real with a
lower confidence score, users are encouraged to approach Ultimately, our goal is to enhance the reliability of digital
that classification with caution, as the model may not be media by providing a robust mechanism for identifying
entirely certain of its assessment. manipulated content. The confidence scores generated by our
model will further aid users in assessing the credibility of
An example of this output is illustrated in Figure 3, where video content, thus contributing to the broader effort to
you can see a sample video frame alongside the model's combat misinformation and promote media integrity in an
classification result. The output clearly states whether the increasingly digital world.
video is classified as DeepFake or real, accompanied by the
confidence score of **94%**. Such visual representations
not only help users understand the model's decision-making
process but also provide a reference point for interpreting the
model's outputs effectively.

By offering both classification and confidence levels, our


model aims to empower users with valuable insights into the
authenticity of video content, thereby enhancing trust in
digital media. This functionality is particularly crucial in
VI. LIMITATIONS
Our current method focuses exclusively on video analysis [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
and does not take audio into account. As a result, our Sun. Deep residual learning for image recognition. In
approach is not equipped to detect audio DeepFakes, which CVPR, 2016.
can be equally deceptive and misleading. Audio
manipulation, often involving voice cloning or the synthesis [9] An Overview of ResNet and its
of speech that mimics a person's voice, poses a significant Variants: https://towardsdatascience.com/an-
challenge in the realm of digital media integrity. overview-of-resnet- and-its-variants-5281e2f56035
[10] Long Short-Term Memory: From Zero to Hero with
Recognizing this limitation, we plan to expand our system to Pytorch: https://blog.floydhub.com/long-short-term-
include audio analysis in future iterations. By integrating memory-from-zero-to-hero-with-pytorch/
audio detection capabilities, we aim to develop a
comprehensive solution that can effectively identify both [11] Sequence Models And LSTM Networks
video and audio DeepFakes. This enhancement will involve https://pytorch.org/tutorials/beginner/nlp/sequence_mo
leveraging advanced audio processing techniques and dels_tutorial.html
machine learning algorithms specifically designed to
analyze speech patterns, intonation, and other acoustic
features that may indicate manipulation. [12] https://discuss.pytorch.org/t/confused-about-the-
The goal is to create a robust detection system that not only image- preprocessing-in-classification/3965
assesses the visual integrity of video content but also
evaluates the authenticity of the accompanying audio. By [13] https://www.kaggle.com/c/deepfake-
addressing this critical aspect, we hope to provide users with detection-challenge/data
a more complete tool for identifying manipulated media,
thereby enhancing trust in digital communications and [14] https://github.com/ondyari/FaceForensics
protecting against misinformation. This future direction
aligns with our commitment to improving media verification
methods and staying ahead of emerging challenges in the [15] Y. Qian et al. Recurrent color constancy. Proceedings
rapidly evolving landscape of digital content creation and of the IEEE International Conference on Computer
manipulation. Vision, pages 5459–5467, Oct. 2017. Venice, Italy.

[16] P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros. Image-to-


REFERENCES image translation with conditional adversarial networks.
Proceedings of the IEEE Conference on Computer
[1] Yuezun Li, Siwei Lyu, “ExposingDF Videos By Vision and Pattern Recognition, pages 5967–5976, July
Detecting Face Warping Artifacts,” in 2017. Honolulu, HI.
arXiv:1811.00656v3.
[17] R. Raghavendra, Kiran B. Raja, Sushma Venkatesh,
[2] Yuezun Li, Ming-Ching Chang and Siwei Lyu and Christoph Busch, “Transferable deep-CNN
“Exposing AI Created Fake Videos by Detecting Eye features for detecting digital and print-scanned
Blinking” in arxiv. morphed face images,” in CVPRW. IEEE, 2017.

[3] Huy H. Nguyen , Junichi Yamagishi, and Isao Echizen


“ Using capsule networks to detect forged images and [18] Tiago de Freitas Pereira, Andr´e Anjos, Jos´e Mario De
videos ”. Martino, and S´ebastien Marcel, “Can face anti
spoofing countermeasures work in a real world
scenario?,”in ICB. IEEE, 2013.
[4] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari and
Weipeng Xu “Deep Video Portraits” in [19] Nicolas Rahmouni, Vincent Nozick, Junichi
arXiv:1901.02212v2. Yamagishi, and Isao Echizen, “Distinguishing
computer graphics from natural images using
[5] Umur Aybars Ciftci, ˙Ilke Demir, Lijun Yin “Detection convolution neural networks,” in WIFS. IEEE, 2017.
of Synthetic Portrait Videos using Biological Signals”
in arXiv:1901.02212v2.
[20] F. Song, X. Tan, X. Liu, and S. Chen, “Eyes closeness
detection from still images with multi-scale histograms
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, of principal oriented gradients,” Pattern Recognition,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron vol. 47, no. 9, pp. 2825–2838, 2014.
Courville, and Yoshua Bengio. Generative adversarial
nets. In NIPS, 2014. [21] D. E. King, “Dlib-ml: A machine learning toolkit,”
JMLR, vol. 10, pp. 1755–1758, 2009.
[7] David G¨uera and Edward J Delp. Deepfake video
detection using recurrent neural networks. In AVSS,
2018.

You might also like