0% found this document useful (0 votes)
32 views

IET Biometrics - 2021 - Yu - A Survey on Deepfake Video Detection

Uploaded by

Nikita Mhase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

IET Biometrics - 2021 - Yu - A Survey on Deepfake Video Detection

Uploaded by

Nikita Mhase
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received: 15 October 2020

DOI: 10.1049/bme2.12031

REVIEW
- -
Revised: 20 December 2020 Accepted: 8 February 2021

- IET Biometrics

A Survey on Deepfake Video Detection

Peipeng Yu1 | Zhihua Xia2,3 | Jianwei Fei1 | Yujiang Lu1


1
Engineering Research Centre of Digital Forensics, Ministry of Education, School of Computer and Software, Jiangsu Engineering Centre of Network Monitoring, Jiangsu Collaborative
Innovation Centre on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science & Technology, Jiangsu Province, Nanjing, China
2
College of Cyber Security, Jinan University, Guangzhou, China
3
Engineering Research Center of Digital Forensics, Nanjing University of Information Science & Technology, Nanjing, China

Correspondence Abstract
Zhihua Xia, College of Cyber Security, Jinan Recently, deepfake videos, generated by deep learning algorithms, have attracted wide-
University, Guangzhou, Guangdong Province, China
Province, 510632, China.
spread attention. Deepfake technology can be used to perform face manipulation with
Email: [email protected] high realism. So far, there have been a large amount of deepfake videos circulating on the
Internet, most of which target at celebrities or politicians. These videos are often used to
Funding information damage the reputation of celebrities and guide public opinion, greatly threatening social
Collaborative Innovation Centre of Atmospheric stability. Although the deepfake algorithm itself has no attributes of good or evil, this
Environment and Equipment Technology
(CICAEET) fund, China.; Priority Academic
technology has been widely used for negative purposes. To prevent it from threatening
Programme Development of Jiangsu Higher human society, a series of research have been launched, including developing detection
Education Institutions; ‘333’ project of Jiangsu methods and building large‐scale benchmarks. This review aims to demonstrate the
Province; Qinglan Project of Jiangsu Province;
National Natural Science Foundation of China,
current research status of deepfake video detection, especially, generation process, several
Grant/Award Numbers: 61702276, 61772283, detection methods and existing benchmarks. It has been revealed that current detection
U1936118, 61601236, 61602253, 61672294, methods are still insufficient to be applied in real scenes, and further research should pay
U1836208; National Key R&D Programme of China,
Grant/Award Number: 2018YFB1003205; BK21+
more attention to the generalization and robustness.
programme from the Ministry of Education of
Korea; Jiangsu Basic Research Programs‐Natural
Science Foundation, Grant/Award Number:
BK20181407; Six peak talent project of Jiangsu
Province, Grant/Award Number: R2016L13

1 | INTRODUCTION created by a Reddit user named deepfakes in 2017, which


means that it is unavoidable for deepfake technology to be
The problem of face‐manipulated videos has received wide- used for malicious uses since created. Soon after, FakeApp,
spread attention in the past two years, especially after the advent FaceSwap and other deepfake‐based applications appeared
of deepfake technology that manipulates images and videos with continuously. In June 2019, there is even a smart undressing
deep learning tools. Deepfake algorithm can replace faces in the app named Deepnude, resulting in a huge panic around the
target video with faces in the source video using autoencoders or world. Except for damaging personal privacy, videos generated
generative adversarial networks. With this technology, face‐ by these apps are increasingly applied to interfere in political
manipulated videos are exceedingly simple to generate on con- campaigns and public opinion. The detection of deepfake
dition that one can access large amounts of data. content has become one of the hot issues for individuals,
While deepfake technology could be used for positive businesses and governments around the world.
purposes, such as film‐making and virtual reality, it is still With the increasing interest in deepfake technology, more
heavily applied for malicious uses [1–4]. As shown in Figure 1, and more related researches have been underway. The past two
a huge number of fake videos have been distributed on the years have witnessed significant progress in developing new
Internet, most of which target at politicians and celebrities. The detection methods. To begin with, the number of video data-
first deepfake content was a celebrity pornographic video sets built for deepfake detection tasks is growing. From small

-
This is an open access article under the terms of the Creative Commons Attribution‐NoDerivs License, which permits use and distribution in any medium, provided the original work is
properly cited and no modifications or adaptations are made.
© 2021 The Authors. IET Biometrics published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

IET Biometrics. 2021;10:607–624. wileyonlinelibrary.com/journal/bme2 607


20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
608
- YU ET AL.

section will review the development of deepfake algorithms


and then describe two types of deepfake algorithms.

2.1 | Development of deepfake technologies

Facemanipulation is not a new technology that appeared


recently. The earliest attempt with facemanipulation in the
literature can be found in the iconic 1865 portrait of the US
President Abraham Lincoln. With the development of com-
puter graphics technology, facemanipulation in digital images
has become easily achievable [12–14]. Recent progress in the
F I G U R E 1 Video frames generated by deepfake algorithms. The first
line shows the original video frames and the second line shows the field of deep learning has fundamentally advanced the devel-
corresponding video frames generated by deepfake methods opment of facemanipulation technology. According to different
goals of face manipulation algorithms, existing deepfake algo-
datasets (such as DeepFake‐TIMIT [5] and UADFV [6]) in an rithms could be divided into two categories: face swapping and
early stage, to large‐scale datasets (such as FaceForensic++ [7], face reenactment.
Celeb‐DF [8], DFDC [9] and DeeperForensic [10]), the
number of datasets that can be used for training has increased.
Furthermore, several research institutions are becoming aware 2.1.1 | Face swapping
of the dangers of deepfake videos and trying to promote
related research. Recently, Amazon, Facebook and Microsoft Face swapping videos, swapping person identities in two videos,
have joined forces to host the deepfake detection challenge have attracted people's attention in recent years. Related re-
(DFDC) to build innovative technologies beneficial to detect searches have been established since 2017. In the study of
deepfake videos. Also, SenseTime holds the deeperforensics Korshunova et al. [15], convolutional neural networks (CNNs)
challenge 2020 to solicite new ideas to advance the state‐of‐ were trained to capture the appearance of target identity from an
the‐art in real‐world face forgery detection. Owing to these unstructured photo collection, which enables generating high‐
reasons, several effective detection approaches have been quality face‐swapping images. However, time continuity is not
proposed, which demonstrated excellent performance in considered, thus this approach cannot be applied for high‐
forgery detection tasks. quality video generation. In the same year, Olszewski et al. [16]
Though several advances have been achieved, many critical proposed a novel approach to generate videos with a single RGB
issues for existing deepfake detection methods still need to be image and a source video sequence. A deep generative network
solved. With the continuous evolution of deepfake methods, was used to infer perframe texture deformations of the target
generated videos become more and more realistic. In this case, identity using source textures and the single target texture. Based
traditional methods are probably not suitable for detecting on this method, the newly rendered face could be composited
manipulated videos generated by new deepfake algorithms [11]. onto the source video, replacing the original face using the
It is significant to analyse and forecast the advanced develop- schema of [17]. In December 2017, the first face‐swapping video
ment of deepfake‐related research and improve corresponding generated by deepfake approach was posted by a Reddit user,
detection approaches. In this review, we will focus on the existing bringing marvellous shock to the world. It is generally
detection scheme designed for deepfake videos, attempting to acknowledged that the inspiration of deepfake algorithms comes
promote the development of deepfake video detection. from [15], where CNNs were used to generate face‐swapping
This article is organized as follows. In Section 2, we first images. After that, a wave of creating face‐swapping videos
introduce deepfake video generation algorithms proposed in was set off over the world, regardless of positive or negative
recent years. Then, different types of detection approaches are purposes. Faceswap‐GAN, an improved version of the original
described in Section 3. A list of datasets used in recent study is deepfake algorithms, was proposed in [18]. To generate more
presented in Section 4. After that, a discussion of the state of realistic faces, adversarial loss and perceptual loss were added to
deepfake video detection and its perspectives is carried out in improve the performance of the autoencoder implemented by
Section 5. Finally, we conclude in Section 6. VGGFace [19]. Similarly, DeepFaceLab [20], an open‐source
deepfake generation framework, was designed for providing an
imperative and easy‐to‐use pipeline for people without profes-
2 | DEEPFAKE VIDEO GENERATION sional knowledge. In recent works, FaceShifter was proposed for
occlusion aware face swapping with high fidelity [21]. Unlike
Since the first release of deepfake videos, new manipulation previous face‐swapping studies only using limited information
algorithms are proposed soon, most of which are based on from target images to synthesise faces, FaceShifter generates
generative networks. During these methods, deepfake algo- high‐fidelity swapped faces by performing comprehensive
rithms can be used to create fake content to infringe on per- integration of face attributes. Specifically, AEI‐Net and HEAR‐
sonal privacy, showing huge destructive effect on society. This Net were leveraged to integrate face information and recover an
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 609

anomaly region, respectively. Experiments show its superior videos with a high degree of spatiotemporal continuity.
performance compared with existing face‐swapping algorithms. Compared to Face2Face, this work can not only migrate facial
Videos generated by recent deepfake approaches have been expressions, but also head pose, gaze direction and blinking
extremely realistic, hardly distinguished by human eyes. movements, compensating for the inaccurate head pose in the
Face2Face algorithm. Except for this study, Thies et al. [27] has
also made further optimisations to address problems existing in
2.1.2 | Face reenactment Face2Face. Neuraltexture, incorporating Face2Face and neural
networks for texture extraction based on Face2Face, compen-
Different from face‐swapping technologies, face reenactment sating for Face2Face's blurred texture in the mouth region.
algorithms attempt to control people's expressions in videos,
whichmeans that attackers can generate videos manipulating
someone to do something that does not exist. The first face 2.2 | General process of deepfake video
reenactment algorithm could date back to 2006. Vlasic et al.[22] generation
proposed to perform facial reenactment based on the face
template, which was modified under different expression pa- In this part, we will briefly describe the generation process of
rameters. Most of the subsequent work is based on such two types of deepfake videos.
schemes, where a parametric model is leveraged to adjust facial
images. These methods could generate face images with high
realism, but the obtained results often lack temporal coherence. 2.2.1 | Face swapping
In recent years, research on face reenactment has been further
developed as computing ability increased. To perform monoc- To generate a face‐swapping video, all frames of the target video
ular facial reenactment in real‐time, Thies et al. [23] proposed have to be processed using generative method. Figure 2 shows
Face2Face. In this study, a new global nonrigid model‐based the general generation process of face‐swapping videos. Obvi-
bundling approach was applied to reconstruct the facial fea- ously, the deepfake algorithm, which implements faceswapping
tures of target and source actors. At the same time, a subspace while preserves the source expressions, is the core part of video
deformation transfer technique is designed to perform expres- generation. The deepfake algorithms used in faceswapping are
sion transfer between source and target actors. In addition to mostly developed based on autoencoder, which is widely used
these contributions, this study also proposed a novel method in for data reconstruction tasks. Autoencoder is composed of two
the synthesis of mouth regions, where the best matching image is components: an encoder and a decoder. Latent features are first
retrieved from the target sequence. Compared to previous extracted from the image by the encoder, and then inputted to
studies, Face2Face has already achieved quite remarkable per- the decoder to reconstruct the original image. In the deepfake
formance. However, it cannot guarantee consistent head algorithm, two autoencoders are trained to swap faces between
movements as only the migration of expressions is taken into source video frames and target video frames. As shown in
account. Also, the synthesis of the mouth region is not satisfying, Figure 3, during the training process, two encoders with the
with coarse details of the mouth that are easily noticed by human same weights are trained to extract common features in source
eyes. With the development of deep learning techniques, these and target faces. Then, features extracted are inputted to two
issues are gradually being noticed and addressed. It can be decoders to reconstruct faces, respectively. It is worth noting
noticed that face videos synthesised by previous face reenact- that decoder A is only trained with faces of A while decoder B is
ment algorithms have defects that they are inconsistent with only trained with faces ofB. When the training process is com-
voices. The work of Suwajanakorn et al. [24] supplemented this plete, a latent face generated from face A will be passed to the
defect to a certain extent. They aimed to learn a sequence decoder B. Decoder B would try to reconstruct face B from
mapping from audio to video to manipulate actors to speak the feature relative to faceA. If the autoencoder is trained well, the
same sentences as voice content. Features were extracted from latent space will represent facial expressions. In other words, the
voice sequence as the input of recurrent neural network (RNN), face generated by decoder B will have the same expression as
which outputs a sparse mouth shape corresponding to each faceA.
frame of video output. The textures of mouth are further syn-
thesized and merged into the original video. A better improve-
ment is achieved by Fried et al. [25]. They performed talking‐ 2.2.2 | Face reenactment
head video editing and changing speech words by using
designed neural face rendering method. To perform face reen- The face reenactment task aims to perform the migration of
actment with better performance, Kim et al. [26] proposed a new facial expressions. In order to better demonstrate this kind of
method for photorealistic reanimation of portrait videos. The scheme, we directly use the scheme in [26] as an example to
proposed generative neural network with a novel space‐time introduce. Figure 4 shows the general process of performing
architecture is used to transform coarse face model renderings face reenactment. First, the low‐dimensional parameter repre-
into full photorealistic portrait video output. The major contri- sentation of the source and target videos is obtained using a
bution of this study is designing a new spatiotemporal encoding monocular face reconstruction method. Furthermore, head
as conditional input for video synthesis, resulting in synthesised pose and expression could be transferred to the parameter
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
610
- YU ET AL.

F I G U R E 2 The generating process of face‐swapping video frames. The face area is first detected in each video frame. Then, facial landmarks are extracted
to perform face alignment. After that, the deepfake algorithm (autoencoder or GAN) is applied to generate a synthetic face by inputting the face‐aligned image.
To reduce artefacts caused by blending, the landmarks of the left and right eyebrows and the bottom mouth are used to generate a specific mask. In this way,
after blending the synthetic face to the original image, only the content inside the mask is retained. Finally, to further make the generated image realistic, a
postprocessing operation is supplemented to process the generated image. Specifically, Gaussian blur is applied to the boundary of the mask while the colour
correction algorithm is applied to ensure the consistency of the synthetic face and background image

F I G U R E 3 Generating process of face reenactment videos [26]. First, monocular face reconstruction is performed on the source face and the target face to
obtain their respective face parameters. After that, parameters are modified by preserving parameters of illumination and identify while changing parameters of
pose, expression and eye gaze. Synthetic images are then generated using modified parameters. Finally, rendering‐to‐video translation network is applied to
generate face reenactment videos

space. To perform face reenactment, scene illumination and current detection methods mostly target at fundamental fea-
identity parameters are preserved while head pose, expression tures. As shown in Table 1, these methods fall into five cate-
and eye gaze parameters are changed. After that, synthetic im- gories based on the features they use. To begin with, detection
ages of the target actor are regenerated based on the modified based on general neural networks is commonly used in literature,
parameters. These images are then served as the conditional where deepfake detection task is considered as regular classifi-
input of our new renderingvideo conversion network, which is cation tasks. Temporal consistency features are also exploited to
then trained to convert the synthesized input into a realistic detect discontinuities between adjacent frames of fake video. To
output. To obtain a complete video with better time consistency, find distinguishable features, visual artefacts generated in
the conditioning space‐time volumes are fed into the network in blending process are exploited in detection tasks. Recently
a sliding window fashion. In this way, face reenactment video proposed approaches focus on more fundamental features,
can be obtained. where camera fingerprint and biological signal‐based schemes
show great potential in detection tasks. In the following sections,
we will review detection methods mentioned above.
3 | DEEPFAKE VIDEO DETECTION

Deepfake videos are increasingly harmful to personal privacy 3.1 | General‐network‐based methods
and social security. Various methods have been proposed to
detect manipulated videos. Early attempts mainly focused on Recent advances in image classification have been applied to
inconsistent features caused by the face synthesis process while improve the detection of deepfake videos. In this method, face
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 611

images extracted from the detected video are used to train the 3.1.1 | Transfer learning
detection network. Then, the trained network is applied to
make predictions for all frames of this video. The predictions Network‐based detection methods should be the earliest
are finally calculated by averaging or voting strategy. Conse- method introduced for detection tasks. Shortly after the
quently, the detection accuracy is highly dependent on the appearance of the first deepfake video, some early detection
neural networks, without the need to exploit specific distin- algorithms were proposed, mainly based on existing networks
guishable features. In this section, we divide existing network‐ that performed well in image classification tasks. Transfer
based methods into two types: transfer learning‐based methods learning strategy could be easily found in the early studies.
and detection approaches based on specially designed Combining steganalysis features and deep learning features,
networks. Zhou et al. [28] put forward a two‐stream network for face
tampering detection. Likewise, in [7], Rossler et al. evaluated
XceptionNet [29] on the FaceForensic++ dataset, out-
performing all other networks in detecting fakes. During
DFDC, similar detection methods were used. In [30], two
existing models were tested to provide a performance baseline:
A small DNN(composed of six convolutional layers and a fully
connected layer) and an existing XceptionNet. Early results
showed that the best method (XceptionNet) provides 93.0%
precision. Bonettini et al. [31] studied the ensemble of different
trained CNN models, showing that the ensemble of CNNs can
achieve promising results in deepfake detections. However,
such network‐based algorithms are prone to overfitting [32], so
researchers attempted to exploit intrinsic differences between
real and fake videos through preprocessing. Some pre-
processing methods, such as optical flow calculation [33], had
been proved to be useful to exploit interframe dissimilarities in
network‐based methods.

3.1.2 | Specially designed networks

With the advent of large‐scale datasets and the development


of detection algorithms, more attention is attracted to
improve the generalization of detection algorithms. Nguyen
et al. [34] introduced a capsule network to improve the
performance of detection networks. As illustrated in Figure 5,
face images are first fed into the pretrained VGG‐19 network
F I G U R E 4 Autoencoders used for faceswapping. Figure top shows [35]. Extracted features are then inputted into the proposed
the training process of the two autoencoders. When generating deepfake capsule network, which includes several primary capsules and
faces, decoders are swapped as the figure below shows two output capsules. Agreement between the features

TABLE 1 Classification for existing detection methods

Methods Description
Generalnetwork‐based methods In this method, detection is regarded as a frame‐level classification task which is finished by
CNNs

Temporalconsistency‐based methods Deepfake videos are found to exist inconsistencies between adjacent frames due to the defects of
the forgery algorithm. Thus RNN is applied to detect such inconsistencies

Visualartefacts‐based methods The blending operation in generation process would cause intrinsic image discrepancies in the
blending boundaries. CNN‐based methods are used to identify these artefacts

Camerafingerprints‐based methods Due to specific generation process, devices leave different traces in the captured images. At the
same time, faces and background images are acknowledged to come from different devices.
Thus, detection task can be completed by using these traces

Biologicalsignals‐based methods GAN is hard to understand hidden biological signals of faces, making it difficult to synthesize
human faces with reasonable behaviour. Based on this observation, biological signals are
extracted to detect deepfake videos
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
612
- YU ET AL.

F I G U R E 5 Capsule‐forensics architecture.
Pretrained VGG‐19 is first used to extract features
from face images.Features are further input into
proposed capsules, which include several primary
capsules and two output capsules. Agreement
between primary capsules and output capsules is
calculated by a dynamic routing algorithm. Finally,
output of capsules is mapped to probabilistic values

extracted by the primary capsules is dynamically calculated by structure often affect the abstraction degree of features, it
a dynamic routing algorithm and the results are finally routed still lacks sufficient relevance for the task of deepfake
to the appropriate output capsule. Visualization of latent detection. Therefore, the current direction of such work is
features extracted indicated that the combination of capsule gradually changing. On the one hand, by adding additional
networks and dynamic routing algorithm is effective for components to the model, the model can be constrained to
detecting manipulations. However, the capsule network per- learn heuristic features [40]. In this case, importance of
formed terribly when encountering unknown deepfake videos model architecture is greatly reduced while additional com-
[8], proving that capsule networks still need further ponents play a greater role. This is exactly the difference
improvement to detect high‐fidelity videos. To explore the between deepfake detection tasks and general computer
mesoscopic properties of images, Afchar et al. [36] also vision tasks. On the other hand, more and more network‐
proposed a CNN, namely MesoInception‐4, consisted of a based methods have begun to introduce multitask learning,
variant of inception modules introduced in [37]. Their pro- that is, not only to classify real and fake faces, but also to
posed approach achieved 98.4% accuracy using a private generate pixel‐level tampering masks. In [41], using a semi-
database. Moreover, this approach is also tested using unseen supervised learning strategy, Nguyen et al. designed a multi-
datasets in recent study [7, 8, 30, 38], proving to be a robust task learning framework to simultaneously detect manipulated
approach in deepfake detection tasks. Although these content and locate the manipulated regions. In such schemes,
methods achieved excellent results on various datasets, the however, supervised multitask learning is only a comple-
reasons behind good performance are still unknown. In fact, mentary implementation, which does not necessarily improve
deeper networks tend to achieve better results than shallower the final detection performance. Further improvement was
network in various areas. The reason for the good perfor- achieved by using attention mechanisms. Dang et al. [42]
mance may simply be that designed networks are deep utilized an attention mechanism to process feature maps for
enough. Compared with traditional learning‐based methods, the classification. The proposed approach showed excellent
Wang et al. [39] pay more attention to neuron coverage and performance both in deepfake detection and forgery location,
interactions rather than the design of specific network achieving state‐of‐the‐art performance compared to previous
structures. The FakeSpotter they proposed uses hierarchical solutions. Their approach demonstrates the importance of
neuron behaviour as a feature, showing high robustness attention mechanisms. Likewise, in [43], Tarasiou et al.
against four common perturbation attacks. This research designed a lightweight architecture for extracting local image
provided a new insight for detecting fakes. features and a multitask training scheme for forgery locali-
zation. In this way, the forgery location process provides
evidence for judgement while ensuring detection accuracy,
3.1.3 | Summary promoting the practical use of detection algorithm. It is
worth mentioning that some basic directions in computer
The disadvantage of network‐based methods is that such vision, such as anomaly detection, semantic segmentation and
methods tend to overfit on specific datasets. In this type of metric learning, are making more and more important con-
method, although adjustment and optimization of model tributions in this field.
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 613

3.2 | Temporal‐consistency‐based methods this article, which promoted the development of detection
methods based on temporal consistency.
Time continuity is a unique feature for videos. Unlike images,
video is a sequence composed of multiple frames, where
adjacent frames have a strong correlation and continuity. When 3.2.2 | Improvement
video frames are manipulated, the correlation between adjacent
frames will be destroyed due to defects of deepfake algorithms, After the time‐based detection method showed its effective-
specifically expressed in the shift of face position and video ness, many related studies were proposed. In [46], Sabir et al.
flickering. According to this phenomenon, researchers have utilized the temporal information present in the video stream
proposed several detection approaches. We will first introduce to detect deepfake videos. Similar to [44], an end‐to‐end model
the original CNN‐RNN architecture and then demonstrate its is built, where CNN is also involved in the follow‐up training.
improvement in these years. Meanwhile, face alignment based on facial landmarks and
spatial transformer network is applied to further improve the
performance of the algorithm. Even though such solutions
3.2.1 | CNN‐RNN guarantee high accuracy in videos with high quality, they do not
perform well on low‐quality video when the continuity be-
Considering the time continuity in videos, Guera et al. [44] tween adjacent frames is disrupted by video compression op-
first proposed to use RNN to detect deepfake videos. In their erations. To solve this problem, a CNN‐RNN framework
work, autoencoder was found to be completely unaware of based on automatic weighting mechanisms was proposed by
previously generated faces because faces were generated Montserrat et al. [47]. Considering that the face qualities of
frame‐by‐frame. This lack of temporal awareness results in some frames are not high, an automatic weighting mechanism
multiple anomalies, whichare crucial evidence for deepfake was proposed to emphasize the most reliable regions when
detection. To check the continuity between adjacent frames, making a video‐level prediction. Experiments showed that
an end‐to‐end trainable recurrent deepfake video detection combining CNN and RNN achieves high detection accuracies
system was proposed. As Figure 6 shows, the proposed on the DFDC dataset. Except for the robustness of algo-
system is mainly composed of a convolutional long short‐ rithms, generalization ability is also essential for forgery
term memory (LSTM) structure for processing frame detection tasks. Zhao et al. [48] used optical flow to capture the
sequences. Two essential components are used in a con- obvious differences of facial expressions between adjacent
volutional LSTM structure, where CNN is used for frame frames. However, these studies did not show strong general-
feature extraction and LSTM is used for temporal sequence ization or robustness. To solve this problem, Wu et al. [49]
analysis. Specifically, a pretrained inceptionV3 [45] is adapted proposed a novel manipulation detection framework, named
to output a deep representation for each frame. The 2048‐ SSTNet, exploiting both low‐level artefacts and temporal dis-
dimensional feature vectors extracted by the last pooling crepancies. Another study proposed by Masi et al. [50] ob-
layers are applied as the sequential LSTM input, character- tained good generalization on multiple datasets. In their
izing the continuity between image sequences. Finally, a research, a two‐branch recurrent network is applied to prop-
fullyconnected layer and a softmax layer are added to agate the original information while suppresses the face con-
compute forgery probabilities of the frame sequence tested. tent. Multiband frequencies are amplified using a Laplacian of
The experiments on a self‐made dataset showed that the al- Gaussian as a bottleneck layer. Inspired by [51], a new loss
gorithm can accurately make predictions even when the function is designed for better isolating manipulated face. The
length of a video is less than 2 s. Although this research did experimental results on several datasets show the excellent
not show its superiority since there were no large‐scale generalization performance of the detection algorithm.
datasets at the time, several articles after were inspired by Nevertheless, time‐based detection schemes still have much

F I G U R E 6 Overview of detection method


based on CNN‐LSTM. The backbone CNN model is
first used to extract features of each face image in
successive frames. Output features are then merged
and used as input to the LSTM network, which
processes the time series features to obtain the
probability value of whether the video clip is true or
false
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
614
- YU ET AL.

room for improvement in generalization performance [47]. and background area, the negative samples are generated by a
Screen switching and unknown data are still problems that simplified process, where the face undergoes an affine warp
need to be solved for time‐based detection approaches. back to the source image directly after smoothed. To generate
more realistic negative examples, a convex polygon shape is
used based on the face landmarks of eye browns and the
3.2.3 | Summary bottom of the mouth. Also, colour information is also
randomly changed to enlarge the training diversity. After that,
Compared with general‐network‐based approaches, temporal‐ four CNN models—VGG16, ResNet50, ResNet101 and
consistency‐based detection methods consider the continuity ResNet152 were trained in this study. Evaluated on several
between adjacent frames, thereby improving the detection datasets of available deepfake videos, this method demon-
performance. However, many models tend to destroy the strated effectiveness in practice. Compared with previous
spatial structure of original frames when extracting temporal methods, this study focuses on the visual artefacts caused by
features while the motivation for designing such methods is affine transformation. At the same time, due to no additional
precisely to extract the inconsistency of spatial features in the negative samples to participate, this algorithm does not need to
temporal domain. CNN‐RNN architectures pool the intra- fit the sample distribution of deepfake videos, greatly
frame features into vectors [44, 46], thus cannot capture spatial increasing the generalization of the algorithm [8].
features while detecting temporal consistency. Although
structures such as 3DCNN can avoid destroying spatial fea-
tures, the excessive parameters make it easier to overfit on a 3.3.2 | Blending boundary
specific dataset.
Further improvements were achieved in [32]. Li et al. proposed
a novel image representation, namely face X‐ray, which was
3.3 | Visualartefacts‐based methods exploited to observe whether the input image can be decom-
posed into the foreground face and the background. Specif-
In most existing deepfake methods, the generated face has to ically, the blending boundary between the foreground
be blended into an existing background image, causing exist manipulated face and the background was defined as face X‐
intrinsic image discrepancies on the blending boundaries. As ray. Compared with Li and Lyu [38], this study targeted at
shown in Figure 7, faces and background images come from the blending boundary that is universally introduced in image
different source images, giving rise to the abnormal behaviour blending, thus showing great performance when tested in
of the synthetic image, such as boundary anomaly and incon- various datasets. Except for proposing face X‐ray, this research
sistent brightness. These visual artefacts make deepfake videos designs the generation process of negative samples by using
fundamentally detectable. In this section, three main visual positive samples particularly. Thus, the algorithm does not
artefacts would be introduced. need to consider face manipulation in the deepfake video, but
only focuses on the difference between background and
foreground faces, thereby enhancing the generalization of the
3.3.1 | Face warping artefacts proposed algorithm. However, due to excessive focus on the
blending boundary, this scheme is not resistant to fully syn-
Based on the observations with inconsistency between faces thesized images.
and background, a new deep learning‐based method was
proposed by Li and Lyu [38]. Face warping artefacts generated
by blending process were used to detect fake videos. As shown 3.3.3 | Head pose inconsistency
in Figure 2, synthetic faces have undergone an affine transform
to match the poses of the target faces. In this case, there would Another interesting study comes from [52]. Observing that
be an obvious colour difference and resolution inconsistency deepfake videos were created by splicing a synthesised face
between the internal face and background areas. Since the into the original image, Yang et al. proposed a new detection
purpose here is to detect inconsistency between the face region method based on 3D head poses. They argued that current
generative neural networks could not guarantee landmark
matching, causing that estimated 3D landmarks on the face‐
manipulated area were different from 3D landmarks esti-
mated from the whole face area. In this method, the rotation
matrix estimated using facial landmarks from the whole face
and the one estimated using only landmarks in the central
region are calculated to analyse the similarity between two pose
vectors. Although the experiment confirmed the difference
F I G U R E 7 Video frames with visual artefacts. Deepfake generated between real and fake pose vectors, this study was built based
image shows colour difference and resolution inconsistency because of the on specific features existing in a self‐made dataset which was
lack of postprocess generated by relatively basic version of the deepfake algorithm.
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 615

Thus, this method is not effective for detecting a new version as interpolation and gamma correction. Outside the camera,
of deepfake videos as deepfake algorithms evolve [8]. the image could also be compressed or enhanced, which will
leave many traces in the final image. Thus, each image has its
unique traces, namely noise residuals, which can be used to
3.3.4 | Summary identify its source camera. Follow this direction, Cozzolino
et al. introduced a CNN‐based camera fingerprint named
Visual‐artefacts‐based methods often obtain better general- noiseprint in [59]. To remove scene content and enhance
ization performance because they target more general artefacts camera model‐related artefacts, a siamese network was trained
existing in most deepfake contents. However, these algorithms using images coming from different camera models. In this
can only detect specific forgery traces due to paying more siamese network, a fully convolutional network, proposed in
attention to specific artefacts. With the progress of deepfake [60], was first introduced to extract the noise pattern of images.
algorithms, these artefacts are gradually disappearing. Never- Pairs of images from the same or different camera models were
theless, visual artefacts‐based approaches obtain better per- used to train the siamese network. At the end of the training
formance in the latest version of deepfake video datasets. Such process, CNN used in the siamese network could be used to
schemes still have high potential in deepfake detection tasks. extract the corresponding noiseprint from the input image,
Researches should be established to exploit more intrinsic displaying enhanced camera model artefacts. This work pro-
features. vides new ideas for fingerprint noise extraction tasks, further
promoting the development of image forensic area.

3.4 | Camera‐fingerprints‐based methods


3.4.3 | Video noise pattern
Camera fingerprints are a kind of noise with very weak energy,
which plays an important role in forensic fields, especially After introducing the concept of noiseprint, Cozzolino
source identification tasks. In general, camerafingerprints‐ extended these findings to the video forensic area [59]. Except
based approaches have gone through three processes: the for source identification, noiseprint was also adopted for
photo response nonuniformity (PRNU) patterns, noiseprint forgery detection and localization. Considering that in a
and recent video noise pattern. We will introduce its devel- manipulated video, the manipulated region was generated
opment in the following content. differently from the background region and therefore carries
different noises, they argued that forgery detection could be
finished by using video noiseprint. As shown in Figure 8,
3.4.1 | PRNU patterns noiseprints are extracted frame‐by‐frame, which are then
averaged to indicate the noise contained in the video. Face and
The detection based on camera fingerprints originated from background regions are then split to calculate the similarity.
image forensics. Observing that devices will leave different Similar to [59], the spatial co‐occurrences matrix of the
traces in the captured images, Lukas et al. [53] proposed PRNU extracted noiseprint is used to further calculate the Mahala-
noise, which can be used in camera identification tasks. PRNU nobis distance between the face region and the reference,
arouse due to the different sensitivities of the pixels to light which is then used as the manipulation score. The algorithm
caused by the inhomogeneity of the silicon wafer and imper- showed good detection performance on the FaceForensic++
fections in the sensor manufacturing process. Because of dataset, even though the noise extraction network had not
uniqueness and stability, the PRNU pattern is regarded as a been trained on it. However, since the noiseprints extracted
device fingerprint, which can be used to carry out many from frames are averaged to represent video noiseprint, the
forensic tasks [54–56]. Based on these findings, Koopman et al. calculation of video noiseprint will be interfered if the video
[57] first proposed to use PRNU to detect deepfake videos. has a large motion. In this way, though noiseprint has shown its
PRNU patterns are verified to be effective on a small dataset. effectiveness in image manipulation, its using strategy in video
However, in [58], the PRNU‐based classifier achieves much forensic area still needs further improvement.
lower accuracy when tested in GAN‐generated datasets. More
researches should be performed to verify the effectiveness of
the PRNU pattern in deepfake detection tasks. 3.4.4 | Summary

Camera fingerprints have been proved to be effective in


3.4.2 | Noiseprint deepfake detection tasks. However, accurate estimation of
camera fingerprints requires a large number of images captured
In fact, PRNU‐based methods can only extract device‐related by different types of cameras. Thus, there would be a decrease
features while suppressing other camera artefacts existing in in accuracy when detecting images captured by unknown
the image generated process. Traces generated during the cameras. On the other hand, camerafingerprint‐based methods
digital image acquisition process are composed of several are not robust to simple image postprocessing such as
noises. Inside the camera, the image undergoes operations such compression, noise and blur. Since GAN images are generated
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
616
- YU ET AL.

F I G U R E 8 Scheme used for deepfake detection. Noiseprints are first extracted on a sufficient number of video frames. Then, extracted noiseprints are
averaged to represent the video noiseprint. Divided by the face detector, the video noiseprint is then splited into face region and background region. After that,
the algorithm extract features of background region and calculate the statistical information. Finally, the Mahalanobis distance between features of face and
background area is calculated to obtain the final heat map

without any image capture process, there is no camera model. In the final state prediction stage, a fully connected
fingerprint in the output image, so that the camerafingerprint‐ layer is added to calculate the probability of eye open and
based methods are very suitable for detecting images gener- closed states, which is then used to calculate blink frequency.
ated by GANs. However, recent work shows that images This method is evaluated over self‐made datasets, showing
can also be generated by simulating camera fingerprints [61], promising performance on detecting videos generated with
thus deceiving detection methods that rely on camera fin- deepfake methods. However, forgery algorithms can easily
gerprints. Recent research also proved that noise pattern generate videos with reasonable blinking frequency as long as
could be erased by neural networks [62]. In this way, existing enough closed‐eye images are added to the training set. Due to
camerafingerprint‐based methods should increase robustness excessive attention to abnormal blinking frequency, this
to resist such attacks. method is no longer applicable for current deepfake detection
tasks after the problem of blink frequency is solved.

3.5 | Biological‐signals‐based methods


3.5.2 | Heart rate
Detection based on biological signals is an interesting scheme
that emerged in recent years. The core observation is that even Except for blink frequency, heart rate was also found the
though GAN is able to generate faces with high realism, the difference between real and manipulated videos. Previous
naturally hidden biological signals are still not easily replicate, literature had proved that colour changes of skin in the video
making it difficult to synthesize human faces with reasonable could be applied to infer heart rate [65–67]. Based on these
behaviour [63]. Taking advantage of this abnormal behaviour, findings, a detector based on biological signals named Fake-
several studies have been proposed. In this section, we will Catcher was designed to detect deepfake videos [63]. Specif-
introduce two approaches based on biological signals: blinking ically, remote photoplethysmography (rPPG or iPPG) was
frequency‐based and heart rate‐based detection approaches. used to extract heart rate signals according to subtle changes of
colour and motion in RGB videos [55, 68]. Experiments vali-
dated that spatial coherence and temporal consistency of such
3.5.1 | Eye blinking signals are not wellpreserved in deepfake videos. Following
statistical analysis, a robust synthetic video classifier was
Abnormalities with blink frequency were earlier identified as developed based on physiological changes. Results verified that
discriminable features in deepfake detection tasks [6]. This FakeCatcher has a high detection accuracy for deepfake videos,
could be attributed to the fact that deepfake algorithms train even for low‐resolution or low‐quality videos. Similarly, Fer-
models using a large number of face images obtained online. nandes et al. [69] proposed to use neural ordinary differential
Most of the images show people with their eyes open, causing equations [70] to predict the heart rate of deepfake videos. A
that a closed‐eye view is difficult to generate in a manipulated large difference was shown between original videos and
video. Based on this finding, a deep neural network model, deepfake videos when heart rate prediction was performed
known as long‐term recurrent CNN (LRCN) [64], was intro- separately. However, this work only performed heart rate
duced to distinguish open and close eye states. To calculate prediction of deepfake videos while lacked further experiments
blink frequency, surrounding rectangular regions of eyes are of deepfake detection. A large number of works have been
cropped into a new sequence of input frames after face carried out based on biological signals. A recently proposed
alignment. Then, the cropped sequences are passed into the approach, named DeepRhythm [71], utilized a dual‐spatial‐
LRCN model to capture temporal dependencies. As shown in temporal attention mechanism to monitor the heartbeat
Figure 9, the feature extraction module is first used to extract rhythms, proving to generalize well over different datasets.
discriminative features from the input eye region by a CNN Likewise, DeepFakesON‐Phys [72] predicts the heart rate
based on VGG16 framework. The output of feature extraction through changes in skin colour, thereby considering the
is then fed into sequence learning, implemented with an RNN detection of deepfake videos.
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 617

F I G U R E 9 Overview of LRCN method. Eye


sequences are first extracted by preprocessing
module, then inputted to feature extraction module
to generate feature sequences. Sequence learning
module is then applied to analysis time‐related
sequences. FinallyFC layer is added to make state
prediction, calculating the blinking rate

3.5.3 | Summary FaceForensic++, introduced by Rossler et al. [7]. The dataset


contains 1000 original videos and 4000 manipulated videos
Although biologicalsignals‐based detection approaches have generated by four different forgery methods. Specifically,
shown good performance on various datasets, the natural flaw these methods contain DeepFake, FaceSwap, Face2Face and
of this kind of method is that the detection process cannot be NeuralTexture, where the first two are used to swap faces
performed in an end‐to‐end way. Also, the information re- and the latter two are used for expression manipulation.
flected by biological signal is seriously affected by video quality, Videos of three different compressed levels are provided to
so there are natural flaws and limited application range for develop robust detection methods. Several studies have
biologicalsignals‐based approaches. verified the effectiveness of proposed methods using this
dataset. However, the forgery method used to generate
negative samples in the dataset is relatively backward
4 | DATASETS AND PERFORMANCE compared with current deepfake algorithms, causing a
EVALUATION multitude of visual artefacts in generated forged videos. As
shown in Table 4, the detection accuracy of various detection
Since current deep learning‐based methods are highly depen- schemes has reached even more than 99% on the Face-
dent on large‐scale data, building high‐quality data sets reflects Forensic++ dataset. Although FaceForensic++ has made
importance. As deepfake algorithms evolve, new datasets great contributions to the development of deepfake detection,
should be built to develop advanced algorithms to counter new it is inconsistent with the current development status of
manipulation methods. In this section, we will describe the deepfake research, thus cannot be used to verify the per-
most commonly used datasets shown in Table 3 , and briefly formance of current detection algorithms.
introduce their characteristics. Detection performance on these To further simulate the realistic scene, datasets generated
datasets will also be introduced. by novel deepfake algorithms are proposed. Google and
Due to the lack of publicly available datasets, several self‐ Jigsaw proposed deepfake detection dataset [75], a large‐scale
made datasets were built to verify the effectiveness of pro- dataset built for deepfake detection. In this dataset, 3000
posed algorithms in the early literature. In [73], Deepfake‐ deepfake videos are created by 28 actors in various scenes,
TIMIT, composed of 620 deepfake videos for 16 pairs of which are more realistic than FaceForensic++. After com-
subjects, was proposed to evaluate several baseline face swap mercial companies took part in deepfake detection research,
detection algorithms. Also, in [6], UADFV dataset was DFDC was held to promote the development of deepfake
collected to detect eye blinking rate in the videos. The dataset detection. During the challenge, two datasets were intro-
consists of 49 original videos from YouTube and 49 deepfake duced: DFDC‐preview dataset [30] and DFDC dataset [9].
videos generated by FakeApp, with a typical resolution of The DFDC‐preview dataset is built by two different deepfake
294�500 pixels and an average time of 11.14 s. These self‐ approaches, composed of 1131 original videos and 4113
made datasets greatly promoted the development of deep- corresponding deepfake videos. DFDC dataset is a much
fake detection algorithms in the early stage. As shown in larger dataset used for competition in Kaggle, consisted of
Table 2, detection algorithms perform well on these datasets. over 470 GB of videos (pristine and manipulated). It is worth
However, fake videos generated in these datasets are often noticing that in order to promote the practical application of
targeted at specific detection algorithms and the quality of the deepfake detection algorithm, DFDC is more random in
these videos is not enough for current detection tasks.The the data collection, bringing more visual variability. Related
first large‐scale dataset used for deepfake detection is research on the DFDC dataset and the top‐3 detection
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
618
- YU ET AL.

TABLE 2 Detection performance on self‐made datasets

Study Method Dataset Performance FLOPs


Zhou et al. [28] A two‐stream network SwapMe and FaceSwap dataset 0.927 (AUC) >5.73

Guera et al. [44] CNN + LSTM Self‐made dataset 97.1% (ACC) >5.73

Yang et al. [52] 3D head poses UADFV 0.974 (AUC) ‐

Li et al. [6] Eyeblink + LRCN Self‐made dataset 0.99 (AUC) 15.5

Ciftci et al. [63] Biological signals Self‐made deep fakes dataset 91.07% (ACC) ‐

Afchar et al. [36] MesoInception‐4 Meso‐data(frame‐level) 91.70% (ACC) 0.5


Meso‐data(video‐level) 98.4% (ACC)

Nguyen et al. [34] A capsule network Meso‐data(frame‐level) 95.93% (ACC) >7.72


Meso‐data(video‐level) 99.23% (ACC)

Li and Lyu [38] Face warping artefacts + CNN UADFV 0.974 (AUC) 4.12
Deepfake‐TIMIT(LQ) 0.999 (AUC)
Deepfake‐TIMIT(HQ) 0.932 (AUC)

Li et al. [74] Patch an pair CNN Mesonet‐data 0.979 (AUC) 1.82


Deepfake‐TIMIT 1.0 (AUC)

TABLE 3 List of datasets including video manipulations Another large‐scale benchmark, composed of 50,000 orig-
Dataset Release date Real/fake Source inal videos and 10,000 manipulated videos, has been built in [10].
DF‐VAE, a new conditional autoencoder, is applied to generate
UADFV [6] 2018.11 49/49 YouTube
deepfake faces with a higher realism rating. Studies using
Deepfake‐TIMIT [5] 2018.12 ‐/620 YouTube DeeperForensics demonstrates that the quality of the generated
FaceForensics++ [7] 2019.01 1000/4000 YouTube video is significantly better than that of the existing dataset.
Google DFD [75] 2019.09 363/3068 Actors

DFDC‐preview [30] 2019.10 1131/4119 Actors 5 | DISCUSSION


DFDC [9] 2019.10 23,654/104,500 Actors
Deepfake videos appeared in people's attention in the past two
Celeb‐DF [8] 2019.11 890/5639 YouTube
years, posing a serious threat to social security. To this end,
DeeperForensics [10] 2020.1 10,000/50,000 Actors researchers have carried out a large number of research and
achieved remarkable advances. Recent detection algorithms
achieve almost 100% detection accuracy in the earlier deepfake
dataset. However, the accuracy of existing detection algorithms
scheme of DFDC are shown in Table 5. It is believed that is not ideal in recently built datasets. In the recent DFDC
DFDC dataset would bring more contributions to the competition, the average accuracy of detection approaches
development of deepfake detection tasks. proposed in the entire competition is only 65.18%, proving
Although the scale of the current deepfake video dataset that current detection approaches are still far from meeting the
has been able to meet the needs of detection algorithm, videos needs of practical scenes. At the same time, current research
in these datasets have obvious visual artefacts, whichare not in tends to use a complex network structure to extract abstract
line with the current status of existing deepfake approaches. To features. Although achieving superior detection performance,
solve this problem, Li et al. [8] introduced Celeb‐DF dataset, the increase in network complexity means an increment of
generated by an improved deepfake approach. Problems calculation costs. We have summarized the floating point op-
existing in the early version of deepfake videos, such as tem- erations (FLOPs) of schemes in previous literature to show the
poral flickering and low resolution of synthesized faces, are relationship between accuracy and network complexity. As
improved in this dataset. The dataset is comprised of 590 real shown in Table 4, schemes with higher FLOPs tend to have
videos and 5639 deepfake videos, satisfying the need for model better detection performance while better detection perfor-
training. Experimental results shown in literature (shown in mance does not necessarily mean higher FLOPs. This is a
Table 6) prove that Celeb‐DF is currently the most challenging trade‐off between network complexity and detection effect. We
dataset, where the detection accuracy of various methods on believe that a wise solution should achieve higherprecision
Celeb‐DF is lower than that of other datasets. detection with lower network complexity. Under such
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 619

TABLE 4 Detection performance on FaceForensic++ datasets

Study Method Dataset Performance FLOPs


Bonettini et al. [31] Ensemble of CNNs FaceForensics++(c23) 0.9444 (AUC) 0.24

Nguyen et al. [34] A capsule network FaceForensics++ ‐ Face2Face 93.11% (ACC) >7.72

Zhao et al. [48] Optical flow FaceForensics++ ‐ DeepFake 98.10% (ACC) 0.24

Cozzolino et al. [76] Noiseprint + siamese network FaceForensics++ 92.14% (ACC) ‐

Rossler et al. [7] XceptionNet FaceForensics++(raw) 99.26% (ACC) 8.42


FaceForensics++(c23) 95.73% (ACC)
FaceForensics++(c40) 81.00% (ACC)

Afchar et al. [36] MesoInception‐4 FaceForensics++(raw) 95.23% (ACC) 0.5


FaceForensics++(c23) 83.10% (ACC)
FaceForensics++(c40) 70.47% (ACC)

Sabir et al. [77] CNN + GRU + STN FaceForensics++ ‐ DeepFake 96.9% (ACC) 14.4
FaceForensics++ ‐ Face2Face 94.35% (ACC)
FaceForensics++ ‐ FaceSwap 96.3% (ACC)

Li et al. [32] Face X‐ray + multitask learning FaceForensics++ ‐ DeepFake 0.9912 (AUC) >3.99
FaceForensics++ ‐ FaceSwap 0.9909 (AUC)
FaceForensics++ ‐ Face2Face 0.9931 (AUC)
FaceForensics++ ‐ NeuralTexture 0.9927 (AUC)

Ciftci et al. [63] Biological signals FaceForensics++ ‐ DeepFake 93.75% (ACC) ‐


FaceForensics++ ‐ FaceSwap 96.25% (ACC)
FaceForensics++ ‐ Face2Face 95.25% (ACC)
FaceForensics++ ‐ NeuralTexture 81.25% (ACC)

Tarasiou et al. [43] A lightweight architecture FaceForensics ‐ DeepFake (c23) 97.90% (ACC) ‐
FaceForensics ‐ Face2Face (c23) 98.58% (ACC)
FaceForensics ‐ FaceSwap (c23) 98.32% (ACC)
FaceForensics ‐ DeepFake (c40) 92.40% (ACC)
FaceForensics ‐ Face2Face (c40) 87.11% (ACC)
FaceForensics ‐ FaceSwap (c40) 91.26% (ACC)

Wu et al. [49] SSTNet FaceForensics++(c23) 98.57% (ACC) >8.42


FaceForensics++(c40) 90.11% (ACC)

Li et al. [74] Patch &pair CNN Faceforensics(raw) 0.996 (AUC) 1.82


Faceforensics(c23) 0.983 (AUC)
Faceforensics(c40) 0.931 (AUC)

Masi et al. [50] Two‐branch recurrent network Faceforensics++(frames, c23) 0.987 (AUC) ‐
Faceforensics++(videos, c23) 0.9912 (AUC)
Faceforensics++(frames, c40) 0.8659 (AUC)
Faceforensics++(videos, c40) 0.911 (AUC)

circumstances, summarizing previous algorithms and exploring 5.1 | Concerns


new research directions are required to promote more effective
detection algorithms. In this section, we will talk about some In view of current research on face‐manipulated video detec-
concerns over current detection methods and envision tion, we have summarized the following concerns, which need
important directions that should receive more attention. significant attention in future research.
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
620
- YU ET AL.

T A B L E 5 Detection performance on
Study Method Performance FLOPs
DFDC datasets
Bonettini et al. [31] Ensemble of CNNs 0.8813 (AUC) >0.04

Montserrat et al. [47] An automatic weighting mechanism 91.88% (ACC) >9.9

Tarasiou et al. [43] A lightweight architecture 88.76% (ACC) ‐

Li et al. [32] Face X‐ray + multitask learning 0.892 (AUC) >3.99

Mittal et al. [78] Emotions behind audio and visual content 0.892 (AUC) ‐

Selim Seferbekov EfficientNet + task specific data 0.42798 (LogLoss) 72.35 � 8


augmentations

Vert VertWM vert/ Ensemble of WSDAN‐based networks 0.42842 (LogLoss) 18.83


vert

NtechLab Mixup + EfficientNet + 3D conv 0.43452 (LogLoss) 72.35 � 3

TABLE 6 Detection performance on Celeb‐DF datasets

Study Method Performance FLOPs


Dang et al. [42] Multitask learning + attention mechanism 0.712 (AUC) 4.59

Li et al. [32] Face X‐ray + multitask learning 0.8058 (AUC) >3.99

Ciftci et al. [63] Biological signals detection 91.50% (ACC) ‐

Tarasiou et al. [43] A lightweight architecture 92.62% (ACC) ‐

Hernandez‐Ortega et al. [72] DeepFakesON‐Phys(convolutional attention network) 91.50% (ACC) 0.48

Wang et al. [39] Monitoring neuron behaviours 0.668 (AUC) ‐

5.1.1 | Generalization network cannot provide human‐understandable justifications


for its output. However, the detection algorithm must be
Generalization is an important indicator to measure the per- interpretable in practical forensic scenarios, otherwise
formance of algorithms, which is often adopted to evaluate the convincing results cannot be obtained. There has been some
performance of the algorithm on unknown datasets. The related work on interpretability in other fields [81–83], while
detection algorithms proposed are mostly based on supervised research on interpretability has not progressed in deepfake
learning, which is prone to overfit on their own datasets. Related detection fields. The interpretability of deepfake detection
experiments performed in [32] have proved that the general- approaches is still an important issue needed to be solved in
ization performance of existing detection algorithms is still the future.
insufficient for cross‐dataset detection tasks. Each subdataset of
FaceForensic++ is used as a training set to train the Xception
network, which is then evaluated on the other subdatasets. 5.1.3 | Time consumption
Experimental data shown in Table 7 demonstrate that the
Xception network is fragile when encountering unknown When applied in a practical scene, time consumption be-
data, even reaching a detection accuracy of only 49.13%. Prac- comes a significantly important point. In the foreseeable
tically, there are great differences in the selection of source videos future, deepfake detection algorithms will be widely used on
and postprocessing of generated videos, resulting in that streaming media platforms to reduce the negative impact of
different data sets often imply distinct distributions. To our deepfake videos on social security. However, current detec-
knowledge, some work has focused on improving the general- tion algorithms are far from wide implementation in practical
ization of algorithm [32, 38, 79, 80]. However, due to special scenarios due to their high time consumption. In this survey,
design, these algorithms have their own inherent flaws. For we performed a brief evaluation on time consumption of
example, face X‐ray [32] heavily relies on blending steps, causing existing neural networks. Specifically, we randomly select 10
that it cannot detect artefacts in entirely synthetic images. videos from corresponding dataset for each trained model.
Therefore, generalization is still an urgent problem to be solved. Each video selected has a length of about 300 frames. In this
evaluation, we detect 64 frames for each video so as to
achieve more accurate results. The final time consumption is
5.1.2 | Interpretability calculated by only considering the inferring time of models.
As shown in Table 8, the average detection time of 10 videos
Interpretability has been an inherent problem for algorithms is about 70–80 s, which means that each video spends 7–8 s
based on neural networks. As a black‐box model, the neural to detect. Considering videos are much longer than 300
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 621

frames in the practical scenarios, such time consumption is 5.2.1 | Triplet training
far from meeting the needs of massive video detection. In the
current literature related to deepfake detection, detection The toughest problem for deepfake detection tasks is that
accuracy is regarded as the only standard while rare re- generalization performance is not sufficient to support the
searches pay attention to the time consumption of deepfake needs of practical scenarios due to the different distribution of
detection. In the future, more attention should be devoted to datasets. Under such circumstances, it is difficult for detection
studying how to design an efficient and high‐accuracy models to learn the intrinsic difference between real and fake
detection method. videos. To address this problem, triplet training strategy would
be a possible solution for such tasks [28, 31]. Triplet training
aims to minimize the distance between samples with the same
5.1.4 | Robustness category and maximize samples between features with
different categories in the feature space. Especially, the triplet
Robustness is often applied to evaluate the performance of training strategy ensures that the distance between samples
detection algorithms when encounter various degradations. with different categories is larger than the distance with the
Compared with original videos, compressed videos are same category. Therefore, the optimization goal of triplet
more difficult to detect because it ignores a lot of image training would attempt to exploit the intrinsic difference be-
information for higher compression rate. As shown in tween real and fake videos, providing assistance in subsequent
Table 4, detection algorithms often indicate a decrease in classification tasks. In the field of face liveness detection, triplet
performance when encounter low‐quality videos compared training has been applied for domain adaptation tasks [84],
with high‐quality videos. In addition to compression oper- demonstrating the potential of the triplet training strategy in
ations, videos may also encounter operations such as image finding intrinsic differences between real and fake videos, even
reshape and rotation. Under such circumstances, robustness if the datasets have different distributions.
becomes an important property that must be considered
when designing detection algorithms. An effective way to
improve robustness should be to add a noise layer to the 5.2.2 | Multitask learning
detection network, so that multiple data degradation sce-
narios are considered. Improving the robustness of existing Multitask learning, performing multiple tasks simultaneously, is
detection methods would perform a significant role in the proved to improve prediction performance comparing with
future. single‐task learning. Performing both forgery location and
deepfake detection at the same time is found to be effective to
improve accuracy in deepfake detection tasks. Multitask
5.2 | Future works learning allows the model to perform two tasks at the same
time, considering losses caused by both tasks, and further
To address problems existing in current detection algorithms, improving the performance of the model. In [32, 43, 85], also
we also envision some research directions, which will advance prove that forgery location plays a vital role in the deepfake
future research on face‐manipulated video detection. detection task. Therefore, multitask learning has great potential
for further improvement of deepfake detection.
TABLE 7 Cross‐dataset evaluation on FaceForensic++ dataset

Test set AUC 5.2.3 | Antiforensics


Training set DF F2F FS NT FF++
Antiforensic technology is developed due to defects existing in
DF 99.38 75.05 49.13 80.39 76.34
current forensic technology. In the field of deepfake detection,
F2F 87.56 99.53 65.23 65.9 79.55 neural networks are widely used to distinguish forgery videos.
FS 70.12 61.7 99.36 68.71 74.91 However, due to inherent defects, neural networks cannot
resist attacks of adversarial samples [86–88]. To this end, re-
NT 93.09 84.82 47.98 99.5 83.42
searchers need to design more robust algorithms that can

TABLE 8 Time consumption evaluation on FaceForensic++ subdataset

Time consumption (s)


RAW C23 C40
Model DF F2F FS NT DF F2F FS NT DF F2F FS NT AVERAGE
EfficientNetB0 96.71 68.26 95.57 85.78 94.18 81.06 73.65 77.11 80.46 86.94 60.97 77.96 81.56

ResNet50 80.62 66.54 68.43 59.84 105.40 83.23 75.46 57.48 84.12 90.87 65.24 57.84 74.59

ResNet101 78.72 65.38 86.19 79.33 104.27 81.28 73.68 77.38 82.39 87.84 63.55 77.63 79.80
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
622
- YU ET AL.

withstand possible attacks found in the laboratory to prevent O R CI D


such attacks in real‐world scenarios. The development of Peipeng Yu https://orcid.org/0000-0003-0056-4300
antiforensics technology can predict possible attacks in
advance and discover the weekness of existing algorithms, R EF E R E N C E S
thereby improving existing algorithms. 1. Chesney, B., Citron, D.: Deep fakes: a looming challenge for privacy,
democracy, and national security. Calif. L. Rev. 107, 1753 (2019)
2. Delfino, R.: Pornographic Deepfakes—Revenge Porn’s Next Tragic Act–
6 | CONCLUSION The Case for Federal Criminalization. 887, 88 Fordham L. Rev, (2019).
SSRN 3341593
3. Dixon, H.B., Jr.: Deepfakes: More frightening than photoshop on ste-
In recent years, deepfake technologies, which rely on deep roids. Judges J. 58(3), 35–37 (2019)
learning, are developing at an unprecedented rate. Malicious 4. Feldstein, S.: How Artificial Intelligence Systems Could Threaten De-
face‐manipulated videos generated by deepfake algorithms can mocracy. The Conversation (2019)
be rapidly disseminated through the global pervasiveness of the 5. Korshunov, P., Marcel, S.: Deepfakes: a new threat to face recognition?
Internet, threatening social stability and personal privacy. To Assessment and detection. arXiv preprint arXiv:1812.08685. (2018)
6. Li, Y., Chang, M.‐C., Lyu, S.: Inictu oculi: exposing AI created fake videos
this end, commercial companies and research groups world- by detecting eye blinking. In: 2018 IEEE International Workshop on
wide are conducting relevant researches to reduce the negative Information Forensics and Security (WIFS), pp. 1–7.IEEE (2018)
impacts of deepfake videos on people. In this article, we first 7. Rossler, A., et al.: Faceforensics++: Learning to detect manipulated facial
introduce the generation technology of deepfake videos, then images. In: Proceedings of the IEEE International Conference on
analyse the existing detection technology, and finally discuss Computer Vision, pp. 1–11 (2019)
8. Li, Y., et al.: Celeb‐DF: a new dataset for deepfake forensics. arXiv
the future research direction. Existing problems of current preprint arXiv:1909.12962. (2019)
detection algorithms and promising research are particularly 9. Dolhansky, B., et al.: The deepfake detection challenge dataset. arXiv
emphasized in this review. Generalization and robustness are preprint arXiv:2006.07397. (2020)
particularly emphasized in this review. We hope this article 10. Jiang, L., et al.: Deeper Forensics‐1.0: a large‐scale dataset for real‐world
would be useful for researchers engaged in deepfake detection face forgery detection. arXiv preprint arXiv:2001.03024. (2020)
11. Chesney, R., Citron, D.: Deepfakes and the new disinformation war: the
research and restrain the negative impact of deepfake videos. coming age of post‐truth geopolitics. Foreign Aff. 98, 147 (2019)
12. Bitouk, D., et al.: Face swapping: automatically replacing faces in pho-
ACK NO W LE DG E ME NT S tographs. In: ACM SIGGRAPH 2008 Papers, pp. 1–8 (2008)
This work is supported in part by the Jiangsu Basic 13. Yuan, L., et al.: Face replacement with large‐pose differences. In: Pro-
Research Programs‐Natural Science Foundation under grant ceedings of the 20th ACM International Conference on Multimedia,
pp. 1249–1250 (2012)
numbers BK20181407, in part by the National Natural 14. Zhang, X., Song, J., Park, J.I.: The image blending method for face
Science Foundation of China under grant numbers swapping. In: 2014 4th IEEE International Conference on Network
U1936118, 61,672,294, in part by Six peak talent project of Infrastructure and Digital Content, pp. 95–98.IEEE (2014)
Jiangsu Province (R2016L13), Qinglan Project of Jiangsu 15. Korshunova, I., et al.: Fast face‐swap using convolutional neural net-
Province, and ‘333' project of Jiangsu Province, in part by works. In: 2017 IEEE International Conference on Computer Vision
(ICCV), pp. 3697–3705 (2017)
the National Natural Science Foundation of China under 16. Olszewski, K., et al.: Realistic dynamic facial textures from a single image
grant numbers U1836208, 61702276, 61772283, 61602253, using gans. 2017 IEEE International Conference on Computer Vision
and 61601236, in part by National Key R&D Programme of (ICCV) (2017)
China under grant 2018YFB1003205, in part by the Priority 17. Dale, K., et al.: Video face replacement. ACM (2011)
Academic Program Development of Jiangsu Higher Educa- 18. Faceswap‐gan. (2018). https://github.com/shaoanlu/faceswap‐GAN
19. Keras‐vggface: Vggface Implementation with Keras Framework. (2019).
tion Institutions (PAPD) fund, in part by the Collaborative https://github.com/rcmalli/keras‐vggface
Innovation Centre of Atmospheric Environment and 20. Petrov, I., et al.: DeepFaceLab: a simple, flexible and extensible face
Equipment Technology (CICAEET) fund, China. Zhihua swapping framework. arXiv preprint arXiv:2005.05535. (2020)
Xia is supported by BK21+ programme from theMinistry of 21. Li, L., et al.: Faceshifter: towards high fidelity and occlusion aware face
Education of Korea. swapping. arXiv preprint arXiv:1912.13457. (2019)
22. Vlasic, D., et al.: Face transfer with multilinear models. In ACM SIG-
GRAPH 2006 Courses, p. 24. (2006)
CON F LIC T O F I N T E R ES T 23. Thies, J., et al.: Face2face: real‐time face capture and reenactment of
None. RGBvideos. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2387–2395 (2016)
P ERMIS SI ON T O R EP ROD U C E M AT E R I AL S 24. Suwajanakorn, S., Seitz, S.M., Kemelmacher‐Shlizerman, I.: Synthesizing
Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017)
FR OM O THER SOU R C ES 25. Fried, O., et al.: Text‐based editing of talking‐head video. ACM Trans.
Figure 3,4 comes from Reference [26]. Thus, we have added Graph. 38(4), 1–14 (2019)
relevant references. 26. Kim, H., et al.: Deep video portraits. ACM Trans. Graph. 37(4), 1–14
(2018)
DATA AVA IL AB I L I T Y S TA T E ME NT 27. Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: Image
synthesis using neural textures. ACM Trans. Graph. 38(4), 1–12 (2019)
The data that support the findings of this study are openly 28. Zhou, P., et al.: Two‐stream neural networks for tampered face detection.
available in https://github.com/ondyari/FaceForensics, refer- 2017 IEEE Conference on Computer Vision and Pattern Recognition
ence number [1]. Workshops (CVPRW) (2017)
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 623

29. Chollet, F.: Xception: Deep learning with depthwise separable convolu- 52. Yang, X., Li, Y., Lyu, S.: Exposing deep fakes using inconsistent head
tions. In: Proceedings of the IEEE Conference on Computer Vision and poses. In: ICASSP 2019‐2019 IEEE International Conference on
Pattern Recognition, pp. 1251–1258 (2017) Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265.
30. Dolhansky, B., et al.: The deepfake detection challenge (DFDC) preview IEEE (2019)
dataset. arXiv preprint arXiv:1910.08854. (2019) 53. Lukas, J., Fridrich, J., Goljan, M.: Digital camera identification from
31. Bonettini, E.D.C., et al.: Video face manipulation detection through sensor pattern noise. IEEE Trans. Inf. Forensics Secur. 1(2), 205–214
ensemble of CNNs. arXiv preprint arXiv:2004.07676. (2020) (2006)
32. Li, L. et al.: Face x‐ray for more general face forgery detection. In: 2020 54. Chen, M., et al.: Determining image origin and integrity using sensor
IEEE/CVF Conference on Computer Vision and Pattern Recognition noise. IEEE Trans. Inf. Forensics Secur. 3(1), 74–90 (2008)
(CVPR). Seattle, WA, USA pp. 5000–5009. (2020) 55. Chierchia, G., et al.: A bayesian‐MRFapproach for PRNU‐based image
33. Amerini, I., et al.: Deepfake video detection through optical flow based forgery detection. IEEE Trans. Inf. Forensics Secur. 9(4), 554–567
CNN. Proceedings of the IEEE International Conference on Computer (2014)
Vision Workshops (2019) 56. Korus, P., Huang, J.: Multi‐scale analysis strategies in PRNU‐based
34. Nguyen, H.H., Yamagishi, J., Echizen, I.: Capsule‐forensics: Using tampering localization. IEEE Trans. Inf. Forensics Secur. 12(4), 809–824
capsule networks to detect forged images and videos. In: ICASSP 2019‐ (2016)
2019 IEEE International Conference on Acoustics, Speech and Signal 57. Koopman, M., Rodriguez, A.M., Geradts, Z.: Detection of deepfake
Processing (ICASSP), pp. 2307–2311. IEEE (2019) video manipulation. In: The 20th Irish Machine Vision and Image Pro-
35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for cessing Conference (IMVIP), pp. 133–136. (2018)
large‐scale image recognition. arXiv preprint arXiv:1409.1556. (2014) 58. Frank, J., et al.: Leveraging frequency analysis for deep fake image
36. Afchar, D., et al.: Mesonet: A compact facial video forgery detection recognition. arXiv preprint arXiv:2003.08685. (2020)
network. In: 2018 IEEE International Workshop on Information Fo- 59. Cozzolino, D., Verdoliva, L.: Noiseprint: ACNN‐based camera model
rensics and Security (WIFS), pp. 1–7. IEEE (2018) fingerprint. IEEE Trans. Inf. Forensics Secur. 15, 144–159 (2019)
37. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of 60. Zhang, K., et al.: Beyond a gaussian denoiser: Residual learning of deep
the IEEE Conference on Computer Vision and Pattern Recognition, CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–
pp. 1–9 (2015) 3155 (2017)
38. Li, Y., Lyu, S.: Exposing deepfake videos by detecting face warping ar- 61. Huang, Y., et al.: Fakeretouch: Evading Deepfakes Detection via the
tifacts. arXiv preprint arXiv:1811.00656. (2018) Guidance of Deliberate Noise (2020). arXiv preprint arXiv:2009.09213
39. Wang, R., et al.: Fakespotter: a simple yet robust baseline for spotting ai‐ 62. Chen, C., et al.: Camera trace erasing. In: Proceedings of the IEEE/CVF
synthesized fake faces. International Joint Conference on Artificial In- Conference on Computer Vision and Pattern Recognition, pp. 2950–2959
telligence (IJCAI) (2020) (2020)
40. Liu, Z., et al.: Global texture enhancement for fake face detection in the 63. Ciftci, U.A., Demir, I., Fakecatcher, L.Y.: Detection of synthetic portrait
wild. Proceedings of the IEEE/CVF Conference on Computer Vision videos using biological signals. IEEE Trans. Pattern Anal. Mach. Intell. 1
and Pattern Recognition (CVPR) (2020) (2020)
41. Nguyen, H.H., et al.: Multi‐task learning for detecting and segmenting 64. Donahue, J., et al.: Long‐term recurrent convolutional networks for vi-
manipulated facial images and videos. arXiv preprint arXiv:1906.06876. sual recognition and description. In: Proceedings of the IEEE Confer-
(2019) ence on Computer Vision and Pattern Recognition, pp. 2625–2634
42. Dang, H., et al.: On the detection of digital face manipulation. In: Pro- (2015)
ceedings of the IEEE/CVF Conference on Computer Vision and 65. Feng, L., et al.: Motion‐resistant remote imaging photoplethysmography
Pattern recognition, pp. 5781–5790 (2020) based on the optical properties of skin. IEEE Trans. Circ. Syst. Video
43. Tarasiou, M., Zafeiriou, S.: Extracting Deep Local Features to Detect Technol. 25(5), 879–891 (2014)
Manipulated Images of Human Faces. In: 2020 IEEE International 66. Kumar, S., Prakash, A., Tucker, C.S.: Bounded kalman filter method for
Conference on Image Processing (ICIP), pp. 1821–1825. (2020) motion‐robust, non‐contact heart rate estimation. Biomed. Optic. Ex-
44. Güera, D., Edward, J.: Delp: Deepfake video detection using recurrent press. 9(2), 873–897 (2018)
neural networks. In: 2018 15th IEEE International Conference on 67. Zhao, C., et al.: A novel framework for remote photoplethysmography
Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE pulse extraction on compressed videos. In: Proceedings of the IEEE
(2018) Conference on Computer Vision and Pattern Recognition Workshops,
45. Szegedy, C., et al.: Rethinking the inception architecture for computer pp. 1299–1308 (2018)
vision. In: Proceedings of the IEEE Conference on Computer Vision 68. Chen, W., McDuff, D.: DeepPhys: Video‐based physiological measure-
and Pattern Recognition, pp. 2818–2826 (2016) ment using convolutional attention networks. In: Proceedings of the
46. Sabir, E., et al.: Recurrent‐convolution approach to deepFake detection‐ European Conference on Computer Vision (ECCV), pp. 349–365.(2018)
state‐of‐art results on FaceForensics++. arXiv preprint 69. Fernandes, S., et al.: Predicting heart rate variations of deepfake videos
arXiv:1905.00582 (2019) using neural ODE. In: Proceedings of the IEEE International Confer-
47. Montserrat, D.M., et al.: Deepfakes Detection with Automatic Face ence on Computer Vision Workshops (2019)
Weighting. In: 2020 IEEE/CVF Conference on Computer Vision and 70. Chen, R.T.Q., et al.: Neural ordinary differential equations. In: Advances
Pattern Recognition Workshops (CVPRW), pp. 2851–2859. (2020) in Neural Information Processing Systems, pp. 6571–6583 (2018)
48. Zhao, Y., et al.: Capturing the persistence of facial expression features for 71. Qi, H., et al.: DeepRhythm: exposing deepfakes with attentional visual
deepfake video detection. In: International Conference on Information heartbeat rhythms. arXiv preprint arXiv:2006.07634, 2020
and Communications Security, pp. 630–645.Springer (2019) 72. Hernandez‐Ortega, J., et al.: DeepFakesON‐phys: deepfakes detection
49. Wu, X., et al.: SSTNet: Detecting manipulated faces through spatial, based on heart rate estimation. arXiv preprint arXiv:2010.00400. (2020)
steganalysis and temporal features. In: ICASSP 2020‐2020 IEEE Inter- 73. Korshunov, P., Marcel, S.: Vulnerability assessment and detection of
national Conference on Acoustics, Speech and Signal Processing deepfake videos. In: The 12th IAPR International Conference on Bio-
(ICASSP), pp. 2952–2956.IEEE (2020) metrics (ICB), pp. 1–6 (2019)
50. Masi, I., et al.: Two‐Branch Recurrent Network for Isolating Deepfakes 74. Li, X., et al.: Fighting against deepfake: Patch&pair convolutional neural
in Videos. In: 16th European Conference on Computer Vision ECCV networks (PPCNN). In: Companion Proceedings of the Web Conference
2020, pp. 667–684 Springer, Cham (2020) 2020, pp. 88–89 (2020)
51. Ruff, L., et al.: Deep one‐class classification. In:Proceedings of Ma- 75. Dufour, N., Gully, A.: Deepfakes Detection Dataset (2019)
chine Learning Research, vol. 80, pp. 4393–4402. PMLR, Stockholm 76. Cozzolino, D., Poggi, G., Verdoliva, L.: Extracting camera‐based fin-
(2018) gerprints for video forensics. In: Proceedings of the IEEE Conference
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
624
- YU ET AL.

on Computer Vision and Pattern Recognition Workshops, pp. 130–137 84. Jia, Y., et al.: Single‐side domain generalization for face anti‐spoofing. In:
(2019) Proceedings of the IEEE/CVF Conference on Computer Vision and
77. Sabir, E., et al.: Recurrent convolutional strategies for face manipulation Pattern Recognition, pp. 8484–8493 (2020)
detection in videos. Interfaces. 3 (2019) 85. Stehouwer, J., et al.: On the detection of digital face manipulation. arXiv
78. Mittal, T., et al.: Emotions dont lie: an audio‐visual deepfake detection preprint arXiv:1910.01717. (2019)
method using affective cues. In: Proceedings of the 28th ACM Inter- 86. Carlini, N., Farid, H.: Evading deepfake‐image detectors with white‐and
national Conference on Multimedia, MM 20, pp. 2823–2832. Association black‐box attacks. arXiv preprint arXiv:2004.00622. (2020)
for Computing Machinery, New York (2020) 87. Gandhi, A., Jain, S.: Adversarial perturbations fool deepfake detectors.
79. Cozzolino, D., et al.: Forensictransfer: Weakly‐supervised domain adap- arXiv preprint arXiv:2003.10596. (2020)
tation for forgery detection. arXiv preprint arXiv:1812.02510. (2018) 88. Neekhara, P., et al.: Adversarial deepfakes: Evaluating vulnerability of
80. Du, M., et al.: Towards generalizable forgery detection with locality‐aware deepfake detectors to adversarial examples. arXiv preprint arXiv:
autoencoder. arXiv preprint arXiv:1909.05999. (2019) 2002.12749. (2020)
81. Scott, L., Lee, S.: A Unified Approach to Interpreting Model Predictions,
pp. 4768–4777 In: Proceedings of the 31st International Conference on
Neural Information Processing Systems (NIPS'17), (2017)
82. Samek, W., Wiegand, T., Muller, K.: Explainable artificial intelligence: How to cite this article: Yu, P., et al.: A Survey on
understanding, visualizing and interpreting deep learning models. arXi-
Deepfake Video Detection. IET Biom. 10(6), 607–624
vArtificial Intelligence (2017)
83. Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases (2021). https://doi.org/10.1049/bme2.12031
by image‐based deep learning. Cell. 172(5), 1122–1131 (2018)

You might also like