IET Biometrics - 2021 - Yu - A Survey on Deepfake Video Detection
IET Biometrics - 2021 - Yu - A Survey on Deepfake Video Detection
DOI: 10.1049/bme2.12031
REVIEW
- -
Revised: 20 December 2020 Accepted: 8 February 2021
- IET Biometrics
Correspondence Abstract
Zhihua Xia, College of Cyber Security, Jinan Recently, deepfake videos, generated by deep learning algorithms, have attracted wide-
University, Guangzhou, Guangdong Province, China
Province, 510632, China.
spread attention. Deepfake technology can be used to perform face manipulation with
Email: [email protected] high realism. So far, there have been a large amount of deepfake videos circulating on the
Internet, most of which target at celebrities or politicians. These videos are often used to
Funding information damage the reputation of celebrities and guide public opinion, greatly threatening social
Collaborative Innovation Centre of Atmospheric stability. Although the deepfake algorithm itself has no attributes of good or evil, this
Environment and Equipment Technology
(CICAEET) fund, China.; Priority Academic
technology has been widely used for negative purposes. To prevent it from threatening
Programme Development of Jiangsu Higher human society, a series of research have been launched, including developing detection
Education Institutions; ‘333’ project of Jiangsu methods and building large‐scale benchmarks. This review aims to demonstrate the
Province; Qinglan Project of Jiangsu Province;
National Natural Science Foundation of China,
current research status of deepfake video detection, especially, generation process, several
Grant/Award Numbers: 61702276, 61772283, detection methods and existing benchmarks. It has been revealed that current detection
U1936118, 61601236, 61602253, 61672294, methods are still insufficient to be applied in real scenes, and further research should pay
U1836208; National Key R&D Programme of China,
Grant/Award Number: 2018YFB1003205; BK21+
more attention to the generalization and robustness.
programme from the Ministry of Education of
Korea; Jiangsu Basic Research Programs‐Natural
Science Foundation, Grant/Award Number:
BK20181407; Six peak talent project of Jiangsu
Province, Grant/Award Number: R2016L13
-
This is an open access article under the terms of the Creative Commons Attribution‐NoDerivs License, which permits use and distribution in any medium, provided the original work is
properly cited and no modifications or adaptations are made.
© 2021 The Authors. IET Biometrics published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
anomaly region, respectively. Experiments show its superior videos with a high degree of spatiotemporal continuity.
performance compared with existing face‐swapping algorithms. Compared to Face2Face, this work can not only migrate facial
Videos generated by recent deepfake approaches have been expressions, but also head pose, gaze direction and blinking
extremely realistic, hardly distinguished by human eyes. movements, compensating for the inaccurate head pose in the
Face2Face algorithm. Except for this study, Thies et al. [27] has
also made further optimisations to address problems existing in
2.1.2 | Face reenactment Face2Face. Neuraltexture, incorporating Face2Face and neural
networks for texture extraction based on Face2Face, compen-
Different from face‐swapping technologies, face reenactment sating for Face2Face's blurred texture in the mouth region.
algorithms attempt to control people's expressions in videos,
whichmeans that attackers can generate videos manipulating
someone to do something that does not exist. The first face 2.2 | General process of deepfake video
reenactment algorithm could date back to 2006. Vlasic et al.[22] generation
proposed to perform facial reenactment based on the face
template, which was modified under different expression pa- In this part, we will briefly describe the generation process of
rameters. Most of the subsequent work is based on such two types of deepfake videos.
schemes, where a parametric model is leveraged to adjust facial
images. These methods could generate face images with high
realism, but the obtained results often lack temporal coherence. 2.2.1 | Face swapping
In recent years, research on face reenactment has been further
developed as computing ability increased. To perform monoc- To generate a face‐swapping video, all frames of the target video
ular facial reenactment in real‐time, Thies et al. [23] proposed have to be processed using generative method. Figure 2 shows
Face2Face. In this study, a new global nonrigid model‐based the general generation process of face‐swapping videos. Obvi-
bundling approach was applied to reconstruct the facial fea- ously, the deepfake algorithm, which implements faceswapping
tures of target and source actors. At the same time, a subspace while preserves the source expressions, is the core part of video
deformation transfer technique is designed to perform expres- generation. The deepfake algorithms used in faceswapping are
sion transfer between source and target actors. In addition to mostly developed based on autoencoder, which is widely used
these contributions, this study also proposed a novel method in for data reconstruction tasks. Autoencoder is composed of two
the synthesis of mouth regions, where the best matching image is components: an encoder and a decoder. Latent features are first
retrieved from the target sequence. Compared to previous extracted from the image by the encoder, and then inputted to
studies, Face2Face has already achieved quite remarkable per- the decoder to reconstruct the original image. In the deepfake
formance. However, it cannot guarantee consistent head algorithm, two autoencoders are trained to swap faces between
movements as only the migration of expressions is taken into source video frames and target video frames. As shown in
account. Also, the synthesis of the mouth region is not satisfying, Figure 3, during the training process, two encoders with the
with coarse details of the mouth that are easily noticed by human same weights are trained to extract common features in source
eyes. With the development of deep learning techniques, these and target faces. Then, features extracted are inputted to two
issues are gradually being noticed and addressed. It can be decoders to reconstruct faces, respectively. It is worth noting
noticed that face videos synthesised by previous face reenact- that decoder A is only trained with faces of A while decoder B is
ment algorithms have defects that they are inconsistent with only trained with faces ofB. When the training process is com-
voices. The work of Suwajanakorn et al. [24] supplemented this plete, a latent face generated from face A will be passed to the
defect to a certain extent. They aimed to learn a sequence decoder B. Decoder B would try to reconstruct face B from
mapping from audio to video to manipulate actors to speak the feature relative to faceA. If the autoencoder is trained well, the
same sentences as voice content. Features were extracted from latent space will represent facial expressions. In other words, the
voice sequence as the input of recurrent neural network (RNN), face generated by decoder B will have the same expression as
which outputs a sparse mouth shape corresponding to each faceA.
frame of video output. The textures of mouth are further syn-
thesized and merged into the original video. A better improve-
ment is achieved by Fried et al. [25]. They performed talking‐ 2.2.2 | Face reenactment
head video editing and changing speech words by using
designed neural face rendering method. To perform face reen- The face reenactment task aims to perform the migration of
actment with better performance, Kim et al. [26] proposed a new facial expressions. In order to better demonstrate this kind of
method for photorealistic reanimation of portrait videos. The scheme, we directly use the scheme in [26] as an example to
proposed generative neural network with a novel space‐time introduce. Figure 4 shows the general process of performing
architecture is used to transform coarse face model renderings face reenactment. First, the low‐dimensional parameter repre-
into full photorealistic portrait video output. The major contri- sentation of the source and target videos is obtained using a
bution of this study is designing a new spatiotemporal encoding monocular face reconstruction method. Furthermore, head
as conditional input for video synthesis, resulting in synthesised pose and expression could be transferred to the parameter
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
610
- YU ET AL.
F I G U R E 2 The generating process of face‐swapping video frames. The face area is first detected in each video frame. Then, facial landmarks are extracted
to perform face alignment. After that, the deepfake algorithm (autoencoder or GAN) is applied to generate a synthetic face by inputting the face‐aligned image.
To reduce artefacts caused by blending, the landmarks of the left and right eyebrows and the bottom mouth are used to generate a specific mask. In this way,
after blending the synthetic face to the original image, only the content inside the mask is retained. Finally, to further make the generated image realistic, a
postprocessing operation is supplemented to process the generated image. Specifically, Gaussian blur is applied to the boundary of the mask while the colour
correction algorithm is applied to ensure the consistency of the synthetic face and background image
F I G U R E 3 Generating process of face reenactment videos [26]. First, monocular face reconstruction is performed on the source face and the target face to
obtain their respective face parameters. After that, parameters are modified by preserving parameters of illumination and identify while changing parameters of
pose, expression and eye gaze. Synthetic images are then generated using modified parameters. Finally, rendering‐to‐video translation network is applied to
generate face reenactment videos
space. To perform face reenactment, scene illumination and current detection methods mostly target at fundamental fea-
identity parameters are preserved while head pose, expression tures. As shown in Table 1, these methods fall into five cate-
and eye gaze parameters are changed. After that, synthetic im- gories based on the features they use. To begin with, detection
ages of the target actor are regenerated based on the modified based on general neural networks is commonly used in literature,
parameters. These images are then served as the conditional where deepfake detection task is considered as regular classifi-
input of our new renderingvideo conversion network, which is cation tasks. Temporal consistency features are also exploited to
then trained to convert the synthesized input into a realistic detect discontinuities between adjacent frames of fake video. To
output. To obtain a complete video with better time consistency, find distinguishable features, visual artefacts generated in
the conditioning space‐time volumes are fed into the network in blending process are exploited in detection tasks. Recently
a sliding window fashion. In this way, face reenactment video proposed approaches focus on more fundamental features,
can be obtained. where camera fingerprint and biological signal‐based schemes
show great potential in detection tasks. In the following sections,
we will review detection methods mentioned above.
3 | DEEPFAKE VIDEO DETECTION
Deepfake videos are increasingly harmful to personal privacy 3.1 | General‐network‐based methods
and social security. Various methods have been proposed to
detect manipulated videos. Early attempts mainly focused on Recent advances in image classification have been applied to
inconsistent features caused by the face synthesis process while improve the detection of deepfake videos. In this method, face
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 611
images extracted from the detected video are used to train the 3.1.1 | Transfer learning
detection network. Then, the trained network is applied to
make predictions for all frames of this video. The predictions Network‐based detection methods should be the earliest
are finally calculated by averaging or voting strategy. Conse- method introduced for detection tasks. Shortly after the
quently, the detection accuracy is highly dependent on the appearance of the first deepfake video, some early detection
neural networks, without the need to exploit specific distin- algorithms were proposed, mainly based on existing networks
guishable features. In this section, we divide existing network‐ that performed well in image classification tasks. Transfer
based methods into two types: transfer learning‐based methods learning strategy could be easily found in the early studies.
and detection approaches based on specially designed Combining steganalysis features and deep learning features,
networks. Zhou et al. [28] put forward a two‐stream network for face
tampering detection. Likewise, in [7], Rossler et al. evaluated
XceptionNet [29] on the FaceForensic++ dataset, out-
performing all other networks in detecting fakes. During
DFDC, similar detection methods were used. In [30], two
existing models were tested to provide a performance baseline:
A small DNN(composed of six convolutional layers and a fully
connected layer) and an existing XceptionNet. Early results
showed that the best method (XceptionNet) provides 93.0%
precision. Bonettini et al. [31] studied the ensemble of different
trained CNN models, showing that the ensemble of CNNs can
achieve promising results in deepfake detections. However,
such network‐based algorithms are prone to overfitting [32], so
researchers attempted to exploit intrinsic differences between
real and fake videos through preprocessing. Some pre-
processing methods, such as optical flow calculation [33], had
been proved to be useful to exploit interframe dissimilarities in
network‐based methods.
Methods Description
Generalnetwork‐based methods In this method, detection is regarded as a frame‐level classification task which is finished by
CNNs
Temporalconsistency‐based methods Deepfake videos are found to exist inconsistencies between adjacent frames due to the defects of
the forgery algorithm. Thus RNN is applied to detect such inconsistencies
Visualartefacts‐based methods The blending operation in generation process would cause intrinsic image discrepancies in the
blending boundaries. CNN‐based methods are used to identify these artefacts
Camerafingerprints‐based methods Due to specific generation process, devices leave different traces in the captured images. At the
same time, faces and background images are acknowledged to come from different devices.
Thus, detection task can be completed by using these traces
Biologicalsignals‐based methods GAN is hard to understand hidden biological signals of faces, making it difficult to synthesize
human faces with reasonable behaviour. Based on this observation, biological signals are
extracted to detect deepfake videos
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
612
- YU ET AL.
F I G U R E 5 Capsule‐forensics architecture.
Pretrained VGG‐19 is first used to extract features
from face images.Features are further input into
proposed capsules, which include several primary
capsules and two output capsules. Agreement
between primary capsules and output capsules is
calculated by a dynamic routing algorithm. Finally,
output of capsules is mapped to probabilistic values
extracted by the primary capsules is dynamically calculated by structure often affect the abstraction degree of features, it
a dynamic routing algorithm and the results are finally routed still lacks sufficient relevance for the task of deepfake
to the appropriate output capsule. Visualization of latent detection. Therefore, the current direction of such work is
features extracted indicated that the combination of capsule gradually changing. On the one hand, by adding additional
networks and dynamic routing algorithm is effective for components to the model, the model can be constrained to
detecting manipulations. However, the capsule network per- learn heuristic features [40]. In this case, importance of
formed terribly when encountering unknown deepfake videos model architecture is greatly reduced while additional com-
[8], proving that capsule networks still need further ponents play a greater role. This is exactly the difference
improvement to detect high‐fidelity videos. To explore the between deepfake detection tasks and general computer
mesoscopic properties of images, Afchar et al. [36] also vision tasks. On the other hand, more and more network‐
proposed a CNN, namely MesoInception‐4, consisted of a based methods have begun to introduce multitask learning,
variant of inception modules introduced in [37]. Their pro- that is, not only to classify real and fake faces, but also to
posed approach achieved 98.4% accuracy using a private generate pixel‐level tampering masks. In [41], using a semi-
database. Moreover, this approach is also tested using unseen supervised learning strategy, Nguyen et al. designed a multi-
datasets in recent study [7, 8, 30, 38], proving to be a robust task learning framework to simultaneously detect manipulated
approach in deepfake detection tasks. Although these content and locate the manipulated regions. In such schemes,
methods achieved excellent results on various datasets, the however, supervised multitask learning is only a comple-
reasons behind good performance are still unknown. In fact, mentary implementation, which does not necessarily improve
deeper networks tend to achieve better results than shallower the final detection performance. Further improvement was
network in various areas. The reason for the good perfor- achieved by using attention mechanisms. Dang et al. [42]
mance may simply be that designed networks are deep utilized an attention mechanism to process feature maps for
enough. Compared with traditional learning‐based methods, the classification. The proposed approach showed excellent
Wang et al. [39] pay more attention to neuron coverage and performance both in deepfake detection and forgery location,
interactions rather than the design of specific network achieving state‐of‐the‐art performance compared to previous
structures. The FakeSpotter they proposed uses hierarchical solutions. Their approach demonstrates the importance of
neuron behaviour as a feature, showing high robustness attention mechanisms. Likewise, in [43], Tarasiou et al.
against four common perturbation attacks. This research designed a lightweight architecture for extracting local image
provided a new insight for detecting fakes. features and a multitask training scheme for forgery locali-
zation. In this way, the forgery location process provides
evidence for judgement while ensuring detection accuracy,
3.1.3 | Summary promoting the practical use of detection algorithm. It is
worth mentioning that some basic directions in computer
The disadvantage of network‐based methods is that such vision, such as anomaly detection, semantic segmentation and
methods tend to overfit on specific datasets. In this type of metric learning, are making more and more important con-
method, although adjustment and optimization of model tributions in this field.
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 613
3.2 | Temporal‐consistency‐based methods this article, which promoted the development of detection
methods based on temporal consistency.
Time continuity is a unique feature for videos. Unlike images,
video is a sequence composed of multiple frames, where
adjacent frames have a strong correlation and continuity. When 3.2.2 | Improvement
video frames are manipulated, the correlation between adjacent
frames will be destroyed due to defects of deepfake algorithms, After the time‐based detection method showed its effective-
specifically expressed in the shift of face position and video ness, many related studies were proposed. In [46], Sabir et al.
flickering. According to this phenomenon, researchers have utilized the temporal information present in the video stream
proposed several detection approaches. We will first introduce to detect deepfake videos. Similar to [44], an end‐to‐end model
the original CNN‐RNN architecture and then demonstrate its is built, where CNN is also involved in the follow‐up training.
improvement in these years. Meanwhile, face alignment based on facial landmarks and
spatial transformer network is applied to further improve the
performance of the algorithm. Even though such solutions
3.2.1 | CNN‐RNN guarantee high accuracy in videos with high quality, they do not
perform well on low‐quality video when the continuity be-
Considering the time continuity in videos, Guera et al. [44] tween adjacent frames is disrupted by video compression op-
first proposed to use RNN to detect deepfake videos. In their erations. To solve this problem, a CNN‐RNN framework
work, autoencoder was found to be completely unaware of based on automatic weighting mechanisms was proposed by
previously generated faces because faces were generated Montserrat et al. [47]. Considering that the face qualities of
frame‐by‐frame. This lack of temporal awareness results in some frames are not high, an automatic weighting mechanism
multiple anomalies, whichare crucial evidence for deepfake was proposed to emphasize the most reliable regions when
detection. To check the continuity between adjacent frames, making a video‐level prediction. Experiments showed that
an end‐to‐end trainable recurrent deepfake video detection combining CNN and RNN achieves high detection accuracies
system was proposed. As Figure 6 shows, the proposed on the DFDC dataset. Except for the robustness of algo-
system is mainly composed of a convolutional long short‐ rithms, generalization ability is also essential for forgery
term memory (LSTM) structure for processing frame detection tasks. Zhao et al. [48] used optical flow to capture the
sequences. Two essential components are used in a con- obvious differences of facial expressions between adjacent
volutional LSTM structure, where CNN is used for frame frames. However, these studies did not show strong general-
feature extraction and LSTM is used for temporal sequence ization or robustness. To solve this problem, Wu et al. [49]
analysis. Specifically, a pretrained inceptionV3 [45] is adapted proposed a novel manipulation detection framework, named
to output a deep representation for each frame. The 2048‐ SSTNet, exploiting both low‐level artefacts and temporal dis-
dimensional feature vectors extracted by the last pooling crepancies. Another study proposed by Masi et al. [50] ob-
layers are applied as the sequential LSTM input, character- tained good generalization on multiple datasets. In their
izing the continuity between image sequences. Finally, a research, a two‐branch recurrent network is applied to prop-
fullyconnected layer and a softmax layer are added to agate the original information while suppresses the face con-
compute forgery probabilities of the frame sequence tested. tent. Multiband frequencies are amplified using a Laplacian of
The experiments on a self‐made dataset showed that the al- Gaussian as a bottleneck layer. Inspired by [51], a new loss
gorithm can accurately make predictions even when the function is designed for better isolating manipulated face. The
length of a video is less than 2 s. Although this research did experimental results on several datasets show the excellent
not show its superiority since there were no large‐scale generalization performance of the detection algorithm.
datasets at the time, several articles after were inspired by Nevertheless, time‐based detection schemes still have much
room for improvement in generalization performance [47]. and background area, the negative samples are generated by a
Screen switching and unknown data are still problems that simplified process, where the face undergoes an affine warp
need to be solved for time‐based detection approaches. back to the source image directly after smoothed. To generate
more realistic negative examples, a convex polygon shape is
used based on the face landmarks of eye browns and the
3.2.3 | Summary bottom of the mouth. Also, colour information is also
randomly changed to enlarge the training diversity. After that,
Compared with general‐network‐based approaches, temporal‐ four CNN models—VGG16, ResNet50, ResNet101 and
consistency‐based detection methods consider the continuity ResNet152 were trained in this study. Evaluated on several
between adjacent frames, thereby improving the detection datasets of available deepfake videos, this method demon-
performance. However, many models tend to destroy the strated effectiveness in practice. Compared with previous
spatial structure of original frames when extracting temporal methods, this study focuses on the visual artefacts caused by
features while the motivation for designing such methods is affine transformation. At the same time, due to no additional
precisely to extract the inconsistency of spatial features in the negative samples to participate, this algorithm does not need to
temporal domain. CNN‐RNN architectures pool the intra- fit the sample distribution of deepfake videos, greatly
frame features into vectors [44, 46], thus cannot capture spatial increasing the generalization of the algorithm [8].
features while detecting temporal consistency. Although
structures such as 3DCNN can avoid destroying spatial fea-
tures, the excessive parameters make it easier to overfit on a 3.3.2 | Blending boundary
specific dataset.
Further improvements were achieved in [32]. Li et al. proposed
a novel image representation, namely face X‐ray, which was
3.3 | Visualartefacts‐based methods exploited to observe whether the input image can be decom-
posed into the foreground face and the background. Specif-
In most existing deepfake methods, the generated face has to ically, the blending boundary between the foreground
be blended into an existing background image, causing exist manipulated face and the background was defined as face X‐
intrinsic image discrepancies on the blending boundaries. As ray. Compared with Li and Lyu [38], this study targeted at
shown in Figure 7, faces and background images come from the blending boundary that is universally introduced in image
different source images, giving rise to the abnormal behaviour blending, thus showing great performance when tested in
of the synthetic image, such as boundary anomaly and incon- various datasets. Except for proposing face X‐ray, this research
sistent brightness. These visual artefacts make deepfake videos designs the generation process of negative samples by using
fundamentally detectable. In this section, three main visual positive samples particularly. Thus, the algorithm does not
artefacts would be introduced. need to consider face manipulation in the deepfake video, but
only focuses on the difference between background and
foreground faces, thereby enhancing the generalization of the
3.3.1 | Face warping artefacts proposed algorithm. However, due to excessive focus on the
blending boundary, this scheme is not resistant to fully syn-
Based on the observations with inconsistency between faces thesized images.
and background, a new deep learning‐based method was
proposed by Li and Lyu [38]. Face warping artefacts generated
by blending process were used to detect fake videos. As shown 3.3.3 | Head pose inconsistency
in Figure 2, synthetic faces have undergone an affine transform
to match the poses of the target faces. In this case, there would Another interesting study comes from [52]. Observing that
be an obvious colour difference and resolution inconsistency deepfake videos were created by splicing a synthesised face
between the internal face and background areas. Since the into the original image, Yang et al. proposed a new detection
purpose here is to detect inconsistency between the face region method based on 3D head poses. They argued that current
generative neural networks could not guarantee landmark
matching, causing that estimated 3D landmarks on the face‐
manipulated area were different from 3D landmarks esti-
mated from the whole face area. In this method, the rotation
matrix estimated using facial landmarks from the whole face
and the one estimated using only landmarks in the central
region are calculated to analyse the similarity between two pose
vectors. Although the experiment confirmed the difference
F I G U R E 7 Video frames with visual artefacts. Deepfake generated between real and fake pose vectors, this study was built based
image shows colour difference and resolution inconsistency because of the on specific features existing in a self‐made dataset which was
lack of postprocess generated by relatively basic version of the deepfake algorithm.
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YU ET AL.
- 615
Thus, this method is not effective for detecting a new version as interpolation and gamma correction. Outside the camera,
of deepfake videos as deepfake algorithms evolve [8]. the image could also be compressed or enhanced, which will
leave many traces in the final image. Thus, each image has its
unique traces, namely noise residuals, which can be used to
3.3.4 | Summary identify its source camera. Follow this direction, Cozzolino
et al. introduced a CNN‐based camera fingerprint named
Visual‐artefacts‐based methods often obtain better general- noiseprint in [59]. To remove scene content and enhance
ization performance because they target more general artefacts camera model‐related artefacts, a siamese network was trained
existing in most deepfake contents. However, these algorithms using images coming from different camera models. In this
can only detect specific forgery traces due to paying more siamese network, a fully convolutional network, proposed in
attention to specific artefacts. With the progress of deepfake [60], was first introduced to extract the noise pattern of images.
algorithms, these artefacts are gradually disappearing. Never- Pairs of images from the same or different camera models were
theless, visual artefacts‐based approaches obtain better per- used to train the siamese network. At the end of the training
formance in the latest version of deepfake video datasets. Such process, CNN used in the siamese network could be used to
schemes still have high potential in deepfake detection tasks. extract the corresponding noiseprint from the input image,
Researches should be established to exploit more intrinsic displaying enhanced camera model artefacts. This work pro-
features. vides new ideas for fingerprint noise extraction tasks, further
promoting the development of image forensic area.
F I G U R E 8 Scheme used for deepfake detection. Noiseprints are first extracted on a sufficient number of video frames. Then, extracted noiseprints are
averaged to represent the video noiseprint. Divided by the face detector, the video noiseprint is then splited into face region and background region. After that,
the algorithm extract features of background region and calculate the statistical information. Finally, the Mahalanobis distance between features of face and
background area is calculated to obtain the final heat map
without any image capture process, there is no camera model. In the final state prediction stage, a fully connected
fingerprint in the output image, so that the camerafingerprint‐ layer is added to calculate the probability of eye open and
based methods are very suitable for detecting images gener- closed states, which is then used to calculate blink frequency.
ated by GANs. However, recent work shows that images This method is evaluated over self‐made datasets, showing
can also be generated by simulating camera fingerprints [61], promising performance on detecting videos generated with
thus deceiving detection methods that rely on camera fin- deepfake methods. However, forgery algorithms can easily
gerprints. Recent research also proved that noise pattern generate videos with reasonable blinking frequency as long as
could be erased by neural networks [62]. In this way, existing enough closed‐eye images are added to the training set. Due to
camerafingerprint‐based methods should increase robustness excessive attention to abnormal blinking frequency, this
to resist such attacks. method is no longer applicable for current deepfake detection
tasks after the problem of blink frequency is solved.
Guera et al. [44] CNN + LSTM Self‐made dataset 97.1% (ACC) >5.73
Ciftci et al. [63] Biological signals Self‐made deep fakes dataset 91.07% (ACC) ‐
Li and Lyu [38] Face warping artefacts + CNN UADFV 0.974 (AUC) 4.12
Deepfake‐TIMIT(LQ) 0.999 (AUC)
Deepfake‐TIMIT(HQ) 0.932 (AUC)
TABLE 3 List of datasets including video manipulations Another large‐scale benchmark, composed of 50,000 orig-
Dataset Release date Real/fake Source inal videos and 10,000 manipulated videos, has been built in [10].
DF‐VAE, a new conditional autoencoder, is applied to generate
UADFV [6] 2018.11 49/49 YouTube
deepfake faces with a higher realism rating. Studies using
Deepfake‐TIMIT [5] 2018.12 ‐/620 YouTube DeeperForensics demonstrates that the quality of the generated
FaceForensics++ [7] 2019.01 1000/4000 YouTube video is significantly better than that of the existing dataset.
Google DFD [75] 2019.09 363/3068 Actors
Nguyen et al. [34] A capsule network FaceForensics++ ‐ Face2Face 93.11% (ACC) >7.72
Zhao et al. [48] Optical flow FaceForensics++ ‐ DeepFake 98.10% (ACC) 0.24
Sabir et al. [77] CNN + GRU + STN FaceForensics++ ‐ DeepFake 96.9% (ACC) 14.4
FaceForensics++ ‐ Face2Face 94.35% (ACC)
FaceForensics++ ‐ FaceSwap 96.3% (ACC)
Li et al. [32] Face X‐ray + multitask learning FaceForensics++ ‐ DeepFake 0.9912 (AUC) >3.99
FaceForensics++ ‐ FaceSwap 0.9909 (AUC)
FaceForensics++ ‐ Face2Face 0.9931 (AUC)
FaceForensics++ ‐ NeuralTexture 0.9927 (AUC)
Tarasiou et al. [43] A lightweight architecture FaceForensics ‐ DeepFake (c23) 97.90% (ACC) ‐
FaceForensics ‐ Face2Face (c23) 98.58% (ACC)
FaceForensics ‐ FaceSwap (c23) 98.32% (ACC)
FaceForensics ‐ DeepFake (c40) 92.40% (ACC)
FaceForensics ‐ Face2Face (c40) 87.11% (ACC)
FaceForensics ‐ FaceSwap (c40) 91.26% (ACC)
Masi et al. [50] Two‐branch recurrent network Faceforensics++(frames, c23) 0.987 (AUC) ‐
Faceforensics++(videos, c23) 0.9912 (AUC)
Faceforensics++(frames, c40) 0.8659 (AUC)
Faceforensics++(videos, c40) 0.911 (AUC)
T A B L E 5 Detection performance on
Study Method Performance FLOPs
DFDC datasets
Bonettini et al. [31] Ensemble of CNNs 0.8813 (AUC) >0.04
Mittal et al. [78] Emotions behind audio and visual content 0.892 (AUC) ‐
frames in the practical scenarios, such time consumption is 5.2.1 | Triplet training
far from meeting the needs of massive video detection. In the
current literature related to deepfake detection, detection The toughest problem for deepfake detection tasks is that
accuracy is regarded as the only standard while rare re- generalization performance is not sufficient to support the
searches pay attention to the time consumption of deepfake needs of practical scenarios due to the different distribution of
detection. In the future, more attention should be devoted to datasets. Under such circumstances, it is difficult for detection
studying how to design an efficient and high‐accuracy models to learn the intrinsic difference between real and fake
detection method. videos. To address this problem, triplet training strategy would
be a possible solution for such tasks [28, 31]. Triplet training
aims to minimize the distance between samples with the same
5.1.4 | Robustness category and maximize samples between features with
different categories in the feature space. Especially, the triplet
Robustness is often applied to evaluate the performance of training strategy ensures that the distance between samples
detection algorithms when encounter various degradations. with different categories is larger than the distance with the
Compared with original videos, compressed videos are same category. Therefore, the optimization goal of triplet
more difficult to detect because it ignores a lot of image training would attempt to exploit the intrinsic difference be-
information for higher compression rate. As shown in tween real and fake videos, providing assistance in subsequent
Table 4, detection algorithms often indicate a decrease in classification tasks. In the field of face liveness detection, triplet
performance when encounter low‐quality videos compared training has been applied for domain adaptation tasks [84],
with high‐quality videos. In addition to compression oper- demonstrating the potential of the triplet training strategy in
ations, videos may also encounter operations such as image finding intrinsic differences between real and fake videos, even
reshape and rotation. Under such circumstances, robustness if the datasets have different distributions.
becomes an important property that must be considered
when designing detection algorithms. An effective way to
improve robustness should be to add a noise layer to the 5.2.2 | Multitask learning
detection network, so that multiple data degradation sce-
narios are considered. Improving the robustness of existing Multitask learning, performing multiple tasks simultaneously, is
detection methods would perform a significant role in the proved to improve prediction performance comparing with
future. single‐task learning. Performing both forgery location and
deepfake detection at the same time is found to be effective to
improve accuracy in deepfake detection tasks. Multitask
5.2 | Future works learning allows the model to perform two tasks at the same
time, considering losses caused by both tasks, and further
To address problems existing in current detection algorithms, improving the performance of the model. In [32, 43, 85], also
we also envision some research directions, which will advance prove that forgery location plays a vital role in the deepfake
future research on face‐manipulated video detection. detection task. Therefore, multitask learning has great potential
for further improvement of deepfake detection.
TABLE 7 Cross‐dataset evaluation on FaceForensic++ dataset
ResNet50 80.62 66.54 68.43 59.84 105.40 83.23 75.46 57.48 84.12 90.87 65.24 57.84 74.59
ResNet101 78.72 65.38 86.19 79.33 104.27 81.28 73.68 77.38 82.39 87.84 63.55 77.63 79.80
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
622
- YU ET AL.
29. Chollet, F.: Xception: Deep learning with depthwise separable convolu- 52. Yang, X., Li, Y., Lyu, S.: Exposing deep fakes using inconsistent head
tions. In: Proceedings of the IEEE Conference on Computer Vision and poses. In: ICASSP 2019‐2019 IEEE International Conference on
Pattern Recognition, pp. 1251–1258 (2017) Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265.
30. Dolhansky, B., et al.: The deepfake detection challenge (DFDC) preview IEEE (2019)
dataset. arXiv preprint arXiv:1910.08854. (2019) 53. Lukas, J., Fridrich, J., Goljan, M.: Digital camera identification from
31. Bonettini, E.D.C., et al.: Video face manipulation detection through sensor pattern noise. IEEE Trans. Inf. Forensics Secur. 1(2), 205–214
ensemble of CNNs. arXiv preprint arXiv:2004.07676. (2020) (2006)
32. Li, L. et al.: Face x‐ray for more general face forgery detection. In: 2020 54. Chen, M., et al.: Determining image origin and integrity using sensor
IEEE/CVF Conference on Computer Vision and Pattern Recognition noise. IEEE Trans. Inf. Forensics Secur. 3(1), 74–90 (2008)
(CVPR). Seattle, WA, USA pp. 5000–5009. (2020) 55. Chierchia, G., et al.: A bayesian‐MRFapproach for PRNU‐based image
33. Amerini, I., et al.: Deepfake video detection through optical flow based forgery detection. IEEE Trans. Inf. Forensics Secur. 9(4), 554–567
CNN. Proceedings of the IEEE International Conference on Computer (2014)
Vision Workshops (2019) 56. Korus, P., Huang, J.: Multi‐scale analysis strategies in PRNU‐based
34. Nguyen, H.H., Yamagishi, J., Echizen, I.: Capsule‐forensics: Using tampering localization. IEEE Trans. Inf. Forensics Secur. 12(4), 809–824
capsule networks to detect forged images and videos. In: ICASSP 2019‐ (2016)
2019 IEEE International Conference on Acoustics, Speech and Signal 57. Koopman, M., Rodriguez, A.M., Geradts, Z.: Detection of deepfake
Processing (ICASSP), pp. 2307–2311. IEEE (2019) video manipulation. In: The 20th Irish Machine Vision and Image Pro-
35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for cessing Conference (IMVIP), pp. 133–136. (2018)
large‐scale image recognition. arXiv preprint arXiv:1409.1556. (2014) 58. Frank, J., et al.: Leveraging frequency analysis for deep fake image
36. Afchar, D., et al.: Mesonet: A compact facial video forgery detection recognition. arXiv preprint arXiv:2003.08685. (2020)
network. In: 2018 IEEE International Workshop on Information Fo- 59. Cozzolino, D., Verdoliva, L.: Noiseprint: ACNN‐based camera model
rensics and Security (WIFS), pp. 1–7. IEEE (2018) fingerprint. IEEE Trans. Inf. Forensics Secur. 15, 144–159 (2019)
37. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of 60. Zhang, K., et al.: Beyond a gaussian denoiser: Residual learning of deep
the IEEE Conference on Computer Vision and Pattern Recognition, CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–
pp. 1–9 (2015) 3155 (2017)
38. Li, Y., Lyu, S.: Exposing deepfake videos by detecting face warping ar- 61. Huang, Y., et al.: Fakeretouch: Evading Deepfakes Detection via the
tifacts. arXiv preprint arXiv:1811.00656. (2018) Guidance of Deliberate Noise (2020). arXiv preprint arXiv:2009.09213
39. Wang, R., et al.: Fakespotter: a simple yet robust baseline for spotting ai‐ 62. Chen, C., et al.: Camera trace erasing. In: Proceedings of the IEEE/CVF
synthesized fake faces. International Joint Conference on Artificial In- Conference on Computer Vision and Pattern Recognition, pp. 2950–2959
telligence (IJCAI) (2020) (2020)
40. Liu, Z., et al.: Global texture enhancement for fake face detection in the 63. Ciftci, U.A., Demir, I., Fakecatcher, L.Y.: Detection of synthetic portrait
wild. Proceedings of the IEEE/CVF Conference on Computer Vision videos using biological signals. IEEE Trans. Pattern Anal. Mach. Intell. 1
and Pattern Recognition (CVPR) (2020) (2020)
41. Nguyen, H.H., et al.: Multi‐task learning for detecting and segmenting 64. Donahue, J., et al.: Long‐term recurrent convolutional networks for vi-
manipulated facial images and videos. arXiv preprint arXiv:1906.06876. sual recognition and description. In: Proceedings of the IEEE Confer-
(2019) ence on Computer Vision and Pattern Recognition, pp. 2625–2634
42. Dang, H., et al.: On the detection of digital face manipulation. In: Pro- (2015)
ceedings of the IEEE/CVF Conference on Computer Vision and 65. Feng, L., et al.: Motion‐resistant remote imaging photoplethysmography
Pattern recognition, pp. 5781–5790 (2020) based on the optical properties of skin. IEEE Trans. Circ. Syst. Video
43. Tarasiou, M., Zafeiriou, S.: Extracting Deep Local Features to Detect Technol. 25(5), 879–891 (2014)
Manipulated Images of Human Faces. In: 2020 IEEE International 66. Kumar, S., Prakash, A., Tucker, C.S.: Bounded kalman filter method for
Conference on Image Processing (ICIP), pp. 1821–1825. (2020) motion‐robust, non‐contact heart rate estimation. Biomed. Optic. Ex-
44. Güera, D., Edward, J.: Delp: Deepfake video detection using recurrent press. 9(2), 873–897 (2018)
neural networks. In: 2018 15th IEEE International Conference on 67. Zhao, C., et al.: A novel framework for remote photoplethysmography
Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE pulse extraction on compressed videos. In: Proceedings of the IEEE
(2018) Conference on Computer Vision and Pattern Recognition Workshops,
45. Szegedy, C., et al.: Rethinking the inception architecture for computer pp. 1299–1308 (2018)
vision. In: Proceedings of the IEEE Conference on Computer Vision 68. Chen, W., McDuff, D.: DeepPhys: Video‐based physiological measure-
and Pattern Recognition, pp. 2818–2826 (2016) ment using convolutional attention networks. In: Proceedings of the
46. Sabir, E., et al.: Recurrent‐convolution approach to deepFake detection‐ European Conference on Computer Vision (ECCV), pp. 349–365.(2018)
state‐of‐art results on FaceForensics++. arXiv preprint 69. Fernandes, S., et al.: Predicting heart rate variations of deepfake videos
arXiv:1905.00582 (2019) using neural ODE. In: Proceedings of the IEEE International Confer-
47. Montserrat, D.M., et al.: Deepfakes Detection with Automatic Face ence on Computer Vision Workshops (2019)
Weighting. In: 2020 IEEE/CVF Conference on Computer Vision and 70. Chen, R.T.Q., et al.: Neural ordinary differential equations. In: Advances
Pattern Recognition Workshops (CVPRW), pp. 2851–2859. (2020) in Neural Information Processing Systems, pp. 6571–6583 (2018)
48. Zhao, Y., et al.: Capturing the persistence of facial expression features for 71. Qi, H., et al.: DeepRhythm: exposing deepfakes with attentional visual
deepfake video detection. In: International Conference on Information heartbeat rhythms. arXiv preprint arXiv:2006.07634, 2020
and Communications Security, pp. 630–645.Springer (2019) 72. Hernandez‐Ortega, J., et al.: DeepFakesON‐phys: deepfakes detection
49. Wu, X., et al.: SSTNet: Detecting manipulated faces through spatial, based on heart rate estimation. arXiv preprint arXiv:2010.00400. (2020)
steganalysis and temporal features. In: ICASSP 2020‐2020 IEEE Inter- 73. Korshunov, P., Marcel, S.: Vulnerability assessment and detection of
national Conference on Acoustics, Speech and Signal Processing deepfake videos. In: The 12th IAPR International Conference on Bio-
(ICASSP), pp. 2952–2956.IEEE (2020) metrics (ICB), pp. 1–6 (2019)
50. Masi, I., et al.: Two‐Branch Recurrent Network for Isolating Deepfakes 74. Li, X., et al.: Fighting against deepfake: Patch&pair convolutional neural
in Videos. In: 16th European Conference on Computer Vision ECCV networks (PPCNN). In: Companion Proceedings of the Web Conference
2020, pp. 667–684 Springer, Cham (2020) 2020, pp. 88–89 (2020)
51. Ruff, L., et al.: Deep one‐class classification. In:Proceedings of Ma- 75. Dufour, N., Gully, A.: Deepfakes Detection Dataset (2019)
chine Learning Research, vol. 80, pp. 4393–4402. PMLR, Stockholm 76. Cozzolino, D., Poggi, G., Verdoliva, L.: Extracting camera‐based fin-
(2018) gerprints for video forensics. In: Proceedings of the IEEE Conference
20474946, 2021, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/bme2.12031, Wiley Online Library on [21/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
624
- YU ET AL.
on Computer Vision and Pattern Recognition Workshops, pp. 130–137 84. Jia, Y., et al.: Single‐side domain generalization for face anti‐spoofing. In:
(2019) Proceedings of the IEEE/CVF Conference on Computer Vision and
77. Sabir, E., et al.: Recurrent convolutional strategies for face manipulation Pattern Recognition, pp. 8484–8493 (2020)
detection in videos. Interfaces. 3 (2019) 85. Stehouwer, J., et al.: On the detection of digital face manipulation. arXiv
78. Mittal, T., et al.: Emotions dont lie: an audio‐visual deepfake detection preprint arXiv:1910.01717. (2019)
method using affective cues. In: Proceedings of the 28th ACM Inter- 86. Carlini, N., Farid, H.: Evading deepfake‐image detectors with white‐and
national Conference on Multimedia, MM 20, pp. 2823–2832. Association black‐box attacks. arXiv preprint arXiv:2004.00622. (2020)
for Computing Machinery, New York (2020) 87. Gandhi, A., Jain, S.: Adversarial perturbations fool deepfake detectors.
79. Cozzolino, D., et al.: Forensictransfer: Weakly‐supervised domain adap- arXiv preprint arXiv:2003.10596. (2020)
tation for forgery detection. arXiv preprint arXiv:1812.02510. (2018) 88. Neekhara, P., et al.: Adversarial deepfakes: Evaluating vulnerability of
80. Du, M., et al.: Towards generalizable forgery detection with locality‐aware deepfake detectors to adversarial examples. arXiv preprint arXiv:
autoencoder. arXiv preprint arXiv:1909.05999. (2019) 2002.12749. (2020)
81. Scott, L., Lee, S.: A Unified Approach to Interpreting Model Predictions,
pp. 4768–4777 In: Proceedings of the 31st International Conference on
Neural Information Processing Systems (NIPS'17), (2017)
82. Samek, W., Wiegand, T., Muller, K.: Explainable artificial intelligence: How to cite this article: Yu, P., et al.: A Survey on
understanding, visualizing and interpreting deep learning models. arXi-
Deepfake Video Detection. IET Biom. 10(6), 607–624
vArtificial Intelligence (2017)
83. Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases (2021). https://doi.org/10.1049/bme2.12031
by image‐based deep learning. Cell. 172(5), 1122–1131 (2018)