0% found this document useful (0 votes)
12 views

GTA-Net_A_Robust_Method_for_Deepfake_Face_Image_Detection

The document presents GTA-Net, a novel two-stream network framework designed for the robust detection of deepfake face images. It incorporates a Global Residual Attention module, a Texture Feature Saliency module, and an Attention Feature Fusion module to enhance detection accuracy and robustness against degraded datasets. Experimental results indicate that GTA-Net effectively distinguishes deepfake images from real ones, addressing the challenges posed by increasingly realistic deepfake technology.

Uploaded by

yashas.g.n14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

GTA-Net_A_Robust_Method_for_Deepfake_Face_Image_Detection

The document presents GTA-Net, a novel two-stream network framework designed for the robust detection of deepfake face images. It incorporates a Global Residual Attention module, a Texture Feature Saliency module, and an Attention Feature Fusion module to enhance detection accuracy and robustness against degraded datasets. Experimental results indicate that GTA-Net effectively distinguishes deepfake images from real ones, addressing the challenges posed by increasingly realistic deepfake technology.

Uploaded by

yashas.g.n14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

GTA-Net: A robust method for deepfake face image

detection
1st Qinhua Yu 2nd Xiaofeng Wang 3rd Mao Jia
Xi’an University of Technology Xi’an University of Technology Xi’an University of Technology
Xi’an, China Xi’an, China Xi’an, China
[email protected] [email protected] [email protected]

4th Ningning Bai 5th Jianpeng Hou 6th Dong Liu


Xi’an University of Technology Xi’an University of Technology Xi’an University of Technology
Xi’an, China Xi’an, China Xi’an, China
[email protected] [email protected] [email protected]

Abstract—The rapid advancement of artificial intelligence potent tools for cybercrime [4]. In recent years, with the
technology has resulted in the emergence of deepfake, which continuous advancement of deepfake technology, deepfake
has had a significant impact on various fields due to its realistic face images have become increasingly realistic to the point
effects. Addressing the challenges posed by deepfake has become
a crucial area of research. In this study, we propose a two-stream where traditional image and video forensics technologies are
network framework, GTA-Net, for the detection of deepfake face rendered powerless. As a result, accurately detecting deepfake
images. The framework comprises a Global Residual Attention faces has become an urgent technical requirement.
module (GRA), a Texture Feature Saliency module (TFS), and an To tackle the challenges posed by deepfake, deepfake de-
2023 China Automation Congress (CAC) | 979-8-3503-0375-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/CAC59555.2023.10451644

Attention Feature Fusion module (AFF). We incorporate Local tection technology has emerged. Given that deepfake outputs
Binary Patterns (LBP) features into the network input to guide
the decision-making process of the model towards prominent tex- exhibit a similar distribution to real images, the deepfake
ture features, thereby enhancing detection accuracy. Additionally, detection task is different from visual classification, detection,
we employ a residual attention mechanism to focus on specific and recognition in the field of pattern recognition. Conven-
deepfake-generated features and improve robustness by avoiding tional network models utilized for classification tasks are not
interference from content-preserving manipulations. Experimen- directly applicable to the detection of deepfake. In recent
tal results show that the proposed method has high detection
accuracy and strong robustness against degraded datasets, and years, numerous deep learning-based methods for detecting
generalization for cross-dataset detection. deepfakes have been proposed. In the early stages of research,
Index Terms—deepfake detection, texture feature saliency, utilizing the internal structure of the forgery model and inher-
robustness, generalization ent features left by the forgery process is an effective deepfake
detection method. For instance, affine transformation leaves
I. I NTRODUCTION traces [5], inconsistent illumination and geometric position
As a result of the advancement in artificial intelligence tech- cause artifacts [6]. With advancements in generation tech-
nology, deepfake has attracted significant attention in recent nology, high-resolution deepfake images with more detailed
years. Deepfake refers to using deep learning technology to processing continue to emerge, compensating for limitations
create fabricated scenes, particularly facial images such as in generation technology and significantly weakening existing
full-face synthesis, attribute manipulation, expression transfer methods’ detection ability. To improve the detection perfor-
and identity exchange [1], and so on. This technology has mance of the model, attention-based network frameworks have
diverse applications, including special effects production in been proposed for detecting high-resolution deepfake images
the film industry, digital art creation, social entertainment, that selectively engage image features to ignore irrelevant
and even generating synthetic videos with intricate details information and focus on critical data. Some methods include
such as micro-expressions and mouth movements. However, generating attention maps [7], using multiple attention maps
the malicious use of deepfake technology poses unpredictable for fine-grained classification [8], and utilizing the attention
risks and hidden dangers to individuals and society [2], [3]. mechanism to guide data augmentation [9], [10]. With the
For instance, malevolent actors may exploit deepfakes for ongoing advancements in Generative Adversarial Network
identity fraud or create fabricated videos or images of public (GAN) [11] technology, there has been a proliferation of
figures to disseminate false information, thereby providing high-quality deepfake datasets, leading to the emergence of
numerous data-driven models for detecting deepfakes. Data-
This work was supported in part by the National Natural Science Findation driven-based deepfake detection methods focus on extracting
of China under Grant 61772416, in part by the Shaanxi Science Findation of
China under Grant 2022GY-087, inpart by Postgraduate Research Findation features from the spatial and frequency domains to capture
of Xi’an Univeristy of Technology under Grant 310-252082209. image details and texture changes, such as capsule networks

979-8-3503-0375-9/23/$31.00 ©2023 IEEE 4576

Authorized licensed use limited to: Vidya Vardhaka College Of Engineering Mysore. Downloaded on November 04,2024 at 07:40:05 UTC from IEEE Xplore. Restrictions apply.
[12], [13], autoencoders [14], and Fine-Tune Transformers To address this issue, we design the solution from two
[15], and so on. levels. Firstly, we use LBP feature as the part of the network
These approaches can achieve optimal detection accuracy on input. This will have two benefits. On the one hand, it is
experimental datasets. However, their performance is signifi- used to increase the diversity of samples, and on the other
cantly degraded or even rendered ineffective when confronted hand, the LBP feature is used to guide the network model
with degraded datasets due to various interferences such as to pay more attention to the specific features generated by
compression, blur and noise, and so on. This necessitates the deepfake, so as to improve the detection accuracy of the
development of robust detection methods to mitigate the severe detection algorithm. Secondly, CA mechanism is introduced
impact of data degradation. To address this issue, Liu et al. to guide the network to focus on high-level semantic features
[16] proposed a deep learning model for facial anti-fraud based from the channel and spatial levels, respectively, so as to avoid
on convolutional neural networks. The model is capable of the interference of various content-preserving manipulations
classifying faces in videos as real or fake. To mitigate data and extract the common features behind them. To this end, we
degradation, the authors introduced an auxiliary supervision employ a two-stream network architecture. The first stream is
task to enable the model to learn specific features that enhance an attention network based on the residual structure, which
its robustness. He et al. [17] proposed a deep learning model is used to extract the common features behind the content-
called ForgeryNet, which is capable of detecting not only preserving manipulations from the original images. The other
deepfake but also other types of manipulated images. To tackle is a feature extraction network based on depthwise separable
the issue of data degradation, the authors employed a technique convolution, which is used to extract texture details from LBP
called ”deep fusion” in their model, which integrates informa- space to achieve the purpose of enhancing the difference of
tion from multiple data sources to enhance its robustness. texture features, so as to improve the accuracy and robustness
However, as deepfake technology continues to advance, the of deepfake detection.
generated fake faces are becoming increasingly realistic, which
B. Network Framework
poses higher demands of deepfake detection methods. More-
over, with more extensive application requirements emerging, In this section, we describe the proposed deepfake face
it is imperative to develop more robust methods capable of detection model. The proposed model mainly includes three
handling various complex scenarios such as detecting degraded modules: Global Residual Attention module (GRA), Texture
data. This presents numerous difficulties and challenges for Feature Saliency module (TFS), and Attention Feature Fusion
deepfake detection. module (AFF). We name it GTA-Net. The architecture of
In this study, we present a novel deep learning-based ap- GTA-Net is illustrated in Fig.1.
proach for detecting deepfake face images with great accuracy 1) Global Residual Attention Module: Deepfake detection
and robustness. In the proposed method, we employ a dual- requires the extraction of specific features introduced by GAN,
stream network architecture. The initial stream extracts low- while convolutional neural networks typically process global
level features from the RGB color space. It contains a Coordi- images uniformly and focus on semantic features. Therefore,
nate Attention (CA) [18] module to guide the network towards it is incapable to accurately capture crucial forged regions
focusing on high-level semantic features at both channel and specific features. However, the attention mechanism can
and spatial levels, respectively, and emphasizes the common dynamically select and weigh different feature representations
features behind various content-preserving manipulations. The based on varying input data, thus augmenting the capacity of
other layer focuses on the texture details of the image and the model to express and differentiate significant features. To
highlights the distinctions introduced by deepfake through Lo- this end, we incorporate a CA mechanism into the residual
cal Binary Patterns (LBP) [19] feature extraction. Finally, we structure to design the GRA module.
use an attention fusion module to perform feature fusion, then GRA is a residual structure that incorporates CA, as shown
distinguishes deepfake images from authenticity images. The in Fig.1. Here, the role of CA aims to solve the problem of
effectiveness of this approach has been completely validated ignoring spatial information in feature extraction. By using
through experiments on numerous datasets. the relationship between pixels, CA focus on channel infor-
mation, it can also perform a weighted average for pixels at
II. T HE PROPOSED METHOD different positions in the image, so as to further capture spatial
information to improve the performance of the model.
A. Motive CA encodes channel relationships and long-range dependen-
Thanks to the powerful generative ability of GAN, deepfake cies through precise positional information, which is achieved
technology can generate outputs with the same distribution via two steps: coordinate information embedding and coordi-
as the input. Therefore, the inter-class difference between nate attention generation. The structure of the CA is illustrated
deepfake face images and real face images is much smaller in Fig.2.
than that for classification tasks in the pattern recognition do- Coordinate Information Embedding: To enable the atten-
main. Traditional classification networks struggle to accurately tion module to capture spatial long-range dependencies with
differentiate deepfake face images from real face images, while greater precision, we have decomposed global average pooling
many deep learning-based detection methods lack robustness. into a pair of one-dimensional feature encoding operations.

4577

Authorized licensed use limited to: Vidya Vardhaka College Of Engineering Mysore. Downloaded on November 04,2024 at 07:40:05 UTC from IEEE Xplore. Restrictions apply.
Real

AFF

Fake

Fig. 1: Network Framework of GTA-Net


BatchNorm+Non_linear

Conv2 Sigmoid
H Avg Pool directions, where r denotes the down-sampling ratio that
Residual

Conv2 split
controls the module size.
Input Output
Secondly, we splits the tensor f into two separate tensors
W Avg Pool Conv2 Sigmoid
f h ∈ RC/r×H and f w ∈ RC/r×W along the spatial dimen-
sion. Then, two 1 × 1 convolutional filters Fh and Fw are
Fig. 2: The diagram of the Coordinate Attention mechanism applied to transform the feature maps f h and f w to match the
number of channels in input X. It can be expressed in eq.(4)
Specifically, each channel of input X = [x1 , x2 , · · · , xC ] ∈ and eq.(5).
RC×H×W is first encoded along the horizontal and vertical   
g h = σ Fh f h , (4)
coordinate directions using pooling kernels of dimensions
(H, 1) and (1, W ), respectively. As a result, the output of the g w = σ (Fw (f w )) , (5)
c − th channel at height h can be expressed in eq.(1). Here, σ is the sigmoid function. Then expanding g h and g w
1  as attention weights, the final output of the CA module can
zch (h) = xc (h, i) , (1)
W be expressed in eq.(6). The visualization is shown in Fig.3.
0≤i<W
Similarly, the output of the c − th channel at width w can yc (i, j) = xc (i, j) × gch (i) × gcw (j) , (6)
be expressed in eq.(2).
1 
zcw (w) = xc (j, w) , (2)
H
0≤j<H
The above two transformations perform feature aggregation
along two spatial directions, while enabling the attention
module to capture long-range dependencies along one spatial Fig. 3: Visual effect generated by GRA module
direction and preserve precise location information along the
2) Texture Feature Saliency Module: As deepfake tech-
other. This enhances the capacity of the network to accurately
nology often introduces subtle traces or unnatural textures
locate the interest target. Note that the coordinate information
into synthetic facial images, we utilize local specific detail
embedding operation corresponds to W Avg Pool and H Avg
texture features as the basis for discrimination in detecting
Pool as depicted in Fig.2.
and distinguishing deepfake faces. To achieve this, we employ
Coordinate Attention Generation: To leverage the represen-
LBP as the input of texture feature extraction.
tation with global receptive field and accurate positional in-
formation generated by the coordinate information embedding
module, we introduce the operation of coordinate attention
generation. Firstly, concatenate two feature maps z h and z w
generated from the previous module, followed send them to a
shared 1 × 1 convolution transform F1 as expressed in eq.(3).
  
f = δ F1 z h , z w , (3) Fig. 4: Original image and LBP texture feature image
Here, [·, ·] represents the concatenation operation along the LBP is an effective texture descriptor that offers significant
spatial dimension, δ denotes a non-linear activation function advantages such as rotation, gray and illumination invariance.
and f ∈ RC/r×(H+W ) characterizes an intermediate feature It can measure and extract the local texture information
map with spatial information in both horizontal and vertical from images without considering their overall structure, as

4578

Authorized licensed use limited to: Vidya Vardhaka College Of Engineering Mysore. Downloaded on November 04,2024 at 07:40:05 UTC from IEEE Xplore. Restrictions apply.
illustrated in Fig.4. In this study, we utilized the center pixel FO

Point-wise Conv

Point-wise Conv
LBP (as shown in Fig.5) to describe image texture information. ReLU

44 118 192 0 1 1

32 83 204 0 1 Sigmoid F

GlobalAvgPooling

Point-wise Conv

Point-wise Conv
61 174 250 0 1 1

ReLU

Fig. 5: LBP schematic


8
 FL
LBP (xc , yc ) = s (I (p) − I (c)) × 2p , (7)
p=1
Fig. 7: The diagram of the AFF module
In eq.(7), (xc , yc ) represents the coordinates of the central experimental results of the proposed method. Thirdly, we
pixel while p denotes the p − th pixel point in a 3 × 3 window assess the robustness of the proposed method against content-
other than the central one. I(c) refers to the grayscale value of preserving manipulation such as Gaussian noise, JPEG com-
the central pixel point and I(p) represents that of the p − th pression, random distortion block, etc. Finally, we conducted
pixel within its domain. The symbolic function formula for ablation experiments to validate the efficacy of the proposed
s(x) can be expressed in eq.(8). module.

1, x≥0 A. Datasets and data preprocess
s(x) = , (8)
0, otherwise To assess the efficacy of the proposed approach, we col-
We extract the LBP features of the input face image X on lected a dataset named Mixed face image dataset (MFI).
its R, G, and B channels, respectively. Considering the inter- MFI contains face images from FFQH [20], StyleGAN [20],
channel correlations, we integrate the acquired feature maps StyleGAN2 [21], and PGGAN [22] datasets. As shown in
to generate more comprehensive texture features. The specific Tab.I. Prior to model training, input images underwent uni-
process is shown in Fig.6. form preprocessing which involved normalization to a size of
224 × 224 followed by random flipping for data augmentation
purposes.
TABLE I: The splitting of datasets
1ȽInput X 2ȽRGB grayscale map 3ȽLBP features map 4ȽFused LBP feature map Y

Fig. 6: Pre-processing process Datasets Training set Validation set Test set
Yellow race face(StyleGAN2) 1620 540 540
3) Attention Feature Fusion Module: To enhance the per- Star face(StyleGAN2) 1620 540 540
ception for key features and improve the sensitivity and Fake Internet celebrity face(StyleGAN2) 1620 540 540
Entire face synthesis(StyleGAN) 1620 540 540
discrimination ability of the model, we propose incorporating Entire face synthesis(PGGAN) 1620 540 540
an attention mechanism into the feature fusion process to cap- Real Yellow race face(FFQH) 2700 900 900
ture more comprehensive information and semantics, thereby Real Real Star face(FFQH) 2700 900 900
Real celebrity face(FFQH) 2700 900 900
enhancing the accuracy of our model. To achieve this goal,
we employ the AFF module to enhance the nonlinearity of B. Evaluation metric
the model by processing the output from both branches of the To assess the effectiveness of the proposed method, we
network. Compared with other feature fusion methods, AFF employed True Positive, True Negative, False Positive, False
can dynamically determine the significance of each feature Negative, Precision, Accuracy, Area Under The Curve, Recall
during fusion, thereby leveraging diverse information more and F 1 − score as evaluation metrics. To quantify the evalua-
effectively. tion, we utilize T P , T N , F P , and F N to compute precision,
Specifically, for two features FL and FO , we use AFF to accuracy, recallrate, F 1−score as well as false positive rate
compute fused feature. It can be expressed in eq.(9). and true positive rate required for calculating the area under
F = M (FL ⊕ FO ) ⊗ FL + (1 − M (FL ⊕ FO )) ⊗ FO , (9) the curve.
TP
Here, ⊕ denotes element-wise addition. The symbol ⊗ denotes P recison = , (10)
the operation of element-wise multiplication. M stands for the TP + FP
weights that are adapted by training the network. The detailed TP + TN
Accuracy = , (11)
calculation procedure is illustrated in Fig.7. TP + FP + FN + TN
TP
III. E XPERIMENT RESULT AND PERFORMANCE ANALYSIS Recall = , (12)
TP + FN
In this section, we present the evaluation and analysis of the
2 × P recison × Recall
performance of the proposed method through experiments. The F 1 − score = . (13)
experiments are performed on an NVIDIA GeForce RTX3090 P recision + Recall
computer using the PyTorch learning framework. Firstly, we C. Detection accuracy and comparative analysis
introduce the data set used in the experiment. Secondly, we To comprehensively evaluate the proposed approach, we
introduce the performance evaluation metrics, and present the investigate the detection accuracy and compare with Inception

4579

Authorized licensed use limited to: Vidya Vardhaka College Of Engineering Mysore. Downloaded on November 04,2024 at 07:40:05 UTC from IEEE Xplore. Restrictions apply.
[23], ResNet18 [24], Meso4 [25], ResNet50 [24]and SE- For the blur interference and JPEG compression, the accuracy
ResNet56 [26], the results are shown in Tab.II. As can be seen and AUC metrics remained relatively stable. This demonstrates
from Tab.II, the proposed method has superior performance in that the proposed method is provided with strong adaptability
terms of Accuracy, Recall, F 1 − score and AU C. for blur manipulation and JPEG compression. For different
TABLE II: The detection accuracy and compared results contrast levels, the model demonstrates consistent performance
across various contrast levels, indicating robust anti-jamming
Methods Accuracy Recall F 1 − score AU C properties that enable efficient adaptation to diverse scenarios.
Inception 0.9324 0.9203 0.9202 0.9901 The proposed method exhibits slightly varying performance
for saturation changes. Optimal results are achieved for low
ResNet18 0.9670 0.9895 0.9625 0.9924
saturation, but accuracy and AUC values decrease significantly
Meso4 0.7137 0.4383 0.5967 0.9635 for high saturation. All considered, the proposed method
ResNet50 0.9120 0.9725 0.9150 0.9832 shows satisfactory results for various interference operations,
which demonstrates that the method can be applied in various
SE-ResNet56 0.9851 0.9925 0.9853 0.9987
practical scenarios.
Ours 0.9990 0.9987 0.9990 0.9991

D. Robustness evaluation
The robustness refers to the capacity of the algorithm to
Fig. 8: The example of content-preserving manipulations
withstand interference. To assess the robustness of the pro-
posed method, we investigate six types of disturbances: con- E. Generalization assessment
trast alteration, JPEG compression, Gaussian blurring, addition The generalization of the detection algorithm is extremely
Gaussian noise, saturation modification, and random insertion important, which determines whether the algorithm can be
of distortion blocks. Each disturbance is further divided into reliable and effective in practical applications. To evaluate the
three levels ranging from mild to severe. We conduct these generalization of the proposed method, we performed cross-
interference operations on the test datasets, then evaluate the dataset detection for the proposed method. We first train the
Accuracy and AUC, and the detection results are presented proposed model on the StyleGAN2 dataset, and then perform
in Tab.III. Fig.8 illustrates the instance of each interference test on PGGAN, StyleGAN, AttGAN [27], FaceForensics++
operation. (FF++) [28], GDWCT, CelebA datasets, respectively, and the
TABLE III: Robustness evaluation of content-preserving ma- detection results are shown in Tab.IV.
nipulations TABLE IV: Cross-dataset evaluation

Type of operation Experimental factor Accuracy AU C Methods AU C Recall


StyleGAN 0.9748 0.9975
0.3 0.9961 0.9989
Compression 0.5 0.9961 0.9989 PGGAN 0.9612 0.9922
0.7 0.9766 0.9980
AttGan 0.9428 0.9844
0.5 0.9766 0.9940 FF++ 0.8171 0.9766
Saturation 1.0 0.9987 0.9988
1.5 0.9102 0.9759 CelebA 0.8721 0.9798

1 0.9961 0.9973 GDWCT 0.9017 0.9812


Blur 2 0.9840 0.9837
3 0.9787 0.9791 As can be seen from Tab.IV, our method shows good per-
formance on StyleGAN, PGGA, AttGAN and FF++ datasets,
0.3 0.9922 0.9954
Contrast ratio 0.5 0.9805 0.9946 which shows that the proposed method has strong generaliza-
0.7 0.9648 0.9824 tion. It can be applied in some important practical application
0.3 0.9922 0.9940
scenarios, such as security field, face recognition access con-
Distortion block 0.5 0.9883 0.9936 trol system, identity verification, etc.
0.7 0.9883 0.9936
F. Ablation analysis
0.1 0.9922 0.9927
Gaussian noise 0.3 0.9570 0.9910 To assess the efficacy of model components, we conducted
0.5 0.8828 0.9906 ablation experiments on the MFI dataset. Initially, we removed
the TFS module from the proposed GTA-Net and performed
Experimental results demonstrate that the proposed model testing with identical experimental parameters. The results
achieves higher accuracy and AUC values with a lower level indicate that the Accuracy and AUC is decreased by 0.0467
of Gaussian noise. However, as the level of Gaussian noise and 0.0372, respectively. This highlights the indispensability of
increases, there is a significant decrease in both accuracy the TFS module for enhancing detection accuracy. Secondly,
and AUC values. This indicates that the proposed method is in the same experiment, the attention residual module was
provide with some degree of robustness for Gaussian noise. replaced by ResNet50 residual block. The results show that

4580

Authorized licensed use limited to: Vidya Vardhaka College Of Engineering Mysore. Downloaded on November 04,2024 at 07:40:05 UTC from IEEE Xplore. Restrictions apply.
the Accuracy and AUC are decreased by 0.0273 and 0.0105, [8] H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, and N. Yu, “Multi-
respectively, which demonstrate that the GRA module is ef- attentional deepfake detection,” in Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition, 2021, pp. 2185–2194.
fective for improving the detection performance of the model. [9] T. Hu, H. Qi, Q. Huang, and Y. Lu, “See better before looking closer:
Finally, the same experiment was carried out after removing Weakly supervised data augmentation network for fine-grained visual
the AFF module and replacement by feature sum. The results classification,” arXiv preprint arXiv:1901.09891, 2019.
[10] C. Wang and W. Deng, “Representative forgery mining for fake face
show that the Accuracy and AUC are decreased by 0.0110 and detection,” in Proceedings of the IEEE/CVF conference on computer
0.0093, respectively, which proves that the AFF module plays vision and pattern recognition, 2021, pp. 14 923–14 932.
an effective role in accurately capturing the key features of [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
the face. All in all, as shown in Tab.V, the proposed method Advances in neural information processing systems, vol. 27, 2014.
achieves satisfactory results. [12] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: Using
capsule networks to detect forged images and videos,” in ICASSP 2019-
TABLE V: Ablation study 2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2019, pp. 2307–2311.
TFS GRA AFF Accuracy AU C [13] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between
√ √ capsules,” Advances in neural information processing systems, vol. 30,
× 0.9516(↓0.0467) 0.9628(↓0.0372) 2017.
√ √
× 0.9710(↓0.0273) 0.9895(↓0.0105) [14] D. Cozzolino, J. Thies, A. Rössler, C. Riess, M. Nießner, and L. Verdo-
√ √ liva, “Forensictransfer: Weakly-supervised domain adaptation for forgery
× 0.9873(↓0.0110) 0.9907(↓0.0093) detection,” arXiv preprint arXiv:1812.02510, 2018.
√ √ √ [15] H. Jeon, Y. Bang, and S. S. Woo, “Fdftnet: Facing off fake images using
0.9990 0.9991
fake detection fine-tuning network,” in IFIP international conference on
ICT systems security and privacy protection. Springer, 2020, pp. 416–
IV. C ONCLUSION 430.
In this study, we propose a two-stream network model that [16] Y. Liu, A. Jourabloo, and X. Liu, “Learning deep models for face
anti-spoofing: Binary or auxiliary supervision,” in 2018 IEEE/CVF
can accurately distinguish deepfake face from authentic face Conference on Computer Vision and Pattern Recognition, 2018.
images while maintaining strong robustness against degraded [17] Y. He, B. Gan, S. Chen, Y. Zhou, G. Yin, L. Song, L. Sheng, J. Shao, and
datasets, and generalization for cross-dataset detection. The Z. Liu, “Forgerynet: A versatile benchmark for comprehensive forgery
analysis,” in Proceedings of the IEEE/CVF conference on computer
proposed approach comprises GRA, TFS, and AFF modules. vision and pattern recognition, 2021, pp. 4360–4369.
The GRA module is used to enhance the perception capacity of [18] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient
model for key facial features by capturing spatial information mobile network design,” in Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2021, pp. 13 713–13 722.
simultaneously focusing on channel information. TFS is used [19] T. Ojala, M. Pietikainen, and D. Harwood, “Performance evaluation of
to extract the specific features introduced by deepfake from texture measures with classification based on kullback discrimination
texture information to enhance the accuracy of deepfake face of distributions,” in Proceedings of 12th international conference on
pattern recognition, vol. 1. IEEE, 1994, pp. 582–585.
detection. AFF is used to adaptive assign network weights, en- [20] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture
abling the network to accurately capture key features required for generative adversarial networks,” in Proceedings of the IEEE/CVF
for the task and improve model discrimination ability toward conference on computer vision and pattern recognition, 2019, pp. 4401–
4410.
deepfake face images. Experimental results demonstrate that [21] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
the proposed method exhibits high detection accuracy and “Analyzing and improving the image quality of stylegan,” in Proceedings
strong robustness, and can be effectively applied to various of the IEEE/CVF conference on computer vision and pattern recognition,
2020, pp. 8110–8119.
significant scenes. [22] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing
of gans for improved quality, stability, and variation,” arXiv preprint
R EFERENCES arXiv:1710.10196, 2017.
[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[1] E. Gonzalez-Sosa, J. Fierrez, R. Vera-Rodriguez, and F. Alonso- V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
Fernandez, “Facial soft biometrics for recognition in the wild: Recent in Proceedings of the IEEE conference on computer vision and pattern
works, annotation, and cots evaluation,” IEEE Transactions on Informa- recognition, 2015, pp. 1–9.
tion Forensics and Security, vol. 13, no. 8, pp. 2001–2014, 2018. [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[2] P. Korshunov and S. Marcel, “Deepfakes: a new threat to face recogni- recognition,” in Proceedings of the IEEE conference on computer vision
tion? assessment and detection,” arXiv preprint arXiv:1812.08685, 2018. and pattern recognition, 2016, pp. 770–778.
[3] M. Pawelec, “Deepfakes and democracy (theory): how synthetic audio- [25] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact
visual media for disinformation and hate speech threaten core democratic facial video forgery detection network,” in 2018 IEEE international
functions,” Digital society, vol. 1, no. 2, p. 19, 2022. workshop on information forensics and security (WIFS). IEEE, 2018,
[4] L. Guarnera, O. Giudice, and S. Battiato, “Fighting deepfake by exposing pp. 1–7.
the convolutional traces on images,” IEEE Access, vol. 8, pp. 165 085– [26] X. Wang, Z. Zhao, C. Zhang, N. Bai, and X. Hu, “Se-resnet56: Robust
165 098, 2020. network model for deepfake detection,” in International Workshop on
[5] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping Digital Watermarking. Springer, 2022, pp. 37–52.
artifacts,” arXiv preprint arXiv:1811.00656, 2018. [27] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Attgan: Facial attribute
[6] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual artifacts editing by only changing what you want,” IEEE transactions on image
to expose deepfakes and face manipulations,” in 2019 IEEE Winter processing, vol. 28, no. 11, pp. 5464–5478, 2019.
Applications of Computer Vision Workshops (WACVW). IEEE, 2019, [28] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and
pp. 83–92. M. Nießner, “Faceforensics++: Learning to detect manipulated facial
[7] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, “On the images,” in Proceedings of the IEEE/CVF international conference on
detection of digital face manipulation,” in Proceedings of the IEEE/CVF computer vision, 2019, pp. 1–11.
Conference on Computer Vision and Pattern recognition, 2020, pp.
5781–5790.

4581

Authorized licensed use limited to: Vidya Vardhaka College Of Engineering Mysore. Downloaded on November 04,2024 at 07:40:05 UTC from IEEE Xplore. Restrictions apply.

You might also like