License: arXiv.org perpetual non-exclusive license
arXiv:2604.21478v1 [cs.CV] 23 Apr 2026

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Yuhan Luo*, Tao Chen*, Decheng Liu Yuhan Luo, Tao Chen and Decheng Liu are with School of Cyber Engineering, Xidian University, Xi’an 710071, Shaanxi, P. R. China (e-mail: dchliu@xidian.edu.cn).
Abstract

Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can’t achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose Cross-AUC, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework Semantic Fine-grained Alignment and Mixture-of-Experts (SFAM), consisting of a patch-level image-text alignment module that enhances CLIP’s sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics. The source code is available at GitHub.

I Introduction

In recent years, face forgery detection has drawn increasing attention due to its societal and economic implications, and numerous deepfake detection methods[31; 16; 30; 2] have been developed. Despite steady progress, reliable deployment in open-world scenarios remains challenging because models often fail to generalize across datasets. A common practice evaluates cross-domain generalization by training on FF++[19] and reporting AUC on other benchmarks. However, AUC is invariant to monotonic score transformations and therefore cannot reveal whether detector scores are comparable across domains. In practice, dataset-specific biases (e.g., compression pipelines, camera characteristics, post-processing) may cause systematic shifts in the score distribution, leading to inconsistent operating thresholds and unreliable decisions when the test domain is unknown. As illustrated in Fig. 1, samples can be well separated within each dataset while exhibiting markedly different score ranges across datasets.

Refer to caption
Figure 1: Prediction score distributions of authentic and forged samples across FF++, WDF, and Celeb-DF datasets, showing intra-dataset separability but severe cross-domain distribution shifts.

To explicitly measure cross-domain score comparability, we propose Cross-AUC, an evaluation protocol that computes AUC not only within each dataset, but also across datasets by contrasting real samples from one dataset against fake samples from another (and vice versa), and averaging over dataset pairs. Benchmarking representative detectors under Cross-AUC reveals substantial drops compared to standard within-dataset AUC, indicating an overlooked vulnerability of current evaluation and training practices. To improve robustness under Cross-AUC, we further introduce a semantic fine-grained alignment and mixture-of-Experts framework (SFAM), that discourages reliance on dataset-specific shortcuts. Specifically, we design a patch-level image-text alignment module (PaITA) that enhances CLIP’s sensitivity to subtle manipulation artifacts, and introduce the facial region mixture-of-experts (FaRMoE), an MoE-style routing mechanism that assigns features from different facial regions to specialized experts for targeted mining of region-specific forgery cues.

The main contributions of our paper are summarized as follows:

  • We propose a novel specific metric Cross-AUC, which evaluates cross-domain score comparability by computing AUC across dataset pairs. The interesting phenomenon shows that evaluating existing SOTA detectors under the Cross-AUC metric almost reveals substantial performance drops.

  • We further design the semantic fine-grained alignment and mixture-of-Experts framework (SFAM) to improve the generalization ability with the Cross-AUC metric. The patch-level image-text alignment module can enhance CLIP’s sensitivity to subtle manipulation artifacts, and the facial region mixture-of-experts module aims to assign features from different facial regions to specialized experts for targeted mining of region-specific forgery cues.

  • Experimental results on the public representative datasets illustrate the superior performance of the proposed SFAM compared with the state-of-the art face forgery detection methods.

II Related Work

II-A ViT-based Forgery Detection

Face forgery detection has evolved from handcrafted feature methods to deep learning-driven paradigms. With the development of convolutional neural networks (CNNs), CNN-based methods became the mainstream of deepfake detection research. Chollet et al.[4] proposed Xception, which became the most widely used backbone for early forgery detection; Tan et al.[22] further put forward EfficientNet series for better accuracy-efficiency balance, and Lin et al.[14] applied dynamic data augmentation strategies. However, limited by the inherent local receptive field, CNNs fail to detect high-quality forgeries with subtle artifacts.

Refer to caption
Figure 2: The workflow of the proposed SFAM framework.

To address this limitation, Vision Transformer (ViT)[9] has gradually become the core backbone paradigm for face forgery detection. Subsequent studies optimized ViT for forgery detection tasks: Chen et al.[3] proposed a local relation learning framework to fuse RGB and frequency domain features; Luo et al.[15] devised a generalizable model to enhance high-frequency forgery feature capture. In recent years, cross-modal learning represented by CLIP[18] has become a key research direction, bringing breakthroughs to the interpretability and generalization of forgery detection. Khan et al. first introduced adaptive prompt tuning into CLIP for universal deepfake detection; Zhang et al.[28] designed fine-grained text prompts to help CLIP focus on subtle forgery artifacts; Zhang et al.[29] constructed the DD-VQA dataset to train a BLIP-based model for both authenticity judgment and interpretable explanation. Meanwhile, works[21; 27; 10] have been done to improve CLIP. These works verified the potential of ViT models, while existing methods still lack targeted optimization for facial regional heterogeneous feature extraction and semantically fine-grained detection.

II-B Evaluation Metrics in Deepfake Detection

Evaluation metrics are the core guidance for the development of face forgery detection technologies. Early research directly adopted general metrics including Accuracy, Precision, Recall and F1-score, which remain auxiliary indicators in most studies. However, due to the widespread class imbalance in forgery detection datasets, these threshold-sensitive metrics cannot comprehensively reflect the model’s overall detection ability.

To solve this problem, the Area Under the Receiver Operating Characteristic Curve (AUC) has gradually become the core evaluation metric in this field. AUC comprehensively quantifies the model’s ability to distinguish fake and real samples with strong robustness to imbalanced datasets. Rossler[19] et al. took AUC as the core metric in the FaceForensics++ benchmark, and it has become the most widely accepted standard. Subsequently, mainstream benchmarks including Celeb-DF[13], DFDC[7] and DFDCP[8] all adopted AUC as the primary evaluation indicator. Although researchers have supplemented a series of targeted metrics on the basis of AUC, the current evaluation system still relies heavily on intra-dataset AUC, lacking a unified and systematic metric to accurately quantify the model’s cross-domain generalization ability in real open scenarios.

III Methodology

Built upon the pre-trained CLIP model as the backbone, our approach aims to improve the robustness of face forgery detection by encouraging the model to capture semantically meaningful manipulation cues rather than relying on dataset-specific artifacts. Fig.2 illustrates the overall architecture of the proposed framework. To achieve this goal, we introduce two key components. First, we propose PaITA, a patch-level image–text alignment module that enhances CLIP’s sensitivity to subtle manipulation artifacts through fine-grained visual–semantic alignment. Second, we design FaRMoE (Facial Region Mixture-of-Experts), which introduces region-aware expert routing to better capture organ-specific forgery patterns by leveraging specialized feature experts for different facial regions. In the following sections, we first briefly introduce the CLIP backbone used in our framework. We then present the designs of PaITA and FaRMoE in detail, followed by the joint loss formulation and the overall training strategy.

III-A Preliminaries: Model Backbone

Our framework is built upon the pre-trained CLIP model, which learns aligned visual–text representations from large-scale image–text data and provides strong semantic priors for visual understanding. However, directly applying CLIP to forgery detection has two limitations. First, CLIP performs image–text alignment mainly at the global level, which may overlook subtle local manipulation artifacts. Second, the vision encoder applies a shared transformation to all patches, lacking region-aware specialization for different facial components. To address these issues, we introduce two complementary modules. PaITA extends CLIP with patch-level image–text alignment to capture fine-grained manipulation cues, while FaRMoE introduces region-aware expert routing to model organ-specific forgery patterns.

III-B Mask-Guided Hybrid Data Augmentation

To encourage the model to focus on manipulated facial regions during training, we introduce a mask-guided hybrid augmentation strategy that synthesizes diverse forged samples while providing explicit spatial supervision.

Given a real image IrI_{r} and a fake image IfI_{f}, we generate a binary authenticity mask M{0,1}H×WM\in\{0,1\}^{H\times W} based on facial landmarks. Several facial regions (e.g., eyes, nose, mouth, or half-face) are randomly selected as manipulated regions. Using the mask, a synthetic forged image I~\tilde{I} is constructed as

I~=MIf+(1M)Ir,\tilde{I}=M\odot I_{f}+(1-M)\odot I_{r}, (1)

where \odot denotes element-wise multiplication.

To further increase appearance diversity, we incorporate Self-Blended Images (SBI)[20]. A transformed version of the real image T(Ir)T(I_{r}) is blended with the original image using a smoothed mask M~\tilde{M}:

I~=M~T(Ir)+(1M~)Ir.\tilde{I}=\tilde{M}\odot T(I_{r})+(1-\tilde{M})\odot I_{r}. (2)

During training, the two augmentation strategies are randomly sampled to generate diverse forged samples. The generated mask MM explicitly indicates manipulated regions and will later be downsampled to the patch resolution to supervise patch-level alignment.

III-C Facial Region Mixture-of-Experts

To capture region-specific forgery patterns, we introduce FaRMoE, a facial region Mixture-of-Experts module integrated into the CLIP vision encoder. Fig. 3 illustrates the structure of our targeted optimized vision encoder with FarMoE module.

Refer to caption
Figure 3: The structure of vision encoder with our proposed FaRMoE module.

Given an input image, the vision encoder divides it into NN patches as: {x1,x2,,xN},xid.\{x_{1},x_{2},\dots,x_{N}\},\quad x_{i}\in\mathbb{R}^{d}. Using facial landmarks, each patch is assigned a region label ri{1,,K}r_{i}\in\{1,\dots,K\}, where KK denotes the number of facial regions. FaRMoE consists of a set of region experts {E1,,EK}\{E_{1},\dots,E_{K}\} that specialize in modeling different facial components. Each patch feature is routed to the corresponding expert according to its region label:

zi=Eri(xi).z_{i}=E_{r_{i}}(x_{i}). (3)

We integrate FaRMoE into the CLIP vision encoder by replacing the key projection in selected self-attention layers:

ki=Wkxiki=Eri(xi).k_{i}=W_{k}x_{i}\quad\rightarrow\quad k_{i}=E_{r_{i}}(x_{i}). (4)

This design enables the Transformer to capture organ-specific forgery artifacts while preserving global contextual modeling.

III-D Patch-Level Image-Text Alignment

To further capture subtle manipulation cues, we extend CLIP’s global image–text alignment to a patch-level visual–semantic alignment framework. Given an input image, the vision encoder outputs a global feature fclsf_{cls} and patch features {f1,f2,,fP}.\{f_{1},f_{2},\dots,f_{P}\}. On the text side, we construct two prompts describing authentic and forged content. Let trealt_{real} and tfaket_{fake} denote their embeddings. The visual–text similarity is computed using cosine similarity

sreal=ftrealftreal,sfake=ftfakeftfake.s_{real}=\frac{f^{\top}t_{real}}{\|f\|\|t_{real}\|},\quad s_{fake}=\frac{f^{\top}t_{fake}}{\|f\|\|t_{fake}\|}. (5)

The forgery probability is then obtained as

p=exp(sfake)exp(sfake)+exp(sreal).p=\frac{\exp(s_{fake})}{\exp(s_{fake})+\exp(s_{real})}. (6)

Global prediction is computed from the class token fclsf_{cls}, while patch-level probabilities {pi}\{p_{i}\} are computed for each patch feature fif_{i}. The augmentation mask is downsampled to the patch resolution to obtain patch labels M{0,1}B×PM\in\{0,1\}^{B\times P}.

Intra-image local ranking loss

To encourage higher forgery scores for manipulated patches within a forged image, we impose a ranking constraint sfgsbgms_{fg}-s_{bg}\geq m between forged patches SfgS_{fg} and authentic patches SbgS_{bg}.

The loss is defined as

Lrank_intra=1|Bfake|bBfake1|Sfg||Sbg|sfgSfgsbgSbgmax(0,m(sfgsbg)).\begin{split}L_{rank\_intra}={}&\frac{1}{|B_{fake}|}\sum_{b\in B_{fake}}\frac{1}{|S_{fg}||S_{bg}|}\\ &\sum_{s_{fg}\in S_{fg}}\sum_{s_{bg}\in S_{bg}}\max\left(0,m-\left(s_{fg}-s_{bg}\right)\right).\end{split} (7)

To improve generalization, we further introduce a cross-sample ranking constraint between paired real and fake samples. For a pair (r,f)(r,f), let RfakeR_{fake} denote forged patches in the fake image and RrealR_{real} denote the corresponding patches in the real image.

Lrank_real_fake=1|Bpair|(r,f)Bpair1|Rfake|imax(0,m(sfakeisreali)).\begin{split}L_{rank\_real\_fake}={}&\frac{1}{|B_{pair}|}\sum_{(r,f)\in B_{pair}}\frac{1}{|R_{fake}|}\\ &\sum_{i}\max\left(0,m-\left(s_{fake}^{i}-s_{real}^{i}\right)\right).\end{split} (8)

The overall training objective combines global classification and patch-level alignment constraints:

Ltotal=Lcls+λ1Lrank_intra+λ2Lrank_real_fake.L_{total}=L_{cls}+\lambda_{1}L_{rank\_intra}+\lambda_{2}L_{rank\_real\_fake}. (9)

Here the global classification loss is defined as

Lcls=1Ni=1N[yilogpi+(1yi)log(1pi)].L_{cls}=-\frac{1}{N}\sum_{i=1}^{N}\left[y_{i}\log p_{i}+(1-y_{i})\log(1-p_{i})\right]. (10)

IV Cross-AUC Metric For Forgery Detection

Area Under the Receiver Operating Characteristic Curve (AUC) is widely regarded as the dominant evaluation metric in forgery detection tasks. It quantifies a model’s discriminative ability by comparing positive and negative samples within a single dataset. However, this over-reliance on AUC overlooks a critical flaw: it only reflects model performance on pre-collected datasets, not its effectiveness in real-world detection scenarios. We argues that AUC’s dataset-bound nature leads to misleading evaluations, as it fails to account for the dynamic, complex, and diverse conditions of practical forgery detection. To evaluate cross-domain discrimination ability, we introduce Cross-AUC, which measures whether a detector can distinguish fake samples from one dataset against real samples from another dataset. Let KK datasets be denoted as {D1,,DK}\{D_{1},\dots,D_{K}\}. And let RiR_{i} denote the set of real samples in dataset DiD_{i} and FjF_{j} denote the set of fake samples in dataset DjD_{j}. For each dataset pair (i,j)(i,j) where iji\neq j, we compute a cross-domain AUC:

AUCi,j=AUC(Ri,Fj),\text{AUC}_{i,j}=\text{AUC}(R_{i},F_{j}), (11)

which measures the model’s ability to distinguish real samples from domain ii and fake samples from domain jj.

The final Cross-AUC score is obtained by averaging over all dataset pairs:

Cross-AUC=1K(K1)ijAUC(Ri,Fj).\text{Cross-AUC}=\frac{1}{K(K-1)}\sum_{i\neq j}\text{AUC}(R_{i},F_{j}). (12)

Compared with conventional AUC computed within a single dataset, the key contribution of the proposed Cross-AUC is to evaluate discrimination across heterogeneous domains.

TABLE I: Intra-datasets comparison results with the metric of frame-level AUC. CDv1 and CDv2 denote Celeb-DF-v1 and Celeb-DF-v2 datasets, respectively.
Model CDv1 CDv2 DFDCP DFDC UADFV Avg.
Xception[4] 0.754 0.740 0.730 0.713 0.936 0.775
Efficientnet-B4[22] 0.767 0.750 0.641 0.709 0.924 0.758
F3Net[17] 0.750 0.729 0.704 0.700 0.8910 0.755
FFD[6] 0.709 0.687 0.705 0.714 0.950 0.753
RECCE[1] 0.734 0.741 0.698 0.680 0.887 0.748
UCF[26] 0.811 0.772 0.693 0.731 0.920 0.785
CLIP[18] 0.716 0.754 0.661 0.713 0.850 0.739
Forensics Adapter[5] 0.914 0.900 0.890 0.843 0.980 0.905
Effort[25] 0.892 0.833 0.799 0.784 0.955 0.853
Ours 0.916 0.885 0.862 0.796 0.972 0.886
TABLE II: Intra-datasets comparison results with the metric of video-level AUC. CDv1 and CDv2 denote Celeb-DF-v1 and Celeb-DF-v2 datasets, respectively.
Model CDv1 CDv2 DFDCP DFDC UADFV Avg.
Xception[4] 0.810 0.816 0.761 0.740 0.963 0.818
Efficientnet-B4[22] 0.815 0.808 0.675 0.724 0.960 0.796
F3Net[17] 0.811 0.789 0.735 0.718 0.930 0.797
FFD[6] 0.761 0.742 0.741 0.739 0.973 0.791
RECCE[1] 0.815 0.823 0.715 0.696 0.943 0.798
UCF[26] 0.861 0.837 0.706 0.751 0.955 0.822
CLIP[18] 0.770 0.814 0.679 0.736 0.895 0.779
Forensics Adapter[5] 0.969 0.957 0.929 0.872 0.994 0.944
Effort[25] 0.935 0.885 0.816 0.807 0.965 0.882
Ours 0.966 0.950 0.900 0.898 0.984 0.940

V Experiments

We conduct comprehensive and systematic experiments to validate the effectiveness of our proposed SFAM deepfake detection framework and the novel Cross-AUC evaluation metric. We first detail the complete experimental setup, including datasets, evaluation metrics, and implementation details. Then, we present the cross-domain performance comparison with state-of-the-art (SOTA) methods, followed by ablation studies to quantify the independent contribution of each core component in our framework.

V-A Experimental Settings

V-A1 Datasets

To evaluate the performance of our proposed forgery detection method, we select a representative training dataset and five public test datasets, covering various forgery techniques and data distributions to ensure the generalization ability of the model. The details of the datasets are described as follows. FaceForensics++[19] is adopted as the training dataset, which is one of the most widely used benchmark datasets for face forgery detection. And five public test datasets, including Celeb-DF-v1, Celeb-DF-v2[13], DFDCP[8], DFDC[7] and UADFV[12] are used to evaluate the generalization performance of the testing model, covering different forgery methods, including NeuralTextures[23], Face2Face[24], FaceSwap, FaceShifter and Deepfakes.

V-A2 Evaluate Metrics

We adopt four core metrics to comprehensively evaluate the performance of all models, which are fully aligned with the research objectives and can systematically reflect the model’s detection accuracy, cross-domain generalization ability, and practical deployment stability: (1) AUC: Traditional intra-dataset AUC, calculated by pairing real and fake samples within a single test dataset. This metric is used for horizontal comparison with existing methods and to verify the basic detection capability of the model in closed-set scenarios. (2) Standard Deviation (Std): The standard deviation of 20 cross-domain Cross-AUC values, which measures the fluctuation of the model’s performance across different cross-domain scenarios. A smaller Std indicates more stable and reliable performance of the model in diverse real-world environments. (3) Cross-AUC: The value obtained from all dataset pairs, which is the core metric to evaluate the overall cross-domain generalization ability of the model in practical scenarios. (4) Cross-AUC Minimum: The minimum value among all testing Cross-AUC, which reflects the model’s detection performance in the most challenging cross-domain scenario, and is a key indicator to evaluate the lower bound of the model’s practical deployment performance.

V-A3 Implementation Details

We adopt a two-stage training strategy consistent with the framework design. First, we pre-train the PaIMA module based on the original CLIP model for 10 epochs. Then, we replace 6 key (k) networks in the multi-head self-attention layers of the pre-trained CLIP ViT encoder with our proposed FaRMoE module. With the pre-trained model keeping frozen, we fine-tune the new structure for 1 additional epoch until convergence. All experiments are conducted on two Nvidia 3090 GPUs.

V-B Comparison Results

V-B1 Intra-dataset Comparison Experiment

Table I and Table II present the frame-level and video-level AUC results of all compared models on five widely-used public benchmark datasets. From the overall results, the Forensics Adapter model achieves the highest average frame-level AUC of 0.905. Our proposed SFAM framework maintains a highly competitive average frame-level AUC of 0.886, which is on par with top-tier state-of-the-art (SOTA) methods and exhibits strong fundamental detection capability in intra-dataset scenarios.

Notably, our model delivers outstanding performance in video-level detection, with an average video-level AUC of 0.940. This result is only 0.4 percentage points lower than Forensics Adapter (0.944), and significantly outperforms other mainstream baselines. Specifically, we reach the highest frame-level AUC of 0.916 on the Celeb-DF-v1 dataset, surpassing Forensics Adapter (0.914) and all other compared methods. On the most challenging DFDC dataset, which features diverse real-world forgery patterns, complex background noise, and heterogeneous data distributions, our model achieves the highest video-level AUC of 0.898, which is 2.6 percentage points higher than Forensics Adapter (0.872). This demonstrates that our framework maintains stable and reliable detection performance across consecutive video frames, which is a critical capability for practical real-world video deepfake detection tasks.

V-B2 Cross-dataset Comparison Experiment

We conduct cross-domain detection experiments on the five datasets for the SFAM framework, as well as all baseline models. The experimental results are summarized in Table III.

TABLE III: Cross-datasets comparison results with several metrics. CDv1 and CDv2 denote Celeb-DF-v1 and Celeb-DF-v2 datasets, respectively.
Xception Efficientnet-B4 F3Net FFD RECCE UCF CLIP Forensics Adapter Effort Ours
CDv1 CDv2 0.787 0.794 0.789 0.792 0.766 0.845 0.766 0.945 0.902 0.929
DFDC 0.803 0.747 0.721 0.684 0.700 0.805 0.627 0.986 0.923 0.938
DFDCP 0.818 0.812 0.768 0.723 0.746 0.800 0.714 0.980 0.893 0.932
UADFV 0.804 0.788 0.761 0.775 0.722 0.819 0.739 0.960 0.908 0.946
CDv2 CDv1 0.697 0.713 0.677 0.571 0.704 0.720 0.697 0.857 0.815 0.865
DFDC 0.764 0.692 0.643 0.547 0.670 0.719 0.601 0.969 0.871 0.900
DFDCP 0.779 0.771 0.705 0.588 0.719 0.709 0.696 0.959 0.821 0.892
UADFV 0.763 0.745 0.696 0.666 0.691 0.738 0.722 0.923 0.845 0.912
DFDC CDv1 0.660 0.659 0.737 0.735 0.730 0.691 0.760 0.688 0.731 0.814
CDv2 0.703 0.700 0.781 0.828 0.763 0.743 0.810 0.750 0.749 0.837
DFDCP 0.746 0.726 0.754 0.752 0.744 0.681 0.749 0.872 0.738 0.854
UADFV 0.731 0.699 0.748 0.810 0.719 0.712 0.775 0.786 0.770 0.875
DFDCP CDv1 0.623 0.645 0.673 0.699 0.656 0.742 0.712 0.665 0.778 0.747
CDv2 0.664 0.680 0.715 0.791 0.691 0.788 0.757 0.724 0.794 0.775
DFDC 0.695 0.629 0.643 0.674 0.631 0.740 0.636 0.859 0.834 0.805
UADFV 0.698 0.683 0.691 0.775 0.650 0.757 0.735 0.754 0.810 0.824
UADFV CDv1 0.922 0.923 0.891 0.926 0.903 0.923 0.842 0.960 0.951 0.961
CDv2 0.935 0.934 0.911 0.959 0.917 0.940 0.875 0.975 0.952 0.965
DFDC 0.936 0.907 0.871 0.908 0.864 0.912 0.764 0.992 0.959 0.968
DFDCP 0.944 0.937 0.891 0.934 0.894 0.912 0.827 0.989 0.949 0.966
AUC Avg. 0.775 0.758 0.753 0.748 0.734 0.785 0.739 0.905 0.853 0.886
Cross-AUC Avg. 0.774 0.759 0.753 0.757 0.744 0.785 0.740 0.881 0.850 0.885
Cross-AUC Min 0.623 0.629 0.643 0.547 0.631 0.681 0.601 0.665 0.731 0.747
Std 0.094 0.097 0.080 0.115 0.083 0.081 0.069 0.102 0.074 0.066

Verification of the Limitations of Traditional AUC. The experimental results provide solid evidence for our core argument about the limitations of traditional AUC in Section 1. The Forensics Adapter model achieves the highest Original AUC Average of 0.905, indicating its excellent closed-set intra-dataset detection performance. However, this high intra-dataset AUC does not translate to reliable cross-domain generalization: its Cross-AUC Minimum drops to only 0.665, which is 12.3 percentage points lower than our SFAM model (0.747), and its performance fluctuation Std reaches 0.102, 54.5% higher than ours (0.066). This contrast directly validates that traditional AUC is inherently dataset-bound, and models optimized solely for intra-dataset AUC tend to overfit to dataset-specific forgery patterns rather than learning universal forgery features. In contrast, our SFAM model maintains a negligible gap of only 0.1 percentage points between Original AUC Average (0.886) and Cross-AUC Average (0.885), demonstrating that it learns generalizable facial forgery details rather than dataset-specific biases. This result proves that the proposed Cross-AUC metric reflects the real-world detection performance of models more accurately, and serves as a more practical and comprehensive evaluation standard for deepfake detection research.

Superiority in Cross-Domain Generalization. Our proposed model achieves the highest Cross-AUC Average of 0.885 among all evaluated models, outperforming all mainstream and state-of-the-art baselines. This performance demonstrates that our SFAM framework achieves remarkable cross-domain generalization ability for deepfake detection. It effectively addresses the core challenge of existing methods, which suffer from severe performance degradation when facing unseen forgery patterns and heterogeneous data in practical scenarios.

Deployment Stability Across Real-World Scenarios. Our SFAM framework achieves the highest Cross-AUC Minimum of 0.747 among all evaluated models. This result proves that our model maintains reliable and effective detection performance while effectively avoiding the catastrophic performance degradation that widely exists in baseline methods.

Meanwhile, our model reaches an extremely low Cross-AUC Std of 0.066, which is the lowest among all methods. This demonstrates that our model has consistent and stable detection performance across diverse data distributions and forgery patterns, without drastic performance fluctuations. The outstanding performance in both two core stability indicators fully validates that our SFAM framework has far more practical deployment value than existing methods for real-world deepfake detection.

TABLE IV: Ablation study of different components in the proposed SFAM framework. MGHDA denotes the Mask-Guided Hybrid Data Augment module, PaITA denotes the patch-level Image-Text Alignment, and FaRMoE denotes the Facial Region Mixture-of-Experts module.
MGHDA PaITA FaRMoE AUC Avg. Cross-AUC Avg. Cross-AUC Min
×\times ×\times ×\times 0.739 0.740 0.601
\checkmark ×\times ×\times 0.858 0.861 0.725
\checkmark \checkmark ×\times 0.880 0.881 0.749
\checkmark \checkmark \checkmark 0.886 0.885 0.747

V-C Ablation Study

We first conduct ablation experiments on the three core components of our SFAM framework to explore the effect of each component. The experimental results are shown in Table IV. Key observations from the ablation results are summarized as follows: First, the CLIP baseline achieves the worst performance across all metrics, with a Cross-AUC Average of only 0.740 and a Cross-AUC Minimum of 0.601. This demonstrates that the CLIP model lacks optimization and cannot be directly applied to deepfake detection without targeted structural and training strategy modifications. Second, mask-guided hybrid data augmentation serves as the foundation for performance improvement. By adding this module, the Cross-AUC Average increases significantly from 0.740 to 0.861, and the Cross-AUC Minimum rises from 0.601 to 0.725. This improvement validates that the mask-guided hybrid data augmentation enriches the diversity of fake samples, and explicitly guides the model to focus on forged regions, thus fundamentally enhancing the model’s cross-domain generalization ability. Third, patch-level alignment is the core module for further cross-domain performance enhancement. On the basis of mask-based augmentation, adding patch-level alignment further lifts the Cross-AUC Average to 0.881 and Cross-AUC Minimum to 0.749. This result proves that the fine-grained cross-modal semantic matching between local image patches and regional text prompts enables the model to capture subtle local forgery cues that are overlooked by global-only alignment.

TABLE V: Experiment results of different hyperparameters in the loss function.
λ1\lambda_{1} λ2\lambda_{2} AUC Avg. Cross-AUC Avg.
0.2 0.3 0.831 0.833
0.25 0.25 0.868 0.865
0.3 0.2 0.880 0.881
0.35 0.15 0.865 0.866

Finally, the full SFAM model with all three modules achieves the highest Original AUC Average of 0.886, while reaching a highest Cross-AUC Average of 0.885 and Cross-AUC Minimum of 0.747, which is nearly identical to Variant 2. This indicates that the FaRMoE module further enhances the model’s organ-specific fine-grained feature extraction and overall detection ability, without sacrificing cross-domain generalization performance or the lower bound of detection stability. In summary, the three core modules have a significant effect, and their addition continuously improves the model’s detection accuracy, cross-domain generalization ability, and practical deployment stability. All ablation results fully validate the rationality and necessity of each module’s design in our SFAM framework.

V-D Parameter Analysis

We also conduct experiments on the hyperparameters λ1\lambda_{1} and λ2\lambda_{2} of the loss function in section 3.4 to explore the best configuration. The experiment is conducted without the FaRMoE module and the results are shown in Table V. The results show that the combination of (λ1\lambda_{1}=0.3 and λ2\lambda_{2}=0.2) achieves the highest Original AUC Average (0.880) and Cross-AUC Average (0.881) among all tested configurations. The intra-image ranking loss Lrank_intraL_{rank\_intra} (weighted by λ1\lambda_{1}) guides the model to distinguish forged and authentic regions within a single image, which is the foundation of detection ability. We assign a slightly higher weight of 0.3 to ensure the model can capture subtle local forgery cues. The cross-sample ranking loss Lrank_real_fakeL_{rank\_real\_fake} (weighted by λ2\lambda_{2}) helps the model learn universal forgery features and improve cross-domain generalization, so we set a weight of 0.2 to provide effective supervision without affecting the basic feature learning. The weight setting of (λ1\lambda_{1}=0.3 and λ2\lambda_{2}=0.2) achieves the best balance between the two ranking loss terms. All other weight configurations lead to obvious performance degradation. For configuration of (λ1\lambda_{1}=0.2 and λ2\lambda_{2}=0.3), the reversed weight allocation makes the model ignore fine-grained local forgery details, resulting in a sharp drop in overall performance. The equal weight setting fails to capture the different priorities of the two loss terms, and cannot reach the optimal performance. For the last configuration of (λ1\lambda_{1}=0.35 and λ2\lambda_{2}=0.15), excessive weight on intra-image loss makes the model overfit to local details of the training set, which damages the cross-domain generalization ability.

Besides, the optimal configuration has a negligible gap between Original AUC and Cross-AUC, which means the model achieves the best balance between closed-set detection performance and real-world cross-domain generalization. This perfectly matches the core design goal of our SFAM framework.

Refer to caption
Figure 4: t-SNE visualization results of four methods.

V-E Visualization Analysis

We perform t-distributed Stochastic Neighbor Embedding (t-SNE)[11] visualization on the feature representations learned by four methods: the CLIP baseline, Effort, Forensics Adapter, and our proposed SFAM framework, with the results presented in Figure 4. As observed from the figure, the features of real and forged images extracted by the CLIP baseline are severely intertwined with no clear discriminative boundary between the two categories. This indicates that the baseline fails to effectively distinguish between authentic and fake samples. In contrast, Effort and Forensics Adapter form relatively well-defined class-wise distributions in the feature space. However, the features extracted by these methods exhibit obvious inter-dataset distribution gaps: samples from different data sources form isolated clusters in the feature space. This phenomenon reveals that the models rely to dataset-specific biases, which may impair their generalization ability in real-world open scenarios.

Meanwhile, our proposed method exhibits superior cross-dataset generalization capability. Samples from different data sources are distributed more compactly in the feature space, while the clear separation between authentic and fake categories is well preserved. This demonstrates that our framework effectively mitigates the adverse effects of dataset bias and learns more generalizable forgery-related feature representations, thereby enhancing the model’s robustness in complex real-world deployment scenarios.

VI Conclusion

In this paper, we have proposed a novel Cross-AUC evaluation metric for generalizable forgery detection task in the real. The proposed Cross-AUC provides a more practical evaluation of real-world generalization performance. Besides, we further proposed the novel framework Semantic Fine-grained Alignment and Mixture-of-Experts (SFAM), consisting of a patch-level image-text alignment module that enhances CLIP’s sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive experiments validate that our method achieves state-of-the-art cross-domain detection performance on mainstream benchmarks. In the future, we will evaluate the proposed method with different forgery models on more complex real scenarios to adapt to the needs of the real world.

References

  • [1] Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction-classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4113–4122 (June 2022)
  • [2] Chen, L., Zhang, Y., Song, Y., Liu, L., Wang, J.: Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18689–18698 (2022). https://doi.org/10.1109/CVPR52688.2022.01815
  • [3] Chen, S., Yao, T., Chen, Y., Ding, S., Li, J., Ji, R.: Local relation learning for face forgery detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1081–1088 (2021). https://doi.org/10.1609/aaai.v35i2.16193, https://doi.org/10.1609/aaai.v35i2.16193
  • [4] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [5] Cui, X., Li, Y., Luo, A., Zhou, J., Dong, J.: Forensics adapter: Adapting clip for generalizable face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19207–19217 (June 2025)
  • [6] Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
  • [7] Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset (2020), https://arxiv.org/abs/2006.07397
  • [8] Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer, C.C.: The deepfake detection challenge (dfdc) preview dataset (2019), https://arxiv.org/abs/1910.08854
  • [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929 (2020), https://arxiv.org/abs/2010.11929
  • [10] Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y.: Improving clip training with language rewrites. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 35544–35575. Curran Associates, Inc. (2023)
  • [11] Hinton, G., Maaten, L.v.d.: Visualizing Data using t-SNE. Journal of Machine Learning Research 9(86), 2579–2605 (Jan 2008)
  • [12] Li, Y., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–7 (2018). https://doi.org/10.1109/WIFS.2018.8630787
  • [13] Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-df: A large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
  • [14] Lin, Y., Song, W., Li, B., Li, Y., Ni, J., Chen, H., Li, Q.: Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 104–122. Springer Nature Switzerland, Cham (2025)
  • [15] Luo, Y., Zhang, Y., Yan, J., Liu, W.: Generalizing face forgery detection with high-frequency features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16317–16326 (June 2021)
  • [16] Lyu, S., Li, Y.: Exposing DeepFake Videos By Detecting Face Warping Artifacts. arXiv: Computer Vision and Pattern Recognition (Nov 2018)
  • [17] Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 86–103. Springer International Publishing, Cham (2020)
  • [18] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (18–24 Jul 2021), https://proceedings.mlr.press/v139/radford21a.html
  • [19] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Niessner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
  • [20] Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18720–18729 (June 2022)
  • [21] Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale (2023), https://arxiv.org/abs/2303.15389
  • [22] Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/tan19a.html
  • [23] Thies, J., Zollhöfer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. 38(4) (Jul 2019). https://doi.org/10.1145/3306346.3323035, https://doi.org/10.1145/3306346.3323035
  • [24] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Niessner, M.: Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
  • [25] Yan, Z., Wang, J., Jin, P., Zhang, K.Y., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for generalizable ai-generated image detection (2025), https://arxiv.org/abs/2411.15633
  • [26] Yan, Z., Zhang, Y., Fan, Y., Wu, B.: Ucf: Uncovering common features for generalizable deepfake detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22412–22423 (October 2023)
  • [27] Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-clip: Unlocking the long-text capability of clip. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 310–325. Springer Nature Switzerland, Cham (2025)
  • [28] Zhang, Y., Wang, T., Yu, Z., Gao, Z., Shen, L., Chen, S.: Mfclip: Multi-modal fine-grained clip for generalizable diffusion face forgery detection. IEEE Transactions on Information Forensics and Security 20, 5888–5903 (2025). https://doi.org/10.1109/TIFS.2025.3576577
  • [29] Zhang, Y., Colman, B., Guo, X., Shahriyari, A., Bharaj, G.: Common sense reasoning for deepfake detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 399–415. Springer Nature Switzerland, Cham (2025)
  • [30] Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional Deepfake Detection. arXiv: Computer Vision and Pattern Recognition (Mar 2021)
  • [31] Zhou, P., Han, X., Morariu, V.I., Davis, L.S.: Two-stream neural networks for tampered face detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 1831–1839 (2017). https://doi.org/10.1109/CVPRW.2017.229