7
7
Abstract
The proliferation of DeepFake technology is a rising challenge in today’s society,
owing to more powerful and accessible generation methods. To counter this, the research
community has developed detectors of ever-increasing accuracy. However, the ability
to explain the decisions of such models to users is lacking behind and is considered an
accessory in large-scale benchmarks, despite being a crucial requirement for the correct
deployment of automated tools for content moderation. We attribute the issue to the
reliance on qualitative comparisons and the lack of established metrics. We describe a
simple set of metrics to evaluate the visual quality and informativeness of explanations
of video DeepFake classifiers from a human-centric perspective. With these metrics, we
compare common approaches to improve explanation quality and discuss their effect on
both classification and explanation performance on the recent DFDC and DFD datasets.
1 Introduction
“DeepFake” refers to the realistic alteration or generation of multimedia content, in visual,
audio, or textual form. The most striking application of DeepFakes are generative deep
learning models that can alter a person’s appearance in videos. From early attempts [75,
76], the quality of these face-swapping techniques has increased consistently to the point
that both casual and attentive observers can be fooled. While some applications can be
positively innovating [85], DeepFakes can be designed with malicious intent, such as online
disinformation or public defamation. In response, the research community has introduced
datasets [21, 57, 63, 65] and methods [12, 15, 36] for the automatic monitoring and detection
of DeepFakes. However, benchmark performance has become the de-facto goal, shadowing
other aspects that are crucial for the correct deployment of such models.
In practice, as automated DeepFake detectors acquire a significant role for moderation
and censorship of online communities, it becomes necessary to inspect and explain their
© 2022. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
* Work performed during an internship at Huawei Ireland Research Center
2 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
Figure 1: Explanation heatmap for one sample from DeepFake Dataset obtained by apply-
ing SmoothGrad to a classifier regularized with Total Variation. The resulting heatmap is
visually smooth (TV = 0.25), localized (σ = 731), and concentrates around visible manipu-
lation artefacts (Min = 18.1%), i.e. the eyes and one corner of the mouth.
decision process. From the users’ perspective, it is not acceptable that “black-box” models
manage their freedom of expression and online safety. Instead, users require intuitive expla-
nations to validate DeepFake forgeries, prevent unjustified censorship, and trust automated
moderation systems. From the perspective of companies and regulators, interpretability is
necessary to justify the enforcement of DeepFake detectors, in accordance to the right to
explanation of legal frameworks such as the GDPR [23]. Also, developers of such tools can
benefit from explanations to verify the learned representation, mitigate unwanted bias, and
defend against adversarial attacks.
Several methods for explaining visual classifiers exist, e.g. [69, 71], which can be com-
pared in terms of faithfulness to the model and correctness to the data. However, researchers
lack quantitative tools to evaluate human-centric properties of explanations and claims of im-
proved informativeness are often based on subjective comparisons. This work introduces a
quantitative framework to evaluate DeepFake explanations w.r.t. to human perception, which
can be applied in practical deployments of DeepFake classifiers. In particular, we contextu-
alize existing metrics, i.e. manipulation detection, and propose new ones as needed, namely
for smoothness, sparsity, and locality. We apply these metrics to state-of-the-art video recog-
nition models and compare several techniques intended for improving explanations, form-
ing a quantitative baseline on the DeepFake Detection Challenge dataset and the DeepFake
Dataset [21, 65]. Last, we empirically evaluate how to best communicate heatmap-based
explanations to users, discuss limitations and future directions for DeepFake explainability.
2 Related work
DeepFake generation. Since their inception, generative models have been applied to ma-
nipulate faces, bodies and voices in online media. Today’s availability of online content and
ease of access to open-source frameworks, allow anyone with consumer-grade hardware to
generate DeepFakes. While legitimate applications of this technology exits, e.g. dubbing,
DeepFakes have been infamously used for disinformation, fraud, hatred, sexual abuse, and
other crimes [17, 82]. This work focuses on visual forgery of faces in videos, which can
be categorized as face swapping, in which the appearance of a face is replaced with an-
other [19, 40, 41, 58]; or facial reenactment, in which expressions are edited [76, 77]. Such
manipulations can be produced via purely learning-based generative models [19, 59, 77] or
hybrid computer graphics approaches [40, 76]. For survey of methods and applications, we
refer the reader to the works of Tolosana et al. [78] and Masood et al. [47].
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 3
Explainable AI. Although powerful, deep learning models are often deemed “black-boxes”
to illustrate the opacity of their decision process. The field of study of Explainable AI
(XAI) tries to address these shortcomings to allow users, researchers, and regulators to gain
insights into such models (model interpretability) and their outputs (decision explainabil-
ity) [28, 50]. In the visual domain, in particular for classification, it is common to ex-
plain the decision of a model using heatmaps which highlight important areas of the in-
put [25, 61, 91]. Backpropagation-based approaches generate heatmaps by computing gra-
dients [33, 69, 71, 72, 74, 88, 93] or gradient surrogates [53, 68, 73]. Alternative approaches
construct proxy models that are locally faithful and easier to interpret, e.g. LIME [64]. Re-
cently, transformer models [7, 22, 83] have popularized using attention maps as explana-
tions [1, 14, 87], although these might not be representative [32].
DeepFake explainability. As social platforms integrate automated tools for DeepFake de-
tection and moderation in their pipelines [47], it becomes crucial to offer proper justification
when some content is blocked. Prototype-based explanations as in Trinh et al. [80], could
teach users to identify manipulation artefacts on their own. Similarly, SHAP-based methods
can be adapted to to videos by defining 3D super-pixels [46, 62]. Focusing on input fea-
tures, Wang et al. [84] suggest pre-processing steps that result in more human-interpretable
heatmaps, according to a qualitative evaluation. Finally, human-annotated explanations, e.g.
Mathew et al. [49], provide direct insight on manipulation techniques.
3 Method
3.1 Explanations methods
Our goal is to establish quantitative metrics to evaluate explanations of visual DeepFake
classifiers. In particular, we focus on heatmap-based methods [8, 71, 91] that associate each
pixel to a scalar proportionally to its importance w.r.t. the classifier decision. Formally, we
define a video v ∈ V as a mapping from a discrete grid G = T ×H×W to the RGB color space.
A DeepFake classifier is then a function f : V → [0, 1] that maps a video to the probability
distribution p(FAKE|v). An explanation method is a function Φ : V × F → H that maps a
pair (v, f ) to a relevance heatmap h : G → R+ , where F and H = {h| hdλ = 1} denote the
R
set of classifiers and heatmaps respectively. With this notation, popular gradient-based ex-
planation methods are expressed as: Sensitivity ∇ f (v) [71]; Gradient×Input ∇ f (v) · v [37];
SmoothGrad Eε∼N (0,δ I) [∇ f (v + vε )], where vε adds random color perturbations [72]; and
Integrated Gradients (v − vb ) · 01 ∇R f (vb + α(v − vb ))dα, where the baseline vb is a uniform
R
black video [74]. Note that ∇ and are discretized operators over G (see Appendix D).
Explanation methods are commonly compared according to their faithfulness, i.e. the
ability to correctly explain a decision [4, 8, 52, 64, 69]. Faithfulness is quantified by the dele-
4 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
R
tion score [61, 67] defined as Ev [ f (v (1 − hα ))dα], where hα is a binary mask obtained
by selecting the most important pixels from the explanation Φ( f , v) such that their cumu-
lative relevance is α ∈ [0, 1]. A low deletion score indicates a faithful explanation method:
if relevant pixels are masked out first, the prediction confidence should drop sharply. For
our baseline model, SmoothGrad achieves the lowest deletion score (paired one-sided t-test
p < 10−5 ) and is therefore selected for all evaluations of visual quality. We report per-
method hyperparameters and per-dataset scores in Appendix D.4. Clearly, faithfulness is a
necessary property of explanation methods, however, their heatmaps can still appear noisy
and uninformative for humans, hence the need for quantitative metrics of visual quality.
The first set of metrics considers general properties of explanation heatmaps that facilitate
their understanding and communicability. Complex models can take decisions based on
features that are not easily accessible to users, e.g. texture details or high-frequency pat-
terns [26, 84]. Instead, we expect models that focus on human-interpretable cues [38] such
as small manipulation artefacts, teeth misalignment, non-circular pupils, or irregular skin
complexion, to produce smoother, sparser and more localized heatmaps.
Smoothness. Explanations that vary excessively between neighboring pixels or frames are
not meaningful to humans [84]. The smoothness of a heatmap h : G → R+ is measured as its
Total Variation (TV), where low values indicate higher local consistency:
Z
TV(h) = k∇hk1 dλ . (1)
G
Σ)| = det Eh ρ ρ T − Eh [ρ
ρ ] Eh ρ T .
σ = | det(Σ (2)
A low σ will favor sharp unimodal distributions, e.g. a Gaussian with low dispersion, as
opposed to scattered multimodal heatmaps. In the context of DeepFakes, this means high-
lighting single manipulation artefacts instead of allocating mass to distinct parts of the face.
For other tasks, spatial locality can be extended to account for domain-specific requirements.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 5
Sparsity. While TV and σ capture spatial properties, the individual values shall also be
sparse, since few highly important regions are more indicative of a good explanation than
several mildly relevant ones. Both L0 norm and Entropy [70] are popular measures of spar-
sity, but the Gini Index [29] is preferred according to Hurley and Rickard [31]. For a heatmap
h : G → R+ and sorting indices i = {1, . . . , T HW } such that h(ρ
ρ i ) ≤ h(ρ
ρ i+1 ):
2 ∑i i · h(ρ
ρ i ) T HW + 1
G= − . (3)
T HW ∑i h(ρ ρ i) T HW
4 Experiments
The previous section establishes a set of desirable qualities of explanations and proposes
evaluation metrics built on sound mathematical foundations. We now consider several tech-
niques from previous works and quantify their effect on explanations using these metrics.
Section 4.1 analyses the effects of: i) data preparation [84]; ii) loss-based regularization [66];
iii) augmentation-based regularization [20]; and iv) model architecture [26, 81]. Both and
classification performance (Tab. 1) and explanation quality (Fig. 2) are reported for each
experiment. Furthermore, Section 4.2 discusses post-processing techniques for heatmap vi-
sualization, which are important for communicating explanations to users in practice.
6 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
Training dataset. All models are trained on videos from the DeepFake Detection Chal-
lenge [21] in “high-quality” compression (constant rate quantization 23). Specifically, we
train on 19k real and 100k fake videos, and use the official validation split of 2k real and 2k
fakes for hyper-parameter tuning. Each video is preprocessed using the MTCNN face de-
tector [92], the main face is heuristically determined among all detections, then cropped and
resized to 224×224 pixels. Part segmentation is obtained with the BiSeNet face parser [90]
and aggregated into background, face, nose, mouth, eyes, ears. Additional details on data
preparation and dataset statistics are provided in Appendix A.
Table 1: Classification metrics: LCE is categorical cross-entropy (↓), AROC is the area under
the receiver operating characteristic curve (↑). Average values over 3 runs, full results in
Tables 3 and 4. Reported values account for class imbalance as detailed in Table 2.
DFDC test DFD
LCE AROC LCE AROC
S3D Baseline .447 89.0 .694 80.2
S3D Bilateral .696 54.2 .746 45.8
S3D Gaussian .542 81.8 .760 66.4
S3D TV Loss .460 87.4 .698 75.8
S3D Cutout .481 87.2 .655 79.6
MViT .430 96.4 .513 90.0
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 7
Figure 2: Quantitative explanation metrics: visual quality (top) and manipulation de-
tection (bottom) for the evaluation subsets of DeepFake Detection Challenge (DFDC) and
DeepFake Detection Dataset (DFD). Higher values indicate better explanation quality, except
for TV and locality σ . Mean and standard deviation of 3 runs, full results in the appendix.
1
L`TV =
T HW ∑ Ω1D (AA`d ), (5)
d
8 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
where the summation considers all 1D slices of A orthogonal to its axes and Ω1D indicates the
1D total variation. Averaging over all convolutional blocks in our architecture, the optimiza-
tion objective results L = Ltask + αE` L`TV , where α ∈ R+ is a hyperparameter. The ad-
ditional term places a smoothness constrain on the activations of intermediate layers, which
we hope will result in localized peaks in the heatmap corresponding to visible artefacts in
the video, though TV does not control the location of such peaks.
The first effect of TV regularization is noticeable during the initial phases of training.
For unconstrained models, we observe that E` L`TV tends to increase during the initial
phase of training and stabilizes at around 0.5 after one epoch. On the other hand, when
α = 1, the optimization process is dominated by LTV for the first epochs and classification
loss starts decreasing only after this term drops below 0.1. From the results in Table 1, a
strong TV regularization affects classification performance negatively. However, we also
observe a significant improvement over the baseline for locality, sparsity, and manipulation
in Figure 2. In fact, the average σ for DFDC decreases from 814 to 726, indicating more
spatially-focused explanations. Also, the Gini Index increases from 75% to 77%, meaning
that fewer pixels are responsible for the bulk of the heatmaps. With respect to manipulation
detection, the heatmaps produced by TV-regularized models match more closely the ground-
truth, resulting in higher P100 for both DFDC and DFD.
Video cutout. Cutout data augmentation which can greatly improve classification perfor-
mance by masking input patches at random during training [20]. We adapt Cutout to video
data by replacing masking with heavy spatio-temporal blur: since motion blur occurs natu-
rally, the augmented samples are maintained closer to the data manifold. We expect Facial
Cutout to guide the network towards more meaningful representations, where the relation-
ship between semantic parts of the face are better understood, hence improving part-based
manipulation detection. On the other hand, removing parts of the input might yield more
spread out heatmaps, as the network learns to capture information from more diverse loca-
tions. In our experiments, we observe slightly better generalization to DFD for regularized
models (Table 1), which confirms the regularization properties of Facial Cutout. However,
the effects on explanations are limited, resulting in slightly higher Total Variation and ma-
nipulation detection scores (Figure 2).
models, while spatial locality remains similar (σ ). Furthermore, the bottom row of Fig. 2
indicates that MViT heatmaps are stronger detectors of manipulated areas, focusing most
of the heatmap inside the ground-truth mask (Min ). We attribute these promising results to:
i) a more robust classifier which can better distinguish fake videos and is thus likely to have
learned a good representation of manipulation artefacts; and ii) the underlying inductive bias
of attention and its effect on gradient propagation used for heatmap generation.
5 Conclusion
The Explainable AI has developed a plethora of explanation methods of varying degrees
of faithfulness. However, to the best of our knowledge, quantitative metrics to compare
the quality of such explanations are lacking. This work attempts to lay out an objective
evaluation framework for DeepFake explanations, which we hope will drive the development
of detectors that are better aligned with human cognition. The main contribution of this
paper is the introduction of a family of such metrics, novel or adapted from existing works,
to measure visual quality and manipulation detection.
10 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
In our experiments we consider several techniques for training DeepFake detectors and
study their impact on explainability metrics in a quantitative way, whereas previous work
was limited to qualitative comparisons. We observe that TV regularization has the largest
impact across most metrics. On the other hand, controlling high-frequency components of
the input is of little utility, at least when realistic video compression settings are considered.
Finally, we observe that recent architectures such as MViT significantly outperform any of
the S3D variations in both detection and explanation quality. We recommend further study
of transformer-based DeepFake classifiers and how to employ attention as an explanation.
Limitations and future work. This project leads to many natural avenues for future re-
search in Explainable AI. First, although the proposed metrics are drawn from existing liter-
ature and are based on sound mathematical foundations, an extensive study of the correlation
between these metrics and human preference would increase their reliability. Second, it is
surely possible to conceive more refined metrics for DeepFake detection to address the short-
comings discussed in Section 3.2. For instance, we have already mentioned that locality (σ )
favors unimodal over multimodal heatmaps, whereas more faceted metrics of localization
are desirable. Third, as made evident from the experiments on DeepFake Dataset, when
classification performance is not perfect explanations can be meaningless. Thus, combin-
ing explanations and uncertainty estimation would provide a more complete picture of any
DeepFake detector. Finally, we remark that the proposed metrics are not meant to supplant
human judgment, e.g. user studies, but rather to provide a non-interactive and repeatable
benchmark that is more suitable for guiding the development and facilitating the deployment
of better DeepFake detectors.
References
[1] Samira Abnar and Willem Zuidema. Quantifying Attention Flow in Transformers. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguis-
tics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.acl-main.385.
[2] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. Mesonet: A
compact facial video forgery detection network. In 2018 IEEE International Workshop
on Information Forensics and Security (WIFS), pages 1–7. IEEE, 2018.
[3] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li.
Protecting World Leaders Against Deep Fakes. In IEEE Conference on Computer Vi-
sion and Pattern Recognition Workshops (CVPRW), volume 1, 2019.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 11
[4] David Alvarez Melis and Tommi Jaakkola. Towards Robust Interpretability with Self-
Explaining Neural Networks. Advances in Neural Information Processing Systems, 31,
2018.
[5] Irene Amerini, Leonardo Galteri, Roberto Caldelli, and Alberto Del Bimbo. Deepfake
Video Detection through Optical Flow Based CNN. In Proceedings of the IEEE/CVF
International Conference on Computer Vision Workshops, pages 0–0, 2019.
[6] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better
understanding of gradient-based attribution methods for Deep Neural Networks. In
International Conference on Learning Representations, February 2018.
[7] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and
Cordelia Schmid. ViViT: A Video Vision Transformer. arXiv:2103.15691 [cs], March
2021.
[8] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-
Robert Müller, and Wojciech Samek. On Pixel-Wise Explanations for Non-Linear Clas-
sifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7):e0130140,
July 2015. ISSN 1932-6203. doi: 10.1371/journal.pone.0130140.
[9] Federico Baldassarre, Kevin Smith, Josephine Sullivan, and Hossein Azizpour.
Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph
Networks. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael
Frahm, editors, ECCV 2020, Lecture Notes in Computer Science, pages 612–630.
Springer International Publishing, 2020. ISBN 978-3-030-58604-1. doi: 10.1007/
978-3-030-58604-1_37.
[10] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot,
Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina,
Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable Artificial In-
telligence (XAI): Concepts, taxonomies, opportunities and challenges toward respon-
sible AI. Information Fusion, 58:82–115, June 2020. ISSN 1566-2535. doi:
10.1016/j.inffus.2019.12.012.
[11] Luca Bondi, Edoardo Daniele Cannas, Paolo Bestagini, and Stefano Tubaro. Training
Strategies and Data Augmentations in CNN-based DeepFake Video Detection. In 2020
IEEE International Workshop on Information Forensics and Security (WIFS), pages
1–6, December 2020. doi: 10.1109/WIFS49906.2020.9360901.
[12] Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo
Bestagini, and Stefano Tubaro. Video Face Manipulation Detection Through Ensemble
of CNNs. In 2020 25th International Conference on Pattern Recognition (ICPR), pages
5012–5019, January 2021. doi: 10.1109/ICPR48806.2021.9412711.
[13] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen
Huang, Liang Wang, Chang Huang, Wei Xu, Deva Ramanan, and Thomas S. Huang.
Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convo-
lutional Neural Networks. In 2015 IEEE International Conference on Computer Vision
(ICCV), pages 2956–2964, December 2015. doi: 10.1109/ICCV.2015.338.
12 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
[14] Hila Chefer, Shir Gur, and Lior Wolf. Generic Attention-Model Explainability for
Interpreting Bi-Modal and Encoder-Decoder Transformers. In arXiv:2103.15679 [Cs],
March 2021.
[15] Davide Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi.
Combining EfficientNet and Vision Transformers for Video Deepfake Detection.
arXiv:2107.02612 [cs], July 2021.
[16] Robert T. Collins. Mean-shift blob tracking through scale space. In 2003 IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition, 2003. Proceed-
ings., volume 2, pages II–234. IEEE, 2003.
[17] Jesse Damiani. A Voice Deepfake Was Used To Scam A CEO Out Of
$243,000. https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice-deepfake-
was-used-to-scam-a-ceo-out-of-243000/, September 2019.
[18] Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K. Jain. On the detection
of digital face manipulation. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5781–5790, 2020.
[21] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang,
and Cristian Canton Ferrer. The DeepFake Detection Challenge (DFDC) Dataset.
arXiv:2006.07397 [cs], October 2020.
[22] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale. In arXiv:2010.11929 [Cs], September
2020.
[23] European Union. General Data Protection Regulation (GDPR). https://gdpr.eu/, 2018.
[24] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Ma-
lik, and Christoph Feichtenhofer. Multiscale Vision Transformers. arXiv:2104.11227
[cs], April 2021.
[25] Ruth C. Fong and Andrea Vedaldi. Interpretable Explanations of Black Boxes by
Meaningful Perturbation. In 2017 IEEE International Conference on Computer Vision
(ICCV), pages 3449–3457, Venice, October 2017. IEEE. ISBN 978-1-5386-1032-9.
doi: 10.1109/ICCV.2017.371.
[26] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wich-
mann, and Wieland Brendel. ImageNet-trained CNNs are biased towards texture; in-
creasing shape bias improves accuracy and robustness. In International Conference on
Learning Representations, September 2018.
[27] Generated Photos Team. Generated Photos Dataset. https: //generated.photos/, 2018.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 13
[28] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, and Lalana
Kagal. Explaining Explanations: An Overview of Interpretability of Machine Learn-
ing. The 5th IEEE International Conference on Data Science and Advanced Analytics
(DSAA 2018)., June 2018.
[29] Corrado Gini. Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle
relazioni statistiche. Tipogr. di P. Cuppini, 1912.
[30] David Güera and Edward J. Delp. Deepfake Video Detection Using Recurrent Neural
Networks. In 2018 15th IEEE International Conference on Advanced Video and Signal
Based Surveillance (AVSS), pages 1–6, November 2018. doi: 10.1109/AVSS.2018.
8639163.
[31] Niall Hurley and Scott Rickard. Comparing Measures of Sparsity. IEEE Transactions
on Information Theory, 55(10):4723–4741, October 2009. ISSN 1557-9654. doi: 10.
1109/TIT.2009.2027527.
[32] Sarthak Jain and Byron C. Wallace. Attention is not Explanation. In Proceedings of
the 2019 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational
Linguistics. doi: 10.18653/v1/N19-1357.
[33] Andrei Kapishnikov, Tolga Bolukbasi, Fernanda Viégas, and Michael Terry. XRAI:
Better Attributions Through Regions. arXiv:1906.02825 [cs, stat], August 2019.
[34] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 4401–4410, 2019.
[35] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheen-
dra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa
Suleyman, and Andrew Zisserman. The Kinetics Human Action Video Dataset.
arXiv:1705.06950 [cs], May 2017.
[36] Minha Kim, Shahroz Tariq, and Simon S. Woo. FReTAL: Generalizing Deepfake De-
tection Using Knowledge Distillation and Representation Learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1001–
1012, 2021.
[37] Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Sven Dähne. Investi-
gating the influence of noise and distractors on the interpretation of neural networks. In
NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems, Novem-
ber 2016.
[39] Alexander Kolesnikov and Christoph H. Lampert. Seed, Expand and Constrain: Three
Principles for Weakly-Supervised Image Segmentation. In ECCV (4), January 2016.
[41] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Advancing High
Fidelity Identity Swapping for Forgery Detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 5074–5083, 2020.
[42] Yuezun Li and Siwei Lyu. Exposing deepfake videos by detecting face warping arti-
facts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2018.
[43] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi: Exposing ai created fake
videos by detecting eye blinking. In 2018 IEEE International Workshop on Information
Forensics and Security (WIFS), pages 1–7. IEEE, 2018.
[44] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A Large-
Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020.
[45] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In ICLR
2019, 2019.
[46] Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Pre-
dictions. In Advances in Neural Information Processing Systems, volume 30. Curran
Associates, Inc., 2017.
[47] M. Masood, Marriam Nawaz, K. M. Malik, A. Javed, and Aun Irtaza. Deepfakes
Generation and Detection: State-of-the-art, open challenges, countermeasures, and way
forward. ArXiv, 2021.
[48] Falko Matern, Christian Riess, and Marc Stamminger. Exploiting visual artifacts to ex-
pose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer
Vision Workshops (WACVW), pages 83–92. IEEE, 2019.
[49] Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and
Animesh Mukherjee. HateXplain: A Benchmark Dataset for Explainable Hate Speech
Detection. arXiv:2012.10289 [cs], December 2020.
[50] Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.
Artificial Intelligence, 267:1–38, February 2019. ISSN 0004-3702. doi: 10.1016/j.
artint.2018.07.007.
[51] Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. A Multidisciplinary Survey and
Framework for Design and Evaluation of Explainable AI Systems. ACM Transactions
on Interactive Intelligent Systems, 11(3-4):24:1–24:45, August 2021. ISSN 2160-6455.
doi: 10.1145/3387166.
[52] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and
Klaus-Robert Müller. Explaining nonlinear classification decisions with deep Taylor
decomposition. Pattern Recognition, 65:211–222, May 2017. ISSN 0031-3203. doi:
10.1016/j.patcog.2016.11.008.
[53] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for inter-
preting and understanding deep neural networks. Digital Signal Processing, 73:1–15,
February 2018. ISSN 1051-2004. doi: 10.1016/j.dsp.2017.10.011.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 15
[54] Joao C. Neves, Ruben Tolosana, Ruben Vera-Rodriguez, Vasco Lopes, Hugo Proença,
and Julian Fierrez. GANprintR: Improved fakes and evaluation of the state of the art in
face manipulation detection. IEEE Journal of Selected Topics in Signal Processing, 14
(5):1038–1048, 2020.
[55] Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. Multi-task Learn-
ing for Detecting and Segmenting Manipulated Facial Images and Videos. In 2019
IEEE 10th International Conference on Biometrics Theory, Applications and Systems
(BTAS), pages 1–8. IEEE, 2019.
[56] Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. Capsule-forensics: Using Cap-
sule Networks to Detect Forged Images and Videos. In ICASSP 2019 - 2019 IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2307–2311, May 2019. doi: 10.1109/ICASSP.2019.8682602.
[57] Nick Dufour and Andrew Gully. Deep Fake Detection Dataset by Google and JigSaw,
2019.
[58] Y. Nirkin, I. Masi, A. Tran, Tal Hassner, and G. Medioni. On Face Segmentation, Face
Swapping, and Face Perception. 2018 13th IEEE International Conference on Auto-
matic Face & Gesture Recognition (FG 2018), 2018. doi: 10.1109/FG.2018.00024.
[59] Yuval Nirkin, Yosi Keller, and Tal Hassner. FSGAN: Subject Agnostic Face Swap-
ping and Reenactment. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7184–7193, 2019.
[60] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent
Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher,
Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research, 12(85):2825–2830, 2011. ISSN 1533-7928.
[61] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for
Explanation of Black-box Models. In British Machine Vision Conference (BMVC),
September 2018.
[62] Samuele Pino, Mark James Carman, and Paolo Bestagini. What’s wrong with this
video? Comparing Explainers for Deepfake Detection. arXiv:2105.05902 [cs], May
2021.
[63] Jiameng Pu, Neal Mangaokar, Lauren Kelly, Parantapa Bhattacharya, Kavya Sun-
daram, Mobin Javed, Bolun Wang, and Bimal Viswanath. Deepfake Videos in the
Wild: Analysis and Detection. arXiv:2103.04263 [cs], March 2021.
[64] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?":
Explaining the Predictions of Any Classifier. arXiv:1602.04938 [cs, stat], August
2016.
[65] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and
Matthias Niessner. FaceForensics++: Learning to Detect Manipulated Facial Images.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
1–11, 2019.
16 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
[66] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based
noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
[67] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and
Klaus-Robert Müller. Evaluating the Visualization of What a Deep Neural Network
Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):
2660–2673, November 2017. ISSN 2162-2388. doi: 10.1109/TNNLS.2016.2599820.
[71] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convo-
lutional Networks: Visualising Image Classification Models and Saliency Maps.
arXiv:1312.6034 [cs], April 2014.
[72] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg.
SmoothGrad: Removing noise by adding noise. arXiv:1706.03825 [cs, stat], June
2017.
[73] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Ried-
miller. Striving for Simplicity: The All Convolutional Net. In Yoshua Bengio and
Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.
[74] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep
Networks. arXiv:1703.01365 [cs], June 2017.
[76] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias
Niessner. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2387–2395, 2016.
[77] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering:
Image synthesis using neural textures. ACM Transactions on Graphics, 38(4):66:1–
66:12, July 2019. ISSN 0730-0301. doi: 10.1145/3306346.3323035.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 17
[78] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier
Ortega-Garcia. Deepfakes and beyond: A Survey of face manipulation and fake de-
tection. Information Fusion, 64:131–148, December 2020. ISSN 1566-2535. doi:
10.1016/j.inffus.2020.06.014.
[79] Carlo Tomasi and Roberto Manduchi. Bilateral filtering for gray and color images.
In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271),
pages 839–846. IEEE, 1998.
[80] Loc Trinh, Michael Tsang, Sirisha Rambhatla, and Yan Liu. Interpretable and Trust-
worthy Deepfake Detection via Dynamic Prototypes. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, pages 1973–1983, 2021.
[81] Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L. Griffiths. Are Convolutional
Neural Networks or Transformers more like human vision? In Cognitive Science Soci-
ety, July 2021.
[82] Cristian Vaccari and Andrew Chadwick. Deepfakes and Disinformation: Exploring
the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News.
Social Media + Society, 6(1):2056305120903408, January 2020. ISSN 2056-3051.
doi: 10.1177/2056305120903408.
[83] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In NIPS,
December 2017.
[84] Kaili Wang, Jose Oramas, and Tinne Tuytelaars. Towards Human-Understandable
Visual Explanations: Imperceptible High-frequency Cues Can Better Be Removed.
arXiv:2104.07954 [cs], April 2021.
[85] Mika Westerlund. The emergence of deepfake technology: A review. Technology Inno-
vation Management Review, 9(11), 2019. ISSN 1927-0321. doi: 10.22215/timreview/
1282.
[86] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking
spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In
Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321,
2018.
[87] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image
Caption Generation with Visual Attention. In International Conference on Machine
Learning, pages 2048–2057. PMLR, June 2015.
[88] Shawn Xu, Subhashini Venugopalan, and Mukund Sundararajan. Attribution in Scale
and Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 9680–9689, 2020.
[89] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head
poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 8261–8265. IEEE, 2019.
18 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
[90] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang.
Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmen-
tation. International Journal of Computer Vision, pages 1–18, 2021.
[91] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional
Networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, edi-
tors, Computer Vision – ECCV 2014, Lecture Notes in Computer Science, pages 818–
833, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10590-1. doi:
10.1007/978-3-319-10590-1_53.
[92] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint Face Detec-
tion and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Sig-
nal Processing Letters, 23(10):1499–1503, October 2016. ISSN 1558-2361. doi:
10.1109/LSP.2016.2603342.
[93] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning Deep Features for Discriminative Localization. arXiv:1512.04150 [cs], De-
cember 2015.
[94] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face generation
by adversarially disentangled audio-visual representation. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 33, pages 9299–9306, 2019.
[95] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. WildDeep-
fake: A Challenging Real-World Dataset for Deepfake Detection. arXiv:2101.01456
[cs], January 2021.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 19
Supplementary Material
A Datasets
A.1 DeepFake Detection Challenge
The DFDC dataset was released as part of the homonymous Kaggle challenge [21]. It con-
tains approximately 120k videos, the majority of which are DeepFakes created with different
manipulation methods. The dataset comes in different variants of compression quality, which
are mostly relevant for training robust classifiers and assessing detection performance. For
an analysis of explanation quality, we choose to work on high-quality videos where we ex-
pect manipulation artifacts to be more prominent. Table 2 reports how many videos are used
for training, validation, testing and explanation evaluation.
Preprocessing. The videos from the dataset might contain one or more faces. Of these,
only one is manipulated in the case of “fake” videos. For the purpose of training the Deep-
Fakes classifier, each video is preprocessed as follows:
1. Videos are spatially resized with padding so that each frame is 640 × 640 pixels;
2. MTCNN is applied every 5 frames, which outputs rectangular bounding boxes tightly
cropped around all faces in a frame
3. Face detections are linked across frames using a greedy overlap-based heuristic; namely,
if two bounding boxes overlap with IoU > 0.5 they are considered the same face;
4. The longest consecutive sequence of linked boxes is considered the main face and is
assumed to be the target for DeepFake manipulation, all other boxes are discarded and
the video is clipped to the frames containing the main face;
5. For intermediate frames where MTCNN was not applied, a bounding box for the main
face is created by linearly interpolating the corners of the two closest boxes;
6. Boxes are expanded by 1.5× to capture more of the hair, neck and background
7. All frames belonging to the main face sequence are cropped according to their box;
rectangular crops are resized to 224 × 224 and used for training;
8. BiSeNet is applied every 5th frame of a 512×512 version of the aforementioned video,
the probabilistic output of BiSeNet is resized with bilinear interpolation to match the
original size and then the most likely face part is selected.
Classification. For training, validation, and testing, each video is processed separately.
This means that fake videos will not be perfectly aligned with the corresponding real video,
neither in space nor in time. Also, it means that detection and parsing might fail on some
fake videos, due to the low quality of the manipulation. While this can hinder training, it
also represents a realistic scenario where unseen videos are submitted to a trained classifier.
Table 2: Dataset sizes: number of real and fake videos contained in each dataset and split.
DFDC is used to train all classifiers in this work, to report classification metrics, and to
evaluate explanation metrics. DFD is only used for testing and explanation evaluation.
DFDC DFD
Real Fake Real Fake
Train 19143 99953 - -
Validation 1975 1968 - -
Test 2479 2486 37 107
Explanation 100 230 37 107
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 21
B Classification performance
We report relevant classification metrics for the test split of DFDC in Table 3 and for a
subset of DFD in Table 4. In addition to cross-entropy loss (LCE ) and area under the receiver
operating characteristic curve (AROC ), we also report precision, recall, and F1 scores obtained
when the model output is binarized with a threshold of 0.5. Furthermore, we report average
precision (AP), i.e. the area under the precision-recall curve as the classification threshold
changes. For each metric and configuration described in Section 3.2 and Section 4, we report
mean and standard deviation of 3 runs.
Table 3: Classification metrics for DFDC, test split. For each configuration and metric, mean
and standard deviation of 3 runs are reported. For all metrics except the LCE loss, values are
given in percentage and a higher value indicates a better result.
Table 4: Classification metrics for DFD. For each configuration and metric, mean and stan-
dard deviation of 3 runs are reported. For all metrics except the LCE loss, values are given in
percentage and a higher value indicates a better result.
C Training details
C.1 High-frequencies smoothing
For models trained with smoothing preprocessing, either Gaussian blur or bilateral filtering
are applied. Gaussian blur is applied at the video level using a spatial standard deviation of
0.8 and a temporal standard deviation of 0.5. Bilateral filtering is applied per-frame using a
spatial standard deviation of 2 and a color range standard deviation of 0.1. These values are
empirically chosen so that the filtered videos remain qualitatively similar to the original ones
and that common DeepFake artefacts are still visible.
C.3 Cutout
When cutout is enabled, each video is augmented with cutout with probability 0.5. Cutout
acts on a 64 × 64 region of the video selected with uniform probability. Once a mask is
selected, its contents are blurred with a strong Gaussian filter (standard deviation 4).
Compute resources. All models are trained on a single machine equipped with 4 NVIDIA
V100 GPUs with 32GB of RAM each, which allow for large batch sizes. Once trained, the
model can be ran for both inference and explanations on more modest hardware, e.g. a single
GPU environment with 12GB of RAM.
24 BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS
D Explanation metrics
This section details how the explanation metrics introduced in Section 3.2 are computed in
practice using discretized videos, masks and heatmaps. Furthermore, Table 5 and Table 6
report the metrics for all variations considered in this work as the average and standard
deviation of 3 runs each.
The main text uses a compact notation where videos are defined as a mapping from the
discretized grid G = {1, . . . , T } × {1, . . . , H} × {1, . . . ,W } to RGB pixel values. For this
appendix, we choose a more explicit notation based on tensors. We remark the equivalence
between the two notations since any function f : G → R+ can be uniquely represented as a
T ×H×W tensor whose element at (t, h, w)Ris f (t, h, w). In this context, it is useful to define
the derivative and integral operators ∇ and as:
(∇ f (ρ ρ + e i ) − f (ρ
ρ ))i=1,2,3 = f (ρ ρ ), (6)
1
Z
f dλ = f (ρ
ρ ), (7)
G T HW ρ∑ ∈G
where the vector ρ = (t, u, v) denotes the pixel coordinates, and the vectors e i are the usual
orthonormal basis i.e. (ei ) j = δi j .
1
TV(h) = ∑ ∇h(ρρ ), (8)
T HW ρ ∈G
|h(t, u, v) − h(t + 1, u, v)| + |h(t, u, v) − h(t, u + 1, v)| + |h(t, u, v) − h(t, u, v + 1)| (9)
µ= ∑ ρ h(ρρ ), (10)
ρ ∈G
where the vector ρ = (t, u, v)T represents pixel coordinates. Then, to summarize the 3 × 3
variance matrix as a scalar, we consider its volume given by the determinant | det(Σ
Σ)|. Larger
volumes correspond to more spread out heatmaps, while smaller values indicate more local-
ized explanations. Importantly, this metric is particularly indicated for unimodal heatmaps
that focus around a single location of the video.
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 25
2 ∑i i · h(ρ
ρ i ) T HW + 1
G= − , (12)
T HW ∑i h(ρ ρ i) T HW
ρ i ) ≤ h(ρ
with the indices i = 1, . . . , T HW that select pixel coordinates such that h(ρ ρ i+1 ).
Notably, the “sparsity” measured by the Gini Index refers to the scalar importance values of
each pixel and not their location in the heatmap. The heatmap will have a high Gini Index
if most of the explanation mass is concentrated in few highly-relevant pixels while all other
pixels have low relevance.
D.4 Faithfulness
Faithfulness is generally used to compare explanation methods on the basis of how closely
they identify portions of the input that are meaningful for the classifier and a particular
decision. Faithfulness is measured using the deletion score, which represents the area under
the curve traced by the confidence in p(FAKE|v) as pixels are removed from the video in
decreasing order of relevance. Considering the large amount of pixels in a video, the curve
is approximated by removing several pixels in one step. Specifically, we consider the sorted
values of an heatmap h and group them in 25 bins. These bins do not necessarily contain the
same amount of pixels, but the total relevance in each bin is approximately the same.
In this work, we are interested in the properties of explanations rather than of explanation
methods. However, a faithful explanation method is a prerequisite for further evaluation
of explanation quality. As a preliminary step, we compare the four explanation methods
discussed in Sec. 3.1 and choose the most faithful one based on its deletion scores on the
baseline model. The following hyperparameters are used: in SmoothGrad, gradients are
averaged over 25 randomly perturbed videos with a noise parameter equal to 0.15 of the
RGB range; in Integrated Gradients the path integral is calculated w.r.t. a black video baseline
using 25 interpolation steps.
With respect to fake videos in DFDC and DFD, the four methods achieve the following
average deletion scores: Sensitivity 42.54%, Gradient×Input 43.68%, SmoothGrad 41.25%,
Integrated Gradients 43.77%. Therefore, SmoothGrad has the lowest deletion score of the
four (p value of paired one-sided t-test < 10−5 ) and is then employed throughout all experi-
ments.
For the main metrics, we measure the precision of 100 most-relevant pixels in the heatmap,
and the percentage of heatmap contained in the ground-truth manipulation mask. This
approach has the advantage of being threshold independent, as opposed to binarizing the
heatmap using an arbitrary threshold and computing metrics such as Intersection over Union.
Notably, both metrics penalize explanations that focus outside the ground-truth mask, but
can not distinguish whether the heatmap clusters around an actual manipulation artefact or
is uniformly scattered inside the mask (Figure 6).
Table 5: Explanation metrics for DeepFake Detection Challenge (DFDC). Mean and stan-
dard deviation of 3 independent classifiers whose decisions are explained using SmoothGrad.
√
3
TV ↓ σ↓ Gini ↑ MIN ↑ P100 ↑
avg std avg std avg std avg std avg std
S3D Baseline 0.285 0.006 814.4 31.0 74.93 1.00 17.43 0.38 29.43 1.34
S3D Bilateral 0.428 0.044 1372.6 268.8 67.89 3.50 12.15 3.32 13.47 8.77
S3D Gaussian 0.292 0.004 841.4 15.1 75.55 0.44 17.38 0.51 26.31 1.19
S3D TV Loss 0.256 0.003 726.3 54.8 77.40 1.97 17.68 1.00 34.46 1.85
S3D Cutout 0.296 0.005 839.0 40.2 75.26 1.40 17.76 0.63 30.48 1.10
MViT 0.246 0.013 808.3 17.1 80.42 0.36 22.11 0.64 36.03 3.43
Table 6: Explanation metrics for DeepFake Detection Dataset (DFD). Mean and standard
deviation of 3 independent classifiers whose decisions are explained using SmoothGrad.
√
3
TV ↓ σ↓ Gini ↑ MIN ↑ P100 ↑
avg std avg std avg std avg std avg std
S3D Baseline 0.261 0.009 827.6 34.4 74.84 0.74 17.09 0.15 28.62 0.80
S3D Bilateral 0.413 0.061 1426.3 346.3 68.92 4.95 13.47 4.25 16.10 16.48
S3D Gaussian 0.273 0.011 799.2 11.0 76.54 0.17 16.83 0.82 26.87 1.16
S3D TV Loss 0.244 0.005 742.1 45.3 77.08 1.70 17.20 0.92 30.09 2.72
S3D Cutout 0.274 0.004 838.7 46.5 75.33 1.47 16.87 0.31 29.22 0.96
MViT 0.250 0.016 834.5 27.3 80.02 0.10 20.35 0.50 29.81 3.06
BALDASSARRE ET AL.: QUANTITATIVE METRICS FOR EVALUATING EXPLANATIONS 27
E Additional examples
Additional examples of semantic parsing, manipulation detection (Section 3.2.2), and expla-
nation post-processing for the user study (Section 4.2) are shown below.
Figure 4: User study visualization: real video, fake video, enhanced heat-map, Gaussian
matching, blob detection, semantic aggregation.
Figure 6: Additional examples for manipulation detection. Random frames from random
videos. From left to right: original, part-based manipulated video, heatmap.