0% found this document useful (0 votes)
37 views27 pages

Adversarial Attacks in Explainable ML

detection of adversarial attacks

Uploaded by

51sneha Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views27 pages

Adversarial Attacks in Explainable ML

detection of adversarial attacks

Uploaded by

51sneha Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

ADVANCED REVIEW OPEN ACCESS

Adversarial Attacks in Explainable Machine Learning: A


Survey of Threats Against Models and Humans
Jon Vadillo1 | Roberto Santana1 | Jose A. Lozano1,2

1Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, San Sebastian, Spain | 2Basque Center for
Applied Mathematics (BCAM), Bilbao, Spain

Correspondence: Jon Vadillo ([Link]@[Link])

Received: 24 November 2021 | Revised: 23 September 2024 | Accepted: 2 October 2024

Funding: This work was supported by the Eusko Jaurlaritza (IT1504-­22, BERC 2022-­2 025 and KK-2020/00049, KK-2023/00012, KK-2024/00030), by the
Spanish Ministry of Economy and Competitiveness MINECO (projects PID2019-­104966GB-­I00 and PID2022-­137442NB-­I00) and by the Spanish Ministry
of Science, Innovation and Universities (FPU19/03231 predoctoral grant). Jose A. Lozano acknowledges support by the Spanish Ministry of Science,
Innovation and Universities through BCAM Severo Ochoa accreditation (SEV-­2 017-­0718 and CEX2021-­0 01142-­S).

Keywords: adversarial examples | deep neural networks | explainable machine learning

ABSTRACT
Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations.
Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-­of-­
distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable
machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios
where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next,
we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable
models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms
existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed.
Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user
expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to
successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and
realistic study of adversarial examples in the field of explainable machine learning.

1   |   Introduction this limitation, different strategies have been proposed in the lit-
erature (Gilpin et al. 2018; Samek et al. 2021; Zhang et al. 2021),
Machine learning models, such as deep neural networks, still face ranging from post hoc explanation methods, which try to identify
several weaknesses that hamper the development and deployment the parts, elements or concepts in the inputs that most affect the
of these technologies, despite their outstanding and ever-­increasing decisions of trained models (Ghorbani et al. 2019; Kim et al. 2018;
capacity to solve complex artificial intelligence problems. One of Yosinski et al. 2015; Zeiler and Fergus 2014), to more proactive
the main shortcoming is their black-­box nature, which prevents approaches which pursue a transparent reasoning by training in-
analyzing and understanding their reasoning process, while such herently interpretable models (Alvarez-­Melis and Jaakkola 2018a;
a requirement is ever more in demanded in order to guarantee a Chen, Li, et al. 2019; Hase et al. 2019; Li et al. 2018; Saralajew
reliable and transparent use of artificial intelligence. To overcome et al. 2019; Zhang, Wu, and Zhu 2018).

Edited by: Mehmed Kantardzic, Associate Editor and Witold Pedrycz, Editor-­in-­Chief

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.

© 2024 The Author(s). WIREs Data Mining and Knowledge Discovery published by Wiley Periodicals LLC.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025; 15:e1567 1 of 27
[Link]
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Another issue that threatens the reliability of deep neural net- The aim of all these contributions is to establish a basis for a
works is their low robustness to adversarial examples (Szegedy more rigorous study of the vulnerabilities of explainable ma-
et al. 2014; Yuan et al. 2019), that is, to inputs manipulated in chine learning in adversarial scenarios. We believe that these
order to maliciously change the output of a model while the fields will benefit from our work in the following ways.
changes are imperceptible to humans. Indeed, this can be seen
as a direct implication of their lack of human-­like reasoning. • Heretofore, studies on adversarial attacks against explain-
Therefore, improving the explainability of the models is also a able models have considered very particular or fragmented
promising direction to achieve adversarial robustness, a hypoth- scenarios and attack paradigms. Thus, there is a lack of a
esis which is supported by recent works which show that inter- unifying perspective in this field that connects all these
pretability and robustness are connected (Etmann et al. 2019; works within a general analytical framework and taxonomy,
Noack et al. 2021; Ros and Doshi-­Velez 2018; Tsipras et al. 2019; which is a gap that we aim to fill with this review.
Zhang and Zhu 2019).
• The framework we propose encompasses not only attack
paradigms which have already been investigated in the liter-
Furthermore, the study of adversarial attacks against explain-
ature, but also paradigms that, to the best of our knowledge,
able models has gained interest in recent years, as will be fully
have not yet been studied, paving the way for new research
reviewed in Sections 2.3 and 2.4. In contrast to common ad-
venues.
versarial attacks, which focus solely on changing the classifi-
cation of the model (Yuan et al. 2019), attacks on explainable • In addition, the role of the human is often overlooked in the
models need to consider both changes in the classification study of attacks against explainable models, despite being a
and in the explanation supporting that classification. Another key factor in these scenarios. In this work, we address this
key difference when considering attacks against explainable limitation by analyzing the requirements that adversarial
models is related to their stealthiness. Generally, the only examples should satisfy in order to be able to mislead an
constraint assumed in order to produce a stealthy attack is explainable model, and even a human, depending on the
that the changes added to the inputs must be imperceptible attack scenario. This analysis provides a road map for the
to humans. However, the use of explainable models implies a design of realistic attacks against explainable models.
different scenario, where it is assumed that a human will ob-
• Furthermore, the fact that our framework considers a wide
serve and analyze not only the input, but also the model classi-
range of scenarios that an adversary may face allows us to
fication and explanation. Therefore, uncontrolled changes in
summarize which paradigms are realistic or unrealistic in
both factors may cause inconsistencies, alerting the human.
each of them, which is fundamental to ensure that attack
For this reason, the assumption of explainable classification
methods are evaluated with an appropriate setting and
models introduces a new question regarding the definition of
methodology in future works.
adversarial examples: can adversarial examples be deployed if
humans observe not only the input but also the output classifi- • On another note, our work also contributes to raise aware-
cation and/or the corresponding explanation? ness about the possible attack types that both models and
humans may face in realistic adversarial scenarios, which
is important to promote a more aware and secure use of ma-
1.1   |   Objectives and Contributions chine learning based technologies, or even the development
of more robust models or explanation methods.
The objective of this exploratory review is to shed light on this
question by extending the notion of adversarial examples for For the above reasons, the aim of this work is to contribute to a
explainable machine learning scenarios, in which humans more methodical research in this area, delimiting the differences
can not only assess the input sample, but also compare it to the between the possible attack paradigms, identifying limitations
output of the model and to the explanation. These extended in the current approaches and establishing more fine-­grained
notions of adversarial examples allow us to analyze the pos- and rigorous standards for the development and evaluation of
sible attacks that can be produced by means of adversarially new attacks or defenses.
changing the model's classification and explanation, either
jointly or independently (that is, changing the explanation
without altering the output class, or vice versa). Our analy- 2   |   Related Work
sis leads to a framework that establishes whether (and how)
adversarial attacks can be generated for explainable models Our work focuses on adversarial attacks against explainable
under human supervision. Moreover, we describe the require- machine learning models. Therefore, this section provides an
ments that adversarial examples should satisfy in order to be introduction to both research topics. This introduction will first
able to mislead an explainable model (and even a human) de- summarize each research field independently, and, afterward,
pending on multiple scenarios or factors which, despite their the intersection between both, as described as follows.
relevance, are often overlooked in the literature of adversarial
examples for explainable models, such as the expertise of the To begin with, the fields of explainable and adversarial machine
user or the objective of the explanation. Finally, the proposed learning are presented in Sections 2.1 and 2.2, respectively.
attack paradigms are also illustrated by adversarial examples Subsequently, the reliability of the explanation methods in ad-
generated for two representative image classification tasks, as versarial scenarios is discussed in Section 2.3. Finally, further
well as for two different explanation methods. The outline of connections between explanation methods and adversarial ex-
our work is summarized in Figure 1. amples are discussed in Section 2.4.

2 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 1    |    Outline of our exploratory review.

2.1   |   Overview of Explanation Methods in In addition, explanations can be used, even for the same model,
Machine Learning for different purposes. For instance, users querying the model
for a credit loan might be interested in explaining the output
In this section, we summarize the explanation methods pro- obtained for their particular cases only, whereas a developer
posed in the literature in order to present the terminology and might be interested in discovering why that model misclas-
taxonomy that will be used in the subsequent sections to develop sifies certain input samples. At the same time, an analyst can
our analytical framework on adversarial examples in explain- be interested in whether that model is biased against a social
able models. group for unethical reasons. Moreover, explanations can be lev-
eraged to enable users to interactively refine the model, thus
enhancing their understanding and trust in the system (Guo
2.1.1   |   Scope, Objective, and Impact of the Explanations et al. 2022; Ross, Hughes, and Doshi-­Velez 2017; Schramowski
et al. 2020; Teso et al. 2023; Teso and Kersting 2019). At a higher
The objective of an explanation is to justify the behavior of level, all these purposes are based on necessities involving eth-
a model in a way that is easily understandable to humans. ics, safety or knowledge acquisition, among others (Doshi-­Velez
However, different users might be interested in different aspects and Kim 2018). Based on the purpose of the explanations and
of the model, and, therefore, the explanations can be generated the particular problem, domain or scenario in which they are
for different scopes or objectives. required, another relevant factor should be taken into consid-
eration: the impact of the explanations, which can be defined
Overall, the scope of an explanation can be categorized as as the consequence of the decisions made based on the analysis
local or global (Zhang et al. 2021). On the one hand, local of the explanation. Healthcare domains are clear examples in
explanations aim to characterize or explain the model's pre- which the consequences of the decisions can be severe.
diction for each particular input individually, for example, by
identifying the most relevant parts or features of the input. On Despite the relevance of these factors, they are often overlooked
the other hand, global explanations attempt to expose the gen- when local explanation methods are designed or evaluated
eral reasoning process of the model, for instance, summariz- (Doshi-­Velez and Kim 2018; Nauta et al. 2023; Zhang et al. 2021).
ing (e.g., using a more simple but interpretable model) when The same happens for adversarial attacks in explainable mod-
a certain class will be predicted, or describing to what extent els. We argue that the scope, the objective and the impact of
a particular input-­feature is related to one class. Since in this explanations should be key factors when designing adversarial
paper, we address the vulnerability of explainable models to attacks against explainable models, since a different attack strat-
adversarial examples, and, therefore, the interest is placed on egy needs to be adopted in each context to successfully deceive
specific inputs and the corresponding outputs, we focus on the model (and the human). This will be discussed in detail in
local explanations. Section 3.

3 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.1.2   |   Types of Explanations Hase et al. 2019; Li et al. 2018; Nauta, Van Bree, and
Seifert 2021), achieving a more interpretable reasoning.
Different types of explanations exist depending on how the ex- The prototypes can represent an entire input describing
planation is conveyed: one class (e.g., a prototypical handwritten digit “1” in digit
classification) (Gautam et al. 2022; Li et al. 2018), or rep-
• Feature-­based explanations: assign an importance score resent image-­parts or semantic concepts (Alvarez-­Melis
to each feature in the input, based on their relevance for and Jaakkola 2018a; Chen, Li, et al. 2019; Hase et al. 2019;
the output classification (Baehrens et al. 2010; Lundberg Nauta, Van Bree, and Seifert 2021).
and Lee 2017; Ribeiro, Singh, and Guestrin 2016; • Rule-­based explanations: these explanation methods aim
Robnik-­ Šikonja and Kononenko 2008; Štrumbelj and to expose the reasoning of a model in a simplified or
Kononenko 2010). Common feature-­based explanations human-­understandable set of rules, such as logic-­r ules or
(especially in the image domain) are activation or saliency if-­then-­else rules, which represent a natural form of ex-
maps, which highlight the most relevant parts of the input planations for humans (Guidotti et al. 2019; Lakkaraju
(Bach et al. 2015; Morch et al. 1995; Selvaraju et al. 2017; et al. 2019; Ribeiro, Singh, and Guestrin 2018; van der Waa
Simonyan, Vedaldi, and Zisserman 2014; Springenberg et al. 2021). Rule-­based explanations are particularly well-­
et al. 2015; Sundararajan, Taly, and Yan 2017; Zeiler and suited when the input contains features which are easily
Fergus 2014). Despite their extensive use, acceptance and interpretable.
scientific relevance, several works have put forward op-
posing views, identifying that such explanations can be • Counterfactual explanations: although counterfactual ex-
unreliable and misleading (Chen, Bei, and Rudin 2020; planations (Guidotti et al. 2019; Wachter, Mittelstadt, and
Hase et al. 2019; Kim et al. 2018; Kindermans et al. 2019; Russell 2017) can be considered, in their form, as rule-­based
Lipton 2018; Rudin 2019). explanations, the main difference of these explanations is
their conditional or hypothetical reasoning nature, as the
• Example-­based explanations: the explanation is based on aim is suggesting the possible changes that should happen in
comparing the similarity between the input at hand and the input to receive a different (and frequently more positive)
a set of prototypical inputs that are representative of the output classification (e.g., “a rejected loan request would be
predicted class. Thus, the classification of a given input accepted if the subject had a higher income”).
sample is justified by the similarity between it and the pro-
totypes of the predicted class. We will also refer to these Some illustrative examples of these four types of explanations
types of explanations as prototype-­based explanations in are presented in Figure 2. Overall, the most suitable type of
the paper, although different forms of example-­based ex- explanation depends on the domain, the scope and the pur-
planation exist, such as the strategies proposed in Koh pose of the explanation, as well as on the expertise level of
and Liang (2017) or Yeh et al. (2018), based on estimat- the users querying the model. We refer the reader to Samek
ing the training images most responsible for a prediction. et al. (2021), Zhang et al. (2021) and Gilpin et al. (2018) for
Recent works have integrated prototype-­based explana- a more fine-­g rained overview of explanation methods. These
tions directly in the learning process of neural networks, surveys also provide an exhaustive enumeration of relevant
so that the classification is based on the similarities be- methods in the literature focused on computing such expla-
tween the input and a set of prototypes (Alvarez-­Melis and nations. Furthermore, for technical papers on the quantitative
Jaakkola 2018a; Chen, Li, et al. 2019; Gautam et al. 2022; and qualitative evaluation of explanation methods, we refer

FIGURE 2    |    Illustrative examples of the four main types of explanations in machine learning.

4 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
the reader to Nauta et al. (2023), Kim et al. (2022), and Lu 2.2.1   |   Taxonomy of Adversarial Attacks
et al. (2021).
Different categories of attacks are considered in the literature
depending on factors such as the specific type of error to be pro-
2.2   |   Overview of Adversarial Attacks Against duced, the scope of the perturbation and the resources available
Machine Learning Models to the adversary (Yuan et al. 2019). In this section, we present
the main categories in order to describe the most common attack
Adversarial examples (Szegedy et al. 2014) are inputs deliber- paradigms studied in the literature.
ately manipulated by a malicious actor with the purpose of (i)
fooling the model into providing incorrect predictions and (ii) • Type of misclassification: First, two main types of at-
ensuring that the perturbations are imperceptible to humans. tacks can be differentiated depending on whether the adver-
An illustration of an adversarial example is shown in Figure 3. sary aims to produce one particular incorrect class (targeted
The imperceptibility constraint ensures that the adversarial per- attacks), or simply a misclassification without any specific
turbation introduced to the inputs does not legitimate the change target class of preference (untargeted attacks).
in the output classification. At the same time, the fact that im-
perceptible perturbations can fool machine-­ learning models • Scope of the perturbation: In addition, different types of
raised alarms regarding the vulnerability of machine learning perturbations can be considered depending on whether they
models. Indeed, adversarial examples have shown to be appli- are generated for one specific input at hand (individual per-
cable to a wide range of high stakes and human-­centered ap- turbations) (Carlini and Wagner 2017; Goodfellow, Shlens,
plications, such as healthcare systems (Asgari Taghanaki, Das, and Szegedy 2015; Madry et al. 2018; Moosavi-­Dezfooli,
and Hamarneh 2018; Bortsova et al. 2021; Finlayson et al. 2019; Fawzi, and Frossard 2016; Szegedy et al. 2014) or whether
Hirano, Minagi, and Takemoto 2021; Joel et al. 2022; Li and they are designed to be input-­agnostic and therefore effective
Zhu 2020; Li et al. 2020; Ma et al. 2021; Paschali et al. 2018; Paul with independence of the input in which they are applied
et al. 2020; Rahman et al. 2021; Yoo and Choi 2020), surveil- (universal perturbations) (Khrulkov and Oseledets 2018;
lance systems (Bai et al. 2021; Sharif et al. 2016; Thys, Ranst, and Moosavi-­Dezfooli et al. 2017; Mopuri, Garg, and Babu 2017,
Goedemé 2019; Wang, Wang, et al. 2020; Xu et al. 2020; Zheng, Mopuri et al. 2018, 1).
Lu, and Velipasalar 2020), machine translation (Belinkov and • Resources available to the adversary: The information
Bisk 2018; Cheng et al. 2019, Cheng et al. 2020; 1; Ebrahimi, required by the adversary in order to effectively generate
Lowd, and Dou 2018; Michel, Li, and Neubig 2019; Zhao, Dua, attacks leads to two main scenarios. On the one hand, in
and Singh 2018; Zou et al. 2020), dialogue or question and an- the white-­box scenario, the adversary has full knowledge
swering systems (Deng et al. 2021; Xue et al. 2020), social net- about the model internals (e.g., its architecture, weights,
work based scenarios such as recommendation systems, spam or hyperparameters) and its training details. This allows
detection or sentiment analysis (Guo, Li, and Mu 2021), and highly efficient attacks to be generated, most of them rely-
financial applications such as credit loan approval or fraud de- ing on gradient-­based strategies (Carlini and Wagner 2017;
tection systems (Ballet et al. 2019; Cartella et al. 2021; Fursov Goodfellow, Shlens, and Szegedy 2015; Madry et al. 2018;
et al. 2021; Kumar et al. 2021; Mathov et al. 2022; Renard Moosavi-­Dezfooli et al. 2017, 1; Szegedy et al. 2014). On the
et al. 2019; Sarkar et al. 2018). The vulnerability to adversarial other hand, in black-­box scenarios the adversary has no
attacks has also been exposed in a wide range of popular ma- knowledge about the model (Alzantot et al. 2019; Brendel,
chine learning as a service APIs (Borkar and Chen 2021; Ilyas Rauber, and Bethge 2018; Ilyas et al. 2018; Papernot
et al. 2018; Papernot et al. 2017). et al. 2017). More intermediate scenarios (sometimes re-
ferred to as gray-­box scenarios) can be assumed when the ad-
versary has limited access to the models, such as the output
confidences assigned to every possible class or the logit val-
ues (Ilyas et al. 2018). The opacity in terms of model details
requires more costly strategies than those used in the other
case, such as evolutionary algorithms (Alzantot et al. 2019;
Qiu, Custode, and Iacca 2021), gradient estimations (Chen
et al. 2017; Ilyas et al. 2018), or the use of surrogate models
to generate the attack that are afterward transferred to the
initial model (Liu et al. 2017; Papernot et al. 2017).

• Type of deployment: Finally, a key aspect of adversarial


examples is how those inputs are fed to the model. Generally,
FIGURE 3    |    Illustration of an adversarial example generated for the assumed scenarios in research works allow the input to
a chest x-­ray (CXR) classification task, in which the objective is to be modified “digitally,” which is afterward fed to the model.
categorize the status of the patient as one of three classes: “normal,” In other cases, physical adversarial examples are crafted
Covid-­19, or (non-­Covid) pneumonia (more details in Section 4.1). (a) (e.g., printed traffic signals or malicious speech commands
Original input sample in which the patient is diagnosed with Covid-­19. reproduced by a speaker) that are effective even when
(b) Adversarially manipulated input, which is misclassified by the the signal is captured “over-­the-­air” and fed to the model
model as “normal” (i.e., no disease found in the patient) despite being (Eykholt et al. 2018; Sharif et al. 2016; Xu et al. 2020). This
perceptually identical to the original image. allows circumventing the possible limitation in real-­world

5 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
scenarios in which the adversary might not have access to Le-­K hac (2020), it is shown that small changes in input samples
the digital files. can produce drastic changes in feature-­importance explana-
tions, while maintaining the output classification. In Ghorbani,
A summary of the taxonomy described can be consulted in Abid, and Zou (2019), the proposed attacks are also evalu-
Table 1. We also refer the reader to the work of Yuan et al. (2019) ated in the example-­based explanations proposed in Koh and
for a more comprehensive and fine-­grained survey on adversar- Liang (2017), based on estimating the relevance of each train-
ial examples. ing image for a given prediction by using influence-­functions.
In Zheng, Fernandes, and Prakash (2019), adversarial attacks
Finally, whereas the research on adversarial examples in the last capable of changing the explanations while maintaining the
few years has led to a torrent of attack methods proposed for outputs are created for self-­explainable (prototype-­based) clas-
multiple scenarios, tasks and even types of models, this research sifiers. In Zhang et al. (2020) and Kuppa and Le-­K hac (2020), it
has focused almost exclusively on classification problems (Yuan is shown that adversarial examples can also produce wrong out-
et al. 2019). Nevertheless, it has been shown that adversarial ex- puts and (feature-­importance) explanations at the same time, or
amples can be generated for machine learning models trained change the output while maintaining the explanations (Zhang
to perform very different types of problems, such as regression et al. 2020).
(Balda, Behboodi, and Mathar 2019; Gupta et al. 2021; Kos,
Fischer, and Song 2018; Li et al. 2020; Mode and Hoque 2020; Aivodji et al. (2019), Aïvodji et al. (2021), and Lakkaraju and
Tabacof, Tavares, and Valle 2016), reinforcement learning Bastani (2020)show that trustworthy explanations can be pro-
(Hussenot, Geist, and Pietquin 2020; Lin et al. 2017) or image duced for a biased or an untrustworthy model, thus manipu-
segmentation (Cisse et al. 2017; Fischer et al. 2017; Metzen lating user trust. These approaches are, however, not based on
et al. 2017; Mopuri, Ganeshan, and Babu 2019; Poursaeed adversarial attacks, as they focus on producing a global expla-
et al. 2018; Xie et al. 2017) problems. All these advances allow a nation model that closely approximates the original (black-­box)
wide range of opportunities for adversaries to maliciously take model but which employs trustworthy features instead of sensi-
control of the outcomes of machine learning models, threaten- tive or discriminatory features (which are actually being used by
ing countless systems. At the same time, research has focused the original model to predict). Similarly, in Anders et al. (2020),
on models that only provide a classification as an answer. Only Dimanov et al. (2020), Heo, Joo, and Moon (2019), Le Merrer
recently has the vulnerability of explainable models begun to be and Trédan (2020), and Slack et al. (2020) adversarial models are
studied, as we will discuss in detail in the following section. generated, capable of producing incorrect or misleading expla-
nations without harming their predictive performance.

2.3   |   Reliability of Explanations Under Adversarial On the other hand, recent works have proposed defensive ap-
Attacks proaches in order to increase the robustness of different explanation
methods. These works have focused primarily on feature-­based ex-
Some explanation methods in the literature have been proven planation methods, relying on regularization strategies (Boopathy
to be unreliable in adversarial settings. In Ghorbani, Abid, et al. 2020; Chen, Wu, et al. 2019; Joo et al. 2023; Tang et al. 2022;
and Zou (2019), Dombrowski et al. (2019), Alvarez-­Melis Wang et al. 2020e) and explanation-­averaging strategies (Rieger
and Jaakkola (2018b), Zhang et al. (2020) and Kuppa and and Hansen 2020) for gradient-­based explanations, or tailored

TABLE 1    |    Summary of the main taxonomy used to describe and categorize adversarial attacks.

Factor Categories Explanation/goal


Type of misclassification Targeted Produce one particular incorrect output class.
Untargeted Produce an incorrect class without any
preference over the available classes.
Scope of the Perturbation Individual Optimized for one particular input.
Universal Optimized to be applicable to multiple inputs.
Resources available to the adversary White-­box scenario The adversary has full knowledge of the model (e.g.,
weights, hyperparameters, training details, etc.).
Black-­box scenario The adversary has none (or very limited)
information about the details of the model,
which is considered a “black-­box.”
Type of deployment “Digital-­world” The adversarial example is crafted digitally
(i.e., by manipulating the “digital file”)
and is sent to the model “as-­is.”
“Physical-­world” The adversarial example is generated
“physically” in order to fool the model when
the input is captured “over-­the-­air.”

6 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
approaches for model-­ agnostic explanations (Blesch, Wright, methods have been proposed as a tool to generate or analyze
and Watson 2023; Carmichael and Scheirer 2023; Ghalebikesabi explanations (Elliott, Law, and Russell 2021; Haffar et al. 2021;
et al. 2021; Vreš and Robnik-­Šikonja 2022) such as LIME (Ribeiro, Moore, Hammerla, and Watkins 2019; Praher et al. 2021).
Singh, and Guestrin 2016) and SHAP (Lundberg and Lee 2017).
Defensive approaches for counterfactual explanations have also Finally, the similarities between interpretation methods and ad-
been recently investigated in Virgolin and Fracaros (2023) and versarial attacks and defenses are analyzed in Liu et al. (2021),
Pawelczyk et al. (2023). While a comprehensive evaluation of showing how adversarial methods can be redefined from an in-
the adversarial robustness of the multiple explanation types and terpretation perspective, and discussing how techniques from
methods remains an open challenge, we refer the reader to Huang one field can bring advances into the other. Our paper, however,
et al. (2023) for a technical work on the evaluation of a wide range addresses a different objective. In contrast to Liu et al. (2021),
of feature-­based explanations. which focuses on highlighting the similarities between par-
ticular methods from both fields, in this paper, we propose a
Some works have also tried to justify the vulnerability of explana- framework to study if (and how) adversarial examples can be
tion methods to adversarial attacks, or the links between them. In generated for explainable models under human assessment.
Ghorbani, Abid, and Zou (2019) and Dombrowski et al. (2019), the
non-­smooth geometry of decision boundaries (of complex models)
is blamed, arguing that, due to these properties, small changes in 3   |   Extending Adversarial Examples for
the inputs imply that the direction of the gradients (i.e., normal Explainable Machine Learning Scenarios
to the decision boundary) can drastically change. As most ex-
planation methods rely on gradient information, the change in In this section, we extend the notion of adversarial examples to
the gradient direction implies a different explanation. In Zhang fit in explainable machine learning contexts. For this purpose,
et al. (2020) and Kuppa and Le-­Khac (2020), the vulnerability is in Section 3.1, we start from a basic definition of adversarial ex-
attributed to a gap between predictions and explanations. It is an amples, and discuss more comprehensive scenarios in which the
open question whether this hypothesis holds for self-­explainable human subjects judge not only the input sample, but also the
models, which have been trained jointly to classify accurately and decisions of the model. In Section 3.2, the human assessment
to provide explanations (Alvarez-­Melis and Jaakkola 2018a; Chen, of the explanations is also taken into account. To the best of our
Li, et al. 2019; Hase et al. 2019; Li et al. 2018; Saralajew et al. 2019). knowledge, no prior work has comprehensively addressed this
Finally, theoretical connections between explanations and ad- type of generalization of adversarial examples.
versarial examples are established in Etmann et al. (2019) and
Ignatiev, Narodytska, and Marques-­Silva (2019). This extended definition allows us to provide a general frame-
work that identifies the way in which an adversary should de-
sign an adversarial example to deploy critical attacks even when
2.4   |   Further Connections Between Adversarial a human is assessing the prediction process. The framework
Examples and Interpretability introduced also identifies several realistic ways of deploying at-
tacks depending on factors such as the way in which the expla-
Paradoxically, using explanations to support or justify the pre- nation is conveyed (Section 3.2.1) or the type of scenario, user
diction of a model can imply security breaches, as they might and task (Section 3.2.2). From an adversary perspective, this
reveal sensitive information (Sokol and Flach 2019; Viganò and framework provides a road map for the design of malicious at-
Magazzeni 2020). For instance, an adversary can use explanations tacks in realistic scenarios involving explainable models and a
of how a black-­box model works (e.g., what features are the most human assessment of the predictions. From the perspective of a
relevant in a prediction) in order to design more effective attacks. developer or a defender, this road map helps to identify the most
Similarly, in this paper, we will show that justifying the classifica- critical requirements that their explainable model should satisfy
tion of the model with an explanation makes it possible to generate in order to be reliable.
types of deception using adversarial examples that, without expla-
nations, it would not be possible to generate (e.g., to convince an
expert that a misclassification of the model is correct).
3.1   |   Scenarios in Which Human Subjects Are
Aware of the Model Predictions
On another note, recent works have shown that robust (e.g.,
adversarially trained) models are more interpretable (Etmann
Regular adversarial examples (Yuan et al. 2019) are based on the
et al. 2019; Ros and Doshi-­Velez 2018; Tsipras et al. 2019; Zhang
assumption that an adversary can introduce a perturbation into
and Zhu 2019). In Etmann et al. (2019), this is justified by show-
an input sample, so that:
ing that the farther the inputs are with respect to the decision
1. The perturbation is not noticeable to humans, and, there-
boundaries, the more aligned the inputs are with their saliency
fore, the human's perception of which class the input be-
maps, thus, being more interpretable. Furthermore, Noack
longs to does not change.
et al. (2021) show that enhancing the interpretability of a model
during the training phase increases its adversarial robustness. 2. The class predicted by a machine learning model changes.
Moreover, explanation methods have inspired particular defen-
sive strategies against adversarial attacks (Hossam et al. 2021; Note that, according to this definition of adversarial examples,
Jiang et al. 2021; Kao et al. 2022; Liu, Yang, and Hu 2018; the human criterion is only considered regarding the input
Renard et al. 2019; Tao et al. 2018; Wang, Wu, et al. 2020; Yang sample, without any human assessment of the model's output.
et al. 2020; Zhang et al. 2018) and, inversely, adversarial attack However, this definition does not guarantee the stealthiness of

7 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
the attack in scenarios in which the user observes the output Let us take as an example a complex computer-­a ided diagnosis
classification, since the change in the output can be inconsis- task through medical images, in which an expert subject fails
tent, alerting the human. For these reasons, the following ques- in their diagnosis while the model is correct. In such cases,
tion arises: are regular adversarial examples useful in practice we can induce a human error confirmation attack by forc-
when the user is aware of the output? ing the model to confirm the (wrong) medical diagnosis pro-
duced by the expert, that is, forcing f (x) = h(x) ≠ yx (Bortsova
To address this question, we start by discussing four different et al. 2021; Finlayson et al. 2019; Goddard, Roudsari, and
scenarios, based on the agreement of the following factors: f (x), Wyatt 2012; Johnson 2019).
the model's prediction of the input; h(x), the classification per-
formed by a human subject; and yx, the ground-­truth class of Based on the above discussion, we can determine that some
an input x (which will be unknown for both the model and the types of adversarial attacks can still be effective even when the
human subject in the prediction phase of the model). It is worth user is aware of the output. Nonetheless, paradoxically, it is pos-
clarifying that a human misclassification (h(x) ≠ yx) can occur sible to introduce new types of adversarial attacks when the out-
in scenarios in which the addressed task is of high complexity, put classification is supported by explanations, as we show in
such as medical diagnosis (Pillai, Oza, and Sharma 2019), or the following section.
in which the label of an input is ambiguous, such as sentiment
analysis (Agirre and Edmonds 2006; Beck et al. 2020). Although
a human misclassification might be uncommon in simple prob- 3.2   |   Scenarios in Which Human Subjects Are
lems such as object recognition, even in such cases ambigu- Aware of the Explanations
ous or challenging inputs can be found (Stock and Cisse 2018;
Tsipras et al. 2020). Finally, unless specified, we will assume The scenarios described in the previous section can be further
expert subjects, that is, subjects with knowledge in the task and extended for the case of explainable machine learning models,
capable of providing well-­founded classifications.1 According to as the explanations for the predictions come into play. As a con-
this framework, the four possible scenarios are those described sequence, each of the cases defined above can be subdivided into
in Figure 4. new subcases depending on whether the explanations match the
output class or whether humans agree with the explanations of
According to the described casuistry, regular adversar- the models. To avoid an exhaustive enumeration of all the possi-
ial attacks aim to produce the second scenario (A.0.2, i.e., ble scenarios, we focus only on those identified in the literature
f (x) ≠ h(x)( = )yx), by
( imperceptibly
) perturbing an input x0 that as interesting from an adversary perspective.
satisfies f x0 = h x0 = yx0 (i.e., the first scenario) so that the
model's output is changed, but without altering the human per- From this standpoint, given an explainable model, adversarial
ception of the input (which, therefore, implies h(x) = yx = yx0). examples can be generated by perturbing a well classified input
However, assuming that the user is aware of the output, the ful- (for which the corresponding explanation is also correct and co-
fillment of the attack is subject to whether human subjects can herent) with the aim of changing (i) the output class, (ii) the pro-
correct the detected misclassification, or have control over the vided explanation, or (iii) both at the same time (Noppel, Peter,
implications of that prediction. For example, an adversarial and Wressnegger 2023; Schneider, Meske, and Vlachos 2022).
traffic signal will only produce a dramatic consequence in au-
tonomous cars if the drivers do not take control with sufficient To formalize these scenarios, let us denote Af (x) as the expla-
promptness. nation provided to characterize the decision f (x) of a machine-­
learning model, and Ah (x) as the explanation provided by a
Regarding the remaining cases, they do not fit in the definition human according to their knowledge or criteria. Since a total
of a regular adversarial attack since either the input is mis- agreement or disagreement between such explanations is gen-
classified by the human subject (h(x) ≠ yx ) or the model is not erally unlikely and challenging to characterize in a formal
fooled ( f (x) = h(x) = yx ). Nevertheless, assuming a more gen- way, the disagreement between Af (x) and Ah (x) will be de-
eral definition, scenarios involving human misclassifications noted as Af (x) ≈ Ah (x), while the agreement will be denoted as
could be potentially interesting for an adversary. Similarly to Af (x) ≈ Ah (x). Similarly, we will denote A(x) ∼ y if an expla-
regular adversarial attacks, which force the second scenario nation A(x) for the input x is consistent with the reasons that
departing from the first one, an adversary might be interested characterize the class y (that is, if the explanation correctly char-
in forcing the fourth scenario departing from the third one. acterizes or supports the classification of x as the class y). For

FIGURE 4    |    Attack casuistry when the human observes not only the input but also the output classification of the model.

8 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
simplification, unless specified, we assume that given an input A.2.1 Af (x) ∼ f (x). In this case, the fact that the provided
x belonging to the class yx, h(x) = yx and Ah (x) ∼ yx, this is, the explanation is coherent with the (incorrectly) predicted
human classification of an input into one class is correct and class can increase the confidence of the human in the pre-
is based on reasons consistent for that class. Similarly, we will diction, being therefore interesting from an adversary's
also assume that, for a clean (unperturbed) input x, f (x) = yx perspective. We identify this case as the most direct exten-
and Af (x) ∼ yx. sion of adversarial examples for explainable models, as the
model is not only fooled but also supports its own misclas-
The identified scenarios (summarized in Table 2) are as follows: sification with the explanation.
A.2.2 Af (x) ∼ f (x) ∧ Af (x) ∼ yx . This case is similar to the
A.1 f (x) = yx ∧ Af (x) ≈ Ah (x). In this case, the model is
previous one (A.2.1), with the important difference that
right but the explanations are incorrect or differ from
the model's explanation is now coherent with a class y′
those that would be provided by a human. Adversarial
different to f (x) and yx . Thus, we are in a scenario in
attacks capable of producing such scenarios have been
which a total mismatch is produced between all the con-
studied in recent works for post hoc feature-­importance
sidered factors. Whereas these attacks are an interest-
explanations (Alvarez-­Melis and Jaakkola 2018b; Anders
ing and open case of study, deploying them in practice
et al. 2020; Dombrowski et al. 2019; Ghorbani, Abid, and
without the inconsistencies being noticed poses greater
Zou 2019; Huang et al. 2023; Kuppa and Le-­K hac 2020;
challenges.
Noppel, Peter, and Wressnegger 2023; Sinha et al. 2021;
Tamam, Lapid, and Sipper 2023; Zhang et al. 2020) and A.3 f (x) ≠ yx ∧ Af (x) ≈ Ah (x) ∧ Af (x) ∼ yx. In this case,
for self-­explainable prototype-­ based classifiers (Zheng, the model's classification is wrong but the provided ex-
Fernandes, and Prakash 2019), showing that small per- planations are coherent from a human perspective with
turbations in the input can produce a drastic change in respect to the ground-­truth class yx (Huang et al. 2023;
the explanations without changing the output. Noppel, Peter, and Wressnegger 2023; Subramanya, Pillai,
A.1.1 More particularly, we can imagine a scenario in and Pirsiavash 2019; Zhan et al. 2022). The agreement in
which Af (x) ∼ yx despite Af (x) ≈ Ah (x), for instance, if the explanations can increase the confidence in the model,
Af (x) points to relevant and coherent properties to clas- but, at the same time, the output is not consistent with
sify the input as yx , but which do not compose a cor- the explanation. However, the consistency issue might be
rect or relevant explanation (with respect to the given solved by finding an input for which the explanation not
input) according to a human criterion. From an adver- only satisfies Af (x) ∼ yx but also Af (x) ∼ f (x), for instance,
sary's perspective, changing the explanations without by finding an ambiguous explanation that is applicable to
forcing a wrong classification allows confusing recom- both classes. Such attacks could be employed to convince
mendations to be introduced. For illustration, a model the user to consider an incorrect class as correct or justified,
can correctly reject a loan request but the decision can or to bias the user's decision toward a preferred class (e.g.,
be accompanied by a wrong yet coherent explanation when there is more than one reasonable output class for an
(e.g., “the applicant is too young”), preventing the ap- image) (Table 2).
plicant from correcting the actually relevant deficien-
cies of the request (e.g., “the applicant's salary is too
low”) (Ustun, Spangher, and Liu 2019). Similarly, a 3.2.1   |   Attack Design Based on the Type of Explanation
wrong explanation of a medical diagnosis system might
lead to a wrong treatment or prescription (Bortsova Whereas our framework considers the explanations of the mod-
et al. 2021; Bussone, Stumpf, and O'Sullivan 2015; els in their most general form, the way in which an explanation
Ghassemi, Oakden-­ R ayner, and Beam 2021; Stiglic is conveyed determines how humans process and interpret the
et al. 2020). In addition, biased or discriminative ex- information (Doshi-­Velez and Kim 2018; Zhang et al. 2021). This
planations could be produced with this attack scheme implies that some attack strategies might be more suitable for
(Slack et al. 2021), for instance, attributing a loan rejec- some types of explanations than for others. Moreover, the way
tion to sensitive features (e.g., gender, race or religion). in which an adversarial example is generated for an explainable
Such an explanation could make the models look un- machine learning model will also depend on the type of expla-
reliable or untrustworthy for users. Oppositely, biases nation. For these reasons, in this section we briefly discuss in
could be hidden by producing trustworthy explanations which way an adversarial example should be designed depend-
to manipulate the trust of the users (Aivodji et al. 2019; ing on the type of explanation or the particular type of attack to
Dimanov et al. 2020; Lakkaraju and Bastani 2020; Le be produced.
Merrer and Trédan 2020; Slack et al. 2020; Wang, Tuyls,
et al. 2020). • Feature-­based explanations: the highlighted parts or fea-
A.2 f (x) ≠ yx ∧ Af (x) ≈ Ah (x). In this case, both the clas- tures need to be coherent with the classification, and
sification and the explanation provided by the model are correspond to (I) human-­ perceivable, (II) semantically
incorrect. Adversarial attacks capable of producing such meaningful and (III) relevant parts. A common criticism
scenarios have been investigated in recent works (Kuppa to feature-­based explanations such as saliency maps is
and Le-­K hac 2020; Noppel, Peter, and Wressnegger 2023; that they identify the relevant parts of the inputs, but not
Sun et al. 2022; Zhang et al. 2020). More particularly, we how the models are processing such parts (Rudin 2019).
identify two specific sub-­cases as relevant when a human Moreover, the acceptance of the explanation would be
assesses the entire classification process: subject to the human comprehensibility of how the model

9 of 27
TABLE 2    |    Overview of the attack casuistry described in Sections 3.1 and 3.2.

10 of 27
Classification Explanation
Model
Factors Model-­ Model coherent Model-­
observed by Model Human human coherent with with its human Attack category/ Representative examples,
the user ID correct correct agreement ground-­truth output agreement description tasks, or use-­cases
Input + Output A.0.2 ✘ ✓ ✘ — — — Regular attack. Forcing misclassifications in
(Section 3.1) critical tasks (e.g., traffic-­sign
recognition, surveillance or
finance fraud detection).
A.0.4 ✘ ✘ ✓ — — — Human error Confirm a wrong diagnosis
confirmation. produced by an expert in
health-­care domains.
Input + Output A.1 ✓ ✓ ✓ * * ✘ Incorrect explanation Reduce human trust
+ explanation (while keeping the in the model.
(Section 3.2) correct output).
A.1.1 ✓ ✓ ✘ Incorrect and Confusing recommendations in
coherent explanations credit-­loan request or medical-­
(while keeping the diagnosis tasks. Biased or
correct output). discriminative explanations.
Hide inappropriate
behaviors of the model.
A.2 ✘ ✓ ✘ * * ✘ Incorrect output Reduce human trust
and explanation. in the model.
A.2.1 ✘ ✓ ✘ Model is wrong but Increase confidence of the
supports its own human in the incorrect
misclassification. prediction. Bias the human
in favor of a wrong class.
A.2.2 ✘ ✘ ✘ Total mismatch Reduce human trust
between the input, in the model.
the classification and
the explanation.
A.3 ✘ ✓ ✘ ✓ ✓ ✓ Incorrect output Ambiguous explanations
while keeping a applicable to more than
correct explanation. one class. Misdirect the
attention of the user toward
another reasonable class.
Note: For the sake of simplicity, we use the following symbols to represent the following terms: ✓ (yes), ✘ (no), — (not applicable). In those paradigms in which subcases are considered, the symbol “*” is used to represent the term
“not specified” (i.e., the choice made for those factors determines the attack subtype).

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
processes the identified features. Thus, an adversarial 3.2.2   |   Attack Design Based on the Scenario, User,
attack could take advantage of these limitations. First, a and Task
particular region of the input can be highlighted to sup-
port a misclassification of the model and to convince the To conclude our framework, we describe the main character-
user (assuming that the region contributes to predict an istics or desiderata that an adversarial attack should satisfy in
incorrect class) (Noppel, Peter, and Wressnegger 2023), different scenarios. We build on the idea that common tasks,
which is interesting particularly for targeted adversarial problems or applications share common categories, and that ex-
attacks. An attack could also highlight irrelevant parts to planations or interpretation needs are different in each of them
mislead the observer or generate ambiguous explanations, (Doshi-­Velez and Kim 2018). Thus, adversarial attacks (or, oppo-
by highlighting multiple regions or providing a uniform sitely, the defensive countermeasures) should also be designed
map (Noppel, Peter, and Wressnegger 2023), which are differently for each type of explanation, focusing on the most
strategies well-­suited for untargeted attacks. relevant or crucial factor in each case.
• Prototype-­based explanations: in this case, for the human to The considered scenarios, summarized in Figure 5, comprise
accept the given explanation, the key features of the closest different degrees of expertise of the human in supervision of
prototypes should (I) be perceptually identifiable in the given the classification process and different purposes of the expla-
input, and, ideally, (II) contain features correlated with the nation. It is important to note that a particular problem or
output class. The contrary should happen for the farthest pro- task could belong to more than one scenario. Moreover, we
totypes, that is, their key features should not be present in the emphasize that some of the scenarios involve factors which
input nor be correlated with the output class (or, ideally, be are difficult to quantify in a formal way (e.g., the expertise of a
opposite). In order to achieve these objectives, the more gen- user). Nevertheless, we believe that it is necessary to consider
eral the prototypes (e.g., if they represent semantic concepts such detailed scenarios in order to rigorously discuss which
or parts of inputs rather than completely describing an out- type of adversarial examples can be realizable in practice. In
put class), the higher the chances of producing explanations what follows, we describe each scenario and identify the re-
that could lead to a wrong classification while being coherent quirements that adversarial attacks should satisfy in order to
with a human perception, such as ambiguous explanations. pose a realistic threat in each of them. This information will
To the best of our knowledge, no customized attack strate- be summarized in Table 3.
gies that account for the assessment of the user have been
proposed for prototype-­based explanations, although some S1 Scenario: The first scenario comprises tasks in which the
recent approaches (Recaido and Kovalerchuk 2023) repre- implications of the decisions made by the model cannot be con-
sent promising attempts in this direction. trolled by the user, or cases in which there is no time for human
• Rule-­based explanations can be fooled by targeting expla- supervision of the predictions. Despite the relevance of some
nations which are aligned with the output of the model tasks that fall into this category, such as autonomous driving
(e.g., the explanation justifies the prediction or at least (Fujiyoshi, Hirakawa, and Yamashita 2019) or massive content
mimics the behavior of the model), but which employ reli- filtering (Kuchipudi, Nannapaneni, and Liao 2020; Mahdavifar
able, trustworthy, or neutral features (Aivodji et al. 2019; and Ghorbani 2019), humans cannot thoroughly evaluate each
Lakkaraju and Bastani 2020; Le Merrer and Trédan 2020). possible prediction. For this reason, explanations are not of prac-
For instance, a model for criminal-­recidivism prediction tical use in such cases, so the main (or only) goal of an adversary
could provide a negative assessment based on unethical rea- is to produce an incorrect output.
sons, whereas the explanation is taken as ethical (Aivodji
et al. 2019; Lakkaraju and Bastani 2020). S2 Scenario: Interpretability or explainability can be desirable
properties for machine learning models (including those devel-
• Counterfactual explanations: in this case, the objective of an oped for the S1 scenario) in order to debug or validate them
adversarial attack could be forcing a particular counterfac- (Adadi and Berrada 2018; Anders et al. 2022; Doshi-­Velez and
tual explanation, suggesting changes on irrelevant features Kim 2018; Ras, van Gerven, and Haselager 2018; Sotgiu, Pintor,
(thus preventing correcting the deficiencies which are actu- and Biggio 2022). Furthermore, recent works demonstrate that
ally relevant), or producing biased or discriminatory expla- human explanations and feedback can be used to correct the
nations (Slack et al. 2021) in detriment to the fairness of the system in an interactive way, in which the active role of the
model. human takes an even higher relevance (Kulesza et al. 2015;

FIGURE 5    |    Critical scenarios to be considered in the study adversarial attacks against explainable machine learning models.

11 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
TABLE 3    |    Possible scenarios in which explainable machine learning models can be deployed, and a guideline on how adversarial attacks should
be designed in each case in order to pose a realistic threat.

Scenario Representative examples Applicable attacks


S1: Impossibility of Fast decision making scenarios (e.g., • Any adversarial attack capable of
correcting the output autonomous cars) or automatized processes producing a change in the output class
or controlling the (e.g., massive online content filtering). (as the explanations are not of practical
implications of the decision use in these cases).
in time.
S2: Model debugging, Applicable to any task. • A.2.1, A.3 (justify misclassifications
development, validation, of the model).
etc. • A.1.1 (mask inappropriate behaviors,
e.g., hiding biases by producing
S3: Decisions of the models Risk of criminal recidivism or
trustworthy outputs or explanations).
are more imperative than credit risk management.
• A.2 (produce wrong outputs and
experts' judgments.
explanations jointly).
S4: User with no expertise. ^Scenarios in which the decision criteria are secret, • Any adversarial attack scheme
hidden, or unknown (e.g., banking or financial (able to change the classification,
scenarios, malware classification problems, etc.). the explanation or both at the same
time), taking advantage of the user's
inexperience.
S5: User with medium Challenging scenarios (e.g., complex • A.1.1, A.2.1, A.3.
expertise (the model medical diagnosis) or unforeseeable • The explanation needs to be
is expected to clarify scenarios (e.g., macroeconomic predictions, consistent with the input patterns and/
or support the user's risk of criminal recidivism, etc.). or consistent with the output class.
decisions).
S6: User with partial Hierarchical classification (e.g., large • A.1.1, A.2.1, A.3.
expertise (i.e., expert in scale visual recognition). • The output and the explanation
some factors but clueless in should be consistent with the factors
others). which are familiar to the user (either
regarding input features or the output
class).
S7: User with high Tasks in which the inputs can be ambiguous • A.1.1, A.3 (attacks involving
expertise. (e.g., NLP tasks such as sentiment analysis or generating ambiguous explanations).
multiple object detection in the image domain).
S8: Explanations even Predictive maintenance, medical diagnosis or • A.1, A.2.1 (e.g., maintain the output
more relevant than the credit/loan approval (e.g., with a wrong explanation but produce totally or partially wrong
classification itself. users cannot modify or correct the deficiencies). explanations, or produce unethical
explanations).

Teso et al. 2023; Teso and Kersting 2019). For instance, a model judgments. Although this scenario resembles S1, the main differ-
developer might want to explain the decisions of a self-­driving ence is that, in this case, explanations can be useful or relevant
car (even if the end-­user will not receive explanations when the even when the model is deployed or employed by the end-­user,
model is put into practice) to assess why it has provided an in- and, therefore, the attack should also take the explanations into
correct output, to validate its reasoning process, to improve it, or consideration instead of considering only the output class.
to gain knowledge about what the model has learned (Fujiyoshi,
Hirakawa, and Yamashita 2019; Mori et al. 2019; Ras, van Gerven, S4 Scenario: Regarding the expertise level of the user query-
and Haselager 2018). In such cases, an adversary could: justify a ing the model, the case of no expertise is the simplest one from
misclassification of the model (A.2.1, A.3), hide an inappropriate the perspective of the adversary, as any attack scheme can be
behavior when the model predicts correctly but for the wrong produced without arousing suspicions, taking advantage of
reasons (A.1.1), or produce wrong outputs and explanations at the user inexperience. For the same reason, models deployed
the same time (A.2). in such scenarios should also be the ones with more security
measures against adversarial attacks.
S3 Scenario: The same attack strategies applicable to the S2
scenario can be applied in scenarios in which the models' deci- S5 Scenario: If the user's expertise is medium, the model might
sions are taken as more relevant or imperative than the experts' be expected to clarify or support the user's decisions. Thus, the

12 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
explanation should be sufficiently consistent with the main se- framework, and that the focus will be on exemplifying the de-
mantic features in the input (e.g., the user might not be able to sign of the attacks (i.e., the requirements that they should satisfy
diagnose a medical image, but can identify the relevant spots de- in order to pose a legit threat against explainable models) rather
pending on what is being diagnosed, such as darker spots in skin-­ than on the methods that could be used to implement them or
cancer diagnosis (Al-­masni et al. 2018)), and/or be sufficiently in their performance. A summary of the illustrated scenarios
consistent with the output class (A.1.1, A.2.1, A.3). and the corresponding details is provided in Table 4. As can be
seen in the table, our illustrations cover all the main attack par-
S6 Scenario: Similarly to the S5 scenario, if the user has a adigms and scenarios considered in the framework developed in
partial expertise, that is, if the user is an expert in some fac- the previous section.
tors but clueless in others, 2 then the adversary needs to en-
sure that the output and the explanations are coherent only
with the factors or features that are familiar to the user (A.1.1,
4.1   |   Selected Tasks, Datasets, and Models
A.2.1, A.3).
We will focus on two image classification tasks to generate the
S7 Scenario: A user with high expertise, by definition, will re-
adversarial examples:
alize that a model is producing a wrong output or explanation.
However, it can be possible to mislead the model and convince
• Medical image classification: the selected task consists of
the human of a wrong prediction by means of ambiguity (A.1.1,
chest x-­ray (CXR) classification, in which the aim is to iden-
A.3). For instance, in an image classification task, two objects can
tify, given an x-­ray image, one of the following diseases:
appear at the same time, making it possible to produce a wrong
Covid-­ 19, (non-­Covid) pneumonia or none (“normal”).
class with a reasonable explanation, for example, by selectively fo-
We used a pretrained Covid-­Net model (Wang, Lin, and
cusing the attention of the explanation on one of the objects or by
Wong 2020), trained on the COVIDx dataset (Wang, Lin, and
highlighting the secondary object as the most relevant one (Stock
Wong 2020), which achieves an accuracy of 92.6%.3
and Cisse 2018). In addition, in problems in which the inputs are
inherently ambiguous, such as natural language processing tasks, • Large scale visual recognition: the aim of this task is to clas-
different but reasonable explanations can be produced for the sify real or natural images across a wide range of classes. We
same input (Agirre and Edmonds 2006; Beck et al. 2020). selected the ImageNet dataset, which contains images from
1000 different classes such as animals or ordinary objects,
S8 Scenario: Finally, in some cases the explanations might and a pretrained ResNet-­50 deep neural network as a clas-
be more critical, necessary or challenging than the output it- sifier, which achieves a Top-­1 accuracy of 74.9%.4 Both the
self. Some representative tasks are predictive maintenance ImageNet and the ResNet-­50 architecture have been widely
(Serradilla et al. 2022) (e.g., it might be more interesting to know employed for the study of image classification, as well as in
why a certain system will fail than just knowing that it will fail) the more particular field of interpretable machine learn-
or medical diagnosis (Stiglic et al. 2020) (e.g., discovering why ing (Nguyen, Kim, and Nguyen 2021; Selvaraju et al. 2017;
a model has diagnosed a patient as being at high risk for a par- Zhang et al. 2020).
ticular disease might be the main priority to prevent the disease
or provide a better treatment). For these reasons, a change in These two use-­cases allow us to illustrate the different scenarios
the explanation is critical for such models, which makes them described in Section 3.2.2. First, the medical image classification
particularly sensitive to the attacks described in A.1, or those de- represents a challenging task that requires a high expertise in
scribed in A.2.1, if the misclassification of the model is difficult order to correctly classify inputs or to provide well-­funded expla-
to notice by the user. nations of the decisions. As discussed, in such a scenario, an adver-
sary has more room to generate adversarial examples that produce
incorrect model responses (both in terms of classification and ex-
4   |   Illustration of Context-­Aware Adversarial planation) which, at the same time, may be coherent or acceptable
Attacks according to a human criterion (particularly for nonexpert users).
Moreover, the explanations can be critical in this task, as the rea-
In this section, we generate different types of adversarial son for determining a diagnosis is of high relevance, being there-
examples to illustrate the main attack paradigms described fore representative of the S8 scenario described in Section 3.2.2.
in Section 3, in terms of both the type of misclassification
that wants to be produced (as described in Section 3.2) and Secondly, users with a high expertise can be assumed in the
the “scenario” in which the attack is created (as described in large-­scale visual recognition task, as the ImageNet dataset
Section 3.2.2). To this end, we will consider two representa- contains images containing familiar objects or animals which
tive image classification tasks, assuming an explainable ma- will be easily recognizable for humans. Thus, a human observ-
chine learning scenario. In addition, we will consider two ing the input as well as the output of the classification should
explanation methods, namely feature-­based explanations and easily detect inconsistencies in the prediction of the model (i.e.,
prototype-­based explanations, to illustrate the effect of the at- whether or not it is correct). However, at the same time, some
tacks in both cases. Our code is publicly available at: [Link] images might be ambiguous or challenging to classify even
github.​com/​vadel/​​A E4XAI. for humans (e.g., fine-­grained dog breed classification (Khosla
et al. 2011; Nguyen, Kim, and Nguyen 2021)) which therefore
We remark that the aim of this section is to provide illustrative can be representative of medium or partial expertise, as the user
examples of the attack paradigms described in the proposed might be able to effectively discriminate certain classes (e.g.,

13 of 27
14 of 27
TABLE 4    |    Summary of the illustrative attacks shown in Sections 4.4 and 4.5.

Possible Wrong
Task Type of explanation scenario class Wrong explanation Attack description Figures
X-­ray (Section 4.4) Feature-­based (saliency map) S2, S3, S4/ ✘ ✘ No attack (i.e., original input) Figure 6a
S5/S6, S8
✓ ✓ Regular attack (i.e., without Figure 6b
(conflicting) controlling the explanation)
✓ ✘ A.3 Figure 6c
✘ ✓ A.1.1 (confusing Figure 6d
recommendation)
✓ ✓ A.2.1 Figure 6e
Af (x) ∼ f (x)
(non-­informative
but consistent, i.e.,
supports prediction)
✓ ✓ A.2.2 Figure 6f
Af (x) ∼ f (x)
Af (x) ∼ yx
(non-­informative
and inconsistent)
Large-­scale visual recog. Feature-­based (saliency map) S2, S5/S6 ✘ ✘ No attack (i.e., original input) Figure 7a
(Section 4.5)
✓ ✘ A.3 Figure 7b–d
S2, S7 ✘ ✘ No attack (i.e., original input) Figure 8a
✘ ✘ No attack (the output is Figure 8b
further biased in favor
of the correct class,
avoiding ambiguities)
✓ ✓ A.2.1 Figure 8c
✓ ✘ A.3 Figure 8d
Prototype-­based explanation S2, S5/S6 ✓ ✘ A.3 Figure 9a,b
(three nearest training inputs) Af (x) ∼ yx
Af (x) ∼ f (x)
(ambiguous)
Note: Notice that each attack paradigm and scenario is exemplified at least once. Note also that for the large-­scale visual recognition task different scenarios can be considered depending on the characteristics or the challengingness
of the input.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 6    |    Different types of adversarial attacks for the x-­ray medical image diagnosis task. The left part of each image shows the input image as
well as the class assigned by the model (jointly with the confidence score in the [0, 1] range), whereas the right part shows the explanation provided
by the Grad-­CAM method. (a) Original image. (b) Regular adversarial attack (PGD) targeting the class “normal” (i.e., the possible changes that the
adversarial perturbation may produce in the explanation are not controlled by the attack). (c) Attack producing the wrong classification “normal”
while maintaining the original explanation. (d) Attack maintaining the correct classification while changing the explanation in order to selectively
highlight some parts (the right part) but omitting others (in this case, the left part). (e) Attack producing the wrong class “normal” and a wrong
explanation which uniformly highlights the relevant parts of the image. (f) Attack producing the wrong class “normal” and a uniform explanation
outside the main parts of the image (i.e., highlighting only irrelevant and incorrect parts).

differentiating dogs from other animal species) but not others method is to employ the feature maps learned by the model in
(e.g., two similar dog breeds). In such cases, the user might ex- the last convolutional layer to produce the explanations. Given a
pect the prediction of the model or the corresponding explana- convolutional neural network f and an input x, the Grad-­CAM
tion to clarify the correct class of the input. saliency-­map S is defined as:
( M )

S = ReLU 𝛼 m,c ⋅ Cm
4.2   |   Explanation Methods m (1)

We will consider two representative explanation methods in where Cm, m = 1, … , M , represents the (two-­dimensional) m-­th
order to illustrate an explainable machine-­learning scenario. activation map (for the input x) at the last convolutional layer of
f , and 𝛼 m,c ∈ ℝ represents the importance of the m-­th map in the
prediction of the class of interest yc (typically f (x), i.e., the class
4.2.1   |   Feature-­Based Explanation predicted by the model). The importance 𝛼 m,c of each activation
map is estimated as the average global pooling of the gradient of
The Grad-­CAM method (Selvaraju et al. 2017) will be used the output score (corresponding to the class yc) with respect to Cm,
to generate saliency-­map explanations. The rationale of this which will be denoted as Gm,c = ∇Cm fc (x):

15 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 7    |    Adversarial examples generated for the ImageNet dataset classification task taking advantage of class ambiguity. The adversarial
examples are generated from the input in (a)-­left, which belongs to the source class “Great Pyrenees,” targeting different classes that are characterized
by features similar to those of the source class: (b) “Kuvasz,” (c) “White wolf,” and (d) “Labrador Retriever.” Each adversarial example is created
ensuring that the saliency-­map explanation of the original input, shown in (a)-­right, is maintained. (e)–(h) show, for each of the four classes considered
(source class + 3 target classes), the k = 3 prototypes closest to the original input, in order to assess their similarity.

∑∑
𝛼 m,c = i,j
Gm,c (2) representing classes) that are closest to the input which has been
i j classified are provided (Nguyen, Kim, and Nguyen 2021). The
proximity between the inputs will be measured as the Euclidean
i,j
where Gm,c denotes the value at the i -­the row and j-­th column. distance of the l -­dimensional latent representation rf (x): ℝd → ℝl
The ReLU nonlinearity in (1) is applied to remove negative val- learned by the model f in the last layer, that is, the (flattened) ac-
ues, maintaining only the features with a positive influence on yc. tivations of the last convolutional layer of the model. This repre-
sentation captures complex semantic features of the inputs, thus
providing a more appropriate representation space for meaning-
4.2.2   |   Example-­Based Explanation fully comparing input samples according to the features learned
by the model. Let Xtrain
c
represent the set of training inputs be-
We will also consider an example-­based explanation in which longing to the class of interest yc (e.g., the class predicted by the
the k training images (which can be considered prototypes model). Given a model f and an input x, the explanation will

16 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 8    |    Adversarial examples generated for the ImageNet dataset classification task, taking advantage of the class ambiguity introduced by
the appearance of multiple concepts in the image. (a) Original input. (b) Input perturbed in order to maximize confidence in the original class without
altering the enhanced region in the explanation. (c) Adversarial example targeting the class “Suit” and a target saliency-­map highlighting the region
in which this class appears. (d) Adversarial example targeting the class “Irish water spaniel” and a target saliency-­map highlighting the region in
which the ground-­truth class (“Curly-­coated retriever”) appears. (e) and (f) The three training images belonging to the class “Curly-­coated retriever”
and “Irish water spaniel,” respectively, which are closest to the original input.

{ p }
p p
be a set of k input samples P = x1 , … , xk | xi ∈ Xtrain
c
that 4.3   |   Attack Method
satisfies:
We will assume a targeted attack for our experiments, in which
‖ ( ) ‖ ‖ ( p) ‖ c p the aim will be to create, given an input x, an adversarial exam-
‖rf �
x − rf (x)‖ > ‖rf xi − rf (x)‖ , x∈
∀� Xtrain − P, ∀ xi ∈P
‖ ‖2 ‖ ‖2 ple x ′ such that:
(3)
( )
Note that the two selected methods allow, by definition, expla- f x � = yt (4)
nations to be computed for any class of interest yc. However, we
( )
will consider as the main explanation the one corresponding to Af x � = m t (5)
the predicted class f (x). Finally, we assume that the explana-
tion methods and their parameters are fixed and known to the ‖ x − x� ‖ ≤ ϵ (6)
adversary. Since the focus of our experimentation is illustrative
and not performance-­based, analyzing the sensitivity of the ex- where yt represents a target class, mt a target explanation and
planation methods to hyperparameters (Bansal, Agarwal, and ϵ the maximum distortion norm. For the case of saliency-­map
Nguyen 2020; Dombrowski et al. 2019) will be out of the scope explanations, mt will be a predefined saliency-­ map St. For
of this section. the case of prototype-­based classification, mt will be the set

17 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 9    |    Adversarial example for the large scale visual recognition task, assuming a prototype-­based explanation. (a) Original input belonging
to the class “Labrador Retriever” (left) and adversarial example targeting the class “Doberman” (right). (b) Prototype-­based explanation of the
adversarial example and the class Doberman (i.e., the three training images belonging to the class “Doberman” that are closest to the adversarial
example). (c) Prototype-­based explanation of the original input and the ground-­truth class. (d) Prototype-­based explanation of the original input and
the target class “Doberman.”

{ p p p}
Pt = x1 , x2 , … , xk of k training inputs (with the value of k explanations, we instantiated Lexpl as the Euclidean distance be-
fixed beforehand by the explanation method) selected by the ad- tween the model's explanation g(x) = S and the target saliency
versary to be produced as explanations (i.e., the training inputs map St (specified by the adversary):
of class yt that are closer to x should be those in the set Pt). We do ( )
not specify any particular order for the k target-­prototypes in Pt, Lexpl x, St = ‖ ‖
‖g(x) − St ‖2 (9)
that is, we assume that the relevance of each of the k prototypes
in the explanation is the same. For the case of prototype-­based explanations, Lexpl will be the
average Euclidean distance between the latent representation
We will use a targeted Projected Gradient Descent (PGD) attack of the (adversarial) input and the latent representation of the k
(Madry et al. 2018) to generate the adversarial examples. This prototypes selected by the adversary as the target explanation
attack iteratively perturbs the input sample in the direction of Pt :
the gradient of a loss function
( L (e.g., the cross-­entropy) with re-
( )) � � 1 �
spect to the input, sign ∇x � L xi� , yt , with a step size 𝛼. At each Lexpl x, Pt = ‖r(x) − r(x p )‖2 (10)
i k xp ∈ P
step, the adversarial example is projected by a projection oper- t

ator ℬ ϵ to ensure that the norm of the adversarial perturbation


v = x � − x is restricted to ‖v‖∞ ≤ ϵ: 4.4   |   Illustrative Attacks in the X-­Ray
( (
Classification Task
( )))

xi+1 = ℬ ϵ xi� − 𝛼 ⋅ sign ∇x � L xi� , yt (7)
i
Figure 6 illustrates the results obtained for different adversarial
examples generated against the COVID-­Net model. The left part
In order to produce attacks capable of changing both the classi- of each sub-­f igure shows the input sample, the model's classifi-
fication and the explanation, we will consider a generalized loss cation of the input and the confidence score of the prediction,
function whereas the right part shows the saliency-­maps generated with
( ) ( ) ( ) the Grad-­CAM explanation (darker-­red parts represent a higher
L xi , yt , mt , 𝜆 = (1 − 𝜆) ⋅ Lpred xi , yt + 𝜆 ⋅ Lexpl xi , mt (8) relevance). Figure 6a shows the original (i.e., unperturbed)
( ) input sample, which is correctly classified as its ground-­truth
where Lpred x, yt represents ( the) classification error with respect class “COVID-­19.”5
to the target class yt, Lexpl x, mt represents the explanation error
with respect to the target explanation mt and 𝜆 ∈ [0, 1] balances Figure 6b shows an adversarial example generated using a reg-
the trade-­off between both functions. A close approach can be ular PGD attack (considering only changing the output class,
consulted in Zhang et al. (2020). In our experiments, we used i.e., L = Lpred), targeting the class “normal.” Notice that the main
the cross-­entropy loss as Lpred. For the case of saliency-­map parts of the explanation have changed to the central part and to

18 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
the rightmost part of the image, highlighting mainly irrelevant misclassification that could be considered as coherent or rea-
zones. Therefore, such an explanation might not be taken as sonable even for humans. These examples illustrate the attack
consistent. Contrarily, Figure 6c shows an adversarial example paradigm A.3 described in Section 3. In particular, each attack
using the attack described in Equation (7) using the loss function is generated by setting a target class for which the inputs belong-
described in (8), targeting the class “normal” and the original ing to that class contain very similar features to those inputs be-
saliency-­map (i.e., the one obtained for the original input image), longing to the source class. Figure 7a shows the original input
illustrating the attack paradigm A.3 described in Section 3.2. We sample used to create the adversarial examples, the top-­3 pre-
clarify that using a regular adversarial attack might not necessar- dictions of the model and the corresponding Grad-­CAM expla-
ily imply a change in the explanation (or imply that the explana- nation. Figure 7b–d shows the adversarial examples targeting
tion, even if changed, will necessarily highlight irrelevant zones). the classes, “Kuvasz,” “White wolf,” and “Labrador Retriever,”
Nevertheless, this example allows us to illustrate the need to con- respectively. These classes represent different dog breeds with
trol the explanation in order to create adversarial examples that very similar features, as is shown in Figure 7e–h, in which the
are capable of convincing the human that the model's (mis)clas- prototypes (for each of the classes) that are closer to the origi-
sifications and the corresponding explanations are coherent or nal input image are shown. In all the cases, the saliency map
consistent, as discussed in Section 3. targeted in the attack is the one obtained for the original input
(i.e., we maintain the original explanation while changing the
Figure 6d illustrates the attack paradigm A.1.1 described in classification).
Section 3.2, in which the original class is maintained whereas
a change in the explanation is produced, in this case, selectively In Figure 8, a different type of ambiguity will be considered to
highlighting some regions of the image. Here, we generated the generate the adversarial examples: the appearance of multiple
attack setting the target map as the right-­half side of the original concepts or classes in the image. In such cases, adversarial ex-
saliency-­map, and setting the left-­part values as zero. Such an amples can be employed to change the focus of the classification
attack strategy can be extremely concerning for those scenarios to one of the objects of interest. The explanation is, therefore,
in which the explanation is of high relevance, as a misleading a key factor in order to further support the model decision in
adversarial explanation might lead to an incorrect diagnosis, classifying the input as the class of interest selected by the ad-
prescription or treatment. versary. The input in Figure 8a contains two classes that could
be equally relevant: “Curly-­coated retriever” dog breed (ground-­
Figure 6e illustrates the attack paradigm A.2.1. Notice that both truth class) and “suit.” In Figure 8b,c, adversarial examples are
the output class and the explanation are changed. Moreover, the generated in order to “untie” this ambiguity, maximizing the
target map set in this case represents a roughly uniform map confidence of one of the classes (“Curly-­coated retriever” and
over the most relevant parts of the image (in this case the two “suit,” respectively) and changing the explanation to highlight
lungs). Therefore, the provided explanation can be taken as co- the selected parts (paradigm A.2.1).6 As can be seen, the adver-
herent, as the main parts of the image are taken into consid- sarial examples effectively focus the prediction on one of the
eration for the prediction. The fact that the predicted class is classes, which can therefore bias the human interpretation of
“normal” also increases the coherence of the explanation, since the result, accepting the prioritized output class as the dominant
it can be interpreted from it that the most critical areas are cor- one. Whereas this type of attacks are limited to the objects ap-
rect (i.e., that there is no evidence in those areas of a possible pearing in the image, different types of ambiguity can be consid-
disease). Indeed, the same explanation would have a different ered at the same time to produce misclassifications that may be
effect if the prediction had represented a disease (e.g., if the orig- taken as “correct” for humans, as shown in Figure 8d, in which
inal class had been maintained in this case). This is because a the focus is not only placed on the dog, but also an incorrect
uniform explanation would not provide a precise justification of class is produced (“Irish water spaniel”) taking advantage of
why the disease is predicted, thus hampering a proper diagno- the ambiguity of the class similarity (paradigm A.3). This am-
sis, but, at the same time, is coherent with the input features biguity is, indeed, also reflected in the output confidence scores
(because the most relevant parts are highlighted), contributing provided by the model when the original input is classified,
to the user's acceptance that the model prediction is correct. shown above the left part of Figure 8a, as both classes achieved
Contrarily, in Figure 6f the target map is the opposite: roughly a similar score. In order to assess the similarity between these
all the relevant parts are considered as not relevant, whereas the two classes, Figure 8e,f shows the three prototypes belonging to
remaining regions are considered as relevant, illustrating a case each of the two classes that are the closest to the original image,
in which the explanation is completely wrong. Since the predic- in which it can be seen that both breeds contain very similar
tion is also incorrect, and is not supported by the explanation, a features.
total mismatch is produced between all the considered factors,
exemplifying the attack paradigm A.2.2. Finally, in Figure 9, we provide an illustrative example of an at-
tack designed to fool a model whose decisions are explained using
a prototype-­based explanation. As discussed in Section 3.2.1, an
4.5   |   Illustrative Attacks in the Large Scale Visual adversary can take advantage of prototype-­based explanations to
Recognition Task support certain misclassifications, for instance, producing an in-
correct output class and minimizing the distance with prototypes
In this section, we illustrate different types of adversarial ex- which, apart from containing features representative of the source
amples generated taking advantage of class ambiguity. First, class, are representative of the target class as well. Figure 9a
in Figure 7, the similarity between different classes is used shows a well-­classified image (left) and an adversarial example
to generate adversarial examples capable of producing a targeting the class “Doberman.” The adversarial perturbation

19 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
has been optimized in order to reduce the distance (in the latent the attack paradigms described in our framework, that is, an
representation) between the input and the three training images approach capable of automatically generating adversarial exam-
(belonging to the target class) shown in Figure 9b. As can be seen, ples which satisfy the most important requirements depending
these training images not only contain features representative of on the scenario, explanation method or attack paradigm that
the target class (“Doberman”), but also additional features that wants to be produced. We plan to study the generation of such
resemble those in the original input sample (indeed, a similar attacks in future works.
dog is present in the selected training images), exemplifying the
attack paradigm A.3. Figure 9c shows the prototypes belonging More generally, conceiving strategies to improve the reliabil-
to the source class that are the closest to the original image, and ity and robustness of explanation methods continues to be an
Figure 9d those prototypes closest to the original image yet be- urgent line of research, as still limited research has been con-
longing to the target class. Note that both Figure 9b,d contains ducted on the adversarial robustness of different explanation
prototypes belonging to the target class, however, those which are methods such as prototype-­based approaches. Thus, a deeper
adversarially produced appear considerably more coherent due to analysis of the vulnerability of current explanation methods is
their ambiguity (in the sense that they contain prototypical fea- an important step in order to increase the reliability and trust-
tures of both the source and target class). worthiness of explainable machine learning models. We hope
our work serves as a foundation for future studies on defensive
approaches to build upon, and to address these critical chal-
5   |   Conclusions lenges in a structured and organized way.

In this paper, we have introduced a framework to rigorously


study and review the possibilities and limitations of adversarial
examples in explainable machine learning scenarios, in which the Author Contributions
input, the predictions of the models and the explanations are as- Jon Vadillo: conceptualization (equal), data curation (lead), formal
sessed by humans. First, we have extended the notion of adversar- analysis (lead), investigation (lead), methodology (equal), software
ial examples in order to fit in such scenarios, which has allowed (lead), validation (lead), visualization (lead), writing – original draft
us to examine the different adversarial attack paradigms existing (lead). Roberto Santana: conceptualization (equal), formal analysis
(supporting), funding acquisition (supporting), investigation (support-
in the literature with a unifying perspective, encompassing the re-
ing), methodology (equal), project administration (equal), resources
search in the field. Furthermore, we analyze how adversarial at- (equal), supervision (equal), writing – original draft (supporting). Jose
tacks should be designed in order to mislead explainable machine A. Lozano: conceptualization (equal), formal analysis (supporting),
learning models (and humans) depending on a wide range of fac- funding acquisition (lead), investigation (supporting), methodology
tors such as the type of task addressed, the expertise of the users (equal), project administration (equal), resources (equal), supervision
querying the model, as well as the type, scope or impact of the (equal), writing – original draft (supporting).
explanation methods used to justify the decisions of the models.
The main attack paradigms identified have been illustrated using Acknowledgments
two representative image classification tasks and two different ex- The authors would like to express their gratitude to Dr. Ainhoa
planation methods based on feature-­attribution explanations and Astiazarán Rodríguez and Dr. Paul López Sala for assessing the adver-
example-­ based explanations. Overall, the proposed framework sarial attacks in the medical image classification scenario and validat-
provides a road map for the design of malicious attacks in realistic ing the corresponding claims.
scenarios involving explainable models and human supervision,
contributing to a more rigorous and structured research of adver- Conflicts of Interest
sarial examples in the field of explainable machine learning. The authors declare no conflicts of interest.

Data Availability Statement


6   |   Future Work
The data that support the findings of this study are available at [Link]
In this last section, we identify different promising research github.​com/​vadel/​​A E4XAI. These data were derived from the following
resources available at [Link] ww.​image​-­​net.​org/​ and [Link]
directions that could be derived from the contributions of our
com/​linda​wangg/​​COVID​-­​Net.
paper. First, additional factors could be considered in the pro-
posed framework (jointly with the input, the output class and the
Related Wires Articles
explanation) in order to consider even more fine-­grained scenar-
ios, such as the confidence of the prediction, which can condition Adversarial machine learning for cybersecurity and computer vision:
the human acceptance of a model prediction, as recently studied Current developments and challenges
in Nguyen, Kim, and Nguyen (2021). Other scenarios deserving Explainable artificial intelligence and machine learning: A reality
further research are explanation-­ driven interactive machine rooted perspective
learning scenarios, involving human-­ in-­
the-­
loop processes A historical perspective of explainable Artificial Intelligence
in which human feedback is actively used to influence or im-
Explainable artificial intelligence: An analytical review
prove the model (Ghai et al. 2021; Guo et al. 2022; Schramowski
et al. 2020; Teso et al. 2023; Teso and Kersting 2019).
Endnotes
Moreover, an interesting research line could be developing a 1 Different degrees of expertise can be considered for a more comprehen-

general and unifying attack algorithm capable of addressing all sive scenario, such as unskilled subjects, or partially skilled subjects

20 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
capable of providing basic judgments about the input (for instance, a Anders, C. J., L. Weber, D. Neumann, W. Samek, K.-­R . Müller, and
subject might not be able to visually discriminate between different S. Lapuschkin. 2022. “Finding and Removing Clever Hans: Using
species of reptiles, yet be able to visually classify an animal as a reptile Explanation Methods to Debug and Improve Deep Models.” Information
and not as another animal class). Fusion 77: 261–295.
2 This could happen in hierarchical classification tasks or large-­scale Asgari Taghanaki, S., A. Das, and G. Hamarneh. 2018. “Vulnerability
visual recognition tasks, as a fine-­ grained distinction of certain Analysis of Chest X-­ R ay Image Classification Against Adversarial
classes might be challenging, whereas the remaining classes are eas- Attacks.” In Understanding and Interpreting Machine Learning in
ily classified (Deng et al. 2009; Hase et al. 2019; Nguyen, Kim, and Medical Image Computing Applications, Lecture Notes in Computer
Nguyen 2021; Russakovsky et al. 2015; Silla and Freitas 2011). Science, 87–94. Cham, Switzerland: Springer International Publishing.
3 The selected pretrained model (COVIDNet-­CXR Small) is accessible at Bach, S., A. Binder, G. Montavon, F. Klauschen, K.-­R . Müller, and W.
[Link] ithub.​com/​linda​wangg/​​COVID​-­​Net/​blob/​master/​docs/​mod- Samek. 2015. “On Pixel-­Wise Explanations for Non-­Linear Classifier
els.​md. Decisions by Layer-­Wise Relevance Propagation.” PLoS One 10, no. 7:
4 More e0130140.
information about the pretrained model used can be found at
[Link] Baehrens, D., T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen,
5 Disclaimer: and K.-­
R . Müller. 2010. “How to Explain Individual Classification
the authors acknowledge no expertise in CXR classifica-
Decisions.” Journal of Machine Learning Research 11, no. 61: 1803–1831.
tion, and, as the dataset does not contain a ground-­truth saliency-­map
explanation, it will be assumed for illustration purposes that the expla- Bai, S., Y. Li, Y. Zhou, Q. Li, and P. H. Torr. 2021. “Adversarial Metric
nation achieved for the original input is coherent and reasonable. We Attack and Defense for Person re-­Identification.” IEEE Transactions on
acknowledge, nevertheless, that the explanation may be incomplete or Pattern Analysis and Machine Intelligence 43, no. 6: 2119–2126.
vary from the interpretation an expert might provide.
Balda, E. R., A. Behboodi, and R. Mathar. 2019. “Perturbation Analysis
6 Inthis case, the target saliency maps have been generated using an of Learning Algorithms: Generation of Adversarial Examples From
image-­segmentation model (Mask R-­CNN with Inception Resnet v2), Classification to Regression.” IEEE Transactions on Signal Processing
which has been used to segment the two desired parts. The pretrained 67, no. 23: 6078–6091.
model is accessible at [Link] flow/​​mask_​rcnn/​incep​
tion_​resnet_​v 2_​1024x​1024/​1. Note that, in Figure 8b, both the classifi- Ballet, V., X. Renard, J. Aigrain, T. Laugel, P. Frossard, and M.
cation and the explanation are preserved, thus no attack is carried out. Detyniecki. 2019. “Imperceptible Adversarial Attacks on Tabular
Data.” In NeurIPS 2019 Workshop on Robust AI in Financial Services:
Data, Fairness, Explainability, Trustworthiness, and Privacy (Robust
References AI in FS).

Adadi, A., and M. Berrada. 2018. “Peeking Inside the Black-­Box: A Bansal, N., C. Agarwal, and A. Nguyen. 2020. “SAM: The Sensitivity of
Survey on Explainable Artificial Intelligence (XAI).” IEEE Access 6: Attribution Methods to Hyperparameters.” In Proceedings of the IEEE/
52138–52160. CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(IEEE Computer Society), 8670–8680.
Agirre, E., and P. Edmonds. 2006. “Word Sense Disambiguation:
Algorithms and Applications.” In vol. 33 of Text, Speech and Language Beck, C., H. Booth, M. El-­A ssady, and M. Butt. 2020. “Representation
Technology. Springer. Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty,
Error and Bias.” In Proceedings of the 14th Linguistic Annotation
Aivodji, U., H. Arai, O. Fortineau, S. Gambs, S. Hara, and A. Tapp. Workshop, 60–73.
2019. “Fairwashing: The Risk of Rationalization.” In Proceedings of the
36th International Conference on Machine Learning (ICML), vol. 97 of Belinkov, Y., and Y. Bisk. 2018. “Synthetic and Natural Noise Both Break
Proceedings of Machine Learning Research, 161–170. Neural Machine Translation.” In International Conference on Learning
Representations (ICLR).
Aïvodji, U., H. Arai, S. Gambs, and S. Hara. 2021. “Characterizing the
Risk of Fairwashing.” In Advances in Neural Information Processing Blesch, K., M. N. Wright, and D. Watson. 2023. “Unfooling SHAP
Systems, vol. 34, 14822–14834. and SAGE: Knockoff Imputation for Shapley Values.” In Explainable
Artificial Intelligence, 131–146. Cham, Switzerland: Springer Nature
Al-­masni, M. A., M. A. Al-­antari, M.-­T. Choi, S.-­M. Han, and T.-­S. Kim. Switzerland.
2018. “Skin Lesion Segmentation in Dermoscopy Images via Deep Full
Resolution Convolutional Networks.” Computer Methods and Programs Boopathy, A., S. Liu, G. Zhang, et al. 2020. “Proper Network
in Biomedicine 162: 221–231. Interpretability Helps Adversarial Robustness in Classification.” In
Proceedings of the 37th International Conference on Machine Learning
Alvarez-­Melis, D., and T. Jaakkola. 2018a. “Towards Robust (ICML), vol. 119 of Proceedings of Machine Learning Research (PMLR),
Interpretability With Self-­E xplaining Neural Networks.” In Advances in 1014–1023.
Neural Information Processing Systems, vol. 31, 7775–7784. Red Hook,
NY: Curran Associates Inc. Borkar, J., and P.-­
Y. Chen. 2021. “Simple Transparent Adversarial
Examples.” In ICLR 2021 Workshop on Security and Safety in Machine
Alvarez-­Melis, D., and T. S. Jaakkola. 2018b. “On the Robustness of Learning Systems.
Interpretability Methods.”In Proceedings of the 2018 ICML Workshop on
Human Interpretability in Machine Learning (WHI 2018), 66–71. Bortsova, G., C. González-­ Gonzalo, S. C. Wetstein, et al. 2021.
“Adversarial Attack Vulnerability of Medical Image Analysis Systems:
Alzantot, M., Y. Sharma, S. Chakraborty, H. Zhang, C.-­J. Hsieh, and Unexplored Factors.” Medical Image Analysis 73: 102141.
M. B. Srivastava. 2019. “GenAttack: Practical Black-­ Box Attacks
With Gradient-­ Free Optimization.” In Proceedings of the Genetic Brendel, W., J. Rauber, and M. Bethge. 2018. “Decision-­Based Adversarial
and Evolutionary Computation Conference (GECCO), GECCO'19 Attacks: Reliable Attacks Against Black-­Box Machine Learning Models.”
(Association for Computing Machinery), 1111–1119. In International Conference on Learning Representations (ICLR).

Anders, C., P. Pasliev, A.-­K . Dombrowski, K.-­R . Müller, and P. Kessel. Bussone, A., S. Stumpf, and D. O'Sullivan. 2015. “The Role of
2020. “Fairwashing Explanations With off-­ Manifold Detergent.” In Explanations on Trust and Reliance in Clinical Decision Support
Proceedings of the 37th International Conference on Machine Learning Systems.” In Proceedings of the 2015 International Conference on
(ICML), vol. 119, 314–323. Healthcare Informatics (ICHI), 160–169.

21 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Carlini, N., and D. Wagner. 2017. “Towards Evaluating the Robustness 27th International Conference on Computational Linguistics (COLING)
of Neural Networks.” In Proceedings of the 2017 IEEE Symposium on (Association for Computational Linguistics), 653–663.
Security and Privacy (SP), 39–57.
Elliott, A., S. Law, and C. Russell. 2021. “Explaining Classifiers Using
Carmichael, Z., and W. J. Scheirer. 2023. “Unfooling Perturbation-­ Adversarial Perturbations on the Perceptual Ball.” In Proceedings of
Based Post Hoc Explainers.” In Proceedings of the 37th AAAI Conference the IEEE/CVF Conference on Computer Vision and Pattern Recognition
on Artificial Intelligence, vol. 37 of AAAI'23/IAAI'23/EAAI'23 (AAAI (CVPR), 10693–10702.
Press), 6925–6934.
Etmann, C., S. Lunz, P. Maass, and C. Schoenlieb. 2019. “On the
Cartella, F., O. Anunciação, Y. Funabiki, D. Yamaguchi, T. Akishita, and Connection Between Adversarial Robustness and Saliency Map
O. Elshocht. 2021. “Adversarial Attacks for Tabular Data: Application Interpretability.” In Proceedings of the 36th International Conference on
to Fraud Detection and Imbalanced Data.” In Proceedings of the 2021 Machine Learning (ICML), vol. 97 of Proceedings of Machine Learning
AAAI Workshop on Artificial Intelligence Safety (SafeAI). Research, 1823–1832.
Chen, C., O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su. 2019. “This Eykholt, K., I. Evtimov, E. Fernandes, et al. 2018. “Robust Physical-­
Looks Like That: Deep Learning for Interpretable Image Recognition.” World Attacks on Deep Learning Visual Classification.” In Proceedings
In Advances in Neural Information Processing Systems, vol. 32, 8930– of the IEEE Conference on Computer Vision and Pattern Recognition
8941. Red Hook, NY: Curran Associates Inc. (CVPR), 1625–1634.
Chen, J., X. Wu, V. Rastogi, Y. Liang, and S. Jha. 2019. “Robust Attribution Finlayson, S. G., J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, and I.
Regularization.” In Advances in Neural Information Processing Systems, S. Kohane. 2019. “Adversarial Attacks on Medical Machine Learning.”
vol. 32, 14300–14310. Red Hook, NY: Curran Associates Inc. Science 363, no. 6433: 1287–1289.

Chen, P.-­Y., H. Zhang, Y. Sharma, J. Yi, and C.-­J. Hsieh. 2017. “ZOO: Fischer, V., M. C. Kumar, J. H. Metzen, and T. Brox. 2017. “Adversarial
Zeroth Order Optimization Based Black-­Box Attacks to Deep Neural Examples for Semantic Image Segmentation.” In Workshop of the 2017
Networks Without Training Substitute Models.” In Proceedings of the International Conference on Learning Representations (ICLR).
10th ACM Workshop on Artificial Intelligence and Security (AISec) Fujiyoshi, H., T. Hirakawa, and T. Yamashita. 2019. “Deep Learning-­
(Association for Computing Machinery), 15–26. Based Image Recognition for Autonomous Driving.” IATSS Research
Chen, Z., Y. Bei, and C. Rudin. 2020. “Concept Whitening for 43, no. 4: 244–252.
Interpretable Image Recognition.” Nature Machine Intelligence 2, no. Fursov, I., M. Morozov, N. Kaploukhaya, et al. 2021. “Adversarial
12: 772–782. Attacks on Deep Models for Financial Transaction Records.” In
Cheng, M., J. Yi, P.-­Y. Chen, H. Zhang, and C.-­J. Hsieh. 2020. “Seq2Sick: Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Evaluating the Robustness of Sequence-­ To-­ Sequence Models With Discovery & Data Mining, KDD'21 (Association for Computing
Adversarial Examples.” Proceedings of the AAAI Conference on Artificial Machinery), 2868–2878.
Intelligence 34, no. 4: 3601–3608. Gautam, S., A. Boubekki, S. Hansen, et al. 2022. “ProtoVAE: A
Cheng, Y., L. Jiang, and W. Macherey. 2019. “Robust Neural Machine Trustworthy Self-­
E xplainable Prototypical Variational Model.” In
Translation With Doubly Adversarial Inputs.” In Proceedings of the Advances in Neural Information Processing Systems, vol. 35, 17940–
57th Annual Meeting of the Association for Computational Linguistics 17952. Red Hook, NY: Curran Associates Inc.
(Association for Computational Linguistics), 4324–4333. Ghai, B., Q. V. Liao, Y. Zhang, R. Bellamy, and K. Mueller. 2021.
Cisse, M. M., Y. Adi, N. Neverova, and J. Keshet. 2017. “Houdini: “Explainable Active Learning (XAL): Toward AI Explanations as
Fooling Deep Structured Visual and Speech Recognition Models With Interfaces for Machine Teachers.” Proceedings of the ACM on Human-­
Adversarial Examples.” In Advances in Neural Information Processing Computer Interaction 4, no. CSCW3: 235:1–235:28.
Systems, vol. 30, 6977–6987. Red Hook, NY: Curran Associates Inc. Ghalebikesabi, S., L. Ter-­Minassian, K. DiazOrdaz, and C. C. Holmes.
Deng, E., Z. Qin, M. Li, Y. Ding, and Z. Qin. 2021. “Attacking the 2021. “On Locality of Local Explanation Models.” In Advances in Neural
Dialogue System at Smart Home.” In Proceedings of the International Information Processing Systems, vol. 34, 18395–18407. Red Hook, NY:
Conference on Collaborative Computing: Networking, Applications Curran Associates, Inc.
and Worksharing, Lecture Notes of the Institute for Computer Sciences, Ghassemi, M., L. Oakden-­R ayner, and A. L. Beam. 2021. “The False
Social Informatics and Telecommunications Engineering (Springer Hope of Current Approaches to Explainable Artificial Intelligence in
International Publishing), 148–158. Health Care.” Lancet Digital Health 3, no. 11: e745–e750.
Deng, J., W. Dong, R. Socher, L.-­J. Li, K. Li, and L. Fei-­Fei. 2009. Ghorbani, A., A. Abid, and J. Zou. 2019. “Interpretation of Neural
“ImageNet: A Large-­Scale Hierarchical Image Database.” In Proceedings Networks Is Fragile.” In Proceedings of the AAAI Conference on Artificial
of the 2009 IEEE Conference on Computer Vision and Pattern Recognition Intelligence, vol. 33, 3681–3688.
(CVPR), 248–255.
Ghorbani, A., J. Wexler, J. Y. Zou, and B. Kim. 2019. “Towards
Dimanov, B., U. Bhatt, M. Jamnik, and A. Weller. 2020. “You Shouldn't Automatic Concept-­ Based Explanations.” In Advances in Neural
Trust Me: Learning Models Which Conceal Unfairness From Multiple Information Processing Systems, vol. 32, 9277–9286. Red Hook, NY:
Explanation Methods.” In Proceedings of the 24th European Conference Curran Associates Inc.
on Artificial Intelligence (ECAI), vol. 97, 2473–2480.
Gilpin, L. H., D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal.
Dombrowski, A.-­K ., M. Alber, C. Anders, M. Ackermann, K.-­R . Müller, 2018. “Explaining Explanations: An Overview of Interpretability
and P. Kessel. 2019. “Explanations Can Be Manipulated and Geometry of Machine Learning.” In Proceedings of the IEEE 5th International
Is to Blame.” In Advances in Neural Information Processing Systems, vol. Conference on Data Science and Advanced Analytics (DSAA), 80–89.
32, 13589–13600. Red Hook, NY: Curran Associates Inc.
Goddard, K., A. Roudsari, and J. C. Wyatt. 2012. “Automation Bias: A
Doshi-­Velez, F., and B. Kim. 2018. “Considerations for Evaluation and Systematic Review of Frequency, Effect Mediators, and Mitigators.”
Generalization in Interpretable Machine Learning.” In Explainable and Journal of the American Medical Informatics Association 19, no. 1:
Interpretable Models in Computer Vision and Machine Learning, the 121–127.
Springer Series on Challenges in Machine Learning, 3–17.
Goodfellow, I., J. Shlens, and C. Szegedy. 2015. “Explaining and
Ebrahimi, J., D. Lowd, and D. Dou. 2018. “On Adversarial Examples Harnessing Adversarial Examples.” In International Conference on
for Character-­L evel Neural Machine Translation.” In Proceedings of the Learning Representations (ICLR).

22 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Guidotti, R., A. Monreale, F. Giannotti, D. Pedreschi, S. Ruggieri, and F. Joo, S., S. Jeong, J. Heo, A. Weller, and T. Moon. 2023. “Towards More
Turini. 2019. “Factual and Counterfactual Explanations for Black Box Robust Interpretation via Local Gradient Alignment.” In Proceedings of
Decision Making.” IEEE Intelligent Systems 34, no. 6: 14–23. the 37th AAAI Conference on Artificial Intelligence, vol. 37 of AAAI'23/
IAAI'23/EAAI'23 (AAAI Press), 8168–8176.
Guo, L., E. M. Daly, O. Alkan, M. Mattetti, O. Cornec, and B.
Knijnenburg. 2022. “Building Trust in Interactive Machine Learning Kao, C.-­Y., J. Chen, K. Markert, and K. Böttinger. 2022. “Rectifying
via User Contributed Interpretable Rules.” In Proceedings of the 27th Adversarial Inputs Using XAI Techniques.” In Proceedings of the 30th
International Conference on Intelligent User Interfaces (IUI), 537–548. European Signal Processing Conference (EUSIPCO), 573–577.
Guo, S., X. Li, and Z. Mu. 2021. “Adversarial Machine Learning on Khosla, A., N. Jayadevaprakash, B. Yao, and F.-­F. Li. 2011. “Novel
Social Network: A Survey.” Frontiers in Physics 9:766540. Accessed May Dataset for Fine-­Grained Image Categorization: Stanford Dogs.” In
6, 2022. [Link] CVPR Workshop on Fine-­Grained Visual Categorization (FGVC).
fphy.2021.766540/full.
Khrulkov, V., and I. Oseledets. 2018. “Art of Singular Vectors and
Gupta, K., B. Pesquet-­Popescu, F. Kaakai, J.-­C. Pesquet, F. D. Malliaros, Universal Adversarial Perturbations.” In Proceedings of the IEEE/
and U. Paris-­ Saclay. 2021. “An Adversarial Attacker for Neural CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Networks in Regression Problems.” In IJCAI Workshop on Artificial 8562–8570.
Intelligence Safety (AI Safety).
Kim, B., M. Wattenberg, J. Gilmer, et al. 2018. “Interpretability Beyond
Haffar, R., N. M. Jebreel, J. Domingo-­Ferrer, and D. Sánchez. 2021. Feature Attribution: Quantitative Testing With Concept Activation
“Explaining Image Misclassification in Deep Learning via Adversarial Vectors (TCAV).” In Proceedings of the 35th International Conference on
Examples.” In Proceedings of the International Conference on Modeling Machine Learning (ICML), vol. 80 of Proceedings of Machine Learning
Decisions for Artificial Intelligence (MDAI), lecture Notes in Computer Research, 2668–2677.
Science (Springer International Publishing), 323–334.
Kim, S. S. Y., N. Meister, V. V. Ramaswamy, R. Fong, and O.
Hase, P., C. Chen, O. Li, and C. Rudin. 2019. “Interpretable Image Russakovsky. 2022. “HIVE: Evaluating the Human Interpretability
Recognition With Hierarchical Prototypes.” In Proceedings of the of Visual Explanations.” In Computer Vision – ECCV 2022, 280–298.
AAAI Conference on Human Computation and Crowdsourcing, vol. 7, Cham, Switzerland: Springer Nature Switzerland.
32–40.
Kindermans, P.-­J., S. Hooker, J. Adebayo, et al. 2019. “The (Un)reliabil-
Heo, J., S. Joo, and T. Moon. 2019. “Fooling Neural Network ity of Saliency Methods.” In Explainable AI: Interpreting, Explaining
Interpretations via Adversarial Model Manipulation.” In Advances in and Visualizing Deep Learning, vol. 11700 of Lecture Notes in Computer
Neural Information Processing Systems, vol. 32, 2925–2936. Red Hook, Science, 267–280.
NY: Curran Associates Inc.
Koh, P. W., and P. Liang. 2017. “Understanding Black-­Box Predictions
Hirano, H., A. Minagi, and K. Takemoto. 2021. “Universal Adversarial via Influence Functions.” In Proceedings of the 34th International
Attacks on Deep Neural Networks for Medical Image Classification.” Conference on Machine Learning (ICML), vol. 70 of Proceedings of
BMC Medical Imaging 21, no. 1: 1–13. Machine Learning Research, 1885–1894.
Hossam, M., T. Le, H. Zhao, and D. Phung. 2021. “Explain2Attack: Text Kos, J., I. Fischer, and D. Song. 2018. “Adversarial Examples for
Adversarial Attacks via Cross-­Domain Interpretability.” In Proceedings Generative Models.” In Proceedings of the 2018 IEEE Security and
of the 25th International Conference on Pattern Recognition (ICPR) Privacy Workshops (SPW), 36–42.
(IEEE), 8922–8928.
Kuchipudi, B., R. T. Nannapaneni, and Q. Liao. 2020. “Adversarial
Huang, W., X. Zhao, G. Jin, and X. Huang. 2023. “SAFARI: Versatile and Machine Learning for Spam Filters.” In Proceedings of the 15th
Efficient Evaluations for Robustness of Interpretability.” In Proceedings International Conference on Availability, Reliability and Security,
of the IEEE/CVF International Conference on Computer Vision (ICCV), ARES'20 (Association for Computing Machinery), 1–6.
1988–1998.
Kulesza, T., M. Burnett, W.-­K . Wong, and S. Stumpf. 2015. “Principles of
Hussenot, L., M. Geist, and O. Pietquin. 2020. “CopyCAT: Taking Explanatory Debugging to Personalize Interactive Machine Learning.”
Control of Neural Policies With Constant Attacks.” In Proceedings of the In Proceedings of the 20th International Conference on Intelligent User
19th International Conference on Autonomous Agents and MultiAgent Interfaces (IUI), 126–137.
Systems, AAMAS'20 (International Foundation for Autonomous Agents
Kumar, N., S. Vimal, K. Kayathwal, and G. Dhama. 2021. “Evolutionary
and Multiagent Systems), 548–556.
Adversarial Attacks on Payment Systems.” In Proceedings of the 20th
Ignatiev, A., N. Narodytska, and J. Marques-­Silva. 2019. “On Relating IEEE International Conference on Machine Learning and Applications
Explanations and Adversarial Examples.” In Advances in Neural (ICMLA), 813–818.
Information Processing Systems, vol. 32, 15883–15893. Red Hook, NY:
Kuppa, A., and N.-­A . Le-­K hac. 2020. “Black Box Attacks on Explainable
Curran Associates Inc.
Artificial Intelligence(XAI) Methods in Cyber Security.” In 2020
Ilyas, A., L. Engstrom, A. Athalye, and J. Lin. 2018. “Black-­ Box International Joint Conference on Neural Networks (IJCNN), 1–8.
Adversarial Attacks With Limited Queries and Information.” In
Lakkaraju, H., and O. Bastani. 2020. “‘How do I fool you?’:
Proceedings of the 35th International Conference on Machine Learning
Manipulating User Trust via Misleading Black Box Explanations.” In
(ICML), vol. 80, 2137–2146.
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society,
Jiang, W., X. Wen, J. Zhan, X. Wang, and Z. Song. 2021. “Interpretability-­ AIES'20, 79–85.
Guided Defense Against Backdoor Attacks to Deep Neural Networks.”
Lakkaraju, H., E. Kamar, R. Caruana, and J. Leskovec. 2019. “Faithful
In IEEE Transactions on Computer-­Aided Design of Integrated Circuits
and Customizable Explanations of Black Box Models.” In Proceedings
and Systems.
of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES'19
Joel, M. Z., S. Umrao, E. Chang, et al. 2022. “Using Adversarial Images to (Association for Computing Machinery), 131–138.
Assess the Robustness of Deep Learning Models Trained on Diagnostic
Le Merrer, E., and G. Trédan. 2020. “Remote Explainability Faces the
Images in Oncology.” JCO Clinical Cancer Informatics 6: e2100170.
Bouncer Problem.” Nature Machine Intelligence 2, no. 9: 529–539.
Johnson, S. L. J. 2019. “AI, Machine Learning, and Ethics in Health
Li, O., H. Liu, C. Chen, and C. Rudin. 2018. “Deep Learning for Case-­
Care.” Journal of Legal Medicine 39, no. 4: 427–441.
Based Reasoning Through Prototypes: A Neural Network That Explains

23 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Its Predictions.” Proceedings of the AAAI Conference on Artificial Moosavi-­Dezfooli, S.-­M., A. Fawzi, O. Fawzi, and P. Frossard. 2017.
Intelligence 32, no. 1: 3530–3537. “Universal Adversarial Perturbations.” In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 86–94.
Li, X., and D. Zhu. 2020. “Robust Detection of Adversarial Attacks
on Medical Images.” In Proceedings of the 17th IEEE International Moosavi-­Dezfooli, S.-­M., A. Fawzi, and P. Frossard. 2016. “DeepFool:
Symposium on Biomedical Imaging (ISBI), 1154–1158. A Simple and Accurate Method to Fool Deep Neural Networks.” In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Li, Y., H. Zhang, C. Bermudez, Y. Chen, B. A. Landman, and Y.
Recognition (CVPR), 2574–2582.
Vorobeychik. 2020. “Anatomical Context Protects Deep Learning From
Adversarial Perturbations in Medical Imaging.” Neurocomputing 379: Mopuri, K. R., A. Ganeshan, and R. V. Babu. 2019. “Generalizable Data-­
370–378. Free Objective for Crafting Universal Adversarial Perturbations.” IEEE
Transactions on Pattern Analysis and Machine Intelligence 41, no. 10:
Lin, Y.-­C., Z.-­W. Hong, Y.-­H. Liao, M.-­L . Shih, M.-­Y. Liu, and M. Sun.
2452–2465.
2017. “Tactics of Adversarial Attack on Deep Reinforcement Learning
Agents.” In Proceedings of the 26th International Joint Conference on Mopuri, K. R., U. Garg, and R. V. Babu. 2017. “Fast Feature Fool: A
Artificial Intelligence (IJCAI) (AAAI Press), 3756–3762. Data Independent Approach to Universal Adversarial Perturbations.”
In Proceedings of the British Machine Vision Conference 2017 (BMVC)
Lipton, Z. C. 2018. “The Mythos of Model Interpretability: In Machine
(BMVA Press), 30.1–30.12.
Learning, the Concept of Interpretability Is Both Important and
Slippery.” Queue 16, no. 3: 31–57. Mopuri, K. R., P. K. Uppala, and R. V. Babu. 2018. “Ask, Acquire,
and Attack: Data-­Free UAP Generation Using Class Impressions.” In
Liu, N., M. Du, R. Guo, H. Liu, and X. Hu. 2021. “Adversarial Attacks and
Proceedings of the European Conference on Computer Vision (ECCV),
Defenses: An Interpretation Perspective.” ACM SIGKDD Explorations
Lecture Notes in Computer Science (Springer International Publishing),
Newsletter 23, no. 1: 86–99.
20–35.
Liu, N., H. Yang, and X. Hu. 2018. “Adversarial Detection With Model
Morch, N., U. Kjems, L. Hansen, et al. 1995. “Visualization of Neural
Interpretation.” In Proceedings of the 24th ACM SIGKDD International
Networks Using Saliency Maps.” In Proceedings the International
Conference on Knowledge Discovery & Data Mining, KDD'18 (Association
Conference on Neural Networks (ICNN), vol. 4, 2085–2090.
for Computing Machinery), 1803–1811.
Mori, K., H. Fukui, T. Murase, T. Hirakawa, T. Yamashita, and H.
Liu, Y., X. Chen, C. Liu, and D. Song. 2017. “Delving Into Transferable
Fujiyoshi. 2019. “Visual Explanation by Attention Branch Network for
Adversarial Examples and Black-­ Box Attacks.” In International
End-­to-­End Learning-­Based Self-­Driving.” In Proceedings of the 2019
Conference on Learning Representations (ICLR).
IEEE Intelligent Vehicles Symposium (IV), 1577–1582.
Lu, X., A. Tolmachev, T. Yamamoto, et al. 2021. “Crowdsourcing
Nauta, M., J. Trienes, S. Pathak, et al. 2023. “From Anecdotal Evidence
Evaluation of Saliency-­ Based XAI Methods.” In Machine Learning
to Quantitative Evaluation Methods: A Systematic Review on Evaluating
and Knowledge Discovery in Databases. Applied Data Science Track
Explainable AI.” ACM Computing Surveys 55, no. 13s: 295:1–42.
(Springer International Publishing), 431–446.
Nauta, M., R. Van Bree, and C. Seifert. 2021. “Neural Prototype Trees
Lundberg, S. M., and S.-­I. Lee. 2017. “A Unified Approach to Interpreting
for Interpretable Fine-­Grained Image Recognition.” In Proceedings of
Model Predictions.” In Advances in Neural Information Processing
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Systems, vol. 30, 4765–4774.
(CVPR) (IEEE), 14933–14943.
Ma, X., Y. Niu, L. Gu, et al. 2021. “Understanding Adversarial Attacks
Nguyen, G., D. Kim, and A. Nguyen. 2021. “The Effectiveness of Feature
on Deep Learning Based Medical Image Analysis Systems.” Pattern
Attribution Methods and Its Correlation With Automatic Evaluation
Recognition 110: 107332.
Scores.” In Advances in Neural Information Processing Systems, vol. 34,
Madry, A., A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. 2018. 26422–26436. Red Hook, NY: Curran Associates Inc.
“Towards Deep Learning Models Resistant to Adversarial Attacks.” In
Noack, A., I. Ahern, D. Dou, and B. Li. 2021. “An Empirical Study
International Conference on Learning Representations (ICLR).
on the Relation Between Network Interpretability and Adversarial
Mahdavifar, S., and A. A. Ghorbani. 2019. “Application of Deep Robustness.” SN Computer Science 2, no. 1. Accessed May 1, 2022.
Learning to Cybersecurity: A Survey.” Neurocomputing 347: 149–176. [Link]
Mathov, Y., E. Levy, Z. Katzir, A. Shabtai, and Y. Elovici. 2022. “Not Noppel, M., L. Peter, and C. Wressnegger. 2023. “Disguising Attacks
all Datasets Are Born Equal: On Heterogeneous Tabular Data and With Explanation-­Aware Backdoors.” In Proceedings of the 2023 IEEE
Adversarial Examples.” Knowledge-­Based Systems 242: 108377. Symposium on Security and Privacy (SP), 664–681.
Metzen, J. H., M. C. Kumar, T. Brox, and V. Fischer. 2017. “Universal Papernot, N., P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A.
Adversarial Perturbations Against Semantic Image Segmentation.” Swami. 2017. “Practical Black-­Box Attacks Against Machine Learning.”
In Proceedings of the 2017 IEEE International Conference on Computer In Proceedings of the 2017 ACM on Asia Conference on Computer and
Vision (ICCV) (IEEE), 2774–2783. Communications Security, ASIA CCS'17 (Association for Computing
Machinery), 506–519.
Michel, P., X. Li, G. Neubig, and J. Pino. 2019. “On Evaluation of
Adversarial Perturbations for Sequence-­ to-­
Sequence Models.” In Paschali, M., S. Conjeti, F. Navarro, and N. Navab. 2018.
Proceedings of the 2019 Conference of the North American Chapter “Generalizability vs. Robustness: Investigating Medical Imaging
of the Association for Computational Linguistics: Human Language Networks Using Adversarial Examples.” In Proceedings of the 2018
Technologies (NAACL-­HLT) (Association for Computational International Conference on Medical Image Computing and Computer
Linguistics), vol. 1, 3103–3114. Assisted Intervention (MICCAI), vol. 11070 of Lecture Notes in Computer
Science (Springer International Publishing), 493–501.
Mode, G. R., and K. A. Hoque. 2020. “Adversarial Examples in Deep
Learning for Multivariate Time Series Regression.” In 2020 IEEE Paul, R., M. Schabath, R. Gillies, L. Hall, and D. Goldgof. 2020.
Applied Imagery Pattern Recognition Workshop (AIPR), 1–10. “Mitigating Adversarial Attacks on Medical Image Understanding
Systems.” In Proceedings of the 17th IEEE International Symposium on
Moore, J., N. Hammerla, and C. Watkins. 2019. “Explaining Deep
Biomedical Imaging (ISBI), 1517–1521.
Learning Models With Constrained Adversarial Examples.” In PRICAI
2019: Trends in Artificial Intelligence, Lecture Notes in Computer Pawelczyk, M., T. Datta, J. van-­ den-­
Heuvel, G. Kasneci, and H.
Science, 43–56. Cham, Switzerland: Springer International Publishing. Lakkaraju. 2023. “Probabilistically Robust Recourse: Navigating the

24 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Trade-­Offs Between Costs and Robustness in Algorithmic Recourse.” Russakovsky, O., J. Deng, H. Su, et al. 2015. “ImageNet Large Scale
In International Conference on Learning Representations (ICLR). Visual Recognition Challenge.” International Journal of Computer
Vision 115, no. 3: 211–252.
Pillai, R., P. Oza, and P. Sharma. 2019. “Review of Machine Learning
Techniques in Health Care.” In Proceedings of the 2019 International Samek, W., G. Montavon, S. Lapuschkin, C. J. Anders, and K.-­R . Müller.
Conference on Recent Innovations in Computing (ICRIC), Lecture Notes 2021. “Explaining Deep Neural Networks and Beyond: A Review of
in Electrical Engineering (Springer International Publishing), 103–111. Methods and Applications.” Proceedings of the IEEE 109, no. 3: 247–278.
Poursaeed, O., I. Katsman, B. Gao, and S. Belongie. 2018. “Generative Saralajew, S., L. Holdijk, M. Rees, E. Asan, and T. Villmann. 2019.
Adversarial Perturbations.” In Proceedings of the IEEE/CVF Conference “Classification-­B y-­Components: Probabilistic Modeling of Reasoning
on Computer Vision and Pattern Recognition (CVPR) (IEEE), 4422–4431. Over a Set of Components.” In Advances in Neural Information
Processing Systems, vol. 32, 2792–2803. Red Hook, NY: Curran
Praher, V., K. Prinz, A. Flexer, and G. Widmer. 2021. “On the Veracity of
Associates Inc.
Local, Model-­A gnostic Explanations in Audio Classification: Targeted
Investigations With Adversarial Examples.” In Proceedings of the Sarkar, S. K., K. Oshiba, D. Giebisch, and Y. Singer. 2018. “Robust
22nd International Society for Music Information Retrieval Conference Classification of Financial Risk.” In NIPS 2018 Workshop on
(ISMIR), 531–538. Challenges and Opportunities for AI in Financial Services: The Impact
of Fairness, Explainability, Accuracy, and Privacy.
Qiu, H., L. L. Custode, and G. Iacca. 2021. “Black-­Box Adversarial
Attacks Using Evolution Strategies.” In Proceedings of the Genetic Schneider, J., C. Meske, and M. Vlachos. 2022. “Deceptive AI
and Evolutionary Computation Conference Companion, GECCO'21 Explanations: Creation and Detection.” In Proceedings of the 14th
(Association for Computing Machinery), 1827–1833. International Conference on Agents and Artificial Intelligence (ICAART),
vol. 2, 44–55.
Rahman, A., M. S. Hossain, N. A. Alrajeh, and F. Alsolami. 2021.
“Adversarial Examples—Security Threats to COVID-­19 Deep Learning Schramowski, P., W. Stammer, S. Teso, et al. 2020. “Making Deep Neural
Systems in Medical IoT Devices.” IEEE Internet of Things Journal 8, no. Networks Right for the Right Scientific Reasons by Interacting With
12: 9603–9610. Their Explanations.” Nature Machine Intelligence 2, no. 8: 476–486.

Ras, G., M. van Gerven, and P. Haselager. 2018. “Explanation Methods in Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D.
Deep Learning: Users, Values, Concerns and Challenges.” In Explainable Batra. 2017. “Grad-­CAM: Visual Explanations From Deep Networks
and Interpretable Models in Computer Vision and Machine Learning, the via Gradient-­Based Localization.” In Proceedings of the 2017 IEEE
Springer Series on Challenges in Machine Learning, 19–36. International Conference on Computer Vision (ICCV), 618–626.

Recaido, C., and B. Kovalerchuk. 2023. “Visual Explainable Machine Serradilla, O., E. Zugasti, J. Rodriguez, and U. Zurutuza. 2022. “Deep
Learning for High-­ Stakes Decision-­ Making With Worst Case Learning Models for Predictive Maintenance: A Survey, Comparison,
Estimates.” Data Analysis and Optimization 202: 291–329. Challenges and Prospects.” Applied Intelligence 52, 10934–10964.

Renard, X., T. Laugel, M.-­J. Lesot, C. Marsala, and M. Detyniecki. Sharif, M., S. Bhagavatula, L. Bauer, and M. K. Reiter. 2016.
2019. “Detecting Potential Local Adversarial Examples for Human-­ “Accessorize to a Crime: Real and Stealthy Attacks on State-­of-­the-­A rt
Interpretable Defense.” In Proceedings of the 2018 ECML PKDD Face Recognition.” In Proceedings of the 2016 ACM SIGSAC Conference
Workshop on Recent Advances in Adversarial Machine Learning, Lecture on Computer and Communications Security, CCS'16 (Association for
Notes in Computer Science (Springer International Publishing), 41–47. Computing Machinery), 1528–1540.

Ribeiro, M. T., S. Singh, and C. Guestrin. 2016. “Why Should I Trust Silla, C. N., and A. A. Freitas. 2011. “A Survey of Hierarchical
You?: Explaining the Predictions of Any Classifier.” In Proceedings of the Classification Across Different Application Domains.” Data Mining and
22nd ACM SIGKDD International Conference on Knowledge Discovery Knowledge Discovery 22, no. 1: 31–72.
and Data Mining, KDD'16 (Association for Computing Machinery), Simonyan, K., A. Vedaldi, and A. Zisserman. 2014. “Deep Inside
1135–1144. Convolutional Networks: Visualising Image Classification Models and
Ribeiro, M. T., S. Singh, and C. Guestrin. 2018. “Anchors: High-­Precision Saliency Maps.” In Workshop of the 2014 International Conference on
Model-­A gnostic Explanations.” In Proceedings of the 32nd AAAI Learning Representations (ICLR).
Conference on Artificial Intelligence and 30th Innovative Applications Sinha, S., H. Chen, A. Sekhon, Y. Ji, and Y. Qi. 2021. “Perturbing Inputs
of Artificial Intelligence Conference and 8th AAAI Symposium on for Fragile Interpretations in Deep Natural Language Processing.” In
Educational Advances in Artificial Intelligence, AAAI'18/IAAI'18/ Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and
EAAI'18 (AAAI Press), 1527–1535. Interpreting Neural Networks for NLP (Association for Computational
Linguistics), 420–434.
Rieger, L., and L. K. Hansen. 2020. “A Simple Defense Against
Adversarial Attacks on Heatmap Explanations.” In ICML Workshop on Slack, D., A. Hilgard, H. Lakkaraju, and S. Singh. 2021. “Counterfactual
Human Interpretability in Machine Learning (WHI). Explanations Can be Manipulated.” In Advances in Neural Information
Processing Systems, vol. 34, 62–75.
Robnik-­Šikonja, M., and I. Kononenko. 2008. “Explaining Classifications
for Individual Instances.” IEEE Transactions on Knowledge and Data Slack, D., S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. 2020. “Fooling
Engineering 20, no. 5: 589–600. LIME and SHAP: Adversarial Attacks on Post Hoc Explanation
Methods.” In Proceedings of the AAAI/ACM Conference on AI, Ethics,
Ros, A. S., and F. Doshi-­ Velez. 2018. “Improving the Adversarial
and Society, 180–186.
Robustness and Interpretability of Deep Neural Networks by
Regularizing Their Input Gradients.” Proceedings of the AAAI Sokol, K., and P. Flach. 2019. “Counterfactual Explanations of Machine
Conference on Artificial Intelligence 32, no. 1: 1660–1669. Learning Predictions: Opportunities and Challenges for AI Safety.” In
Proceedings of the 2019 AAAI Workshop on Artificial Intelligence Safety
Ross, A. S., M. C. Hughes, and F. Doshi-­Velez. 2017. “Right for the
(SafeAI), 95–99.
Right Reasons: Training Differentiable Models by Constraining Their
Explanations.” In Proceedings of the 26th International Joint Conference Sotgiu, A., M. Pintor, and B. Biggio. 2022. “Explainability-­Based Debugging
on Artificial Intelligence (IJCAI), 2662–2670. of Machine Learning for Vulnerability Discovery.” In Proceedings of the
17th International Conference on Availability, Reliability and Security,
Rudin, C. 2019. “Stop Explaining Black Box Machine Learning Models
ARES'22 (Association for Computing Machinery), 1–8.
for High Stakes Decisions and Use Interpretable Models Instead.”
Nature Machine Intelligence 1, no. 5: 206–215.

25 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Springenberg, J. T., A. Dosovitskiy, T. Brox, and M. Riedmiller. 2015. van der Waa, J., E. Nieuwburg, A. Cremers, and M. Neerincx. 2021.
“Striving for Simplicity: The all Convolutional Net.” In Workshop of the “Evaluating XAI: A Comparison of Rule-­Based and Example-­Based
2015 International Conference on Learning Representations. Explanations.” Artificial Intelligence 291: 103404.
Stiglic, G., P. Kocbek, N. Fijacko, M. Zitnik, K. Verbert, and L. Cilar. Viganò, L., and D. Magazzeni. 2020. “Explainable Security.” In
2020. “Interpretability of Machine Learning-­Based Prediction Models Proceedings of the 2020 IEEE European Symposium on Security and
in Healthcare.” WIREs Data Mining and Knowledge Discovery 10, no. Privacy Workshops (EuroS&PW), 293–300.
5: e1379.
Virgolin, M., and S. Fracaros. 2023. “On the Robustness of Sparse
Stock, P., and M. Cisse. 2018. “ConvNets and ImageNet Beyond Counterfactual Explanations to Adverse Perturbations.” Artificial
Accuracy: Understanding Mistakes and Uncovering Biases.” In Intelligence 316: 103840.
Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol.
Vreš, D., and M. Robnik-­Šikonja. 2022. “Preventing Deception With
11210, 504–519. Cham, Switzerland: Springer International Publishing.
Explanation Methods Using Focused Sampling.” In Data Mining and
Štrumbelj, E., and I. Kononenko. 2010. “An Efficient Explanation of Knowledge Discovery.
Individual Classifications Using Game Theory.” Journal of Machine Wachter, S., B. Mittelstadt, and C. Russell. 2017. “Counterfactual
Learning Research 11, no. 1: 1–18. Explanations Without Opening the Black Box: Automated Decisions
Subramanya, A., V. Pillai, and H. Pirsiavash. 2019. “Fooling Network and the GDPR.” Harvard Journal of Law & Technology 31, no. 2: 842–887.
Interpretation in Image Classification.” In Proceedings of the IEEE/CVF Wang, H., G. Wang, Y. Li, D. Zhang, and L. Lin. 2020. “Transferable,
International Conference on Computer Vision (ICCV), 2020–2029. Controllable, and Inconspicuous Adversarial Attacks on Person re-­
Sun, S., B. Song, X. Cai, X. Du, and M. Guizani. 2022. “CAMA: Class Identification With Deep Mis-­R anking.” In Proceedings of the IEEE/
Activation Mapping Disruptive Attack for Deep Neural Networks.” CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Neurocomputing 500: 989–1002. 342–351.

Sundararajan, M., A. Taly, and Q. Yan. 2017. “Axiomatic Attribution for Wang, J., J. Tuyls, E. Wallace, and S. Singh. 2020. “Gradient-­Based
Deep Networks.” In Proceedings of the 34th International Conference on Analysis of NLP Models Is Manipulable.” In Findings of the Association
Machine Learning (ICML), 70, 3319–3328. for Computational Linguistics: EMNLP 2020 (Association for
Computational Linguistics), 247–258.
Szegedy, C., W. Zaremba, I. Sutskever, et al. 2014. “Intriguing Properties
of Neural Networks.” In International Conference on Learning Wang, J., Y. Wu, M. Li, X. Lin, J. Wu, and C. Li. 2020. “Interpretability
Representations (ICLR). Is a Kind of Safety: An Interpreter-­ Based Ensemble for Adversary
Defense.” In Proceedings of the 26th ACM SIGKDD International
Tabacof, P., J. Tavares, and E. Valle. 2016. “Adversarial Images for Conference on Knowledge Discovery & Data Mining, KDD'20 (Association
Variational Autoencoders.” In NIPS 2016 Workshop on Adversarial for Computing Machinery), 15–24.
Training.
Wang, L., Z. Q. Lin, and A. Wong. 2020. “COVID-­Net: A Tailored Deep
Tamam, S. V., R. Lapid, and M. Sipper. 2023. “Foiling Explanations Convolutional Neural Network Design for Detection of COVID-­ 19
in Deep Neural Networks.” Transactions on Machine Learning Cases From Chest X-­R ay Images.” Scientific Reports 10, no. 1: 19549.
Research: 1–32. Accessed May 5, 2024. [Link]
forum?id=wvLQMHtyLk. Wang, Z., H. Wang, S. Ramkumar, P. Mardziel, M. Fredrikson, and
A. Datta. 2020e. “Smoothed Geometry for Robust Attribution.” In
Tang, R., N. Liu, F. Yang, N. Zou, and X. Hu. 2022. “Defense Against Advances in Neural Information Processing Systems, vol. 33, 13623–
Explanation Manipulation.” Frontiers in Big Data 5: 704203. 13634. Red Hook, NY: Curran Associates, Inc.
Tao, G., S. Ma, Y. Liu, and X. Zhang. 2018. “Attacks Meet Interpretability: Xie, C., J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. 2017.
Attribute-­Steered Detection of Adversarial Samples.” In Advances in “Adversarial Examples for Semantic Segmentation and Object
Neural Information Processing Systems, vol. 31, 7717–7728. Red Hook, Detection.” In Proceedings of the 2017 IEEE International Conference on
NY: Curran Associates Inc. Computer Vision (ICCV), 1378–1387.
Teso, S., Ö. Alkan, W. Stammer, and E. Daly. 2023. “Leveraging Xu, K., G. Zhang, S. Liu, et al. 2020. “Adversarial T-­Shirt! Evading
Explanations in Interactive Machine Learning: An Overview.” Frontiers Person Detectors in a Physical World.” In Proceedings of the 2020
in Artificial Intelligence 6: 1066049. European Conference on Computer Vision (ECCV), Lecture Notes in
Teso, S., and K. Kersting. 2019. “Explanatory Interactive Machine Computer Science (Springer International Publishing), 665–681.
Learning.” In Proceedings of the 2019 AAAI/ACM Conference on AI, Xue, M., C. Yuan, J. Wang, W. Liu, and P. Nicopolitidis. 2020. “DPAEG:
Ethics, and Society (AIES), 239–245. A Dependency Parse-­Based Adversarial Examples Generation Method
for Intelligent Q&A Robots.” Security and Communication Networks,
Thys, S., W. V. Ranst, and T. Goedemé. 2019. “Fooling Automated
2020.
Surveillance Cameras: Adversarial Patches to Attack Person Detection.”
In Proceedings of the IEEE/CVF Conference on Computer Vision and Yang, P., J. Chen, C.-­J. Hsieh, J.-­L . Wang, and M. Jordan. 2020. “ML-­
Pattern Recognition Workshops (CVPRW), 49–55. LOO: Detecting Adversarial Examples With Feature Attribution.”
Proceedinegs of the AAAI Conference on Artificial Intelligence 34, no. 4:
Tsipras, D., S. Santurkar, L. Engstrom, A. Ilyas, and A. Madry. 2020.
6639–6647.
“From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks.” In Proceedings of the 37th International Conference on Yeh, C.-­K ., J. Kim, I. E.-­H. Yen, and P. K. Ravikumar. 2018. “Representer
Machine Learning (ICML), vol. 119 of Proceedings of Machine Learning Point Selection for Explaining Deep Neural Networks.” In Advances in
Research, 9625–9635. Proceedings of Machine Learning Research. Neural Information Processing Systems, vol. 31, 9291–9301. Red Hook,
NY: Curran Associates Inc.
Tsipras, D., S. Santurkar, L. Engstrom, A. Turner, and A. Madry.
2019. “Robustness May Be at Odds With Accuracy.” In International Yoo, T. K., and J. Y. Choi. 2020. “Outcomes of Adversarial Attacks on
Conference on Learning Representations (ICLR). Deep Learning Models for Ophthalmology Imaging Domains.” JAMA
Ophthalmology 138, no. 11: 1213–1215.
Ustun, B., A. Spangher, and Y. Liu. 2019. “Actionable Recourse in
Linear Classification.” In Proceedings of the Conference on Fairness, Yosinski, J., J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. 2015.
Accountability, and Transparency, FAT*'19 (Association for Computing “Understanding Neural Networks Through Deep Visualization.” In
Machinery), 10–19. 2015 ICML Workshop on Deep Learning.

26 of 27 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025


19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Yuan, X., P. He, Q. Zhu, and X. Li. 2019. “Adversarial Examples:
Attacks and Defenses for Deep Learning.” IEEE Transactions on Neural
Networks and Learning Systems 30, no. 9: 2805–2824.
Zeiler, M. D., and R. Fergus. 2014. “Visualizing and Understanding
Convolutional Networks.” In Computer Vision—ECCV 2014, vol. 8689 of
Lecture Notes in Computer Science, 818–833.
Zhan, Y., B. Zheng, Q. Wang, et al. 2022. “Towards Black-­Box Adversarial
Attacks on Interpretable Deep Learning Systems.” In Proceedings of the
2022 IEEE International Conference on Multimedia and Expo (ICME)
(IEEE Computer Society), 1–6.
Zhang, C., Z. Ye, Y. Wang, and Z. Yang. 2018. “Detecting Adversarial
Perturbations With Saliency.” In Proceedings of the IEEE 3rd
International Conference on Signal and Image Processing (ICSIP),
271–275.
Zhang, Q., Y. N. Wu, and S.-­C. Zhu. 2018. “Interpretable Convolutional
Neural Networks.” In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 8827–8836.
Zhang, T., and Z. Zhu. 2019. “Interpreting Adversarially Trained
Convolutional Neural Networks.” In Proceedings of the 36th International
Conference on Machine Learning (ICML), vol. 97 of Proceedings of
Machine Learning Research, 7502–7511.
Zhang, X., N. Wang, H. Shen, S. Ji, X. Luo, and T. Wang. 2020.
“Interpretable Deep Learning under Fire.” In Proceedings of the 29th
USENIX Security Symposium (USENIX Security 20), 1659–1676.
Zhang, Y., P. Tiňo, A. Leonardis, and K. Tang. 2021. “A Survey on
Neural Network Interpretability.” IEEE Transactions on Emerging
Topics in Computational Intelligence 5, no. 5: 726–742.
Zhao, Z., D. Dua, and S. Singh. 2018. “Generating Natural Adversarial
Examples.” In International Conference on Learning Representations
(ICLR).
Zheng, H., E. Fernandes, and A. Prakash. 2019. “Analyzing the
Interpretability Robustness of Self-­E xplaining Models.” In ICML 2019
Security and Privacy of Machine Learning Workshop.
Zheng, Y., Y. Lu, and S. Velipasalar. 2020. “An Effective Adversarial
Attack on Person Re-­Identification in Video Surveillance via Dispersion
Reduction.” IEEE Access 8: 183891–183902.
Zou, W., S. Huang, J. Xie, X. Dai, and J. Chen. 2020. “A Reinforced
Generation of Adversarial Examples for Neural Machine Translation.”
In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics (Association for Computational Linguistics),
3486–3497.

27 of 27

You might also like