Adversarial Attacks in Explainable ML
Adversarial Attacks in Explainable ML
1Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, San Sebastian, Spain | 2Basque Center for
Applied Mathematics (BCAM), Bilbao, Spain
Funding: This work was supported by the Eusko Jaurlaritza (IT1504-22, BERC 2022-2 025 and KK-2020/00049, KK-2023/00012, KK-2024/00030), by the
Spanish Ministry of Economy and Competitiveness MINECO (projects PID2019-104966GB-I00 and PID2022-137442NB-I00) and by the Spanish Ministry
of Science, Innovation and Universities (FPU19/03231 predoctoral grant). Jose A. Lozano acknowledges support by the Spanish Ministry of Science,
Innovation and Universities through BCAM Severo Ochoa accreditation (SEV-2 017-0718 and CEX2021-0 01142-S).
ABSTRACT
Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations.
Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-
distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable
machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios
where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next,
we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable
models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms
existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed.
Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user
expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to
successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and
realistic study of adversarial examples in the field of explainable machine learning.
1 | Introduction this limitation, different strategies have been proposed in the lit-
erature (Gilpin et al. 2018; Samek et al. 2021; Zhang et al. 2021),
Machine learning models, such as deep neural networks, still face ranging from post hoc explanation methods, which try to identify
several weaknesses that hamper the development and deployment the parts, elements or concepts in the inputs that most affect the
of these technologies, despite their outstanding and ever-increasing decisions of trained models (Ghorbani et al. 2019; Kim et al. 2018;
capacity to solve complex artificial intelligence problems. One of Yosinski et al. 2015; Zeiler and Fergus 2014), to more proactive
the main shortcoming is their black-box nature, which prevents approaches which pursue a transparent reasoning by training in-
analyzing and understanding their reasoning process, while such herently interpretable models (Alvarez-Melis and Jaakkola 2018a;
a requirement is ever more in demanded in order to guarantee a Chen, Li, et al. 2019; Hase et al. 2019; Li et al. 2018; Saralajew
reliable and transparent use of artificial intelligence. To overcome et al. 2019; Zhang, Wu, and Zhu 2018).
Edited by: Mehmed Kantardzic, Associate Editor and Witold Pedrycz, Editor-in-Chief
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2024 The Author(s). WIREs Data Mining and Knowledge Discovery published by Wiley Periodicals LLC.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2025; 15:e1567 1 of 27
[Link]
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Another issue that threatens the reliability of deep neural net- The aim of all these contributions is to establish a basis for a
works is their low robustness to adversarial examples (Szegedy more rigorous study of the vulnerabilities of explainable ma-
et al. 2014; Yuan et al. 2019), that is, to inputs manipulated in chine learning in adversarial scenarios. We believe that these
order to maliciously change the output of a model while the fields will benefit from our work in the following ways.
changes are imperceptible to humans. Indeed, this can be seen
as a direct implication of their lack of human-like reasoning. • Heretofore, studies on adversarial attacks against explain-
Therefore, improving the explainability of the models is also a able models have considered very particular or fragmented
promising direction to achieve adversarial robustness, a hypoth- scenarios and attack paradigms. Thus, there is a lack of a
esis which is supported by recent works which show that inter- unifying perspective in this field that connects all these
pretability and robustness are connected (Etmann et al. 2019; works within a general analytical framework and taxonomy,
Noack et al. 2021; Ros and Doshi-Velez 2018; Tsipras et al. 2019; which is a gap that we aim to fill with this review.
Zhang and Zhu 2019).
• The framework we propose encompasses not only attack
paradigms which have already been investigated in the liter-
Furthermore, the study of adversarial attacks against explain-
ature, but also paradigms that, to the best of our knowledge,
able models has gained interest in recent years, as will be fully
have not yet been studied, paving the way for new research
reviewed in Sections 2.3 and 2.4. In contrast to common ad-
venues.
versarial attacks, which focus solely on changing the classifi-
cation of the model (Yuan et al. 2019), attacks on explainable • In addition, the role of the human is often overlooked in the
models need to consider both changes in the classification study of attacks against explainable models, despite being a
and in the explanation supporting that classification. Another key factor in these scenarios. In this work, we address this
key difference when considering attacks against explainable limitation by analyzing the requirements that adversarial
models is related to their stealthiness. Generally, the only examples should satisfy in order to be able to mislead an
constraint assumed in order to produce a stealthy attack is explainable model, and even a human, depending on the
that the changes added to the inputs must be imperceptible attack scenario. This analysis provides a road map for the
to humans. However, the use of explainable models implies a design of realistic attacks against explainable models.
different scenario, where it is assumed that a human will ob-
• Furthermore, the fact that our framework considers a wide
serve and analyze not only the input, but also the model classi-
range of scenarios that an adversary may face allows us to
fication and explanation. Therefore, uncontrolled changes in
summarize which paradigms are realistic or unrealistic in
both factors may cause inconsistencies, alerting the human.
each of them, which is fundamental to ensure that attack
For this reason, the assumption of explainable classification
methods are evaluated with an appropriate setting and
models introduces a new question regarding the definition of
methodology in future works.
adversarial examples: can adversarial examples be deployed if
humans observe not only the input but also the output classifi- • On another note, our work also contributes to raise aware-
cation and/or the corresponding explanation? ness about the possible attack types that both models and
humans may face in realistic adversarial scenarios, which
is important to promote a more aware and secure use of ma-
1.1 | Objectives and Contributions chine learning based technologies, or even the development
of more robust models or explanation methods.
The objective of this exploratory review is to shed light on this
question by extending the notion of adversarial examples for For the above reasons, the aim of this work is to contribute to a
explainable machine learning scenarios, in which humans more methodical research in this area, delimiting the differences
can not only assess the input sample, but also compare it to the between the possible attack paradigms, identifying limitations
output of the model and to the explanation. These extended in the current approaches and establishing more fine-grained
notions of adversarial examples allow us to analyze the pos- and rigorous standards for the development and evaluation of
sible attacks that can be produced by means of adversarially new attacks or defenses.
changing the model's classification and explanation, either
jointly or independently (that is, changing the explanation
without altering the output class, or vice versa). Our analy- 2 | Related Work
sis leads to a framework that establishes whether (and how)
adversarial attacks can be generated for explainable models Our work focuses on adversarial attacks against explainable
under human supervision. Moreover, we describe the require- machine learning models. Therefore, this section provides an
ments that adversarial examples should satisfy in order to be introduction to both research topics. This introduction will first
able to mislead an explainable model (and even a human) de- summarize each research field independently, and, afterward,
pending on multiple scenarios or factors which, despite their the intersection between both, as described as follows.
relevance, are often overlooked in the literature of adversarial
examples for explainable models, such as the expertise of the To begin with, the fields of explainable and adversarial machine
user or the objective of the explanation. Finally, the proposed learning are presented in Sections 2.1 and 2.2, respectively.
attack paradigms are also illustrated by adversarial examples Subsequently, the reliability of the explanation methods in ad-
generated for two representative image classification tasks, as versarial scenarios is discussed in Section 2.3. Finally, further
well as for two different explanation methods. The outline of connections between explanation methods and adversarial ex-
our work is summarized in Figure 1. amples are discussed in Section 2.4.
2.1 | Overview of Explanation Methods in In addition, explanations can be used, even for the same model,
Machine Learning for different purposes. For instance, users querying the model
for a credit loan might be interested in explaining the output
In this section, we summarize the explanation methods pro- obtained for their particular cases only, whereas a developer
posed in the literature in order to present the terminology and might be interested in discovering why that model misclas-
taxonomy that will be used in the subsequent sections to develop sifies certain input samples. At the same time, an analyst can
our analytical framework on adversarial examples in explain- be interested in whether that model is biased against a social
able models. group for unethical reasons. Moreover, explanations can be lev-
eraged to enable users to interactively refine the model, thus
enhancing their understanding and trust in the system (Guo
2.1.1 | Scope, Objective, and Impact of the Explanations et al. 2022; Ross, Hughes, and Doshi-Velez 2017; Schramowski
et al. 2020; Teso et al. 2023; Teso and Kersting 2019). At a higher
The objective of an explanation is to justify the behavior of level, all these purposes are based on necessities involving eth-
a model in a way that is easily understandable to humans. ics, safety or knowledge acquisition, among others (Doshi-Velez
However, different users might be interested in different aspects and Kim 2018). Based on the purpose of the explanations and
of the model, and, therefore, the explanations can be generated the particular problem, domain or scenario in which they are
for different scopes or objectives. required, another relevant factor should be taken into consid-
eration: the impact of the explanations, which can be defined
Overall, the scope of an explanation can be categorized as as the consequence of the decisions made based on the analysis
local or global (Zhang et al. 2021). On the one hand, local of the explanation. Healthcare domains are clear examples in
explanations aim to characterize or explain the model's pre- which the consequences of the decisions can be severe.
diction for each particular input individually, for example, by
identifying the most relevant parts or features of the input. On Despite the relevance of these factors, they are often overlooked
the other hand, global explanations attempt to expose the gen- when local explanation methods are designed or evaluated
eral reasoning process of the model, for instance, summariz- (Doshi-Velez and Kim 2018; Nauta et al. 2023; Zhang et al. 2021).
ing (e.g., using a more simple but interpretable model) when The same happens for adversarial attacks in explainable mod-
a certain class will be predicted, or describing to what extent els. We argue that the scope, the objective and the impact of
a particular input-feature is related to one class. Since in this explanations should be key factors when designing adversarial
paper, we address the vulnerability of explainable models to attacks against explainable models, since a different attack strat-
adversarial examples, and, therefore, the interest is placed on egy needs to be adopted in each context to successfully deceive
specific inputs and the corresponding outputs, we focus on the model (and the human). This will be discussed in detail in
local explanations. Section 3.
3 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.1.2 | Types of Explanations Hase et al. 2019; Li et al. 2018; Nauta, Van Bree, and
Seifert 2021), achieving a more interpretable reasoning.
Different types of explanations exist depending on how the ex- The prototypes can represent an entire input describing
planation is conveyed: one class (e.g., a prototypical handwritten digit “1” in digit
classification) (Gautam et al. 2022; Li et al. 2018), or rep-
• Feature-based explanations: assign an importance score resent image-parts or semantic concepts (Alvarez-Melis
to each feature in the input, based on their relevance for and Jaakkola 2018a; Chen, Li, et al. 2019; Hase et al. 2019;
the output classification (Baehrens et al. 2010; Lundberg Nauta, Van Bree, and Seifert 2021).
and Lee 2017; Ribeiro, Singh, and Guestrin 2016; • Rule-based explanations: these explanation methods aim
Robnik- Šikonja and Kononenko 2008; Štrumbelj and to expose the reasoning of a model in a simplified or
Kononenko 2010). Common feature-based explanations human-understandable set of rules, such as logic-r ules or
(especially in the image domain) are activation or saliency if-then-else rules, which represent a natural form of ex-
maps, which highlight the most relevant parts of the input planations for humans (Guidotti et al. 2019; Lakkaraju
(Bach et al. 2015; Morch et al. 1995; Selvaraju et al. 2017; et al. 2019; Ribeiro, Singh, and Guestrin 2018; van der Waa
Simonyan, Vedaldi, and Zisserman 2014; Springenberg et al. 2021). Rule-based explanations are particularly well-
et al. 2015; Sundararajan, Taly, and Yan 2017; Zeiler and suited when the input contains features which are easily
Fergus 2014). Despite their extensive use, acceptance and interpretable.
scientific relevance, several works have put forward op-
posing views, identifying that such explanations can be • Counterfactual explanations: although counterfactual ex-
unreliable and misleading (Chen, Bei, and Rudin 2020; planations (Guidotti et al. 2019; Wachter, Mittelstadt, and
Hase et al. 2019; Kim et al. 2018; Kindermans et al. 2019; Russell 2017) can be considered, in their form, as rule-based
Lipton 2018; Rudin 2019). explanations, the main difference of these explanations is
their conditional or hypothetical reasoning nature, as the
• Example-based explanations: the explanation is based on aim is suggesting the possible changes that should happen in
comparing the similarity between the input at hand and the input to receive a different (and frequently more positive)
a set of prototypical inputs that are representative of the output classification (e.g., “a rejected loan request would be
predicted class. Thus, the classification of a given input accepted if the subject had a higher income”).
sample is justified by the similarity between it and the pro-
totypes of the predicted class. We will also refer to these Some illustrative examples of these four types of explanations
types of explanations as prototype-based explanations in are presented in Figure 2. Overall, the most suitable type of
the paper, although different forms of example-based ex- explanation depends on the domain, the scope and the pur-
planation exist, such as the strategies proposed in Koh pose of the explanation, as well as on the expertise level of
and Liang (2017) or Yeh et al. (2018), based on estimat- the users querying the model. We refer the reader to Samek
ing the training images most responsible for a prediction. et al. (2021), Zhang et al. (2021) and Gilpin et al. (2018) for
Recent works have integrated prototype-based explana- a more fine-g rained overview of explanation methods. These
tions directly in the learning process of neural networks, surveys also provide an exhaustive enumeration of relevant
so that the classification is based on the similarities be- methods in the literature focused on computing such expla-
tween the input and a set of prototypes (Alvarez-Melis and nations. Furthermore, for technical papers on the quantitative
Jaakkola 2018a; Chen, Li, et al. 2019; Gautam et al. 2022; and qualitative evaluation of explanation methods, we refer
FIGURE 2 | Illustrative examples of the four main types of explanations in machine learning.
5 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
scenarios in which the adversary might not have access to Le-K hac (2020), it is shown that small changes in input samples
the digital files. can produce drastic changes in feature-importance explana-
tions, while maintaining the output classification. In Ghorbani,
A summary of the taxonomy described can be consulted in Abid, and Zou (2019), the proposed attacks are also evalu-
Table 1. We also refer the reader to the work of Yuan et al. (2019) ated in the example-based explanations proposed in Koh and
for a more comprehensive and fine-grained survey on adversar- Liang (2017), based on estimating the relevance of each train-
ial examples. ing image for a given prediction by using influence-functions.
In Zheng, Fernandes, and Prakash (2019), adversarial attacks
Finally, whereas the research on adversarial examples in the last capable of changing the explanations while maintaining the
few years has led to a torrent of attack methods proposed for outputs are created for self-explainable (prototype-based) clas-
multiple scenarios, tasks and even types of models, this research sifiers. In Zhang et al. (2020) and Kuppa and Le-K hac (2020), it
has focused almost exclusively on classification problems (Yuan is shown that adversarial examples can also produce wrong out-
et al. 2019). Nevertheless, it has been shown that adversarial ex- puts and (feature-importance) explanations at the same time, or
amples can be generated for machine learning models trained change the output while maintaining the explanations (Zhang
to perform very different types of problems, such as regression et al. 2020).
(Balda, Behboodi, and Mathar 2019; Gupta et al. 2021; Kos,
Fischer, and Song 2018; Li et al. 2020; Mode and Hoque 2020; Aivodji et al. (2019), Aïvodji et al. (2021), and Lakkaraju and
Tabacof, Tavares, and Valle 2016), reinforcement learning Bastani (2020)show that trustworthy explanations can be pro-
(Hussenot, Geist, and Pietquin 2020; Lin et al. 2017) or image duced for a biased or an untrustworthy model, thus manipu-
segmentation (Cisse et al. 2017; Fischer et al. 2017; Metzen lating user trust. These approaches are, however, not based on
et al. 2017; Mopuri, Ganeshan, and Babu 2019; Poursaeed adversarial attacks, as they focus on producing a global expla-
et al. 2018; Xie et al. 2017) problems. All these advances allow a nation model that closely approximates the original (black-box)
wide range of opportunities for adversaries to maliciously take model but which employs trustworthy features instead of sensi-
control of the outcomes of machine learning models, threaten- tive or discriminatory features (which are actually being used by
ing countless systems. At the same time, research has focused the original model to predict). Similarly, in Anders et al. (2020),
on models that only provide a classification as an answer. Only Dimanov et al. (2020), Heo, Joo, and Moon (2019), Le Merrer
recently has the vulnerability of explainable models begun to be and Trédan (2020), and Slack et al. (2020) adversarial models are
studied, as we will discuss in detail in the following section. generated, capable of producing incorrect or misleading expla-
nations without harming their predictive performance.
2.3 | Reliability of Explanations Under Adversarial On the other hand, recent works have proposed defensive ap-
Attacks proaches in order to increase the robustness of different explanation
methods. These works have focused primarily on feature-based ex-
Some explanation methods in the literature have been proven planation methods, relying on regularization strategies (Boopathy
to be unreliable in adversarial settings. In Ghorbani, Abid, et al. 2020; Chen, Wu, et al. 2019; Joo et al. 2023; Tang et al. 2022;
and Zou (2019), Dombrowski et al. (2019), Alvarez-Melis Wang et al. 2020e) and explanation-averaging strategies (Rieger
and Jaakkola (2018b), Zhang et al. (2020) and Kuppa and and Hansen 2020) for gradient-based explanations, or tailored
TABLE 1 | Summary of the main taxonomy used to describe and categorize adversarial attacks.
7 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
the attack in scenarios in which the user observes the output Let us take as an example a complex computer-a ided diagnosis
classification, since the change in the output can be inconsis- task through medical images, in which an expert subject fails
tent, alerting the human. For these reasons, the following ques- in their diagnosis while the model is correct. In such cases,
tion arises: are regular adversarial examples useful in practice we can induce a human error confirmation attack by forc-
when the user is aware of the output? ing the model to confirm the (wrong) medical diagnosis pro-
duced by the expert, that is, forcing f (x) = h(x) ≠ yx (Bortsova
To address this question, we start by discussing four different et al. 2021; Finlayson et al. 2019; Goddard, Roudsari, and
scenarios, based on the agreement of the following factors: f (x), Wyatt 2012; Johnson 2019).
the model's prediction of the input; h(x), the classification per-
formed by a human subject; and yx, the ground-truth class of Based on the above discussion, we can determine that some
an input x (which will be unknown for both the model and the types of adversarial attacks can still be effective even when the
human subject in the prediction phase of the model). It is worth user is aware of the output. Nonetheless, paradoxically, it is pos-
clarifying that a human misclassification (h(x) ≠ yx) can occur sible to introduce new types of adversarial attacks when the out-
in scenarios in which the addressed task is of high complexity, put classification is supported by explanations, as we show in
such as medical diagnosis (Pillai, Oza, and Sharma 2019), or the following section.
in which the label of an input is ambiguous, such as sentiment
analysis (Agirre and Edmonds 2006; Beck et al. 2020). Although
a human misclassification might be uncommon in simple prob- 3.2 | Scenarios in Which Human Subjects Are
lems such as object recognition, even in such cases ambigu- Aware of the Explanations
ous or challenging inputs can be found (Stock and Cisse 2018;
Tsipras et al. 2020). Finally, unless specified, we will assume The scenarios described in the previous section can be further
expert subjects, that is, subjects with knowledge in the task and extended for the case of explainable machine learning models,
capable of providing well-founded classifications.1 According to as the explanations for the predictions come into play. As a con-
this framework, the four possible scenarios are those described sequence, each of the cases defined above can be subdivided into
in Figure 4. new subcases depending on whether the explanations match the
output class or whether humans agree with the explanations of
According to the described casuistry, regular adversar- the models. To avoid an exhaustive enumeration of all the possi-
ial attacks aim to produce the second scenario (A.0.2, i.e., ble scenarios, we focus only on those identified in the literature
f (x) ≠ h(x)( = )yx), by
( imperceptibly
) perturbing an input x0 that as interesting from an adversary perspective.
satisfies f x0 = h x0 = yx0 (i.e., the first scenario) so that the
model's output is changed, but without altering the human per- From this standpoint, given an explainable model, adversarial
ception of the input (which, therefore, implies h(x) = yx = yx0). examples can be generated by perturbing a well classified input
However, assuming that the user is aware of the output, the ful- (for which the corresponding explanation is also correct and co-
fillment of the attack is subject to whether human subjects can herent) with the aim of changing (i) the output class, (ii) the pro-
correct the detected misclassification, or have control over the vided explanation, or (iii) both at the same time (Noppel, Peter,
implications of that prediction. For example, an adversarial and Wressnegger 2023; Schneider, Meske, and Vlachos 2022).
traffic signal will only produce a dramatic consequence in au-
tonomous cars if the drivers do not take control with sufficient To formalize these scenarios, let us denote Af (x) as the expla-
promptness. nation provided to characterize the decision f (x) of a machine-
learning model, and Ah (x) as the explanation provided by a
Regarding the remaining cases, they do not fit in the definition human according to their knowledge or criteria. Since a total
of a regular adversarial attack since either the input is mis- agreement or disagreement between such explanations is gen-
classified by the human subject (h(x) ≠ yx ) or the model is not erally unlikely and challenging to characterize in a formal
fooled ( f (x) = h(x) = yx ). Nevertheless, assuming a more gen- way, the disagreement between Af (x) and Ah (x) will be de-
eral definition, scenarios involving human misclassifications noted as Af (x) ≈ Ah (x), while the agreement will be denoted as
could be potentially interesting for an adversary. Similarly to Af (x) ≈ Ah (x). Similarly, we will denote A(x) ∼ y if an expla-
regular adversarial attacks, which force the second scenario nation A(x) for the input x is consistent with the reasons that
departing from the first one, an adversary might be interested characterize the class y (that is, if the explanation correctly char-
in forcing the fourth scenario departing from the third one. acterizes or supports the classification of x as the class y). For
FIGURE 4 | Attack casuistry when the human observes not only the input but also the output classification of the model.
9 of 27
TABLE 2 | Overview of the attack casuistry described in Sections 3.1 and 3.2.
10 of 27
Classification Explanation
Model
Factors Model- Model coherent Model-
observed by Model Human human coherent with with its human Attack category/ Representative examples,
the user ID correct correct agreement ground-truth output agreement description tasks, or use-cases
Input + Output A.0.2 ✘ ✓ ✘ — — — Regular attack. Forcing misclassifications in
(Section 3.1) critical tasks (e.g., traffic-sign
recognition, surveillance or
finance fraud detection).
A.0.4 ✘ ✘ ✓ — — — Human error Confirm a wrong diagnosis
confirmation. produced by an expert in
health-care domains.
Input + Output A.1 ✓ ✓ ✓ * * ✘ Incorrect explanation Reduce human trust
+ explanation (while keeping the in the model.
(Section 3.2) correct output).
A.1.1 ✓ ✓ ✘ Incorrect and Confusing recommendations in
coherent explanations credit-loan request or medical-
(while keeping the diagnosis tasks. Biased or
correct output). discriminative explanations.
Hide inappropriate
behaviors of the model.
A.2 ✘ ✓ ✘ * * ✘ Incorrect output Reduce human trust
and explanation. in the model.
A.2.1 ✘ ✓ ✘ Model is wrong but Increase confidence of the
supports its own human in the incorrect
misclassification. prediction. Bias the human
in favor of a wrong class.
A.2.2 ✘ ✘ ✘ Total mismatch Reduce human trust
between the input, in the model.
the classification and
the explanation.
A.3 ✘ ✓ ✘ ✓ ✓ ✓ Incorrect output Ambiguous explanations
while keeping a applicable to more than
correct explanation. one class. Misdirect the
attention of the user toward
another reasonable class.
Note: For the sake of simplicity, we use the following symbols to represent the following terms: ✓ (yes), ✘ (no), — (not applicable). In those paradigms in which subcases are considered, the symbol “*” is used to represent the term
“not specified” (i.e., the choice made for those factors determines the attack subtype).
FIGURE 5 | Critical scenarios to be considered in the study adversarial attacks against explainable machine learning models.
11 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
TABLE 3 | Possible scenarios in which explainable machine learning models can be deployed, and a guideline on how adversarial attacks should
be designed in each case in order to pose a realistic threat.
Teso et al. 2023; Teso and Kersting 2019). For instance, a model judgments. Although this scenario resembles S1, the main differ-
developer might want to explain the decisions of a self-driving ence is that, in this case, explanations can be useful or relevant
car (even if the end-user will not receive explanations when the even when the model is deployed or employed by the end-user,
model is put into practice) to assess why it has provided an in- and, therefore, the attack should also take the explanations into
correct output, to validate its reasoning process, to improve it, or consideration instead of considering only the output class.
to gain knowledge about what the model has learned (Fujiyoshi,
Hirakawa, and Yamashita 2019; Mori et al. 2019; Ras, van Gerven, S4 Scenario: Regarding the expertise level of the user query-
and Haselager 2018). In such cases, an adversary could: justify a ing the model, the case of no expertise is the simplest one from
misclassification of the model (A.2.1, A.3), hide an inappropriate the perspective of the adversary, as any attack scheme can be
behavior when the model predicts correctly but for the wrong produced without arousing suspicions, taking advantage of
reasons (A.1.1), or produce wrong outputs and explanations at the user inexperience. For the same reason, models deployed
the same time (A.2). in such scenarios should also be the ones with more security
measures against adversarial attacks.
S3 Scenario: The same attack strategies applicable to the S2
scenario can be applied in scenarios in which the models' deci- S5 Scenario: If the user's expertise is medium, the model might
sions are taken as more relevant or imperative than the experts' be expected to clarify or support the user's decisions. Thus, the
13 of 27
14 of 27
TABLE 4 | Summary of the illustrative attacks shown in Sections 4.4 and 4.5.
Possible Wrong
Task Type of explanation scenario class Wrong explanation Attack description Figures
X-ray (Section 4.4) Feature-based (saliency map) S2, S3, S4/ ✘ ✘ No attack (i.e., original input) Figure 6a
S5/S6, S8
✓ ✓ Regular attack (i.e., without Figure 6b
(conflicting) controlling the explanation)
✓ ✘ A.3 Figure 6c
✘ ✓ A.1.1 (confusing Figure 6d
recommendation)
✓ ✓ A.2.1 Figure 6e
Af (x) ∼ f (x)
(non-informative
but consistent, i.e.,
supports prediction)
✓ ✓ A.2.2 Figure 6f
Af (x) ∼ f (x)
Af (x) ∼ yx
(non-informative
and inconsistent)
Large-scale visual recog. Feature-based (saliency map) S2, S5/S6 ✘ ✘ No attack (i.e., original input) Figure 7a
(Section 4.5)
✓ ✘ A.3 Figure 7b–d
S2, S7 ✘ ✘ No attack (i.e., original input) Figure 8a
✘ ✘ No attack (the output is Figure 8b
further biased in favor
of the correct class,
avoiding ambiguities)
✓ ✓ A.2.1 Figure 8c
✓ ✘ A.3 Figure 8d
Prototype-based explanation S2, S5/S6 ✓ ✘ A.3 Figure 9a,b
(three nearest training inputs) Af (x) ∼ yx
Af (x) ∼ f (x)
(ambiguous)
Note: Notice that each attack paradigm and scenario is exemplified at least once. Note also that for the large-scale visual recognition task different scenarios can be considered depending on the characteristics or the challengingness
of the input.
differentiating dogs from other animal species) but not others method is to employ the feature maps learned by the model in
(e.g., two similar dog breeds). In such cases, the user might ex- the last convolutional layer to produce the explanations. Given a
pect the prediction of the model or the corresponding explana- convolutional neural network f and an input x, the Grad-CAM
tion to clarify the correct class of the input. saliency-map S is defined as:
( M )
∑
S = ReLU 𝛼 m,c ⋅ Cm
4.2 | Explanation Methods m (1)
We will consider two representative explanation methods in where Cm, m = 1, … , M , represents the (two-dimensional) m-th
order to illustrate an explainable machine-learning scenario. activation map (for the input x) at the last convolutional layer of
f , and 𝛼 m,c ∈ ℝ represents the importance of the m-th map in the
prediction of the class of interest yc (typically f (x), i.e., the class
4.2.1 | Feature-Based Explanation predicted by the model). The importance 𝛼 m,c of each activation
map is estimated as the average global pooling of the gradient of
The Grad-CAM method (Selvaraju et al. 2017) will be used the output score (corresponding to the class yc) with respect to Cm,
to generate saliency-map explanations. The rationale of this which will be denoted as Gm,c = ∇Cm fc (x):
15 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 7 | Adversarial examples generated for the ImageNet dataset classification task taking advantage of class ambiguity. The adversarial
examples are generated from the input in (a)-left, which belongs to the source class “Great Pyrenees,” targeting different classes that are characterized
by features similar to those of the source class: (b) “Kuvasz,” (c) “White wolf,” and (d) “Labrador Retriever.” Each adversarial example is created
ensuring that the saliency-map explanation of the original input, shown in (a)-right, is maintained. (e)–(h) show, for each of the four classes considered
(source class + 3 target classes), the k = 3 prototypes closest to the original input, in order to assess their similarity.
∑∑
𝛼 m,c = i,j
Gm,c (2) representing classes) that are closest to the input which has been
i j classified are provided (Nguyen, Kim, and Nguyen 2021). The
proximity between the inputs will be measured as the Euclidean
i,j
where Gm,c denotes the value at the i -the row and j-th column. distance of the l -dimensional latent representation rf (x): ℝd → ℝl
The ReLU nonlinearity in (1) is applied to remove negative val- learned by the model f in the last layer, that is, the (flattened) ac-
ues, maintaining only the features with a positive influence on yc. tivations of the last convolutional layer of the model. This repre-
sentation captures complex semantic features of the inputs, thus
providing a more appropriate representation space for meaning-
4.2.2 | Example-Based Explanation fully comparing input samples according to the features learned
by the model. Let Xtrain
c
represent the set of training inputs be-
We will also consider an example-based explanation in which longing to the class of interest yc (e.g., the class predicted by the
the k training images (which can be considered prototypes model). Given a model f and an input x, the explanation will
{ p }
p p
be a set of k input samples P = x1 , … , xk | xi ∈ Xtrain
c
that 4.3 | Attack Method
satisfies:
We will assume a targeted attack for our experiments, in which
‖ ( ) ‖ ‖ ( p) ‖ c p the aim will be to create, given an input x, an adversarial exam-
‖rf �
x − rf (x)‖ > ‖rf xi − rf (x)‖ , x∈
∀� Xtrain − P, ∀ xi ∈P
‖ ‖2 ‖ ‖2 ple x ′ such that:
(3)
( )
Note that the two selected methods allow, by definition, expla- f x � = yt (4)
nations to be computed for any class of interest yc. However, we
( )
will consider as the main explanation the one corresponding to Af x � = m t (5)
the predicted class f (x). Finally, we assume that the explana-
tion methods and their parameters are fixed and known to the ‖ x − x� ‖ ≤ ϵ (6)
adversary. Since the focus of our experimentation is illustrative
and not performance-based, analyzing the sensitivity of the ex- where yt represents a target class, mt a target explanation and
planation methods to hyperparameters (Bansal, Agarwal, and ϵ the maximum distortion norm. For the case of saliency-map
Nguyen 2020; Dombrowski et al. 2019) will be out of the scope explanations, mt will be a predefined saliency- map St. For
of this section. the case of prototype-based classification, mt will be the set
17 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FIGURE 9 | Adversarial example for the large scale visual recognition task, assuming a prototype-based explanation. (a) Original input belonging
to the class “Labrador Retriever” (left) and adversarial example targeting the class “Doberman” (right). (b) Prototype-based explanation of the
adversarial example and the class Doberman (i.e., the three training images belonging to the class “Doberman” that are closest to the adversarial
example). (c) Prototype-based explanation of the original input and the ground-truth class. (d) Prototype-based explanation of the original input and
the target class “Doberman.”
{ p p p}
Pt = x1 , x2 , … , xk of k training inputs (with the value of k explanations, we instantiated Lexpl as the Euclidean distance be-
fixed beforehand by the explanation method) selected by the ad- tween the model's explanation g(x) = S and the target saliency
versary to be produced as explanations (i.e., the training inputs map St (specified by the adversary):
of class yt that are closer to x should be those in the set Pt). We do ( )
not specify any particular order for the k target-prototypes in Pt, Lexpl x, St = ‖ ‖
‖g(x) − St ‖2 (9)
that is, we assume that the relevance of each of the k prototypes
in the explanation is the same. For the case of prototype-based explanations, Lexpl will be the
average Euclidean distance between the latent representation
We will use a targeted Projected Gradient Descent (PGD) attack of the (adversarial) input and the latent representation of the k
(Madry et al. 2018) to generate the adversarial examples. This prototypes selected by the adversary as the target explanation
attack iteratively perturbs the input sample in the direction of Pt :
the gradient of a loss function
( L (e.g., the cross-entropy) with re-
( )) � � 1 �
spect to the input, sign ∇x � L xi� , yt , with a step size 𝛼. At each Lexpl x, Pt = ‖r(x) − r(x p )‖2 (10)
i k xp ∈ P
step, the adversarial example is projected by a projection oper- t
19 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
has been optimized in order to reduce the distance (in the latent the attack paradigms described in our framework, that is, an
representation) between the input and the three training images approach capable of automatically generating adversarial exam-
(belonging to the target class) shown in Figure 9b. As can be seen, ples which satisfy the most important requirements depending
these training images not only contain features representative of on the scenario, explanation method or attack paradigm that
the target class (“Doberman”), but also additional features that wants to be produced. We plan to study the generation of such
resemble those in the original input sample (indeed, a similar attacks in future works.
dog is present in the selected training images), exemplifying the
attack paradigm A.3. Figure 9c shows the prototypes belonging More generally, conceiving strategies to improve the reliabil-
to the source class that are the closest to the original image, and ity and robustness of explanation methods continues to be an
Figure 9d those prototypes closest to the original image yet be- urgent line of research, as still limited research has been con-
longing to the target class. Note that both Figure 9b,d contains ducted on the adversarial robustness of different explanation
prototypes belonging to the target class, however, those which are methods such as prototype-based approaches. Thus, a deeper
adversarially produced appear considerably more coherent due to analysis of the vulnerability of current explanation methods is
their ambiguity (in the sense that they contain prototypical fea- an important step in order to increase the reliability and trust-
tures of both the source and target class). worthiness of explainable machine learning models. We hope
our work serves as a foundation for future studies on defensive
approaches to build upon, and to address these critical chal-
5 | Conclusions lenges in a structured and organized way.
general and unifying attack algorithm capable of addressing all sive scenario, such as unskilled subjects, or partially skilled subjects
Adadi, A., and M. Berrada. 2018. “Peeking Inside the Black-Box: A Bansal, N., C. Agarwal, and A. Nguyen. 2020. “SAM: The Sensitivity of
Survey on Explainable Artificial Intelligence (XAI).” IEEE Access 6: Attribution Methods to Hyperparameters.” In Proceedings of the IEEE/
52138–52160. CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(IEEE Computer Society), 8670–8680.
Agirre, E., and P. Edmonds. 2006. “Word Sense Disambiguation:
Algorithms and Applications.” In vol. 33 of Text, Speech and Language Beck, C., H. Booth, M. El-A ssady, and M. Butt. 2020. “Representation
Technology. Springer. Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty,
Error and Bias.” In Proceedings of the 14th Linguistic Annotation
Aivodji, U., H. Arai, O. Fortineau, S. Gambs, S. Hara, and A. Tapp. Workshop, 60–73.
2019. “Fairwashing: The Risk of Rationalization.” In Proceedings of the
36th International Conference on Machine Learning (ICML), vol. 97 of Belinkov, Y., and Y. Bisk. 2018. “Synthetic and Natural Noise Both Break
Proceedings of Machine Learning Research, 161–170. Neural Machine Translation.” In International Conference on Learning
Representations (ICLR).
Aïvodji, U., H. Arai, S. Gambs, and S. Hara. 2021. “Characterizing the
Risk of Fairwashing.” In Advances in Neural Information Processing Blesch, K., M. N. Wright, and D. Watson. 2023. “Unfooling SHAP
Systems, vol. 34, 14822–14834. and SAGE: Knockoff Imputation for Shapley Values.” In Explainable
Artificial Intelligence, 131–146. Cham, Switzerland: Springer Nature
Al-masni, M. A., M. A. Al-antari, M.-T. Choi, S.-M. Han, and T.-S. Kim. Switzerland.
2018. “Skin Lesion Segmentation in Dermoscopy Images via Deep Full
Resolution Convolutional Networks.” Computer Methods and Programs Boopathy, A., S. Liu, G. Zhang, et al. 2020. “Proper Network
in Biomedicine 162: 221–231. Interpretability Helps Adversarial Robustness in Classification.” In
Proceedings of the 37th International Conference on Machine Learning
Alvarez-Melis, D., and T. Jaakkola. 2018a. “Towards Robust (ICML), vol. 119 of Proceedings of Machine Learning Research (PMLR),
Interpretability With Self-E xplaining Neural Networks.” In Advances in 1014–1023.
Neural Information Processing Systems, vol. 31, 7775–7784. Red Hook,
NY: Curran Associates Inc. Borkar, J., and P.-
Y. Chen. 2021. “Simple Transparent Adversarial
Examples.” In ICLR 2021 Workshop on Security and Safety in Machine
Alvarez-Melis, D., and T. S. Jaakkola. 2018b. “On the Robustness of Learning Systems.
Interpretability Methods.”In Proceedings of the 2018 ICML Workshop on
Human Interpretability in Machine Learning (WHI 2018), 66–71. Bortsova, G., C. González- Gonzalo, S. C. Wetstein, et al. 2021.
“Adversarial Attack Vulnerability of Medical Image Analysis Systems:
Alzantot, M., Y. Sharma, S. Chakraborty, H. Zhang, C.-J. Hsieh, and Unexplored Factors.” Medical Image Analysis 73: 102141.
M. B. Srivastava. 2019. “GenAttack: Practical Black- Box Attacks
With Gradient- Free Optimization.” In Proceedings of the Genetic Brendel, W., J. Rauber, and M. Bethge. 2018. “Decision-Based Adversarial
and Evolutionary Computation Conference (GECCO), GECCO'19 Attacks: Reliable Attacks Against Black-Box Machine Learning Models.”
(Association for Computing Machinery), 1111–1119. In International Conference on Learning Representations (ICLR).
Anders, C., P. Pasliev, A.-K . Dombrowski, K.-R . Müller, and P. Kessel. Bussone, A., S. Stumpf, and D. O'Sullivan. 2015. “The Role of
2020. “Fairwashing Explanations With off- Manifold Detergent.” In Explanations on Trust and Reliance in Clinical Decision Support
Proceedings of the 37th International Conference on Machine Learning Systems.” In Proceedings of the 2015 International Conference on
(ICML), vol. 119, 314–323. Healthcare Informatics (ICHI), 160–169.
21 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Carlini, N., and D. Wagner. 2017. “Towards Evaluating the Robustness 27th International Conference on Computational Linguistics (COLING)
of Neural Networks.” In Proceedings of the 2017 IEEE Symposium on (Association for Computational Linguistics), 653–663.
Security and Privacy (SP), 39–57.
Elliott, A., S. Law, and C. Russell. 2021. “Explaining Classifiers Using
Carmichael, Z., and W. J. Scheirer. 2023. “Unfooling Perturbation- Adversarial Perturbations on the Perceptual Ball.” In Proceedings of
Based Post Hoc Explainers.” In Proceedings of the 37th AAAI Conference the IEEE/CVF Conference on Computer Vision and Pattern Recognition
on Artificial Intelligence, vol. 37 of AAAI'23/IAAI'23/EAAI'23 (AAAI (CVPR), 10693–10702.
Press), 6925–6934.
Etmann, C., S. Lunz, P. Maass, and C. Schoenlieb. 2019. “On the
Cartella, F., O. Anunciação, Y. Funabiki, D. Yamaguchi, T. Akishita, and Connection Between Adversarial Robustness and Saliency Map
O. Elshocht. 2021. “Adversarial Attacks for Tabular Data: Application Interpretability.” In Proceedings of the 36th International Conference on
to Fraud Detection and Imbalanced Data.” In Proceedings of the 2021 Machine Learning (ICML), vol. 97 of Proceedings of Machine Learning
AAAI Workshop on Artificial Intelligence Safety (SafeAI). Research, 1823–1832.
Chen, C., O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su. 2019. “This Eykholt, K., I. Evtimov, E. Fernandes, et al. 2018. “Robust Physical-
Looks Like That: Deep Learning for Interpretable Image Recognition.” World Attacks on Deep Learning Visual Classification.” In Proceedings
In Advances in Neural Information Processing Systems, vol. 32, 8930– of the IEEE Conference on Computer Vision and Pattern Recognition
8941. Red Hook, NY: Curran Associates Inc. (CVPR), 1625–1634.
Chen, J., X. Wu, V. Rastogi, Y. Liang, and S. Jha. 2019. “Robust Attribution Finlayson, S. G., J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, and I.
Regularization.” In Advances in Neural Information Processing Systems, S. Kohane. 2019. “Adversarial Attacks on Medical Machine Learning.”
vol. 32, 14300–14310. Red Hook, NY: Curran Associates Inc. Science 363, no. 6433: 1287–1289.
Chen, P.-Y., H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. 2017. “ZOO: Fischer, V., M. C. Kumar, J. H. Metzen, and T. Brox. 2017. “Adversarial
Zeroth Order Optimization Based Black-Box Attacks to Deep Neural Examples for Semantic Image Segmentation.” In Workshop of the 2017
Networks Without Training Substitute Models.” In Proceedings of the International Conference on Learning Representations (ICLR).
10th ACM Workshop on Artificial Intelligence and Security (AISec) Fujiyoshi, H., T. Hirakawa, and T. Yamashita. 2019. “Deep Learning-
(Association for Computing Machinery), 15–26. Based Image Recognition for Autonomous Driving.” IATSS Research
Chen, Z., Y. Bei, and C. Rudin. 2020. “Concept Whitening for 43, no. 4: 244–252.
Interpretable Image Recognition.” Nature Machine Intelligence 2, no. Fursov, I., M. Morozov, N. Kaploukhaya, et al. 2021. “Adversarial
12: 772–782. Attacks on Deep Models for Financial Transaction Records.” In
Cheng, M., J. Yi, P.-Y. Chen, H. Zhang, and C.-J. Hsieh. 2020. “Seq2Sick: Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Evaluating the Robustness of Sequence- To- Sequence Models With Discovery & Data Mining, KDD'21 (Association for Computing
Adversarial Examples.” Proceedings of the AAAI Conference on Artificial Machinery), 2868–2878.
Intelligence 34, no. 4: 3601–3608. Gautam, S., A. Boubekki, S. Hansen, et al. 2022. “ProtoVAE: A
Cheng, Y., L. Jiang, and W. Macherey. 2019. “Robust Neural Machine Trustworthy Self-
E xplainable Prototypical Variational Model.” In
Translation With Doubly Adversarial Inputs.” In Proceedings of the Advances in Neural Information Processing Systems, vol. 35, 17940–
57th Annual Meeting of the Association for Computational Linguistics 17952. Red Hook, NY: Curran Associates Inc.
(Association for Computational Linguistics), 4324–4333. Ghai, B., Q. V. Liao, Y. Zhang, R. Bellamy, and K. Mueller. 2021.
Cisse, M. M., Y. Adi, N. Neverova, and J. Keshet. 2017. “Houdini: “Explainable Active Learning (XAL): Toward AI Explanations as
Fooling Deep Structured Visual and Speech Recognition Models With Interfaces for Machine Teachers.” Proceedings of the ACM on Human-
Adversarial Examples.” In Advances in Neural Information Processing Computer Interaction 4, no. CSCW3: 235:1–235:28.
Systems, vol. 30, 6977–6987. Red Hook, NY: Curran Associates Inc. Ghalebikesabi, S., L. Ter-Minassian, K. DiazOrdaz, and C. C. Holmes.
Deng, E., Z. Qin, M. Li, Y. Ding, and Z. Qin. 2021. “Attacking the 2021. “On Locality of Local Explanation Models.” In Advances in Neural
Dialogue System at Smart Home.” In Proceedings of the International Information Processing Systems, vol. 34, 18395–18407. Red Hook, NY:
Conference on Collaborative Computing: Networking, Applications Curran Associates, Inc.
and Worksharing, Lecture Notes of the Institute for Computer Sciences, Ghassemi, M., L. Oakden-R ayner, and A. L. Beam. 2021. “The False
Social Informatics and Telecommunications Engineering (Springer Hope of Current Approaches to Explainable Artificial Intelligence in
International Publishing), 148–158. Health Care.” Lancet Digital Health 3, no. 11: e745–e750.
Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Ghorbani, A., A. Abid, and J. Zou. 2019. “Interpretation of Neural
“ImageNet: A Large-Scale Hierarchical Image Database.” In Proceedings Networks Is Fragile.” In Proceedings of the AAAI Conference on Artificial
of the 2009 IEEE Conference on Computer Vision and Pattern Recognition Intelligence, vol. 33, 3681–3688.
(CVPR), 248–255.
Ghorbani, A., J. Wexler, J. Y. Zou, and B. Kim. 2019. “Towards
Dimanov, B., U. Bhatt, M. Jamnik, and A. Weller. 2020. “You Shouldn't Automatic Concept- Based Explanations.” In Advances in Neural
Trust Me: Learning Models Which Conceal Unfairness From Multiple Information Processing Systems, vol. 32, 9277–9286. Red Hook, NY:
Explanation Methods.” In Proceedings of the 24th European Conference Curran Associates Inc.
on Artificial Intelligence (ECAI), vol. 97, 2473–2480.
Gilpin, L. H., D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal.
Dombrowski, A.-K ., M. Alber, C. Anders, M. Ackermann, K.-R . Müller, 2018. “Explaining Explanations: An Overview of Interpretability
and P. Kessel. 2019. “Explanations Can Be Manipulated and Geometry of Machine Learning.” In Proceedings of the IEEE 5th International
Is to Blame.” In Advances in Neural Information Processing Systems, vol. Conference on Data Science and Advanced Analytics (DSAA), 80–89.
32, 13589–13600. Red Hook, NY: Curran Associates Inc.
Goddard, K., A. Roudsari, and J. C. Wyatt. 2012. “Automation Bias: A
Doshi-Velez, F., and B. Kim. 2018. “Considerations for Evaluation and Systematic Review of Frequency, Effect Mediators, and Mitigators.”
Generalization in Interpretable Machine Learning.” In Explainable and Journal of the American Medical Informatics Association 19, no. 1:
Interpretable Models in Computer Vision and Machine Learning, the 121–127.
Springer Series on Challenges in Machine Learning, 3–17.
Goodfellow, I., J. Shlens, and C. Szegedy. 2015. “Explaining and
Ebrahimi, J., D. Lowd, and D. Dou. 2018. “On Adversarial Examples Harnessing Adversarial Examples.” In International Conference on
for Character-L evel Neural Machine Translation.” In Proceedings of the Learning Representations (ICLR).
23 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Its Predictions.” Proceedings of the AAAI Conference on Artificial Moosavi-Dezfooli, S.-M., A. Fawzi, O. Fawzi, and P. Frossard. 2017.
Intelligence 32, no. 1: 3530–3537. “Universal Adversarial Perturbations.” In Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 86–94.
Li, X., and D. Zhu. 2020. “Robust Detection of Adversarial Attacks
on Medical Images.” In Proceedings of the 17th IEEE International Moosavi-Dezfooli, S.-M., A. Fawzi, and P. Frossard. 2016. “DeepFool:
Symposium on Biomedical Imaging (ISBI), 1154–1158. A Simple and Accurate Method to Fool Deep Neural Networks.” In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Li, Y., H. Zhang, C. Bermudez, Y. Chen, B. A. Landman, and Y.
Recognition (CVPR), 2574–2582.
Vorobeychik. 2020. “Anatomical Context Protects Deep Learning From
Adversarial Perturbations in Medical Imaging.” Neurocomputing 379: Mopuri, K. R., A. Ganeshan, and R. V. Babu. 2019. “Generalizable Data-
370–378. Free Objective for Crafting Universal Adversarial Perturbations.” IEEE
Transactions on Pattern Analysis and Machine Intelligence 41, no. 10:
Lin, Y.-C., Z.-W. Hong, Y.-H. Liao, M.-L . Shih, M.-Y. Liu, and M. Sun.
2452–2465.
2017. “Tactics of Adversarial Attack on Deep Reinforcement Learning
Agents.” In Proceedings of the 26th International Joint Conference on Mopuri, K. R., U. Garg, and R. V. Babu. 2017. “Fast Feature Fool: A
Artificial Intelligence (IJCAI) (AAAI Press), 3756–3762. Data Independent Approach to Universal Adversarial Perturbations.”
In Proceedings of the British Machine Vision Conference 2017 (BMVC)
Lipton, Z. C. 2018. “The Mythos of Model Interpretability: In Machine
(BMVA Press), 30.1–30.12.
Learning, the Concept of Interpretability Is Both Important and
Slippery.” Queue 16, no. 3: 31–57. Mopuri, K. R., P. K. Uppala, and R. V. Babu. 2018. “Ask, Acquire,
and Attack: Data-Free UAP Generation Using Class Impressions.” In
Liu, N., M. Du, R. Guo, H. Liu, and X. Hu. 2021. “Adversarial Attacks and
Proceedings of the European Conference on Computer Vision (ECCV),
Defenses: An Interpretation Perspective.” ACM SIGKDD Explorations
Lecture Notes in Computer Science (Springer International Publishing),
Newsletter 23, no. 1: 86–99.
20–35.
Liu, N., H. Yang, and X. Hu. 2018. “Adversarial Detection With Model
Morch, N., U. Kjems, L. Hansen, et al. 1995. “Visualization of Neural
Interpretation.” In Proceedings of the 24th ACM SIGKDD International
Networks Using Saliency Maps.” In Proceedings the International
Conference on Knowledge Discovery & Data Mining, KDD'18 (Association
Conference on Neural Networks (ICNN), vol. 4, 2085–2090.
for Computing Machinery), 1803–1811.
Mori, K., H. Fukui, T. Murase, T. Hirakawa, T. Yamashita, and H.
Liu, Y., X. Chen, C. Liu, and D. Song. 2017. “Delving Into Transferable
Fujiyoshi. 2019. “Visual Explanation by Attention Branch Network for
Adversarial Examples and Black- Box Attacks.” In International
End-to-End Learning-Based Self-Driving.” In Proceedings of the 2019
Conference on Learning Representations (ICLR).
IEEE Intelligent Vehicles Symposium (IV), 1577–1582.
Lu, X., A. Tolmachev, T. Yamamoto, et al. 2021. “Crowdsourcing
Nauta, M., J. Trienes, S. Pathak, et al. 2023. “From Anecdotal Evidence
Evaluation of Saliency- Based XAI Methods.” In Machine Learning
to Quantitative Evaluation Methods: A Systematic Review on Evaluating
and Knowledge Discovery in Databases. Applied Data Science Track
Explainable AI.” ACM Computing Surveys 55, no. 13s: 295:1–42.
(Springer International Publishing), 431–446.
Nauta, M., R. Van Bree, and C. Seifert. 2021. “Neural Prototype Trees
Lundberg, S. M., and S.-I. Lee. 2017. “A Unified Approach to Interpreting
for Interpretable Fine-Grained Image Recognition.” In Proceedings of
Model Predictions.” In Advances in Neural Information Processing
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Systems, vol. 30, 4765–4774.
(CVPR) (IEEE), 14933–14943.
Ma, X., Y. Niu, L. Gu, et al. 2021. “Understanding Adversarial Attacks
Nguyen, G., D. Kim, and A. Nguyen. 2021. “The Effectiveness of Feature
on Deep Learning Based Medical Image Analysis Systems.” Pattern
Attribution Methods and Its Correlation With Automatic Evaluation
Recognition 110: 107332.
Scores.” In Advances in Neural Information Processing Systems, vol. 34,
Madry, A., A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. 2018. 26422–26436. Red Hook, NY: Curran Associates Inc.
“Towards Deep Learning Models Resistant to Adversarial Attacks.” In
Noack, A., I. Ahern, D. Dou, and B. Li. 2021. “An Empirical Study
International Conference on Learning Representations (ICLR).
on the Relation Between Network Interpretability and Adversarial
Mahdavifar, S., and A. A. Ghorbani. 2019. “Application of Deep Robustness.” SN Computer Science 2, no. 1. Accessed May 1, 2022.
Learning to Cybersecurity: A Survey.” Neurocomputing 347: 149–176. [Link]
Mathov, Y., E. Levy, Z. Katzir, A. Shabtai, and Y. Elovici. 2022. “Not Noppel, M., L. Peter, and C. Wressnegger. 2023. “Disguising Attacks
all Datasets Are Born Equal: On Heterogeneous Tabular Data and With Explanation-Aware Backdoors.” In Proceedings of the 2023 IEEE
Adversarial Examples.” Knowledge-Based Systems 242: 108377. Symposium on Security and Privacy (SP), 664–681.
Metzen, J. H., M. C. Kumar, T. Brox, and V. Fischer. 2017. “Universal Papernot, N., P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A.
Adversarial Perturbations Against Semantic Image Segmentation.” Swami. 2017. “Practical Black-Box Attacks Against Machine Learning.”
In Proceedings of the 2017 IEEE International Conference on Computer In Proceedings of the 2017 ACM on Asia Conference on Computer and
Vision (ICCV) (IEEE), 2774–2783. Communications Security, ASIA CCS'17 (Association for Computing
Machinery), 506–519.
Michel, P., X. Li, G. Neubig, and J. Pino. 2019. “On Evaluation of
Adversarial Perturbations for Sequence- to-
Sequence Models.” In Paschali, M., S. Conjeti, F. Navarro, and N. Navab. 2018.
Proceedings of the 2019 Conference of the North American Chapter “Generalizability vs. Robustness: Investigating Medical Imaging
of the Association for Computational Linguistics: Human Language Networks Using Adversarial Examples.” In Proceedings of the 2018
Technologies (NAACL-HLT) (Association for Computational International Conference on Medical Image Computing and Computer
Linguistics), vol. 1, 3103–3114. Assisted Intervention (MICCAI), vol. 11070 of Lecture Notes in Computer
Science (Springer International Publishing), 493–501.
Mode, G. R., and K. A. Hoque. 2020. “Adversarial Examples in Deep
Learning for Multivariate Time Series Regression.” In 2020 IEEE Paul, R., M. Schabath, R. Gillies, L. Hall, and D. Goldgof. 2020.
Applied Imagery Pattern Recognition Workshop (AIPR), 1–10. “Mitigating Adversarial Attacks on Medical Image Understanding
Systems.” In Proceedings of the 17th IEEE International Symposium on
Moore, J., N. Hammerla, and C. Watkins. 2019. “Explaining Deep
Biomedical Imaging (ISBI), 1517–1521.
Learning Models With Constrained Adversarial Examples.” In PRICAI
2019: Trends in Artificial Intelligence, Lecture Notes in Computer Pawelczyk, M., T. Datta, J. van- den-
Heuvel, G. Kasneci, and H.
Science, 43–56. Cham, Switzerland: Springer International Publishing. Lakkaraju. 2023. “Probabilistically Robust Recourse: Navigating the
Ras, G., M. van Gerven, and P. Haselager. 2018. “Explanation Methods in Selvaraju, R. R., M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D.
Deep Learning: Users, Values, Concerns and Challenges.” In Explainable Batra. 2017. “Grad-CAM: Visual Explanations From Deep Networks
and Interpretable Models in Computer Vision and Machine Learning, the via Gradient-Based Localization.” In Proceedings of the 2017 IEEE
Springer Series on Challenges in Machine Learning, 19–36. International Conference on Computer Vision (ICCV), 618–626.
Recaido, C., and B. Kovalerchuk. 2023. “Visual Explainable Machine Serradilla, O., E. Zugasti, J. Rodriguez, and U. Zurutuza. 2022. “Deep
Learning for High- Stakes Decision- Making With Worst Case Learning Models for Predictive Maintenance: A Survey, Comparison,
Estimates.” Data Analysis and Optimization 202: 291–329. Challenges and Prospects.” Applied Intelligence 52, 10934–10964.
Renard, X., T. Laugel, M.-J. Lesot, C. Marsala, and M. Detyniecki. Sharif, M., S. Bhagavatula, L. Bauer, and M. K. Reiter. 2016.
2019. “Detecting Potential Local Adversarial Examples for Human- “Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-A rt
Interpretable Defense.” In Proceedings of the 2018 ECML PKDD Face Recognition.” In Proceedings of the 2016 ACM SIGSAC Conference
Workshop on Recent Advances in Adversarial Machine Learning, Lecture on Computer and Communications Security, CCS'16 (Association for
Notes in Computer Science (Springer International Publishing), 41–47. Computing Machinery), 1528–1540.
Ribeiro, M. T., S. Singh, and C. Guestrin. 2016. “Why Should I Trust Silla, C. N., and A. A. Freitas. 2011. “A Survey of Hierarchical
You?: Explaining the Predictions of Any Classifier.” In Proceedings of the Classification Across Different Application Domains.” Data Mining and
22nd ACM SIGKDD International Conference on Knowledge Discovery Knowledge Discovery 22, no. 1: 31–72.
and Data Mining, KDD'16 (Association for Computing Machinery), Simonyan, K., A. Vedaldi, and A. Zisserman. 2014. “Deep Inside
1135–1144. Convolutional Networks: Visualising Image Classification Models and
Ribeiro, M. T., S. Singh, and C. Guestrin. 2018. “Anchors: High-Precision Saliency Maps.” In Workshop of the 2014 International Conference on
Model-A gnostic Explanations.” In Proceedings of the 32nd AAAI Learning Representations (ICLR).
Conference on Artificial Intelligence and 30th Innovative Applications Sinha, S., H. Chen, A. Sekhon, Y. Ji, and Y. Qi. 2021. “Perturbing Inputs
of Artificial Intelligence Conference and 8th AAAI Symposium on for Fragile Interpretations in Deep Natural Language Processing.” In
Educational Advances in Artificial Intelligence, AAAI'18/IAAI'18/ Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and
EAAI'18 (AAAI Press), 1527–1535. Interpreting Neural Networks for NLP (Association for Computational
Linguistics), 420–434.
Rieger, L., and L. K. Hansen. 2020. “A Simple Defense Against
Adversarial Attacks on Heatmap Explanations.” In ICML Workshop on Slack, D., A. Hilgard, H. Lakkaraju, and S. Singh. 2021. “Counterfactual
Human Interpretability in Machine Learning (WHI). Explanations Can be Manipulated.” In Advances in Neural Information
Processing Systems, vol. 34, 62–75.
Robnik-Šikonja, M., and I. Kononenko. 2008. “Explaining Classifications
for Individual Instances.” IEEE Transactions on Knowledge and Data Slack, D., S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. 2020. “Fooling
Engineering 20, no. 5: 589–600. LIME and SHAP: Adversarial Attacks on Post Hoc Explanation
Methods.” In Proceedings of the AAAI/ACM Conference on AI, Ethics,
Ros, A. S., and F. Doshi- Velez. 2018. “Improving the Adversarial
and Society, 180–186.
Robustness and Interpretability of Deep Neural Networks by
Regularizing Their Input Gradients.” Proceedings of the AAAI Sokol, K., and P. Flach. 2019. “Counterfactual Explanations of Machine
Conference on Artificial Intelligence 32, no. 1: 1660–1669. Learning Predictions: Opportunities and Challenges for AI Safety.” In
Proceedings of the 2019 AAAI Workshop on Artificial Intelligence Safety
Ross, A. S., M. C. Hughes, and F. Doshi-Velez. 2017. “Right for the
(SafeAI), 95–99.
Right Reasons: Training Differentiable Models by Constraining Their
Explanations.” In Proceedings of the 26th International Joint Conference Sotgiu, A., M. Pintor, and B. Biggio. 2022. “Explainability-Based Debugging
on Artificial Intelligence (IJCAI), 2662–2670. of Machine Learning for Vulnerability Discovery.” In Proceedings of the
17th International Conference on Availability, Reliability and Security,
Rudin, C. 2019. “Stop Explaining Black Box Machine Learning Models
ARES'22 (Association for Computing Machinery), 1–8.
for High Stakes Decisions and Use Interpretable Models Instead.”
Nature Machine Intelligence 1, no. 5: 206–215.
25 of 27
19424795, 2025, 1, Downloaded from [Link] Wiley Online Library on [04/10/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Springenberg, J. T., A. Dosovitskiy, T. Brox, and M. Riedmiller. 2015. van der Waa, J., E. Nieuwburg, A. Cremers, and M. Neerincx. 2021.
“Striving for Simplicity: The all Convolutional Net.” In Workshop of the “Evaluating XAI: A Comparison of Rule-Based and Example-Based
2015 International Conference on Learning Representations. Explanations.” Artificial Intelligence 291: 103404.
Stiglic, G., P. Kocbek, N. Fijacko, M. Zitnik, K. Verbert, and L. Cilar. Viganò, L., and D. Magazzeni. 2020. “Explainable Security.” In
2020. “Interpretability of Machine Learning-Based Prediction Models Proceedings of the 2020 IEEE European Symposium on Security and
in Healthcare.” WIREs Data Mining and Knowledge Discovery 10, no. Privacy Workshops (EuroS&PW), 293–300.
5: e1379.
Virgolin, M., and S. Fracaros. 2023. “On the Robustness of Sparse
Stock, P., and M. Cisse. 2018. “ConvNets and ImageNet Beyond Counterfactual Explanations to Adverse Perturbations.” Artificial
Accuracy: Understanding Mistakes and Uncovering Biases.” In Intelligence 316: 103840.
Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol.
Vreš, D., and M. Robnik-Šikonja. 2022. “Preventing Deception With
11210, 504–519. Cham, Switzerland: Springer International Publishing.
Explanation Methods Using Focused Sampling.” In Data Mining and
Štrumbelj, E., and I. Kononenko. 2010. “An Efficient Explanation of Knowledge Discovery.
Individual Classifications Using Game Theory.” Journal of Machine Wachter, S., B. Mittelstadt, and C. Russell. 2017. “Counterfactual
Learning Research 11, no. 1: 1–18. Explanations Without Opening the Black Box: Automated Decisions
Subramanya, A., V. Pillai, and H. Pirsiavash. 2019. “Fooling Network and the GDPR.” Harvard Journal of Law & Technology 31, no. 2: 842–887.
Interpretation in Image Classification.” In Proceedings of the IEEE/CVF Wang, H., G. Wang, Y. Li, D. Zhang, and L. Lin. 2020. “Transferable,
International Conference on Computer Vision (ICCV), 2020–2029. Controllable, and Inconspicuous Adversarial Attacks on Person re-
Sun, S., B. Song, X. Cai, X. Du, and M. Guizani. 2022. “CAMA: Class Identification With Deep Mis-R anking.” In Proceedings of the IEEE/
Activation Mapping Disruptive Attack for Deep Neural Networks.” CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Neurocomputing 500: 989–1002. 342–351.
Sundararajan, M., A. Taly, and Q. Yan. 2017. “Axiomatic Attribution for Wang, J., J. Tuyls, E. Wallace, and S. Singh. 2020. “Gradient-Based
Deep Networks.” In Proceedings of the 34th International Conference on Analysis of NLP Models Is Manipulable.” In Findings of the Association
Machine Learning (ICML), 70, 3319–3328. for Computational Linguistics: EMNLP 2020 (Association for
Computational Linguistics), 247–258.
Szegedy, C., W. Zaremba, I. Sutskever, et al. 2014. “Intriguing Properties
of Neural Networks.” In International Conference on Learning Wang, J., Y. Wu, M. Li, X. Lin, J. Wu, and C. Li. 2020. “Interpretability
Representations (ICLR). Is a Kind of Safety: An Interpreter- Based Ensemble for Adversary
Defense.” In Proceedings of the 26th ACM SIGKDD International
Tabacof, P., J. Tavares, and E. Valle. 2016. “Adversarial Images for Conference on Knowledge Discovery & Data Mining, KDD'20 (Association
Variational Autoencoders.” In NIPS 2016 Workshop on Adversarial for Computing Machinery), 15–24.
Training.
Wang, L., Z. Q. Lin, and A. Wong. 2020. “COVID-Net: A Tailored Deep
Tamam, S. V., R. Lapid, and M. Sipper. 2023. “Foiling Explanations Convolutional Neural Network Design for Detection of COVID- 19
in Deep Neural Networks.” Transactions on Machine Learning Cases From Chest X-R ay Images.” Scientific Reports 10, no. 1: 19549.
Research: 1–32. Accessed May 5, 2024. [Link]
forum?id=wvLQMHtyLk. Wang, Z., H. Wang, S. Ramkumar, P. Mardziel, M. Fredrikson, and
A. Datta. 2020e. “Smoothed Geometry for Robust Attribution.” In
Tang, R., N. Liu, F. Yang, N. Zou, and X. Hu. 2022. “Defense Against Advances in Neural Information Processing Systems, vol. 33, 13623–
Explanation Manipulation.” Frontiers in Big Data 5: 704203. 13634. Red Hook, NY: Curran Associates, Inc.
Tao, G., S. Ma, Y. Liu, and X. Zhang. 2018. “Attacks Meet Interpretability: Xie, C., J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. 2017.
Attribute-Steered Detection of Adversarial Samples.” In Advances in “Adversarial Examples for Semantic Segmentation and Object
Neural Information Processing Systems, vol. 31, 7717–7728. Red Hook, Detection.” In Proceedings of the 2017 IEEE International Conference on
NY: Curran Associates Inc. Computer Vision (ICCV), 1378–1387.
Teso, S., Ö. Alkan, W. Stammer, and E. Daly. 2023. “Leveraging Xu, K., G. Zhang, S. Liu, et al. 2020. “Adversarial T-Shirt! Evading
Explanations in Interactive Machine Learning: An Overview.” Frontiers Person Detectors in a Physical World.” In Proceedings of the 2020
in Artificial Intelligence 6: 1066049. European Conference on Computer Vision (ECCV), Lecture Notes in
Teso, S., and K. Kersting. 2019. “Explanatory Interactive Machine Computer Science (Springer International Publishing), 665–681.
Learning.” In Proceedings of the 2019 AAAI/ACM Conference on AI, Xue, M., C. Yuan, J. Wang, W. Liu, and P. Nicopolitidis. 2020. “DPAEG:
Ethics, and Society (AIES), 239–245. A Dependency Parse-Based Adversarial Examples Generation Method
for Intelligent Q&A Robots.” Security and Communication Networks,
Thys, S., W. V. Ranst, and T. Goedemé. 2019. “Fooling Automated
2020.
Surveillance Cameras: Adversarial Patches to Attack Person Detection.”
In Proceedings of the IEEE/CVF Conference on Computer Vision and Yang, P., J. Chen, C.-J. Hsieh, J.-L . Wang, and M. Jordan. 2020. “ML-
Pattern Recognition Workshops (CVPRW), 49–55. LOO: Detecting Adversarial Examples With Feature Attribution.”
Proceedinegs of the AAAI Conference on Artificial Intelligence 34, no. 4:
Tsipras, D., S. Santurkar, L. Engstrom, A. Ilyas, and A. Madry. 2020.
6639–6647.
“From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks.” In Proceedings of the 37th International Conference on Yeh, C.-K ., J. Kim, I. E.-H. Yen, and P. K. Ravikumar. 2018. “Representer
Machine Learning (ICML), vol. 119 of Proceedings of Machine Learning Point Selection for Explaining Deep Neural Networks.” In Advances in
Research, 9625–9635. Proceedings of Machine Learning Research. Neural Information Processing Systems, vol. 31, 9291–9301. Red Hook,
NY: Curran Associates Inc.
Tsipras, D., S. Santurkar, L. Engstrom, A. Turner, and A. Madry.
2019. “Robustness May Be at Odds With Accuracy.” In International Yoo, T. K., and J. Y. Choi. 2020. “Outcomes of Adversarial Attacks on
Conference on Learning Representations (ICLR). Deep Learning Models for Ophthalmology Imaging Domains.” JAMA
Ophthalmology 138, no. 11: 1213–1215.
Ustun, B., A. Spangher, and Y. Liu. 2019. “Actionable Recourse in
Linear Classification.” In Proceedings of the Conference on Fairness, Yosinski, J., J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. 2015.
Accountability, and Transparency, FAT*'19 (Association for Computing “Understanding Neural Networks Through Deep Visualization.” In
Machinery), 10–19. 2015 ICML Workshop on Deep Learning.
27 of 27