0% found this document useful (0 votes)
48 views15 pages

XEQ Scale For Evaluating XAI Experience Quality Grounded in Psychometric Theory

Uploaded by

drheinrich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views15 pages

XEQ Scale For Evaluating XAI Experience Quality Grounded in Psychometric Theory

Uploaded by

drheinrich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

XEQ Scale for Evaluating XAI Experience Quality

Grounded in Psychometric Theory


Anjana Wijekoona , Nirmalie Wiratungaa , David Corsara , Kyle Martina, * , Ikechukwu Nkisi-Orjia ,
Belen Díaz-Agudob and Derek Bridgec
a Robert Gordon University, Aberdeen, Scotland
a Universidad Complutense de Madrid, Spain
a University College Cork, Ireland
ORCID (Anjana Wijekoon): https://orcid.org/0000-0003-3848-3100, ORCID (Nirmalie Wiratunga):
https://orcid.org/0000-0003-4040-2496, ORCID (David Corsar): https://orcid.org/0000-0001-7059-4594, ORCID
(Kyle Martin): https://orcid.org/0000-0003-0941-3111, ORCID (Ikechukwu Nkisi-Orji):
https://orcid.org/0000-0001-9734-9978, ORCID (Belen Díaz-Agudo): https://orcid.org/0000-0003-2818-027X,
ORCID (Derek Bridge): https://orcid.org/0000-0002-8720-3876
arXiv:2407.10662v1 [cs.AI] 15 Jul 2024

Abstract. Explainable Artificial Intelligence (XAI) aims to im- in recent literature, the evaluation of these interactions remains a key
prove the transparency of autonomous decision-making through ex- research challenge. Current works primarily target the development
planations. Recent literature has emphasised users’ need for holistic of objective metrics for single-shot techniques [24, 28], emphasising
“multi-shot” explanations and the ability to personalise their engage- the need for reproducible benchmarks on public datasets [16]. Such
ment with XAI systems. We refer to this user-centred interaction as metrics are system-centred and model-agnostic, giving the advantage
an XAI Experience. Despite advances in creating XAI experiences, of generalisability. However, objective metrics fail to acknowledge
evaluating them in a user-centered manner has remained challenging. the requirements of different stakeholder groups. A satisfactory ex-
To address this, we introduce the XAI Experience Quality (XEQ) planation is reliant on the recipient’s expertise within that domain
Scale (pronounced "Seek" Scale), for evaluating the user-centered and of AI in general [20]. Subjective metrics, such as those described
quality of XAI experiences. Furthermore, XEQ quantifies the quality in [10, 11], allow an evaluation which is personalised to the individ-
of experiences across four evaluation dimensions: learning, utility, ual and domain. However, existing subjective evaluations lack the
fulfilment and engagement. These contributions extend the state-of- capacity to measure the interactive process that underpins multi-shot
the-art of XAI evaluation, moving beyond the one-dimensional met- explanations and how they impact user experience.
rics frequently developed to assess single-shot explanations. In this We address this challenge by introducing the XAI Experience
paper, we present the XEQ scale development and validation process, Quality (XEQ) Scale (pronounced: "seek" Scale). We define an XAI
including content validation with XAI experts as well as discriminant Experience as the user-centred process of a stakeholder interacting
and construct validation through a large-scale pilot study. Out pilot with an XAI system to gain knowledge and/or improve comprehen-
study results offer strong evidence that establishes the XEQ Scale as sion. XAI Experience Quality (XEQ) is defined as the extent to which
a comprehensive framework for evaluating user-centred XAI experi- a stakeholder’s explanation needs are satisfied by their XAI Experi-
ences. ence. A glossary of all related terminology used throughout this pa-
per is included in Table 1. Specifically, we ask the research question:
“How to evaluate an XAI experience, in contrast to assessing single-
1 Introduction shot (non-interactive) explanations?”. To address this, we follow a
Explainable Artificial Intelligence (XAI) describes a range of tech- formal psychometric scale development process [3] and outline the
niques to elucidate autonomous decision-making and the data that following objectives:
informed that AI system [19, 12, 2]. Each technique typically pro- 1. conduct a literature review to compile a collection of XAI evalua-
vides explanations that focus on a specific aspect of the system and tion questionnaire items;
its decisions. Accordingly, the utility of employing multiple tech- 2. conduct a content validity study with XAI experts to develop the
niques for a holistic explanation of a system becomes increasingly XEQ scale; and
clear [2, 27]. The collection of explanations, provided by different 3. perform a pilot study to refine and validate the XEQ scale for in-
techniques and describing different components of the system, forms ternal consistency, construct and discriminant validity.
what we describe as “multi-shot” explanations. Previous work has
demonstrated the effectiveness of delivering multi-shot explanations The rest of this paper expands on each objective. We discuss re-
using graphical user interfaces [2] and conversation [18, 27]. lated work in Section 2. Section 3 presents key previous publications
While the utility of user-centred interactive explanations is evident and the creation of the initial items bank. The Content Validity study
details and results are presented in Section 4 followed by Section 5
∗ Corresponding Author. Email: [email protected] presenting pilot study details for the refinement and validation of the
Term Definition Table 1. Glossary
XAI System An automated decision-making system that is designed and developed to provide information about its reasoning.
Stakeholder An individual or group with a vested interest in the XAI system. Stakeholders are a diverse group, encompassing system
designers and developers, who hold an interest in the system’s technical functionality, the end consumers relying on its
decisions, and regulatory authorities responsible for ensuring fair and ethical use.
XAI Experience (XE) A user-centred process of a stakeholder interacting with an XAI system to gain knowledge and/or improve comprehension.
XE Quality (XEQ) The extent to which a stakeholder’s explanation needs are satisfied by their XE.

XEQ Scale. Key implications of the XEQ Scale are discussed in Sec- 3.2 Findings: Evaluation Dimensions and Metrics
tion 6. Finally, we offer conclusions in Section 7.
Hoffman et al. [9] are one of the leading contributors and their work
has been widely utilised in many user-centred XAI research. They
2 Related Work conceptually modelled the “process of explaining in XAI” outlining
dimensions and metrics for evaluating single-shot explanations from
In the literature, there are several methodologies for developing eval-
stakeholders’ perspectives. They considered six evaluation dimen-
uation metrics or instruments for user-centred XAI.
sions: goodness, satisfaction, mental model, curiosity, trust and per-
Hoffman et al. [9] employed Psychometric Theory to construct
formance, For each dimension, they either systematically developed
the Satisfaction Scale, evaluating both content validity and discrim-
an evaluation metric or critiqued metrics available in literature offer-
inant validity. A similar methodology was adopted in [17] to de-
ing a comprehensive evaluation methodology for XAI practitioners.
velop the Madsen-Gregor human-machine trust scale, relying on pre-
System Causability Scale [11] is the other most prominent work in
existing item lists from previous scales in conjunction with expert in-
XAI evaluation. We discuss each scale briefly below.
sights [21]. Jian et al. [13] pursued a factor analysis approach involv-
ing non-expert users to formulate a human-machine trust scale. They Hoffman’s Goodness Checklist is utilised to objectively evaluate
compiled words and phrases associated with trust and its variants, explanations with an independent XAI expert to improve the
organising them based on their relevance to trust and distrust, which “goodness”. It consists of 7 items answered by either selecting
were then clustered to identify underlying factors and formulate cor- ’yes’ or ’no’. It was developed by referring to literature that pro-
responding statements. This methodology is particularly suitable in poses “goodness” properties of explanations.
cases where no prior items exist for initial compilation. While these Hoffman Satisfaction Scale was designed using psychometric the-
methodologies are robust to produce reliable scales they are resource ory to evaluate the subjective “goodness” of explanations with
and knowledge-intensive processes. stakeholders. It consists of 8 items responded in a 5-step Likert
A more frequent approach to scale development is deriving them Scale. It is viewed as the user-centred variant of the Goodness
from existing scales in psychology research. For instance, the System Checklist with many shared items. However, conversely, it has
Causability Scale [11] draws inspiration from the widely used Sys- been evaluated for content validity with XAI experts as well as
tem Usability Scale [4], while the Hoffman Curiosity Checklist orig- construct and discriminant validity in pilot studies.
inates from scales designed to assess human curiosity [9]. Similarly, Hoffman Curiosity Checklist is designed to elicit stakeholder ex-
the Cahour-Forzy Trust Scale [5] selected questions from research on planation needs, i.e. which aspects of the system pique their cu-
human trust, and the Hoffman Trust Scale incorporates items from riosity. This metric consists of one question Why have you asked
previous trust scales [5, 13]. Notably, these derived scales were not for an explanation? Check all that apply. and the responses inform
evaluated for reliability or other desirable factors, they rely on the the design and implementation of the XAI system.
quality of the original scales for validity. In this paper, we opt for the Hoffman Trust Scale measures the development of trust when ex-
psychometric theory approach to establish the content, construct and ploring a system’s explainability. The authors derived this trust
discriminant validity of the resulting scale. While this approach is scale by considering the overlaps and cross-use of scales from
resource-intensive, the complexity and the novelty of the evaluation trust scales in literature for measuring trust in autonomous sys-
task necessitate a rigorous approach to scale development. tems (not in the presence of explainability, e.g. trust between hu-
man and a robot) [13, 1, 25, 5].
3 Literature Review and Initial Items Bank System Causability Scale measures the effectiveness, efficiency
Compilation and satisfaction of the explainability process in systems involving
multi-shot explanations [11]. Derived from the widely-used Sys-
This section presents the literature review findings that led to the tem Usability Scale [4], this scale comprises 10 items rated on a
compilation of the initial items bank for the XEQ Scale. 5-step Likert scale. Notably, it includes items that measure stake-
holder engagement, addressing a gap in previous scales designed
3.1 Methodology for one-shot explainability settings. However, the validation of the
scale is limited to one small-scale pilot study in the medical do-
To scope the existing work and form the initial item bank, we con- main.
ducted a targeted literature review in the domain of XAI evaluation
metrics. The reasoning for a targeted review instead of a system-
3.2.1 Other Dimensions
atic review is two fold: 1) the purpose of the review is to form the
initial item bank which involves in depth analysis of selected liter- Many other publications emphasised the need for user-centred XAI
ature (depth over breadth); and 2) literature under this topic is sig- evaluations and explored evaluation dimensions. Two other dimen-
nificantly limited. The initial findings highlighted that while many sions considered in [9] are mental model and performance con-
publications discuss and emphasise the importance of evaluation di- cerning task completion. Hoffman et al. recommended eliciting the
mensions (what should be or is evaluated), only a few actually pro- mental model of stakeholders in think-aloud problem-solving and
pose and systematically develop metrics for XAI evaluation. question-answering sessions. Performance is measured by observing
the change in productivity and change in system usage. The eval- Engagement: the quality of the interaction between the user and the
uation of these dimensions requires metrics beyond questionnaire- XAI system.
based techniques. Another domain-agnostic survey finds many over-
laps with Hoffman et al., defining 4 user-centred evaluation dimen- In the next sections, we describe the development, refinement and
sions: mental model, usefulness and satisfaction, trust and reliance validation of the XEQ Scale following Psychometric Theory [23].
and human-task performance [20]. Zhou et al., [29] summarise pre-
vious literature, emphasising three subjective dimensions - trust, con- 4 XEQ Scale Development
fidence and preference that overlap with dimensions identified in [9].
Conversely to Hoffman et al., they consider task completion to be an This section presents the details and results of an expert user study
objective dimension in user-centred XAI evaluation. performed to establish the content validity of the XEQ Scale.
Carvalho et al., delineate characteristics of a human-friendly ex-
planation in the medical domain, including some subjective or user- 4.1 Study Design
centred properties such as comprehensibility, novelty, and consis-
tency with stakeholders’ prior beliefs [6]. Notably, consistency with In this study, participants evaluated the initial set of 32 items using
stakeholders’ prior beliefs aligns with the mental model from Hoff- the Content Validity Ratio (CVR) method [15]. The CVR method
man et al. [9], while novelty can influence stakeholder engage- is recommended for quantifying the strength of psychometric scale
ment [11]. Nauta and Seifert [22] recognise 12 properties of explana- items with a small group of experts (5-10).
tion quality for image classification applications. They identify three At the start, participants are familiarised with the terminology and
user-centred properties: context - how relevant the explanation is to they explore 3 sample XAI experiences: 1) a student interacting with
the user; coherence - how accordant the explanation is with prior a chatbot that assists with course recommendations and support; 2)
knowledge and beliefs; and controllability - how interactive and con- a clinician interacting with the graphical interface of a radiograph
trollable the explanation is. In comparison to other literature, control- fracture detection system for clinical decision support; and 3) a reg-
lability aligns with engagement [11] and coherence aligns with the ulatory officer interacting with a local council welfare web page to
mental model [9, 6]. Context can be associated with several proper- explore the fairness and biases with the recommender system used
ties such as curiosity, satisfaction, and preference [9, 29]. for predicting application outcomes. A sample dialogue with the in-
These findings highlighted that there are many overlaps between teractive Course Assist chatbot is shown in Figure 1. This dialogue,
evaluation dimensions identified in recent years. However, we high- in its entirety, captures the explanation experience that we wish to
light two main gaps in this current work: 1) there is no consensus in evaluate. Next, participants are asked to rate the 32 items in terms
previous literature regarding the applicable metrics to measure these of their relevance for measuring XAI experience quality using a 5-
evaluation dimensions; and 2) the majority of the existing dimensions point Likert scale. The relevance scale ranges from Not Relevant at
and metrics focus on evaluating individual explanations, not the XAI All to Extremely Relevant. Additionally, we included ratings related
experiences. to clarity, which also use a Likert scale ranging from Not Clear at All
to Extremely Clear. This clarity rating was added to get feedback on
how easily the items are understood by the participants. Finally, we
3.3 Compiling the Initial Item Bank provided participants the opportunity to suggest rephrasing for items
The initial item bank of 40 items included 7 from the Goodness they found relevant but not clear.
Checklist [10]; 8 from the Satisfaction Scale [10]; 8 from the Trust
Scale [10]; and 10 from the System Causability Scale [11]. Seven 4.2 Recruitment Details
additional items were authored and included by the research team.
These were designed to capture stakeholder views on the interactive The participants of this study consisted of XAI experts both from
experience which is less explicitly addressed in previous literature. academia and the industry. 38 experts were contacted via email and
This initial list underwent a rigorous review and revision process, 13 participated in the study.The 13 participants represented a diverse
during which the research team eliminated duplicates, consolidated set of interests in the human-centred XAI research domain and are
similar items, and rephrased items to reflect the measurement of either actively conducting research or have published XAI research
XAI experiences instead of explanations. The resulting 32 statements outcomes since 2020. The study was hosted on the Jisc Online Sur-
formed the initial XEQ Scale (included in Supplementary Material). veys platform for 3 weeks between November and December 2023.
Response for each item is recorded on a 5-point Likert scale, ranging
from “I Strongly Agree” to “I Strongly Disagree”. 4.3 Metrics
4.3.1 Content Validity
3.3.1 Evaluation Dimensions
The Content Validity Index (CVI) assesses item validity based on re-
We reviewed evaluation dimensions from previous literature and con-
sponses to the relevance property. Lower scores indicate items that
solidated XEQ items into four evaluation dimensions representing
may need modification or removal. Given scale S with M items
XAI experience quality: learning, utility, fulfillment, and engage-
where i indicates an item, rji denotes the response of participant j
ment. These dimensions are relevant to capturing personalised ex-
to item i. For analysis, each response (rji ) is modified as follows.
periences for a given stakeholder. We define them as follows:
(
Learning: the extent to which the experience develops knowledge i 1, if rji ∈ [Extremely Relevant or Somewhat Relevant]
rj =
or competence; 0, otherwise
Utility: the contribution of the experience towards task completion;
Fulfilment: the degree to which the experience supports the We calculate the following two forms of the Content Validity In-
achievement of XAI goals; and dex (CVI) scores.
Item-Level CVI: measures the validity of each item independently; were designed one relatively positive and one relatively negative ex-
the number of responses is N and the expected score is ≥ 0.78. perience. The two samples differ in the quality of the explanations
PN presented to the participant and the resulting impact on the overall in-
i
j=1 (rj ) teraction flow. Participants accessed all samples in video format. Fig-
I-CV Ii =
N ure 1 presents the static views of the two samples of the CourseAssist
Chatbot. AssistHub samples are included in Supplementary Material.
Scale-Level CVI: measures the overall scale validity using a) Av- These sample experiences create a controlled experiment where the
erage method i.e. the mean Item-Level CVI score where the ex- discriminant properties of the scale can be validated.
pected score is ≥ 0.90; and b) Universal Agreement method i.e. First, the participants explore the XAI experience and they proceed
the percentage of items experts always found relevant with the ex- to respond to the XEQ Scale. In addition, they are also queried about
pected value of ≥ 0.80. the clarity of items within the scope of the sample experience. Lastly,
PM participants can offer free-text feedback about the study.
i=1 (I-CV Ii )
S-CV I(a) =
M
PM
5.2 Recruitment Details
i=1 1[I-CV Ii =1] This study enlisted 203 participants, comprising undergraduate stu-
S-CV I(b) =
M dents from the leading research institute and participants recruited
Here, once the average of the I-CVIs is calculated for all items from the Prolific.co platform. 33 students from the research institute
with S-CVI(a), S-CVI(b) counts the number of items with an I- and 70 Prolific participants were recruited for the CourseAssist ap-
CVI of 1 (indicating complete agreement among experts that the plication where the inclusion criteria were: Current education level -
item is relevant) and divides this by the total number of items. Undergraduate degree; Degree subjects - Mathematics and statistics,
Information and Communication Technologies, Natural sciences;
and Year of study - 1st, 2nd, 3rd or 4th. 100 Prolific participants
4.4 Results were recruited for the AssistHub application with the following in-
We refer to the first two columns of Table 2 for the results of the clusion criteria: Household size is 3 or larger; Property ownership is
Content Validity study. We first removed items with low validity either social housing or affordable-rented accommodation; Employ-
(I-CVIi ≤ 0.75) and thereafter S-CVI scores were used to establish ment status is either part-time, due to start a new job within the next
the content validity of the resulting scale. Here we marginally divert month, unemployed, or not in paid work (e.g. homemaker or retired).
from the established baseline of 0.78 for I-CVI to further investigate In the rest of this paper, we will refer to all responses to positive
items with 0.75 ≤ I-CVI ≤ 0.78 during the pilot study. The Likert experiences as Group A and all responses to negative experiences as
responses to the clarity property and free text feedback influenced the Group B. To represent application-specific groups we will use the
re-wording of 7 items to improve clarity (indicated by †). The item application name as a prefix; e.g. CourseAssist-A. Each participant
selection and rephrasing were done based on the suggestions from was randomly assigned to one of the sample experiences and after
the XAI experts and the consensus of the research team. The result- review, we excluded 5, 1 and 1 responses from groups CourseAssist-
ing scale comprised 18 items, which we refer to as the XEQ Scale A, AssistHub-A and AssistHub-B who failed the following attention
- pronounced: "Seek". In Table 2, items are ordered by their I-CVI checks: 1) spend less than half of the allocated time; and/or 2) re-
scores. sponded to the questionnaire in a pattern. This resulted in 53, 50, 50,
S-CVI(a) and S-CVI(b) of the scale were 0.8846 and 0.2222. and 50 respondents for CourseAssist-A, CourseAssist-B, AssistHub-
While S-CVI(a) is comparable to the baseline of 0.9, S-CVI(b) indi- A and AssistHub-B groups respectively.
cate universal agreement is not achieved. However, existing literature
suggests that meeting one of the baseline criteria is sufficient to pro- 5.3 Metrics
ceed to pilot studies. Notably, the 14 items with I-CVI ≥ 0.78 also
only achieve average agreement (S-CVI(a) = 0.9179) and not uni- For analysis, we introduce the following notations. Given rji is the
versal agreement (S-CVI(b) = 0.2667). Following the item selec- participant j’s response to item i, the participant’s total is rj and the
tion and refinement, each item was assigned an evaluation dimension item total is ri . We transform 5-step Likert responses to numbers
based on the consensus of the research team (see Column “Factor” as follows: Strongly Disagree-1, Somewhat Disagree-2, Neutral-3,
in Table 2). These will be used in further investigations using factor Somewhat Agree-4, and Strongly Agree-5. Accordingly, for the 18-
analysis to establish the construct validity of the scale. item XEQ Scale, rj ≤ 90 (5×18).

5.3.1 Internal Consistency


5 XEQ Scale Refinement and Validation
Internal consistency refers to the degree of inter-relatedness among
In this section, we present the pilot study conducted to refine the items within a scale. We employ the following metrics from psycho-
XEQ Scale for internal consistency, construct validity and discrimi- metric theory to assess the XEQ Scale items.
nant validity.
Item-Total Correlation calculates the Pearson correlation coeffi-
cient between the item score and the total score the expected value
5.1 Study Design and Applications per item is ≥ 0.50. The Item-Total Correlation of item i, iT is cal-
The study involved two application domains: 1) CourseAssist Chat- culated as follows.
bot for new students to guide their course selection processes and
PN i ¯i
j=1 (rj − r )(rj − r̄)
2) AssistHub website for welfare applicants to assist with applica- iT = qP
N i ¯i 2 PN (rj − r̄)2
tion outcomes. For each application, two sample XAI experiences j=1 (rj − r ) j=1
Figure 1. The static previews of the relatively positive (left 2 columns) and negative (right column) XAI Experiences with the CourseAssist Chatbot designed
for the Pilot Study
Here r¯i is the average response score for the ith item, and r¯i is the 5.4.2 Discriminant Validity
overall response average.
We performed discriminant analysis over 100 trials where at each
Inter-Item Correlation is a measure of the correlation between dif-
trial a different train-test split of the responses was used. Each trial
ferent items within a scale and values between 0.2 and 0.8 are con-
used a stratified split, with 70% of the responses for training and 30%
sidered expected since ≥ 0.80 indicate redundancy and ≤ 0.20
for testing. Over the 100 trials, we observed accuracy of 0.63 ± 0.05
indicate poor item homogeneity. The calculation is similar to the
and a macro F1-score of 0.63 ± 0.05 which is significantly over
previous but is between two items.
the baseline accuracy of 0.50 for a binary classification task. Mixed-
Cronbach’s alpha measures the extent to which all items in a scale
model ANOVA test showed a statistically significant difference be-
are measuring the same underlying construct [7]. High internal
tween groups A and B with a p-value of 1.63e − 12 where the
consistency is indicated by α ≥ 0.7. If si is the standard deviation
mean participant total for groups A and B were 70.96 ± 0.47 and
of responses to item i, and s is the standard deviation of response
57.97 ± 1.84. Also, it revealed a substantial variability within groups
totals, α is calculated as follows.
indicated by the group variance of 104.86, which we account for in-
PM i 2 !
(s ) cluding responses from two application domains. Furthermore, Co-
M
α= 1 − i=12 hen’s d was 1.7639 which indicates a large effect size confirming
M −1 s
a significant difference between groups A and B. A standard t-test
also obtained close to zero p-value associated with the t-statistic at
As such this helps to quantify how much of the total variance is
1.13e − 09 further verifying statistical difference. Based on this ev-
due to the shared variance among items, which reflects their con-
idence we reject the null hypothesis and confirm the discriminant
sistency in measuring the same underlying construct.
validity of the scale.

5.3.2 Discriminant Validity


5.4.3 Construct Validity
Discriminant validity measures the ability of the scale to discern be-
tween positive and negative experiences and we used the following We first explore the number of factors present in the XEQ Scale using
two methods. Exploratory FA. Figure 2 plots the eigenvalues for PCA coefficients
derived from the scale responses. There is significant one factor that
Discriminant Analysis treats the pilot study responses as a labelled is evident throughout the scale responses which we refer to as “XAI
dataset to train a classification model with a linear decision bound- Experience Quality”. This is evidenced by the sharp drop and plateau
ary. The items are considered as features and the group (A or B) is of eigenvalues for PCA coefficient 2 onwards. To perform the Con-
considered as the label. A holdout set then evaluates the model’s
ability to distinguish between groups A and B.
Parametric Statistical Test uses a mixed-model ANOVA test to
measure if there is a statistically significant difference between the
two groups A and B (agnostic of the domain). Our null hypothesis
is “no significant difference is observed in the mean participant
total between groups A and B“. Our sample sizes meet the re-
quirements for a parametric test determined by an a priori power
analysis using G*Power [8].

5.3.3 Construct Validity


Construct validity evaluates the degree to which the scale assesses
the characteristic of interest [14]. We perform two forms of Factor firmatory FA weFigure
create2.a factor
Exploratory
modelFactor Analysis by exploratory
as suggested
Analysis (FA) to uncover underlying factors (i.e. dimensions) and FA. Confirmatory FA shows that all item factor loadings are ≥ 0.5
validate them. baseline meaning, all items contribute to the over-arching factor mea-
sured by this scale. To validate the evaluation dimensions assigned
Exploratory FA finds the number of underlying factors in the scale to items by the research team, we create another factor model with
by assessing the variance explained through the Principal Compo- 4 factors each assigned with the subset of the items indicated in Ta-
nent Analysis (PCA) coefficients (i.e. eigenvalues). ble 2 last column. Again we find that each item meets ≥ 0.5 factor
Confirmatory FA hypothesise a factor model (e.g. model proposed loading for their assigned factor. This confirms that while there is an
in the Exploratory FA or a Factor model proposed by XAI experts) over-arching factor about “XAI Experience Quality”, it is substan-
and calculate factor loadings where the expected loading for each tially underpinned by the four factors Learning, Utility, Fulfilment
item is expected to be ≥ 0.5. and Engagement. This concludes our multi-faceted refinement of the
XEQ Scale based on pilot study results.

5.4 Results
6 Discussion
5.4.1 Internal Consistency
6.1 Implications and Limitations
Table 2 column 3 reports the Item-Total Correlation. All items met
the baseline criteria of iT ≥ 0.5 and baseline criteria for Inter-Item In psychometric theory, conducting a pilot study involves adminis-
correlation. Cronbach’s alpha is 0.9562 which also indicates strong tering both the scale under development and existing scales to par-
internal consistency. ticipants. The objective is to assess the correlation between the new
# Item Table 2. Results I-CVI Item-Total One Factor Factor
Correlation Loading
1 The explanations received throughout the experience were consistent† . 1.0000 0.6274 0.6076 Engagement
2 The experience helped me understand the reliability of the AI system. 1.0000 0.6416 0.6300 Learning
3 I am confident about using the AI system. 1.0000 0.7960 0.7790 Utility
4 The information presented during the experience was clear. 1.0000 0.7666 0.7605 Learning
5 The experience was consistent with my expectations† . 0.9231 0.7959 0.7831 Fulfilment
6 The presentation of the experience was appropriate for my requirements† . 0.9231 0.8192 0.8083 Fulfilment
7 The experience has improved my understanding of how the AI system works. 0.9231 0.6169 0.5859 Learning
8 The experience helped me build trust in the AI system. 0.9231 0.7160 0.7018 Learning
9 The experience helped me make more informed decisions. 0.9231 0.7460 0.7279 Utility
10 I received the explanations in a timely and efficient manner. 0.8462 0.7015 0.6841 Engagement
11 The information presented was personalised to the requirements of my role† . 0.8462 0.7057 0.6801 Utility
12 The information presented was understandable within the req. of my role† . 0.8462 0.7876 0.7803 Utility
13 The information presented showed me that the AI system performs well† . 0.8462 0.8112 0.8016 Fulfilment
14 The experience helped to complete the intended task using the AI system. 0.8462 0.8299 0.8241 Utility
15 The experience progressed sensibly† . 0.7692 0.8004 0.7912 Engagement
16 The experience was satisfying. 0.7692 0.7673 0.7529 Fulfilment
17 The information presented during the experience was sufficiently detailed. 0.7692 0.8168 0.8035 Utility
18 The experience provided answers to all of my explanation needs. 0.7692 0.8472 0.8444 Fulfilment

scale and those found in existing literature, particularly in shared di- Table 3. XEQ Scale in Practice
Stakeholder ID Factor
Factor Item#
mensions. However, our pilot studies did not incorporate this, since to #1 #2 #3 #4 #5 Mean
the best of our knowledge there are no previous scales that measured Learning 2 3 2 4 3 3
4 2 1 4 3 4
the XAI experience quality or multi-shot XAI experiences. While 7 4 3 3 4 2
the System Causability Scale [11] is the closest match in the liter- 8 3 3 2 1 4 2.90
ature, it was not applicable as it featured in the initial items bank. Utility 3 2 4 5 3 4
Also, the current pilot studies had limited variability in application 9 4 2 4 4 5
domains. To address this limitation, we are currently planning pi- 11 4 2 3 5 3
12 5 2 5 2 3
lot studies with two medical applications: 1) fracture prediction in 14 5 3 5 3 4
Radiographs; and 2) liver disease prediction from CT scans. In the 17 4 2 4 4 5 3.67
future, we will further validate and refine the scale as necessary. Fulfilment 5 3 5 4 2 5
6 3 4 5 5 4
13 3 3 3 2 3
6.2 XEQ Scale in Practice 16 4 3 4 5 4
18 4 4 3 2 5 3.68
Engagement 1 1 2 4 2 3
Table 3 presents a representative scenario of how the XEQ Scale is
10 4 3 2 4 3
used in practice. It presents the XAI Experience Quality analysis of 15 2 1 3 3 1 2.40
a hypothetical XAI system based on the XEQ scale administered Stakeholders’s XEQ
3.22 2.72 3.72 3.17 3.61
to 5 stakeholders. The items in Table 3 are organised according to Score
their evaluation dimensions and the stakeholder responses are ran- System XEQ Score 3.16
domly generated and quantified as follows: Strongly Agree-5; Some-
what Agree-4; Neutral-3; Somewhat Disagree-2; Strongly Disagree-
1. Based on the responses, we calculate the following scores. 6.3 XEQ Benchmark Development

Stakeholder XEQ Score quantifies individual experiences and is


calculated as the mean of stakeholder’s responses to all items.
Factor Score quantifies the quality along each evaluation dimen- The next phase for the XEQ scale entails developing a benchmark
sion and is calculated as the mean of all responses to the respective for XAI experience quality. This process includes administering the
subset of items. For XAI system designers, Factor Means indicates XEQ scale to over 100 real-world AI systems that provide explain-
the dimensions that need improvement. For instance, in the exam- ability to stakeholders and establishing a classification system. We
ple XAI system, the Learning and Engagement dimensions need plan to follow the established benchmark maintenance policy of the
improvement whereas stakeholders found the Utility and Fulfil- User Experience Questionnaire [26] where we develop and release
ment dimensions to be satisfactory with room for improvement. an XEQ Analysis tool with the benchmark updated regularly.
System XEQ Score quantifies the XAI experience quality of the We envision when the scale is administered to stakeholders of
system as a whole and is calculated as the aggregate of Factor a new XAI system, the benchmark will categorise the new system
Means. The System XEQ Score helps the XAI system designers based on the mean participant total in each evaluation dimension as
to iteratively develop a well-rounded system grounded in user ex- follows - Excellent: Within the top 10% of XAI systems considered
perience. The designers can also choose to assign a higher weight in the benchmark; Good: Worse than the top 10% and better than the
to one or a subset of dimensions that they find important at any lower 75%; Above average: Worse than the top 25% and better than
iteration. System XEQ Score can also be utilised by external par- the lower 50%; Below average: Worse than the top 50% and better
ties (e.g. regulators, government), either to evaluate or benchmark than the lower 25%; and Bad: Within the 25% worst XAI systems.
XAI systems. Accordingly, the XEQ benchmark will enable XAI system owners to
iteratively enhance the XAI experience offered to their stakeholders.
7 Conclusion [11] A. Holzinger, A. Carrington, and H. Müller. Measuring the quality of
explanations: the system causability scale (scs) comparing human and
In this paper, we presented the XEQ scale for evaluating XAI ex- machine explanations. KI-Künstliche Intelligenz, 34(2):193–198, 2020.
[12] B. Hu, P. Tunison, B. Vasu, N. Menon, R. Collins, and A. Hoogs. Xaitk:
periences. The XEQ scale provides a comprehensive evaluation for The explainable ai toolkit. Applied AI Letters, 2(4):e40, 2021.
user-centred XAI experiences and fills a novel gap in the evaluation [13] J.-Y. Jian, A. M. Bisantz, and C. G. Drury. Foundations for an em-
of multi-shot explanations which is currently not adequately fulfilled pirically determined scale of trust in automated systems. International
journal of cognitive ergonomics, 4(1):53–71, 2000.
by any other evaluation metric(s). Throughout this paper, we have [14] A. E. Kazdin. Research design in clinical psychology. Cambridge Uni-
described the development and validation of the scale following psy- versity Press, 2021.
chometric theory. We make this scale available as a public resource [15] C. H. Lawshe et al. A quantitative approach to content validity. Person-
for evaluating the quality of XAI experiences. In future work, we nel psychology, 28(4):563–575, 1975.
[16] P. Q. Le, M. Nauta, V. B. Nguyen, S. Pathak, J. Schlötterer, and
plan to investigate the generalisability of the XEQ scale on additional C. Seifert. Benchmarking explainable ai: a survey on available toolk-
domains, AI systems and stakeholder groups. Beyond this, we pro- its and open challenges. In International Joint Conference on Artificial
pose to establish a benchmark using the XEQ scale. Our goal is to Intelligence, 2023.
[17] M. Madsen and S. Gregor. Measuring human-computer trust. In 11th
facilitate the user-centred evaluation of XAI and support the emerg-
australasian conference on information systems, volume 53, pages 6–8.
ing development of best practices in the explainability of autonomous Citeseer, 2000.
decision-making. [18] L. Malandri, F. Mercorio, M. Mezzanzanica, and N. Nobani. Convxai:
a system for multimodal interaction with any black-box explainer. Cog-
nitive Computation, 15(2):613–644, 2023.
[19] T. Miller. Explanation in artificial intelligence: Insights from the social
Ethical Statement sciences. Artificial intelligence, 267:1–38, 2019.
[20] S. Mohseni, N. Zarei, and E. D. Ragan. A multidisciplinary survey
Both content validity study and pilot study protocols passed the and framework for design and evaluation of explainable ai systems.
ethics review of the leading institution (references removed for re- ACM Transactions on Interactive Intelligent Systems (TiiS), 11(3-4):1–
view). Informed consent was obtained from all XAI experts and pilot 45, 2021.
study participants. [21] G. C. Moore and I. Benbasat. Development of an instrument to mea-
sure the perceptions of adopting an information technology innovation.
Information systems research, 2(3):192–222, 1991.
[22] M. Nauta and C. Seifert. The co-12 recipe for evaluating interpretable
Acknowledgements part-prototype image classifiers. In World Conference on Explainable
Artificial Intelligence, pages 397–420. Springer, 2023.
We thank all XAI experts and pilot study participants for their con- [23] J. C. Nunnally and I. H. Bernstein. Psychometric Theory: Nunnally and
tributions. Bernstein. McGraw Hill, 3rd edition, 1993.
[24] A. Rosenfeld. Better metrics for evaluating explainable artificial intel-
This work is done as part of the iSee project. iSee is an EU ligence. In Proceedings of the 20th international conference on au-
CHIST-ERA project which received funding for the UK from EP- tonomous agents and multiagent systems, pages 45–50, 2021.
SRC under grant number EP/V061755/1; for Ireland from the Irish [25] K. Schaefer. The perception and measurement of human-robot trust.
Research Council under grant number CHIST-ERA-2019-iSee and 2013.
[26] M. Schrepp, J. Thomaschewski, and A. Hinderks. Construction of a
for Spain from the MCIN/AEI and European Union “NextGenera- benchmark for the user experience questionnaire (ueq). 2017.
tionEU/PRTR” under grant number PCI2020-120720-2. [27] A. Wijekoon, N. Wiratunga, K. Martin, D. Corsar, I. Nkisi-Orji, C. Pal-
ihawadana, D. Bridge, P. Pradeep, B. D. Agudo, and M. Caro-Martínez.
Cbr driven interactive explainable ai. In International Conference on
References Case-Based Reasoning, pages 169–184. Springer, 2023.
[28] C.-K. Yeh, C.-Y. Hsieh, A. Suggala, D. I. Inouye, and P. K. Ravikumar.
[1] B. D. Adams, L. E. Bruyn, S. Houde, P. Angelopoulos, K. Iwasa- On the (in) fidelity and sensitivity of explanations. Advances in Neural
Madge, and C. McCann. Trust in automated systems. Ministry of Na- Information Processing Systems, 32, 2019.
tional Defence, 2003. [29] J. Zhou, A. H. Gandomi, F. Chen, and A. Holzinger. Evaluating the
[2] V. Arya, R. K. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. quality of machine learning explanations: A survey on methods and
Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilović, et al. One metrics. Electronics, 10(5):593, 2021.
explanation does not fit all: A toolkit and taxonomy of ai explainability
techniques. arXiv preprint arXiv:1909.03012, 2019.
[3] G. O. Boateng, T. B. Neilands, E. A. Frongillo, and S. L. Young. Best
practices for developing and validating scales for health, social, and be- Appendix
havioral research: a primer. Frontiers in public health, 6:366616, 2018.
[4] J. Brooke et al. Sus-a quick and dirty usability scale. Usability evalua- 7.1 Initial Items Bank
tion in industry, 189(194):4–7, 1996.
[5] B. Cahour and J.-F. Forzy. Does projection into use improve trust and 7.2 Content Validity Study
exploration? an example with a cruise control system. Safety science,
47(9):1260–1270, 2009. This study aimed to establish the content validity of the XEQ scale
[6] D. V. Carvalho, E. M. Pereira, and J. S. Cardoso. Machine learning with XAI experts. Being XAI Experiences are a novel concept, we in-
interpretability: A survey on methods and metrics. Electronics, 8(8): cluded three example XAI experiences that capture a variety of stake-
832, 2019.
[7] L. J. Cronbach. Essentials of psychological testing. 1949. holder types and application domains. In addition to the CourseAssist
[8] F. Faul, E. Erdfelder, A.-G. Lang, and A. Buchner. G* power 3: A chatbot example included in the paper, they were presented with the
flexible statistical power analysis program for the social, behavioral, following experiences in video format.
and biomedical sciences. Behavior research methods, 39(2):175–191,
2007.
[9] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman. Met-
• The AssistHub AI platform is a website for processing welfare
rics for explainable ai: Challenges and prospects. arXiv preprint applications and is used by a local council to accelerate the appli-
arXiv:1812.04608, 2018. cation process. A regulation officer is exploring the website and
[10] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman. Measures for its XAI features to understand the fairness and bias of the AI sys-
explainable ai: Explanation goodness, user satisfaction, mental models,
curiosity, trust, and human-ai performance. Frontiers in Computer Sci- tem being used in the decision-making process. A non-interactive
ence, 5:1096257, 2023. preview of the experience is presented in Figure 3.
Item Table 4. Items Bank
I like using the system for decision-making.
The information presented during the experience was clear.
The explanations received throughout the experience did not contain in-
consistencies.
I could adjust the level of detail on demand.
The experience helped me make more informed decisions.
The experience helped me establish the reliability of the system.
I received the explanations in a timely and efficient manner.
The experience was satisfying.
The experience was suitable for the intended purpose of the system.
I was able to express all my explanation needs.
The experience revealed whether the system is fair.
The experience helped me complete my task using the system.
The experience was consistent with my expectations within the context of
my role.
The presentation of the experience was appropriate.
The experience has improved my understanding of how the system works.
The experience helped me understand how to use the system.
The experience was understandable in the context of my role.
The experience helped me build trust in the system.
The experience was personalised to the context of my role.
I could request more detail on demand if needed.
I did not need external support to understand the explanations.
The experience was helpful to achieve my goals.
The experience progressed logically.
The experience was consistent with my understanding of the system.
The duration of the experience was appropriate within the context of my
role.
The experience improved my engagement with the system.
The experience was personalised to my explanation needs.
Throughout the experience, all of my explanation needs were resolved.
The experience showed me how accurate the system is.
All parts of the experience were suitable and necessary.
The information presented during the experience was sufficiently detailed
for my understanding of the domain.
I am confident in the system.

• RadioAssist AI platform is a desktop application, used by the lo-


cal hospital to support clinicians in their clinical decision-making
processes. An AI system predicts the presence of fracture in Ra-
diographs and explains its decisions to the clinicians. A non-
interactive preview of the experience is presented in Figure 4.

Both Figures 4 and 3 are annotated with notes that describe the
XAI features that were available to the stakeholder. Finally, Figure 5
presents a preview of the Study page.

7.3 Pilot Study


A pilot study was conducted with 203 participants over two appli-
cation domains where they evaluated either a positive or negative
XAI experience. In addition to the CourseAssist chatbot examples
provided in the paper, we included two XAI experiences of welfare
applicants interacting with the AssistHub AI platform (see Figures 6
and 7). Notes refer to how different aspects of the explanations can
lead to a positive or negative XAI experience. Similar to the previ-
ous study, all XAI experiences were available to participants in video
format. Finally Figure 8 presents a preview of the Pilot study where
pages 1 and 2 were customised based on the application participants
were assigned to.
Figure 3. Positive XAI Experience of a Regulation Officer exploring the AssistHub AI platform; A pop-up is shown when clicked on the AI INSIGHTS button,
with four navigation pages providing different types of explanations. The help description pop-up appears when clicking on the question mark button.
Figure 4. Positive XAI Experience of a Clinician using the RadioAssist AI platform; The clinician can click on the two minimised pages to expand and view
explanations about the AI system and the decision. The question mark button shows the help description.
Figure 5. Study Preview; Examples removed from Page 3 and list of items shortened in Page 4.
Figure 6. Relatively positive XAI Experience of a welfare applicant using the AssistHub AI platform
Figure 7. Relatively negative XAI Experience of a welfare applicant using the AssistHub AI platform
Figure 8. Pilot Study Preview; Example removed from Page 2 and list of items shortened in Page 3.

You might also like