XEQ Scale For Evaluating XAI Experience Quality Grounded in Psychometric Theory
XEQ Scale For Evaluating XAI Experience Quality Grounded in Psychometric Theory
Abstract. Explainable Artificial Intelligence (XAI) aims to im- in recent literature, the evaluation of these interactions remains a key
prove the transparency of autonomous decision-making through ex- research challenge. Current works primarily target the development
planations. Recent literature has emphasised users’ need for holistic of objective metrics for single-shot techniques [24, 28], emphasising
“multi-shot” explanations and the ability to personalise their engage- the need for reproducible benchmarks on public datasets [16]. Such
ment with XAI systems. We refer to this user-centred interaction as metrics are system-centred and model-agnostic, giving the advantage
an XAI Experience. Despite advances in creating XAI experiences, of generalisability. However, objective metrics fail to acknowledge
evaluating them in a user-centered manner has remained challenging. the requirements of different stakeholder groups. A satisfactory ex-
To address this, we introduce the XAI Experience Quality (XEQ) planation is reliant on the recipient’s expertise within that domain
Scale (pronounced "Seek" Scale), for evaluating the user-centered and of AI in general [20]. Subjective metrics, such as those described
quality of XAI experiences. Furthermore, XEQ quantifies the quality in [10, 11], allow an evaluation which is personalised to the individ-
of experiences across four evaluation dimensions: learning, utility, ual and domain. However, existing subjective evaluations lack the
fulfilment and engagement. These contributions extend the state-of- capacity to measure the interactive process that underpins multi-shot
the-art of XAI evaluation, moving beyond the one-dimensional met- explanations and how they impact user experience.
rics frequently developed to assess single-shot explanations. In this We address this challenge by introducing the XAI Experience
paper, we present the XEQ scale development and validation process, Quality (XEQ) Scale (pronounced: "seek" Scale). We define an XAI
including content validation with XAI experts as well as discriminant Experience as the user-centred process of a stakeholder interacting
and construct validation through a large-scale pilot study. Out pilot with an XAI system to gain knowledge and/or improve comprehen-
study results offer strong evidence that establishes the XEQ Scale as sion. XAI Experience Quality (XEQ) is defined as the extent to which
a comprehensive framework for evaluating user-centred XAI experi- a stakeholder’s explanation needs are satisfied by their XAI Experi-
ences. ence. A glossary of all related terminology used throughout this pa-
per is included in Table 1. Specifically, we ask the research question:
“How to evaluate an XAI experience, in contrast to assessing single-
1 Introduction shot (non-interactive) explanations?”. To address this, we follow a
Explainable Artificial Intelligence (XAI) describes a range of tech- formal psychometric scale development process [3] and outline the
niques to elucidate autonomous decision-making and the data that following objectives:
informed that AI system [19, 12, 2]. Each technique typically pro- 1. conduct a literature review to compile a collection of XAI evalua-
vides explanations that focus on a specific aspect of the system and tion questionnaire items;
its decisions. Accordingly, the utility of employing multiple tech- 2. conduct a content validity study with XAI experts to develop the
niques for a holistic explanation of a system becomes increasingly XEQ scale; and
clear [2, 27]. The collection of explanations, provided by different 3. perform a pilot study to refine and validate the XEQ scale for in-
techniques and describing different components of the system, forms ternal consistency, construct and discriminant validity.
what we describe as “multi-shot” explanations. Previous work has
demonstrated the effectiveness of delivering multi-shot explanations The rest of this paper expands on each objective. We discuss re-
using graphical user interfaces [2] and conversation [18, 27]. lated work in Section 2. Section 3 presents key previous publications
While the utility of user-centred interactive explanations is evident and the creation of the initial items bank. The Content Validity study
details and results are presented in Section 4 followed by Section 5
∗ Corresponding Author. Email: [email protected] presenting pilot study details for the refinement and validation of the
Term Definition Table 1. Glossary
XAI System An automated decision-making system that is designed and developed to provide information about its reasoning.
Stakeholder An individual or group with a vested interest in the XAI system. Stakeholders are a diverse group, encompassing system
designers and developers, who hold an interest in the system’s technical functionality, the end consumers relying on its
decisions, and regulatory authorities responsible for ensuring fair and ethical use.
XAI Experience (XE) A user-centred process of a stakeholder interacting with an XAI system to gain knowledge and/or improve comprehension.
XE Quality (XEQ) The extent to which a stakeholder’s explanation needs are satisfied by their XE.
XEQ Scale. Key implications of the XEQ Scale are discussed in Sec- 3.2 Findings: Evaluation Dimensions and Metrics
tion 6. Finally, we offer conclusions in Section 7.
Hoffman et al. [9] are one of the leading contributors and their work
has been widely utilised in many user-centred XAI research. They
2 Related Work conceptually modelled the “process of explaining in XAI” outlining
dimensions and metrics for evaluating single-shot explanations from
In the literature, there are several methodologies for developing eval-
stakeholders’ perspectives. They considered six evaluation dimen-
uation metrics or instruments for user-centred XAI.
sions: goodness, satisfaction, mental model, curiosity, trust and per-
Hoffman et al. [9] employed Psychometric Theory to construct
formance, For each dimension, they either systematically developed
the Satisfaction Scale, evaluating both content validity and discrim-
an evaluation metric or critiqued metrics available in literature offer-
inant validity. A similar methodology was adopted in [17] to de-
ing a comprehensive evaluation methodology for XAI practitioners.
velop the Madsen-Gregor human-machine trust scale, relying on pre-
System Causability Scale [11] is the other most prominent work in
existing item lists from previous scales in conjunction with expert in-
XAI evaluation. We discuss each scale briefly below.
sights [21]. Jian et al. [13] pursued a factor analysis approach involv-
ing non-expert users to formulate a human-machine trust scale. They Hoffman’s Goodness Checklist is utilised to objectively evaluate
compiled words and phrases associated with trust and its variants, explanations with an independent XAI expert to improve the
organising them based on their relevance to trust and distrust, which “goodness”. It consists of 7 items answered by either selecting
were then clustered to identify underlying factors and formulate cor- ’yes’ or ’no’. It was developed by referring to literature that pro-
responding statements. This methodology is particularly suitable in poses “goodness” properties of explanations.
cases where no prior items exist for initial compilation. While these Hoffman Satisfaction Scale was designed using psychometric the-
methodologies are robust to produce reliable scales they are resource ory to evaluate the subjective “goodness” of explanations with
and knowledge-intensive processes. stakeholders. It consists of 8 items responded in a 5-step Likert
A more frequent approach to scale development is deriving them Scale. It is viewed as the user-centred variant of the Goodness
from existing scales in psychology research. For instance, the System Checklist with many shared items. However, conversely, it has
Causability Scale [11] draws inspiration from the widely used Sys- been evaluated for content validity with XAI experts as well as
tem Usability Scale [4], while the Hoffman Curiosity Checklist orig- construct and discriminant validity in pilot studies.
inates from scales designed to assess human curiosity [9]. Similarly, Hoffman Curiosity Checklist is designed to elicit stakeholder ex-
the Cahour-Forzy Trust Scale [5] selected questions from research on planation needs, i.e. which aspects of the system pique their cu-
human trust, and the Hoffman Trust Scale incorporates items from riosity. This metric consists of one question Why have you asked
previous trust scales [5, 13]. Notably, these derived scales were not for an explanation? Check all that apply. and the responses inform
evaluated for reliability or other desirable factors, they rely on the the design and implementation of the XAI system.
quality of the original scales for validity. In this paper, we opt for the Hoffman Trust Scale measures the development of trust when ex-
psychometric theory approach to establish the content, construct and ploring a system’s explainability. The authors derived this trust
discriminant validity of the resulting scale. While this approach is scale by considering the overlaps and cross-use of scales from
resource-intensive, the complexity and the novelty of the evaluation trust scales in literature for measuring trust in autonomous sys-
task necessitate a rigorous approach to scale development. tems (not in the presence of explainability, e.g. trust between hu-
man and a robot) [13, 1, 25, 5].
3 Literature Review and Initial Items Bank System Causability Scale measures the effectiveness, efficiency
Compilation and satisfaction of the explainability process in systems involving
multi-shot explanations [11]. Derived from the widely-used Sys-
This section presents the literature review findings that led to the tem Usability Scale [4], this scale comprises 10 items rated on a
compilation of the initial items bank for the XEQ Scale. 5-step Likert scale. Notably, it includes items that measure stake-
holder engagement, addressing a gap in previous scales designed
3.1 Methodology for one-shot explainability settings. However, the validation of the
scale is limited to one small-scale pilot study in the medical do-
To scope the existing work and form the initial item bank, we con- main.
ducted a targeted literature review in the domain of XAI evaluation
metrics. The reasoning for a targeted review instead of a system-
3.2.1 Other Dimensions
atic review is two fold: 1) the purpose of the review is to form the
initial item bank which involves in depth analysis of selected liter- Many other publications emphasised the need for user-centred XAI
ature (depth over breadth); and 2) literature under this topic is sig- evaluations and explored evaluation dimensions. Two other dimen-
nificantly limited. The initial findings highlighted that while many sions considered in [9] are mental model and performance con-
publications discuss and emphasise the importance of evaluation di- cerning task completion. Hoffman et al. recommended eliciting the
mensions (what should be or is evaluated), only a few actually pro- mental model of stakeholders in think-aloud problem-solving and
pose and systematically develop metrics for XAI evaluation. question-answering sessions. Performance is measured by observing
the change in productivity and change in system usage. The eval- Engagement: the quality of the interaction between the user and the
uation of these dimensions requires metrics beyond questionnaire- XAI system.
based techniques. Another domain-agnostic survey finds many over-
laps with Hoffman et al., defining 4 user-centred evaluation dimen- In the next sections, we describe the development, refinement and
sions: mental model, usefulness and satisfaction, trust and reliance validation of the XEQ Scale following Psychometric Theory [23].
and human-task performance [20]. Zhou et al., [29] summarise pre-
vious literature, emphasising three subjective dimensions - trust, con- 4 XEQ Scale Development
fidence and preference that overlap with dimensions identified in [9].
Conversely to Hoffman et al., they consider task completion to be an This section presents the details and results of an expert user study
objective dimension in user-centred XAI evaluation. performed to establish the content validity of the XEQ Scale.
Carvalho et al., delineate characteristics of a human-friendly ex-
planation in the medical domain, including some subjective or user- 4.1 Study Design
centred properties such as comprehensibility, novelty, and consis-
tency with stakeholders’ prior beliefs [6]. Notably, consistency with In this study, participants evaluated the initial set of 32 items using
stakeholders’ prior beliefs aligns with the mental model from Hoff- the Content Validity Ratio (CVR) method [15]. The CVR method
man et al. [9], while novelty can influence stakeholder engage- is recommended for quantifying the strength of psychometric scale
ment [11]. Nauta and Seifert [22] recognise 12 properties of explana- items with a small group of experts (5-10).
tion quality for image classification applications. They identify three At the start, participants are familiarised with the terminology and
user-centred properties: context - how relevant the explanation is to they explore 3 sample XAI experiences: 1) a student interacting with
the user; coherence - how accordant the explanation is with prior a chatbot that assists with course recommendations and support; 2)
knowledge and beliefs; and controllability - how interactive and con- a clinician interacting with the graphical interface of a radiograph
trollable the explanation is. In comparison to other literature, control- fracture detection system for clinical decision support; and 3) a reg-
lability aligns with engagement [11] and coherence aligns with the ulatory officer interacting with a local council welfare web page to
mental model [9, 6]. Context can be associated with several proper- explore the fairness and biases with the recommender system used
ties such as curiosity, satisfaction, and preference [9, 29]. for predicting application outcomes. A sample dialogue with the in-
These findings highlighted that there are many overlaps between teractive Course Assist chatbot is shown in Figure 1. This dialogue,
evaluation dimensions identified in recent years. However, we high- in its entirety, captures the explanation experience that we wish to
light two main gaps in this current work: 1) there is no consensus in evaluate. Next, participants are asked to rate the 32 items in terms
previous literature regarding the applicable metrics to measure these of their relevance for measuring XAI experience quality using a 5-
evaluation dimensions; and 2) the majority of the existing dimensions point Likert scale. The relevance scale ranges from Not Relevant at
and metrics focus on evaluating individual explanations, not the XAI All to Extremely Relevant. Additionally, we included ratings related
experiences. to clarity, which also use a Likert scale ranging from Not Clear at All
to Extremely Clear. This clarity rating was added to get feedback on
how easily the items are understood by the participants. Finally, we
3.3 Compiling the Initial Item Bank provided participants the opportunity to suggest rephrasing for items
The initial item bank of 40 items included 7 from the Goodness they found relevant but not clear.
Checklist [10]; 8 from the Satisfaction Scale [10]; 8 from the Trust
Scale [10]; and 10 from the System Causability Scale [11]. Seven 4.2 Recruitment Details
additional items were authored and included by the research team.
These were designed to capture stakeholder views on the interactive The participants of this study consisted of XAI experts both from
experience which is less explicitly addressed in previous literature. academia and the industry. 38 experts were contacted via email and
This initial list underwent a rigorous review and revision process, 13 participated in the study.The 13 participants represented a diverse
during which the research team eliminated duplicates, consolidated set of interests in the human-centred XAI research domain and are
similar items, and rephrased items to reflect the measurement of either actively conducting research or have published XAI research
XAI experiences instead of explanations. The resulting 32 statements outcomes since 2020. The study was hosted on the Jisc Online Sur-
formed the initial XEQ Scale (included in Supplementary Material). veys platform for 3 weeks between November and December 2023.
Response for each item is recorded on a 5-point Likert scale, ranging
from “I Strongly Agree” to “I Strongly Disagree”. 4.3 Metrics
4.3.1 Content Validity
3.3.1 Evaluation Dimensions
The Content Validity Index (CVI) assesses item validity based on re-
We reviewed evaluation dimensions from previous literature and con-
sponses to the relevance property. Lower scores indicate items that
solidated XEQ items into four evaluation dimensions representing
may need modification or removal. Given scale S with M items
XAI experience quality: learning, utility, fulfillment, and engage-
where i indicates an item, rji denotes the response of participant j
ment. These dimensions are relevant to capturing personalised ex-
to item i. For analysis, each response (rji ) is modified as follows.
periences for a given stakeholder. We define them as follows:
(
Learning: the extent to which the experience develops knowledge i 1, if rji ∈ [Extremely Relevant or Somewhat Relevant]
rj =
or competence; 0, otherwise
Utility: the contribution of the experience towards task completion;
Fulfilment: the degree to which the experience supports the We calculate the following two forms of the Content Validity In-
achievement of XAI goals; and dex (CVI) scores.
Item-Level CVI: measures the validity of each item independently; were designed one relatively positive and one relatively negative ex-
the number of responses is N and the expected score is ≥ 0.78. perience. The two samples differ in the quality of the explanations
PN presented to the participant and the resulting impact on the overall in-
i
j=1 (rj ) teraction flow. Participants accessed all samples in video format. Fig-
I-CV Ii =
N ure 1 presents the static views of the two samples of the CourseAssist
Chatbot. AssistHub samples are included in Supplementary Material.
Scale-Level CVI: measures the overall scale validity using a) Av- These sample experiences create a controlled experiment where the
erage method i.e. the mean Item-Level CVI score where the ex- discriminant properties of the scale can be validated.
pected score is ≥ 0.90; and b) Universal Agreement method i.e. First, the participants explore the XAI experience and they proceed
the percentage of items experts always found relevant with the ex- to respond to the XEQ Scale. In addition, they are also queried about
pected value of ≥ 0.80. the clarity of items within the scope of the sample experience. Lastly,
PM participants can offer free-text feedback about the study.
i=1 (I-CV Ii )
S-CV I(a) =
M
PM
5.2 Recruitment Details
i=1 1[I-CV Ii =1] This study enlisted 203 participants, comprising undergraduate stu-
S-CV I(b) =
M dents from the leading research institute and participants recruited
Here, once the average of the I-CVIs is calculated for all items from the Prolific.co platform. 33 students from the research institute
with S-CVI(a), S-CVI(b) counts the number of items with an I- and 70 Prolific participants were recruited for the CourseAssist ap-
CVI of 1 (indicating complete agreement among experts that the plication where the inclusion criteria were: Current education level -
item is relevant) and divides this by the total number of items. Undergraduate degree; Degree subjects - Mathematics and statistics,
Information and Communication Technologies, Natural sciences;
and Year of study - 1st, 2nd, 3rd or 4th. 100 Prolific participants
4.4 Results were recruited for the AssistHub application with the following in-
We refer to the first two columns of Table 2 for the results of the clusion criteria: Household size is 3 or larger; Property ownership is
Content Validity study. We first removed items with low validity either social housing or affordable-rented accommodation; Employ-
(I-CVIi ≤ 0.75) and thereafter S-CVI scores were used to establish ment status is either part-time, due to start a new job within the next
the content validity of the resulting scale. Here we marginally divert month, unemployed, or not in paid work (e.g. homemaker or retired).
from the established baseline of 0.78 for I-CVI to further investigate In the rest of this paper, we will refer to all responses to positive
items with 0.75 ≤ I-CVI ≤ 0.78 during the pilot study. The Likert experiences as Group A and all responses to negative experiences as
responses to the clarity property and free text feedback influenced the Group B. To represent application-specific groups we will use the
re-wording of 7 items to improve clarity (indicated by †). The item application name as a prefix; e.g. CourseAssist-A. Each participant
selection and rephrasing were done based on the suggestions from was randomly assigned to one of the sample experiences and after
the XAI experts and the consensus of the research team. The result- review, we excluded 5, 1 and 1 responses from groups CourseAssist-
ing scale comprised 18 items, which we refer to as the XEQ Scale A, AssistHub-A and AssistHub-B who failed the following attention
- pronounced: "Seek". In Table 2, items are ordered by their I-CVI checks: 1) spend less than half of the allocated time; and/or 2) re-
scores. sponded to the questionnaire in a pattern. This resulted in 53, 50, 50,
S-CVI(a) and S-CVI(b) of the scale were 0.8846 and 0.2222. and 50 respondents for CourseAssist-A, CourseAssist-B, AssistHub-
While S-CVI(a) is comparable to the baseline of 0.9, S-CVI(b) indi- A and AssistHub-B groups respectively.
cate universal agreement is not achieved. However, existing literature
suggests that meeting one of the baseline criteria is sufficient to pro- 5.3 Metrics
ceed to pilot studies. Notably, the 14 items with I-CVI ≥ 0.78 also
only achieve average agreement (S-CVI(a) = 0.9179) and not uni- For analysis, we introduce the following notations. Given rji is the
versal agreement (S-CVI(b) = 0.2667). Following the item selec- participant j’s response to item i, the participant’s total is rj and the
tion and refinement, each item was assigned an evaluation dimension item total is ri . We transform 5-step Likert responses to numbers
based on the consensus of the research team (see Column “Factor” as follows: Strongly Disagree-1, Somewhat Disagree-2, Neutral-3,
in Table 2). These will be used in further investigations using factor Somewhat Agree-4, and Strongly Agree-5. Accordingly, for the 18-
analysis to establish the construct validity of the scale. item XEQ Scale, rj ≤ 90 (5×18).
5.4 Results
6 Discussion
5.4.1 Internal Consistency
6.1 Implications and Limitations
Table 2 column 3 reports the Item-Total Correlation. All items met
the baseline criteria of iT ≥ 0.5 and baseline criteria for Inter-Item In psychometric theory, conducting a pilot study involves adminis-
correlation. Cronbach’s alpha is 0.9562 which also indicates strong tering both the scale under development and existing scales to par-
internal consistency. ticipants. The objective is to assess the correlation between the new
# Item Table 2. Results I-CVI Item-Total One Factor Factor
Correlation Loading
1 The explanations received throughout the experience were consistent† . 1.0000 0.6274 0.6076 Engagement
2 The experience helped me understand the reliability of the AI system. 1.0000 0.6416 0.6300 Learning
3 I am confident about using the AI system. 1.0000 0.7960 0.7790 Utility
4 The information presented during the experience was clear. 1.0000 0.7666 0.7605 Learning
5 The experience was consistent with my expectations† . 0.9231 0.7959 0.7831 Fulfilment
6 The presentation of the experience was appropriate for my requirements† . 0.9231 0.8192 0.8083 Fulfilment
7 The experience has improved my understanding of how the AI system works. 0.9231 0.6169 0.5859 Learning
8 The experience helped me build trust in the AI system. 0.9231 0.7160 0.7018 Learning
9 The experience helped me make more informed decisions. 0.9231 0.7460 0.7279 Utility
10 I received the explanations in a timely and efficient manner. 0.8462 0.7015 0.6841 Engagement
11 The information presented was personalised to the requirements of my role† . 0.8462 0.7057 0.6801 Utility
12 The information presented was understandable within the req. of my role† . 0.8462 0.7876 0.7803 Utility
13 The information presented showed me that the AI system performs well† . 0.8462 0.8112 0.8016 Fulfilment
14 The experience helped to complete the intended task using the AI system. 0.8462 0.8299 0.8241 Utility
15 The experience progressed sensibly† . 0.7692 0.8004 0.7912 Engagement
16 The experience was satisfying. 0.7692 0.7673 0.7529 Fulfilment
17 The information presented during the experience was sufficiently detailed. 0.7692 0.8168 0.8035 Utility
18 The experience provided answers to all of my explanation needs. 0.7692 0.8472 0.8444 Fulfilment
scale and those found in existing literature, particularly in shared di- Table 3. XEQ Scale in Practice
Stakeholder ID Factor
Factor Item#
mensions. However, our pilot studies did not incorporate this, since to #1 #2 #3 #4 #5 Mean
the best of our knowledge there are no previous scales that measured Learning 2 3 2 4 3 3
4 2 1 4 3 4
the XAI experience quality or multi-shot XAI experiences. While 7 4 3 3 4 2
the System Causability Scale [11] is the closest match in the liter- 8 3 3 2 1 4 2.90
ature, it was not applicable as it featured in the initial items bank. Utility 3 2 4 5 3 4
Also, the current pilot studies had limited variability in application 9 4 2 4 4 5
domains. To address this limitation, we are currently planning pi- 11 4 2 3 5 3
12 5 2 5 2 3
lot studies with two medical applications: 1) fracture prediction in 14 5 3 5 3 4
Radiographs; and 2) liver disease prediction from CT scans. In the 17 4 2 4 4 5 3.67
future, we will further validate and refine the scale as necessary. Fulfilment 5 3 5 4 2 5
6 3 4 5 5 4
13 3 3 3 2 3
6.2 XEQ Scale in Practice 16 4 3 4 5 4
18 4 4 3 2 5 3.68
Engagement 1 1 2 4 2 3
Table 3 presents a representative scenario of how the XEQ Scale is
10 4 3 2 4 3
used in practice. It presents the XAI Experience Quality analysis of 15 2 1 3 3 1 2.40
a hypothetical XAI system based on the XEQ scale administered Stakeholders’s XEQ
3.22 2.72 3.72 3.17 3.61
to 5 stakeholders. The items in Table 3 are organised according to Score
their evaluation dimensions and the stakeholder responses are ran- System XEQ Score 3.16
domly generated and quantified as follows: Strongly Agree-5; Some-
what Agree-4; Neutral-3; Somewhat Disagree-2; Strongly Disagree-
1. Based on the responses, we calculate the following scores. 6.3 XEQ Benchmark Development
Both Figures 4 and 3 are annotated with notes that describe the
XAI features that were available to the stakeholder. Finally, Figure 5
presents a preview of the Study page.