6 - A User-Centered Explainable Artificial Intelligence Approach For Financial Fraud Detection
6 - A User-Centered Explainable Artificial Intelligence Approach For Financial Fraud Detection
A R T I C L E I N F O A B S T R A C T
Keywords: This paper aims to produce user-centered explanations for financial fraud detection models based
Financial fraud detection on Explainable artificial intelligence (XAI) methods. By combining an ensemble predictive model
Explainable artificial intelligence with an explainable framework based on Shapley values, we develop a financial fraud detection
SHAP
approach that is accurate and explainable at the same time. Our results show that the explainable
framework can meet the requirements of different external stakeholders by producing local and
global explanations. Local explanations can help understand why a specific prediction is identi
fied as fraud, and global explanations reveal the overall logic of the whole ensemble model.
1. Introduction
Recent years have witnessed the fact that artificial intelligence (AI)—specifically, machine learning (ML)—is being increasingly
used in financial fraud detection to improve predictive performance (Abbasi et al., 2012; Bao et al., 2020; Bertomeu et al., 2021; Perols
et al., 2017). The high predictive performance of ML models often comes at the expense of their explainability about why the models
generate a certain output (Gunning et al., 2019; Meske et al., 2022). Both regulators and ethicists are concerned with the explainability
of AI systems (EU, 2016; European Commission, 2019; FSB, 2017; Kim and Routledge, 2022). The lack of explainability gradually
becomes one of the major challenges in the practical applications of ML-based fraud detection models for stakeholders.
Explainable artificial intelligence (XAI) is the method that can achieve both explainability and high predictive performance (Bauer
et al., 2023). Providing explanations is the key element that enables human users to understand the way ML models mine and leverage
information. Implementing and using XAI methods are conducive to building justified confidence, enabling enhanced control,
improving applied models, and discovering new facts (Adadi and Berrada, 2018). XAI methods assist stakeholders in making more
informed decisions instead of blindly trusting the output of the ML model.
Producing user-centered explanations is significant for applications where complex ML models are part of how stakeholders come
to make decisions. However, most of the researches designed the XAI approach without evaluating whether the explanations satisfied
the needs of real users (Miller, 2019). In order to produce user-satisfied explanations, we seek to link the needs of intended users and
state-of-the-art XAI techniques in financial fraud detection. Furthermore, previous XAI studies focused more on developers to help
them improve the development process (Bhatt et al., 2020). In this study, we focus on stakeholders concerned about financial fraud,
external to the development.
This paper proposes a systematic analytical framework in response to the challenge that ML-based fraud detection models have to
* Corresponding author.
E-mail address: [email protected] (Z. Xiao).
https://doi.org/10.1016/j.frl.2023.104309
Received 24 May 2023; Received in revised form 2 August 2023; Accepted 6 August 2023
Available online 9 August 2023
1544-6123/© 2023 Elsevier Inc. All rights reserved.
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
be human-understandable. Our contribution is based on linking the XAI methods with the decision-making requirements of external
stakeholders concerned about financial fraud. The objective is to provide a bridge between ML applications in fraud detection and
external stakeholders to meet their decision needs. We develop an accurate and explainable financial fraud detection approach for
tackling the task. By combining a predictive model with an explainable framework, the approach can provide external stakeholders
with accurate prediction results and user-centered explanations.
Specifically, we build an ensemble model based on raw financial statement data to predict financial fraud. Furthermore, we
describe external stakeholders concerned about financial fraud and analyze their explainability requirements for ML-based detection
models, and then develop an explainable framework based on Shapley values. We present four types of explanations in a meaningful
context, which can be summarized in local and global explanations. Local explanations provide information about why a specific
prediction is identified as fraud and how stakeholders could make an investigation plan according to the machine’s logic. Global
explanations show degrees of confidence for the whole model and the relationships between predictions and features from a global
perspective.
The dataset spans from 2007 to 2020 with 37,502 firm-year observations of Chinese non-financial companies, including 432
fraudulent observations that were convicted by regulators and 37,070 nonfraudulent observations.
Following Bao et al. (2020) and Achakzai and Juan (2022), we use raw financial data in financial statements as our features.
Machines may capture possibly unknown patterns from complex data by themselves(Agarwal and Dhar, 2014). In this study, we use a
more comprehensive feature set of raw financial data, directly taken from three key financial statements (balance sheet, income
statement, and cash flow statement), to involve more information and unleash the power of the ML model (Bertomeu et al., 2021; Chen
et al., 2022a). The feature list of raw financial data is shown in Appendix. The data in this study are collected from China Stock Market
and Accounting Research (CSMAR) database. As XGBoost can deal with missing values automatically, we do not handle this problem
specially.
We separate the dataset into a training dataset (2007–2017) and a testing dataset (2018–2020). The training dataset is further
divided into a training data subset (2007–2015) and a validation dataset (2016–2017). We use the training data subset to train a model
for some hyperparameters. The hyperparameters that generate the highest out-of-sample AUC on the validation dataset are chosen by
the grid search process. The whole training dataset is used to train the final model with selected hyperparameters.
3. Methodology
There are four external stakeholder groups concerned about financial fraud (Dechow et al., 2011). (i) Regulators. They need to
identify the companies suspected of being involved in fraud to strengthen investor protection and improve regulatory policies. (ii)
Auditors. They want to obtain more reliable audit evidence about financial fraud to support their audit opinions and reduce audit risks.
(iii) Analysts. They desire to improve the assessment accuracy of financial fraud to prevent reputational damage. (iv) Investors, e.g.
outside shareholders, banks, or other creditors. They hope to select companies with low fraud risks to avoid investment losses.
We further analyze the underlying explainability requirements of these external stakeholders. Table 1 outlines external stakeholder
groups and their requirements for explanations in financial fraud detection, e.g. the regulators may seek an explanation to assess
confidence for the whole detection model as well as an explanation to discover outliers in a specific prediction as further investigation
clues. The approach proposed in the next section is performed to tackle these explainability requirements.
Financial fraud detection faces the class imbalance problem. We generate multiple undersampled training subsets. Each
Table 1
The explainability requirements for external stakeholders in financial fraud detection.
Explainability requirements Regulators Auditors Analysts Investors
2
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
undersampled training subset is comprised of all the fraudulent observations and a random subset of nonfraudulent observations of the
same size.
For each training subset, an XGBoost classifier is trained. XGBoost is a variant of Gradient boosting that has shown good perfor
mance in diverse domain applications.
We use majority voting, which is proven to be robust compared to other combination methods (Genre et al., 2013; Sesmero et al.,
2021; Wang et al., 2022), to combine base classifiers. The final prediction result would be generated as follows:
∑
T
y = argmax
̂ χ A (ft (x) = c) (1)
c
t=1
where χA(ft(x) = c) is the set of base classifiers of class c, and ft(x) is the predicted output of base classifier Ft.
Then, for each observation, the base classifier that generates identical predictions with the final result is considered as the right
model. The right model set is expressed as R = {f1,f2,…, fH}, where f h (x) = ̂ y.
SHAP (SHapley Additive exPlanations) is a model-agnostic method that can provide explanations for complex models in various
fields, such as credit risk management (Bussmann et al., 2021; Wu et al., 2022) or crypto asset allocation (Babaei et al., 2022).
Compared to alternative model-agnostic XAI methods, SHAP is the only possible explanation model to guarantee desirable properties:
local accuracy, consistency, and missingness (Lundberg and Lee, 2017). Therefore, we use SHAP which is based on game-theoretic
Shapley values to generate explanations:
3
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
∑
M
f (x) = ϕ0 (f ) + ϕi (f , x) (2)
i=1
where ϕ0(f) is the expectation of the model output values, and the sum of ϕi(f,x) matches the original tree-based model output f(x).
The ϕi(f,x) is the contribution of feature xi that is calculated as follows:
1 ∑ v(S ∪ i) − v(S)
ϕi (f , x) = (3)
M S⊆N\i Cn−|S| 1
(1) Local explanations for a given observation. Shapley values satisfy the additivity property. We calculate mean value of the
explanations of the right models and consider explanations of the right models as the final local explanations. The explanations
generated by different base classifiers are used to compare for further analysis.
(2) Global explanation for the whole ensemble model. We average the absolute values of final Shapley values for all observations of
interest to generate a global explanation of the ensemble model. Then, we reveal the impact of a feature on the final results and
the interaction between two features on the outputs.
4. Empirical analysis
In this section, we apply the proposed approach for financial fraud detection to actual Chinese data. The hyperparameters used for
XGBoost are eta:0.5, gamma:0.2, n_estimators:700, max_depth:5, min_child_weight:1, subsample:0.6, nthread:3, colsample_bytree:0.9,
reg_alpha:1e-5, scale_pos_weight:1.1 In our experiments, proposed ensemble model has 10 base classifiers and generates an AUC of
0.79, which improves prediction performance than XGBoost as it is (AUC=0.58). We next examine the usefulness of random under
sampling, random oversampling and SMOTE methods in our dataset.2 The results shows that random undersampling (AUC=0.76),
random oversampling (AUC=0.63) and SMOTE(AUC=0.69) methods do not perform better than our model. The explanation results
for the financial fraud detection model are presented in the following sub-sections.
In this section, we randomly pick observation A and observation B from the predicted fraudulent observations in the testing set.
Fig. 2 shows which features contribute more to the model output for each observation, and indicates the positive or negative
contribution of each feature. The red color represents a positive contribution, implying this feature increases fraud risk. In contrast, the
blue color indicates a negative impact on the probability of fraud. Fig.2 clearly shows the local explanations for individual observations
are personalized. For example, “Prepayments” has the largest positive contribution for observation A, while the positive contribution
of “Goodwill” is the largest for observation B.
1
Detailed explanations about hyperparameters can be found on the following page: https://xgboost.readthedocs.io/en/stable/python/python_
api.html#xgboost.XGBClassifier.
2
Other methods to address class imbalance problems, such as extreme value models (Calabrese and Giudici, 2015), may be effective in the fraud
setting. Future research can examine this and other alternative methods in a fraud context.
4
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
We use decision plots to further interpret the predictions of the fraud detection ensemble model. In this part, we compare pre
dictions from base classifiers of the ensemble model as clues for further investigations for stakeholders. Fig. 3 shows the decision plots
of local explanations, which depict how ML models arrive at their outputs. A legend identifies the prediction of each base classifier, in
which “1” represents the predicted label of fraud and “0” represents the predicted label of non-fraud. As shown in Fig. 3(b), different
base classifiers generate inconsistent results. It might be worth investigating the difference between the decision paths of different
predicted labels. These decision plots may provide more abundant information about the local explanations of the ensemble model for
supporting users’ decision-making.
Stakeholders may intend to understand the feature contributions to fraud for the whole model. Fig. 4 provides a whole insight into
the global explanation of all observations in the testing dataset. The red color means the value of the feature is high, while the blue
color indicates the feature has a low value. Fig. 4 shows the positive or negative impacts of the features sorted by feature importance
and depicts the distributions of SHAP values for these features. The explanation for important features is roughly consistent with
existing domain knowledge from experts. For example, features associated with related-party transactions, such as prepayment, are
also used as important features in previous fraud detection studies (Achakzai and Juan, 2022; Wei et al., 2017). Moreover, we show
that goodwill, rarely used in previous fraud detection studies, is an important feature for fraud detection. This conclusion is in line with
the findings of the accounting literature, which show that managers could manipulate goodwill impairment to inflate earnings (Han
et al., 2021; Li and Sloan, 2017).
5
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
6
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
7
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
(2) What are the relationships between outputs and features? Is there feature interaction?
We take the top four features as examples shown in Fig. 4, and present their dependence scatter plots based on SHAP values in
Fig. 5. The distribution of data values is depicted by the grey area. Fig.5 shows the intuitive relationships between predictions and
features of interest from a global perspective. For example, Fig. 5(a) reveals the impact of “Goodwill” on the predicted outcome. When
“Goodwill” is greater than about 2e8, the increase in “Goodwill” implies the probability of fraud risk is increased.
The vertical dispersion in Fig. 5(d) shows that the same value for “Operating profit” can make a different contribution to the model
output for different observations. This may indicate there are interaction effects between “Operating profit” and other features. To
further understand how the feature interactions affect the model predictions, we present the interactions between “Operating profit”
and two other features. Fig. 6 presents that different values of “Goodwill” or “Total assets” change the impact of “Operating profit” on
the model results because of the interactions between features.
5. Conclusions
This paper proposes an accurate and explainable financial fraud detection approach to meet the needs of external stakeholders. We
demonstrate the adoption of SHAP to provide stakeholders with user-centered explanations including local explanations and global
explanations. The generated explanations in this study are not equivalent to causal drivers of fraud, rather they are main drivers of ML-
based predictions with desirable theoretical properties. These explanations can be then supervised by accounting experts for further
analysis.
Future research can apply other XAI techniques in our setting, such as Shapley Lorenz values (Giudici and Raffinetti, 2021), for a
detailed comparison. Future research can also extend our analysis by considering other potential informational requirements, such as
measuring fairness (see e.g. Chen et al., 2022b) in fraud detection models. Furthermore, an interesting extension would be to improve
our approach by developing a selection procedure based on predictive performance, which may need appropriate statistical testing.
Funding
This work was supported by the National Natural Science Foundation of China [grant numbers 72071021, 71671019]
Ying Zhou: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing –
original draft. Haoran Li: Writing – review & editing. Zhi Xiao: Funding acquisition, Project administration, Resources, Supervision.
Jing Qiu: Validation.
Data availability
8
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
Features
9
Y. Zhou et al. Finance Research Letters 58 (2023) 104309
References
Abbasi, Albrecht, Vance, Hansen, 2012. MetaFraud: a meta-learning framework for detecting financial fraud. MIS Q. 36, 1293.
Achakzai, M.A.K., Juan, P., 2022. Using machine learning meta-classifiers to detect financial frauds. Finance Res. Lett. 48, 102915.
Adadi, A., Berrada, M., 2018. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160.
Agarwal, R., Dhar, V., 2014. Editorial —big data, data science, and analytics: the opportunity and challenge for IS research. Inf. Syst. Res. 25, 443–448.
Babaei, G., Giudici, P., Raffinetti, E., 2022. Explainable artificial intelligence for crypto asset allocation. Finance Res. Lett. 47, 102941.
Bao, Y., Ke, B., Li, B., Yu, Y.J., Zhang, J., 2020. Detecting accounting fraud in publicly Traded U.S. firms using a machine learning approach. J. Account. Res. 58,
199–235.
Bauer, K., Zahn, M., Hinz, O., 2023. Expl(AI)ned: the impact of explainable artificial intelligence on users’ information processing. Inf. Syst. Res. Articles in Adv. 1–21.
Bertomeu, J., Cheynel, E., Floyd, E., Pan, W., 2021. Using machine learning to detect misstatements. Rev. Account. Stud. 26, 468–519.
Bhatt, U., Xiang, A., Sharma, S., Weller, A., Eckersley, P., 2020. Explainable machine learning in deployment. In: Proc. of the 2020 Conf. on Fairness, Accountability,
and Transparency, NY, USA, pp. 648–657.
Bussmann, N., Giudici, P., Marinelli, D., Papenbrock, J., 2021. Explainable machine learning in credit risk management. Comput. Econ. 57, 203–216.
Calabrese, R., Giudici, P., 2015. Estimating bank default with generalised extreme value regression models. J. Oper. Res. Soc. 66 (11), 1783–1792.
Chen, X., Cho, Y.H.(Tony), Dou, Y., Lev, B., 2022a. Predicting future earnings changes using machine learning and detailed financial data. J. Account. Res. 60,
467–515.
Chen, Y., Giudici, P., Liu, K., Raffinetti, E., 2022b. Measuring fairness in credit scoring. Available at SSRN 4123413.
Dechow, P.M., Ge, W., Larson, C.R., Sloan, R.G., 2011. Predicting material accounting misstatements. Contemp. Account. Res. 28, 17–82.
European Commission., 2019. Ethics guidelines for trustworthy AI.
EU, 2016. Regulation (EU) 2016/679—General data protection regulation (GDPR). Official J. European Union.
FSB, 2017. Technical Report. Financial Stability Board.
Genre, V., Kenny, G., Meyler, A., Timmermann, A., 2013. Combining expert forecasts: can anything beat the simple average? Int. J. Forecast. 29 (1), 108–121.
Giudici, P., Raffinetti, E., 2021. Shapley-Lorenz eXplainable artificial intelligence. Expert Syst. Appl. 167, 114104.
Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.-Z., 2019. XAI—explainable artificial intelligence. Sci. Robot. 4, eaay7120.
Han, H., Tang, J.J., Tang, Q., 2021. Goodwill impairment, securities analysts, and information transparency. Eur. Account. Rev. 30, 767–799.
Kim, T.W., Routledge, B.R., 2022. Why a right to an explanation of algorithmic decision-making should exist: a trust-based approach. Bus. Ethics Q. 32, 75–102.
Li, K.K., Sloan, R.G., 2017. Has goodwill accounting gone bad? Rev. Account. Stud. 22, 964–1003.
Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., Lee, S.-I., 2020. From local explanations to global
understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67.
Lundberg, S.M., Lee, S.-I., 2017. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information
Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, pp. 4768–4777.
Meske, C., Bunde, E., Schneider, J., Gersch, M., 2022. Explainable artificial intelligence: objectives, stakeholders, and future research opportunities. Inf. Syst. Manag.
39, 53–63.
Miller, T., 2019. Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38.
Perols, J.L., Bowen, R.M., Zimmermann, C., Samba, B., 2017. Finding needles in a haystack: using data analytics to improve fraud prediction. Account. Rev. 92,
221–245.
Sesmero, M.P., Iglesias, J.A., Magán, E., Ledezma, A., Sanchis, A., 2021. Impact of the learners diversity and combination method on the generation of heterogeneous
classifier ensembles. Appl. Soft Comput. 111, 107689.
Wang, X., Hyndman, R.J., Li, F., Kang, Y., 2022. Forecast combinations: an over 50-year review. Int. J. Forecast.
Wei, Y., Chen, J., Wirth, C., 2017. Detecting fraud in Chinese listed company balance sheets. PAR 29, 356–379.
Wu, J., Zhang, Z., Zhou, S.X., 2022. Credit rating prediction through supply chains: a machine learning approach. Prod. Oper. Manag. 31 (4), 1613–1629.
10