2006.05714v3
2006.05714v3
Computer Algorithms1
Giorgio Visania,b , Enrico Baglib and Federico Chesania
a University of Bologna, School of Informatics & Engineering, viale Risorgimento 2, 40136 Bologna (BO), Italy
b CRIF S.p.A., via Mario Fantin 1-3, 40131 Bologna (BO), Italy
Abstract
Local Interpretable Model-Agnostic Explanations (LIME) is a popular method to perform interpretability of any kind of Ma-
chine Learning (ML) model. It explains one ML prediction at a time, by learning a simple linear model around the prediction.
The model is trained on randomly generated data points, sampled from the training dataset distribution and weighted ac-
cording to the distance from the reference point - the one being explained by LIME. Feature selection is applied to keep only
the most important variables, their coefficients are regarded as explanation. LIME is widespread across different domains,
although its instability - a single prediction may obtain different explanations - is one of the major shortcomings. This is due
to the randomness in the sampling step, as well and determines a lack of reliability in the retrieved explanations, making
LIME adoption problematic. In Medicine especially, clinical professionals trust is mandatory to determine the acceptance of
arXiv:2006.05714v3 [cs.LG] 7 Feb 2022
an explainable algorithm, considering the importance of the decisions at stake and the related legal issues. In this paper, we
highlight a trade-off between explanation’s stability and adherence, namely how much it resembles the ML model. Exploiting
our innovative discovery, we propose a framework to maximise stability, while retaining a predefined level of adherence. Op-
tiLIME provides freedom to choose the best adherence-stability trade-off level and more importantly, it clearly highlights the
mathematical properties of the retrieved explanation. As a result, the practitioner is provided with tools to decide whether
the explanation is reliable, according to the problem at hand. We extensively test OptiLIME on a toy dataset - to present
visually the geometrical findings - and a medical dataset. In the latter, we show how the method comes up with meaningful
explanations both from a medical and mathematical standpoint.
Keywords
Explainable AI (XAI), Interpretable Machine Learning, Explanation, Model Agnostic, LIME, Healthcare, Stability
and VSI indices [26], while the adherence is assessed individual weight. The formulation provides smooth
using the 𝑅 statistic, which measures the goodness of
2 weights in the range [0, 1] and flexibility through the
the linear approximation through a set of points [32]. kernel width parameter 𝑘𝑤.
All the figures of merit above span in the range [0, 1],
where higher values define respectively higher stabil- ||𝑥 (𝑖) − 𝑥 (𝑟𝑒𝑓 ) ||2
𝑅𝐵𝐹 (𝑥 (𝑖) ) = exp − (1)
ity and adherence. ( 𝑘𝑤 )
To fully explain the rationale of the proposition, we
first cover three important concepts about LIME. In The RBF flexibility makes it suitable to each situation,
this section we employ a Toy Dataset to show our the- although it requires a proper tuning: setting a high
oretical findings. 𝑘𝑤 value will result in considering a neighbourhood
of large dimension, shrinking 𝑘𝑤 we shrink the width
of the neighbourhood.
Toy Dataset In Figure 2, LIME generated points are displayed as
The dataset is generated from the Data Generating Pro- green dots and the corresponding LIME explanations
cess: (red lines) are shown. The points are scattered all over
𝑌 = 𝑠𝑖𝑛(𝑋 ) ∗ 𝑋 + 10 the ML function, however their size is proportional
to the weight assigned by the RBF kernel. Small ker-
100 distinct points have been generated uniformly in nel widths assign significant weights only to the clos-
the 𝑋 range [0,10] and only 20 of them were kept, at est points, making the further ones almost invisible.
random. In Figure 1, the blue line represents the True In this way, they do not contribute to the local linear
DGP function, whereas the green one is its best ap- model.
proximation using a Polynomial Regression of degree The concept of locality is crucial to LIME: a neigh-
5 on the generated dataset (blue points). In the follow- bourhood too large may cause the LIME model not
ing we will regard the Polynomial as our ML function, to be adherent to the ML function in the considered
we will not make use of the True DGP function (blue neighbourhood.
(a) Ridge Penalty = 0
Figure 2: LIME explanations for different kernel widths
4. Case Study
(b) Best LIME Explanation, Unit 7207
Dataset
Figure 6: NHANES individual Explanations using OptiL-
To validate our methodology we use a well known med- IME
ical dataset: NHANES I. It has been employed for med-
ical research [37],[38] as well as a benchmark to test
explanation methods [39]. The original dataset is de- 0.9 as a reasonable level of adherence. OptiLIME is em-
scribed in [40]. We use a reformatted version, released ployed to find the proper kernel width to achieve 𝑅 2
at http://github.com/suinleelab/treexplainer-study. It value close to 0.9 while maximizing stability indices
contains 79 features, based on clinical measurements for the local explanation models.
of 14,407 individuals. The aim is to model the risk of The model prediction consists in the hazard ratio for
death over twenty years of follow-up. each individual, higher prediction means the individ-
ual is likely to survive a shorter time. Therefore, posi-
Diagnostic Algorithm tive coefficients define risk factors, whereas protective
factors have negative values.
Following Lundberg [39] prescriptions, the dataset has LIME model interpretation is the same as a Linear
been divided into a 64/16/20 split for train/validation/test.Regression model, but with the additional concept of
The features have been mean imputed and standard- locality. As an example, for Age variable we distin-
ized based on statistics computed on the training set. guish different impact based on the individual charac-
A Survival Gradient Boosting model has been trained, teristics: having 1 year more for the Unit 100 (increas-
using the XGBoost framework [41]. Its hyper-parametersing from 65 to 66 years) will raise the death risk of
have been optimized by coordinate descent, using the 3.56 base points, for Unit 7207 1 year of ageing (from
C-statistic [42] on the validation set as the figure of 49 to 50) will increase the risk of just 0.79. Another
merit. example is the impact of Sex: it is more pronounced
in elder people (being female is a protective factor for
Explanations 1.49 points at age 49, at age 65 being male has a much
worse impact, as a risk factor for 3.04).
We use the OptiLIME framework to achieve the opti- For the Unit 100 in Figure 6a, the optimal kernel
mal explanation of the XGBoost model on the dataset. width is a bit higher compared with Unit 7207 in Fig-
We consider two randomly chosen individuals to visu- ure 6b. This is probably caused by the ML model hav-
ally show the results. In our simulation, we consider ing a higher degree of non linearity for the latter unit:
to achieve the same adherence, we are forced to con- sciously.
sider a smaller portion of the ML model, hence a small We exploit these findings in order to tackle LIME
neighbourhood. Smaller kernel width implies also a weak points. The result is the OptiLIME framework,
reduced Stability, testified by small values of the VSI which represents a new and innovative contribution to
and CSI indices. Whenever the practitioner desires the scientific community. OptiLIME achieves stability
more stable results, it is possible to re-run OptiLIME of the explanations and automatically finds the proper
with a less strict requirement for the adherence. It kernel width value, according to the practitioner’s needs.
is important to remark that low degrees of adherence The framework may serve as an extremely useful
will make the explanations increasingly more global: tool: using OptiLIME, the practitioner knows how much
the linear surface retrieved by LIME will consist in an to trust the explanations, based on their stability and
average of many local non-linearities of the ML model. adherence values.
The computation time largely depends on the Bayesian Nonetheless, we acknowledge that the optimization
Search, controlled by the parameters 𝑝 and 𝑚. In our framework may be improved to allow for a faster and
setting, 𝑝 = 10 and 𝑚 = 30 produce good results for more precise computation.
both the units in Figure 6.
On a 4 Intel-i7 CPUs 2.50GHz laptop, the OptiLIME
evaluation for Unit 100 and Unit 7207 took respectively Acknowledgments
123 and 147 seconds to compute. For faster, but less ac-
We acknowledge financial support by CRIF S.p.A. and
curate results, the Bayesian Search parameters can be
Università degli Studi di Bologna.
reduced.
5. Conclusions References
[1] A. Holzinger, G. Langs, H. Denk, K. Zatloukal,
In Medicine, diagnostic computer algorithms provid-
H. Müller, Causability and explainability of ar-
ing accurate predictions have countless benefits, no-
tificial intelligence in medicine, Wiley Interdis-
tably they may help in saving lives as well as reduc-
ciplinary Reviews: Data Mining and Knowledge
ing medical costs. However, precisely because of the
Discovery 9 (2019) e1312.
importance of these matters, the rationale of the de-
[2] I. Kononenko, Machine learning for medical di-
cisions must be clear and understandable. A plethora
agnosis: History, state of the art and perspective,
of techniques to explain the ML decisions has grown
Artificial Intelligence in medicine 23 (2001) 89–
in recent years, though there is no consensus on the
109.
best in class, since each method presents some draw-
[3] R. Miotto, L. Li, B. A. Kidd, J. T. Dudley, Deep pa-
backs. Explainable models are required to be reliable,
tient: An unsupervised representation to predict
thus stability is regarded as a key desiderata.
the future of patients from the electronic health
We consider the LIME technique, whose major draw-
records, Scientific reports 6 (2016) 1–10.
back lies in the lack of stability. Moreover, it is difficult
[4] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart,
to tune properly its main parameter: different values
J. Sun, Doctor ai: Predicting clinical events via
of the kernel width provide substantially different ex-
recurrent neural networks, in: Machine Learning
planations.
for Healthcare Conference, 2016, pp. 301–318.
The main contribution of this paper consists in the
[5] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Ha-
clear decomposition of the LIME framework in its rel-
jaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun,
evant components and the exhaustive analysis of each
Scalable and accurate deep learning with elec-
one, starting from the geometrical meaning through
tronic health records, NPJ Digital Medicine 1
the empirical experiments to validate our intuitions.
(2018) 18.
We showed that Ridge penalty is not needed and LIME
[6] B. Shickel, P. J. Tighe, A. Bihorac, P. Rashidi,
works best with simple Linear Regression as explain-
Deep EHR: A survey of recent advances in deep
able model. In addition, smaller kernel width values
learning techniques for electronic health record
provide a more adherent LIME plane to the ML surface,
(EHR) analysis, IEEE journal of biomedical and
therefore a more realistic local explanation. Eventu-
health informatics 22 (2017) 1589–1604.
ally, the trade-off between the adherence and stabil-
[7] Z. Che, D. Kale, W. Li, M. T. Bahadori, Y. Liu, Deep
ity properties is extremely valuable since it empowers
computational phenotyping, in: Proceedings of
the practitioner to choose the best kernel width con-
the 21th ACM SIGKDD International Conference tional expectation, Journal of Computational and
on Knowledge Discovery and Data Mining, 2015, Graphical Statistics 24 (2015) 44–65.
pp. 507–516. [21] D. W. Apley, J. Zhu, Visualizing the effects of pre-
[8] T. A. Lasko, J. C. Denny, M. A. Levy, Computa- dictor variables in black box supervised learning
tional phenotype discovery using unsupervised models, arXiv preprint arXiv:1612.08468 (2016).
feature learning over noisy, sparse, and irregular [22] S. M. Lundberg, S.-I. Lee, A unified approach
clinical data, PloS one 8 (2013). to interpreting model predictions, in: Advances
[9] E. J. Topol, High-performance medicine: The in Neural Information Processing Systems, 2017,
convergence of human and artificial intelligence, pp. 4765–4774.
Nature medicine 25 (2019) 44–56. [23] M. Craven, J. W. Shavlik, Extracting tree-
[10] A. Holzinger, From machine learning to explain- structured representations of trained networks,
able AI, in: 2018 World Symposium on Digital in: Advances in Neural Information Processing
Intelligence for Systems and Machines (DISA), Systems, 1996, pp. 24–30.
IEEE, 2018, pp. 55–66. [24] Y. Zhou, G. Hooker, Interpreting models
[11] C. Molnar, Interpretable Machine Learning, Lulu. via single tree approximation, arXiv preprint
com, 2020. arXiv:1610.09036 (2016).
[12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, [25] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors:
F. Giannotti, D. Pedreschi, A survey of methods High-precision model-agnostic explanations, in:
for explaining black box models, ACM comput- Thirty-Second AAAI Conference on Artificial In-
ing surveys (CSUR) 51 (2018) 93. telligence, 2018.
[13] M. T. Ribeiro, S. Singh, C. Guestrin, Why should [26] G. Visani, E. Bagli, F. Chesani, A. Poluzzi, D. Ca-
i trust you?: Explaining the predictions of any puzzo, Statistical stability indices for LIME: Ob-
classifier, in: Proceedings of the 22nd ACM taining reliable explanations for Machine Learn-
SIGKDD International Conference on Knowl- ing models, arXiv preprint arXiv:2001.11757
edge Discovery and Data Mining, ACM, 2016, pp. (2020).
1135–1144. [27] D. Alvarez-Melis, T. S. Jaakkola, On the robust-
[14] G. J. Katuwal, R. Chen, Machine learning model ness of interpretability methods, arXiv preprint
interpretability for precision medicine, arXiv arXiv:1806.08049 (2018).
preprint arXiv:1610.09045 (2016). [28] A. Gosiewska, P. Biecek, IBreakDown: Un-
[15] A. Y. Zhang, S. S. W. Lam, N. Liu, Y. Pang, L. L. certainty of model explanations for non-
Chan, P. H. Tang, Development of a Radiology additive predictive models, arXiv preprint
Decision Support System for the Classification of arXiv:1903.11420 (2019).
MRI Brain Scans, in: 2018 IEEE/ACM 5th Interna- [29] M. R. Zafar, N. M. Khan, DLIME: A deterministic
tional Conference on Big Data Computing Appli- local interpretable model-agnostic explanations
cations and Technologies (BDCAT), IEEE, 2018, approach for computer-aided diagnosis systems,
pp. 107–115. arXiv preprint arXiv:1906.10263 (2019).
[16] C. Moreira, R. Sindhgatta, C. Ouyang, P. Bruza, [30] S. M. Shankaranarayana, D. Runje, ALIME:
A. Wichert, An Investigation of Interpretability Autoencoder Based Approach for Local Inter-
Techniques for Deep Learning in Predictive Pro- pretability, in: International Conference on Intel-
cess Analytics, arXiv preprint arXiv:2002.09192 ligent Data Engineering and Automated Learn-
(2020). ing, Springer, 2019, pp. 454–463.
[17] L. Breiman, Random forests, Machine learning [31] C. Molnar, Limitations of Interpretable Machine
45 (2001) 5–32. Learning Methods, 2020.
[18] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, [32] W. H. Greene, Econometric Analysis, Pearson
L. Wasserman, Distribution-free predictive in- Education India, 2003.
ference for regression, Journal of the American [33] A. E. Hoerl, R. W. Kennard, Ridge Regression:
Statistical Association 113 (2018) 1094–1111. Biased Estimation for Nonorthogonal Problems,
[19] J. H. Friedman, Greedy function approximation: Technometrics 12 (1970) 55–67. doi:10.1080/
A gradient boosting machine, Annals of statistics 00401706.1970.10488634.
(2001) 1189–1232. [34] W. N. van Wieringen, Lecture notes on ridge re-
[20] A. Goldstein, A. Kapelner, J. Bleich, E. Pitkin, gression, arXiv preprint arXiv:1509.09169 (2019).
Peeking inside the black box: Visualizing sta- [35] P.-F. Verhulst, Correspondance mathématique et
tistical learning with plots of individual condi- physique, Ghent and Brussels 10 (1838) 113.
[36] B. Letham, B. Karrer, G. Ottoni, E. Bakshy, Con-
strained Bayesian optimization with noisy exper-
iments, Bayesian Analysis 14 (2019) 495–519.
[37] J. Fang, M. H. Alderman, Serum uric acid and car-
diovascular mortality: The NHANES I epidemio-
logic follow-up study, 1971-1992, Jama 283 (2000)
2404–2410.
[38] L. J. Launer, T. Harris, C. Rumpel, J. Madans,
Body mass index, weight change, and risk of mo-
bility disability in middle-aged and older women:
The epidemiologic follow-up study of NHANES
I, Jama 271 (1994) 1093–1098.
[39] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave,
J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb,
N. Bansal, S.-I. Lee, From local explanations
to global understanding with explainable AI for
trees, Nature machine intelligence 2 (2020) 2522–
5839.
[40] C. S. Cox, Plan and Operation of the NHANES I
Epidemiologic Followup Study, 1987, 27, US De-
partment of Health and Human Services, Public
Health Service, Centers . . . , 1992.
[41] T. Chen, C. Guestrin, Xgboost: A scalable tree
boosting system, in: Proceedings of the 22nd
Acm Sigkdd International Conference on Knowl-
edge Discovery and Data Mining, 2016, pp. 785–
794.
[42] P. J. Heagerty, T. Lumley, M. S. Pepe, Time-
dependent ROC curves for censored survival data
and a diagnostic marker, Biometrics 56 (2000)
337–344.