0% found this document useful (0 votes)
2 views

2006.05714v3

The document presents OptiLIME, a framework designed to enhance the stability of Local Interpretable Model-Agnostic Explanations (LIME) while maintaining adherence to the underlying machine learning model. It addresses the instability issues of LIME, which can lead to varying explanations for the same prediction, particularly in critical fields like medicine where trust is essential. Through extensive testing on both toy and medical datasets, OptiLIME demonstrates its ability to provide reliable and meaningful explanations by allowing practitioners to choose the optimal trade-off between stability and adherence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2006.05714v3

The document presents OptiLIME, a framework designed to enhance the stability of Local Interpretable Model-Agnostic Explanations (LIME) while maintaining adherence to the underlying machine learning model. It addresses the instability issues of LIME, which can lead to varying explanations for the same prediction, particularly in critical fields like medicine where trust is essential. Through extensive testing on both toy and medical datasets, OptiLIME demonstrates its ability to provide reliable and meaningful explanations by allowing practitioners to choose the optimal trade-off between stability and adherence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

OptiLIME: Optimized LIME Explanations for Diagnostic

Computer Algorithms1
Giorgio Visania,b , Enrico Baglib and Federico Chesania
a University of Bologna, School of Informatics & Engineering, viale Risorgimento 2, 40136 Bologna (BO), Italy
b CRIF S.p.A., via Mario Fantin 1-3, 40131 Bologna (BO), Italy

Abstract
Local Interpretable Model-Agnostic Explanations (LIME) is a popular method to perform interpretability of any kind of Ma-
chine Learning (ML) model. It explains one ML prediction at a time, by learning a simple linear model around the prediction.
The model is trained on randomly generated data points, sampled from the training dataset distribution and weighted ac-
cording to the distance from the reference point - the one being explained by LIME. Feature selection is applied to keep only
the most important variables, their coefficients are regarded as explanation. LIME is widespread across different domains,
although its instability - a single prediction may obtain different explanations - is one of the major shortcomings. This is due
to the randomness in the sampling step, as well and determines a lack of reliability in the retrieved explanations, making
LIME adoption problematic. In Medicine especially, clinical professionals trust is mandatory to determine the acceptance of
arXiv:2006.05714v3 [cs.LG] 7 Feb 2022

an explainable algorithm, considering the importance of the decisions at stake and the related legal issues. In this paper, we
highlight a trade-off between explanation’s stability and adherence, namely how much it resembles the ML model. Exploiting
our innovative discovery, we propose a framework to maximise stability, while retaining a predefined level of adherence. Op-
tiLIME provides freedom to choose the best adherence-stability trade-off level and more importantly, it clearly highlights the
mathematical properties of the retrieved explanation. As a result, the practitioner is provided with tools to decide whether
the explanation is reliable, according to the problem at hand. We extensively test OptiLIME on a toy dataset - to present
visually the geometrical findings - and a medical dataset. In the latter, we show how the method comes up with meaningful
explanations both from a medical and mathematical standpoint.

Keywords
Explainable AI (XAI), Interpretable Machine Learning, Explanation, Model Agnostic, LIME, Healthcare, Stability

1. Introduction accountability (“who is accountable for wrong deci-


sions?”) are some of the main topics XAI tries to ad-
Nowadays Machine Learning (ML) is pervasive and dress. To achieve the explainability, quite a few tech-
widespread across multiple domains. Medicine makes niques have been proposed in recent literature. These
no difference, on the contrary it is considered one of approaches can be grouped based on different criterion
the greatest challenges of Artificial Intelligence [1]. The [11], [12] such as i) Model agnostic or model specific
idea of exploiting computers to provide assistance to ii) Local, global or example based iii) Intrinsic or post-
the medical personnel is not new. An historical overview hoc iv) Perturbation or saliency based. Among them,
on the topic, starting from the early ‘60s is provided model agnostic approaches are quite popular in prac-
in [2]. More recently, computer algorithms have been tice, since the algorithm is designed to be effective on
proven useful for patients and medical concepts repre- any type of ML model.
sentation [3], outcome prediction [4],[5],[6] and new LIME [13] is a well-known instance-based, model
phenotype discovery [7],[8]. An accurate overview of agnostic algorithm. The method generates data points,
ML successes in Health related environments, is pro- sampled from the training dataset distribution and weighted
vided by Topol in [9]. according to distance from the instance being explained.
Unfortunately, ML methods are hardly perfect and, Feature selection is applied to keep only the most im-
especially in the medical field where human lives are portant variables and a linear model is trained on the
at stake, Explainable Artificial Intelligence (XAI) is ur- weighted dataset. The model coefficients are regarded
gently needed [10]. Medical education, research and as explanation. LIME has already been employed sev-
eral times in medicine, such as on Intensive Care data
1 When citing this work, please refer to the peer-reviewed version of
[14] and cancer data [15],[16]. The technique is known
this paper, published in the Proceedings of the CIKM 2020 Conference, to suffer from instability, mainly caused by the ran-
Vol 2699. link: http://ceur-ws.org/Vol-2699/paper03.pdf
" [email protected] (G. Visani) domness introduced in the sampling step. Stability is a
 0000-0001-6818-3526 (G. Visani); 0000-0003-3913-7701 (E. Bagli); desirable property for an interpretable model, whereas
0000-0003-1664-9632 (F. Chesani) the lack of it reduces the trust in the explanations re-
© 2022 Author:Pleasefillinthe\copyrightclause macro
trieved, especially in the medical field.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
In our contribution, we review the geometrical idea 2.1. LIME Framework
on which LIME is based upon. Relying on statistical
A thorough examination of LIME is provided from a
theory and simulations, we highlight a trade-off be-
geometrical perspective, while a detailed algorithmic
tween the explanation’s stability and adherence, namely
description can be found in [13]. We may consider the
how much LIME’s simple model resembles the ML model.
ML model as a multivariate surface in the ℝ𝑑+1 space
Exploiting our innovative discovery, we propose Op-
spanned by the 𝑑 independent variables 𝑋1 , ..., 𝑋𝑑 and
tiLIME: a framework to maximise the stability, while
the 𝑌 dependent variable.
retaining a predefined level of adherence. OptiLIME
LIME’s objective is to find the tangent plane to the
provides both i) freedom to choose the best adherence-
ML surface, in the point we want to explain. This task
stability trade-off level and ii) it clearly highlights the
is analytically unfeasible, since we don’t have a para-
mathematical properties of the explanation retrieved.
metric formulation of the function, besides the ML sur-
As a result, the practitioner is provided with tools to
face may have a huge number of discontinuity points,
decide whether each explanation is reliable, according
preventing the existence of a proper derivative and
to the problem at hand.
tangent. To find an approximation of the tangent, LIME
We test the validity of the framework on a medical
uses a Ridge Linear Model to fit points on the ML sur-
dataset, where the method comes up with meaningful
face, in the neighbourhood of the reference individual.
explanations both from a medical and mathematical
Points all over the ℝ𝑑 space are generated, sampling
standpoint. In addition, a toy dataset is employed to
the 𝐗 values from a Normal distribution inferred from
present visually the geometrical findings.
the training set. The 𝑌 coordinate values are obtained
The code used for the experiments is available at
by ML predictions, so that the generated points are
https://github.com/giorgiovisani/LIME_stability.
guaranteed to perfectly lie on the ML surface. The
concept of neighbourhood is introduced using a kernel
2. Related Work function (RBF Kernel), which smoothly assigns higher
weights to points closer to the reference. Ridge Model
For the sake of shortness, in the following review we is trained on the generated dataset, each point weighted
consider only model agnostic techniques, which are by the kernel function, to estimate the linear relation-
effective on any kind of ML model by construction. A ship 𝐄(𝑌 ) = 𝛼 + ∑𝑑𝑗=1 𝛽𝑗 𝑋𝑗 . The 𝛽 coefficients are re-
popular approach is to exclude a certain feature, or garded as LIME explanation.
group of features, from the model and evaluate the
loss incurred in terms of model goodness. Such value 2.2. LIME Instability
quantifies the importance of the excluded feature: an
high loss value underlines an important variable for One of the main issues of LIME is the lack of stability.
the prediction task. The idea has been first introduced Explanations derived from repeated LIME calls, under
by Breiman [17] for the Random Forest model and has the same conditions, are considered stable when statis-
been generalised to a model-agnostic framework, named tically equal [26]. In [27] the authors provide insight
LOCO [18]. Based on variable exclusion, the predictive about LIME’s lack of robustness, a similar notion to the
power of the ML models has been decomposed into above-mentioned stability. Analogous findings also in
single variables contribution in PDP [19], ICE [20] and [28]. Often, practitioners are either not aware of such
ALE [21] plots, based on different assumptions about drawback or diffident about the method because of its
the ML model. The same idea is exploited also for local unreliability. By all means, unambiguous explanations
explanations in SHAP [22], where the decomposition are a key desiderata for the interpretable frameworks.
is obtained through a game-based setting. The major source of LIME instability comes from
Another common approach is to train a surrogate the sampling step, when new observations are ran-
model mimicking the behaviour of the ML model. In domly selected. Some approaches, grouped in two high
this vein, approximations on the entire input space are level concepts, have been recently laid out in order to
provided in [23] and [24] among others, while LIME solve the stability issue.
[13] and its extension using decision rules [25] rely on
this technique for providing local approximations. Avoid the sampling step
In [29] the authors propose to bypass the sampling
step using the training units only and a combination
of Hierarchical Clustering and K-Nearest Neighbour
techniques. Although this method achieves stability,
it may find a bad approximation of the ML function,
in regions with only few training points.

Evaluate the post-hoc stability


The shared idea is to repeat LIME method at the same
conditions, and test whether the results are equivalent.
Among the various propositions on how to conduct
the test, in [30] the authors compare the standard devi-
ations of the Ridge coefficients, whereas [31] examines
the stability of the feature selection step - whether the
selected variables are the same - . In [26] two comple-
mentary indices have been developed, based on statis- Figure 1: Toy Dataset
tical comparison of the Ridge models generated by re-
peated LIME calls. The Variables Stability Index (VSI)
checks the stability of the feature selection step, whereas
line) which is usually not available in practical data
the Coefficients Stability Index (CSI) asserts the equal-
mining scenarios. The red dot is the reference point
ity of coefficients attributed to the same feature.
in which we will evaluate the local LIME explanation.
The dataset is intentionally one dimensional, so that
3. Methodology the geometrical ideas about LIME may be well repre-
sented in a 2d plot.
OptiLIME consists in a framework to guarantee the
highest reachable level of stability, constrained to the 3.1. Kernel Width defines locality
finding of a relevant local explanation. From a geo-
metrical perspective, the relevance of the explanation Locality is enforced through a kernel function, the de-
corresponds to the adherence of the linear plane to the fault is the RBF Kernel (Formula 1). It is applied to each
ML surface. To evaluate the stability we rely on the CSI point 𝑥 generated in the sampling step, obtaining an
(𝑖)

and VSI indices [26], while the adherence is assessed individual weight. The formulation provides smooth
using the 𝑅 statistic, which measures the goodness of
2 weights in the range [0, 1] and flexibility through the
the linear approximation through a set of points [32]. kernel width parameter 𝑘𝑤.
All the figures of merit above span in the range [0, 1],
where higher values define respectively higher stabil- ||𝑥 (𝑖) − 𝑥 (𝑟𝑒𝑓 ) ||2
𝑅𝐵𝐹 (𝑥 (𝑖) ) = exp − (1)
ity and adherence. ( 𝑘𝑤 )
To fully explain the rationale of the proposition, we
first cover three important concepts about LIME. In The RBF flexibility makes it suitable to each situation,
this section we employ a Toy Dataset to show our the- although it requires a proper tuning: setting a high
oretical findings. 𝑘𝑤 value will result in considering a neighbourhood
of large dimension, shrinking 𝑘𝑤 we shrink the width
of the neighbourhood.
Toy Dataset In Figure 2, LIME generated points are displayed as
The dataset is generated from the Data Generating Pro- green dots and the corresponding LIME explanations
cess: (red lines) are shown. The points are scattered all over
𝑌 = 𝑠𝑖𝑛(𝑋 ) ∗ 𝑋 + 10 the ML function, however their size is proportional
to the weight assigned by the RBF kernel. Small ker-
100 distinct points have been generated uniformly in nel widths assign significant weights only to the clos-
the 𝑋 range [0,10] and only 20 of them were kept, at est points, making the further ones almost invisible.
random. In Figure 1, the blue line represents the True In this way, they do not contribute to the local linear
DGP function, whereas the green one is its best ap- model.
proximation using a Polynomial Regression of degree The concept of locality is crucial to LIME: a neigh-
5 on the generated dataset (blue points). In the follow- bourhood too large may cause the LIME model not
ing we will regard the Polynomial as our ML function, to be adherent to the ML function in the considered
we will not make use of the True DGP function (blue neighbourhood.
(a) Ridge Penalty = 0
Figure 2: LIME explanations for different kernel widths

3.2. Ridge penalty is harmful to LIME


In statistics, data are assumed to be generated from a
Data Generating Process (DGP) combined with a source
of white noise, so that the standard formulation of the
problem is 𝑌 = 𝑓 (𝐗) + , where  ∼ 𝑁 (0, 𝜎 2 ). The aim
of each statistical model is to retrieve the best spec-
ification of the DGP function 𝑓 (𝐗), given the noisy
dataset.
Ridge Regression [33] assumes a linear DGP, namely
𝑓 (𝐗) = 𝛼 + ∑𝑑𝑗=1 𝛽𝑗 𝑋𝑗 , and applies a penalty propor-
tional to the norm of the 𝛽 coefficients, enforced dur- (b) Ridge Penalty = 1
ing the estimation process through the penalty param- Figure 3: Effects of Ridge Penalty on LIME explanations
eter 𝜆. This technique is useful when dealing with very
noisy datasets (where the stochastic component  ex-
hibits high variance 𝜎 2 ) [34]. In fact, the noise makes 3.3. Relationship between Stability,
various sets of coefficients as viable solutions. Instead,
tuning 𝜆 to its proper value allows Ridge to retrieve a
Adherence and Kernel Width
unique solution. Since the kernel width represents the main hyper-parameter
In the LIME setting, the ML function acts as the of LIME, we wish to understand how Stability and Ad-
DGP, while the sampled points are the dataset. Recall- herence vary wrt to it.
ing that the 𝑌 coordinate of each point is given by ML From the theory, we have few helpful results:
prediction, it is guaranteed they lie exactly on the ML
surface by construction. Hence, no noise is present • Taylor Theorem [32] gives a polynomial approx-
in our dataset. For this reason, we argue that Ridge imation for any differentiable function, calcu-
penalty is not needed, on the contrary it can be harm- lated in a given point. If we truncate the for-
ful and distort the right estimates of the parameters, mula to the first degree polynomial, we obtain a
as shown in Figure 3. linear function, its approximation error depends
In the 3b panel, Ridge penalty 𝜆 = 1 (LIME default) is on the distance from the point in which the er-
employed, whereas in 3a no penalty (𝜆 = 0) is imposed. ror is evaluated and the given point.
It is possible to see how the estimation gets severely Thus, if we assume the ML function to be dif-
distorted by the penalty, proven also by the 𝑅 2 values. ferentiable in the neighbourhood of 𝑥 (𝑟𝑒𝑓 ) , the
This happens especially for small kernel width values, adherence of the linear model is expected to be
since each unit has very small weight and the weighted inversely proportional to the width of the neigh-
residuals are almost irrelevant in the Ridge loss, which bourhood, i.e. to the kernel width. This is true
is dominated by the penalty term. To minimize the since the approximation error depends on the
penalty term the coefficients are shrunk towards 0. distance from the two points, namely the neigh-
bourhood size.
3.4. OptiLIME
Previously, we empirically showed that adherence and
stability are monotonous noisy functions of the kernel
width: for increasing kernel width we observe, on av-
erage, decreasing adherence and increasing stability.
Our proposition consists in a framework which en-
ables the best choice for the trade-off between stabil-
ity and adherence of the explanations. OptiLIME sets
a desired level of adherence and finds the largest ker-
nel width, matching the request. At the same time, the
best kernel width provides the highest stability value,
constrained to the chosen level of adherence. At the
end of the day, OptiLIME consists in an automated way
Figure 4: Relationship among kernel width, 𝑅 2 and CSI of finding the best kernel width. Moreover, it empow-
ers the practitioner to be in control of the trade-off
between the two most important properties of LIME
• in Linear Regression, the standard deviation of Local Explanations.
the coefficients is inversely correlated to the stan- To retrieve the best width, OptiLIME converts the
2
dard deviation of the 𝐗 variables [32]. decreasing 𝑅 2 function into 𝑙(𝑘𝑤, 𝑅̃ ), by means of For-
The stability of the explanations depends on the mula 2:
spread of the 𝐗 variables in our weighted dataset.
We then expect the kernel width and Stability to { 2
be directly proportional. 2 𝑅 2 (𝑘𝑤), if 𝑅 2 (𝑘𝑤) ≤ 𝑅̃
̃
𝑙(𝑘𝑤, 𝑅 ) = 2 2 (2)
2𝑅̃ − 𝑅 2 (𝑘𝑤) if 𝑅 2 (𝑘𝑤) > 𝑅̃
To illustrate the conjectures above, we run LIME for
different kernel width values and evaluate both 𝑅 2 and
̃2
CSI metrics (VSI is not considered in the Toy Dataset, where 𝑅 is the 2
requested adherence.
since only one variable is present). In Figure 4 the re- For a fixed 𝑅 , chosen by the practitioner, the function
̃
2
sults of such experiment, for the reference unit, are 𝑙(𝑘𝑤, 𝑅̃ ) presents a global maximum. We are particu-
shown. 2
larly interested in the arg max𝑘𝑤 𝑙(𝑘𝑤, 𝑅̃ ), namely the
Both the adherence and stability are noisy functions best kernel width.
of the kernel width: they contain some stochasticity, In order to solve the optimum problem, Bayesian
due to the different datasets generated by each LIME Optimization is employed, since it is the most suitable
call. Despite this, it is possible to detect a clear pat- technique to find the global optimum of noisy func-
tern: monotonically increasing for the CSI Index and tions [36]. The technique relies on two parameters
monotonically decreasing for the 𝑅 2 statistic. to be set beforehand: 𝑝, number of preliminary calls
For numerical evidence of these properties, we fit with random 𝑘𝑤 values, 𝑚, number of iterations of the
the Logistic function [35], which retrieves the best monotonous
search refinement strategy. Increasing the parameters
approximation to a set of points. The goodness of the ensures to find a better kernel width value, at the cost
logistic approximation is confirmed by a low value of of longer computation time.
the Mean Absolute Error (MAE).
To corroborate our assumption, the same process has In Figure 5, an application of OptiLIME to the ref-
been repeated on all the units of the Toy Dataset, ob- erence unit of the Toy Dataset is presented. 𝑅̃ 2 has
taining average MAE for the 𝑅 2 approximation of 0.005 been set to 0.9, 𝑝 = 20 and 𝑚 = 40. The points in the
and for the CSI of 0.026. The logistic growth rate has plot represent the distinct evaluations performed by
also been inspected: 𝑅 2 highest growth rate is -10.78 the Bayesian Search in order to find the optimum.
and CSI lowest growth rate is 7.20. These results en- Comparing the plot with Figure 4, we observe the ef-
sure the monotonous relationships of adherence and 2
fect of Formula 2 on the left part of the 𝑅 2 and 𝑙(𝑘𝑤, 𝑅̃ )
stability with the kernel width, respectively decreas-
functions. In Figure 5 the search has converged to
ing and increasing.
the maximum, evaluating various points close to the
best kernel width. At the same time, it is evident the
stochastic nature of the CSI function: the several CSI
(a) Best LIME Explanation, Unit 100

Figure 5: OptiLIME Search for the best kernel width

measurements, performed in the proximity of 0.3 value


of the kernel width, show a certain variation. Nonethe-
less, it is possible to recall the increasing CSI trend.

4. Case Study
(b) Best LIME Explanation, Unit 7207
Dataset
Figure 6: NHANES individual Explanations using OptiL-
To validate our methodology we use a well known med- IME
ical dataset: NHANES I. It has been employed for med-
ical research [37],[38] as well as a benchmark to test
explanation methods [39]. The original dataset is de- 0.9 as a reasonable level of adherence. OptiLIME is em-
scribed in [40]. We use a reformatted version, released ployed to find the proper kernel width to achieve 𝑅 2
at http://github.com/suinleelab/treexplainer-study. It value close to 0.9 while maximizing stability indices
contains 79 features, based on clinical measurements for the local explanation models.
of 14,407 individuals. The aim is to model the risk of The model prediction consists in the hazard ratio for
death over twenty years of follow-up. each individual, higher prediction means the individ-
ual is likely to survive a shorter time. Therefore, posi-
Diagnostic Algorithm tive coefficients define risk factors, whereas protective
factors have negative values.
Following Lundberg [39] prescriptions, the dataset has LIME model interpretation is the same as a Linear
been divided into a 64/16/20 split for train/validation/test.Regression model, but with the additional concept of
The features have been mean imputed and standard- locality. As an example, for Age variable we distin-
ized based on statistics computed on the training set. guish different impact based on the individual charac-
A Survival Gradient Boosting model has been trained, teristics: having 1 year more for the Unit 100 (increas-
using the XGBoost framework [41]. Its hyper-parametersing from 65 to 66 years) will raise the death risk of
have been optimized by coordinate descent, using the 3.56 base points, for Unit 7207 1 year of ageing (from
C-statistic [42] on the validation set as the figure of 49 to 50) will increase the risk of just 0.79. Another
merit. example is the impact of Sex: it is more pronounced
in elder people (being female is a protective factor for
Explanations 1.49 points at age 49, at age 65 being male has a much
worse impact, as a risk factor for 3.04).
We use the OptiLIME framework to achieve the opti- For the Unit 100 in Figure 6a, the optimal kernel
mal explanation of the XGBoost model on the dataset. width is a bit higher compared with Unit 7207 in Fig-
We consider two randomly chosen individuals to visu- ure 6b. This is probably caused by the ML model hav-
ally show the results. In our simulation, we consider ing a higher degree of non linearity for the latter unit:
to achieve the same adherence, we are forced to con- sciously.
sider a smaller portion of the ML model, hence a small We exploit these findings in order to tackle LIME
neighbourhood. Smaller kernel width implies also a weak points. The result is the OptiLIME framework,
reduced Stability, testified by small values of the VSI which represents a new and innovative contribution to
and CSI indices. Whenever the practitioner desires the scientific community. OptiLIME achieves stability
more stable results, it is possible to re-run OptiLIME of the explanations and automatically finds the proper
with a less strict requirement for the adherence. It kernel width value, according to the practitioner’s needs.
is important to remark that low degrees of adherence The framework may serve as an extremely useful
will make the explanations increasingly more global: tool: using OptiLIME, the practitioner knows how much
the linear surface retrieved by LIME will consist in an to trust the explanations, based on their stability and
average of many local non-linearities of the ML model. adherence values.
The computation time largely depends on the Bayesian Nonetheless, we acknowledge that the optimization
Search, controlled by the parameters 𝑝 and 𝑚. In our framework may be improved to allow for a faster and
setting, 𝑝 = 10 and 𝑚 = 30 produce good results for more precise computation.
both the units in Figure 6.
On a 4 Intel-i7 CPUs 2.50GHz laptop, the OptiLIME
evaluation for Unit 100 and Unit 7207 took respectively Acknowledgments
123 and 147 seconds to compute. For faster, but less ac-
We acknowledge financial support by CRIF S.p.A. and
curate results, the Bayesian Search parameters can be
Università degli Studi di Bologna.
reduced.

5. Conclusions References
[1] A. Holzinger, G. Langs, H. Denk, K. Zatloukal,
In Medicine, diagnostic computer algorithms provid-
H. Müller, Causability and explainability of ar-
ing accurate predictions have countless benefits, no-
tificial intelligence in medicine, Wiley Interdis-
tably they may help in saving lives as well as reduc-
ciplinary Reviews: Data Mining and Knowledge
ing medical costs. However, precisely because of the
Discovery 9 (2019) e1312.
importance of these matters, the rationale of the de-
[2] I. Kononenko, Machine learning for medical di-
cisions must be clear and understandable. A plethora
agnosis: History, state of the art and perspective,
of techniques to explain the ML decisions has grown
Artificial Intelligence in medicine 23 (2001) 89–
in recent years, though there is no consensus on the
109.
best in class, since each method presents some draw-
[3] R. Miotto, L. Li, B. A. Kidd, J. T. Dudley, Deep pa-
backs. Explainable models are required to be reliable,
tient: An unsupervised representation to predict
thus stability is regarded as a key desiderata.
the future of patients from the electronic health
We consider the LIME technique, whose major draw-
records, Scientific reports 6 (2016) 1–10.
back lies in the lack of stability. Moreover, it is difficult
[4] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart,
to tune properly its main parameter: different values
J. Sun, Doctor ai: Predicting clinical events via
of the kernel width provide substantially different ex-
recurrent neural networks, in: Machine Learning
planations.
for Healthcare Conference, 2016, pp. 301–318.
The main contribution of this paper consists in the
[5] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Ha-
clear decomposition of the LIME framework in its rel-
jaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun,
evant components and the exhaustive analysis of each
Scalable and accurate deep learning with elec-
one, starting from the geometrical meaning through
tronic health records, NPJ Digital Medicine 1
the empirical experiments to validate our intuitions.
(2018) 18.
We showed that Ridge penalty is not needed and LIME
[6] B. Shickel, P. J. Tighe, A. Bihorac, P. Rashidi,
works best with simple Linear Regression as explain-
Deep EHR: A survey of recent advances in deep
able model. In addition, smaller kernel width values
learning techniques for electronic health record
provide a more adherent LIME plane to the ML surface,
(EHR) analysis, IEEE journal of biomedical and
therefore a more realistic local explanation. Eventu-
health informatics 22 (2017) 1589–1604.
ally, the trade-off between the adherence and stabil-
[7] Z. Che, D. Kale, W. Li, M. T. Bahadori, Y. Liu, Deep
ity properties is extremely valuable since it empowers
computational phenotyping, in: Proceedings of
the practitioner to choose the best kernel width con-
the 21th ACM SIGKDD International Conference tional expectation, Journal of Computational and
on Knowledge Discovery and Data Mining, 2015, Graphical Statistics 24 (2015) 44–65.
pp. 507–516. [21] D. W. Apley, J. Zhu, Visualizing the effects of pre-
[8] T. A. Lasko, J. C. Denny, M. A. Levy, Computa- dictor variables in black box supervised learning
tional phenotype discovery using unsupervised models, arXiv preprint arXiv:1612.08468 (2016).
feature learning over noisy, sparse, and irregular [22] S. M. Lundberg, S.-I. Lee, A unified approach
clinical data, PloS one 8 (2013). to interpreting model predictions, in: Advances
[9] E. J. Topol, High-performance medicine: The in Neural Information Processing Systems, 2017,
convergence of human and artificial intelligence, pp. 4765–4774.
Nature medicine 25 (2019) 44–56. [23] M. Craven, J. W. Shavlik, Extracting tree-
[10] A. Holzinger, From machine learning to explain- structured representations of trained networks,
able AI, in: 2018 World Symposium on Digital in: Advances in Neural Information Processing
Intelligence for Systems and Machines (DISA), Systems, 1996, pp. 24–30.
IEEE, 2018, pp. 55–66. [24] Y. Zhou, G. Hooker, Interpreting models
[11] C. Molnar, Interpretable Machine Learning, Lulu. via single tree approximation, arXiv preprint
com, 2020. arXiv:1610.09036 (2016).
[12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, [25] M. T. Ribeiro, S. Singh, C. Guestrin, Anchors:
F. Giannotti, D. Pedreschi, A survey of methods High-precision model-agnostic explanations, in:
for explaining black box models, ACM comput- Thirty-Second AAAI Conference on Artificial In-
ing surveys (CSUR) 51 (2018) 93. telligence, 2018.
[13] M. T. Ribeiro, S. Singh, C. Guestrin, Why should [26] G. Visani, E. Bagli, F. Chesani, A. Poluzzi, D. Ca-
i trust you?: Explaining the predictions of any puzzo, Statistical stability indices for LIME: Ob-
classifier, in: Proceedings of the 22nd ACM taining reliable explanations for Machine Learn-
SIGKDD International Conference on Knowl- ing models, arXiv preprint arXiv:2001.11757
edge Discovery and Data Mining, ACM, 2016, pp. (2020).
1135–1144. [27] D. Alvarez-Melis, T. S. Jaakkola, On the robust-
[14] G. J. Katuwal, R. Chen, Machine learning model ness of interpretability methods, arXiv preprint
interpretability for precision medicine, arXiv arXiv:1806.08049 (2018).
preprint arXiv:1610.09045 (2016). [28] A. Gosiewska, P. Biecek, IBreakDown: Un-
[15] A. Y. Zhang, S. S. W. Lam, N. Liu, Y. Pang, L. L. certainty of model explanations for non-
Chan, P. H. Tang, Development of a Radiology additive predictive models, arXiv preprint
Decision Support System for the Classification of arXiv:1903.11420 (2019).
MRI Brain Scans, in: 2018 IEEE/ACM 5th Interna- [29] M. R. Zafar, N. M. Khan, DLIME: A deterministic
tional Conference on Big Data Computing Appli- local interpretable model-agnostic explanations
cations and Technologies (BDCAT), IEEE, 2018, approach for computer-aided diagnosis systems,
pp. 107–115. arXiv preprint arXiv:1906.10263 (2019).
[16] C. Moreira, R. Sindhgatta, C. Ouyang, P. Bruza, [30] S. M. Shankaranarayana, D. Runje, ALIME:
A. Wichert, An Investigation of Interpretability Autoencoder Based Approach for Local Inter-
Techniques for Deep Learning in Predictive Pro- pretability, in: International Conference on Intel-
cess Analytics, arXiv preprint arXiv:2002.09192 ligent Data Engineering and Automated Learn-
(2020). ing, Springer, 2019, pp. 454–463.
[17] L. Breiman, Random forests, Machine learning [31] C. Molnar, Limitations of Interpretable Machine
45 (2001) 5–32. Learning Methods, 2020.
[18] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, [32] W. H. Greene, Econometric Analysis, Pearson
L. Wasserman, Distribution-free predictive in- Education India, 2003.
ference for regression, Journal of the American [33] A. E. Hoerl, R. W. Kennard, Ridge Regression:
Statistical Association 113 (2018) 1094–1111. Biased Estimation for Nonorthogonal Problems,
[19] J. H. Friedman, Greedy function approximation: Technometrics 12 (1970) 55–67. doi:10.1080/
A gradient boosting machine, Annals of statistics 00401706.1970.10488634.
(2001) 1189–1232. [34] W. N. van Wieringen, Lecture notes on ridge re-
[20] A. Goldstein, A. Kapelner, J. Bleich, E. Pitkin, gression, arXiv preprint arXiv:1509.09169 (2019).
Peeking inside the black box: Visualizing sta- [35] P.-F. Verhulst, Correspondance mathématique et
tistical learning with plots of individual condi- physique, Ghent and Brussels 10 (1838) 113.
[36] B. Letham, B. Karrer, G. Ottoni, E. Bakshy, Con-
strained Bayesian optimization with noisy exper-
iments, Bayesian Analysis 14 (2019) 495–519.
[37] J. Fang, M. H. Alderman, Serum uric acid and car-
diovascular mortality: The NHANES I epidemio-
logic follow-up study, 1971-1992, Jama 283 (2000)
2404–2410.
[38] L. J. Launer, T. Harris, C. Rumpel, J. Madans,
Body mass index, weight change, and risk of mo-
bility disability in middle-aged and older women:
The epidemiologic follow-up study of NHANES
I, Jama 271 (1994) 1093–1098.
[39] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave,
J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb,
N. Bansal, S.-I. Lee, From local explanations
to global understanding with explainable AI for
trees, Nature machine intelligence 2 (2020) 2522–
5839.
[40] C. S. Cox, Plan and Operation of the NHANES I
Epidemiologic Followup Study, 1987, 27, US De-
partment of Health and Human Services, Public
Health Service, Centers . . . , 1992.
[41] T. Chen, C. Guestrin, Xgboost: A scalable tree
boosting system, in: Proceedings of the 22nd
Acm Sigkdd International Conference on Knowl-
edge Discovery and Data Mining, 2016, pp. 785–
794.
[42] P. J. Heagerty, T. Lumley, M. S. Pepe, Time-
dependent ROC curves for censored survival data
and a diagnostic marker, Biometrics 56 (2000)
337–344.

You might also like