0% found this document useful (0 votes)
242 views

Statistical and Machine Learning Models in Credit Scoring A Systematic

Uploaded by

Trịnh Tâm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
242 views

Statistical and Machine Learning Models in Credit Scoring A Systematic

Uploaded by

Trịnh Tâm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Applied Soft Computing Journal 91 (2020) 106263

Contents lists available at ScienceDirect

Applied Soft Computing Journal


journal homepage: www.elsevier.com/locate/asoc

Statistical and machine learning models in credit scoring: A systematic


literature survey

Xolani Dastile a , , Turgay Celik a,b , Moshe Potsane a
a
School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa
b
Wits Institute of Data Science, University of the Witwatersrand, Johannesburg, South Africa

article info a b s t r a c t

Article history: In practice, as a well-known statistical method, the logistic regression model is used to evaluate the
Received 3 June 2019 credit-worthiness of borrowers due to its simplicity and transparency in predictions. However, in
Received in revised form 23 February 2020 literature, sophisticated machine learning models can be found that can replace the logistic regression
Accepted 19 March 2020
model. Despite the advances and applications of machine learning models in credit scoring, there are
Available online 25 March 2020
still two major issues: the incapability of some of the machine learning models to explain predictions;
Keywords: and the issue of imbalanced datasets. As such, there is a need for a thorough survey of recent literature
Credit scoring in credit scoring. This article employs a systematic literature survey approach to systematically review
Statistical learning statistical and machine learning models in credit scoring, to identify limitations in literature, to propose
Machine learning a guiding machine learning framework, and to point to emerging directions. This literature survey is
Deep learning
based on 74 primary studies, such as journal and conference articles, that were published between
Systematic literature survey
2010 and 2018. According to the meta-analysis of this literature survey, we found that in general, an
ensemble of classifiers performs better than single classifiers. Although deep learning models have not
been applied extensively in credit scoring literature, they show promising results.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction the number of dependents, time at current address, time at cur-


rent employment, etc. The bureau information is also collected
The history of credit scoring dates back to the 1950s [1]. from the local bureaus, and it includes the number of inquiries,
During those early years, credit decisions were made using a judgments, number of delinquencies, etc. Once the accepted pop-
judgmental approach commonly known as 5C’s approach [2]: ulation, i.e. borrowers who have been granted loans, are iden-
tified, their loan repayment history is tracked for a period of
Character: do you know the person or their family? time, e.g. 24-months. A target flag (i.e. good/bad flag) gets created
Capital: how much is being asked for? based on loan repayment history of the accepted population. If
Collateral: what is the borrower willing to put up from their the number of days past due (or missed payments) is less than a
resources? certain number of days, e.g. 90 days, then the borrower is flagged
Capacity: what is their repayment ability? as a good borrower, otherwise a bad borrower. The known goods
Condition: what are the conditions in the market? and bads are then used to develop a scorecard. The cut-off score is
determined by the Kolmogorov–Smirnov statistic, and it measures
The drawback of using the 5C’s approach was the incapability of
the distance between the cumulative distribution of goods and
processing a huge number of applications daily. This resulted in bads. The score which gives the maximum distance between the
the advent of scorecards, which make consistent nonjudgmental distribution of goods and bads is regarded as the cut-off score and
decisions that treat all borrowers fairly. The scorecards generate is used to predict the goods and the bads. If a score of a borrower
a score which quantifies the risk of lending money to borrowers. is larger than or equal to the cut-off score, then the borrower
When a borrower applies for a loan, the financial institution, is predicted as a good borrower otherwise a bad borrower. The
without losing generality hereafter referred to as the bank, col- scorecard is then applied to the rejected population to predict
lects information from the borrower. This information is called goods and bads and this is referred to as reject inference [3]. The
application data and it consists of demographic information, e.g., final application scorecard is built on the accepted and rejected
populations, i.e. Through-The-Door (TTD) population. Please see
∗ Corresponding author. Fig. 1 for a schematic view of data flowchart for application score-
E-mail addresses: [email protected] (X. Dastile), cards. Traditionally, financial institutions use a logistic regression
[email protected] (T. Celik), [email protected] (M. Potsane). to score borrowers. The choice of using the logistic regression

https://doi.org/10.1016/j.asoc.2020.106263
1568-4946/© 2020 Elsevier B.V. All rights reserved.
2 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

Fig. 1. Schematic view of data flowchart for application scorecards.

model is due to its simplicity and transparency. The scores from (3) propose a guiding machine learning framework to perform
logistic regression are calculated using the following formula [3] credit scoring;
n ki (4) point to emerging directions.
pdo ( ∑ ∑ )
Score = × β0 + βi WOEi,j , (1) In this survey, the statistical techniques considered include
ln(2)
i=1 j=1 Linear Discriminant Analysis (LDA), Logistic Regression (LR) and
where Naïve Bayes (NB), and the machine learning include k-Nearest
( ) Neighbor (k-NN), Decision Trees (DTs), Support Vector Machines
FGj (SVMs), Artificial Neural Networks (ANNs), Random Forests (RFs),
WOEi,j = ln (2)
FBj Boosting, Extreme Gradient Boost (XGBoost), Bagging, Restricted
Boltzmann Machines (RBMs), Deep Multi-Layer Perceptron
and n is the number of features, ki is the number of groups or (DMLP), Convolutional Neural Networks (CNNs) and Deep Belief
attributes in the ith feature, pdo is points double the odds and is Neural Networks (DBNs). Note that the list of these techniques
used as a scaling factor and FGj and FBj are distribution of good for both statistical and machine learning is not exhaustive, the
borrowers and bad borrowers in the jth attribute in feature i, literature provides a myriad of techniques that are used to model
respectively. The βi coefficients are estimated using Maximum credit scoring. However, this study only reviews the most com-
Likelihood Estimation [4]. monly used techniques as it would be almost impossible to look
In literature, sophisticated machine learning (ML) models can at all techniques applied in credit scoring.
be found that can replace the logistic regression model. In spite of A process flowchart of the systematic literature survey
high accuracies from ML models; ML models are generally unable methodology is presented in Fig. 2. Note that this study follows
to explain their predictions. Financial institutions are regulated the methodology of the recently published systematic literature
entities and are required to be transparent in their decisions survey on bankruptcy prediction models [6]. The search words
when using application scorecards. However, techniques that can of the current study are ‘‘Statistical Learning in Credit Scoring’’,
deduce rules to mitigate the lack of transparency without even
‘‘Machine Learning in Credit Scoring’’ and ‘‘Deep Learning in
compromising accuracy are suggested in the literature, e.g. [5].
Credit Scoring’’. The inclusion criterion for articles is based on
In this paper, we aim to present a systematic literature survey
the recently published work in credit scoring and the period
of statistical and machine learning models which are employed
considered is the year 2010 to the year 2018. The study selects
in credit scoring between 2010 and 2018 and propose a guiding
peer-reviewed journal and conference articles as these are con-
ML framework for credit scoring. The remainder of this paper
sidered to be of high quality [7]. The articles are selected by
is organized as follows. In Section 2 we cover the methodology
reading the abstract and the conclusion, in some instances the
followed for conducting this systematic literature survey. In Sec-
entire article is read. This review is based on articles that are
tion 3 we highlight feature selection/engineering methods. The
written in English only. This overlooks one of the requirements in
learning models are covered in Section 4. In Section 5 we present
systematic literature review where language constraints are dis-
evaluation metrics. The data imbalance is covered in Section 6. In
couraged. The following databases are used in searching papers:
Section 7 we cover model transparency. In Section 8 we discuss
Google Scholar, Science Direct, IEEEXplore, ACM and Springer-
limitations and assumptions of different models. In Section 9 we
Link. These databases include studies which are undertaken all
discuss emerging trends. Section 10 discusses limitations in liter-
over the world, hence geographical bias is removed.
ature. In Section 11 we discuss results. In Section 12 we propose
All unpublished work and dissertations are not included in
a guiding framework for machine learning in credit scoring and
finally Section 13 provides conclusion and future work. this current study. In the end, 74 primary studies were selected
for this systematic literature survey. The primary studies include
models in their hybrid form (i.e. where feature selection/feature
2. Survey methodology
engineering is combined with a classifier or ensemble classifiers)
and in their standalone form. A meta-analysis of the results from
This study employs a systematic literature survey approach to
the selected articles is done by producing summary tables, pie-
(1) systematically review the most commonly used statistical charts and histograms. The German and Australian credit datasets
and machine learning techniques in credit scoring; are the most frequently used datasets in credit scoring, hence we
(2) identify limitations in literature; used these two datasets for model performance comparisons.
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 3

Fig. 2. Process flowchart of the systematic literature survey methodology.

2.1. Existing literature surveys on bankruptcy prediction or credit feature engineering is that the feature selection selects a subset
scoring of features from the entire feature set whereas feature engi-
neering creates a new set of features from the existing features.
There are several literature surveys on bankruptcy prediction Albeit, Liang et al. [14] investigated the effect of feature selec-
or credit scoring [6,8–13]. We briefly highlight the objectives of tion and concluded that performing feature selection does not
each literature survey and show where they differ with our study always improve the prediction performance, the studies in [15–
in Table 1. The literature surveys in Table 1 focus mostly on 19] showed that removing redundant features can improve model
comparing models based on their performances. However, the performance in credit scoring. There are also studies [20–22]
listed literature surveys ignore key factors in credit scoring, such which employed meta-heuristic approach such as Genetic Algo-
as model transparency and the nature of datasets. Furthermore, rithm for feature selection. We divided the feature selection into
deep learning models are not considered in the listed literature filter, wrapper and embedded methods.
surveys in Table 1.
Among the existing literature surveys shown in Table 1, [12] 3.1. Filter methods
provide the most recent and the most comprehensive systematic
literature survey in credit scoring. The main objective of [12]
The filter methods perform feature selection based on the
is ‘‘proposing a new method for rating, comparing traditional
univariate analysis of the features without using a predictor.
techniques, conceptual discussions, feature selection, literature
They compute a score for each feature and select a subset of
review, performance measures studies and, at last, other issues’’.
features based on their scores. In the following, we present most
Their literature survey covers a period from January 1992 to
December 2015 and focuses on 12 questions on the conceptual commonly used filter methods in credit scoring.
scenario over the techniques. Comparing [12] with our literature
survey, there is an overlap from 2010 to 2015. However, our 3.1.1. F-score
literature survey includes the most recent years. According to the F-score measures the discrimination of two sets of real num-
best of our knowledge, there is no systematic literature survey for bers [17]. Given m training vectors xk , where k = 1, 2, . . . , m,
credit scoring that has been published which encompasses, sta- and if the number of positive and negative instances are m(+1)
tistical learning, traditional machine learning and deep learning and m(−1) , respectively, then the F-score of the ith feature F (i) is
models. defined as follows
F (i)
3. Feature selection and feature engineering
(x̄i (+1) − x̄i )2 + (x̄i (−1) − x̄i )2
This section provides a review of most commonly used feature = ∑m(+1) ∑m(−1)
(+1) (−1)
selection (FS) and feature engineering (FE) techniques used in
1
m(+1) −1 k=1 (xk,i − x̄i (+1) )2 + 1
m(−1) −1 k=1 (xk,i − x̄i (−1) )2
credit scoring. The distinction between feature selection and (3)
4 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

Table 1
Existing literature surveys on bankruptcy prediction or credit scoring and their differences from this survey paper.
Survey paper Articles searched Years Objective Difference from this survey
[8] 165 1965–2010 ✓ Investigation of model type by decade ✓ Transparency is not reported
✓ Analysis of Number of features used by ✓ Deep learning models are not covered
decade ✓ Nature of datasets (balanced vs imbalanced) is not
✓ Compare model performance reported
[9] 214 Not specified ✓ Comparing models based on performance ✓ Transparency is not reported
✓ Deep learning models are not covered
✓ Nature of datasets (balanced vs imbalanced) is not
reported
[10] 130 1995–2010 ✓ Models are compared based on design, ✓ Transparency is not covered
datasets and baselines ✓ Deep learning models are not covered
✓ Nature of datasets (balanced vs imbalanced) is not
reported
[11] Not specified Not specified ✓ Comparing models based on performance ✓ Transparency is not reported
✓ Deep learning models are not covered
✓ Nature of datasets (balanced vs imbalanced) is not
reported
[12] 187 1992–2015 ✓ Proposing a new method for scoring ✓ Transparency is not reported
✓ Compare traditional techniques ✓ Deep learning models are not covered
✓ Conceptual discussion ✓ Nature of datasets (balanced vs imbalanced) is not
reported
[13] 6 Not specified ✓ Compare model performance ✓ Transparency is not reported
✓ Deep learning models are not covered
✓ Nature of datasets (balanced vs imbalanced) is not
reported
[6] 49 2010–2015 ✓ Compare models based on accuracy, ✓ Deep learning models are not covered
transparency, data size capability, data ✓ Nature of datasets (balanced vs imbalanced) is not
dispersion and etc. reported

where xk,i is the ith feature of the kth sample, x̄i , x̄i (+1) and x̄i (−1) 3.2.1. Stepwise selection
are the averages of the ith feature of the whole, positive, and The stepwise feature selection has three components [3],
negative datasets, respectively [17]. The numerator indicates the namely, forward feature selection, backward feature elimination
discrimination between the positive and negative sets, and the and a stepwise feature selection which is a combination of both
denominator indicates the one within each of the two sets. The forward feature selection and backward feature elimination. All
larger the F-score value is, the more discriminative power the three components use linear regression and a p-value for feature
feature has [17,23]. selection. The forward feature selection starts by regressing one
feature and if the feature is significant according to the p-value
then that feature is retained. This process is repeated by adding
3.1.2. Rough set theory one feature at a time until the list of features is exhausted, and
In rough set theory, the sample dataset is called information the significant features are retained and insignificant features are
system, denoted by I = (U , A), where U is a non-empty finite discarded. The backward feature elimination works the opposite
set of observations called the universe and A is a non-empty way, you start with the entire set of features and you keep
finite set of features. Firstly, a notion of indiscernibility relation discarding features with p-values less than the chosen level of
on a universe set is defined. With every subset B ⊆ A, an significance [27]. The stepwise feature selection adds and removes
indiscernibility relation on U is defined as follows features.

I(B) = {(x, y) ∈ U × U : fi (x) = fi (y), ∀i ∈ B}, (4)

where fi : U → Vi is the ith feature function and Vi is a set of 3.2.2. Genetic algorithm
values associated with feature i. Feature i is redundant in B if A genetic algorithm [28] uses genetic inspired operators to
I(B) = I(B − {i}), otherwise feature i is important in B. If all of evolve an initial population into a new population. These oper-
the features in B are important, then B is said to be a reduced set ators are the selection, crossover and mutation. Each population
of features [24–26]. comprises of chromosomes (a string of bits (0/1)) that represent
genetically encoded individual solutions to a specific problem. Se-
lection operator selects chromosomes in the population (a set of
solutions) for reproduction. Crossover operator randomly chooses
3.2. Wrapper methods
a position and exchanges the subsequent bits before and after
that position between two chromosomes to create two offsprings.
The wrapper methods select a subset of features based on Mutation operator randomly flips some of the bits in a chro-
an evaluation metric, such as accuracy, of a pre-determined pre- mosome. Each individual has a fitness score assigned to them,
dictor. Each subset of the features are scored according to their which represents its ability of be selected. For feature selection,
predictive power, thus, the wrapper methods are computationally the individuals are subsets of features that are encoded as bits
expensive compared to the filter methods. However, since they where the ith feature is selected if the corresponding bit is set
select features based on the performance of the pre-determined to 0. The fitness value is some measure of model performance,
predictor, the wrapper methods usually outperform filter meth- such as classification accuracy. A new population is evolved by
ods in credit scoring. In the following, we present the most using operators of crossover, where selection is based on indi-
commonly used wrapper methods in credit scoring. vidual’s fitness function and its ability to reproduce the next
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 5

generation [29]. Hence, genetic algorithm is an evolutionary algo- 1. Standardize the n dimensional dataset. Each feature xi,j of
rithm problem that is solved by searching a space (in our case a the feature vector xi is standardized according to
space of features). However, the key elements of problem solving xi,j − µj
by search are exploitation and exploration. The exploitation uses xi,j ← , (7)
the information from previously visited points to determine the
σj
prospect of finding profitable regions to be visited next and the where µj and σj are the mean and standard deviation of jth
exploration is the process of visiting new regions of a search space feature.
to uncover promising offsprings [30]. ‘‘How and when to control 2. Construct the covariance matrix Σ . The covariance matrix
and balance exploration and exploitation in the search process to is computed as
obtain even better fitness results and/or convergence faster are m
still on-going research’’ [31]. It is key that a diversified population 1 ∑
Σ= xi xTi , (8)
is maintained during the whole search process. The measure of m−1
i=1
diversity is entropy and it represents the amount of population
disorder, where an increase in entropy results in an increase in where xi is assumed a column vector.
diversity [30]. 3. Apply eigenvalue decomposition on the covariance matrix
to compute eigenvalues (λj s) and eigenvectors (ej s).
4. Select k eigenvectors that correspond to the k largest eigen-
3.3. Embedded methods values, where k is the dimensionality of the new feature
subspace. Generally, the variance explained ratio is used to
The embedded methods optimize an objective function or select the number of principal components (eigenvectors).
learning classifier with a goodness-of-fit term and a penalty for a For a set of all sorted eigenvalues
large number of features [32]. The most commonly used embed-
ded method in credit scoring is the least absolute shrinkage and λ1 ≥ λ2 ≥ · · · ≥ λn , (9)
selection operator (LASSO) [33]. Given a linear regression model the variance explained ratio is defined as follows
to predict dependent variable yi using independent variables xi,j s ∑k
for ith sample in the training dataset, i.e. j=1 λj
∑n . (10)
j=1 λj
n

yi = xi,j βj + β0 , (5)
The chosen k eigenvectors will have a high total explained
i=1
variance.
where j = 1, 2, . . . , n, the LASSO [33] solves the L1-penalized 5. Finally, use k eigenvectors and project each feature vector
regression problem of finding the set of parameters β = {βj }nj=1 x to k-subspace, i.e.,
to minimize
x̂ = (x − µ)T [e1 , e2 , . . . , ek ], (11)
m ( )2 n
∑ ∑ ∑
yi − xi,j βj +λ |βj |, (6) where µ = [µ1 , µ2 , . . . , µn ] is the mean vector and x̂ ∈
T

i=1 j j=1 Rk is the projection of x onto k-subspace.


where
∑ m is the number of instances in the training dataset and 3.4.2. Autoencoders
λ nj=1 |βj | represents the penalty term. The penalty term reduces
Fig. 3 shows an autoencoder which is a type of neural network
coefficients βj and simplifies the linear regression model. Thus,
with three layers, namely; input, hidden and output layers. For
the LASSO performs variable selection and model shrinkage based
a typical autoencoder, the input variables are the same as the
on the magnitude of βj s.
target variables. The autoencoder learns representations of its
inputs at hidden layer and tries to reconstruct its inputs from
3.4. Feature engineering the learned representations by optimizing a cost function on a
training dataset, i.e.
The original features may be dependent on each other which m
1∑
may reduce the performance of a predictor. To tackle this prob- E= ∥a2 (a1 (α(1) xi )α(2) ) − xi ∥22 , (12)
lem, feature engineering is used to create a new set of engineered 2
i=1
features from the original ones. The feature engineering methods
where α(1) , α(2) are weights and a1 , a2 are activation functions
can either decrease or increase the dimensionality of feature
between input and hidden layer, hidden layer and output layer,
vectors [34]. In the following, we provide a brief review of most
respectively. Once α(1) and α(2) are learned, a feature vector x is
commonly used feature engineering methods in credit scoring.
mapped to x̂ as follows x̂ = σ1 (α(1) x) which is further used as
engineered feature vector. One can choose different number of
3.4.1. Principal Component Analysis hidden layers for the autoencoder to learn complex representa-
Principal Component Analysis (PCA) aims to transform/ tions in the data. Furthermore, the autoencoder can both decrease
compress the data from a higher dimensional space Rn to a lower or increase the dimensionality of the projected vectors x̂ with
dimensional space Rk [35], where k ≪ n. Note that relevant respect to the dimensionality of the input feature vector x.
feature information (variance) is not lost during this transforma-
tion. The reduced feature space consists of principal components 3.4.3. Linear discriminant analysis
which are orthogonal to each other. Each principal component Linear discriminant analysis (LDA) was first introduced
represents the direction of maximum variance. The principal by Fisher [36]. The following outlines the steps taken when using
components are computed via eigenvalue decomposition of the LDA for feature selection [37].
covariance matrix of features. The resulting eigenvectors from the
decomposition are used as principal components. The following 1. Calculate n-dimensional mean vectors µ(+1) and µ(−1) for
steps are used when constructing principal components: ‘‘good’’ (or positive (+1)) and ‘‘bad’’ (or negative (−1))
6 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

In order to learn the model parameters, information from the


previously processed applicants are collected to form a labeled
dataset

k=1 ,
D = {(xk , yk )}m

where xk ∈ Rn denotes kth applicant’s feature vector and yk ∈


{+1, −1} is the corresponding label from the set of classes (good
borrower = +1) and (bad borrower = −1). The dataset D is split
into a training set Dtrain ⊂ D and a test dataset Dtest ⊂ D where
Dtrain ∩ Dtest = ∅ and Dtrain ∪ Dtest = D.
Once the training and testing datasets are formed, the param-
eters α of the model φ are learned on the training dataset Dtrain
by minimizing a cost function C
Fig. 3. An example of an autoencoder with one hidden layer. The green circles ∑
represent inputs, the blue circles represent engineered features, the red circles C (Dtrain , φ ) = d(yk , φ (xk )), (18)
are reconstructed inputs. ∀xk ∈Dtrain

d(yk , φ (xk )) is a distance measure, such as the Mean Squared


Error (MSE) or Cross-Entropy, between yk and its prediction ŷk =
classes, respectively: φ (xk ) from the model. The optimal parameters α∗ of the model
m(+1) m(−1) φ are obtained according to
1 ∑ (+1) 1 ∑ (−1)
µ(+1) = xk , µ(−1) = xk . (13) α∗ = argmin C (Dtrain , φα ). (19)
m(+1) m(−1)
k=1 k=1 α

2. Construct the within-class scatter matrix Sw and the The performance of the learned model φα∗ is assessed on the test
between-class scatter matrix Sb : dataset Dtest .

m(c)
∑ m(c) ∑ ( )( )T 4.2. Statistical learning models
(c) (c)
Sw = xk − µ(c) xk − µ(c)
m(+1) + m(−1)
∀c ∈{+1,−1} k=1 This section discusses popular statistical learning techniques
(14) in credit scoring.

∑ m(c) )T 4.2.1. Linear discriminant analysis


µ − µ µ(c) − µ
( (c) )(
Sb = (15) The Linear Discriminant Analysis (LDA) develops a linear com-
m(+1) + m(−1)
∀c ∈{+1,−1} bination of independent features which yield the largest mean
(+1) (−1) difference between classes. Under the assumption of equal class
where µ = m(+m 1) +m(−1) µ
(+1)
+ m(+m1) +m(−1) µ(−1) is the overall
covariances [38], we have the following solution in Eq. (20) for a
mean of the dataset.
binary classification problem, such as credit scoring. For a given
3. Apply eigenvalue decomposition on the matrix inv (Sw ) Sb
feature vector x we decide for class ŷ = +1 if the expression on
to compute eigenvalues (λj s) and eigenvectors (ej s), where
the right hand side is greater than the expression on the left hand
inv (·) is matrix inverse.
side, otherwise we decide for class ŷ = −1
4. Similar to PCA, select k eigenvectors that correspond to the
k largest eigenvalues. (µ(+) − µ(−1) )T Σ −1 (x − µ) ≷ log(P(y = −1)) − log(P(y = +1)),
5. Finally, use k eigenvectors and project each feature vector
x to k-subspace, i.e., (20)

x̂ = xT [e1 , e2 , . . . , ek ], (16) where P(·) denotes probability.

where x̂ ∈ Rk is the projection of x onto k-subspace. 4.2.2. Logistic regression


The Logistic Regression (LR) is the most commonly used sta-
4. Supervised learning
tistical model in credit scoring due to interpretability of its deci-
sions. The LR model is defined as
Credit scoring is a supervised learning problem, specifically it
is a binary classification problem, where the aim is to classify 1
P(y = +1|x) = (21)
good borrowers and bad borrowers. In this section we first define 1 + exp(α0 + αT x)
supervised learning problem and then present most commonly and
used statistical and machine learning methods in credit scoring.
exp(α0 + αT x)
P(y = −1|x) = 1 − P(y = +1|x) = , (22)
4.1. Supervised learning problem 1 + exp(α0 + αT x)
where x ∈ Rn is the feature vector, P(y = +1|x) is the prob-
In the context of credit scoring, supervised learning problem
ability of classifying x as a good borrower, P(y = −1|x) is the
can be defined as searching for a model φ : Rn ↦ → {+1, −1}
probability of classifying x as a bad borrower and {α0 , α} are the
which maps a feature vector x ∈ Rn to a predicted class label
model parameters estimated by using, e.g., maximum likelihood
ŷ ∈ {+1, −1}, i.e
estimation on the training dataset [4]. Once the model parameters
φα : x → ŷ, (17) are estimated, the decision on an input feature vector x is made
in favor of ŷ = +1 if
α is the set of the model parameters. For the ease of reading φα
and φ are used interchangeably. P(y = +1|x) ≥ P(y = −1|x), (23)
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 7

which is equivalent to the following decision rule

for 1 ≥ exp(α0 + αT x);


{
+1
ŷ = (24)
−1 otherwise.

4.2.3. Naïve Bayes


The Naïve Bayes (NB) classifier is based on Bayes decision
rule [39]. The adjective ‘‘naive’’ comes from the assumption that
the features in a dataset are mutually independent. The Naïve
Bayes classifier decides for class y = +1 over y = −1 if

P(y = +1|x) ≥ P(y = −1|x), (25)

where
p(x|y = +1)P(y = +1)
P(y = +1|x) = (26) Fig. 4. An example of a decision tree in credit scoring.
p(x)
and
p(x|y = −1)P(y = −1) 4.3.2. Decision trees
P(y = −1|x) = . (27) As shown in Fig. 4 a Decision Tree (DT) [39] asks a series of
p(x)
questions in order to arrive to an answer (which is a class label).
The probability density p(x) of observing x is defined according The root of the Decision Tree is called the root node and it is the
to most discriminating feature. The leaf nodes denote the classes.

p(x) = p(x|y = +1)P(y = +1) + p(x|y = −1)P(y = −1) (28) 4.3.3. Support vector machines
and the conditional density functions p(x|y = −1) and p(x|y = A Support Vector Machine (SVM) [43] uses an idea of a hy-
+1) are modeled according to perplane (which is a decision boundary) that separates classes
in a high dimensional feature space.
√ The linear SVM focuses on
p(x|y = −1)
∑m
i=1 αi between the negative
2
maximizing the margin ∥α∥ =
1 1 and positive hyperplanes. The correct class is assigned by using
exp{− (x − µ(−) )T inv Σ (−) (x − µ(−) )}
( )
= m (−) 12 2 the following equation:
(2π ) |Σ
2 |
+1, if b + αT x ≥ +1
{
(29)
y= (32)
−1, if b + αT x ≤ −1.
and
where b is the bias. For non-linear cases, a kernel trick [44] is
p(x|y = +1)
used to project features into a high dimensional space.
1 1
exp{− (x − µ(+) )T inv Σ (+) (x − µ(+) )},
( )
= m (+) 12 2 4.3.4. Artificial neural networks
(2π ) |Σ
2 |
An Artificial Neural Network (ANN) [45] is a system which
(30) is motivated by a biological neural network system. An ANN
where Σ (−) and Σ (+) are covariance matrixes computed on neg- emulates the way in which a biological neural network of the
ative and positive instances from the training dataset. For the brain processes information by means of interconnected neu-
Naïve Bayes classifier, Σ (−) and Σ (+) are diagonal matrices. rons [2,46,47]. Typically, a neural network consists of three lay-
Note that p(x|y = −1) and p(x|y = +1) are uni-modal ers, namely an input, hidden and output layers [48]. Essentially,
multivariate Gaussian distribution density functions which may training a neural network involves the process of finding opti-
not adequately model multi-modal data. In this case, one can em- mal weights that map the input and output layers by means of
ploy multi-variate Gaussian Mixture Model [40] and Expectation back-propagation.
Maximization (EM) [41] to learn the model parameters on the For a given input feature vector x, a three-layer ANN computes
training dataset. the output ŷ according to
(1) (2)
ŷ = a2 (a1 (α(1) x + α0 )α(2) + α0 ), (33)
4.3. Machine learning models (1) (2)
where (α , α(1) ), (α , α(2) ) are weights and a1 , a2 are activation
0 0
functions between input and hidden layer, hidden layer and out-
This section discusses popular machine learning techniques in
put layer, respectively. The parameters are learned on a training
credit scoring.
set. The ANN performs final decision by applying a decision
function, such as soft-max, on ŷ. The ANN was first applied in
4.3.1. k-Nearest neighbor credit scoring by Odom and Sharda [49].
A k-Nearest Neighbor (k-NN) [42] assigns to an input feature
vector x the class of the majority of its k nearest neighbors in the 4.3.5. Random forests
training dataset. The nearest neighbors are determined by calcu- A Random Forest (RF) is an ensemble of decision trees [50],
lating the Euclidean distance or Mahalanobis distance between the i.e. K decision trees are built on bootstrapped samples with m
input feature vector x and the training dataset {xk }m k=1 . Thus the observations. Each decision tree is developed using a subset of
class for the new data point is randomly chosen k features. Each decision tree will give a class
[ ] of a new feature vector. Thereafter, for overall classification, a RF
y = majority argmin {xk }k=1 − x .
 m

  (31) assigns the class of the new feature vector by using majority vote
y∈{+1,−1} k based on the outputs from the decision trees.
8 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

4.3.6. Boosting
The Boosting works by estimating multiple models iteratively
and assigning weights to data instances [51]. Boosting starts by
developing a weak model (i.e. shallow decision tree). Thereafter
a better model that will address errors of the previous model
is developed. The instances which were incorrectly classified by
the previous model will be assigned higher weights. The popular
boosting technique is Adaptive Boosting (Ada-boost). The Ada-
boost assigns a class to an input feature vector x in the following
way
( T
)
Fig. 5. A restricted Boltzmann machine architecture with a visible layer v and

ŷ = sign αt φt (x) , (34) a hidden layer h.
t =1

where αt is the weight of classifier φt (x) and T is the total number


of classifiers. The parameters αt s are learned on the training
dataset.

4.3.7. Extreme gradient boost


The XGBoost [52] is short for extreme gradient boosting. The
XGBoost is famously known for its processing speed and per-
formance. It works similar to gradient boosting but builds the
decision trees in parallel instead of building the decision trees in
a series [53]. The optimization function to minimize is
n
∑ (t −1)
L(t) = l(yk , ŷk + φt (xk )) + Ω (φt ) (35) Fig. 6. A deep belief neural network architecture which is composed of several
hidden layers of restricted Boltzmann machines.
k=1

where l(·) is a loss function and Ω (φt ) is a regularization term


that penalizes complexity of the model. The goal of XGBoost is to
where ai and bj are biases for binary variables vi and hj respec-
find the φt which minimizes the objective function L(t) [54].
tively, while αij are the weights between unit i and j.
4.3.8. Bagging
The Bagging classifier is also known as the bootstrap aggre- 4.4.2. Deep belief neural networks
gation [55]. The Bagging technique takes K bootstraps from the Deep Belief Networks (DBNs), introduced by Hinton et al. [57],
underlying datasets and builds a classifier for every bootstrap. is a class of deep neural networks. As shown in Fig. 6 a typical
Then a class label is assigned by using a majority vote from votes DBN is composed of several hidden layers of RBM. Essentially,
of the K classifiers an output of a lower level RBM can be perceived as input of the
K
∑ higher level RBM. Fig. 6 shows a graphical view of the DBN layers.
y = argmax 1 (y = φi (x)) , (36)
y∈{+1,−1}
i=1
4.4.3. Convolutional neural networks
where Convolutional Neural Networks (CNNs) were first introduced
1, if y = φi (x); by LeCun et al. [58] and have been mainly used in image process-
{
1 (y = φi (x)) = (37) ing [59–61], and on time series data [60,62,63]. The convolutional
0, if y ̸ = φi (x).
neural network consists of an input layer, convolutional layers,
4.4. Deep learning models pooling layers and fully connected layers (see Fig. 7). The convo-
lutional and pooling layers are responsible for extracting data
Deep learning algorithms have been successfully applied in representations [64].
literature since the 1980s in an attempt to improve classification Input. A convolutional neural network takes inputs as tensors of
accuracy. Moreover, deep learning models with optimal hidden shape (height, width, channel). In image classification, the height
layers have been developed to reveal information not easily de- and width values represent the height and width of an image. The
tectable with traditional statistical and machine learning models. channel represents the depth/color of an image (e.g. 1 represents
The following subsections highlight the deep learning models that a gray-scale image and 3 represents an image with color). A
are used in credit risk datasets. tensor is a multidimensional array that contains numerical values
or in some instances non-numeric values. The input shape is
4.4.1. Restricted Boltzmann machines normally changed into a shape that the convolutional neural
As shown in Fig. 5, a Restricted Boltzmann Machine (RBM) can network anticipates and the input is scaled so that all values are
be perceived as an undirected neural network with two layers. in the [0, 1] interval [64]. Since credit scoring data is not an image
The two layers can be called the hidden and the visible layers. data, a 1D convolutional neural network is normally used for non-
The hidden layer is used as a feature detector while the visible image data with the exception of speech/voice data. Hence, for
layer is used to train the input data [56]. Given n visible layers v, credit scoring, the CNN architecture consists of an input layer
and m hidden layers h, the energy function is which is a tensor of shape (m, 1, n), where m is the number of
n m n m instances and n is the number of features. Each instance is a
feature vector with three channels that has a shape (1, 1, n). All
∑ ∑ ∑ ∑
E(v, h) = − ai v i − b j hj − αij vi hj , (38)
i=1 j=1 i=1 j=1 inputs are scaled into [0, 1] interval.
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 9

Fig. 7. A 1D convolutional neural network architecture which consists of two convolutional layers, two pooling layers and a fully connected layer.

Convolutional layer. The word ‘‘convolution’’ refers to a math- Table 2


ematical operation which is a specialized kind of linear opera- A confusion matrix with correct predictions, True Positives (TP) and True
Negatives (TN), and incorrect predictions, False Positives (FP) and False Negatives
tion [65]. Below is a convolution function (FN).
∫ ∞ Predicted
(f ∗ g)(t) = f (x)g(t − x)dx. (39) Positives Negatives
−∞
Positives TP FN
Actual
A convolutional layer learns local patterns [64], for example, in Negatives FP TN
image classification, a convolutional layer breaks down an image
into edges and textures. These local patterns are feature maps
and are also known as activation maps. A convolution operation is
A neural network tries to minimize the cost function by learning
applied on the input tensor and the feature detector. The feature
and updating the connection weights using back propagation [66].
detector also known as the filter or kernel is a tensor of shape
(height, width). The feature detector slides along different loca-
tions of an input tensor to form a feature map. The feature map 4.4.4. Deep Multi-Layer Perceptron
is a reduced and transformed input tensor. A non-linear activation A Multi-Layer Perceptron with multiple hidden layers is called
function such as ReLU (Rectified Linear Unit) or a Sigmoid function Deep Multi-Layer Perceptron. Deep Multi-Layer Perceptron is
is applied on activation maps to introduce non-linearity. a directed neural network and training is performed by back-
propagation. The cost or loss function for Deep Multi-Layer Per-
Pooling layer. A pooling layer is responsible for downsampling ceptron uses Softmax and Cross-Entropy to update the weights.
feature maps. At pooling layer, either a max pooling or an average The Softmax function is
pooling is used. For max pooling, each local input location is trans- ezj
formed by taking the maximum value of each channel over the fj (z) = ∑ z (43)
ke
k
location, whereas for average pooling, an average value of each
channel is used. A convolutional neural network has a property and fj (z) ∈ [0, 1] and zj is a probability output from the network.
called spatial invariance, which occurs at a pooling layer. The
spatial invariance assures that the network is not influenced by 5. Evaluation metrics
input distortions or variations. Thus, the pooling layer is capable
of preserving important features. There are many evaluation metrics which are used in litera-
ture and the following metrics are the most popular metrics for
Flattening. The pooled feature maps are then converted into a
assessing the performance of the models in credit scoring. The
single vector by a process called flattening. This single vector
Percentage Correctly Classified (PCC), Type I Error, Type II Error,
becomes an input to a fully connected artificial neural network.
Kolmogorov–Smirnov Statistic (K-S), Sensitivity/Recall, Specificity,
Fully connected layer. A Fully Connected Layer is a feed forward Geometric-Mean(G-mean), F-measure and Area Under Receiver Op-
artificial neural network. It consists of an input layer, hidden erating Characteristics Curve (AUC). Each of these metrics is shown
layer/s and an output layer. At output layer, a loss function or cost in the sequel.
function is defined as follows A confusion matrix (see Table 2) consists of True Positives (TP),
∑ True Negatives (TN), False Positives (FP) and False Negatives (FN)
C (y, ŷ) = d(yi , ŷi ), (40) and is used for calculating the metrics which are discussed in this
∀i section. Based on credit scoring classification, TN is the number of
borrowers who are correctly classified as non-defaults, FP is the
where yi , ŷi ∈ {+1, −1} and d(yi , ŷi ) is a measure between yi and
number of non-defaulted borrowers who are incorrectly classified
ŷ(i) such as the Mean Squared Error(MSE) or Cross-Entropy. The
as defaults, FN is the number of defaulted borrowers who are
mean squared error is
incorrectly classified as non-defaults and TP is the number of
m
1 ∑ borrowers who are correctly classified as defaults.
MSE = (yi − ŷi )2 , (41) From the confusion matrix the performance metrics can be
m derived, such as
i=1

and the cross-entropy is (TP + TN )


PCC = , (44)
1
(TP + FP + FN + TN )
(FP )

H(y, ŷ) = − yi log ŷi . (42)
Type I = , (45)
i=0 (FP + TN )
10 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

(FN ) to the feature vector under consideration to form a synthetic


Type II = , (46)
(TP + FN ) feature vector.
TP There are other techniques that neither do over-sampling nor
Sensitivity/Recall = , (47) under-sampling to deal with class imbalance, such as wavelet
(TP + FN ) data transformation and linear dependence approach. Saia et al.
TN [70] proposed a discrete wavelet transformation to deal with
Specificity = , (48) imbalanced data in credit scoring. Wavelets are small waves and
(FP + TN )
√ wavelet transform captures both the time and frequency do-
TN TP mains [71]. The discrete wavelet transform disintegrates a signal
G–mean = × , (49) wave into a set of wavelets that is mutually orthogonal [71]. Saia
(FP + TN) (TP + FN)
et al. [70] approach outperformed the random forest model re-
2 × Recall × Precision gardless of data distributions. Saia and Carta [72] proposed a
F–measure = . (50)
Precision + Recall linear dependence approach for imbalanced data. The idea was
to exploit one class (the majority class) to overcome imbalanced
The Kolmogorov–Smirnov Statistic
class distribution issue. The linear dependence is determined by
K–S = max|P(s|G) − P(s|B)| (51) calculating the determinant of a square and non-square matrices.
The determinant is a real number that is calculated from a matrix.
calculates the maximum distance between the cumulative score When vectors are dependent, their determinant is zero. Saia
distribution of bads P(s|B) and goods P(s|G) and Carta [72] used an average of sub-matrix determinants and
reliability band of the majority class to classify new instances. The

P(s|B) = p(t |B) (52)
proposed approach performed very similarly to the random forest
t ≤s
model.
and
7. Model transparency

P(s|G) = p(t |G) (53)
t ≤s
This section discusses different model transparency
respectively and where s ∈ Z+ denotes a score in 500 ≤ s ≤ techniques. The aim of model transparency is to make non-
1000. transparent models explainable. In a case of credit scoring, this
The AUC [67] is will help in explaining why a borrower was not granted a loan.
1( TP FP ) In the following, we present the most commonly used model
AUC = 1+ − (54) transparency techniques in credit scoring.
2 TP + FN FP + TN
and measures the discriminative power of a model between 7.1. NeuroRule
classes.
The NeuroRule was first introduced by Setiono and Liu [73].
6. Data imbalance The NeuroRule uses three-layer feed-forward neural network. The
NeuroRule consists of six steps:
Imbalanced datasets occur as the number of observations in
one class (referred to as a minority class) in a dataset is usually (Step-1) Build and train a neural network;
much lower than the number of observations in the other class (Step-2) Prune the neural network to remove irrelevant connec-
(referred to as a majority class). There are studies that have dealt tions;
with imbalanced datasets. For example, Brown and Mues [15] (Step-3) Discretize the hidden unit activation values of the pruned
showed that the random forest and gradient boosting classifiers neural network by clustering;
perform very well in a credit scoring context and are able to (Step-4) Extract rules that describe the network outputs in terms
cope comparatively well with pronounced class imbalances in the of the discretized hidden activation values;
datasets. On the other hand, when faced with a large class im- (Step-5) Extract rules that describe the discretized hidden unit
balance, the C4.5 decision tree algorithm, quadratic discriminant activation values in terms of the network inputs;
analysis and k-nearest neighbors perform significantly worse than (Step-6) Combine the two sets of rules extracted in steps 4 and
the best performing classifiers. Douzas and Bacao [68] tackle the 5 to obtain a final set of rules that relate the inputs and
problem of imbalanced datasets by using a novel oversampling outputs of the network.
method, Self-Organizing Map-based Oversampling (SOMO).
There are a number of over-sampling (applied on minority 7.2. Trepan
class) or under-sampling techniques (applied on majority class)
that can be found in literature. For example, Chawla et al. [69] Trepan was first introduced by Craven and Shavlik [74]. Trepan
proposed Synthetic Minority Over-sampling Technique (SMOTE). is an hybrid intelligent system. It [Trepan] induces a tree to
The SMOTE over-samples the minority class by taking each mi- approximate any classifier’s predictions. Hence, Trepan is not
nority class sample and creating synthetic examples (along the restricted to neural networks. Trepan learns queries as opposed
line segments joining any/all of the k minority class nearest to normal decision trees which learn from data. At each node,
neighbors). Thereafter, neighbors from the k nearest neighbors Trepan stores (i) a subset of training observations, (ii) a set of
are randomly chosen, depending on the amount of over-sampling query instances and (iii) a set of constraints (the conditions that
required. For instance, if the amount of over-sampling needed is observations must satisfy in order to reach the node).
300%, only three neighbors are chosen and one sample is gener-
ated in the direction of each. Synthetic samples are generated by 7.3. LIME
taking the difference between the feature vector (sample) under
consideration and its nearest neighbor. Thereafter this difference LIME stands for Local Interpretable Model-Agnostic Explana-
is multiplied by a random number between 0 and 1, and is added tions. LIME is capable of explaining a prediction of any classifier
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 11

by learning an interpretable model (e.g. a Decision Tree or a linear Let x be the original representation of x̂. Then the Eq. (55) be-
model) around a prediction [75]. The logic behind explaining comes a locally weighted square loss
predictions is for human subjects to have trust in the predictions ∑
if actions need to be taken. For example, if a doctor depends ψ{φ, e, πx } = πx (Φ (x) − e(x̂))2 . (59)
on a model to predict presence/absence of any disease, he/she x,x̂∈ℜ

will need to trust the model predictions in order for him/her to


8. Model limitations and imposed assumptions
prescribe medication for patients. The doctor has prior knowl-
edge in the field of medicine, he/she will either accept or reject The Linear Discriminant Analysis (LDA) technique is one of
the explanation based on his/her expertise. Hence, human prior the two popular statistical techniques applied in credit scor-
knowledge plays an important role for accepting or rejecting ing, and is subject to a parametric assumption which is defined
explanations for predictions. LIME is designed to provide trust to as the underlying multivariate normal distribution of the fea-
human subjects. tures. Eisenbeis [76] argues that categorical features violate this
parametric assumption. The other model which uses two strong
7.3.1. Interpretable data representation statistical assumptions (i.e. i. the assumption that the features are
Before acquiring explanations, a data set needs to be in a for- conditionally independent ii. the values of numeric features are
mat that is understandable to humans. This is applicable, e.g., in normally distributed) is the Naïve Bayes Classifier. The normally
image classification or text classification, where features in the distributed assumption in Naïve Bayes classifier does not always
input space need to be transformed to a vector of ones and hold in other domains, hence a need to estimate other continuous
zeros to indicate the presence or absence of feature components. distributions is required [77].

An observation x ∈ Rn is transformed into x̂ ∈ {0, 1}n for The most popular statistical technique in credit scoring is
interpretability. Logistic Regression (LR). The Logistic Regression assumes a linear
relationship between the inputs and the log odds. However, the
7.3.2. Local fidelity linearity assumption does not always hold, there are other cases
The behavior of a classifier in the vicinity of an individual where the relationship between the independent variables and
prediction determines the local faithfulness of an explanation. the log odds is non-linear. In such cases kernel Support Vector
Let an explanation be denoted by e ∈ E where E is a space of Machines (SVM) serves as one of the non-linear classification
interpretable models, e.g., decision trees or linear models. The techniques to be used. However, Yu et al. [78] argues that the
complexity of an explanation is given by Ω (e), e.g., for decision performance of SVM model is sensitive not only to the algo-
trees the complexity is the depth of a tree. Let φ : Rn → R rithm for solving the Quadratic Programming problem but also
be a classifier which needs to be explained and let πx (y) be a to the parameters setting in learning (i.e. regularized parameter
proximity measure between observation y to observation x which balancing the classification margin and tolerable misclassification
defines the locality around x. The measure of how unfaithful is e errors, and the kernel parameter). Yu et al. [78] propose Least
in explaining φ (x) in the locality given by πx (y) is Squares SVM (LSSVM) to solve the quadratic programming prob-
lem and the design of the experiment for parameter selection in
ψ{φ, e, πx }, (55) SVM modeling.
Henley and Hand [42] propose the use of a non-parametric
where technique, k-nearest neighbor method in credit scoring. However,
πx (y) = exp −D(x, y)2 /σ 2 the k-nearest neighbor method is costly because it requires more
( )
(56)
computations (i.e. it calculates a metric distance for each data
is an exponential kernel defined on some distance D (e.g. Maha- record stored during classification). Jiang [79] proposes the use
lanobis distance). of a decision tree (C4.5) in conjunction with an approach called
To obtain interpretability and local fidelity, an optimization is Simulating Annealing Algorithm (SAA) which performs global
performed to minimize equation Eq. (55) and Ω (e) is kept low. optimization. In this study, Jiang highlights the shortcoming of
Hence, the prediction explanation for observation x is the Decision Tree approach which is the inability to perform
global optimization because of its local search strategy, hence
ζ (x) = argminψ{φ, e, πx } + Ω (e). (57)
the need to use SAA. Albeit the ANNs better performance when
e∈E
compared to other techniques in terms of accuracy, it lacks in-
7.3.3. Model agnostic terpretability [5]. Baesens et al. [5] propose Decompositional and
For explanations to be model-agnostic, we do not need to Pedagogical approaches to extract rules (without compromising
make assumptions about φ . This will ensure that any prediction the accuracy) from the ANN for interpretability.
of any black-box model can be explained. Deep Learning models have not been applied extensively in
credit scoring. However, the problem with deep learning models
is interpretability. Since banks are governed by regulators, banks
7.3.4. Sampling are required to be transparent in their credit scoring process. A
The Eq. (55) is approximated by drawing samples which are bank needs to tell a borrower why his/her loan application has
weighted by πx around an observation of interest x. The sampling been rejected.
is done by drawing non-zero elements of x uniformly at ran-

dom [75]. Then x̂ ∈ {0, 1}n is a perturbed sample which contains 9. Emerging trends
non-zero elements of x. The perturbed data set is denoted by ℜ.
The outcome/label prediction for x̂ is given by φ (x). Most banks have developed an interest in applying machine
learning models (including deep learning modes) in their credit
7.3.5. Sparse linear explanations scoring systems. However, the regulators still maintain that the
This is a part where predictions are explained. In this stage an models which are used for credit scoring should be transparent.
explainer is a linear model. Hence, e(x̂) is a linear model Despite their non-transparency, machine learning models are
gaining traction in credit scoring. One way of mitigating the non-
e(x̂) = αT x̂. (58) transparency issue of machine learning models is to rationalize
12 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

Table 3
Datasets that are used in literature for credit scoring. Each dataset has a number of good borrowers and a number of bad borrowers.
Dataset Sample size #Goods #Bads # of features
European Credit Bureau 186,574 179,544 7,030 324
UK 30,000 28,800 1,200 14
Barbados 21,117 20,614 503 20
Indonesia 14,700 14,290 4,410 31
Benelux 2 (Belgium, Netherlands, and Luxembourg) 7,190 5,033 2,157 28
Brazilian 4,504 4,144 360 5
Benelux 1 (Belgium, Netherlands, and Luxembourg) 3,123 1,040 2,083 27
University of California, San Diego 2,435 1,836 599 38
China 1,057 552 505 10
German 1,000 700 300 20
Iranian 1,000 950 50 27
Australian 690 307 383 14
Japanese 653 296 357 15
Polish 240 128 112 30
Texas Banks 162 81 81 19

(i.e. give justification as to why a certain decision has been made) vector machines and multi-layer perceptron using Australian,
the predictions. German and Kaggle credit datasets. The meta-cognitive restricted
Boltzmann machine showed superior performance. Tomczak and
9.1. Transparency Zieba [67] assessed and compared performances of classification
restricted Boltzmann machine with several traditional statistical
Methods that involve rule extractions for opening-up non- and machine learning models, such as logistic regression, decision
transparent models have been suggested in the literature. Bae- trees, adaboost, random forest etc., using German, Australian,
sens et al. [5] evaluated and contrasted three neural network rule Kaggle and Short-Term Loans data. The classification restricted
extraction techniques, namely, Neurorule, Trepan, and Nefclass Boltzmann machine showed comparable results on German, Kag-
for credit-risk evaluation. They concluded that neural network gle and Short-Term Loan datasets, and better results on Australian
rule extraction is an effective and powerful management tool dataset. Tran et al. [86] proposed a hybrid genetic programming
which allows the development of advanced and user-friendly and stacked auto-encoder network model. The proposed hybrid
decision-support systems for credit-risk evaluation. Setiono and model was compared to logistic regression, k−NN, support vec-
Liu [80] proposed a NeuroLinear method for extracting oblique tor machines, artificial neural networks and decision trees using
decision rules from neural networks. The experimental results German and Australian credit datasets. The hybrid model showed
showed that NeuroLinear is effective in extracting compact and a better accuracy rate. Yeh et al. [87] used daily stock returns
comprehensible rules that have high predictive accuracy from to predict defaults and non-defaults using deep belief network
neural networks. and support vector machines. The proposed deep belief net-
The techniques highlighted above are based on rules, and work outperformed support vector machines. In their empirical
recently techniques such as SHAP values, Partial Dependence and study, Neagoe et al. [88] compared deep convolutional neural
Explainable Boosting Machines that do not require rules have been networks with multi-layer perceptron using German and Aus-
applied in domains (other than credit scoring). In their empirical tralian datasets. The results showed a superior overall accuracy
study [81] proposed intelligible model that can easily explain rate for deep convolutional neural networks. These studies have
its predictions. The study applied generalized additive models shown the superiority of deep learning models in credit scoring.
with feature interactions (also referred to as explainable boosting However, Hamori et al. [89] argue that the performance of deep
machines) on real health care problems. Their results showed learning models is dependent on the choice of activation function,
that the proposed generalized additive model can perform com- the number of hidden layers and the dropout rate. The results
parably with best performing machine learning models such as in [89] showed a better performance for ensemble methods,
random forest and logistic regression. Lundberg and Lee [82] used such as boosting and bagging, when compared with deep neural
SHAP (SHapley Additive exPlanations) as a unified framework to networks using Taiwan credit dataset. These studies highlight and
interpret predictions. The SHAP method assesses contribution of reiterate the applicability of deep learning algorithms to credit
each feature towards a prediction. On the other hand, the partial scoring data.
dependence checks how the prediction changes when different
feature values are used. The partial dependence plots show the 9.2.1. Data augmentation
marginal impact one or two features have on the prediction of a Deep learning models require more data for training to avoid
machine learning model [83]. overfitting [90]. Data augmentation is normally performed to
increase the number of training data points. This is done by
9.2. Deep learning in credit scoring applying several distortions to the original training images, such
as changing the brightness and rotation of images and the distor-
This section covers deep learning techniques in credit scoring. tions should not change the spatial pattern of target classes [91].
There is an emerging trend to replace statistical and classical ma- In [91], several data augmentations such as random shift, random
chine learning techniques with deep learning techniques in credit zoom, random horizontal flip and random rotation were applied
scoring. For instance, Luo et al. [84] used Corporate Default Swaps on image data. The random shift randomly shifts the images by
(CDS) data to compare performances of deep belief networks a factor, the random zoom randomly zooms the images by a
with well-known credit scoring models such as logistic regres- certain range, the random rotation randomly rotates the images
sion, multi-layer perceptron and support vector machine. Deep by a certain angle and the random horizontal flip randomly
belief networks showed better performance. In their study, Ra- flips the images horizontally to produce additional images. Their
masamy and Rajaraman [85] compared meta-cognitive restricted results [91] showed that data augmentation did not significantly
Boltzmann machine with extreme learning machines, support improve the accuracy (i.e. accuracy increased by 1%). Kvamme
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 13

et al. [92] argue that data augmentation has been mostly done
in image processing, and does not necessarily generalize well to
other data sources.
On the other hand, Krizhevsky et al. [93] achieved signifi-
cant improvements on error rates when data augmentation was
applied on ImageNet dataset. A huge deep neural network (60
million parameters and 650,000 neurons) was used for this task,
and for this they needed to increase the dataset size. Perez and
Wang [94] used data augmentation on gold fish vs. dogs and
cats vs. dogs datasets. The results showed better improvements
in terms of accuracy on both datasets. Salamon and Bello [95]
combined deep learning neural network with data augmentation
to classify audio data. They used four different data augmenta-
tions and the proposed combination significantly outperformed
deep neural network without augmentation. Frid-Adar et al. [96]
proposed a combination of classic data augmentation with Gener-
ative Adversarial Network (GAN) data augmentation. The GAN was
used to synthesize new images of liver lesion. This combination
resulted in significant accuracy improvement. All of these studies Fig. 8. A pie-chart that shows a proportion of literature papers that used data
used image datasets and credit scoring data is not in the form of balancing techniques and a proportion of papers that did not balance their
images. However, the credit scoring data can be converted into datasets.
images [97].
The above authors have contrasting views on the effective-
ness of data augmentation on model accuracy. The conclusion classify a borrower as default but to know when will a bor-
from Krizhevsky et al. [93] is that when data augmentation is rower default. The univariate analysis can help in identifying
used in conjunction with large deep neural networks, the accu- non-predictive variables and also detect outliers. Model/time
racy increases significantly compared to using data augmenta- complexity stems from models with too many parameters. Hence,
tions with ‘‘shallow’’ neural networks. Also Salamon and Bello it is key to keep a few parameters during model development. Al-
[95] found that class accuracy is influenced differently by each though feature reduction/selection improves model accuracy, the
augmentation. literature needs to look at other feature engineering techniques
such as taking the sum/product of two features to create a new
feature. In cases where there are few independent features, a cor-
10. Limitations in credit scoring literature
relation between the target variable and the independent features
can be performed to determine the high predictive features. The
Below is the list of limitations which have been partially or literature needs to focus on other ensemble techniques where
completely ignored in credit scoring literature: base classifiers/learners are heterogeneous (i.e. models coming
✓ No inclusion of macro-economic variables;
from different model classes). The default cut-off for classifying
✓ The time it takes for borrowers to default is not deter-
borrowers is 0.5, e.g. in logistic regression, the literature needs to
mined; assess the impact of using different cut-offs. The error which is of
✓ Exploratory Data Analysis: detection of outliers and distri-
interest in credit scoring community is Type II error and the aim
bution of variables (checking zero-variance for variables) is is to minimize this error. The literature needs to report more on
not performed; this error when developing credit scoring models.
✓ Time/Model Complexity is not factored;
✓ Creation of new features instead of performing feature 11. Results
space reduction techniques/feature selection is not taken
into account; This section focuses on the meta-analysis of the results from
✓ Correlation between dependent variable (or target vari- primary studies. The frequency distributions of feature extraction
able) and independent variables is not assessed. techniques, the models and evaluation metrics from all primary
✓ Most studies in literature focus on homogeneous base clas- studies are analyzed. The pie-charts for distribution of data bal-
sifiers and ignore heterogeneous base classifiers for ensem- ancing techniques and transparency techniques are also shown.
ble methods; The German and Australian credit datasets are selected for model
✓ Using different cut-offs (instead of 0.5) to classify borrow- performance comparison, as they are the most frequently used
ers as either non-defaults or defaults is not covered in datasets in credit scoring.
literature;
✓ Few studies incorporate Type II error. The type II error is
11.1. Data balancing
more costly in credit scoring, since Type II (False Negative
ratio) predicts a borrower as a good borrower but in actual
Credit risk data is generally imbalanced (see Table 3). The non-
fact he or she is a bad borrower.
default borrowers are usually more than the defaulted borrowers.
The increases in macro-economic variables such as interest In cases where there is a high imbalance in data, the models
rate, inflation rate and unemployment rate may increase the risk turn to be biased towards the majority class. To mitigate this
of a borrower defaulting. Hence, it is key to incorporate macro- bias, techniques such as over-sampling or under-sampling are
economic variables in credit scoring. Since macro-economic vari- usually performed. However, this study shows that only 18% of
ables are time-varying, it is key to develop forward-looking credit the primary studies in this literature survey have balanced their
scoring techniques. The survival analysis can cater for forward- datasets (see Fig. 8). The most used technique is under-sampling
looking credit scoring techniques. This can help not only to the majority class (see Fig. 9).
14 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

Fig. 11. A frequency distribution of evaluation metrics that are used in primary
studies of this literature survey.

Fig. 9. A frequency distribution of data balancing techniques that are used in


primary studies of this literature survey.

Fig. 12. A frequency distribution plot of most frequently used statistical and
machine learning models in credit scoring.

Fig. 10. A frequency distribution of feature extraction and engineering


techniques that are used in primary studies of this literature survey.

11.2. Feature extraction

The distribution of feature extraction techniques, such as fea-


ture selection and feature engineering for this literature survey is
shown in Fig. 10. This shows that out of 17 studies (see Table 4 for
details) which used feature extraction, Rough Set technique was
the most frequently used feature extraction technique, followed
by Stepwise, Genetic Algorithm and PCA.

11.3. Evaluation metrics

Based on this current literature survey, Fig. 11 shows com-


monly used evaluation metrics in credit scoring. The most fre- Fig. 13. A pie-chart that shows a proportion of studies that used transparency
quently used evaluation metrics are PCC and AUC (see Table 5 techniques and a proportion of studies that did not use transparency techniques.
for details).

11.4. Models 11.5. Transparency

This literature survey focuses on statistical, traditional and


state-of-the-art machine learning models. Based on the results of
this literature survey, Fig. 12 shows that LR, SVM and ANN were Under Basel II Accord [149], credit scoring models are required
the most frequently used single classifiers (see Table 6 for details). to be transparent (i.e. to explain their predictions). The major-
For ensemble of classifiers, Boosting was the most frequently used ity of machine learning models are not transparent in nature.
ensemble classifier. The deep learning classifiers are among the Hence, a need to make the models transparent in credit scoring is
least frequently used classifiers and this is attributable to the fact paramount. However, as can be seen in Fig. 13, this survey shows
that deep learning classifiers are not applied extensively in credit that only 8% of the primary studies have looked into techniques
scoring. which make models transparent.
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 15

Table 4
Feature extraction methods used in primary studies. Legend: ST (Stepwise), FS (F-score), RS (Rough Set), GA (Genetic
Algorithm), PCA (Principal Component Analysis), AE (AutoEncoder), LDA (Linear Discriminant Analysis), and LASSO
(Least Absolute Shrinkage and Selection Operator).
Source ST FS RS GA PCA AE LDA LASSO
[17] ✓ ✓
[98] ✓ ✓
[25] ✓
[99] ✓
[100] ✓
[101] ✓ ✓
[20] ✓
[15] ✓
[102] ✓
[16] ✓
[103] ✓
[104] ✓
[18] ✓
[105] ✓
[22] ✓
[86] ✓
[106] ✓
[107]
[108] ✓ ✓
[109] ✓
[110] ✓

Fig. 14. Averages of pcc and auc that are calculated from the results reported on the papers considered for this literature survey that used German and Australian
credit datasets.

11.6. Model performance on German and Australian datasets PCC and AUC for each of the classifiers. This could be attributed
to data pre-processing of each study.
Out of 74 primary studies, 39 primary studies either used Ger-
man credit dataset, Australian credit dataset or both. The results
12. Guiding machine learning framework for credit scoring
for both PCC and AUC on both datasets were collected from the
primary studies. The results (see Fig. 14) show that on average
on both datasets, the Convolutional Neural Networks performed Based on the results of this literature survey, we propose a
better in terms of PCC, followed by ensemble classifiers such as framework which is shown in Fig. 15, which serves as a guideline
Random Forests, Bagging and Boosting. The high performance of for credit scoring analysts. The nature of the data in credit scoring
CNN can be attributed to the fact that they can detect features is either balanced or imbalanced. For both types of datasets,
that are more discriminative between borrowers. In terms of this framework suggests exploratory data analysis, data pre-
AUC, the ensemble of classifiers performed better than all single processing, feature extraction techniques, the sampling method-
classifiers. We noted that different studies reported differently on ology, the models to use as benchmarks, the best performing
16 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

Table 5
Evaluation metrics from primary studies. Legend: PCC (Percentage Correctly Classified), AUC (Area Under ROC Curve), T1 (Type I
Error), T2 (Type II Error), F-M (F-measure), G-M (G-measure), K-S (Kolmogorov–Smirnov Statistic), SP (Specificity), SE (Sensitivity)
Source PCC AUC T1 T2 F-M G-M K-S SP SE
[111] ✓
[25] ✓ ✓ ✓ ✓
[112] ✓
[113] ✓
[98] ✓
[17] ✓
[25] ✓ ✓ ✓ ✓
[99] ✓
[114] ✓ ✓ ✓
[100] ✓
[115] ✓ ✓ ✓
[116] ✓ ✓ ✓
[117] ✓ ✓ ✓
[118] ✓ ✓ ✓ ✓
[114] ✓ ✓
[101] ✓
[119] ✓
[20] ✓ ✓
[120] ✓ ✓ ✓
[121] ✓ ✓ ✓
[15] ✓
[122] ✓
[102] ✓ ✓ ✓
[16] ✓ ✓
[123] ✓ ✓ ✓
[104] ✓
[18] ✓
[103] ✓ ✓ ✓
[47]
[124] ✓
[22] ✓
[105] ✓
[125] ✓
[126] ✓
[127] ✓
[67] ✓ ✓ ✓ ✓
[128] ✓
[87] ✓
[129]
[130] ✓ ✓ ✓ ✓
[131] ✓
[132] ✓ ✓ ✓
[133] ✓
[86] ✓ ✓ ✓
[134] ✓ ✓ ✓
[135] ✓
[46] ✓ ✓ ✓ ✓
[136] ✓ ✓
[137] ✓
[107] ✓ ✓ ✓ ✓
[84] ✓ ✓
[138] ✓ ✓
[136] ✓ ✓ ✓
[19] ✓ ✓
[139] ✓ ✓ ✓
[85] ✓ ✓
[140] ✓
[141] ✓ ✓ ✓
[106] ✓
[142] ✓ ✓
[110] ✓ ✓
[143] ✓
[92] ✓ ✓ ✓ ✓
[108] ✓ ✓
[97] ✓ ✓ ✓
[144] ✓
[88] ✓
[89] ✓ ✓ ✓
[109] ✓
[145] ✓ ✓
[56] ✓
[146] ✓ ✓
[147] ✓
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 17

Table 6
Models used in credit scoring. Legend: LR (Logistic Regression), NB (Naïve Bayes), LDA (Linear Discriminant Analysis), XGB (XGBoost), EML (Extreme Learning
Machines), k-NN (k-Nearest Neighbor), SVM (Support Vector Machine), ANN (Artificial Neural Network), BA (Bagging), BO (Boosting), RF (Random Forest), RBM
(Restricted Boltzmann Machine), DBN (Deep Belief Network), DMLP (Deep Multi-Layer Perceptron), and CNN (Convolutional Neural Network).
Source LR NB LDA XGB DT EML k-NN SVM ANN BA BO RF RBM DBN DMLP CNN
[111] ✓
[98] ✓ ✓
[99] ✓ ✓
[112] ✓ ✓ ✓ ✓ ✓ ✓ ✓
[113] ✓ ✓ ✓
[25] ✓ ✓ ✓
[17] ✓
[114] ✓
[100] ✓
[115] ✓ ✓ ✓ ✓ ✓ ✓
[116] ✓
[117] ✓ ✓ ✓
[119] ✓ ✓
[114] ✓ ✓ ✓ ✓ ✓ ✓
[101] ✓
[118] ✓ ✓
[20] ✓
[120] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
[15] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
[121] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
[16] ✓ ✓
[123] ✓
[122] ✓ ✓
[102] ✓ ✓ ✓
[103] ✓ ✓
[18] ✓
[104] ✓
[18] ✓
[47] ✓ ✓ ✓ ✓ ✓
[124] ✓
[105] ✓ ✓
[22] ✓
[125] ✓ ✓
[126] ✓
[127] ✓ ✓ ✓ ✓ ✓ ✓
[129] ✓
[131] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
[128] ✓
[87] ✓ ✓ ✓
[67] ✓ ✓ ✓ ✓ ✓ ✓ ✓
[132] ✓ ✓ ✓
[130] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
[135] ✓
[86] ✓ ✓ ✓ ✓ ✓ ✓ ✓
[134] ✓ ✓ ✓ ✓ ✓
[133] ✓ ✓ ✓ ✓ ✓
[142] ✓ ✓ ✓ ✓ ✓ ✓
[141] ✓ ✓
[46] ✓ ✓ ✓ ✓ ✓ ✓
[136] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
[137] ✓ ✓
[140] ✓
[106] ✓
[107] ✓
[84] ✓ ✓ ✓ ✓
[138] ✓ ✓ ✓ ✓
[136] ✓ ✓ ✓ ✓ ✓ ✓
[19] ✓
[139] ✓ ✓
[85] ✓ ✓ ✓ ✓
[97] ✓ ✓ ✓
[144] ✓ ✓ ✓ ✓ ✓
[88] ✓ ✓
[109] ✓ ✓ ✓ ✓ ✓ ✓
[145] ✓ ✓ ✓ ✓ ✓
[143] ✓
[92] ✓
[108]
[146] ✓
[147] ✓
[148] ✓ ✓ ✓ ✓ ✓ ✓ ✓
[110] ✓ ✓ ✓
[56] ✓ ✓
[89] ✓ ✓ ✓ ✓ ✓
18 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

Fig. 15. The guiding machine learning credit scoring framework that is proposed in this literature survey.

models in credit scoring, the evaluation metrics and the tech- learning models in credit scoring. However, no consensus has
nique which explains model predictions. This literature survey been reached in identifying the best performing model. This
showed that the most frequently used techniques for feature literature survey systematically reviewed the most commonly
extraction are Rough Set and Genetic Algorithm. The sampling used statistical, classical machine learning and deep learning
methodology that we are suggesting for imbalanced datasets is models in credit scoring. The performances (on German and
SMOTE. The sampling is only applied on the training set. The Australian credit datasets) of statistical, classical machine learn-
models which can be used as benchmarks is Logistic Regression ing and deep learning models that were reported in literature
(LR) and Decision Trees (DT). The LR performs similarly to most were compared in this literature survey. The literature results
traditional Machine learning models and the choice of selecting showed that an ensemble of classifiers generally outperform sin-
DT is its capability of explaining predictions. This literature survey gle classifiers. Despite the minimal application of deep learning
shows that ensemble classifiers and CNN performed better when models in credit scoring literature, deep learning models such
compared to other models, hence we suggest these two models as convolutional neural networks showed better results com-
for use in credit scoring. For imbalanced datasets, evaluation pared to statistical and classical machine learning models. This
metrics which can be applied are G-mean, Recall and F-measure. literature survey also highlighted limitations in credit scoring
Since predictions from ensemble classifiers and CNN cannot be literature. The credit scoring literature often ignores exploratory
explained, we suggest LIME for predictions’ explanations. data analysis, omits inclusion of macro-economic variables and
does not determine correlation between the target variable and
the independent features, to mention a few. Furthermore, in
13. Conclusion and future work
this survey we proposed a guiding machine learning framework
for analysts to perform credit scoring. This framework includes
Many studies over the years have evaluated and contrasted feature selection methods, data balancing technique, models to
the performances of different statistical and classical machine
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 19

consider for bench-marking, best performing models in credit [10] W. Lin, Y. Hu, C. Tsai, Machine learning in financial crisis prediction: A
scoring, relevant evaluation metrics and a method for making survey, IEEE Trans. Syst. Man Cybern. C 42 (4) (2012) 421–436.
[11] X. Wang, M. Xu, Ö.T. Pusatli, A survey of applying machine learning
models transparent. This survey pointed to emerging directions
techniques for credit rating: existing models and open issues, in: S. Arik,
in credit scoring such as the use of techniques that make model T. Huang, W.K. Lai, Q. Liu (Eds.), Neural Information Processing, Springer
predictions explainable and the applications of deep learning International Publishing, 2015, pp. 122–132.
models in credit scoring. [12] F. Louzada, A. Ara, G.B. Fernandes, Classification methods applied to
Future research should focus more on balancing classes of credit scoring: Systematic review and overall comparison, Surv. Oper. Res.
Manag. Sci. 21 (2) (2016) 117–134.
datasets in credit scoring. Balancing techniques that neither do [13] S.S. Devi, Y. Radhika, A Survey on Machine Learning and Statistical
over-sampling nor under-sampling such as linear dependence Techniques in Bankruptcy Prediction, 2018.
approach and wavelet data transformation should be explored [14] D. Liang, C.-F. Tsai, H.-T. Wu, The effect of feature selection on financial
in credit scoring. Future research should incorporate macro- distress prediction, Knowl.-Based Syst. 73 (2015) 289–297.
[15] I. Brown, C. Mues, An experimental comparison of classification algo-
economic variables such as interest rates, unemployment rates rithms for imbalanced credit scoring data sets, Expert Syst. Appl. 39 (3)
and inflation rates. An increase in any of the mentioned macro- (2012) 3446–3453.
economic variables may increase the risk of a borrower de- [16] K. Bijak, L.C. Thomas, Does segmentation always improve model
faulting. Future studies should consider the time it takes for performance in credit scoring? Expert Syst. Appl. 39 (3) (2012)
2433–2442.
borrowers to default and this will allow commercial banks to be
[17] F.-L. Chen, F.-C. Li, Combination of feature selection approaches with SVM
forward-looking and proactive. The literature does not take into in credit scoring, Expert Syst. Appl. 37 (7) (2010) 4902–4909.
consideration the exploratory data analysis, hence future research [18] W. Chen, L. Shi, Credit scoring with F-score based on support vector
should focus on this aspect to better understand for example machine, in: Proceedings 2013 International Conference on Mechatronic
the distributions of features. Future studies should focus more Sciences, Electric Engineering and Computer, MEC, 2013, pp. 1512–1516.
[19] H. Chen, Y. Xiang, The study of credit scoring model based on group lasso,
on time/model complexity to allow efficiency in model devel- Procedia Comput. Sci. 122 (2017) 677–684, 5th International Conference
opment in credit scoring. Future research should determine the on Information Technology and Quantitative Management, ITQM 2017.
correlation between the target variable and the independent vari- [20] B.-W. Chi, C.-C. Hsu, A hybrid approach to integrate genetic algorithm
ables/features in-order to identify predictive variables/features. into dual scoring model in enhancing the performance of credit scoring
model, Expert Syst. Appl. 39 (3) (2012) 2650–2661.
There are few studies that focused on using heterogeneous base
[21] B. Back, T. Laitinen, K. Sere, Neural networks and genetic algorithms for
classifiers for ensemble methods [113,145,147]. However, future bankruptcy predictions, Expert Syst. Appl. 11 (4) (1996) 407–413.
studies should focus more on using heterogeneous instead of [22] S. Oreski, G. Oreski, Genetic algorithm-based heuristic for feature selec-
homogeneous base classifiers in ensemble methods to allow tion in credit risk assessment, Expert Syst. Appl. 41 (4, Part 2) (2014)
2052–2064.
diversity of base classifiers.
[23] Q. Song, H. Jiang, J. Liu, Feature selection based on FDA and F-score for
multi-class classification, Expert Syst. Appl. 81 (2017) 22–27.
Declaration of competing interest [24] Z. Pawlak, Rough set approach to knowledge-based decision support,
European J. Oper. Res. 99 (1) (1997) 48–57.
[25] J. Wang, K. Guo, S. Wang, Rough set and tabu search based feature
The authors declare that they have no known competing finan- selection for credit scoring, Procedia Comput. Sci. 1 (1) (2010) 2425–2432,
cial interests or personal relationships that could have appeared ICCS 2010.
to influence the work reported in this paper. [26] Q. Zhang, Q. Xie, G. Wang, A survey on rough set theory and its
applications, CAAI Trans. Intell. Technol. 1 (4) (2016) 323–333.
[27] C.-F. Tsai, Feature selection in bankruptcy prediction, Knowl.-Based Syst.
Acknowledgments 22 (2) (2009) 120–127.
[28] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, 1996.
The authors would like to thank the Bankseta, South Africa [29] V. Kozeny, Genetic algorithms for credit scoring: Alternative fitness
function performance comparison, Expert Syst. Appl. 42 (6) (2015)
(The Sector Education and Training Authority (SETA) for the bank-
2998–3004.
ing industry) for making funds available to Xolani Dastile’s PhD [30] M. Crepinsek, S.-H. Liu, M. Mernik, Exploration and exploitation in
study. The authors also would like to thank the anonymous evolutionary algorithms: A survey, ACM Comput. Surv. 45 (2013)
reviewers for providing valuable feedback on initial versions of 35:1–35:33.
[31] S.-h. Liu, M. Mernik, B. Bryant, To explore or to exploit: An entropy-driven
this paper.
approach for evolutionary algorithms, KES J. 13 (2009) 185–206.
[32] J.M. Cadenas, M.C. Garrido, R. Martínez, Feature subset selection filter–
References wrapper based on low quality data, Expert Syst. Appl. 40 (16) (2013)
6241–6252.
[1] L.C. Thomas, J. Crook, D. Edelman, Credit Scoring and Its Applications, [33] R. Tibshirani, Regression shrinkage and selection via the lasso: a
Society for Industrial and Applied Mathematics, 2002. retrospective, J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (3) (2011) 273–282.
[34] A. Zheng, A. Casari, Feature Engineering for Machine Learning: Principles
[2] L.C. Thomas, A survey of credit and behavioural scoring: forecasting
and Techniques for Data Scientists, first ed., O’Reilly Media, Inc., 2018.
financial risk of lending to consumers, Int. J. Forecast. 16 (2) (2000)
[35] S. Sehgal, H. Singh, M. Agarwal, V. Bhasker, Shantanu, Data analysis
149–172.
using principal component analysis, in: 2014 International Conference
[3] N. Siddiqi, Credit Risk Scorecards: Developing and Implementing
on Medical Imaging, M-Health and Emerging Communication Systems,
Intelligent Credit Scoring, SAS Publishing, 2005.
MedCom, 2014, pp. 45–48.
[4] I.J. Myung, Tutorial on maximum likelihood estimation, J. Math. Psych.
[36] R.A. Fisher, The use of multiple measurements in taxonomic problems,
47 (1) (2003) 90–100.
Ann. Eugen. 7 (2) (1936) 179–188.
[5] B. Baesens, R. Setiono, C. Mues, J. Vanthienen, Using neural network rule [37] A.M. Martinez, A.C. Kak, PCA versus LDA, IEEE Trans. Pattern Anal. Mach.
extraction and decision tables for credit-risk evaluation, Manage. Sci. 49 Intell. 23 (2) (2001) 228–233.
(3) (2003) 312–329. [38] C. Rao, The utilization of multiple measurements in problems of biological
[6] H.A. Alaka, L.O. Oyedele, H.A. Owolabi, V. Kumar, S.O. Ajayi, O.O. Akinade, classification, J. R. Stat. Soc. (1948) 159–203.
M. Bilal, Systematic review of bankruptcy prediction models: Towards a [39] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., Wiley,
framework for tool selection, Expert Syst. Appl. 94 (2018) 164–184. 2001.
[7] R. Schlosser, Appraising the Quality of Systematic Reviews, Focus: [40] D. Reynolds, Gaussian mixture models, in: Encyclopedia of Biometrics,
Technical Briefs 17, 2007, pp. 1–8. Springer US, Boston, MA, 2015, pp. 827–832.
[8] J.L. Bellovary, D.E. Giacomino, M.D. Akers, A review of bankruptcy [41] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incom-
prediction studies: 1930 to present, J. Financial Educ. 33 (2007) 1–42. plete data via the EM algorithm, J. R. Stat. Soc. Ser. B Stat. Methodol. 39
[9] H.A. Abdou, J. Pointon, Credit scoring, statistical techniques and evalua- (1) (1977) 1–38.
tion criteria: A review of the literature, Int. J. Intell. Syst. Account. Financ. [42] W.E. Henley, D.J. Hand, A k-nearest-neighbour classifier for assessing
Manage. 18 (2–3) (2011) 59–88. consumer credit risk, J. R. Stat. Soc. 45 (1) (1996) 77–95.
20 X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263

[43] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) [77] G.H. John, P. Langley, Estimating continuous distributions in Bayesian
273–297. classifiers, in: Proceedings of the Eleventh Conference on Uncertainty
[44] B. Schölkopf, The kernel trick for distances, in: Proceedings of the 13th in Artificial Intelligence, Morgan Kaufmann Publishers Inc., 1995, pp.
International Conference on Neural Information Processing Systems, MIT 338–345.
Press, 2000, pp. 283–289. [78] L. Yu, X. Yao, S. Wang, K. Lai, Credit risk evaluation using a weighted least
[45] T.M. Mitchell, Machine Learning, first ed., McGraw-Hill, Inc., New York, squares SVM classifier with design of experiment for parameter selection,
NY, USA, 1997. Expert Syst. Appl. 38 (12) (2011) 15392–15399.
[46] F. Barboza, H. Kimura, E. Altman, Machine learning models and [79] Y. Jiang, Credit scoring model based on the decision tree and the sim-
bankruptcy prediction, Expert Syst. Appl. 83 (2017) 405–417. ulated annealing algorithm, in: 2009 WRI World Congress on Computer
[47] C.-F. Tsai, Y.-F. Hsu, D.C. Yen, A comparative study of classifier ensembles Science and Information Engineering, Vol. 4, 2009, pp. 18–22.
for bankruptcy prediction, Appl. Soft Comput. 24 (2014) 977–984. [80] R. Setiono, H. Liu, Neurolinear: From neural networks to oblique decision
[48] C.-F. Tsai, J.-W. Wu, Using neural network ensembles for bankruptcy rules, Neurocomputing 17 (1) (1997) 1–24.
prediction and credit scoring, Expert Syst. Appl. 34 (4) (2008) 2639–2649. [81] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, N. Elhadad, Intelligible
[49] M.D. Odom, R. Sharda, A neural network model for bankruptcy prediction, models for healthCare: Predicting pneumonia risk and hospital 30-day
in: 1990 IJCNN International Joint Conference on Neural Networks, vol. readmission, in: KDD ’15, 2015.
2, 1990, pp. 163–168. [82] S. Lundberg, S. Lee, A unified approach to interpreting model predictions,
[50] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. CoRR abs/1705.07874 (2017).
[51] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line [83] J.H. Friedman, Greedy function approximation: A gradient boosting
learning and an application to boosting, J. Comput. System Sci. 55 (1) machine, Ann. Statist. 29 (2000) 1189–1232.
(1997) 119–139. [84] C. Luo, D. Wu, D. Wu, A deep learning approach for credit scoring using
[52] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: credit default swaps, Eng. Appl. Artif. Intell. 65 (2017) 465–470.
Proceedings of the 22Nd ACM SIGKDD International Conference on [85] S. Ramasamy, K. Rajaraman, A hybrid meta-cognitive restricted Boltz-
Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794. mann machine classifier for credit scoring, in: TENCON 2017 - 2017 IEEE
[53] J. Nobre, R. Neves, Combining principal component analysis, discrete Region 10 Conference, 2017, pp. 2313–2318.
wavelet transform and xgboost to trade in the financial markets, Expert [86] K. Tran, T. Duong, Q. Ho, Credit scoring model: A combination of
Syst. Appl. 125 (2019). genetic programming and deep learning, in: 2016 Future Technologies
[54] Y. Xia, C. Liu, Y. Li, N. Liu, A boosted decision tree approach using Bayesian Conference, FTC, 2016, pp. 145–149.
hyper-parameter optimization for credit scoring, Expert Syst. Appl. 78 [87] S.H. Yeh, C.J. Wang, M.F. Tsai, Deep belief networks for predicting
(2017). corporate defaults, in: 2015 24th Wireless and Optical Communication
[55] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. Conference, WOCC, 2015, pp. 159–163.
[88] V. Neagoe, A. Ciotec, G. Cucu, Deep convolutional neural networks versus
[56] L. Yu, R. Zhou, L. Tang, R. Chen, A DBN-based resampling SVM ensemble
multilayer perceptron for financial prediction, in: 2018 International
learning paradigm for credit classification with imbalanced data, Appl.
Conference on Communications, COMM, 2018, pp. 201–206.
Soft Comput. 69 (2018) 192–202.
[89] S. Hamori, M. Kawai, T. Kume, Y. Murakami, C. Watanabe, Ensemble
[57] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep
learning or deep learning? Application to default risk analysis, J. Risk
belief nets, Neural Comput. 18 (7) (2006) 1527–1554.
Financial Manag. 11 (1) (2018).
[58] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
[90] C. Shorten, T.M. Khoshgoftaar, A survey on image data augmentation for
L.D. Jackel, Backpropagation applied to handwritten zip code recognition,
deep learning, J. Big Data 6 (1) (2019) 60.
Neural Comput. 1 (4) (1989) 541–551.
[91] A. Gómez-Ríos, S. Tabik, J. Luengo, A.S.M. Shihavuddin, B. Krawczyk,
[59] Y. Seo, K. shik Shin, Hierarchical convolutional neural networks for
F. Herrera, Towards highly accurate coral texture images classification
fashion image classification, Expert Syst. Appl. 116 (2019) 328–339.
using deep convolutional neural networks and data augmentation, CoRR
[60] Y. Lecun, Y. Bengio, The Handbook of Brain Theory and Neural Networks,
abs/1804.00516 (2018).
MIT Press, 1995, chapter Convolutional networks for images, speech, and
[92] H. Kvamme, N. Sellereite, K. Aas, S. Sjursen, Predicting mortgage default
time-series.
using convolutional neural networks, Expert Syst. Appl. 102 (2018).
[61] F.F. Ting, Y.J. Tan, K.S. Sim, Convolutional neural network improvement
[93] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep
for breast cancer classification, Expert Syst. Appl. 120 (2019) 103–115.
convolutional neural networks, Neural Inf. Process. Syst. 25 (2012).
[62] O.B. Sezer, A.M. Ozbayoglu, Algorithmic financial trading with deep
[94] L. Perez, J. Wang, The effectiveness of data augmentation in image
convolutional neural networks: time series to image conversion approach,
classification using deep learning, CoRR (2017).
Appl. Soft Comput. 70 (2018) 525–538.
[95] J. Salamon, J.P. Bello, Deep convolutional neural networks and data
[63] B. Zhao, H. Lu, S. Chen, J. Liu, D. Wu, Convolutional neural networks for
augmentation for environmental sound classification, CoRR (2016) http:
time series classification, J. Syst. Eng. Electron. 28 (2017) 162–169.
//arxiv.org/abs/1608.04363.
[64] F. Chollet, Deep Learning with Python, first ed., Manning Publications Co., [96] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic
Greenwich, CT, USA, 2017. data augmentation using GAN for improved liver lesion classification,
[65] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. CoRR (2018).
[66] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University [97] B. Zhu, W. Yang, H. Wang, Y. Yuan, A hybrid deep learning model for
Press, Inc., New York, NY, USA, 1995. consumer credit scoring, in: 2018 International Conference on Artificial
[67] J.M. Tomczak, M. Zieba, Classification restricted Boltzmann machine for Intelligence and Big Data, ICAIBD, 2018, pp. 205–208.
comprehensible credit scoring model, Expert Syst. Appl. 42 (4) (2015) [98] M.F. Kiani, F. Mahmoudi, A new hybrid method for credit scoring based
1789–1796. on clustering and support vector machine (ClsSVM), in: 2010 2nd IEEE
[68] G. Douzas, F. Bacao, Self-organizing map oversampling (SOMO) for International Conference on Information and Financial Engineering, 2010,
imbalanced data set learning, Expert Syst. Appl. 82 (2017) 40–52. pp. 585–589.
[69] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic [99] D. Zhang, X. Zhou, S.C. Leung, J. Zheng, Vertical bagging decision trees
minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002) model for credit scoring, Expert Syst. Appl. 37 (12) (2010) 7838–7843.
321–357. [100] M.A.H. Farquad, V. Ravi, Sriramjee, G. Praveen, Credit scoring using
[70] R. Saia, S. Carta, G. Fenu, A Wavelet-Based Data Analysis to Credit Scoring, PCA-SVM hybrid model, in: V.V. Das, J. Stephen, Y. Chaba (Eds.), Com-
2018. puter Networks and Information Technologies, Springer Berlin Heidelberg,
[71] J. Nobre, R. Neves, Combining principal component analysis, discrete Berlin, Heidelberg, 2011, pp. 249–253.
wavelet transform and xgboost to trade in the financial markets, Expert [101] Y. Ping, L. Yongheng, Neighborhood rough set and SVM based hybrid
Syst. Appl. 125 (2019). credit scoring classifier, Expert Syst. Appl. 38 (9) (2011) 11300–11304.
[72] R. Saia, S. Carta, A Linear-Dependence-Based Approach to Design Proactive [102] J. Wang, A.-R. Hedar, S. Wang, J. Ma, Rough set and scatter search
Credit Scoring Models, 2016. metaheuristic based feature selection for credit scoring, Expert Syst. Appl.
[73] R. Setiono, H. Liu, Symbolic representation of neural networks, Computer 39 (6) (2012) 6123–6128.
29 (3) (1996) 71–77. [103] L. Han, L. Han, H. Zhao, Orthogonal support vector machine for credit
[74] M.W. Craven, J.W. Shavlik, Extracting tree-structured representations of scoring, Eng. Appl. Artif. Intell. 26 (2) (2013) 848–862.
trained networks, in: Proceedings of the 8th International Conference on [104] J. Shi, S.-y. Zhang, L.-m. Qiu, Credit scoring by feature-weighted support
Neural Information Processing Systems, MIT Press, 1995, pp. 24–30. vector machines, J. Zhejiang Univ. Sci. C 14 (3) (2013) 197–204.
[75] M.T. Ribeiro, S. Singh, C. Guestrin, ‘‘Why should I trust you?’’: Explaining [105] Q. Li, J. Zhang, Y. Wang, K. Kang, Credit risk classification using
the predictions of any classifier, CoRR abs/1602.04938 (2016). discriminative restricted boltzmann machines, in: 2014 IEEE 17th Inter-
[76] R.A. Eisenbeis, Problems in applying discriminant analysis in credit national Conference on Computational Science and Engineering, 2014, pp.
scoring models, J. Bank. Financ. 2 (3) (1978) 205–219. 1697–1700.
X. Dastile, T. Celik and M. Potsane / Applied Soft Computing Journal 91 (2020) 106263 21

[106] S. Maldonado, J. Pérez, C. Bravo, Cost-based feature selection for support [129] Z. Zhao, S. Xu, B.H. Kang, M.M.J. Kabir, Y. Liu, R. Wasinger, Investigation
vector machines: An application in credit scoring, European J. Oper. Res. and improvement of multi-layer perceptron neural networks for credit
261 (2) (2017) 656–665. scoring, Expert Syst. Appl. 42 (7) (2015) 3508–3516.
[107] H. Sutrisno, S. Halim, Credit scoring refinement using optimized logis- [130] R. Florez-Lopez, J.M. Ramon-Jeronimo, Enhancing accuracy and inter-
tic regression, in: 2017 International Conference on Soft Computing, pretability of ensemble strategies in credit risk assessment. A correlated-
Intelligent System and Information Technology, ICSIIT, 2017, pp. 26–31. adjusted decision forest proposal, Expert Syst. Appl. 42 (13) (2015)
[108] R.A. Mancisidor, M. Kampffmeyer, K. Aas, R. Jenssen, Segment-based 5737–5753.
credit scoring using latent clusters in the variational autoencoder, 2018, [131] R. Florez-Lopez, J.M. Ramon-Jeronimo, Enhancing accuracy and inter-
arXiv:1806.02538. pretability of ensemble strategies in credit risk assessment. A correlated-
[109] X. Zhang, Y. Yang, Z. Zhou, A novel credit scoring model based on adjusted decision forest proposal, Expert Syst. Appl. 42 (13) (2015)
optimized random forest, in: 2018 IEEE 8th Annual Computing and 5737–5753.
Communication Workshop and Conference, CCWC, 2018, pp. 60–65. [132] M. Aláraj, M. Abbod, A systematic credit scoring model based on het-
[110] S. Jadhav, H. He, K. Jenkins, Information gain directed genetic algorithm erogeneous classifier ensembles, in: 2015 International Symposium on
wrapper feature selection for credit rating, Appl. Soft Comput. 69 (2018) Innovations in Intelligent SysTems and Applications, INISTA, 2015, pp.
541–553. 1–7.
[111] G. Dong, K.K. Lai, J. Yen, Credit scorecard based on logistic regression [133] M. Aláraj, M.F. Abbod, Classifiers consensus system approach for credit
with random coefficients, Procedia Comput. Sci. 1 (1) (2010) 2463–2468, scoring, Knowl.-Based Syst. 104 (2016) 89–105.
ICCS 2010. [134] L. Yu, Z. Yang, L. Tang, A novel multistage deep belief network based
[112] B. Twala, Multiple classifier application to credit risk assessment, Expert extreme learning machine ensemble learning paradigm for credit risk
Syst. Appl. 37 (4) (2010) 3326–3336. assessment, Flex. Serv. Manuf. J. 28 (4) (2016) 576–592.
[113] N.-C. Hsieh, L.-P. Hung, A data driven ensemble classifier for credit scoring [135] H. Xiao, Z. Xiao, Y. Wang, Ensemble classification based on supervised
analysis, Expert Syst. Appl. 37 (1) (2010) 534–545. clustering for credit scoring, Appl. Soft Comput. 43 (2016) 73–86.
[114] L. Yu, X. Yao, S. Wang, K. Lai, Credit risk evaluation using a weighted least [136] A. Bequé, S. Lessmann, Extreme learning machines for credit scoring: An
squares SVM classifier with design of experiment for parameter selection, empirical evaluation, Expert Syst. Appl. 86 (2017) 42–53.
Expert Syst. Appl. 38 (12) (2011) 15392–15399. [137] A. Lawi, F. Aziz, S. Syarif, Ensemble gradientboost for increasing classifi-
[115] G. Wang, J. Hao, J. Ma, H. Jiang, A comparative assessment of ensemble cation accuracy of credit scoring, in: 2017 4th International Conference
learning for credit scoring, Expert Syst. Appl. 38 (1) (2011) 223–230. on Computer Applications and Information Processing Technology, CAIPT,
[116] Q. Wang, K.K. Lai, D. Niu, Green credit scoring system and its risk 2017, pp. 1–4.
assessemt model with support vector machine, in: 2011 Fourth Inter- [138] Y. Li, X. Lin, X. Wang, F. Shen, Z. Gong, Credit risk assessment algorithm
national Joint Conference on Computational Sciences and Optimization, using deep neural networks with clustering and merging, in: 2017 13th
2011, pp. 284–287. International Conference on Computational Intelligence and Security, CIS,
[117] B. Ribeiro, N. Lopes, Deep belief networks for financial prediction, in: 2017, pp. 173–176.
B.-L. Lu, L. Zhang, J. Kwok (Eds.), Neural Information Processing, Springer [139] Z. Li, Y. Tian, K. Li, F. Zhou, W. Yang, Reject inference in credit scoring
Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 766–773. using semi-supervised support vector machines, Expert Syst. Appl. 74
[118] B.W. Yap, S.H. Ong, N.H.M. Husain, Using data mining to improve (2017) 105–114.
assessment of credit worthiness via credit scoring models, Expert Syst. [140] O.J. Okesola, K.O. Okokpujie, A.A. Adewale, S.N. John, O. Omoruyi, An
Appl. 38 (10) (2011) 13274–13283. improved bank credit scoring model: A Naïve Bayesian approach, in: 2017
[119] F. Louzada, O. Anacleto-Junior, C. Candolo, J. Mazucheli, Poly-bagging International Conference on Computational Science and Computational
predictors for classification modelling for credit scoring, Expert Syst. Appl. Intelligence, CSCI, 2017, pp. 228–233.
38 (10) (2011) 12717–12720. [141] H. Chen, M. Jiang, X. Wang, Bayesian ensemble assessment for credit
[120] A. Marqués, V. García, J. Sánchez, Exploring the behaviour of base scoring, in: 2017 4th International Conference on Industrial Economics
classifiers in credit scoring ensembles, Expert Syst. Appl. 39 (11) (2012) System and Industrial Security Engineering, IEIS, 2017, pp. 1–5.
10244–10250. [142] J. Abellán, J.G. Castellano, A comparative study on base classifiers in
[121] A. Marqués, V. García, J. Sánchez, Two-level classifier ensembles for credit ensemble methods for credit scoring, Expert Syst. Appl. 73 (2017) 1–10.
risk assessment, Expert Syst. Appl. 39 (12) (2012) 10916–10922. [143] B. Vanderheyden, J. Priestley, Logistic ensemble models, 2018.
[122] B. Tang, S. Qiu, A new credit scoring method based on improved [144] P. Martey Addo, D. Guegan, B. Hassani, Credit risk analysis using machine
fuzzy support vector machine, in: 2012 IEEE International Conference on and deep learning models, Risks 6 (2018) 38.
Computer Science and Automation Engineering, CSAE, Vol. 3, 2012, pp. [145] Y. Xia, C. Liu, B. Da, F. Xie, A novel heterogeneous ensemble credit scoring
73–75. model based on bstacking approach, Expert Syst. Appl. 93 (C) (2018)
[123] F. Louzada, P.H. Ferreira-Silva, C.A. Diniz, On the impact of disproportional 182–199.
samples in credit scoring models: An application to a Brazilian bank data, [146] Y.-C. Chang, K.-H. Chang, G.-J. Wu, Application of extreme gradient
Expert Syst. Appl. 39 (9) (2012) 8071–8078. boosting trees in the construction of credit risk assessment models for
[124] J. Abellán, C.J. Mantas, Improving experimental studies about ensembles financial institutions, Appl. Soft Comput. 73 (2018).
of classifiers for bankruptcy prediction and credit scoring, Expert Syst. [147] W. Li, S. Ding, Y. Chen, S. Yang, Heterogeneous ensemble for de-
Appl. 41 (8) (2014) 3825–3830. fault prediction of peer-to-peer lending in China, IEEE Access 6 (2018)
[125] T. Harris, Credit scoring using the clustered support vector machine, 54396–54406.
Expert Syst. Appl. 42 (2) (2015) 741–750. [148] A. Cao, H. He, Z. Chen, W. Zhang, Performance evaluation of machine
[126] B. Yi, J. Zhu, Credit scoring with an improved fuzzy support vector learning approaches for credit scoring, Int. J. Econ. Finance Manag. Sci. 6
machine based on grey incidence analysis, in: 2015 IEEE International (2018) 255–260.
Conference on Grey Systems and Intelligent Services, GSIS, 2015, pp. [149] Basel Committee on Banking Supervision, Basel II: International conver-
173–178. gence of capital measurement and capital standards: A revised framework
[127] S. Jones, D. Johnstone, R. Wilson, An empirical evaluation of the perfor- - comprehensive version, bank for international settlements, BIS (2006).
mance of binary classifiers in the prediction of credit ratings changes, J.
Bank. I Finance 56 (2015) 72–85.
[128] J. Chen, L. Xu, A method of improving credit evaluation with support
vector machines, in: 2015 11th International Conference on Natural
Computation, ICNC, 2015, pp. 615–619.

You might also like