0% found this document useful (0 votes)

72 views

Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

1) Cross-validation performed after oversampling the entire dataset can result in overoptimistic error estimates, as copies of the same patterns may appear in both the training and test sets. 2) It is better to perform oversampling during cross-validation, oversampling only the training portion of each fold to avoid patterns being seen during both training and testing. 3) Overfitting refers to how well an oversampling algorithm can generate new synthetic samples, while overoptimism relates to how cross-validation is performed with oversampling - performing it after can result in overoptimistic estimates, while performing it during avoids this issue.

Uploaded by

Mauricio Espinoza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views

Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

Uploaded by

Mauricio Espinoza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328315720

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and

Overﬁtting Approaches

Article in IEEE Computational Intelligence Magazine · October 2018

DOI: 10.1109/MCI.2018.2866730

CITATIONS READS
3 1,703

5 authors, including:

Miriam Seoane Santos Jastin Pompeu Soares

University of Coimbra University of Coimbra
16 PUBLICATIONS 58 CITATIONS 9 PUBLICATIONS 8 CITATIONS

SEE PROFILE SEE PROFILE

Pedro Henriques Abreu Helder J. Araujo

University of Coimbra University of Coimbra
82 PUBLICATIONS 344 CITATIONS 202 PUBLICATIONS 1,665 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Motion estimation View project

ATENA project (https://www.atena-h2020.eu/) View project

All content following this page was uploaded by Miriam Seoane Santos on 22 November 2018.

The user has requested enhancement of the downloaded file.

Cross-Validation for Imbalanced Datasets: Avoiding
Overoptimistic and Overfitting Approaches
Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henriques Abreu, Hélder Araújo and João Santos
CISUC, Department of Informatics Engineering, University of Coimbra, Portugal
Department of Electrical and Computer Engineering, University of Coimbra, Portugal
IPO-Porto Research Centre, Porto, Portugal

Abstract—Although cross-validation is a standard procedure familiarised with the topic tend to misunderstand some aspects
for performance evaluation, its joint application with oversam- of a standard experimental setup in imbalanced domains. One
pling remains an open question for researchers farther from the of their frequent misconceptions relates to the joint-use of
imbalanced data topic. A frequent experimental flaw is the appli-
cation of oversampling algorithms to the entire dataset, resulting CV and oversampling algorithms: oversampling seems to be
in biased models and overly-optimistic estimates. We emphasize applied to the entire original data, and only then the cross-
and distinguish overoptimism from overfitting, showing that the validation and model evaluation is performed [9]–[12]. This
former is associated with the cross-validation procedure, while misconception naturally leads to building biased models and
the latter is influenced by the chosen oversampling algorithm. producing overoptimistic error estimates (examples of these
Furthermore, we perform a thorough empirical comparison of
well-established oversampling algorithms, supported by a data situations will be illustrated in Section III).
complexity analysis. The best oversampling techniques seem to In traditional CV, the entire dataset is initially partitioned
possess three key characteristics: use of cleaning procedures, into k folds, where k-1 folds are used to train the prediction
cluster-based example synthetization and adaptive weighting model and the left-out fold is used for testing. The folds then
of minority examples, where Synthetic Minority Oversampling rotate so that all folds are used for training and testing the
Technique coupled with Tomek Links and Majority Weighted
Minority Oversampling Technique stand out, being capable of model, and the final performance metrics are averaged across
increasing the discriminative power of data. the k estimates of each test fold. This process assures that
k independent sets are used to test the model, simulating
unseen data: the test set is never seen during the training of
I. I NTRODUCTION the model, to avoid overfitting the data. Incorrectly applying
Imbalanced Data (ID) occurs when there is a considerable oversampling while performing CV may derive into two main
difference between the class priors of a given problem. Con- issues: overoptimism and overfitting, as we proceed to explain.
sidering a binary classification problem, a dataset is said to Regarding the issue of overoptimism, consider Approach
be imbalanced if there exists an under-represented concept 1 (CV after Oversampling) and Approach 2 (CV during
(a minority class) when compared to the other (a majority Oversampling) as depicted in Fig. 1. In the first approach
class) [1]. Prediction models built from imbalanced datasets (Approach 1) we design a cross-validation setup prone to
are most often biased towards the majority concept, which is overoptimism: the entire dataset is first oversampled to achieve
especially critical when there is a higher cost of misclassifying a 50-50 distribution between classes and the cross-validation
the minority examples, such as diagnosing rare diseases [2]. is applied afterwards. In this scenario, it is possible that copies
Approaches to handle imbalanced scenarios can be mainly of the same patterns appear in both the training and test sets,
divided into data-level approaches, where the data is prepro- making this design subjected to overoptimism (Fig. 1 - CV
cessed in order to achieve a balanced dataset for classification, after Oversampling). In the second approach (Approach 2), the
and algorithmic-level approaches, where the classifiers are oversampling procedure is performed during cross-validation:
adapted to deal with the characteristic issues of imbalanced the dataset is first divided into k stratified partitions and only
data [3]–[6]. By far, data-level approaches are the most com- the training set (corresponding to k − 1 partitions) is over-
monly used, as they have proven to be efficient, are simple sampled (Fig. 1 - CV during Oversampling). In this scenario,
to implement and completely classifier-independent [2], [7]. the patterns included in the test set are never oversampled
Data-level strategies fall into two main categories, under- or seen by the model in the training stage, thus allowing a
sampling and oversampling: the former consists in removing proper evaluation of the model’s capability to generalize from
majority examples while the latter replicates the minority the training data.
examples. Researchers often invest in oversampling procedures Regarding the issue of overfitting, some researchers directly
since they are capable of balancing class distributions without associate it to all oversampling procedures, while others refer
ruling out potentially critical majority examples [8]. to the overoptimistic results of a CV approach as “overfitting”,
Cross-validation (CV) is a standard procedure to evalu- which confuses both concepts and hinders their identification.
ate classification performance; yet, its joint application with For this reason, we here distinguish both ideas and explain
oversampling raises some questions for researchers farther how they relate to CV and oversampling approaches, providing
from the imbalanced data community. Some researchers not some examples:
Original Data
A B C D E 1 2 3 4 5 6 7 8 9 10

CV after Oversampling CV during Oversampling

A B C D E a c a d b for Fold = 1 to all the k partitions of CV

Oversampling
k=1
1 2 3 4 5 6 7 8 9 10
Training
B C D E 3 4 5 6 7 8 9 10
(CV: Fold 1)
Test
A 1 2
for Fold = 1 to all the k partitions of CV (CV: Fold 1)

k=1 [Oversampling]
Training B C D E c b c e 3 4 5 6 7 8 9 10
Training A B C D E a* c b 1 2 4 5 6 7 9 10 (CV: Fold 1)
(CV: Fold 1) Test
A 1 2
Test (CV: Fold 1)
a* d 8 3
(CV: Fold 1)
k=2
Training
(CV: Fold 2) A C D E 1 2 5 6 7 8 9 10
k=2
Training Test
A C D E c a* d b 2 3 4 6 7 8 9 10 B 3 4
(CV: Fold 2) (CV: Fold 2)

Test [Oversampling]
B a* 1 5 Training A C D E a d c d 1 2 5 6 7 8 9 10
(CV: Fold 2)
(CV: Fold 2)
Test
(…) B 3 4
(CV: Fold 2) (…)

Fig. 1. Different cross-validation approaches: CV after oversampling (left) and CV during oversampling (right). When the cross-validation is implemented
after oversampling is applied, similar patterns may appear in both training and test partitions (marked in the schema with an asterisk), leading to overoptimistic
error estimates. When the cross-validation is applied during oversampling, only the training patterns are considered both for generating new patterns and
training the model, avoiding overoptimism. In both approaches, similar or exact copies may appear in the training partitions, leading to overfitting, which is
surpassed by an appropriate choice of oversampling technique.

• Overfitting occurs when the classifier is “tightly fitted” that we considered all five partitions to perform oversampling,
to the training data points, and therefore loses its gen- creating similar patterns rather than exact replicas. Although
eralization ability for the test data. Because of this, the we are not using a technique prone to overfitting, we are
classification performance is lower in the test set when considering all the data points in the oversampling procedure
compared to the training set. In this context, overfitting and therefore the probability that similar patterns are both in
is usually associated to oversampling techniques that the training and test partitions increases (Fig. 1). In this case,
generate exact replicas of training data patterns (e.g. we are in the presence of an overoptimistic approach.
Random Oversampling - ROS), causing an overfit of the
model in its learning stage. The importance of a proper cross-validation approach in
• Overoptimism occurs when exact or similar replicas of a imbalanced domains was first emphasized by Blagus and Lusa
given pattern exist in both the training and test sets (well [13]. They have evaluated the bias introduced in Classifi-
represented in Fig. 1 - CV after Oversampling). In this cation and Regression Trees (CART) when cross-validation
case, the classification performance in the test sets will and sampling techniques – random undersampling, random
be similar to the one obtained in the training sets, not oversampling and Synthetic Minority Oversampling Technique
because the model is able to correctly generalize to the (SMOTE) – are jointly used. The results showed that incorrect
test data, but rather because there are similar patterns in CV achieved overly-optimistic estimates for random oversam-
both training and test partitions. In this context, overopti- pling and SMOTE, while random undersampling produced ac-
mism is associated to incorrect implementations of cross- curate predictions, resilient to the change of CV procedure. Al-
validation approaches, when oversampling is used. though this work provides an interesting take on the problem,
some questions remained unanswered from the experimental
As an example, consider that we divided a dataset into five setup. The number of real-world datasets used was rather small
equal folds. If we considered four partitions for training and (10 datasets) and there was not much variability in terms of
applied the ROS algorithm, exact replicas of existing minority sample size. Therefore, although authors claimed that a higher
patterns would be generated: the classifier could be so exagger- bias (overoptimistic effects) was observed for smaller datasets,
atedly fitted to the training data that it would misclassify the the lack of variability does not allow a complete analysis:
test patterns (overfitting occurs). On the other hand, imagine in this work, we use a larger number of real-world datasets
(86 datasets) to provide a thorough evaluation of this topic. A. Oversampling Algorithms
Blagus and Lusa [13] also refer that the bias is marginal when
1) ROS: Random Oversampling (ROS) is the simplest of
the prediction task is “easy”, without supporting this claim
oversampling techniques, where the existing minority exam-
with any type of complexity measures: we therefore explore
ples are replicated until the class distribution is balanced. This
well-established data complexity measures to characterize the
approach is often criticised since it does not introduce any new
difficulty of each dataset. Furthermore, the following novel
information to the data (the oversampled examples are mere
analyses are included:
copies of the original data points) and may lead to overfitting
• Determine whether Imbalance Ratio (IR) influences the (even if CV is performed properly) [14].
classification bias (overoptimistic effects); 2) SMOTE: Synthetic Majority Oversampling Technique
• Evaluate incorrect versus correct CV approaches from a (SMOTE) works by generating synthetic minority examples
complexity perspective, by analysing the data complexity along the line segments joining randomly chosen G minority
in training and test partitions; examples and their k-nearest minority class neighbors [15]. G
• Analyse a higher number of oversampling algorithms, in is the number of minority examples to oversample in order
order to compare their inner procedure, determine how to obtain the desired balancing ratio between the classes, and
they handle data complexity and assess which are more along with the value of k, it can be specified by the user.
subjected to overfitting and which provide the highest SMOTE will then generate a new synthetic sample s according
classification improvement. to s = x + ϕ(x − v), where x is the minority sample to
Motivated by the topics presented above, the purpose of this oversample, v is one of its chosen nearest neighbors and ϕ is
work is as follows: called a gap, in this case, a random number between 0 and
(i) To fully characterize the risk of overoptimism when CV 1. By generating similar examples to the existing minority
and oversampling algorithms are used, extending the points, SMOTE creates larger and less specific decision bound-
work of Blagus and Lusa [13], as previously described; aries that increase the generalization capabilities of classifiers,
(ii) To distinguish the problem of overoptimism from the therefore increasing their performance.
overfitting problem, including a novel analysis on the risk 3) ADASYN: Instead of producing an equal number of
of overfitting and on the influence of data complexity on synthetic minority instances for each minority example, the
classification results; Adaptive Synthetic Sampling Approach (ADASYN) algo-
(iii) To study the behavior of 15 well-established oversam- rithm, proposed by He et al. [16], specifies that minority
pling algorithms and their influence on classification examples harder to learn are given a greater importance, being
performance, providing a thorough analysis of their inner oversampled more often. ADASYN determines a weight (wi )
procedure. for each minority example, defined as the normalized ratio of
majority examples Ni among its k nearest neighbors: wi =
In this way, the contribution of this research is two-fold. Ni
First, it details important aspects on how to properly address k×z where z is a normalization constant. Then, the number
of synthetic data points to generate for each minority example
imbalanced data problems, so that researchers farther from
is specified as gi = wi × G, being G the total necessary
the imbalance topic or new researchers in the field truly
number of synthetic minority samples to produce according
understand the nature of the problem and acknowledge the
to the required amount of oversampling. The oversampling
most correct validation procedures and promising resampling
procedure is the same as SMOTE; the only difference is that
techniques. Secondly, for researchers familiarised with the im-
harder minority examples are replicated more often.
balanced data field, it provides a thorough empirical analysis of
a comprehensive set of oversampling techniques, focusing on 4) Borderline-SMOTE: Based on the same idea of provid-
their behavior/inner procedure and strengths/faults, supported ing a more clear decision boundary, Han et al. [17] suggested
by a data complexity analysis. two new variations of SMOTE – Borderline-SMOTE1 and
Borderline-SMOTE2 – in which only the minority exam-
The structure of the manuscript is as follows: Section
ples near the borderline are considered for oversampling.
II presents some background knowledge on oversampling
Borderline-SMOTE first considers the division of the minority
techniques, complexity measures and performance measures.
examples into three mutually exclusive sets: noise, safe and
Then, Section III presents recent works that make use of
danger. This division is made by considering the number of
overoptimistic cross-validation procedures. The experimental
majority examples m0 found among each minority example’s
setup is described in Section IV, while the experimental results
k nearest neighbors. Thus being, if m0 = k, all the nearest
are discussed in Section V. Finally, Section VI summarises
neighbors of a minority data point pi are majority examples,
the conclusions of the work and refers to some directions for
and pi is considered noise; conversely, if k2 > m0 ≥ 0,
future work.
pi is considered safe while if k > m0 ≥ k2 , pi is sur-
rounded by more majority examples than minority ones (or
II. BACKGROUND K NOWLEDGE surrounded by exactly the same number), and therefore is
considered danger. The “danger” data points are considered the
This section reviews some background information that sup- minority borderline examples, and only them are oversampled,
ports the different stages of this work, regarding oversampling following a SMOTE-like procedure. For Borderline-SMOTE1
algorithms, complexity metrics and performance metrics. new synthetic examples are created along the line between
the danger examples and their minority nearest neighbors; it is placed along the first principal component axis of its k-
Bordeline-SMOTE2 uses the same procedure as Borderline- neighborhood.
SMOTE1, but further considers the nearest majority example 9) CBO: Jo and Japkowicz [22] propose an oversampling
of each danger data point to produce one more synthetic approach that simultaneously handles the between-class im-
example: the distance between each danger point and its balance (imbalance between different classes) and the within-
nearest majority neighbour is multiplied by a gap between class imbalance, where a single class may comprise sub-
0 and 0.5 so that the new point falls closer to the minority clusters that hinder the learning process of algorithms. Their
class, thus strengthening the minority borderline examples. approach is called Cluster-Based Oversampling (CBO) and
5) Safe-Level-SMOTE: Contrary to Borderline-SMOTE, uses k-means clustering to guide the oversampling procedure.
the technique proposed by Bunkhumpornpat et al. [18], First, k-means is applied to each class to find the existing
called Safe-Level-SMOTE, only synthesizes minority exam- sub-clusters; then, the majority class is oversampled - each
ples around safe regions. To specify a safe region, a coefficient sub-cluster of the majority class is inflated until it reaches the
named safe level ratio (slratio ) is defined, which is the ratio size of the largest majority sub-cluster. Finally, the minority
between the number of minority examples found among each class is oversampled: each sub-cluster is oversampled until it
minority example’s (p) k nearest neighbors, slp , and the reaches the size Nmaj /Ncmin , where Nmaj is total number
number of minority examples found among a randomly chosen of majority examples after oversampling and Ncmin is the
neighbor’s (n) k-neighborhood, sln . Depending on the slratio number of minority class clusters. Different oversampling
of a given minority example, five different scenarios may be approaches may be coupled with CBO algorithm: this work
applied to the SMOTE-based generation: if both slp and sln makes use of the Random Oversampling (CBO + ROS), as
are 0, no oversampling occurs; if slp > 0 and sln = 0, proposed by Jo and Japkowicz in the original paper [22] and
then the SMOTE’s gap is set to 0 (the minority example SMOTE (CBO + SMOTE), as discussed by He and Garcia
is duplicated); if slratio = 1, the gap is as in the original [1].
formulation of SMOTE (rand(0, 1)); if slratio > 1, the gap 10) AHC: Cohen et al. propose an oversampling approach
1
is set to rand(0, slratio ) so that the new example is generated based on Agglomerative Hierarchical Clustering (AHC) [23].
closer to the minority example p and finally, if slratio < 1, In this approach, the minority examples are clustered using
the gap is set to rand(1 − slratio , 1) so that, conversely, the AHC with both the single and complete linkage rules in
new example is generated closer to the nearest neighbor n. succession, so that the produced clusters may vary. Then, fine-
grained clusters are retrieved from all levels of the generated
6) SMOTE+TL: SMOTE + Tomek Links (SMOTE+TL)
dendrograms and their centroids (prototypes) are determined.
also works on the basis of creating clear safe regions, by
The process of synthetic data generation is based on introduc-
applying Tomek links after the data is oversampled with
ing the computed cluster prototypes as new samples from the
SMOTE [14]. A Tomek link is defined as a pair of examples
minority class, until a complete balance is achieved.
from different classes, one from the minority class and the
11) MWMOTE: Similarly to ADASYN and Borderline-
other from the majority class, (xi , xj ), that are each other’s
SMOTE, the Majority Weighted Minority Oversampling Tech-
closest neighbors [19]. In this technique, SMOTE is first
nique (MWMOTE) also works on the basis of generating syn-
applied to oversample the minority examples; then, the Tomek
thetic samples in specific regions, where the minority examples
links are identified and both data points of each pair are
are harder to learn [24]. MWMOTE starts by identifying the
removed.
harder-to-learn minority examples (Simin ), so that each is
7) SMOTE+ENN: Similar to SMOTE+TL, SMOTE+ENN given a selection weight (Sw ), according to their distance to
first generates synthetic examples from the minority class the nearest examples belonging to the majority class. These
(through SMOTE), from which a process of data cleaning weights are then converted into selection probabilities, Sp , that
follows, using the Wilson’s Edited Nearest Neighbour Rule will be used in the oversampling stage. To generate the new
(ENN). ENN removes any example (either minority or ma- synthetic samples, the complete set of minority class examples
jority examples) whose class differs from at least two of its Smin is clustered into M groups. Then, a minority example
three nearest neighbors [20]. By removing the examples that x from Simin is selected according to the probability Sp , and
are misclassified by its three nearest neighbors, SMOTE+ENN another random minority example in Smin that belongs to the
provides a deeper data cleaning than SMOTE+TL [14]. same cluster of x is used to generate a new synthetic sample
8) ADOMS: Adjusting the Direction Of the synthetic Mi- in the same way as SMOTE. This approach is performed as
nority clasS examples (ADOMS) algorithm combines SMOTE many times as required, according to the necessary number N
with Principal Component Analysis (PCA) to produce new of synthetic samples to be generated for complete balance.
synthetic minority examples along the first principal com- 12) SPIDER: Stefanowski and Wilk propose an algorithm
ponent of the data surrounding each minority example [21]. that uses the characteristics of examples to drive their oversam-
For each minority example to replicate, ADOMS searches for pling: Selective Pre-Processing of Imbalance Data (SPIDER)
its k-nearest minority class neighbors and performs PCA to [25]. SPIDER comprises two stages: first, each example is
determine the first principal component axis of the local data. categorized into “safe” or “noisy”, according to the correct or
The generation of the new example is done in a SMOTE- incorrect classification result returned by its k-neighborhood,
like fashion, but instead of being placed along the line that respectively (k = 3 in the original formulation).
joins a minority example and one of its k nearest neighbors, Then, an amplification strategy must be specified by the
user: either “weak amplification”, “weak amplification with 1) Geometry and Topology: L3 and N4 measure the nonlin-
relabeling” or “strong amplification”. If weak amplification is earity of a linear classifier and a nearest-neighbour classifier,
chosen, the noisy minority examples are amplified (copied) as respectively. L3 returns the error of a Support Vector Machine
many times as there are safe majority examples in their k- (SVM) [28] with linear kernel in a test set created by linear
neighborhood (k = 3). “Weak amplification with relabeling” interpolation of randomly selected pairs of examples from
allies the amplification of noise minority examples described the same class. N4 constructs a test set in the same way
before with a relabeling procedure: noisy majority examples as for L3 and returns the test error for a nearest-neighbor
surrounded by noisy minority examples (considering k = 3), classifier. Higher values of these measures indicate more
are relabeled to the minority class. The “strong amplification” complex classification problems.
technique processes both the noisy and safe minority exam- 2) Overlapping of Individual Feature Values: F1, F2 and F3
ples. It starts by amplifying the safe examples by producing focus on the ability of a single feature to distinguish between
as many copies as there are safe majority examples in their classes [27]. F1 measures the highest discriminative power
3-nearest neighborhood and then considers the noisy minority of all features in data (higher discriminative power indicates
examples and reclassifies them according to a larger neighbor- lower complexity), F2 measures the highest volume of overlap
hood (k = 5). If an example is correctly classified, it suffers between the classes’ conditional distributions (if there is no
a standard weak amplification; otherwise, it is more strongly overlap in at least one feature, F2 will be zero), and F3
amplified, by considering a 5-nearest neighborhood. Finally, measures feature efficiency, the fraction of points where the
for any type of amplification chosen, the noisy examples values spanned by each class do not overlap (higher fractions
of the majority class are removed (in the case of “weak indicate easier classification problems).
amplification with relabeling”, only the un-relabelled noisy 3) Class Separability: L1, L2, N1, N2 and N3 focus on
majority examples are removed). the characteristics of the boundary between classes [29]. L1
SPIDER2 is a modification of SPIDER that performs the and L2 measure to what extent the training data is linearly
pre-processing of minority and majority examples in two separable using an SVM with linear kernel [30]: if a classifi-
separate stages [26]. It maintains the choice to perform a weak cation problem is linearly separable, then L1 is zero and L2 is
or strong amplification for the minority examples; while for the the training set error rate. N1 measures the fraction of points
majority examples it is possible to decide whether relabeling is connected to the opposite class by an edge in a Minimum
required or not. SPIDER2 starts by categorizing the majority Spanning Tree (MST) and it can achieve high values when the
examples into “safe” or “noisy” and if the relabeling option is classes are interspersed (higher complexity) or when the class
chosen, the noisy majority examples are relabeled; otherwise, boundary has a narrower margin than the intra-class distances
they are removed. Then, the minority examples are also (lower complexity). However, for the datasets used in this
divided into “safe” or “noisy” and the amplification proceeds research, we observed that the first scenario is often the case,
according to the chosen technique (weak or strong), which are and for that reason we have associated higher values of N1
the same as above. to a higher complexity in Table I. N2 measures the trade-off
between the within-class spread and the between-class spread.
B. Data Complexity Measures In an easy classification problem, the within-class scatter
Ho and Basu [27] proposed several complexity measures should be low and the between-class scatter should be high;
that regard essentially three properties of datasets: geom- nevertheless, the denominator (between-class scatter) greatly
etry/topology, class overlapping and boundary separability influences N2 values: we, therefore, consider that higher values
(Table I). of N2 (smaller between-class scatter) traduce more complex
scenarios. Finally, N3 measure is the error rate of a 1-nearest
TABLE I neighbor classifier (higher N3 values are associated to a higher
C OMPLEXITY MEASURES DESCRIPTION . complexity).
Higher Data Additional information on the presented complexity mea-
Measure Description Complexity sures is available in the extended version of the paper (https:
F1
Highest value of Fisher’s Discriminative
−− //eden.dei.uc.pt/∼pha/Long-version-CIM.pdf).
Ratio (among all features)
Highest volume of overlap between classes
F2 (among all features)
++
F3
Maximum feature efficiency (among all
−− C. Performance Metrics for Imbalanced Scenarios
features)
Minimised error of a linear classifier (linear
Accuracy (ACC) measures the percentage of cor-
L1 SVM)
++ rectly classified examples and is computed as ACC =
Error rate (training set) of a linear classifier T P +T N
T P +F N +F P +T N , where TP and TN are the true positives
L2 (linear SVM)
++
N1 Fraction of points on boundary by MST ++ and true negatives and FP and FN are the false positives
Ratio of average intra-class and inter-class
N2 scatter
++ and false negatives. Given that ACC is biased towards the
N3
Error rate of nearest neighbour classifier
++ majority class [28], alternative metrics should be considered,
(KNN, k=1)
such as Sensitivity, Specificity, Precision, F-Measure, G-mean
L3 Nonlinearity of linear classifier (linear SVM) ++
Nonlinearity of a nearest neighbour classifier and the Area Under the Receiver Operating Characteristics
N4 ++
(KNN, k=1) (ROC) Curve (AUC) [1]. Sensitivity (SENS) is calculated as
SEN S = T PT+F P
N and measures the percentage of positive
examples correctly classified, while Specificity (SPEC) refers IV. E XPERIMENTAL S ETUP
to the percentage of negative examples correctly identified
The experimental setup used in this work comprises three
and can be computed as SP EC = T NT+F N
P . Precision main approaches: Baseline, Approach 1 and Approach 2 (Fig.
(PREC) corresponds to the percentage of positive examples
2). For the results presented as “Baseline”, the collected
correctly classified, considering the set of all the examples
classified as positive, P REC = T PT+F P
P . F-measure, G-
mean and AUC represent the trade-off between some of the Approach 1: CV after oversampling
metrics described above. F-measure (F-1) shows the com- Oversampling Performance
(complete dataset) Metrics
promise between sensitivity and precision, obtained through
their harmonic mean, F -1 = 2×P REC×SEN S
while G-mean Data Complexity
P REC+SEN S (training partitions)
represents the
√ geometric mean of both classes’ accuracies,
Cross-validation Data Complexity
G-mean = SEN S × SP EC. At last, AUC makes use of (k = 5) (test partitions)
the ROC curve to exhibit the trade-off between the classifier’s
TP and FP rates [31]. Baseline
III. R ELATED W ORKS Original Cross-validation Performance
Data (k = 5) Metrics
This section presents a series of related works aiming
to show that the less the work is related to learning from
imbalanced data, the more likely the cross-validation (CV) Data Complexity
(original test partitions)
procedure is poorly designed. Thus, related works were di-
vided into three main categories: “Learning from imbalanced Oversampling Performance
data”, “Comparing approaches in a specific context” and (training partitions) Metrics
“Solving a classification problem”. The “Learning” category
includes research works focused on performing extensive Data Complexity
experiments to evaluate diverse sampling techniques [32]–[38]. (training partitions)
Typically, these works include a large number of publicly Approach 2: CV during oversampling
available datasets and a comprehensive set of learners and
sampling algorithms. “Comparison” category works perform a Fig. 2. Experimental setup architecture.
comparison of oversampling approaches in a specific context:
these works normally include a lower number of datasets and datasets are first divided into five stratified folds (k = 5
sampling strategies. “Classification” category comprises works folds is the maximum that allowed a proper stratification)
where the main objective is to solve a particular classification and the classifiers are applied afterwards, without any type of
problem and the imbalanced nature of data is not the focus. oversampling. The data complexity measures and performance
The extended version of this paper (https://eden.dei.uc.pt/ metrics for the original training and test sets are then retrieved.
∼pha/Long-version-CIM.pdf) provides additional information In Approach 1, the original datasets are oversampled and
on related works, including a table that summarises their the CV and performance evaluation are performed afterwards.
main characteristics. All works included in the “Learning” The data complexity measures (for oversampled training and
category, except one, perform a well-designed CV procedure, test sets) are then retrieved. In Approach 2, oversampling is
where the training and test partitions are determined before performed during CV: the original datasets are first divided
any oversampling technique is applied. As we move towards into five folds (same folds as for the Baseline), and only the
research works whose objective is not to provide a general training partitions are oversampled. The classifiers are then
review on ways to deal with the data imbalance problem, trained with the oversampled training folds and tested in the
we find a larger number of works where the CV procedure respective original test folds. In this case, the data complexity
is not appropriate: the complete dataset is oversampled and measures are only determined for the oversampled training
the partition into training and test is performed afterwards. sets, since the data complexity of the test sets is the same as
This is more evident if we consider the research works where obtained from the Baseline method.
the main objective is to ease a classification task, rather With this setup we aim to perform 3 main analysis: (i)
than studying different approaches to surpass the issues of compare the differences in classification performance between
imbalanced datasets. It is possible that these researchers were Approaches 1 and 2, in order to explain the risk of overopti-
not completely familiarised with imbalanced data domains mistic error estimates, (ii) distinguish between overoptimistic
and respective approaches; thus, when faced with a specific and overfitting approaches in imbalanced scenarios that con-
imbalanced context, they resorted to the state-of-the-art over- sider oversampling and (iii) determine which oversampling
sampling approach (namely, SMOTE) to solve the issue, but approach is the most appropriate to solve the imbalance
they understood it as a form of preprocessing, which created a problem, obtaining the best average results for all the different
greater propensity for misconception during its application. We contexts (datasets) considered in this study.
therefore conclude that the less the work is related to learning Regarding the process of data collection, the 86 datasets
from imbalanced learning, the more likely the CV procedure used in this work were collected from two online repositories,
is poorly designed. UCI Machine Learning Repository (http://archive.ics.uci.edu/
ml) and KEEL – Knowledge Extraction based on Evolutionary that for Approach 1, the best methods often include CBO and
Learning (http://www.keel.es). The choice criteria included ROS. CBO+Random and ROS create exact replicas of existing
the following parameters: complete datasets, regarding binary data points, and since the division (i.e., CV) is performed
classification problems, with a variable sample size, number of after the oversampling procedure in the entire dataset, the
features and IR. Their main characteristics are summarised in probability that exact replicas exist in both the training and
Table II. As a note worth mentioning, although the variability test sets increases, thus producing better results. In the case
of datasets considered in this work is considerable, especially of CBO+SMOTE, although SMOTE creates synthetic exam-
in comparison to the precursor work of Blagus and Lusa [13], ples, it does so by inflating the clusters defined by k-means
we would like to point out that the conclusions drawn in this algorithm, which may reduce the data variability introduced
research refer to datasets with low dimensionality (4-34 fea- in the dataset. Therefore, patterns in the test sets may also be
tures). The interested readers may find additional information similar to the ones comprised in the training sets.
on the data collection stage in the extended version of this In Approach 2, the difference between the results of the
paper (https://eden.dei.uc.pt/∼pha/Long-version-CIM.pdf). training and test sets is more accentuated: in this scenario, the
test sets follow the same distribution as the original dataset
V. R ESULTS AND D ISCUSSION and its patterns are never considered in the oversampling or
In this section, we start by comparing Approaches 1 and training phases. As a result, overoptimism does not appear
2 regarding their risk of overoptimism/overfitting. Then, we in this scenario. However, some overfitting effects may oc-
move to an analysis on data complexity and the inner charac- cur. Considering the presence of overfitting as a difference
teristics of each oversampling method. around 0.1 between the training and test AUCs [39], it can
be observed that the great majority of oversampling meth-
ods cannot be responsible for overfitting effects. However,
A. Approach 1 versus Approach 2: Evaluating the risk of some methods seem to be introducing overfitting (Table III).
overoptimism and overfitting CBO+Random, which obtains the worst results, seems to be
To evaluate the issues of overoptimism and overfitting the method responsible for the highest amount of overfitting,
regarding the joint-use of CV and oversampling approaches, followed by Borderline-SMOTE and CBO+SMOTE. ROS,
we started by comparing the performance results of Approach SMOTE+ENN and Safe-Level-SMOTE, although in a lighter
1 and Approach 2 (please refer to the extended version scale, also seem to have some generalization issues, where
of this paper (https://eden.dei.uc.pt/∼pha/Long-version-CIM. the difference between training and test AUCs also comes
pdf). The results confirmed that the performance outputted by close to 0.1. The same cannot be observed for Approach 1
Approach 1 is more optimistic: the mean test values of the given that the overoptimism problem highly dominates the
various performance metrics (AUC, G-mean, F-1 and SENS) results, preventing the identification of these overfitting effects.
was always higher in Approach 1 (see extended version). The tendency of CBO+Random, CBO+SMOTE and ROS to
Since the behavior observed for both Approaches is consistent overfit the data is somewhat intuitive: as they create exact
for all performance metrics, we will refer only to the AUC replicas (CBO+Random and ROS) or very similar replicas
values in the following analyses, in order to provide a base of (CBO+SMOTE by creating synthetic examples in defined
comparison with previous works, which largely use AUC. Fig. clusters) to the existing training patterns, the models tend to
3 shows the AUC values (training and test partitions) obtained overfit these training patterns and fail to generalize to different
for the original datasets (Baseline) and for the oversampled ones. Further on, Section V-C performs a detailed analysis of
datasets, considering both Approaches 1 and 2. Furthermore, Approach 2, reviewing the advantages and disadvantages of
Table III presents the absolute differences between AUC each oversampling algorithm, allowing the understanding of
values of training and test partitions, for both Approaches why Borderline-SMOTE and Safe-Level-SMOTE may present
(listed in descending order of differences in Approach 2). The generalization issues (which also explains their poor perfor-
p-values derived from a Mann-Whitney test are also included mance). The major issue of these methods is that the definition
and confirm that the train-test differences between Approaches of danger/borderline examples (Borderline-SMOTE) and safe
1 and 2 are significantly different. In the extended version of examples (Safe-Level-SMOTE) may fail in certain scenarios
this paper, the interested reader may find additional statistical and prejudice the classification task (inability to generalize).
tests for training and test partitions individually, and a table of Finally, SMOTE+ENN shows a training/test difference of
10 representative datasets for which the classification results 0.096, which is considerable when compared to its analogous
are discussed in detail. SMOTE+TL (0.091) and precursor SMOTE (0.087). Both
As shown in Fig. 3, the training results are similar, which methods (SMOTE+ENN and SMOTE+TL) were developed to
suggests that the major difference between both approaches surpass the issues of overgeneralization of SMOTE. However,
relies on the characteristics of the test sets. In Approach 1, it as will be discussed further on (Section V-C), the ability
is the overoptimism problem (rather than the overfitting) that of SMOTE to create larger decision boundaries seems to
is identified, given that the difference between training and be a major strength, whereas its successor procedures seem
test results is not considerable (Table III). In this scenario, the to create a higher risk of overfitting the training data. This
test sets have similar characteristics to the training sets (are may be due to excessive cleaning applied after SMOTE. In
balanced and may contain exact replicas or similar patterns the case of SMOTE+TL, the issue is not critical (0.091), as
to the training points). From Table III, it can be observed only the Tomek Links are removed. For SMOTE+ENN, the
TABLE II
C HARACTERISTICS OF IMBALANCED DATASETS .

Dataset Size Features IR Dataset Size Features IR

bupa 345 6 1.38 vowel0 988 13 9.98
pageblocks-1-3vs4 472 10 1.57 ecoli-0-6-7vs5 220 7 10.00
glass1 214 9 1.82 glass-0-1-6vs2 192 9 10.29
ecoli-0vs1 220 7 1.86 ecoli-0-1-4-7vs2-3-5-6 336 7 10.59
wisconsin 683 9 1.86 led7digit-0-2-4-5-6-7-8-9vs1 443 7 10.97
pima 768 8 1.87 ecoli-0-1vs5 240 7 11.00
cmc1vs2 961 9 1.89 glass-0-6vs5 108 9 11.00
iris0 150 4 2.00 glass-0-1-4-6vs2 205 9 11.06
glass0 214 9 2.06 glass2 214 9 11.59
german 1000 20 2.33 ecoli-0-1-4-7vs5-6 332 7 12.28
yeast1 1484 8 2.46 cleveland-0vs4 173 14 12.31
haberman 306 3 2.78 ecoli-0-1-4-6vs5 280 7 13.00
vehicle2 846 18 2.88 shuttle-c0-vs-c4 1829 9 13.87
vehicle1 846 18 2.90 yeast-1vs7 459 8 14.30
vehicle3 846 18 2.99 glass4 214 9 15.46
glass-0-1-2-3vs4-5-6 214 9 3.20 ecoli4 336 7 15.8
transfusion 748 4 3.20 abalone9-18 731 8 16.4
vehicle0 846 18 3.25 dermatology-6 358 34 16.9
ecoli1 336 7 3.36 thyroid-3vs2 703 21 18.00
newthyroid1 215 5 5.14 glass-0-1-6vs5 184 9 19.44
ecoli2 336 7 5.46 pageblocks-1vs3-4-5 5144 10 21.27
balance scaleBvsR 337 4 5.88 shuttle-6vs2-3 230 9 22.00
balance scaleBvsL 337 4 5.88 yeast-1-4-5-8vs7 693 8 22.10
segment0 2308 19 6.02 pageblocks-1-2vs3-4-5 5473 10 22.69
glass6 214 9 6.38 glass5 214 9 22.78
yeast3 1484 8 8.10 yeast-2vs8 482 8 23.10
ecoli3 336 7 8.60 letter-U 20000 16 23.60
pageblocks0 5472 10 8.79 flare-F 1066 11 23.79
ecoli-0-3-4vs5 200 7 9.00 car-good 1728 6 24.04
yeast-2vs4 514 8 9.08 pageblocks-1vs4-5 5116 10 24.20
ecoli-0-6-7vs3-5 222 7 9.09 car-vgood 1728 6 25.58
ecoli-0-2-3-4vs5 202 7 9.10 letter-Z 20000 16 26.25
glass-0-1-5vs2 172 9 9.12 kr-vs-k-zero-onevsdraw 2901 6 26.63
yeast-0-3-5-9vs7-8 506 8 9.12 yeast4 1484 8 28.10
yeast-0-2-5-6vs3-7-8-9 1004 8 9.14 winequality-red-4 1599 11 29.17
yeast-0-2-5-7-9vs3-6-8 1004 8 9.14 poker-9vs7 244 10 29.50
ecoli-0-4-6vs5 203 7 9.15 yeast-1-2-8-9vs7 947 8 30.57
ecoli-0-1vs2-3-5 244 7 9.17 abalone-3vs11 502 8 32.47
ecoli-0-2-6-7vs3-5 224 7 9.18 yeast5 1484 8 32.73
glass-0-4vs5 92 9 9.22 kr-vs-k-threevseleven 2935 6 35.23
ecoli-0-3-4-6vs5 205 7 9.25 winequality-red-8vs6 656 11 35.44
ecoli-0-3-4-7vs5-6 257 7 9.28 abalone-17vs7-8-9-10 2338 8 39.31
yeast-0-5-6-7-9vs4 528 8 9.35 abalone-21vs8 581 8 40.50

(??@9"ABC4 (??@9"ABC5
!"#$%&'$ !"#$%&'$
*=>)375 ()(*+, *=>)375 ()(*+,

*=>)37 ()-.* *=>)37 ()-.*

*.-2362< (/0 *.-2362< (/0

*.-2363,, !1*.-234 *.-2363,, !1*.-234

.-23 !1.-235 .-23 !1.-235

<1.-23 0!-67"'89: <1.-23 0!-67"'89:

7-* 0!-6*.-23 7-* 0!-6*.-23
.;.-23 2@"&' 2$#D .;.-23

Fig. 3. Performance metrics (average) achieved for the original datasets (Baseline) and for the oversampled datasets, considering both Approaches 1 and 2.

issue is aggravated (0.096) due to its deeper data-cleaning for all scenarios: such simplification may jeopardize general-
procedure. Such cleaning aims to simplify the training data ization. Focusing on the test AUC results, MWMOTE and
and ease the definition of less complex class boundaries, SMOTE+TL seem to be the best oversampling methods (Fig.
although the results suggest that this may not be advantageous 3).
TABLE III behavior regarding both the results of the classification and
AUC DIFFERENCES BETWEEN TRAINING AND TEST PARTITIONS . complexity measures. The intrinsic characteristics of each
Train-Test Difference Man-Whitney
oversampling algorithm will be further discussed in Section
V-C.
Algorithm Approach 1 Approach 2 p-value
We continue this section by addressing the questions raised
CBO+Random 0.011 0.112 9.26E-23
Borderline-SMOTE2 0.019 0.104 8.34E-18 by Blagus and Lusa [13] that were not fully answered in their
Borderline-SMOTE1 0.020 0.104 1.31E-17 experimental setup (please check Section I). Thus being, we
CBO+SMOTE 0.016 0.099 1.18E-17
ROS 0.018 0.097 2.17E-17 analysed the mean test AUC results for ROS and SMOTE
SMOTE+ENN 0.019 0.096 6.07E-16
Safe-Level-SMOTE 0.019 0.095 2.54E-16 methods (the two oversampling methods used by Blagus and
SMOTE+TL 0.020 0.091 1.32E-14 Lusa [13]), for all datasets ordered by their sample size and
AHC 0.023 0.089 1.36E-12
SPIDER 0.022 0.088 3.18E-17 IR: from the simulation results, no relation was found with
SMOTE 0.023 0.087 2.83E-41
ADASYN 0.024 0.086 4.53E-12 either one. For this reason, and due to space restrictions,
ADOMS 0.024 0.085 7.92E-12 this analysis is not included herein, but it is fully detailed
SPIDER2 0.025 0.084 6.93E-08
MWMOTE 0.025 0.084 2.69E-12 in the extended version of the paper (https://eden.dei.uc.pt/
Baseline 0.069 0.069 9.94E-01 ∼pha/Long-version-CIM.pdf). In terms of complexity, we have

chosen to present the F1 metric (Fig. 5). The results using

other complexity measures followed the same tendency, yet
B. Data Complexity Analysis F1 seems the most straightforward to understand: it measures
In order to better support the existence of overoptimism in the highest discriminative power considering all the features in
Approach 1 (CV after oversampling), we have investigated the the dataset – if at least one feature has a high discriminative
complexity of the training and test partitions for all datasets, capability (its values allow to distinguish between classes),
in both Approaches 1 and 2. We hypothesize that the overop- then the classification task is “easy”. Fig. 5 shows that the
timism is related to the difference between training and test complexity of the classification task is what most influences
partitions as explained in what follows. When oversampling the overoptimistic behavior of poorly designed CV procedures:
is applied before CV, the test and training partitions will have the less complex the classification task is, the smaller is the
a similar structure, and therefore their complexity is similar difference between the CV setups (Approach 1 and Approach
- the classification is more straightforward, given that the 2). Indeed, when the classification task is easier, the decision
algorithm learns from similar contexts. When oversampling boundary is clearer and Approach 2 achieves higher classifi-
is performed during CV, the test and training partitions, as cation results. Thus the difference between both approaches is
previously explained, have a different structure, which hinders not so discrepant.
the classification task. We have further conducted a regression and clustering
Fig. 4 shows the difference (in module) between the com- analysis based on all the complexity measures obtained from
plexity of the training and test partitions, on average, for each the training data. For the regression analysis, we obtained a
approach. This is performed for all oversampling algorithms, regression model that could accurately predict the test AUC
and the differences in complexity are also related to the mean based solely on the complexity measure of the corresponding
test AUCs for each algorithm. For the original (Baseline) training partitions (R2 of 0.72), where the highest values of
partitions, the AUC values and differences in complexity are the coefficient of determination were obtained for SMOTE+TL
the same for both approaches. (0.807), MWMOTE (0.798) and SMOTE+ENN (0.795). The
From Fig. 4, it can be observed that the results are consistent clustering analysis (using k-means clustering) produced a di-
with our reasoning: the difference in complexity in Approach vision where the top 70 datasets with the best test AUC results
2 is higher than for Approach 1. In some cases, algorithms are grouped, the majority are produced with MWMOTE,
SPIDER and SPIDER2 show an antagonistic behavior to the SMOTE+TL and SMOTE+ENN. These results are detailed in
other methods, which may be due to their process of generat- the extended version of the paper (https://eden.dei.uc.pt/∼pha/
ing new data (that differs from the remaining algorithms). In Long-version-CIM.pdf).
the implementation used in this work, SPIDER uses a weak
amplification strategy, where the minority class examples are
replicated according to the existence of majority data points C. Analysis of oversampling algorithms: Approach 2
marked as “safe” among their k nearest neighbors. Given a After determining the most suitable CV scheme in imbal-
complex dataset, where there are only a few “safe” examples, anced scenarios (CV during oversampling - Approach 2), we
the minority examples are never oversampled. For SPIDER2, focus on analyzing the most appropriate oversampling methods
we have used a strong amplification strategy with relabeling, for imbalanced contexts. To that end, three different strategies
where the neighborhood to be considered is extended to were considered. In the first strategy, we analyze the average
k + 2, and the class of the original majority examples marked test AUC values including all classifiers (Strategy 1). In the
as “noisy” is directly changed. Additionally, SPIDER and second strategy, we rank the AUC values by the oversampling
SPIDER2 are the only methods that do not guarantee an equal technique, for each classifier. Then, the average rank is com-
class distribution, i.e., it is not guaranteed that the resulting puted for each oversampling technique (Strategy 2). Finally,
dataset, after oversampling, is balanced. These differences the third strategy considers the ranking of AUC values by
from the other methods could be the origin of their erratic oversampling technique, for each classifier and dataset. Then,
045
03

F1
015
0
024
018
F2

012
006
0

06
F3

04
02
0
1
L3

0
045
N4

03
015
0
045
N1

03
015
0
12
09
N2

06
03
0
06
045
N3

03
015
0
09
L1

06
03
0

06
L2

03
0
TE

SM E

TE E

SP L
ER

2
AD ine
YN

SM E1

+S m

ER
TE

T
M

SM EN
AH

do
T

O
l

ID
AS

R
O
se

TE
O

ID
an

+
AD
Ba

l-S

SP
O
+R

O
ve
e-

BO
BO

SM
lin

lin

Le
C
C
er

-
fe
rd

Sa
Bo

Methods
0HDQ AUC Test Approach 1 2
0848 0857 0865 0927 0938 0952

Fig. 4. Differences (in module) between the complexity measures for all oversampling techniques, considering both Approaches 1 and 2.

the average rank is computed for each oversampling technique Table IV.
(Strategy 3). The results of each strategy are summarised in
Table IV shows that all the implemented techniques are

526

0HDQ$8&7HVW

0HWKRGV

6027(

\HDVWBBBBBBYVBBB

OHGGLJLWBBBBBBBBBYV

JODVVBBBBBYVBBB
JODVV

\HDVW

OHWWHUB=
JODVVBBBYVB

JODVV

VHJPHQW
HFROL

JODVV

YRZHO
HFROL
\HDVW
HFROL
\HDVWBBBBBBYVB

YHKLFOH

\HDVWBBYVB

ZLVFRQVLQ
HFROLBBBYVBBB

QHZWK\URLG
\HDVW
HFROLBBBBBYVBB

HFROLBBBBBYVB
FOHYHODQGBBYVB
HFROLBBBYVB

JODVVBBBYVB

HFROL
\HDVWBBYVB
SDJHBEORFNVBBBYVB

HFROLBBBBBYVB
HFROLBBBBYVB
HFROLBBBBBYVB
HFROLBBBBYVB

HFROLBBBBYVB

JODVVBBBBYVB

DEDORQHBBYVB

VKXWWOHBBYVBB
HFROLBBYVB
GHUPDWRORJ\B
VKXWWOHBFBYVBF
LULV
DEDORQHBBYVB
'DWDVHWRUGHUHGE\)DVFHQGLQJ $SSURDFK

Fig. 5. Differences between test AUCs of Approach 1 and Approach 2: datasets are ordered by their original F1 complexity measure (highest discriminative
power among all features). Due to space restrictions, only the datasets with highest F1 values are represented, although a complete analysis is performed in
the extended version of this paper.

TABLE IV
OVERSAMPLING METHODS ( PLUS BASELINE ) ORDERED BY PERFORMANCE , ACCORDING TO EACH TESTED STRATEGY.

Strategy
Rank 1 2 3
1st SMOTE+TL (0.871±0.052) SMOTE (3.000±1.265) SMOTE+TL (6.535±4.094)
2nd MWMOTE (0.871±0.053) SMOTE+TL (3.167±1.941) MWMOTE (7.199±4.332)
3rd SMOTE (0.868±0.054) MWMOTE (4.333±4.844) SMOTE+ENN (7.201±4.072)
4th SMOTE+ENN (0.867±0.054) SMOTE+ENN (5.000±2.828) SMOTE (7.222±3.460)
5th AHC (0.865±0.055) AHC (6.000±2.280) ADOMS (7.606±4.195)
6th ADOMS (0.864±0.057) ADOMS (7.000±2.000) AHC (8.088±3.817)
7th ADASYN (0.862±0.059) ADASYN (8.000±2.098) ADASYN (8.215±4.223)
8th CBO+SMOTE (0.860±0.058) SL-SMOTE (8.000±6.753) SL-SMOTE (8.411±4.075)
9th B-SMOTE1 (0.858±0.060) CBO+SMOTE (9.333±5.317) B-SMOTE1 (8.743±4.119)
10th B-SMOTE2 (0.858±0.060) ROS (9.833±5.076) B-SMOTE2 (8.743±4.119)
11th SL-SMOTE (0.857±0.061) B-SMOTE1 (10.667±1.966) CBO+SMOTE (8.745±4.475)
12th SPIDER (0.856±0.059) B-SMOTE2 (10.667±1.966) ROS (9.019±4.034)
13th ROS (0.855±0.063) SPIDER (11.000±1.673) SPIDER (9.412±4.665)
14th SPIDER2 (0.855±0.059) SPIDER2 (11.833±2.137) SPIDER2 (9.569±4.764)
15th CBO+Random (0.849±0.063) CBO+Random (13.500±2.811) CBO+Random (9.821±4.419)
16th Baseline (0.848±0.066) Baseline (13.667±3.830) Baseline (11.471±4.981)
B and SL are equivalent to Borderline and Safe-Level, respectively.

better than using the original dataset without any type of pro- their inner procedure and how they are able to address the
cessing (Baseline). Also, all considered strategies output the datasets’ complexity and improve the classification results,
same set of winners, SMOTE+TL, SMOTE+ENN, MWMOTE also highlighting their main advantages and disadvantages.
and SMOTE, although their ranks may vary. SMOTE+TL, 1) ROS and SMOTE: ROS is the simplest of the oversam-
followed by MWMOTE, are considered the best oversampling techniques: a random subset of minority examples is
pling methods. The same is true for the worst oversampling replicated until the desired balance is reached. Nevertheless,
techniques, where CBO+Random, SPIDER, SPIDER2 and this technique is subjected to overfitting due to the replication
ROS are found on the bottom positions. In light of these (creation of exact copies) of minority examples. The fact
results, we herein provide a detailed discussion on the intrinsic that ROS creates exact copies of existing examples, leads to
characteristics and behavior of the different oversampling a generation of very similar partitions in Approach 1 (and
methods used. We compare each method in what concerns consequent overoptimism), while in Approach 2, as explained
in Fig. 3, ROS is mostly subjected to overfitting. This is also
supported by Fig. 4, where ROS is among the best methods in a procedure for the generation of new examples, and each has
Approach 1 (between 0.938 and 0.952), while for Approach 2 its hitches:
it provides the worst AUC results (between 0.848 and 0.857). • CBO+Random is more prone to overfitting: since random
SMOTE smooths the problem of creating exact copies of oversampling is performed within clusters, the probability
existing minority examples by creating synthetic minority that similar instances are oversampled more often is even
instances using the k-neighborhood of minority examples. greater than for ROS alone, as discussed in Fig. 3 and
However, the minority class is augmented without considering Table III;
the structure of data: all minority examples have the same • CBO+SMOTE eases the problem of overgeneralization
probability of being oversampled, regardless of their neigh- given that SMOTE is performed within clusters; however,
borhood, which leads to the following issues [7], [24], [40]: it no longer takes advantage of SMOTE’s ability to create
• By considering a neighborhood composed only of minor- larger decision regions, which explains why its perfor-
ity examples, the new synthetic examples may be gener- mance is considerably lower than SMOTE’s (Table IV):
ated in overlapping areas (problem of overgeneralization); applying SMOTE within clusters increases the probability
• Since no distinction between minority examples is per- that similar instances are generated, which can also result
formed (e.g. by evaluating their majority neighborhood), in overfitting, as discussed from Table III.
SMOTE-like methods can also augment noise regions, Finally, for both techniques, the definition of the most
by oversampling noisy examples (minority examples sur- appropriate number of clusters is a problem. In this work, to
rounded by majority examples, that are most likely noise). find the optimal k number of clusters for each class, we have
Nevertheless, it seems that the ability of SMOTE to generate used three evaluation criteria: Calinski-Harabasz [42], Davies-
larger decision boundaries is still a major strength, even with Bouldin [43] and Silhouette [44], and a range of k = 2, ..., 20.
its susceptibilities. In fact, SMOTE is found among the best Our CBO computes for each criterion five times and extracts
oversampling methods, as shown in Fig. 4, which justifies why the mode of these five runs. Finally, after determining the
it is a renowned oversampling method, widely used in several optimal k according to each criterion, the mode is computed
research areas [40], [41]. again to obtain the final optimal k for a given class.
2) SMOTE+TL and SMOTE+ENN: SMOTE+TL and 4) Borderline-SMOTE and ADASYN: Defining a taxon-
SMOTE+ENN combine oversampling with a cleaning proce- omy of minority examples (noise, safe and danger) allows
dure that alleviates SMOTE’s problem of overgeneralization: Borderline-SMOTE to operate only on the examples of in-
they are able to remove examples that lie on overlapping terest: the synthetic minority examples will be created in a
regions (as detailed in Section II-A). However, since SMOTE SMOTE-like fashion, along the line that joins each danger
is applied prior to the cleaning procedure, some of the same example to its k nearest minority neighbors, thus strengthening
issues from SMOTE remain: the borderline examples. Nevertheless, as Borderline-SMOTE
• All minority examples have the same probability of uses the same procedure as SMOTE to oversample minority
being oversampled, causing some unnecessary (“safe”) examples, it may suffer from the same issues mentioned above.
examples to be oversampled; Additionally, another problem with Borderline-SMOTE tech-
• Noise minority regions could be augmented and remain nique is in the way danger/borderline examples are identified
after the cleaning procedure: after oversampling, they (see Section II-A): in some contexts, the k > m0 ≥ k2
may not be identified by Tomek Links or ENN as criterion may fail, and in those cases there is no oversampling
examples to remove, since their neighborhood is changed. in important regions near the decision boundary, which will
Nevertheless, what is true for noise regions, is also true for prejudice the classification task [24], as discussed in Table III.
small disjuncts. SMOTE+TL and SMOTE+ENN, by applying We assume that this issue may affect some of the datasets
SMOTE as a first step, may be inflating unnecessary noise in our study, since that, although Borderline-SMOTE aims to
regions, but may also be inflating important, underrepresented, provide a more clear decision boundary, it does not figure
minority points. Overall, our results show that combining among the best approaches (Table IV).
SMOTE with these cleaning methods turns out to be a superior ADASYN considers the majority neighborhood of the mi-
approach than most (Table IV): SMOTE creates larger and nority examples to guide the oversampling procedure: the
less specific decision boundaries, that are afterwards simplified minority examples are assigned different weights according to
by Tomek Links and ENN by removing several borderline the number of majority examples in their neighborhood. Adap-
examples while also alleviating the issue of small disjuncts. tively assigning weights to the minority examples is a way
However, some caution must be taken regarding the cleaning to smooth the above-mentioned issues of Borderline-SMOTE;
procedure: as discussed from Table III, for some datasets, an however, the definition of parameters for weight assignment
excessive cleaning may be the cause of overfitting. may be inappropriate to correctly distinguish the importance of
3) CBO+Random and CBO+SMOTE: CBO was first minority examples for classification. As mentioned in Section
thought as a way of handling both the between-class imbalance II-A, the weight of each minority example is proportional to
as well as the within-class imbalance (small disjuncts). CBO the number of majority examples in its k-neighborhood, which
is able to attend to the structure of data by performing causes two main issues [24]:
clustering on both classes individually (both minority and • ADASYN may oversample unnecessary noisy examples:
majority example are oversampled). Nevertheless, CBO needs noisy examples are typically surrounded by the majority
class, and therefore their weight will be high; or noisy examples (if they are “not-safe”, they are all
• ADASYN may fail to oversample important minority considered “noise”). Therefore, these “unsafe” minority
examples close to the decision boundary, which is the examples are all given the same importance to classi-
most important concept to learn, if all their k-nearest fication: SPIDER/SPIDER2 can either be oversampling
neighbors are from the minority class. difficult examples so that they are not misclassified, but
Considering different weights for different minority exam- at the same time, they can be augmenting undesired noise
ples is a way of defining the structure of minority data (al- regions;
though ignoring the structure of majority data). If additionally, • Both methods perform replication of examples rather than
the criterion to define those weights fails for some datasets, synthetization, which adds no new information;
ADASYN loses its main advantage. This is consistent with the • When “relabeling” is chosen, SPIDER/SPIDER2 perform
results provided in Table IV, where ADASYN is found in the an oversampling procedure similar to SMOTE, except
7th position, slightly above the middle of the table, although that instead of generating new instances in the neigh-
far from the top winners. borhood of minority examples, it relabels their majority
5) Safe-Level-SMOTE and ADOMS: Safe-Level-SMOTE neighbors: however, relabeling examples might not be an
also considers a weighted scheme to oversample the minority appropriate approach in some domains.
examples in safe regions. The weight assignment is more Although SPIDER and SPIDER2 aim to define a taxon-
sophisticated than ADASYN’s, since that rather than looking omy of minority examples, they do not distinguish between
only to the majority neighborhood of each minority example, two important minority concepts, “borderline” and “noisy”,
Safe-Level-SMOTE also considers the structure of minority addressing them as equals. Also, the fact that these methods
data points: the weights defined by slratio allow Safe-Level- consider replication of existing examples rather than the syn-
SMOTE to place new instances near those considering “safer”, thetization of new ones, is surely responsible for their lower
easing the problem of small disjuncts while avoiding the positions on Table IV, along similar methods with the same
augmentation of noise regions. However, for specific scenarios, inner procedure (CBO+Random and ROS). Finally, they are
Safe-Level-SMOTE may generate inconsistent examples [7]: the only methods for which it is not possible to establish
if a minority example is an outlier, inside a well-defined the amount of oversampling, which, in this work, has been
majority cluster, then its slratio will be 0, causing the gap for established to accomplish a perfect balance in the training sets
SMOTE synthetization to be 1, thus creating a new minority (50%-50% distribution). Since the remaining methods were
instance in the exact location of a majority point. This may optimised to achieve perfect balance, it was expected that
explain its susceptibility to overfitting (Table III) and its poor SPIDER/SPIDER2 might provide somewhat erratic results, as
performance (Table IV). discussed in Section V-B.
Rather than placing synthetic examples along the line 7) AHC and MWMOTE: Through clustering, AHC is able
between a minority example and one of its k minority to consider the structure of both minority and majority classes,
neighbors (as SMOTE), ADOMS considers the local minority which is a great advantage over most oversampling algorithms
distribution along the example to oversample, through the that focus mostly on local properties rather than the whole
computation of the first principal component of the defined data structure. Also, specifying the number of clusters is not
k-neighborhood (Section II-A). Therefore, ADOMS takes ad- an issue, since all levels of the resulting dendrograms are
vantage of SMOTE’s ability to define larger decision regions, considered. However, this originates its major disadvantage:
while considering the local structure. However, ADOMS the process becomes very computationally expensive. AHC’s
seems to fall behind SMOTE in the three considered strategies ability to take into account the structure of data seems to be
from Table IV: we hypothesize that some of the instances one of the reasons why it figures among the best approaches
placed by ADOMS create more overlapping than SMOTE’s: (Table IV), which is also confirmed in similar approaches (e.g.
SMOTE generates instances along a line joining two minority ADASYN, MWMOTE).
instances, yet ADOMS may place its instances in sparser As discussed in Section II-A, MWMOTE is the most
projections [21]. Since, as in SMOTE, the distribution of complete method, and its inner procedure is able to surpass
majority examples is not considered, the generation procedure most issues explained above. MWMOTE aims to provide
might not be appropriate for all scenarios. i) an improved way of selecting the minority examples for
6) SPIDER and SPIDER2: SPIDER combines the local- oversampling, by being more meticulous on the way the
oversampling of noisy, difficult, minority examples with a importance of minority examples for classification is defined
cleaning procedure that removes (or relabels) noisy majority and ii) an improved way of generating new synthetic examples,
examples. The original SPIDER algorithm processed both avoiding the issues of SMOTE-based synthesization. To that
minority and majority examples at the same time, sometimes end, MWMOTE considers filtering, a weighted scheme based
severely modifying the majority class. To address this issue, on a taxonomy of minority examples, and a SMOTE-like
a new version was proposed, SPIDER2, that alleviates the cluster-based synthesization of examples:
degradation of the minority class by processing minority • MWMOTE starts by filtering the initial minority set to
and majority examples separately. The major issues of these find the examples that are surrounded by the majority
methods are as follows: class, thus avoiding that noisy points are oversampled;
• The process that leads to the amplification of minor- • Then, MWMOTE defines the importance of each minor-
ity examples does not distinguish between borderline ity example for classification, taking into account three
main factors: 1) To emphasize the risk of overoptimism related to the
(i) minority examples closer to the decision boundary joint use of CV and oversampling, extending the work
should have a higher weight than those farther from of Blagus and Lusa [13];
it; 2) To distinguish the problem of overoptimism from the
(ii) minority examples within sparse minority clusters overfitting problem and study the influence of the
should have a higher weight than those on dense mi- datasets’ complexity generated by oversampling algo-
nority clusters (which alleviates the problem of small rithms on the classification task;
disjuncts); 3) To determine the performance of the state-of-the-art over-
(iii) minority examples closer to a dense majority cluster sampling strategies, in order to provide some insight on
should have a higher weight than those closer to a the ones that reveal the best behavior.
sparse majority cluster. Attending to these sub-objectives, there are three main
• Finally, MWMOTE reduces the issues of SMOTE-like conclusions to be derived:
synthesization by considering a cluster-based oversam- (i) The cross-validation procedure after the oversampling
pling approach: the generation of new minority examples (Approach 1) leads to overoptimistic results and makes
is performed using only minority neighbors of the same this approach inappropriate for imbalanced domains. Ap-
clusters. proach 2 – performing oversampling in the training sets
By combining strong features of other algorithms (filtering, at each iteration of a cross-validation procedure – is the
clustering, adaptive weighting), MWMOTE performs a more correct way of validating results in imbalanced scenarios.
guided oversampling procedure, that considers not only the The overoptimism is not related to the data’s sample
distribution of majority examples around minority examples to size or imbalance ratio, but rather to the complexity of
define their importance, but also the structure of minority and the prediction task, where the maximum discriminative
majority examples (through clustering). This behavior is what power of all features (complexity measure F1) seems to
makes MWMOTE one of the top approaches, and outstanding be a good predictor of this effect;
in dealing with several difficulty factors that arise in real- (ii) While Overoptimism is greatly associated with inappro-
world datasets [45], namely overlapping (through filtering), priate validation setups, Overfitting (significant differ-
noisy data (through a weighting scheme) and small disjuncts ences in performance between training and test sets)
(through clustering). In the extended version of this pa- is mostly related to the oversampling algorithm used,
per (https://eden.dei.uc.pt/∼pha/Long-version-CIM.pdf), read- where algorithms that create exact replicas of existing
ers may find a comprehensive table that resumes the main patterns are the most prejudicial (e.g. CBO+Random).
characteristics of the oversampling algorithms implemented The difference in complexity of the training and test
in this work. The key factors that distinguish algorithms from sets is lower in Approach 1 and is the rational behind
each other are presented in a synthesised way and their greatest its overoptimistic behavior: the training and test sets
advantages and disadvantages are highlighted. have a similar structure, that is, they are balanced and
Taking into account the characteristics of the inner pro- might contain exact replicas or similar data points to the
cedure of each method, and in light of the performance training data;
results discussed in the previous sections, it seems that the (iii) Among the implemented oversampling methods,
best oversampling methods are those that combine three main SMOTE+TL and MWMOTE achieve the best results,
characteristics: with average test AUC values of 0.871 (considering all
1) Cluster-based oversampling, so that the classifiers). These techniques change the overlapping
structure/distribution of both the minority and majority areas in the data and increase their discriminative power.
examples is considered: this approach seems to be Overall, the best oversampling techniques possess
superior to considering only the majority neighborhood three key characteristics: use of cleaning procedures,
of individual minority examples or filtering out some cluster-based synthetization of examples and adaptive
minority/majority examples; weighting of minority examples.
2) Adaptive weighting of minority examples: defining a Furthermore, we have performed a regression and clustering
proper taxonomy of minority examples (borderline, safe, analysis which confirmed that the complexity produced by the
noise, and rare/small disjuncts) is crucial so that the oversampling algorithms is related to the classification results,
importance of each example for classification is properly in a quasi-linear way. As concluding remarks, we would like
addressed: more important examples should be oversam- to emphasize some lessons learned which could be beneficial
pled more often; to new researchers in the field:
3) Cleaning procedures, to overcome some issues that rise
naturally during oversampling, namely the generation of • Oversampling algorithms have distinctive inner proce-
synthetic examples in overlapping areas. dures that are better suited to particular characteristics of
data (e.g. CBO inflates small disjuncts, SMOTE-TL and
SMOTE-ENN deal with class overlapping, Safe-Level-
VI. C ONCLUSIONS AND F UTURE W ORK
SMOTE and Borderline-SMOTE prioritize safe and bor-
The goal of this work was essentially threefold: derline concepts in data). Thus being, analyzing data
complexity measures may provide useful insights to guide [12] K. Oppedal, K. Engan, T. Eftestol, M. Beyer, and D. Aarsland, “Clas-
the choice of appropriate oversampling methods; sifying alzheimer’s disease, lewy body dementia, and normal controls
using 3d texture analysis in magnetic resonance images,” Biomedical
• Stratified CV is the state-of-the-art validation approach Signal Processing and Control, vol. 33, pp. 19–29, Mar. 2017.
for performance evaluation and should be carefully de- [13] R. Blagus and L. Lusa, “Joint use of over-and under-sampling techniques
signed in imbalanced domains. Nevertheless, even a and cross-validation for the development and assessment of prediction
models,” BMC Bioinformatics, vol. 16, no. 1, pp. 1–10, Nov. 2015.
correct CV may cause partition-induced covariate shift [14] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior
during the learning stage [40], which can lead to loss in of several methods for balancing machine learning training data,” ACM
performance or under-estimation of results. A promising Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 20–29, June 2004.
approach to surpass the issues of dataset shift is the Dis- [15] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
“SMOTE: synthetic minority over-sampling technique,” Journal of Ar-
tribution Optimally Balanced stratified cross-validation tificial Intelligence Research, vol. 16, pp. 321–357, June 2002.
(DOB-SCV) [46], which is worthy of investigation in [16] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic
future works in the field. sampling approach for imbalanced learning,” in IEEE International Joint
Conference on Neural Networks. IEEE, June 2008, pp. 1322–1328.
As future work, several undersampling and other new over- [17] H. Han, W. Wang, and B. Mao, “Borderline-SMOTE: A new over-
sampling techniques could be included in the analysis, in order sampling method in imbalanced data sets learning,” Advances in In-
telligent Computing, pp. 878–887, Aug. 2005.
to determine the complexity changes they make in the original [18] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-
datasets. In this context, the novel R package “imbalance” smote: Safe-level-synthetic minority over-sampling technique for han-
could be of interest, given that it includes the implementation dling the class imbalanced problem,” Advances in Knowledge Discovery
and Data Mining, pp. 475–482, Apr. 2009.
of recent resampling algorithms in the literature [47]. Also,
[19] I. Tomek, “Two modifications of CNN,” IEEE Trans. Systems, Man and
one could focus on specific sub-problems of imbalanced data Cybernetics, vol. 6, pp. 769–772, Nov. 1976.
(e.g. small disjuncts, overlapping, lack of data) and study [20] D. L. Wilson, “Asymptotic properties of nearest neighbour rules using
their identification in multidimensional data and/or ways to edited data,” IEEE Transactions on Systems, Man, and Cybernetics,
vol. 2, no. 3, pp. 408–421, July 1972.
surpass them using preprocessing techniques. Finally, future [21] S. Tang and S. Chen, “The generation mechanism of synthetic minority
work could also consider an extension of this research for class examples,” in IEEE International Conference on Information
datasets with higher dimensionality. Technology and Applications in Biomedicine. IEEE, May 2008, pp.
444–447.
[22] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM
Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 40–49, June 2004.
R EFERENCES [23] G. Cohen, M. Hilario, H. Sax, S. Hugonnet, and A. Geissbuhler, “Learn-
ing from imbalanced data in surveillance of nosocomial infection,”
[1] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Artificial Intelligence in Medicine, vol. 37, no. 1, pp. 7–18, May 2006.
Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.
[24] S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE–majority
1263–1284, June 2009.
weighted minority oversampling technique for imbalanced data set
[2] V. López, A. Fernández, J. G. Moreno-Torres, and F. Herrera, “Analysis
learning,” IEEE Transactions on Knowledge and Data Engineering,
of preprocessing vs. cost-sensitive learning for imbalanced classification.
vol. 26, no. 2, pp. 405–425, Nov. 2014.
open problems on intrinsic data characteristics,” Expert Systems with
[25] J. Stefanowski and S. Wilk, “Selective pre-processing of imbalanced data
Applications, vol. 39, no. 7, pp. 6585–6608, June 2012.
for improving classification performance,” Lecture Notes in Computer
[3] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Special issue on learning
Science, vol. 5182, pp. 283–292, Sep. 2008.
from imbalanced data sets,” ACM Sigkdd Explorations Newsletter, vol. 6,
no. 1, pp. 1–6, June 2004. [26] K. Napierała, J. Stefanowski, and S. Wilk, “Learning from imbalanced
[4] R. Mollineda, R. Alejo, and J. Sotoca, “The class imbalance problem data in presence of noisy and borderline examples,” in Rough Sets and
in pattern classification and learning,” in II Congreso Español de Current Trends in Computing. Springer, June 2010, pp. 158–167.
Informática (CEDI 2007). ISBN, Sep. 2007, pp. 978–84. [27] T. K. Ho and M. Basu, “Complexity measures of supervised classifi-
[5] V. Ganganwar, “An overview of classification algorithms for imbalanced cation problems,” IEEE Transactions on Pattern Analysis and Machine
datasets,” International Journal of Emerging Technology and Advanced Intelligence, vol. 24, no. 3, pp. 289–300, Mar. 2002.
Engineering, vol. 2, no. 4, pp. 42–47, Apr. 2012. [28] P. H. Abreu, M. S. Santos, M. H. Abreu, B. Andrade, and D. C. Silva,
[6] U. Bhowan, M. Johnston, M. Zhang, and X. Yao, “Evolving diverse “Predicting breast cancer recurrence using machine learning techniques:
ensembles using genetic programming for classification with unbalanced A systematic review,” ACM Computing Surveys (CSUR), vol. 49, no. 3,
data,” IEEE Transactions on Evolutionary Computation, vol. 17, no. 3, pp. 1–40, Dec. 2016.
pp. 368–386, May 2013. [29] T. K. Ho, “Geometrical complexity of classification problems,” Proceed-
[7] T. Maciejewski and J. Stefanowski, “Local neighbourhood extension of ings of the 7th Course on Ensemble Methods for Learning Machines at
SMOTE for mining imbalanced data,” in 2011 IEEE Symposium on the International School on Neural Nets “E.R. Caianiello”, pp. 1–15,
Computational Intelligence and Data Mining (CIDM). IEEE, Apr. Feb. 2004.
2011, pp. 104–111. [30] A. Orriols-Puig, N. Macia, and T. K. Ho, “Documentation for the data
[8] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and complexity library in C++,” Universitat Ramon Llull, La Salle, vol. 196,
G. Bing, “Learning from class-imbalanced data: Review of methods and pp. 1–40, Dec. 2010.
applications,” Expert Systems with Applications, vol. 73, pp. 220–239, [31] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under
May 2017. a receiver operating characteristic (ROC) curve.” Radiology, vol. 143,
[9] P. Fergus, P. Cheung, A. Hussain, D. Al-Jumeily, C. Dobbins, and no. 1, pp. 29–36, Apr. 1982.
S. Iram, “Prediction of preterm deliveries from EHG signals using [32] O. Loyola-González, J. F. Martı́nez-Trinidad, J. A. Carrasco-Ochoa, and
machine learning,” PloS One, vol. 8, no. 10, p. e77154, Oct. 2013. M. Garcı́a-Borroto, “Study of the impact of resampling methods for con-
[10] K. U. Rani, G. N. Ramadevi, and D. Lavanya, “Performance of synthetic trast pattern based classifiers in imbalanced databases,” Neurocomputing,
minority oversampling technique on imbalanced breast cancer data,” vol. 175, pp. 935–947, Jan. 2016.
in 3rd International Conference on Computing for Sustainable Global [33] R. Alejo, J. Monroy-de Jesús, J. H. Pacheco-Sánchez, E. López-
Development. IEEE, Mar. 2016, pp. 1623–1627. González, and J. A. Antonio-Velázquez, “A selective dynamic sampling
[11] U. R. Acharya, V. K. Sudarshan, S. Q. Rong, Z. Tan, C. M. Lim, J. E. back-propagation approach for handling the two-class imbalance prob-
Koh, S. Nayak, and S. V. Bhandary, “Automated detection of premature lem,” Applied Sciences, vol. 6, no. 7, pp. 1–17, July 2016.
delivery using empirical mode and wavelet packet decomposition tech- [34] W. A. Rivera and P. Xanthopoulos, “A priori synthetic over-sampling
niques with uterine electromyogram signals,” Computers in Biology and methods for increasing classification sensitivity in imbalanced data sets,”
Medicine, vol. 85, pp. 33–42, May 2017. Expert Systems with Applications, vol. 66, pp. 124–135, Dec. 2016.
[35] J. A. Sáez, B. Krawczyk, and M. Woźniak, “Analyzing the oversampling
of different classes and types of examples in multi-class imbalanced
datasets,” Pattern Recognition, vol. 57, pp. 164–178, Sep. 2016.
[36] G. Douzas and F. Bacao, “Self-organizing map oversampling (SOMO)
for imbalanced data set learning,” Expert Systems with Applications,
vol. 82, pp. 40–52, Oct. 2017.
[37] S. Shilaskar, A. Ghatol, and P. Chatur, “Medical decision support system
for extremely imbalanced datasets,” Information Sciences, vol. 384, pp.
205–219, Apr. 2017.
[38] J. Liu, Y. Li, and E. Zio, “A SVM framework for fault detection of the
braking system in a high speed train,” Mechanical Systems and Signal
Processing, vol. 87, pp. 401–409, Mar. 2017.
[39] J. Luengo, A. Fernández, S. Garcı́a, and F. Herrera, “Addressing
data complexity for imbalanced data sets: Analysis of SMOTE-based
oversampling and evolutionary undersampling,” Soft Computing, vol. 15,
no. 10, pp. 1909–1936, Oct. 2011.
[40] V. López, A. Fernández, S. Garcı́a, V. Palade, and F. Herrera, “An insight
into classification with imbalanced data: Empirical results and current
trends on using data intrinsic characteristics,” Information Sciences, vol.
250, pp. 113–141, Nov. 2013.
[41] M. S. Santos, P. H. Abreu, P. J. Garcı́a-Laencina, A. Simão, and
A. Carvalho, “A new cluster-based oversampling method for improving
survival prediction of hepatocellular carcinoma patients,” Journal of
Biomedical Informatics, vol. 58, pp. 49–59, Dec. 2015.
[42] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,”
Communications in Statistics-theory and Methods, vol. 3, no. 1, pp. 1–
27, June 1974.
[43] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, no. 2, pp.
224–227, Apr. 1979.
[44] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Intro-
duction to Cluster Analysis. John Wiley & Sons, 2009, vol. 344.
[45] J. Stefanowski, Dealing with Data Difficulty Factors While Learning
from Imbalanced Data. Springer International Publishing, June 2016,
pp. 333–363.
[46] J. G. Moreno-Torres, J. A. Sáez, and F. Herrera, “Study on the impact
of partition-induced dataset shift on k-fold cross-validation,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 23, no. 8,
pp. 1304–1312, June 2012.
[47] I. Cordón, S. Garcı́a, A. Fernández, and F. Herrera, Imbalance:
Preprocessing Algorithms for Imbalanced Datasets, Feb. 2018, R
package version 1.0.0. [Online]. Available: https://cran.r-project.org/
web/packages/imbalance

View publication stats

Mca Final Year Project
100% (2)
Mca Final Year Project
76 pages
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
From Everand
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
SUJAN
No ratings yet
Managing the Testing Process: Practical Tools and Techniques for Managing Hardware and Software Testing
From Everand
Managing the Testing Process: Practical Tools and Techniques for Managing Hardware and Software Testing
Rex Black
4/5 (8)
Criterion-referenced Test Development: Technical and Legal Guidelines for Corporate Training
From Everand
Criterion-referenced Test Development: Technical and Legal Guidelines for Corporate Training
Sharon A. Shrock
No ratings yet
Métodos numéricos aplicados a Ingeniería: Casos de estudio usando MATLAB
From Everand
Métodos numéricos aplicados a Ingeniería: Casos de estudio usando MATLAB
Héctor Jorquera González
5/5 (1)
Vehicle Showroom Management System
No ratings yet
Vehicle Showroom Management System
22 pages
Decarburization 2
No ratings yet
Decarburization 2
7 pages
Osteoarthritis
No ratings yet
Osteoarthritis
10 pages
PH Effect On The Synthesis of Magnetite Nanopartic
No ratings yet
PH Effect On The Synthesis of Magnetite Nanopartic
5 pages
Bending Strength and Nondestructive Evaluation of Structural Bamboo
No ratings yet
Bending Strength and Nondestructive Evaluation of Structural Bamboo
2 pages
Deviant Behavior Variety Scale: Development and Validation With A Sample of Portuguese Adolescents
No ratings yet
Deviant Behavior Variety Scale: Development and Validation With A Sample of Portuguese Adolescents
9 pages
Chelonian Predation by Jaguars (Panthera Onca) : Chelonian Conservation and Biology December 2018
No ratings yet
Chelonian Predation by Jaguars (Panthera Onca) : Chelonian Conservation and Biology December 2018
5 pages
2.2005cementososeos LAAR
No ratings yet
2.2005cementososeos LAAR
9 pages
Foodbioprocess Hamburguesa
No ratings yet
Foodbioprocess Hamburguesa
10 pages
Figueiredoetal2015 JSCR HRVvsLOADINTENSITY
No ratings yet
Figueiredoetal2015 JSCR HRVvsLOADINTENSITY
9 pages
Acrylic Resin Cytotoxicity For Denture Base - Lite
No ratings yet
Acrylic Resin Cytotoxicity For Denture Base - Lite
9 pages
Peniche
No ratings yet
Peniche
17 pages
Physical Chemicalcharacterizationofyellowpassionfruitproductionindifferentcultivationsystems
No ratings yet
Physical Chemicalcharacterizationofyellowpassionfruitproductionindifferentcultivationsystems
13 pages
Rickettsia en Cuba
No ratings yet
Rickettsia en Cuba
4 pages
Lectura Complementaria 4 Tema 1
No ratings yet
Lectura Complementaria 4 Tema 1
10 pages
Talent Detection in Taekwondo
No ratings yet
Talent Detection in Taekwondo
13 pages
Castilla-Beltranetal 2020
No ratings yet
Castilla-Beltranetal 2020
15 pages
AISP
No ratings yet
AISP
2 pages
Araujo-Santos et al 2021
No ratings yet
Araujo-Santos et al 2021
11 pages
10.108014783363.2018.1538776 (1)
No ratings yet
10.108014783363.2018.1538776 (1)
26 pages
Ferreira2015 Phbuffersreview
No ratings yet
Ferreira2015 Phbuffersreview
16 pages
Moraes 2016
No ratings yet
Moraes 2016
8 pages
Silvaetal.2024 Salt Excluderrootstockimproves
No ratings yet
Silvaetal.2024 Salt Excluderrootstockimproves
16 pages
Modelo de abordagem progressiva ao jogo
No ratings yet
Modelo de abordagem progressiva ao jogo
28 pages
Rios Jaraetal2014
No ratings yet
Rios Jaraetal2014
29 pages
Aluminium Fractionation and Speciation in Bulk and
No ratings yet
Aluminium Fractionation and Speciation in Bulk and
10 pages
Neves Filho, 2015 - Insight in The White Rat
No ratings yet
Neves Filho, 2015 - Insight in The White Rat
15 pages
Antidiabetic Drugs: Mechanisms of Action and Potential Outcomes On Cellular Metabolism
No ratings yet
Antidiabetic Drugs: Mechanisms of Action and Potential Outcomes On Cellular Metabolism
16 pages
Printedcircuitboardrecycling-Physicalprocessingandcopperextractionbyselectiveleaching
No ratings yet
Printedcircuitboardrecycling-Physicalprocessingandcopperextractionbyselectiveleaching
9 pages
IWSSIP2020
No ratings yet
IWSSIP2020
7 pages
education-13-01215-v2
No ratings yet
education-13-01215-v2
18 pages
IOT Sensors
No ratings yet
IOT Sensors
45 pages
Santos Da Rosa Et Al 2022 The Acoustics of Aggregation Sites
No ratings yet
Santos Da Rosa Et Al 2022 The Acoustics of Aggregation Sites
16 pages
Alonso Domnguezetal2022
No ratings yet
Alonso Domnguezetal2022
14 pages
AISP
No ratings yet
AISP
2 pages
Moro09 MarketImpact
No ratings yet
Moro09 MarketImpact
9 pages
MEDICINA
No ratings yet
MEDICINA
8 pages
The 7 TH International Conference On Unsaturated Soils (UNSAT2018)
No ratings yet
The 7 TH International Conference On Unsaturated Soils (UNSAT2018)
7 pages
Abstract GRA EAGE2019
No ratings yet
Abstract GRA EAGE2019
5 pages
2011.DeCTSaCTSA.Textoenportugus
No ratings yet
2011.DeCTSaCTSA.Textoenportugus
25 pages
2020 Review Papaya Biocontrol
No ratings yet
2020 Review Papaya Biocontrol
12 pages
Volume Load Rather Than Resting Interval Influences Muscle Hypertrophy
No ratings yet
Volume Load Rather Than Resting Interval Influences Muscle Hypertrophy
7 pages
sustainability-14-06493
No ratings yet
sustainability-14-06493
27 pages
2012 - G.monteiro - Optical and Spectroscopic Properties of Erbium Doped Germanotellurite Glasses - InPress-corrected Proof Version
No ratings yet
2012 - G.monteiro - Optical and Spectroscopic Properties of Erbium Doped Germanotellurite Glasses - InPress-corrected Proof Version
14 pages
Smokers Melanosis Lesi Primer
No ratings yet
Smokers Melanosis Lesi Primer
6 pages
Carlottoetal2011LosdominiosgeotectnicosdelPer
No ratings yet
Carlottoetal2011LosdominiosgeotectnicosdelPer
3 pages
Thermoeconomic_Analysis_of_Simple_Trigeneration_Sy
No ratings yet
Thermoeconomic_Analysis_of_Simple_Trigeneration_Sy
8 pages
Regeneración in Vitro de Plantas de Cebolla (Allium Cepa L.)
No ratings yet
Regeneración in Vitro de Plantas de Cebolla (Allium Cepa L.)
10 pages
Produccion de Cebolla
No ratings yet
Produccion de Cebolla
10 pages
Regeneración in Vitro de Plantas de Cebolla (Allium Cepa L.)
No ratings yet
Regeneración in Vitro de Plantas de Cebolla (Allium Cepa L.)
10 pages
Energy Efficiency Intervention in Urea Processes by Recovering The Excess Pressure Through Hydraulic Power Recovery Turbines (HPRTS)
No ratings yet
Energy Efficiency Intervention in Urea Processes by Recovering The Excess Pressure Through Hydraulic Power Recovery Turbines (HPRTS)
24 pages
bai7
No ratings yet
bai7
10 pages
Silva Et Al-2015-Scientia Agricola
No ratings yet
Silva Et Al-2015-Scientia Agricola
7 pages
Biochemical Composition and Fatty Acid Content of Filamentousnitrogen-Fixing Cyanobacteria
No ratings yet
Biochemical Composition and Fatty Acid Content of Filamentousnitrogen-Fixing Cyanobacteria
7 pages
ContentServer.asp
No ratings yet
ContentServer.asp
22 pages
Chemical Composition and Cellular Structure of Ponytail Palm (Beaucarnea Recurvata) Cork
No ratings yet
Chemical Composition and Cellular Structure of Ponytail Palm (Beaucarnea Recurvata) Cork
12 pages
PMI-ACP Exam Companion : Q & A with Explanations
From Everand
PMI-ACP Exam Companion : Q & A with Explanations
SUJAN
No ratings yet
AIML
No ratings yet
AIML
19 pages
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Cs One Mark QP Set 001
No ratings yet
Cs One Mark QP Set 001
17 pages
Computer Science Engineering An International Journal CSEIJ
No ratings yet
Computer Science Engineering An International Journal CSEIJ
2 pages
Tuples and Dictionaries Q A
No ratings yet
Tuples and Dictionaries Q A
4 pages
Dsa Module 1 - Ktustudents - in
No ratings yet
Dsa Module 1 - Ktustudents - in
13 pages
MCA VITMEE Syllabus 2024
No ratings yet
MCA VITMEE Syllabus 2024
3 pages
Final Exam Booklet - Grade 9 - Term 1
No ratings yet
Final Exam Booklet - Grade 9 - Term 1
15 pages
Data Types and Derivatives
No ratings yet
Data Types and Derivatives
3 pages
Computer Software
No ratings yet
Computer Software
24 pages
c# -module3
No ratings yet
c# -module3
85 pages
cs4811 ch14 AutomatedReasoning
No ratings yet
cs4811 ch14 AutomatedReasoning
53 pages
Disign and analysis of algorith - Overview
No ratings yet
Disign and analysis of algorith - Overview
23 pages
JAVA Project 122
No ratings yet
JAVA Project 122
12 pages
Francois Fleuret - C++ Lecture Notes
No ratings yet
Francois Fleuret - C++ Lecture Notes
146 pages
python paper
No ratings yet
python paper
1 page
Mcs-031: Design and Analysis of Algorithm: Downloaded From
No ratings yet
Mcs-031: Design and Analysis of Algorithm: Downloaded From
2 pages
VHDL Tutorial
No ratings yet
VHDL Tutorial
42 pages
Excel Basic Class 2
No ratings yet
Excel Basic Class 2
5 pages
Computer Annual Question Paper 2023-24
No ratings yet
Computer Annual Question Paper 2023-24
6 pages
Design and Analysis of Algorithms Cho
No ratings yet
Design and Analysis of Algorithms Cho
12 pages
Download Cloud Computing: Theory and Practice 3rd Edition Dan C. Marinescu ebook All Chapters PDF
100% (10)
Download Cloud Computing: Theory and Practice 3rd Edition Dan C. Marinescu ebook All Chapters PDF
66 pages
51671
No ratings yet
51671
48 pages
MCQ-9 Friend
No ratings yet
MCQ-9 Friend
6 pages
Tower of Hanoi C Program Assignment.
No ratings yet
Tower of Hanoi C Program Assignment.
8 pages
Programming Assignment Unit 3-CS 1103 (EM)
No ratings yet
Programming Assignment Unit 3-CS 1103 (EM)
3 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
20 Marks MCQ Test Paper on Tuples 11 TEST
No ratings yet
20 Marks MCQ Test Paper on Tuples 11 TEST
3 pages

Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

Uploaded by

Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and

Article in IEEE Computational Intelligence Magazine · October 2018

Miriam Seoane Santos Jastin Pompeu Soares

SEE PROFILE SEE PROFILE

Pedro Henriques Abreu Helder J. Araujo

SEE PROFILE SEE PROFILE

Motion estimation View project

ATENA project (https://www.atena-h2020.eu/) View project

The user has requested enhancement of the downloaded file.

CV after Oversampling CV during Oversampling

A B C D E a c a d b for Fold = 1 to all the k partitions of CV

Dataset Size Features IR Dataset Size Features IR

*.-23 !1*.-235 *.-23 !1*.-235

*<1*.-23 0!-67"'89: *<1*.-23 0!-67"'89:

chosen to present the F1 metric (Fig. 5). The results using

View publication stats

You might also like

.-23 !1.-235 .-23 !1.-235

<1.-23 0!-67"'89: <1.-23 0!-67"'89: