Learning From Imbalanced Data
Learning From Imbalanced Data
Abstract—With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as
surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and
analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques
have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning
problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning
problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class
distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new
understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge
representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data.
Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment
metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future
research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for
learning from imbalanced data.
Index Terms—Imbalanced learning, classification, sampling methods, cost-sensitive learning, kernel-based learning, active learning,
assessment metrics.
1 INTRODUCTION
generalize inductive rules over the sample space when S ¼ fðxxi ; yi Þg; i ¼ 1; . . . ; m, where xi 2 X is an instance in
presented with this form of imbalance. In this case, the the n-dimensional feature space X ¼ ff1 ; f2 ; . . . ; fn g, and
combination of small sample size and high dimensionality yi 2 Y ¼ f1; . . . ; C g is a class identity label associated with
hinders learning because of difficultly involved in forming instance xi . In particular, C ¼ 2 represents the two-class
conjunctions over the high degree of features with limited classification problem. Furthermore, we define subsets
samples. If the sample space is sufficiently large enough, a Smin S and Smaj S, where Smin is the set of minority
set of general (albeit complex) inductive rules can be class examples in S, and Smaj is the set of majority class
defined for the dataspace. However, when samples are examples in S, so that Smin \ Smaj ¼ fg and Smin [
limited, the rules formed can become too specific, leading Smaj ¼ fS g. Lastly, any sets generated from sampling
to overfitting. In regards to learning from such data sets, procedures on S are labeled E, with disjoint subsets Emin
this is a relatively new research topic that requires much and Emaj representing the minority and majority samples of
needed attention in the community. As a result, we will E, respectively, whenever they apply.
touch upon this topic again later in our discussions.
3.1 Sampling Methods for Imbalanced Learning
Typically, the use of sampling methods in imbalanced
3 THE STATE-OF-THE-ART SOLUTIONS learning applications consists of the modification of an
FOR IMBALANCED LEARNING imbalanced data set by some mechanisms in order to
The topics discussed in Section 2 provide the foundation for provide a balanced distribution. Studies have shown that
most of the current research activities on imbalanced for several base classifiers, a balanced data set provides
learning. In particular, the immense hindering effects that improved overall classification performance compared to
an imbalanced data set [35], [36], [37]. These results justify
these problems have on standard learning algorithms are
the use of sampling methods for imbalanced learning.
the focus of most of the existing solutions. When standard
However, they do not imply that classifiers cannot learn
learning algorithms are applied to imbalanced data, the
from imbalanced data sets; on the contrary, studies have
induction rules that describe the minority concepts are often
also shown that classifiers induced from certain imbalanced
fewer and weaker than those of majority concepts, since the
data sets are comparable to classifiers induced from the
minority class is often both outnumbered and under-
same data set balanced by sampling techniques [22], [23].
represented. To provide a concrete understanding of the
This phenomenon has been directly linked to the problem
direct effects of the imbalanced learning problem on
of rare cases and its corresponding consequences, as
standard learning algorithms, we observe a case study of
described in Section 2. Nevertheless, for most imbalanced
the popular decision tree learning algorithm.
data sets, the application of sampling techniques does
In this case, imbalanced data sets exploit inadequacies in
indeed aid in improved classifier accuracy.
the splitting criterion at each node of the decision tree [23],
[24], [33]. Decision trees use a recursive, top-down greedy 3.1.1 Random Oversampling and Undersampling
search algorithm that uses a feature selection scheme (e.g., The mechanics of random oversampling follow naturally from
information gain) to select the best feature as the split its description by adding a set E sampled from the minority
criterion at each node of the tree; a successor (leaf) is then class: for a set of randomly selected minority examples in
created for each of the possible values corresponding to the Smin , augment the original set S by replicating the selected
split feature [26], [34]. As a result, the training set is examples and adding them to S. In this way, the number of
successively partitioned into smaller subsets that are total examples in Smin is increased by jEj and the class
ultimately used to form disjoint rules pertaining to class distribution balance of S is adjusted accordingly. This
concepts. These rules are finally combined so that the final provides a mechanism for varying the degree of class
hypothesis minimizes the total error rate across each class. distribution balance to any desired level. The oversampling
The problem with this procedure in the presence of method is simple to both understand and visualize, thus we
imbalanced data is two-fold. First, successive partitioning refrain from providing any specific examples of its
of the dataspace results in fewer and fewer observations of functionality.
minority class examples resulting in fewer leaves describing While oversampling appends data to the original data
minority concepts and successively weaker confidences set, random undersampling removes data from the original
estimates. Second, concepts that have dependencies on data set. In particular, we randomly select a set of majority
different feature space conjunctions can go unlearned by the class examples in Smaj and remove these samples from S so
sparseness introduced through partitioning. Here, the first that jSj ¼ jSmin j þ jSmaj j jEj. Consequently, undersam-
issue correlates with the problems of relative and absolute pling readily gives us a simple method for adjusting the
imbalances, while the second issue best correlates with the balance of the original data set S.
between-class imbalance and the problem of high dimen- At first glance, the oversampling and undersampling
sionality. In both cases, the effects of imbalanced data on methods appear to be functionally equivalent since they
decision tree classification performance are detrimental. In both alter the size of the original data set and can actually
the following sections, we evaluate the solutions proposed provide the same proportion of balance. However, this
to overcome the effects of imbalanced data. commonality is only superficial, each method introduces its
For clear presentation, we establish here some of the own set of problematic consequences that can potentially
notations used in this section. Considering a given training hinder learning [25], [38], [39]. In the case of under-
data set S with m examples (i.e., jSj ¼ m), we define: sampling, the problem is relatively obvious: removing
HE AND GARCIA: LEARNING FROM IMBALANCED DATA 1267
i =K
Fig. 4. Data creation based on Borderline instance. i ¼ ; i ¼ 1; . . . ; jSmin j; ð4Þ
Z
the SMOTE algorithm also has its drawbacks, including over where i is the number of examples in the K-nearest
generalization and variance [43]. We will further analyze neighbors of xi that belong to Smaj , and Z is a normalization
P
these limitations in the following discussion. constant so that i is a distribution function ( i ¼ 1).
Then, determine the number of synthetic data samples that
3.1.4 Adaptive Synthetic Sampling need to be generated for each xi 2 Smin :
In the SMOTE algorithm, the problem of over general-
ization is largely attributed to the way in which it creates gi ¼ i G: ð5Þ
synthetic samples. Specifically, SMOTE generates the same Finally, for each xi 2 Smin , generate gi synthetic data
number of synthetic data samples for each original minority samples according to (1). The key idea of the ADASYN
example and does so without consideration to neighboring algorithm is to use a density distribution as a criterion to
examples, which increases the occurrence of overlapping automatically decide the number of synthetic samples that
between classes [43]. To this end, various adaptive need to be generated for each minority example by
sampling methods have been proposed to overcome this adaptively changing the weights of different minority
limitation; some representative work includes the Border- examples to compensate for the skewed distributions.
line-SMOTE [44] and Adaptive Synthetic Sampling (ADA-
SYN) [45] algorithms. 3.1.5 Sampling with Data Cleaning Techniques
Of particular interest with these adaptive algorithms are Data cleaning techniques, such as Tomek links, have been
the techniques used to identify minority seed samples. In effectively applied to remove the overlapping that is
the case of Borderline-SMOTE, this is achieved as follows: introduced from sampling methods. Generally speaking,
First, determine the set of nearest neighbors for each Tomek links [46] can be defined as a pair of minimally
xi 2 Smin ; call this set Si:mNN ; Si:mNN S. Next, for each distanced nearest neighbors of opposite classes. Given an
xi , identify the number of nearest neighbors that belongs to instance pair: ðx xi ; x j Þ, where xi 2 Smin ; x j 2 Smaj , and
the majority class, i.e., jSi:mNN \ Smaj j. Finally, select those dðxxi ; x j Þ is the distance between xi and xj , then the ðx xi ; xj Þ
xi that satisfy: pair is called a Tomek link if there is no instance x k , such
that dðx xi ; x k Þ < dðx
xi ; x j Þ or dðx
xj ; x k Þ < dðx
xi ; x j Þ. In this way,
m
jSi:mNN \ Smaj j < m: ð2Þ if two instances form a Tomek link then either one of these
2
instances is noise or both are near a border. Therefore, one
Equation (2) suggests that only those xi that have more can use Tomek links to “cleanup” unwanted overlapping
majority class neighbors than minority class neighbors are between classes after synthetic sampling where all Tomek
selected to form the set “DANGER” [44]. Therefore, the links are removed until all minimally distanced nearest
examples in DANGER represent the borderline minority neighbor pairs are of the same class. By removing over-
class examples (the examples that are most likely to be lapping examples, one can establish well-defined class
misclassified). The DANGER set is then fed to the SMOTE clusters in the training set, which can, in turn, lead to well-
algorithm to generate synthetic minority samples in the defined classification rules for improved classification
neighborhood of the borders. One should note that if performance. Some representative work in this area
jSi:mNN \ Smaj j ¼ m, i.e., if all of the m nearest neighbors of includes the OSS method [42], the condensed nearest
xi are majority examples, such as the instance C in Fig. 4, neighbor rule and Tomek Links ðCNNþTomek LinksÞ
then this xi is considered as noise and no synthetic integration method [22], the neighborhood cleaning rule
examples are generated for it. Fig. 4 illustrates an example (NCL) [36] based on the edited nearest neighbor (ENN)
of the Borderline-SMOTE procedure. Comparing Fig. 4 and rule—which removes examples that differ from two of its
Fig. 3, we see that the major difference between Borderline- three nearest neighbors, and the integrations of SMOTE
SMOTE and SMOTE is that SMOTE generates synthetic with ENN ðSMOTEþENNÞ and SMOTE with Tomek links
instances for each minority instance, while Borderline- ðSMOTEþTomekÞ [22].
SMOTE only generates synthetic instances for those Fig. 5 shows a typical procedure of using SMOTE and
minority examples “closer” to the border. Tomek to clean the overlapping data points.
ADASYN, on the other hand, uses a systematic method to Fig. 5a shows the original data set distribution for an
adaptively create different amounts of synthetic data accord- artificial imbalanced data set; note the inherent overlapping
ing to their distributions [45]. This is achieved as follows: that exists between the minority and majority examples.
HE AND GARCIA: LEARNING FROM IMBALANCED DATA 1269
Fig. 5. (a) Original data set distribution. (b) Post-SMOTE data set.
(c) The identified Tomek Links. (d) The data set after removing Tomek
links.
where P ðjjx xÞ represents the probability of each class j for a Dtþ1 ðiÞ ¼ Ci Dt ðiÞ expðt ht ðx
xi Þyi Þ=Zt ; ð8Þ
given example x [49], [54].
There are many different ways of implementing cost- Dtþ1 ðiÞ ¼ Ci Dt ðiÞ expðt Ci ht ðx
xi Þyi Þ=Zt : ð9Þ
sensitive learning, but, in general, the majority of techniques Equations (7), (8), and (9) corresponds to the AdaC1,
fall under three categories. The first class of techniques apply AdaC2, and AdaC3 methods, respectively. Here, the cost item
misclassification costs to the data set as a form of dataspace Ci is the associated cost for each xi and Ci s of higher value
weighting; these techniques are essentially cost-sensitive correspond to examples with higher misclassification costs.
bootstrap sampling approaches where misclassification In essence, these algorithms increase the probability of
costs are used to select the best training distribution for sampling a costly example at each iteration, giving the
induction. The second class applies cost-minimizing techni- classifier more instances of costly examples for a more
ques to the combination schemes of ensemble methods; this targeted approach of induction. In general, it was observed
class consists of various Metatechniques where standard that the inclusion of cost factors into the weighting scheme of
learning algorithms are integrated with ensemble methods Adaboost imposes a bias toward the minority concepts and
to develop cost-sensitive classifiers. Both of these classes also increases the use of more relevant data samples in each
have rich theoretical foundations that justify their ap- hypothesis, providing for a more robust form of classification.
proaches, with cost-sensitive dataspace weighting methods Another cost-sensitive boosting algorithm that follows a
building on the translation theorem [55], and cost-sensitive similar methodology is AdaCost [59]. AdaCost, like AdaC1,
Metatechniques building on the Metacost framework [54]. In introduces cost sensitivity inside the exponent of the weight
fact, many of the existing research works often integrate the updating formula of Adaboost. However, instead of apply-
Metacost framework with dataspace weighting and adap- ing the cost items directly, AdaCost uses a cost-adjustment
tive boosting to achieve stronger classification results. To function that aggressively increases the weights of costly
this end, we consider both of these classes of algorithms as misclassifications and conservatively decreases the weights
one in the following section. The last class of techniques of high-cost examples that are correctly classified. This
incorporates cost-sensitive functions or features directly into modification becomes:
classification paradigms to essentially “fit” the cost-sensitive
framework into these classifiers. Because many of these Dtþ1 ðiÞ ¼ Dt ðiÞ expðt ht ðx
xi Þyi i Þ=Zt ; ð10Þ
techniques are specific to a particular paradigm, there is no
unifying framework for this class of cost-sensitive learning, with the cost-adjustment function i , defined as i ¼
but in many cases, solutions that work for one paradigm can ðsignðyi ; ht ðx
xi ÞÞ; Ci Þ, where signðyi ; ht ðx
xi ÞÞ is positive for
often be abstracted to work for others. As such, in our correct classification and negative for misclassification. For
discussion of these types of techniques, we consider a few clear presentation, one can use þ when signðyi ; ht ðx xi ÞÞ ¼ 1
methods on a case-specific basis. and when signðyi ; ht ðx xi ÞÞ ¼ 1. This method also allows
some flexibility in the amount of emphasis given to the
3.2.2 Cost-Sensitive Dataspace Weighting importance of an example. For instance, Fan et al. [59]
with Adaptive Boosting suggest þ ¼ 0:5Ci þ 0:5 and ¼ 0:5Ci þ 0:5 for good
results in most applications, but these coefficients can be
Motivated by the pioneering work of the AdaBoost
adjusted according to specific needs. An empirical compar-
algorithms [56], [57], several cost-sensitive boosting meth-
ison over four imbalanced data sets of AdaC1, AdaC2,
ods for imbalanced learning have been proposed. Three
AdaC3, and AdaCost and two other similar algorithms,
cost-sensitive boosting methods, AdaC1, AdaC2, and
CSB1 and CSB2 [60], was performed in [58] using decision
AdaC3, were proposed in [58] which introduce cost items
trees and a rule association system as the base classifiers. It
into the weight updating strategy of AdaBoost. The key idea
of the AdaBoost.M1 method is to iteratively update the was noted that in all cases, a boosted ensemble performed
distribution function over the training data. In this way, on better than the stand-alone base classifiers using F-measure
each iteration t :¼ 1; . . . ; T , where T is a preset number of (see Section 4.1) as the evaluation metric, and in nearly all
the total number of iterations, the distribution function Dt is cases, the cost-sensitive boosted ensembles performed
updated sequentially and used to train a new hypothesis: better than plain boosting.
Though these cost-sensitive algorithms can significantly
Dtþ1 ðiÞ ¼ Dt ðiÞ expðt ht ðx
xi Þyi Þ=Zt ; ð6Þ improve classification performance, they take for granted the
1"t
availability of a cost matrix and its associated cost items. In
1
where t ¼ 2 lnð "t Þ is the weight updating parameter, many situations, an explicit description of misclassification
ht ðx
xi Þ is the prediction output of hypothesis ht on the costs is unknown, i.e., only an informal assertion is known,
instance xi ; "t is the error of hypothesis ht over the training such as misclassifications on the positive class are more expensive
P
data "t ¼ i:ht ðxxi Þ6¼yi Dt ðiÞ, and Zt is a normalization factor than the negative class [51]. Moreover, determining a cost
P representation of a given domain can be particularly
so that Dtþ1 is a distribution function, i.e., m i¼1 Dtþ1 ðiÞ ¼ 1.
With this description in mind, a cost factor can be applied in challenging and in some cases impossible [61]. As a result,
the techniques discussed in this section are not applicable in
three ways, inside of the exponential, outside of the
these situations and other solutions must be established. This
exponential, and both inside and outside the exponential. is the prime motivation for the cost-sensitive fitting techni-
Analytically, this translates to ques mentioned earlier. In the following sections, we provide
Dtþ1 ðiÞ ¼ Dt ðiÞ expðt Ci ht ðx
xi Þyi Þ=Zt ; ð7Þ an overview of these methods for two popular learning
paradigms, namely, decision trees and neural networks.
1272 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 9, SEPTEMBER 2009
The learning rate can also influence the weight concept. In this case, it might occur that the support vectors
adjustment (see (12)). As a result, cost-sensitive factors can representing the minority concept are “far away” from this
be applied to the learning rate to change the impact that the “ideal” line, and as a result, will contribute less to the final
modification procedure has on the weights, where costly hypothesis [74], [75], [76]. Moreover, if there is a lack of data
examples will have a greater impact on weight changes. The representing the minority concept, there could be an
key idea of this approach is to put more attention on costly imbalance of representative support vectors that can also
examples during learning by effectively decreasing the degrade performance. These same characteristics are also
learning rate for each corresponding costly example. This readily evident in linear nonseparable spaces. In this case, a
also suggests that low-cost examples will train at a faster kernel function is used to map the linear nonseparable
rate than costly examples, so this method also strikes a space into a higher dimensional space where separation is
balance in training time. Experiments on this technique have achievable. However, in this case, the optimal hyperplane
shown it to be very effective for training neural networks separating the classes will be biased toward the majority
with significant improvements over the base classifier [65]. class in order to minimize the high error rates of
The final adaptation of cost-sensitive neural networks misclassifying the more prevalent majority class. In the
replaces the error-minimizing function shown in (11) by an worst case, SVMs will learn to classify all examples as
expected cost minimization function. This form of cost- pertaining to the majority class—a tactic that, if the
sensitive fitting was shown to be the most dominant of the imbalance is severe, can provide the minimal error rate
methods discussed in this section [65]. It also is in line with across the dataspace.
the back propagation methodology and theoretic founda-
tions established on the transitivity between error-minimiz- 3.3.2 Integration of Kernel Methods
ing and cost-minimizing classifiers. with Sampling Methods
Though we only provide a treatment for decision trees There have been many works in the community that apply
and neural networks, many cost-sensitive fitting techniques general sampling and ensemble techniques to the SVM
exist for other types of learning paradigms as well. For framework. Some examples include the SMOTE with
instance, a great deal of works have focused on cost- Different Costs (SDCs) method [75] and the ensembles of
sensitive Bayesian Classifiers [66], [67], [68], [69] and some over/undersampled SVMs [77], [78], [79], [80]. For example,
works exist which integrate cost functions with support the SDC algorithm uses different error costs [75] for different
vector machines [70], [71], [72]. Interested readers can refer classes to bias the SVM in order to shift the decision boundary
to these works for a broader overview. away from positive instances and make positive instances
3.3 Kernel-Based Methods and Active Learning more densely distributed in an attempt to guarantee a more
Methods for Imbalanced Learning well-defined boundary. Meanwhile, the methods proposed
Although sampling methods and cost-sensitive learning in [78], [79] develop ensemble systems by modifying the data
methods seem to dominate the current research efforts in distributions without modifying the underlying SVM classi-
imbalanced learning, numerous other approaches have also fier. Lastly, Wang and Japkowicz [80] proposed to modify the
been pursued in the community. In this section, we briefly SVMs with asymmetric misclassification costs in order to
review kernel-based learning methods and active learning boost performance. This idea is similar to the AdaBoost.M1
methods for imbalanced learning. Since kernel-based [56], [57] algorithm in that it uses an iterative procedure to
learning methods provide state-of-the-art techniques for effectively modify the weights of the training observations. In
many of today’s data engineering applications, the use of this way, one can build a modified version of the training
kernel-based methods to understand imbalanced learning data based on such sequential learning procedures to
has naturally attracted growing attention recently. improve classification performance.
The Granular Support Vector Machines—Repetitive Un-
3.3.1 Kernel-Based Learning Framework dersampling algorithm (GSVM-RU) was proposed in [81] to
The principles of kernel-based learning are centered on the integrate SVM learning with undersampling methods. This
theories of statistical learning and Vapnik-Chervonenkis method is based on granular support vector machines
(VC) dimensions [73]. The representative kernel-based (GSVMs) which were developed in a series of papers
learning paradigm, support vector machines (SVMs), can according to the principles from statistical learning theory
provide relatively robust classification results when applied and granular computing theory [82], [83], [84]. The major
to imbalanced data sets [23]. SVMs facilitate learning by characteristics of GSVMs are two-fold. First, GSVMs can
using specific examples near concept boundaries (support effectively analyze the inherent data distribution by obser-
vectors) to maximize the separation margin (soft-margin ving the trade-offs between the local significance of a subset of
maximization) between the support vectors and the data and its global correlation. Second, GSVMs improve the
hypothesized concept boundary (hyperplane), meanwhile computational efficiency of SVMs through use of parallel
minimizing the total classification error [73]. computation. In the context of imbalanced learning, the
The effects of imbalanced data on SVMs exploit GSVM-RU method takes advantage of the GSVM by using an
inadequacies of the soft-margin maximization paradigm iterative learning procedure that uses the SVM itself for
[74], [75]. Since SVMs try to minimize total error, they are undersampling [81]. Concretely, since all minority (positive)
inherently biased toward the majority concept. In the examples are considered to be informative, a positive
simplest case, a two-class space is linearly separated by an information granule is formed from these examples. Then, a
“ideal” separation line in the neighborhood of the majority linear SVM is developed using the positive granule and the
1274 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 9, SEPTEMBER 2009
remaining examples in the data set (i.e., Smaj ); the negative imbalanced learning problem by modifying the kernel
examples that are identified as support vectors by this SVM, matrix in the feature space. Theoretical analyses and
the so-called “negative local support vectors” (NLSVs), are empirical studies showed that this method not only
formed into a negative information granule and are removed provides competitive accuracy, but it can also be applied
from the original training data to obtain a smaller training to both vector data and sequence data by modifying the
data set. Based on this reduced training data set, a new linear kernel matrix.
SVM is developed, and again, the new set of NLSVs is formed In a more integrated approach of kernel based learning,
into a negative granule and removed from the data set. This Liu and Chen [89] and [90] propose the total margin-based
procedure is repeated multiple times to obtain multiple adaptive fuzzy SVM kernel method (TAF-SVM) to improve
negative information granules. Finally, an aggregation SVM robustness. The major beneficial characteristics of
operation that considers global correlation is used to select TAF-SVM are three-fold. First, TAF-SVM can handle over-
specific sample sets from those iteratively developed nega- fitting by “fuzzifying” the training data, where certain
tive information granules, which are then combined with all training examples are treated differently according to their
positive samples to develop a final SVM model. In this way, relative importance. Second, different cost algorithms are
the GSVM-RU method uses the SVM itself as a mechanism for embedded into TAF-SVM, which allows this algorithm to
undersampling to sequentially develop multiple information self-adapt to different data distribution skews. Last, the
granules with different informative samples, which are later conventional soft-margin maximization paradigm is re-
combined to develop a final SVM for classification. placed by the total margin paradigm, which considers both
the misclassified and correctly classified data examples in
3.3.3 Kernel Modification Methods
the construction of the optimal separating hyperplane.
for Imbalanced Learning
A particularly interesting kernel modification method for
In addition to the aforementioned sampling and ensemble imbalanced learning is the k-category proximal support
kernel-based learning methods, another major category of vector machine (PSVM) with Newton refinement [91]. This
kernel-based learning research efforts focuses more con- method essentially transforms the soft-margin maximization
cretely on the mechanics of the SVM itself; this group of paradigm into a simple system of k linear equations for
methods is often called kernel modification method. either linear or nonlinear classifiers, where k is the number
One example of kernel modification is the kernel
of classes. One of the major advantages of this method is that
classifier construction algorithm proposed in [85] based
it can perform the learning procedure very fast because this
on orthogonal forward selection (OFS) and the regularized
method requires nothing more sophisticated than solving
orthogonal weighted least squares (ROWLSs) estimator.
this simple system of linear equations. Lastly, in the presence
This algorithm optimizes generalization in the kernel-based
of extremely imbalanced data sets, Raskutti and Kowalcyzk
learning model by introducing two major components that
[74] consider both sampling and dataspace weighting
deal with imbalanced data distributions for two-class data
compensation techniques in cases where SVMs completely
sets. The first component integrates the concepts of leave-
ignore one of the classes. In this procedure, two balancing
one-out (LOO) cross validation and the area under curve
modes are used in order to balance the data: a similarity
(AUC) evaluation metric (see Section 4.2) to develop an
LOO-AUC objective function as a selection mechanism of detector is used to learn a discriminator based predominantly
the most optimal kernel model. The second component on positive examples, and a novelty detector is used to learn a
takes advantage of the cost sensitivity of the parameter discriminator using primarily negative examples.
estimation cost function in the ROWLS algorithm to assign Several other kernel modification methods exist in the
greater weight to erroneous data examples in the minority community including the support cluster machines (SCMs)
class than those in the majority class. for large-scale imbalanced data sets [92], the kernel neural
Other examples of kernel modification are the various gas (KNG) algorithm for imbalanced clustering [93], the
techniques for adjusting the SVM class boundary. These P2PKNNC algorithm based on the k-nearest neighbors
methods apply boundary alignment techniques to improve classifier and the P2P communication paradigm [94], the
SVM classification [76], [86], [87]. For instance, in [76], hybrid kernel machine ensemble (HKME) algorithm includ-
three algorithmic approaches for adjusting boundary ing a binary support vector classifier (BSVC) and a one-class
skews were presented: the boundary movement (BM) support vector classifier (SV C) with Gaussian radial basis
approach, the biased penalties (BPs) approach, and the kernel function [95], and the Adaboost relevance vector
class-boundary alignment (CBA) approach. Additionally, machine (RVM) [96], amongst others. Furthermore, we
in [86] and [87], the kernel-boundary alignment (KBA) would like to note that for many kernel-based learning
algorithm was proposed which is based on the idea of methods, there is no strict distinction between the afore-
modifying the kernel matrix generated by a kernel function mentioned two major categories of Sections 3.3.2 and 3.3.3. In
according to the imbalanced data distribution. The under- many situations, learning methods take a hybrid approach
lying theoretical foundation of the KBA method builds on where sampling and ensemble techniques are integrated
the adaptive conformal transformation (ACT) methodol- with kernel modification methods for improved perfor-
ogy, where the conformal transformation on a kernel mance. For instance, [75] and [76] are good examples of
function is based on the consideration of the feature-space hybrid solutions for imbalanced learning. In this section, we
distance and the class-imbalance ratio [88]. By generalizing categorize kernel-based learning in two sections for better
the foundation of ACT, the KBA method tackles the presentation and organization.
HE AND GARCIA: LEARNING FROM IMBALANCED DATA 1275
Fig. 10 and their lines in cost space. For instance, the bottom definition (see (17)) to the geometric means of recall values
axis represents perfect classification, while the top axis of every class for multiclass imbalanced learning.
represents the contrary case; these lines correspond to ROC
points A and B, respectively.
With a collection of cost lines at hand, a cost curve is then
5 OPPORTUNITIES AND CHALLENGES
created by selecting a classification line for each possible The availability of vast amounts of raw data in many of
operation point. For example, a cost curve can be created today’s real-world applications enriches the opportunities
that minimizes the normalized expected cost across all of learning from imbalanced data to play a critical role
possible operation points. In particular, this technique across different domains. However, new challenges arise at
allows for a clearer visual representation of classification the same time. Here, we briefly discuss several aspects for
performance compared to ROC curves, as well as more the future research directions in this domain.
direct assessments between classifiers as they range over
operation points. 5.1 Understanding the Fundamental Problems
Currently, most of the research efforts in imbalanced
4.5 Assessment Metrics for Multiclass learning focus on specific algorithms and/or case studies;
Imbalanced Learning only a very limited amount of theoretical understanding
While all of the assessment metrics discussed so far in this on the principles and consequences of this problem have
section are appropriate for two-class imbalanced learning been addressed. For example, although almost every
problems, some of them can be modified to accommodate algorithm presented in literature claims to be able to
the multiclass imbalanced learning problems. For instance, improve classification accuracy over certain benchmarks,
Fawcett [119] and [120] discussed multiclass ROC graphs. there exist certain situations in which learning from the
For an n classes problem, the confusion matrix presented in
original data sets may provide better performance. This
Fig. 9 becomes an n n matrix, with n correct classifications
raises an important question: to what extent do imbalanced
(the major diagonal elements) and n2 n errors (the off-
learning methods help with learning capabilities? This is a
diagonal elements). Therefore, instead of representing the
fundamental and critical question in this field for the
trade-offs between a single benefit (TP) and cost (FP), we
following reasons. First, suppose there are specific (exist-
have to manage n benefits and n2 n costs. A straightfor-
ward way of doing this is to generate n different ROC ing or future) techniques or methodologies that signifi-
graphs, one for each class [119], [120]. For instance, cantly outperform others across most (or, ideally, all)
considering a problem with a total of W classes, the ROC applicable domains, then rigorous studies of the under-
graph i; ROCi , plots classification performance using class lying effects of such methods would yield fundamental
wi as the positive class and all other classes as the negative understandings of the problem at hand. Second, as data
class. However, this approach compromises one of the engineering research methodologies materialize into real-
major advantages of using ROC analysis for imbalanced world solutions, questions such as “how will this solution
learning problems: It becomes sensitive to the class skew help” or “can this solution efficiently handle various types
because the negative class in this situation is the combina- of data,” become the basis on which economic and
tion of n 1 classes (see Sections 4.1 and 4.2). administrative decisions are made. Thus, the consequences
Similarly, under the multiclass imbalanced learning of this critical question have wide-ranging effects in the
scenario, the AUC values for two-class problems become advancement of this field and data engineering at large.
multiple pairwise discriminability values [131]. To calculate This important question follows directly from a previous
such multiclass AUCs, Provost and Domingos [121] proposition addressed by Provost in the invited paper for
proposed a probability estimation-based approach: First, the AAAI 2000 Workshop on Imbalanced Data Sets [100]:
the ROC curve for each reference class wi is generated and “[In regards to imbalanced learning,]. . . isn’t the best research
their respective AUCs are measured. Second, all of the strategy to concentrate on how machine learning algorithms can
AUCs are combined by a weight coefficient according to the deal most effectively with whatever data they are given?”
reference class’s prevalence in the data. Although this We believe that this fundamental question should be
approach is quite simple in calculation, it is sensitive to the investigated with greater intensity both theoretically and
class skews for the same reason as mentioned before. To empirically in order to thoroughly understand the essence
eliminate this constraint, Hand and Till [131] proposed the of imbalanced learning problems. More specifically, we
M measure, a generalization approach that aggregates all believe that the following questions require careful and
pairs of classes based on the inherent characteristics of the thorough investigation:
AUC. The major advantage of this method is that it is
insensitive to class distribution and error costs. Interested 1. What kind of assumptions will make imbalanced
readers can refer to [131] for a more detailed overview of learning algorithms work better compared to learn-
this technique. ing from the original distributions?
In addition to multiclass ROC analysis, the community 2. To what degree should one balance the original
has also adopted other assessment metrics for multiclass data set?
imbalanced learning problems. For instance, in cost-sensi- 3. How do imbalanced data distributions affect the
tive learning, it is natural to use misclassification costs for computational complexity of learning algorithms?
performance evaluation for multiclass imbalanced problems 4. What is the general error bound given an imbal-
[8], [10], [11]. Also, Sun et al. [7] extend the G-mean anced data distribution?
1280 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 9, SEPTEMBER 2009
5. Is there a general theoretical methodology that can 1. lack of a uniform benchmark for standardized
alleviate the impediment of learning from imbal- performance assessments;
anced data sets for specific algorithms and applica- 2. lack of data sharing and data interoperability across
tion domains? different disciplinary domains;
Fortunately, we have noticed that these critical funda- 3. increased procurement costs, such as time and labor,
mental problems have attracted growing attention in the for the research community as a whole group since
community. For instance, important works are presented in each research group is required to collect and
[37] and [24] that directly relate to the aforementioned prepare their own data sets.
question 2 regarding the “level of the desired degree of With these factors in mind, we believe that a well-
balance.” In [37], the rate of oversampling and under- organized, publicly available benchmark specifically dedi-
sampling was discussed as a possible aid for imbalanced cated to imbalanced learning will significantly benefit the
learning. Generally speaking, though the resampling para- long-term research development of this field. Furthermore,
digm has had successful cases in the community, tuning as a required component, an effective mechanism to promote
these algorithms effectively is a challenging task. To the interoperability and communication across various
alleviate this challenge, Estabrooks et al. [37] suggested disciplines should be incorporated into such a benchmark
that a combination of different expressions of resampling to ultimately uphold a healthy, diversified community.
methods may be an effective solution to the tuning problem.
5.3 Need of Standardized Evaluation Practices
Weiss and Provost [24] have analyzed, for a fixed training
As discussed in Section 4, the traditional technique of
set size, the relationship between the class distribution of
using a singular evaluation metric is not sufficient when
training data (expressed as the percentage of minority class
handling imbalanced learning problems. Although most
examples) and classifier performance in terms of accuracy
publications use a broad assortment of singular assess-
and AUC. This work provided important suggestions
ment metrics to evaluate the performance and potential
regarding “how do different training data class distribu-
trade-offs of their algorithms, without an accompanied
tions affect classification performance” and “which class
curve-based analysis, it becomes very difficult to provide
distribution provides the best classifier” [24]. Based on a
any concrete relative evaluations between different algo-
thorough analysis of 26 data sets, it was suggested that if
rithms, or answer the more rigorous questions of
accuracy is selected as the performance criterion, the best
functionality. Therefore, it is necessary for the community
class distribution tends to be near the naturally occurring
to establish—as a standard—the practice of using the
class distribution. However, if the AUC is selected as the curve-based evaluation techniques described in Sec-
assessment metric, then the best class distribution tends to tions 4.2, 4.3, and 4.4 in their analysis. Not only because
be near the balanced class distribution. Based on these each technique provides its own set of answers to
observations, a “budget-sensitive” progressive sampling different fundamental questions, but also because an
strategy was proposed to efficiently sample the minority analysis in the evaluation space of one technique can be
and majority class examples such that the resulting training correlated to the evaluation space of another, leading to
class distribution can provide the best performance. increased transitivity and a broader understanding of the
In summary, the understanding of all these questions will functional abilities of existing and future works. We hope
not only provide fundamental insights to the imbalanced that a standardized set of evaluation practices for proper
learning issue, but also provide an added level of compara- comparisons in the community will provide useful guides
tive assessment between existing and future methodologies. for the development and evaluation of future algorithms
It is essential for the community to investigate all of these and tools.
questions in order for research developments to focus on the
fundamental issues regarding imbalanced learning. 5.4 Incremental Learning from Imbalanced
Data Streams
5.2 Need of a Uniform Benchmark Platform Traditional static learning methods require representative
Data resources are critical for research development in the data sets to be available at training time in order to develop
knowledge discovery and data engineering field. Although decision boundaries. However, in many realistic application
there are currently many publicly available benchmarks for environments, such as Web mining, sensor networks,
assessing the effectiveness of different data engineering multimedia systems, and others, raw data become available
algorithm/tools, such as the UCI Repository [132] and the continuously over an indefinite (possibly infinite) learning
NIST Scientific and Technical Databases [133], there are a lifetime [134]. Therefore, new understandings, principles,
very limited number of benchmarks, if any, that are solely methodologies, algorithms, and tools are needed for such
dedicated to imbalanced learning problems. For instance, stream data learning scenarios to efficiently transform raw
many of the existing benchmarks do not clearly identify data into useful information and knowledge representation
imbalanced data sets and their suggested evaluation use in to support the decision-making processes. Although the
an organized manner. Therefore, many data sets require importance of stream data mining has attracted increasing
additional manipulation before they can be applied to attention recently, the attention given to imbalanced data
imbalanced learning scenarios. This limitation can create a streams has been rather limited. Moreover, in regards to
bottleneck for the long-term development of research in incremental learning from imbalanced data streams, many
imbalanced learning in the following aspects: important questions need to be addressed, such as:
HE AND GARCIA: LEARNING FROM IMBALANCED DATA 1281
1. How can we autonomously adjust the learning discussions of the fundamental nature of the imbalanced
algorithm if an imbalance is introduced in the learning problem, the state-of-the-art solutions used to
middle of the learning period? address this problem, and the several major assessment
2. Should we consider rebalancing the data set during techniques used to evaluate this problem will serve as a
the incremental learning period? If so, how can we comprehensive resource for existing and future knowledge
accomplish this? discovery and data engineering researchers and practi-
3. How can we accumulate previous experience and tioners. Additionally, we hope that our insights on the many
use this knowledge to adaptively improve learning opportunities and challenges available in this relatively new
from new data? research area will help guide the potential research direc-
4. How do we handle the situation when newly tions for the future development of this field.
introduced concepts are also imbalanced (i.e., the
imbalanced concept drifting issue)?
A concrete understanding and active exploration in these REFERENCES
areas can significantly advance the development of technol- [1] “Learning from Imbalanced Data Sets,” Proc. Am. Assoc. for
Artificial Intelligence (AAAI) Workshop, N. Japkowicz, ed., 2000,
ogy for real-world incremental learning scenarios. (Technical Report WS-00-05).
[2] “Workshop Learning from Imbalanced Data Sets II,” Proc. Int’l
5.5 Semisupervised Learning from Imbalanced Data Conf. Machine Learning, N.V. Chawla, N. Japkowicz, and A. Kolcz,
The semisupervised learning problem concerns itself with eds., 2003.
learning when data sets are a combination of labeled and [3] N.V. Chawla, N. Japkowicz, and A. Kolcz, “Editorial: Special Issue
on Learning from Imbalanced Data Sets,” ACM SIGKDD Explora-
unlabeled data, as opposed to fully supervised learning tions Newsletter, vol. 6, no. 1, pp. 1-6, 2004.
where all training data are labeled. The key idea of [4] H. He and X. Shen, “A Ranked Subspace Learning Method for
semisupervised learning is to exploit the unlabeled exam- Gene Expression Data Classification,” Proc. Int’l Conf. Artificial
Intelligence, pp. 358-364, 2007.
ples by using the labeled examples to modify, refine, or [5] M. Kubat, R.C. Holte, and S. Matwin, “Machine Learning for the
reprioritize the hypothesis obtained from the labeled data Detection of Oil Spills in Satellite Radar Images,” Machine
alone [135]. For instance, cotraining works under the Learning, vol. 30, nos. 2/3, pp. 195-215, 1998.
assumption of two-viewed sets of feature spaces. Initially, [6] R. Pearson, G. Goney, and J. Shwaber, “Imbalanced Clustering for
Microarray Time-Series,” Proc. Int’l Conf. Machine Learning, Work-
two separate classifiers are trained with the labeled shop Learning from Imbalanced Data Sets II, 2003.
examples on two sufficient and conditionally independent [7] Y. Sun, M.S. Kamel, and Y. Wang, “Boosting for Learning Multiple
feature subsets. Then, each classifier is used to predict the Classes with Imbalanced Class Distribution,” Proc. Int’l Conf. Data
Mining, pp. 592-602, 2006.
unlabeled data and recover their labels according to their [8] N. Abe, B. Zadrozny, and J. Langford, “An Iterative Method for
respective confidence levels [136], [137]. Other representa- Multi-Class Cost-Sensitive Learning,” Proc. ACM SIGKDD Int’l
tive works for semisupervised learning include the self- Conf. Knowledge Discovery and Data Mining, pp. 3-11, 2004.
training methods [138], [139], semisupervised support [9] K. Chen, B.L. Lu, and J. Kwok, “Efficient Classification of Multi-
Label and Imbalanced Data Using Min-Max Modular Classifiers,”
vector machines [140], [141], graph-based methods [142], Proc. World Congress on Computation Intelligence—Int’l Joint Conf.
[143], and Expectation-Maximization (EM) algorithm with Neural Networks, pp. 1770-1775, 2006.
generative mixture models [144], [145]. Although all of [10] Z.H. Zhou and X.Y. Liu, “On Multi-Class Cost-Sensitive Learn-
ing,” Proc. Nat’l Conf. Artificial Intelligence, pp. 567-572, 2006.
these methods have illustrated great success in many [11] X.Y. Liu and Z.H. Zhou, “Training Cost-Sensitive Neural Net-
machine learning and data engineering applications, the works with Methods Addressing the Class Imbalance Problem,”
issue of semisupervised learning under the condition of IEEE Trans. Knowledge and Data Eng., vol. 18, no. 1, pp. 63-77, Jan.
imbalanced data sets has received very limited attention in 2006.
[12] C. Tan, D. Gilbert, and Y. Deville, “Multi-Class Protein Fold
the community. Some important questions include: Classification Using a New Ensemble Machine Learning Ap-
proach,” Genome Informatics, vol. 14, pp. 206-217, 2003.
1.How can we identify whether an unlabeled data [13] N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer,
example came from a balanced or imbalanced “SMOTE: Synthetic Minority Over-Sampling Technique,”
underlying distribution? J. Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[14] H. Guo and H.L. Viktor, “Learning from Imbalanced Data Sets
2. Given an imbalanced training data with labels, what with Boosting and Data Generation: The DataBoost IM Ap-
are the effective and efficient methods for recovering proach,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1,
the unlabeled data examples? pp. 30-39, 2004.
[15] K. Woods, C. Doss, K. Bowyer, J. Solka, C. Priebe, and W.
3. What kind of biases may be introduced in the Kegelmeyer, “Comparative Evaluation of Pattern Recognition
recovery process (through the conventional semisu- Techniques for Detection of Microcalcifications in Mammogra-
pervised learning techniques) given imbalanced, phy,” Int’l J. Pattern Recognition and Artificial Intelligence, vol. 7,
labeled data? no. 6, pp. 1417-1436, 1993.
[16] R.B. Rao, S. Krishnan, and R.S. Niculescu, “Data Mining for
We believe that all of these questions are important not Improved Cardiac Care,” ACM SIGKDD Explorations Newsletter,
only for theoretical research development, but also for vol. 8, no. 1, pp. 3-10, 2006.
[17] P.K. Chan, W. Fan, A.L. Prodromidis, and S.J. Stolfo, “Distributed
many practical application scenarios.
Data Mining in Credit Card Fraud Detection,” IEEE Intelligent
Systems, vol. 14, no. 6, pp. 67-74, Nov./Dec. 1999.
[18] P. Clifton, A. Damminda, and L. Vincent, “Minority Report in
6 CONCLUSIONS Fraud Detection: Classification of Skewed Data,” ACM SIGKDD
In this paper, we discussed a challenging and critical Explorations Newsletter, vol. 6, no. 1, pp. 50-59, 2004.
[19] P. Chan and S. Stolfo, “Toward Scalable Learning with Non-
problem in the knowledge discovery and data engineering Uniform Class and Cost Distributions,” Proc. Int’l Conf. Knowledge
field, the imbalanced learning problem. We hope that our Discovery and Data Mining, pp. 164-168, 1998.
1282 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 9, SEPTEMBER 2009
[20] G.M. Weiss, “Mining with Rarity: A Unifying Framework,” ACM [45] H. He, Y. Bai, E.A. Garcia, and S. Li, “ADASYN: Adaptive
SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 7-19, 2004. Synthetic Sampling Approach for Imbalanced Learning,” Proc.
[21] G.M. Weiss, “Mining Rare Cases,” Data Mining and Knowledge Int’l J. Conf. Neural Networks, pp. 1322-1328, 2008.
Discovery Handbook: A Complete Guide for Practitioners and [46] I. Tomek, “Two Modifications of CNN,” IEEE Trans. System, Man,
Researchers, pp. 765-776, Springer, 2005. Cybernetics, vol. 6, no. 11, pp. 769-772, Nov. 1976.
[22] G.E.A.P.A. Batista, R.C. Prati, and M.C. Monard, “A Study of the [47] N.V. Chawla, A. Lazarevic, L.O. Hall, and K.W. Bowyer,
Behavior of Several Methods for Balancing Machine Learning “SMOTEBoost: Improving Prediction of the Minority Class in
Training Data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, Boosting,” Proc. Seventh European Conf. Principles and Practice of
pp. 20-29, 2004. Knowledge Discovery in Databases, pp. 107-119, 2003.
[23] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A [48] H. Guo and H.L. Viktor, “Boosting with Data Generation:
Systematic Study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429- Improving the Classification of Hard to Learn Examples,” Proc.
449, 2002. Int’l Conf. Innovations Applied Artificial Intelligence, pp. 1082-1091,
[24] G.M. Weiss and F. Provost, “Learning When Training Data Are 2004.
Costly: The Effect of Class Distribution on Tree Induction,” [49] C. Elkan, “The Foundations of Cost-Sensitive Learning,” Proc. Int’l
J. Artificial Intelligence Research, vol. 19, pp. 315-354, 2003. Joint Conf. Artificial Intelligence, pp. 973-978, 2001.
[25] R.C. Holte, L. Acker, and B.W. Porter, “Concept Learning and the [50] K.M. Ting, “An Instance-Weighting Method to Induce Cost-
Problem of Small Disjuncts,” Proc. Int’l J. Conf. Artificial Sensitive Trees,” IEEE Trans. Knowledge and Data Eng., vol. 14,
Intelligence, pp. 813-818, 1989. no. 3, pp. 659-665, May/June 2002.
[26] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, [51] M.A. Maloof, “Learning When Data Sets Are Imbalanced and
vol. 1, no. 1, pp. 81-106, 1986.
When Costs Are Unequal and Unknown,” Proc. Int’l Conf. Machine
[27] T. Jo and N. Japkowicz, “Class Imbalances versus Small Learning, Workshop Learning from Imbalanced Data Sets II, 2003.
Disjuncts,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1,
[52] K. McCarthy, B. Zabar, and G.M. Weiss, “Does Cost-Sensitive
pp. 40-49, 2004.
Learning Beat Sampling for Classifying Rare Classes?” Proc. Int’l
[28] N. Japkowicz, “Class Imbalances: Are We Focusing on the Right
Workshop Utility-Based Data Mining, pp. 69-77, 2005.
Issue?” Proc. Int’l Conf. Machine Learning, Workshop Learning from
Imbalanced Data Sets II, 2003. [53] X.Y. Liu and Z.H. Zhou, “The Influence of Class Imbalance on
Cost-Sensitive Learning: An Empirical Study,” Proc. Int’l Conf.
[29] R.C. Prati, G.E.A.P.A. Batista, and M.C. Monard, “Class Imbal-
Data Mining, pp. 970-974, 2006.
ances versus Class Overlapping: An Analysis of a Learning
System Behavior,” Proc. Mexican Int’l Conf. Artificial Intelligence, [54] P. Domingos, “MetaCost: A General Method for Making
pp. 312-321, 2004. Classifiers Cost-Sensitive,” Proc. Int’l Conf. Knowledge Discovery
[30] S.J. Raudys and A.K. Jain, “Small Sample Size Effects in Statistical and Data Mining, pp. 155-164, 1999.
Pattern Recognition: Recommendations for Practitioners,” IEEE [55] B. Zadrozny, J. Langford, and N. Abe, “Cost-Sensitive Learning by
Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 3, Cost-Proportionate Example Weighting,” Proc. Int’l Conf. Data
pp. 252-264, Mar. 1991. Mining, pp. 435-442, 2003.
[31] R. Caruana, “Learning from Imbalanced Data: Rank Metrics and [56] Y. Freund and R.E. Schapire, “Experiments with a New Boosting
Extra Tasks,” Proc. Am. Assoc. for Artificial Intelligence (AAAI) Conf., Algorithm,” Proc. Int’l Conf. Machine Learning, pp. 148-156, 1996.
pp. 51-57, 2000 (AAAI Technical Report WS-00-05). [57] Y. Freund and R.E. Schapire, “A Decision-Theoretic General-
[32] W.H. Yang, D.Q. Dai, and H. Yan, “Feature Extraction Uncorre- ization of On-Line Learning and an Application to Boosting,”
lated Discriminant Analysis for High-Dimensional Data,” IEEE J. Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
Trans. Knowledge and Data Eng., vol. 20, no. 5, pp. 601-614, May [58] Y. Sun, M.S. Kamel, A.K.C. Wong, and Y. Wang, “Cost-Sensitive
2008. Boosting for Classification of Imbalanced Data,” Pattern Recogni-
[33] N.V. Chawla, “C4.5 and Imbalanced Data Sets: Investigating the tion, vol. 40, no. 12, pp. 3358-3378, 2007.
Effect of Sampling Method, Probabilistic Estimate, and Decision [59] W. Fan, S.J. Stolfo, J. Zhang, and P.K. Chan, “AdaCost:
Tree Structure,” Proc. Int’l Conf. Machine Learning, Workshop Misclassification Cost-Sensitive Boosting,” Proc. Int’l Conf. Machine
Learning from Imbalanced Data Sets II, 2003. Learning, pp. 97-105, 1999.
[34] T.M. Mitchell, Machine Learning. McGraw Hill, 1997. [60] K.M. Ting, “A Comparative Study of Cost-Sensitive Boosting
[35] G.M. Weiss and F. Provost, “The Effect of Class Distribution on Algorithms,” Proc. Int’l Conf. Machine Learning, pp. 983-990, 2000.
Classifier Learning: An Empirical Study,” Technical Report ML- [61] M. Maloof, P. Langley, S. Sage, and T. Binford, “Learning to Detect
TR-43, Dept. of Computer Science, Rutgers Univ., 2001. Rooftops in Aerial Images,” Proc. Image Understanding Workshop,
[36] J. Laurikkala, “Improving Identification of Difficult Small Classes pp. 835-845, 1997.
by Balancing Class Distribution,” Proc. Conf. AI in Medicine in [62] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
Europe: Artificial Intelligence Medicine, pp. 63-66, 2001. Regression Trees. Chapman & Hall/CRC Press, 1984.
[37] A. Estabrooks, T. Jo, and N. Japkowicz, “A Multiple Resampling [63] C. Drummond and R.C. Holte, “Exploiting the Cost (In)Sensitivity
Method for Learning from Imbalanced Data Sets,” Computational of Decision Tree Splitting Criteria,” Proc. Int’l Conf. Machine
Intelligence, vol. 20, pp. 18-36, 2004. Learning, pp. 239-246, 2000.
[38] D. Mease, A.J. Wyner, and A. Buja, “Boosted Classification Trees [64] S. Haykin, Neural Networks: A Comprehensive Foundation, second
and Class Probability/Quantile Estimation,” J. Machine Learning ed. Prentice-Hall, 1999.
Research, vol. 8, pp. 409-439, 2007.
[65] M.Z. Kukar and I. Kononenko, “Cost-Sensitive Learning with
[39] C. Drummond and R.C. Holte, “C4.5, Class Imbalance, and Cost
Neural Networks,” Proc. European Conf. Artificial Intelligence,
Sensitivity: Why Under Sampling Beats Over-Sampling,” Proc.
pp. 445-449, 1998.
Int’l Conf. Machine Learning, Workshop Learning from Imbalanced
Data Sets II, 2003. [66] P. Domingos and M. Pazzani, “Beyond Independence: Conditions
[40] X.Y. Liu, J. Wu, and Z.H. Zhou, “Exploratory Under Sampling for for the Optimality of the Simple Bayesian Classifier,” Proc. Int’l
Class Imbalance Learning,” Proc. Int’l Conf. Data Mining, pp. 965- Conf. Machine Learning, pp. 105-112, 1996.
969, 2006. [67] G.R.I. Webb and M.J. Pazzani, “Adjusted Probability Naive
[41] J. Zhang and I. Mani, “KNN Approach to Unbalanced Data Bayesian Induction,” Proc. Australian Joint Conf. Artificial Intelli-
Distributions: A Case Study Involving Information Extraction,” gence, pp. 285-295, 1998.
Proc. Int’l Conf. Machine Learning (ICML ’2003), Workshop Learning [68] R. Kohavi and D. Wolpert, “Bias Plus Variance Decomposition for
from Imbalanced Data Sets, 2003. Zero-One Loss Functions,” Proc. Int’l Conf. Machine Learning, 1996.
[42] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced [69] J. Gama, “Iterative Bayes,” Theoretical Computer Science, vol. 292,
Training Sets: One-Sided Selection,” Proc. Int’l Conf. Machine no. 2, pp. 417-430, 2003.
Learning, pp. 179-186, 1997. [70] G. Fumera and F. Roli, “Support Vector Machines with Embedded
[43] B.X. Wang and N. Japkowicz, “Imbalanced Data Set Learning with Reject Option,” Proc. Int’l Workshop Pattern Recognition with Support
Synthetic Samples,” Proc. IRIS Machine Learning Workshop, 2004. Vector Machines, pp. 68-82, 2002.
[44] H. Han, W.Y. Wang, and B.H. Mao, “Borderline-SMOTE: A New [71] J.C. Platt, “Fast Training of Support Vector Machines Using
Over-Sampling Method in Imbalanced Data Sets Learning,” Proc. Sequential Minimal Optimization,” Advances in Kernel Methods:
Int’l Conf. Intelligent Computing, pp. 878-887, 2005. Support Vector Learning, pp. 185-208, MIT Press, 1999.
HE AND GARCIA: LEARNING FROM IMBALANCED DATA 1283
[72] J.T. Kwok, “Moderating the Outputs of Support Vector Machine [96] A. Tashk, R. Bayesteh, and K. Faez, “Boosted Bayesian Kernel
Classifiers,” IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1018- Classifier Method for Face Detection,” Proc. Int’l Conf. Natural
1031, Sept. 1999. Computation, pp. 533-537, 2007.
[73] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, [97] N. Abe, “Invited Talk: Sampling Approaches to Learning from
1995. Imbalanced Data Sets: Active Learning, Cost Sensitive Learning
[74] B. Raskutti and A. Kowalczyk, “Extreme Re-Balancing for SVMs: and Deyond,” Proc. Int’l Conf. Machine Learning, Workshop Learning
A Case Study,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, from Imbalanced Data Sets II, 2003.
pp. 60-69, 2004. [98] S. Ertekin, J. Huang, L. Bottou, and L. Giles, “Learning on the
[75] R. Akbani, S. Kwek, and N. Japkowicz, “Applying Support Vector Border: Active Learning in Imbalanced Data Classification,” Proc.
Machines to Imbalanced Data Sets,” Lecture Notes in Computer ACM Conf. Information and Knowledge Management, pp. 127-136,
Science, vol. 3201, pp. 39-50, 2004. 2007.
[76] G. Wu and E. Chang, “Class-Boundary Alignment for Imbalanced [99] S. Ertekin, J. Huang, and C.L. Giles, “Active Learning for Class
Data Set Learning,” Proc. Int’l Conf. Data Mining (ICDM ’03), Imbalance Problem,” Proc. Int’l SIGIR Conf. Research and Develop-
Workshop Learning from Imbalanced Data Sets II, 2003. ment in Information Retrieval, pp. 823-824, 2007.
[77] F. Vilarino, P. Spyridonos, P. Radeva, and J. Vitria, “Experiments [100] F. Provost, “Machine Learning from Imbalanced Data Sets 101,”
with SVM and Stratified Sampling with an Imbalanced Problem: Proc. Learning from Imbalanced Data Sets: Papers from the Am.
Detection of Intestinal Contractions,” Lecture Notes in Computer Assoc. for Artificial Intelligence Workshop, 2000 (Technical Report
Science, vol. 3687, pp. 783-791, 2005. WS-00-05).
[78] P. Kang and S. Cho, “EUS SVMs: Ensemble of Under sampled [101] Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast Kernel
SVMs for Data Imbalance Problems,” Lecture Notes in Computer Classifiers with Online and Active Learning,” J. Machine Learning
Science, vol. 4232, pp. 837-846, 2006. Research, vol. 6, pp. 1579-1619, 2005.
[79] Y. Liu, A. An, and X. Huang, “Boosting Prediction Accuracy on
[102] J. Zhu and E. Hovy, “Active Learning for Word Sense Disambi-
Imbalanced Data Sets with SVM Ensembles,” Lecture Notes in
guation with Methods for Addressing the Class Imbalance
Artificial Intelligence, vol. 3918, pp. 107-118, 2006.
Problem,” Proc. Joint Conf. Empirical Methods in Natural Language
[80] B.X. Wang and N. Japkowicz, “Boosting Support Vector Machines
Processing and Computational Natural Language Learning, pp. 783-
for Imbalanced Data Sets,” Lecture Notes in Artificial Intelligence,
790, 2007.
vol. 4994, pp. 38-47, 2008.
[81] Y. Tang and Y.Q. Zhang, “Granular SVM with Repetitive [103] J. Doucette and M.I. Heywood, “GP Classification under
Undersampling for Highly Imbalanced Protein Homology Pre- Imbalanced Data Sets: Active Sub-Sampling AUC Approxima-
diction,” Proc. Int’l Conf. Granular Computing, pp. 457- 460, 2006. tion,” Lecture Notes in Computer Science, vol. 4971, pp. 266-277,
2008.
[82] Y.C. Tang, B. Jin, and Y.-Q. Zhang, “Granular Support Vector
Machines with Association Rules Mining for Protein Homology [104] B. Scholkopt, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C.
Prediction,” Artificial Intelligence in Medicine, special issue on Williamson, “Estimating the Support of a High-Dimensional
computational intelligence techniques in bioinformatics, vol. 35, Distribution,” Neural Computation, vol. 13, pp. 1443-1471, 2001.
nos. 1/2, pp. 121-134, 2005. [105] L.M. Manevitz and M. Yousef, “One-Class SVMs for Document
[83] Y.C. Tang, B. Jin, Y.-Q. Zhang, H. Fang, and B. Wang, “Granular Classification,” J. Machine Learning Research, vol. 2, pp. 139-154,
Support Vector Machines Using Linear Decision Hyperplanes for 2001.
Fast Medical Binary Classification,” Proc. Int’l Conf. Fuzzy Systems, [106] L. Zhuang and H. Dai, “Parameter Estimation of One-Class SVM
pp. 138-142, 2005. on Imbalance Text Classification,” Lecture Notes in Artificial
[84] Y.C. Tang, Y.Q. Zhang, Z. Huang, X.T. Hu, and Y. Zhao, Intelligence, vol. 4013, pp. 538-549, 2006.
“Granular SVM-RFE Feature Selection Algorithm for Reliable [107] H.J. Lee and S. Cho, “The Novelty Detection Approach for
Cancer-Related Gene Subsets Extraction on Microarray Gene Difference Degrees of Class Imbalance,” Lecture Notes in Computer
Expression Data,” Proc. IEEE Symp. Bioinformatics and Bioeng., Science, vol. 4233, pp. 21-30, 2006.
pp. 290-293, 2005. [108] L. Zhuang and H. Dai, “Parameter Optimization of Kernel-Based
[85] X. Hong, S. Chen, and C.J. Harris, “A Kernel-Based Two-Class One-Class Classifier on Imbalance Text Learning,” Lecture Notes in
Classifier for Imbalanced Data Sets,” IEEE Trans. Neural Networks, Artificial Intelligence, vol. 4099, pp. 434-443, 2006.
vol. 18, no. 1, pp. 28-41, Jan. 2007. [109] N. Japkowicz, “Supervised versus Unsupervised Binary-Learning
[86] G. Wu and E.Y. Chang, “Aligning Boundary in Kernel Space for by Feedforward Neural Networks,” Machine Learning, vol. 42,
Learning Imbalanced Data Set,” Proc. Int’l Conf. Data Mining, pp. 97-122, 2001.
pp. 265-272, 2004. [110] L. Manevitz and M. Yousef, “One-Class Document Classification
[87] G. Wu and E.Y. Chang, “KBA: Kernel Boundary Alignment via Neural Networks,” Neurocomputing, vol. 70, pp. 1466-1481,
Considering Imbalanced Data Distribution,” IEEE Trans. Knowl- 2007.
edge and Data Eng., vol. 17, no. 6, pp. 786-795, June 2005. [111] N. Japkowicz, “Learning from Imbalanced Data Sets: A Compar-
[88] G. Wu and E.Y. Chang, “Adaptive Feature-Space Conformal ison of Various Strategies,” Proc. Am. Assoc. for Artificial Intelligence
Transformation for Imbalanced-Data Learning,” Proc. Int’l Conf. (AAAI) Workshop Learning from Imbalanced Data Sets, pp. 10-15,
Machine Learning, pp. 816-823, 2003. 2000 (Technical Report WS-00-05).
[89] Y.H. Liu and Y.T. Chen, “Face Recognition Using Total Margin- [112] N. Japkowicz, C. Myers, and M. Gluck, “A Novelty Detection
Based Adaptive Fuzzy Support Vector Machines,” IEEE Trans. Approach to Classification,” Proc. Joint Conf. Artificial Intelligence,
Neural Networks, vol. 18, no. 1, pp. 178-192, Jan. 2007. pp. 518-523, 1995.
[90] Y.H. Liu and Y.T. Chen, “Total Margin Based Adaptive Fuzzy
[113] C.T. Su and Y.H. Hsiao, “An Evaluation of the Robustness of MTS
Support Vector Machines for Multiview Face Recognition,” Proc.
for Imbalanced Data,” IEEE Trans. Knowledge and Data Eng.,
Int’l Conf. Systems, Man and Cybernetics, pp. 1704-1711, 2005.
vol. 19, no. 10, pp. 1321-1332, Oct. 2007.
[91] G. Fung and O.L. Mangasarian, “Multicategory Proximal Support
Vector Machine Classifiers,” Machine Learning, vol. 59, nos. 1/2, [114] G. Taguchi, S. Chowdhury, and Y. Wu, The Mahalanobis-Taguchi
pp. 77-97, 2005. System. McGraw-Hill, 2001.
[92] J. Yuan, J. Li, and B. Zhang, “Learning Concepts from Large Scale [115] G. Taguchi and R. Jugulum, The Mahalanobis-Taguchi Strategy. John
Imbalanced Data Sets Using Support Cluster Machines,” Proc. Int’l Wiley & Sons, 2002.
Conf. Multimedia, pp. 441-450, 2006. [116] M.V. Joshi, V. Kumar, and R.C. Agarwal, “Evaluating Boosting
[93] A.K. Qin and P.N. Suganthan, “Kernel Neural Gas Algorithms Algorithms to Classify Rare Classes: Comparison and Improve-
with Application to Cluster Analysis,” Proc. Int’l Conf. Pattern ments,” Proc. Int’l Conf. Data Mining, pp. 257-264, 2001.
Recognition, 2004. [117] F.J. Provost and T. Fawcett, “Analysis and Visualization of
[94] X.P. Yu and X.G. Yu, “Novel Text Classification Based on K- Classifier Performance: Comparison under Imprecise Class and
Nearest Neighbor,” Proc. Int’l Conf. Machine Learning Cybernetics, Cost Distributions,” Proc. Int’l Conf. Knowledge Discovery and Data
pp. 3425-3430, 2007. Mining, pp. 43-48, 1997.
[95] P. Li, K.L. Chan, and W. Fang, “Hybrid Kernel Machine Ensemble [118] F.J. Provost, T. Fawcett, and R. Kohavi, “The Case against
for Imbalanced Data Sets,” Proc. Int’l Conf. Pattern Recognition, Accuracy Estimation for Comparing Induction Algorithms,” Proc.
pp. 1108-1111, 2006. Int’l Conf. Machine Learning, pp. 445-453, 1998.
1284 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 9, SEPTEMBER 2009
[119] T. Fawcett, “ROC Graphs: Notes and Practical Considerations for Haibo He received the BS and MS degrees in
Data Mining Researchers,” Technical Report HPL-2003-4, HP electrical engineering from Huazhong University
Labs, 2003. of Science and Technology (HUST), Wuhan,
[120] T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recogni- China, in 1999 and 2002, respectively, and the
tion Letters, vol. 27, no. 8, pp. 861-874, 2006. PhD degree in electrical engineering from Ohio
[121] F. Provost and P. Domingos, “Well-Trained Pets: Improving University, Athens, in 2006. He is currently an
Probability Estimation Trees,” CeDER Working Paper: IS-00-04, assistant professor in the Department of Elec-
Stern School of Business, New York Univ., 2000. trical and Computer Engineering, Stevens In-
[122] T. Fawcett, “Using Rule Sets to Maximize ROC Performance,” stitute of Technology, Hoboken, New Jersey.
Proc. Int’l Conf. Data Mining, pp. 131-138, 2001. His research interests include machine learning,
[123] J. Davis and M. Goadrich, “The Relationship between Precision- data mining, computational intelligence, VLSI and FPGA design, and
Recall and ROC Curves,” Proc. Int’l Conf. Machine Learning, embedded intelligent systems design. He has served regularly on the
pp. 233-240, 2006. organization committees and the program committees of many
[124] R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. Mooney, A. Ramani, international conferences and has also been a reviewer for the leading
and Y. Wong, “Comparative Experiments on Learning Informa- academic journals in his fields, including the IEEE Transactions on
tion Extractors for Proteins and Their Interactions,” Artificial Knowledge and Data Engineering, the IEEE Transactions on Neural
Intelligence in Medicine, vol. 33, pp. 139-155, 2005. Networks, the IEEE Transactions on Systems, Man and Cybernetics
[125] J. Davis, E. Burnside, I. Dutra, D. Page, R. Ramakrishnan, V.S. (part A and part B), and others. He has also served as a guest editor for
Costa, and J. Shavlik, “View Learning for Statistical Relational several international journals, such as Soft Computing (Springer) and
Learning: With an Application to Mammography,” Proc. Int’l Joint Applied Mathematics and Computation (Elsevier), among others. He
Conf. Artificial Intelligence, pp. 677-683, 2005. has delivered several invited talks including the IEEE North Jersey
[126] P. Singla and P. Domingos, “Discriminative Training of Markov Section Systems, Man & Cybernetics invited talk on “Self-Adaptive
Logic Networks,” Proc. Nat’l Conf. Artificial Intelligence, pp. 868- Learning for Machine Intelligence.” He was the recipient of the
873, 2005. Outstanding Master Thesis Award of Hubei Province, China, in 2002.
[127] T. Landgrebe, P. Paclik, R. Duin, and A.P. Bradley, “Precision- Currently, he is the editor of the IEEE Computational Intelligence
Recall Operating Characteristic (P-ROC) Curves in Imprecise Society (CIS) Electronic Letter (E-letter), and a committee member of
Environments,” Proc. Int’l Conf. Pattern Recognition, pp. 123-127, the IEEE Systems, Man, and Cybernetic (SMC) Technical Committee
2006. on Computational Intelligence. He is a member of the IEEE, the ACM,
[128] R.C. Holte and C. Drummond, “Cost Curves: An Improved and the AAAI.
Method for Visualizing Classifier Performance,” Machine Learning,
vol. 65, no. 1, pp. 95-130, 2006. Edwardo A. Garcia received the BS degree in
[129] R.C. Holte and C. Drummond, “Cost-Sensitive Classifier Evalua- mathematics from New York University, New
tion,” Proc. Int’l Workshop Utility-Based Data Mining, pp. 3-9, 2005. York, and the BE degree in computer engineer-
[130] R.C. Holte and C. Drummond, “Explicitly Representing Expected ing from Stevens Institute of Technology, Hobo-
Cost: An Alternative to ROC Representation,” Proc. Int’l Conf. ken, New Jersey, both in 2008. He currently
Knowledge Discovery and Data Mining, pp. 198-207, 2000. holds research appointments with the Depart-
[131] D.J. Hand and R.J. Till, “A Simple Generalization of the Area ment of Electrical and Computer Engineering at
under the ROC Curve to Multiple Class Classification Problems,” Stevens Institute of Technology and with the
Machine Learning, vol. 45, no. 2, pp. 171-186, 2001. Department of Anesthesiology at New York
[132] UC Irvine Machine Learning Repository, http://archive.ics.uci. University School of Medicine. His research
edu/ml/, 2009. interests include machine learning, biologically inspired intelligence,
[133] NIST Scientific and Technical Databases, http://nist.gov/srd/ cognitive neuroscience, data mining for medical diagnostics, and
online.htm, 2009. mathematical methods for f-MRI.
[134] H. He and S. Chen, “IMORL: Incremental Multiple Objects
Recognition Localization,” IEEE Trans. Neural Networks, vol. 19,
no. 10, pp. 1727-1738, Oct. 2008. . For more information on this or any other computing topic,
[135] X. Zhu, “Semi-Supervised Learning Literature Survey,” Technical please visit our Digital Library at www.computer.org/publications/dlib.
Report TR-1530, Univ. of Wisconsin-Madson, 2007.
[136] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled
Data with Co-Training,” Proc. Workshop Computational Learning
Theory, pp. 92-100, 1998.
[137] T.M. Mitchell, “The Role of Unlabeled Data in Supervised
Learning,” Proc. Int’l Colloquium on Cognitive Science, 1999.
[138] C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-Super-
vised Self-Training of Object Detection Models,” Proc. IEEE
Workshops Application of Computer Vision, pp. 29-36, 2005.
[139] M. Wang, X.S. Hua, L.R. Dai, and Y. Song, “Enhanced Semi-
Supervised Learning for Automatic Video Annotation,” Proc. Int’l
Conf. Multimedia and Expo, pp. 1485-1488, 2006.
[140] K.P. Bennett and A. Demiriz, “Semi-Supervised Support Vector
Machines,” Proc. Conf. Neural Information Processing Systems,
pp. 368-374, 1998.
[141] V. Sindhwani and S.S. Keerthi, “Large Scale Semi-Supervised
Linear SVMs,” Proc. Int’l SIGIR Conf. Research and Development in
Information Retrieval, pp. 477-484, 2006.
[142] A. Blum and S. Chawla, “Learning from Labeled and Unlabeled
Data Using Graph Mincuts,” Proc. Int’l Conf. Machine Learning,
pp. 19-26, 2001.
[143] D. Zhou, B. Scholkopf, and T. Hofmann, “Semi-Supervised
Learning on Directed Graphs,” Proc. Conf. Neural Information
Processing Systems, pp. 1633-1640, 2004.
[144] A. Fujino, N. Ueda, and K. Saito, “A Hybrid Generative/
Discriminative Approach to Semi-Supervised Classifier Design,”
Proc. Nat’l Conf. Artificial Intelligence, pp. 764-769, 2005.
[145] D.J. Miller and H.S. Uyar, “A Mixture of Experts Classifier with
Learning Based on Both Labeled and Unlabelled Data,” Proc. Ann.
Conf. Neural Information Processing Systems, pp. 571-577, 1996.