0% found this document useful (0 votes)
13 views

Learning_With_Imbalanced_Data_in_Smart_M

This document presents a framework for evaluating machine learning techniques to address the challenge of imbalanced data in smart manufacturing, which is crucial for predicting faults. The authors conduct a comparative analysis using a real-world dataset to benchmark various algorithm components and their effectiveness in improving predictive accuracy. Key contributions include a comprehensive review of classification challenges, a novel evaluation framework, and insights into the interplay between dataset characteristics and machine learning performance.

Uploaded by

Vinicius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Learning_With_Imbalanced_Data_in_Smart_M

This document presents a framework for evaluating machine learning techniques to address the challenge of imbalanced data in smart manufacturing, which is crucial for predicting faults. The authors conduct a comparative analysis using a real-world dataset to benchmark various algorithm components and their effectiveness in improving predictive accuracy. Key contributions include a comprehensive review of classification challenges, a novel evaluation framework, and insights into the interplay between dataset characteristics and machine learning performance.

Uploaded by

Vinicius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Received November 27, 2020, accepted December 17, 2020, date of publication December 28, 2020,

date of current version January 6, 2021.


Digital Object Identifier 10.1109/ACCESS.2020.3047838

Learning With Imbalanced Data in Smart


Manufacturing: A Comparative Analysis
YASMIN FATHY 1, MONA JABER 2, (Member, IEEE), AND ALEXANDRA BRINTRUP1
1 Department of Engineering, University of Cambridge, Cambridge CB3 0FS, U.K.
2 School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4FZ, U.K.

Corresponding author: Yasmin Fathy ([email protected])


This work was supported by the Research England for the Pitch-In Project (Promoting the Internet of Things via Collaborations between
HEIs and Industry).

ABSTRACT The Internet of Things (IoT) paradigm is revolutionising the world of manufacturing into what
is known as Smart Manufacturing or Industry 4.0. The main pillar in smart manufacturing looks at harnessing
IoT data and leveraging machine learning (ML) to automate the prediction of faults, thus cutting maintenance
time and cost and improving the product quality. However, faults in real industries are overwhelmingly
outweighed by instances of good performance (faultless samples); this bias is reflected in the data captured
by IoT devices. Imbalanced data limits the success of ML in predicting faults, thus presents a significant
hindrance in the progress of smart manufacturing. Although various techniques have been proposed to tackle
this challenge in general, this work is the first to present a framework for evaluating the effectiveness of
these remedies in the context of manufacturing. We present a comprehensive comparative analysis in which
we apply our proposed framework to benchmark the performance of different combinations of algorithm
components using a real-world manufacturing dataset. We draw key insights into the effectiveness of each
component and inter-relatedness between the dataset, the application context, and the design of the ML
algorithm.

INDEX TERMS Manufacturing analytics, generative modeling, smart manufacturing, imbalanced data,
limited failure data, generating synthetic data.

I. INTRODUCTION depends on the entire chain of production. By embracing


Manufacturing process is a broad terminology that encom- the IoT paradigm, SM leverages IoT data to drive and
passes the production of final products either handmade, automate intelligent decisions, thus improving the efficiency
machine-made, or hybrid. This often entails the transfor- and quality of each task in the chain, not the least the final
mation of raw material into the final product through product.
various mechanical, chemical or other industrial processes. In traditional manufacturing, intermediate quality tests are
Smart Manufacturing (SM), also referred to as Industry 4.0, performed during the production process to detect quality
has transformed the traditional linear manufacturing into a issues and identify defective batches. However, these tests
dynamic and digital ecosystem to improve product quality, are time-consuming and not conclusive as they do not
operations efficiency, and production yield. The Internet of allow for full physical inspections of a production line.
Things (IoT) is a network of connected devices, such as On the other hand, the early detection of defects at early
sensors and actuators, that gives operators access to data from product processing stages is one of the most effective ways
the real world in real-time whilst working from the comfort of reducing costs, saving time, and boosting operational
of their desks. In the context of SM, IoT offers infinite efficiency [1]. SM empowered with IoT and machine
possibilities in terms of remote monitoring, maintenance, and learning (ML) techniques offers the ability of early defect
control of operations. A manufacturing process is typically a detection and the opportunity to capitalise on its benefits.
chain of complex tasks where the quality of the final product IoT sensor data has been used for predictive maintenance in
industrial machines [2] and automotive manufacturing [3].
The associate editor coordinating the review of this manuscript and These works; however, are not designed to deal with data
approving it for publication was Qingchao Jiang . imbalance. For instance, authors in [3] use cluster-based

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
2734 VOLUME 9, 2021
Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

methods for detecting and disregarding data outliers in order (faulty samples). Canonical ML algorithms often assume that
to improve fault detection during the manufacturing process. each class has roughly the same number of objects [7], [9].
Other existing works such as [4] show that classification Manufacturing datasets, however, often have a dramatically
performance for imbalanced data can be improved when data skewed distribution. Thus, their application to canonical ML
outliers are eliminated. algorithms fails to deliver reliable results. Indeed, when
Modern prediction systems rely on using IoT data and positive samples are limited, predictive models tend to be
ML techniques to predict the expected quality level of biased towards the majority (negative) class. This leads to a
forthcoming products. Predictive analytics is one application high probability of misclassifying samples from the minority
of ML that analyses current and historical data to make class. In the context of manufacturing, this bias in predictive
predictions about future events. With predictive analytics, models results in the majority of faults going unnoticed
SM manufacturers can discover intricate patterns from and significantly impacting the quality and efficiency of
collected IoT data, identify processing batches that drop production. In fact, the impact of not predicting a fault in
below a defined quality level and perform the best course manufacturing is much more detrimental to the quality and
of actions among multiple options. Consequently, quality process than misclassifying a faultless sample as faulty.
engineers can use this information to either immediately To this end, there is a dominant incentive in SM to
adjust the process parameters or stop a particular defective improve the performance of quality predictive models that
batch processing. deal with imbalanced datasets. There are various efforts
In today’s multimode manufacturing processes, not all to address this issue, where some propose to remove the
anomalies or faults would affect the product’s quality. Thus, bias by manipulating the dataset, referred to as data-based,
in order to make the process of quality monitoring more and others propose to implement a positive bias in the
purposeful and accurate, a tailored performance indicator ML algorithm, referred to as algorithm-based. Data-based
method has been proposed by Song et al. [5]. The proposed methods look at generating new synthetic data samples
method considers the influence of the fault on product quality from the minority class to reach similar numbers as the
and process safety by constructing sub-spaces that enable majority class. These methods are often referred to as
distinguishing between various types of quality-related and data augmentation tools as they increase the number of
safety-related faults in complex industrial systems. Predicting samples by adding the synthetic data. For instance, authors
quality in a real-time is essential for process monitoring and in [10] use such a method (Synthetic Minority Over-sampling
control, but measuring quality variables often require offline TEchnique (SMOTE) [11], [12]) to mitigate data imbalance
analysis. However, quality variables can be measured by while predicting product quality.
exploring the correlation between the changes in process vari- The second group relates to algorithm-based methods, also
ables and the collected quality information [6]. To this end, referred to as cost-sensitive learning. An artificial bias is
a multi-subspace elastic network method is employed in [6] implemented in the existing classification process through a
to construct the correlation relationship between process cost matrix that amplifies the penalty value for misclassifying
variables and quality variables for the detection and diagnosis minority samples. An algorithm-based method was recently
of faults. Recently, a novel data-driven method has been used with an imbalanced dataset to predict failure in air
proposed in [7] using multi-subspace orthogonal canonical pressure systems in [13]. A promising direction looks at
correlation analysis for identifying quality-affecting faults in combining both approaches to form a hybrid technique that
a real-time fashion. alters the dataset as well as the algorithm bias to circumvent
Despite the recent success of collecting IoT data for the imbalance obstacle.
process monitoring [8], this data mirrors the actual man- However, these methods have not been evaluated in the
ufacturing process and its intrinsic bias towards good context of predictive analytics for manufacturing problems.
performance. It is, of course, fortunate that instances of In fact, there is no one-solution-fits-all, and it is critical
faultless samples largely outnumber the defects and faults to have a framework in which various methods can be
in the manufacturing process. This characteristic is a major assessed in a data-centric and contextual fashion. In this
factor in today’s economy as it renders the manufacturing work, we present the first such framework that covers the
of complex products cost-effective and the end product process of predictive analytics starting at the original dataset
accessible to masses. For instance, two-thirds of the world and finishing at the context-aware performance evaluation,
population today can afford a mobile phone.1 On the other as shown in Figure 1. In this study, we present a compre-
hand, data imbalance is a major challenge for ML-based hensive comparative analysis of data-based, algorithm-based
predictive analytics which rely on data for learning. and hybrid methods for improving the prediction accuracy
In a binary classification ML task (e.g., faulty/faultless), of manufacturing faults. To this end, we propose the first
imbalanced data is where the number of negative samples statistical analysis framework for measuring the effectiveness
(faultless or samples that conform to the quality con- of each method by quantifying four key aspects. The first
trol process) outweighing the number of positive samples looks at measuring the bias embedded in data by adopting
entropy concepts. It is evident that the optimum predictive
1 https://datareportal.com/global-digital-overview analytics method majorly depends on the dataset and its

VOLUME 9, 2021 2735


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

FIGURE 1. Proposed predictive analytics framework for dealing with imbalanced datasets.

intricate features. It is, therefore, of pivotal importance to with particular focus on manufacturing applications. Our
capture the level of imbalance in the dataset and the resulting research study aims at assessing the performance of some of
behaviour with different re-balancing methods. The second the widely used techniques at three levels:
statistical method gauges the goodness of the associated • Pre-processing: Data-augmentation techniques (e.g.,
labels (where each represents a cluster or class) using the SMOTE, GAN) and feature reduction techniques (e.g.,
Silouhette coefficient. Indeed, this metric is essential for Principal component analysis).
understanding the dataset and the source of bias in order • Classifier algorithm: Classifier type (e.g. Linear regres-
to identify a suitable remedy. The third aspect measures sion, Random forest) that might incorporate class
the effectiveness of the proposed method (be it data- weights while being trained and tuned or optimised for
based, algorithm-based, or hybrid) in predicting defects and a given cost-sensitive function (i.e., misclassification
identifies relevant metrics such as Precision and Recall which cost).
are used to calculate the F1-score. The last aspect only relates • Post-processing: moving threshold for counter-balancing
to data augmentation methods and aims at measuring the the data bias.
goodness of generated synthetic data in comparison with the The performance evaluation phase employs context-aware
real data samples. metrics from a wide range of indicators for measuring the data
Data augmentation is an optional step in the pre-processing imbalance, the goodness of class labels, the context-aware
phase and may include many methods such as SMOTE and success of the classification, and the usability of synthetic
Generative Adversary Networks (GANs) and related variants data. We have conducted multiple experiments that combine
(see Figure 1). Algorithm-based methods are implemented in hybrid options within the framework in Figure 1 using a real-
the ML classifier, which is selected from a pool of potential world SM dataset.
classifiers. In addition, data bias can be addressed in the Although the literature review reveals various works that
post-processing phase, as shown in Figure 1, by tuning the address the challenge of handling imbalanced datasets in the
classification threshold to counter-balance the imbalance. context of smart manufacturing, none offers a comprehensive
study on how intrinsic characteristics of datasets can be
A. CONTRIBUTIONS exploited in the selection of the ML-based predictive
This article aims to provide a better understanding of analytics. To this end, the overarching contributions in this
predictive analytics methods that employ imbalanced datasets work can be summarised as follows:

2736 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

• Comprehensive review of the challenge of classification a certain collected sample and i and j are sample id, and
using imbalanced datasets and corresponding widely sensor id, respectively. Moreover, {a11 , a21 , · · · aM 1 } are data
used techniques. streams; a sequence of numerical observations collected by
• The first complete evaluation framework to enable the sensor s1 .
assessment of various combinations of techniques. The problem can be formulated as an imbalanced clas-
• A context-aware set of performance indicators to gauge sification predictive modelling; a process of predicting
the effectiveness of predictive models and their impact quality issues that are categorised into classes/labels (e.g.
on the application at hand. failure/non-failure) by approximating a mapping function f
• The first comparative analysis of multiple experiments from input data samples X = {xi , · · · , xM } into discrete
in which we apply the designed framework using a real output labels Y . We assume that there is a high-imbalanced
manufacturing dataset and draw novel insights. ratio between the minority and majority classes in Y .
Minority class refers to the class that has few samples in
B. OUTLINE the data X , while the majority class refers to the class that
The paper is structured as follows. The problem formulation has many samples in the same data. The ratio between these
is explained in Section II. Section III provides the required two types of samples is referred to as imbalanced ratio.
background and related work. The adopted methodology in Imbalance Ratio (IR) is defined as a proportion samples in the
the implementation of various techniques for data augmen- number of majority class (yj = 0) to the number of minority
tation and cost-sensitive functions is detailed in Section IV. class (yj = 1).
The performance metrics for context-aware evaluation of In this article, our dataset consists of data samples X
predictive models are described in Section V. In Section VI, and their corresponding discrete labels Y . To this end,
we present the manufacturing dataset that is used in the we refer to the training samples in our dataset as D =
comparative analysis, and we devise four cases that encom- {(xi , yj ), · · · (xM , yM )} in a binary classification task, where
pass hybrid combinations of data-based, algorithm-based, the i-th example pair in D denotes a data sample (i.e., feature
and post-processing techniques. In Section VII, we present vector) taken by N sensors at a time t and is labelled (through
the results from each of the cases and draw context-aware yj ) as either a normal sample (when yj = 0) or faulty one
insights in the discussion. We finally conclude the paper and (when yj = 1). To this end, in our binary classification case,
explain the future directions of our research in Section VIII. we denote data samples subset containing all faulty samples
(i.e., minority class) as Df ⊂ D = {(xi , yj )|yj = 1}. Similarly,
II. PROBLEM FORMULATION data samples subset containing all normal or faultless samples
Consider a network of N sensor nodes that are deployed (i.e., majority class) as Ds ⊂ D = {(xi , yj )|yj = 0}.
across the manufacturing process to predict quality issues
such as defective batches, component failure, among others. III. LEARNING WITH IMBALANCED DATA
Each sensor node sn , where n = 1, 2 · · · N , collects Recent advancements in Artificial Intelligence (AI) stem-
data steams of particular type, e.g. temperature, flow, and ming from deep neural networks (DNN) have helped to
motor rotation. Data streams are a sequence of numerical establish manufacturing predictive quality as an essential
measurements in consecutive order. At each time instance category in Smart Manufacturing (SM). Despite their success
t ≥ 0, each sensor node sn gathers a single data stream. Let in many data-driven real-world, the problem of learning from
x(t) ∈ RN be a data sample or instance that is collected from imbalanced data is still a challenging task [14]. Most of
N sensors at a time t. Let X ∈ RM ×N be a sequence of M the manufacturing applications suffer from having highly-
data samples (i.e., features/variables) that are collected by N skewed class distributions where there are usually few
senors where each has a label or class yj to indicate whether collected samples about defective batches, or component
a failure has occurred (yj = 1) or not (yj = 0) at the machine failures (i.e., minority class) and it is costly to collect more
or component being observed such that yj = {0, 1}, where failure samples [1], [14]. As a result, rare instances and
j = {1, 2 · · · M }. Let Y ∈ RM be discrete labels whose values events in manufacturing applications aren’t easily identified,
are to be modelled and predicted by the input data X . X and and it is difficult to apply standards classification techniques
Y can be formulated as follows: while attaining high accuracy in predicting the minority
    class [15]. Approaches that have been developed to tackle
a11 a12 · · · a1N y1 the imbalanced problem can be grouped into three main
 a21 a22 · · · a2N   y2 
    categories: data-driven methods, algorithmic-based methods
X = . .. .. ..  Y =  ..  (1)
 .. . . .   .  and hybrid methods.
aM 1 aM 2 · · · aMN yM
A. DATA-DRIVEN METHODS
M data samples from N sensors labels
Data-driven approaches are data preprocessing methods to
where the data samples X = {xi , · · · , xM } such that the first enhance the learning from imbalanced data models by modi-
data sample x1 = {a11 , a12 , · · · a1N } is collected by N sensor; fying the training data-set to balance the class distributions
aij is a single measurement value for a particular sensor at (see Figure 1). This modification is based on generating

VOLUME 9, 2021 2737


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

new samples for the minority class or removing samples as Safe-Level-SMOTE [27], and Borderline-SMOTE [28],
of the majority class. The former is referred to as Over- among others. Safe-Level-SMOTE aims to define safe
sampling, while the latter is referred to as under-sampling. regions to prevent overlapping between classes and generate
The basic sampling approach is Random Over-sampling, less noisy minority samples, while the primary goal of
and under-sampling (ROU) aim to balance class distribution Borderline-SMOTE is to limit the number of generated
by the random elimination of majority class samples or samples near minority class borders. Similar to SMOTE,
duplicate samples of minority class samples in the training ADASYN [29] is an over-sampling method that creates
data-set [16]. This results in discarding useful information synthesis data samples of the minority classes. However,
about the data and performance degradation due to removing ADASYN improves the over-sampling process by further
potentially useful samples or duplicating samples that might reducing the bias introduced by the class imbalance and force
cause over-fitting for the learned models [9], [17]. Over- the learning algorithm to learn the minority class boundaries
fitting occurs when the learned model fits too closely to the adaptively for enhancing the quality of generated minority
training data such that it becomes unable to generalise to samples.
new data samples [18]. To address this problem, advanced Yet, powerful generative models including, Generative
sampling techniques have been proposed to maintain the Adversarial Networks (GAN) [30] have been successful in
underlying distribution of the natural grouping in the data generating new samples that are similar to real samples
while adding new data samples. for improving the performance of imbalanced classification
Advanced under-sampling strategies aim to preserve tasks. In [31], Decision Tree (DT) classifier using the training
important information while learning from imbalanced data. data-set generated by GAN achieves comparable results to
Lin et al. [19] have proposed an under-sampling cluster- DT trained on original data-set. Conditional GAN (CGAN)
based technique to maintain a good representation of has been developed in [32] for creating synthetic data for
the majority class in the dataset. The approach is based prognostics under the conditions of limited failure data
on clustering the data samples from the majority class availability.
such that the number of clusters in the majority class is
set to be equal to the number of data samples in the TABLE 1. A cost matrix for binary classification [36]
minority class. Moreover, Mani and Zhang [20] have used C : Cost, TP: True Positive, FP: False Positive (i.e., false alarm), TN: True
Negative, and FN: False Negative.
a K-Nearest Neighbours (KNN) classifier as an under-
sampling preprocessing step. KNN tends to reduce the
overlapping between minority and majority samples such that
majority samples are classified, and samples are selected to
be removed based on their distances from minority samples.
Density-based under-sampling methods have been developed
to retain useful information while reducing the number
of majority classes’ samples such as Density-based under-
sampling (DBU) technique [21], [22]. DBU assumes that B. ALGORITHMIC-BASED METHODS
similar examples are relatively close to each other, while Algorithmic-based approaches are often discussed under
noisy examples tend to be far from other examples that are cost-sensitive methods in the literature. Unlike data sampling
associated with the same class in the feature space. methods, cost-sensitive methods do not alter training data-
Several informed over-sampling techniques have also been set. Instead, they modify the existing learning algorithms or
developed to reduce over-fitting and strength class bound- decision process (e.g. for classification tasks) through a cost
aries. Over-sampling methods tend to be more efficient than matrix such that each class is assigned a misclassification
under-sampling techniques when handling extremely imbal- penalty value [9], [33], [34] (see Figure 1). In principle, this
anced big data problem with large imbalanced ratio [12], family of algorithms alleviate their bias towards majority
[23]–[25]. Chawla et al. [11] have introduced a Synthetic classes by increasing the cost value of minority groups. This
Minority Over-sampling Technique (SMOTE); a method for results in increasing the importance of these groups and
creating synthetic data of minority samples by identifying the decreasing the likelihood that the learning algorithm will
feature space for the minority samples and considering their misclassify them [18]. An example of a cost matrix of a
k nearest neighbours. In principle, SMOTE creates artificial binary classification problem such as failure prediction is
data samples for the minority class by artificially linear shown in Table 1 where Ci,j is the misclassification cost,
interpolating new samples between already existing minority i.e. penalty cost for predicting samples as a class i when
samples and their nearest minority neighbours. their true class is class j. Intuitively, there is no penalty for
SMOTE has shown great success for addressing imbal- classifying the samples correctly (i.e., True Positive (TP) and
anced data in different industrial applications including, True Negative (TN)). To this end, the diagonal of the cost
manufacturing process [24], predictive maintenance and matrix, where i = j (i.e., C1,1 and C0,0 ) is 0. The optimisation
failure prediction [26]. Several extensions have been devel- process of the predictive models for imbalanced data shifts
oped to improve upon the original SMOTE algorithm such from maximising the overall accuracy or minimising error

2738 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

rate, to minimising total cost such that: misclassified samples and decreases the weights associated
with correctly classified samples. Other different ensemble
TCost = m CFP + n CFN (2) approaches are also discussed in [42].
where TCost is the total cost that should be minimised, and m
and n are the number of errors for false positive (FP) and false C. HYBRID METHODS
negative (FN) classification, respectively. More preciously, m Hybrid methods take the advantages of data-driven and
is the number of faultless samples that are predicted as faulty algorithmic-based methods. These two categories have
samples, and vice versa for n. In real-world manufacturing been integrated in various ways. For instance, data-driven
applications, FN errors cost more than FP errors i.e., CFN ≫ solutions are combined with classifier ensembles to mitigate
CFP [18]. For instance, CFP = 10 and CFN = 500 for the effect of the imbalanced data [43]. Other approaches
predicting air pressure system failures in Scania trucks [13]. such as [44], [45] combine cost-sensitive and over-sampling
In such a case, CFP refers to the cost that an unnecessary approaches based on data density to generate better samples
check needs to be done for a truck by a mechanics. In contrast, around each minority group and eliminating the noise
CFN refers to the cost of missing a faulty truck, which may effect for imbalance learning. It is worth-noting that over-
cause a breakdown and put drivers and their road fellows at sampling approaches have shown little sensitivity when
high risk. misclassification costs change [46].
The total cost TCost in Eq. 2 is non-normalised cost. Yang et al. [25] introduce a hybrid optimal ensemble classi-
Without loss of generality, we can express the normalised cost fier (HOEC) framework that outperforms other conventional
as T̂Cost with respect to n as follows: and ensemble classifier methods for learning from imbal-
anced real-world data sets. HOEC combines density-based
T̂Cost = λ CFP + CFN (3) under-sampling and cost-effective methods through a multi-
where the coefficient λ indicates the relative importance of objective optimisation process for overcoming the limitations
various misclassification costs such that (λ = mn ) [35]. of traditional under-sampling and cost-sensitive algorithms.
If the cost-sensitive classifier produces posterior probability Several cost-sensitive methods have been developed based
estimates P for test samples instead of discrete labels, the cost on traditional decision trees to improve the imbalanced
matrix will rely on defining a classification threshold p∗ [36] classification performance of the minority class. Li et al. [47]
such that: developed a hybrid decision tree that incorporated both a
misclassification cost and a set of selected attributes. The
CFP
p∗ = (4) attributes selection criterion is based on a linear combination
CFP + CFN of the Gini index and information gain.
In such a case, the learning algorithm classifies test A hybrid framework that incorporates data clustering, data
samples as faulty samples if their posterior probability sampling and ensemble is proposed in [48] for improving
estimates P ≥ p∗ . This type of threshold moving methods is the performance evaluation of binary classifier. The proposed
sometimes called Relabelling and used as a post-processing hybrid framework outperforms the traditional over-sampling
step for relabelling the output class of test samples [18], [36]. techniques including, SMOTE.
The performance of cost-sensitive algorithms mainly relies Overall, this article studies the improvement of the per-
on incorporating an effective cost matrix that modifies the formance of predictive quality analytics under the condition
learning process. However, the actual cost matrix is often of limited faulty data availability. We present a comparison
unknown in most of the real-world applications, and it can of state-of-the-art over-sampling approaches for generating
be empirically defined or by domain experts [9], [18]. samples for minority groups (e.g., faulty data). We also
Ensemble-based learning algorithms are another type of study a hybrid approach between sampling and cost-sensitive
cost-sensitive methods for tackling imbalanced class distri- model. With the use of statistical analysis, we measure the
butions. Krawczyk et al. [37] introduce a one-class ensemble quality of over-sampling data techniques and their effects on
learning algorithm for improving predictive classifier of a alleviating the bias towards majority groups (e.g., faultless
multi-class imbalanced problem with complex and imbal- or normal samples) by increasing importance and the cost
anced class distribution. The proposed ensemble learning values (i.e., penalties) of minority groups. To the best of
algorithm aims to create individual descriptions of each class, our knowledge, so far, such a comparison spanning various
and then combining them to have a classifier that outperforms over-sampling techniques for predicting quality analytics in
each of them. Such classifier reduces the bias towards manufacturing has not been carried out.
one of the classes introduced by standard classifiers [38].
Bagging [39], AdaBoost [40] and Gradient Boosting [41] are IV. METHODOLOGY FOR PREDICTIVE
the most common ensemble classifier algorithms. Boosting QUALITY ANALYTICS
is considered an iterative algorithm that associates different In this section, we present the methodology adopted
weights on data distribution. The learned algorithm is forced in the evaluation of various combinations of data-based
to focus more on the misclassified samples at each iteration. and algorithm-based methods for dealing with imbalanced
This is achieved by increasing the associated weights for the datasets. As discussed in Section II, our work focuses on a

VOLUME 9, 2021 2739


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

manufacturing problem that aims to predict defects in the end Thus, the newly generated faulty samples are at random
product. To this end, the dataset considered in this evaluation positions on the linear vector space from x1 . The number of
contains samples that are labelled as either positive (minority generated samples relies on β = M min
Mmaj , where Mmin is the
class) or negative (majority class). number of faulty samples in the minority class after over-
In the following paragraphs, we summarise the standard sampling and Mmaj is the number of faultless (i.e., normal)
synthetic minority over-sampling technique (SMOTE), and samples in the majority class. In a binary classification
generative models, including Generative Adversarial Net- task, the total number of samples M is the sum of the
work (GAN) and their variants in terms of their objec- number of samples of the minority and majority classes
tive functions and their architectures. We evaluate these M = Mmin + Mmaj .
techniques for creating high-quality minority data samples
and measure their effectiveness in conjunction with cost-
sensitive learning algorithms on improving the classification B. GENERATIVE ADVERSARIAL NETWORKS (GAN)
performance when presented with an imbalanced dataset. SMOTE synthetically generates new and non-replicated
There are very few works that use some of these presented minority samples to alleviate the over-fitting caused by
techniques as data augmentation such as [26], [49] while random over-sampling. However, SMOTE tends to neglect
learning from an imbalanced dataset, especially in industrial the characteristics of the local distribution of data samples
settings. To the best of our knowledge, this is the first compre- as it considers the neighbourhood parameter k globally [50].
hensive evaluation of different data augmentation and cost- This results in generating overlapped and noisy samples [51].
sensitive methods for predictive analytics in manufacturing Consequently, SMOTE is not guaranteed to create realistic
applications. faulty data samples for manufacturing applications [32].
Goodfellow et al. [30] propose Generative Adversarial
A. SYNTHETIC OVER-SAMPLING TECHNIQUES
Networks (GAN) as an alternative over-sampling method for
creating synthetic data samples. GAN is a minimax two-
As mentioned previously in Section III-A, SMOTE [11] is
player game with an objective function V (G, D) between
one of the main over-sampling methods for creating synthetic
generator G and discriminator D. The generator and dis-
data samples of the minority class (e.g., faulty samples).
criminator are two neural networks defined by Multilayer
SMOTE firstly selects a random data sample x1 from minority
Perceptrons (MLP) with weight vectors θg and θd , respec-
samples then finds a k nearest neighbours minority samples.
tively. These two models compete against each other during
An example is shown in Figure 2 to demonstrate the process
the training process, and they are trained simultaneously.
followed in SMOTE in which we assume that k = 6.
GAN-based methods emerged originally as an over-
In this example, the neighbouring samples of faulty sample
sampling technique to create realistic images to improve
x1 are indicated as {x2 , x3 , · · · , x7 }. Then, for each of the
the performance of learning algorithms in different appli-
pairs {(x1 , x2 ), (x1 , x3 ), · · · , (x1 , x7 )}, a synthetic faulty data
cations [52]. In manufacturing applications, GAN-based
sample is created along the line segments joining them, shown
methods were recently used to create faulty synthetic samples
as red triangle in Figure 2.
under an imbalanced dataset for improving prediction of
faults [32]. The main objective of GAN-based methods is
to augment the original training data such that the number
of available faulty samples (after generating new synthetic
faulty samples) for the training models is increased.
In standard GAN, generator G is trained to fool the
discriminator D by capturing the underlying distribution
of the real faulty data Df = {x (i) } of variables x (i) ∼
pdata (x (i) ), so that it can create synthetic faulty samples that
are intended to come from the same distribution of real faulty
data pdata (x (i) ). The discriminator D is trained to recognise
fake and real faulty data by estimating the probability that
a given data sample originates from the real faulty samples.
This zero-sum game between the generator and discriminator
motivates both of them to improve their functionalities. The
basic architecture of GAN is depicted in Figure 3 (left).
Formally, given faulty data Df = {x (i) } of a variable
x ∼ pdata (x (i) ), we wish to estimate pdata (x). To this end,
(i)

FIGURE 2. An example of SMOTE for k = 6. we transform a prior white noise variable z ∼ p(z) through
- Before over-sampling: M = 24 with Mmin = 7 (black circle) and a generator G(z; θg ), parametrised by MLP parameters θg ,
Mmaj = 17 (blue diamonds)
- After over-sampling: M = 30 with Mmin = 13 (black circle and red to produce a new synthetic (i.e., fake) faulty data sample.
triangles) and Mmaj = 17 (blue diamonds). To this end, G implicitly defines a probability distribution

2740 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

FIGURE 3. Basic GAN (Left) and WGAN (Right) architectures.

D to better identify real faulty


pg as the distribution of the faulty samples G(z) obtained samples
when z ∼ pz . The discriminator D(x; θd ), parametrised by z }| {
max V (G, D) = Ex∼pdata (x) [log D(x)]
MLP parameters θd is to output the probability estimation that D
any x comes from the data distribution pdata (x). In principle, + Ez∼pz (z) [log(1 − D(G(z)))] (6)
| {z }
the discriminator D is a scalar function that is trained to D to better identify generated fake
maximise the probability to assign the correct labels to faulty data samples
faulty samples in the training data and generated from G(z). Formally, both the discriminator D and the generator G
In such a case, the discriminator is typically a traditional play a two-player minimax game with the following main
supervised learning method that is optimised to identify objective function V (G, D) such that the generator tries to
whether any given sample x is a real faulty data sample (i.e., minimise it while the discriminator tries to maximise it. More
x ∼ pdata ) or fake sample (i.e., sampled from generator specifically, V (G, D) (shown in Eq. 7) incorporates the loss
distribution x ∼ pg ). functions of Eq. 5 and Eq. 6 as follows:
Overall, the main goal of GAN is to learn a distribution D output for real
pg over the faulty data samples such that pg is as close faulty data
z }| {
as possible to the original faulty data distribution pdata . min max V (D, G) = Ex∼pdata (x) [log D(x)]
In such a case, the generator represents a distribution over G D
the distributions of the original data. The training procedure + Ez∼pz (z) [log(1 − D(G(z)))] (7)
| {z }
for the generator G is to maximise the probability that the D output for generated
discriminator D makes a mistake in identifying fake and fake faulty data G(z)
real faulty samples. This is done by increasing the chances In Eq. 7, the generator does not have a direct effect on
of D to produce a high probability for fake examples, the first term log(D(x)) in the objective function. For the
thus to minimise Ez∼pz (z) [log(1 − D(G(z)))]. At the same generator, minimising the loss is equivalent to minimising
time, the training process for the discriminator D aims log(1 − D(G(z))).
to teach it to identify real faulty data samples accurately
by maximising Ex∼pdata (x) [log D(x)]. Meanwhile, given a 1) GLOBAL OPTIMALITY IN GAN
fake sample sampled from the generator z ∼ pg (z), The global optimality of V (D, G) is only achieved when
the discriminator is expected to output a probability, D(G(z)), both D and G are at their optimal values. In such a case,
close to zero by maximising Ez∼pz (z) [log(1 − D(G(z)))]. pg becomes very close to pdata . The training objective for
More specifically, the loss functions of generator G and the discriminator D can be described as maximising the log-
discriminator D are defined in Eq. 5 and Eq. 6, respectively. likelihood for estimating a conditional probability P(Y =
y|x), where Y = {0, 1} indicates whether x comes from real
Optimise G to generate better fake faulty
samples to fool D data failure pdata (with y = 1) or a synthetic or fake failure
z }| {
min V (G, D) = Ez∼pz (z) [log(1 − D(G(z)))] (5) sampled from pg (with y = 0) [30]. The optimal discriminator
G should be able to identify the real failure data and generated

VOLUME 9, 2021 2741


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

fake failures. This can be achieved when the real failure data Algorithm 1: Minibatch Stochastic Gradient Descent
distribution pdata and generated failure data distribution pg Training of GAN for Generating Synthetic Faulty
are known. In such a case, the optimal discriminator D∗G (x) Samples
for any fixed generator G is expressed in Eq. 8. Indeed, when for number of training steps on faulty samples do
pg = pdata , the optimal discriminator D∗G (x) = 12 . for number of k steps to train discriminator do
• Sample a minibatch of m noise samples
pdata (x)
D∗G (x) = (8) {z1 , · · · , zm } from noise prior pg (z)
pdata (x) + pg (x)
• Sample a minibatch of m samples

For the optimal discriminator to maximise the quantity {x1 , · · · , xm } from data distribution pdata (x)
V (G, D), Eq. 6 can now be reformulated as in Eq. 9: • Update the discriminator models θd by
ascending its stochastic gradient:
D to better identify real faulty m
samples
1X
∇θd [log D(x) + log(1 − D(G(z)))]
z }| { m
max V (D∗ , G) = Ex∼pdata (x) [log D∗G (x)] i=1
D
+ Ex∼pg (x) [log(1 − D∗G (x))] (9) end
| {z } • Sample a minibatch of m noise samples
D to better identify generated fake
faulty data samples {z1 , · · · , zm } from noise prior pg (z)
• Update the generator model parameters θg by
In such a case, V (D, G) in Eq. 9 has a value of − log 4 descending its stochastic gradient:
(proof is included in [30]). Using Eq. 8, Eq. 9 can now be m
1X
reformulated as: ∇θg [log(1 − D(G(z)))]
m
i=1
pdata + pg 
max V (D∗ , G) = − log 4 + KL pdata || end
D 2
pdata + pg  Generate N synthetic faulty data samples from
+ KL pg || (10) learned Generator model
2
where KL is Kullback–Leibler divergence 2 that measures
how the probability distribution pdata diverges from a second 2) GAN TRAINING PROCESS
expected probability distribution pq . Intuitively, KL has a
The value functions of both players are typically defined in
minimum value of zero when pdata = pg . When pd ata(x)
terms of their model parameters θg and θd . The discriminator
gets closer to zero and pg (x) is non-zero, the effect of pg (x)
aims to maximise V (θd , θg ) while it has only control on θd ,
will then be ignored and disregarded [53]. However, both
while the generator aims to minimise V (θd , θg ) while only
distributions are equally important. Moreover, it is very clear
controlling θg . The GAN training process (shown in Alg. 1)
that KL divergence is asymmetric measure. Jensen–Shannon
for cost function V (G, D) (in Eq. 7) includes two gradient
Divergence (JSD) 3 is another measure of similarity between
steps simultaneously: one updating θd to maximise V (D) and
two probability distributions. It is a symmetric measure and
one updating θg to minimise V (G). Adam [56] is often used
is bounded by [0, 1]. The advantage of using symmetric
as a gradient-based optimisation algorithm for learning both
JS divergence instead of asymmetric KL divergence while
models’ parameters. To this end, the discriminator model
training GAN has been discussed in [53], [54]. It turns out
parameters θd is learned for a minibatch of m failure examples
that KL divergence is hard to optimise and that the minimax
by ascending its stochastic gradient in Eq. 12, while generator
converges to its equilibrium between the polices of both
model parameters θg is learned by descending its stochastic
generator and discriminator when the polices can be updated
gradient in Eq. 13 (see Figure 3 (Left)).
during the training process while minimising the JSD [55].
m
To this end, Eq. 10 can be reformulated as follows: 1X
∇θd [log D(x) + log(1 − D(G(z)))] (12)
m
i=1
V (D∗ , G) = − log 4 + 2 . JSD (pdata ||pg ) (11) m
1X
∇θg [log(1 − D(G(z)))] (13)
As explained previously, the global optimality of V (D, G) m
i=1
is achieved when pg = pdata . In such a case, V (D∗ , G) is
obtained as in Eq. 11. C. CONDITIONAL GAN (CGAN)
In the standard GAN (shown in Alg. 1 and from [30]),
2 KL(p||q) = log p(x) is a term that quantifying the KL divergence
q(x) the generative model G(z) is trained without having control
between two distributions p and q; it measures how one probability
distribution p diverges from a second expected probability distribution q.
on the type of faulty data being generated. In Conditional
3 JSD(p||q) = 1 KL(p|| p+q )+ 1 KL(q|| p+q ) is a term that quantifying the GAN (CGAN) [57], the generator and discriminator are
2 2 2 2
JS divergence between two distributions p and q conditioned on auxiliary information y, where y is the failure

2742 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

class labels. The conditioning on class failure label is formed some a Lipschitz constant K . K-Lipschitz function has a
by feeding the label y into both generator and discriminator. constraint to satisfy |f (x1 ) − f (x2 )| ≤ K |x1 − x2 |, where
This allows the generator G(z|y) to learn to generate synthetic K ≥ 0, ∀x1 , x2 ∈ R and K is independent from x1 and x2 [58].
faulty data samples for a particular type of failure labelled In such a case, ||f ||L ≤ 1 can be replaced (in Eq. 18) by
through y. In such a case, the objective function of the ||f ||L ≤ K where K = 1. In order to solve Eq. 18, we suppose
minimax game of the standard GAN (in Eq. 7) can be that function f comes from a parameterised family of
formulated to incorporate class failure label y as follows: K-Lipschitz continuous functions {fw }w∈W . An assumption
D output for real has been made in [58] that by attaining the supremum in
faulty data of type y Eq. 18 for some w ∈ W , W (pdata , pg ) can then be calculated.
z }| {
min max V (D, G) = Ex∼pdata (x) [log D(x|y)] In principle, we can calculate the Wasserstein distance by
G D finding a 1-Lipschitz function that can be learned by DNN
+Ez∼pz (z) [log(1 − D(G(z|y)))] (14) parameterised on weights w in a compact space W . By back-
| {z }
D output for generated proping via Eq. 18, W (pdata , pg ) can be differentiating via
fake faulty data of type y
estimating Ez∼p(z) [∇θ fw (gθ (z))], where pg is the distribution
Similar to the standard GAN, the CGAN training process of gθ (Z ) with Z a random variable with density p and fw is
for cost function V (G, D) includes two gradient steps simul- the set of 1-Lipschitz function.
taneously. In principle, Eq. 12 and Eq. 13 are reformulated In the view of the Kantorovich-Rubinstein duality for
for CGAN as follows: calculating the Wasserstein distance, the value function
m
1X V (G, D) of minimax game for WGAN can be written as
∇θd [log D(x|y) + log(1 − D(G(z|y)))] (15) follows:
m
i=1 D output for real
m faulty data
1X z }| {
∇θg [log(1 − D(G(z|y)))] (16)
m min max V (D, gθ ) = Ex∼pdata (x) [D(x)]
i=1 G D∈F
+ Ez∼pz (z) [D(gθ (z))] (19)
D. WASSERSTEIN GAN (WGAN) | {z }
D output for generated
The traditional GAN may result in model collapse whereby fake faulty data
the generator reaches a state in which it always produces where F is the set of 1-Lipschitz functions. Similar to GAN,
the same synthetic output (e.g., same faulty samples or WGAN has a discriminator D; however, it is not a classifier,
image). Wasserstein GAN (WGAN) is an alternative to tra- but instead, it is a model with a critic that scores the realness
ditional GAN. It employs the Wasserstein distance measure or fakeness of a given sample, as shown in Figure 3 (right).
(W) between two probability distributions in the training More precisely, it’s a real-valued function that aims to learn
process, which allows a smoother gradient. The main goal w in a compact space W to find a good fw [58]. To train
of WGAN is to provide an efficient approximation of the the discriminator, Arjovsky et al. [58] specify a family of
W distance that provides high synthetic sample quality. functions fw by DNN, and a weighted clipping is then applied
As such, the WGAN averts model collapse as it follows a that aims to enforce the Lipschitz continuity. More precisely,
more stable training process and offers better learning for to maintain the Lipschitz continuity of fw during the training,
hyperparameter search [58]–[60]. In general, Wasserstein upon every gradient update, the weights w are clipped to be
distance W (pdata , pg ) between the two distributions is defined within a small window, e.g., [−0.01, 0.01]. This results in
as follows: having a compact parameter space W and obtaining the bound
W (pdata , pg ) = inf E(x,y)∼γ [kx − yk] (17) of fw that preserves the Lipschitz continuity.
γ ∼5(pdata ,pg )
Analogous to GAN where Jensen-Shannon (JS) divergence
where 5(pdata , pg ) is the set of all joint probability dis- implicitly employed, the learning in WGAN is based on
tributions of γ (x, y) whose marginals are pdata and pg , minimising the Wasserstein distance between the real faulty
respectively. In principle, γ (x, y) describes the percentage of sample distribution pdata and leaned generator distribution pg
faulty data distribution that should be transported from x to that can be represented as in [58] as follows:
y such that the distribution of real faulty samples pdata is
W (pdata , pg ) = max Ex∼pdata [fw (x)] − Ez∼p(z) [fw (gθ (z))]
transformed into pg that will be used to generate synthetic w∈W
faulty samples. More precisely, Eq. 17 indicates the cost of (20)
such an optimal transport plan. However, the infimum in
Given an optimal discriminator which is also known as a
Eq. 17 makes it intractable. Using Kantorovich-Rubinstein
critic (i.e., because it is not trained to classify), minimising
duality [61], Eq. 17 can be simplified to Eq. 18:
the value function in Eq. 19 with respect to generator’s
W (pdata , pg ) = sup Ex∼pdata [f (x)] − Ex∼pg [f (x)] (18) parameters minimised W (pdata , pg ) in Eq. 20 [62]. The
||f ||L ≤1 WGAN training procedure to generate faulty synthetic
where supremum is over f that is 1-Lipschitz functions where samples is summarised in Alg. 2, and the proof of optimality
the general form of K-Lipschitz function is ||f ||L ≤ K for of WGAN is detailed in [58].

VOLUME 9, 2021 2743


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

Algorithm 2: WGAN for Generating Synthetic Faulty Algorithm 3: WGAN-GP: WGAN With Gradient
Samples Penalty λ for Generating Synthetic Faulty Samples
Require: α: learning rate, c: clipping parameter, m: Require: λ: gradient penalty coefficient,
batch size, ncritic : number of iterations of the critic per ncritic : number of iterations of the critic per generator
generator iteration, w0 : initial critic parameters, and θ0 : iteration, m: batch size, w0 : initial critic parameters, α:
initial generator’s parameters Adam learning rate (i.e., step-size),
while θ has not converged do β1 , β2 : Adam exponential decay rates, and θ0 : initial
for t = 0, · · · ncritic do generator’s parameters
• Sample a minibatch of m noise samples while θ has not converged do
{z1 , · · · , zm } from noise prior pg (z) for t = 1, · · · ncritic do
• Sample a minibatch of m samples for i = 1, · · · m do
{x1 , · · · , xm } from data distribution pdata (x) • Sample a real sample x ∼ pdata

m m • Sample a latent variable z ∼ pg (z)


1 X 1X  • Select a random number ǫ ∼ U [0, 1]
gw ← ∇w fw (xi ) − fw (gθ (zi ))
m m x̃ ← Gθ (z)
i=1 i=1
x̂ ← ǫx + (1 − ǫ)x̃
w ← w + α. RMSProp(w, gw )
Li ← Ex∼pg (x) [D(x)]−
w ← clip(w, −c, c)
end Ex∼pdata (x) [D(x)] +
λ Ex̂∼px̂ [(||∇x̂ D(x̂)||2 − 1)2 ]
• Sample a minibatch of m noise samples
end
{z1 , · · · , zm } from noise prior pg (z)
P
gθ ← −∇θ m1 m i=1 fw (gθ (zi )) m
θ ← θ − α. RMSProp(θ, gθ ) 1X
w ← Adam(∇w Li , w, α, β1 , β2 )
end m
i=1
Generate N synthetic faulty data samples from
learned Generator model end
Sample a minibatch of m noise/latent samples
{z1 , · · · , zm } from P noise prior pg (z)
To enforce the Lipschitz constraint without clipping θ ← Adam (∇θ m1 m i=1 −Dw (Gθ (z)), θ, α,
the discriminator’s weights, Gulrajani et al. [62] have β1 , β2 )
end
introduced the WGAN gradient penalty (WGAN-GP) which
Generate N synthetic faulty data samples from
incorporates a penalised gradient such that the norm of the
learned Generator model
discriminator’s output with respect to its input is constrained.
This results in reformulating Eq. 19 as follows:
Original critic loss
z }| { in WCGAN are conditioned on the type of failure y. As such,
L(D) = Ex∼pg (x) [D(x)] − Ex∼pdata (x) [D(x)] the objective function of the minimax game of the WGAN-
+ λEx̂∼px̂ [(||∇x̂ D(x̂)||2 − 1)2 ] (21) GP (in Eq. 21) can be formulated to incorporate class failure
| {z } label y as follows:
Gradient penalty
Original critic loss
where the last term is a soft version of the constraint with z }| {
a penalty on the gradient norm for random samples x̂ ∼ L(D) = Ex∼pg (x) [D(x|y)] − Ex∼pdata (x) [D(x|y)]
p(x̂). The training procedure for WGAN-GP incorporating + λEx̂∼px̂ [(||∇x̂ D(x̂|y)||2 − 1)2 ] (22)
the gradient’s penalty is summarised Alg. 3, and the detailed | {z }
with Gradient penalty
discussion on how WGAN can be improved using Eq. 21 is
discussed in [62]. where x̂ is a penalty on the gradient norm for random samples
x̂ ∼ p(x̂). The main objective function of the generator L(G)
E. WASSERSTEIN CGAN (WCGAN) is expressed conditionally on a label y as follows:
Wasserstein CGAN (WCGAN) is often discussed under L(G) = −Ex∼pg (x) [D(x|y)] (23)
WCGAN as in [63] or CWGAN as in [64], [65]. Similar
to CGAN discussed in Section IV-C, the generator and Zheng et al. [64] have used WCGAN-GP as an over-
discriminator in WCGAN can be conditioned on the failure sampling approach to generate realistic minority samples
class labels y, as auxiliary information. WCGAN incorpo- based on learning the real distribution of available minority
rates a penalised gradient to constrain discriminator’s output, samples. In comparison to WGAN-GP, the authors show that
analogous to the WGAN-GP shown in Alg. 3. However, incorporating the class label in the WCGAN-GP increases the
differently from WGAN-GP in Alg. 3, the objective functions quality of synthetic generative data. The same observation

2744 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

is drawn in [66], where the authors have constrained their analysis framework to answer these questions, where we
WCGAN-based model conditionally on some features (e.g. propose four performance metrics. The first two extract two
colours) to generate better cartoon images from sketch essential dataset features: the level of skewness or imbalance
images. Furthermore, Qin and Jiang [63] have discussed the ratio and the goodness of the allocated label using the
main shortcoming of WGAN model in speech enhancement Silhouette coefficient. The third measures the effectiveness
task; the task that aims to improve the performance of speech of the predictive analytics by quantifying revealing indicators
systems in a noisy environment. Although WGAN learns such as the precision, recall and F1-scores. The last, applies to
well the characteristics of the speech data, the model tends data-based methods only as it measures the fit of the synthetic
to overfit while training on low-data environments. To this samples in comparison with the real data samples.
end, the authors have introduced an improved objective
function upon WCGAN-GP to improve the performance A. IMBALANCE RATIO
of speech enhancement task. The improved WCGAN-GP In a dataset with two classes, an Imbalance Ratio (IR) is
conditionally on a variety of features in voice data makes it defined as a proportion of the number of samples in the
possible to improve the speech quality. However, there was majority class to the number of samples in the minority
no comparison with other existing GAN-based models. class. Referring to the problem formulation in Section II,
Liu et al. [67] have tackled the limited availability of the majority class in a manufacturing problem is the number
training dataset to predict the illegal accesses for the Internet of good samples labelled yj = 0, while the minority
of Vehicles (IoV) applications. In particular, WCGAN has represents the defects labelled yj = 1, where J = {1, 2 · · · M }
been used to generate synthetic illegal access data such and M is the total number of data samples. This ratio can be
that there is a balance between the legal and illegal access calculated using information entropy.
dataset. The simulation results show that WCGAN converges Entropy measures quantify the information about the
faster than traditional GAN and thus, improves the prediction outcome class, given the class distribution. Shannon’s entropy
accuracy and reducing false-negative rate. However, there is one of the most widely entropy measures [68]. In general,
was no comparison between WCGAN and WGAN. given a dataset with k number of classes, let Y be a class
Recently, WCGAN-based model has also been developed variable or label with k modalities (i.e., the number of
to generate faulty samples conditioned on fault categories classes), Y = {y1 , · · · yk } with frequentist probabilities of
to improve the performance of fault diagnosis in industrial k
P
applications [65]. In comparison to traditional GAN, the pro- p = (p1 , · · · pk ) where pi = 1 and pi ≥ 0 ∀ i = 1, . . . k.
i=1
posed WCGAN-based generates good quality data of faulty The Shannon entropy H of the probabilities distribution can
synthetic samples. Hence, it improves the prediction accuracy be computed as follows:
(by 3% after over-sampling faulty examples) for faults and
k
avoids over-fitting. However, there is no comparison to other X
GAN-based approaches. H =− pi log pi (24)
i=1
Based on these studies, WCGAN-based methods tend
to avoid over-fitting the training data and to converge where pi = |cMi | is the frequentist probability of a class labelled
faster in different applications, including industrial sce- yi , with ci is the cluster of all |ci | samples with yi label out
narios. In this work, we present the first comprehensive of a total of M data samples. With this metric, H −→ 0 if
evaluation of different GAN-based methods for predictive the dataset is very unbalanced and H −→ log k if the data
analytics in manufacturing applications, including WCGAN. is balanced. Normalising the entropy H in Eq. 24 by log k
Furthermore, we extend the evaluation to encompass data- gives Ĥ ∈ [0, 1]. Therefore, the imbalance ratio (IR) can be
based, algorithm-based and hybrid methods for improving defined as follows:
the prediction accuracy of manufacturing faults. To the
k
best of our knowledge, this is the first statistical analysis P
− pi logb pi
framework for measuring the effectiveness of dealing with i=1
data imbalance when used in predictive analytics. IR = (25)
logb k

V. PERFORMANCE METRICS where IR tends to be 0 when the data is highly imbalanced


In the recent advances made to circumvent the challenge of an and 1 when the data is balanced.
imbalanced dataset, it has become apparent that while some
methods are successful for a given classification problem, B. SILHOUETTE COEFFICIENT
they may ultimately fail in the imbalanced classification In section III, we have discussed some of the conditional
task. What are the governing factors that render a method generative methods that create synthetic data samples of
successful? What are the dataset aspects that dictate the minority class dependent on its labels (e.g. type of quality
correct method to apply? What is an adequate metric to gauge issues or faults). In order to group the minority data samples
the effectiveness of a method, particularly in the context of into k clusters, clustering approaches are used to identify
manufacturing? This is the first work that offers a statistical natural groupings of the minority class. Silhouette coefficient

VOLUME 9, 2021 2745


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

is a single score that is widely used for measuring the quality alarm can also be captured by computing the false positive
of clustering results [69]. rate (FPR) as follows:
Silhouette coefficient measures how well the separation FP
between the clusters independently from the number of FPR = (29)
clusters. In principle, it measures how each sample in a cluster TN + FP
is close (i.e., similar) to other samples in the same cluster On the other hand, recall (often referred to as sensitivity or
when comparing to other samples in other clusters [70]. TP rate (TPR)), measures how well the fault prediction model
The coefficient has a value ∈ [−1, 1]. A high value of the learns to predict the actual faulty samples correctly – reducing
coefficient means a better structure for the clusters. The the number of undetected faults (i.e., faulty samples that are
Silhouette coefficient can be obtained as follows: predicted as faultless samples). However, neither precision
b(i) − d(i) nor recall provides a conclusive evaluation of the imbalanced
s(i) = , ∀i = 1, · · · M (26) classification model. Good recall value often levels out a
max{d(i), b(i)}
reduced precision value and vice-versa. F1-score provides a
where M is the total number of samples, and d(i) is the way to combine the precision and recall (Eq. 27 and Eq. 28).
average dissimilarity of the sample xi = {ai,1 , · · · , ai,M } to It is also known as F-Measure. It takes into account false
other samples with the same label yi and b(i) is the average alarms and undetected faults – weighting the precision and
dissimilarity of the same with respect to the closest cluster recall equally. High F1-score means perfect precision and
with a different label. The average value of s(i) for all samples recall scores. F1-score is given by:
measures the quality of how well the M samples in the input
data are clustered. Precision × Recall
F1-score = 2 × (30)
Precision + Recall
C. FAULT AND QUALITY PREDICTION Another informative measure is the receiver operating charac-
In a highly imbalanced dataset, predictive models tend teristic curve (ROC), which is a graphical plot that illustrates
to be biased towards the majority class; having a high the diagnostic ability of a binary classifier. The ROC curve
misclassification rate for the minority class. Such a bias is created by plotting the recall (in Eq. 28) or TPR against
in manufacturing predictive analytics is detrimental to the the FPR (in Eq. 29) at various discrimination threshold
quality and cost of the end product. Therefore, there is a settings. The discriminating threshold value controls the
need for the classifier to provide high accuracy in predicting classification decision based on the probabilistic output of the
the minority class (e.g. defects) without deteriorating the classifier (Please refer to Section VI-D for more details about
classification performance of the majority class [12], [71]. the discrimination threshold). Similarly, the precision-recall
Using a single criterion, such as the overall accuracy or error curve (PRC) is another interpretation of the output of a binary
rate, fails to discern the failed predictions related to defects classifier where the precision (in Eq. 27) is plotted against the
when the availability of faulty data samples is limited. To this recall (in Eq. 28) at various potential discrimination threshold
end, more informative evaluation metrics, such as precision settings.
and recall, are proposed to capture the effectiveness of the ROC curves are appropriate when the data is relatively
classification method in the presence of imbalanced data. balanced between classes – having an equal number of data
Precision quantifies the ratio of the correctly predicted samples for each class in the dataset. However, with severely
faulty samples among all predicted samples in a classification imbalanced class distributions, PRC curves tend to be a more
model. Recall is another metric that measures the ratio informative measure that offers to find an optimal threshold
of the correctly predicted faulty samples among all actual that achieves a right balance between precision and recall,
faulty samples in the dataset (regardless of the classification). hence considering class distributions. Thus, in the context of
Precision can be calculated by the following: manufacturing predictive analytics, we propose to calculate
TP the Precision, Recall, F1-score, and PRC curve in addition to
Precision = (27) general accuracy and success rate for an in-depth evaluation
TP + FP
of the effectiveness of the method adopted.
where TP represents the number of correctly classified faults,
and FP represents the number of faultless samples that have
D. DATA GENERATION
been incorrectly classified as faulty. Recall, on the other
hand, is expressed as follows, where FN is the number of Data-based methods for curtailing the challenges of imbal-
faulty samples that are classified as faultless and TP + FN anced dataset employ different over-sampling generative
represents the total number of actual faulty samples: approaches for the creation of new samples from the minority
class. These approaches aim to minimise the distance
TP between the real and generated distributions. Nonetheless,
Recall = (28)
TP + FN it is crucial to measure the quality of generated data by
Precision tends to measure how well the fault prediction quantifying how similar generated data and the original data
algorithm can reduce the number of faultless samples that are are [57], [72], [73]. To this end, we need to develop a
misclassified – reducing false alarms. The probability of false single score to compare the quality of generated samples

2746 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

of different over-sampling methods. Several metrics and contribution is to present such a framework that is not
statistical tests for measuring the similarity between two dis- tailored to a particular dataset but is able to extract the
tributions exist such as f-divergence (e.g., Hellinger distance, pertinent characteristics of any given dataset to guide the
Jensen–Shannon divergence) [73], Wasserstein distance [74] user in selecting the appropriate ML approach. This section
which is also known as Kantorovich–Rubinstein metric and explains the real-world dataset used in our evaluation which
the Kolmogorov–Smirnov test (KS-test) [75], [76], among is based on heavy trucks’ air pressure system. The details
others. of the pre-processing phase, the classifiers parameterisation,
KS-test is widely used in hypothesis testing for the and post-processing steps are also presented. The evaluation
comparison of the cumulative distribution functions (cdf) of framework discussed in Section V is then used to assess the
given distributions [77], [78]. Suppose the real and generated effectiveness of each method.
data have a size of R and G samples, respectively. Let FR (x)
and FG′ (x) be the cdf of real and generated data distributions, A. AIR PRESSURE FAILURE DATA
respectively. The KS-statistic is defined as the maximum The Air Pressure System (APS) plays a vital role in heavy
distance between the two cdfs: Scania trucks. In principle, APS is a system that generates
DMG = maxx |FM (x) − FG′ (x)| (31) pressurised air to be used in various functions in a truck, such
as gear changes and braking. The APS Scania dataset was
where DRG is the maximum absolute difference between collected from heavy Scania trucks and was made available
the cdfs of the distributions of the real and generated data by the Industrial Challenge for IDA 2016.4 Each instance
samples. In principle, the KS-statistic has a value ∈ [0, 1] in the dataset is classified as positive or negative. The APS
that defines the overlap between the two distributions; 0 for is, thus, an example of smart mechanical systems and fits
perfect overlap and 1 for no overlap. Therefore, the KS- for the purpose of framework validation since it has a high
dissimilarity can be thought of as the fractional difference degree of bias. Although the data is collected from trucks in
between the two distributions [79]. A null hypothesis H0 : operation, the insights drawn from analysing this data reflect
F = F ′ is defined to check their overlap; if the distributions on the quality of the manufacturing process and highlight
of the generated and real faulty data samples are statistically manufacturing issues when linked to other datasets such as
similar. To this end, a q-value is obtained representing the factory (e.g. factory Identification) and batch numbers.
hypothesis probability, taking into account the comparison The positive class indicates that reported failures are
between DRG and a critical value C(α) such that the null related to the APS system, while the negative class includes
hypothesis is rejected at a level α if: all other instances that have other types of faults. Differently
r from the common binary classification in manufacturing
R+G
DRG > C(α) (32) datasets (faulty and faultless), in this case, both classes
R·G
reflect faults. Nonetheless, the positive class remains the
where C(α) is a size-independent function with α as the less frequent type of APS-related faults, hence represents the
chosen significance level for statistical significance. For q < minority class. Whereas, non-APS related faults are many and
α, the null hypothesis is rejected. This means the distribution form the majority class, i.e. the positive class.
of the generated data doesn’t converge to the real faulty data The main goal is to develop a binary predictive model that
samples, and they aren’t statistically similar. In such a case, correctly identifies failures related to APS. An APS failure
the generative method generates poor quality data samples. that is not predicted prior to its occurrence would incur a
In contrast, for q > α, the hypothesis is accepted, and this drastic cost on Scania. Thus the predictive model is expected
implies a high quality of generated data samples. In principle, to perform well in terms of accuracy in general but also in
the selected significance value of α impacts recognising the detecting false alarms (misclassifying other failures as APS)
statistical significance between the two distributions. or missed alarms (i.e., misclassifying an APS failure as that
For a small α value, a substantial difference between of another component).
the two distributions is required for rejecting the null Scania has provided training and testing datasets for APS.
hypothesis, indicating a higher DRG value. On the other hand, The training set contains 60, 000 rows, of which only 1, 000
a significantly large α means that having small differences instances belong to the positive class (i.e., faulty samples that
between the two distributions are magnified – leading the null are related to APS) and 59, 000 for the negative class (i.e.,
hypothesis to be rejected regardless of small DRG values. faulty samples that are not associated with APS). Each row
includes 171 anonymised features, one of which is the label
VI. FRAMEWORK EVALUATION (i.e., target class) column to indicate APS-related faults (i.e.,
As introduced in Section I, the proposed framework is not positive class) or other faults (i.e., negative class). The testing
tailored for a particular dataset but is designed to address set consists of 16, 000 instances, of which only 375 instances
the challenge of data imbalance that results from an IoT-rich belong to the positive class.
(smart) manufacturing environment. This is purposely a
large domain and encompasses smart environments such
as industrial or mechanical systems. Indeed, our main 4 https://ida2016.blogs.dsv.su.se/?page_id=1387

VOLUME 9, 2021 2747


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

The APS dataset is therefore biased with a high imbalance projecting it into fewer dimensions that maximise the
ratio between positive and negative classes [13], [80]. Indeed, variance, called Principal Components (PCs). The primary
1.7% of the entire training set represents APS-related faulty goal of PCA is to obtain the best summary of the data using
samples (i.e., positive samples), while the remaining 98.3% a limited number of PCs [81]. In this process, we found
belongs to other faulty samples (i.e., negative samples). It is that 11 principal components are sufficient to capture 95%
expected that, with such imbalanced data, the predictive of the information in the dataset – having a cumulative
models would tend to be biased towards the majority class explained variance percentage of 95%. The number of
(i.e., non-APS faulty samples) and less sensitive towards the principal components was found to be a good indication to
minority class (i.e., APS-related faulty samples) [18]. hit the point of diminishing returns (i.e., a little variance
In addition to the dataset, Scania has provided a mis- is gained by retaining additional principal components).
classification cost metric for the APS dataset. As discussed We have included more details about the steps that yielded
previously in Section III-B, the cost of predicting false the number 11 in Appendix A.
negatives (CFN ) is much higher than false positives, i.e. false Data augmentation is another optional pre-processing
alarms (CFP ). In Scania APS scenario, the cost of the former procedure, as discussed in Section III-A. We describe the
is CFN = A C500, while the latter is CFP = A C10 [80]. methodology that we adopted with the APS dataset for
In principle, CFP refers to the cost of unnecessary checks different data augmentation techniques in Section IV. In the
carried by a mechanic as a result of false alarms. On the other following paragraph, we elaborate on the selection of data
hand, CFN refers to undetected APS-related faults which may augmentation methods used in the hybrid combinations of
cause the truck to break down; hence, it is the incurred cost this study.
of downtime and repair. By substituting these cost values in
Eq. 2, the total cost that should be minimised for predicting C. PREDICTIVE MODELS WITH APS IMBALANCED DATA
APS-related faulty samples is expressed as follows:
We have developed three different machine learning pre-
TCost = m CFP + n CFN dictive models to detect APS-related faults (i.e., positive or
(33)
= m × 10 + n × 500 minority class) or Non-APS-related faults (i.e., negative or
majority class). The three models are Logistic Regression
where n is the number of undetected APS-related faulty
(LR) [82], Random Forest (RF) [83] and XGBoost [84].
samples (missed alarms), and m is the number of false alarms.
These models are represented as ML Classifiers in Figure 1.
The Scania APS dataset is suitable for the evaluation
We have used these binary classification models to make
of the data-driven methods introduced in Section III-A for
a fair empirical comparison between the different counter-
dealing with the imbalanced data problem in manufacturing.
balancing techniques explained in Section. IV). Given that
Furthermore, the manufacturer’s cost-sensitive function gives
the original APS dataset is highly imbalanced, it calls
a domain expert’s perception of the impact of different
for procedures to generate synthetic APS-related failure
misclassification errors on the manufacturing process. It can,
data samples for the Ready-to-use training set, as shown
therefore, be implemented as an effective algorithmic-based
in Figure 1. We have conducted a set of experiments that
approach (as discussed in Section III-B) as well as hybrid
can be grouped into the following five cases. Each of
approaches that combine both data-driven and algorithm-
these cases represents a group of experiments with common
based methods (as discussed in Section III-C).
pre-processing methods but use different classifiers. The
performance metrics for these experiments are reported based
B. DATA PRE-PROCESSING
on their total cost in addition to their achieved accuracy,
As shown in Figure 1, data pre-processing includes essential
precision, recall, and F1-score (discussed in Section. V-C).
and optional procedures. Data cleaning is essential to any
classification exercise and aims to compensate for missing • Case I: evaluating the classifier algorithms using the
values, among other issues. The APS dataset contains up to original training data without data augmentation. Prin-
82% of missing values per feature. We have replaced the cipal Component Analysis (PCA) is employed in the
missing values by the mean imputation method as in [80]. pre-processing phase to simplify the complexity of our
In such a method, all missing values in a particular column in high-dimensional data.
the dataset are substituted with the mean value of the available • Case II: using SMOTE to generate synthetic samples
values in that column. of APS-related faulty samples such that the original
The pre-processing phase also includes the optional training dataset is augmented with new synthetic minor-
procedures for reducing the number of features of a multi- ity samples. PCA is also applied in the pre-processing
dimensional dataset to facilitate its interpretation. The APS phase, and each experiment is repeated for different
dataset contains 171 features, one of which is the label imbalance ratios (explained in Section. V-A).
class (i.e., to indicate if the faulty sample is APS-related or • Case III: using GAN and WGAN for data augmentation.
other). To reduce the number of features (i.e., dimensions), PCA is also applied in the pre-processing phase, and
we use Principal Component Analysis (PCA). PCA simplifies each experiment is repeated for different imbalance
the complexity of high-dimensional data by geometrically ratios. Moreover, the quality of generated samples is

2748 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

evaluated using the performance metric discussed in


Section. V-D.
• Case IV: clustering the APS-faulty samples; clusters are
reported based on the metric discussed in Section. V-
B to measure the separability between clusters. More
precisely, we measure the distance between each data
sample, the centroid of its assigned cluster and the
closest centroid belonging to another cluster. We then
use CGAN and WCGAN to generate synthetic samples
conditioned on the cluster assigned to minority data
samples. For these set of experiments, we also apply
PCA, and we calculate the same performance metrics
as in Case III for various imbalanced ratios.
It should be noted that the four cases mentioned here aim FIGURE 4. Decision boundary to find the optimal threshold th
to minimise the total cost of misclassification. To this end, y = 1 for APS faulty samples; y = 0 for Non-APS samples
we tune the hyperparameters of the classification algorithm TP: True Positive, FP: False Positive (i.e., false alarm), TN: True Negative,
and FN: False Negative.
using 5-fold cross-validation in such a way that the total
cost of misclassification of APS-related faulty samples (in
Eq. 33) is minimised. In other words, each of the described
four cases implements algorithm-based methods to curtail the Referring to our problem formulation in Section II, label
pitfalls of imbalanced datasets. Case I falls in the category y = 1 represent the minority class whereas y = 0 refers
of cost-sensitive approaches (discussed in Section. III-B). to the majority class. Then for a sample x, when a classifier
Indeed, all three classifier algorithms (LR, RF, and XGBoost) outputs a probability p(y = 1|x) >= th, it is classified
in Case I are trained to minimise the cost matrix for binary as positive sample, otherwise as a negative sample. The
misclassification without modifying the original training threshold-moving method aims to adjust the value of the
dataset. On the other hand, all other cases are considered decision threshold in order to improve the predictions. Given
hybrid methods (discussed in Section III-C). In each of these a threshold centered at the value th = 0.5, suppose that a
cases, the original training dataset is augmented by generating data sample x1 has a probability p(y = 1|x1 ) = 0.6 such
synthetic APS-related faulty samples (the number of minority as p(y = 1|x1 ) > p(y = 0|x1 ); it is thus classified as
class samples is increased to compare to the majority class). belonging to the minority class. Another data sample x2 with
At the same time, the three classifiers are trained, and their p(y = 0|x2 ) = 0.8 and p(y = 0|x2 ) > p(y = 1|x2 ) is
parameters are tuned for minimising a misclassification cost classified as belonging to them majority class. If, however,
matrix. Thus, Cases II, III, and IV combine data-driven and the threshold is moved to th = 0.85 or 85%, then x2 will
algorithmic-based approaches, hence fall under the hybrid be assigned to the minority class instead of majority because
category. p(y = 0|x2 ) < th. On the other hand, With th = 0.85, x1
would remained in the minority class since p(y = 1|x1 ) < th.
D. THRESHOLD FOR IMBALANCED CLASSIFICATION According to the threshold moving method, a sample is
The RF, LR and XGBoost classifiers output a probability that only classified as belonging to the majority class if the
estimates the likelihood of associating a class label to each classifier’s confidence in this classification is higher than
of the data samples in the testing dataset. This probability the set threshold. In our dataset, this implies that a sample
gives some confidence in the label prediction. In principle, is considered to be APS-relate fault unless the classifier is
the output probability is then converted to a discrete class highly confident of the fault not relating to APS. As discussed
label in the post-processing phase (see Figure 1). In a binary in Section V-C, the optimum value to the decision threshold
classifier, this is achieved by using a threshold that is referred for imbalanced dataset is derived using the precision-recall-
to as a decision threshold, a discrimination threshold or a curve (PCR).
cut-off. The default threshold value is often set to 50% or In [13], the moving threshold approach is associated with
0.5 when the dataset is balanced, as shown in Figure 4. the confidence level of a given sample belonging to the
However, a decision threshold of 0.5 may not provide an majority class. The best results of predicting APS faulty
optimal interpretation of predicted probabilities in the case samples were obtained based on using a decision threshold of
of imbalanced datasets. To this end, the decision threshold is th = 95%. In this moving threshold application, the selected
moved along the x-axis in Figure 4 to compensate for the data threshold determines whether an instance belongs to a major-
imbalance and improve the prediction results. This is referred ity class (i.e., non-APS-related faulty samples or negative
to as threshold-moving method in the post-processing phase samples). In principle, a data sample is only classified as
of classifiers [85]. a non-APS fault if the related confidence level exceeds the
The decision threshold is selected with the aim of minimis- particular threshold (i.e., ≥ 95% ); otherwise, it is classified
ing the probabilities of FP and FN, as shown in Figure 4. as an APS related fault.

VOLUME 9, 2021 2749


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

TABLE 2. Case I: performance of classifiers on actual data (PCs = 11) without augmenting synthetic faulty data samples.

Following the same work on APS dataset in [13], [80], recall curve (PRC) (as discussed in Section. V-C). In Case I
[86], we incorporate a threshold moving method as a post- experiments, XGBoost achieves the best results with a total
processing phase of classifiers in our conducted experiments. cost of AC12,230 as highlighted in Table. 2. Our results are
In such a case, the precision-recall curve (PCR explained aligned with the work in [49] where a boosting-based method
previously in Section. V-C) is utilised to obtain the list of achieves less total cost than RF.
the potential threshold values. The value of this threshold is The results obtained in Case I show that the accuracy is
optimised whilst minimising the total misclassification cost. not an informative metric to measure the overall performance
In such a case, the best threshold is selected that achieves the of an imbalanced binary classification. As shown in Table 2,
minimum cost. LR achieves 94% accuracy which is equivalent to only
1% deterioration compared to XGboost results of 95%.
E. PARAMETER SETTINGS AND REPRODUCIBILITY Furthermore, the results of both classifiers with respect
We have developed and evaluated the classification algo- to Recall and Precision are exactly the same. However,
rithms and the experiments explained above in Python. The examining the total cost achieved by each classifier reveals
scikit-learn (sklearn) library provides the main implemen- that XGBoost indeed outperforms LR by a significant margin
tations of these algorithms [87]. In sklearn, we are able to of 26%. On the other hand, in comparison with RF, LR has
incorporate class-specific weights in the loss function of better accuracy by 1% and exactly the same F1-score.
LR and RF algorithms. Similar to [13], the weight of each However, the cost reduction achieved by RF in comparison
class is automatically adjusted to be inversely proportional to with LR is also a significant margin of 17%. These results
class frequencies. SMOTE is also implemented and available further demonstrate the discussion around context-aware
in imbalanced-learn API.5 In each experiment, we only performance measurement presented in Section. V. In fact,
mention the parameter settings if they are different from using a single criterion such as precision, recall, F1-score or
the default values in sklearn and imbalanced-learn API. accuracy metrics fails to discern the critical performance of a
On the other hand, our GAN-based approaches are adapted classifier in the presence of imbalanced datasets.
from the available open-source GAN-Sandbox.6 To ensure
the reproducibility of our results, we have made the code B. CASE II
and dataset of our implementations available and have also
Table 3 shows the results of Case II where LR, RF and
provided details of a configurable experimental set-up at:
XGBoost are trained on the Ready-to-use training data to
https://github.com/YasminFathy/HandleImbalancedDatasets
which we have applied PCA and data augmentation (for the
minority class only) using SMOTE. The original training
VII. RESULTS AND DISCUSSION
dataset has an imbalance ratio of IR = 0.1 (Eq. 25) which
In this section, we discuss and analyse the results of our four
corresponds to 1000 minority samples over 59000 majority
sets of experiments discussed in Section. VI-C.
samples. Using SMOTE for data augmentation, we have a
total number of 2000, 5000, 10, 000 minority samples for
A. CASE I
IR equal to 0.2, 0.4, and 0.6, respectively. We do not alter
Table 2 shows the results of Case I where LR, RF and
the number of majority samples and do not alter the testing
XGBoost are trained on the original data and tested on
dataset, as shown in Figure 1.
the testing set. As mentioned previously, we apply PCA in
The results shown in Table 3 include those obtained in
the pre-processing phase and the moving threshold method
Case I (reported in Table 2 and are aligned in terms of
during the post-processing step as detailed in Section VI-D.
classifier ranking. The best performance of each classifier
The reported results in Table 2 are derived after finding
for different IR settings is highlighted in bold. As can
the optimal decision threshold empirically such that a
be seen, XGBoost outperforms RF, which is followed by
minimum total cost is achieved. The threshold values for LR,
LR. However, it is worth noting that despite the three data
RF and XGBoost are 65.5%, 70% and 98.2%, respectively.
augmentation levels (IR={0.2, 0.4, 0.6}), the performance
Each threshold value was obtained based on optimising
gain of the classifiers remains limited. LR and RF achieve
the minimum cost while aiming to achieve a trade-off
the minimal cost reduction of 1.6% and 2.8%, respectively,
between precision and recall through using the precision-
whereas XGBoost does not benefit from data augmentation.
5 https://imbalanced-learn.readthedocs.io/en/stable/api.html XGBoost fails to reduce the number of false positives (i.e.,
6 https://github.com/mjdietzx/GAN-Sandbox false alarms) as more synthetic faulty data samples are

2750 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

TABLE 3. Case II: performance of classifiers with augmenting artificial faulty data samples obtained using SMOTE to original data (with PCs=11).

generated by SMOTE. Unlike, LR and RF that incorporate Our experiments show different behaviour when comparing
class-specific weights in the loss function, XGBoost equally the results obtained in Case I, Case II, and Case III with both
weights misclassified samples, and its performance often LR and RF classifiers. In both classifiers, SMOTE presents
becomes subtle with imbalanced data, and it has to be com- gain compared to Case I and GAN brings a larger gain
bined with other ensembling methods to improve imbalanced compared to SMOTE. Indeed, we argue that the benefit of
classification [88]. data augmentation and the superiority of GANs over SMOTE
essentially depend on the underlying characteristic of the
minority class and the data complexity.
C. CASE III
Table 5 shows the results of Case III using WGAN with
Table 4 shows the results of Case III where LR, RF and IR={0.2, 0.4, 0.6}. The same ranking between the three
XGBoost are trained on the Ready-to-use training data to classifiers is maintained as in all previous cases. However,
which we have applied PCA and data augmentation (for we see a degradation in performance compared to the results
the minority class only) using GAN. The same approach to achieved by GAN (Table 4). With WGAN, the cost reduction
data augmentation described in Section VII-B with respect achieved by LR and RF shows a limited improvement of 4.0%
to IR={0.2, 0.4, 0.6} and the testing data is adopted, except and 5.1% compared Case I.
that GAN is used instead of SMOTE. The real and generated Analogous to GAN, WGAN tends to converge faster and
distributions tend to be similar; this is measured using KS-test being stable during the training. However, WGAN has no
(discussed in Section. V-D). The same ranking between the substantial effect on reducing the total classification cost in
three classifiers is maintained as in Cases I and II. Moreover, our experiments.
the same trend seen in Case II, whereby LR performs best
with IR = 0.2 and RF performs best for IR = 0.6.
Although XGBoost still does not benefit from the data D. CASE IV
augmentation, RF and LR achieve better cost reduction than Table 6 and Table 7 show the results obtained by CGAN
SMOTE with 8.9% and 8.7% improvement compared Case I. and WCGAN, respectively for each of the classifiers and IR
Similar experiments have been conducted by WGAN for data ratios. It is clear that generating synthetic data conditionally
augmentation. on the class label did not improve the result in our experiment.
GAN is a generative model that produces new content This outcome is also related to the underlying characteristics
based on the presented training data; however, generated data of the dataset. When calculating the Silouhette coefficient
might be noisy [31]. For example, in [31], GAN improved the (SC) (discussed in Section. V-B) which measured the quality
classification when compared with the original imbalanced of clusters in the APS dataset, we find SC = 0.4. This
dataset; however, it did not perform better than SMOTE. is obtained by clustering the APS faulty data samples

VOLUME 9, 2021 2751


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

TABLE 4. Case III: performance of classifiers with augmenting artificial faulty data samples obtained using GAN to original data (with PCs=11).

TABLE 5. Case III: performance of classifiers with augmenting artificial faulty data samples obtained using WGAN to original data (with PCs=11).

using hierarchical clustering (i.e., agglomerative clustering). Section V that there is no one-solution-fits-all when it comes
Having SC = 0.4 means that it is quite hard to separate to binary classification. Moreover, it is essential to examine
APS-related fault samples into clusters. For that reason, the dataset at hand and understand its context to allow a
conditioning GAN on the cluster label does not bring pertinent choice of classification methods that would prove
any advantage. This outcome further proves our claim in effective.

2752 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

TABLE 6. Case IV: performance of classifiers with augmenting artificial faulty data samples obtained using CGAN to original data (with PCs=11).

TABLE 7. Case IV: performance of classifiers with augmenting artificial faulty data samples obtained using WCGAN to original data (with PCs=11).

Although WCGAN (in Table 7) shows a slight improve- This is another demonstration of the necessity to under-
ment to WGAN (in Table. 5) by LR and RF achieving 6.5% stand the intrinsic characteristics of the data before selecting
and 6.8% gain compared to Case I, the benefit of conditional the data-augmentation method. With the APS dataset, using
training remains limited due to the low Silouhette factor as the Wasserstein distance between two probability distribu-
explained in Section VII-C. tions in the training process instead of the Kullback–Leibler

VOLUME 9, 2021 2753


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

divergence does not improve the quality of the generated This shows that a domain expert’s input into the
synthetic samples. This is clear from the comparison of the interpretation of the data is critical in the success of
results in Table 4 and those in Table 5 and relates to the ML algorithms.
Silouhette Coefficient of the dataset. Further investigation is – The similarity between real and generated dis-
left for future work. tributions is quantified using KS-test to measure
the quality of generated samples. The number of
VIII. LESSONS LEARNT AND CONCLUSIVE REMARKS generated synthetic samples does not have a linear
Industry 4.0 offers manufacturing the potential of leveraging relation with the performance of the ML algorithm.
IoT data with machine learning to reap the benefits of In fact, some classifiers perform better with a larger
automation and excellence in quality control. A major synthetic dataset, and others do the opposite as
obstacle currently facing the progress in this field is the there is a risk of unsuitable augmentation method
nature of data generated by manufacturing processes. Indeed, generating noise instead of useful data samples.
manufacturing data is overwhelmed with instances of good This is an effect of the interplay between the
performance with few examples of malfunctioning, referred underlying data features, the data augmentation
to as imbalanced data. Since machine learning is a data- method, and the intrinsic method implemented in
driven learning process, imbalanced data results in biased the classifier.
learning. In the context of manufacturing, this learning – It is generally believed that conditional GANs
bias causes most manufacturing faults to go unnoticed and (e.g., CGAN and WCGAN) tend to improve
compromising the quality of the end product. The results the classification performance in comparison with
of our study are expected to improve the decision-making ordinary GANs (e.g., GAN and WGAN). However,
in smart manufacturing by detecting unexpected faults that our experiments have demonstrated that this is only
affect products’ quality. true if applied to a compatible dataset. The APS
In this work, we have presented the first comprehensive dataset used in this work has a low Silhouette
comparative analysis of various methods in the literature that coefficient (i.e., a poor distinction between both
aim to curtail the curse of data imbalance in the ML process. classes); hence, conditioning the training process
We present an evaluation framework that considers all steps on the class label adds little benefit to the end
of the process, including data preparation and pre-processing, results. In other words, the APS dataset is not
classifier design, and post-processing. More importantly, compatible with conditional GAN derivatives when
we set-up a set of key performance indicators that jointly conditioned on the class label.
reveal the effectiveness of each method in a context-aware – Augmenting artificial faulty data samples obtained
fashion. by GAN achieves a larger gain compared to
We have applied our framework on an industry-based SMOTE (using logistic regression). More precisely,
dataset which enabled us to conduct the comparative analysis GAN achieves up to 9% cost reduction compared
and draw key insights with particular application to smart to the original data (with no augmentation) in
manufacturing. We summarise key lessons learnt from this Case III and 7% compared to SMOTE in Case II.
study that we hope will be a useful guide to future research On the other hand, SMOTE improves the predictive
in this field: analytics (using random forest) and offers higher
• All the experiments conducted in each of the cases in this gain comparing with the original data (with no
study have been assessed using the proposed evaluation augmentation) with a cost reduction up to 3% in
framework. The overarching insight that can be drawn Case II
from these results is that none of the key metrics, such as • Overall, it is not always practical to compare our
accuracy or F1-score, can be misleading when examined results with existing similar studies. Most of the studies
in isolation. As such, it is essential to inspect the full including, [49], [86] do not report the parameter settings
spectrum of performance metrics to have a complete of each chosen classifier which makes it a challenging
evaluation of the ML techniques used in manufacturing. task to reproduce their experiments. For instance,
• In binary classification, there is no one-solution-fits-
authors in [49] achieve a total cost of AC11,090 using RF
all as the optimum solution depends on the intrinsic as a binary classifier on the original data; however, they
characteristics of the dataset and the context in which do not report parameter settings, imputation technique,
data is collected, and classification results are employed. whether a decision threshold is applied and its value or
Here are a few insights into the role of data and context other evaluation metrics including precision and recall.
in the effectiveness of ML methods:
– Our experiments demonstrate that XGBoost, APPENDIX
empowered with a cost-sensitive function and PRINCIPAL COMPONENTS
context-aware moving threshold, outperforms PCA finds a projection of high dimensional data into a lower-
Logistic Regression and Random Forest classifiers dimensional subspace such that the maximum variance of the
in every setting, even without data augmentation. data is retained and the least-square reconstruction error is

2754 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

[15] G. M. Weiss, ‘‘Mining with rarity: A unifying framework,’’ ACM SIGKDD


Explor. Newslett., vol. 6, no. 1, pp. 7–19, Jun. 2004.
[16] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano, ‘‘Experimental
perspectives on learning from imbalanced data,’’ in Proc. 24th Int. Conf.
Mach. Learn. ICML, 2007, pp. 935–942.
[17] C. K. Aridas, S. Karlos, V. G. Kanas, N. Fazakis, and S. B. Kotsiantis,
‘‘Uncertainty based under-sampling for learning naive bayes classifiers
under imbalanced data sets,’’ IEEE Access, vol. 8, pp. 2122–2133,
2020.
[18] J. M. Johnson and T. M. Khoshgoftaar, ‘‘Survey on deep learning with class
imbalance,’’ J. Big Data, vol. 6, no. 1, p. 27, Dec. 2019.
[19] W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, ‘‘Clustering-based
undersampling in class-imbalanced data,’’ Inf. Sci., vols. 409–410,
pp. 17–26, Oct. 2017.
FIGURE 5. The cumulative variance explained by different number of [20] I. Mani and I. Zhang, ‘‘kNN approach to unbalanced data distributions:
principal components. A case study involving information extraction,’’ in Proc. Workshop Learn.
Imbalanced Datasets, vol. 126, Aug. 2003, pp. 1–7.
[21] Y. Hou, B. Li, L. Li, and J. Liu, ‘‘A density-based under-sampling algorithm
for imbalance classification,’’ J. Phys., Conf. Ser., vol. 1302, Aug. 2019,
minimised. Figure. 5 shows the cumulative variance retained Art. no. 022064.
by a different number of principal components. We found [22] F. Kamalov, ‘‘Kernel density estimation based sampling for imbalanced
class distribution,’’ Inf. Sci., vol. 512, pp. 1192–1201, Feb. 2020.
that 11 principal components are sufficient to capture 95% [23] A. Fernández, S. del Río, N. V. Chawla, and F. Herrera, ‘‘An insight into
of the variance in the dataset where a little variance of data is imbalanced big data classification: Outcomes and challenges,’’ Complex
gained by retaining additional principal components (i.e., 11 Intell. Syst., vol. 3, no. 2, pp. 105–120, Jun. 2017.
components seem to be a good indication to hit the point of [24] D.-H. Lee, J.-K. Yang, C.-H. Lee, and K.-J. Kim, ‘‘A data-driven approach
to selection of critical process steps in the semiconductor manufacturing
diminishing returns). process considering missing and imbalanced data,’’ J. Manuf. Syst., vol. 52,
pp. 146–156, Jul. 2019.
[25] K. Yang, Z. Yu, X. Wen, W. Cao, C. L. Philip Chen, H.-S. Wong, and
REFERENCES
J. You, ‘‘Hybrid classifier ensemble for imbalanced data,’’ IEEE Trans.
[1] F. Tao, Q. Qi, A. Liu, and A. Kusiak, ‘‘Data-driven smart manufacturing,’’ Neural Netw. Learn. Syst., vol. 31, no. 4, pp. 1387–1400, Apr. 2020.
J. Manuf. Syst., vol. 48, pp. 157–169, Jul. 2018. [26] Y. O. Lee, J. Jo, and J. Hwang, ‘‘Application of deep neural network and
[2] A. Kanawaday and A. Sane, ‘‘Machine learning for predictive maintenance generative adversarial network to industrial maintenance: A case study of
of industrial machines using IoT sensor data,’’ in Proc. 8th IEEE Int. Conf. induction motor fault detection,’’ in Proc. IEEE Int. Conf. Big Data (Big
Softw. Eng. Service Sci. (ICSESS), Nov. 2017, pp. 87–90. Data), Dec. 2017, pp. 3248–3253.
[3] M. Syafrudin, G. Alfian, N. Fitriyani, and J. Rhee, ‘‘Performance analysis [27] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, ‘‘Safe-level-
of IoT-based sensor, big data processing, and machine learning model smote: Safe-level-synthetic minority over-sampling technique for handling
for real-time monitoring system in automotive manufacturing,’’ Sensors, the class imbalanced problem,’’ in Proc. Pacific–Asia Conf. Knowl.
vol. 18, no. 9, p. 2946, Sep. 2018, doi: 10.3390/s18092946. Discovery Data Mining. Berlin, Germany: Springer, 2009, pp. 475–482.
[4] J.-H. Oh, J. Y. Hong, and J.-G. Baek, ‘‘Oversampling method using outlier [28] H. Han, W.-Y. Wang, and B.-H. Mao, ‘‘Borderline-SMOTE: A new over-
detectable generative adversarial network,’’ Expert Syst. Appl., vol. 133, sampling method in imbalanced data sets learning,’’ in Proc. Int. Conf.
pp. 1–8, Nov. 2019. Intell. Comput. Berlin, Germany: Springer, 2005, pp. 878–887.
[5] B. Song, X. Zhou, H. Shi, and Y. Tao, ‘‘Performance-indicator-oriented [29] H. He, Y. Bai, E. A. Garcia, and S. Li, ‘‘ADASYN: Adaptive synthetic
concurrent subspace process monitoring method,’’ IEEE Trans. Ind. sampling approach for imbalanced learning,’’ in Proc. IEEE Int. Joint
Electron., vol. 66, no. 7, pp. 5535–5545, Jul. 2019. Conf. Neural Netw. (IEEE World Congr. Comput. Intell.), Jun. 2008,
[6] B. Song, H. Yan, H. Shi, and S. Tan, ‘‘Multisubspace elastic network pp. 1322–1328.
for multimode quality-related process monitoring,’’ IEEE Trans. Ind. [30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
Informat., vol. 16, no. 9, pp. 5874–5883, Sep. 2020. S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
[7] B. Song, H. Shi, S. Tan, and Y. Tao, ‘‘Multi-subspace orthogonal canonical Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
correlation analysis for quality related plant wide process monitoring,’’ [31] F. H. K. dos Santos Tanaka and C. Aranha, ‘‘Data augmentation using
IEEE Trans. Ind. Informat., early access, Aug. 7, 2020, doi: 10.1109/TII. GANs,’’ 2019, arXiv:1904.09135. [Online]. Available: http://arxiv.org/abs/
2020.3015034. 1904.09135
[8] Y. Fathy, P. Barnaghi, and R. Tafazolli, ‘‘Large-scale indexing, discovery, [32] G. D. Ranasinghe and A. Kumar Parlikad, ‘‘Generating real-valued failure
and ranking for the Internet of Things (IoT),’’ ACM Comput. Surv., vol. 51, data for prognostics under the conditions of limited data availability,’’ in
no. 2, pp. 1–53, Jun. 2018. Proc. IEEE Int. Conf. Prognostics Health Manage. (ICPHM), Jun. 2019,
[9] B. Krawczyk, ‘‘Learning from imbalanced data: Open challenges and pp. 1–8.
future directions,’’ Prog. Artif. Intell., vol. 5, no. 4, pp. 221–232, Nov. 2016. [33] C. Elkan, ‘‘The foundations of cost-sensitive learning,’’ in Proc. Int. Joint
[10] G. Wang, A. Ledwoch, R. M. Hasani, R. Grosu, and A. Brintrup, Conf. Artif. Intell., vol. 17, no. 1, 2001, pp. 973–978.
‘‘A generative neural network model for the quality prediction of [34] A. Ali, S. M. Shamsuddin, and A. L. Ralescu, ‘‘Classification with class
work in progress products,’’ Appl. Soft Comput., vol. 85, Dec. 2019, imbalance problem: A review,’’ Int. J. Advance Soft Compu. Appl, vol. 7,
Art. no. 105683. no. 3, pp. 176–204, 2015.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE: [35] Y. Hu, C. Guo, E. W. T. Ngai, M. Liu, and S. Chen, ‘‘A scalable intelligent
Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16, non-content-based spam-filtering framework,’’ Expert Syst. Appl., vol. 37,
pp. 321–357, Jun. 2002. no. 12, pp. 8557–8565, Dec. 2010.
[12] H. He and E. A. Garcia, ‘‘Learning from imbalanced data,’’ IEEE Trans. [36] C. X. Ling and V. S. Sheng, ‘‘Cost-sensitive learning and the class
Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009. imbalance problem,’’ Encyclopedia Mach. Learn., vol. 2011, pp. 231–235,
[13] C. F. Costa and M. A. Nascimento, ‘‘Ida 2016 industrial challenge: Using Jan. 2008.
machine learning for predicting failures,’’ in Proc. Int. Symp. Intell. Data [37] B. Krawczyk, M. Woźniak, and F. Herrera, ‘‘On the usefulness of one-class
Anal. Cham, Switzerland: Springer, 2016, pp. 381–386. classifier ensembles for decomposition of multi-class problems,’’ Pattern
[14] J. Wang, Y. Ma, L. Zhang, R. X. Gao, and D. Wu, ‘‘Deep learning for Recognit., vol. 48, no. 12, pp. 3969–3982, Dec. 2015.
smart manufacturing: Methods and applications,’’ J. Manuf. Syst., vol. 48, [38] L. Rokach, ‘‘Ensemble-based classifiers,’’ Artif. Intell. Rev., vol. 33,
pp. 144–156, Jul. 2018. nos. 1–2, pp. 1–39, 2010.

VOLUME 9, 2021 2755


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

[39] L. Breiman, ‘‘Bagging predictors,’’ Mach. Learn., vol. 24, no. 2, [65] Y. Yu, B. Tang, R. Lin, S. Han, T. Tang, and M. Chen, ‘‘CWGAN:
pp. 123–140, Aug. 1996. Conditional wasserstein generative adversarial nets for fault data gener-
[40] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of on- ation,’’ in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019,
line learning and an application to boosting,’’ J. Comput. Syst. Sci., vol. 55, pp. 2713–2718.
no. 1, pp. 119–139, Aug. 1997. [66] Y. Liu, Z. Qin, T. Wan, and Z. Luo, ‘‘Auto-painter: Cartoon image
[41] J. H. Friedman, ‘‘Greedy function approximation: A gradient boosting generation from sketch by using conditional wasserstein generative
machine,’’ Ann. Statist., vol. 29, pp. 1189–1232, Oct. 2001. adversarial networks,’’ Neurocomputing, vol. 311, pp. 78–87, Oct. 2018.
[42] L. I. Kuncheva, Combining Pattern Classifiers: Methods Algorithms. [67] Y. Liu, M. Xiao, Y. Zhou, D. Zhang, J. Zhang, H. Gacanin, and J. Pan,
Hoboken, NJ, USA: Wiley, 2014. ‘‘An access control mechanism based on risk prediction for the IoV,’’ in
[43] M. Woźniak, M. Graña, and E. Corchado, ‘‘A survey of multiple classifier Proc. IEEE 91st Veh. Technol. Conf. (VTC-Spring), May 2020, pp. 1–5.
systems as hybrid systems,’’ Inf. Fusion, vol. 16, pp. 3–17, Mar. 2014. [68] C. E. Shannon, ‘‘A mathematical theory of communication,’’ ACM
[44] Q. Cao and S. Wang, ‘‘Applying over-sampling technique based on data SIGMOBILE Mobile Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55, 2001.
density and cost-sensitive SVM to imbalanced learning,’’ in Proc. Int. Conf. [69] P. J. Rousseeuw, ‘‘Silhouettes: A graphical aid to the interpretation and
Inf. Manage., Innov. Manage. Ind. Eng., Nov. 2011, pp. 543–548. validation of cluster analysis,’’ J. Comput. Appl. Math., vol. 20, pp. 53–65,
Nov. 1987.
[45] S. Wang, Z. Li, W. Chao, and Q. Cao, ‘‘Applying adaptive over-sampling
[70] Y. Fathy, P. Barnaghi, S. Enshaeifar, and R. Tafazolli, ‘‘A distributed in-
technique based on data density and cost-sensitive SVM to imbalanced
network indexing mechanism for the Internet of Things,’’ in Proc. IEEE
learning,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jun. 2012,
3rd World Forum Internet Things (WF-IoT), Dec. 2016, pp. 585–590.
pp. 1–8.
[71] J. Stefanowski, ‘‘Dealing with data difficulty factors while learning from
[46] C. Drummond and R. C. Holte, ‘‘C4. 5, class imbalance, and cost imbalanced data,’’ in Challenges in Computational Statistics and Data
sensitivity: Why under-sampling beats over-sampling,’’ in Proc. ICML Mining. Cham, Switzerland: Springer, 2016, pp. 333–363.
Workshop Learn. Imbalanced Datasets, vol. 11, 2003, pp. 1–8. [72] T. Chavdarova and F. Fleuret, ‘‘SGAN: An alternative training of
[47] F. Li, X. Zhang, X. Zhang, C. Du, Y. Xu, and Y.-C. Tian, ‘‘Cost-sensitive generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
and hybrid-attribute measure multi-decision tree over imbalanced data Pattern Recognit., Jun. 2018, pp. 9407–9415.
sets,’’ Inf. Sci., vol. 422, pp. 242–256, Jan. 2018. [73] W. Li, W. Ding, R. Sadasivam, X. Cui, and P. Chen, ‘‘His-GAN: A
[48] N.-N. Zhang, S.-Z. Ye, and T.-Y. Chien, ‘‘Imbalanced data classification histogram-based GAN model to improve data generation quality,’’ Neural
based on hybrid methods,’’ in Proc. 2nd Int. Conf. Big Data Res. ICBDR, Netw., vol. 119, pp. 31–45, Nov. 2019.
2018, pp. 16–20. [74] A. Ramdas, N. Trillos, and M. Cuturi, ‘‘On wasserstein two-sample testing
[49] G. D. Ranasinghe, T. Lindgren, M. Girolami, and A. K. Parlikad, and related families of nonparametric tests,’’ Entropy, vol. 19, no. 2, p. 47,
‘‘A methodology for prognostics under the conditions of limited failure Jan. 2017.
data availability,’’ IEEE Access, vol. 7, pp. 183996–184007, 2019. [75] H. W. Lilliefors, ‘‘On the kolmogorov-smirnov test for normality with
[50] K. Cheng, C. Zhang, H. Yu, X. Yang, H. Zou, and S. Gao, ‘‘Grouped mean and variance unknown,’’ J. Amer. Stat. Assoc., vol. 62, no. 318,
SMOTE with noise filtering mechanism for classifying imbalanced data,’’ pp. 399–402, Jun. 1967.
IEEE Access, vol. 7, pp. 170668–170681, 2019. [76] J. L. Hodges, ‘‘The significance probability of the smirnov two-sample
[51] A. Fernandez, S. Garcia, F. Herrera, and N. V. Chawla, ‘‘SMOTE for test,’’ Arkiv För Matematik, vol. 3, no. 5, pp. 469–486, Jan. 1958.
learning from imbalanced data: Progress and challenges, marking the 15- [77] F. Luo and S. Mehrotra, ‘‘Distributionally robust optimization with
year anniversary,’’ J. Artif. Intell. Res., vol. 61, pp. 863–905, Apr. 2018. decision dependent ambiguity sets,’’ Optim. Lett., vol. 14, pp. 1–30,
[52] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘CVAE-GAN: Fine-grained Jan. 2020.
image generation through asymmetric training,’’ in Proc. IEEE Int. Conf. [78] C. Saez, M. Robles, and J. M. Garcia-Gomez, ‘‘Comparative study of
Comput. Vis. (ICCV), Oct. 2017, pp. 2745–2754. probability distribution distances to define a metric for the stability of
[53] L. Weng, ‘‘From GAN to WGAN,’’ 2019, arXiv:1904.08994. [Online]. multi-source biomedical research data,’’ in Proc. 35th Annu. Int. Conf.
Available: http://arxiv.org/abs/1904.08994 IEEE Eng. Med. Biol. Soc. (EMBC), Jul. 2013, pp. 3226–3229.
[54] F. Huszár, ‘‘How (not) to train your generative model: Scheduled sampling, [79] P. Vermeesch, ‘‘Dissimilarity measures in detrital geochronology,’’ Earth-
likelihood, adversary?’’ 2015, arXiv:1511.05101. [Online]. Available: Sci. Rev., vol. 178, pp. 310–321, Mar. 2018.
http://arxiv.org/abs/1511.05101 [80] J. Biteus and T. Lindgren, ‘‘Planning flexible maintenance for heavy
[55] I. Goodfellow, ‘‘NIPS 2016 tutorial: Generative adversarial networks,’’ trucks using machine learning models, constraint programming, and route
2017, arXiv:1701.00160. [Online]. Available: http://arxiv.org/abs/1701. optimization,’’ SAE Int. J. Mater. Manuf., vol. 10, no. 3, pp. 306–315,
00160 Mar. 2017.
[81] J. Lever, M. Krzywinski, and N. Altman, ‘‘Principal component anal-
[56] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimiza-
ysis,’’ Nature Methods, vol. 14, pp. 641–642, 2017. [Online]. Avail-
tion,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/
able: https://www.nature.com/articles/nmeth.4346#citeas, doi: 10.1038/
1412.6980
nmeth.4346.
[57] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’
[82] R. E. Wright, ‘‘Logistic regression,’’ in Reading and Understanding
2014, arXiv:1411.1784. [Online]. Available: http://arxiv.org/abs/1411.
Multivariate Statistics, L. G. Grimm and P. R. Yarnold, Eds. Washington,
1784
DC, USA: American Psychological Association, 1995, pp. 217–244.
[58] M. Arjovsky, S. Chintala, and L. Bottou, ‘‘Wasserstein GAN,’’ 2017,
[Online]. Available: https://psycnet.apa.org/record/1995-97110-007
arXiv:1701.07875. [Online]. Available: http://arxiv.org/abs/1701.07875 [83] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
[59] S. Bhatia and R. Dahyot, ‘‘Using WGAN for improving imbalanced 2001.
classification performance,’’ in Proc. AICS, 2019, pp. 365–375. [84] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’
[60] X. Wei, B. Gong, Z. Liu, W. Lu, and L. Wang, ‘‘Improving the improved in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
training of wasserstein GANs: A consistency term and its dual effect,’’ in Aug. 2016, pp. 785–794.
Proc. Int. Conf. Learn. Represent., Feb. 2018, pp. 1–17. [85] H. He and Y. Ma, Imbalanced Learning: Foundations, Algorithms, and
[61] C. Villani, Optimal Transport: Old New, vol. 338. Berlin, Germany: Applications. Hoboken, NJ, USA: Wiley, 2013.
Springer, 2008. [86] C. Gondek, D. Hafner, and O. R. Sampson, ‘‘Prediction of failures in the
[62] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, air pressure system of scania trucks using a random forest and feature
‘‘Improved training of wasserstein gans,’’ in Proc. Adv. Neural Inf. Process. engineering,’’ in Proc. Int. Symp. Intell. Data Anal. Cham, Switzerland:
Syst., 2017, pp. 5767–5777. Springer, 2016, pp. 398–402.
[63] S. Qin and T. Jiang, ‘‘Improved wasserstein conditional generative [87] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
adversarial network speech enhancement,’’ EURASIP J. Wireless Commun. M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and J. Vanderplas,
Netw., vol. 2018, no. 1, p. 181, Dec. 2018. ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12,
[64] M. Zheng, T. Li, R. Zhu, Y. Tang, M. Tang, L. Lin, and Z. Ma, pp. 2825–2830, Oct. 2011.
‘‘Conditional wasserstein generative adversarial network-gradient penalty- [88] C. Wang, C. Deng, and S. Wang, ‘‘Imbalance-XGBoost: Leveraging
based approach to alleviating imbalanced data classification,’’ Inf. Sci., weighted and focal losses for binary label-imbalanced classification with
vol. 512, pp. 1009–1023, Feb. 2020. XGBoost,’’ Pattern Recognit. Lett., vol. 136, pp. 190–197, Aug. 2020.

2756 VOLUME 9, 2021


Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

YASMIN FATHY received the M.Sc. degree in ALEXANDRA BRINTRUP received the Ph.D.
artificial intelligence (AI) from the AI Laboratory, degree from Cranfield University, Cranfield, U.K.
Vrije Universiteit Brussel (VUB), Belgium, and She is currently a Lecturer in digital manufacturing
the Ph.D. degree from the Institute of Communi- with the University of Cambridge, Cambridge,
cation Systems (ICS), University of Surrey. She U.K. She develops intelligent systems to help orga-
was a Research Associate with the Computer nizations navigate through complexity. Her main
Science Department, University College London work in this area includes system development
(UCL). She is currently a Research Associate for digitized product lifecycle management. She
with the Department of Engineering, University of uses artificial intelligence paradigms, particularly
Cambridge, and a Fellow of the Higher Education for data analytics and automated decision making.
Academy. Her research interests include the Internet of Things, machine She held postdoctoral and fellowship appointments with the University
learning, and data analytics. of Cambridge and the University of Oxford. She teaches operations
management and decision engineering. Her research interest includes the
modeling, analysis, and control of dynamical and functional properties of
emergent manufacturing networks.

MONA JABER (Member, IEEE) received the


B.E. degree in computer and communications
engineering and the M.E. degree in electrical
and computer engineering from the American
University of Beirut, Beirut, Lebanon, in 1996 and
2014, respectively, and the Ph.D. degree from
the 5G Innovation Centre, University of Surrey,
in 2017. Her Ph.D. research was on 5G backhaul
innovations. She was a Telecommunication Con-
sultant in various international firms with a focus
on the radio design of cellular networks, including GSM, GPRS, UMTS, and
HSPA. She was leading the IoT Research Group, Fujitsu Laboratories on
Europe, from 2017 to 2019, where she focused in particular on automotive
applications. She is currently a Lecturer in Internet of Things with the School
of Electronic Engineering and Computer Science, Queen Mary University of
London. Her research interests include cyber-physical systems, data-driven
digital twins, and AI/ML applications in the automotive industry.

VOLUME 9, 2021 2757

You might also like