Learning_With_Imbalanced_Data_in_Smart_M

This document presents a framework for evaluating machine learning techniques to address the challenge of imbalanced data in smart manufacturing, which is crucial for predicting faults. The authors conduct a comparative analysis using a real-world dataset to benchmark various algorithm components and their effectiveness in improving predictive accuracy. Key contributions include a comprehensive review of classification challenges, a novel evaluation framework, and insights into the interplay between dataset characteristics and machine learning performance.

Uploaded by

Vinicius

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Learning_With_Imbalanced_Data_in_Smart_M

Uploaded by

Vinicius

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Received November 27, 2020, accepted December 17, 2020, date of publication December 28, 2020,

date of current version January 6, 2021.

Digital Object Identifier 10.1109/ACCESS.2020.3047838

Learning With Imbalanced Data in Smart

Manufacturing: A Comparative Analysis
YASMIN FATHY 1, MONA JABER 2, (Member, IEEE), AND ALEXANDRA BRINTRUP1
1 Department of Engineering, University of Cambridge, Cambridge CB3 0FS, U.K.
2 School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4FZ, U.K.

Corresponding author: Yasmin Fathy ([email protected])

This work was supported by the Research England for the Pitch-In Project (Promoting the Internet of Things via Collaborations between
HEIs and Industry).

ABSTRACT The Internet of Things (IoT) paradigm is revolutionising the world of manufacturing into what
is known as Smart Manufacturing or Industry 4.0. The main pillar in smart manufacturing looks at harnessing
IoT data and leveraging machine learning (ML) to automate the prediction of faults, thus cutting maintenance
time and cost and improving the product quality. However, faults in real industries are overwhelmingly
outweighed by instances of good performance (faultless samples); this bias is reflected in the data captured
by IoT devices. Imbalanced data limits the success of ML in predicting faults, thus presents a significant
hindrance in the progress of smart manufacturing. Although various techniques have been proposed to tackle
this challenge in general, this work is the first to present a framework for evaluating the effectiveness of
these remedies in the context of manufacturing. We present a comprehensive comparative analysis in which
we apply our proposed framework to benchmark the performance of different combinations of algorithm
components using a real-world manufacturing dataset. We draw key insights into the effectiveness of each
component and inter-relatedness between the dataset, the application context, and the design of the ML
algorithm.

INDEX TERMS Manufacturing analytics, generative modeling, smart manufacturing, imbalanced data,
limited failure data, generating synthetic data.

I. INTRODUCTION depends on the entire chain of production. By embracing

Manufacturing process is a broad terminology that encom- the IoT paradigm, SM leverages IoT data to drive and
passes the production of final products either handmade, automate intelligent decisions, thus improving the efficiency
machine-made, or hybrid. This often entails the transfor- and quality of each task in the chain, not the least the final
mation of raw material into the final product through product.
various mechanical, chemical or other industrial processes. In traditional manufacturing, intermediate quality tests are
Smart Manufacturing (SM), also referred to as Industry 4.0, performed during the production process to detect quality
has transformed the traditional linear manufacturing into a issues and identify defective batches. However, these tests
dynamic and digital ecosystem to improve product quality, are time-consuming and not conclusive as they do not
operations efficiency, and production yield. The Internet of allow for full physical inspections of a production line.
Things (IoT) is a network of connected devices, such as On the other hand, the early detection of defects at early
sensors and actuators, that gives operators access to data from product processing stages is one of the most effective ways
the real world in real-time whilst working from the comfort of reducing costs, saving time, and boosting operational
of their desks. In the context of SM, IoT offers infinite efficiency [1]. SM empowered with IoT and machine
possibilities in terms of remote monitoring, maintenance, and learning (ML) techniques offers the ability of early defect
control of operations. A manufacturing process is typically a detection and the opportunity to capitalise on its benefits.
chain of complex tasks where the quality of the final product IoT sensor data has been used for predictive maintenance in
industrial machines [2] and automotive manufacturing [3].
The associate editor coordinating the review of this manuscript and These works; however, are not designed to deal with data
approving it for publication was Qingchao Jiang . imbalance. For instance, authors in [3] use cluster-based

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
2734 VOLUME 9, 2021
Y. Fathy et al.: Learning With Imbalanced Data in SM: A Comparative Analysis

methods for detecting and disregarding data outliers in order (faulty samples). Canonical ML algorithms often assume that
to improve fault detection during the manufacturing process. each class has roughly the same number of objects [7], [9].
Other existing works such as [4] show that classification Manufacturing datasets, however, often have a dramatically
performance for imbalanced data can be improved when data skewed distribution. Thus, their application to canonical ML
outliers are eliminated. algorithms fails to deliver reliable results. Indeed, when
Modern prediction systems rely on using IoT data and positive samples are limited, predictive models tend to be
ML techniques to predict the expected quality level of biased towards the majority (negative) class. This leads to a
forthcoming products. Predictive analytics is one application high probability of misclassifying samples from the minority
of ML that analyses current and historical data to make class. In the context of manufacturing, this bias in predictive
predictions about future events. With predictive analytics, models results in the majority of faults going unnoticed
SM manufacturers can discover intricate patterns from and significantly impacting the quality and efficiency of
collected IoT data, identify processing batches that drop production. In fact, the impact of not predicting a fault in
below a defined quality level and perform the best course manufacturing is much more detrimental to the quality and
of actions among multiple options. Consequently, quality process than misclassifying a faultless sample as faulty.
engineers can use this information to either immediately To this end, there is a dominant incentive in SM to
adjust the process parameters or stop a particular defective improve the performance of quality predictive models that
batch processing. deal with imbalanced datasets. There are various efforts
In today’s multimode manufacturing processes, not all to address this issue, where some propose to remove the
anomalies or faults would affect the product’s quality. Thus, bias by manipulating the dataset, referred to as data-based,
in order to make the process of quality monitoring more and others propose to implement a positive bias in the
purposeful and accurate, a tailored performance indicator ML algorithm, referred to as algorithm-based. Data-based
method has been proposed by Song et al. [5]. The proposed methods look at generating new synthetic data samples
method considers the influence of the fault on product quality from the minority class to reach similar numbers as the
and process safety by constructing sub-spaces that enable majority class. These methods are often referred to as
distinguishing between various types of quality-related and data augmentation tools as they increase the number of
safety-related faults in complex industrial systems. Predicting samples by adding the synthetic data. For instance, authors
quality in a real-time is essential for process monitoring and in [10] use such a method (Synthetic Minority Over-sampling
control, but measuring quality variables often require offline TEchnique (SMOTE) [11], [12]) to mitigate data imbalance
analysis. However, quality variables can be measured by while predicting product quality.
exploring the correlation between the changes in process vari- The second group relates to algorithm-based methods, also
ables and the collected quality information [6]. To this end, referred to as cost-sensitive learning. An artificial bias is
a multi-subspace elastic network method is employed in [6] implemented in the existing classification process through a
to construct the correlation relationship between process cost matrix that amplifies the penalty value for misclassifying
variables and quality variables for the detection and diagnosis minority samples. An algorithm-based method was recently
of faults. Recently, a novel data-driven method has been used with an imbalanced dataset to predict failure in air
proposed in [7] using multi-subspace orthogonal canonical pressure systems in [13]. A promising direction looks at
correlation analysis for identifying quality-affecting faults in combining both approaches to form a hybrid technique that
a real-time fashion. alters the dataset as well as the algorithm bias to circumvent
Despite the recent success of collecting IoT data for the imbalance obstacle.
process monitoring [8], this data mirrors the actual man- However, these methods have not been evaluated in the
ufacturing process and its intrinsic bias towards good context of predictive analytics for manufacturing problems.
performance. It is, of course, fortunate that instances of In fact, there is no one-solution-fits-all, and it is critical
faultless samples largely outnumber the defects and faults to have a framework in which various methods can be
in the manufacturing process. This characteristic is a major assessed in a data-centric and contextual fashion. In this
factor in today’s economy as it renders the manufacturing work, we present the first such framework that covers the
of complex products cost-effective and the end product process of predictive analytics starting at the original dataset
accessible to masses. For instance, two-thirds of the world and finishing at the context-aware performance evaluation,
population today can afford a mobile phone.1 On the other as shown in Figure 1. In this study, we present a compre-
hand, data imbalance is a major challenge for ML-based hensive comparative analysis of data-based, algorithm-based
predictive analytics which rely on data for learning. and hybrid methods for improving the prediction accuracy
In a binary classification ML task (e.g., faulty/faultless), of manufacturing faults. To this end, we propose the first
imbalanced data is where the number of negative samples statistical analysis framework for measuring the effectiveness
(faultless or samples that conform to the quality con- of each method by quantifying four key aspects. The first
trol process) outweighing the number of positive samples looks at measuring the bias embedded in data by adopting
entropy concepts. It is evident that the optimum predictive
1 https://datareportal.com/global-digital-overview analytics method majorly depends on the dataset and its