0% found this document useful (0 votes)
17 views

A_Comprehensive_Review_of_Convolutional_Neural_Networks_for_Defect_Detection_in_Industrial_Applications

This document presents a comprehensive review of Convolutional Neural Networks (CNNs) for defect detection in industrial applications, highlighting their effectiveness in automating quality inspection processes. It discusses the advancements in CNN architectures, their practical applications across various industrial scenarios, and the challenges faced, such as limited datasets and computational complexity. The review aims to serve as a valuable resource for researchers and practitioners by exploring methodologies, hardware considerations, and the relationship between CNNs and hardware infrastructure necessary for effective deployment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

A_Comprehensive_Review_of_Convolutional_Neural_Networks_for_Defect_Detection_in_Industrial_Applications

This document presents a comprehensive review of Convolutional Neural Networks (CNNs) for defect detection in industrial applications, highlighting their effectiveness in automating quality inspection processes. It discusses the advancements in CNN architectures, their practical applications across various industrial scenarios, and the challenges faced, such as limited datasets and computational complexity. The review aims to serve as a valuable resource for researchers and practitioners by exploring methodologies, hardware considerations, and the relationship between CNNs and hardware infrastructure necessary for effective deployment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Received 28 May 2024, accepted 5 July 2024, date of publication 8 July 2024, date of current version 17 July 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3425166

A Comprehensive Review of Convolutional


Neural Networks for Defect Detection in
Industrial Applications
RAHIMA KHANAM , MUHAMMAD HUSSAIN , RICHARD HILL , AND PAUL ALLEN
School of Computing and Engineering, Department of Computer Science, University of Huddersfield, HD1 3DH Huddersfield, U.K.
Corresponding author: Rahima Khanam ([email protected])

ABSTRACT Quality inspection and defect detection remain critical challenges across diverse industrial
applications. Driven by advancements in Deep Learning, Convolutional Neural Networks (CNNs) have
revolutionized Computer Vision, enabling breakthroughs in image analysis tasks like classification and
object detection. CNNs’ feature learning and classification capabilities have made industrial defect detection
through Machine Vision one of their most impactful applications. This article aims to showcase practical
applications of CNN models for surface defect detection across various industrial scenarios, from pallet racks
to display screens. The review explores object detection methodologies and suitable hardware platforms
for deploying CNN-based architectures. The growing Industry 4.0 adoption necessitates enhancing quality
inspection processes. The main results demonstrate CNNs’ efficacy in automating defect detection, achieving
high accuracy and real-time performance across different surfaces. However, challenges like limited datasets,
computational complexity, and domain-specific nuances require further research. Overall, this review
acknowledges CNNs’ potential as a transformative technology for industrial vision applications, with
practical implications ranging from quality control enhancement to cost reductions and process optimization.

INDEX TERMS Computer vision, convolutional neural network, deep learning, industrial defect detection,
object detection, quality inspection: manufacturing.

I. INTRODUCTION discovered by ML algorithms, resulting in making decisions


Artificial Intelligence (AI) emerged in the 1950s as an without any explicit instructions [5].
academic discipline and remained in the scientific field Deep learning (DL) is a powerful subclass of ML that
with ambiguity and minimal beneficial interest for relatively permits numerous processing layers of computational models
half a century [1]. Since then, AI has revolutionized the to learn and define data at multiple levels of abstraction, using
modern world with a tremendous volume of data, improved a technique that mimics how the human brain perceives and
computer technologies, and theoretical understanding. Many comprehends multimodal information, implicitly apprehend-
industries, including manufacturing, marketing, healthcare, ing complex large-scale data structures [6]. There was an
cyber security, military, and finance, have adopted AI as urge to build a model that replicates the human brain which
a cornerstone of modern technology [2], [3]. AI uses led to the origination of Neural Networks (NNs) in 1943 by
algorithms which comprise a set of explicit instructions that McCulloch and Pitts, where they tried to comprehend how
a computer can run to execute a problem or situation [2]. the human brain’s remarkable ability to process and interpret
Machine Learning (ML) is a subset of AI that uses large highly complex patterns using interconnected cells known
datasets to train systems to recognize patterns and predict as neurons [7]. They created a model of neurons known
outcomes [4]. The underlying relationships in data can be as the MCP model owing to a significant contribution to
the emergence of Artificial Neural Networks (ANNs). The
The associate editor coordinating the review of this manuscript and growing demand for DL methods by training NNs [8], [9],
approving it for publication was Zhongyi Guo . [10] is attributable to their remarkable success in numerous
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
94250 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

applications, especially due to the abundance of complex data The field of Image Recognition has undergone a revolu-
from various sources like medical, industrial, sensor, social, tionary transformation over the years, driven by the evolution
visual, text, audio, graph, etc. of DL. Within this domain, CNNs [38] have emerged as a
A widely known DL technology is the Convolutional powerful model, propelling accuracy to levels approaching or
Neural Network (CNN) [11], [12], [13], which is hugely even surpassing human capabilities. This remarkable ascent
used in various domains like Natural Language Processing can be attributed to several key advantages of CNNs over
(NLP) [14], [15], [16], [17], [18], [19], [20], [21], [22], traditional artificial feature extraction and shallow learning
[23], [24], [25], Image [26], [27], [28], [29] and Speech methods. Unlike conventional methods that rely on hand-
recognition [30], [31], [32], [33], [34]. Several models crafted features, CNNs leverage a hierarchical architecture
have demonstrated that CNNs are particularly effective in composed of multiple convolutional and pooling layers. This
image recognition tasks, achieving state-of-the-art (SOTA) layered structure empowers the network to progressively
results [3]. A crucial part of their success is because the extract increasingly complex and abstract features from
hierarchical architecture enables them to retrieve features at the raw image data. Furthermore, CNNs follow a data-
varying degrees of abstraction and extract spatial features driven approach. Unlike traditional methods that depend
and patterns in images [35], [36]. Figure 1 shows the general heavily on expert-defined features, CNNs learn directly
relationship between AI, ML, DL, and CNN. from large image datasets. This automates feature extraction,
significantly lowering the barrier to entry for CV research and
product development.
Intelligent manufacturing is recognised as a corner-
stone of Industry 4.0, with quality inspection serving
as a critical and indispensable link in both production
and maintenance processes. Machine vision-based surface
defect detection plays a pivotal role in ensuring prod-
uct quality across diverse industries. However, industrial
image processing poses unique challenges compared to
natural image analysis, with a crucial emphasis on non-
standard customization. Consequently, directly applying
generic CNN models to industrial tasks often proves
ineffective. This necessitates the development of cus-
tomized CNN transformations tailored to specific industrial
scenarios.
This paper aims is to comprehensively analyze the appli-
cations of CNNs, particularly within the domain of Industrial
Surface Defect Detection. To broaden the scope of the review,
domains closely related to Structural Health Monitoring
(SHM), such as inspection methodologies for identifying
defects in various surfaces, such as pallet racks, exudates [39],
steel, rail, magnetic tiles, photovoltaic cells [40], [41],
FIGURE 1. A diagram showing the relationship between AI, ML, DL, and
CNN. fabrics, displays, etc., have been systematically investigated
to furnish a thorough and efficient review. By meticulously
surveying relevant literature, this review aims to serve as a
A. SURVEY OBJECTIVE valuable resource and guide for researchers and practitioners
Advances in Computer Vision (CV) are capable of enhancing in the industrial sector. It sheds light on the effective utiliza-
operations in a variety of industries. Despite this, researchers tion of advanced DL technologies, specifically customized
are increasingly discovering that CV architectures are incom- CNNs, for robust and automated surface defect detection in
patible with existing application requirements while being diverse industrial settings.
transferred from academic laboratories to manufacturing To facilitate a comprehensive understanding of CNNs
and industrial sectors. Several factors contribute to this in Industrial inspection systems, this paper delves into
problem, including a high cost of computation, increased critical interconnected modules: their historical evolution,
energy consumption, and more processing power from the fundamental building blocks, practical Object Detection
Central Processing Units (CPU) and Graphics Processing (OD) implementation technologies, and the essential hard-
Units (GPU) [37]. Due to these incompatibilities, researchers ware components necessary for efficient deployment. This
have shifted their focus to multidimensional parameters, exploration will equip readers with the knowledge and tools
like computational complexities, architectural footprints, and required to effectively leverage the power of CNNs for
energy consumption, rather than one-dimensional criteria robust and automated defect detection in diverse industrial
which provides high accuracy. settings.

VOLUME 12, 2024 94251


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

B. RESEARCH SCOPE AND METHODOLOGY explored to identify further relevant research (forward
1) RESEARCH QUESTIONS searching).
To establish a focused inquiry, we initiated this review 4) Manual Search and Expert Consultation: To ensure
with a preliminary investigation. This investigation aimed to comprehensiveness, a targeted manual search for
identify a set of research questions (RQs) considered to be the relevant literature was conducted. Additionally, con-
most pertinent to the current understanding of CNNs and their sultations were held with the supervisory team and
impact in the Industrial Surface Defect Detection. These RQs researchers in the field to glean recommendations for
will subsequently guide the analysis and discussion presented pertinent literature sources.
throughout this article. 5) Data Extraction and Analysis: A systematic method-
ology was employed to extract and synthesize critical
1) What are the essential components of CNNs? information from the selected literature. This process
2) How have CNNs evolved over time, and what are the focused on gathering data on several key areas, includ-
recent advancements in their architecture? ing the categorization of surfaces within industrial
3) What are the key components and techniques involved settings, the various categories of defects, the employed
in implementing CNNs for Object Detection (OD)? CNN architectures and their modifications, techniques
4) What are the essential hardware components and utilized for performance improvement, the characteris-
considerations for efficient deployment of CNN-based tics of the datasets leveraged, and the principal findings
models? or outcomes generated by the implemented models.
5) What are the different applications of CNNs in the 6) Qualitative and Quantitative Analysis: Qualitative
domain of Industrial Surface Defect Detection? analysis was conducted to discern common themes,
6) How have CNNs been employed for defect detection challenges, and promising future research directions
in various industrial surfaces, such as pallet racks, steel, within the existing literature. Quantitative analysis was
rail, magnetic tiles, photovoltaic cells, fabrics, displays, employed to compare the performance of various CNN
etc.? architectures, techniques, and implementations applied
7) What techniques or strategies have been employed to to the domain of Industrial Surface Defect Detection.
enhance the performance of CNN models for Industrial 7) Dataset Analysis: Conducted a critical examination
Surface Defect Detection? of datasets employed in previous and current research
8) What types of datasets (real-world, synthetic, public, on industrial surface defect detection. This analysis
proprietary) have been commonly used for training and encompassed both publicly accessible datasets and
evaluating CNN models in this domain? domain-specific datasets curated by researchers or
industrial collaborators.
2) RESEARCH METHODS
C. RELATIONSHIP BETWEEN CNNS AND HARDWARE
This research employed a multifaceted approach to conduct
INFRASTRUCTURE
a thorough review of the literature on CNNs for Industrial
Surface Defect Detection. The methodology encompassed The field of CV has witnessed a transformative era of
several key strategies: architectural advancements in CNNs, resulting in a paradigm
shift towards automated visual applications. Despite their
1) Systematic Literature Review: A rigorous search significant breakthroughs, the successful integration of CNN
and analysis were undertaken, encompassing relevant models into the real world extends beyond the intricacies
research articles, conference proceedings, and techni- of their design. Hardware configurations emerge a crucial
cal reports. Clear inclusion and exclusion criteria were factor, dictating the feasibility, efficiency, and ultimately, the
established for selecting the reviewed literature. tangible applicability of CNN-based solutions. Therefore,
2) Keyword-based Search: A comprehensive set of key- establishing a firm understanding of the symbiotic relation-
words pertaining to the field, including ‘‘Convolutional ship between CNN development and underlying hardware
Neural Networks,’’ ‘‘CNN Components’’, ‘‘Hardware constraints is paramount before embarking on a detailed
Accelerators’’, ‘‘Surface Defect Detection,’’ ‘‘Indus- exploration of individual architectures.
trial Quality Inspection,’’ ‘‘Structural Health Monitor-
ing,’’ and ‘‘Object Detection,’’ was utilized to search 1) COMPUTATIONAL DEMANDS
prominent academic databases like IEEE Xplore, ACM CNNs exhibit significant computational requirements, neces-
Digital Library, ScienceDirect, Scopus, and Google sitating considerable processing power to execute their
Scholar. intricate operations. As CNN model complexity increases,
3) Backward and Forward Reference Searching: with enhanced depth and breadth for capturing intricate
Reference lists of identified relevant papers were features, their computational demands escalate exponen-
meticulously examined to unearth additional pertinent tially. Therefore, a meticulous assessment of hardware
literature sources (backward searching). Furthermore, considerations, particularly the selection of appropriate
papers citing the initially selected sources were processors and accelerators, is imperative to ensure the

94252 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

effective execution and real-time performance. To address exhibiting the development of custom hardware acceleration
the demands of CNNs, specialized hardware platforms tailored to specific CNN architectures. However, ASICs,
such as GPUs, Field-Programmable Gate Arrays (FPGAs), despite boasting the highest performance and energy
and Application-Specific Integrated Circuits (ASICs) offer efficiency, incur substantial development costs and limited
substantial computational power capable of accelerating both flexibility.
training and inference processes. The design of CNN architectures and hardware consider-
ations exhibit a symbiotic relationship, where each domain
2) MEMORY DEMANDS critically influences and advances the other. Advancements
CNNs are known for their substantial memory demands, in hardware capabilities, particularly in terms of computa-
primarily due to the need to store intermediate feature tional power and memory resources, pave the way for the
maps and inference weights. The rapid evolution of deeper development of increasingly complex and sophisticated CNN
architectures has led to an exponential increase in model architectures. Conversely, innovative CNN architectures
size, measured in terms of the number of parameters. This serve as inspiration for developing specialized hardware
escalating memory burden necessitates the use of hardware tailored for DL tasks. This iterative cycle, fosters continuous
accelerators with sufficient memory capacity and bandwidth progress and innovation in both the algorithmic and hardware
to effectively address the memory demands of CNNs. realms, ultimately propelling advancements in diverse CV
applications.
3) ENERGY EFFICIENCY
For resource-constrained edge devices and embedded D. EXISTING SURVEYS
systems deploying CNNs, energy efficiency becomes a
Several valuable surveys have explored the application of
paramount concern. To address this challenge, hardware
DL by using CNNs in industrial surface defect detec-
accelerators have emerged, to execute CNN computations.
tion. Qi et al. [42] offer a comprehensive overview, while
These accelerators strive to achieve real-time inference
Cumbajin et al. [43] provide a detailed classification of
speeds while minimizing power consumption. To further
commonly encountered defects. Additionally, several recent
minimize the energy footprint of CNN architectures,
scholarly reviews have comprehensively surveyed the appli-
compression techniques such as quantization, pruning, and
cations of deep learning for surfaces and materials across
memory access optimization can be employed. Hardware
diverse fields. These include steel surfaces [44], rail
platforms characterized by low power consumption, like
tracks [45], photovoltaic systems [46], [47], [48], fabrics [49],
dedicated NN accelerators or edge devices designed for
[50], [51], and displays [52]. While these works shed light
power efficiency, facilitates the deployment of CNN
on the potential of CNNs, they fail to address the crucial
architectures with enhanced energy efficiency.
relationship between these architectures and the hardware
4) SCALABILITY AND PARALLELISM
accelerators necessary for successful real-world deployment.
Addressing this gap, authors in [53] investigate efficient CNN
These terms are intricately linked concepts in the context
implementation across various hardware platforms, but their
of CNN architecture. Scalability pertains to the ability of a
focus remains generic, excluding specific industrial scenar-
CNN to efficiently utilize multiple hardware resources for
ios. A broader survey by [54] explores the integration of
executing large datasets or conducting parallel computations.
CV algorithms with hardware accelerators, including object
Hardware platforms equipped with parallel processing capa-
detection, but lacks a specific connection to surface defect
bilities, like GPUs, contribute to accelerated training and
detection in industrial settings. This paper distinguishes itself
inference speeds by leveraging inherent parallelism within
by offering a comprehensive and up-to-date survey that
CNN architectures. Furthermore, advancements in hardware
specifically links object detection, hardware accelerators, and
design, including systolic arrays and tensor processing units
industrial defect detection within the overall CNN ecosystem
(TPUs), play a pivotal role in enabling efficient parallel
to provide a valuable resource for researchers, practitioners,
execution of CNN computations, thereby enhancing overall
and industry professionals seeking to leverage its combined
scalability.
power for robust and efficient industrial defect detection
5) FLEXIBLE DEPLOYMENT systems.
The deployment flexibility of CNNs is significantly
influenced by hardware considerations. Diverse application E. ORGANIZATION OF PAPER
domains often necessitate specific hardware platforms due This article is subsequently organized into the following
to varying constraints such as weight, power consumption, sections: Section II aims to guide readers through the
physical size, and cost limitations. For example, while foundational components of CNNs along with the crucial
GPU-based systems excel in high-performance computing mathematical computations that facilitate their functionality.
scenarios, their inherent resource requirements might render Section III delves into the architectural evolution from ANNs
them impractical for resource-constrained environments. to CNNs. Section IV dives into the intricate domain of OD,
Conversely, FPGAs provide superior reconfigurability, offering a comprehensive analysis of its historical trajectory

VOLUME 12, 2024 94253


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 2. Visual structure of this review.

and current SOTA approaches. This analysis encompasses of a CNN for image recognition mandates an initial training
not only the architectural advancements but also the essential phase using extensive datasets comprising labeled images
ecosystem surrounding OD, including established frame- highlighting the focal areas. Throughout this process, CNN
works, readily available Application Programming Interfaces undergoes an iterative refinement, in which it acquires
(APIs), diverse datasets for model training and evaluation, the ability to correlate the discerned features with the
and robust metrics for quantifying performance. The subse- accurate corresponding labels using the implementation of
quent section, Section V, dives deep into the fascinating realm backpropagation and optimization techniques [55], [56],
of hardware acceleration for Industrial IoT-powered visual [57], [58]. Once the CNN attains proficiency through
inspection. This segment dissects the intricate functionalities training, its applicability extends to new and previously
and application potentials of three key technologies: GPUs, unseen images. This prediction process involves presenting
FPGAs, and ASICs. Section VI embarks on the primary the image through the network, where CNN discerns and
aim of this review, exploring the considerable potential evaluates the extracted features, ultimately selecting the
of CNNs for revolutionizing Industrial Defect Detection label corresponding to the highest predicted probability as
Systems. It provides a detailed exploration of CNN’s ability the output. This systematic approach underscores CNN’s
to be leveraged for robust inspection methodologies across adaptability and effectiveness in image recognition tasks.
a diverse spectrum of surfaces, encompassing pallet racks, Figure 3 provides a basic representation of a CNN
steel, rail, magnetic tiles, photovoltaic cells, fabric, screens, architecture designed for image classification. The CNN
and beyond. Section VII focuses on the critical challenges and receives an input image of forklift as a matrix of pixel
promising future directions that require further investigation. values. This image is subsequently introduced into the
Finally, Section VIII summarizes the key takeaways and input layer, which serves as the initial stage of feature
concludes this review. Figure 2 presents an overall structure extraction. The initial layer of the NN receives the input
of this survey. image and passes it on to the subsequent hidden layer. Within
this layer, a collection of filters, also termed kernels, are
II. BUILDING BLOCKS OF CNN employed to process the image data. These filters, typically
CNNs represent a formidable class of NNs prominently small matrices of weights, slide across the image, extracting
employed in the domain of image recognition. Characterized low-level features like edges and patterns. The output of
by a hierarchical architecture, CNNs incorporate pooling this layer is a feature map, where each pixel represents
and convolutional layers, which are configured to capture the activation of a particular filter at a specific location
pertinent characteristics from input images. After feature in the image. Subsequently, the feature map is transmitted
extraction, multiple FC layers leverage these discerned to the succeeding hidden layer, where a new array of
features to formulate predictive outcomes. The deployment filters of varying complexity is applied. These filters extract

94254 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

higher-level features from the image, such as shapes, patterns, to extract patterns hierarchically, first recognizing simpler
and objects. This process of feature extraction and abstraction features like lines and curves, and then gradually progressing
is iterated across multiple hidden layers, each refining and to more complex patterns such as faces and objects. This
enriching the representation of the image. The final hidden hierarchical processing mirrors the human visual system,
layer generates features that are transferred to the output enabling CNNs to learn and recognize complex visual
layer consisting of a FC NN that classifies the image into patterns, making them a powerful tool for CV tasks.
one of the predefined categories. The final layer of the NN
generates a probability distribution for each of the potential B. CNN LAYERS
classes: forklift, car, and truck, indicating the confidence level A CNN generally comprises multiple layers, each assigned
associated with each class. The prediction of the network a distinct role within the network. Each one of them are
corresponds to the class exhibiting the strongest probability. explained below:
The filters’ weights in every CNN layer undergo a learning
mechanism known as backpropagation. This iterative process 1) CONVOLUTIONAL LAYER
allows the CNN to learn the optimal weights for extracting The Convolutional layer stands as the fundamental corner-
relevant features and making accurate classifications. stone in the architecture of a CNN [59], [60], composing
a collection of convolutional kernels, also known as filters,
which play a pivotal role in feature extraction from input
images. The convolutional kernel slides across the image,
dividing it into smaller matrices known as receptive fields.
This image segmentation facilitates the extraction of dis-
tinctive feature patterns. The convolution operation slides
the kernel across the image and performs an element-wise
multiplication between the kernel weights and the resultant
pixels in the receptive field [61], generating a feature map,
where each pixel represents the activation of a particular filter
at a specific location in the image. The convolution operation
can be mathematically represented in Equation 1:
FIGURE 3. A basic representation of a CNN architecture. XX
fkl (p, q) = ic (x, y) · elk (u, v) (1)
c x,y
CNNs have become the dominant approach for image
classification owing to their ability to effectively extract where fkl (p, q) represents the activation of neuron (p, q) in the
and acquire features from images, leading to optimised lth feature map produced by the kth filter, ic (x, y) represents
performance compared to traditional ML algorithms. Their the pixel value at position (x, y) in the cth channel of the input
success has revolutionized various applications, including image, and elk (u, v) represents the value at position (u, v) in
object detection, image recognition, and medical image the kth filter for the lth feature map.
analysis. A comprehensive overview of the CNN components
is essential for crafting innovative architectures and con- 2) POOLING LAYER
sequently attaining improved performance. Therefore, this Pooling layers serve a critical function in minimizing the
section provides a concise exploration of the foundational spatial dimensions (width and height) of feature maps,
components of CNNs, delving into the basic architecture to facilitating efficient processing in subsequent convolutional
foster a nuanced understanding of the various architectural layers [62], [63], [64]. The process executed by this layer,
variants within the realm of CNNs. also known as downsampling or subsampling [65], results in
a concurrent loss of information due to the size reduction.
A. INPUT IMAGE This layer operates independently on each feature map and
Digital images are composed of pixels, the fundamental units achieves downsampling through renowned strategies such as
of visual information. Each pixel is represented by a binary average and max pooling. However, this loss is advantageous
value ranging from 0-255, corresponding to its brightness for the network since it mitigates computational complexity
and hue. Upon viewing an image, the human brain processes and enhances resilience to minor translations in the input
a vast amount of information within the first second. This image. Equation (2) mathematically represents the pooling
remarkable ability stems from the intricate network of operation, where Zkl denotes the pooled feature map at the
neurons in the visual cortex, where each neuron possesses a lth layer for the kth input feature map. The function gp
receptive field, a specific region of the visual field to which determines the specific type of pooling operation employed,
it responds. Similar to the biological vision system, CNNs such as max pooling or average pooling.
also employ receptive fields, allowing individual neurons to
analyze data within their assigned areas. CNNs are designed Zkl = gp (Fkl ) (2)

VOLUME 12, 2024 94255


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

3) ACTIVATION LAYER 6) DROPOUT LAYER


The activation layer plays a critical role in CNNs by Dropout [78], a regularization technique introduced for NNs,
incorporating non-linearity into the network architecture. The effectively addresses the issue of overfitting [79] by randomly
absence of linearity ensures that the output of the activation dropping neurons in each layer during each training iteration
layer is not simply a linear combination of its inputs, rather a [80], [81]. Overfitting occurs in NNs when the network
non-linear transformation [66], [67], [68], which significantly learns intricate non-linear relationships between the input
enhances the network’s capability to extract and interpret data and the desired outputs [82]. During training, these
complex patterns residing in the input data and the desired connections become co-adapted, leading the network to
output. memorize specific patterns in the training data rather than
learning generalizable features. Dropout combats this by
4) FULLY CONNECTED LAYER introducing stochasticity into the training process, promoting
The Fully connected (FC) layer represents a conventional NN the development of more robust features that generalize better
layer, interconnects the individual neuron in the preceding to new data.
layer with those in the latest layer [69], [70], [71], [72]. The core principle of dropout lies in randomly disabling
This intricate network of connections fosters the exchange a subset of neurons during each training iteration. This
of information across the entire feature space, allowing the random dropout creates an ensemble of different thinned
FC layer to learn complex relationships between features network architectures, each with a reduced number of
and their corresponding classes. This layer is predominantly connections. At the end of the training, the weights of
positioned at the network’s end to generate the ultimate the remaining neurons are averaged across all thinned
output. Distinguished from pooling and convolutional layers, architectures, effectively approximating the behavior of an
it operates globally, receiving inputs from feature extraction ensemble of networks [83].
phases and conducting comprehensive analysis on the outputs
of all prior layers [73]. This global connectivity enables the C. FEATURE MAPS
FC layer to perform a non-linear combination of selected The feature map emerges as a fundamental element, encap-
features, extracting high-level abstractions and patterns sulating the result of a convolutional layer [84], [85], [86],
crucial for accurate classification tasks [28]. [87], [88], [89]. It is a 2D array that intricately mirrors
the extent to which localized areas within the input image
5) BATCH NORMALIZATION LAYER align with the applied filters. The individual elements
The batch normalization [74] layer serves to normalize the within the feature map encapsulate distinct features of the
result of the antecedent layer through the application of mean input image, corresponding to the activation of neurons
subtraction and division by the batch’s standard deviation in the convolutional layer. The calculation of the feature
[75], [76]. This normalization process involving computing map includes the application of an array of filters to
the mean µB and variance σB2 of the activations within a the input image through convolution. Every filter yields a
mini-batch and utilizing these statistics to transform the single-channel feature map, selectively highlighting a definite
feature map. This is derived in Equation 3. pattern or feature within the input image. Aggregating them
in a single layer results in a multi-channel feature map, adept
F k − µB at capturing a diverse range of features embedded within the
Nlk = ql (3)
input image. The mathematical representation of feature map
σB2 + ε
computation is expressed in Equation 4:
where, Nlk represents normalized feature-map, Flk is the F X
X Cin
F X
input feature-map, µB and σB2 depict mean and variance of yi,j,k = ωl,m,n,k xi+l−1,j+m−1,n + bk (4)
a feature-map for a mini batch respectively, and ε is a small l=1 m=1 n=1
constant added to the variance to prevent division by zero. where yi,j,k represents the value of the kth Feature map at
This mechanism mitigates the challenges posed by inter- position (i, j) in the output tensor, ωl,m,n,k denotes the weight
nal covariance shifts embedded in feature maps. Internal of the kth filter at position (l, m, n), xi+l−1,j+m−1,n signifies
covariate shift refers to the change in the distribution of the value of the nth input Feature map at position (i + l −
activations across layers during training, which can hinder 1, j + m − 1), bk is the bias term for the kth Feature map, Cin
the network’s convergence and require meticulous parameter is the number of input channels, an F is the size of the filter.
initialization. The BN layer effectively tackles this issue Feature maps play a pivotal role in CNNs, serving as the
by normalizing the output of the previous layer, ensuring foundation for subsequent layers to derive progressively high-
that the activations maintain a zero mean and unit variance level features. As the network processes complex layers, the
[77]. Additionally, it facilitates smoother gradient flow and feature maps become increasingly refined, capturing more
operates as a regulatory mechanism, thereby enhancing the abstract and complex patterns crucial for tasks like image
network’s generalization capabilities. classification and object detection.

94256 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

D. ACTIVATION FUNCTION 3) HYPERBOLIC TANGENT (TANH) FUNCTION


Activation function constitute a crucial element within CNNs, It resembles sigmoid, but squashes input values between −1
serving to incorporate the model to capture non-linear and 1. It is often preferred in hidden layers as it helps mitigate
interactions [90], [91], [92], [93], [94], [95]. This is essential the vanishing gradient problem better than the sigmoid. This
due to the complex real-world scenario exhibiting non-linear function possesses several noteworthy properties due to its
patterns. A model reliant solely on linear operations might symmetric output range and smoothness that makes it a
inadequately handle these intricate situations. Hence, they valuable asset for CNNs. It is mathematically expressed in
play a pivotal role in enhancing the expressive capacity of Equation 8, which transforms the input variable x into a
CNNs, enabling them to aptly model a broader spectrum of value between −1 and 1 using the difference and sum of
functions. exponential functions.
As depicted in Equation 5, the activation function, denoted
as ga (.), is applied to the output of the convolution operation, ex − e−x
tanh(x) = (8)
represented by Fkl . This application introduces non-linearity ex + e−x
into the network’s processing, transforming the input feature
E. BACK PROPAGATION
map Fkl into the output feature map Tkl for the lth layer.
In the realm of NNs, backpropagation stands as a fundamental
Tkl = ga (Fkl ) (5) algorithm that facilitates the efficient calculation of gradients
for the loss function with respect to the network’s weights and
Within the literature, various activation functions, includ- biases [102], [103], [104], [105]. The mechanism involves
ing the hyperbolic tangent (tanh) function, the sigmoid minimizing network error and enhancing model performance.
function, SWISH, maxout, rectified linear unit (ReLU), and Backpropagation commences subsequent to the output layer,
its derivatives like leaky ReLU, ELU, and PReLU, are indicating that the model has made a prediction. The
employed to introduce non-linearity and enable the network goal is to refine the network’s weights by calculating the
to learn and represent intricate patterns in the input data [12], associated derivatives. The formula to derive the output layer
[96], [97], [98], [99]. Among these, ReLU, sigmoid, and tanh is expressed in Equation 9:
are frequently used as explained below: 1E
= Yz − tz (9)
1Yz
1) RECTIFIED LINEAR UNIT (RELU)
ReLU and its variants have garnered favor due to their where, Yz is the predicted output and tz is the actual output.
effectiveness in mitigating the issue related to vanishing The weight update is performed using Equation 10:
gradient, which is a critical challenge in training deep NNs 1l
[100], [101]. The ReLU function is expressed mathematically w1 = w1 − learning rate · (10)
1w1
in Equation 6:
The above equation illustrates the computation for w1.
f (x) = max(0, x) (6) During each forward epoch, the partial derivatives of the loss
function are calculated by computing the difference between
Here, x represents the input to the neuron. This function the predicted and actual outputs. These partial derivatives
introduces a simple yet effective non-linearity by retaining drive the weight updates, enabling the network to learn
positive values and discarding negative ones. from its mistakes. Additionally, the partial derivative of the
input with respect to the loss function is computed. This
2) SIGMOID FUNCTION computation involves a backward pass through the network,
The sigmoid function denoted in Equation 7, is another accumulating the partial derivatives at each intermediate
commonly used activation function in CNNs. It squashes its weight until reaching the input. The chain rule of calculus is
input within the range of 0 and 1, representing a probability. applied to multiply all the accumulated partial derivatives and
the derivative of the input with respect to the loss, resulting in
1
f (x) = (7) the new value of the input weight, as shown in Equation 11:
1 + e−x
1L 1L 1e 1ẑ
This property makes it especially suitable for binary classi- = · · (11)
1w1 1e 1ẑ 1w1
fication tasks, as it can effectively model probabilities. How-
ever, the sigmoid function presents limitations, particularly in The backpropagation algorithm serves as a crucial component
DNNs. Its saturating nature causes it to exhibit the vanishing in the training process of NNs. It collaborates with an
gradient problem, wherein the gradient becomes extremely optimization algorithm during the training phase, such as
small for extreme input values (far from zero). During gradient descent, facilitating the iterative adjustment of
backpropagation, this phenomenon leads to a slowdown in network weights and biases. These updates are guided by the
training, as the updates to the weights in the initial layers computed gradients, gradually steering the network towards
become negligible, hampering effective learning. minimizing the loss function.

VOLUME 12, 2024 94257


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

F. PADDING, STRIDE, AND FILTERS map size is expressed in Equation 12:


Padding, stride, and filters represent vital parameters within W − F + 2P
the architecture of CNNs, influencing the dimensions of +1 (12)
S
output feature maps [106], [107], [108], [109], [110], [111],
[112], [113]. These parameters wield significant control over Fine-tuning the parameters such as padding, stride, and filter
the information retained in feature maps, thereby exerting a size plays a crucial role in shaping a CNN architecture
profound impact on the overall performance of the network. capable of adeptly extracting features from input images
Padding is a technique employed to address the inherent and generating high-fidelity Feature maps. These parameters
shrinking of feature maps during convolution operations by furnish a level of adaptability in configuring convolu-
strategically adding rows and columns of zeros surrounding tional layers, enabling customization based on the specific
the borders of the input feature map. This practice aids in objectives of the task. For instance, when devising image
preserving the spatial dimensions of the image as it traverses classification networks, opting for larger filters and smaller
convolutional layers. By retaining spatial dimensions and strides may be advantageous for capturing intricate features.
controlling the size of feature maps, padding ensures that the Conversely, in the context of OD, a design choice favoring
network maintains essential information for successful task smaller filters and larger strides could curtail computational
completion. The judicious selection of padding strategy is demands and expedite the detection process.
crucial for optimizing CNN performance, balancing spatial
information preservation, computational efficiency, and the III. EVOLUTION OF CNNS
potential for overfitting. CNNs emerged long after ANNs appeared because they
Stride determines the step size between consecutive couldn’t handle complex scenarios. CNNs could train sys-
convolution operations as they traverse the input image. This tems based on large datasets of labeled data which couldn’t
seemingly simple parameter exerts a profound influence on be handled by traditional ANNs. The progression of ANNs
the result of the output feature maps, ultimately shaping the to CNNs is explained as follows:
network’s computational efficiency and feature extraction
capabilities. A larger stride value leads to more sparse A. EARLY DEVELOPMENT (1940S-1960S)
sampling of the input image, resulting in a smaller output In the early development of NNs, scientists were inspired by
feature map with a reduced resolution. Conversely, a smaller the contributions of Warren McCulloch, a neurophysiologist,
stride value ensures a denser sampling of the input image, and Walter Pitts, a young mathematician. They developed a
producing a larger output feature map with higher resolution. simple model of a neuron in 1943, which was explained in a
The selection of stride strategy has a significant impact on paper entitled ‘‘The Logical Calculus of the Ideas Immanent
the performance of CNNs. For tasks that require precise in Nervous Activity [7].’’ In their model, neurons were
localization of features, such as OD, a smaller stride may be depicted as binary devices with predetermined thresholds,
preferred to maintain high feature resolution. Conversely, for enabling them to execute basic logic functions.
tasks that prioritize computational efficiency, such as image In 1958, psychologist Frank Rosenblatt, further extended
classification, a larger stride may be employed to reduce this model to create the perceptron [114], an electronic
feature map size and computational demands. device that used biological principles to learn. He also wrote
Filters, also known as kernels, are small matrices that act one of the first books on neurocomputing, Principles of
as feature detectors, scanning the input image to identify and Neurodynamics [115]. Rosenblatt also helped develop the
extract specific spatial patterns. As the filters slide across the multilayer perceptron (MLP) in the 1960s. MLPs are more
input image, they perform element-wise multiplication with complex models of neurons that can learn to perform simple
corresponding image patches, resulting in a feature map. The tasks, such as classifying binary inputs. They quickly became
values in the feature map represent the strength of the filter’s one of the most popular NN architectures and have been used
response at individual location in the image. A high value to solve diverse array of challenges over the years. Since then,
indicates that the filter has detected the corresponding pattern numerous researchers worked to enhance the performance of
in that region of the image, while a low value suggests that the ANNs, like developing the ADALINE system(ADAptive
the pattern is not present. The size of the filter determines the LInear Element) by Widrow and Hoff in 1962 [116].
spatial extent of the pattern it seeks to detect. Larger filters Even then, the early NNs were hard to train and couldn’t
can capture broader patterns, such as edges or elongated learn complex relationships between inputs and outputs
shapes, while smaller filters focus on finer details, such as because of the shortage of training algorithms, and the
corners or specific textures. The choice of filter size is crucial existing computers weren’t powerful enough to handle
for tailoring the network to the specific task at hand. problematic cases.
The dimensions of the resulting Feature map are dependent
upon several factors, including the dimensions of the input B. BACKPROPAGATION (1970S-1990S)
image (denoted as W), the dimensions of the filter (denoted The backpropagation algorithm, developed in the 1980s,
as F), the amount of padding (denoted as P), and the stride was a major breakthrough in NN research. Although
(denoted as S). The formula for calculating the output feature back-propagation was developed in 1974 by Werbos [117],

94258 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

its significance was not fully appreciated until 1986. This D. DOMINANCE (2010S)
algorithm enables NNs to learn from errors by modifying Following AlexNet’s triumph, CNNs have become one of
the weights of neurons, adjusting them according to the error the most prevalent architectures in DL. Despite being based
in the output layer. This made it possible to train NNs more on LeNet, currently, CNN models differ significantly from
efficiently and effectively, leading to a renewed interest in the their predecessors. As CNNs evolved, advanced processing
field and the development of a new NN architecture known units and new network blocks have been designed that have
as the CNN. contributed significantly to their improvement.
This traces back to the early 1960s when Hubel and ZFNet, designed by Zeiler and Fergus [126] in 2013,
Wiesel [118] disclosed that few neurons in the cat’s visual represents an evolutionary iteration of AlexNet, yielding
cortex and primates are sensitive to specific patterns of light enhanced performance metrics within the framework of the
and dark. This led to the emergence of the Neocognitron, ILSVRC. Noteworthy modifications include a reduction in
a model of visual perception introduced by Fukushima the first layer’s filter dimensions, downsized from 11 ×
in 1980 [119]. It was the neocognitron that introduced 11 to 7 × 7. Impressively, ZFNet achieved a top-5 validation
convolutional layers, which are a defining characteristic error rate of 16.5 during its training, which spanned
of CNNs. Fukushima arranged ‘‘S-cells’’ and ‘‘C-cells’’ in 12 days and leveraged the computational capabilities of a
alternating layers, creating a hierarchy known as ‘‘sandwich GTX 580 GPU.
layers’’ (SCSCS. . . ). S and C cells exhibit characteristics GoogLeNet [127], [128], [129], created by Szegedy and
close to simple and complex cells found in the visual cortex. his research team in 2014, stands as a CNN architecture
‘‘S-cells’’ possess adjustable parameters, while ‘‘C-cells’’ distinguished by the introduction of the Inception module
perform pooling operations. Backpropagation was not used [127]. This innovative convolutional layer demonstrated
on the network at the time. superior efficiency in learning complex features compared to
In 1998, LeNet-5 was developed at Bell Labs by Yann its predecessors. Notably, GoogLeNet boasted a remarkable
LeCun and his research team, which further enhanced the reduction in the number of parameters, utilizing 12 times
performance of CNNs [120]. LeCun used backpropagation fewer parameters than the prominent AlexNet. The training
to train Fukushima’s ANN [121], which achieved an error regimen for this model, executed on a select few high-
rate of 1% and a reject rate of about 9% on zip code digits. end GPUs, was accomplished within a week. It achieved
LeCun further improved CNNs using an error gradient-based a 6.67% top-5 error rate, underscoring its efficacy in
learning algorithm. image classification tasks. The Visual Geometry Group
Due to dwindling public interest and funding, only a (VGG) [130], a CNN model devised by Simonyan and
handful of researchers persevered in fields like pattern Zisserman of the University of Oxford, extends the depth
recognition. However, this period saw the emergence of of NN architectures. This augmentation not only attains
several paradigms that continue to be refined in modern SOTA precision across the ILSVRC datasets [131] but also
research. demonstrates applicability to various other image recognition
databases.
C. RESURGENCE (2000S) The introduction of skip connections in 2015, pioneered by
In the early 2000s, the limitations of computing power led to a the ResNet architecture [132], [133] for training deep CNNs
decline in the popularity of CNNs. However, the introduction [134], [135], marked a significant turning point in the field
of new training algorithms, such as greedy layer-wise training of CV. Huang et al. [134] and his research team initiated the
[122], rekindled interest in CNNs. These algorithms allowed development of DenseNet in 2016, a CNN that introduced an
for more efficient training of CNNs on larger datasets, innovative dense connectivity pattern. This novel architecture
resulting in improved performance. enables DenseNets to discern intricate features even with
Later in 2012, a CNN named AlexNet [123], [124] limited data. Noteworthy is the incorporation of depthwise
achieved victory in the ImageNet Large Scale Visual separable convolutions, reducing the number of connections
Recognition Challenge (ILSVRC), a competition involving and yielding a lightweight model. Subsequently, this con-
the classification of images into thousands of categories. cept was incorporated into numerous subsequent networks,
AlexNet’s victory marked a significant turning point in the including Inception-ResNet, Wide ResNet, and ResNeXt
development of CNNs, sparking a renewed surge of interest [136], [137], [138]. Various architectural configurations,
in the field. AlexNet’s complex architecture, being one of the including Wide ResNet, Pyramidal Net, PolyNet, ResNeXt,
first Deep Convolutional Neural Networks (DCNNs), played and Xception, investigate the influence of multilevel transfor-
a crucial role in its success. Utilizing numerous layers of mations on the learning capability of CNNs. This is achieved
convolutional and pooling operations, it effectively extracted by incorporating aspects such as cardinality or enhancing
and examined image features across different scales. This network width [137], [138], [139], [140].
deep architecture, along with the effective utilization of Accordingly, research focused on improved network archi-
GPUs, ReLU activation functions [125], regularization tech- tecture rather than parameter optimization and connection
niques like dropout [83] and data augmentation, contributed readjustment. This shift has led to a surge of novel
significantly to AlexNet’s outstanding performance. architectural ideas, including channel boosting, spatial and
VOLUME 12, 2024 94259
R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

TABLE 1. A Timeline of notable CNN architectures.

feature-map-wise exploitation, and attention-based informa- industries. They have revolutionized CV, enabling machines
tion processing [141], [142], [143]. to perform tasks like object detection, image classification,
image segmentation, and natural language processing with
E. MODERN ERA remarkable accuracy. CNNs are also playing an increasingly
In recent years, numerous CNN architectures have been important role in other fields, such as robotics, autonomous
designed to enhance operational efficiency and produce vehicles, and healthcare.
models of reduced computational burden. MobileNets [144],
[145] represent an exemplification of such endeavors, empha- IV. OBJECT DETECTION
sizing lightweight design principles to achieve heightened Since the commencement of the 2010s, there has been a
efficacy by employing techniques like depthwise separable notable acceleration in research focused on the application
convolutions, inverted residuals, and squeeze-and-excitation of DL methodologies to address CV challenges. These
layers. MobileNets stand as a testament to the pursuit of challenges encompass a diverse range of tasks, including
computational efficiency within CNN frameworks. Con- image classification, object detection, edge detection, object
versely, EfficientNet [146] embodies a paradigm wherein tracking, image segmentation, and feature extraction. Among
precision and efficiency converge harmoniously by leverag- these, OD has emerged as a central area of interest for defect
ing a compound scaling method that systematically adjusts detection, primarily due to its inherent ability to localize and
both network width and depth, EfficientNet ensures a classify defects as objects. This capability aligns seamlessly
judicious balance in enhancing accuracy while maintaining with the fundamental objective of defect detection [150],
computational efficiency. which is to identify and pinpoint the location of defects within
The Vision Transformer (ViT) [147] marks a departure a given image or video sequence.
from traditional CNN paradigms by adopting the transformer Figure 4 illustrates the layered architecture of CNN for OD.
architecture for image processing, demonstrating notable The process begins with the convolution of input image by
performance benchmarks across diverse image classification an activation function, resulting in feature maps that capture
tasks. A refinement of ViT, the Swin Transformer [148], spatial dependencies and local features. To mitigate spatial
introduces heightened efficiency and scalability through a complexity and retain key features, the network employs
hierarchical transformer architecture and a shifted window pooling layers. These layers downsample the feature maps
attention mechanism. ConvNeXt [149] distinguishes itself by while preserving relevant information. This process is iterated
incorporating a pioneering attention mechanism known as through multiple convolutional and pooling layers, each with
cross-covariance attention (CCA), contributing to its efficacy an increasing number of filters, resulting in a hierarchy
as a CNN. Table 1 presents a chronological overview of of increasingly abstract feature representations. Finally, the
key CNN architectures, showcasing notable developments extracted features are fed into FC layers, which act as a
from 1959 to 2022. Each entry includes the year of classifier and generate the output probabilities for different
inception, the architecture name, the primary authors, and its object classes. The primary goal is to identify instances
description. of real-world objects, such as humans, animals, bicycles,
Today, CNNs stand as one of the most powerful tools cars, etc, within real-time videos or still images [151]. This
in AI, with a wide range of applications across various process facilitates the recognition, localization, and detection

94260 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 4. Utilizing CNNs for object detection.

of single or multiple objects within a video frame or image, A. TRADITIONAL DETECTORS


contributing to a more comprehensive interpretation of the The current landscape of OD is dominated by DL-based
entire image. techniques, marking a revolution in the field. However,
Generic object detection (GOD) [152] is an essential looking back to the 1990s, the landscape reflected the
concept in the domain of CV, aiming to identify and ingenuity and future-oriented CV systems. During this
localize instances of objects from pre-defined categories period, a majority of OD algorithms relied on meticulously
within an image. GOD typically employs a bounding box, handcrafted features. Owing to the limited effectiveness
a rectangular region surrounding the object, to represent the of image representation methodologies, researchers were
object’s location and extent [153], [154]. The bounding box compelled to devise intricate feature representations and
provides precise coordinates for the top-left and bottom-right employ diverse acceleration strategies.
corners of the object, allowing for accurate localization and
measurement of its size. 1) VIOLA JONES DETECTORS
Over the last two decades, consensus has emerged that Paul Viola and Michael Jones introduced a groundbreaking
the evolution of OD can be broadly categorized into two algorithm for real-time face detection in 2001, breaking
historical periods: the ‘‘traditional OD period (pre-2014)’’ free from the limitations of skin color segmentation [166],
and the ‘‘DL-based detection period (post-2014).’’ Figure 5 [167]. Operating on a 700-MHz Pentium III CPU, the VJ
illustrates the timeline of diverse detectors of OD. Traditional detector surpassed the performance of its contemporaries by
OD methods, such as hand-crafted feature extraction and achieving tens to hundreds of times faster detection while
support vector machines (SVMs) [155], have often encoun- maintaining comparable accuracy. The VJ detector’s core
tered limitations in terms of accuracy and robustness. DL- concept revolves around the sliding window approach, sys-
based OD approaches, on the other hand, have demonstrated tematically scanning an image across all possible locations
superior performance, particularly in dealing with complex and scales to identify potential face regions. While seemingly
and cluttered scenes and providing promising results in straightforward, the computational demands of this method
image classification [156], serving as a fundamental building were immense for the computing power available at the
block for a myriad of sophisticated vision-related tasks, time. To address this challenge, the VJ detector ingeniously
including image captioning [157], [158], [159], instance incorporated three key techniques: integral images, feature
segmentation [160], [161], [162], semantic segmentation, selection, and detection cascades.
object recognition, and tracking [163], [164], [165]. Due to
this, these techniques have been extensively applied across 2) HOG DETECTOR
diverse domains within the Internet of Things (IoT) and Dalal and Triggs [168] introduced the Histogram of Oriented
AI, including medical imaging, manufacturing industries, Gradients (HOG) feature descriptor, marking a signifi-
robot vision, consumer electronics, military, self-driving cars, cant advancement in OD. HOG emerged as a significant
human-computer interaction (HCI), facial recognition, and improvement over existing feature descriptors, particularly
surveillance. the scale-invariant feature transform (SIFT) [169], [170]
This section aims to provide a comprehensive under- and shape contexts [171]. HOG addresses the challenge of
standing of OD technology, offering insights from various capturing both invariance to image transformations (e.g.,
perspectives, with a particular focus on its evolutionary translation, scale, illumination) and the non-linear character-
development. The primary emphasis is directed towards the istics of objects. To achieve this balance, HOG is calculated
historical, current, and prospective aspects, augmented by across a dense grid of uniformly spaced cells, incorporating
essential components in OD, including datasets, evaluation overlapping local contrast normalization to enhance accuracy.
metrics, and acceleration techniques. While HOG was initially developed for pedestrian detection,

VOLUME 12, 2024 94261


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 5. Timeline of object detection.

its versatility has extended to a wide range of object Additionally, Girshick introduced a novel acceleration tech-
classes. To effectively identify objects of varying resolutions, nique, achieving a 10x speedup compared to traditional
the HOG detector employs a technique called multi-scale methods without compromising accuracy [173], [178].
detection. The input image is rescaled numerous times,
keeping the dimensions of the detection window constant. B. DEEP LEARNING DETECTORS
This approach allows the detector to effectively capture The field of OD experienced a period of stagnation following
objects at different scales without significantly increasing the saturation of handcrafted features in the early 2010s. The
computational complexity. The HOG detector has become an resurgence of CNNs in 2012 [123] presented a promising
integral component of numerous OD algorithms [172], [173], avenue for revitalizing OD. In particular, the introduction
[174], and has found widespread applications in various of Region-based CNN (RCNN) by Girshick et al. in 2014
CV tasks. Its robustness to image transformations, efficient [177], [180] marked a significant breakthrough, propelling
representation of object shapes, and adaptability to various OD into an era of unprecedented advancement. The advent
object classes have made HOG an indispensable tool in OD of OD models using DL introduced a fundamental distinction
research and practice. between two main approaches: ‘‘two-stage detectors’’ and
‘‘one-stage detectors’’.
3) DEFORMABLE PART-BASED MODEL (DPM)
The Deformable Part-Based Model (DPM), introduced by 1) TWO-STAGE DETECTORS
Felzenszwalb et al. in 2008 [173], marked a significant Two-stage detectors adopt a coarse-to-fine strategy, employ-
advancement in traditional OD. DPM gained remarkable ing a series of stages to refine detection results. They
success, winning the prestigious PASCAL Visual Object follow a sequential approach wherein they initially propose
Classes (VOC) detection challenges in 2008 and 2009. DPM regions of interest or candidate object locations, subsequently
built upon the foundation of the HOG descriptor, introducing evaluating these proposed regions for object presence. This
a more flexible and adaptable representation for OD. The core section provides a detailed summary of different two-stage
concept of DPM lies in the principle of ‘‘divide and conquer.’’ detectors.
This approach involves decomposing the object into multiple
parts and treating the OD task as an ensemble of these a: RCNN
parts. This decomposition enables DPM to handle object The emergence of OD methods based on handcrafted features
variations effectively, accounting for non-rigid deformations marked a significant advancement in the field. However, these
and pose changes. A standard DPM detector consists of methods faced several limitations, including high computa-
two main filters: a root filter and multiple part filters. tional cost, sensitivity to noise, and difficulty in handling
The root filter represents the overall object shape, while object occlusion. To address these drawbacks, RCNNs uti-
the part filters represent individual object components. The lized a selective search method [181] to significantly reduce
part filters are learned automatically using latent variables, the number of region proposals. As depicted in Figure 6, the
eliminating the need for manual configuration. This process, RCNN architecture employs a selective search mechanism to
known as ‘‘multi-instance learning,’’ [175] was extensively extract a subset of approximately 2000 region proposals from
modified by Girshick et al. [176], [177], [178], [179], the input image. The subsequent step involves resizing these
enhancing the learning efficiency and accuracy. To further region proposals to a consistent image size and subjecting
improve detection accuracy, DPM incorporated methods like them to a pre-trained CNN model, such as AlexNet, trained
hard negative mining, bounding box regression, and context on the ImageNet dataset, to capture a diverse array of
priming. These techniques effectively filtered out irrelevant features. Finally, a SVM classifier is employed to determine
negative examples and refined bounding box predictions. the existence of an object within each region proposal

94262 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 6. Architecture of RCNN.

FIGURE 7. Architecture of SPPNet.

and categorize the identified object. RCNNs demonstrated b: SPPNET


remarkable performance, achieving a mean average precision The Spatial Pyramid Pooling Network (SPPNet) was devel-
(mAP) of 58.5% on the VOC-2007 dataset, surpassing oped by He et al. in 2014 [183]. Former CNN models
the previously reported performance of 33.7% achieved by were limited by their requirement for a fixed-sized input
DPM-v5 [182]. This substantial improvement was attributed image. For example, the AlexNet model [123] required
to the ability of RCNNs to extract more discriminative 224 × 224 resolution for input images. As shown in
features using a deep CNN and to effectively decrease the Figure 7, SPPNet incorporated a ‘‘spatial pyramid pooling
count of region proposals, reducing computational overhead (SPP) layer’’ that enabling to extract features from region
and improving overall detection performance. Despite the proposals at multiple scales, eliminating the need for resizing
significant improvements in OD accuracy achieved by images. This approach deviates from RCNN by eliminating
the RCNN algorithm, several limitations remained. One the need for separate processing for each region proposal by
major drawback was the computationally expensive train- utilizing the SPP layer to derive features from the entire input
ing process, requiring the classification of 2000 object image and then applying a fixed-length sequence to arbitrary
region proposals per image, significantly increasing training region proposals. SPPNet’s ability to extract features from
time. Additionally, RCNN was not practical for real-time a single image and apply them to multiple region proposals
applications due to its high computational demand, taking led to substantial improvements in computational efficiency.
approximately 47 seconds per test image. This computational SPPNet achieved remarkable speedups over RCNN, with
inefficiency was further compounded by the fixed nature of a processing time of 6.9 seconds per image, compared to
the selective search method, which could lead to inaccurate 47 seconds for RCNN. This translated into real-time OD
object region proposals. To address these limitations, SPPNet capabilities, making SPPNet more practical for real-world
[183] was proposed in the same year. applications. Additionally, SPPNet maintained the accuracy

VOLUME 12, 2024 94263


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 8. Architecture of fast RCNN.

gains of RCNN, achieving an mAP of 59.2% on the VOC-


2007 dataset. Despite its advancements, SPPNet still retained
some of the limitations of RCNN, including the multi-staged
training process and the reliance on fully convolutional
(FC) layers. To address these limitations, Faster RCNN was
introduced in 2015 [184].

c: FAST RCNN FIGURE 9. Architecture of faster RCNN.

In 2015, a novel OD algorithm called Fast RCNN was


introduced by Girshick [184], which represented a significant
advancement over its predecessors, RCNN and SPPNet Faster RCNN achieved a mAP of 70.4% on the VOC-2012
[177], [183]. As depicted in Figure 8, Fast RCNN integrated dataset and a processing speed of 17 frames per second
the strengths of both algorithms, training a detector and (fps) with ZFNet [126]. Despite its achievements, Faster
a bounding box regressor jointly within a single network RCNN exhibited computational redundancy in the final
architecture. This integration enabled Fast RCNN to achieve stage. To address this issue, Region-based fully convolutional
remarkable performance gains, achieving a mAP of 70.0% networks (RFCN) [185] and Light Head RCNN [186] were
on the VOC07 dataset, surpassing the performance of RCNN proposed as further advancements on Faster RCNN.
(58.5%) and SPPNet (59.2%). Additionally, Fast RCNN
demonstrated significant speed improvements, processing e: FPN
images over 200 times faster than RCNN. Despite its Building upon the foundation of Faster RCNN, Lin et al.
advancements, Fast RCNN still faced limitations in its [187] introduced the Feature Pyramid Network (FPN), that
detection speed, constrained by the separate region proposal marked a significant milestone in the field of OD. FPN,
generation stage. This limitation prompted researchers to as depicted in Figure 10, addressed a fundamental limitation
consider an alternative approach that led to the introduction of previous OD methods, which focused solely on object
of Faster RCNN [156]. localization in the final layers of the CNN. While these
features were effective for category recognition, they posed
d: FASTER RCNN challenges for object localization, particularly for small
In 2015, soon after the unveiling of Fast RCNN, objects. To tackle this challenge, FPN employs a top-down
Ren et al. [156] proposed a significant advancement called pathway architecture and lateral connections to establish
Faster RCNN, which addressed the limitations of Fast RCNN. high-level semantics at various scales, enabling the fusion
As depicted in Figure 9, Faster RCNN introduced a Region of features from different depths of the CNN. This approach
Proposal Network (RPN) to seamlessly integrate object significantly improved the detection of small objects, as the
proposal generation into the detection process, eliminating CNN effectively forms a feature pyramid during its forward
the need for a separate proposal generation step. The process propagation. The integration of FPN with Faster RCNN
commences with the initial Region of Interest (ROI) pooling, resulted in substantial performance gains, achieving a mAP of
which are then fed into a CNN for feature extraction. The 59.1% on the MS-COCO dataset lacking extra features. This
extracted features are then passed through two FC layers, remarkable achievement solidified the position of FPN as a
one for softmax classification and the other for bounding fundamental building block for many modern OD algorithms.
box regression. Compared to Fast RCNN, Faster RCNN
demonstrated remarkable performance gains. On the MS- f: MASK RCNN
COCO dataset, it achieved a mAP of 42.7%, surpassing Mask RCNN, proposed by He et al. [188], represents a
the performance of Fast RCNN (36.2%). Additionally, significant advancement in OD by seamlessly integrating

94264 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 10. Architecture of FPN.

FIGURE 11. Architecture of mask RCNN.

instance segmentation into the Faster RCNN framework. ResNet-FPN [132], [187], a combination of ResNet and
As depicted in Figure 11, Mask RCNN uses the two-stage FPN. This combination provides Mask RCNN with the
pipeline of Faster RCNN, while incorporating an additional ability to extract both fine-grained semantic information
branch for predicting object masks. This enables Mask and accurate localization cues, enabling high-performance
RCNN to effectively identify and segment diverse objects OD and instance segmentation. However, the introduction
within an image or video, addressing a critical instance of the mask branch adds a small computational overhead
segmentation challenge in CV applications. The mask to the network, resulting in a processing speed of approxi-
branch operates parallel to the class label and bounding mately 5 FPS. Despite this minor drawback, Mask RCNN has
box (BB) regression branches, enabling Mask RCNN to established itself as a powerful and versatile tool for instance
simultaneously detect objects, localize them precisely, and segmentation, offering a significant step forward in the field
generate high-quality segmentation masks for every instance. of OD.
This parallel processing enables the network to efficiently
generate all three outputs: object class label, bounding box 2) ONE-STAGE DETECTORS
coordinates, and object mask. Most of the two-stage object detectors operate within a
The effectiveness of Mask RCNN stems from its ability coarse-to-fine processing paradigm. This approach facilitates
to effectively utilize features extracted from the CNN using high precision without requiring intricate architectural

VOLUME 12, 2024 94265


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 12. Architecture of YOLO.

TABLE 2. Evolution of YOLO models.

embellishments. However, its inadequate speed and inherent a ‘‘fast’’ version capable of processing frames at a staggering
complexity limit its practical applicability in engineering 155 fps while maintaining moderate accuracy (VOC07
applications. In contrast, one-stage detectors offer the advan- mAP = 52.7%), and an ‘‘enhanced’’ version achieving higher
tage of one-step object inference, making them attractive for accuracy (VOC07 mAP = 63.4%) at 45 fps. This represented
mobile devices due to their real-time capabilities and ease of a radical departure from the dominant two-stage paradigm
deployment. These detectors directly predict object bounding by employing a single NN to analyze the entire image
boxes and class labels from the input image, eliminating the in one pass. While offering dramatic speed improvements,
need for separate region proposals. YOLO exhibited a trade-off with localization accuracy,
particularly for smaller objects, when compared to its two-
stage counterparts. Subsequent iterations of YOLO [191],
a: YOLO [192], [194], along with concurrently developed detectors
In 2016, the field of OD witnessed a significant paradigm like SSD [205], focused on mitigating this accuracy-speed
shift with the introduction of You Only Look Once trade-off. YOLOv5 [197] took a different turn with its
(YOLO) by Redmon et al. [189] which is depicted in PyTorch implementation and modular architecture, prior-
Figure 12. From the single-shot approach of YOLOv1 to itizing speed and adaptability. YOLOv6 [200] explored
the anchor-free elegance of YOLOv8, each iteration has reparametrization and attention modules. Then, the YOLOv4
brought groundbreaking features and performance leaps. [194] team unveiled YOLOv7 [202], a further refinement
YOLOv1 distinguished itself by its remarkable speed, with that leveraged optimized architectural elements like dynamic
94266 VOLUME 12, 2024
R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 13. Architecture of SSD.

FIGURE 14. Architecture of RetinaNet.

label allocation and model design reparameterization. This advantages in speed and simplicity. This performance
iteration delivered impressive performance, surpassing most gap was investigated by Lin et al. in 2017 [206] by
existing object detectors in terms of speed and accuracy introducing RetinaNet, where he identified the exceptional
across a spectrum spanning 5-160 fps. Finally, YOLOv8 foreground-background class imbalance in the training phase,
[203] marked a paradigm shift with its anchor-free detection which hinders the learning process for dense one-stage
and transformer-based backbone, promising even greater detectors. RetinaNet introduces a novel loss function called
precision and efficiency. This ongoing evolution reflects the focal loss which modifies the conventional cross-entropy
remarkable dedication of YOLO’s developers to constantly loss by assigning greater emphasis to hard, misclassified
elevate the state of the art in OD, offering a diverse toolbox for examples. By focusing on these challenging instances, focal
tackling real-world challenges across various applications. loss enables RetinaNet to achieve accuracy comparable to
Table 2 charts the evolution of YOLO models from YOLOv1 two-stage detectors while maintaining significantly faster
to YOLOv8, highlighting key features, performance metrics, inference speeds (COCO [email protected] = 59.1%). This is
and architectural choices for each version. illustrated in Figure 14.

b: SSD d: SQUEEZEDET
In 2016, Liu et al. [205] introduced the Single-Shot Multibox SqueezeDet was introduced by Wu et al. [207], which
Detector (SSD), depicted in Figure 13. Their key innovation is a lightweight, single-stage, and highly efficient FCNN
lies in the multireference and multiresolution detection for detecting objects in systems related to autonomous
techniques. These techniques enable SSD to surpass the driving. The successful deployment of deep CNNs for OD in
precision of previous one-stage detectors, particularly for real-time necessitates the resolution of critical issues such as
small objects. SSD demonstrates strong performance in speed, model resolution, accuracy, and power consumption.
relation to speed and accuracy, achieving a 46.5% [email protected] Notably, the SqueezeDet model adeptly addresses these
on the COCO benchmark while a dedicated fast version challenges, as illustrated in Figure 15. It achieves real-time
operates at 59 fps. A distinctive feature of SSD compared to OD through a three-step process. First, it extracts high-
prior detectors is its ability to detect objects at different scales dimensional, low-resolution features via a single forward
across various network layers, unlike previous methods that pass using stacked convolution filters. Next, the innovative
restricted detection to the uppermost layers. ConvDet layer leverages these features to simultaneously
generate numerous bounding box proposals and predict
c: RETINANET their object categories. Finally, post-processing refines these
One-stage object detectors have long suffered from lower detections, yielding accurate and efficient object identifica-
accuracy compared to two-stage counterparts, despite their tion. SqueezeNet [208] serves as the backbone architecture of

VOLUME 12, 2024 94267


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 15. Architecture of SqueezeDet.

FIGURE 16. Architecture of CornerNet.

SqueezeDet, where the model size remains remarkably small eters, and prolonged training times. Recognizing these
at less than 8 MB, significantly exceeding AlexNet [123] in limitations, Law and Deng [210] proposed a paradigm shift,
compactness while maintaining comparable accuracy. With reframing OD as a keypoint (bounding box corner) prediction
approximately two million trainable parameters, SqueezeDet problem. CornerNet leverages a CNN to directly predict
outperforms VGG19 [130] and ResNet-50 [132] in terms the locations of objects as paired keypoints, specifically the
of accuracy despite boasting significantly fewer parameters top-left and bottom-right corners. To facilitate accurate corner
(143 million and 25 million, respectively). On the Kitti localization, CornerNet introduces ‘‘corner pooling’’ layer,
dataset [209], SqueezeDet achieves an impressive 57.2 fps specifically designed to extract relevant features for corner
for input images of size 1242 × 375, while consuming only detection. The CNN generates a heatmap corresponding to
1.4 J of energy per image. These results demonstrate the all top-left and bottom-right corners, accompanied by an
model’s remarkable efficiency, rendering it highly suitable for incorporated vector map for identified corner. This novel
real-time OD in applications related to autonomous driving. approach surpassed the performance of most contemporary
one-stage detectors, achieving a COCO [email protected] of 57.8%.
e: CORNERNET The architectural configuration of CornerNet is illustrated
Prior OD frameworks predominantly relied on anchor in Figure 16. However, a notable limitation pertains to
boxes as a means for both classification and regression its propensity for generating inaccurate paired key points
reference points. This approach inherently assumes a degree associated with the detected object.
of uniformity in object characteristics, such as number,
location, scale, and aspect ratio. To achieve high performance, f: CENTERNET
these methods rely on establishing an extensive quantity of Zhou et al. [211] introduced CenterNet, a keypoint-based
pre-defined anchor boxes to better encompass the ground OD framework operating on a fully end-to-end mecha-
truth object configurations. However, this approach suffered nisms, as depicted in Figure 17. Unlike prior models like
from several drawbacks, including exacerbated category CornerNet [210] and ExtremeNet [212] which rely on costly
imbalance, reliance on numerous hand-tuned hyperparam- computational post-processing procedures like group-based

94268 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

keypoint assignment and non-maximum suppression (NMS), D. DATASETS AND METRICS


CenterNet eliminates these stages by regressing all object 1) DATASETS
attributes (size, orientation, location, pose) directly from a The growing need for robust and generalizable OD algo-
single reference center point. The model’s simplicity and rithms necessitates the construction of large-scale and diverse
elegance simplifies the detection process and enables the datasets with minimized biases. Over the past decade, the
integration of diverse tasks within a unified framework, research community has witnessed the emergence of several
including detection of 3D objects, estimation of human poses, pivotal OD datasets that have significantly facilitated algo-
learning optical flow, and deriving depth estimations. Despite rithm development and evaluation. Prominent contributions
its minimalistic approach, CenterNet achieves competitive include the PASCAL VOC Challenges (e.g., VOC2007,
detection performance, reaching a COCO [email protected] of 61.1%. VOC2012) [153], [233], which established early benchmarks
for OD and classification, and the ILSVRC datasets (e.g.,
g: DETR ILSVRC2014) [154], which shifted the focus towards broader
The surge of Transformers in recent years has profoundly object categories and image understanding. Additionally,
impacted the landscape of DL, specially within CV. Recog- the MS-COCO Detection Challenge [193] introduced a
nizing the limitations of CNNs in capturing global context, richer and more diverse set of object classes and scene
Transformers leverage attention mechanisms to achieve contexts, while the Open Images Dataset [234], [235] further
a wider receptive field without relying on convolutions. expanded the scope to encompass a vast array of visual
In 2020, Carion et al. [213] introduced DETR, an end-to-end concepts and annotations. More recent contributions such as
OD network built upon Transformers, which reformulated Objects365 [236] have further emphasized the importance of
OD as a problem of predicting sets, as illustrated in Figure 18. temporal dynamics and object interactions in OD tasks. These
This marked a pivotal shift in the field, paving the way for commonly known datasets are further explained below:
OD devoid of anchor boxes or points. Subsequently, Zhu
et al. [214] addressed DETR’s limitations in convergence a: PASCAL VOC
speed and performance for small objects by proposing The PASCAL Visual Object Classes (VOC) Challenge (2005-
Deformable DETR. This innovative architecture employs 2012) [153], [233] played a pivotal role in shaping the tra-
deformable attention modules, significantly enhancing con- jectory of early CV research, serving as a crucial benchmark
vergence speed and achieving SOTA performance on the for OD algorithms. Two specific versions, VOC2007 and
MSCOCO dataset (COCO [email protected] = 71.9%). VOC2012, have become particularly influential within the
field. VOC2007 provides a challenging dataset consisting
C. DEEP LEARNING FRAMEWORKS AND API SERVICES of 5,000 training images annotated with 12,000 individ-
1) FRAMEWORKS ual object instances, while VOC2012 offers an expanded
The proliferation of DL frameworks in recent years has been dataset with 11,000 training images and 27,000 annotated
remarkable. Table 3 provides a comprehensive comparison objects. Both datasets feature annotations for 20 commonly
of various DL frameworks. The frameworks included in encountered object categories, including ‘‘person,’’ ‘‘cat,’’
the comparison are Theano, Caffe, Deeplearning4j, Ten- ‘‘bicycle,’’ and ‘‘sofa,’’ ensuring broad applicability for
sorFlow, Keras, Chainer, Apache Singa, Apache MXnet, training and evaluating OD models.
CNTK, PyTorch, Neon, and BigDL, which are delineated
based on the identity of the framework developer, year b: ILSVRC
of origination, distinctive features, supported platforms and The ILSVRC [154] played a vital role in advancing the SOTA
interfaces, compatible models (e.g., CNN, RCNN, and in generic OD. From its inception in 2010 to its refinement
DBN/RBMs), capabilities for parallel execution, and the till 2017, ILSVRC served as a yearly benchmark for progress
associated licensing agreements. This comprehensive table in this domain. Notably, it featured a dedicated detection
aims to assist researchers, developers, and practitioners challenge utilizing images sourced from the ImageNet dataset
in selecting an appropriate DL framework based on their [131]. This challenge dataset boasted a remarkable breadth,
specific requirements and preferences. Each framework’s encompassing 200 distinct object classes. Moreover, the
unique features and characteristics are highlighted to provide sheer volume of images and object instances surpassed the
a clear understanding of their capabilities. PASCAL Visual Object Classes (VOC) dataset by two orders
of magnitude, offering researchers a significantly richer
2) API SERVICES training ground for their algorithms.
Table 4 provides a comprehensive overview of various
API services designed for OD, enabling researchers and c: MS-COCO
developers to select the most suitable tool for their image The Microsoft COCO (MS-COCO) dataset, specifically
analysis tasks. Data is illustrated based on their names, MS-COCO3 [193], stands as one of the most challeng-
founding years, types of services, accessibility methods, ing publicly available benchmarks for OD research. Its
developer organisation, and key features. annual challenge, initiated in 2015, has served as a crucial

VOLUME 12, 2024 94269


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 17. Architecture of CenterNet.

FIGURE 18. Architecture of DETR.

platform for evaluating and advancing OD algorithms. While has emerged as the de facto standard for the OD community.
featuring fewer object categories compared to ILSVRC, Its comprehensive annotations, diverse object instances,
MS-COCO compensates with a significantly larger number and challenging scenarios provide a robust platform for
of object instances per image. For instance, MS-COCO- algorithm development and evaluation, driving significant
17 boasts 164,000 images and 897,000 annotated objects advancements in the field.
across 80 categories. A key differentiator of MS-COCO
compared to datasets like Pascal VOC and ILSVRC lies in its d: OPEN IMAGES
inclusion of per-instance segmentation annotations alongside Building upon the success of the MS-COCO dataset,
bounding boxes. This detailed labeling facilitates precise 2018 witnessed the launch of the Open Images Detection
object localization and understanding of complex inter- (OID) Challenge [237], a landmark initiative that signifi-
object relationships. Furthermore, MS-COCO challenges cantly expanded the scale and complexity of OD tasks. Unlike
algorithms with its abundance of small objects (area less than MS-COCO, OID tackles two distinct tasks:
1% of the image) and densely packed scenes, pushing the 1. Standard Object Detection: This task aligns with
boundaries of detection accuracy and scalability. Similar to traditional OD frameworks, requiring the identification and
the impact of ImageNet on image classification, MS-COCO localization of individual objects within images. The OID

94270 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

TABLE 3. An analysis of different DL frameworks.

dataset for this challenge boasts an impressive 1.91 million tasks that encourage the development of more robust and
images, each annotated with 15.44 million bounding boxes nuanced algorithms capable of understanding the rich visual
encompassing 600 distinct object categories. This substantial interactions within images.
scale surpasses prior datasets by a significant margin, offering
researchers a rich and diverse ground for benchmarking and e: ROBOFLOW 100
advancing OD algorithms. Roboflow 100 (RF100) [238] was introduced in 2022 to
2. Visual Relationship Detection: Stepping beyond single- shift beyond the limitations of single-domain datasets like
object identification, OID introduces the challenging task of MS-COCO. RF100 emerged as a diverse and challenging
detecting relationships between pairs of objects within an benchmark for OD research by providing the following
image. This novel task delves into the intricate semantic con- features:
nections present in complex scenes, pushing the boundaries 1. Domain Diversity: Spanning 7 distinct domains (Aerial,
of CV beyond mere object localization. Videogames, Microscopic, Underwater, Documents, Electro-
Overall, OID represents a significant advancement in the magnetic, Real World) with over 224,714 images, RF100
field of OD, offering a comprehensive dataset and diverse challenges models to adapt to varied visual characteristics

VOLUME 12, 2024 94271


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

TABLE 4. API services for object detection.

and object types beyond the typical focus on common objects [239], [240], which replaced FPPW with the more holistic
in everyday settings. This diversity fosters a more realistic ‘‘false positives per-image (FPPI)’’ metric.
assessment of model generalizability across different tasks In recent years, the field of OD has primarily relied on
and environments. average precision (AP) as the primary evaluation metric.
2. Rich Annotations: Over 829 class labels with 11,170+ Originally introduced within the Pascal VOC2007 challenge,
hours of manual labeling ensure accurate and detailed annota- AP quantifies the AP achieved across different recall levels,
tions for diverse objects, enabling precise object localization typically evaluated for individual object categories. This
and understanding of complex inter-object relationships. This comprehensive metric balances both the true positive rate
level of annotation granularity is crucial for training and (OD accuracy) and the false positive rate (incorrect detec-
evaluating robust models capable of handling intricate real- tions) across the entire range of recall values. Additionally,
world scenarios. the mAP, calculated by averaging AP across all object
3. Accessibility: Open-sourced and publicly available, categories, is often employed as a single, overarching
RF100 promotes widespread research and development performance indicator. To assess the accuracy of object local-
activities by providing researchers with a readily accessible ization, the Intersection over Union (IoU) between predicted
platform for benchmarking and improving their OD models. bounding boxes and ground-truth annotations is employed.
4. Benchmarking Tools and Visualization: Code is avail- This metric measures the overlap between the predicted and
able for replicating benchmark results and performing actual object locations, with a threshold (typically 0.5) used
fine-tuning and evaluation on YOLOv5 and YOLOv7 mod- to determine whether a detection is considered successful.
els, allowing researchers to easily compare and analyze their If the IoU exceeds the threshold, the object is deemed
models’ performance on RF100. Additionally, the RF100 ‘‘detected,’’ otherwise it is classified as ‘‘missed.’’ This
website provides a platform for visualizing images and binary classification based on IoU has become the practiced
trained model results, facilitating data exploration and model standard for evaluating OD performance, with the 0.5-IoU
performance analysis. mAP serving as the primary benchmark for comparing and
ranking detectors.
2) METRICS Following the introduction of the influential MS-COCO
The question of how to accurately evaluate object detectors dataset in 2014, the focus within OD research demonstrably
is not static but rather adapts to the evolving landscape of shifted towards accurate object localization. Prior to this,
detection research. In the early days, a lack of consensus evaluation metrics often relied on a fixed IoU threshold for
existed on suitable metrics. For instance, pedestrian detection classifying detections as true positives. While this offered
research primarily employed the ‘‘miss rate versus false a simple binary assessment, it failed to capture the nuances
positives per window (FPPW)’’ metric [168]. This per- of precise bounding box placement. The MS-COCO dataset
window approach, however, proved inherently flawed, failing challenged this paradigm by employing an AP metric
to accurately predict detector performance for the entire calculated across a range of IoU thresholds (typically from
image [239]. This prompted a paradigm shift in 2009 with the 0.5 to 0.95). This innovative approach encouraged models
introduction of the Caltech pedestrian detection benchmark to not only identify objects but also to delineate their

94272 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

spatial extent with greater accuracy. This is arguably of and remarkable acceleration for highly parallel workloads,
paramount importance for real-world applications, where such as CNNs [246]. Consequently, in the modern era, GPUs
imprecise localization could have significant consequences. have transformed from solely compelling graphics engines
Consider, for example, a robotic arm tasked with grasping to versatile, highly parallelized processing units, boasting
a specific tool. Even a marginally misplaced bounding box impressive throughput and memory bandwidth, ideally suited
could lead to a missed grasp and potentially hinder the robot’s for parallel computing paradigms.
functionality. By fostering research into improved localiza- Modern computing landscapes encompass two distinct
tion capabilities, the MS-COCO dataset has significantly processing paradigms: multiple CPUs and GPUs. CPUs
impacted the trajectory of OD research and paved the way for typically exhibit multi-instructional, out-of-order execution,
more robust and nuanced applications in diverse real-world leveraging large caches to mitigate single-thread latency
settings. while operating at high frequencies. In contrast, GPUs
possess thousands of in-order cores, relying on smaller
V. HARDWARE CATALOGUE caches and lower frequencies for parallel processing effi-
While the architectural advancements in OD as discussed in ciency [247]. Recognizing the challenges associated with
the previous sections have demonstrably fueled the success GPU-based application development and integration, various
of CNNs, it is crucial to acknowledge that architectural platforms have emerged to bridge this gap. Notable examples
breakthroughs are not the sole engine driving CV’s progress. include Open Computing Language (OpenCL) [248] and
The evolution of hardware over the preceding decades stands NVIDIA’s widely adopted Compute Unified Device Archi-
as an equally consequential contributing factor, particularly in tecture (CUDA) [249].
the context of deploying CNNs [241]. Significantly impactful The symbiosis between DL and GPUs has profoundly
advancements in hardware acceleration have yielded robust impacted various scientific domains. Examining the intricate
parallel computing architectures, propelling the efficient architecture of CNNs reveals a remarkable alignment with the
training and inference of increasingly complex, multi- inherent parallelism of GPUs. This synergy manifests in the
layered, and DCNN architectures. efficient execution of convolutional operations, diverse sub-
Hardware acceleration employs targeted interventions in sampling strategies, and neuron activations within FC layers
computer hardware to achieve a demonstrably reduced facilitated by binary-tree multipliers [250]. Recognizing the
latency and enhanced throughput for computational tasks, immense potential of GPUs in accelerating CNNs, a plethora
in contrast to traditional software execution on general- of libraries have emerged to facilitate seamless integration.
purpose CPUs. Notably, Princeton architectures have his- Prominent examples include cuDNN [251], Cuda-convert
torically prioritized serial computation models, coupled [252], and libraries embedded within popular DL frameworks
with intricate task scheduling algorithms [242]. CNNs pose like Caffe [253], TensorFlow [218], and Torch [254].
significant computational challenges due to their inherent The evaluation of GPU efficiency for DL applications
reliance on dense parallel computation. This reliance neces- hinges on three primary performance metrics: memory
sitates high memory bandwidth and often leads to excessive efficacy, computational throughput, and power consumption.
power consumption, particularly when dealing with complex These metrics collectively paint a picture of a GPU’s ability
network architectures [243]. to handle the intensive computational demands of DL tasks
Recognizing these hurdles, researchers and hardware while optimizing resource utilization. Among GPU vendors,
vendors have embarked on a concerted effort to develop inno- NVIDIA has cemented its position as the dominant force
vative strategies for boosting processing capabilities. This in the DL realm. Recognizing the diverse landscape of
endeavor strives to achieve enhanced parallelism, optimized DL applications, coupled with the constraints of demand-
inferencing, and efficient power utilization. This section ing deployment environments and budgetary limitations,
delves into a critical evaluation of prominent hardware NVIDIA has consistently expanded its GPU portfolio over
acceleration implementations, meticulously examining their the past two decades.
contributions, potential drawbacks, and broader implications Acknowledging the limitations of diverse GPU variants
for applications within the realm of CV. within resource-constrained environments demanding edge
deployment, compact form factors, and cost efficiency,
A. GPU NVIDIA developed the Jetson platform. Characterized by
The emergence of the Graphical Processing Unit (GPU) as a a heterogeneous architecture, Jetson leverages the CPU for
versatile computational force has transformed the landscape core operating system (OS) management while offloading
of modern computing. Initially conceived as a dedicated DL workloads onto the CUDA-powered GPU. This strategy
accelerator for real-time 3D graphics applications, rendering, facilitates the delivery of server-grade compute performance
and gaming [244], [245], the GPU’s inherent potential for at an attractive price point, evidenced by the proliferation
broader scientific and engineering applications was quickly of various Jetson variants specifically tailored for low-power
recognized as the 21st century unfolded. This realization embedded applications. Consequently, NVIDIA accelerator
stemmed from the GPU’s unique architecture, offering sig- kits have become a ubiquitous tool for diverse ML and DL
nificant performance gains for intensive computational tasks research endeavors and practical applications. Research by
VOLUME 12, 2024 94273
R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[255] investigated the performance of CNNs across various


CNN architectures, demonstrating that the Jetson TX2 variant
exhibited superior efficiency in comparison to other Jetson
variants. In a separate study, Jin and Niu [256] introduced a
novel teacher/student architecture for real-time fabric defect
detection within the production process. This architecture
leveraged a YOLOv5 backbone network and was deployed
on the NVIDIA Jetson TX2 platform for edge inference,
opting for this platform over the Raspberry Pi due to its
superior processing capabilities [257]. The student network
demonstrated a noteworthy AUC of 96.5%, coupled with
an inference time of 16 ms, thereby ensuring real-time
operational efficacy. FIGURE 19. FPGA architecture for CNN implementation.

B. FPGA The execution of a CNN on an FPGA entails a multi-


The efficacy of GPUs in delivering high levels of parallelism stage process. Initially, convolutional weights and input
and throughput has solidified their position as a formidable feature maps are transferred from the main data memory
force for hardware acceleration. However, the advancement (MDM) to a dedicated on-chip buffer, ODM. Then, the
of IoT [258] landscape, with its expanding reach across General Matrix Multiplication (GEMM) unit conducts matrix
diverse enterprises, necessitates a paradigm shift in deploy- operations and subsequently transmits the outcomes to
ment strategies. Specifically, the Industry 4.0 [259] vision the Memory Logic Unit (MLU) for batch normalization,
emphasizes edge-device solutions that offer close-to-source ReLU application, and pooling operations. The MLU-derived
processing. Alongside high accuracy and inference speeds, results are subsequently relayed to another ODM unit to
the shift in emphasis necessitates a critical focus on power facilitate accessibility by successive convolutional or FC
efficiency as well, which is an area where FPGAs [260] layers. In instances where on-chip buffer capacity proves
hold a distinct advantage over GPUs. In light of recent insufficient, intermediate results find temporary location in
architectural advances that have yielded increasingly sparse either on-chip or off-chip memory resources.
yet compact architectures, FPGAs offer unique capabilities Traditionally, the specification of FPGAs has been accom-
for facilitating anomalous parallelism, user-defined data plished at the register-transfer level (RTL) using hardware
types, and the ability to implement custom hardware designs. description languages (HDLs) like VHDL [264] and Verilog
Moreover, it permits reconfigurability to post-production for [265]. While effective, this approach can be challenging due
dynamic adaptation to evolving application and environmen- to the low level of abstraction. This necessitates considerable
tal requirements. These features, coupled with the inherent hardware design expertise, significant time and effort for
flexibility and customized design capabilities of FPGAs, both theoretical and practical implementation, and the ability
have driven their adoption in the domain of embedded to manage the complexities of high concurrency across
systems [261]. diverse hardware modules. To mitigate these issues, high-
The fundamental architecture of an FPGA is characterized level synthesis (HLS) methodologies have emerged as a
by a group of programmable logic blocks interconnected promising solution. HLS facilitates FPGA hardware design
through a hierarchical network. A representative FPGA by leveraging high-level languages like C [266], thereby
architecture incorporates diverse subcomponents, including enabling automated compilation of high-level descriptions
dedicated digital signal processing (DSP) units optimized for into low-level specifications. This advancement significantly
multiply-add-accumulate (MAC) operations, lookup tables broadens the accessibility of FPGA design to a broader group
(LUTs) tailored for combinatorial logic functions, and of researchers [267].
block RAMs enabling efficient on-chip data storage [262]. Despite the inherent high computational demands of
Figure 19 depicts a representative FPGA architecture in the CNNs stemming from their intricate architectural parameters
context of CNN implementation. The internal composition (FLOPs) and substantial memory storage requirements,
of this architecture comprises several dedicated sub-modules, FPGAs often offer significantly lower memory bandwidth
including a memory-data-management unit (MDM) for compared to GPUs, typically capped at 10% or less.
efficient data transfer, an on-chip-data-management unit This significant disparity in memory bandwidth has been
(ODM) for minimizing external memory accesses, a general- identified as a critical obstacle to the efficient implementation
purpose matrix-multiply unit (GEMM) developed through of CNNs on FPGAs. To address this challenge, researchers
a set of processing elements (PEs) dedicated to calculate have actively pursued algorithmic optimizations that aim to
MAC operations. Additionally, a Msic-layers unit (MLU) is reduce the computational complexity of CNNs, resulting in
incorporated to handle batch normalization, ReLU activation, lightweight architectures more suitable for deployment on
and pooling operations [263]. resource-constrained FPGA platforms.

94274 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

One optimization strategy, termed algorithmic opera- Model pruning [283], [284], [285] tackles the challenge
tion, entails the incorporation of computational transforms, of network complexity by selectively removing redundant
including the Fast Fourier Transform (FFT) [268], GEMM, parameters or weights. CNNs, in particular, possess a
and Winograd [269], applied to convolutional kernels and substantial weight count, yet not all of these weights offer
feature maps. The primary objective is to mitigate arithmetic significant, or in some cases, any measurable contribution to
operations post-deployment; which is the inference phase. performance. By eliminating these redundant elements, the
The employment of FFT contributes to a reduction in network sheds unnecessary complexity, resulting in a more
arithmetic complexity by transforming a 2-D convolution into lightweight and energy-efficient architecture. This facilitates
an element-wise matrix multiplication [270]. This transfor- deployment on resource-constrained devices, such as FPGAs,
mation offers substantial computational gains, particularly where computational limitations would otherwise hinder
for large kernel sizes, where the number of operations performance.
between kernels and feature maps escalates rapidly. Low-rank approximation [286] presents another avenue
GEMM stands as an extensively used method for the for CNN compression. This technique decomposes the
processing of DNNs in both CPUs and GPUs, demonstrating convolutional weight matrix or FC layers into a set of low-
notable efficacy through the vectorization of computations rank filters. Evaluating these filters requires significantly
in both convolutional and FC layers [271]. In scenarios less computational effort compared to the original weight
involving small kernels, the Winograd emerges as a more matrix, making it particularly advantageous for deployment
efficient strategy to arithmetic reduction compared to FFT, on hardware with limited computational capacity.
by leveraging the reuse of intermediate results [272]. The Quantization of CNNs has emerged as a promising tech-
efficacy of Winograd is exemplified by a considerable 7.28x nique for optimizing computational efficiency, particularly in
enhancement in runtime speed when applied to a VGG-Net, resource-constrained deployment environments. Leveraging
in contrast to GEMM, particularly observable on a Titan-X the inherently lower resource demands of fixed-point arith-
GPU [270]. Furthermore, Winograd exhibits a commendable metic compared to floating-point operations, quantization
throughput of 46 Giga Operations Per Second (GOPs) for involves representing CNN feature maps and weight matrices
AlexNet on a FPGA [273]. using fixed-point formats. This can lead to significant reduc-
Data-path optimization represents another strategic ini- tions in computational cost while maintaining acceptable
tiative directed towards achieving enhanced computational accuracy [287], [288]. For extremely constrained scenarios,
efficiency in architectures. Historically, FPGAs have conven- further compression can be achieved by quantizing weights
tionally structured and implemented processing elements in to binary values, effectively creating Binary Neural Networks
the form of 2-D systolic arrays [274], [275], [276]. However, (BNNs). However, this aggressive quantization approach
these implementations, while conceptually appealing, suffer can introduce significant accuracy degradation, necessitating
from an inherent limitation - the inability to implement careful trade-offs between efficiency and performance [289].
data caching mechanisms due to the imposed constraints on Rui and Qiang [290] investigated the efficacy of pruning
kernel size within the CNN architecture. This, unfortunately, through its application in a CNN architecture designed
restricts the overall efficacy of such designs. for textile defect detection in a production environment.
The loop optimization technique endeavors to address Their study employed tensorRT on an NVIDIA Jetson TX2
the aforementioned challenge through the incorporation of platform to implement pruning prior to deployment. The
various sub-components. First, loop reordering minimizes authors evaluated the impact of pruning on processing time,
redundant memory access by exploiting spatial locality, reporting a reduction from 80 milliseconds to 36 milliseconds
thereby enhancing cache utilization [277]. Second, loop for defect processing after pruning, signifying a significant
unrolling and pipelining contribute to improved FPGA performance improvement.
resource utilization, as demonstrated in [278] and [279],
respectively. Finally, loop tiling involves the partitioning of C. ASIC
weights and feature maps for each layer emanating from Within the realm of DL, ASICs stand out as custom-designed
the memory into ‘tiles,’ thereby facilitating efficient hosting hardware accelerators, prioritizing performance optimization
within on-chip buffers [280]. for specific applications over general-purpose functionality
CNNs exhibit remarkable versatility across diverse appli- [291]. This tailored approach allows ASICs to achieve
cation domains. This inherent adaptability is particularly superior performance in terms of accuracy and inference
advantageous in scenarios where a degree of error tolerance speed compared to GPUs and FPGAs when evaluated on the
is acceptable, such as quality inspection tasks within the target application. However, the inherent advantage of custom
manufacturing sector. Recognizing this potential, researchers design comes at the cost of significantly longer development
have actively pursued model compression strategies aimed at cycles due to the specialized nature of the design process.
mitigating architectural and hardware complexities of CNNs. In the past decade, the field of AI has witnessed the
Notably, three distinct approaches have emerged in model advent of diverse ASIC accelerators designed to address
compression: pruning [281], low-rank approximation [261], its unique computational demands. One notable example
and quantization [282]. is the HiSilicon Kirin-970, developed by Huawei [292].

VOLUME 12, 2024 94275


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

This heterogeneous architecture features a dedicated neural Motivated by the ever-growing need for efficient and
processing unit (NPU) alongside a Cortex-A73 (quad-core) reliable warehouse operations and the limitations of manual
CPU cluster. This configuration demonstrably enhances inspection, this section aims to investigate the promising
throughput by 25x and energy efficiency by 50x compared potential of CNNs for Industrial defect detection systems.
to traditional CPU-only approaches. Similarly, Google has To broaden the scope of this review, we systematically
spearheaded the development of its custom-designed Tensor delve into closely related domains within the purview
Processing Unit (TPU) [293]. Optimized for DNNs and of Structural Health Monitoring (SHM). This includes
seamlessly integrated with the TensorFlow platform [254], inspection methodologies for identifying defects in various
the TPU offers a compelling alternative for high-performance surfaces, such as pallet racks, steel, rail, magnetic tiles,
AI inference. Apple has also entered the fray with its neural photovoltaic cells, fabric, screens, etc. By delving into the
engine [294], a specialized set of processor cores targeting latest advancements and available options in deployable CV
specific DL network operations, particularly in applications development frameworks, this survey equips researchers with
like facial recognition. These advancements highlight the the necessary knowledge to stay updated and make informed
growing trend of customized ASIC accelerators fostering decisions in their CV-related research within the context of
significant performance and efficiency gains within the AI SHM and beyond. This comprehensive approach not only
landscape. fills a critical knowledge gap in industrial defect inspection
Table 5 provides a comprehensive comparison across GPU, but also offers valuable insights for other closely related
FPGA, and ASIC platforms, considering a broader range domains, fostering cross-pollination of ideas and accelerating
of metrics. Furthermore, it is imperative to note that once advancements in the field.
developed, the design footprint of ASICs remains immutable.
This lack of reconfigurability represents a notable constraint A. PALLET RACKS
for ASICs, as the dynamic nature of diverse deployment envi- Pallet-rack inspection is a potentially novel application for
ronments necessitates the capacity for relevant adjustments to automated defect detection using MV within warehousing
accommodate evolving requirements. and manufacturing environments. Pallet racks constitute
the backbone of industrial logistics, enabling streamlined
VI. INDUSTRIAL DEFECT DETECTION APPLICATION storage and transport of goods. However, their susceptibility
AREAS to structural defects, such as cracks, dents, and corrosion,
Quality control sits at the heart of a robust and efficient poses significant safety hazards. These unobserved damage
manufacturing ecosystem. Any deviations from desired spec- to these vital structures can trigger a cascade of negative
ifications directly impact product functionality, marketabil- consequences if the racking were to collapse. These potential
ity, and ultimately, brand reputation. Therefore, enhancing repercussions include substantial financial losses due to
quality inspection mechanisms is paramount. In the realm of ruined stock, operational downtime, employee injuries, and
industrial automation, MV has emerged as a powerful tool in extreme cases, loss of life. While various mechanical
for revolutionizing quality control and process optimization. solutions, such as rackguards [300], exist to mitigate the
However, its widespread adoption within the manufacturing impact of collisions, they lack the intelligent capabilities
domain has been a gradual process, closely intertwined necessary for proactive damage detection and subsequent
with the advancements in both hardware capabilities and intervention.
underlying computational architectures. Prior to the past Hussain et al. [302] spearheaded research in this field
decade, limitations in computational power and sensor by investigating the application of DL architectures for
technology often rendered MV impractical for integration automated defect detection in pallet racking systems. Given
into existing industrial workflows. Consequently, quality the limited computational resources available in such
inspection primarily relied on human-based visual inferenc- industrial settings, the authors proposed a MobileNetV2-
ing, a method prone to inconsistencies and subjective bias. based model trained on the first publicly available pallet-
The intricacies of identifying and classifying diverse defect racking dataset, acquired through collaboration with multiple
types often surpass human capabilities. industry partners. To address potential data imbalances and
However, the landscape of manufacturing industry is enhance the model’s generalizability, authors implemented
undergoing a significant transformation, driven by the inte- a novel representative sample scaling technique, which
gration of automated processes through MV-based inspection ultimately led to a maP of 92.7% on the aforementioned
systems, particularly within the domain of Surface Defect dataset. Furthermore, their work distinguished itself from
Detection. MV leverages the capabilities of CV algorithms previous mechanical and sensor-based solutions by proposing
to analyze digital images of products, automatically identi- a unique hardware placement strategy. Instead of attaching
fying and classifying imperfections with high accuracy and any equipment directly to the racking structure itself, the
efficiency. This yields multifaceted benefits like reduction authors suggested mounting the inference hardware device
in labor costs, mitigation or elimination of human bias, on the forklift’s adjustable brackets, a solution that bal-
decreased inference time, and alleviation of human fatigue, ances performance with operational flexibility. This strategic
among others. placement significantly reduced hardware requirements, with

94276 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

TABLE 5. Hardware metric comparison.

TABLE 6. Comparison of existing pallet racking research work.

some installations achieving a 95% decrease in hardware CNN architectures. The proposed CNN architecture has
while maintaining a 50% IoU metric. This reduction was only 6.5 million learnable parameters, making it the first
accompanied by an expanded coverage area relative to the custom-designed CNN architecture for the pallet racking
operating forklift. domain. The system achieved a baseline accuracy of greater
Hussain et al. [303] further improved performance and than 90% and an overall F1 score of 96% on the test data,
real-time operational feasibility by proposing a domain demonstrating its effectiveness in detecting damaged pallet
variance modeling (DVM) approach for training the YoloV7 racking in warehousing and distribution centers. Various
architecture. Additionally, they broadened the scope of defect regularization strategies, including dropout at a drop rate of
detection to encompass not only vertical flaws, but also 50%, were applied to further enhance the performance and
horizontal cracks and rack support damage. The results were generalizability of the network. Dropout at a drop rate of 50%
noteworthy, with the system achieving an IoU of 50% at an provided the highest performance during training, achieving
impressive 91.1% accuracy, operating at 19 fps. 99% precision, recall, and F1 score. The performance of the
Farahnakian et al. [301] further contributed to the domain proposed architecture was evaluated on the test dataset, and
of automated racking inspection by focusing on semantic although there was a slight drop in the overall F1 score, the
segmentation, employing Mask-RCNN as the inference performance was still impressive at 96%.
architecture. While their reported performance slightly Hussain [305] further works on this domain and proposes
surpassed that of [302], a closer examination of their YOLO-v5n as the optimal architecture for automated pallet
dataset revealed limitations in its representativeness for real- rack inspection, achieving an impressive [email protected] accuracy
world deployment. The captured images depicted isolated of 96.8%, surpassing previous efforts in this domain. This
racking structures devoid of contextual information, such achievement hinges on a novel methodology that delivers
as the surrounding warehouse environment or loaded stock, a robust architecture characterized by high accuracy, strong
potentially hindering the generalizability of their proposed generalization capabilities, and a lightweight footprint. The
architecture. key to this success lies in the proposed augmentation
Expanding upon prior research, Hussain and Hill [304] strategy. By incorporating domain-specific augmentations,
introduces a development pipeline called CNN-Block Devel- the model learns robust features that generalize well to
opment Mechanism (CNN-BDM) that enables researchers real-world scenarios. This, in turn, leads to high accuracy
in the warehousing domain to develop custom lightweight without sacrificing generalizability. Furthermore, a variant

VOLUME 12, 2024 94277


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

selection algorithm plays a crucial role in balancing accu- ing. This innovative framework demonstrates exceptional
racy and computational efficiency. Recognizing the need resilience against the inherent variability of real-world
for lightweight models suitable for edge deployment, the environments, often posing significant challenges for tradi-
algorithm prioritizes YOLO-v5n over its higher-accuracy tional CV algorithms. Furthermore, the same research team
counterpart, YOLO-v5x, due to the latter’s significant developed a structural visual inspection method leveraging
computational burden. Faster R-CNN [311]. This method enables the quasi-real-time
Alif [306] introduces Pallet-Net, a novel DL technique for simultaneous detection of multiple types of defects, further
automated pallet rack inspection. Leveraging an attention- enhancing the efficiency and accuracy of infrastructure
based CNN, Pallet-Net achieves a remarkable total accuracy inspection.
of 97.63%, surpassing existing methods (ViT and CCT) in Building upon existing work, He et al. [312] proposed a
terms of both precision (98%), recall (98%), and F1 score novel end-to-end DL system for steel plate defect inspection.
(98%). This exceptional performance highlights the model’s Their approach used CNN as the primary feature extraction
ability to effectively identify faulty pallet racks, contributing mechanism, generating feature maps at each processing
significantly to industrial safety and maintenance practices. stage. These feature maps were then integrated through a
Hu [307] presents a novel approach for automated pallet Multilevel-feature fusion network (MFN), culminating in
racking assessment utilizing the MobileNetV2-YOLOv5 a single feature representation containing enhanced spatial
framework. This framework enables the detection of various details of potential defects. Leveraging these enriched
damage types in pallet racking systems directly on edge plat- features, a RPN identified areas of interest, followed by
forms during pallet movement. Following a comprehensive a detector comprising a classifier and a bounding box
analysis, the study identifies MobileNetV2-YOLOv5 (n) as regressor to produce the final defect detection results. The
the optimal architecture due to its superior balance between proposed architecture demonstrated impressive performance,
high accuracy and computational efficiency for deployment achieving an accuracy of 99.67% for the defect classification
on resource-constrained edge devices. The proposed method- task and an mAP of 82.3 for the defect detection task on a
ology demonstrates impressive performance, achieving a publicly available dataset. Furthermore, the system achieves
precision of 90.6%, a recall of 95.7%, and a [email protected] of a detection speed of 20 ft/s while maintaining a mAP of 70%.
97.8%. These metrics highlight the framework’s effectiveness Addressing the challenge of imbalanced class distributions
in accurately identifying damage while maintaining compu- due to the inherent sparsity of abnormal samples, Lian et
tational feasibility for real-time applications. al. [313] proposed a novel Generative Adversarial Network
Reviewing the current literature, it becomes evident that a (GAN) architecture for identifying tiny flaws in steel plates.
significant gap exists in research concerning automated rack- Their approach leverages defect exaggeration, generating
ing inspection utilizing DL techniques. Table 6 summarizes both the clean image and an exaggerated version of the defect.
the key findings of existing research studies in this field. This augmented dataset is then fed into a GAN-CNN hybrid
model, demonstrably improving the accuracy of tiny surface
defect detection by effectively augmenting the minority class.
B. STEEL SURFACES Luo et al. [314] address the challenges of roll mark
Steel remains a critical element across various planar detection on steel strips, characterized by large intra-class
industries, including architecture, aerospace, machinery, and variations, low contrast, and harsh industrial environments.
automobile manufacturing. However, producing certain steel They propose a novel network, SCFPN, featuring a pyramid
strips can be technically complex, with stringent quality structure for enhanced defect feature extraction and CIoU
requirements. These complexities expose the material to var- loss function for improved training stability. To address data
ious defect formation risks arising from human, mechanical, limitations, they introduce CSU_STEEL, the first publicly
or environmental factors. Figure 20 illustrates common steel available database of hot-rolled steel strip surface defects.
defect types. SCFPN demonstrates impressive performance, achieving
With the growing demands for intelligent manufacturing 75.9% AP for fine-grained roll mark characterization and
and enhanced surface quality assurance, steel surface defect exceeding DeepPCB and NEU datasets with 99.2% and
detection has garnered significant attention in recent years. 82.8% mAP respectively.
In response to this critical need, Yi et al. [309] proposed Liu et al. [315] propose a surface defect detection
an end-to-end system for surface defect recognition in steel framework for steel strips that incorporates a self-attention
strips. Their system leverages a novel symmetric surround mechanism to capture spatial-wise semantic relationships and
saliency map for initial defect detection, followed by a DCNN model global contextual inter-dependencies. The framework
that directly maps defect images to their respective categories. known as Feature Refinement Faster R-CNN (FR-FRCNN),
This CNN, trained on raw defect images, forms the core of an automatically identifies the specific class and location of six
efficient end-to-end defect recognition pipeline. types of typical surface defects on steel strips. Compared
Cha et al. [310] introduced a novel DCNN architecture to the baseline framework (Faster R-CNN with FPN), the
capable of directly identifying cracks on concrete and steel proposed framework achieves higher detection accuracy and
surfaces, eliminating the need for manual feature engineer- better localization ability.

94278 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 20. Common steel surface defects [308].

Feng et al. [316] address the critical challenge of automat- notion by demonstrating the potential of YOLOv5. Through
ing hot-rolled steel strip defect detection, highlighting meticulously crafted data augmentation strategies, they
its crucial role in various manufacturing industries. They achieve superior performance in both speed and accuracy
propose a novel approach leveraging a RepVGG architecture compared to Faster R-CNN. Their trained YOLOv5 model
augmented with a spatial attention mechanism. While the attains a mAP of 98.7% (IoU-0.5) while satisfying real-time
overall test accuracy reaches a promising 95.10%, the detection requirements for steel pipe production, with a single
authors acknowledge significant variations in performance image processing time of 0.12 seconds. However, the current
across different defect categories, with some as low as implementation has limitations. The X-ray data is processed
78.95%. Notably, they commendably present a detailed on a PC equipped with a GPU, indicating a centralized
analysis of their architecture’s computational complexity, processing architecture that may not be readily scalable in
acknowledging its relatively large size in terms of learnable large production environments. Future work could explore
parameters (83.825 million) and computational demand edge computing solutions or more distributed processing
(17.892 GMACs). This transparency allows for informed architectures to address this limitation.
comparison and future refinement of the model. In the study by [318], the authors proposed a custom CNN
Yang et al. [317] explore the potential of YOLOv5 for architecture for automatic metal casting defect detection
production-based weld steel defect detection using X-ray and compared their performance with SOTA architectures
images of weld pipes. Their work challenges the traditional like ResNet, MobileNet, and Inception. Their approach
dominance of two-stage detectors like Faster-RCNN in this leverages the depth-wise separable convolution introduced
domain, advocating for the effectiveness of single-stage by MobileNet within their custom architectures, alongside
detectors when equipped with appropriate training strategies. several optimization strategies such as Blurpool, stochastic
While conventional wisdom often assigns superior accuracy weight averaging, MixUp, label smoothing, and squeeze-
to two-stage detectors, Yang et al. effectively dispel this excitation. While the authors claim superior performance

VOLUME 12, 2024 94279


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

based on the reduced number of parameters, their reported unsupervised layer-wise pre-training, where each layer of
accuracy is of 81.87% whereas the attained Inception stands the network is trained independently on unlabeled data
at 91.48%. before fine-tuning with labeled defect images. This approach
aimed to learn general image features and improve the
C. RAIL TRACKS overall network’s generalization capabilities. Secondly, they
Rail surface defect detection presents a unique but critical explored the effectiveness of training data augmentation,
challenge within the broader realm of steel surface defect a technique that artificially expands the training dataset
analysis. Exceptionally high contact pressures between wheel by generating variations of existing images. This serves
and rail, coupled with high traction and braking forces result to prevent overfitting and enhance the network’s ability
in a multitude of potential rail surface damage defects. Such to recognize defects under diverse lighting conditions and
defects can propagate from the rail surface into the rail image noise. By exploring these regularization methods, the
head, and if left undetected can result in catastrophic rail authors demonstrated the potential of DL for rail surface
failures. As the maintenance demands of an ever-growing defect detection using photometric stereo images. Their work
global rail network continues, it becomes essential to develop contributes to the development of reliable and automated
high-speed, reliable, and cost effective detection systems to inspection systems for ensuring the safety and integrity of
ensure the safety of railway operations. Figure 21 illustrates railway infrastructure.
common defects encountered in rails. Image-based MV To further automate manual inspection, Liang et al. [320]
have the potential to provide a cost-effective solutions to introduced a novel approach utilizing a DCNN to automate
rail defect detection, however, complicated by the high rail rail surface defect detection. Their work leverages the
reflectivity and complex ambient lighting of railway systems SegNet architecture [321], a well-established network known
(night/day, dew point, tunnels, cuttings and open sections) for its efficient encoder-decoder structure. This deep, 59-
wide adoption necessitates the development of robust, higher layer network extracts relevant features from rail surface
speed and more sophisticated image processing algorithms images while simultaneously performing spatial localization,
to effectively handle the complexities inherent in continuous enabling precise defect identification.
monitoring of rail surface defect images. Shang et al. [323] proposed a novel two-stage pipeline
for automated rail defect detection, emphasizing a novel
localization and classification strategy. In the first stage,
they employed traditional image processing techniques to
accurately localize on cropped rail images rather than the
original image. Subsequently, the cropped images were fed
into a fine-tuned CNN for defect classification. This approach
leveraged the powerful feature extraction capabilities of
CNNs while reducing computational complexity by focusing
only on the relevant regions. This targeted feature extraction
yielded superior discriminative power for subsequent classi-
fication, ultimately achieving impressive precision and recall
rates of 92.08% and 92.54%, respectively.
Leveraging the YOLOv3 DL model [192], Yanan et al.
[324] proposed an approach for rail surface defect inspection.
This framework divides the input image into a grid of S × S
cells. Within each cell, logistic regression was employed
to predict the bounding box object score, indicating the
confidence of a defect being present. Additionally, binary
cross-entropy loss function was utilized to predict the
specific categories of defects that the bounding box might
encompass. The research outcomes achieved an impressive
FIGURE 21. Common rail surface defects [42]. recognition rate exceeding 97%, with an identification time
of approximately 0.15 seconds.
Soukup and Huber-Mörk [319] explored the application Yuan et al. [325] addressed the challenge of real-time
of CNNs for rail surface defect detection using photometric defect detection by proposing an end-to-end framework
stereo images. This approach leverages the inherent depth utilizing a lightweight CNN architecture. Their approach
information encoded within such images, potentially leading leverages MobileNetV2 [145] as the core network for
to more robust and accurate defect identification compared to efficient feature extraction, enabling multi-scale defect
conventional methods. Their work focused on investigating detection for optimal performance. This methodology yielded
the impact of various regularization techniques on improving an impressive accuracy of 87.4 mAP while maintaining a
the performance of the CNNs. Firstly, they employed real-time processing speed of 60 fps.

94280 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 22. Common magnetic tile surface defects [322].

D. MAGNETIC TILES the specific location of surface defects by generating


Magnetic tiles, a vital component of permanent magnet bounding boxes, providing precise spatial information for
motors, are tile-shaped magnets offering consistent magnetic subsequent analysis. Remarkably, this model can process
potential. As their quality directly impacts the overall motor each image within a mere 0.07 seconds, demonstrating its
performance, accurate surface defect detection is crucial. suitability for real-time applications.
As depicted in Figure 22, a variety of surface defects can Dong et al. [322] introduced the PGA-Net, a novel
afflict magnetic tiles, hindering their optimal functioning. approach for pixel-wise detection of surface defects, incor-
Historically, this task relied primarily on traditional MV tech- porating a pyramid feature fusion mechanism and a global
niques. However, the recent emergence of CNN algorithms context attention network. The framework involves the
has introduced a promising alternative. extraction of multi-scale features, employing efficient dense
Addressing the need for real-time saliency detection in skip connections through a pyramid feature fusion mod-
magnetic tile surface defect inspection, Huang et al. [326] ule. Subsequently, a global context attention module is
developed the MCuePush U-Net model. This innovative applied to facilitate effective information propagation from
architecture employs three key components: a multi-channel low-resolution fusion feature maps to high-resolution fusion
cue (MCue) module, a U-Net architecture, and a push counterparts. Experimental evaluations demonstrated the
network. The MCue module generates a three-channel input efficacy of the proposed method, revealing a IoU of 71.31%.
consisting of an MCue saliency image and two raw images, The incorporation of pyramid feature fusion and global
enriching the information accessible to the subsequent context attention enhances the model’s ability to discern
processing stages. The U-Net is a DCNN which possesses surface defects pixel-wise, contributing to improved defect
a hierarchical structure, extracting salient features from this detection performance.
enriched input. This hierarchical approach allows the network Furthermore, Lian et al. [313] evaluated their small
to identify both subtle and prominent defect characteristics. surface defect detection model on magnetic tile images using
Finally, the push network, composed of two FC layers and an an exaggerated local variation-based GAN, achieving an
output layer, refines the defect localization further. It predicts impressive accuracy of 99.2%.

VOLUME 12, 2024 94281


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

E. PHOTOVOLTAICS regions, leading to improved detection accuracy. VGG-16


PV systems have been recognized as a crucial contributor to serves as the chosen feature extractor within the Faster-
the transition towards renewable energy sources, primarily RCNN framework, pretrained on the ImageNet dataset.
due to their role in mitigating global emissions and reducing While the overall architecture encompasses 261.26 million
dependence on conventional energy generation paradigms parameters, the CAN mechanism itself contributes mini-
[327]. Solar cells, the fundamental building blocks of PV mally to this complexity, with Faster-RCNN accounting for
modules, are susceptible to the formation of micro-cracks 260.50 million parameters. This suggests that the CAN’s
during manufacturing due to exposure to high-temperature impact on computational cost is negligible compared to the
variations and external pressure sources [328]. underlying Faster-RCNN architecture.
Limitations inherent in human-based quality inspection, Ahmad et al. [333] identified significant challenges in PV
as experienced in the fabric manufacturing industry, also surface defect detection, including inhomogeneous surface
affect the PV panel sector. A recent case study (2018) intensities and complex background variations. To address
involving the post-deployment inspection of 180,000 PV these challenges, they proposed a custom CNN architecture
panels revealed that over 4,000 faulty modules were shipped, tailored for accurate defect detection on PV surfaces.
resulting in substantial financial losses for the client [329]. The network comprised four convolutional blocks, each
These losses highlight the challenges of defect detection containing 32 filters, followed by two dual convolutional
in PV panels, primarily due to the non-uniformity of cell blocks with 64 filters each, and finalizing with two additional
surfaces. Consequently, researchers have shifted their focus convolutional blocks housing 128 filters each. These layers
toward the development, deployment, and optimization of fed into a single FC layer for final output. Notably, the authors
CNN architectures for automating this critical task. implemented various augmentation techniques to enhance
Hussain et al. [330] addressed the challenge of micro-crack the network’s generalization capabilities. Their approach
detection in PV manufacturing by proposing a tailored CNN achieved a respectable accuracy of 91.58%, demonstrating its
architecture. Recognizing the difficulties associated with potential for efficient PV surface defect detection.
acquiring sufficient industrial data, the authors introduced
a novel approach of extracting internal feature maps as F. TEXTILES
representative augmented samples. This technique effec- Fabric, an ubiquitous material in our daily lives, is woven
tively enriched the dataset’s variance without compromising from textile fibers. These fibers can be derived from natural
its integrity. Subsequently, the proposed architecture was sources such as cotton or wool, or be synthetic composites
rigorously benchmarked against SOTA models across a like wool-nylon or polyester blends. However, imperfections
comprehensive set of metrics encompassing architectural and can arise during their production, leading to fabric defects,
computational complexities, accuracy, FPS performance, and blemishes on the fabric surface. In the realm of automation,
latency. Notably, the proposed methodology achieved an Fabric defect detection plays a pivotal role in quality
impressive F1 score of 98.8% while maintaining a remarkably control, aiming to identify and locate these flaws. The textile
low parameter count of 6.42 million. industry itself recognizes and categorizes over 70 distinct
Luo et al. [331] tackle the challenge of limited data types of fabric defects [334]. These imperfections can be
availability in industrial PV manufacturing, within the attributed to diverse factors, including machine malfunctions,
domain of Electroluminescence (EL) imaging for PV cell yarn irregularities, inadequate finishing processes, excessive
inspection. Their approach leverages a GAN to augment the stretching, and more. Figure 24 highlights six commonly
existing data by generating representative synthetic samples. encountered defects.
To assess the effectiveness of their method, they train three Jin and Niu [256] highlight the potential for automated
SOTA DL architectures: ResNet, AlexNet, and SqueezeNet quality inspection in fabric manufacturing, emphasizing
[208]. While ResNet emerged as the optimal performer, the reduced labor costs and increased detection speed. To address
authors critically acknowledge the method’s limitations. They the challenge of identifying small defects, they propose a
highlight the high instability of the training process and customized YoloV5 architecture enhanced with a spatial
increased computational demands, which could hinder the attention mechanism. Their methodology involves training a
scalability of their proposed solution. ‘‘teacher network’’ on a dedicated fabric dataset, followed by
Su et al. [332] highlight the challenges associated with knowledge distillation to transfer the acquired knowledge to
automated EL-based PV cell surface defect detection, a ‘‘student network’’ with a lighter computational footprint.
particularly the difficulty in distinguishing between defec- This student network is subsequently deployed to a Jetson
tive and normal pixel regions due to blurred boundaries. TX2 platform for real-time inference, leveraging TensorRT
To address this, the authors propose a Complementary for optimized performance. The authors report that, while
Attention Network (CAN) [291], depicted in Figure 23, the teacher network achieved higher performance (98.8%
which leverages a unique coupling of channel-wise and AUC) compared to the student network (96.5% AUC), the
spatial-wise attention modules. This mechanism, integrated latter offered significantly faster inference speeds (16ms
into the Faster-RCNN architecture, effectively suppresses vs. 35ms). They conclude that the combined effect of
background pixels and enhances the saliency of defective their proposed architecture and hardware selection ensures

94282 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

FIGURE 23. Complementary attention network.

8.9% improvement in overall precision compared to baseline


methods. However, they acknowledge the limitations of their
testing dataset, lacking patterned fabric images that could
pose greater classification challenges.
Song et al. [338] present an efficient architecture for fabric
defect detection specifically tailored to industrial deployment
environments. Recognizing the limitations of computational
resources and power constraints in such settings, the authors
prioritize the development of a lightweight model offering
minimal latency and power consumption while maintaining
high accuracy. In addition to employing data augmenta-
tion techniques, the authors optimize the architecture by
leveraging TensorRT. This framework facilitates layer and
tensor fusion, alongside weight and activation precision
calibration. Notably, they justify the implementation of a
combined convolution, bias, and ReLU (CBR) layer to
FIGURE 24. Common fabric surface defects [334], [335]. achieve a computationally lighter structure. The results
demonstrate the effectiveness of their proposed solution.
Edge deployment exhibits a 2.5x speedup compared to a
cloud-based configuration, while maintaining a detection
real-time performance, meeting the necessary identification accuracy of 98%. Post-TensorRT optimization, the model
timeframes for industrial applications. attains an even faster frame rate of 22.78 seconds per
Zhang et al. [336] present an updated MobileNetV2-SSD- image and an inference time of 43.9 milliseconds, compared
Lite architecture specifically designed for automated fabric to 13.74 seconds per image and 72.8 milliseconds before
surface defect detection. Their modification introduces a optimization. These findings highlight the efficacy of the
channel-based attention mechanism to emphasize defective proposed architecture for resource-constrained industrial
regions against the background surface. Furthermore, they settings.
replace the default loss function with a combination of
focal loss and K-means clustering for optimizing candidate
box parameters. This approach demonstrates respectable G. DISPLAYS
performance, with the proposed architecture achieving a Liquid crystal displays (LCDs) and touch displays have
mAP exceeding 90% while maintaining an inference speed permeated the electronics landscape, becoming ubiquitous
of 14.19 fps on the camouflage dataset. Notably, the components in various devices. Consequently, the quality of
authors report negligible computational overhead from the these displays directly impacts the overall user experience
channel-based attention mechanism, with parameter count and perception of electronic products. In the context of LCD
increasing by only 0.001 million. screens, process and environmental factors can introduce
Li and Li [337] introduced a novel approach for fabric various display defects, such as bright spots, light leakage,
defect detection utilizing a cascaded-RCNN architecture. white spots, foreign objects, streaks, BLOBs, Mura, black
This approach focuses on enhancing accuracy through spots, uneven colors, scratches, bubbles, and wrinkles [42].
a multi-headed strategy, employing multi-scale training These imperfections significantly impact user experience
followed by dimensional clustering to characterize prior and necessitate robust quality control measures. The rapidly
anchors. Notably, they leverage ResNet-50 as the feature growing application of DL and MV within intelligent
extractor and incorporate FPN with a bottom-up-top-down manufacturing offers immense potential for addressing this
path facilitated by lateral connections. The authors report an challenge. In particular, CNN-based screen defect detection

VOLUME 12, 2024 94283


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

TABLE 7. Summary of CNN-based defect detection applications in various industrial domains.

technology presents a promising avenue for enhancing both bination enables real-time training and classification of
the efficiency and accuracy of quality control processes. Mura defects directly within the production line, eliminating
Luo et al. [339] presented an automated scratch detection the need for pre-training on large datasets or offline
method employing a two-module cascading architecture. The processing.
first module leveraged a series of low-level processing stages Building upon the challenges of diverse defect sizes and
to identify large scratches and localize potential small scratch shapes in screen defect detection, Lei et al. [341] developed
candidates. Subsequently, the second module employed a a novel end-to-end framework. Their approach integrates
lightweight ScratchNet model for classifying each identified merging and splitting strategies to effectively process image
small scratch candidate as a genuine scratch or a non-scratch patches of varying scales. Subsequently, a trained Recurrent
anomaly. This approach demonstrated remarkable accuracy, Neural Network (RNN) analyzes the processed patches and
achieving 96.35% for small scratch classification. identifies the ones most likely to contain defects. This system
In the domain of quality control for flat-panel displays, achieved a remarkable precision of 90.36%, significantly
Mura defects present a significant challenge due to their exceeding the 76.02 benchmark established by AlexNet
subtle visual nature and variability. To address this issue, [123].
Yang et al. [340] proposed a novel approach combining Lv et al. [342] proposed an automated defect detection sys-
online sequential classification and transfer learning for tem for mobile phone cover glass. Their approach leverages
real-time detection and classification of Mura defects on backlight imaging and a modified segmentation technique
production lines. The key innovation lies in the integration powered by DNNs to effectively extract and identify defects.
of a DCNN for feature extraction with a sequential extreme Furthermore, the authors proposed a GAN-based approach
learning machines (SELMs) classifier. This innovative com- coupled with a Faster R-CNN model [156] to address the

94284 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

challenge of limited defect data in specific applications, such over image classification, notably providing precise spatial
as identifying imperfections on mobile phone cover glass. localization of objects. This capability unlocks diverse appli-
Their experiments convincingly demonstrated the efficacy of cations, particularly in manufacturing, where OD enables
this method in enhancing defect detection performance. integration with actuation mechanisms for enhanced effi-
ciency. While CNNs offer robust feature learning, high
H. OTHER SURFACES accuracy, end-to-end training, and transfer learning capa-
Beyond the aforementioned applications, a review of the bilities for OD, they face challenges [369], [370], [371].
relevant literature reveals that DL has also been successfully These include large labeled dataset requirements, computa-
employed in a diverse range of industrial quality detection tional complexity, real-time performance constraints, domain
tasks, including Metal surface defect detection [343], [344], shift and generalization issues, and lack of interpretability.
[345], Electronic component defect detection [346], [347], Future research should focus on developing efficient and
[348], Optical fiber defect detection [349], [350], Wheel lightweight architectures, exploring domain adaptation and
hub surface defect detection [351], [352], [353], Diode chip few-shot learning techniques, improving interpretability,
defect detection [354], [355], [356], Bottle mouth defect and seamlessly integrating CNN-based OD into industrial
detection [357], [358], [359], Precision parts defect detection workflows.
[360], Varistor defect detection [361], [362], Ceramic defect
detection [363], [364], [365], and Wood defect detection B. PROGRESSION OF YOLO MODELS
[366], [367], [368]. This broad spectrum of applications
Owing to the introduction of YOLO in 2015, this sector
showcases the substantial potential of DL-based MV technol-
of OD has established itself as a dominant force within
ogy for industrial quality inspection. However, it is crucial to
the CV landscape, continuing its contribution through the
acknowledge that DL applications in the industrial domain
release of YOLOv8 on January 2023. This phenomenal
are predominantly customized, necessitating a high degree
success can be attributed to the unwavering focus of its
of coupling between the technical models and specific
creators on meticulously optimizing two crucial metrics
inspection scenarios. Consequently, refined development
for real-world applications: accuracy and computational
tailored to each unique task is paramount for successful
efficiency, leading to superior inference speed. While the
implementation.
original author’s decision to halt further development due
to privacy concerns initially seemed like a setback [189],
I. COMPARATIVE ANALYSIS
it has ironically fostered a vibrant research landscape. Driven
As evident from Table 7, CNN-based approaches have been
by the vast potential applications of lightweight, real-time
widely explored for defect detection across various industrial
OD, numerous researchers and renowned research groups
surfaces, including pallet racks, steel, rail tracks, magnetic
have actively engaged in rigorous architectural optimiza-
tiles, photovoltaic cells, textiles, and displays. The reported
tions [372], [373], [374]. This competitive environment,
performance metrics, such as accuracy, mAP, precision,
fuels continuous innovation, as research organizations vie
and recall, demonstrate the efficacy of these techniques in
for superiority in the OD arena, driven by the vast potential
automating defect detection processes. However, challenges
applications that hinge upon fast, lightweight detection
like limited dataset availability, computational complexity,
capabilities.
and domain-specific nuances still exist, necessitating further
research and adaptation. Nonetheless, the practical implica-
tions of these CNN-based solutions are significant, ranging C. NECESSITIES OF ADOPTING HARDWARE-BASED
from enhancing quality control and improving product BENCHMARKS
functionality to reducing maintenance costs and optimizing For over a decade, the design of OD architectures has
manufacturing processes. been primarily driven by the pursuit of highly accurate
benchmark datasets. These datasets often encompass vast
VII. CHALLENGES AND FUTURE SCOPE amounts of data, featuring countless classes and millions
This study was dedicated to providing a comprehensive of images, leading to the creation of computationally
examination of the historic and current landscape of CNNs, expensive and complex architectures. However, the recent
considering both algorithmic intricacies and deployment in years have witnessed a paradigm shift within the field, with
various industrial defect detection systems. Through rigorous the focus transitioning from purely theoretical performance
review, this work synthesizes research trends, identifies optimizations to practical, real-world implementations. This
potential focal points, and outlines prospective directions for led to the emergence for the development of hardware-based
future investigations. benchmarks which necessitates expanding the competition
metrics beyond mere accuracy to encompass a broader spec-
A. WIDESPREAD ADOPTION OF OBJECT DETECTION trum of performance indicators like computational efficiency,
ARCHITECTURES resource allocation, and inference speed, particularly on
The surge in development of CNN-based object detection resource-constrained devices such as FPGAs [53], [375],
(OD) architectures stems from their inherent advantages [376].

VOLUME 12, 2024 94285


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

D. PRIVACY CONSIDERATIONS IN CV fundamental network structure of the DL model while only


As CV research transcends academic boundaries and ventures substituting the training samples. Such a straightforward and
into the realm of real-world applications, a crucial shift is efficient approach will gain broader acceptance among the
necessary in the architectural development stage. While the majority of industrial enterprises.
traditional focus on accuracy and lightweight computation
remains vital, ensuring privacy-centric design [377] and VIII. CONCLUSION
fostering explainable models are equally paramount to This comprehensive review has examined the vast potential
ensure seamless integration into complex applications across and applications of CNNs for industrial defect detection
diverse industries like manufacturing, healthcare, security, systems. We commenced by establishing the foundational
and renewable energy. Therefore, developers and researchers building blocks of CNNs, meticulously dissecting their
must conscientiously factor in security considerations [378], architectural design, mathematical formulations, and tracing
[379], [380] during the formulation of their design frame- their evolutionary lineage stemming from ANNs. Pivotal
works, ensuring a level of confidence for potential clients, advancements such as convolutional layers, pooling opera-
particularly in sensitive domains such as healthcare [381]. tions, backpropagation algorithms, and specialized activation
functions were explored, elucidating how these innovations
E. DATA SCARCITY IN INDUSTRIAL RESEARCH enabled CNNs to excel at intricate visual tasks. However,
The industrial research landscape is often characterized by the review also highlighted the significant challenges and
data scarcity, where acquiring large volumes of high-quality dependencies associated with CNNs, namely their reliance
data for training machine learning models can be pro- on massive labeled datasets and computationally powerful
hardware infrastructure. While pre-trained models offer a
hibitively expensive and time-consuming [382], [383]. This
viable solution to mitigate data scarcity concerns, they
is particularly true in domains like product defect detection,
simultaneously introduce ethical quandaries surrounding
where generating ample defect samples necessitates the pro-
bias, transparency, and accountability.
duction of potentially flawed products, incurring significant
The review then embarked on a comprehensive exploration
financial and ethical concerns. In such scenarios, traditional
of the evolution of object detection techniques, a domain inti-
machine learning approaches, heavily reliant on abundant
mately intertwined with CNNs. This analysis encompassed
data, often struggle to achieve satisfactory performance.
not only the architectural innovations but also the burgeoning
This is where few-shot learning and transfer learning [384]
ecosystem of supporting tools, frameworks, datasets, and
emerge as potential game-changers, offering powerful tools
performance evaluation metrics. Furthermore, the pivotal
to navigate the challenges of data scarcity in industrial
role of hardware accelerators, including GPUs, FPGAs,
research.
and ASICs, in boosting computational performance was
elucidated. Notably, recent trends in object detection research
F. UTILIZING LIGHTWEIGHT ARCHITECTURES FOR indicate a paradigm shift towards resource-efficient archi-
INDUSTRIAL APPLICATIONS tectural designs, with deployment requirements becoming
The integration of DL into industrial applications promises an increasingly crucial metric integrated within the design
significant advancements, particularly in areas like quality phase.
inspection and industrial defect detection [43], [385], [386], Across a diverse array of industrial domains, ranging
[387]. However, these environments often face constraints from steel manufacturing and photovoltaic cell inspection
on processing resources, making the deployment of com- to rail track maintenance and textile quality control, the
putationally expensive models a challenge. This is where review demonstrated the remarkable effectiveness of CNNs
lightweight networks emerge as a game-changer, offering a in robustly inspecting surfaces for defects and anomalies.
compelling solution for effective and efficient DL implemen- Their ability to discern intricate patterns and subtle flaws
tation in industry. These models will offer a promising avenue positions CNNs as a powerful solution for automated
for advancing the application of AI in industrial settings. quality inspection. Nevertheless, the review also underscored
By mitigating resource constraints and empowering cost- significant limitations that impede widespread adoption,
effective deployments, they pave the way for a future where including data scarcity challenges, interpretability issues,
AI seamlessly permeates the industrial landscape, driving lack of adaptability, security vulnerabilities, and real-time
automation, efficiency, and overall productivity. performance constraints. While CNNs exhibit remarkable
promise for industrial defect detection applications, the
G. DEVELOPMENT OF GENERALIZED DEFECT DETECTION review emphasizes that they are not a universal panacea.
MODEL Addressing key limitations surrounding data efficiency,
The implementation of a generalized defect detection model model interpretability, security considerations, and inference
holds the potential to expedite its application across a broader speed will be crucial in realizing their full potential.
spectrum of industrial environments, resulting in notable Furthermore, the review highlights the merits of consid-
reductions in development costs and shorter development ering alternative approaches, such as geometric computer
cycles. Notably, this approach involves maintaining the vision, classical machine learning techniques, and hybrid

94286 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

methodologies, based on specific use case requirements and [20] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, ‘‘Learning semantic
constraints. representations using convolutional neural networks for web search,’’ in
Proc. 23rd Int. Conf. World Wide Web, Apr. 2014, pp. 373–374.
In conclusion, this comprehensive review has achieved its [21] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, ‘‘A convolutional
primary objective by thoroughly exploring and answering all neural network for modelling sentences,’’ 2014, arXiv:1404.2188.
the outlined research questions pertaining to the applications [22] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’
of CNNs for industrial defect detection systems. While CNNs 2014, arXiv:1408.5882.
[23] R. Collobert and J. Weston, ‘‘A unified architecture for natural language
hold immense promise, they have not yet attained maturity. processing: Deep neural networks with multitask learning,’’ in Proc. 25th
Continued research efforts to enhance their efficiency, Int. Conf. Mach. Learn., 2008, pp. 160–167.
robustness, interpretability, and real-time capabilities are [24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of
imperative. Ultimately, a balanced and objective evaluation word representations in vector space,’’ 2013, arXiv:1301.3781.
[25] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
of CNNs’ strengths, limitations, and viable alternatives will and P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’
pave the way for optimal solutions in industrial quality J. Mach. Learn. Res., vol. 12, pp. 2493–2537, Nov. 2493.
inspection. [26] T. Guo, J. Dong, H. Li, and Y. Gao, ‘‘Simple convolutional neural network
on image classification,’’ in Proc. IEEE 2nd Int. Conf. Big Data Anal.
(ICBDA), Mar. 2017, pp. 721–724.
REFERENCES [27] N. Sharma, V. Jain, and A. Mishra, ‘‘An analysis of convolutional
[1] M. Haenlein and A. Kaplan, ‘‘A brief history of artificial intelligence: On neural networks for image classification,’’ Proc. Comput. Sci., vol. 132,
the past, present, and future of artificial intelligence,’’ California Manage. pp. 377–384, Jul. 2018.
Rev., vol. 61, no. 4, pp. 5–14, Aug. 2019. [28] W. Rawat and Z. Wang, ‘‘Deep convolutional neural networks for image
[2] R. R. Nadikattu, ‘‘The emerging role of artificial intelligence in modern classification: A comprehensive review,’’ Neural Comput., vol. 29, no. 9,
society,’’ Int. J. Creative Res. Thoughts, vol. 1, no. 1, pp. 1–20, 2016. pp. 2352–2449, Sep. 2017.
[3] M. Krichen, ‘‘Convolutional neural networks: A survey,’’ Computers, [29] J. Naranjo-Torres, M. Mora, R. Hernández-García, R. J. Barrientos,
vol. 12, no. 8, p. 151, Jul. 2023. C. Fredes, and A. Valenzuela, ‘‘A review of convolutional neural network
[4] V. C. Müller and N. Bostrom, ‘‘Future progress in artificial intelligence: applied to fruit image processing,’’ Appl. Sci., vol. 10, no. 10, p. 3443,
A survey of expert opinion,’’ Fundam. Issues Artif. Intell., vol. 1, May 2020.
pp. 555–572, Aug. 2016. [30] J.-T. Huang, J. Li, and Y. Gong, ‘‘An analysis of convolutional neural
[5] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, ‘‘A survey of the networks for speech recognition,’’ in Proc. IEEE Int. Conf. Acoust.,
recent architectures of deep convolutional neural networks,’’ Artif. Intell. Speech Signal Process. (ICASSP), Apr. 2015, pp. 4989–4993.
Rev., vol. 53, no. 8, pp. 5455–5516, Dec. 2020. [31] S. Dua, S. S. Kumar, Y. Albagory, R. Ramalingam, A. Dumka,
[6] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, R. Singh, M. Rashid, A. Gehlot, S. S. Alshamrani, and A. S. AlGhamdi,
‘‘Deep learning for computer vision: A brief review,’’ Comput. Intell. ‘‘Developing a speech recognition system for recognizing tonal speech
Neurosci., vol. 2018, pp. 1–13, Aug. 2018. signals using a convolutional neural network,’’ Appl. Sci., vol. 12, no. 12,
[7] W. Mcculloch and W. Pitts, ‘‘A logical calculus of the ideas immanent in p. 6223, Jun. 2022.
nervous activity,’’ Bull. Math. Biol., vol. 52, nos. 1–2, pp. 99–115, 1990. [32] L. Trinh Van, T. Dao Thi Le, T. Le Xuan, and E. Castelli, ‘‘Emotional
[8] E. Akleman, ‘‘Deep learning,’’ Computer, vol. 53, no. 9, pp. 1–17, speech recognition using deep neural networks,’’ Sensors, vol. 22, no. 4,
Sep. 2020. p. 1414, Feb. 2022.
[9] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, [33] M. Kubanek, J. Bobulski, and J. Kulawik, ‘‘A method of speech coding
MA, USA: MIT Press, 2016. for speech recognition using a convolutional neural network,’’ Symmetry,
[10] Y. Bengio, I. Goodfellow, and A. Courville, Deep Learning, vol. 1. vol. 11, no. 9, p. 1185, Sep. 2019.
Cambridge, MA, USA: MIT Press, 2017. [34] S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T. F. Zheng, and
[11] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, ‘‘A survey of convolutional Y. Li, ‘‘Noisy training for deep neural networks in speech recognition,’’
neural networks: Analysis, applications, and prospects,’’ IEEE Trans. EURASIP J. Audio, Speech, Music Process., vol. 2015, no. 1, pp. 1–14,
Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 6999–7019, Dec. 2022. Dec. 2015.
[12] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, [35] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, ‘‘Deep feature extraction
G. Wang, J. Cai, and T. Chen, ‘‘Recent advances in convolutional neural and classification of hyperspectral images based on convolutional
networks,’’ Pattern Recognit., vol. 77, pp. 354–377, May 2018. neural networks,’’ IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,
[13] S. Albawi, T. A. Mohammed, and S. Al-Zawi, ‘‘Understanding of a pp. 6232–6251, Oct. 2016.
convolutional neural network,’’ in Proc. Int. Conf. Eng. Technol. (ICET), [36] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, ‘‘Deep convolutional
Aug. 2017, pp. 1–6. neural network architecture with reconfigurable computation patterns,’’
[14] A. Fesseha, S. Xiong, E. D. Emiru, M. Diallo, and A. Dahou, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 8,
‘‘Text classification based on convolutional neural networks and word pp. 2220–2233, Aug. 2017.
embedding for low-resource languages: Tigrinya,’’ Information, vol. 12, [37] D. Strigl, K. Kofler, and S. Podlipnig, ‘‘Performance and scalability of
no. 2, p. 52, Jan. 2021. GPU-based convolutional neural networks,’’ in Proc. 18th Euromicro
[15] A. Alani, ‘‘Arabic handwritten digit recognition based on restricted Conf. Parallel, Distrib. Netw.-Based Process., Feb. 2010, pp. 317–324.
Boltzmann machine and convolutional neural networks,’’ Information, [38] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, ‘‘Deep
vol. 8, no. 4, p. 142, Nov. 2017. learning for visual understanding: A review,’’ Neurocomputing, vol. 187,
[16] W. Wang and J. Gang, ‘‘Application of convolutional neural network in pp. 27–48, Apr. 2016.
natural language processing,’’ in Proc. Int. Conf. Inf. Syst. Comput. Aided [39] B. A. Aydin, M. Hussain, R. Hill, and H. Al-Aqrabi, ‘‘Domain modelling
Educ. (ICISCAE), Jul. 2018, pp. 64–70. for a lightweight convolutional network focused on automated exudate
[17] P. Li, J. Li, and G. Wang, ‘‘Application of convolutional neural network detection in retinal fundus images,’’ in Proc. 9th Int. Conf. Inf. Technol.
in natural language processing,’’ in Proc. 15th Int. Comput. Conf. Wavelet Trends (ITT), May 2023, pp. 145–150.
Act. Media Technol. Inf. Process. , Dec. 2018, pp. 120–122. [40] A. Zahid, M. Hussain, R. Hill, and H. Al-Aqrabi, ‘‘Lightweight
[18] M. Giménez, J. Palanca, and V. Botti, ‘‘Semantic-based padding convolutional network for automated photovoltaic defect detection,’’ in
in convolutional neural networks for improving the performance in Proc. 9th Int. Conf. Inf. Technol. Trends (ITT), May 2023, pp. 133–138.
natural language processing. A case of study in sentiment analysis,’’ [41] D. Animashaun and M. Hussain, ‘‘Automated micro-crack detection
Neurocomputing, vol. 378, pp. 315–323, Feb. 2020. within photovoltaic manufacturing facility via ground modelling for a
[19] E. Grefenstette, P. Blunsom, N. de Freitas, and K. Moritz Hermann, ‘‘A regularized convolutional network,’’ Sensors, vol. 23, no. 13, p. 6235,
deep architecture for semantic parsing,’’ 2014, arXiv:1404.7296. Jul. 2023.

VOLUME 12, 2024 94287


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[42] S. Qi, J. Yang, and Z. Zhong, ‘‘A review on industrial surface defect [63] H. J. Jie and P. Wanda, ‘‘RunPool: A dynamic pooling layer for
detection based on deep learning technology,’’ in Proc. 3rd Int. Conf. convolution neural network,’’ Int. J. Comput. Intell. Syst., vol. 13, no. 1,
Mach. Learn. Mach. Intell., Sep. 2020, pp. 24–30. p. 66, 2020.
[43] E. Cumbajin, N. Rodrigues, P. Costa, R. Miragaia, L. Frazão, N. Costa, [64] F. Yan, L. Liu, X. Ding, Q. Zhang, and Y. Liu, ‘‘Monocular catadioptric
A. Fernández-Caballero, J. Carneiro, L. H. Buruberri, and A. Pereira, ‘‘A panoramic depth estimation via improved end-to-end neural network
systematic review on deep learning with CNNs applied to surface defect model,’’ Frontiers Neurorobotics, vol. 17, Sep. 2023, Art. no. 1278986.
detection,’’ J. Imag., vol. 9, no. 10, p. 193, Sep. 2023. [65] A. T. Kabakus, ‘‘DroidMalwareDetector: A novel Android malware
[44] X. Wen, J. Shan, Y. He, and K. Song, ‘‘Steel surface defect recognition: detection framework based on convolutional neural network,’’ Expert
A survey,’’ Coatings, vol. 13, no. 1, p. 17, Dec. 2022. Syst. Appl., vol. 206, Nov. 2022, Art. no. 117833.
[45] L. Kou, ‘‘A review of research on detection and evaluation of the [66] W. Ouyang, B. Xu, J. Hou, and X. Yuan, ‘‘Fabric defect detection using
rail surface defects,’’ Acta Polytechnica Hungarica, vol. 19, no. 3, activation layer embedded convolutional neural network,’’ IEEE Access,
pp. 167–186, 2022. vol. 7, pp. 70130–70140, 2019.
[46] B. Li, C. Delpha, D. Diallo, and A. Migan-Dubois, ‘‘Application of [67] I. D. Khan, O. Farooq, and Y. U. Khan, ‘‘Automatic seizure detection
artificial neural networks to photovoltaic fault detection and diagno- using modified CNN architecture and activation layer,’’ J. Phys., Conf.
sis: A review,’’ Renew. Sustain. Energy Rev., vol. 138, Mar. 2021, Ser., vol. 2318, no. 1, Aug. 2022, Art. no. 012013.
Art. no. 110512. [68] J.-Y. Gan, Y.-K. Zhai, Y. Huang, J.-Y. Zeng, and K.-Y. Jiang, ‘‘Research
[47] G. M. El-Banby, N. M. Moawad, B. A. Abouzalm, W. F. Abouzaid, of facial beauty prediction based on deep convolutional features using
and E. A. Ramadan, ‘‘Photovoltaic system fault detection techniques: double activation layer,’’ Acta Electonica Sinica, vol. 47, no. 3, p. 636,
A review,’’ Neural Comput. Appl., vol. 35, no. 35, pp. 24829–24842, 2019.
Dec. 2023. [69] H. Nakahara, T. Fujii, and S. Sato, ‘‘A fully connected layer elimination
[48] U. Hijjawi, S. Lakshminarayana, T. Xu, G. P. M. Fierro, and M. Rahman, for a binarizec convolutional neural network on an FPGA,’’ in Proc. 27th
‘‘A review of automated solar photovoltaic defect detection systems: Int. Conf. Field Program. Log. Appl. (FPL), Sep. 2017, pp. 1–4.
Approaches, challenges, and future orientations,’’ Sol. Energy, vol. 266, [70] C. Yang, Z. Yang, J. Hou, and Y. Su, ‘‘A lightweight full homomorphic
Dec. 2023, Art. no. 112186. encryption scheme on fully-connected layer for CNN hardware accelera-
[49] A. Rasheed, B. Zafar, A. Rasheed, N. Ali, M. Sajid, S. H. Dar, U. Habib, tor achieving security inference,’’ in Proc. 28th IEEE Int. Conf. Electron.,
T. Shehryar, and M. T. Mahmood, ‘‘Fabric defect detection using Circuits, Syst. (ICECS), Nov. 2021, pp. 1–4.
computer vision techniques: A comprehensive review,’’ Math. Problems [71] D. Ramachandran, R. S. Kumar, A. Alkhayyat, R. Q. Malik, P. Srinivasan,
Eng., vol. 2020, pp. 1–24, Nov. 2020. G. G. Priya, and A. Gosu Adigo, ‘‘Classification of electrocardiography
[50] C. Li, J. Li, Y. Li, L. He, X. Fu, and J. Chen, ‘‘Fabric defect detection in hybrid convolutional neural network-long short term memory with
textile manufacturing: A survey of the state of the art,’’ Secur. Commun. fully connected layer,’’ Comput. Intell. Neurosci., vol. 2022, pp. 1–10,
Netw., vol. 2021, pp. 1–13, May 2021. Jul. 2022.
[51] Y. Kahraman and A. Durmugoglu, ‘‘Deep learning-based fabric defect [72] T. Zheng, Q. Wang, Y. Shen, and X. Lin, ‘‘Gradient rectified parameter
detection: A review,’’ Textile Res. J., vol. 93, nos. 5–6, pp. 1485–1503, unit of the fully connected layer in convolutional neural networks,’’
Mar. 2023. Knowl.-Based Syst., vol. 248, Jul. 2022, Art. no. 108797.
[52] W. Ming, C. Cao, G. Zhang, H. Zhang, F. Zhang, Z. Jiang, and J. Yuan, [73] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto,
‘‘Review: Application of convolutional neural network in defect detection C. Yu, W. J. Fabian, M. Espi, T. Higuchi, S. Araki, and T. Nakatani,
of 3C products,’’ IEEE Access, vol. 9, pp. 135657–135674, 2021. ‘‘The NTTT Chime-3 system: Advances in speech enhancement and
[53] D. Ghimire, D. Kil, and S.-H. Kim, ‘‘A survey on efficient convolutional recognition for mobile multi-microphone devices,’’ in Proc. IEEE
neural networks and hardware acceleration,’’ Electronics, vol. 11, no. 6, Workshop Autom. Speech Recognit. Understand., Jun. 2015, pp. 436–443.
p. 945, Mar. 2022. [74] Z. Liao and G. Carneiro, ‘‘On the importance of normalisation layers
[54] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and in deep learning with piecewise linear activation units,’’ in Proc. IEEE
M. Martina, ‘‘An updated survey of efficient hardware architectures Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2016, pp. 1–8.
for accelerating deep convolutional neural networks,’’ Future Internet, [75] S.-H. Wang, J. Hong, and M. Yang, ‘‘Sensorineural hearing loss
vol. 12, no. 7, p. 113, Jul. 2020. identification via nine-layer convolutional neural network with
[55] L. Shen, Z. Lin, and Q. Huang, ‘‘Relay backpropagation for effective batch normalization and dropout,’’ Multimedia Tools Appl., vol. 79,
learning of deep convolutional neural networks,’’ in Proc. Eur. Conf. nos. 21–22, pp. 15135–15150, Jun. 2020.
Comput. Vis. (ECCV), 2016, pp. 467–482. [76] T. Sledevic, ‘‘Adaptation of convolution and batch normalization layer for
[56] E. M. Dogo, O. J. Afolabi, N. I. Nwulu, B. Twala, and C. O. Aigbavboa, CNN implementation on FPGA,’’ in Proc. Open Conf. Electr., Electron.
‘‘A comparative analysis of gradient descent-based optimization algo- Inf. Sci., Apr. 2019, pp. 1–4.
rithms on convolutional neural networks,’’ in Proc. Int. Conf. Comput. [77] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep
Techn., Electron. Mech. Syst., Dec. 2018, pp. 92–99. network training by reducing internal covariate shift,’’ in Proc. Int. Conf.
[57] Y. Ren and X. Cheng, ‘‘Review of convolutional neural network Mach. Learn., 2015, pp. 448–456.
optimization and training in image processing,’’ in Proc. 10th Int. Symp. [78] C. Garbin, X. Zhu, and O. Marques, ‘‘Dropout vs. Batch normalization:
Precis. Eng. Meas. Instrum., Mar. 2019, pp. 788–797. An empirical study of their impact to deep learning,’’ Multimedia Tools
[58] G. Habib and S. Qureshi, ‘‘Optimization and acceleration of convolu- Appl., vol. 79, nos. 19–20, pp. 12777–12815, May 2020.
tional neural networks: A survey,’’ J. King Saud Univ. Comput. Inf. Sci., [79] G. Li, X. Jian, Z. Wen, and J. AlSultan, ‘‘Algorithm of overfitting
vol. 34, no. 7, pp. 4244–4268, Jul. 2022. avoidance in CNN based on maximum pooled and weight decay,’’ Appl.
[59] J. A. Pandian, K. Kanchanadevi, V. D. Kumar, E. Jasinska, R. Gono, Math. Nonlinear Sci., vol. 7, no. 2, pp. 965–974, Jul. 2022.
Z. Leonowicz, and M. Jasinski, ‘‘A five convolutional layer deep con- [80] P. Dileep, D. Das, and P. K. Bora, ‘‘Dense layer dropout based CNN
volutional neural network for plant leaf disease detection,’’ Electronics, architecture for automatic modulation classification,’’ in Proc. Nat. Conf.
vol. 11, no. 8, p. 1266, Apr. 2022. Commun. (NCC), Feb. 2020, pp. 1–5.
[60] T.-C. Lu, ‘‘CNN convolutional layer optimisation based on quantum [81] W. Setiawan, ‘‘Character recognition using adjustment convolutional
evolutionary algorithm,’’ Connection Sci., vol. 33, no. 3, pp. 482–494, network with dropout layer,’’ IOP Conf. Ser., Mater. Sci. Eng., vol. 1125,
Jul. 2021. May 2021, Art. no. 012049.
[61] J. Gunther, P. M. Pilarski, G. Helfrich, H. Shen, and K. Diepold, [82] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
‘‘First steps towards an intelligent laser welding architecture using deep R. R. Salakhutdinov, ‘‘Improving neural networks by preventing co-
neural networks and reinforcement learning,’’ Proc. Technol., vol. 15, adaptation of feature detectors,’’ 2012, arXiv:1207.0580.
pp. 474–483, Jul. 2014. [83] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[62] N.-I. Galanis, P. Vafiadis, K.-G. Mirzaev, and G. A. Papakostas, R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
‘‘Convolutional neural networks: A roundup and benchmark of their from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
pooling layer variants,’’ Algorithms, vol. 15, no. 11, p. 391, Oct. 2022. 2014.

94288 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[84] P. Jiang, Y. Xue, and F. Neri, ‘‘Convolutional neural network pruning [106] A. Wiranata, S. A. Wibowo, R. Patmasari, R. Rahmania, and
based on multi-objective feature map selection for image classification,’’ R. Mayasari, ‘‘Investigation of padding schemes for faster R-CNN on
Appl. Soft Comput., vol. 139, May 2023, Art. no. 110229. vehicle detection,’’ in Proc. Int. Conf. Control, Electron., Renew. Energy
[85] J. Kim and J. Cho, ‘‘Low-cost embedded system using convolutional Commun., Dec. 2018, pp. 208–212.
neural networks-based spatiotemporal feature map for real-time human [107] C. Yang, Y. Wang, X. Wang, and L. Geng, ‘‘A stride-based convolution
action recognition,’’ Appl. Sci., vol. 11, no. 11, p. 4940, May 2021. decomposition method to stretch CNN acceleration algorithms for
[86] D. U. Jeong and K. M. Lim, ‘‘Convolutional neural network for efficient and flexible hardware implementation,’’ IEEE Trans. Circuits
classification of eight types of arrhythmia using 2D time–frequency Syst. I, Reg. Papers, vol. 67, no. 9, pp. 3007–3020, Sep. 2020.
feature map from standard 12-lead electrocardiogram,’’ Sci. Rep., vol. 11, [108] H. Naseri and V. Mehrdad, ‘‘Novel CNN with investigation on accuracy
no. 1, Aug. 2021, Art. no. 20396. by modifying stride, padding, kernel size and filter numbers,’’ Multimedia
[87] W. Lu, H. Sun, J. Chu, X. Huang, and J. Yu, ‘‘A novel approach for video Tools Appl., vol. 82, no. 15, pp. 23673–23691, Jun. 2023.
text detection and recognition based on a corner response feature map [109] S. Tummalapalli, L. Kumar, and N. L. B. Murthy, ‘‘Web service anti-
and transferred deep convolutional neural network,’’ IEEE Access, vol. 6, patterns detection using CNN with varying sequence padding size,’’ in
pp. 40198–40211, 2018. Proc. 12th Ind. Symp. Conjunct, 2022, pp. 153–165.
[88] F. Wang, C. Yang, S. Huang, and H. Wang, ‘‘Automatic modulation [110] C. Guo, Y.-L. Liu, and X. Jiao, ‘‘Study on the influence of variable stride
classification based on joint feature map and convolutional neural scale change on image recognition in CNN,’’ Multimedia Tools Appl.,
network,’’ IET Radar, Sonar Navigat., vol. 13, no. 6, pp. 998–1003, vol. 78, no. 21, pp. 30027–30037, Nov. 2019.
Jun. 2019. [111] J.-H. Luo, H. Zhang, H.-Y. Zhou, C.-W. Xie, J. Wu, and W. Lin, ‘‘ThiNet:
[89] J. Zou, T. Rui, Y. Zhou, C. Yang, and S. Zhang, ‘‘Convolutional neural Pruning CNN filters for a thinner net,’’ IEEE Trans. Pattern Anal. Mach.
network simplification via feature map pruning,’’ Comput. Electr. Eng., Intell., vol. 41, no. 10, pp. 2525–2538, Oct. 2019.
vol. 70, pp. 950–958, Aug. 2018. [112] A. Barroso-Laguna and K. Mikolajczyk, ‘‘Key.Net: Keypoint detection
[90] C. K. Dewa, ‘‘Suitable CNN weight initialization and activation function by handcrafted and learned CNN filters revisited,’’ IEEE Trans. Pattern
for Javanese vowels classification,’’ Proc. Comput. Sci., vol. 144, Anal. Mach. Intell., vol. 45, no. 1, pp. 698–711, Jan. 2023.
pp. 124–132, Jun. 2018. [113] W. S. Ahmed and A. A. A. Karim, ‘‘The impact of filter size and number
[91] W. Hao, W. Yizhou, L. Yaqin, and S. Zhili, ‘‘The role of activation of filters on classification accuracy in CNN,’’ in Proc. Int. Conf. Comput.
function in CNN,’’ in Proc. 2nd Int. Conf. Inf. Technol. Comput. Appl. Sci. Softw. Eng. (CSASE), Apr. 2020, pp. 88–93.
(ITCA), Dec. 2020, pp. 429–432. [114] F. Rosenblatt, ‘‘The perceptron: A probabilistic model for information
[92] A. Mondal and V. K. Shrivastava, ‘‘A novel parametric Flatten-p storage and organization in the brain,’’ Psychol. Rev., vol. 65, no. 6,
mish activation function based deep CNN model for brain tumor pp. 386–408, 1958.
classification,’’ Comput. Biol. Med., vol. 150, Nov. 2022, Art. no. 106183. [115] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory
of Brain Mechanisms. Washington, DC, USA: Spartan books, 1962.
[93] B. Khagi and G.-R. Kwon, ‘‘A novel scaled-gamma-tanh (SGT) activation
function in 3D CNN applied for MRI classification,’’ Sci. Rep., vol. 12, [116] B. Widrow and M. E. Hoff, ‘‘Adaptive switching circuits,’’ in IRE
no. 1, p. 14978, Sep. 2022. WESCON Convention Record, vol. 4. New York, NY, USA, 1960,
pp. 96–104.
[94] T. Jannat, Md. A. Hossain, and A. Sayeed, ‘‘An effective approach
[117] P. J. Werbos, The Roots of Backpropagation: From Ordered Derivatives
for hyperspectral image classification based on 3D CNN with mish
to Neural Networks and Political Forecasting. Hoboken, NJ, USA: Wiley,
activation function,’’ in Proc. 25th Int. Conf. Comput. Inf. Technol.
1994.
(ICCIT), Dec. 2022, pp. 1074–1079.
[118] D. H. Hubel and T. N. Wiesel, ‘‘Receptive fields, binocular interaction and
[95] Y. Jiang, J. Xie, and D. Zhang, ‘‘An adaptive offset activation function for
functional architecture in the cat’s visual cortex,’’ J. Physiol., vol. 160,
CNN image classification tasks,’’ Electronics, vol. 11, no. 22, p. 3799,
no. 1, pp. 106–154, Jan. 1962.
Nov. 2022.
[119] K. Fukushima, ‘‘Neocognitron: A self-organizing neural network model
[96] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. M´’uller, ‘‘Efficient BackProp,’’
for a mechanism of pattern recognition unaffected by shift in position,’’
in Neural Networks: Tricks of the Trade. Cham, Switzerland: Springer,
Biol. Cybern., vol. 36, no. 4, pp. 193–202, Apr. 1980.
1998, pp. 9–50.
[120] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based
[97] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, ‘‘End-to-end text recognition learning applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
with convolutional neural networks,’’ in Proc. 21st Int. Conf. Pattern pp. 2278–2324, 1998.
Recognit. (ICPR), Nov. 2012, pp. 3304–3308.
[121] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard,
[98] B. Xu, N. Wang, T. Chen, and M. Li, ‘‘Empirical evaluation of rectified and L. Jackel, ‘‘Handwritten digit recognition with a back-propagation
activations in convolutional network,’’ 2015, arXiv:1505.00853. network,’’ in Proc. Adv. Neural Inf. Process. Syst., 1989, pp. 1–18.
[99] P. Ramachandran, B. Zoph, and Q. V. Le, ‘‘Searching for activation [122] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, ‘‘Greedy layer-
functions,’’ 2017, arXiv:1710.05941. wise training of deep networks,’’ in Proc. Adv. Neural Inf. Process. Syst.,
[100] S. Hochreiter, ‘‘The vanishing gradient problem during learning recurrent 2006, pp. 1–16.
neural nets and problem solutions,’’ Int. J. Uncertainty, Fuzziness Knowl.- [123] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification
Based Syst., vol. 6, no. 2, pp. 107–116, Apr. 1998. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.
[101] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, ‘‘Activation Process. Syst., 2012, pp. 1–14.
functions: Comparison of trends in practice and research for deep [124] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
learning,’’ 2018, arXiv:1811.03378. with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6,
[102] D. Dang, S. V. R. Chittamuru, S. Pasricha, R. Mahapatra, and D. Sahoo, pp. 84–90, May 2017.
‘‘BPLight-CNN: A photonics-based backpropagation accelerator for [125] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted
deep learning,’’ ACM J. Emerg. Technol. Comput. Syst., vol. 17, no. 4, Boltzmann machines,’’ in Proc. 27th Int. Conf. Mach. Learn., 2010,
pp. 1–26, Oct. 2021. pp. 807–814.
[103] A. Mazouz and C. P. Bridges, ‘‘Automated CNN back-propagation [126] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convolu-
pipeline generation for FPGA online training,’’ J. Real-Time Image tional networks,’’ in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
Process., vol. 18, no. 6, pp. 2583–2599, Dec. 2021. [127] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[104] V. Raj, R. Kumar, and N. S. Kumar, ‘‘An scrupulous framework to V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’
forecast the weather using CNN with back propagation method,’’ in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
Proc. 4th Int. Conf. Adv. Comput., Commun. Control Netw., Dec. 2022, pp. 1–9.
pp. 177–181. [128] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van
[105] Y. Jaramillo-Munera, Lina M. Sepulveda-Cano, A. E. Castro-Ospina, den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
L. Duque-Muñoz, and J. D. Martinez-Vargas, ‘‘Classification of epileptic M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
seizures based on CNN and guided back-propagation for interpretation I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and
analysis,’’ in Proc. Int. Conf. Smart Technol., Syst. Appl., 2022, D. Hassabis, ‘‘Mastering the game of go with deep neural networks and
pp. 212–226. tree search,’’ Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.

VOLUME 12, 2024 94289


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[129] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, ‘‘A survey on deep learning for [152] T. Liu, L. Zhang, Y. Wang, J. Guan, Y. Fu, J. Zhao, and S. Zhou, ‘‘Recent
big data,’’ Inf. Fusion, vol. 42, pp. 146–157, Jul. 2018. few-shot object detection algorithms: A survey with performance
[130] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for comparison,’’ ACM Trans. Intell. Syst. Technol., vol. 14, no. 4, pp. 1–36,
large-scale image recognition,’’ 2014, arXiv:1409.1556. Aug. 2023.
[131] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: [153] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput. A. Zisserman, ‘‘The Pascal visual object classes (VOC) challenge,’’ Int.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
[132] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for [154] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,
(CVPR), Jun. 2016, pp. 770–778. ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. Comput. Vis.,
[133] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, vol. 115, no. 3, pp. 211–252, Dec. 2015.
K. Modi, and H. Ghayvat, ‘‘CNN variants for computer vision: History, [155] L. Nanni, S. Ghidoni, and S. Brahnam, ‘‘Handcrafted vs. non-handcrafted
architecture, application, challenges and future scope,’’ Electronics, features for computer vision classification,’’ Pattern Recognit., vol. 71,
vol. 10, no. 20, p. 2470, Oct. 2021. pp. 158–172, Nov. 2017.
[134] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely [156] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis. object detection with region proposal networks,’’ in Proc. Adv. Neural Inf.
Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269. Process. Syst., 2015, pp. 1–26.
[135] C. Zhang, P. Benz, D. M. Argaw, S. Lee, J. Kim, F. Rameau, J.-C. Bazin, [157] A. Karpathy and L. Fei-Fei, ‘‘Deep visual-semantic alignments for
and I. S. Kweon, ‘‘ResNet or DenseNet? Introducing dense shortcuts generating image descriptions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
to ResNet,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Recognit. (CVPR), Jun. 2015, pp. 3128–3137.
Jan. 2021, pp. 3549–3558. [158] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel,
[136] C. Szegedy, S. Loffe, V. Vanhoucke, and A. Alemi, ‘‘Inception-v4, and Y. Bengio, ‘‘Show, attend and tell: Neural image caption generation
inception-resnet and the impact of residual connections on learning,’’ in with visual attention,’’ in Proc. 32nd Int. Conf. Mach. Learn. (ICML),
Proc. AAAI Conf. Artif. Intell., vol. 31, 2017, pp. 1–50. vol. 37, 2015, pp. 2048–2057.
[137] S. Zagoruyko and N. Komodakis, ‘‘Wide residual networks,’’ 2016, [159] Q. Wu, C. Shen, P. Wang, A. Dick, and A. v. d. Hengel, ‘‘Image
arXiv:1605.07146. captioning and visual question answering based on attributes and external
[138] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, ‘‘Aggregated residual knowledge,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6,
transformations for deep neural networks,’’ in Proc. IEEE Conf. Comput. pp. 1367–1381, Jun. 2018.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5987–5995. [160] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Simultaneous
[139] X. Zhang, Z. Li, C. C. Loy, and D. Lin, ‘‘PolyNet: A pursuit of structural detection and segmentation,’’ in Proc. Eur. Conf. Comput. Vis., 2014,
diversity in very deep networks,’’ in Proc. IEEE Conf. Comput. Vis. pp. 297–312.
Pattern Recognit. (CVPR), Jul. 2017, pp. 3900–3908. [161] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Hypercolumns for
[140] D. Han, J. Kim, and J. Kim, ‘‘Deep pyramidal residual networks,’’ in object segmentation and fine-grained localization,’’ in Proc. IEEE Conf.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 447–456.
pp. 6307–6315. [162] J. Dai, K. He, and J. Sun, ‘‘Instance-aware semantic segmentation via
[141] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, multi-task network cascades,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
and X. Tang, ‘‘Residual attention network for image classification,’’ in Recognit. (CVPR), Jun. 2016, pp. 3150–3158.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, [163] J. Fourie, S. Mills, and R. Green, ‘‘Harmony filter: A robust visual
pp. 6450–6458. tracking system using the improved harmony search algorithm,’’ Image
[142] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional block Vis. Comput., vol. 28, no. 12, pp. 1702–1716, Dec. 2010.
attention module,’’ in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 3–19. [164] E. Cuevas, N. Ortega-Sánchez, D. Zaldivar, and M. Pérez-Cisneros,
[143] A. Khan, A. Sohail, and A. Ali, ‘‘A new channel boosted convolutional ‘‘Circle detection by harmony search optimization,’’ J. Intell. Robotic
neural network using transfer learning,’’ 2018, arXiv:1804.08528. Syst., vol. 66, no. 3, pp. 359–376, May 2012.
[144] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, [165] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang,
M. Andreetto, and H. Adam, ‘‘MobileNets: Efficient convolutional neural Z. Wang, R. Wang, X. Wang, and W. Ouyang, ‘‘T-CNN: Tubelets with
networks for mobile vision applications,’’ 2017, arXiv:1704.04861. convolutional neural networks for object detection from videos,’’ IEEE
[145] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Trans. Circuits Syst. Video Technol., vol. 28, no. 10, pp. 2896–2907,
‘‘MobileNetV2: Inverted residuals and linear bottlenecks,’’ in Oct. 2018.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, [166] P. Viola and M. Jones, ‘‘Rapid object detection using a boosted cascade of
pp. 4510–4520. simple features,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
[146] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for Recognit. CVPR, Jul. 2001, pp. 1–68.
convolutional neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2019, [167] P. Viola and M. J. Jones, ‘‘Robust real-time face detection,’’ Int.
pp. 6105–6114. J. Comput. Vis., vol. 57, no. 2, pp. 137–154, May 2004.
[147] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, [168] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision detection,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, Recognit., Jul. 2005, pp. 886–893.
pp. 87–110, Jan. 2023. [169] D. G. Lowe, ‘‘Object recognition from local scale-invariant features,’’ in
[148] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and Proc. 7th IEEE Int. Conf. Comput. Vis., Jan. 1999, pp. 1150–1157.
B. Guo, ‘‘Swin transformer: Hierarchical vision transformer using shifted [170] D. G. Lowe, ‘‘Distinctive image features from scale-invariant keypoints,’’
windows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
pp. 9992–10002. [171] S. Belongie, J. Malik, and J. Puzicha, ‘‘Shape matching and object
[149] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and recognition using shape contexts,’’ IEEE Trans. Pattern Anal. Mach.
S. Xie, ‘‘ConvNeXt v2: Co-designing and scaling ConvNets with masked Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.
autoencoders,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [172] P. Felzenszwalb, D. McAllester, and D. Ramanan, ‘‘A discriminatively
(CVPR), Jun. 2023, pp. 16133–16142. trained, multiscale, deformable part model,’’ in Proc. IEEE Conf. Comput.
[150] M. Hussain, H. Al-Aqrabi, M. Munawar, and R. Hill, ‘‘Feature Vis. Pattern Recognit., Jun. 2008, pp. 1–8.
mapping for Rice leaf defect detection based on a custom convolutional [173] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, ‘‘Cascade object
architecture,’’ Foods, vol. 11, no. 23, p. 3914, Dec. 2022. detection with deformable part models,’’ in Proc. IEEE Comput. Soc.
[151] M. Hussain and H. Al-Aqrabi, ‘‘Child emotion recognition via custom Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2241–2248.
lightweight CNN architecture,’’ in Kids Cybersecurity Using Compu- [174] T. Malisiewicz, A. Gupta, and A. A. Efros, ‘‘Ensemble of exemplar-
tational Intelligence Techniques. Cham, Switzerland: Springer, 2023, SVMs for object detection and beyond,’’ in Proc. Int. Conf. Comput. Vis.,
pp. 165–174. Nov. 2011, pp. 89–96.

94290 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[175] Y.-F. Li, J. T. Kwok, I. W. Tsang, and Z.-H. Zhou, ‘‘A convex method [202] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, ‘‘YOLOv7: Trainable
for locating regions of interest with multi-instance learning,’’ in Proc. bag-of-freebies sets new state-of-the-art for real-time object detectors,’’
Mach. Learn. Knowl. Discovery Databases, Eur. Conf., 2009, pp. 15–30. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[176] D. Forsyth, ‘‘Object detection with discriminatively trained part-based Jun. 2023, pp. 7464–7475.
models,’’ Computer, vol. 47, no. 2, pp. 6–7, Feb. 2014. [203] (2023). Ultralytics Github Repository. Accessed: Jan. 16, 2024. [Online].
[177] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature Available: https://github.com/ultralytics/ultralytics
hierarchies for accurate object detection and semantic segmentation,’’ in [204] F. J. Solawetz. (2023). What is YOLOv8? The Ultimate Guide. Accessed:
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. Jan. 16, 2024. [Online]. Available: https://blog.roboflow.com/whats-new-
[178] R. Girshick, P. Felzenszwalb, and D. McAllester, ‘‘Object detection in-yolov8/
with grammar models,’’ in Proc. Adv. Neural Inf. Process. Syst., 2011, [205] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
pp. 1–47. A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. 14th Eur.
[179] R. B. Girshick, From Rigid Templates to Grammars: Object Detection Conf., Oct. 2016, pp. 21–37.
With Structured Models. Chicago, IL, USA: University of Chicago, 2012. [206] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for
[180] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Region-based dense object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
convolutional networks for accurate object detection and segmentation,’’ Oct. 2017, pp. 2999–3007.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158, [207] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer, ‘‘SqueezeDet:
Jan. 2016. Unified, small, low power fully convolutional neural networks for
[181] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and real-time object detection for autonomous driving,’’ in Proc. IEEE
A. W. M. Smeulders, ‘‘Selective search for object recognition,’’ Int. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Sep. 2013. pp. 446–454.
[182] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. (2012). Discrimi- [208] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
natively Trained Deformable Part Models. Accessed: Jan. 25, 2024. and K. Keutzer, ‘‘SqueezeNet: AlexNet-level accuracy with 50x fewer
[183] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in parameters and <0.5MB model size,’’ 2016, arXiv:1602.07360.
deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern [209] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu,
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015. M. Pelillo, and L. Zhang, ‘‘DOTA: A large-scale dataset for object
[184] R. Girshick, ‘‘Fast R-CNN,’’ 2015, arXiv:1504.08083. detection in aerial images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[185] J. Dai, Yi Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- Pattern Recognit., Jun. 2018, pp. 3974–3983.
based fully convolutional networks,’’ in Proc. Adv. Neural Inf. Process. [210] H. Law and J. Deng, ‘‘CornerNet: Detecting objects as paired keypoints,’’
Syst., 2016, pp. 1–48. in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 734–750.
[186] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, ‘‘Light-head R- [211] X. Zhou, D. Wang, and P. Krähenbuhl, ‘‘Objects as points,’’ 2019,
CNN: In defense of two-stage object detector,’’ 2017, arXiv:1711.07264. arXiv:1904.07850.
[187] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. [212] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou,
Belongie, ‘‘Feature pyramid networks for object detection,’’ in Proc. B. Yang, Z. Wang, H. Zhou, and X. Wang, ‘‘Crafting GBD-net for
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 9,
pp. 936–944. pp. 2109–2123, Sep. 2018.
[188] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc. [213] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988. S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
[189] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 213–229.
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. [214] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. DETR: Deformable transformers for end-to-end object detection,’’ 2020,
[190] J. Redmon. (2013). Darknet: Open Source Neural Networks in arXiv:2010.04159.
C. Accessed: Jan. 16, 2024. [Online]. Available: https://pjreddie. [215] (2024). Theano. [Online]. Available: http://deeplearning.net/software/
com/darknet theano/
[191] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ [216] (2024). Berkeley Vision and Learning Center. [Online]. Available:
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, http://caffe.berkeleyvision.org/
pp. 6517–6525. [217] (2024). Deeplearning4j. [Online]. Available: https://deeplearning4j.org
[192] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ [218] (2024). TensorFlow. [Online]. Available: https://www.tensorflow.org/
2018, arXiv:1804.02767. [219] (2024). Keras. [Online]. Available: https://keras.io/
[193] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, [220] (2024). Chainer. [Online]. Available: https://chainer.org
P. Dollar, and C. L. Zitnick, ‘‘Microsoft COCO: Common [221] (2024). Apache. [Online]. Available: http://singa.apache.org
objects in context,’’ in Proc. Eur. Conf. Comput. Vis., 2014, [222] (2024). MXnet. [Online]. Available: http://mxnet.io/
pp. 740–755. [223] (2024). Microsoft Cognitive Toolkit CNTK. [Online]. Available:
[194] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, ‘‘YOLOv4: Optimal https://www.microsoft.com/en-us/research/product/cognitive-toolkit/
speed and accuracy of object detection,’’ 2020, arXiv:2004.10934. [224] (2024). PyTorch. [Online]. Available: https://pytorch.org
[195] Z. Yao, Y. Cao, S. Zheng, G. Huang, and S. Lin, ‘‘Cross-iteration batch [225] (2024). Neon. [Online]. Available: http://neon.nerva-nasys.com/docs/
normalization,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. latest
(CVPR), Jun. 2021, pp. 12326–12335. [226] (2024). BigDL. [Online]. Available: https://github.com/intel-analytics/
[196] D. Misra, ‘‘Mish: A self regularized non-monotonic activation function,’’ BigDL
2019, arXiv:1908.08681. [227] (2024). Clarifai. [Online]. Available: https://clarifai.com/
[197] G. Jocher et al., ‘‘Ultralytics/YOLOv5: V3.0,’’ Aug. 2020. [228] (2024). Cloud Sight. [Online]. Available: https://cloudsight.readme.
[198] A. S. G. Jocher. (2021). Ultralytics/YOLOv5: V4.0—Activations, Weights io/v1.0/docs
& Biases Logging, PyTorch Hub Integration, 2021. [Online]. Available: [229] (2024). Microsoft Cognitive Service. [Online]. Available:
https://zenodo.org/records/4418161 https://www.microsoft.com/cognitive-services/en-us/computer-vision-
[199] M. Hussain, ‘‘YOLO-v1 to YOLO-v8, the rise of YOLO and its api
complementary nature toward digital manufacturing and industrial defect [230] (2024). Google Cloud Vision API. [Online]. Available:
detection,’’ Machines, vol. 11, no. 7, p. 677, Jun. 2023. https://cloud.google.com/vision/
[200] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, [231] (2024). IBM Watson Vision Recognition Service. [Online]. Available:
W. Nie, Y. Li, B. Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, http://www.ibm.com/watson/developercloud/visual-recognition.html
and X. Wei, ‘‘YOLOv6: A single-stage object detection framework for [232] (2024). Amazon Recognition. [Online]. Available:
industrial applications,’’ 2022, arXiv:2209.02976. https://aws.amazon.com/rekognition/
[201] S. Rath. (2022). YOLOv6 Object Detection Tutorial. Accessed: [233] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
Jan. 16, 2024. [Online]. Available: https://learnopencv.com/yolov6- J. Winn, and A. Zisserman, ‘‘The Pascal visual object classes challenge: A
object-detection/ retrospective,’’ Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jan. 2015.

VOLUME 12, 2024 94291


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[234] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- [256] R. Jin and Q. Niu, ‘‘Automatic fabric defect detection based on
Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and an improved YOLOv5,’’ Math. Problems Eng., vol. 2021, pp. 1–13,
V. Ferrari, ‘‘The open images dataset v4: Unified image classification, Sep. 2021.
object detection, and visual relationship detection at scale,’’ Int. J. [257] Raspberry Pi 4 Model B. Accessed: Jan. 10, 2024. [Online]. Available:
Comput. Vis., vol. 128, no. 7, pp. 1956–1981, Jul. 2020. https://thepihut.com/collections/raspberry-pi/products/raspberry-pi-4-
[235] R. Benenson, S. Popov, and V. Ferrari, ‘‘Large-scale interactive object model-b
segmentation with human annotators,’’ in Proc. IEEE/CVF Conf. Comput. [258] M. B. Mohamad Noor and W. H. Hassan, ‘‘Current research on Internet of
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 11692–11701. Things (IoT) security: A survey,’’ Comput. Netw., vol. 148, pp. 283–294,
[236] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and Jan. 2019.
J. Sun, ‘‘Objects365: A large-scale, high-quality dataset for object [259] A. G. Frank, L. S. Dalenogare, and N. F. Ayala, ‘‘Industry 4.0
detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, technologies: Implementation patterns in manufacturing companies,’’ Int.
pp. 8429–8438. J. Prod. Econ., vol. 210, pp. 15–26, Apr. 2019.
[237] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, [260] U. Farooq, Z. Marrakchi, and H. Mehrez, ‘‘FPGA architectures: An
A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, and A. Veit, ‘‘OpenImages: overview,’’ in Tree-Based Heterogeneous FPGA Architectures: Appli-
A public dataset for large-scale multi-label and multi-class image cation Specific Exploration and Optimization. New York, NY, USA:
classification,’’ Dataset, vol. 2, no. 3, p. 18, 2017. Springer, 2012, pp. 7–48.
[261] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
[238] F. Ciaglia, F. Saverio Zuppichini, P. Guerrie, M. McQuade, and
S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded FPGA
J. Solawetz, ‘‘Roboflow 100: A rich, multi-domain object detection
platform for convolutional neural network,’’ in Proc. ACM/SIGDA Int.
benchmark,’’ 2022, arXiv:2211.13523.
Symp. Field-Programmable Gate Arrays, Feb. 2016, pp. 26–35.
[239] P. Dollar, C. Wojek, B. Schiele, and P. Perona, ‘‘Pedestrian detection:
[262] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. O. G. Hock,
A benchmark,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh,
Jun. 2009, pp. 304–311.
‘‘Can FPGAs beat GPUs in accelerating next-generation deep neural
[240] P. Dollar, C. Wojek, B. Schiele, and P. Perona, ‘‘Pedestrian detection: An networks?’’ in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate
evaluation of the state of the art,’’ IEEE Trans. Pattern Anal. Mach. Intell., Arrays, Feb. 2017, pp. 5–14.
vol. 34, no. 4, pp. 743–761, Apr. 2012. [263] Y. Liu, P. Liu, Y. Jiang, M. Yang, K. Wu, W. Wang, and Q. Yao, ‘‘Building
[241] K. Neshatpour, M. Malik, M. A. Ghodrat, A. Sasan, and H. Homayoun, a multi-FPGA-based emulation framework to support networks-on-chip
‘‘Energy-efficient acceleration of big data analytics applications using design and verification,’’ Int. J. Electron., vol. 97, no. 10, pp. 1241–1262,
FPGAs,’’ in Proc. IEEE Int. Conf. Big Data, Oct. 2015, pp. 115–123. Oct. 2010.
[242] V. Kontorinis, L. E. Zhang, B. Aksanli, J. Sampson, H. Homayoun, [264] P. Dondon, J. Carvalho, R. Gardere, P. Lahalle, G. Tsenov, and
E. Pettis, D. M. Tullsen, and T. S. Rosing, ‘‘Managing distributed ups V. Mladenov, ‘‘Implementation of a feed-forward artificial neural net-
energy for effective power capping in data centers,’’ ACM SIGARCH work in VHDL on FPGA,’’ in Proc. 12th Symp. Neural Netw. Appl. Electr.
Comput. Archit. News, vol. 40, no. 3, pp. 488–499, Sep. 2012. Eng. (NEUREL), Nov. 2014, pp. 37–40.
[243] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, ‘‘Toward dark [265] C. Ünsalan and B. Tar, Digital System Design With FPGA: Implementa-
silicon in servers,’’ IEEE Micro, vol. 31, no. 4, pp. 6–15, Jul. 2011. tion Using Verilog and VHDL. New York, NY, USA: McGraw-Hill, 2017.
[244] C. Yan and T. Yue, ‘‘A novel method for dynamic modelling and real- [266] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta,
time rendering based on GPU,’’ Geo-Inf. Sci., vol. 14, no. 2, pp. 149–157, and Z. Zhang, ‘‘Accelerating binarized convolutional neural networks
Aug. 2012. with software-programmable FPGAs,’’ in Proc. ACM/SIGDA Int. Symp.
[245] A. R. Brodtkorb, T. R. Hagen, and M. L. Sætra, ‘‘Graphics processing unit Field-Programmable Gate Arrays, Feb. 2017, pp. 15–24.
(GPU) programming strategies and trends in GPU computing,’’ J. Parallel [267] X. Wei, Y. Liang, and J. Cong, ‘‘Overcoming data transfer bottlenecks
Distrib. Comput., vol. 73, no. 1, pp. 4–13, Jan. 2013. in FPGA-based DNN accelerators via layer conscious memory manage-
[246] R. Barrett, M. Chakraborty, D. Amirkulova, H. Gandhi, G. Wellawatte, ment,’’ in Proc. 56th ACM/IEEE Design Autom. Conf. (DAC), Jun. 2019,
and A. White, ‘‘HOOMD-TF: GPU-accelerated, online machine learning pp. 1–6.
in the HOOMD-blue molecular dynamics engine,’’ J. Open Source Softw., [268] T. Abtahi, C. Shea, A. Kulkarni, and T. Mohsenin, ‘‘Accelerating
vol. 5, no. 51, p. 2367, Jul. 2020. convolutional neural network with FFT on embedded hardware,’’ IEEE
[247] H. Ma, ‘‘Development of a CPU-GPU heterogeneous platform based Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 9, pp. 1737–1749,
on a nonlinear parallel algorithm,’’ Nonlinear Eng., vol. 11, no. 1, Sep. 2018.
pp. 215–222, Jun. 2022. [269] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, ‘‘High-performance
CNN accelerator on FPGA using unified winograd-GEMM architecture,’’
[248] J. E. Stone, D. Gohara, and G. Shi, ‘‘OpenCL: A parallel programming
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 12,
standard for heterogeneous computing systems,’’ Comput. Sci. Eng.,
pp. 2816–2828, Dec. 2019.
vol. 12, no. 3, pp. 66–73, May 2010.
[270] A. Lavin and S. Gray, ‘‘Fast algorithms for convolutional neural
[249] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick,
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
S. Morton, E. Phillips, Y. Zhang, and V. Volkov, ‘‘Parallel computing
Jun. 2016, pp. 4013–4021.
experiences with CUDA,’’ IEEE Micro, vol. 28, no. 4, pp. 13–27,
[271] J. Bottleson, S. Kim, J. Andrews, P. Bindu, D. N. Murthy, and J. Jin,
Jul. 2008.
‘‘ClCaffe: OpenCL accelerated caffe for convolutional neural networks,’’
[250] M. Halvorsen, ‘‘Hardware acceleration of convolutional neural net- in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW),
works,’’ M.S. thesis, Dept. Comput. Inf. Sci., Norwegian Univ. Sci. May 2016, pp. 50–57.
Technol. (NTNU), Trondheim, Norway, 2015. [272] S. Winograd, Arithmetic Complex. Computations. Philadelphia, PA,
[251] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, USA: SIAM, 1980.
and E. Shelhamer, ‘‘CuDNN: Efficient primitives for deep learning,’’ [273] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and
2014, arXiv:1410.0759. S. Areibi, ‘‘Caffeinated FPGAs: FPGA framework for convolutional
[252] CUDA Convnet2. Accessed: Jan. 10, 2024. [Online]. Available: neural networks,’’ in Proc. Int. Conf. Field-Programmable Technol.
https://code.google.com/archive/p/cuda-convnet2/ (FPT), Dec. 2016, pp. 265–268.
[253] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, [274] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,
S. Guadarrama, and T. Darrell, ‘‘Caffe: Convolutional architecture for E. Cosatto, and H. P. Graf, ‘‘A massively parallel coprocessor for
fast feature embedding,’’ in Proc. 22nd ACM Int. Conf. Multimedia, convolutional neural networks,’’ in Proc. 20th IEEE Int. Conf. Appl.-
Nov. 2014, pp. 675–678. Specific Syst., Archit. Processors, Jul. 2009, pp. 53–60.
[254] R. Collobert, K. Kavukcuoglu, and C. Farabet, ‘‘Torch7: A MATLAB-like [275] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, ‘‘A dynam-
environment for machine learning,’’ in Proc. BigLearn, NIPS Workshop, ically configurable coprocessor for convolutional neural networks,’’ in
2011, pp. 1–11. Proc. 37th Annu. Int. Symp. Comput. Archit., Jun. 2010, pp. 247–257.
[255] S. Mittal, ‘‘A survey on optimized implementation of deep learning [276] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
models on the NVIDIA Jetson platform,’’ J. Syst. Archit., vol. 97, Y. LeCun, ‘‘NeuFlow: A runtime reconfigurable dataflow processor for
pp. 428–442s, Aug. 2019. vision,’’ in Proc. CVPR Workshops, Jun. 2011, pp. 109–116.

94292 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[277] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, ‘‘Optimizing [300] J. J. Coyle, R. A. Novack, B. J. Gibson, and C. J. Langley, Supply
FPGA-based accelerator design for deep convolutional neural networks,’’ Chain Management: A Logistics Perspective. Chennai, India: Cengage
in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, Learning, 2021.
Feb. 2015, pp. 161–170. [301] F. Farahnakian, L. Koivunen, T. Mäkilä, and J. Heikkonen, ‘‘Towards
[278] A. Rahman, S. Oh, J. Lee, and K. Choi, ‘‘Design space exploration of autonomous industrial warehouse inspection,’’ in Proc. 26th Int. Conf.
FPGA accelerators for convolutional neural networks,’’ in Proc. Design, Autom. Comput., Sep. 2021, pp. 1–6.
Autom. Test Eur. Conf. Exhibition, Mar. 2017, pp. 1147–1152. [302] M. Hussain, T. Chen, and R. Hill, ‘‘Moving toward smart manufac-
[279] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren, ‘‘A GPU-outperforming FPGA turing with an autonomous pallet racking inspection system based
accelerator architecture for binary convolutional neural networks,’’ ACM on MobileNetV2,’’ J. Manuf. Mater. Process., vol. 6, no. 4, p. 75,
J. Emerg. Technol. Comput. Syst., vol. 14, no. 2, pp. 1–16, Apr. 2018. Jul. 2022.
[280] S. Derrien and S. Rajopadhye, ‘‘Loop tiling for reconfigurable accelera- [303] M. Hussain, H. Al-Aqrabi, M. Munawar, R. Hill, and T. Alsboui,
tors,’’ in Proc. Int. Conf. Field Program. Log. Appl., 2001, pp. 398–408. ‘‘Domain feature mapping with YOLOv7 for automated edge-based
[281] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy, ‘‘Sparse pallet racking inspections,’’ Sensors, vol. 22, no. 18, p. 6927, Sep. 2022.
convolutional neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern [304] M. Hussain and R. Hill, ‘‘Custom lightweight convolutional neural
Recognit. (CVPR), Jun. 2015, pp. 806–814. network architecture for automated detection of damaged pallet rack-
[282] M. Courbariaux, Y. Bengio, and J.-P. David, ‘‘Training deep neural ing in warehousing & distribution centers,’’ IEEE Access, vol. 11,
networks with low precision multiplications,’’ 2014, arXiv:1412.7024. pp. 58879–58889, 2023.
[283] X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang, [305] M. Hussain, ‘‘YOLO-v5 variant selection algorithm coupled with
Z. Cheng, K. Rupnow, and D. Chen, ‘‘High-performance video content representative augmentations for modelling production-based variance
recognition with long-term recurrent convolutional network for FPGA,’’ in automated lightweight pallet racking inspection,’’ Big Data Cognit.
in Proc. 27th Int. Conf. Field Program. Log. Appl. (FPL), Sep. 2017, Comput., vol. 7, no. 2, p. 120, Jun. 2023.
pp. 1–4. [306] M. A. R. Alif, ‘‘Attention-based automated pallet racking damage
[284] T.-J. Yang, Y.-H. Chen, and V. Sze, ‘‘Designing energy-efficient convolu- detection,’’ Int. J. Innov. Sci. Res. Technol., vol. 1, no. 1, p. 169, Jun. 2024.
tional neural networks using energy-aware pruning,’’ in Proc. IEEE Conf. [307] D. Hu, ‘‘Automated pallet racking examination in edge platform based on
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6071–6079. MobileNetV2: Towards smart manufacturing,’’ J. Grid Comput., vol. 22,
[285] A. Page, A. Jafari, C. Shea, and T. Mohsenin, ‘‘SPARCNet: A hardware no. 1, pp. 1–12, Mar. 2024.
accelerator for efficient deployment of sparse convolutional networks,’’ [308] Q. Luo, X. Fang, L. Liu, C. Yang, and Y. Sun, ‘‘Automated visual defect
ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 3, pp. 1–32, Jul. 2017. detection for flat steel surface: A survey,’’ IEEE Trans. Instrum. Meas.,
[286] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, ‘‘Learning separable vol. 69, no. 3, pp. 626–644, Mar. 2020.
filters,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, [309] L. Yi, G. Li, and M. Jiang, ‘‘An end-to-end steel strip surface defects
pp. 2754–2761. recognition system based on convolutional neural networks,’’ Steel Res.
[287] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, Int., vol. 88, no. 2, Feb. 2017, Art. no. 1600068.
J.-S. Seo, and Y. Cao, ‘‘Throughput-optimized OpenCL-based FPGA [310] Y.-J. Cha, W. Choi, and O. Büyüköztürk, ‘‘Deep learning-based
accelerator for large-scale convolutional neural networks,’’ in Proc. crack damage detection using convolutional neural networks,’’
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, Feb. 2016, Comput.-Aided Civil Infrastruct. Eng., vol. 32, no. 5, pp. 361–378,
pp. 16–25. May 2017.
[288] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Optimizing loop operation [311] Y. Cha, W. Choi, G. Suh, S. Mahmoudkhani, and O. Buyuközturk,
and dataflow in FPGA acceleration of deep convolutional neural ‘‘Autonomous structural visual inspection using region-based deep
networks,’’ in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate learning for detecting multiple damage types,’’ Comput.-Aided Civil
Arrays, Feb. 2017, pp. 45–54. Infrastruct. Eng., vol. 33, no. 9, pp. 731–747, Sep. 2018.
[289] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, [312] Y. He, K. Song, Q. Meng, and Y. Yan, ‘‘An end-to-end steel
‘‘Binarized neural networks: Training deep neural networks with weights surface defect detection approach via fusing multiple hierarchical
and activations constrained to +1 or −1,’’ 2016, arXiv:1602.02830. features,’’ IEEE Trans. Instrum. Meas., vol. 69, no. 4, pp. 1493–1504,
[290] J. Rui and N. Qiang, ‘‘Research on textile defects detection based on Apr. 2020.
improved generative adversarial network,’’ J. Engineered Fibers Fabrics, [313] J. Lian, W. Jia, M. Zareapoor, Y. Zheng, R. Luo, D. K. Jain, and N. Kumar,
vol. 17, Jan. 2022, Art. no. 155892502211013. ‘‘Deep-learning-based small surface defect detection via an exaggerated
[291] Y. Qin, R. Purdy, A. Probst, C.-Y. Lin, and J.-G. Zhu, ‘‘ASIC local variation-based generative adversarial network,’’ IEEE Trans. Ind.
implementation of nonlinear CNN-based data detector for TDMR system Informat., vol. 16, no. 2, pp. 1343–1351, Feb. 2020.
in 28 nm CMOS at 200 Mbits/s throughput,’’ IEEE Trans. Magn., vol. 59, [314] Q. Luo, W. Jiang, J. Su, J. Ai, and C. Yang, ‘‘Smoothing complete feature
no. 3, pp. 1–8, Mar. 2023. pyramid networks for roll mark detection of steel strips,’’ Sensors, vol. 21,
[292] Huawei Technol. Co. (2017). Huawei Reveals the Future of no. 21, p. 7264, Oct. 2021.
Mobile AI at IFA. Accessed: Feb. 07, 2024. [Online]. Available: [315] R. Liu, M. Huang, and P. Cao, ‘‘An end-to-end steel strip surface
https://www.huawei.com/en/news/2017/9/mobile-ai-ifa-2017 defects detection framework: Considering complex background interfer-
[293] N. P. Jouppi et al., ‘‘In-datacenter performance analysis of a tensor ence,’’ in Proc. 33rd Chin. Control Decis. Conf. (CCDC), May 2021,
processing unit,’’ in Proc. ACM/IEEE 44th Annu. Int. Symp. Comput. pp. 317–322.
Archit. (ISCA), Jun. 2017, pp. 1–12. [316] X. Feng, X. Gao, and L. Luo, ‘‘X-SDD: A new benchmark for hot rolled
[294] J. Vincent. (2017). The Iphone X’s New Neural Engine Exemplifies steel strip surface defects detection,’’ Symmetry, vol. 13, no. 4, p. 706,
Apple’s Approach To AI. [Online]. Available: https://www.theverge.com/ Apr. 2021.
2017/9/13/16300464/apple-iphone-x-ai-neuralengine [317] D. Yang, Y. Cui, Z. Yu, and H. Yuan, ‘‘Deep learning based steel pipe
[295] W. J. Dally, Y. Turakhia, and S. Han, ‘‘Domain-specific hardware weld defect detection,’’ Appl. Artif. Intell., vol. 35, no. 15, pp. 1237–1249,
accelerators,’’ Commun. ACM, vol. 63, no. 7, pp. 48–57, Jun. 2020. Dec. 2021.
[296] I. Kuon and J. Rose, ‘‘Measuring the gap between FPGAs and ASICs,’’ [318] R. Lal, B. K. Bolla, and E. Sabeesh, ‘‘Efficient neural net approaches
in Proc. ACM/SIGDA 14th Int. Symp. Field Program. Gate Arrays, in metal casting defect detection,’’ Proc. Comput. Sci., vol. 218,
Feb. 2006, pp. 21–30. pp. 1958–1967, Aug. 2023.
[297] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and [319] D. Soukup and R. Huber-Mork, ‘‘Convolutional neural networks for steel
E. S. Chung, ‘‘Accelerating deep convolutional neural networks using surface defect detection from photometric stereo images,’’ in Proc. Int.
specialized hardware,’’ Microsoft Res. Whitepaper, vol. 2, no. 11, pp. 1–4, Symp. Vis. Comput., 2014, pp. 668–677.
2015. [320] Z. Liang, H. Zhang, L. Liu, Z. He, and K. Zheng, ‘‘Defect detection of rail
[298] L. Gwennap. (2020). Groq Rocks Neural Networks. Accessed: surface with deep convolutional neural networks,’’ in Proc. 13th World
Jan. 25, 2024. [Online]. Available: https://wow.groq.com/wp- Congr. Intell. Control Autom. (WCICA), Jul. 2018, pp. 1317–1322.
content/uploads/2023/05/GROQ-ROCKS-NEURAL-NETWORKS.pdf [321] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep
[299] M. Khazraee, L. Zhang, L. Vega, and M. B. Taylor, ‘‘Moonwalk: NRE convolutional encoder–decoder architecture for image segmentation,’’
optimization in ASIC clouds,’’ ACM SIGPLAN Notices, vol. 52, no. 4, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
pp. 511–526, May 2017. Dec. 2017.

VOLUME 12, 2024 94293


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[322] H. Dong, K. Song, Y. He, J. Xu, Y. Yan, and Q. Meng, ‘‘PGA- [346] H. Xin, Z. Chen, and B. Wang, ‘‘PCB electronic component defect
Net: Pyramid feature fusion and global context attention network for detection method based on improved YOLOv4 algorithm,’’ J. Phys. Conf.
automated surface defect detection,’’ IEEE Trans. Ind. Informat., vol. 16, Ser., vol. 1827, no. 1, Mar. 2021, Art. no. 012167.
no. 12, pp. 7448–7458, Dec. 2020. [347] M. Jeon, S. Yoo, and S.-W. Kim, ‘‘A contactless PCBA defect detection
[323] L. Shang, Q. Yang, J. Wang, S. Li, and W. Lei, ‘‘Detection of rail surface method: Convolutional neural networks with thermographic images,’’
defects based on CNN image recognition and classification,’’ in Proc. IEEE Trans. Compon., Packag., Manuf. Technol., vol. 12, no. 3,
20th Int. Conf. Adv. Commun. Technol. (ICACT), Feb. 2018, pp. 1–2. pp. 489–501, Mar. 2022.
[324] S. Yanan, Z. Hui, L. Li, and Z. Hang, ‘‘Rail surface defect detection [348] A. D. Santoso, F. B. Cahyono, B. Prahasta, I. Sutrisno, and A. Khumaidi,
method based on YOLOv3 deep learning networks,’’ in Proc. Chin. ‘‘Development of PCB defect detection system using image processing
Autom. Congr. (CAC), Nov. 2018, pp. 1563–1568. with YOLO CNN method,’’ Int. J. Artif. Intell. Res., vol. 6, no. 1, pp. 1–78,
[325] H. Yuan, H. Chen, S. Liu, J. Lin, and X. Luo, ‘‘A deep convolutional 2022.
neural network for detection of rail surface defect,’’ in Proc. IEEE Vehicle [349] S. Wang, L. Wu, W. Wu, J. Li, X. He, and F. Song, ‘‘Optical fiber defect
Power Propuls. Conf. (VPPC), Oct. 2019, pp. 1–4. detection method based on DSSD network,’’ in Proc. IEEE Int. Conf.
[326] Y. Huang, C. Qiu, and K. Yuan, ‘‘Surface defect saliency of magnetic Smart Internet Things, Aug. 2019, pp. 422–426.
tile,’’ Vis. Comput., vol. 36, no. 1, pp. 85–96, Jan. 2020. [350] S. Mei, Q. Cai, Z. Gao, H. Hu, and G. Wen, ‘‘Deep learning
[327] M. Hussain, H. Al-Aqrabi, and R. Hill, ‘‘PV-CrackNet architecture based automated inspection of weak microscratches in optical fiber
for filter induced augmentation and micro-cracks detection within a connector end-face,’’ IEEE Trans. Instrum. Meas., vol. 70, pp. 1–10,
photovoltaic manufacturing facility,’’ Energies, vol. 15, no. 22, p. 8667, 2021.
Nov. 2022. [351] K. Han, M. Sun, X. Zhou, G. Zhang, H. Dang, and Z. Liu, ‘‘A new
[328] M. Dhimish and P. Mather, ‘‘Development of novel solar cell micro crack method in wheel hub surface defect detection: Object detection algorithm
detection technique,’’ IEEE Trans. Semicond. Manuf., vol. 32, no. 3, based on deep learning,’’ in Proc. Int. Conf. Adv. Mech. Syst., Dec. 2017,
pp. 277–285, Aug. 2019. pp. 335–338.
[329] Clean Energy Associates (CEA), USA. World’s Largest Nighttime Solar [352] X. Sun, J. Gu, R. Huang, R. Zou, and B. Giron Palomares, ‘‘Surface
Module Electroluminescence (EL) Testing Identifies Causes of Six-Figure defects recognition of wheel hub based on improved faster R-CNN,’’
Losses for Major O&M. Accessed: Jun. 25, 2024. [Online]. Available: Electronics, vol. 8, no. 5, p. 481, Apr. 2019.
https://www.cea3.com/cea-blog/worlds-largest-nighttime-solar-module- [353] S. Cheng, J. Lu, M. Yang, S. Zhang, Y. Xu, D. Zhang, and H.
electroluminescence-el-testing-identifies-causes-2021 Wang, ‘‘Wheel hub defect detection based on the DS-cascade RCNN,’’
[330] M. Hussain, T. Chen, S. Titrenko, P. Su, and M. Mahmud, ‘‘A gradient Measurement, vol. 206, Jan. 2023, Art. no. 112208.
guided architecture coupled with filter fused representations for micro- [354] H. Lin, B. Li, X. Wang, Y. Shu, and S. Niu, ‘‘Automated defect inspection
crack detection in photovoltaic cell surfaces,’’ IEEE Access, vol. 10, of LED chip using deep convolutional neural network,’’ J. Intell. Manuf.,
pp. 58950–58964, 2022. vol. 30, no. 6, pp. 2525–2534, Aug. 2019.
[331] Z. Luo, S. Y. Cheng, and Q. Y. Zheng, ‘‘Corrigendum: GAN-based [355] M. L. Stern and M. Schellenberger, ‘‘Fully convolutional networks
augmentation for improving CNN performance of classification of for chip-wise defect detection employing photoluminescence images:
defective photovoltaic module cells in electroluminescence images,’’ IOP Efficient quality control in LED manufacturing,’’ J. Intell. Manuf., vol. 32,
Conf. Series. Earth Environ. Sci., vol. 354, Aug. 2019, Art. no. 012106. no. 1, pp. 113–126, Jan. 2021.
[332] B. Su, H. Chen, P. Chen, G. Bian, K. Liu, and W. Liu, ‘‘Deep learning- [356] P. Zheng, J. Lou, X. Wan, Q. Luo, Y. Li, L. Xie, and Z. Zhu, ‘‘LED chip
based solar-cell manufacturing defect detection with complementary defect detection method based on a hybrid algorithm,’’ Int. J. Intell. Syst.,
attention network,’’ IEEE Trans. Ind. Informat., vol. 17, no. 6, vol. 2023, pp. 1–13, Feb. 2023.
pp. 4084–4095, Jun. 2021. [357] W. Koodtalang, T. Sangsuwan, and S. Sukanna, ‘‘Glass bottle bottom
[333] A. Ahmad, Y. Jin, C. Zhu, I. Javed, A. Maqsood, and M. W. Akram, inspection based on image processing and deep learning,’’ in Proc. Res.,
‘‘Photovoltaic cell defect classification using convolutional neural Invention, Innov. Congr. (RI2C), Dec. 2019, pp. 1–5.
network and support vector machine,’’ IET Renew. Power Gener., vol. 14, [358] X. Zhang, L. Yan, and H. Yan, ‘‘Defect detection of bottled liquor based
no. 14, pp. 2693–2702, Oct. 2020. on deep learning,’’ in Proc. CSAA/IET Int. Conf. Aircr. Utility Syst.,
[334] Textile Handbook, Cotton Spinners Assoc., Hong Kong, 2000. vol. 2020, Sep. 2020, pp. 1259–1264.
[335] H. Y. T. Ngan, G. K. H. Pang, and N. H. C. Yung, ‘‘Automated [359] A. Gizaw and T. Kebebaw, ‘‘Water bottle defect detection system using
fabric defect detection—A review,’’ Image Vis. Comput., vol. 29, no. 7, convolutional neural network,’’ in Proc. Int. Conf. Inf. Commun. Technol.
pp. 442–458, 2011. Develop. Afr., Nov. 2022, pp. 19–24.
[336] J. Zhang, J. Jing, P. Lu, and S. Song, ‘‘Improved MobileNetV2- [360] Z. Qu, J. Shen, R. Li, J. Liu, and Q. Guan, ‘‘PartsNet: A unified deep
SSDLite for automatic fabric defect detection system based on cloud-edge network for automotive engine precision parts defect detection,’’ in Proc.
computing,’’ Measurement, vol. 201, Sep. 2022, Art. no. 111665. 2nd Int. Conf. Comput. Sci. Artif. Intell., Dec. 2018, pp. 594–599.
[337] F. Li and F. Li, ‘‘Bag of tricks for fabric defect detection based on cascade [361] T. Yang, L. Xiao, B. Gong, and L. Huang, ‘‘Surface defect recognition of
R-CNN,’’ Textile Res. J., vol. 91, nos. 5–6, pp. 599–612, Mar. 2021. varistor based on deep convolutional neural networks,’’ in Optoelectronic
[338] S. Song, J. Jing, Y. Huang, and M. Shi, ‘‘EfficientDet for fabric defect Imaging and Multimedia Technology, vol. 11187. Bellingham, WA, USA:
detection based on edge computing,’’ J. Engineered Fibers Fabrics, SPIE, 2019, pp. 267–274.
vol. 16, Jan. 2021, Art. no. 155892502110083. [362] T. Yang, S. Peng, and L. Huang, ‘‘Surface defect detection of
[339] Z. Luo, X. Xiao, S. Ge, Q. Ye, S. Zhao, and X. Jin, ‘‘Scratchnet: Detecting voltage-dependent resistors using convolutional neural networks,’’
the scratches on cellphone screen,’’ in Proc. 2nd CCF Chin. Conf., Multimedia Tools Appl., vol. 79, nos. 9–10, pp. 6531–6546,
Apr. 2017, pp. 178–186. Mar. 2020.
[340] H. Yang, S. Mei, K. Song, B. Tao, and Z. Yin, ‘‘Transfer-learning-based [363] O. Stephen, U. J. Maduh, and M. Sain, ‘‘A machine learning method for
online MURA defect classification,’’ IEEE Trans. Semicond. Manuf., detection of surface defects on ceramic tiles using convolutional neural
vol. 31, no. 1, pp. 116–123, Feb. 2018. networks,’’ Electronics, vol. 11, no. 1, p. 55, Dec. 2021.
[341] J. Lei, X. Gao, Z. Feng, H. Qiu, and M. Song, ‘‘Scale insensitive and focus [364] F. Lu, Z. Zhang, L. Guo, J. Chen, Y. Zhu, K. Yan, and X. Zhou, ‘‘HFENet:
driven mobile screen defect detection in industry,’’ Neurocomputing, A lightweight hand-crafted feature enhanced CNN for ceramic tile surface
vol. 294, pp. 72–81, Jun. 2018. defect detection,’’ Int. J. Intell. Syst., vol. 37, no. 12, pp. 10670–10693,
[342] Y. Lv, L. Ma, and H. Jiang, ‘‘A mobile phone screen cover glass defect Dec. 2022.
detection MODEL based on small samples learning,’’ in Proc. IEEE 4th [365] G. Wan, H. Fang, D. Wang, J. Yan, and B. Xie, ‘‘Ceramic tile surface
Int. Conf. Signal Image Process., Jul. 2019, pp. 1055–1059. defect detection based on deep learning,’’ Ceram. Int., vol. 48, no. 8,
[343] X. Tao, D. Zhang, W. Ma, X. Liu, and D. Xu, ‘‘Automatic metallic surface pp. 11085–11093, Apr. 2022.
defect detection and recognition with convolutional neural networks,’’ [366] J. Shi, Z. Li, T. Zhu, D. Wang, and C. Ni, ‘‘Defect detection of industry
Appl. Sci., vol. 8, no. 9, p. 1575, Sep. 2018. wood veneer based on NAS and multi-channel mask R-CNN,’’ Sensors,
[344] Y. Xu, K. Zhang, and L. Wang, ‘‘Metal surface defect detection using vol. 20, no. 16, p. 4398, Aug. 2020.
modified YOLO,’’ Algorithms, vol. 14, no. 9, p. 257, Aug. 2021. [367] L.-C. Chen, M. S. Pardeshi, W.-T. Lo, R.-K. Sheu, K.-C. Pai, C.-Y. Chen,
[345] Hsien-I. Lin and F. S. Wibowo, ‘‘Image data assessment approach for deep P.-Y. Tsai, and Y.-T. Tsai, ‘‘Edge-glued wooden panel defect detection
learning-based metal surface defect-detection systems,’’ IEEE Access, using deep learning,’’ Wood Sci. Technol., vol. 56, no. 2, pp. 477–507,
vol. 9, pp. 47621–47638, 2021. Mar. 2022.

94294 VOLUME 12, 2024


R. Khanam et al.: Comprehensive Review of Convolutional Neural Networks for Defect Detection

[368] W.-H. Lim, M. B. Bonab, and K. H. Chua, ‘‘An aggressively pruned CNN RAHIMA KHANAM received the B.Sc. degree
model with visual attention for near real-time wood defects detection in computer science from Binary University,
on embedded processors,’’ IEEE Access, vol. 11, pp. 36834–36848, Malaysia, in 2018, and the Master of Computer
2023. Applications (MCA) degree from Jamia Millia
[369] R. Padilla, S. L. Netto, and E. A. B. da Silva, ‘‘A survey on performance Islamia University, India, in 2022. She is cur-
metrics for object-detection algorithms,’’ in Proc. Int. Conf. Syst., Signals rently pursuing the Ph.D. degree in computer
Image Process. (IWSSIP), Jul. 2020, pp. 237–242.
science and informatics with the University of
[370] T. Diwan, G. Anirudh, and J. V. Tembhurne, ‘‘Object detection
using YOLO: Challenges, architectural successors, datasets and appli-
Huddersfield, U.K., with a focus on developing an
cations,’’ Multimedia Tools Appl., vol. 82, no. 6, pp. 9243–9275, industrial pallet racking detection system utilizing
Mar. 2023. deep learning. Her research interests include the
[371] S. H. Shaikh, K. Saeed, N. Chaki, S. H. Shaikh, K. Saeed, and N. Chaki, intersection of AI, ML, the IoT, and deep learning technologies, with a
‘‘Moving object detection approaches, challenges and object tracking,’’ in specific focus on applications in industrial defect detection.
Moving Object Detection Using Background Subtraction. Springer, 2014,
pp. 5–14.
[372] P. Jiang, D. Ergu, F. Liu, Y. Cai, and B. Ma, ‘‘A review of YOLO algorithm
developments,’’ Proc. Comput. Sci., vol. 199, pp. 1066–1073, Mar. 2022.
[373] F.-J. Du and S.-J. Jiao, ‘‘Improvement of lightweight convolutional neural MUHAMMAD HUSSAIN received the B.Eng.
network model based on YOLO algorithm and its research in pavement degree in electrical and electronic engineering and
defect detection,’’ Sensors, vol. 22, no. 9, p. 3537, May 2022. the M.S. degree in the Internet of Things from
[374] R. K. Chandana and A. C. Ramachandra, ‘‘Real time object detection sys- the University of Huddersfield, West Yorkshire,
tem with YOLO and CNN models: A review,’’ 2022, arXiv:2208.00773. in 2019, and the Ph.D. degree in artificial intel-
[375] P. Dhilleswararao, S. Boppu, M. S. Manikandan, and L. R. Cenkeramaddi, ligence for defect identification, in 2022. He is
‘‘Efficient hardware architectures for accelerating deep neural networks:
an accomplished Researcher hailing from Dews-
Survey,’’ IEEE Access, vol. 10, pp. 131788–131828, 2022.
[376] T. Mohaidat and K. Khalil, ‘‘A survey on neural network hardware bury, West Yorkshire, U.K. His research interests
accelerators,’’ IEEE Trans. Artif. Intell., vol. 1, no. 1, pp. 1–21, Aug. 2024. include fault detection, particularly microcracks
[377] A. Boulemtafes, A. Derhab, and Y. Challal, ‘‘A review of privacy- on photovoltaic (PV) cells due to mechanical and
preserving techniques for deep learning,’’ Neurocomputing, vol. 384, thermal stress. His work contributes to optimizing PV systems’ efficiency
pp. 21–45, Apr. 2020. and reliability. He’s equally passionate about machine vision, focusing
[378] J. A. D. Cameron, P. Savoie, M. E. Kaye, and E. J. Scheme, ‘‘Design on lightweight architectures for edge device deployment in real-world
considerations for the processing system of a CNN-based automated production settings. Beyond fault detection, he explores AI interpretability,
surveillance system,’’ Expert Syst. Appl., vol. 136, pp. 105–114, concentrating on developing explainable AI for medical, and healthcare
Dec. 2019. applications. His interdisciplinary approach underscores his commitment
[379] A. E. Elhabashy, L. J. Wells, and J. A. Camelio, ‘‘Cyber-physical security to ethical and impactful AI solutions. With his diverse expertise spanning
research efforts in manufacturing—A literature review,’’ Proc. Manuf., AI, fault detection, machine vision, and interpretability. He aims to leave
vol. 34, pp. 921–931, Mar. 2019.
his mark on shaping the future of technology and its positive influence on
[380] S. B. Jha and R. F. Babiceanu, ‘‘Deep CNN-based visual defect
detection: Survey of current literature,’’ Comput. Ind., vol. 148, Jun. 2023,
society.
Art. no. 103911.
[381] K. B. Ismail, ‘‘Ensuring data privacy and security in healthcare computer
vision and AI applications: Investigating techniques for anonymization,
encryption, and federated learning,’’ Int. J. Appl. Mach. Learn. Comput.
RICHARD HILL is currently the Head of the
Intell., vol. 13, no. 12, pp. 1–10, 2023.
[382] M. A. Bansal, D. R. Sharma, and D. M. Kathuria, ‘‘A systematic review Department of Computer Science and the Director
on data scarcity problem in deep learning: Solution and applications,’’ of the Centre for Sustainable Computing, Uni-
ACM Comput. Surv., vol. 54, no. 10s, pp. 1–29, Jan. 2022. versity of Huddersfield, U.K. He has published
[383] L. Alzubaidi, J. Bai, A. Al-Sabaawi, J. Santamaría, A. S. Albahri, over 200 peer-reviewed articles. He has specific
B. S. N. Al-Dabbagh, M. A. Fadhel, M. Manoufali, J. Zhang, interests in digital manufacturing, including digital
A. H. Al-Timemy, Y. Duan, A. Abdullah, L. Farhan, Y. Lu, A. Gupta, threads and digital twinning. He was a recipient
F. Albu, A. Abbosh, and Y. Gu, ‘‘A survey on deep learning tools of several best paper awards, recognized by the
dealing with data scarcity: Definitions, challenges, solutions, tips, and IEEE for outstanding research leadership in the
applications,’’ J. Big Data, vol. 10, no. 1, p. 46, 2023. areas of big data, predictive analytics, the Internet
[384] M. L. Hutchinson, E. Antono, B. M. Gibbons, S. Paradiso, J. Ling, and of Things, cyber physical systems security, and industry 4.0.
B. Meredig, ‘‘Overcoming data scarcity with transfer learning,’’ 2017,
arXiv:1711.05099.
[385] G. A. Mukhaini, M. Anbar, S. Manickam, T. A. Al-Amiedy, and
A. A. Momani, ‘‘A systematic literature review of recent lightweight
detection approaches leveraging machine and deep learning mechanisms PAUL ALLEN received the Ph.D. degree in vehicle
in Internet of Things networks,’’ J. King Saud Univ. Comput. Inf. Sci., dynamics from Manchester Metropolitan Univer-
vol. 36, no. 1, Jan. 2024, Art. no. 101866. sity, U.K., in 1998. He is currently a Professor
[386] J. Wu, T. Tang, M. Chen, Y. Wang, and K. Wang, ‘‘A study on
of railway engineering and technology and the
adaptation lightweight architecture based deep learning models for
bearing fault diagnosis under varying working conditions,’’ Expert Syst. Director of the Institute of Railway Engineering,
Appl., vol. 160, Dec. 2020, Art. no. 113710. University of Huddersfield, Huddersfield, U.K.
[387] D. Zhang, X. Hao, D. Wang, C. Qin, B. Zhao, L. Liang, and W. Liu, ‘‘An His research interests include vehicle dynamics,
efficient lightweight convolutional neural network for industrial surface energy, traction and braking technologies, and
defect detection,’’ Artif. Intell. Rev., vol. 56, no. 9, pp. 10651–10677, remote monitoring solutions.
Sep. 2023.

VOLUME 12, 2024 94295

You might also like