0% found this document useful (0 votes)

136 views

Learning Parallel Computing Environment Bioengineering

Uploaded by

siradanbilgiler

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

Learning Parallel Computing Environment Bioengineering

Uploaded by

siradanbilgiler

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 269

Deep Learning and

Parallel Computing
Environment for
Bioengineering
Systems
Deep Learning and
Parallel Computing
Environment for
Bioengineering
Systems

Edited by
Arun Kumar Sangaiah
Elsevier
3251 Riverport Lane
St. Louis, Missouri 63043

Deep Learning and Parallel Computing Environment for Bioengineering Systems ISBN: 978-0-12-816718-2
Copyright © 2019 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photo-
copying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how
to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the
Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted
herein).

Notices

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information,
methods, compounds or experiments described herein. Because of rapid advances in the medical sciences, in particular, independent
veriﬁcation of diagnoses and drug dosages should be made. To the fullest extent of the law, no responsibility is assumed by Elsevier,
authors, editors or contributors for any injury and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

Publisher: Mara Conner

Acquisition Editor: Sonnini R. Yura
Editorial Project Manager: Thomas Van Der Ploeg
Production Project Manager: Nirmala Arumugam
Designer: Mark Rogers
List of Contributors

S.P. Abirami, ME Shaik Abdul Khalandar Basha, MTech

Department of Computer Science and Engineering School of Information Technology and Engineering
Coimbatore Institute of Technology Vellore Institute of Technology (VIT)
Coimbatore, Tamil Nadu, India Vellore, Tamil Nadu, India
Sandhya Armoogum, PhD Syed Muzamil Basha, MTech
Department of Industrial Systems Engineering School of Computer Science and Engineering
School of Innovative Technologies & Engineering Vellore Institute of Technology (VIT)
University of Technology Mauritius Vellore, Tamil Nadu, India
Pointe-Aux-Sables, Mauritius
Pralhad Gavali, ME
R. Arun, ME RIT
Department of Computer Science Islampur, Maharashtra, India
Builders Engineering College
Kangayam, Tamil Nadu, India Alireza Goli, ME
Department of Industrial Engineering
Erfan Babaee Tirkolaee, ME Yazd University
Department of Industrial Engineering Yazd, Iran
Mazandaran University of Science & Technology
Babol, Iran R. Karthick, BOT
Steps Rehabilitation Center (I)
C. Bagavathi, MTech Steps Groups
Department of ECE Coimbatore, Tamil Nadu, India
Government College of Technology
Coimbatore, Tamil Nadu, India Ramgopal Kashyap, PhD

Mani Bakhshi, ME G. Kousalya, ME, PhD

Department of Industrial Engineering Department of Computer Science and Engineering
Isfahan University of Technology Coimbatore Institute of Technology
Isfahan, Iran Coimbatore, Tamil Nadu, India

K. Balaji, ME K. Lavanya, PhD

School of Computer Science and Engineering School of Computer Science and Engineering
Vellore Institute of Technology (VIT) Vellore Institute of Technology (VIT)
Vellore, Tamil Nadu, India Vellore, Tamil Nadu, India

Balakrishnan, ME, PhD XiaoMing Li, PhD

Department of Computer Science and Engineering School of Computer Science & Technology
Vellore Institute of Technology (VIT) Tianjin University
Vellore, Tamil Nadu, India Tianjin, China

J. Saira Banu, PhD M. Madiajagan, MS, PhD

Vellore Institute of Technology (VIT) Vellore Institute of Technology (VIT)
Vellore, Tamil Nadu, India Vellore, Tamil Nadu, India

v
vi LIST OF CONTRIBUTORS

Ankit Sandeep Malhotra, BE Dharmendra Singh Rajput, PhD

School of Electrical Engineering School of Information Technology and Engineering
Vellore Institute of Technology (VIT) Vellore Institute of Technology (VIT)
Vellore, Tamil Nadu, India Vellore, Tamil Nadu, India
Geethu Mohan, ME Arun Kumar Sangaiah, PhD
School of Electronics Engineering School of Computing Science and Engineering
Vellore Institute of Technology (VIT) Vellore Institute of Technology (VIT)
Vellore, Tamil Nadu, India Vellore, Tamil Nadu India
R. Mohanasundaram, PhD O. Saraniya, ME, PhD
School of Computing Science and Engineering Department of ECE
Vellore Institute of Technology (VIT) Government College of Technology
Vellore, Tamil Nadu, India Coimbatore, Tamil Nadu, India

T.M. Navamani, ME, PhD G. Sreeja, ME

School of Computer Science and Engineering Department of ECE
Vellore Institute of Technology (VIT) Government College of Technology
Vellore, Tamil Nadu, India Coimbatore, Tamil Nadu, India

P.S. Periasamy, PhD M. Monica Subashini, PhD

Department of ECE School of Electrical Engineering
K.S.R College of Engineering Vellore Institute of Technology (VIT)
Thiruchengode, Tamil Nadu, India Vellore, Tamil Nadu, India

S. Sridhar Raj, BTech, MTech Durai Raj Vincent, PhD

Pondicherry University School of Information Technology and Engineering
Kalapet, Puducherry, India Vellore Institute of Technology (VIT)
Vellore, Tamil Nadu, India
Preface

Deep machine learning is an emergent area in the field and applications of deep learning approaches applied
of computational intelligence (CI) research that is con- to parallel computing environment in bioengineering
cerned with the analysis and design of learning algo- systems. Presently, there are many noteworthy issues
rithms, representations of data, at multiple levels of ab- (health informatics, bio-image informatics energy effi-
straction. Deep learning is a technique for implement- ciency, etc.) that need to be addressed in the context
ing machine learning that provides an effective solution of deep machine learning, parallel computing and bio-
for parallel computing environment in bi-engineering engineering. For the aforementioned reasons, this book
problems. It encompasses artificial intelligence (AI), focuses on addressing a comprehensive nature of cogni-
artificial neural network, reasoning, natural language tive neural computing, parallel computing and on em-
processing that will be helpful to human intelligence phasizing its character in human intelligence and learn-
and decision making process. The heterogeneous par-
ing systems, complex analysis tasks mimicking human
allel computing architectures have been significant for
cognition and learning behavior, prediction and control
real-time bio-engineering applications that needed a de-
of bio-engineering systems. This book intends to give an
sign of a high-level operating system for matching the
processing tasks to the appropriate machine learning overview of state-of-the-art of issues and solution guide-
paradigm in a mixed-machine parallel system. This ef- lines in the new era of deep machine learning paradigm
fort is exerted to investigate the feasibility of a deep ma- and its recent trends of techniques for bioengineering.
chine learning technique for implementing a high-level
operating system for heterogeneous parallel computers.
The new frontier research era and convergence of ORGANIZATION OF THE BOOK
deep machine learning and parallel computing with ref- The volume is organized into 15 chapters. A brief de-
erence to bio-engineering has three main streams need- scription of each chapter is given as follows:
ing to be addressed in the current scenario: bioinfor- Chapter 1 illustrates the parallel processing basic
matics, medical imaging, and sustainable engineering. concepts with examples in order to highlight the sig-
This book is integrating machine learning, cognitive nificance of parallel deep learning. The types of par-
neural computing, parallel computing paradigms, ad- allelization technique are addressed, and the relation
vanced data analytics and optimization opportunities between computational intelligence and parallel deep
to bring more compute to the bio-engineering problems learning, the challenges in combining them together
and challenges. Further, it is important to make a note (parallel computing, graphics processing unit and new
that convergence of parallel computing architectures,
hardware for deep learning in computational intelli-
deep machine learning and its intelligence techniques
gence research) and benefits are discussed in this chap-
has not been adequately investigated from the perspec-
ter.
tive of bioengineering research streams (bioinformatics,
Chapter 2 presents the big data analytics with re-
medical imaging, and sustainable engineering) and its
related research issues. Obviously, these challenges also gards to the Hadoop Big Data framework for storing and
create immense opportunities for researchers. processing big data, described in the context of bioinfor-
The book will present novel, in-depth and funda- matics. The authors have highlighted the importance of
mental research contributions either from a method- the machine learning approach for performing predic-
ological or application perspective in understanding tive and prescriptive analytics. Thus, machine and deep
the fusion of deep machine learning paradigms and learning approaches currently being used in the context
their capabilities in solving a diverse range of problems of big data analytics in the Hadoop framework have also
in bio-engineering and its real-world applications. The been presented, as well as the current uses of such tech-
overall objective of the book is to illustrate the state- niques and tools in bioinformatics are illustrated in this
of-the-art and recent developments in the new theories chapter.

vii
viii PREFACE

Chapter 3 deals with the survey of image fusion learning algorithms for real time medical imaging data
algorithms based on deep convolutional neural net- sets and justified their profound impact on the medical
work, and the results obtained by these methods are field.
interpreted and discussed. The chapter authors have ad- Chapter 9 describes the role of machine learning al-
dressed the significance in combining the outcomes of gorithms on both linear and nonlinear data in address-
different modalities to utilize the complementary infor- ing regression and classification problems. The results
mation from each modality to form a better image. With obtained in this chapter are applicable to address the
image fusion, the multi-sensor data with complemen- real time problems like classification and regression.
tary information about the particular region are com- The chapter results state that support vector machine
paratively analyzed in this chapter. (SVM) performs better than all other classification al-
Chapter 4 illustrates the necessity of integrating ma- gorithms and the neural network (NN) approach gives
chine and deep learning methodology with the diagno- the lowest mean squared error (MSE) in the regression
sis of brain tumor, and recent segmentation and classifi- problem.
cation techniques on magnetic resonance images (MRI) The main objective of Chapter 10 is to consolidate
are reviewed. This chapter addressed the current trends the benefits of the classifications using singular value
in the grading of brain tumor with a focus on gliomas decomposition (SVD-QR) and limited memory sub-
which include astrocytoma. The current state-of-the-art, space optimization SVD (LMSVD-QR) calculations for
software packages, evaluation and validation metrics preprocessing of profound learning in multilayer neu-
used in different approaches are discussed, along with ral systems. This chapter has indicated why singular
integration into the clinical environment. value decomposition (SVD)-QR calculation is required
Chapter 5 provides the essentials of deep learning for preprocessing of profound learning for vast scale in-
methods with convolutional neural networks and an- formation input.
alyzes their achievements in medical image analysis Chapter 11 presents the challenges in storing and
such as deep feature representation, detection, segmen- processing big data using Hadoop and Spark. The au-
tation, classification, and prediction. This chapter re- thors have highlighted the new analytical platforms
views the different deep learning convolution neural such as Hadoop, Spark, along with MapReduce pro-
network methods. The features, benefits, and applica- gramming. The objective of this chapter is to make the
tions of convolutional neural network methods are also readers understand the challenges in storing and pro-
discussed in this chapter. cessing big data and how to use different big data frame-
Chapter 6 investigated how deep learning could be works effectively to store and process big data.
applied to the classification of images on the CIFAR- Chapter 12 presents a novel mixed-integer linear
10 database. The chapter authors have specified deep programming (MILP) model to consider a location
learning technologies that are becoming more acces- routing problem (LRP) for multiple perishable prod-
sible for corporations and individuals and give better ucts with vehicles having multiple trips, intermediate
results than the convolution neuron network. In this depots, and soft time windows. To cope with the solu-
chapter, deep convolutional neural networks are used tion complexity of the problem, an efficient biography
for classification and GPU technology is used for paral- based optimization algorithm (BBO) is investigated in
lel processing. this chapter.
Chapter 7 discusses the basic deep learning network Chapter 13 gives a brief overview of evolutionary
models and outlines some of the applications in health procedures, systolic arrays and methods to transform an
informatics. In this chapter, biomedical data can be ef- iterative algorithm into architecture. Significance of pa-
ficiently processed by deep learning networks, which in rameters derived from GTDM is mentioned, and the pa-
turn increase the predictive power for many specific ap- rameters involved in selecting the best of addressed al-
plications in the health informatics domain. Thus, this gorithms are clearly justified. The chapter authors have
chapter highlights that deep learning algorithms can revealed that ant colony optimization performed the
provide better outcomes and prediction in health infor- best among the selected evolutionary algorithms in ad-
matics with the integration of advanced parallel proces- dressing a systolic array mapping of grey tone distribu-
sors. tion matrix (GTDM) computation.
Chapter 8 illustrates the role of deep learning and The ultimate aim of Chapter 14 is to design a com-
semi-supervised and transfer learning algorithms for plete combinatorial model for the results from vari-
medical imaging. The chapter authors have used classifi- ous screening experiments involving multimodal deep
cation of supervised, semi-supervised and unsupervised learning technique that projects into better solution of
PREFACE ix

autism identification. This chapter mainly focuses on plied to parallel computing environment in bioengi-
the emotional sequence identification of children who neering systems. The book aims to present concepts
are autism spectrum disorder (ASD) positive and ASD and technologies that are successfully used in the im-
negative (i.e., normal TD). plementation of today’s intelligent data-centric critical
Chapter 15 gives parallel machine learning and deep systems and multimedia cloud big data, having a good
learning approaches for bioinformatics. The authors chance to be used in future computing systems. The
have outlined the deep learning and other deep-based book will constitute teaching material for organizing
representative learning algorithms which have been the course titled Computational Intelligence for New
applied successfully in image understanding, speech Computing Environments, hence suitable for university
recognition, and text classification, etc. level courses as well as research scholars.

Arun Kumar Sangaiah

AUDIENCE School of Computing Science and Engineering
The overall objective of the book is to illustrate the Vellore Institute of Technology
state-of-the-art and recent developments in the new the- Vellore, Tamil Nadu, India
ories and applications of deep learning approaches ap-
Foreword

This book delivers a significant forum for the technical ploring the significance of deep learning systems and
advancement of deep learning in parallel computing en- bio-engineering in the next paradigm of computing.
vironment across bio-engineering diversified domains This book gives an intensive and in-depth cover-
and its applications. Pursuing an interdisciplinary ap- age of the use of deep learning in the field of bio-
proach, it focuses on methods used to identify and ac- engineering systems and various interesting findings.
quire valid, potentially useful knowledge sources. Man- This book is a significant step in this field’s maturation
aging the gathered knowledge and applying it to mul- and will serve to unify, advance, and challenge the scien-
tiple domains including health care, social networks, tific community in many important ways. In addition,
mining, recommendation systems, image processing, this book is well suited for researchers exploring the sig-
pattern recognition and predictions using deep learning nificance of deep-learning systems and bio-engineering.
This book integrates in fact the core ideas of deep learn-
paradigms is the major strength of this book. Effective
ing and its applications in bio-engineering application
data and knowledge management has become a key
domains, to be accessible to all scholars and academi-
to the success of engineering applications and business
cians. The proposed techniques and concepts in this
organizations that can offer a substantial competitive
book can be extended in the future to accommodate
edge. changing business organizations’ needs, as well as prac-
The book “Deep Learning and Parallel Computing titioners’ innovative ideas.
Environment for Bioengineering Systems” is focusing on I am pleased to appreciate the editors and authors on
domain experts and developers, who want to under- their accomplishment, and hope that the readers will
stand and explore the application of deep learning find the book worthwhile and a source of inspiration in
and computational intelligence aspects (opportunities their research and professional activity.
and challenges) for the design and development of
parallel computing environment in the context of bio- Prof. Vincenzo Piuri, PhD
engineering systems era and their related applications IEEE Fellow
such as smarter health care, homeland security, com- Department of Computer Science
putational biology, robotics, and intelligent assistance. University of Milan
This book is a significant collection of 15 chapters ex- Milan, Italy

xi
Acknowledgment

The editors would like to recognize the help of all the of the reviewers who helped us to refine the context of
people involved in this project and especially the au- this book. Most of the authors also served as referees;
thors and reviewers that took part in the peer review we highly appreciate their double task.
process. Without their support, this book would not Finally, our gratitude goes to all of our friends and
have become a reality. colleagues, who were so generous with their encourage-
First, the editors would like to thank each of the au- ment, advice and support.
thors for their contributions. Our sincere gratitude goes
to the chapter’s authors who contributed their time and
expertise to this book. Arun Kumar Sangaiah
Second, the editors wish to acknowledge the valu- School of Computing Science and Engineering
able contributions of the reviewers regarding the im- Vellore Institute of Technology
provement of quality, coherence, and content presen- Vellore, Tamil Nadu, India
tation of chapters. We deeply appreciate the comments

xiii
Contents

L I S T O F C O N T R I B U T O R S, v 1.3.1 Hardware Architecture of CPU

P R E F A C E, vii and GPU, 7
F O R E W O R D, xi 1.3.2 Suitability of GPU to Parallel
A C K N O W L E D G M E N T, xiii Deep Learning, 7
1.3.3 CPU vs. GPU, 8
1 Parallel Computing, 1.3.4 Advantages of Using GPU in
Graphics Processing Unit (GPU) and Parallel Deep Learning, 8
New Hardware for Deep Learning 1.3.5 Disadvantages of Using GPU
in Computational Intelligence in Parallel Deep Learning, 8
Research, 1 1.3.6 Famous GPUs on the
M. Madiajagan, MS, PhD, Market, 9
S. Sridhar Raj, BTech, MTech 1.4 GPU Based Parallel Deep Learning
on Computational Intelligence
1.1 Introduction, 1
Applications With Case Study, 9
1.1.1 Machine and Deep
1.4.1 Dataflow of the Deep
Learning, 1
Parallelized Training and
1.1.2 Graphics Processing Unit
Testing of Computational
(GPU), 1
Intelligence Applications, 9
1.1.3 Computational Intelligence, 2 1.4.2 Numerical Example for a
1.1.4 GPU, Deep Learning and Generic Computational
Computational Intelligence, 2 Intelligence Application, 10
1.2 Deep Learning and Parallelization, 2 1.4.3 Dealing With Limited GPU
1.2.1 Parallel Processing Memory, 11
Concepts, 2 1.4.4 Computational Intelligence
1.2.2 Deep Learning Using Parallel Applications, 11
Algorithms, 3 1.5 Implementation Screenshots to
1.2.3 Parallelization Methods to Visualize the Training Process, 14
Distribute Computation Across 1.6 Summary, 14
Multiple Machines, 3 References, 14
1.2.4 Methods to Train Deep
Neural Networks Using 2 Big Data Analytics and Deep
Parallelization, 4 Learning in Bioinformatics With
1.2.5 Parallelization Over Data, Hadoop, 17
Function, Parameter and Sandhya Armoogum, PhD, XiaoMing Li, PhD
Prediction Scale, 5 2.1 Introduction, 17
1.2.6 Types of Speed-Up and 2.2 From Big Data to Knowledge
Scaling, 6 Discovery With Hadoop, 18
1.3 Role of Graphics Processing Unit in 2.2.1 Hadoop Big Data
Parallel Deep Learning, 7 Framework, 19

xv
xvi CONTENTS

2.2.2 Big Data Collection and 3.8.4 Mutual Information (MI), 46

Ingestion, 20 3.8.5 Image Quality Index (IQI), 46
2.2.3 Data Staging and Storage on 3.8.6 QRS/F , 46
Hadoop, 22 3.8.7 QG , 46
2.2.4 Data Processing and Analysis 3.8.8 Structural Similarity Index
Frameworks, 22 Metric (SSIM), 47
2.2.5 Big Data Analysis and 3.8.9 FSIM, 47
Visualization, 25 3.8.10 Contrast (C), 47
2.3 Machine Learning for Big Data 3.8.11 QE , 47
Analysis, 25 3.8.12 Average Gradient (AG), 47
2.3.1 Machine Learning 3.8.13 Human Perception-Based
Methods, 26 Metric (QCB ), 47
2.3.2 Deep Learning and Neural
3.8.14 Processing Time (T), 47
Networks, 29
3.9 Results Interpretation and
2.3.3 Machine Learning and
Discussion, 47
Hadoop, 30
3.10 Issues in Existing CNN Based Image
2.3.4 Distributed Deep Learning and
Fusion Methods, 50
Hadoop, 31
3.11 Conclusions, 50
2.4 Conclusions, 31
References, 50
References, 32

4 Medical Imaging With Intelligent

3 Image Fusion Through Deep
Systems: A Review, 53
Convolutional Neural Network, 37
Geethu Mohan, ME,
G. Sreeja, ME, O. Saraniya, ME, PhD
M. Monica Subashini, PhD
3.1 Introduction, 37
4.1 Introduction, 53
3.2 Image Fusion, 37
4.2 Tumor Types and Grading, 53
3.3 Registration, 38
4.2.1 Why Are Studies Concentrated
3.3.1 Image Registration Stages, 38
Mostly on Gliomas?, 54
3.3.2 Need for Image Registration in
4.2.2 Grading, 56
Medical Imaging, 39
3.4 Existing Image Fusion Methods – 4.2.3 Symptoms of a Brain
Overview, 42 Tumor, 58
3.5 Deep Learning, 42 4.2.4 Diagnosis and Treatment of a
3.6 Convolutional Neural Network Brain Tumor, 58
(CNN), 43 4.3 Imaging Techniques, 59
3.7 CNN Based Image Fusion 4.3.1 Reading an MR Image, 60
Algorithms, 43 4.3.2 Advanced Magnetic
3.7.1 Deep Stacked CNN Resonance Imaging, 60
(DSCNN), 43 4.4 Machine Learning (ML) – Supervised
3.7.2 CNN Based Similarity and Unsupervised Methods, 62
Learning, 44 4.4.1 ML Software
3.7.3 CNN Based Fusion Using Packages/Toolboxes, 63
Pyramidal Decomposition, 45 4.5 Deep Learning (DL), 64
3.8 Evaluation Metrics, 45 4.5.1 DL Tools/Libraries, 65
3.8.1 Entropy (S), 46 4.6 Evaluation and Validation Metrics, 65
3.8.2 Standard Deviation (σ ), 46 4.7 Embedding Into Clinics, 66
3.8.3 Spatial Frequency (SF), 46 4.8 Current State-of-the-Art, 66
CONTENTS xvii

4.8.1 Deep Learning Concepts for 5.7.5 Organ and Substructure

Brain Tumor Grading, 66 Segmentation, 93
4.9 Discussion, 67 5.7.6 Lesion Segmentation, 93
4.10 Conclusions, 67 5.8 Discussion, 93
Short Authors Biographies, 70 5.9 Conclusions, 94
Acknowledgments, 70
References, 94
Funding, 70
References, 70
6 Deep Convolutional Neural Network
for Image Classification on CUDA
5 Medical Image Analysis With Deep
Platform, 99
Neural Networks, 75
K. Balaji, ME, K. Lavanya, PhD Pralhad Gavali, ME, J. Saira Banu, PhD

5.1 Introduction, 75 6.1 Introduction, 99

5.2 Convolutional Neural Networks, 75 6.2 Image Classification, 100
5.3 Convolutional Neural Network 6.2.1 Image Classification
Methods, 78 Approach, 101
5.4 Convolutional Layer, 86 6.2.2 Image Classification
5.4.1 Tiled Convolution, 86 Techniques, 102
5.4.2 Transposed Convolution, 86 6.2.3 Research Gaps, 103
5.4.3 Dilated Convolution, 86 6.2.4 Research Challenge, 103
5.4.4 Network-in-Network, 86
6.2.5 Problem Definition, 104
5.4.5 Inception Module, 90
6.2.6 Objective, 104
5.5 Pooling Layer, 90
6.3 Deep Convolutional Neuron
5.5.1 Lp Pooling, 90
Network, 104
5.5.2 Mixed Pooling, 90
5.5.3 Stochastic Pooling, 90 6.4 Compute Unified Device Architecture
5.5.4 Spectral Pooling, 90 (CUDA), 105
5.5.5 Spatial Pyramid Pooling, 90 6.5 TensorFlow, 107
5.5.6 Multi-Scale Orderless 6.6 Implementation, 107
Pooling, 90 6.6.1 Deep Convolutional Neural
5.6 Activation Function, 91 Networks, 107
5.6.1 Rectified Linear Unit 6.6.2 Dataset, 109
(ReLU), 91 6.6.3 Implementing an Image
5.6.2 Leaky ReLU, 91 Classifier, 109
5.6.3 Parametric ReLU, 91 6.6.4 Installation and System
5.6.4 Randomized ReLU, 91 Requirements, 110
5.6.5 Exponential Linear Unit
6.6.5 Algorithms, 112
(ELU), 91
6.7 Result Analysis, 112
5.6.6 Maxout, 91
5.6.7 Probout, 92 6.7.1 Neural Networks in
5.7 Applications of CNN in Medical Image TensorFlow, 112
Analysis, 92 6.7.2 Understanding the Original
5.7.1 Image Classification, 92 Image Dataset, 112
5.7.2 Object Classification, 92 6.7.3 Understanding the Original
5.7.3 Region, Organ, and Landmark Labels, 113
Localization, 92 6.7.4 Implementing Pre-process
5.7.4 Object or Lesion Detection, 93 Functions, 114
xviii CONTENTS

6.7.5 Output of the Model, 116 8.6 Deep Learning Architecture, 141
6.7.6 Training a Model 8.6.1 Convolution Neural Network
Using Multiple GPU (CNN), 141
Cards/CUDA, 118 8.6.2 Recurrent Neural
6.8 Conclusions, 120 Network, 141
References, 121 8.6.3 Deep Neural Network, 142
8.7 Deep Learning in Medical Imaging
7 Efficient Deep Learning Approaches [5], 142
for Health Informatics, 123 8.7.1 Diabetic Retinopathy, 142
T.M. Navamani, ME, PhD 8.7.2 Cardiac Imaging, 143
7.1 Introduction, 123 8.7.3 Tumor Classification in
Machine Learning Vs. Deep Homogeneous Breast
Learning, 123 Tissue, 143
7.2 Deep Learning Approaches, 125 8.8 Developments in Deep Learning
Deep Autoencoders, 125 Methods, 144
Recurrent Neural Networks 8.8.1 Black Box and Deep
(RNNs), 126 Learning, 144
Restricted Boltzmann Machine 8.8.2 Semi-Supervised and Transfer
(RBM), 127 Learning Algorithms, 144
Deep Belief Network, 127 8.8.3 Applications of
Deep Boltzmann Machine (DBM), 127 Semi-Supervised Learning in
Convolutional Neural Network, 128 Medical Imaging, 145
7.3 Applications, 130 8.8.4 Method, 145
Translational Bioinformatics, 130 8.9 Transfer Learning, 146
Clinical Imaging, 130 8.9.1 Transfer Learning in Image
Electronic Health Records, 130 Data, 146
Genomics, 131 8.9.2 Transfer Learning Technique
Mobile Devices, 131 for the Detection of Breast
7.4 Challenges and Limitations, 131 Cancer, 149
7.5 Conclusions, 134 References, 151
References, 134
9 Survey on Evaluating the
8 Deep Learning and Performance of Machine Learning
Semi-Supervised and Transfer Algorithms: Past Contributions and
Learning Algorithms for Medical Future Roadmap, 153
Imaging, 139 Syed Muzamil Basha, MTech,
R. Mohanasundaram, PhD, Dharmendra Singh Rajput, PhD
Ankit Sandeep Malhotra, BE,
R. Arun, ME, 9.1 Introduction, 153
P.S. Periasamy, PhD 9.2 Methodology, 154
8.1 Introduction, 139 9.3 Linear Regression, 156
8.2 Image Acquisition in the Medical 9.4 Nonlinear Regression, 156
Field, 139 9.4.1 Support Vector Machine, 156
8.3 Deep Learning Over Machine 9.4.2 K-Nearest Neighbors, 157
Learning, 140 9.4.3 Neural Network, 158
8.4 Neural Network Architecture, 140 9.5 Nonlinear Decision Tree
8.5 Defining Deep Learning, 140 Regression, 158
CONTENTS xix

9.5.1 Regression With Decision 11 Challenges in Storing and

Trees, 158 Processing Big Data Using Hadoop
9.5.2 Random Forest, 158 and Spark, 179
9.6 Linear Classification in R, 159 Shaik Abdul Khalandar Basha, MTech,
Syed Muzamil Basha, MTech,
9.7 Results and Discussion, 161
Durai Raj Vincent, PhD,
9.8 Conclusions, 162 Dharmendra Singh Rajput, PhD
References, 163
11.1 Introduction, 179
10 Miracle of Deep Learning Using 11.2 Background and Main Focus, 179
IoT, 165 11.2.1 Challenges of Big Data
Ramgopal Kashyap, PhD Technologies, 180
11.2.2 Real Time Applications of Big
10.1 Introduction, 165 Data Frameworks, 181
10.2 Inspiration, 166 11.3 Hadoop Architecture, 181
10.3 Decisions in an Area of Deep Learning 11.4 MapReduce Architecture, 181
and IoT, 167 11.5 Joins in MapReduce, 182
10.3.1 IoT Reference Model, 168 11.6 Apache Storm, 184
10.3.2 Rudiments of Machine 11.7 Apache Spark Environment, 184
Learning, 169 11.7.1 Use Cases of Big Data
10.3.3 Algorithms for Efficient Technologies, 184
Training, 170 11.8 Graph Analysis With GraphX, 185
10.3.4 Secure Deep Learning, 170 11.9 Streaming Data Analytics, 185
10.3.5 Robust and 11.10 Futer Research Directions, 186
Resolution-Invariant 11.11 Conclusion, 187
Image Classification, 171 References, 187
10.3.6 Planning With Flawless and
Noisy Images, 171 12 An Efficient Biography-Based
10.3.7 Smart and Fast Data Optimization Algorithm to Solve
Processing, 172 the Location Routing Problem With
10.4 Simulation Results and Performance
Intermediate Depots for Multiple
Analysis of Handwritten Digits
Perishable Products, 189
Erfan Babaee Tirkolaee, ME,
Recognition in IoT, 172
Alireza Goli, ME,
10.5 An Intelligent Traffic Load Mani Bakhshi, ME,
Prediction, 173 Arun Kumar Sangaiah, PhD
10.6 Performance of Deep Learning Based
12.1 Introduction, 189
Channel Assignment, 173
12.2 Model Development, 191
10.6.1 A Deep Learning System for
12.2.1 Mathematical Model, 193
the Individual Pursuit, 174
12.2.2 An Illustration, 195
10.6.2 Network Architecture, 175
12.3 Solution Method, 196
10.6.3 Dataset, 175 12.3.1 Introduction to Biography
10.6.4 Statistics, 175 Based Optimization
10.6.5 Effectiveness of Online Algorithm, 196
Instance Matching, 176 12.3.2 Solution Representation, 198
10.7 Discussion, 176 12.3.3 Initial Solution
10.8 Conclusion, 177 Generation, 198
References, 177 12.3.4 Immigration Phase, 198
xx CONTENTS

12.3.5 Mutation Phase, 199 14.4.1 Accuracy of the Face

12.3.6 Optimal Parameter Expression Analysis, 240
Design, 199 14.5 Conclusions, 240
12.4 Computational Results, 199 14.6 Future Work, 242
12.4.1 Optimal Solutions for Instance References, 242
Problems, 201
12.4.2 Sensitivity Analysis, 202 15 Parallel Machine Learning and
12.5 Discussion, Concluding Remarks and Deep Learning Approaches for
Future Research Directions, 203 Bioinformatics, 245
References, 203 M. Madiajagan, MS, PhD,
S. Sridhar Raj, BTech, MTech
13 Evolutionary Mapping Techniques 15.1 Introduction, 245
for Systolic Computing System, 207 15.1.1 Machine Learning and Deep
C. Bagavathi, MTech, O. Saraniya, ME, PhD Learning, 245
13.1 Introduction, 207 15.1.2 Role of Parallelization in Deep
13.2 Systolic Arrays, 208 Learning, 245
13.3 Evolutionary Algorithms, 209 15.1.3 Deep Learning Applications in
13.4 Swarm Intelligence (SI), 211 Bioinformatics, 246
13.5 Mapping Techniques, 213 15.2 Deep Learning and Parallel
13.6 Systolic Implementation of Texture Processing, 246
Analysis, 215 15.2.1 Parallel Processing, 246
13.7 Results and Discussion, 216 15.2.2 Scalability of Parallelization
13.7.1 Performance of EA for F8 Methods, 247
Optimization, 216 15.2.3 Deep Learning Using Parallel
Algorithms, 247
13.7.2 Texture Analysis, 217
15.3 Deep Learning and
13.8 Conclusions, 220
Bioinformatics, 248
List of Acronyms and Abbreviations, 220
15.3.1 Bioinformatics
References, 220
Applications, 248
15.3.2 Advantages of Using
14 Varied Expression Analysis
Parallel Deep Learning
of Children With ASD Using
in Bioinformatics
Multimodal Deep Learning
Applications, 250
Technique, 225
15.3.3 Challenges in Using
S.P. Abirami, ME, G. Kousalya, ME, PhD,
Balakrishnan, ME, PhD, R. Karthick, BOT Parallel Deep Learning
for Bioinformatics
14.1 Introduction, 225 Applications, 250
14.2 State-of-the-Art, 226 15.4 Parallel Deep Learning in
14.3 Methodology, 227 Bioinformatics Applications With
14.3.1 Detection of Human Implementation and Real Time
Faces, 227 Numerical Example, 250
14.3.2 Extraction of Features, 230 15.5 Sample Implementation Screenshots
14.3.3 Expression Classifier, 231 to Visualize the Training Process, 252
14.3.4 Expression Identification 15.6 Summary, 254
Through a Convolution Neural References, 254
Network (CNN), 232
14.4 Results and Analysis, 235 I N D E X, 257
CHAPTER 1

Parallel Computing,
Graphics Processing Unit (GPU) and
New Hardware for Deep Learning in
Computational Intelligence Research
M. MADIAJAGAN, MS, PHD • S. SRIDHAR RAJ, BTECH, MTECH

1.1 INTRODUCTION raw datasets to derive meaningful information. Big data

Machine learning is the competency of software to per- also supports the nature of deep learning algorithms,
form a single or a series of tasks intelligently with- which requires large amount of training data. Training
out being programmed for those activities, which is a many parameters in deep learning networks increases
part of artificial intelligence (AI). Normally, the soft- the testing accuracy [2].
ware behaves based on the programmer’s coded instruc- Some deep learning applications are in natural
tions, while machine learning is going one step further language processing, video processing, recommenda-
by making the software capable of accomplishing in- tion systems, disease prediction, drug discovery, speech
tended tasks by using statistical analysis and predictive recognition, web content filtering, etc. As the scope for
analytics techniques. In simple words, machine learn- learning algorithms evolves, the applications for deep
ing helps the software learn by itself and act accord- learning grows drastically.
ingly.
Let us consider an example, when we like or com- 1.1.2 Graphics Processing Unit (GPU)
ment a friend’s picture or video on a social media site, Graphics Processing Unit (GPU) is a specialized circuit
the related images and videos are posted earlier and which can perform high level manipulation and alter
stay displayed. Same with the “people you may know” the memory. It can perform rendering of 2D and 3D
suggestions in facebook, the system suggests us another graphics to acquire the final display. In the beginning,
user’s profile to add as a friend who is somehow related the need for a GPU was driven by the world of computer
to our existing friend’s list. And you wonder how the games, and slowly the researchers realized that it has
system knows that? This is called machine learning. The many other applications like movement planning of a
software uses statistical analysis to identify the pattern robot, image processing, video processing, etc. The gen-
that you, as a user, are performing, and using the pre- eral task of the GPUs was just expressing the algorithms
dictive analytics, it populates the related news feed on in terms of pixels, graphical representations and vectors.
your social media site. NVIDIA and AMD, two giants in GPU manufacturing,
changed the perspective of GPUs by introducing a ded-
1.1.1 Machine and Deep Learning icated pipeline for rendering the graphics using multi-
Machine learning algorithms are used to automatically core systems. CPU uses vector registers to execute the
understand and realize the day-to-day problems that instruction stream, whereas GPUs use hardware threads
people are facing. The number of hidden layers in an which execute a single instruction on different datasets
artificial neural network reflects in the type of learning. [1].
The intent is to gain knowledge by learning through GPUs, and now TPUs (tensor processing units), re-
datasets using customized methods. But, in case of big duce the time required for training a machine learn-
data where the data is huge and complicated, it is diffi- ing (ML) model. For example, using a CPU approach
cult to learn and analyze [1]. may take a week to train, a GPU approach on the same
Deep learning plays a very vital role in resolving this problem would take a day, and a TPU approach takes
issue of learning and analyzing big data. It learns com- only a few hours. Also, multiple GPUs and TPUs can
plex data structures and representations acquired from be used. Multiple CPUs can be used, but network la-

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00008-7 1
Copyright © 2019 Elsevier Inc. All rights reserved.
2 Deep Learning and Parallel Computing Environment for Bioengineering Systems

tency and other factors make this approach untenable. 1.2.1 Parallel Processing Concepts
As others have noted, GPUs are designed to handle Parallel processing concept arises to facilitate the anal-
high-dimensional matrices, which is a feature of many ysis of huge data and acquire meaningful information
ML models. TPUs are designed specifically for ML mod- from it. Speech processing, medical imaging, bioinfor-
els and don’t include the technology required for image matics and many similar fields are facing the difficulty
display. of analyzing huge amounts of complex data. There are
some problems in which the run-time complexity can-
1.1.3 Computational Intelligence not be improved even with many processors.
Computational intelligence deals with the automatic Parallel algorithms are called efficient when their
adaptation and organizes accordingly with respect to run-time complexity divided by the number of proces-
the implementation environment. By possessing the at- sors is equal to the best run-time complexity in sequen-
tributes such as knowledge discovery, data abstraction, tial processing. Not everything should be parallelized.
association and generalization, the system can learn and User experience, for example, is a serial task. If one
deal with new situations in the changing environments. thread redraws the screen when some other thread is
Silicon-based computational intelligence comprises hy- trying to click something that cannot be encouraged for
brids of paradigms such as artificial neural networks, parallel processing, it has to be sequential. Sometimes
fuzzy systems and evolutionary algorithms, augmented sequential processing is faster than parallel where the
with knowledge elements, which are often designed to latter requires gathering all the data in one place, but
mimic one or more aspects of carbon-based biological the former does not have to gather data [4].
intelligence [3]. In single processor systems, a set of inputs are given
to the processor and it returns the output after pro-
1.1.4 GPU, Deep Learning and cessing. The performance of the processor can be made
Computational Intelligence faster by increasing the frequency limit. But, there is a
GPU is basically based on parallel processing in na- certain limit beyond which the processor emits a huge
ture, which helps in improving the execution time amount of heat. The amount of heat emitted by the elec-
of the deep learning algorithms. By imparting the trons moving through the processor is very high, hence
parallel deep learning using GPU, all the computa- there is a certain frequency limit beyond which the pro-
tional intelligence research applications which involves cessor melts down.
images, videos, etc., can be trained at a very fast
rate and the entire execution time is reduced drasti-
cally.
The rest of the chapter is organized as follows. In
Section 1.2, we discuss the role and types of paralleliza-
tion in deep learning. Section 1.3 tells us the role of
GPU in parallel deep learning. Section 1.4 presents the
data flow of parallelization and a numerical example
on how the parallelization works in deep learning with FIG. 1.1 Single processor execution.
a real time application. Section 1.5 shows the imple-
mentation details and screenshots, while Section 1.6 To rectify the issue shown in Fig. 1.1, we move to par-
summarizes the entire contents discussed in the above allel processing where more than one processor is used
sections. to process the data. This way the workload is divided
between multiple processors. See Fig. 1.2.
Parallel computing has its own disadvantage such
1.2 DEEP LEARNING AND
as dependency between processors, i.e., one processor
PARALLELIZATION
might wait for the results of the process running on an-
In this section, we will discuss what is parallel pro- other processor. In modern computing, we address the
cessing and the algorithms which are suitable for deep number of processors by using the term core. Dual-core,
learning through analysis. The analysis is based on the multi-core, i3, i5, i7, etc., all denote the number of pro-
time and throughput of the algorithms. cessors.
CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 3

1.2.2.2 Challenges in Implementing Parallel

Algorithms in Deep Learning
Applying machine learning algorithms to large scale
datasets like web mining, social networks and other
distributed environment data is challenging. Research
works on making the normal sequential algorithms
scalable has still not rectified its purpose. Sequential
methods continue to take very long training time. The
FIG. 1.2 Parallel processing execution. above mentioned problems face a tough challenge even
though MapReduce frameworks which hide much of
1.2.2 Deep Learning Using Parallel the complexity.
Algorithms To overcome this challenge, a framework, which
Machine learning provides solutions on a very small implements parallel machine learning algorithms on
scale even for sophisticated problems. If it is expanded large distributed data such as social networks, is based
into large scale problems, many new and surprising re- on functional programming abstractions [8]. The algo-
sults can be explored. Unfortunately, the limited capac- rithms can be implemented very easily by using the
ity of the sequential algorithms in terms of time and functional combinators, which yields the best compo-
complexity has ruined its evolution. Parallel process- sition of aggregation, distributed and sequential pro-
ing offered by modern data processing platforms like cesses. This system also avoids inversion of control in a
Hadoop and Spark can be leveraged for machine learn- synchronous parallel model. Limited graphical process-
ing in multiple ways, especially for hyperparameter and ing unit memory is yet another challenge which has to
ensemble learning [5]. be rectified.

1.2.2.1 Understanding the Needs and Benefits 1.2.3 Parallelization Methods to Distribute
of Parallel Algorithms in Deep Learning Computation Across Multiple
Machines
• Neural networks take a huge number of parameters
The methodologies to perform parallelized or dis-
from the datasets and learn to define the model. This
tributed computation on multi-core machines are given
learning of many parameters amounts to a very long
below.
computation time. The computation time is consid-
ered on order of days, and “q” denotes the number of 1.2.3.1 Local Training
cores in the processor. The VGGNet application takes
Local training means that the data is being trained on
about 10 hours for training even on an 8q machine.
a single machine which has multi-core processors. The
This is a computationally intensive process which
entire datasets are loaded onto the same machine, the
takes a lot of time [6]. cores inside take care of the processing task. Multi-core
• In some cases, the datasets are very large for a sin- machines can be used in two ways:
gle machine to store and process. Therefore we need • By loading multiple data in a single layer and pro-
parallel and distributed processing methodologies cessing them using the multi-core processor, which
which reduce the training time. is a lengthy parallelization process;
• The very nature of deep learning is distributed across • By using a batching system to separate the datasets
processing units or nodes. Using simulated par- into many small batches and sending each batch to
allelism is slow, but implementing deep learning a core for processing.
in it’s “natural form” would mean improvements
in training time from months to weeks, or even 1.2.3.2 Distributed Training
days. Of importance here is the acceleration, noth- When the datasets are so huge that they cannot be
ing else, one can run deep learning solutions on a stored on a single system, distributed training resolves
single processor or machine provided one can tol- this problem. The data is stored across many machines
erate the sluggishness [5]. Hence, the sure way of in a distributed manner. Here, either the model or data
speeding things up is to use hardware acceleration can be distributed, which is discussed below.
just like in computer graphics since both graphics • In data parallelism, data is distributed across multi-
and deep learning are inherently parallel in nature ple machines. When the data set is large or its faster
[7]. processing is required, data parallelism can be used.
4 Deep Learning and Parallel Computing Environment for Bioengineering Systems

• In model parallelism, the model is typically too big 1.2.4.1 Inter-Model Parallelism
to fit on a single system. When a model is placed into Generally, when inter-model parallelism is used, there
a single machine, one model demands the output of are different models, and each model can have differ-
another model. This forward and backward propaga- ent parameters such as equation function, layer types,
tion establishes communication between the models number of neurons per layer, etc. All three different
from different machines in a serial fashion [9]. model cases are trained with the same dataset. See
Fig. 1.5.
1.2.4 Methods to Train Deep Neural
Networks Using Parallelization
Deep neural networks or deep artificial neural networks
follow the structure of the actual brain and its functions.
They use multiple layers of artificial neurons for classifi-
cation and pattern recognition.
Fig. 1.3 shows the structure of a non-deep neural net-
work, having only one hidden layer, whereas Fig. 1.4
depicts a deep neural network with three hidden layers.
Networks having between 3 and 10 hidden layers are
called very deep neural networks. There are four ways to
parallelize neural network training. They are discussed
in what follows.

FIG. 1.5 Inter-model parallelism.

1.2.4.2 Data Parallelism

The idea of data parallelism was brought up by Jeff
Dean style as parameter averaging. We have three copies
of the same model. We deploy the same model A over
three different nodes, and a subset of the data is fed
over the three identical models. The values of the pa-
rameters are sent to the parameter server and, after col-
lecting all the parameters, they are averaged. Using the
parameter server, the omega is synchronized. The neural
networks can be trained in parallel in two ways, i.e., syn-
FIG. 1.3 Structure of non-deep neural networks. chronously (by waiting for one complete iteration and

FIG. 1.4 Structure of deep neural networks.

CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 5

updating the value for omega) and asynchronously (by

sending outdated parameters out of the network). But
the amount of time taken for both methods is same,
and the method choice is not a big issue here. See
Fig. 1.6.

FIG. 1.7 Intra-model parallelism.

1.2.5 Parallelization Over Data, Function,

Parameter and Prediction Scale
The focus is on classical machine learning algorithms;
moreover, we are only considering Hadoop and Spark
for a parallel processing platform.
FIG. 1.6 Data parallelism.
1.2.5.1 Data Partitioning
To overcome the problems in data parallelism, task With data partitioning and parallelism, the learning al-
gorithm operates on each partition in parallel, and fi-
level parallelism has been introduced. Independent
nally the results for each partition are stitched together.
computation tasks are processed in parallel by using the
This is essentially the divide-and-conquer pattern. The
conditional statements in GPUs. Task level parallelism critical assumption is that learning algorithms can be
can act without the help of data parallelism only to a recast to execute in parallel. However, not all learning
certain extent, beyond which the GPU needs data paral- algorithms are amenable to parallel processing.
lelism for better efficiency. But the task level parallelism At first, this may sound very attractive. However, care-
gives more flexibility and computation acceleration to ful consideration of the following observations will lead
the GPUs. us to a different conclusion:
• The amount of training data required has a complex
1.2.4.3 Intra-Model Parallelism relationship with model complexity, expressed as a
Machine learning can leverage modern parallel data VC dimension. Higher model complexity demands
processing platforms like Hadoop and Spark in several more training data.
ways. In this section, we will discuss how to scale ma- • Most machine learning algorithms are not capable
chine learning with Hadoop or Spark. Three different of handling very complex models because of high
generalization error. Deep learning is an exception
ways of parallel processing can benefit machine learn-
to this.
ing. When thinking about parallel processing in the
• Complexity is chosen based on a tradeoff between
context of machine learning, what immediately jumps error due to bias and error due to variance.
to our mind is data partitioning along with divide-and- Based on these observations, we can conclude that
conquer learning algorithms. However, as we will find machine learning algorithms, when applied to real
out that data partitioning is not necessarily, the best way world problems, do not require very large training data
is to exploit parallel processing. There are other more set and hence do not require parallel processing capa-
fruitful areas [10]. See Fig. 1.7. bilities of Hadoop or Spark. More on the relationship
6 Deep Learning and Parallel Computing Environment for Bioengineering Systems

between model complexity and training data size can be may have to be made in close to real time. Here is an
found from our earlier work. We could still use Hadoop example where predictions are to be made in near real
or Spark. We can use a sequential learning algorithm time and a large amount of data is involved. Consider a
that will operate on the whole data set without any par- model that predicts the probability of a customer buy-
titioning [11]. ing something during the current visit to an e-commerce
site, based on real time click stream data [5]. This could
1.2.5.2 Function Partitioning be done with Spark Streaming with click stream data ar-
This is the flip side of data partitioning. A function is riving through a Kafka topic. To maximize throughput,
decomposed into several independent functions. Each the data could be processed with multiple Spark parti-
function operates in parallel on the whole data set. tions. Each Spark task processing a partition will load
Results are consolidated when all the functions have the predictive model [13]. The output of the prediction
been computed. There is no learning algorithm that could be written back to another Kafka topic. The web-
is amenable to functional decomposition. Moreover, site could personalize content based on prediction from
Hadoop and Spark provide parallel processing capabil- the model [6].
ities only through data partitioning.
1.2.6 Types of Speed-Up and Scaling
1.2.5.3 Hyperparameter Learning Speeding up and scaling the capacity of the processors
This is an area where we can exploit parallel process- leads to reducing the execution time of the processor.
ing in Hadoop and Spark very effectively. Generally, any The types of scaling the models are discussed below.
learning algorithm has many parameters that influence The number of resources added for scaling the pro-
the final result, i.e., test or generalization error. Our goal cessor is nearly propositional to the performance of the
is to select a set of parameter values that will give us best processor. The resources denote the processors, memory
performance, i.e., minimum error. size and bandwidth offered in case of distributed envi-
This is essentially an optimization problem where ronment. Adding “y” times more resources yields “y”
we want to minimize error on a multi-dimensional pa- times speed-up [3]. The idea is to scale the number of
rameter space. However, the error cannot be expressed processors and check the efficiency of the machine.
as a function of the parameters in closed form, and There are two scalability models:
hence many classical optimization techniques cannot • Problem constrained;
be used. • Time constrained.
Here are some of the optimization techniques that
can be used for finding the optimal set of parameter val- 1.2.6.1 Problem Constrained (PC) Scaling
ues. The number of optimization techniques available The size of the problem here is fixed, and the reduc-
is by no means limited to this list. For some parameter tion of the execution time is the aim. Therefore, without
value sets we build a predictive model and test it to find increasing the size of the problem, the number of pro-
the test or generalization error [12]. In ensemble learn- cessors and memory size are increased. The speed-up is
ing, multiple predictive models are built. Random forest computed by the equation below:
is a good example where an ensemble of decision trees
is used. The ensemble of models is used for prediction, SPC = Time (1 processor) / Time (“p” processors).
e.g., by taking a majority vote. With ensemble learning,
error due to variance can be reduced. The ratio of time taken by one processor and time
The models in the ensemble are generally built by taken by the total number of processors used yields the
using a subset of training data and a subset of features. speed-up value.
There are other generic ways to create ensembles, e.g., by
bagging, boosting and stacking. There are also specific 1.2.6.2 Time Constrained (TC) Scaling
ensemble techniques for learning algorithms. Since the Unlike the problem constrained case, here in the time
models in the ensemble are trained independently, they constrained situation, the execution time is fixed to the
can be trained in parallel. maximum limit. Increasing the problem size is the ob-
jective here. Speed-up is defined in terms of the work,
1.2.5.4 Prediction at Scale and the time is kept constant. “Speed-up” is then de-
Having built a predictive model, sometimes the model fined as
needs to be deployed to predict on a massive amount of
data and with low latency. Additionally, the prediction Src = Work (“p” processors) / Work (1 processor),
CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 7

the work or problem size executed by all processors over

that accomplished by a single processor [7].

1.3 ROLE OF GRAPHICS PROCESSING UNIT

IN PARALLEL DEEP LEARNING
1.3.1 Hardware Architecture of CPU and
GPU
A discussion about the architecture of the CPU and GPU
is given below with diagrammatic representations.
FIG. 1.9 Modern GPU architecture.
1.3.1.1 Conventional CPU Architecture
The hardware architecture of a conventional CPU is
given below in Fig. 1.8.
1.3.2 Suitability of GPU to Parallel Deep
Learning
The suitability of GPU for deep learning is discussed
by an example. Consider a PC game where a low-end
machine is chosen with just a CPU or a high-end ma-
chine with a CPU and GPU. It is possible to play some
games on low-end machines but the frame rate is quite
low compared to the frame rate obtained on a high-end
machine. The GPU speeds up or accelerates graphical
computations very well; both CPU and GPU can han-
dle graphical operations, but the latter performs faster
because of the distributed setup. The parallel architec-
ture in a GPU can perform matrix and vector opera-
tions effectively. In 3D computer graphics there are a lot
FIG. 1.8 Conventional CPU architecture. of such operations, like computation of lighting effects
from normal maps and 3D effects. GPUs were designed
The architecture consists of control logic, cache to handle such vector and matrix operations in paral-
memory, system memory and arithmetic logic unit
lel unlike a single core CPU that would handle matrix
(ALU). In CPUs, more space is allocated to control logic
operations in serial form processing one element at a
than ALU. The latency of a single thread is optimized
time. This makes it possible to play games at 60 fps with
and hidden for multi-level caches. CPUs can effectively
impressive real-time visuals. Now coming back to deep
regulate the flow control even under heavy workloads.
learning, there are a lot of vector and matrix operations
A typical multi-core CPU has the number of ALUs rang-
[10].
ing between 1 to 32, and the cache memory is shared
across all the cores. A lot of data can be stored in L1 caches and reg-
ister files on GPUs to reuse convolutional and matrix
1.3.1.2 Modern GPU Architecture multiplication tiles. For example, the best matrix mul-
In the modern GPU architecture (Fig. 1.9), very less tiplication algorithms use 2 tiles of 64 × 32 to 96 × 64
space is allocated to the control logic and cache mem- numbers for 2 matrices in L1 cache, and a 16 × 16 to
ory. Since it has multiple threads across the core, large 32 × 32 number register tile for the output sums per
register files are required to adapt them. Unlike CPU, thread block (1 thread block = up to 1024 threads; one
here many ALUs are placed in each core. Each core has a has 8 thread blocks per stream processor, there are 60
small cache memory, which is user manageable. A band- stream processors in total for the entire GPU). If you
width of about 150 GB/s manages to service many ALUs have a 100 MB matrix, it can be split into smaller ma-
at the same time. trices that fit into your cache and registers, and then
The main specialization of GPU is the in-depth and you can do matrix multiplication with three matrix tiles
data parallel computation. Therefore, more transistors at speeds of 10–80 TB/s. This is one more reason why
are allocated for processing the data instead of flow con- GPUs are so much faster than CPUs, and why they are
trol and caching mechanism [14]. so well suited for deep learning [15].
8 Deep Learning and Parallel Computing Environment for Bioengineering Systems

1.3.3 CPU vs. GPU architectures like Xeon Phis where this utilization is
• CPUs are designed for more general computing difficult to achieve and difficult to debug, which in
workloads. GPUs in contrast are less flexible; how- the end makes it difficult to maximize performance
ever, GPUs are designed to compute the same in- on a Xeon Phis. [5].
structions in parallel. See Fig. 1.10.
1.3.4 Advantages of Using GPU in Parallel
Deep Learning
• The advantage of the GPU here is that it can have
a small pack of registers for every processing unit
(steam processor, or SM), of which it has many. Thus
we can have a lot of register memory in total, which
is very small and thus very fast. This leads to the ag-
gregate GPU registers’ size being more than 30 times
larger compared to CPUs and still twice as fast, which
FIG. 1.10 Architecture difference between CPU and GPU.
translates into up to 14 MB register memory that op-
• In image processing applications, GPU’s graphics- erates at a whooping 80 TB/s.
specific capabilities can be exploited to speed up the • A neural network involves lots of matrix manipula-
calculations further. tions, such as multiplication, addition and element-
• The primary weakness of GPUs as compared to CPUs wise calculations. These manipulations can be sig-
is memory capacity on GPUs which is lower than nificantly sped up because they are highly paralleliz-
on CPUs. The best known GPU contains 24 GB of able.
RAM; in contrast, CPUs can reach 1 TB of RAM. A sec- • GPUs are massively parallel calculators that al-
ondary weakness is that a CPU is required to transfer low performing many mathematical operations very
data into the GPU card. This takes place through the quickly and at once. Using GPUs cuts down the train-
PCI-E connector which is much slower than CPU or ing time.
GPU memory. The final weakness is that GPUs’ clock • GPU programming must be vectorized to be effec-
speeds are one-third that of high-end CPUs, so on tive. This is because GPU processors are built to do
sequential tasks a GPU is not expected to perform computations on images which come in a form of
comparatively well. matrices, so vectorized operations are natural in this
• GPUs are so fast because they are so efficient in ma- domain.
trix multiplication and convolution, and the reason • Deep neural networks and most AI stuff in machine
for this is memory bandwidth and not necessarily learning (ML) can thus be cast as parallel prob-
parallelism. In short and in order of importance, lems, which means parallel computing solutions like
high bandwidth main memory, hiding memory ac- GPUs can speed up 90% or so of the algorithms in
cess latency under thread parallelism and large and AI, only few algorithms such as tree traversing or
fast register and L1 memory, which is easily pro- recursive algorithms are not parallelizable, so those
grammable, are the components which make GPUs can be handled on a CPU more efficiently.
so well suited for deep learning. • GPUs are best for speeding up distributed algorithms
• CPUs are latency optimized while GPUs are band- whereby each unit in the distributed system works
width optimized. independently of the other units. Thus, an ensemble
• The CPU L1 cache only operates at about 5 TB/s, of processing nodes in a neural network, like most AI
which is quite slow, and has the size of roughly algorithms, fall into this category.
1 MB; CPU registers usually have sizes of around
64–128 KB and operate at 10–20 TB/s. Of course, this 1.3.5 Disadvantages of Using GPU in Parallel
comparison of numbers is a bit flawed because regis- Deep Learning
ters operate a bit differently than GPU registers (a bit • Full register utilization in GPUs seems to be diffi-
like comparing apples and oranges), but the differ- cult to achieve at first because it is the smallest unit
ence in size here is more crucial than the difference of computation which needs to be fine-tuned by
in speed, and it does make a difference. hand for good performance. But NVIDIA has devel-
• It is easy to tweak the GPU code to make use of the oped good compiler tools here which exactly indi-
right amount of registers and L1 cache for fast per- cate when you are using too many or too few registers
formance. This gives GPUs an advantage over other per stream processor.
CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 9

• One example algorithm that is hard to get sped up concurrently executed in a multiprocessor. The unified
from GPUs is the Fibonacci sequence calculation, virtual address is useful in establishing the connection
which is sequential. By speeding up the calculations, between two GPUs.
neural networks can be optimized using more data
and more parameters, thanks to progress in deep 1.3.6.2 AMD
learning. AMD accelerated parallel processing (APP) or ATI
• CPUs are better suited to perform a wider range of Stream is the technology which is used to execute gen-
operations, at the cost of slower performance for eral computations. Each APP device consists of multiple
some of the rendering-specific operations. compute units, each compute unit contains multiple
• A CPU usually has less cores, current CPUs com- stream cores, and each core contains multiple process-
monly have between 4 and 16, while newer high-end ing elements. The instances of GPU program are concur-
GPUs have more than a 1000. Each of these cores is rently executed, each instance is named as a work item.
essentially a computer in itself. So, why isn’t a GPU In lockstep, multiple work items are executed in parallel
always better than a CPU if it has way more cores? by all the cores of a compute unit. The total work items
One reason is that the clock speed is typically much are decided based on the hardware and requirement of
higher on a CPU, meaning that each of the individ- the programmer for a particular work group [3].
ual cores can perform more calculations per second
than the individual cores of the GPU. This makes
CPUs faster on sequential tasks. 1.4 GPU BASED PARALLEL DEEP
• There are other things to consider when calculating LEARNING ON COMPUTATIONAL
the benefit of using a GPU instead of a CPU, such as INTELLIGENCE APPLICATIONS WITH
memory transfer. If you want to multiply 1000 num- CASE STUDY
bers on the GPU, you need to tell it first what those
In this section, we discuss the computational intelli-
1000 numbers are, so the GPU tends to be more use-
gence applications and how GPU based parallel deep
ful in cases where you need to do a lot of things with
learning is applied over those applications by consider-
little change in input and a small volume of output.
ing examples.
1.3.6 Famous GPUs on the Market
1.4.1 Dataflow of the Deep Parallelized
NVIDIA and AMD are the two leading manufacturers,
Training and Testing of Computational
the GPUs of which are widely used in the technology
Intelligence Applications
world. The discussion about both the GPUs are carried
out below. Applying the parallel deep learning methodology over
the computational intelligence research applications
1.3.6.1 NVIDIA brings a greater challenge in implementation. The over-
All the NVIDIA products are under the standard com- heads, applicability problems and related issues are
pute unified device architecture (CUDA). CUDA is de- addressed in this section.
fined as an architectural register, where the binary files A general model has been designed in order to exe-
required for execution of one CUDA do not necessar- cute the computational intelligence applications data in
ily work for another CUDA based GPU. The CUDA parallel with deep learning algorithms. Fig. 1.8 depicts
GPU consists of multiprocessors which execute multi- the data flow of the parallel execution using deep learn-
ple threads in blocks. Multiple blocks can be executed ing. The data flow comprises of the following steps:
by a multiprocessor simultaneously. Each multiproces- 1. The required data is collected by means of a sensor
sor contains 8 CUDA cores with compute capability of or similar devices from the subject.
1x. As the capability increases to 2, 3, etc., the number 2. Once the data is ready for training, the entire data is
of CUDA cores also increases. separated into training and test data.
To hide the memory access and arithmetic laten- 3. The training data is fed into the model.
cies, multiple threads have to be concurrently executed. 4. In order to perform parallel processing, the dataset
NVIDIA runs 192 to 256 threads per multiprocessor for has to be separated into halves for parallel process-
the GPUs having compute capability 1x. It is better to ing. Certain scheduling algorithms can be used to
run more threads in case of data parallelism to free up schedule the processes based on the number of cores
the registers. High performance can be achieved only available in the process.
by knowing the optimal number of threads that can be 5. The dataset gets trained in the separate cores.
10 Deep Learning and Parallel Computing Environment for Bioengineering Systems

7. By having the trained results in a single module

makes the testing process smoother. See Figs. 1.11
and 1.12.

1.4.2 Numerical Example for a Generic

Computational Intelligence Application
Let us consider some real time numerical values regard-
ing the time overhead led by imparting parallel pro-
cessing in deep learning applications. Let us assume
that
• The total number of records in the dataset is 1100;
• A quadcore processor is given for parallel execution.
Training part parameters:
• The total number of processors given equals 4;
• The number of training records in the dataset is 1000;
• The number of records given for each processor for
parallel execution is 250;
• The time taken for one data allocation to the proces-
sor is 1 s (4 ∗ 1 = 4 s);
FIG. 1.11 Dataflow of parallel dataset training. • The training time of one record is 1 s;
• The table merging time is 2 s per merging;
• The overall computation of the quadcore processor
takes 10 s.
Testing part parameters:
• The total number of processors given is 4;
• The number of testing records in the dataset is 100;
• The number of records given for each processor for
parallel execution is 25;
• The time taken for one data allocation (separation)
to the processor equals 1 s (4 ∗ 1 = 4 s);
• The training time of one record is 1 s;
• The table merging time is 2 s per merging;
• The overall computation of the quadcore processor
takes 10 s;
• The prediction time is 2 s;
• Writing the predicted result to the dataset takes 1 s
per record;
• Recording the datasets is done as required by the ap-
plication;
• Once the dataset is ready, the data schema has to be
framed in such a way that it is lossless and depen-
dency preserving;
• Now, the first level of dataset separation involves
splitting the data into training and test data. This
level of data separation need not be checked for loss-
less and dependency preserving properties, since we
FIG. 1.12 Dataflow of parallel testing process. are not going to combine it again;
• The second level involves dividing or scheduling the
6. After the training is over, the results of the trained dataset based on the availability and the number of
data have to be combined into a single module. processors given for execution.
CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 11

FIG. 1.15 Generative model.

FIG. 1.13 Training image.

FIG. 1.14 Inferred correspondences.

1.4.3 Dealing With Limited GPU Memory

FIG. 1.16 Detection of objects in an image.
A major problem with parallelized deep learning is lim-
ited memory. All the intermediate states and the in-
put mini-batches have to be fit into the limited GPU
memory. GeePS architecture is introduced to rectify this
problem by buffering the data to the GPU memory from
the large CPU memory.
The efficiency of the GeePS is demonstrated using
only a fraction of the GPU memory for the largest
case and by experimenting with a much larger synthetic
model. GeePS’s memory management support allows
us to do video classification on longer videos. If we
further consider the memory used for testing, the sup-
ported maximum video length will be shorter. The par-
allel approach to split the model across four machines
incurs extra network communication overhead. By con-
FIG. 1.17 Neural network training of the segments of the
trast, with the memory management support of GeePS, image.
we are able to train videos with up to 192 frames, using
solely data parallelism. 1. Neural networks are used on problems of category
clustering, classification, prediction, composition
1.4.4 Computational Intelligence and control systems.
Applications 2. Evolutionary computation involves route or path op-
The different application areas of computational intel- timization, scheduling problem and medical diagno-
ligence possess a hybrid of neural networks, fuzzy sys- sis of diseases.
tems and evolutionary algorithms. The categorization of 3. Fuzzy logic deals with vehicle monitoring, sensor
applications is as follows: data in home appliances and control systems.
12 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 1.18 Type of pre loaded data selection.

FIG. 1.19 Preprocessing phase.

CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 13

FIG. 1.20 Setting the cross-validation for one instance of a processor.

FIG. 1.21 Classification outcome for a general application.

14 Deep Learning and Parallel Computing Environment for Bioengineering Systems

1.6 SUMMARY
GPUs work well on parallel deep neural network com-
putations because:
• GPUs have many more resources and faster band-
width to memory;
• Deep neural networks’ computations fit well with
GPU architecture. Computational speed is extremely
important because training of deep neural networks
can take from days to weeks. In fact, many of the suc-
cesses of deep learning may have not been discovered
if it were not for the availability of GPUs.
• Deep learning involves huge amounts of matrix mul-
tiplications and other operations, which can be mas-
sively parallelized and thus sped up on GPUs.
In this chapter, the basic concepts of parallel pro-
cessing are explained with examples in order to make
a clear way to parallel deep learning. Various paral-
lelization techniques are discussed with diagrammatic
explanation, and ways in which they can be internally
classified are focused on. The relation between com-
FIG. 1.22 Snippet while the dataset is getting trained in putational intelligence and parallel deep learning, the
Python shell. challenges in combining them together and benefits are
discussed. The applicability of the parallel deep learning
4. Expert systems have financial applications, robot
algorithms to the real time datasets are explained with
production, diagnostics and various industry based
simple numerical examples.
operations.

1.5 IMPLEMENTATION SCREENSHOTS TO REFERENCES

VISUALIZE THE TRAINING PROCESS 1. Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature
521 (7553) (2015) 436.
Parallel execution of GPU involves data parallelism, 2. J. Schmidhuber, Deep learning in neural networks: an
where the training images are divided into separate overview, Neural Networks 61 (2015) 85–117.
batches. After the computation of the GPU batches sep- 3. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y.
arately, the average of the entire batch is calculated. Ng, On optimization methods for deep learning, in: Pro-
Since the batching results are synchronous, the results ceedings of the 28th International Conference on Interna-
will not vary from a single GPU execution. This batch- tional Conference on Machine Learning, Omnipress, 2011,
ing process has been observed to produce 3.75 times pp. 265–272.
better speed than a single GPU. 4. A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments
Fig. 1.13 presents a training image, which shows a for generating image descriptions, in: Proceedings of the
tabby cat leaning on a wooden table, with one paw on IEEE Conference on Computer Vision and Pattern Recog-
a laser mouse and the other on a black laptop. nition, 2015, pp. 3128–3137.
Fig. 1.14 shows the objects or the information in- 5. S. Salza, M. Renzetti, Performance modeling of paral-
ferred from the training. Fig. 1.15 is the test image lel database systems, Informatica-Ljubljana 22 (1998)
127–140.
which is different from the original image. The deep
6. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N.
learning algorithms perform identification of the infor-
Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, et
mation by learning from the trained data. al., Deep neural networks for acoustic modeling in speech
Fig. 1.16 gives another example of a sample image recognition: The shared views of four research groups, IEEE
(some items placed in a table) to illustrate the detection Signal Processing Magazine 29 (6) (2012) 82–97.
of the pixels and corresponding objects in the image. 7. V. Hegde, S. Usmani, Parallel and Distributed Deep Learn-
Fig. 1.17 shows the training mechanism of the im- ing, Tech. report, Stanford University, June 2016, https://
age using deep convolutional neural networks. Also see stanford.edu/~rezab/dao/projects_reports/hedge_usmani.
Figs. 1.18–1.22. pdf.
CHAPTER 1 Parallel Computing, GPU and New Hardware for Deep Learning 15

8. K.R. Foster, R. Koprowski, J.D. Skufca, Machine learning, 13. S.S. Raj, M. Nandhini, Ensemble human movement se-
medical diagnosis, and biomedical engineering research- quence prediction model with a priori based probability
commentary, Biomedical Engineering Online 13 (1) tree classifier (APTC) and bagged j48 on machine learning,
(2014) 94. Journal of King Saud University: Computer and Informa-
9. Y.B. Kim, N. Park, Q. Zhang, J.G. Kim, S.J. Kang, C.H. Kim, tion Sciences (2018).
Predicting virtual world user population fluctuations with 14. H. Greenspan, B. Van Ginneken, R.M. Summers, Guest ed-
deep learning, PLoS ONE 11 (12) (2016) e0167153. itorial deep learning in medical imaging: overview and
future promise of an exciting new technique, IEEE Trans-
10. I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep
actions on Medical Imaging 35 (5) (2016) 1153–1159.
Learning, vol. 1, MIT Press, Cambridge, 2016, p. 2016.
15. I.-H. Chung, T.N. Sainath, B. Ramabhadran, M. Picheny,
11. A. Ike, T. Ishihara, Y. Tomita, T. Tabaru, Technologies for J. Gunnels, V. Austel, U. Chauhari, B. Kingsbury, Paral-
practical application of deep learning, Fujitsu Scientific lel deep neural network training for big data on Blue
and Technical Journal 53 (5) (2017) 14–19. Gene/Q, IEEE Transactions on Parallel and Distributed Sys-
12. N. Friedman, M. Linial, I. Nachman, D. Pe’er, Using tems 28 (6) (2017) 1703–1714.
Bayesian networks to analyze expression data, Journal of
Computational Biology 7 (3–4) (2000) 601–620.
CHAPTER 2

Big Data Analytics and Deep Learning

in Bioinformatics With Hadoop
SANDHYA ARMOOGUM, PHD • XIAOMING LI, PHD

2.1 INTRODUCTION be processed and analyzed. Structured data are usually

Big data is large and complex data, which is challenging in the form of Excel tables and relational databases. In
to process using traditional data processing and storing the context of big data, now different types of data are
methods. Gartner analyst Doug Laney [1] introduced being factored in, processed and analyzed. Such data
the three defining properties or dimensions of big data formats include text, messages (SMS), e-mails, tweets,
in 2001, which are the 3 Vs (Volume, Velocity and Vari- post, web data, blog data, photo, audio, video, GPS
ety). Over time, additional Vs have been proposed such data, sensor data, documents, as well as structured data
as veracity, i.e., the trustworthiness and authenticity of from relational databases. Thus, big data often consist
data, value, visibility and variability. However, the 3 Vs of a combination of structured, semi-structured and un-
are most popularly used to define big data. structured data of different formats which have to be
Big data involves large volumes of data. In today’s processed and analyzed.
digital world, the amount of data generated daily is Despite the challenges, big data analytics has the po-
massive. For instance, some 500 million tweets are tential to examine large amounts of data to reveal hid-
recorded daily, around 56 million photos are uploaded den patterns, correlations and provide useful insights
on Instagram, and some 200 billion e-mails are sent in different areas such as customer analytics, predic-
daily [2]. Likewise, around 300 hours of video are up- tive marketing, recommendation systems, social media
loaded to YouTube every minute [3]. Today’s data sets analysis and response, fraud and crime detection and
are measured in petabytes while exabyte datasets are ex- prevention, predicting natural and man-made disasters,
pected in the near future. The sheer amount of data that and improving healthcare services and disease preven-
has to be analyzed and stored is a major issue with big tion. For example, the use of big data analytics in the
data, but the expansion of the other two properties, that area of agriculture is being used to maximize crop yields
is, velocity and variety of data, also poses challenges. to address the problem of food security. Big data analyt-
Traditional datasets such as inventory data, sales and ics is also being used to research treatment and cure for
customer data are quite static and bounded. Processing diseases such as cancer.
such data is not delay sensitive as the incoming data Bioinformatics research is regarded as an area which
flow rate is slower than the processing time, and the encompasses voluminous, expanding and complex
processing results are usually still useful in spite of any datasets. Bioinformatics is an interdisciplinary research
processing delay. Today with the advent of the Internet- area mainly comprising molecular biology, computer
of-Things (IoT), thousands of sensors are capturing data science, mathematics, and statistics. It mainly deals with
at a fast rate. Data from social, mobile and other appli- modeling biological processes at the molecular level
cations such as CCTV and news feeds are now streaming to understand and organize the information associ-
into servers continuously. Such data often has to be pro- ated with the biological molecules, as well as to make
cessed in real-time, and the result is only useful if the inferences from the observations. Bioinformatics pre-
processing delay is very short. For example, numerous dominantly focuses on the computational analysis of
sensors are being used to monitor critical infrastructure datasets such as genomics, proteomics, transcriptomics,
and other physical environments. Such sensor data con- metabolomics and glycomics. Nowadays, with the use
stitute a dynamic and unbounded dataset that has to be of high-throughput next-generation sequencing tech-
continuously processed in real-time to take decisions in nologies, there is significant expansion of biological
control systems. data, which presents storage and processing challenges.
A variety of data has always existed, but usually a Genome refers to the entire set of genes or genetic
specified data structure is imposed on data that is to material (DNA) present in a cell of an organism while

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00009-9 17
Copyright © 2019 Elsevier Inc. All rights reserved.
18 Deep Learning and Parallel Computing Environment for Bioengineering Systems

genomics is the study of the structure, function, evolu- disbursed in research grants in its first phase (from 2014
tion, mapping, and editing of an organism’s genome. till 2017) to address some major data science challenges
DNA sequences constitute the most abundant data in and to stimulate data-driven discovery [11].
bioinformatics. DNA is made up of molecules called This chapter aims at presenting big data technolo-
nucleotides. The information in DNA is stored as a gies and big data analysis which integrates deep learn-
code made up of four chemical bases: adenine (A), ing for addressing complex computational needs using
guanine (G), cytosine (C), and thymine (T). The or- open source solutions, namely the Hadoop framework
der of these bases is what determines the genetic code. and ecosystem. In Section 2.2, the big data workflow
DNA sequencing is the process of determining the pre- is described and the Hadoop big data framework for
cise order of the A, G, C and T bases in a strand of processing, storing and analyzing big data is discussed,
DNA. A typical bacterial genome can have several mil- as well as its application in the field of bioinformatics.
lion bases. The human genome consists of about 3.2 Section 2.3 describes the machine and deep learning al-
billion bases, and the size of a single sequenced hu- gorithms and open source tools which can be integrated
man genome is approximately 200 gigabytes [4]. The in the Hadoop ecosystem for more intelligent analysis
first human genome was completely sequenced in June of big data and their applications in bioinformatics. Sec-
2000, and as of 2014, about 228,000 human genomes tion 2.4 concludes the chapter by discussing the future
have been sequenced [5]. Recently, the Illumina, the directions.
largest maker of DNA sequencers, has sequenced more
than 500,000 human genomes [6]. Eventually, biologi-
cal data will be sequenced at an ever-faster pace. Exam- 2.2 FROM BIG DATA TO KNOWLEDGE
ples of two large datasets are the Cancer Genome Atlas DISCOVERY WITH HADOOP
[7] and the Encyclopaedia of DNA Elements [8]. The Eu- There is tremendous potential and highly useful values
ropean Bioinformatics Institute (EBI) has biology-data hidden in the huge volume of biological data, which is
repositories with size of 40 petabytes [9]. Given that bi- now available and which is growing exponentially. Big
ological data from different sources are often used, such data analytics is inevitable for handling biological data
data are heterogeneous as they are stored in different for making evolutionary breakthroughs. The big data
formats. Biological and medical data is also generated in knowledge discovery process is as shown in Fig. 2.1.
real-time and fast (e.g., medical imaging in healthcare). Typically, big data has to be collected and ingested
Another characteristic of biological big data is that it is into the system. Such data is often structured, semi-
geographically distributed [10]. structured and mostly unstructured, thus different tools
Performing data analysis to harvest the wealth of and techniques are used for collecting data. Data col-
data from biological and biomedical data, such as ge- lected often go through a staging phase where the data
netic mapping on the DNA sequence, will only help is cleaned, i.e., inconsistent, incomplete or noisy data
to advance our understanding of the human condition, are discarded. Some data may require pre-processing
health and disease; which will consequently allow cur- so as to improve the quality of the data before analy-
ing diseases and improving human health and lives by sis. During the staging and pre-processing phase, data is
supporting the development of precision methods for stored in a temporary storage. Such pre-processing may
healthcare. This is a typical big data problem and or- include techniques such as data extraction, data anno-
ganizations, such as the National Institutes of Health tation, data integration, data transformation, and data
(NIH), recognize the need to address the big data chal- reduction. Data is then stored in a suitable storage, from
lenges related to the processing and data analysis of where it is accessed for analytics, after which the results
biological data. In 2012, the NIH has launched the Big of the analysis of data can be visualized. Such results can
Data to Knowledge initiative to enable biomedical re- then be interpreted accordingly. Performing data ana-
search and development of innovative approaches and lytics of big data usually requires high server processing
tools in the area of big data science for enhancing the capability often involving massively parallel processing
utility of biomedical big data. Some $200 million was (MPP) technologies. Big data processing and analysis

FIG. 2.1 Big data workflow for knowledge discovery.

CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 19

thus involves a shift in computing architecture to han- lytics. Hadoop is thus often integrated with other soft-
dle the challenges of storing, analyzing and extracting ware solutions. The Apache Hadoop ecosystem consists
meaningful and valuable data from the large volume, of dozens of projects such as Apache Hive, Apache Ma-
variety and high velocity data in the area of bioinfor- hout, and Apache Spark, providing various functionali-
matics. ties such that a number of these projects can be stream-
It is unlikely that big data can be stored on a single lined to deliver the required big data services. Hadoop is
server as the amount of storage required is prohibitive. a flexible platform, as it can also be integrated with non-
Similarly, it is unfeasible to process big data on a single Apache software solutions. The Gartner Magic Quadrant
server node unless multi-core high performance com- for Analytics and Business Intelligence Platforms 2018
puting (HPC) servers, which are quite costly, are used. [16] identifies Microsoft, Tableau and Qlik as the lead-
Thus, to collect, process and analyze big data, a cluster ers for analytics and business intelligence. All three big
of computing nodes may be more suitable than a single data analytics solutions support the Hadoop platform,
compute node. Cloud computing, which can provide a e.g., Azure HDInsight can be integrated with Hadoop
scalable and cost-effective solution for big data storage [17], Qlik solutions can be used with Cloudera, which
and computation, is becoming more and more popu- is a distribution of Hadoop, packaged with other tools
lar and has an important role in the development of [18], and Tableau can be very easily made to work
bioinformatics tools. According to [12], the cloud com- on Hadoop data [19]. With the growing popularity of
puting model is the only storage model that can provide Hadoop, more and more software solutions are con-
the elastic scale needed for DNA sequencing, whose rate stantly being developed to work with Hadoop. Research
of technology advancement could now exceed Moore’s work is also being carried out to make Hadoop become
Law.
more efficient and faster [20,21]. Besides, the Hadoop
The National Institute for Standards and Technol-
platform, especially the MapReduce module is com-
ogy (NIST) defines cloud computing as a model for
monly used in the field of bioinformatics for processing
enabling convenient, on-demand network access to a
data [22–26]. The following subsections describe the
shared pool of configurable computing resources (e.g.,
Hadoop platform to demonstrate why it is an impor-
networks, servers, storage, applications, and services)
tant platform for big data storage and processing. The
that can be rapidly provisioned and released with min-
various other tools that can be integrated with Hadoop
imal management effort or service provider interaction
to achieve the big data workflow for knowledge discov-
[13]. Cloud service providers like Amazon, Microsoft,
ery are also discussed, as well as their applications in the
Oracle and IBM have several geographically distributed
data centers which houses a massive array of compute area of bioinformatics.
server nodes as well as storage. By means of virtu-
2.2.1 Hadoop Big Data Framework
alization technology, the required hardware resources
and computational power can be provisioned instan- The Hadoop platform has several benefits, which makes
taneously. The cloud, thus, provides the storage and it the platform of choice for big data analytics. Hadoop
computing infrastructure for storing, processing and an- is flexible and cost-effective, as it has the ability to store
alyzing big data in a shared pool of resources, i.e., a and process huge amount of any kind of data (struc-
cluster of compute nodes. However, for big data ana- tured, unstructured) quickly and efficiently by using a
lytics, apart from the infrastructure, there is a need for cluster of commodity hardware. By means of resource
a middleware to enable distributed processing across pooling, more processing power is available in the clus-
the cluster of compute nodes. It should be possible to ter in a cost-effective manner than on a single server.
develop custom applications that can be executed in Moreover, Hadoop is massively scalable as more com-
parallel on distributed biological datasets. pute nodes can be easily added in the cluster if more
Hadoop is one of the most popular and significant processing power is required. Likewise, Hadoop has
open source platforms for big data storage, and process- a very high degree of fault tolerance; if one node in
ing [14]. It enables distributed processing across clusters the cluster fails, the processing tasks are redistributed
of commodity servers, scaling up from a single server to among the other nodes in the cluster, and multiple
thousands of servers in the cluster. According to [15], copies of the data is stored in the Hadoop cluster.
the Hadoop big data analytics market is expected to Hadoop is made up of 4 core modules: the Hadoop
grow at a compound annual growth rate (CAGR) of Distributed File System (HDFS), Yet Another Resource
26.5% from 2016 to 2022. The Hadoop platform by Negotiator (YARN), Hadoop Common and MapReduce
itself cannot perform all types of processing and ana- as shown in Fig. 2.2. The Hadoop common is simply a
20 Deep Learning and Parallel Computing Environment for Bioengineering Systems

set of libraries and utilities used by the other Hadoop non-Hadoop clusters are managed using Mesos. Cluster
modules. resources can be dynamically shared, i.e., a YARN cluster
can be resized as required. MapReduce is a program-
ming model for the parallel processing of large data
sets on the distributed computing nodes in the clus-
ter. MapReduce is the default processing framework for
Hadoop, but Hadoop can also be used with other pro-
cessing frameworks. MapReduce is further discussed in
Section 2.2.4.1.

2.2.2 Big Data Collection and Ingestion

Data is often available from different sources, e.g., from
databases, log files, online web applications, and social
FIG. 2.2 Hadoop architecture.
media networks. Similarly, data, in the area of bioinfor-
matics, are generated from numerous sources, includ-
In the Hadoop architecture, data is stored and pro- ing laboratory experiments, genomics datasets, medi-
cessed across many distributed nodes in the cluster. The cal records and medical insurance/claims data which
HDFS is the module responsible for reliably storing data are accessible online [29]. Examples include the large
across multiple nodes in the cluster and for replicating genomic datasets from the Encode consortium [30,
the data to provide fault tolerance. Raw data, interme- 31]; the combined DNA biorepositories with electronic
diate results of processing, processed data and results medical record (EMR) systems for large scale, high-
are all stored in the Hadoop cluster. HDFS is com- throughput genetic research from the Electronic Med-
posed of a master node also known as NameNode, a ical Records and Genomics (eMERGE) Network [32,
secondary NameNode for high availability, and slave 33]; archive of Functional Genomics Data from Array-
nodes called DataNodes, which are the data process- Express, which stores data from high-throughput func-
ing and storage units. The NameNode is a fundamental tional genomics experiments [34,35]; various datasets
component of the HDFS as it is responsible for the up- from the National Centre for Biotechnology Informa-
keeping of namespace of the file system and it maintains tion (NCBI) such as the database of Genotypes and Phe-
an updated directory tree of all files stored on the clus- notypes (dbGaP) [36], the Gene Expression Omnibus
ter, metadata about the files, as well as the locations of (GEO) repository [37], the Reference Sequence (RefSeq)
data files in the cluster. Data is stored in blocks on the database [38,39]; and the SEER-Medicare database [40,
Hadoop cluster, i.e., Hadoop is a block storage system 41].
(block size 64 MB for Apache Hadoop and 128 MB for Such datasets can often be searched online, for ex-
Cloudera). However, Hadoop can also integrate with ample, by using the Basic Local Alignment Search Tool
object stores (object storage systems) such as Open- (BLAST) search algorithm [42]. The database or the un-
stack’s Swift, Amazon AWS’s S3A, Azure blob storage via derlying FASTA files used to create the database can also
Wasp on the cloud. be downloaded using the file transfer protocol (FTP),
To manage cluster membership of DataNodes, coor- the hypertext transfer protocol (HTTP), or the Globus
dinate resource sharing and schedule processing work Striped GridFTP [43]. The FASTA format is a ubiquitous
on individual nodes in the cluster, Hadoop uses the text-based format for representing either nucleotide se-
YARN resource manager. Alternatively, another resource quences or peptide sequences, in which nucleotides or
management and scheduling software to manage a clus- amino acids are represented using single-letter codes
ter of compute node is Mesos, which can be used to in bioinformatics. FASTA sequence files are widely
manage an entire data center [27]. YARN was created supported by bioinformatics tools. Such databases or
to scale Hadoop. It is optimized for scheduling Hadoop FASTA files are typically large in size, for instance, the
jobs, but is not designed for managing an entire data RefSeq database is 1 TB [44]. Storing such data locally
center. Another open source Apache project of inter- and working with various sources and formats of bioin-
est is Apache Myriad, which supports YARN applica- formatics big data can be challenging.
tions running on Mesos [28]. Myriad can be used to Data ingestion is the process of collecting raw data
manage an entire data center. However, Myriad breaks from various silo databases or files and integrating it
the data center into Hadoop and non-Hadoop clus- into a data lake on the data processing platform, e.g.,
ters. Hadoop clusters are managed by YARN whereas the Hadoop data lake. A data lake is a storage repository
CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 21

that holds a huge amount of raw data in its native for- the imported table data or datasets. Such files are then
mat whereby the data structure and requirements are saved as comma-separated files with the name of the
not defined until the data is to be used. Thus, data source table to a directory on HDFS [45]. SequenceFiles
lakes have the schema-on-read characteristic and typ- are flat files consisting of binary key/value pairs, exten-
ically store data using a flat architecture unlike data sively used in MapReduce as input and output formats.
warehouses which store data in a highly structured Intermediate or processed data can also be exported
repository and which adopt a relational or dimensional from Hadoop to an RDBMS datastore using Sqoop.
data model. Data warehouses have the schema-on-write
characteristics, which means that the data structure is 2.2.2.2 Apache Flume
defined before the data is stored. Data lakes are thus Bioinformatics also involves high throughput streaming
more agile as data can be easily configured and recon- real-time data such as the output of DNA sequencing,
figured following different models during the analysis. data captured from health sensors. Such data consist of
Today, several data ingestion tools are available for in- a continuous stream of data at a specific rate. Flume
gesting a variety of data onto Hadoop. The following is a distributed and reliable ingestion tool that can be
three types of input data can be distinguished: used to collect, aggregate streaming data from many
• Structured data has a strong schema, e.g., from rela- different sources and to push out the serialized data, us-
tional databases and FASTA sequence files. ing mechanisms called data sinks, to a centralized data
• Unstructured data does not have any structure and store such as HDFS or HBase on Hadoop or Cassan-
can be of any form, e.g., medical imaging, electronic dra. Flume is more tightly integrated with the Hadoop
health records, clinical trials results, medical sensors. ecosystem, i.e., Flume has HDFS sinks to integrate data
• Semi-structured data has some structure but not onto HDFS. The Flume topology consists of the source,
strictly organized, e.g., XML files for electronic pa- channel and sink. Flume clients send data to the source,
tient record. which keeps the data in a temporary buffer called chan-
The complexity of ingestion tools thus depends on nel. Data flows from the channel to a sink. A typical
the format and the quality of the data sources. These Flume architecture is as shown in Fig. 2.3.
ingestion tools are capable of some pre-processing and
staging. Some of these tools are described as follows. 2.2.2.3 Apache Kafka
Apache Kafka is an open source system for ingesting
2.2.2.1 Apache Sqoop (SQL-to-Hadoop) data from several sources in real-time. While it was
If data to be processed is from structured datastores, not specifically designed for Hadoop, it can be used to
Apache Sqoop can be used for transferring bulk data in collect high throughput parallel data for loading into
both directions between relational databases, data ware- Hadoop. Kafka uses a publish–subscribe system simi-
houses and HDFS or Hadoop data stores such as HBase lar to a messaging system. The Kafka system is made
or Hive. Sqoop reads the relational database manage- up of publishers, the Kafka cluster, and subscribers (cus-
ment system (RDBMS) schema description to gather the tomers of data). Data (messages) emitted by publishers
metadata for the data to be imported and then it trans- are stored as logs in the Kafka cluster. A typical architec-
fers the table data required. The data is captured as a set ture of Kafka is shown in Fig. 2.4. Kafka forwards data to
of serialized files or SequenceFiles containing a copy of the subscriber as and when required. Messages are orga-

FIG. 2.3 Apache Flume architecture.

22 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 2.4 Apache Kafka architecture.

nized into topics, topics are further split into partitions, highly scalable and can be integrated with MapReduce
and partitions are replicated across the nodes – called for processing. HBase is a column oriented big data
brokers – in the cluster. Subscribers can be publishers store and being built on top of HDFS, the data stored
and vice-versa. Kafka is more easily scalable and more in HBase are eventually stored in HDFS. In [49], the
fault-tolerant than Flume. authors experimented using HBase for storing 9 billion
Apache Kafka has been used in several works related patient records.
to bioinformatics. In [46], Kafka was used to ingest data For structured data, there is Apache Hive, which is a
from the SeqRef dataset from the NCBI’s datastores. Us- data warehouse infrastructure built on top of Hadoop.
ing Kafka, data was lightly structured and stored, such Data from HDFS can be moved into Hive by using the
that the data is more amenable to parallel access and extract, transform and load (ETL) tools. Hive can also be
streamed processing. In [47], the authors proposed the used to query data stored in HDFS, HBase or other file
adoption of the Kafka stream processing to simplify systems or databases such as Cassandra. Hive consists
the genomic processing pipeline, to improve the per- of an SQL engine on top of Hadoop for processing big
formance and improve fault-tolerance. The European data using MapReduce jobs through SQL like queries,
Bioinformatics Institute (EMBL-EBI), which maintains a and thus is very convenient to data analysts who are
comprehensive range of molecular data resources, sup- more familiar with SQL than MapReduce programming
ports and encourages the use of the Kafka Streams API. [50]. Apache Hive is also often used with Apache Pig
A prototype using Kafka Streams API to ingest data, ag- to process, transform and analyze data in Hive. Apache
gregate logs and display results online in a dashboard Pig is a high-level language (Pig Latin) for processing
has been detailed [48]. data in Hadoop, the processing is eventually done using
MapReduce jobs [51]. Pig is specifically designed for the
2.2.3 Data Staging and Storage on Hadoop ETL data pipeline and iterative data processing, and it
Big data typically consists of data that is semi-structured supports user defined functions (UDF). In [52], BioPig
or unstructured and which cannot always be repre- – MapReduce and Pig – was used to analyze large-scale
sented in rows and columns as in traditional databases. sequence bioinformatics data.
With the Hadoop big data framework, data is stored as
files in the HDFS distributed file system, which allows 2.2.4 Data Processing and Analysis
storing data across multiple nodes in a Hadoop cluster. Frameworks
However, Apache Hadoop is not suitable for real-time After data has been collected and ingested, big data is
random-access capabilities. For applications or process- available for data processing and analysis. Data process-
ing that requires reading and writing data in real-time, ing for analysis varies depending on the type of insights
i.e., very low latency, Apache HBase NoSQL Hadoop desired and the flow of data along the data analysis
Database is preferred. HBase provides low latency, it is pipeline. Often data is processed a number of times us-
CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 23

ing a single tool or different tools to get useful insights. Blast (basic local alignment tool) algorithm in Hadoop
Big data processing frameworks are often classified into: MapReduce [57].
(i) batch only frameworks; (ii) stream only frameworks; Today, it is possible to access a Hadoop cluster on the
and (iii) hybrid frameworks. cloud. However, when using MapReduce based bioin-
formatics tools in the cloud, if the Hadoop parameters
2.2.4.1 Batch Processing Only Framework – are not set appropriately, there can be resource under-
MapReduce utilization while having to pay considerate cloud com-
Batch processing frameworks are ideal for processing ex- puting costs. Several recent research works have been
tremely large datasets that require significant computa- conducted on the deployment and use of MapReduce
tion. Such datasets are typically bounded (finite collec- on the cloud for bioinformatics computations. A cloud
tion of data) and persistent, i.e., stored on some perma- framework has been proposed in [58] to easily deploy
nent storage. Batch processing is suitable for processing bioinformatics tools (several MapReduce based tools)
which is not time sensitive as processing a large dataset on cloud virtualization platform based on Hadoop for
would take time. The most popular batch processing Bioinformatics-as-a-Service. In [59], the authors worked
framework is Apache Hadoop’s MapReduce. MapRe- on defining the Hadoop parameters for fine tuning
duce is a Java based system for processing large datasets MapReduce so as to have better performance on the
in parallel. It reads data from the HDFS and divides the cloud. In [60], to address the difficulty of compos-
dataset into smaller pieces. Each piece is then sched- ing complex workflows from multiple bioinformatics
uled and distributed for processing among the nodes MapReduce tools, the authors proposed that two ex-
available in the Hadoop cluster. Each node performs isting systems, namely Cloudgene and CloudMan, be
the required computation on the chunk of data and integrated to enable the delivery of MapReduce ap-
the intermediate results obtained are written back to the plications in the cloud. In [61], a novel implementa-
HDFS. These intermediate outputs may then be assem- tion of the partial order alignment (POA) algorithm
on a multi-node Hadoop cluster running on MapRe-
bled, split and redistributed for further processing, until
duce framework which is implemented in the Amazon
final results are written back to HDFS. The MapReduce
AWS cloud was proposed. In [62], a novel library for
programming model for processing data consists of two
the scalable manipulation of aligned next-generation
distinct tasks performed by programs: a Map job and a
sequencing data in the Hadoop distributed computing
Reduce job. Typically, the Map job starts by taking a set
framework on the cloud was proposed.
of data and converting it into another set of data where
individual elements of the data are broken into tuples 2.2.4.2 Stream Processing Only Framework
consisting of key value pairs. These key value pairs may
For datasets which require real-time processing, e.g.,
then be shuffled, sorted, and processed by one or more sensor data being captured in real-time, the MapRe-
Map jobs. The Reduce job usually takes the outputs of duce framework is not suitable. Often in such cases,
a Map job as its input and combines those data tuples the available data has to be processed immediately as
into a smaller set of tuples. soon as it is collected so as to be able to take reactive
Many bioinformatics applications and tools use measures based on the results of the output, e.g., con-
the MapReduce framework. In [53], the following trol system of a manufacturing plant. Such a dataset is
MapReduce-based tools and programming environ- an unbounded set and processing is done on the data
ments for the development of bioinformatics applica- which is available, i.e., the working dataset, which is the
tions are available: BioPig [52], Cloudgene, FASTdoop, amount of data that has been ingested by the system
GATK, Hadoop-BAM, SeqPig and SparkSeq. MapRe- so far. Stream processing frameworks usually continu-
duce has also been adopted in (a) algorithms for single ously process data and do not “end” unless explicitly
nucleotide polymorphism identification, e.g., BlueSNP stopped. The results of processing are available in near-
and Crossbow; (b) gene expression analysis, e.g., Eoul- real time and are continually updated in a dashboard
san, FX, MyRNA, YunBe; (c) sequence comparison, as new data is ingested and processed. A characteristic
e.g., CloudBLAST [54], bCloudBLAST [55], HAFS, K- of stream processing frameworks is in-memory comput-
mulus, Nephele, and Strand; (d) genome assembly, e.g., ing whereby most processing is done in the cluster’s
CloudBrush and Contrail; (e) sequencing reads map- memory and only the final output is stored on a stor-
ping, e.g., BlastReduce, CloudAligner, CloudBurst, and age disk. Popular stream processing frameworks, which
SEAL. Other MapReduce based application in bioinfor- can be integrated with Hadoop, are Apache Storm and
matics include Big-Bio [56], an implementation of the Apache Samza. Bioinformatics workloads usually do
24 Deep Learning and Parallel Computing Environment for Bioengineering Systems

not require real-time processing and thus such frame- for load balancing and faster processing. All processing
works would not be used by itself for processing biolog- is done in-memory, unlike with MapReduce where in-
ical data. However, such a stream processing framework termediate results are written to HDFS and have to be
may be adopted along the data processing pipeline to fetched for the next stage of the computation. Spark
improve the processing efficiency. is 100 times faster than MapReduce when data is pro-
cessed in memory and 10 times faster in terms of disk
2.2.4.3 Hybrid Processing Framework access than Hadoop. Only the end results are persisted
Hybrid processing frameworks are capable of handling on storage, which reduces the processing latency in
batch and stream processing. Two examples of hybrid Spark. The processing is further optimized by the use
frameworks are Apache Spark and Apache Flink, both of directed acyclic graph (DAG) for defining a graph of
of which are adopted in bioinformatics; Apache Spark tasks that allows implementing complex data process-
being the more popular out of the two. Both frame- ing algorithms more efficiently. Moreover, Spark is also
works offer lower data processing latencies as compared highly fault-tolerant; if one node fails, the failed tasks
to MapReduce and use in-memory processing. Both can are distributed across the other nodes. Fig. 2.5 depicts
be plugged in Hadoop and used instead of MapReduce, the Spark architecture. Apart from the data processing,
though they can also work on other underlying frame- Apache Spark also includes other components such as
works such as Mesos. an SQL engine, machine learning library and graph pro-
Apache Spark is mainly a batch processing frame- cessing engine built atop the Spark Core as shown in
work with stream processing capabilities which operates Fig. 2.6.
using a master/slave architecture. The master coordina- Several bioinformatics applications on Apache Spark
tor, called the driver, takes streaming data and converts exists. In a recent survey [63], the authors identified the
it into small microbatches. These microbatch datasets following Spark based applications: (a) for sequence
are stored in memory as a resilient distributed dataset alignment and mapping: SparkSW, DSA, CloudSW,
(RDD) and are dynamically distributed for processing SparkBWA, StreamBWA, PASTASpark, PPCAS, Spark-
across slave nodes (known as executors) in the cluster BLAST, and MetaSpark; (b) for the assembly phase in

FIG. 2.5 Apache Spark architecture.

FIG. 2.6 Apache Spark ecosystem.

CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 25

the sequence analysis workflow: Spaler, SA-BR-Spark; algorithms for data analytics for solving bioinformat-
(c) for the sequence analysis: HiGene, GATK-Spark, and ics problems such as clustering, association rule mining,
SparkSeq. Spark is also used in other biological appli- logistic regression, support vector machine (SVM), and
cations such as in (a) epigenetics, for example, in [64] decision trees.
the authors proposed a novel CpG box model and a Several tools that can capture data from databases,
Markov model to investigate CpG island so as to make analyze them and display results on a dashboard ex-
the analytic process faster; (b) phylogeny, e.g., Cloud- ist for business intelligence. Typical examples include
Phylo; (c) drug discovery, e.g., S-CHEMO; (d) single-cell Tableau, Qlik, Pentaho, and Datameer. However, for
RNA sequencing (scRNA-seq), e.g., Falco; (e) variant bioinformatics, given that the processing is more di-
association and population genetics studies, e.g., Vari- verse and complex, the available almost “plug-and-
antSpark, SEQSpark. Moreover, the Biospark framework play” tools for business intelligence are not suitable. For
[65], which uses Hadoop and Spark, allows storing and the Hadoop environment a few open-source tools such
analyzing large numerical datasets generated from bio- as Elastic Stack, Apache Zoomdata, and Apache Zep-
logical simulations and experiments. pelin can be used for data analysis and visualization.
Apache Flink is still a new technology, unlike Spark In [70], it was reported that researchers at the Scripps
which is more mature. Flink is independent of Hadoop Research Institute are using Elasticsearch and Kibana to
but can be integrated with Hadoop. Just like Spark, analyze/track data from DNA. Elastic stack is a collec-
Flink supports in-memory computation which makes it tion of open-source tools for analyzing big data. It is
as fast as Spark, but Flink is more powerful as it can per- composed of the following tools: Beats and Logstash for
form batch, true stream as well as graph processing. collecting machine data, Elasticsearch which is based
A few works in the field of bioinformatics have on Apache Lucene for searching and analyzing data,
started using Flink. In [66], Apache Flink and MapRe- and Kibana for visualization. Using the Elasticsearch–
duce was used to constitute a sequence alignment Hadoop (ES–Hadoop) connector allows using Elastic
pipeline for processing raw data produced by Illumina stack on data processed and stored on Hadoop.
sequencers. The authors demonstrated that the pro-
posed pipeline has very good scalability and is fault
tolerant. In [67], the authors further exploited the use of 2.3 MACHINE LEARNING FOR BIG DATA
Apache Kafka together with Apache Flink to implement ANALYSIS
the first processing phases for Illumina sequencing data Today machine learning is commonly used for predic-
with positive results and improvements. tive analytics of big data. Machine learning is a sub-
field of artificial intelligence (AI) and is based on the
2.2.5 Big Data Analysis and Visualization idea that systems can learn from examples and expe-
According to [68], data analytics can be categorized into riences, by training on data inputs without relying on
three levels of analysis as follows: descriptive, predic- explicit programming. Machine learning can thus fa-
tive and prescriptive analytics. Descriptive data analysis cilitate data analysis as it automates analytical model
is used to provide summaries about the data, identify building. Today machine learning is being used exten-
basic features of the data, and identify patterns and sively in various industries such as automobile (e.g.,
relationships to describe the data properties. It is per- self-driving cars), genetics to immensely improve the
haps the easiest type of analysis that can be done on understanding of the human genome, healthcare, finan-
bioinformatics data. Predictive data analytics aims to cial services, environment climate change, retail, energy,
observe and determine patterns in the dataset so as to entertainment media, and social media [71]. According
be able to predict future outcomes such as viral evolu- to Gartner [72], data science and machine learning are
tion. Prescriptive data analytics is usually the final stage becoming critical for differentiation and sometimes sur-
of data analytics; it allows taking the course of action to vival in business.
bring improvements based on findings from descriptive Machine learning plays an important role in solving
and predictive analysis of data. Descriptive data anal- numerous bioinformatics problems such as gene find-
ysis can be easily handled by the tools presented in ing algorithms, gene expression, genome alignment,
Section 2.2.4. Predictive and prescriptive data analytics GWAS and genomic selection. There is widespread ap-
are still in their early stages in the area of bioinformat- plication of machine learning in bioinformatics as im-
ics. Mostly machine learning techniques are best suited mense amounts of molecular biology data are now
for such analytics of data as described in Section 2.3. In available [73] and due to the highly complex nature
[69], the authors summarize the different data mining of many problems in bioinformatics whereby manually
26 Deep Learning and Parallel Computing Environment for Bioengineering Systems

developing specialized algorithms that will solve them 2.3.1.1 Supervised Machine Learning
perfectly is impractical, if not impossible. Machine Supervised machine learning consists of training a pro-
learning algorithms have proven to be quite effective gram using a set of training dataset, comprising inputs
in detecting patterns and are being applied in bioinfor- and outputs (labeled with correct output), such that
matics applications with great success. In [74], four ma- when new data is input, the system can reach an ac-
chine learning models, namely neural networks (NNs), curate conclusion. The machine learning task is thus
cellular automata (CAs), random forests (RFs) and mul- to infer a function that maps an input to an output
tifactor dimensionality reduction (MDR), which have based on the input and output pairs from the exam-
been used to successfully detect and characterize gene– ple training data set. The algorithm learns by comparing
gene interactions, are discussed. In [75], a convolutional its actual output with correct outputs to find errors and
neural network (CNN) architecture has been proposed consequently improve the model iteratively until an ac-
and evaluated for detecting cancer metastases in gi- ceptable level of performance is reached.
gapixel microscopy images. In [76], machine learning Assuming input variable (x) and output variable (Y ),
techniques, namely the SVM, artificial neural network then the machine learning algorithm is used to learn the
(ANN), and a hybrid of these techniques, are reviewed mapping function f where Y = f (X), such that when
for DNA sequence classification. Many machine learn- new input data (x) is used, it can accurately predict
ing algorithms are readily available, and in [77], the the output variable (Y ) for that input. In practice, the
authors conducted a thorough analysis of 13 state-of- input x often represents multiple data points such as
the-art machine learning algorithms (e.g., support vec- x1 , x2 , x3 , . . . , in which case the predictor function f (x)
tor classifier (SVC), K-nearest neighbor (KNN), and has the following form, assuming three input compo-
nents, where a0 , a1 , a2 and a3 are constants:
decision tree (DT)) out of a set of 165 algorithms for
solving the problem of classification of data to help re-
f (x) = a0 + a1 x1 + a2 x2 + a3 x3 (2.1)
searchers identify the best algorithm for tackling similar
bioinformatics problems. In the following subsections, By finding the best possible values for a0 , a1 , a2 and
different machine learning approaches and deep learn- a3 iteratively, the predictor function f (x) is perfected.
ing techniques are introduced. Open source tools for Fig. 2.7 depicts the two-step supervised machine learn-
solving bioinformatics problems using machine learning process.
ing on Hadoop are also described. Most practical machine learning uses the supervised
learning method. The supervised learning task can be
2.3.1 Machine Learning Methods either a regression or a classification problem. A classi-
The three main machine learning methods are super- fication problem is when the output variable is a cate-
vised machine learning, unsupervised machine learn- gory, i.e., a prediction can take a finite number of values.
ing, semi-supervised machine learning and reinforce- For example, given the set of input features, the predic-
ment learning. tor function should predict whether a tumor is benign
or malignant. The classification problem can be of two

FIG. 2.7 Supervised machine learning process.

CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 27

types: binary classification or multi-class classification. associated, for instance, to address questions such as
Binary classification is where the output can be one of “What patterns exist in gene expression of cancers?”. The
two possible values or classes, usually 1 or 0 values; two popular unsupervised machine learning tasks are
whereas multi-class classification is where the output the clustering of data and dimensionality reduction of
can be one out of three or more classes, e.g., when pre- data.
dicting the type of cancer. Machine learning algorithms Clustering is the process of finding similarities in un-
for classification problems include decision trees, lo- labeled data so as to group similar data items together
gistic regression, naive Bayes, K-nearest neighbors, ran- into a cluster. Different types of clustering methods are
dom forest, and linear SVC (support vector classifier). In available whereby every methodology follows a differ-
[78], the authors describe the use of decision tree-based ent notion or set of rules for defining the degree of
methods in computational and systems biology. An im- similarity among data points. Fig. 2.8 depicts the differ-
portant classification task in bioinformatics is the classi- ent clustering techniques. According to [82], the most
fication of microarray data. In [79], a random forest has typical use of clustering in bioinformatics is the cluster-
been proposed for gene selection and classification of ing of genes in expression data. Typically, a few samples
microarray data. In [80], support vector machine (SVM) of DNA microarrays allow measuring the expression lev-
has been used as an effective method for gene classifica- els of a large numbers of genes. Clustering can be used
tion. to group genes with a similar expression level in all the
A regression problem is when the output variable to samples into a cluster.
be predicted takes a real or continuous value, such as The two most widely used clustering algorithms used
temperature, weight. Typical regression algorithms in- in machine learning are the K-means clustering and hi-
clude linear regression, regression trees (e.g., random erarchical clustering. K-means clustering is a type of
forest), and support vector regression (SVR). The sim- partitional clustering algorithm, more specifically, it fol-
plest model is a simple linear regression, which tries to lows the centroid model. It is an iterative clustering al-
find a statistical relationship between two continuous gorithm whereby the notion of similarity is based on
variables by drawing the line that best fits the data. In the closeness of a data point to a centroid of the clus-
[81], several regression approaches for microarray data ters. The K-means algorithm partitions the given data
analysis were presented, including the support vector into K clusters (K is defined by the user, which implies
machine (SVM). some prior knowledge of the dataset), where each clus-
ter has a cluster center known as centroid. Initially, the
2.3.1.2 Unsupervised Machine Learning K cluster centers are randomly set and the data items
In supervised machine learning, training datasets with are assigned to the nearest cluster center. The K cluster
labeled data are used, whereas in unsupervised machine centers are reevaluated based on the initial membership
learning no labeled datasets are used. With unsuper- of data items to the clusters. The closeness of the data
vised machine learning, the system is required to an- points to the new data center is evaluated, and the pro-
alyze the actual data to find similarities, patterns and cess is iterated until the data items do not change clus-
correlations in the data to explore and learn about rela- ter membership. Expectation maximization (EM), also
tionships within the data. Unsupervised machine learn- called soft clustering, is another popular clustering tech-
ing is suitable for data that may have little knowledge nique which is of the partitional type but model-based

FIG. 2.8 Clustering techniques.

28 Deep Learning and Parallel Computing Environment for Bioengineering Systems

algorithm. In [83], the authors propose a novel cluster- singular-value decomposition (SVD) which factorizes
ing algorithm which is based on the K-means algorithm the data matrix into 3 smaller matrices.
and incorporates gene information from the Gene On- A self-organizing map (SOM) can also be used
tology into the clustering process to obtain more bi- for dimensionality reduction. It uses an ANN that is
ologically meaningful clusters. Similarly, in [84], the trained using unsupervised learning to produce a low-
K-means clustering has been enhanced to obtain better dimensional representation of the input space of the
performance related to cancer subtype prediction from training samples, called a map. The authors of [88]
gene expression data. In [85], both the K-means and describe the PCA based methods in bioinformatics stud-
Markov clustering algorithm were used to identify key ies. In [89], SVD is used for pathway level analysis of
genes of interest to the study. gene expression. In [90], an improved prediction of
Unlike the partitional clustering techniques which protein functional associations through singular value
attempt to place each data item in exactly one clus- decomposition of phylogenetic profiles is presented.
ter, i.e., non-overlapping cluster, hierarchical clustering
is an approach that allows for subclusters, i.e., a set 2.3.1.3 Semi-Supervised Machine Learning
of nested clusters, that are organized as a tree. There In many real-world machine learning problems, typi-
are two types of hierarchical clustering: divisive and cally a small amount of labeled data and a large dataset
agglomerative. With agglomerative hierarchical cluster- of unlabeled data is available. Often a large amount of
ing, the algorithms initially assign all data points to a unlabeled data is easily acquired but the cost associ-
cluster of their own. Then the two nearest clusters are ated with labeling data is high for generating labeled
merged into one to form a subcluster. The algorithm training dataset for supervised learning. It is therefore
iterates until finally there is a single cluster. The result desirable to combine the explicit classification infor-
mation of labeled data and the information in the un-
of the clustering can be visualized as a dendrogram.
labeled data to construct an accurate learning model.
An example of the agglomerative hierarchical cluster-
Semi-supervised learning is a combination of super-
ing is the single-linkage clustering (SLC). A divisive hi-
vised and unsupervised learning which aims to ad-
erarchical clustering starts with all data points as one
dress such cases. Several algorithms have been proposed
cluster and then splits the cluster recursively into sub-
for semi-supervised learning, such as the EM based
clusters until, finally, subclusters consisting of only one
algorithms [91], self-training [92], co-training [93],
data point remain. In [86], a hierarchical clustering ap-
semi-supervised SVM (S3VM) [94], graph-based meth-
proach has been adopted whereby many hierarchical
ods [95], and boosting based semi-supervised learning
organizations of gene clusters corresponding to some
methods [96]. Deep generative models are also being
subhierarchies in gene ontology were successfully cap- used for semi-supervised learning [97,98].
tured. Another example of methods for cluster analysis Semi-supervised learning is widely adopted in bioin-
is a self-organizing map (SOM) which uses neural net- formatics. In [99], to be able to make the most out
works. of vast amount of microarray data that do not have
Another unsupervised learning task consists of di- sufficient follow-up information, the low-density sep-
mension reduction. Extremely large datasets of un- aration (LDS) semi-supervised learning technique has
labeled data are available in bioinformatics. These been applied to predict the recurrence risk in cancer
datasets may contain thousands of records with nu- patients. Similarly, in [100], the authors use the har-
merous attributes or features. Working with such large monic Gaussian, graph-based semi-supervised learning
numbers of high dimensional records of data is quite algorithm to predict disease genes that considers the im-
computing intensive, and often a lot of data is redun- balance between known disease genes and unknown
dant [87]. Dimensionality reduction refers to methods disease genes. Given the little amount of annotation
used to reduce or combine the complexity of the data of functional and structural attributes of protein se-
by using fewer features while keeping as much relevant quence data, in [101], it is shown that classification
structure as possible, i.e., with minimal loss of infor- based semi-supervised learning can increase the over-
mation. Often dimension reduction is done along the all accuracy of classifying partly labeled data to improve
data analytics pipeline before applying a supervised ma- predictive performance. In [102], the authors have in-
chine learning algorithm. Two popular algorithms used vestigated the use of semi-supervised learning and suc-
to reduce dimensionality are the principal component cessfully improving the amount of recall in the BioNLP
analysis (PCA), which aims to find the linear combina- Gene Regulation Network task. Likewise, to address the
tion that best preserves the variance in the data; and the task of gene regulatory network reconstruction from
CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 29

high-throughput data, in [103], the authors exploited hearing and speech translation, and automatic detec-
an iterative, semi-supervised ensemble-based algorithm tion of cancer cells.
for making inference on a prokaryotic model organism
(E. coli) and on a eukaryotic model organism (S. cere- 2.3.2.1 Artificial Neural Networks (ANNs)
visiae). Their approach shows improved performance as The way humans learn and process information is con-
compared to other techniques. trolled by the nervous system, which is made up of
neurons and the different connections between the neu-
2.3.1.4 Reinforcement Learning rons. ANNs consist of neurons connected to each other.
Reinforcement learning is a computational approach However, the ANNs have discrete layers (cascades of
of learning from action in the absence of a training nonlinear processing unit layers), connections and di-
dataset, i.e., learning from experience by trial and er- rections of data propagation. The output of one layer is
ror to determine which actions yield the greatest reward. used as input of the next layer. The simplest ANN con-
Reinforcement learning consists of three primary com- sist of three layers: the input layer, a hidden layer and an
ponents: (i) the agent (learning agent); (ii) the environ- output layer as shown in Fig. 2.9. The circles represent
ment (agent interacts with environment); and (iii) the the neurons, and the arrows represent the connections
actions (agents can take actions). An agent learns from between the different neurons. For more complex tasks,
the environment by interacting with it and receiving re- the ANNs will be composed of additional hidden lay-
wards for performing actions. Such learning is goal or ers. Fig. 2.10 depicts an ANN with two hidden layers.
task oriented; the agent learns how to attain its goal by Multiple hidden layers result in a larger or deeper neu-
taking the best actions so as to maximize the reward ral network, which usually results in enhanced learning
over a given time period. A task can be either episodic capability.
or continuous. Episodic tasks have a starting point and
an ending point (terminal state), whereas continuous
tasks are those that have no terminal state, i.e., agent
will continuously run until explicitly stopped. Rein-
forcement learning is often used for robotics and gam-
ing. Two popular methods of reinforcement learning
are the Monte Carlo and the temporal difference learn-
ing method. In bioinformatics, reinforcement learning
has been used for solving the fragment assembly prob-
lem [104], the bidimensional protein folding problem
[105], the RNA reverse folding [106], and the 2D-HP
protein folding problem [107], amongst others.
FIG. 2.9 Simple artificial neural network.
2.3.2 Deep Learning and Neural Networks
Deep learning is a subfield of machine learning focusing The ANN has certain parameters, i.e., each of the
on algorithms, which attempts to imitate the function connections has a number associated with it called the
of the human brain for learning. A deep learning ar- connection weight; the neurons have a threshold value
chitecture is thus inspired by the brain’s structure of and an activation function associated to it. Initially, the
neural networks. With the huge computing power avail- ANN is trained with labeled data, i.e., inputs and their
able today (e.g., graphics processing units GPUs), deep expected correct outputs. The ANN runs the inputs with
learning is powered by artificial neural networks (ANN) certain values of the parameters, the results obtained are
for analyzing big data and solving complex problems. then compared with the expected correct results. If the
Neural networks have been around for a while, but the computed results are far apart from the expected correct
modern ANNs are “deep”, i.e., a traditional neural net- results, the ANN adjusts the parameters of the ANN iter-
work typically consists of two or three hidden layers, atively by means of a special training algorithm such as
whereas ANNs nowadays can have as many as 150 lay- the gradient descent, or back propagation until the com-
ers. It is possible to use ANN to build and train models puted outputs are as close as possible to the expected
in a time efficient manner. With a deep learning model, correct outputs. This demonstrates the learning process
the algorithms not only learn but can also determine on of the ANN. After the training phase, when new inputs
their own if a prediction is accurate or not. Applications are run by the ANN, there is a high confidence that the
of deep learning include automatic driving, automatic predicted outputs will be close to the actual outputs.
30 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 2.10 ANN with two hidden layers.

The ANN structure is independent of the task, that is, ter insights and intelligence. In the Hadoop framework,
the neutral network designed for one task can be used data is stored on a cluster of nodes. Machine learning
for another task except that the parameters may have to tools can be easily integrated in the Hadoop ecosystem
be reconfigured as they may be different, and the ANN to exploit the scalability, reliability and resource pool-
will have to be retrained for the other task. ing characteristic offered by the distributed storage and
Typically, neural networks can be classified into processing solution offered by Hadoop. The most pop-
one of the following four types: classification neural ular machine learning tools that can be integrated with
network, prediction neural network, clustering neural Hadoop are Apache Mahout, Spark MLlib, and H20.
network and association neural network. Several neu-
ral network architectures are suitable for deep learn- 2.3.3.1 Spark, Mahout and MLlib
ing, namely feed-forward neural network (FNN), recur- Spark supports iterative computation and has improved
rent neural network (RNN), recursive neural network, processing speed compared to MapReduce as it utilizes
convolutional neural network (CNN), deep belief net- in-memory computation using the resilient distributed
works, and convolutional deep belief networks [108]. datasets (RDDs). The RDDs store data in memory for
An FNN has no feedback connections whereby outputs fast access to data during computation and provide
of the model can be fed back into the system. RNNs fault tolerance [110]. An RDD is an immutable dis-
are architectures whereby the outputs are looped back tributed collection of key–value pairs of data, stored
into the system. A CNN comprises one or more con- across nodes in the cluster. The RDD can be operated in
volutional layers and then is followed by one or more parallel. Moreover, recent versions of Spark also support
fully connected layers. The neurons in one convolu- DataFrames and Datasets that are built on top of RDDs.
tional layer do not connect to all the neurons in the A DataFrame is also an immutable distributed collec-
next layer but only to a small region of it. The inputs of tion of data but where data is organized into named
a CNN are images and thus the layers are organized in 3 columns, similar to a table in a relational database.
dimensions: width, height and depth. The final output Datasets are an extension of DataFrame which provides
will be reduced to a single vector of probability scores, type-safe, object-oriented programming interface. The
organized along the depth dimension. CNNs are very RDD, DataFrame and DataSets API make data process-
effective in areas such as image recognition and clas- ing easy and provide developers with more flexibility.
sification. In [109], applications of neural networks in Using machine learning programming languages such
protein bioinformatics is discussed and summarized. as Python, R, Java, or Scala, machine learning algo-
Protein bioinformatics applications include the predic- rithms and applications can easily be implemented with
tion of protein structure, binding sites and ligands, and Spark.
protein properties. The popular NN architectures iden- However, Spark is shipped with MLlib library for
tified for protein bioinformation are the FNN and RNN. machine learning. Prior to MLlib, Apache Mahout was
typically used for machine learning on Hadoop. How-
2.3.3 Machine Learning and Hadoop ever, Mahout is built atop MapReduce which is slower
Big data analytics requires algorithms based on machine compared to MLlib, which runs on Spark and is thus
learning techniques to process data in real-time for bet- faster. Both MLlib and Mahout provide numerous algo-
CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 31

rithms for machine learning such as classification (lo- network to predict asthma severity level or the immi-
gistic regression, linear support vector machine (SVM), nence of an asthma attack [118].
naive Bayes), regression (linear regression), collabora- Theano is another Python library for developing
tive filtering, clustering (k-means), decomposition (sin- deep learning models. To ease the use of Tensorflow
gular value decomposition (SVD)), principal compo- and Theano, which may be difficult, Keras can be used
nent analysis (PCA), though Mahout, being more ma- to quickly develop a deep learning model. Keras is a
ture, has a more extensive library of algorithms [111]. minimalist Python library for deep learning that can
Other machine learning libraries include Scikit- run on top of Theano or Tensorflow. Keras leverages the
Learn, which is a Python library, and H20, which is an Dist-Keras framework for achieving data parallelism on
open-source library. H2O is a fast, scalable, machine Apache Spark.
and deep learning library. Both can be used on Spark Caffe is a machine learning framework that was de-
for analyzing data on a Hadoop cluster. Sparkling Wa- signed with better expression, speed, and modularity as
ter connector allows the integration of H20 algorithms the focus points [119]. It was developed for computer
with the capabilities of the Spark platform [112]. Spark- vision/image classification by leveraging Convolutional
sklearn is the integration package for Scikit-Learn with Neural Networks (CNNs). CaffeOnSpark can be used to
Apache Spark. bring deep learning onto Hadoop and Spark clusters.
H2O’s Deep Learning is based only on a multi-layer
2.3.4 Distributed Deep Learning and Hadoop feedforward deep neural network that is trained with
stochastic gradient descent using back-propagation. In
Spark Cluster manager is a platform where Spark can be
[120], using the Extended-Caffe framework, a 3D con-
run. Spark supports three cluster managers: Standalone
volutional neural network was constructed to generate
cluster manager, Hadoop Yarn and Apache Mesos.
lung nodule proposals.
Thus, Spark can run stand-alone on one machine, on
Other deep learning frameworks include Torch,
a Hadoop cluster, or on a Mesos datacenter. Machine
Apache MXNet, Microsoft open-source Cognitive Tool-
learning may be possible on stand-alone Spark. How-
kit (CNTK) and Apache Singa. The latter is primarily
ever, as datasets increase in size and deep neural net-
focused on distributed deep learning using model par-
works grow in complexity, the computational inten-
allelism on a cluster of nodes.
sity and memory demands of deep learning increases,
and a cluster of machines with high-performance is re-
quired [113]. Two approaches may be adopted: data 2.4 CONCLUSIONS
parallelism and model parallelism. Data parallelism in-
Previously, biological data computations were mostly
volves partitioning data equally among several process-
done using HPC based multi-core processing archi-
ing nodes. Each node processes the data independently
tectures. Such computing infrastructure can be quite
in parallel. Data parallelism is more suitable when there
expensive and is not easily available. Moreover, with
is a large amount of data. Data parallelism is supported
the next generation sequencing technologies, massive
by MapReduce and Spark running on a cluster. Model amounts of bioinformatics data are available. Big data
parallelism attempts to partition the machine learning analytics and distributed computing, i.e., cloud com-
model itself. It is more complex and challenging. Model puting, are increasingly adopted in bioinformatics ap-
parallelism is more suitable for large learning models plications whereby a cluster of compute nodes is used
such as deep neural networks. In [114], BigDL, a dis- for processing and analyzing data. The Hadoop big data
tributed deep learning framework for big data based framework is one of the most popular frameworks for
on data parallelism, operating on Spark has been pro- processing big data as it provides fault tolerance, scala-
posed. Most of the deep learning frameworks available bility, and reliability, as well as being cost effective. In
today can be run on a cluster of computers. this chapter, we take a holistic approach to big data
Tensorflow [115] is the most popular framework for analytics and present the big data analytics workflow
deep learning. It’s TensorFlowOnSpark framework sup- with regards to the Hadoop framework. The emergence
ports distributed deep learning on the Spark clusters. of such an approach has changed the context of bioin-
According to [116], TensorFlow is a powerful and flex- formatics computation. We discussed the background
ible gateway to deep learning in biology. In [117], Ten- of Hadoop technology, its core components, as well as
sorflow has been used to implement a neural network other components, which form the Hadoop ecosystem.
for predicting the severity of Parkinson’s disease. Ten- The study shows that bioinformatics is fully embracing
sorflow has also been used to implement a deep neural the Hadoop big data framework.
32 Deep Learning and Parallel Computing Environment for Bioengineering Systems

Another significant technology, which can revolu- Roeland G.W. Verhaak, David W. Kane, Chris Wakefield,
tionize bioinformatics applications, is machine learning John N. Weinstein, Gordon B. Mills, Han Liang, TCPA:
techniques. Machine learning is widely proposed in the a resource for cancer functional proteomics data, Nature
literature for solving bioinformatics problems. The dif- Methods 10 (2013) 1046–1047.
ferent approaches of machine learning algorithms have 8. The ENCODE Project Consortium, An integrated encyclo-
pedia of DNA elements in the human genome, Nature
been presented in this chapter. To address more com-
489 (September 2012) 57–74.
plex problems in bioinformatics, deep learning is also
9. EMBL-European Bioinformatics Institute, EMBL-EBI An-
being used. Eventually, machine learning can be easily nual Scientific Report 2013, 2014.
plugged in the data processing and analysis pipeline of 10. H. Kashyap, H.A. Ahmed, N. Hoque, S. Roy, D.K. Bhat-
the Hadoop framework. It is expected that in the future tacharyya, Big data analytics in bioinformatics: a machine
the use of deep learning in the area of bioinformat- learning perspective, CoRR, arXiv:1506.05101, 2015.
ics will greatly improve the understanding of human 11. National Institutes of Health, Big data to knowledge
genome and help find a cure to numerous diseases. phase I & II, available at https://commonfund.nih.gov/
Finally, bioinformatics being a complex field, no sin- bd2k, June 2018. (Accessed 27 June 2018).
gle computational method will be optimal for every 12. Fabricio F. Costa, Big data in biomedicine, Drug Discovery
dataset and every task. A successful data analysis in this Today 19 (4) (Apr. 2014), Reviews.
area most certainly requires a likely combination of 13. Peter Mell, Tim Grance, The NIST Definition of Cloud
multiple data analysis methods. The Hadoop big data Computing Recommendations of the National Institute
of Standards and Technology, Sept. 2011, Special Publica-
framework, which can be easily coupled with other pro-
tion 800-145.
cessing, analytic or machine learning engine, is found
14. The Apache Hadoop project, http://www.hadoop.org.
to be most suitable. In the future, it is expected that
15. Market Research Future, Hadoop Big Data Analytics Mar-
a combination and orchestration of various solutions ket Research Report – Global Forecast to 2022, Report,
on the Hadoop big data framework would support en- July 2018.
hanced bioinformatic computations. Thus, researchers 16. Sisense, Gartner magic quadrant for analytics and
and practitioners should collaborate to work towards business intelligence platforms, available at https://
this goal. www.sisense.com/gartner-magic-quadrant-business-
intelligence/, Feb. 2018.
17. C. Arindam, Microsoft deepens its commitment to
REFERENCES Apache Hadoop and open source analytics, Mi-
1. D. Laney, 3D Data Management: Controlling Data Vol- crosoft Azure, https://azure.microsoft.com/en-us/blog/
ume, Velocity, and Variety, Technical report, META Group, microsoft-reaffirms-its-commitment-to-apache-hadoop-
available at https://blogs.gartner.com/doug-laney/files/ open-source-analytics/, June 2018.
2012/01/ad949-3D-Data-Management-Controlling- 18. Qlik, Why Cloudera + Qlik? Cloudera, https://www.
Data-Volume-Velocity-and-Variety.pdf, 2001, #Internet cloudera.com/partners/solutions/qlik.html.
Live Stats (2015), February 6, 2001. 19. Tableau, Just point at your Hadoop cluster to an-
2. Internet Live Stats. [online] Internetlivestats.com, avail- alyze your data, https://www.tableau.com/solutions/
able at http://www.internetlivestats.com/. (Accessed 24 workbook/hadoop_flavors.
April 2015). 20. WenTai Wu, WeiWei Lin, Ching-Hsien Hsu, LiGang He,
3. Dany, 37 Mind Blowing YouTube Facts, Figures and Statis- Energy-efficient Hadoop for big data analytics and com-
tics – 2018, available at https://merchdope.com/youtube- puting: a systematic review and research insights, Fu-
statistics/, April 2018. ture Generation Computer Systems 86 (Sept. 2018)
4. R.J. Robison, How big is the human genome? in: Preci- 1351–1367, https://doi.org/10.1016/j.future.2017.11.010.
sion Medicine, Jan. 2014. 21. M. Malik, K. Neshatpour, T. Mohsenin, A. Sasan, H.
5. Eugene Rosenberg, The human genome, Ch. 11, in: It’s Homayoun, Big vs little core for energy-efficient Hadoop
in Your DNA. From Discovery to Structure, Function and computing, in: 2017 Design, Automation & Test in
Role in Evolution, Cancer and Aging, 2017, pp. 97–98. Europe Conference & Exhibition (DATE), IEEE, 2017,
6. Matthew Herper, Illumina promises to sepp. 1480–1485.
quence human genome for $100 but not quite 22. Giuseppe Cattaneo, Raffaele Giancarlo, Umberto Fer-
yet, Forbes (January 2017), available at https:// raro Petrillo, Gianluca Roscigno, MapReduce in computa-
www.forbes.com/sites/matthewherper/2017/01/09/ tional biology via Hadoop and Spark, in: Reference Mod-
illumina-promises-to-sequence-human-genome-for- ule in Life Sciences, Elsevier, 2018.
100-but-not-quite-yet/#2ce9b178386d. 23. Aisling O’Driscoll, Jurate Daugelaite, Roy D. Sleator, ‘Big
7. Jun Li, Yiling Lu, Rehan Akbani, Zhenlin Ju, Paul L. data’, Hadoop and cloud computing in genomics, Journal
Roebuck, Wenbin Liu, Ji-Yeon Yang, Bradley M. Broom, of Biomedical Informatics 46 (5) (Oct. 2013) 774–781.
CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 33

24. Q. Zou, X.B. Li, W.R. Jiang, Z.Y. Lin, G.L. Li, K. Chen, 44. Brendan Lawlor, Richard Lynch, Micheál Mac Aogáin,
Survey of MapReduce frame operation in bioinformatics, Paul Walsh, Field of genes: using Apache Kafka as a bioin-
Briefings in Bioinformatics 15 (4) (Feb. 2013) 637–647, formatic data repository, GigaScience 7 (4) (April 2018),
https://doi.org/10.1093/bib/bbs088. https://doi.org/10.1093/gigascience/giy036.
25. P. Singh, Big genomic data in bioinformatics cloud, Ap- 45. Benjamin Bengfort, Jenny Kim, Data ingestion, Ch. 7, in:
plied Microbiology, Open Access 2 (2016) 113, https:// Data Analytics With Hadoop: An Introduction for Data
doi.org/10.4172/2471-9315.1000113. Scientists, O’Reilly, 2016, pp. 157–173.
26. Lizhen Shi, Zhong Wang, Weikuan Yu, Xiandong Meng, 46. Brendan Lawlor, Richard Lynch, Micheál Mac Aogáin,
A case study of tuning MapReduce for efficient bioinfor- Paul Walsh, Field of genes: using Apache Kafka as a bioin-
matics in the cloud, Parallel Computing 61 (Jan. 2017) formatic data repository, GigaScience 7 (4) (April 2018).
83–95. 47. Francesco Versaci, Luca Pireddu, Gianluigi Zanetti, Kafka
27. Apache Mesos, What is Mesos? A distributed systems ker- interfaces for composable streaming genomics pipelines,
nel, http://mesos.apache.org/. in: 2018 IEEE EMBS International Conference on
28. Apache Myriad, Deploy Apache YARN applications using Biomedical & Health Informatics (BHI), Las Vegas, NV,
Apache Mesos, https://myriad.apache.org/. USA, Mar. 2018.
29. N. Peek, J.H. Holmes, J. Sun, Technical challenges for big 48. Szymon Chojnacki, Web Production Team EBI,
data in biomedicine and health: data sources, infrastruc- Genome campus software community, Apache
ture, and analytics, Yearbook of Medical Informatics 9 Kafka streams API, EMBL-EBI, available at https://
(2014) 42–47. github.com/ebi-wp/kafka-streams-api-websockets/blob/
30. The ENCODE Project Consortium, An integrated encyclo- jdisp/docs/kafka-streams-10-10-2017.pdf, Oct. 2017.
pedia of DNA elements in the human genome, Nature 49. Dillon Chrimes, Hamid Zamani, Using distributed data
489 (September 2012) 57–74. over HBase in big data analytics platform for clini-
31. Encode, ENCODE: Encyclopedia of DNA elements, cal services, Computational and Mathematical Methods
https://www.encodeproject.org/. in Medicine 2017 (2017), https://doi.org/10.1155/2017/
32. O. Gottesman, H. Kuivaniemi, G. Tromp, et al., The elec-
6120820 6120820, 16 pages.
tronic medical records and genomics (eMERGE) network:
50. A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N.
past, present, and future, Genetics in Medicine 15 (2013)
Zhang, S. Antony, H. Liu, R. Murthy, Hive — a petabyte
761–771.
scale data warehouse using Hadoop, in: Proceedings of
33. National Human Genome Research Institute (NHGRI),
the International Conference on Data Engineering, 2010,
Electronic medical records and genomics (eMERGE)
pp. 996–1005.
network, https://www.genome.gov/27540473/electronic-
51. Hortonworks, Apache pig, https://hortonworks.com/
medical-records-and-genomics-emerge-network/.
apache/pig/#section_1.
34. G. Rustici, N. Kolesnikov, M. Brandizi, et al., ArrayExpress
52. Henrik Nordberg, Karan Bhatia, Kai Wang, Zhong
update – trends in database growth and links to data anal-
Wang, BioPig: a Hadoop-based analytic toolkit
ysis tools, Nucleic Acids Research 41 (2013) D987–D990.
for large-scale sequence data, Bioinformatics
35. ArrayExpress, Functional genomics data, https://www.ebi.
ac.uk/arrayexpress/. 29 (23) (December 2013) 3014–3019, https://
36. dbGaP, https://www.ncbi.nlm.nih.gov/gap. doi.org/10.1093/bioinformatics/btt528.
37. GEO DataSets, https://www.ncbi.nlm.nih.gov/gds. 53. G. Cattaneo, R. Giancarlo, S. Piotto, U. Ferraro Petrillo,
38. K.D. Pruitt, T. Tatusova, D.R. Maglott, NCBI reference G. Roscigno, L. Di Biasi, MapReduce in computational
sequence (RefSeq): a curated non-redundant sequence biology – a synopsis, in: F. Rossi, S. Piotto, S. Concilio
database of genomes, transcripts and proteins, Nucleic (Eds.), Advances in Artificial Life, Evolutionary Computa-
Acids Research 33 (2005) D501–D504. tion, and Systems Chemistry, WIVACE 2016, in: Commu-
39. RefSeq: NCBI reference sequence database, https://www. nications in Computer and Information Science, vol. 708,
ncbi.nlm.nih.gov/refseq/. Springer, Cham, 2017.
40. SEER-medicare linked database, https:// 54. Andréa Matsunaga, Maurício Atsugewi, José Fortes,
healthcaredelivery.cancer.gov/seermedicare/. CloudBLAST: combining MapReduce and virtualization
41. J.L. Warren, C.N. Klabunde, D. Schrag, et al., Overview of on distributed resources for bioinformatics applica-
the SEER-medicare data: content, research applications, tions, in: 2008 IEEE Fourth International Conference on
and generalizability to the United States elderly popula- eScience, Indianapolis, IN, USA, Dec. 2008, pp. 7–12.
tion, Medical Care 40 (2002), IV–3–18. 55. Zhen Meng, Jianhui Li, Yunchun Zhou, Qi Liu, Yong
42. I. Lobo, Basic local alignment search tool (BLAST), Na- Liu, Wei Cao, bCloudBLAST: an efficient mapreduce
ture Education 1 (1) (2008) 215. program for bioinformatics applications, in: 2011 4th
43. W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, The International Conference on Biomedical Engineering
globus striped GridFTP framework and server, in: SC’05: and Informatics (BMEI), vol. 4, Shanghai, China, 2011,
Proceedings of the 2005 ACM/IEEE Conference on Super- pp. 2072–2076.
computing Date of Conference, 12–18 Nov. 2005, Seattle, 56. Rania Ahmed Abdel, Azeem Abul Seoud, Mahmoud
WA, USA, ISBN 1-59593-061-2, 2005. Ahmed Mahmoud, Amr Essam Eldin, BIG-BIO: big data
34 Deep Learning and Parallel Computing Environment for Bioengineering Systems

Hadoop-based analytic cluster framework for bioinfor- Data (Big Data), December 5–8, 2016, Washington, DC,
matics, in: 2017 International Conference on Informat- USA, IEEE, 2016.
ics, Health & Technology (ICIHT), Riyadh, Saudi Arabia, 67. F. Versaci, L. Pireddu, G. Zanetti, Distributed stream
Feb. 2017, ISBN Information: INSPEC Accession Number: processing for genomics pipelines, PeerJ Preprints
16836504. 5 (e3338v1) (2017), https://doi.org/10.7287/peerj.
57. Tadist Khawla, Mrabti Fatiha, Zahi Azeddine, Najah Said, preprints.3338v1.
A Blast implementation in Hadoop MapReduce using low 68. B. Gavin, Analytics network – O.R & analytics, available
cost commodity hardware, Procedia Computer Science at https://www.theorsociety.com/Pages/SpecialInterest/
127 (2018) 69–75. AnalyticsNetwork_analytics.aspx, 2013.
58. Guan-Jie Hua, Chuan Yi Tang, Che-Lun Hung, Yaw-Ling 69. Kalyan Nagaraj, G.S. Sharvani, Amulyashree Sridhar,
Lin, Cloud computing service framework for bioinfor- Emerging trend of big data analytics in bioinformatics:
matics tools, in: 2015 IEEE International Conference on a literature review, International Journal of Bioinformat-
Bioinformatics and Biomedicine (BIBM), Washington, ics Research and Applications (ISSN 1744-5485) 14 (Jan.
DC, USA, IEEE, Nov. 2015, pp. 9–12. 2018) 144–205, EISSN: 1744-5493.
59. Lizhen Shi, Zhong Wang, Weikuan Yu, Xiandong Meng, A 70. Mark Harwood, Uncoiling the data in DNA with
case study of tuning MapReduce for efficient bioinformat- elasticsearch, big data zone, https://dzone.com/articles/
ics in the cloud, Parallel Computing (ISSN 0167-8191) uncoiling-the-data-in-dna-with-elasticsearch, June 2016.
61 (2017) 83–95, https://doi.org/10.1016/j.parco.2016. 71. Bernard Marr, 27 incredible examples of AI and ma-
10.002. chine learning in practice, Forbes (April 2018), https://
60. L. Forer, T. Lipic, S. Schonherr, H. Weisensteiner, D. Davi- www.forbes.com/sites/bernardmarr/2018/04/30/27-
dovic, F. Kronenberg, E. Afgan, Delivering bioinformat- incredible-examples-of-ai-and-machine-learning-in-
ics MapReduce applications in the cloud, in: Information practice/#7168f1f17502.
72. Rob van der Meulen, 5 ways data science and machine
and Communication Technology Electronics and Micro-
learning impact business, in: Smarter with Gartner, Febru-
electronics (MIPRO) 2014 37th International Convention
ary 2018, https://www.gartner.com/smarterwithgartner/
on, 2014, pp. 373–377.
5-ways-data-science-and-machine-learning-impact-
61. Nafis Neehal, Dewan Ziaul Karim, Ashraful Islam, Cloud-
business/.
POA: a cloud-based map only implementation of PO-
73. H. Bhaskar, D.C. Hoyle, S. Singh, Intelligent technologies
MSA on Amazon multi-node EC2 Hadoop Cluster, in:
in medicine and bioinformatics, Computers in Biology
2017 20th International Conference of Computer and In-
and Medicine 36 (2006) 1104.
formation Technology (ICCIT), 22–24 Dec. 2017, Dhaka,
74. B.A. McKinney, D.M. Reif, M.D. Ritchie, J.H. Moore, Ma-
Bangladesh, 2017.
chine learning for detecting gene–gene interactions: a re-
62. Matti Niemenmaa, Aleksi Kallio, André Schumacher,
view, Applied Bioinformatics 5 (2006) 77.
Petri Klemelä, Eija Korpelainen, Keijo Heljanko, Hadoop- 75. Y. Liu, K. Gadepalli, M. Norouzi, G.E. Dahl, T. Kohlberger,
BAM: directly manipulating next generation sequenc- A. Boyko, S. Venugopalan, A. Timofeev, P.Q. Nelson, G.S.
ing data in the cloud, Bioinformatics Applications Corrado, J.D. Hipp, L. Peng, M.C. Stumpe, Detecting Can-
Note 28 (6) (2012) 876–877, https://doi.org/10.1093/ cer Metastases on Gigapixel Pathology Images, 2017.
bioinformatics/bts054. 76. Pooja Dixit, Ghanshyam I. Prajapati, Machine learning
63. R. Guo, Y. Zhao, Q. Zou, X. Fang, S. Peng, Bioinformatics in bioinformatics: a novel approach for DNA sequenc-
applications on Apache Spark, GigaScience 7 (8) (2018), ing, in: 2015 Fifth International Conference on Advanced
https://doi.org/10.1093/gigascience/giy098. Computing & Communication Technologies, February
64. N. Yu, B. Li, Y. Pan, A cloud-assisted application 21–22, 2015, Haryana, India, 2015.
over Apache Spark for investigating epigenetic mark- 77. R.S. Olson, W. La Cava, Z. Mustahsan, A. Varik, J.H.
ers on DNA genome sequences, in: Big Data and Moore, Data-driven advice for applying machine learn-
Cloud Computing (BDCloud), Social Computing and ing to bioinformatics problems, in: Pacific Symposium
Networking (SocialCom), Sustainable Computing and on Biocomputing, vol. 23, 2018, pp. 192–203.
Communications, (SustainCom) (BDCloud-SocialCom- 78. Pierre Geurts, Alexandre Irrthum, Louis Wehenkel, Super-
SustainCom), 2016 IEEE International Conferences on, vised learning with decision tree-based methods in com-
IEEE, 2016, pp. 67–74. putational and systems biology, Molecular BioSystems
65. Max Klein, Rati Sharma, Chris H. Bohrer, Cameron 5 (12) (2009) 1593, https://doi.org/10.1039/b907946g.
M. Avelis, Elijah Roberts, Biospark: scalable analysis of 79. Ramón Díaz-Uriarte, Sara Alvarez de Andrés, Gene selec-
large numerical datasets from biological simulations and tion and classification of microarray data using random
experiments using Hadoop and Spark, Bioinformatics forest, BMC Bioinformatics 7 (1) (2006) 1.
33 (2) (January 2017) 303–305, https://doi.org/10.1093/ 80. C. Devi Arockia Vanitha, D. Devaraj, M. Venkatesulu,
bioinformatics/btw614. Gene expression data classification using support vector
66. Francesco Versaci, Luca Pireddu, Gianluigi Zanetti, Scal- machine and mutual information-based gene selection,
able genomics: from raw data to aligned reads on Apache Procedia Computer Science (ISSN 1877-0509) 47 (C)
YARN, in: 2016 IEEE International Conference on Big (2015) 13–21.
CHAPTER 2 Big Data Analytics and Deep Learning in Bioinformatics With Hadoop 35

81. M.R. Segal, K.D. Dahlquist, B.R. Conklin, Regression ap- 93. J. Tanha, M. van Someren, H. Afsarmanesh, Dis-
proaches for microarray data analysis, Journal of Compu- agreement-based co-training, in: Tools with Artificial In-
tational Biology 10 (6) (2003) 961–980. telligence (ICTAI), 2011, pp. 803–810.
82. Pedro Larrañaga, Borja Calvo, Roberto Santana, Concha 94. K. Bennett, A. Demiriz, Semi-supervised support vector
Bielza, Josu Galdiano, Iñaki Inza, José A. Lozano, Rubén machines, in: Proceedings of the 1998 Conference on Ad-
Armañanzas, Guzmán Santafé, Aritz Pérez, Victor Robles, vances in Neural Information Processing Systems (NIPS),
Machine learning in bioinformatics, Briefings in Bioin- vol. 11, MIT Press, Cambridge, 1999, pp. 368–374.
formatics 7 (1) (March 2006) 86–112, https://doi.org/10. 95. M. Belkin, P. Niyogi, V. Sindhwani, Manifold regulariza-
1093/bib/bbk007. tion: a geometric framework for learning from labeled
83. G. Macintyre, J. Bailey, D. Gustafsson, A. Boussioutas, I. and unlabeled examples, Journal of Machine Learning Re-
Haviv, A. Kowalczyk, Gene ontology assisted exploratory search 7 (2006) 2399–2434.
microarray clustering and its application to cancer, in: 96. J. Tanha, M. Van Someren, H. Afsarmanesh, Boosting for
M. Chetty, A. Ngom, S. Ahmad (Eds.), Pattern Recogni- multiclass semi-supervised learning, Pattern Recognition
tion in Bioinformatics, PRIB 2008, in: Lecture Notes in Letters 37 (2014) 63–77.
Computer Science, vol. 5265, Springer, Berlin, Heidel- 97. Andre S. Yoon, Taehoon Lee, Yongsub Lim, Deokwoo
berg, 2008. Jung, Philgyun Kang, Dongwon Kim, Keuntae Park,
84. N. Nidheesh, K.A. Abdul Nazeer, P.M. Ameer, An en- Yongjin Choi, Semi-supervised learning with deep gen-
hanced deterministic K-means clustering algorithm for erative models for asset failure prediction, in: KDD17
cancer subtype prediction from gene expression data, Workshop on Machine Learning for Prognostics and
Computers in Biology and Medicine 91 (December Health Management, August 13–17, 2017, Halifax, Nova
2017) 213–221, https://doi.org/10.1016/j.compbiomed. Scotia, Canada, Sept. 2017.
2017.10.014. 98. Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed,
85. B.A. Rosa, S. Oh, B.L. Montgomery, J. Chen, W. Qin, Com- Max Welling, Semi-supervised learning with deep gener-
puting gene expression data with a knowledge-based gene ative models, in: Proceedings of Neural Information Pro-
clustering approach, International Journal of Biochem- cessing Systems (NIPS), Oct. 2014.
istry and Molecular Biology 1 (1) (2010) 51–68. 99. Mingguang Shi, Bing Zhang, Semi-supervised learning
86. Jinze Liu, Wei Wang, Jiong Yang, A framework for improves gene expression-based prediction of cancer
ontology-driven subspace clustering, in: Proceedings of recurrence, Bioinformatics 27 (21) (November 2011)
the Tenth ACM SIGKDD International Conference on 3017–3023, https://doi.org/10.1093/bioinformatics/
Knowledge Discovery and Data Mining (KDD’04), ACM, btr502.
New York, NY, USA, 2004, pp. 623–628. 100. Thanh Phuong Nguyen, Tu Bao Ho, A semi-supervised
87. M. Verleysen, D. François, The curse of dimensionality in learning approach to disease gene prediction, in: 2007
data mining and time series prediction, in: J. Cabestany, IEEE International Conference on Bioinformatics and
A. Prieto, F. Sandoval (Eds.), Computational Intelligence Biomedicine (BIBM 2007), IEEE, November 2007.
and Bioinspired Systems, IWANN 2005, in: Lecture Notes 101. Brian R. King, Chittibabu Guda, Semi-supervised learn-
in Computer Science, vol. 3512, Springer, Berlin, Heidel- ing for classification of protein sequence data, Scientific
berg, 2005. Programming 16 (1) (January 2008), https://doi.org/10.
88. S. Ma, Y. Dai, Principal component analysis-based meth- 1155/2008/795010 5–29.
ods in bioinformatics studies, Briefings in Bioinformatics 102. Thomas Provoost, Marie-Francine Moens, Semi-
12 (6) (Nov. 2011) 714–722, https://doi.org/10.1093/bib/ supervised learning for the BioNLP gene regulation
bbq090, Epub January 17, 2011. network, BMC Bioinformatics 16 (Suppl 10) (2015),
89. John Tomfohr, Jun Lu, Thomas B. Kepler, Pathway level https://doi.org/10.1186/1471-2105-16-S10-S4 S4.
analysis of gene expression using singular value decom- 103. M. Ceci, G. Pio, V. Kuzmanovski, S. Džeroski, Semi-
position, BMC Bioinformatics 6 (2005), https://doi.org/ supervised multi-view learning for gene network re-
10.1186/1471-2105-6-225 225. construction, in: Thomas Wennekers (Ed.), PLoS ONE
90. Andrea Franceschini, Jianyi Lin, Christian von Mer- 10 (12) (December 2015), https://doi.org/10.1371/
ing, Lars Juhl Jensen, SVD-PHY: improved prediction of journal.pone.0144031 e0144031.
protein functional associations through singular value 104. Maria-Iuliana Bocicor, Gabriela Czibula, Istvan-Gergely
decomposition of phylogenetic profiles, Bioinformatics Czibula, A reinforcement learning approach for solving
32 (7) (April 2016) 1085–1087, https://doi.org/10.1093/ the fragment assembly problem, in: 2011 13th Inter-
bioinformatics/btv696. national Symposium on Symbolic and Numeric Algo-
91. K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Text clas- rithms for Scientific Computing, September 26–29, 2011,
sification from labeled and unlabeled documents using Timisoara, Romania, 2011.
EM, Machine Learning 39 (2–3) (May 2000) 103–134. 105. Gabriela Czibula, Maria-Iuliana Bocicor, Istvan-Gergely
92. Y. Li, C. Guan, H. Li, Z. Chin, A self-training semi- Czibula, A reinforcement learning model for solving
supervised SVM algorithm and its application in an EEG- the folding problem, International Journal of Computer
based brain computer interface speller system, Pattern Applications in Technology (ISSN 2229-6093) (2017)
Recognition Letters 29 (2008) 1285–1294. 171–182.
36 Deep Learning and Parallel Computing Environment for Bioengineering Systems

106. Parastou Kohvaei, Reinforcement Learning Techniques 114. Jason Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang,
in RNA Inverse Folding, Master’s Thesis, Albert-Ludwigs Yanzhang Wang, Xianyan Jia, Cherry Zhang, Yan Wan,
Universität Freiburg, Aug. 2015. Zhichao Li, Jiao Wang, Shengsheng Huang, Zhongyuan
107. Berat Doğan, Tamer Ölmez, A novel state space represen- Wu, Yang Wang, Yuhao Yang, Bowen She, Dongjie Shi, Qi
tation for the solution of 2D-HP protein folding prob- Lu, Kai Huang, Guoqiong Song, BigDL: a distributed deep
lem using reinforcement learning methods, Applied Soft learning framework for big data, arXiv:1804.05839, 2018.
Computing (ISSN 1568-4946) 26 (Jan. 2015) 213–223. 115. TensorFlow, An open source machine learning framework
108. Qingchen Zhang, Laurence T. Yang, Zhikui Chen, Peng for everyone, https://www.tensorflow.org/.
Li, A survey on deep learning for big data, Information
116. Ladislav Rampasek, Anna Goldenber, TensorFlow: biol-
Fusion (ISSN 1566-2535) 42 (2018) 146–157.
ogy’s gateway to deep learning? Cell Systems 2 (January
109. K. Chen, L.A. Kurgan, Neural networks in bioinformatics,
2016), https://doi.org/10.1016/j.cels.2016.01.009.
in: G. Rozenberg, T. Bäck, J.N. Kok (Eds.), Handbook of
Natural Computing, Springer, Berlin, Heidelberg, 2012, 117. Srishti Grover, Saloni Bhartia, Akshama, Abhilasha Ya-
pp. 566–583. dav, K.R. Seeja, Predicting severity of Parkinson’s disease
110. M. Zaharia, M. Chowdhury, T. Das, A. Dave, Fast and in- using deep learning, Part of special issue International
teractive analytics over Hadoop data with Spark, USENIX Conference on Computational Intelligence and Data Sci-
Login 37 (4) (2012) 45–51. ence, Procedia Computer Science 132 (2018) 1788–1794,
111. Sara Landset, Taghi M. Khoshgoftaar, Aaron N. Richter, https://doi.org/10.1016/j.procs.2018.05.154.
Tawfiq Hasanin, A survey of open source tools for ma- 118. Quan Do, Tran Cao Son, Jamil Chaudri, Classification
chine learning with big data in the Hadoop ecosystem, of asthma severity and medication using TensorFlow
Journal of Big Data 2 (Dec. 2015) 24, https://doi.org/10. and multilevel databases, Procedia Computer Science
1186/s40537-015-0032-1. (ISSN 1877-0509) 113 (2017) 344–351, https://doi.org/
112. Michal Malohlava, Nidhi Mehta, Machine learn- 10.1016/j.procs.2017.08.343.
ing with sparkling water: H2O + spark, in: Michal 119. Caffe, Deep learning framework by Blair, http://caffe.
Malohlava, Nidhi Mehta, Brandon Hill, Vinod Iyen- berkeleyvision.org/.
gar (Eds.), Machine Learning with Sparkling Water: 120. Hui Wu, Matrix Yao, Albert Hu, Gaofeng Sun, Xiao Kun,
H2O + Spark, H2O.ai, Inc., Feb. 2016, https://
Yu Jian Tang, A systematic analysis for state-of-the-art 3D
h2o-release.s3.amazonaws.com/h2o/rel-tukey/2/docs-
lung nodule proposals generation, Part of special issue:
website/h2o-docs/booklets/SparklingWaterVignette.pdf.
Recent Advancement in Information and Communica-
113. Bilal Jan, Haleem Farman, Murad Khan, Muhammad
tion Technology: Proceedings of the International Con-
Imran, Ihtesham Ul Islam, Awais Ahmad, Shaukat Ali,
ference of Information and Communication Technology
Gwanggil Jeon, Deep learning in big data analytics: a
comparative study, Computers & Electrical Engineering – 2018, Procedia Computer Science 131 (2018) 302–310.
(ISSN 0045-7906) (2017).
CHAPTER 3

Image Fusion Through Deep

Convolutional Neural Network
G. SREEJA, ME • O. SARANIYA, ME, PHD

3.1 INTRODUCTION modalities, since data formations are not similar and
To impart exhaustive medical information to clini- statistically not correlated. In order to enhance the per-
cians for effective diagnosis and treatment, the image formance in real time applications, image fusion for
fusion of different modalities has been of concern guiding diagnosis and disease prognosis [4] has been
in medical image analysis. Multi-modal medical im- addressed to assist doctors in making decisions because
ages can be categorized into functional and anatomi- of limited human exposition of medical images.
cal imaging. Positron emission computed tomography To overcome the above difficulties in image fusion,
(PET) and single photon emission computed tomogra- deep learning (DL) based methods, and in recent years
phy (SPECT) are the two types of functional imaging convolutional neural network (CNN) based techniques,
modality that render metabolic information without have been widely employed in the study of natural
any anatomical context. Anatomical imaging modalities super-resolution image or video processing [5]. The
such as sonography imaging, computed tomography great extent of deep learning approaches is due to its
(CT), and magnetic resonance imaging (MRI), represent strong capability in describing complex relationship
the morphologic details of a human body. In multi- among various signals. In [6], to measure the local
modal medical image fusion, a new fused image is similarity between given two source image patches, a
formed by affiliating complimentary features of func- CNN-based method has been presented and the re-
tional and anatomical modalities. The possibility of sults revealed the advantages of the CNN-based ap-
delivering bone information mutual to the normal and proaches over conventional image fusion techniques.
pathological soft tissue information can be extended by Henceforth, the direction of CNN for designing effec-
the fusing MRI and CT images [1]. The fusion of an tive fusion methods seemed propitious.
anatomical image with a functional image such as PET
with MRI or PET with CT is preferred in localization
for radiation therapy treatment planning and for tumor 3.2 IMAGE FUSION
segmentation in oncology [2]. Before consolidating the The motivation of medical image fusion from different
information obtained from different modalities, proper modalities is to obtain a high quality image by intel-
alignment of input images is paramount, and this pro- ligently combining the collected essential information
cess is referred to as image registration. In the registra- from multi-modal input images [4]. In general, a clas-
tion process, each pixel location in a reference image sification of image fusion, as sketched in Fig. 3.1, de-
can be mapped to a new location in the image that is to scribes three basis levels: feature, pixel, and decision
be registered. The optimality criterion of the mapping levels [7,8]. Fusion based on pixels is the simplest im-
depends on the anatomy of the two input images that age fusion method where the fusion is performed at the
need to be matched. A concise summary of different pixel level to merge the physical parameters. The limi-
medical imaging modalities is given in Table 3.1 [3]. tation in pixel based methods is the effect of large pixel
Further, for professing the vital objective of image intensity variations on the resultant image. At the de-
fusion, a few necessary actions should be considered: cision level, each input is processed separately and the
essential information present in any of the source im- information is extracted; then, based on decision rules,
ages should not be discarded, any artifacts or incom- the extracted features are combined. Thus, in a high-
patibilities have to be eliminated, and the final image level fusion scheme, considerable precision in making
upon fusion must be robust and also have authentic decisions is achieved by the support of feature based de-
information. However, the hindrances for research in cision analysis [9]. Henceforth, the fusion at the highest
image fusion are generally image artifacts, rendering es- level renders the ground work of controls where the ex-
sential features of each modality and similarity between tracted features are the inputs and, in order to meet the

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00010-5 37
Copyright © 2019 Elsevier Inc. All rights reserved.
38 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 3.1
Multi-modal imaging modalities.
S.NO Modality Functions
1 CT SCAN Principle: Image data are obtained using X-ray equipment from different angles around the
human body.
Outcomes: Cross-section of all types of body tissue and organ.
Usage: Best suited for studying the chest and abdomen.
2 MRI SCAN Principle: Uses magnetic and radio waves to get a detailed picture of the inside of a body, and
hence it is free from X-rays or damaging forms of radiation.
Outcomes: T1 weighted images render excellent anatomic detail, and T2 weighted images
provide excellent contrast between normal and abnormal tissues.
Usage: Best suited for finding brain tumors and for examining spinal cord.
3 SPECT SCAN Principle: Works on the principle of nuclear medicine used in both diagnostic and therapeutic
procedures.
Outcomes: Delivers details about the blood flow, temperature of the body, etc.
Usage: Diagnosis of the kidneys, heart, brain, etc., can be done, and also used for detecting
tumors.

FIG. 3.1 Types of fusion.

specific decisions, the extracted features are classified us- image as shown in Fig. 3.2. Image registration involves
ing a cluster of classifiers. four processes as depicted in Fig. 3.3.

3.3 REGISTRATION 3.3.1 Image Registration Stages

The first and foremost procedure to start with image 3.3.1.1 Feature Detection
fusion methods is image registration. Registration in Features, like edges, lines, intersections, corners or
medical image fusion follows the method of spatial points, are detected in the acquired image manually
alignment in order to compare the respective features of or automatically. The applications of this method have
medical images. The registration process of an image re- limited performance in medical imaging since medical
quires two inputs: a reference image and an image to be images are not so substantially distinct or contain easily
registered. The registration process delivers a registered detectable objects. There are plenty of feature detec-
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 39

descriptors are that they should be invariant to geo-

metric and photometric transformations and also be
robust to clutter and occlusion. Feature descriptors are
categorized into real-valued descriptors like SIFT, SURF,
GLOH (gradient location and orientation histogram),
LIOP (local intensity order pattern), etc., and binary
descriptors like FREAK (fast retina key point), BRIEF,
BRISK, ORB, etc.

3.3.1.2 Feature Matching

Feature matching maps the extracted features from the
captured image with the given reference image. Spatial
FIG. 3.2 Image registration. relation and feature descriptions are used based on the
application requirements [3]. The drawbacks of feature
matching methods are that mutual features may be hard
to observe and the features may become unstable in
time.

3.3.1.3 Transform Model Estimation

With matched key points, the elements of transforma-
tion matrix/mapping function and the type are esti-
mated, for geometric alignment of the floating image
in accordance with the reference image.

3.3.1.4 Resampling
The transformation procedure of the floating image can
be done in a forward or backward manner. By the use of
the estimated mapping functions, every pixel from the
sensed image can be transformed directly.

3.3.2 Need for Image Registration in Medical

Imaging
Better medical image registration techniques could give
doctors new and more accurate ways of monitoring and
diagnosing the patients suffering from various diseases.
Image registration would be advantageous for more ex-
FIG. 3.3 Image registration process.
act focusing of radiation therapy against tumors. Effec-
tion techniques available in the literature. The common tive image registration could also cause improvements
feature detectors considered for image alignment are bi- in a surgical procedure where the indirect visualization
nary robust invariant scalable key (BRISK) [13], scale is employed by the operating surgeon. If proper regis-
invariant feature transform (SIFT) [10], Harris corner tration is not done, it leads to misalignment of images.
[12], binary robust independent elementary features This challenge is exacerbated in correlative images taken
(BRIEF), speeded up robust features (SURF) [11], ori- with multiple modalities over multiple time periods.
ented fast and rotated BRIEF (ORB), etc. [14]. Table 3.2 Out of rigid and non-rigid registration structures, proce-
describes the above mentioned feature detectors with dures for registering non-rigid images are imprecise and
their characteristics. provoke poorly reiterated results. The possession to ex-
Once the extraction of interest points is done, the plicitly register such images is adverse for the growth of
features are described by a vector called feature descrip- medical imaging as a diagnostic tool, especially in cases
tor. A descriptor describes the image patch around the involving a larger number of clinicians from different
detected key points. The essential attributes of feature places to achieve a collaborative diagnostic tool [3].
40 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 3.2
Comparison of feature detectors [15].
S.NO Detectors Year Performance Computation time
1 Moravec Corner 1980 Poor Poor
2 Harris Corner 1988 Good Fair
3 SUSAN 1997 Better Better
4 Harris-Affine 2004 Best Better
5 SIFT 2004 Best Better
6 SURF 2006 Good Excellent
7 FAST 2006 Good Good
8 BRISK 2011 Good Good
9 ORB 2011 Good Excellent

3.3.2.1 SURF Based Registration ture matching results and their computation time is less
SURF feature descriptor was proposed by H. Bay [11], compared to SURF.
which works on the principle of Gaussian scale space
analysis. SURF is considered over SIFT because of its fast 3.3.2.3 Implementation
computation time, which entitles various ongoing ap- As an example, image registration based on SURF and
plications, such as object tracking, image fusion, object BRISK are experimented in MATLAB 2017, and the re-
detection and image mosaicing [14]. The SURF descrip- sults are sketched in Figs. 3.6 and 3.8. Here MRI T1
tor exploits integral images [16] to speed-up the feature and T2 are taken as input images, as shown in Fig. 3.4,
detection time [17]. The implementation of SURF algo-
rithm includes following fundamental steps:
• Selecting salient features such as blobs, edges, in-
tersections at particular regions and corners in the
integral image. SURF uses fast-Hessian detector for
detecting feature points.
• Using descriptor to portray the surrounding neigh-
borhood of each feature point. This feature vector
must be unique. Simultaneously, the selected feature
vector ought to be resilient to errors, geometric de-
formations and noise.
• Orientation assignment of key point descriptor is im-
plemented by calculating the Haar wavelet responses
along the coordinates of an image.
• Finally, SURF matching is done by adopting the near-
est neighbor approach.

3.3.2.2 BRISK Based Registration

One of the binary feature descriptors named binary ro-
bust invariant scalable key (BRISK) was developed by
S. Leutenegger [13], which is rotation and scale invari-
ant but limited to affine variations. It exploits adap-
tive and generic corner detection based on the acceler-
ated segment test (AGAST) algorithm to detect corners,
and their local maxima are obtained by adapting fea-
ture accelerated segment test (FAST) corner score [18].
BRISK feature descriptors produce more accurate fea- FIG. 3.4 Input images T1 and T2.
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 41

FIG. 3.5 Matched points using SURF.

where T1 is considered as a reference image and T2 liers and outliers of both algorithms are given in
is the floating image. The transformation is applied Figs. 3.5 and 3.7. The detailed study on SURF and
on the T2 image with reference to T1. In both cases, BRISK feature detectors can be found in [19], and SURF
the algorithm employed to remove the outliers is ran- based medical image registration can be studied in
dom sample consensus (RANSAC). The matched in- [20].

FIG. 3.6 Registered image. FIG. 3.8 BRISK based registration.

FIG. 3.7 Matched points using BRISK.

42 Deep Learning and Parallel Computing Environment for Bioengineering Systems

3.4 EXISTING IMAGE FUSION METHODS – methods are supported by machine learning, neural
OVERVIEW networks and fuzzy logic [23]. On the decision level, the
The dimensionality reduction methods frequently em- weight map is constructed by measuring activity levels
ployed without any restraint comprise principal compo- of each wavelet coefficients and, based on the resultant
nent analysis (PCA), independent component analysis weight map, the fusion rule is framed. Finally, based on
(ICA), intensity–hue–saturation (HIS), and methods of the constructed fusion rule, the images are fused.
multi-resolution based analysis. When any of hybrid Multi-scale transforms (MST), like contourlets,
methods are implemented by combining two among wavelets and curvelets, have been considered to be un-
the above mentioned methods, it will result in good fitting in case of locating directional features. Hence, the
spatial resolution without any color distortion. In [21], use of shearlets in image fusion enabled researchers to
the pixel level fusion activities are reorganized in four observe anisotropic features and capture sharp transi-
families: model-based algorithms, component substitu- tions at different scales and orientation [24]. The edge
tion (CS), multi-resolution analysis (MRA), and hybrid representations are more evident in shearlet based ap-
methods combining CS and MRA. Indeed, the prob- proaches than in traditional MST methods. But the
lem of image registration can be solved by using meth- problem of shift invariance is not imparted by shear-
ods of MRA including contourlet, ridgelet, curvelet, lets because of the subsampling process. The chal-
shearlet learning-based approaches. The spatial distor- lenges involved in shearlets were surmounted by non-
tion in the fused image due to spatial correlation be- subsampling shearlet transform (NSST) [25]. With ref-
tween image pixels is the drawback of pixel level meth- erence to the above discussions, the challenges that
ods. In our opinion, the low-level methods like pixel endured in traditional image fusion research can be out-
level fusion are efficient in computation and simple lined based on different aspects:
to execute with high rendition of authentic informa- • Obstruction in formulating image transforms and fu-
tion, yet they are prone to misalignment of images and sion strategies to prosecute state-of-the-art results.
noise. • Inadequacy of image fusion methods for successful
The various algorithms based on pixel and feature image representation.
levels are compared in [22]. The enhancement of a • Lack of standard fusion metrics for result evaluation.
fused image with high spatial information is achieved The above mentioned difficulties of traditional medi-
by performing fusion at the feature level. The first pro- cal image fusion techniques can be excluded by imple-
cess to be encountered in the feature level is the extrac- menting a fusion mechanism in deep learning.
tion of objects, and then features with similar attributes
are fused together in order to extend the classification
performance. Finally, at the interpretation level, after 3.5 DEEP LEARNING
preliminary classification of each data source, the fea-
tures are converged. Other prominent approaches for Machine learning techniques, a subject of artificial in-
feature level medical image fusion are summarized in telligence, have revolutionized the computer vision re-
[8], which includes neural networks, fuzzy logic, trans- search field as a determinant factor to upgrade perfor-
formations in multi-scale images, and classifiers such mance. They have supported image processing based
as support vector machines (SVM). Image fusion based association for decades, and several specialized areas in
on fuzzy logic can be implemented either as a deci- imaging fields like content-based image retrieval, im-
sion level function or as a feature level function. Most age segmentation face recognition, and multimodal-
of the feature processing methods like neural networks ity image fusion were studied. Progressively, these ap-
and SVM exploit wavelets as a fusion function. Such plications of image processing perceive a method in
approaches include neuro-fuzzy wavelet, wavelet-SVM, deep learning (DL), a machine learning topic that for-
wavelet-edge feature, etc. mulates perspicacity of data by segregating multiple
Conventional multi-resolution analysis (MRA)- stages of representation. By extracting high-level fea-
based methods typically perform the following proce- tures from low level features, DL forms a hierarchical
dures: they begin with a decomposition, then apply a description [26]. Architectures of DL can be reshaped
fusion rule, and finally proceed to reconstruction. Activ- by following four networks: convolutional neural net-
ity level measurement, grouping of coefficients, coeffi- works (CNNs), sparse coding, restricted Boltzmann ma-
cient combination and consistency verification are the chines (RBMs), and auto-encoders [27]. Among these
major components that are exploited in a fusion rule. architectures, CNN achieved good results in image fu-
The major application of decision based image fusion sion.
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 43

FIG. 3.9 Deep convolutional neural network architecture.

3.6 CONVOLUTIONAL NEURAL NETWORK

(CNN)
In convolutional neural networks, the input features are
taken in batches similar to a filter. This will help the
network recollect the images in parts and compute the
operations. This computation includes conversion of
the image RGB scale to gray-scale. Once computation
is completed, the changes in the pixel value will help
observe the edges, and images can be classified into var-
ious categories. A convolution network can be applied
in techniques like signal processing and image classifi-
cation [28]. FIG. 3.10 Framework of CNN for image fusion techniques.
A major potential of CNNs lies in their deep architec-
The key intuition to cope with the mentioned draw-
ture given in Fig. 3.9. In grid-like topologies of images
backs is to incorporate transfer learning. It can be
and videos, DL allows pulling out perceptive features
achieved by successful transmission of the source do-
at various hierarchies of detachment with adequate per-
main knowledge into the target domain without count-
formance. The overall process of image fusion based on
ing on extremely large datasets [31,32].
CNN is illustrated in Fig. 3.10. A powerful weight shar-
ing strategy in a CNN reduces the complexity challenges
and exploits the idea that localization can be done for
3.7 CNN BASED IMAGE FUSION
attributes with similar characteristics in different loca-
ALGORITHMS
tions of an image that are to be fused with possible
correlations. Appropriately, the number of weights has 3.7.1 Deep Stacked CNN (DSCNN)
been reduced in the architecture by performing convo- An improved DCNN in [33] is stacked by a number of
lution instead of multiplication. Moreover, it increases basic units as in Fig. 3.11, where it is composed of low
the computation speed when performed on high par- and high frequency subnets. They respectively consist of
allel computing graphics processing units (GPUs) [27]. three convolution layers: the first layer, where the input
Nevertheless, training a CNN with an input image from information is limited; the second layer for combining
the medical domain is a cumbersome task due to three the information; and the third layer, with a function to
factors [29,30]: merge this information into an image of high and low
• Uniqueness of diseases, since it is very difficult to ac- frequency.
quire a large amount of labeled training data and The general idea of a fusion algorithm based on
expert annotation is expensive. improved deep stacked convolutional neural network
• The process is extremely time-consuming. (DSNN) can be described as follows:
• Convergence problems and data overfitting, which • The back propagation algorithm is procured as a ba-
demand repetitive adjustment. sic training unit.
44 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 3.11 DSCNN architecture for image fusion.

FIG. 3.12 CNN similarity learning based image fusion.

• The construction of a DSCNN is done by stacking resentations is one of fundamental visual understand-
multiple well-trained basic units, and the parameters ing methods in either low- or high-level vision. Since
of the complete network are fine-tuned by means of a CNN is fully convolutional, the convolution is taken
end-to-end. as the similarity metric between detected feature maps
• Images S and R that are ready for fusion are de- F PS of the given patch PS and the F PR feature maps
composed into their corresponding high and low related to the second patch PR as represented by
frequency images, respectively, through the same
DSCNN. Xcorr (F PS , F PR ) = F PS ∗ F PR . (3.1)
• The fusion of high and low frequency images of S
and R is done by using appropriate fusion rules to The advantage of using knowledge transfer from the
obtain the fused high and low frequency images. spatial to frequency domain is that it depends on fully
• The results obtained by the above steps are put back converged pre-trained models and also it does not re-
into a DSCNN in order to reconstruct and get the quire random initialization of weights. This paves the
final fused image. way to a reduction of computation time in transfer
learning and fine tuning.
3.7.2 CNN Based Similarity Learning Let S and R represent the unregistered CT and MRI
In [4], a CNN based image fusion through transfer source images of size m × n and capturing the same
learning has been performed in the shearlet domain. anatomical structure. The fusion framework in Fig. 3.12
The perception of measuring similarity between rep- starts by decomposing source images into approxima-
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 45

FIG. 3.13 CNN based image fusion using Laplacian pyramid method.

tion and detailed coefficients with non-subsampled method specified in [35]. During the training process,
shearlet transform (NSST). In the shearlet domain, fu- the spatial size of the input patch is set to 16 × 16 ac-
sion rules are framed in such a way that algorithm cording to the analysis. Based on multi-scale Gaussian
protocols to decide which extracted coefficients should filtering and random sampling, the training examples
be considered for fusion. By fine-tuning the extrac- were created. The softmax loss function is encountered
tion section of CNN, the high-frequency subnets cor- as the optimization objective and, to reduce it, the
respond to the fusion of extracted feature maps which stochastic gradient descent (SGD) algorithm is adopted.
are consolidated. To detect which of the correlation co- The training process is operated on the deep learning
efficients has high impact on the fused subbands, the framework Caffe [36]. Since the Siamese network has a
normalized cross-correlations are performed between fully-connected layer that has pre-defined dimensions
resultant shearlet coefficient maps. Low-frequency co- on input and output data, the size of the input present
efficients are fused based on the calculation of local in the network must have fixed size in order to en-
energy. To obtain the final fused image, the inverse of sure that the input data of a fully-connected layer is
the NSST is applied in accordance to the fused coeffi- fixed.
cients. In medical image fusion, by separating images into
overlapping patches, inconsistency in the size of in-
3.7.3 CNN Based Fusion Using Pyramidal
put images can be managed, and each patch pair is
Decomposition
fed as an input to the DL architecture, but this will
The CNN architecture used in [34] is a Siamese network,
cause a large number of repeated calculations. To rectify
in which the weights of the two branches are restrained
this problem, the fully-connected layer has to be con-
to be the same. Each branch of the CNN contains three
verted to an equivalent convolutional layer containing
layers for performing convolution and one for max-
two kernels of size 8 × 8 × 512 [37]. Once the con-
pooling, which resembles the architecture used in [35].
version is done, the network can process input images
For the reduction of memory consumption, as well as
as a whole image to generate a huge decision map, in
increase of the computational efficiency, the fully con-
which each prediction (a 2D vector) contains the rel-
nected layer has been eliminated in the defined architec-
ative clarity information of the original patch pair at
ture. The 512 feature maps upon progression are directly
connected to a 2D vector. This 2D vector is fed as an its respective position. The result can be simplified as
input to a 2-class softmax layer that provides a probabil- the weight of the first (or second) source, because there
ity distribution over two defined classes. The two classes are only two dimensions in each prediction and their
represent obtained different normalized weight assign- sum is normalized to 1. Finally, a weight map of the
ment results, namely, “first patch 1 and second patch same size as for input images is acquired, by assign-
0” and “first patch 0 and second patch 1”, respectively. ing the value as the weights of all the pixels within
The probability distribution of each class indicates the the patch location and by averaging the overlapped pix-
possibility of each weight assignment. In case the sum els.
of two output probabilities is 1, the probability of each
class just shows the weight assigned to its corresponding
input patch. 3.8 EVALUATION METRICS
The high-quality image patches and their blurred Fusion quality metrics are employed in various CNN
versions have been taken as training data for the based image fusion works in order to evaluate its effi-
Siamese network in Fig. 3.13 and are trained using the ciency [4].
46 Deep Learning and Parallel Computing Environment for Bioengineering Systems

3.8.1 Entropy (S) fused image SF , it can be defined as [39]:

Entropy in Eq. (3.2) calculates the level of authentic
information available in the input and fused images. MI = I (SA ; SF ) + I (SB ; SF ), (3.7)
Content richness is indicated by the level of entropy: p
p
hR,F (i, j )
I (SR ; SF ) = hR,F (i, j ) log2 .

m−1 hR (i), hF (j )
i=1 j =1
S= p(j ) log2 p(j ), (3.2) (3.8)
j =0
Here R denotes a reference image and F is a fused im-
where p(j ) denotes the pixel gray level probability rang-
age, where hR,F (u, v) is the joint gray level histogram of
ing in {0, . . ., l − 1}.
XR and XF ; hR (u) and hF (v) are the normalized gray
level histograms of XR and XF , respectively.
3.8.2 Standard Deviation (σ )
In general, the contrast of an image is represented by 3.8.5 Image Quality Index (IQI)
standard deviation [38]. An image with maximum σ
The image quality index I QI indicates the quality of the
will directly relate to high contrast. The σ can be char-
fused image. Its dynamic range is the interval [−1, 1]. A
acterized by the degree of deviation between pixels’ gray
larger and closer to unit I QI enhances the quality of
level of an image m(r, c) whose size is M × N and which
the fused result. The I QI is given by:
is computed using
σ 2μF μS 2σF σS
M N 1 ) M N m(r, c))]2
FS
[m(r, c) − (( M×N I QI = . 2 . 2 . (3.9)
σS σF μS + μ2F σS + σF2
σ =
r=1 c=1
.
M ×N
r=1 c=1
(3.3) 3.8.6 QRS/F
In the QRS/F framework [40] essential visual infor-
3.8.3 Spatial Frequency (SF) mation is related with gradient information, and the
Spatial frequency (SF ) [39] given in Eq. (3.4) delivers amount of gradient information transferred from the in-
the level of clarity and provides the image activity level. puts to the fused image is measured using
Therefore, the larger the SF , the more clarity the image RF R SF S
will have: ∀i,j Qi,j Wi,j + Qi,j Wi,j
QRS/F = R S
. (3.10)
SF = RF 2 + CF 2 , (3.4) ∀i,j Wi,j Wi,j

1 a b Fusion methods that passes a high amount of input
RF = (m(r, n − 1) − m(r, n)2 ), gradient information into the fused image are said to
a(b − 1)
r=1 n=2 perform better.
(3.5) The total fusion performance QRS/F is evaluated

as a weighted sum of edge information preservation
1
a
b
CF = (m(r, n) − m(r − 1, n)2 ), values for both input images QR/F and QS/F where
b(a − 1) the weights factors W R and W S reflect perceptual im-
r=2 n=1
(3.6) portance of each source image pixel. The range is 0 ≤
QRS/F ≤ 1, where 0 means that input information has
where RF in Eq. (3.5) is row frequency and CF in been completely lost, and 1 indicates “ideal fusion”
Eq. (3.6) gives column frequency. Image size is given by without any loss of input information. In their simplest
ab, and m(r, n) represents the gray level of the resulting form, the perceptual weights W R and W R take the val-
image. ues of the corresponding gradient strength parameters
gR and gS .
3.8.4 Mutual Information (MI)
The amount of information that a source image pro- 3.8.7 QG
vides to the final image is indicated by MI . Its value The gradient based index QG described in Eq. (3.11)
increases when the texture information in the fused im- measures the level of edge information from the source
age increases. Provided two source images SA , SB and a image that is successfully transferred to the final fused
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 47

image [41]. Index QG is calculated as follows: 3.8.11 QE

N N The formula in Eq. (3.15) indicates the quality index for
RF (r, c)τ R (r, c) + QSF (r, c)K S (r, c)
c=1 Q edge-dependent fusion which is given by
QG = r=1
N N ,
c=1 (τ (r, c) + τ (r, c))
R S
r=1
(3.11) QE (a, b, f ) = Qw (a, b, f ) · Qw (a , b , f )α , (3.15)

where QRF = QRF RF RF

g Qo ; Qg (r, c) and Qo
RF are the
where α is a parameter that expresses the contribution
edge strength and orientation preservation values at lo- of the edge information compared to the source images
cation (r, c), respectively, the width and height of the [43].
images are given by N and M; QSF (r, c) is similar to
QRF (r, c); τ R (r, c) and τ S (r, c) reflect the importance 3.8.12 Average Gradient (AG)
of QRF (r, c) and QSF (r, c), respectively. Average gradient in Eq. (3.16) evaluates the clarity of
the image. The higher the average gradient, the larger the
3.8.8 Structural Similarity Index Metric
resolution of the image. The average gradient is given by
(SSIM)
The similarity between two input images a and b is mea- M−1 N−1
sured as i=0 j =0 g(i, j )
AG = (3.16)
(2μa μb + c1 )(2σab + c2 ) M ×N
SSI M = , (3.12)
(μ2a + μ2b + c1 )(σa2 + σb2 + c2 )
where g(i, j ) is the magnitude gradient at location (i, j ).
where C1 and C1 are stabilization constants; μa , μb are 3.8.13 Human Perception-Based Metric
the average values of a and b; σa2 , σb2 are the variances (QCB )
of a and b while σab is the covariance between the two
The metric QCB in [44] is the human visual system
regions.
inspired measure, it tries to calculate the maximum pre-
3.8.9 FSIM served local saliency and activity level measured like
Edge mutuality between source images and the fused contrast by using Eq. (3.17). It is calculated by averaging
image is represented by FSIM, and it can be obtained the global quality map with the saliency map λR and λS
using the following equation [42]: of sources R and S, respectively; QRF and QSF denote
the information preservation contrast values of R and S:

S (x)P Cm (x)
F SI M = L
x∈
, (3.13)
x∈ P Cm (x) QCB (m, n) = λR (m, n)QRF + λS (m, n)QSF . (3.17)

where gives the image spatial domain, SL (x) de-

scribes the total similarity between the images, and
3.8.14 Processing Time (T)
P Cm (x) measures the phase congruency value. We let T denote the computation time of a complete
process in seconds according to the system specifica-
3.8.10 Contrast (C) tions.
The local contrast represents the purity of an image
view. The image quality index is calculated using the for-
mula: 3.9 RESULTS INTERPRETATION AND
DISCUSSION
|μtarget − μbackground |
C= , (3.14) The results obtained in [4,33,34,42] are studied and
μtarget + μbackground
tabulated below in Tables 3.3, 3.4 and 3.5, respectively.
where μtarget indicates the local mean gray-level of the The software employed in all these mentioned meth-
target image in the region of interest, and for the same ods is MATLAB. Since there is no ground truth image
region μbackground measures the mean of the back- for multimodal medical images, the robustness of each
ground. Higher purity of the image is obtained when algorithm is evaluated on the basis of the highest possi-
C is larger. ble value obtained by calculating each metric. Table 3.3
48 Deep Learning and Parallel Computing Environment for Bioengineering Systems

reveals the performance attributes of similarity based

image fusion.

TABLE 3.3
Result analysis of similarity based image fusion
[4].
Metrics Values
S 5.5572
σ 80.4659
MI 3.4492
SF 8.8984
FIG. 3.15 Comparative analysis of DCNN using Laplacian
I QI 0.7104
pyramidal decomposition on different datasets [34].
SSI M 0.8123
QCB 0.4505
different modalities such as CT+PET and MRI+CT are
investigated in [33]. The computation time for fusion
of CT and PET is longer compared to the fusion time of
MRI and CT. See Fig. 3.16.

TABLE 3.5
Result analysis of DSCNN architecture [33].
Metrics CT and MRI CT and PET CT and MRI
(abdomen) (brain)
S 7.622 6.188 7.16
σ 75.422 21.386 45.907
C 33.973 64.426 11.503
MI 6.344 3.464 5.147
FIG. 3.14 Analysis of similarity based image fusion [4].
AG 6.545 3.395 7.112
The architecture produces good values for SSI M, SF 11.354 8.031 13.640
I QI , SF , but there is a lack of entropy and also time I QI 0.632 0.832 0.611
cost is high. A graphical representation is also given in
T 8.763 s 11.046 s 3.115 s
Fig. 3.14. The objective assessment of the method pro-
posed in [34] is listed in Table 3.4 and the correspond-
ing graphical representation is in Fig. 3.15.

TABLE 3.4
Result analysis of CNN architecture using
pyramidal decomposition [34].
Metrics MRI and CT
S 6.1741
F SI M 0.8872
QG 0.6309
QE 0.6547
T 12.1 s

Table 3.5 shows the performance metrics of the FIG. 3.16 Comparative analysis of DSCNN on different
DSCNN based image fusion method. The datasets from datasets [33].
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 49

TABLE 3.6
Result analysis of CNN architecture using pyramidal decomposition [42].
Metrics Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
AG 0.0844 0.0822 0.886 0.0892 0.0835
σ 0.7782 0.8031 0.954 0.8023 0.7541
QRS/F 0.8662 0.6524 0.7172 0.6012 0.7327
MI 0.8753 1.7621 1.516 1.5721 1.374
T 2.094 s 2.014 s 2.051 s 3.021 s 2.016 s

TABLE 3.7
Result analysis of CNN architecture using pyramidal decomposition [42].
Metrics Dataset 6 Dataset 7 Dataset 8 Dataset 9
AG 0.0837 0.089 0.0874 0.0893
σ 0.7951 0.8023 0.7231 0.998
QRS/F 0.6251 0.5872 0.7517 0.7832
MI 1.6821 1.2451 0.8026 1.4351
T 1.962 s 2.127 s 2.172 s 2.274 s

FIG. 3.17 Comparative analysis of DCNN using pyramidal decomposition on different datasets [42].

The evaluation metrics of Laplacian pyramidal de- AG of DSCNN shows better results, and also the mu-
composition based CNN for different datasets are given tual information between source images is well de-
in Tables 3.6 and 3.7. fined. The value of I QI is high in similarity based
The experimented datasets are from CT, MRI and learning but less for DSCNN. When comparing com-
PET. The computation time is very small, which is ev- putation time of each network, the method proposed
ident from Fig. 3.17.
in [42] requires less processing time. Regarding spatial
The results indicate that the method of similarity
based image fusion has a higher σ value than that ob- frequency, the DSCNN method shows improvement.
tained by DSCNN but comparatively lesser than for Overall, the DSCNN based image fusion and similarity
DCNN based on pyramidal decomposition. In terms based method seemed to have better performance on
of entropy, similarity learning has a smaller standard medical image fusion compared to CNN using pyrami-
deviation value than for the other two methods. The dal decomposition.
50 Deep Learning and Parallel Computing Environment for Bioengineering Systems

3.10 ISSUES IN EXISTING CNN BASED and multi-focus images, in which the source images are
IMAGE FUSION METHODS captured by the same imaging modality, there exists a
The results from all the tables indicate that the advan- way to generate the ground truth images for DCNN
tages of CNNs in the field of image fusion of medical training in an artificial way. The CNN-based methods
images have been widely verified. Specifically, the su- can resolve the challenges of manually designing com-
pervised learning of CNN based classification-type net- plicated algorithms for ghost-removal in conventional
work had great potential for different image fusion chal- methods by preventing motion/ghosting artifacts via a
lenges. Manual design of accurate fusion rules and com- deep learning network, and are more likely to achieve
plicated activity level measurements can be avoided by better performance.
direct generation of a weight map from the input images
in CNN. In particular, a normalized weight assignment
result can be represented by each output neuron of the 3.11 CONCLUSIONS
DCNN, in which the sum of the weights equals to 1, Despite of the aforementioned superiority, the study of
and it represents the corresponding probability. Corre- DCNN based medical image fusion is now at an initial
spondingly, the probability distribution of weight as- stage, and there exists great space for further improve-
signment is defined by the output vector of the network, ment of DCNN in the field of image fusion. In this
and the weight assignment to be measured is based on chapter, some prospects have been put forward for the
its mathematical expectation. Therefore, image fusion study of deep learning models for image fusion. On the
can also be viewed as a two-class classification prob- other hand, designing objective fusion metrics based on
lem. The approach can also be applied to various mul- DL framework still requires more attention.
timodal image fusion issues, because the generalized
multi-class output provides greater flexibility. Moreover,
besides the output weight map, the intermediate feature REFERENCES
maps formed by hidden layers also include the clar- 1. R.C. Krempien, S. Daeuber, F.W. Hensley, M. Wan-
ity or authenticate information of source images, which nenmacher, W. Harms, Image fusion of CT and MRI
is more prominent to a variety of image fusion issues data enables improved target volume definition in 3D-
[5]. brachytherapy treatment planning, Brachytherapy 2 (3)
To achieve high fusion performance, traditional im- (2003) 164–171.
2. A.C. Paulino, W.L. Thorstad, T. Fox, Role of fusion in ra-
age fusion approaches, including MRA, consistency ver-
diotherapy treatment planning, in: Seminars in Nuclear
ification, activity level measurement, etc., existing in Medicine, vol. 33, Elsevier, 2003, pp. 238–243.
conventional image fusion techniques, must be taken 3. P. Shajan, N. Muniraj, J.T. Abraham, 3D/4D image registra-
into consideration. On the other hand, for a certain spe- tion and fusion techniques: a survey.
cific image fusion issue, the conventional approaches 4. H. Hermessi, O. Mourali, E. Zagrouba, Convolutional neu-
should be included to frame the whole fusion scheme ral network-based multimodal image fusion via similarity
together with the CNN-based approach. Meanwhile, learning in the shearlet domain, Neural Computing & Ap-
mutuality between the fused data is still cumbersome, plications 30 (7) (2018) 2029–2045.
which can be rectified when selecting similarity based 5. Y. Liu, X. Chen, Z. Wang, Z.J. Wang, R.K. Ward, X. Wang,
Deep learning for pixel-level image fusion: Recent ad-
learning in CNN. The input manner of the DCNN archi-
vances and future prospects, Information Fusion 42 (2018)
tecture is also not restrained to the Siamese architecture 158–173.
mentioned in [45], yet the advantages of other mod- 6. S. Zagoruyko, N. Komodakis, Learning to compare image
els like pseudo-Siamese and 2-channel also deserve to patches via convolutional neural networks, in: Computer
be compatible with DCNN [5]. The criticality of such Vision and Pattern Recognition (CVPR), 2015 IEEE Con-
methods relies on creating an effective large training ference on, IEEE, 2015, pp. 4353–4361.
dataset. One of the possible solutions is to use the 7. S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fu-
method mentioned in [45] based on multi-scale Gaus- sion: a survey of the state-of-the-art, Information Fusion
sian filtering, but further investigation is essential to 33 (2017) 100–112.
8. A.P. James, B.V. Dasarathy, Medical image fusion: a survey
provoke more compatible algorithms for complex CNN
of the state-of-the-art, Information Fusion 19 (2014) 4–19.
models. 9. D. Wu, A. Yang, L. Zhu, C. Zhang, Survey of multi-sensor
Moreover, there is no ground-truth fused image in image fusion, in: International Conference on Life System
most of the image fusion schemes. A major issue in such Modeling and Simulation and International Conference
category is the creation of training datasets. Fortunately, on Intelligent Computing for Sustainable Energy and En-
for a few image fusion problems such as multi-exposure vironment, Springer, 2014, pp. 358–367.
CHAPTER 3 Image Fusion Through Deep Convolutional Neural Network 51

10. D.G. Lowe, Distinctive image features from scale-invariant 26. Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep
keypoints, International Journal of Computer Vision learning for visual understanding: a review, Neurocomput-
60 (2) (2004) 91–110. ing 187 (2016) 27–48.
11. H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust 27. W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A
features, in: European Conference on Computer Vision, survey of deep neural network architectures and their ap-
Springer, 2006, pp. 404–417. plications, Neurocomputing 234 (2017) 11–26.
12. C. Harris, M. Stephens, A combined corner and edge de- 28. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classifi-
tector, in: Alvey Vision Conference, vol. 15, Citeseer, 1988, cation with deep convolutional neural networks, in: Ad-
pp. 147–151. vances in Neural Information Processing Systems, 2012,
13. S. Leutenegger, M. Chli, R.Y. Siegwart, Brisk: binary ro- pp. 1097–1105.
bust invariant scalable keypoints, in: Computer Vision 29. G. Kutyniok, D. Labate, Shearlets: Multiscale Analysis for
(ICCV), 2011 IEEE International Conference on, IEEE, Multivariate Data, Springer Science & Business Media,
2011, pp. 2548–2555. 2012.
14. P. Ghosh, A. Pandey, U.C. Pati, Comparison of differ- 30. H. Greenspan, B. van Ginneken, R.M. Summers, Guest ed-
ent feature detection techniques for image mosaicing, AC- itorial deep learning in medical imaging: overview and
CENTS Transactions on Image Processing and Computer future promise of an exciting new technique, IEEE Trans-
Vision 1 (1) (2015) 1–7. actions on Medical Imaging 35 (5) (2016) 1153–1159.
15. R.M. Kumar, K. Sreekumar, A survey on image feature de- 31. N. Tajbakhsh, J.Y. Shin, S.R. Gurudu, R.T. Hurst, C.B.
scriptors, International Journal of Computer Science & In- Kendall, M.B. Gotway, J. Liang, Convolutional neural net-
formation Technologies 5 (2014) 7668–7673. works for medical image analysis: full training or fine tun-
16. P. Viola, M.J. Jones, D. Snow, Detecting pedestrians using ing? IEEE Transactions on Medical Imaging 35 (5) (2016)
patterns of motion and appearance, International Journal 1299–1312.
of Computer Vision 63 (2) (2005) 153–161. 32. H.-C. Shin, H.R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues,
J. Yao, D. Mollura, R.M. Summers, Deep convolutional
17. S.A.K. Tareen, Z. Saleem, A comparative analysis of Sift,
neural networks for computer-aided detection: CNN ar-
Surf, Kaze, Akaze, Orb, and Brisk, in: Computing, Math-
chitectures, dataset characteristics and transfer learning,
ematics and Engineering Technologies (iCoMET), 2018 In-
IEEE Transactions on Medical Imaging 35 (5) (2016)
ternational Conference on, IEEE, 2018, pp. 1–10.
1285–1298.
18. E. Mair, G.D. Hager, D. Burschka, M. Suppa, G. Hirzinger,
33. Kai-jian Xia, Hong-sheng Yin, Jiang-qiang Wang, A novel
Adaptive and generic corner detection based on the acceler-
improved deep convolutional neural network model for
ated segment test, in: European Conference on Computer
medical image fusion, Cluster Computing (2018) 1–13.
Vision, Springer, 2010, pp. 183–196.
34. Y. Liu, X. Chen, J. Cheng, H. Peng, A medical image fusion
19. K. Sharma, A. Goyal, Classification based survey of image
method based on convolutional neural networks, in: Infor-
registration methods, in: 2013 Fourth International Con-
mation Fusion (Fusion), 2017 20th International Confer-
ference on Computing, Communications and Networking
ence on, IEEE, 2017, pp. 1–7.
Technologies (ICCCNT), IEEE, 2013, pp. 1–7. 35. Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion
20. S. Sergeev, Y. Zhao, M.G. Linguraru, K. Okada, Medical with a deep convolutional neural network, Information
image registration using machine learning-based interest Fusion 36 (2017) 191–207.
point detector, in: Medical Imaging 2012: Image Process- 36. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
ing, vol. 8314, International Society for Optics and Pho- Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional
tonics, 2012, p. 831424. architecture for fast feature embedding, in: Proceedings of
21. H. Ghassemian, A review of remote sensing image fusion the 22nd ACM International Conference on Multimedia,
methods, Information Fusion 32 (2016) 75–89. ACM, 2014, pp. 675–678.
22. D.E. Nirmala, V. Vaidehi, Comparison of pixel-level and 37. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y.
feature level image fusion methods, in: Computing for Sus- LeCun, Overfeat: integrated recognition, localization and
tainable Global Development (INDIACom), 2015 2nd In- detection using convolutional networks, arXiv preprint,
ternational Conference on, IEEE, 2015, pp. 743–748. arXiv:1312.6229.
23. J. Du, W. Li, K. Lu, B. Xiao, An overview of multi-modal 38. P. Jagalingam, A.V. Hegde, A review of quality metrics for
medical image fusion, Neurocomputing 215 (2016) 3–20. fused image, Aquatic Procedia 4 (2015) 133–142.
24. G. Easley, D. Labate, W.-Q. Lim, Sparse directional im- 39. S. Singh, D. Gupta, R. Anand, V. Kumar, Nonsubsampled
age representations using the discrete shearlet transform, shearlet based ct and mr medical image fusion using bio-
Applied and Computational Harmonic Analysis 25 (1) logically inspired spiking neural network, Biomedical Sig-
(2008) 25–46. nal Processing and Control 18 (2015) 91–101.
25. H. Hermessi, O. Mourali, E. Zagrouba, Multimodal image 40. C. Xydeas, V. Petrovic, Objective image fusion performance
fusion based on non-subsampled shearlet transform and measure, Electronics Letters 36 (4) (2000) 308–309.
neuro-fuzzy, in: International Workshop on Representa- 41. S. Li, X. Kang, J. Hu, Image fusion with guided filter-
tions, Analysis and Recognition of Shape and Motion FroM ing, IEEE Transactions on Image Processing 22 (7) (2013)
Imaging Data, Springer, 2016, pp. 161–175. 2864–2875.
52 Deep Learning and Parallel Computing Environment for Bioengineering Systems

42. B. Rajalingam, D.R. Priya, A novel approach for multi- 44. Y. Chen, R.S. Blum, A new automated quality assessment
modal medical image fusion using hybrid fusion algo- algorithm for image fusion, Image and Vision Computing
rithms for disease analysis, International Journal of Pure 27 (10) (2009) 1421–1432.
and Applied Mathematics 117 (15) (2017) 599–619. 45. G. Bhatnagar, Q.J. Wu, Z. Liu, Directive contrast based
43. G. Piella, H. Heijmans, A new quality metric for image multimodal medical image fusion in NSCT domain, IEEE
fusion, in: Image Processing, 2003. ICIP 2003. Proceed- Transactions on Multimedia 15 (5) (2013) 1014–1024.
ings. 2003 International Conference on, vol. 3, IEEE, 2003,
pp. III–173.
CHAPTER 4

Medical Imaging With Intelligent

Systems: A Review
GEETHU MOHAN, ME • M. MONICA SUBASHINI, PHD

KEY TERMS

Magnetic Resonance Imaging (MRI), Magnetic Resonance (MR), World Health Organization (WHO), Central Nervous Sys-
tem (CNS), Hospital-Based Brain Tumor Registry (HBBTR), Central Brain Tumor Registry of the United States (CBTRUS),
American Brain Tumor Association (ABTA), Confidence Interval (CI), Computed Tomography (CT), Proton Density (PD),
Fluid Attenuation Inversion Recovery (FLAIR), Cerebro Spinal Fluid (CSF), Magnetic Resonance Spectroscopy (MRS),
N-acetyl aspartate (NAA), Choline (Cho), Creatine (Cr), myo-Inositol (mI), Lipid (Lip), Lactate (Lac), Dynamic Susceptibil-
ity Contrast (DSC), relative Apparent Diffusion Coefficient (rADC), Advanced Normalization Tools (ANTs), Virtual Skeleton
Database (VSD), Brain Tumor Segmentation (BRATS), Statistical Parametric Mapping (SPM), Convolutional Neural Net-
work (CNN), Random Forest (RF), Conditional Random Field (CRF), Region of Interest (ROI), Fully Convolutional Neural
Network (FCNN), Conditional Random Fields (CRF)

4.1 INTRODUCTION information, various post-processing algorithms are ap-

The word “tumor” is of Latin origin and means swelling, plied to MR images by an MRI post-processing technol-
which is intermittently associated with a neoplasm, i.e., ogist. The intention has been to develop techniques that
is caused by uncontrolled cell proliferation. The clas- will automatically classify the images into various tu-
sification of brain tumors is based on location, tissue mor grades and hence reduce the burden on radiologists
type, malignancy level, and various other factors. With while classifying the huge amount of images generated
the aid of a microscope, the tumor cells are examined from various MRI scanners.
to determine its malignancy level. This analysis helps in The review is further structured as follows: Tumor
grading the tumor, based on their level of malignancy types and grading (Sect. 4.2) where we brief on the dif-
from most to least malignant. A grade of a tumor is de- ferent types of tumor and summarize about gliomas,
termined using the factors like rate of cell growth, blood tumor grading, symptoms, diagnosis, and treatment of
supply to the cells, dead cells present in the center of the brain tumor. Further imaging techniques (Sect. 4.3),
tumor (necrosis), specific area confinement of cells and machine learning, supervised and unsupervised meth-
similarity of the tumorous cells to normal cells. All these ods along with software packages (Sect. 4.4), deep
processes proceed invasively. So image examination has learning and its libraries (Sect. 4.5), evaluation and
become essential, as it allows repeated, non-invasive in- validation metrics (Sect. 4.6), embedding into the
formation to be recovered over time. clinic (Sect. 4.7), current state-of-the-art, including deep
Raymond Damadian in 1971 revealed that there is a learning concepts (Sect. 4.8). The related subsections
difference between nuclear magnetic relaxation times of are grouped within each section.
tumors and tissues, which motivated scientists to spec-
ulate MR for the detection of disease. The growth in
the field of MRI triggered more opportunities. Clini- 4.2 TUMOR TYPES AND GRADING
cal imaging or radiology is coined from the “invisible The prevalence of brain tumors has advanced over time,
light” medical imaging. The medical practitioner en- varying based on age, gender, race, ethnicity, and geog-
gaged in acquiring and interpreting medical images of raphy. With the advent of improved diagnostic imaging
diagnostic quality is a radiologist. There has always been methodologies, the demand for neurosurgeons, med-
a requirement for radiologists who are trained to read ical care, and variant treatment approaches also esca-
the MR images. For extraction or better visualization of lated. Furthermore, the advances in classifications of

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00011-7 53
Copyright © 2019 Elsevier Inc. All rights reserved.
54 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 4.1 All primary brain tumors with other CNS tumors distribution [52].

specific brain tumor histology have all together made germ cell tumors, tumors of meninges and of the sellar
a great contribution towards the therapeutics of a brain region [12].
tumor.
The World Health Organization (WHO) scheme 4.2.1 Why Are Studies Concentrated Mostly
classifies and grades the central nervous system (CNS) on Gliomas?
tumors for clinical and research purposes. Based on the A type of brain tumor that grows from glial cells is called
predicted biological behavior and histological type, the glioma. The various types of glial cells are astrocyte,
tumors are segregated. Over 100 types of primary CNS oligodendrocyte, microglia and ependymal cells, each
tumors that are histologically distinct are marked by having different functions. Collectively the term glioma
their own band of clinical presentations, treatments, is used to describe the different types of glial tumors:
and outcomes. In 1979, the first edition of the WHO oligodendroglioma, astrocytoma and glioblastoma. Mi-
Classification of Tumors of the CNS was issued. It has croglia is a part of the immune system and not truly a
been revised four times, most recently in 2016, and is glial cell. Pilocytic astrocytoma is of grade I, while grade
considered as the international standard for the classi- II includes astrocytoma, oligodendroglioma and oligoa-
fication of CNS tumors. This classification system acts trocytoma, grade III includes anaplastic astrocytoma,
as a benchmark for conversation between basic science anaplastic oligodendroglioma, anaplastic oligoatrocy-
investigators and clinicians worldwide [35]. The latest toma, and grade IV is called glioblastoma multiforme
update used the new methods of diagnosis based en- [40].
tirely on microscopy by including molecular parameters The primary malignant brain tumors are gliomas
for the classification of CNS tumor entities [37]. [66], so it is crucial to identify the conditions that can
Other types of cancer are staged while brain tumors be modified/enhanced to avert this disease. These tu-
are classified according to the WHO classification of mors have different shape, size, and contrast and can
tumors of the CNS. Based on predicted clinical behav- develop anywhere in the brain, but mostly affect the
ior, they assign a grade from I through IV [52]. On cerebral hemispheres [17]. Broadly categorized, glioma
the basis of histopathology, brain tumors are classi- forms approximately 24.7% of all primary brain and
fied into: tumors of neuroepithelial tissue (hereafter other CNS tumors (Fig. 4.1) and accounts for 74.6% of
referred to as glioma which includes the grade II astrocy- malignant tumors (Fig. 4.2).
toma, the grade III anaplastic astrocytoma, the grade IV Among all malignant tumors, the number of cases is
glioblastoma, oligodendroglioma, and ependymoma), estimated to be the highest for glioblastoma as shown
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 55

FIG. 4.2 Malignant primary tumors with other CNS tumors distribution [52].

TABLE 4.1 ies of CNS tumor data and also a comparison with
Overall estimation of astrocytoma cases [52]. the international registries. The registry generated from
the years 2010 to 2014, reported a total of 4295 cases.
Histology Estimated Estimated Over this five year period, tumors in the registry had
new cases new cases
18.9% grade IV tumors, 20% grade III, 11.4% grade
by 2016 by 2017
II, and 36.3% grade I. The most common tumors
Pilocytic astrocytoma 1100 1120
among adults were reported to be glioblastomas (38%)
Diffuse astrocytoma 1180 1110 followed by anaplastic oligodendrogliomas (24.5%).
Anaplastic astrocytoma 1330 1340 While in the pediatric group, pilocytic astrocytomas
Glioblastoma 12,150 12,390 (44%) contributed almost half of all gliomas seen in
children, followed by ependymomas (31%). On making
an overall comparison between CBTRUS and HBBTR,
in Table 4.1, with 12,150 cases in 2016 and 12,390 in the most frequently reported histology is meningioma
2017 [52]. Glioblastoma forms 46.6% of primary ma- with 36% vs. 20% followed by glioblastoma with 16%
lignant brain tumors (Fig. 4.2) and 14.9% of all primary vs. 14%.
brain and other CNS tumors (Fig. 4.1). Being more The childhood CNS tumors are the most severe
commonly seen in older adults and rarely in children, group of tumors due to their high incidence and mortal-
the incidence of glioblastoma increases with age, show- ity rate [23]. Data was collected from seven tertiary care
ing maximum rates for the age group from 75 to 84 hospitals in India, of 3936 pediatric patients regarding
years. the rate of occurrence of various primary brain tumors.
The increased incidence of brain tumors is at least The results revealed astrocytic tumors to be the most fre-
partially traceable with improved radiographical diag- quent primary pediatric brain tumors (34.7%) in which
nosis rather than histological analysis of tumors [52]. To pilocytic astrocytoma was 23.1% followed by grade II
contribute to large-scale national studies the Hospital- with 5.1%, grade III with 2.1%, and grade IV with 4.4%.
Based Brain Tumor Registry (HBBTR) data was gener- When comparing the incidence of CNS tumors in differ-
ated, examining the brain tumor distribution within ent countries with India, it has been reported that the
the Indian population [24]. This provided a clinical highest occurrence is for astrocytomas in other coun-
benchmark for comparison of previous and future stud- tries also.
56 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 4.2
Grading of astrocytoma.
Tumor Atypia Mitosis Endothelial Necrosis WHO grade General
proliferation grade
Pilocytic I Low-grade
astrocytoma gliomas
Diffuse II Low-grade
astrocytoma gliomas
Anaplastic III High-grade/
astrocytoma malignant
gliomas
Glioblastoma IV High-grade/
malignant
gliomas

4.2.2 Grading The pilocytic astrocytomas (grade I), being the most
For patient management, the grading of the CNS neo- benign of astrocytomas, are frequently encountered in
plasms has profound importance [37]. Based on mor- the first and second decades of life. On MRI they appear
phological features, the grading scheme predicts the as large cysts with low proliferative potential which can
clinical behavior of the tumor. The concept of CNS tu- be cured by resection. The survival rate at 5 years from
mor grading is the progressiveness of neoplasias from diagnosis is 87%. Diffuse astrocytomas (grade II), often
localized and benign tumors to infiltrating and malig- referred to as low-grade gliomas, show less proliferation
nant tumors. Histological grading is used to predict the and often recur. On MRI they give an appearance of an
biological behavior of a neoplasm. Clinically, a tumor area of low density with little mass effect. The survival
grade has a major influence on the selection of therapies rate at 5 years from diagnosis is 35%. These infiltra-
(radiotherapy and chemotherapy). The WHO grading tive lesions are localized in the white matter of cere-
scheme is a “malignancy scale” defining a broad vari- bral hemispheres in adults, between 30 and 40 years.
The infiltrating tumors, anaplastic astrocytomas (grade
ety of neoplasms [36]. The treatment and prognosis of
III) have an average age of diagnosis between low-grade
a brain tumor are linked with the need of an accurate
astrocytomas (grade I and II) and glioblastoma multi-
pathological grading.
forme. Despite originating as primary tumors, most of
To provide clarity for diagnoses, the WHO scheme
the grade III was formed from grade I and II and can
provides a four-tiered histological grading for astrocy-
possibly advance to grade IV. At 5 years from diagno-
tomas. They are designated by a grade ranging from 1
sis, the survival rate for this tumor is 31%. Grade III
to 4, allotting 4 to be the most aggressive while 1 to lesions show nuclear atypia and brisk mitotic activity.
be the least aggressive. This system is established on the Therapy for grade III tumor patients usually includes
basis of the appearance of attributes like atypia (struc- radiation and/or chemotherapy. The highly malignant
tural abnormality of the cell), mitosis (a division of the and fast-growing brain tumor, glioblastoma multiforme
nucleus), endothelial proliferation (an apparent multi- (grade IV) has 3% survival rate at 5 years from di-
layering of endothelium), and necrosis (cell injury re- agnosis. On MRI, it displays heterogeneous enhance-
sulting in premature cell death). These features show ment with centrally non-enhancing regions and mass
the tumor malignancy level in terms of growth rate and effect. These mitotically active, cytologically malignant,
invasion as shown in Table 4.2. WHO describes grade necrosis-prone neoplasms have a fatal outcome as they
I tumors as tumors having these features absent, grade develop within the main mass of the brain, invading
II (diffuse astrocytoma) having only cytological atypia, nearby tissues. Grade IV neoplasms are usually charac-
grade III (anaplastic astrocytoma) also having anaplasia terized by widespread infiltration into neighboring tis-
(morphological changes in a cell) and mitotic activity, sue [7,82].
and grade IV additionally exhibiting microvascular pro- Grade I–II tumors are primarily located in midline
liferation and/or necrosis. locations, such as the diencephalic region and cerebel-
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 57

TABLE 4.3
Astrocytoma distribution, 2009–2013.
Histology 5-year Annual % of all Median Proportion of Median
total average tumors age all gliomas survival
Pilocytic 5106 1021 1.4% 12.0 <5% >10
astrocytoma
Diffuse 8081 1616 2.2% 48.0 25–30% >5 yrs.
astrocytoma
Anaplastic 6245 1249 1.7% 53.0 25–30% 3
astrocytoma
Glioblastoma 54,980 10,996 14.9% 64.0 40–50% 1

TABLE 4.4
Distribution of histologically confirmed astrocytomas, 2011–2013 [52].
Histology Number of newly Histologically Assigned grade
diagnosed tumors conﬁrmed Grade 1 Grade 2 Grade 3 Grade 4
Pilocytic 3078 79.1% 92.6% 6.2% 0.8% 0.5%
astrocytoma
Diffuse 4523 79.2% 4.2% 58.1% 22.7% 15.0%
astrocytoma
Anaplastic 3867 92.8% 0.1% 0.9% 90.2% 8.7%
astrocytoma
Glioblastoma 33,631 79.1% 0.2% 0.2% 1.0% 98.7%

TABLE 4.5
Age-specific incidence rates for astrocytoma, 2009–2013 [52].
Histology Age at diagnosis
0–19 years 20–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85+ years
Rate (95% Rate (95% Rate (95% Rate (95% Rate (95% Rate (95% Rate (95% Rate (95%
CI) CI) CI) CI) CI) CI) CI) CI)
Grade I 0.88 (0.85– 0.24 (0.22– 0.12 (0.11– 0.09 (0.08– 0.08 (0.07– 0.06 (0.04– 0.07 (0.05– – –
0.91) 0.25) 0.14) 0.10) 0.10) 0.07) 0.09)
Grade II 0.27 (0.25– 0.50 (0.48– 0.56 (0.53– 0.58 (0.55– 0.77 (0.73– 0.97 (0.91– 1.08 (1.00– 0.60 (0.52–
0.29) 0.53) 0.60) 0.61) 0.81) 1.03) 1.16) 0.70)
Grade III 0.09 (0.08– 0.30 (0.28– 0.41 (0.38– 0.46 (0.44– 0.65 (0.61– 0.92 (0.86– 0.91 (0.84– 0.42 (0.34–
0.10) 0.31) 0.44) 0.49) 0.68) 9.98) 0.99) 0.50)
Grade IV 0.16 (0.15– 0.42 (0.40– 1.21 (1.16– 3.55 (3.47– 8.11 (7.98– 13.09 (12.87– 15.27 (14.97– 9.16 (8.81–
0.17) 0.45) 1.26) 3.63) 8.24) 13.31) 15.57) 9.52)
58 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 4.6
Average annual age-adjusted incidence rates of astrocytomas, 2009–2013 [52].
Histology Age at diagnosis
0–14 Years 15–39 Years 40+ Years
Rate (95% CI) Rate (95% CI) Rate (95% CI)
Pilocytic 0.98 (0.95–1.02) 0.28 (0.27–0.30) 0.08 (0.08–0.09)
astrocytoma
Diffuse 0.26 (0.24–0.28) 0.45 (0.43–0.47) 0.68 (0.66–0.70)
astrocytoma
Anaplastic 0.09 (0.08–0.10) 0.29 (0.27–0.30) 0.62 (0.60–0.64)
astrocytoma
Glioblastoma 0.15 (0.14–0.17) 0.48 (0.46–0.50) 6.95 (6.89–7.01)

lum, including the hypothalamus and visual pathway. 4.2.4 Diagnosis and Treatment of a Brain
Grade III–IV tumors are usually located in the pontine Tumor
areas of the brain stem or cerebral hemispheres. Most An invasive method for brain tumor diagnosis is the
low-grade astrocytomas are curable by surgical resection spinal tap. A biopsy is another method where the tissue
alone while, despite the addition of radiotherapy and is taken out to check for tumor cells. The only assured
chemotherapy, the prognosis remains poor for high- way for brain tumor diagnosis, grade determination and
grade astrocytomas. treatment planning is a biopsy. Being an invasive tech-
According to ABTA, a patient’s treatment response nique, a needle biopsy is the only reliable diagnosis
depends on the patient age, tumor malignancy grad- currently for a brain tumor, and is not generally recom-
ing, the amount of tumor removed, and his/her general mended in initial stage of diagnosis [69].
health. This shows the requirement of an efficient grad- Treatment of a brain tumor in early stages is a chal-
ing system [65]. See Tables 4.3–4.6. lenging task due to size, shape and location variation
and can only be performed by trained professional neu-
4.2.3 Symptoms of a Brain Tumor roradiologist [60]. A system proposed in [55] and based
The most common brain tumor symptoms are diffi- on qualitative information determines the degree of tu-
culty in thinking, finding words or speaking, seizures mor abnormality using the stained hematoxylin–eosin
or convulsions, headaches, weakness or paralysis in one tissue biopsies (the gold standard in biopsy). These
part or one side of the body, personality or behavior stains are examined by a histopathologist. Although
changes, changes in vision, hearing loss, dizziness or the identification of tumor grade is accurately done by
loss of balance, disorientation or confusion and mem- WHO grading scheme, there is significant intra- and
ory loss. See Table 4.7. inter-observer variability that significantly influences di-
agnosis quality [55].
TABLE 4.7 The prognosis for a brain tumor depends on its lo-
Symptoms of brain tumors [7]. cation, type, grade, the spreading of a tumor inside the
Symptoms Tumor type brain, how long the symptoms existed prior to diagno-
(percentage with symptoms) sis and how much the patient functionality is affected
by the tumor. Similarly, the treatment of brain tumors
Low-grade Malignant
glioma glioma will rely specifically on different factors like the tu-
mor location, type, size, patient symptoms and general
Headache 40 50
health, whether the tumor is malignant or benign, and
Seizure 65–95 15–25
treatment preferences. Surgery, chemotherapy and radi-
Hemiparesis 5–15 30–50 ation therapy are the preeminent treatments for a brain
Mental status 10 40–60 tumor. Depending on the severity and various other fac-
abnormalities tors, patients may undergo only one treatment method
or a combination of treatments.
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 59

FIG. 4.3 Neuroimaging modalities and its applications [50].

4.3 IMAGING TECHNIQUES trast along with high spatial resolution, direct multipla-
A cranial MRI is the only non-invasive test required for a nar imaging – sagittal, coronal and axial planes, display-
brain tumor diagnosis [7]. On the contrary, a computed ing many images and oblique cuts [50].
tomography (CT) may fail to show structural lesions, There are various formats used for storing an MR im-
particularly non-enhancing tumors like the low-grade age. We can broadly classify them it into two groups.
gliomas. Diagnostically considering, the best choice to The output of the machine which captures the MR im-
rule out the possibility of a brain tumor is an MRI ages is known as the scanner format. The other type is
with gadolinium enhancement. As MRI can be sensi- known as the image processing format. It is obtained
tized to various contrast parameters, this enables a com- by a conversion of the original MRI scanner format. The
prehensive assessment of normal and abnormal brain work in [14] uses Neuroimaging Informatics Technol-
physiology. MRI being in vivo, longitudinal, and multi- ogy Initiative (NIFTI-1.1) as the image processing for-
parametric provides unique opportunities for character- mat of MRI. See Fig. 4.3.
izing and delineating experimental models of neurolog- 3D volumes made up of a set of slices are the general
ical diseases like stroke, brain tumors, etc. output of MRI systems. The latest MRI systems provide
The plethora of available MR contrast mechanisms, 3D images of 16-bit depth [51]. An MRI contains an
in conjunction with its superior dynamic functional enormous amount of information but our eyes are un-
range, bestow on MRI the potential to be a formidable able to discriminate beyond several tens of gray levels.
tool in the noninvasive, in vivo, multilevel assessment This inability can be overcome by taking the aid of com-
of tumor physiology [20]. The main advantages of MRI puters to attain entire information contained in an MRI.
are the absence of harmful ionizing radiation, being The grading of a glioma histopathologically with in-
a non-invasive technique, painless, possibly also per- accurate and limited tissue samples, subjective grading
formed without contrast, showing great soft tissue con- criteria and tumor inaccessibility paves the path towards
60 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 4.4 An axial slice of high-grade glioma. From left to right: T1-w, T1-wc, T2-w, and T2 FLAIR.

the immense requirement of automatic segmentation The frequently used sequence for structural analy-
procedure [11]. The high contrast of soft tissues, high sis is a T1-w sequence which gives an easy explanation
spatial resolution, the absence of harmful radiation, of healthy tissues. The appearance of tumor border is
non-invasiveness of MR imaging methodology all aid brighter in T1-weighted contrast-enhanced images. This
in the development of automated diagnostic tools [9]. is because of the accumulation of the contrast agent due
to the blood–brain barrier disruption in the prolifera-
4.3.1 Reading an MR Image tive tumor region. In a T2-weighted MRI, the brightly
The various attributes of a tumor region can be obtained appearing region is the edema that encircles the tumor.
from the different MRI sequences. Hence between se- A special sequence is the T2 FLAIR (FLAIR) that aids in
quences, the intensity profiles of tumor tissues change. the separation of edema region from the CSF [1]. See
Observing, analyzing image features and interpreting Fig. 4.4.
multispectral MR images turns out to be a time consum- T1-w images describe gray matter as gray colored
ing and challenging task for radiologists. Furthermore, and white matter as white colored, while white mat-
the heterogeneous intensity profiles, tumor orientation, ter is gray colored and gray matter is white colored in
T2-w images. FLAIR images guide in viewing the tis-
shape, and overlapping intensity add on to the difficulty
sues by suppressing the cerebrospinal fluid and water
in diagnosis. This results in a differential diagnosis. Si-
content in the brain [62]. Typically, astrocytomas are
multaneously differentiating between distinctive tumor
iso-intense on T1-w and hyperintense on T2-w images.
types, each having identical features is a demanding task
While low-grade astrocytoma rarely enhances on MRI,
[33,83].
most anaplastic astrocytoma enhances when using con-
T1-, T2-weighted and proton density MRIs are sen- trast agents. T2-w MRI has been the most preferred
sitized to, the longitudinal MR relaxation time (T1), method for delineating lesions in clinical and experi-
the transverse MR relaxation time (T2) of tissue water mental diagnostic studies because of its histologically
and the concentration, respectively. T2*-weighted MRI validated superior sensitivity in detecting tissue damage
is sensitized to transverse MR relaxation that is not cor- [20]. To perceive the extent of the tumor and delineate
rected for phase shifts caused by the local field inhomo- its presence, the contrast-enhanced T1-w MRI is used.
geneities. These MR techniques are commonly applied A single ROI in a single MR slice should not be used to
to detect brain lesions, because pathologies like tumors, identify tumor volume and growth because a tumor is
edema, and hemorrhage, are associated with changes in a 3D object. Hence it should be assessed from 3D data,
water content and relaxation rates. i.e., a stack of 2D data [11]. So multi-modals of MRI are
The infiltrative tumors, glioblastomas, have their to be used.
borders often fuzzy and hard to distinguish from
healthy tissues. Hence, more than one MRI modal- 4.3.2 Advanced Magnetic Resonance
ity is often employed, e.g., T1 (spin-lattice relaxation), Imaging
T2 (spin-spin relaxation), T1-contrasted (T1C), proton 4.3.2.1 MRS
density (PD) contrast imaging, diffusion MRI (dMRI), While MRI is the most sensitive modality available for
and fluid attenuation inversion recovery (FLAIR) pulse the detection of brain tumors, it has low specificity, and
sequences. The contrast between these modalities gives many types of tumor share a similarity in appearance
uniqueness to each tissue type [17]. on an MRI. It is difficult to find the grade and type of a
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 61

tumor using MRI [38]. These demerits can be overcome margins defined by conventional MR sequences. Many
by the use of MRS. studies show a good correlation of WHO grade and
The magnetic resonance spectroscopy is another metabolite ratios (Cho/NAA, Cho/Cr, and Lip-Lac/Cr).
noninvasive technique that allows quantitative and Many gliomas display high levels of citrate (not present
qualitative assessment of specific metabolites in the in the normal brain), particularly in the pediatric popu-
brain parenchyma or intracranial extra-axial spaces. lation. Pilocytic astrocytoma is found to have decreased
MRS analysis of brain tumors can be done using 1H Cr levels and variable degrees of Cho/Cr ratios. Rapalino
(proton) MRS or, less frequently, with 31P (phospho- and Ratai [59] state that MRS complements the infor-
rus) or 13C (carbon) MRS techniques. For 1H MRS, mation provided by the conventional MR imaging se-
the most common metabolites evaluated in routine quences and should be always used in conjunction with
clinical practice are N-acetyl aspartate (NAA), choline- the other imaging studies.
containing compounds (Cho), creatine (Cr), myoinos- In astrocytomas, Cho concentrations are more for
itol (mI), lipid (Lip), and lactate (Lac) [59]. MRS does the higher grade of tumors, yet it is noticed that the
not utilize high-energy radiation and contrast agents or high-grade tumors like glioblastoma multiforme (grade
labeled markers. The metabolite spectra from the MRS IV) show lower levels of Cho compared to astrocytoma
imaging add new dimension towards discrimination of grade II or III. This is because high-grade tumors have
lesions. In [48], three metabolite markers of neuronal necrotic cores and necrosis is related to diminished lev-
integrity were evaluated (Cho, NAA, and Cr). els of all metabolites. In non-necrotic, high-grade brain
Metabolic inhomogeneity is a notable feature of tumors Cho levels are typically seen to be high.
brain tumors. The spectrum from the necrotic core of MRS can precisely provide quantitative metabolite
a high-grade tumor varies from that of an actively grow- maps of the brain. It enables viewing the tumor’s het-
ing rim [21]. The spectra of brain tumors were different erogeneous spatial extent outside and inside the MRI-
from the spectra of normal brain tissue. NAA is a neu- detectable lesion. The studies in [38] show that the ac-
ronal metabolite which is decreased in processes with curacy of brain tumor classifiers can be improved by use
neuronal destruction or dysfunction. Most of the brain of image intensities and spectroscopic information. In
tumors have reduced NAA signals and heightened lev- [11], a combination of conventional MR imaging and
els of Cho, resulting in elevated Cho/NAA ratios. Cho dynamic susceptibility contrast MR imaging is a posi-
is increased in brain tumors due to increased mem- tive pre-surgical glioma grade indicator. Generally, the
brane turnover. Cho correlates well with the degree of grading of glioma using dynamic susceptibility contrast
tumor infiltration into neighboring tissue and tumor is done on the basis of cerebral blood volume (CBV)
cellular density. Cr is a metabolite related to the cellular value analysis inside the tumor area, using either a his-
energy metabolism and is relatively stable in different togram analysis method or a hot-spot method. An ex-
pathologic processes affecting the CNS. Hence it is use- perienced operator with sound anatomical knowledge
ful as a reference metabolite. Lip peaks are an indica- does precise identification of glioma tissue. This causes
tion of areas of necrosis and Lac peaks originate from the current grading approaches to be inherently time-
processes resulting in anaerobic metabolism. mI is a consuming and operator-dependent. Conventional MR
marker of astrocytic metabolism and is elevated in cer- imaging protocols for brain tumor consists of 2D MR
tain pathologic processes [59]. Specifically, mI has been images which are suboptimal for tumor segmentation
proclaimed to be high in grade II gliomas. Neoplas- in comparison to 3D MR images. As the appearance of
tic processes have metabolic byproducts related to their tumor on anatomical MR images (signal heterogeneity,
mitotic activity (Cho) and neuronal dysfunction (NAA) edema, T1-w contrast enhancement, and necrosis) cor-
that can be detected by MRS and hence improve the ac- relate well with grade, the low and high-grade gliomas
curacy of the clinical diagnosis. Rapalino and Ratai [59] can be separately evaluated. In [11], the appearance of
give an elaborate idea about the metabolites evaluated tumor high- and low-grade gliomas were evaluated sep-
with MRS in brain tumor imaging. arately
Various studies show that the values of MRS are The major challenges in combining MRI and MRS
used for predicting the histological grade of gliomas. A signals are (i) the spatial resolution in MRI (high) and
higher Cho/NAA ratio represents higher WHO grades MRS (low) and (ii) low computational complexity with
among glial tumors. The Cho/Cr ratio is more accu- best discrimination accuracy. So the combination of the
rate for differentiating the high-grade from low-grade features of metabolite distribution from MRS with the
gliomas. MRS can potentially identify areas with abnor- 3D volumetric texture features from MRI is significant
mal Cho/NAA ratios that can extend beyond the tumor [48]. MRI and MRS can be equally used for the purpose
62 Deep Learning and Parallel Computing Environment for Bioengineering Systems

of classification and provide comparable results, but the MR scanner [70]. Powerful 7 T scanners provide images
collection of MRS data is laborious and needs expertise having a high signal-to-noise ratio and high inhomo-
in signal conditioning [63]. geneity at the same time. The inhomogeneity of voxel
Magnetic resonance spectroscopy (MRS), perfusion- intensities within similar tissue types causes improper
weighted imaging (PWI) and diffusion-weighted imag- parameter initialization and placement of seed points,
ing (DWI) are advanced MR techniques which have resulting in bad segmentation. A statistical modeling
added value over conventional MRI in predicting neo- approach and the level set segmentation method are
plastic histology. They provide additional information combined to overcome this problem. The image voxels
about tumor histological features such as neovascu- are multiplied with the bias field in order to correct the
larization, grade of cellularity and mitotic index [15]. inhomogeneity. Then the bias corrected image is seg-
The imaging features, recognized as independent pre- mented using the level set method.
dictors of tumor grade, were enhancement and necro- At the same instance, [77] gives a wide review of
sis with a specificity of 76% and sensitivity of 97.6% the clinical applications of 7 T brain MRI. Contrast-
when the variables of conventional MRI, PWI, and DWI rich images with high resolution of diverse pathologies
were combined [15]. Despite MRI being highly accu- can be procured. Compared to low field strength meth-
rate in tumor grade assessment, a combination of tra- ods, additional pathophysiological information can be
ditional and advanced MR imaging features resulted in obtained for these diseases. The most relevant imag-
enhanced grading of the tumor by the addition of rADC ing marker to differentiate between high- and low-grade
to variables of conventional MRI. tumors is the absence or presence of MRI enhance-
ment due to post- and pre-gadolinium-based contrast
4.3.2.2 Ultra-High-Field 7 T MRI agent. But in comparison to high-grade tumors, low-
MRI has become a standard diagnostic tool in the grade brain tumors do not show contrast enhancement,
late 1970s from 0.6 T systems to 1.5 T scanners by except for pilocytic which nearly always enhance. The
mid-1980s. The clinical 3 T systems emerged in 2000. In correlation between lower field strengths and 7 T MRI
1998, for research purposes, the first human 8 T scanner with respect to tumor enhancement, post contrast ad-
was introduced. By early 2014, approximately 45 UHF ministration has shown no variations in the presence
scanners at or above 7 T, with around 10% working at and size of enhancing region. Using 7 T along with al-
9.4 T have been operational [73]. Siemens developed ternate MRI methods like T*2-weighted imaging may
its first actively shielded 7 T whole body magnet scan- give supplementary information relative to 1.5 and 3 T
ner MAGNETOM Terra, which was installed at the Er- MRI. Currently, only very few studies have been con-
langen University Clinic, Germany, in April 2015. The ducted on brain tumor patients using 7 T MRI, and yet
units were scheduled for delivery in the early 2016, and it is not evident whether it can overcome the limitations
serial production was scheduled to begin by 2017. An of conventional MRI.
accelerating feature of MAGNETOM Terra is its eight
parallel transmitter channels while clinical MRI scan-
ners worked with only one transmitter channel. Multi- 4.4 MACHINE LEARNING (ML) –
ple channels excite a scanned anatomical structure more SUPERVISED AND UNSUPERVISED
uniformly so as to get an improved image contrast. The METHODS
multi-channel transmitter feature is only available in Basically, machine learning focuses on bringing out in-
the research mode on the MAGNETOM Terra providing formation from an image. After extraction of the fea-
very high spatial resolution of up to 0.2 millimeters in tures, this valuable information is further processed at
all directions. a higher level to make cognitive decisions [49]. Ma-
Various studies have already been conducted using chine learning algorithms become relevant when a clear
ultra-high-field (7 T) MRI in [43,73,77]. The proposed learning problem requires an unambiguous task, per-
algorithm in [26] was experimented using 3 and 7 T formance metric, and experience. While the generative
MRIs. All the remaining works in this review used con- methods involve modeling, the discriminative methods
ventional 1.5/3 T MRI scanners. Imaging at 7 T provided solve classification directly. The generative and discrimi-
advantages in signal-to-noise ratio, image contrast, res- native method pair – naïve Bayes and logistic regression
olution, improved sensitivity and spectral dispersion – is an analogous pair for classification, while hidden
[43]. Markov model (HMM) and conditional random field
The gray matter tissue of the brain was precisely seg- (CRF) is a corresponding pair for sequential data. In
mented from a 3D MRI obtained from a high field (7 T) pattern classification, neural networks are commonly
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 63

employed because they require no information regard-

ing the probability distribution and the a priori proba-
bilities of various classes [84].
The two phases of classification are the training and
testing phases. The training phase involves building the
classification data. A mathematical model such as neu-
ral network or decision tree is trained so that each set
of inputs correctly gives the resultant outputs [60]. This
training is to be done accurately in order to produce the
most appropriate result during the testing phase. The
training can be supervised or unsupervised. The former
trains the model knowing the output class label for a
particular tuple, and in the later the output class la-
bel is not known. In the testing phase, tuples with an
unknown class label are taken as input to get an appro-
priate class label as output. The model is first trained
using a training dataset. Then it is used to divide the
testing dataset into suitable classes. After obtaining the
predicted output, it is compared with practical values to
acquire the performance measurement of the model.
Usually, classification problems can be solved by ei- FIG. 4.5 An overall dual pipeline for brain tumor
ther using supervised classification (it needs data ac- classification of both MRI and MRS.
companied with its outputs for training) and unsuper-
vised classification (it needs no data with outputs and images, both MRI and MRS, follows the steps as in
also training phase is not present) [39]. By employ- Fig. 4.5.
ing a huge dataset, the classification accuracy can be
improved to a small extent using the former method, 4.4.1 ML Software Packages/Toolboxes
but combining both methods makes segmentation re- The most common commercial computing and visu-
sults closer to reality. The segmentation method em- alization environment is MATLAB (MATrix LABora-
ployed in [11] is unsupervised. If the first steps of the tory). The neuroimaging package SPM is an extension
knowledge-based operations were violated, it would re- of MATLAB [76], while an open-source neuroimaging
sult in erroneous final glioma mask, which is a flaw of package is NIPY (neuroimaging in Python). For nu-
unsupervised tumor segmentation. Additionally being merical/statistical analysis, it relies on various other
independent of the huge training dataset, a benefit of open-source packages. Initially, ANTs (Advanced Nor-
unsupervised methods is the lack of subjective user in- malization Tools) was built to provide image regis-
teractions. tration of high performance for medical image anal-
The literature survey reveals a major drawback of ysis. It is based on Insight Toolkit (ITK) provided by
classification, that is, if the techniques are accurate, the the National Institutes of Health. Further, ANTs pro-
time requirement is high, and vice-versa, due to two vided bias correction, template construction, n-tissue
reasons: (1) the radiology experts find it difficult to multivariate segmentation and cortical thickness esti-
delineate tumor borders with healthy tissues from im- mation. Hence it served as an integrated set of tools
ages due to various types of tumor that greatly vary in for multivariate neuroimage analysis. ANTsR was de-
shape, size, location, tissue homogeneity and composi- veloped specifically to interface between ANTs and the
tion; (2) the consequence of partial volume effect (one R project for statistical computing and visualization.
voxel belongs to multiple types of tissue) and the inclu- A modern framework for medical analytics was pro-
sion of the MRI noise. vided by ANTsR, which focused on statistical power and
The possibility to integrate various features is a merit imaging-assisted prediction. An effortless interchange
of machine learning classification, despite redundancy. was possible between medical imaging formats with the
The demerit is that it can be vulnerable to overfitting use of ANTsR.
due to generalization over different brain images [18]. At the same time, for comparison and evaluation of
The overall dual pipeline for classification of medical brain tumor segmentation, an online application called
64 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 4.6 CNN architecture.

VSD was developed [32]. It is a system with a data- P. McBee et al. [45] explain about the intrusion on ML
centric concept and search option for anatomical struc- followed by DL into radiology.
tures. The medical image community utilizes it for sci- DL uses multiple layers and multiple units within
entific collaboration and it is a useful tool for enhancing layers to represent highly complex functions. These hi-
segmentation algorithms. The Multimodal Brain Tumor erarchical layers extract low and high level features using
Segmentation (BRATS) challenge organizer had selected the nonlinear processing units in the layers. Hence they
VSD to host their data and challenge. do supervised learning of the features. When highlight-
To analyze the brain imaging data sequences, the ing the merits of DL, the time consuming process of
SPM software package was designed. These sequences creating hand-crafted features can be skipped, DL gen-
may be series of images from various time-series or co- eralizes well with a new problem and the performance
horts obtained from the same subject. SPM5 is used in improvement is sufficiently higher than in traditional
[11]. It is outdated now and the latest is SPM12. The lat- methods. DL networks include the convolutional neu-
est release is created for the analysis of PET, fMRI, EEG, ral network (CNN), which does supervised learning on
SPECT and MEG. variable and fixed length data and recurrent neural net-
The FSL (FMRIB’s Software Library contains statis- work (RNN), which works on variable length sequential
tical and image analysis tools for structural, diffusion data only. Hence CNN dominates when the challenge
and functional MRI brain imaging) Linear Image Regis- is image classification. This multi-layered network uses
tration Tool FLIRT is an automated accurate and robust spatial relationship for feature generation from two- or
tool for linear inter- and intra-modal brain image regis- three-dimensional image data. Where ML uses matrix
tration. It was embedded to automatically calculate the multiplication operations, DL uses convolution, detec-
transformation between T2-w and T1-w images for each tion and pooling operation. The common form of CNN
patient [85]. FMRIB’s Automated Segmentation Tool architecture is shown in Fig. 4.6. The fundamental deep
called FAST segments the brain 3D image into various learning architectures are deep belief networks (DBN),
tissue types, i.e., CSF, gray and white matter, etc. It does stacked auto encoder (SAE) and multilayer perceptron
correction of spatial intensity variations (also known as (MLP). See Table 4.8.
RF or bias field inhomogeneities) simultaneously. It is An adaptation of the traditional ANN architecture is
based on an associated expectation-maximization algo- the so-called CNN. In a CNN, a multidimensional in-
rithm and a hidden Markov random field model [85]. put image is transformed into the desired output using
Various other software applied in this area of re- nonlinear activation functional and stacks of convolu-
search are TumorSim which is used for simulation of tional filter kernels. The number of layers and units in
synthetic brain tumor images, DoctorNo suite of tools each layer and the between layer connections together
as a plugin for GUI-based “DoctorEye” software platform the structure of the neural network, resulting in
form, GLISTR toolkit, etc. the CNN architecture. The depth of the network is deter-
mined by the number of layers. Based on the problem at
hand, this generalized network architecture can be mod-
4.5 DEEP LEARNING (DL) ified. For example, if the input is an image, it is a matrix
The drawbacks of ML algorithms when compared with multiplied with the weighted convolution kernel. Us-
DL algorithms are that human knowledge is required ing a small kernel, fewer parameters are extracted from
for feature extraction, curse of dimensionality, and poor the input. The layer outputs are calculated using kernels.
generalization capability. Mrudang D. Pandya et al. [47] Convolution operation is followed by linear activation
explain about the need of DL over ML, tools and tech- operation. Commonly rectified linear unit (ReLU) is
nology available for DL, applicability of DL in medical used in the detection stage, followed by the pooling
image analysis, hybrid architectures of DL in the field of stage that modifies the output. Finally, a softmax func-
medical image processing and challenges of DL. A com- tion, i.e., a normalized exponential function is used at
parison of ML vs. DL is given in a table. Also Morgan the output layer.
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 65

TABLE 4.8
A comparison of ML vs. DL.
ML DL
Works even on low end machines High end machine dependent, hence uses GPUs
ML algorithms works well with large amount of data Performance increases as the scale of data increases only
Handpicked features by experts Learns high level features from input data
Comparatively takes lesser training time (few seconds to Take more training time (at times weeks) as it uses multiple
few hours) parameters
Solves the problem part by part Does end to end problem solving
Easy to interpret the solution Hard to defend the solution
The data variables are analyzed by an analyst and the al- The DL algorithms are self-directed once they are imple-
gorithm is directed by them mented

4.5.1 DL Tools/Libraries tion, patch based extraction), classification and post-

While considering deep learning, [22] explains about processing. Eli Gibson et al. [10] presented an open
Pylearn2, a machine learning research library which source for DL in medical imaging named NiftyNet. Built
mostly contains deep learning models that includes au- on Tensor flow framework, the pipeline includes data
toencoders, CNN, MLPs, etc. The most popular deep loading and augmentation, network architecture, loss
learning tools include Theano [72], Pylearn [56], Caffe functions and evaluation metrics. Developing and dis-
[4], Cuda-convnet [5], Torch [75], Deeplearning4j [8], tributing DL solutions for segmentation was done using
Tensor flow [71], Keras [31], Pytorch [57], etc. The NiftyNet in this work.
Python library of Theano can be considered as a mathe-
matical expression compiler. It optimizes and evaluates
mathematical expressions involving multi-dimensional 4.6 EVALUATION AND VALIDATION
arrays. Caffe is a readable and a fast implementation METRICS
of ConvNets in C++. It can be used for image classifi- Manual segmentation by an expert radiologist is the
cation with over 60 million images. It has support of widely accepted ground truth. Hence the comparison
command line, MATLAB and Python interfaces, while is done with this gold standard for evaluation of brain
Torch, written in C, is a computing framework sup- tumor segmentation [1]. To make a quantitative eval-
porting machine learning and provides an environ- uation of registration and segmentation results, the fre-
ment similar to MATLAB. For enabling fast computa- quently used procedure is to determine the overlap with
tion and experimentation, Keras was developed. Written the gold standard. Generally, Jaccard coefficient (JC) or
in Python, it runs on top of TensorFlow or Theano. dice similarity coefficient (DSC) is used. It ranges from
When it comes to computer vision applications, Mat- 0 to 1, with 1 showing perfect overlap and 0 indicating
ConvNet is a suitable MATLAB toolbox. Deeplearning4j no overlap. For probabilistic brain tumor segmentation,
with Scala API is written in Java. It has multiple GPU the various validation metrics are mutual information
support but is not popular in medical imaging. Ten- (MI), area under the receiver operating characteristics
sor flow has multiple CPU and GPU support. Pytorch (ROC) curve, and dice similarity coefficient (DSC). MI is
is a result of integrating Python with Torch engine. used when sensitivity to tumor size changes is the factor
It is flexible and better performing than Torch with of interest, ROC for overall classification accuracy, and
GPU integration. The DL library ranking as per [47] DSC for spatial alignment evaluation. Other validation
in descending sequence is Tensor flow, Keras, Caffe, metrics include [78] peak signal-to-noise ratio (PSNR),
Theano, Pytorch, Torch, Deeplearning4j, and MatConv- mean square error (MSE), Jaccard Tanimoto coefficient
Net. index (TC/JC), similarity index (SI) [53], dice over-
Usually, the image analysis pipeline involving CNN lap index (DOI/DSC/F1score), sensitivity overlap frac-
has the following steps: (a) pre-processing (noise fil- tion (OF) or overlap fraction (OF) or sensitivity, extra
tering, skull stripping, intensity normalization, image fraction (EF), specificity, accuracy, Euclidean distance,
registration), (b) data preparation (data augmenta- structural similarity (SS), normalized cross-correlation
66 Deep Learning and Parallel Computing Environment for Bioengineering Systems

(NCC), normalized absolute error (NAE) [6], Williams’ nos Kamnitsas employed DeepMedic, a 3D CNN archi-
index [29], etc. tecture for lesion segmentation, extended with resid-
ual connections. Varghese Alex proposed 5-layer deep
stacked denoising autoencoders (SDAE) for segmenta-
4.7 EMBEDDING INTO CLINICS tion of gliomas, where the training was done using
Despite enormous research in this field over decades, patches of size 21 × 21, extracted from various MRI se-
applications of these methodologies to clinics are lim- quences like T1, T2, FLAIR, and T1 post-contrast im-
ited as clinicians still trust manual tumor delineations. ages. Tseng Kuan Lun presented a fully-automatic seg-
This could have been because of the communication mentation method by utilizing a CNN. Laszlo Lefkovits
gap between clinicians and researchers. The research used a discriminative model based on RF to accom-
tools developed till now are not familiar to clinicians; plish brain tumor segmentation in multimodal MR im-
hence efforts must be concentrated on making them ages wherein a feature vector with 960 elements was
user-friendly in the future. This may be because of the obtained from 240 image features extracted from each
challenge that the transfer of technology from the bench modality. Loic Le Folgoc proposed cascades of lifted
to bedside is valid only when the efficient outputs ob- decision forests for segmentation which used an SMM-
tained in a controlled research environment are repro- MRF (Student mixture-MRF) layer to locate the ROI
ducible in clinical routine. Robustness is a crucial factor for the whole tumor. Richard McKinley proposed a
for daily use of these protocols. Robustness must be Nabla-net: a deep DAG-like convolutional architecture
towards slight changes in acquisition protocols and flex- for applying to high- and low-grade glioma segmen-
ibility for upgrades. tation. A nabla net is a deep encoder/decoder net-
work. Raphael Meier proposed a dense CRF which can
overcome the shrinking bias inherent in many grid-
4.8 CURRENT STATE-OF-THE-ART structured CRFs. The focus was on the segmentation
The advanced technologies in automated brain tumor of glioblastoma. Xiaomei Zhao integrated an FCNN
segmentation are compared in Multimodal Brain Tu- and CRF for segmentation, rather than adopting CRF
mor Image Segmentation (BRATS) MICCAI challenges as a post-processing step of the FCNN. Balaji Pan-
since 2012. An annotated data set with about 60 high dian proposed a fully automated approach based on
and low-grade cases are publicly available from the a 3D CNN with subvolume training procedures for
VSD and the MIDAS webpages, two online platforms brain tumor segmentation in multi-modal MRIs, while
for hosting and evaluating image segmentation bench- Adria Casamitjana proposed two fully convolutional
marks. The MICCAI 2016 emphasized on longitudinal 3D CNN architectures which are a variant of the two-
segmentation tasks, estimated the size of relevant tumor pathway DeepMedic net. Bi Song proposed anatomy-
structures, predicted whether the tumor was progress- guided brain tumor segmentation and classification to
ing, shrinking, or remained stable for a set of two scans delineate tumor structures into the active tumorous
of a given patient [41]. core, necrosis, and edema.
In [41], Mustafa Arikan used an anisotropic diffusion The proposals in [41] provide a broad spectrum of
filter for noise reduction, a bounding box containing segmentation methodologies. Apparently, it is clear that
tumor was extracted, and a certain number of seeds in the last two years there was an increase in the use
were randomly selected from the dataset. Finally, seg- of deep learning methods, specifically CNN in several
mentation was done using SVM on Flair, T1, T1c and computer vision tasks. The pre-processing stage com-
T2 modalities. Peter D. Chang proposed a fully con- monly used N4ITK for bias correction and also the focus
volutional neural network with hyper-local features for was on gliomas. A detailed survey on the current state-
brain tumor segmentation. The network was composed of-the-art techniques in this field is provided in [44].
of only 130,400 parameters and could complete seg-
mentation for an entire brain volume in less than one 4.8.1 Deep Learning Concepts for Brain
second. Dimah Dera proposed a non-negative matrix Tumor Grading
factorization level set segmentation technique, i.e., a de- In the last few years, there has been an increase in the
composition technique that reduces the dimensionality use of deep learning methods, especially CNN. Rather
of an image, and segmentation of 465 images took 240 than using hand-crafted features, DL models learn com-
minutes. Abdelrahman Ellwaa introduced an iterative plex, task adaptive and high level features from the data
random forest approach which tried to improve its accu- directly. Due to these benefits, DL models are used for
racy by iteratively choosing the best patients. Konstanti- brain tumor detection, segmentation and classification.
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 67

Convolutional neural network (CNN), stacked denois- design a system providing better diagnosis and higher
ing auto-encoder (SDAE) and recurrent neural network accuracy than when individually considered.
(RNN) are the common deep learning models used Computer aided tumor detection along with pattern
[74]. Schmidhuber [67] and Z.W. Weibo Liu et al. [79] recognition software [32,76,85] helps radiologists re-
provide a good review about deep learning networks. view results easier. According to the above review, fusing
Usually, the training sample size is of great consider- several good algorithms gave results that steadily ranked
ation in case of deep learning methodologies. Various above most of the individual algorithms [42]. For sub-
methods using deep learning were proposed for brain sequent fusion, user interaction proved to be useful
tumor classification. See Table 4.9. in selecting the best segmentation maps. Subsequently
The relevant literature has a vast collection of the use driving beyond the limits of individual tumor segmen-
of DL models for tissue, tumor, lesion, subcortical struc- tation algorithms, future advance is possible by inspect-
ture and whole brain segmentation. Bradley J. Erickson ing how to design and combine algorithms by fusion
et al. [3], Z.W. Weibo Liu et al. [79], Geert Litjens et al. strategies. Furthermore, we should look into the dataset
[13], Rupal R. Agravat and Mehul S. Raval [61] and Jose consisting of a class like glioma and its subclasses to
Bernal et al. [27] give a detailed survey for MRI brain be collected for classification. The MICCAI-BRATS Chal-
tumor segmentation and MR image analysis using deep lenge suggests that methods based on random forests
learning. Bradley J. Erickson et al. [3] explain how much are among the most accurate [41]. At the same time, re-
data is really required when we use DL methods for search shows a good performance on CNN-based algo-
medical image analysis. The work concludes that vari- rithms, especially in the field of 2D data classification.
ability in the process being studied actually decides the This is because of its merit that each kernel in different
amount of data. Two ways to work well with decreased layers is learned spontaneously so that no prior feature
training data are transfer learning and data augmenta- setting is needed, because of which the number of train-
tion. In transfer learning, we train the first layers using ing examples becomes critical [81].
images having similar features while in data augmen- DL models have had a huge impact in the domain
tation variants of original data are created, i.e., rotated of natural language processing, speech recognition and
images or images with added noise. computer vision for problem solving. Researchers now
work on small image patches rather than the whole
volume/slice using the computationally efficient CNNs
4.9 DISCUSSION to obtain accurate segmentation results. These DL net-
Gliomas being the most commonly reported type of works have been accepted more, and the architecture
brain tumor, this review sheds light on the grading of is growing more sophisticated by including more layers
gliomas and hence facilitating easy diagnosis. The WHO and better optimization ideas.
classification of CNS tumor [35], its upgrade and grad- For good generalization, use an architecture with op-
ing [12,52] were elaborated initially. As well as histo- timized layers, select best hyperparameters, advanced
logical and clinical grading based on malignancy level training methods, and overcome class imbalance prob-
[36], the biological characteristics that aided grading [7, lems. When considering drawbacks of DL models, com-
65,82] and the need for grading was explained. With putational requirements are to be considered by using
the statistical studies in [17,23,24,40,66], apparently, it quicker matrix multiplication methods and FFT algo-
is clear why gliomas are still the area of great interest rithms. Above all, there is more room to consider the
for research. The increasing need for non-invasive grade distributed and parallelized implementations.
determination of a brain tumor leads to in-depth mul-
tidisciplinary studies.
As the prevalence of brain tumors has increased over 4.10 CONCLUSIONS
time, various approaches incorporating a combination The necessity of integrating machine learning and deep
of MRI and MRS [11,15,38,48,63], supervised and un- learning methodology with the diagnosis of brain tu-
supervised learning methods [11,39,48,60,84], as well mor and the recent segmentation and classification
as hybrid systems using more than one classifier, is a techniques on brain MR images was reviewed. The cur-
growing requirement to build an efficient and accurate rent trends in the grading of brain tumor with a focus
system. These approaches aid to clarify the ambiguity on gliomas which include astrocytoma were elucidated.
prevailing in current methods. The features that can The current state-of-the-art, software packages, evalua-
be studied by MR imaging [7,14,20,33,50,51,62,78,83] tion and validation metrics used in different approaches
and those by MRS [21,38,48,59] can be combined to were discussed along with integration into the clinical
68 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 4.9
Summary of deep learning methodologies for brain tumor classification.
Paper Pre-processing Feature Classes Dataset Performance
extraction/
classiﬁcation
[54] intensity CNN HGG, LGG BRATS 2014 specificity and sensitivity with
normalization – (170 high intersected value of 0.6667
grade, 25 low
grade)
[68] intensity CNN normal BRATS 2013 Dice – (BRATS 2013 dataset – for
normalization, tissue, – (65 scans), the complete, core, and enhancing
bias field necrosis, BRATS 2015 regions are 0.88, 0.83, 0.77, BRATS
correction by edema, non- – (327 scans) 2015 dataset – 0.78, 0.65, 0.75),
N4ITK method enhancing, speed – 8 min
enhancing
tumor
[17] N4ITK for bias CNN non-tumor, BRATS 2013 Dice – (complete – 0.88, core –
correction necrosis, – (20 HGG, 10 0.79, enhancing – 0.73), specificity
edema, LGG) – (complete – 0.89, core – 0.79,
enhancing enhancing – 0.68), sensitivity –
tumor and (complete – 0.87, core – 0.79,
non- enhancing – 0.80), speed – 25 s to
enhancing 3 min
tumor
Peter D. intensity CNN enhancing BRATS 2016 Dice – (enhancing tumor – 0.72,
Chang normalization tumor, core – (144 HGG core tumor – 0.81, complete tumor
[2] algorithm tumor and patients) – 0.87), speed – 0.52 s for 64
complete images
tumor
Vargh- normalization SDAE whole tumor, BRATS 2015 Dice – (whole tumor-0.84, tumor
ese Alex tumor core, – (9 image core – 0.71, active tumor – 0.81)
[2] active tumor volumes)
Tseng nil SegNet complete, BRATS 2015 Dice – (complete – 0.75, core –
Kuan core, 0.77, enhancing – 0.76)
Lun [2] enhancing
Richard nil Nabla-net high and low BRATS 2012 Dice – (whole tumor – 0.87, tumor
McKin- grade glioma – (30 images) core – 0.69, enhancing – 0.56)
ley [2]
Balaji N4ITK bias 3D CNN whole tumor, BRATS 2016 Dice – (whole tumor – 0.725, tumor
Pandian correction, tumor core, – (40 images) core – 0.611, active tumor – 0.572)
[2] normalization active tumor
Raman- nil DNN HGG, LGG BRATS 2015 – Dice – (complete tumor – 0.87,
deep (275 patients) tumor core – 0.75, enhanced tumor
Rand- – 0.71)
hawa
[2]
Adria normalization 3D CNN HGG, LGG BRATS 2015 Dice – (whole – 0.89, core – 0.76,
Casamit- active-0.37)
jana [2]
Xiaomei N4ITK bias FCNN + CRF HGG, LGG BRATS 2013, Dice – (BRATS 2013 – complete –
Zhao [2] correction, BRATS 2015 0.87, core – 0.82, enhancing – 0.76,
normalization BRATS 2015 – complete – 0.8, core
– 0.68, enhancing – 0.65), average
run time – 8 min
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 69

TABLE 4.9 (continued)

Paper Pre-processing Feature Classes Dataset Performance
extraction/
classiﬁcation
[25] nil DNN HGG, LGG BRATS 2012 Dice – (BRATS 2012 – 0.98, BRATS
– (35 LGG, 45 2013 – 0.998, BRATS 2014 – 0.929,
HGG), BRATS BRATS 2015 – 0.95), average time –
2013 – (15 5.5 s
LGG, 25
HGG), BRATS
2014 – (200
training, 100
testing),
BRATS 2015 –
(274 training,
110 testing)
[16] Contrast- VoxResNet gray matter, 5 train Dice – (GM – 0.885, WM – 0.91,
Limited white matter, subject, 15 CSF – 0.827)
Adaptive cerebral test subjects
Histogram spinal fluid
Equalization
(CLAHE) to
[28] nil SDAE brain stem 9 patients Dice – 0.91, average speed – 0.36 s
segment
[64] normalization, Deep CNN HGG, LGG BRATS 2013 Dice – (BRATS 2013 – Complete –
bias field – (20 HGG, 10 0.87, core – 0.89, enhancing dice –
correction LGG), BRATS 0.92, BRATS 2015 – Complete –
2015 – (220 0.86, core – 0.87, enhancing-0.9)
HGG, 54
LGG)
[34] skull stripping, DeepMedic + HGG, LGG BRATS 2015 – whole – 0.89, core – 0.75,
normalization, CRF (274 training, enhancing – 0.72
registration 110 testing)
[19] nil DNN normal, 66 MRI – (22 accuracy – 0.96, AUC – 0.98
glioblastoma, normal, 44
sarcoma, abnormal)
metastatic
bron-
chogenic
carcinoma
tumors
[46] nil incremental HGG, LGG BRATS 2017 average dice score – (Whole Tumor
DCNN – (210 – 0.88, Tumor Core – 0.79,
HGG,75 LGG) Enhancing Tumor core – 0.82),
speed – 20.87 s
[30] nil ImageNet no tumor, 43 no tumor, accuracy – 0.82
pre-trained LGG, GBM 155 LGG, 125
classifiers GBM
[80] Skull stripping, FCNN + CRF HGG, LGG BRATS 2013 Dice – (BRATS 2013 – complete –
Intensity – (14 LGG, 51 0.88, core – 0.84, enhancing – 0.77,
normalization, HGG), BRATS BRATS 2015 – complete – 0.84,
N4ITK 2015 – (274 core – 0.73, enhancing – 0.62)
training, 110
testing)
List of used abbreviations: convolutional neural networks (CNN), conditional random fields (CRFs), deep convolutional neural networks
(DCNN), deep neural networks (DNN), fully convolutional neural networks (FCNNs), high grade glioma (HGG), low grade glioma (LGG),
stacking denoising auto-encoders (SDAE), voxelwise residual network (VoxResNet).
70 Deep Learning and Parallel Computing Environment for Bioengineering Systems

environment. The comparative analysis of various stud- FUNDING

ies above revealed some challenges that might further This research did not receive any specific grant from
improve the methodologies: (1) large database acqui- funding agencies in the public, commercial, or not-for-
sition from different institutions with various image profit sectors.
qualities; (2) extracting more efficient features and in-
creasing the training data set to improve the classifica-
REFERENCES
tion accuracy; (3) there is still more scope to integrate
1. S. Bauer, R. Wiest, L.-P. Nolte, M. Reyes, A survey of MRI-
multiple machine learning techniques into a hybrid sys-
based medical image analysis for brain tumor studies,
tem; (4) it is desirable to establish further experiments Physics in Medicine and Biology 58 (2013) R97–R129,
and perform evaluation only when the proposed ap- https://doi.org/10.1088/0031-9155/58/13/R97.
proaches have generic applications; (5) histopathologi- 2. Bjoern Menze, NCI-MICCAI challenge on multimodal
brain tumor segmentation, in: NCI-MICCAI BRATS 2013,
cal information must be integrated while implementing
Nagoya, Japan, 2013.
algorithms with machine learning and image processing 3. Bradley J. Erickson, Panagiotis Korfiatis, Timothy L. Kline,
approach. There is a great scope for DL and ML in med- Zeynettin Akkus, Kenneth Philbrick, et al., Deep learning
ical imaging although the above mentioned challenges in radiology: does one size fit all? Journal of the American
College of Radiology (2018) 1–6, https://doi.org/10.1016/
are yet to be tackled. Most of the DL and ML tumor clas-
j.jacr.2017.12.027.
sification methods outperform medical experts. There is 4. Caffe, n.d., http://caffe.berkeleyvision.org/.
future scope to improve existing methods by overcom- 5. Cuda-convnet, n.d., http://deeplearning.net/software/
ing the above challenges. pylearn2/library/alex.
6. D. Aju, R. Rajkumar, T1-T2 weighted MR image compo-
sition and cataloguing of brain tumor using regularized
logistic regression, Jurnal Teknologi 9 (2016) 149–159.
SHORT AUTHORS BIOGRAPHIES 7. L.M. DeAngelis, Brain tumors, Medical Progress from The
Geethu Mohan is a Research Scholar in the area of Med- New England Journal of Medicine (2001), https://doi.org/
10.1227/01.NEU.0000311254.63848.72.
ical Image Processing at Vellore Institute of Technology,
8. Deeplearning4j, n.d., https://deeplearning4j.org/.
Tamil Nadu, India, since 2016. She received her Master’s 9. E.-S.A. El-Dahshan, H.M. Mohsen, K. Revett, A.-B.M.
degree in Applied Electronics from Sathyabama Univer- Salem, Computer-aided diagnosis of human brain tumor
sity, Tamil Nadu, India in 2012 and Bachelor’s degree through MRI: a survey and a new algorithm, Expert Sys-
tems with Applications 41 (2014) 5526–5545, https://
in Electronics and Communication Engineering from
doi.org/10.1016/j.eswa.2014.01.021.
T K M Institute of Technology, Kerala, India in 2009. 10. Eli Gibson, Wenqi Li, Carole Sudre, Lucas Fidon,
Her research interests include machine learning, pattern Dzhoshkun I. Shakir, Guotai Wang, Zach Eaton-Rosen,
recognition and medical image analysis. Robert Gray, Tom Doel, Yipeng Hu, Tom Whyntie,
Parashkev Nachev, Marc Modat, Dean C. Barratt, Sebastien
Dr. Monica Subashini is an expert in the area of
Ourselin, M. Jorge Cardoso, Tom Vercauteren, NiftyNet:
medical image processing at Vellore Institute of Tech- a deep-learning platform for medical imaging, Computer
nology, Tamil Nadu, India and working as an Associate Methods and Programs in Biomedicine (2018), https://
Professor. Dr. Monica completed her PhD on Artificial doi.org/10.1016/j.cmpb.2018.01.025.
11. K.E. Emblem, B. Nedregaard, J.K. Hald, T. Nome, P. Due-
Intelligence in the year 2014 from VIT. She received
Tonnessen, A. Bjornerud, Automatic glioma characteriza-
her Master’s degree in Applied Electronics from Anna tion from dynamic susceptibility contrast imaging: brain
University, Tamil Nadu, India, in 2007 and Bachelor’s tumor segmentation using knowledge-based fuzzy cluster-
degree in Electronics and Instrumentation Engineering ing, Journal of Magnetic Resonance Imaging 30 (2009)
1–10, https://doi.org/10.1002/jmri.21815.
from Karunya Institute of Technology, Coimbatore, In-
12. J.L. Fisher, J.A. Schwartzbaum, M. Wrensch, J.L. Wiemels,
dia, in 2001. Epidemiology of brain tumors, Neurologic Clinics 34
(2007) 867–890, https://doi.org/10.1016/j.ncl.2007.07.
002.
ACKNOWLEDGMENTS 13. Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar-
naud Arindra Adiyoso Setio, et al., A survey on deep learn-
We are grateful to School of Electronics Engineering and
ing in medical image analysis, Medical Image Analysis
School of Electrical Engineering, VIT for the immense Journal 42 (December 2017) 60–88, https://doi.org/10.
support in piloting this review. 1016/j.media.2017.07.005.
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 71

14. M. Gupta, B.V.V.S.N. Prabhakar Rao, V. Rajagopalan, A. 26. Z. Ji, Q. Sun, Y. Xia, Q. Chen, D. Xia, D. Feng, Gen-
Das, C. Kesavadas, Volumetric segmentation of brain tu- eralized rough fuzzy c-means algorithm for brain MR
mor based on intensity features of multimodality magnetic image segmentation, Computer Methods and Programs
resonance imaging, in: IEEE International Conference on in Biomedicine 108 (2012) 644–655, https://doi.org/10.
Computer Communication and Control, IC4 2015, 2015. 1016/j.cmpb.2011.10.010.
15. J.A. Guzmán-De-Villoria, J.M. Mateos-Pérez, P. Fernández- 27. Jose Bernal, Kaisar Kushibar, et al., Deep convolutional
García, E. Castro, M. Desco, Added value of advanced neural networks for brain image analysis on magnetic reso-
over conventional magnetic resonance imaging in grading nance imaging: a review, Artificial Intelligence in Medicine
gliomas and other primary brain tumors, Cancer Imag- (April 2018) 1–18, https://doi.org/10.1016/j.artmed.2018.
ing 14 (2014) 1–10, https://doi.org/10.1186/s40644-014- 08.008.
0035-8. 28. Jose Dolz, Nacim Betrouni, Mathilde Quidet, Dris Khar-
16. Hao Chen, Qi Dou, Lequan Yu, Jing Qin, Pheng-Ann roubi, Henri A. Leroy, Nicolas Reyns, Laurent Massoptier,
Heng, VoxResNet: deep voxelwise residual networks for Maximilien Vermandel, Stacking denoising auto-encoders
brain segmentation from 3D MR images, NeuroIm- in a deep network to segment the brainstem on MRI in
age 170 (April 2018) 446–455, https://doi.org/10.1016/j. brain cancer patients: a clinical study, Computerized Med-
neuroimage.2017.04.041. ical Imaging and Graphics (2016), https://doi.org/10.1016/
17. M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, j.compmedimag.2016.03.003.
Y. Bengio, C. Pal, P.M. Jodoin, H. Larochelle, Brain tu- 29. T. Kalaiselvi, K. Somasundaram, S. Vijayalakshmi, A novel
mor segmentation with deep neural networks, Medical Im- self initiating brain tumor boundary detection for MRI,
age Analysis 35 (2017) 18–31, https://doi.org/10.1016/j. Communications in Computer and Information Science
media.2016.05.004. 283 (2012) 54–61, https://doi.org/10.1007/978-3-642-
18. M. Havaei, P.M. Jodoin, H. Larochelle, Efficient interac- 28926-2.
tive brain tumor segmentation as within-brain kNN clas- 30. Ken C.L. Wong, Tanveer Syeda-Mahmood, Mehdi Moradi,
sification, in: Proceedings – International Conference on Building medical image classifiers with very limited data
Pattern Recognition, Institute of Electrical and Electronics using segmentation networks, Medical Image Analysis
Engineers Inc., 2014, pp. 556–561. (2018), https://doi.org/10.1016/j.media.2018.07.010.
19. Heba Mohsen, El-Sayed A. El-Dahshan, El-Sayed M. El- 31. Keras, n.d., https://keras.io/.
Horbaty, Abdel-Badeeh M. Salem, Classification using 32. M. Kistler, S. Bonaretti, M. Pfahrer, R. Niklaus, P. Büchler,
deep learning neural networks for brain tumors, Future The virtual skeleton database: an open access repository for
Computing and Informatics Journal (2018), https://doi. biomedical research and collaboration, Journal of Medical
org/10.1016/j.fcij.2017.12.001. Internet Research 15 (2013) 1–14, https://doi.org/10.2196/
20. M. Horn, Magnetic Resonance Imaging Methods and Bi- jmir.2930.
ologic Applications, Methods in Molecular Medicine, Hu- 33. S. Koley, A.K. Sadhu, P. Mitra, B. Chakraborty, C.
mana Press, New Jersey, 2006. Chakraborty, Delineation and diagnosis of brain tumors
21. A. Horska, P.B. Barker, Imaging of brain tumors: MR spec- from post contrast T1-weighted MR images using rough
troscopy and metabolic imaging, Neuroimaging Clinics granular computing and random forest, Applied Soft Com-
of North America 20 (2011) 293–310, https://doi.org/10. puting 41 (2016) 453–465, https://doi.org/10.1016/j.asoc.
1016/j.nic.2010.04.003.Imaging. 2016.01.022.
22. Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, 34. Konstantinos Kamnitsas, Christian Ledig, Virginia F.J. New-
Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James combe, Joanna P. Simpson, et al., Efficient multi-scale
Bergstra, Frederic Bastien, Yoshua Bengio, Pylearn2: a ma- 3D CNN with fully connected CRF for accurate brain le-
chine learning research library, http://deeplearning.net/ sion segmentation, Medical Image Analysis (2016), https://
software/pylearn2, August 2013. doi.org/10.1016/j.media.2016.10.004.
23. A. Jain, M.C. Sharma, V. Suri, S.S. Kale, A.K. Mahapatra, M. 35. L.R. Lym, Q.T. Ostrom, C. Kruchko, M. Couce, D.J. Brat,
Tatke, G. Chacko, A. Pathak, V. Santosh, P. Nair, N. Husain, D.N. Louis, J.S. Barnholtz-Sloan, Completeness and con-
C. Sarkar, Spectrum of pediatric brain tumors in India: a cordancy of WHO grade assignment for brain and central
multi-institutional study, Neurol India 59 (2011) 208–211, nervous system tumors in the United States, 2004–2011,
https://doi.org/10.4103/0028-3886.79142. Journal of Neuro-Oncology 123 (1) (2015) 43–51, https://
24. J. Jaiswal, A.H. Shastry, A. Ramesh, Y.T. Chickabasaviah, doi.org/10.1007/s11060-015-1775-4.
A. Arimappamagan, V. Santosh, Spectrum of primary in- 36. D.N. Louis, H. Ohgaki, O.D. Wiestler, W.K. Cavenee, P.C.
tracranial tumors at a tertiary care neurological institute: Burger, A. Jouvet, B.W. Scheithauer, P. Kleihues, The 2007
a hospital-based brain tumor registry, Neurol India 64 WHO classification of tumours of the central nervous sys-
(2016) 494–501. tem, Acta Neuropathologica 114 (2007) 97–109, https://
25. Javeria Amin, Muhammad Sharif, Mussarat Yasmin, Steven doi.org/10.1007/s00401-007-0243-4.
Lawrence Fernandes, Big data analysis for brain tumor 37. D.N. Louis, A. Perry, G. Reifenberger, A. von Deimling, D.
detection: deep convolutional neural networks, Future Figarella-Branger, W.K. Cavenee, H. Ohgaki, O.D. Wiestler,
Generation Computer Systems (2018), https://doi.org/10. P. Kleihues, D.W. Ellison, The 2016 world health organiza-
1016/j.future.2018.04.065. tion classification of tumors of the central nervous system:
72 Deep Learning and Parallel Computing Environment for Bioengineering Systems

a summary, Acta Neuropathologica 131 (2016) 803–820, 48. D.S. Nachimuthu, A. Baladhandapani, Multidimensional
https://doi.org/10.1007/s00401-016-1545-1. texture characterization: on analysis for brain tumor tissues
38. J. Luts, A. Heerschap, J.A.K. Suykens, S. Van Huffel, A com- using MRS and MRI, Journal of Digital Imaging 27 (2014)
bined MRI and MRSI based multiclass system for brain 496–506, https://doi.org/10.1007/s10278-013-9669-5.
tumour recognition using LS-SVMs with class probabilities 49. A. Nandi, Detection of human brain tumour using MRI im-
and feature selection, Artificial Intelligence in Medicine 40 age segmentation and morphological operators, in: 2015
(2007) 87–102, https://doi.org/10.1016/j.artmed.2007.02. IEEE International Conference on Computer Graphics, Vi-
002. sion and Information Security (CGVIS), 2015, pp. 55–60.
39. U. Maya, K. Meenakshy, Unified model based classification 50. M.S. Norhashimah, S.A.R. Syed Abu Bakar, A. Sobri Muda,
with FCM for brain tumor segmentation, in: 2015 IEEE M. Mohd Mokji, Review of brain lesion detection and clas-
International Conference on Power, Instrumentation, Con- sification using neuroimaging analysis techniques, Jurnal
trol and Computing (PICC), 2015, pp. 7–10. Teknologi 6 (2015) 73–85.
40. C. McPherson, Glioma Brain Tumors, Mayfield Clinic & 51. A. Ortiz, J.M. Gorriz, J. Ramirez, D. Salas-Gonzalez, Im-
Spine Institute, Ohio, 2016. proving MRI segmentation with probabilistic GHSOM and
41. B. Menze, M. Reyes, Multimodal brain tumor image seg- multiobjective optimization, Neurocomputing 114 (2013)
mentation benchmark: “change detection”, in: Proceed- 118–131, https://doi.org/10.1016/j.neucom.2012.08.047.
ings of MICCAI-BRATS 2016 Multimodal, Munich, 2016. 52. Q.T. Ostrom, H. Gittleman, J. Xu, C. Kromer, Y. Wolin-
42. B.H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Fara- sky, C. Kruchko, J.S. Barnholtz-Sloan, CBTRUS statistical
hani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, L. report: primary brain and other central nervous system
Lanczi, E. Gerstner, M.A. Weber, T. Arbel, B.B. Avants, N. Ay- tumors diagnosed in the United States in 2009–2013,
ache, P. Buendia, D.L. Collins, N. Cordier, J.J. Corso, A. Cri- Neuro-Oncology 18 (2016) v1–v75, https://doi.org/10.
minisi, T. Das, H. Delingette, C. Demiralp, C.R. Durst, M. 1093/neuonc/now207.
Dojat, S. Doyle, J. Festa, F. Forbes, E. Geremia, B. Glocker, P. 53. P. Shanthakumar, P. Ganeshkumar, Performance analysis
Golland, X. Guo, A. Hamamci, K.M. Iftekharuddin, R. Jena, of classifier for brain tumor detection and diagnosis, Com-
N.M. John, E. Konukoglu, D. Lashkari, J.A. Mariz, R. Meier, puters & Electrical Engineering 45 (2015) 302–311, https://
S. Pereira, D. Precup, S.J. Price, T.R. Raviv, S.M.S. Reza, doi.org/10.1016/j.compeleceng.2015.05.011.
M. Ryan, D. Sarikaya, L. Schwartz, H.C. Shin, J. Shotton, 54. Y. Pan, W. Huang, Z. Lin, W. Zhu, J. Zhou, J. Wong, Z. Ding,
Brain tumor grading based on neural networks and convo-
C.A. Silva, N. Sousa, N.K. Subbanna, G. Szekely, T.J. Taylor,
lutional neural networks, in: Engineering in Medicine and
O.M. Thomas, N.J. Tustison, G. Unal, F. Vasseur, M. Win-
Biology Society (EMBC), 2015 37th Annual International
termark, D.H. Ye, L. Zhao, B. Zhao, D. Zikic, M. Prastawa,
Conference of the IEEE, 2015, pp. 699–702.
M. Reyes, K. Van Leemput, The multimodal brain tumor
55. E.I. Papageorgiou, P.P. Spyridonos, D.T. Glotsos, C.D.
image segmentation benchmark (BRATS), IEEE Transac-
Stylios, P. Ravazoula, G.N. Nikiforidis, P.P. Groumpos,
tions on Medical Imaging 34 (2015) 1993–2024, https://
Brain tumor characterization using the soft computing
doi.org/10.1109/TMI.2014.2377694.
technique of fuzzy cognitive maps, Applied Soft Com-
43. M. Metcalf, D. Xu, D.T. Okuda, L. Carvajal, D.A.C.
puting 8 (2008) 820–828, https://doi.org/10.1016/j.asoc.
Kelley, P. Mukherjee, S.J. Nelson, D.R. Nat, D.B. Vi-
2007.06.006.
gneron, D. Pelletier, High-resolution phased-array MRI
56. Pylearn2, n.d., http://deeplearning.net/software/
of the human brain at 7 Tesla: initial experience in
pylearn2/.
multiple sclerosis patients, Journal of Neuroimaging 57. Pytorch, n.d., http://torch.ch/.
20 (2010) 141–147, https://doi.org/10.1111/j.1552-6569. 58. Pytorch, n.d., https://pytorch.org/.
2008.00338.x.High-Resolution. 59. O. Rapalino, E.M. Ratai, Multiparametric imaging analy-
44. G. Mohan, M. Monica Subashini, MRI based medical sis: magnetic resonance spectroscopy, Magnetic Resonance
image analysis: survey on brain tumor grade classifica- Imaging Clinics of North America 24 (2016) 671–686,
tion, Biomedical Signal Processing and Control 39 (2018) https://doi.org/10.1016/j.mric.2016.06.001.
139–161, https://doi.org/10.1016/j.bspc.2017.07.007. 60. S. Roy, S. Sadhu, S.K. Bandyopadhyay, D. Bhattacharyya,
45. Morgan P. McBee, Omer A. Awan, et al., Deep learning T.H. Kim, Brain tumor classification using adaptive neuro-
in radiology, Academic Radiology (2018) 1–9, https://doi. fuzzy inference system from MRI, International Journal
org/10.1016/j.acra.2018.02.018. of Bio-Science and Bio-Technology 8 (2016) 203–218,
46. Mostefa Ben Naceur, Rachida Saouli, Mohamed Akil, Ros- https://doi.org/10.14257/ijbsbt.2016.8.3.21.
tom Kachouri, Fully automatic brain tumor segmenta- 61. Rupal R. Agravat, Mehul S. Raval, Deep learning for au-
tion using end-to-end incremental deep neural networks tomated brain tumor segmentation in MRI images, Chap-
in MRI images, Computer Methods and Programs in ter 11, in: Soft Computing Based Medical Image Analysis,
Biomedicine (2018), https://doi.org/10.1016/j.cmpb.2018. 2018, pp. 183–201.
09.007. 62. J. Sachdeva, V. Kumar, I. Gupta, N. Khandelwal, C.K. Ahuja,
47. Mrudang D. Pandya, Parth D. Shah, Sunil Jardosh, Medical A package-SFERCB-“segmentation, feature extraction, re-
image diagnosis for disease detection: a deep learning ap- duction and classification analysis by both SVM and ANN
proach, Chapter 3, in: U-Healthcare Monitoring Systems, for brain tumors”, Applied Soft Computing 47 (2016)
2019, pp. 37–60. 151–167, https://doi.org/10.1016/j.asoc.2016.05.020.
CHAPTER 4 Medical Imaging With Intelligent Systems: A Review 73

63. J. Sachdeva, V. Kumar, I. Gupta, N. Khandelwal, C.K. Ahuja, 76. N.J. Tustison, K.L. Shrinidhi, M. Wintermark, C.R. Durst,
Multiclass brain tumor classification using GA-SVM, in: B.M. Kandel, J.C. Gee, M.C. Grossman, B.B. Avants, Op-
2011 Developments in E-systems Engineering, vol. 97, timal symmetric multimodal templates and concatenated
2011, pp. 182–187. random forests for supervised brain tumor segmenta-
64. Saddam Hussain, Syed Muhammad Anwar, Muhammad tion (simplified) with ANTsR, Neuroinformatics 13 (2015)
Majid, Segmentation of glioma tumors in brain using 209–225, https://doi.org/10.1007/s12021-014-9245-2.
deep convolutional neural network, Neurocomputing 282 77. A.G. Van Der Kolk, J. Hendrikse, J.J.M. Zwanenburg, F.
(2018) 248–261, https://doi.org/10.1016/j.neucom.2017. Visser, P.R. Luijten, Clinical applications of 7 T MRI in the
12.032. brain, European Journal of Radiology 82 (2013) 708–718,
65. A. Sarkar, E.A. Chiocca, Glioblastoma and malignant astro- https://doi.org/10.1016/j.ejrad.2011.07.007.
cytoma, in: Brain Tumors, American BrainTumor Associa- 78. G. Vishnuvarthanan, M.P. Rajasekaran, P. Subbaraj, A.
tion, Chicago, 2016, pp. 1–22. Vishnuvarthanan, An unsupervised learning method with
66. J.A. Schwartzbaum, J.L. Fisher, K.D. Aldape, M. Wrensch, a clustering approach for tumor identification and tissue
Epidemiology and molecular pathology of glioma, Nature segmentation in magnetic resonance brain images, Ap-
Clinical Practice Neurology 2 (2006) 494–503, https://doi. plied Soft Computing 38 (2016) 190–212, https://doi.org/
org/10.1038/ncpneuro0289. 10.1016/j.asoc.2015.09.016.
67. Juergen Schmidhuber, Deep learning in neural networks: 79. Weibo Liu, Zidong Wang, Xiaohui Lui, Nianyin Zeng,
an overview, Neural and Evolutionary Computing 61 Yurong Liu, et al., A survey of deep neural network archi-
(2015) 85–117, https://doi.org/10.1016/j.neunet.2014.09. tectures and their applications, Neurocomputing 11 (26)
003. (2017), https://doi.org/10.1016/j.neucom.2016.12.038.
80. Xiaomei Zhao, Yihong Wu, Guidong Song, Zhenye Li, et
68. Sérgio Pereira, Adriano Pinto, Victor Alves, Carlos A. Silva,
al., A deep learning model integrating FCNNs and CRFs for
Brain tumor segmentation using convolutional neural net-
brain tumor segmentation, Medical Image Analysis (2017),
works in MRI images, IEEE Transactions on Medical Imag-
https://doi.org/10.1016/j.media.2017.10.002.
ing 35 (5) (2016) 1240–1251, https://doi.org/10.1109/
81. Yuehao Pan, Weimin Huang, Zhiping Lin, Wanzheng Zhu,
TMI.2016.2538465.
Jiayin Zhou, Jocelyn Wong, Zhongxiang Ding, Brain tu-
69. K.A. Smitha, A.K. Gupta, R.S. Jayasree, Relative percent-
mor grading based on neural networks and convolutional
age signal intensity recovery of perfusion metrics-an effi-
neural networks, in: Engineering in Medicine and Biology
cient tool for differentiating grades of glioma, British Jour-
Society (EMBC), 2015 37th Annual International Confer-
nal of Radiology 88 (2015), https://doi.org/10.1259/bjr. ence of the IEEE, 2015, pp. 699–702.
20140784. 82. M. Zarinbal, M.H. Fazel Zarandi, I.B. Turksen, M. Izadi, A
70. M. Strumia, D. Feltell, N. Evangelou, P. Gowland, C. type-2 fuzzy image processing expert system for diagnosing
Tench, L. Bai, Grey matter segmentation of 7T MR images, brain tumors, Journal of Medical Systems 39 (2015) 110,
in: IEEE Nuclear Science Symposium Conference Record, https://doi.org/10.1007/s10916-015-0311-6.
2011, pp. 3710–3714. 83. N. Zhang, S. Ruan, S. Lebonvallet, Q. Liao, Y. Zhu, Ker-
71. Tensorflow, n.d., https://www.tensorflow.org/. nel feature selection to fuse multi-spectral MRI images
72. Theano, n.d., http://deeplearning.net/software/theano/. for brain tumor segmentation, Computer Vision and Im-
73. J.M. Theysohn, O. Kraff, K. Eilers, D. Andrade, M. Gerwig, age Understanding 115 (2011) 256–269, https://doi.org/
D. Timmann, F. Schmitt, M.E. Ladd, S.C. Ladd, A.K. Bitz, 10.1016/j.cviu.2010.09.007.
Vestibular effects of a 7 Tesla MRI examination compared 84. Y. Zhang, Z. Dong, L. Wu, S. Wang, A hybrid method for
to 1.5 T and 0 T in healthy volunteers, PLoS ONE 9 (2014) MRI brain image classification, Expert Systems with Appli-
3–10, https://doi.org/10.1371/journal.pone.0092104. cations 38 (2011) 10049–10053, https://doi.org/10.1016/j.
74. Thierry Bouwmans, Caroline Silva, Cristina Marghes, Mo- eswa.2011.02.012.
hammed Sami Zitouni, Harish Bhaskar, Carl Frelicot, On 85. Y. Zhu, G.S. Young, Z. Xue, R.Y. Huang, H. You, K. Se-
the role and the importance of features for background tayesh, H. Hatabu, F. Cao, S.T. Wong, Semi-automatic seg-
modeling and foreground detection, Computer Science mentation software for quantitative clinical brain glioblas-
Review 28 (2018) 26–91, https://doi.org/10.1016/j.cosrev. toma evaluation, Academic Radiology 19 (2012) 977–985,
2018.01.004. https://doi.org/10.1016/j.acra.2012.03.026.
75. Torch, n.d., http://torch.ch/.
CHAPTER 5

Medical Image Analysis With Deep

Neural Networks
K. BALAJI, ME • K. LAVANYA, PHD

5.1 INTRODUCTION 5.2 CONVOLUTIONAL NEURAL NETWORKS

Deep learning neural networks are among the most sig- An appropriate form of multi-layer neural network is
nificant methods of machine learning [1]. In the previ- a convolutional neural network (CNN) [2]. A CNN
ous five years, we saw a significant growth in the area is intended to identify visual forms from images with
of deep learning. Nowadays, deep learning is the foun- least computation [3]. Convolutional neural networks
dation of several cutting-edge scientific applications. (CNNs) originate ubiquitously. In the last few years, we
Experts’ objective is to benefit computers to not only saw a vivid progress in visual image processing system
recognize medical images. Deep learning facilitates the due to the preliminary part of deep neural networks for
computer to construct complex perceptions from mod- learning and classifying the patterns. CNNs have accom-
est and less significant perceptions. For example, a deep plished noble performance in many areas such as brain,
learning structure identifies a person’s image by merg- retinal, chest X-ray, chest CT, breast, cardiac, abdomi-
ing corners and boundaries of the lower level label and nal and musculoskeletal image analysis. Thus, convo-
combines them into fragments of the body in an or- lutional neural networks have almost limitless appli-
dered way. Deep learning systems are beneficial in the cations. Fig. 5.3 shows the taxonomy of convolutional
exploration of brain images such as disorder classifica- neural networks.
tion; tissue, anatomy, lesion and tumor segmentation; Convolutional neural networks are also referred to
lesion and tumor detection; lesion and tumor classifica- as ConvNets and are related to systematic neural net-
tion; survival and disease activity prediction; as well as works. ConvNets consist of neurons with weights that
image construction and enhancement. are trained from data. Each neuron accepts inputs and
Fig. 5.1 shows the common organization of deep carries out a product with its weights. The last fully con-
learning for medical image analysis. The architecture nected layer has a loss function. The systematic neural
comprises three fundamental segments: network accepts input information as a single vector
(1) Input raw data; which is forwarded to a sequence of hidden layers. In
(2) Deep network; a hidden layer, every neuron is entirely associated with
(3) Output data. all other neurons in the preceding layer. Each neuron
The general process of deep learning is shown in is entirely independent inside a single layer, and they
Fig. 5.2. The key objectives of this exploration can be do not share any information between them. The fi-
organized as follows: nal fully connected layer is also referred to as the out-
(1) To present principal literature of various CNN tech- put layer, which has class information for recognizing
niques; an image in a classification application. ConvNets con-
(2) To categorize significance of this field and; sist of three core layers in their architecture: convolu-
(3) To provide the modern developments of study in tional, pooling, and fully connected layers. ConvNets
this field. usually accept input as image and encode features into
The rest of this chapter is organized as follows. The the network which decreases the number of parameters.
fundamentals of convolutional neural networks are de- In real-world applications, the performance of the con-
scribed in Sect. 5.2, convolutional neural network meth- volutional neural network is much better than that of
ods are described in Sect. 5.3, convolution layer is pre- multi-layer perceptrons.
sented in Sect. 5.4, pooling layer in Sect. 5.5, activation There are two issues found in multi-layer percep-
function in Sect. 5.6, applications of CNN methods in trons:
medical image analysis in Sect. 5.7, and a discussion is (1) A multi-layer perceptron converts a given input
given in Sect. 5.8. Finally, Sect. 5.9 concludes the chap- matrix into a numeric vector without any spatial
ter. information. So, ConvNets are mainly designed for

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00012-9 75
Copyright © 2019 Elsevier Inc. All rights reserved.
76 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 5.1 Common organization of deep learning for medical image analysis.

FIG. 5.2 General process of deep learning.

this actual reason, that is, to interpret the features to process images having input information. In addi-
in a multidimensional space. tion, ConvNets accept input in the form of a matrix.
(2) Unlike multi-layer perceptrons, convolutional neu- Convolutional layers realm is spatial information.
ral networks recognize the information that closer In general, neural networks can be useful for a search
image pixels are deeply correlated compared to problem. Each neuron in the network is used for search-
pixels which are distant. ing the association between the input data and their
ConvNets are different from multi-layer perceptron output feature. Dropout dynamically turns neurons off
when comparing the types of hidden layers present while forward transmitting and also makes off their
in the network. In ConvNets, the neurons are orga- weights from converging matching points. After that,
nized in three dimensions: depth, width, and height. it turns on every neuron present in the network and
Each layer converts three-dimensional inputs into three- propagates towards the back. During forward propaga-
dimensional output features using an activation function, layer values are set to zero to perform dropout.
tion. In a multi-layer perceptron, the number of param- In ConvNets, multiple layers are calculated in a well-
eters is increased, due to the fact that it accepts only a organized manner. Each layer has an essential part in
vector as input. ConvNets address this problem in order the network. The input layer receives the data image.
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 77

FIG. 5.3 Taxonomy of convolutional neural networks.

FIG. 5.4 ConvNets organization for image classification.

Each neuron in the fully connected layer is connected ConvNets. Their organization for image classification
to two neighboring layers and does not share its in- is shown in Fig. 5.4. The process of the convolutional
formation within a layer. The neurons in the present layer is shown in Fig. 5.5. A ConvNet in 3D is shown in
layer have full activations in the preceding layer. The ac- Fig. 5.6. The learning of ConvNets comprises two stages
tivations are processed with matrix multiplication and – forward and backward stages. The dot-product of in-
bias is added to that term. In ConvNets, each neuron put with weight and addition of bias term in each layer
is related to a local area and shares its parameters. In is represented in the forward stage. The actual output
the convolutional layer, the main aim of ConvNets is from the network is associated with the stored objective
to mine the information from the given image. The output [4]. In the backward stage, the weight modifica-
convolutional layer performs most of the processing in tions are performed in every level of the layer to make a
78 Deep Learning and Parallel Computing Environment for Bioengineering Systems

target output. The training of the network is converged, l

vector of location (p, q) and Op,q,r creates the feature
after sufficient volumes of iterations in the forward and map.
backward stage. In ConvNets, an input data image is l
The activation function value fp,q,r of convolutional
the first parameter, and a filter is the second parame- l
feature Op,q,r can be calculated as
ter. The feature map can be obtained from the output
vector of the fully connected layer. Each input and filter l
fp,q,r l
= f (Op,q,r ). (5.2)
are stored in the network independently and presented
in the method of multi-dimensional arrays. The convo-
lutional layer is used to identify local information in 5.3 CONVOLUTIONAL NEURAL NETWORK
an image. The time complexity of convolutional layer METHODS
is O(nm) by taking n inputs and m outputs. A pooling
LeNet 5 [2] architecture accepts 32 × 32 data image as
layer is helpful for reducing the dimensionality of in-
input. CNN architecture of LeNet 5 is shown in Fig. 5.7.
put vectors from the feature map of the convolutional
The input is forwarded through a convolutional layer
layer. During the pooling layer, the time complexity is
via subsampling layer. After that, input from the sub-
reduced to O(km), when n inputs are reduced to k in-
sampling layer is forwarded through a pooling layer
puts, that is, only meaningful features are considered.
to reduce the number of parameters. Then, a series of
Two types of method are accessible in pooling layers for
convolutional and pooling layers receive the inputs. Fi-
converting a feature map into dimensional reduction,
nally, the output is produced from three fully connected
such as max and average pooling [5].
l layers. Layer C1 consists of a 28 × 28 feature map con-
The convolutional layer feature map Op,q,r can be
nected to a 5 × 5 neighborhood and has 156 parame-
computed as
ters with 122,304 connections between neurons. Layer
l
Op,q,r = Wrl Ip,q
l + bl , (5.1) S2 consists of a 14 × 14 feature map connected to a
r
2 × 2 neighborhood and has 12 parameters connected
to 5880 connections between neurons. Layer C3 con-
where Wrl is the weight vector values and brl is the bias
l sists of a 10 × 10 feature map connected to a 5 × 5
of the rth kernel of the lth layer; Ip,q is the input
neighborhood and has 1516 parameters with 156,000
connections between neurons. Layer S4 consists of a
5 × 5 feature map connected to a 2 × 2 neighborhood
and has 32 parameters with 2000 connections between
neurons. Layer C5 consists of a 120 feature map con-
nected to a 5 × 5 neighborhood and has 48,120 connec-
tions between neurons. Layer F6 consists of 84 units and
has 10,164 parameters. LeNet 5 architecture is useful
for handwriting, face, and online handwriting recog-
nition, as well as machine-printed character recogni-
FIG. 5.5 The process of convolutional layer. tion.

FIG. 5.6 ConvNets in 3D.

CHAPTER 5 Medical Image Analysis With Deep Neural Networks 79

FIG. 5.7 CNN architecture of LeNet 5.

FIG. 5.8 CNN architecture of AlexNet.

One of the most effective innovations in the ar- A DeconvNet is an opponent model of ConvNet that
chitecture of convolutional neural networks, and also maps features to pixels instead of mapping pixels to
award-winning, is AlexNet [3] architecture. The CNN features. The DeconvNet perform filtering and pooling
architecture of AlexNet is shown in Fig. 5.8. A large in reverse order of ConvNet. Employing a DeconvNet is
dataset consisting of 1.2 million images with features a method of performing unsupervised learning. In the
having high-resolution is trained and recognized for ZFNet architecture, DeconvNet is attached to every layer
1000 different categories. The convolutional neural net- of ConvNet network. Initially, an input data image is
work has 60 million trainable parameters with 650,000 presented to the ConvNet and features are calculated
network connections. The final layer has 1000-way soft- through convolutional and pooling layers. To evaluate
max. AlexNet uses non-saturating neurons and effective the activation function of ConvNet, the value zero is as-
GPU for faster processing of the network. AlexNet uses signed to all other activations. The output feature map
the dropout method to reduce the overfitting problem. of ConvNet is passed through DeconvNet. In Decon-
The architecture contains eight layers, including three vNet, unpooling is applied; rectification and filtering
convolutional, two pooling, and two fully-connected are used to restructure the input data image. The pro-
layers with softmax classifier. The output layer provides cess is repeated until the input space is reached. ZFNet
a 1000-way softmax, which recognizes 1000 different is mainly used for image classification. ZFNet is able to
class scores. AlexNet architecture is proficient in attain- handle large training data sets. The dropout technique
ing best computational results on large datasets. But it is used to reduce the number of parameters.
reduces the performance if one convolutional layer is Network-In-Network (NIN) [7] is an innovative deep
detached. neural network used for improving classical discrim-
ZFNet [6] is an innovative technique for well-thought inability of local data image patches within their lo-
intermediary layers and their enhancement. The CNN cal regions. The CNN architecture of NIN is shown in
architecture of ZFNet is shown in Fig. 5.9. ZFNet has Fig. 5.10. In general, scanning the input by a predictable
eight layers, including five convolutional layers that are convolutional layer uses kernels for filtering the image
associated with three fully connected layers. The opera- through a nonlinear activation function. As an alter-
tion of convolution layer is executed with GPU. ZFNet is native, NIN forms micro-neural networks with further
a multi-layered deconvolutional network (DeconvNet). composite architectures to abstract the image patches
80 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 5.9 CNN architecture of ZFNet.

convolutional neural networks structures (R-CNN) [8]

is a modest and scalable algorithm for object detection
that improves the result with the help of mean average
precision. The CNN architecture of R-CNN is shown in
Fig. 5.11. The R-CNN architecture is divided into three
phases. The first phase generates class-independent pro-
posal regions. These proposals are used for describing
the candidate detection. The second phase extracts the
fixed-size information from each region of the convo-
lutional neural networks. The third phase is an output
classifier, such as linear SVM.
The two main issues in image classification and ob-
FIG. 5.10 CNN architecture of NIN. ject detection are:
(1) Localization of objects with the help of deep neu-
within their local regions. A powerful function approx- ral network architecture and;
imator, such as multilayer perceptron, provides an in- (2) Training a high-performance network with a mini-
stance for the micro-neural network. The output fea- mal number of interpreted data images
ture maps obtained from the micro-neural networks are The localization of an object in a convolutional
passed to the next layer of the network. Instead of tra- neural network is solved by the recognition using re-
ditional fully connected layers, a contemporary effective gions method. R-CNN generates about 2000 class-
global average pooling method is used for reducing the independent proposal regions from the given input data
overfitting problem in the network. The output vector of image and extracts a fixed-size feature from each pro-
the global average pooling layer is fed into the final clas- posal using convolutional neural networks, and then
sification softmax layer. The NIN architecture is not only the output is classified using an SVM classifier. Affine
useful in image classification but also performs well in image warping is a simple method used for process-
object detection. ing of fixed-size input image from each proposal region
The moral performance approaches of object de- irrespective of the shape. The main issue of object detec-
tection are usually complex cooperative structures that tion is that labeled data and amount of data for training
normally combine the images having various low-level the convolutional neural network is infrequent. A con-
features with the high-level framework. Regions with servative solution for this issue is to use unsupervised
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 81

FIG. 5.11 CNN architecture of R-CNN.

FIG. 5.12 CNN architecture of VGG Net.

learning followed by a supervised training. When data width and contains depth size of color channels. The
is scarce, R-CNN is an efficient method for training the higher layers’ locations are related to the image loca-
large dataset with supervised learning and followed by tions and connected to receptive fields. The basic struc-
fine-tuning on a small dataset. ture of an FCN includes a convolutional layer, pooling
Fully convolutional networks (FCNs) [9] are a deep layer and activation functions that operate on a local re-
as well as influential architecture in semantic segmenta- gion of the image and based only on their associated
tion. An FCN consists of 22 layers, including 19 convo- coordinates. In general, FCN works on any arbitrary-
lutional layers, and is associated with 3 fully connected sized input and produces output based on spatial di-
layers. An FCN takes the input of any size and produces mensions. When the receptive fields overlay consider-
fixed-size output with effective training and interpre- ably, FCN performs layer-by-layer computation using
tation. In semantic segmentation, FCN processes the the feedforward and backpropagation algorithm instead
input data image pixels-to-pixels, which results in the of processing images patch-by-patch.
state-of-the-art without any need for the supplementary Visual Graphics Group (VGG) [10] architecture is
process. FCN is trained end-to-end (i) after supervised
used for evaluating the growth of network depth us-
learning and (ii) for prediction of the input data image
ing large-scale images. Initially, the architecture takes
pixel-wise. The fully convolutional layers of the network
3 × 3 convolution filters and increases the depth up to
calculate compact outputs from random-sized inputs.
16–19 weight layers. The CNN architecture of VGG Net
Both training and interpretation are accomplished for
the entire image at once by computation of the feedfor- is shown in Fig. 5.12. The input data image given to
ward and backpropagation algorithm. Patchwise learn- the network is a 224 × 224 fixed-size RGB image. The
ing is mutual in all methods, but insufficiencies oc- pre-processing of the image is done by subtracting the
cur in the efficiency of training the fully convolutional mean value from each image pixel. The data image is
layer. The complications of pre- and post-processing forwarded through convolutional layers with 3 × 3 fil-
are not included in FCN, which modifies the network ters for further processing. The VGG network consists
from trained information and transmits current realiza- of a series of five convolutional layers, which are asso-
tion for prediction of classification networks as fully ciated with three fully connected layers. The first and
convolutional. Each layer in FCN consists of a three- second fully connected layers have 4096 channels. The
dimensional array that includes the size of height × third fully connected layer called soft-max layer con-
width × depth, where height and weight are represented tains 1000 channels for producing 1000-way classifica-
as spatial dimensions, and depth is represented as a fea- tions. Every hidden layer is processed with the rectifi-
ture. The image pixel size in the first layer is height × cation nonlinear activation function. Initially, the VGG
82 Deep Learning and Parallel Computing Environment for Bioengineering Systems

network contains 11 weight layers having 8 convolu- activation function. GoogLeNet consists of 22 layers,
tional layers that are associated with 3 fully connected including 21 convolutional layers that are associated
layers. After that, the network increases the depth to 19 with one fully connected layer. The network is made
weighted layers having 16 convolutional layers that are from building blocks of convolutional layers and used
associated with 3 fully connected layers. The convolu- to calculate the optimal way of constructing local region
tional layer width starts at 64 and increases by a factor repeatedly with the spatial feature. The lower layer con-
of 2 iteratively and stops when it reaches 512. sists of the input focus in the local regions and 1 × 1
In general, a deep convolutional neural network ac- convolutions are enclosed by the next layer. In the sub-
cepts fixed size input data images. This constraint is sequent convolutions, the filter size varies from 1 × 1,
synthetic and may decrease the accuracy of recogniz- 3 × 3, and 5 × 5 for alignment of image patches. Finally,
ing images of random size. The above constraint is re- a softmax classifier produces output classification of the
moved by innovative pooling approach called spatial given input data image.
pyramid pooling network (SPP-Net) [11]. The CNN ar- Fast regions with convolutional neural networks
chitecture of SPP-Net is shown in Fig. 5.13. SPP-Net can
structures (Fast R-CNN) [13] is an efficient technique
produce a fixed-size image irrespective of an image size.
compared to R-CNN for object detection that reaches
The SPP-Net calculates feature maps only once from the
a higher mean average precision. The CNN architecture
given input data image and applies them to the pool-
of Fast R-CNN is shown in Fig. 5.15. Fast R-CNN in-
ing layer for producing fixed-size images. The SPP-Net
troduces numerous advances in training the network,
avoids repetitive computation in convolutional layers.
It is useful for image classification and also important improves time complexity for testing, and also increases
in object detection. The output classification of SPP-Net the accuracy of object detection. Fast R-CNN trains a
achieves better performance and requires no fine-tuning VGG16 deep network that is 9 times faster than R-CNN
of given input image representation. SPP-Net is one of and 213 times faster in testing for reaching a higher
the best effective techniques in computer vision. The mean average precision. When compared to SPP-net,
network structure partitions the given input image into Fast R-CNN trains VGG16 deep network 3 times faster,
divisions and combines their local regions. Multi-level 10 times faster in testing and is more precise. The no-
pooling in SPP-Net performs faster on object deforma- table drawbacks of R-CNN networks are: (i) R-CNN
tions. The network structure not only generates variable requires multi-stage pipeline for training the network;
size images for testing but also does training and re- (ii) the time and space complexity is more in case
duces over-fitting with the dropout method. of training the network; and (iii) the detection of an
object is slow. The notable drawbacks of SPPnets are:
(i) SPPnets require multi-stage pipeline for training the
network; (ii) the time and space complexity is more in
case of training the network; and (iii) extracted features
from the object proposals are written to disk.
The Fast R-CNN technique has numerous benefits:
(1) Training of the network in single-stage by means of
the multi-task loss function;
(2) Every network layer is updated during training;
(3) Reaching better object detection quality via higher
FIG. 5.13 CNN architecture of SPP-Net.
mean average precision than R-CNN and SPPnets;
A convolutional neural network structure called in- (4) Disk space is not required for storing the object
ception module performs better image classification proposal features.
and object detection. This inception module is also re- A Fast R-CNN network accepts the input data image
ferred to as GoogLeNet [12]. The CNN architecture of and a group of object proposals. The training network
GoogLeNet is shown in Fig. 5.14. The GoogLeNet archi- forwards the entire image to convolutional and pooling
tecture optimizes the use of computational resources. layers for producing the convolutional feature map. The
GoogLeNet architecture increases the width and depth region of interest (RoI) of object proposal is selected
of the convolutional neural network with the least cost. and passed to the pooling layer of the network. Finally,
The optimization quality of architecture is based on every feature vector is passed to a fully connected layer
Hebbian principle and absence of multi-scale compu- that ends with the division of two output layers: softmax
tation. Every convolutional layer uses rectified linear classifier and bounding-box regression.
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 83

FIG. 5.14 CNN architecture of GoogLeNet.

FIG. 5.15 CNN architecture of Fast R-CNN.

Region proposal network (RPN) [14] is a deep con- Faster R-CNN [14] is a deep neural network that is
volutional neural network architecture that detects ob- composed of two different modules. The first module
jects using regions. The CNN architecture of RPN is identifies the object proposals, and the second uses the
shown in Fig. 5.16. RPN calculates object boundaries, object proposals for detection. The CNN architecture of
objectness scores in each object and allows for least Faster R-CNN is shown in Fig. 5.17. Faster R-CNN com-
cost region proposals. RPN performs end-to-end train- bines RPN and Fast R-CNN into a distinct network. The
ing to make high feature region proposals. The network RPN network performs end-to-end training of the net-
accepts input data image and produces a set of rectangu- work to predict the object boundaries and objectness
lar object proposals as output with an objectness score. scores from the given input data image. RPN produces
This objectness score is used to measure the relationship region proposals from the input image. After that, the
to a group of object categories. The RPN selects a small region proposals are used by Fast R-CNN for detection
of objects. So, RPN and Fast R-CNN share their convo-
network with sliding window spatial information from
lutional features and use the popular attention mech-
the given input data image. Each sliding window is re-
anism, in which RPN identifies where to look in the
lated to a low-level feature representation. This feature is
input image for object detection. An innovative anchor
passed to two fully connected layers: a box-classification
boxes method is introduced for avoiding filters. The an-
layer and a box-regression layer. At each sliding win-
chor box method is based on a pyramid of anchors.
dow, the object proposals from multiple regions are The box-classification and box-regression are performed
predicted. The classification layer has 2000 scores that with the help of anchor boxes of different scales and as-
evaluate the probability of an object, and the regression pect ratios. The feature map belongs to single scale, and
layer has 4000 output coordinates of k anchor boxes. the filters are of single size. Faster R-CNN is also used for
An anchor is placed at the sliding window and related multi-scale anchors for sharing the information with-
through aspect ratio and a scale. The RPN method per- out any additional cost. Faster R-CNN produces better
forms object detection in a different sort of scales and results on PASCAL VOC 2007, PASCAL VOC 2012, and
aspect ratios. The training of network is achieved by the MS COCO datasets than other networks.
backpropagation algorithm and stochastic gradient de- A novel residual learning network structure called
scent method. The RPN produces better results in PAS- ResNet [15] was invented for learning of networks that
CAL VOC dataset. are significantly deeper than all other networks used
84 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 5.16 CNN architecture of RPN.

output information is passed to a sequence of layers.

ResNet can be trained using the stochastic gradient de-
scent method with backpropagation algorithm. The re-
sult of ResNet shows that: (i) optimization of the deep
residual network is easy; and (ii) deep residual network
provides better accuracy from increased depth of net-
work than previous networks. ResNet is 8 times deeper
than VGG nets, generating depth up to 152 layers using
FIG. 5.17 CNN architecture of Faster R-CNN. ImageNet 2012 classification dataset.
Region-based fully convolutional neural networks
before. Deep neural networks have directed to a se- (R-FCN) [16] architecture is used for object detection. R-
quence of developments for image classification. Deep FCN performs region-based object detection. It is one of
neural networks naturally combine low-level, middle- the most efficient and accurate methods. Fast/Faster R-
level, and high-level features in a multi-stage pipeline CNN is hundreds of times costlier subnet method than
and deepened by the depth of the layers. In recent de- R-FCN. The method shares computation on the whole
velopments of deep neural networks, the depth of the
input image using the fully convolutional layer. This is
network is of essential importance, and good outcome
addressed using position-sensitive score maps in object
exploits from very deep models at a depth of 16 to
detection and image classification. The method imple-
30 layers. Degradation issue occurs while the deeper
ments two levels of object detection that involve re-
networks converge to a saturated precision and then
quickly degrade. The accuracy of the training shows that gion proposal and region classification. R-FCN accepts
it is not easy to optimize a deeper network. The novel the given input data image and is able to classify RoIs
deep residual learning methodology solves the issue of into object classification and background. In R-FCN, all
degradation. Each stacked layer fits a residual mapping, convolutional layers are trained with weights that are
instead of preferred underlying mapping. The residual calculated on the whole input image. The final layer
mapping is easy to optimize compared to preferred un- referred to as position-sensitive RoI pooling layer aggre-
derlying mapping. The formulation of a residual map- gates the output and creates position-sensitive scores for
ping is recognized by feedforward networks with their each class. Unlike other methods, position-sensitive RoI
shortcut connections. Shortcut connections in a net- layers perform discriminatory pooling and combine re-
work refer to skipping of one or more layers. The short- sponses from one out of all score maps. With the help
cut connections carry out identity mapping and their of end-to-end training in R-FCN, the RoI layer guides
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 85

the final convolutional layer to train position-sensitive down and bottom-up approaches. In a top-down archi-
score maps. tecture, predictions are computed at the optimum stage
A novel modularized architecture residual learning with skip network connections. In a bottom-up archi-
network with next dimension (ResNeXt) [17] outper- tecture, a feature pyramid with a prediction is made
forms 101-layer ResNet, 152-layer ResNet, and 200-layer individually at all levels of the network.
ResNet on image classification. The ResNeXt network Instance segmentation is an inspiring task which
is built by iterating a building block that combines a needs accurate detection of the object image and also
group of conversions within a similar topology. The segmentation of each occurrence. Therefore, it merges
ResNeXt results in a regular, multi-branch network that object detection with semantic segmentation. Object
has only a small number of hyper-parameters such as detection classifies and localizes the objects using a
width, filter sizes, strides to initialize. The method uses bounding box regression. The semantic segmentation
a new dimension called cardinality that defines the size classifies each pixel to a set of classes. Mask R-CNN
of transformations in addition to width and depth. The [19] is a simple and general method for object instance
cardinality of the network enhances the accuracy of im- segmentation. It effectively detects the objects and pro-
age classification and performs more efficiently than duces superior segmentation mask for each occurrence.
going with a deeper network. The network design be- The network encompasses Faster R-CNN by including
comes complex when more layers get involved with the an important step for predicting the object mask with
development of hyper-parameters such as width, filter the existing step for bounding box classification. Mask
sizes, strides, etc. The VGG network contains an efficient R-CNN trains the network in a simpler manner and
method of building a deep architecture which loads the improves with a small modification of Faster R-CNN.
blocks of the same shape. The ResNeXt inherits the fea- The mask branch is a small, fully convolutional net-
tures from VGG and loads the functions of the identical work used for every RoI and determines a segmentation
topology. This optimized constraint decreases the se- mask. Faster R-CNN does not perform pixel-to-pixel
lection of hyper-parameters. The inception module in alignment in the network. An RoIAlign layer is proposed
ResNeXt splits the input into low-dimensional embed- for alignment of the network between inputs and out-
dings having 1 × 1 convolutions and converts it to spe- put.
cific filters such as 3 × 3, 5 × 5 that are combined by RetinaNet [20] is a distinct, integrated network made
concatenation. up of a backbone network along with two subnetworks.
Feature pyramids are an essential part of recogni- In RetinaNet, a feature pyramid network (FPN) is used
tion systems for object detection at multiple scales. But as a backbone network which is responsible for cal-
current research work in object detection has avoided culating the feature map in a convolutional layer of a
the feature pyramids due to memory and computation given input data image. The FPN structure is merged
cost. The main objective of feature pyramid networks with adjacent connections and enhanced for construct-
(FPN) [18] is to build the feature pyramids with mining high-level feature maps at different scales. The net-
imum cost. The FPN structure is merged with adjacent work accepts any arbitrary-size input of a single-scale
connections and enhanced for constructing high-level image and produces output feature maps at different
feature maps at different scales. The network accepts any levels. The two subnets are used to perform bounding
arbitrary-size input of a single-scale image and produces box classification/regression. RetinaNet is a one-stage
output feature maps at different levels. The FPN network dense detection method for object detection. The clas-
incorporates two different approaches – bottom-up and sification subnet calculates the likelihood of an object
top-down approaches. In a bottom-up approach, feed- present at the spatial location that is used for each of
forward computation of ConvNet computes the feature the anchors and object classes. The classification sub-
map at multiple scales with a factor of 2. The feature net is attached to every FPN level, and their parameters
pyramid defines one pyramid stage for each level. The are shared among all levels of the pyramid. The classifi-
last layer’s output feature map is used to construct the cation subnet design is simple by accepting the input
pyramid. The innermost layer of every phase has robust feature map with a number of channels that applies
features. In a top-down approach, the stronger feature 3 × 3 convolutional layers with filters and uses ReLU ac-
maps are created from higher pyramid levels. These fea- tivation function. The classification subnet is deeper by
tures are developed with the features of a bottom-up using 3 × 3 convolutional layers and does not forward
approach through adjacent connections of the network. the parameter information to the box regression subnet,
Every adjacent connection of the network combines fea- which is attached to the network parallel to the classi-
ture maps of identical spatial size from both the top- fication subnet and terminates at 4 anchors sequential
86 Deep Learning and Parallel Computing Environment for Bioengineering Systems

output per spatial location. RetinaNet uses a focal loss all other activations. The output feature map of the con-
function to address the class imbalance issue in the one- volution is passed through deconvolution. In deconvo-
stage detector. A comparison of CNN methods is shown lution, unpooling is applied; rectification and filtering
in Table 5.1. are used to restructure the input data image. The process
is repeated until input space is reached. In recent times,
deconvolution is used for super-resolution [26], visu-
5.4 CONVOLUTIONAL LAYER alization [6], semantic segmentation [27], recognition
The convolutional layer is made up of numerous con- [28], visual question answering [29], and localization
volution filters for calculating different feature maps. [30].
The feature map of every neuron is connected to the
adjacent neurons in the preceding layer. Such a local 5.4.3 Dilated Convolution
region is denoted as a receptive field of the neuron Dilated convolution [31] is a modern improvement of
in the preceding layer. Initially, the input is convolved the convolutional neural network that acquaints with
with a trained filter to obtain the feature map and com- one new parameter to the convolutional layer. This new
putes the convolved outcome with a nonlinear activa- parameter inserts the value of zero between kernels.
tion function. The filter is shared by spatial information Dilated convolution increases the size of a local recep-
of the input for constructing each feature map. The en- tive field and the network includes more information
tire feature maps are generated by using numerous fil- about the input image. This characteristic is essential
ters. for computation which requires a local receptive field
with large size. In general, 1-D dilated convolution is
5.4.1 Tiled Convolution extended to 2-D dilated convolution, 2-D dilated con-
In convolutional neural networks, weight sharing meth- volution is extended to 3-D dilated convolution, and so
ods can significantly reduce the number of parameters. on. In dilated convolutional layers, the factor of dila-
However, CNNs constrain the network that learns from tion increases rapidly at each layer. The middle feature
different kinds of variance. Tiled CNN [21] is a method map FM2 is created from the bottom level feature map
that learns scale and rotational invariant information FM1 using 1-D dilated convolution; FM3 is created from
using tiles and feature maps. Individual filters can be the feature map FM2 using 2-D dilated convolution;
trained within the layer, and the difficult invariances can FM4 is created from the feature map FM3 using 3-D
be trained from pooling layers. The convolution opera-
dilated convolution, and so on. Dilated convolutions
tion is applied to each k unit, where k is the size of the
have achieved impressive performance for the applica-
tile for controlling the network by sharing their weights.
tions in speech recognition [32], scene segmentation
If the size of the tile is 1, then the feature map has
[31], speech synthesis [33], and machine translation
identical weights and represents the identical features of
[34].
traditional CNN. If the size of the tile is 2, the network
provides better results than traditional CNN [22].
5.4.4 Network-in-Network
5.4.2 Transposed Convolution Network-in-network [7] is an essential deep neural net-
work. In general, a convolutional layer uses a linear filter
Transposed convolution [6], [9], [23], [24] is a compet-
ing model of the convolutional network that converts for producing the feature map, and a nonlinear activa-
features to pixels, instead of mapping pixels to features. tion function for scanning the input data image. In NIN,
Transposed convolution is the backward approach of a a micro-neural network of function approximator is in-
traditional convolution network. Transposed convolu- troduced with multilayer perceptron. The feature maps
tion is also referred to as deconvolution and marginally are found by sliding the micro-networks in multilayer
as strided convolution [25]. The deconvolution per- perceptron and are then forwarded to the next layer of
forms filtering and pooling in reverse order of convolu- the network. The output of feature map from the fi-
tion. The deconvolution is a method of performing un- nal multilayer convolutional layer is forwarded to the
supervised learning. In the ZFNet architecture, deconvo- global average pooling layer, and the resultant vector is
lution is attached to every layer of the convolution net- passed to the softmax classifier.
work. Initially, an input data image is presented to the The nonlinear activation function for the feature
convolution. The features are calculated through convo- map of linear convolution layer is calculated as:
lutional and pooling layers. To evaluate the activation
function of convolution, the value of zero is assigned to l
fp,q,r = max(WrT Ip,q + br , 0), (5.3)
TABLE 5.1
Comparison of CNN methods.
Method Organization Features Benefits Applications
LeNet 5 [2] 5 CONV Layers with 3 FC Structured in six planes 1. Six different types of feature map are Handwritten recognition
Layers extracted Face Recognition
2. Implements weight sharing techniques Online Handwritten
recognition
Machine-printed character
recognition
AlexNet [3] 5 CONV. Layers with 3 FC 60 Million trained parame- 1. Non-saturing neurons are used for faster Image Classification
Layers ters and 650,000 connec- training Object Detection
tions 2. Effective GPU implementation
3. Dropout is for reducing the overfitting
problem.
ZFNet [6] 5 CONV Layers with 3 FC An innovative technique for 1. The operation of convolution layer is Image Classification
Layers well-thoughtful of intermedi- executed with GPU
ary layers and their enhance- 2. ZFNet is a multi-layered deconvolutional
ment networks
3. ZFNet is able to handle large training data
sets
4. Dropout technique is used for reducing the
parameters
NIN [7] 3 mlpconv layers with 1 An innovative method for 1. NIN form micro neural networks to abstract Image Classification
global average pooling layer improving classical discrim- the image patch. Object Detection
inability of local data image 2. Global average pooling method is used for
patches within their local re- reducing the overfitting problem in the
gions network
R-CNN [8] 5 CONV Layers with 1 FC Object recognition using re- 1. Localization of objects with the help of Image Classification
Layer gions R-CNN architecture Object Detection
2. Training a high performance network with
minimum number of interpreted data images
3. Enhances mean average precision
TABLE 5.1 (continued)
Method Organization Features Benefits Applications
FCN [9] 19 convolutional layers FCN takes any size of in- 1. The complications of pre- and Semantic Segmentation
with 3 fully connected lay- put and produces fixed- post-processing are not included in FCN
ers size output with effective 2. FCN performs layer-by-layer computation
training and interpretation using feedforward and backpropagation
algorithm instead of processing images
patch-by-patch
VGG [10] 11 weight layers having 8 Evaluating the growth 1. Enhancement of weight layers from 16 to Image Classification
CONV Layers with 3 FC of network depth using 19
Layers and increases the large-scale images
depth to 19 weight layers
having 16 CONV Layers
with 3 FC Layers
SPP-Net [11] 5 CONV Layers with 3 SPP-Net can produce a 1. SPP-Net is one of the most effective Image Classification
FC Layers including Spa- fixed-size image irrespec- techniques in computer vision Object Detection
tial Pyramid Pooling tive of an image size 2. The SPP-Net calculates feature maps only
once from the given input data image and
applies them to the pooling layer for
producing fixed-size images
3. Multi-level pooling in SPP-Net performs
faster under object deformations
GoogLeNet [12] 21 CONV Layers with 1 Inception Module network 1. Optimization quality of the network is Image Classification
FC Layer for better classification based on the Hebbian principle Object Detection
and detection of images 2. The network increases the width and depth
of the convolutional neural network with lower
cost
Fast R-CNN [13] 13 CONV Layers, 4 max Efficient technique com- 1. Training of the network is single-stage by Object Detection
pooling layers with 1 RoI pared to R-CNN for object means of multi-task loss function
pooling layer and several detection that reaches a 2. Every network layer is updated during
FC layers higher mean average pre- training
cision 3. Better object detection quality via higher
mean average precision than R-CNN and
SPP-Net
4. Disk space is not required for storing the
object proposal features
RPN [14] Classification layer has Object recognition using 1. RPN method performs object detection on Object Detection
2000 scores and regres- regions different scales and for different aspect ratios
sion layer has 4000 out- 2. The training of network is achieved by the
put coordinates backpropagation algorithm and stochastic
gradient descent method
TABLE 5.1 (continued)
Method Organization Features Benefits Applications
Faster R-CNN [14] Merges RPN with Fast R- Object recognition using 1. Both RPN and Fast R-CNN share their Object Detection
CNN regions convolutional features
2. RPN produces region proposals, and Fast
R-CNN detects the object
ResNet [15] 34-layer ResNet Learning of networks 1. Optimization of deep residual network is Image Classification
50-layer ResNet that are significantly easy Object Detection
101-layer ResNet deeper than all other 2. Deep residual network provides better
152-layer ResNet networks used before accuracy from increased depth of network
than previous networks
R-FCN [16] Convolutional Layers Region-based object de- 1. RoI pooling layer aggregates the output Image Classification
with RoI Pooling Layer tection and creates position-sensitive scores for each Object Detection
class
2. Position-sensitive RoI layer performs
discriminatory pooling and combines
responses from one out of all score maps
ResNeXt [17] VGG/ResNet method of ResNeXt network is built 1. Increasing cardinality of the network Image Classification
repeating layers with car- by iterating a building enhances the accuracy of image classification Object Detection
dinality 32 block that combines a and is more efficient than going with a deeper
group of conversions of network
similar topology 2. ResNeXt inherits the features from VGG
FPN [18] Feature pyramid recog- FPN structure merged 1. Builds the feature pyramids with minimum Object Detection
nition system with adjacent connec- cost
tions and enhanced for 2. The network accepts any arbitrary-size
constructing high-level input and produces feature maps at different
feature maps at different levels
scales.
Mask R-CNN [19] Inherits Faster R-CNN Effectively detects the 1. Simple and general method for object Object Detection
with RoI Align Layer objects and also pro- instance segmentation Semantic Segmentation
duces superior segmen- 2. RoIAlign layer is used for alignment of the
tation mask for each oc- network between inputs and output
currence
RetinaNet [20] Full connected network RetinaNet is a one-stage 1. RetinaNet uses focal loss function to Object Detection
made up of ResNet-FPN dense detection meth- address the class imbalance issue in
backbone ods for object detection one-stage detector
2. RetinaNet uses ReLU activation function
90 Deep Learning and Parallel Computing Environment for Bioengineering Systems

l
where fp,q,r is the nonlinear activation score of the rth 5.5.2 Mixed Pooling
l
feature map, Ip,q is the input patch, WrT is the weight Mixed pooling [37] is an efficient methodology con-
vector, and br is the bias term. structed by a combination of average and max pooling.
The nonlinear activation function for the feature The mixed pooling can be computed as:
map of multilayer perceptron convolution layer is cal-
1
culated as: Op,q,r = θ max fu,v,r + (1 − θ) fu,v,r ,
(u,v)∈Rpq Rpq
n (u,v)∈Rpq
fp,q,rn = max(Wr nT fp,q:
n − 1 + b n, 0)
k (5.4)
(5.6)
where n is the number of the layer in the multilayer
perceptron. The global average pooling layer has few where θ is a random value of selecting either max pool-
hyper-parameters, reducing the problem of overfitting ing (1) or average pooling (0). The value is stored in the
and decreasing the computational cost. forward propagation network and utilized for backward
propagation. The final result [37] shows that mixed
5.4.5 Inception Module pooling performs better than the other methods.
The inception module is familiarized in the network
GoogLeNet [12]. The network uses variable size filters 5.5.3 Stochastic Pooling
for constructing the feature maps and estimates the op- Stochastic pooling [38] is an essential method which
timal sparse construction by the inception module. The selects the activations based on a multinomial distribu-
inception module consists of three different convolution and makes sure that activations of non-maximal
tion operations and only one pooling layer operation. feature maps are also used. The stochastic pooling calcu-
The operation of convolution places 1 × 1 convolution lates the activation function within the region. It selects
before 3 × 3 and 5 × 5 convolutions for the growth of a location within the region to set the pooling activa-
depth and width of CNN without any additional cost. tion function. The problem of overfitting can be reduced
The number of network hyper-parameters is decreased more in stochastic pooling than in other methods.
to 5 million, which is less when compared to ZFNet (75
million) and AlexNet (60 million). 5.5.4 Spectral Pooling
Spectral pooling [39] is useful for reducing the dimen-
sionality of the input image. The given input data im-
5.5 POOLING LAYER age is passed to a network. The spectral pooling calcu-
Pooling is an essential perception of CNN. It reduces the lates the input feature map by using the discrete Fourier
computational complexity by decreasing the number of transform (DFT) method, and reduces only the required
network connections between convolutional layers. In image. The spectral pooling finally uses inverse DFT for
this section, we bring together the details about latest mapping the feature map to the spatial domain. The
pooling approaches used in CNNs. operation of spectral pooling uses sequential low-pass
filtering which performs better than max pooling. The
5.5.1 Lp Pooling spectral pooling process of attaining matrix truncation
Lp pooling is a naturally stimulated pooling approach costs less in CNNs.
demonstrated on composite cells [35]. The analysis of
Lp pooling approach [36] offers improved generaliza- 5.5.5 Spatial Pyramid Pooling
tion compared to max pooling. K. He et al. proposed a spatial pyramid pooling (SPP)
The Lp pooling can be computed as: [11] to perform fixed-size presentation irrespective of
1/ l input size. The SPP pooling layer computes the input

Op,q,r = l
fu,v,r , (5.5) feature map in its local region and outperforms a num-
(u,v)∈Rpq
ber of bins. The spatial pyramid pooling computation is
better than sliding window pooling. If the final pooling
where Op,q,r is the output vector of a pooling opera- layer is replaced with spatial pyramid pooling, then it is
tion in the location (p, q) with r denoting the feature able to handle variable size input images.
map; fu,v,r is the value of the feature and (u, v) denotes
the location; Rpq is the region of object proposal. The 5.5.6 Multi-Scale Orderless Pooling
average pooling performs computation if p = 1. Max Multi-scale orderless pooling method [40] is used to
pooling method is used when p = ∞. improve the performance of various CNN methods.
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 91

The method computes activation features for the en- The PReLU function is determined as:
tire image and also for image patches in multi-scales.
The multi-scale orderless pooling method acquires the fp,q,r = max(Ip,q,r , 0) + θr min(Ip,q,r , 0) (5.9)
global spatial information. The activation functions
of local image patches are combined by an encoding
where θr is the trained argument of the rth chan-
method called VLAD [41], which acquires local and
nel. There is no additional complexity of overfitting in
fine-grained features of the image. The novel image pre-
PReLU and also no need for an additional cost. It can
sentation is attained by combining the features of global
also learn in parallel with other arguments by the back-
information and VLAD characteristics of local patch in-
propagation algorithm.
formation.
5.6.4 Randomized ReLU
5.6 ACTIVATION FUNCTION Randomized leaky rectified linear unit (RReLU) [45] is
an extension of Leaky ReLU. In this activation function,
An appropriate activation function considerably in-
the arguments of the negative section are randomly se-
creases the presentation of a CNN for a definite process.
lected during training and then kept constant in testing.
In this section, we familiarize the newly used activation
The RReLU function is determined as:
functions in CNNs.
(i) (i) (i) (i)
5.6.1 Rectified Linear Unit (ReLU) fp,q,r = max(Ip,q,r , 0) + θr min(Ip,q,r , 0), (5.10)
One of the efficient activation functions is the rectified
linear unit (ReLU) [42]. The activation function is de- (i)
where Ip,q,r is defined as the input at the location (p, q)
termined as: (i)
on the rth kernel of the ith sample; θr is the sample
(i)
fp,q,r = max(Ip,q,r , 0), (5.7) parameter and fp,q,r is the output. The RReLU function
reduces the problem of overfitting because of random
where Ip,q,r is the input vector present at location (p, q) selection of the parameter.
on the rth kernel. The activation function ReLU always
keeps a positive section and reduces the negative section 5.6.5 Exponential Linear Unit (ELU)
to zero. The operation of ReLU such as max(·) is more An efficient activation function was proposed by
robust than tanh or sigmoid functions. ReLU converts D.A. Clevert et al. [46] and called exponential linear
the network to get sparse depictions. unit (ELU), which performs robust training of deep
networks and leads to greater classification precision.
5.6.2 Leaky ReLU When compared to other activation functions, ELU in-
A latent drawback of ReLU function is that a negative troduces a saturation function for handling the negative
section is set zero, which makes the unit inactive. This section. If the unit is deactivated, the activation function
functionality results in the unit never becoming active is decreased, which makes ELU perform faster when
and modifying its weights. noise is present.
Leaky rectified linear unit (LReLU) [43] function is The ELU function is determined as:
determined as:
fp,q,r = max(Ip,q,r , 0) + min(θ (eIp,q,r − 1), 0), (5.11)
fp,q,r = max(Ip,q,r , 0) + θ min(Ip,q,r , 0), (5.8)

where θ is a predetermined argument in the range either where θ is a predetermined parameter, which is con-
close to 0 or 1. LReLU compresses the negative section trolled by the ELU saturating function for a negative
instead of initializing it to zero. section.

5.6.3 Parametric ReLU 5.6.6 Maxout

The main disadvantage of LReLU is the predetermined One of the effective activation functions that works on
parameter θ . A parametric rectified linear unit (PReLU) spatial location involving multiple channels with max-
was proposed in [44], which automatically trains the ar- imum response is the Maxout [47] function. It is an
guments of the rectifiers for enhancing the accuracy of extension of ReLU. The maxout function is appropriate
the network. for training the network with dropout.
92 Deep Learning and Parallel Computing Environment for Bioengineering Systems

The maxout function is determined as: al. [51] uses three-dimensional convolutions to classify
the Alzheimer disease. J. Kawahara et al. [52] proposed
fp,q,r = max Ip,q,r , (5.12) a CNN-like architecture used for predicting the devel-
r∈[1,r]
opment of the brain. In image classification, CNNs are
where fp,q,r is the output of activation function. the recent state-of-the-art methods. The CNNs learned
about natural images, showing strong performance and
5.6.7 Probout encountering the accuracy of human expert systems. Fi-
J.T. Springenber et al. [48] proposed a variant of maxout nally, these statements conclude that CNNs can be im-
called probout. The operation of maxout is replaced by proved to control the essential architecture of medical
a probabilistic sampling method. It determines the like- images [53].
lihood of each z linear units as:
5.7.2 Object Classification

z In general, object classification takes a trivial portion of
Px = eθox / eθox , (5.13) the medical image as input and produces two or more
y=1 categories for classification. The local and global infor-
mation is required for better classification. W. Shen et
where θ is an essential parameter of the distribution. If al. [54] proposed a method that uses CNNs which ac-
multinomial distribution is considered for P1 , . . . , Pz , cept multi-scale inputs for classification. J. Kawahara
one of the z units is used to initialize the value of the et al. [55] proposed a multi-stream CNN method for
activation. detecting skin lesions and having each stream com-
The probabilities are redefined as: pute with the help of various resolutions of the im-
age. X. Gao et al. [56] proposed a method for merg-

z
P x = eθox /(2
0 = 0.5, P eθox ). (5.14) ing CNNs and RNNs for classification of images and
y=1 containing pre-trained CNN filters. This combination
performs processing of related information irrespec-
The activation function is determined as: tive of the image size. A. Setio et al. [57] proposed
an efficient network called multi-stream CNN for clas-
0 if x = 0 sification of chest CT into either a nodule or a non-
fx = f (x) = (5.15)
Ox else nodule. The candidates having nine different patches
are extracted in the individual stream and finally for-
where x is a draw from a multinomial distribution. warded to the classification layer. D. Nie et al. [58] pro-
Probout achieves a balance in defining properties for posed a network that accepts three-dimensional MRI
maxout units and also enhances their properties. The images and trains the three-dimensional convolutional
computational cost of probout is much greater than for neural networks to assess if a patient has high-grade
maxout, since it performs extra probability computa- gliomas.
tions.
5.7.3 Region, Organ, and Landmark
Localization
5.7 APPLICATIONS OF CNN IN MEDICAL Organ and landmark localization is an essential step
IMAGE ANALYSIS in segmentation or therapy planning. In medical im-
5.7.1 Image Classification age analysis, localization needs three-dimensional vol-
Image classification is the primary domain, in which umes of parsing. The deep neural network indulges
deep neural networks play the most important role of the three-dimensional space as an alignment of two-
medical image analysis. The image classification accepts dimensional orthogonal planes. D. Yang et al. [59] pro-
the given input images and produces output classifi- posed a regular CNN for processing individual sets of
cation for identifying whether the disease is present two-dimensional MRI slices. The landmark of the three-
or not. E. Kim et al. [49] proposed a CNN method dimensional position was determined as a combination
which outperforms perfect image classification accu- of three two-dimensional slices that produced better
racy in cytopathology. Inception v3 architecture [50] classification results. B. De Vos et al. [60] proposed a
is one of the best methods for medical data analysis method that selects the region of interest in anatomical
and has accomplished proficient human performance. regions such as descending aorta, heart, and aortic arch
The CNN architecture proposed by E. Hosseini-AsL et by recognizing a bounding box classification. C. Payer et
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 93

al. [61] proposed a method for predicting the locations pling. The network merges these two layers between
directly. The CNN directly degenerates the locations of deconvolution and convolution operation. The given
the landmarks. Every landmark is presented by using a input image is processed by the network in the forward
Gaussian function, and the network is learned directly pass and results in segmentation. O. Cicek et al. [69]
from the landmark map. Only a few CNNs are able proposed an extension of U-net for performing a full
to address the problem of landmark localization, and three-dimensional segmentation. F. Milletari et al. [70]
regions have three-dimensional image space. Y. Zheng proposed a different three-dimensional U-net structure
et al. [62] proposed a method that reduces the time referred to as V-net. The network computes image seg-
complexity by dividing three-dimensional convolutions mentation using three-dimensional convolutional lay-
into three one-dimensional convolutions for detecting ers with a dice coefficient function. R. Korez et al. [71]
carotid artery bifurcation in the CT data. B. Kong et al. proposed a three-dimensional fCNN architecture used
[63] proposed a method for detection of end-systole for segmentation of MR images. R. Moeskops et al. [72]
and end-diastole frames from the heart using a combi- proposed an fCNN method to perform segmenta-
nation of techniques such as LSTM-RNN with a CNN. tion of brain MRI, coronary blood vessels in car-
The CNN is one of the essential techniques for local- diac CT angiography, and pectoral strength in breast
ization of regions, landmarks, and organs using two- MRI.
dimensional classification of images.
5.7.6 Lesion Segmentation
5.7.4 Object or Lesion Detection In the application of deep learning methods, lesion seg-
The detection of objects or lesions in a given input im- mentation merges the tasks of substructures segmenta-
age is an essential part of diagnosis, which is time con- tion, organ segmentation, and object detection. K. Kam-
suming for clinicians. It is the process of identifying nitsas et al. [73] and M. Ghafoorian et al. [74] pro-
and localizing the small lesions in a given input imposed a method to perform precise segmentation with
age. The research in this area is used to detect lesions the help of local and global information using multi-
automatically, and enhances the accuracy of detection stream networks. The U-net architecture also uses lo-
or reducing the time of human experts. The first object cal and global information for lesion segmentation.
detection method using convolutional neural networks Brosch et al. [75] proposed a method that uses three-
was introduced in 1995. The CNN used different lay- dimensional convolutions and skips the network con-
ers for detecting nodules in X-ray images [64]. Most nection between the first and final layers for segmenta-
of the research work in deep learning for object detection of lesions in brain MRI.
tion is performed by CNNs. The CNNs process pixel
classification and perform post-processing to get can-
didates of the object. The three-dimensional informa- 5.8 DISCUSSION
tion in object detection is processed using multi-stream
CNNs have accomplished noble performance in a col-
CNNs [65]. A. Teramoto et al. [66] proposed a multi-
lection of areas such as brain, retinal, chest X-ray, chest
stream CNN methodology to combine CT data along
with positron emission tomography data. Q. Dou et CT, breast, cardiac, abdominal and musculoskeletal im-
al. [67] proposed a novel three-dimensional CNN tech- age analysis. Thus, convolutional neural networks have
nique, which was used for the discovery of micro-bleeds almost limitless applications.
in brain MRI. The deep neural networks applied to medical image
analysis have many challenges:
5.7.5 Organ and Substructure Segmentation (1) Complexity in training large datasets;
In medical image analysis, such as brain or cardiac, (2) The picture archiving and communication systems
organ and substructure segmentation permits investi- (PACSs) are not useful in medicine;
gation of quantifiable parameters associated with shape (3) Attainment of significant classification for the im-
and volume. It is an essential part of the computer-aided ages;
detection of the object. The processing of segmentation (4) The time complexity of labeling large datasets is
allows recognizing the group of voxels which produces higher;
either the interior or the contour of the objects. The (5) The class imbalance is one of the essential chal-
most distinguished CNN architecture in medical image lenges in image classification;
analysis is U-net [68], which consists of two differ- (6) Providing entire image to the network is not a fea-
ent feature layers, namely downsampling and upsam- sible solution due to memory restrictions.
94 Deep Learning and Parallel Computing Environment for Bioengineering Systems

5.9 CONCLUSIONS 11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Spa-
tial pyramid pooling in deep convolutional networks for
This chapter reviewed the different deep learning con-
visual recognition, in: Computer Vision – ECCV 2014,
volution neural network methods. The features, bene-
vol. 8691, 2014, pp. 1–14.
fits, and applications of convolutional neural network 12. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
methods were also discussed. Deep learning systems are Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
beneficial in the exploration of disorder classification; Vanhoucke, Andrew Rabinovich, Going deeper with con-
tissue, anatomy, lesion and tumor segmentation; lesion volutions, in: IEEE Conference on Computer Vision and
and tumor detection and classification; survival and dis- Pattern Recognition – CVPR 2015, vol. 1, 2015, pp. 1–9.
ease activity prediction; as well as image construction 13. Ross B. Girshick, Fast R-CNN, in: Proceedings of the Inter-
and enhancement. We have also presented the future re- national Conference on Computer Vision – ICCV, vol. 1,
search challenges of deep neural networks. 2015, pp. 1–9.
14. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun,
Author contributions: Both authors contributed
Faster R-CNN: towards real-time object detection with re-
equally. gion proposal networks, in: Computer Vision and Pattern
Recognition, vol. 1, 2016, pp. 1–14, arXiv:1506.01497v3.
15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep
REFERENCES residual learning for image recognition, in: IEEE Confer-
1. L. Deng, A tutorial survey of architectures, algorithms, and ence on Computer Vision and Pattern Recognition – CVPR,
applications for deep learning, APSIPA Transactions on vol. 1, 2016, pp. 770–778.
Signal and Information Processing 3 (2014) 1–29, https:// 16. Jifeng Dai, Yi Li, Kaiming He, Jian Sun, R-FCN: object
doi.org/10.1017/ATSIP.2014.4. detection via region-based fully convolutional networks,
2. Y. LeCun, L. Bottu, Y. Bengio, Gradient-based learning Advances in Neural Information Processing Systems 29
applied to document recognition, Proceedings of the (2016) 379–387, arXiv:1605.06409v2.
IEEE 86 (11) (1998) 2278–2324, https://doi.org/10.1109/ 17. Saining Xe, Ross Girshick, Piotr Dollar, Zhuowen Tu,
5.726791. Kaiming He, Aggregated residual transformations for deep
3. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Im- neural networks, in: IEEE Conference on Computer Vi-
agenet classification with deep convolutional neural net- sion and Pattern Recognition – CVPR 2017, vol. 1, 2017,
works, in: NIPS’12 Proceedings of the 25th International pp. 5987–5995, arXiv:1611.05431v2.
Conference on Neural Information Processing System, 18. Tsung-Yi Lin, Piotr Dollar, Ross B. Girshick, Kaiming He,
vol. 1, 2012, pp. 1097–1105, arXiv:1311.2901. Bharath Hariharan, Serge J. Belongie, Feature pyramid net-
4. K. Balaji, K. Lavanya, Recent trends in deep learning with works for object detection, in: IEEE Conference on Com-
applications, in: Cognitive Computing for Big Data Sys- puter Vision and Pattern Recognition – CVPR 2017, vol. 1,
tems Over IoT, vol. 14, Springer International Publishing 2017, pp. 936–944.
AG, 2018, pp. 201–222. 19. Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Gir-
5. Y. Lan Boureau, Jean Ponce, Yann Lecun, A theoretical anal- shick, Mask R-CNN, in: IEEE Conference on Computer
ysis of feature pooling in visual recognition, in: ICML 2010 Vision and Pattern Recognition – CVPR 2018, vol. 1, 2018,
– Proceedings, 27th International Conference on Machine pp. 1–12, arXiv:1703.06870v3.
Learning, vol. 19(1–8), 2010, pp. 111–118. 20. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Pi-
6. Matthew D. Zeiler, Rob Fergus, Visualizing and under- otr Dollar, Focal loss for dense object detection, in: IEEE
standing convolutional networks, CoRR abs/1311.2901 Conference on Computer Vision and Pattern Recognition
(2013) 1–11, arXiv:1311.2901. – CVPR 2018, vol. 1, 2018, pp. 1–10, arXiv:1708.02002v2.
7. Min Lin, Qiang Chen, Shuicheng Yan, Network in network, 21. Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh,
in: ICLR – 2014, vol. 3, 2014, pp. 1–10, arXiv:1312.4400v3. Quoc V. Le, Andrew Y. Ng, Tiled convolutional neural net-
8. Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra works, Advances in Neural Information Processing Systems
Malik, Rich feature hierarchies for accurate object detec- 1 (2010) 1279–1287, arXiv:1311.2901.
tion and semantic segmentation, in: Proceedings of the 22. Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, J. Leon Zhao,
IEEE Conference on Computer Vision and Pattern Recog- Time series classification using multi-channels deep con-
nition (CVPR), U. C. Berkeley, Berkeley, 2014, pp. 1–21, volutional neural networks, in: Web-Age Information Man-
arXiv:1311.2524v1. agement – WAIM 2014, vol. 8485, 2014, pp. 298–310.
9. Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully con- 23. Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, Rob
volutional networks for semantic segmentation, in: Com- Fergus, Deconvolutional networks, in: IEEE Computer So-
puter Vision and Pattern Recognition – 2015, abs/1411. ciety Conference on Computer Vision and Pattern Recog-
4038, 2015, pp. 1–10, arXiv:1411.4038v2. nition – CVPR 2010, vol. 1, 2010, pp. 2528–2535.
10. Karen Simonyan, Andrew Zisserman, Very deep convo- 24. Matthew D. Zeiler, Graham W. Taylor, Rob Fergus, Adaptive
lutional networks for large-scale image recognition, in: deconvolutional networks for mid and high level feature
Computer Vision and Pattern Recognition – 2015, 2015, learning, in: International Conference on Computer Vi-
pp. 1–15, arXiv:1409.1556v6. sion, vol. 1, 2011, pp. 2018–2025.
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 95

25. Francesco Visin, Marco Ciccone, Adriana Romero, Kyle 37. Dingjun Yu, Hanli Wang, Peiqiu Chen, Zhihua Wei, Mixed
Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Mat- pooling for convolutional neural networks, in: Interna-
teucci, Aaron Courville, Reseg: a recurrent neural network- tional Conference on Rough Sets and Knowledge Technol-
based model for semantic segmentation, in: Computer Vi- ogy – RSKT 2014, vol. 8818, 2014, pp. 364–375.
sion and Pattern Recognition – 2016, vol. 1, 2016, pp. 1–8, 38. Matthew D. Zeiler, Rob Fergus, Stochastic pooling for regu-
arXiv:1511.07053v3. larization of deep convolutional neural networks, in: Pro-
26. Chao Dong, Chen Change Loy, Kaiming He, Xiaoou ceedings of the International Conference on Learning Rep-
Tang, Image super-resolution using deep convolutional resentations – ICLR 2013, vol. 1, 2013, pp. 1–9, arXiv:
networks, IEEE Transactions on Pattern Analysis and Ma- 1301.3557v1.
39. Oren Rippel, Jasper Snoek, Ryan P. Adams, Spectral
chine Intelligence 38 (2) (2016) 295–307, https://doi.org/
representation for convolutional neural networks, Ad-
10.1109/TPAMI.2015.2439281.
vances in Neural Information Processing Systems 2 (2015)
27. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han,
2449–2457, arXiv:1506.03767.
Gradient-based learning applied to document recognition, 40. Yunchao Gong, Liwei Wang, Ruiqi Guo, Svetlana Lazeb-
in: IEEE Conference on Computer Vision and Pattern nik, Multi-scale orderless pooling of deep convolutional
Recognition – CVPR – 2015, vol. 1, 2015, pp. 1520–1528, activation features, in: Computer Vision and Pattern Recog-
arXiv:1505.04366v1. nition – CVPR 2014, vol. 1, 2014, pp. 392–407, arXiv:
28. Yuting Zhang, Kibok Lee, Honglak Lee, Augmenting su- 1403.1840v3.
pervised neural networks with unsupervised objectives for 41. H. Jeqou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, C.
large-scale image classification, in: International Confer- Schmid, Aggregating local image descriptors into compact
ence on Machine Learning – ICML 2016, vol. 48, 2016, codes, IEEE Transactions on Pattern Analysis and Machine
pp. 612–621, arXiv:1606.06582v1. Intelligence 34 (9) (2012) 1704–1716, https://doi.org/10.
29. Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi 1109/TPAMI.2011.235.
Parikh, Dhruv Batra, Human attention in visual question 42. V. Nair, G.E. Hinton, Rectified linear units improve re-
answering: Do humans and deep networks look at the stricted Boltzmann machines, in: International Conference
same regions?, in: Proceedings on the Conference of Em- on Machine Learning – ICML’10, vol. 1, 2010, pp. 807–814.
pirical Methods in Natural Language Processing – EMNLP 43. Andrew L. Mass, Awni Y. Hannum, Andrew Y. Ng, Rectifier
2016, vol. 1, 2016, pp. 932–937, arXiv:1606.03556v2. nonlinearities improve neural network acoustic models, in:
Proceedings of the International Conference on Machine
30. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
Learning, vol. 28, 2013, pp. 1–6.
Antonio Torralba, Learning deep features for discrimina-
44. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,
tive localization, in: IEEE Conference on Computer Vi- Delving deep into rectifiers: surpassing human-level per-
sion and Pattern Recognition – CVPR 2015, vol. 1, 2015, formance on ImageNet classification, in: Proceedings of
pp. 2921–2929, arXiv:1512.04150v1. the International Conference on Computer Vision – ICCV
31. Fisher Yu, Vladlen Koltun, Multi-scale context aggregation 2015, vol. 1, 2015, pp. 1026–1034, arXiv:1502.01852v1.
by dilated convolutions, Proceedings of the IEEE 1 (2016) 45. Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li, Empirical eval-
1–13, arXiv:1511.07122v3. uation of rectified activations in convolutional network,
32. Tom Sercu, Vaibhava Goel, Dense prediction on sequences in: Proceedings of the International Conference on Ma-
with time-dilated convolutions for speech recognition, in: chine Learning – ICML 2015, vol. 1, 2015, pp. 1–5, arXiv:
Proceedings of the Advances in Neural Information Pro- 1505.00853v2.
cessing System, vol. 1, 2016, pp. 1–5, arXiv:1611.09288v2. 46. Djork-Arne Clevert, Thomas Unterthiner, Sepp Hochreiter,
33. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Fast and accurate deep network learning by exponential
Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, linear units (ELUs), in: Proceedings of the International
Andrew Senior, Koray Kavukcuoglu, Wavenet: a generative Conference on Learning Representations – ICLR 2016,
model for raw audio, Proceedings of the IEEE 1 (2016) vol. 1, 2016, pp. 1–14, arXiv:1511.07289v5.
1–5, arXiv:1609.03499. 47. Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza,
Aaron Courville, Yoshua Bengio, Maxout networks, in:
34. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron
Proceedings of the International Conference on Machine
van den Oord, Alex Graves, Koray Kavukcuoglu, Neural
Learning – ICML 2013, vol. 28(3), 2013, pp. 1319–1327,
machine translation in linear time, Computation and Lan-
arXiv:1302.4389v4.
guage 1 (2017) 1–9, arXiv:1610.10099v2. 48. Jost Tobias Springenberg, Martin Riedmiller, Improving
35. Aapo Hyvarinen, Urs Koster, Complex cell pooling and the deep neural networks with probabilistic maxout units, in:
statistics of natural images, Network Computation in Neu- Proceedings of the International Conference on Machine
ral Systems 18 (2) (2009) 81–100, https://doi.org/10.1080/ Learning – ICML 2014, vol. 1, 2014, pp. 1–10, arXiv:1312.
09548980701418942. 6116v2.
36. Joan Bruna, Arthur Szlam, Yann LeCun, Signal recovery 49. Edward Kim, Miquel Corte-Real, Zubair Baloch, A deep se-
from pooling representations, in: Proceedings of the Inter- mantic mobile application for thyroid cytopathology, Pro-
national Conference on Machine Learning – ICML 2014, ceedings of the SPIE 9789 (2016) 1–10, https://doi.org/10.
vol. 1, 2014, pp. 307–315, arXiv:1311.4025v3. 1117/12.2216468.
96 Deep Learning and Parallel Computing Environment for Bioengineering Systems

50. A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. 61. C. Payer, D. Stern, H. Bischof, M. Urschler, Regressing
Blau, S. Thrun, Dermatologist-level classification of skin heatmaps for multiple landmark localization using CNNs,
cancer with deep neural networks, Nature 7639 (2017) in: Medical Image Computing and Computer-Assisted In-
115–118, https://doi.org/10.1038/nature21056. tervention – MICCAI 2016, vol. 9901, 2016, pp. 230–238.
51. Ehsan Hosseini-Asl, Georgy Gimel’farb, Ayman El-Baz, 62. Yefeng Zheng, David Liu, Bogdan Georgescu, Hien
Alzheimer’s disease diagnostics by a deeply supervised Nguyen, Dorin Comaniciu, 3D deep learning for effi-
adaptable 3D convolutional network, in: Proceedings of cient and robust landmark detection in volumetric data,
the International Conference on Machine Learning – ICML in: International Conference on Medical Image Comput-
2016, vol. 1, 2016, pp. 1–12, arXiv:1607.00556v1. ing and Computer-Assisted Intervention – MICCAI 2015,
52. J. Kawahara, C.J. Brown, S.P. Miller, B.G. Booth, V. Chau, vol. 9349, 2015, pp. 565–572.
R.E. Grunau, J.G. Zwicker, Hamarneh G. Brainnetcnn, 63. B. Kong, Y. Zhan, M. Shin, T. Denny, S. Zhang, Recognizing
end-diastole and end-systole frames via deep temporal re-
Convolutional neural networks for brain networks; to-
gression network, in: International Conference on Medical
wards predicting neurodevelopment, NeuroImage 146
Image Computing and Computer-Assisted Intervention –
(2017) 1038–1049, https://doi.org/10.1016/j.neuroimage.
MICCAI 2016, vol. 9902, 2016, pp. 264–272.
2016.09.046.
64. S.-C.B. Lo, S.-L.A. Lou, Jyh-Shyan Lin, M.T. Freedman,
53. Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin
M.V. Chien, S.K. Mun, Artificial convolution neural net-
Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula,
work techniques and applications for lung nodule detec-
Caroline M. Moore, Mark Emberton, Sébastien Ourselin, J. tion, IEEE Transactions on Medical Imaging 14 (4) (1995)
Alison Noble, Dean C. Barratt, Tom Vercauteren, Weakly- 711–718, https://doi.org/10.1109/42.476112.
supervised convolutional neural networks for multimodal 65. Adrian Barbu, Le Lu, Holger Roth, Ari Seff, Ronald M. Sum-
image registration, in: Computer Vision and Pattern Recog- mers, An analysis of robust cost functions for CNN in
nition – CVPR 2018, vol. 1, 2018, pp. 1–19. computer-aided diagnosis, Computer Methods in Biome-
54. W. Shen, M. Zhou, F. Yang, C. Yang, J. Tian, Multi-scale chanics and Biomedical Engineering: Imaging & Visual-
convolutional neural networks for lung nodule classifica- ization 6 (11) (2018) 253–258, https://doi.org/10.1080/
tion, in: Proceedings of the Medical Imaging, vol. 24, 2015, 21681163.2016.1138240.
pp. 588–599. 66. A. Teramoto, H. Fujita, O. Yamamuro, T. Tamaki, Auto-
55. Jeremy Kawahara, Ghassan Hamarneh, Multi-resolution- mated detection of pulmonary nodules in PET/CT images:
tract CNN with hybrid pretrained and skin-lesion trained ensemble false-positive reduction using a convolutional
layers, in: International Workshop on Machine Learning in neural network technique, Medical Physics 43 (6) (2016)
Medical Imaging, vol. 1, 2016, pp. 164–171. 2821–2827, https://doi.org/10.1118/1.4948498.
56. X. Gao, S. Lin, T.Y. Wong, Automatic feature learning 67. Qi Dou, Hao Chen, Lequan Yu, Lei Zhao, Jing Qin, Defeng
to grade nuclear cataracts based on deep learning, IEEE Wang, Vincent C.T. Mok, Lin Shi, Pheng-Ann Heng, Auto-
Transactions on Biomedical Engineering 62 (11) (2015) matic detection of cerebral microbleeds from MR images
2693–2701, https://doi.org/10.1109/TBME.2015.2444389. via 3D convolutional neural networks, IEEE Transactions
57. A.A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S.J. on Medical Imaging 35 (5) (2016) 1182–1195, https://
Van Riel, M.M. Wille, M. Nagibullah, C.I. Sanchez, B. doi.org/10.1109/TMI.2016.2528129.
Van Ginneken, Pulmonary nodule detection in CT im- 68. Olaf Ronneberger, Philipp Fischer, Thomas Brox, U-net:
ages: false positive reduction using multi-view convo- Convolutional networks for biomedical image segmenta-
lutional networks, IEEE Transactions on Medical Imag- tion, in: International Conference on Medical Image Com-
puting and Computer-Assisted Intervention – MICCAI
ing 35 (5) (2016) 1160–1169, https://doi.org/10.1109/TMI.
2015, vol. 9351, 2015, pp. 234–241, arXiv:1505.04597v1,
2016.2536809.
2015.
58. D. Nie, H. Zhang, E. Adeli, L. Liu, D. Shen, 3D deep learn-
69. O. Cicek, A. Abdulkadir, S.S. Lienkamp, T. Brox, O. Ron-
ing for multi-modal imaging-guided survival time predic-
neberger, 3D U-net: learning dense volumetric segmenta-
tion of brain tumor patients, in: Medical Image Comput-
tion from sparse annotation, in: International Conference
ing and Computer-Assisted Intervention, vol. 9901, 2016,
on Medical Image Computing and Computer-Assisted In-
pp. 212–220. tervention – MICCAI 2016, vol. 9901, 2016, pp. 424–432.
59. Dong Yang, Shaoting Zhang, Zhennan Yan, Chaowei Tan, 70. Fausto Milletari, Nassir Navab, Seyed-Ahmad Ahmadi V-
Kang Li, Dimitris Metaxas, Automated anatomical land- net, Fully convolutional neural networks for volumetric
mark detection on distal femur surface using convolu- medical image segmentation, in: Computer Vision and
tional neural network, in: International Symposium on Pattern Recognition – CVPR 2016, vol. 1, 2016, pp. 1–11,
Biomedical Imaging – 2015, vol. 1, 2015, pp. 17–21. arXiv:1606.04797v1.
60. Bob D. de Vos, Jelmer M. Wolterink, Pim A. de Jong, Max 71. R. Korez, B. Likar, F. Pernus, T. Vrtovec, Model-based seg-
A. Viergever, Ivana Isgum, 2D image classification for 3D mentation of vertebral bodies from mr images with 3D
anatomy localization: employing deep convolutional neu- CNNs, in: International Conference on Medical Image
ral networks, in: Proceedings of the SPIE Medical Imaging, Computing and Computer-Assisted Intervention – MIC-
vol. 9784, 2016, pp. 1–10. CAI 2016, vol. 9901, 2016, pp. 433–441.
CHAPTER 5 Medical Image Analysis With Deep Neural Networks 97

72. Pim Moeskops, Jelmer M. Wolterink, Bas H.M. van 74. M. Ghafoorian, N. Karssemeijer, T. Heskes, I.W.M. Van
der Velden, Kenneth G.A. Gilhuijs, Tim Leiner, Max A. Uder, F.E. de Leeuw, E. Marchiori, B. van Ginneken, B. Pla-
Viergever, Deep learning for multi-task medical image seg- tel, Non-uniform patch sampling with deep convolutional
mentation in multiple modalities, in: International Con- neural networks for white matter hyperintensity segmenta-
ference on Medical Image Computing and Computer- tion, in: International Symposium on Biomedical Imaging,
Assisted Intervention – MICCAI 2016, vol. 1, 2016, vol. 1, 2016, pp. 1–10.
pp. 1–9. 75. Tom Brosch, Lisa Y.W. Tang, Youngjin Yoo, David K.B. Li,
73. Konstantinos Kamnitsas, Christian Ledig, Virginia F.J. New- Anthony Traboulsee, Roger Tam, Deep 3D convolutional
combe, Joanna P. Simpson, Andrew D. Kane, David K. encoder networks with shortcuts for multiscale feature in-
Menon, Daniel Rueckert, Ben Glocker, Efficient multi-scale tegration applied to multiple sclerosis lesion segmenta-
3D CNN with fully connected CRF for accurate brain tion, IEEE Transactions on Medical Imaging 35 (5) (2016)
lesion segmentation, in: Computer Vision and Pattern 1229–1239, https://doi.org/10.1109/TMI.2016.2528821.
Recognition – CVPR 2017, vol. 36, 2017, pp. 61–78.
CHAPTER 6

Deep Convolutional Neural Network for

Image Classification on CUDA Platform
PRALHAD GAVALI, ME • J. SAIRA BANU, PHD

6.1 INTRODUCTION eral categories simultaneously. Fuzzy classes present an-

Image classification is a complex process that may be other difficulty for probabilistic categories’ assignment.
affected by many factors. Because classification results Moreover, a combination of different classification ap-
are the basis for many environmental and socioeco- proaches has shown to be helpful for the improvement
nomic applications, scientists and practitioners have of classification accuracy [1].
made great efforts in developing advanced classifica- Deep convolutional neural networks provide better
tion approaches and techniques for improving classi- results than existing methods in the literature due to
fication accuracy. Image classification is used in a lot advantages such as processing by extracting hidden fea-
in basic fields like medicine, education and security. tures, allowing parallel processing and real time oper-
Correct classification has vital importance, especially in ation. The concept of convolutions in the context of
medicine. Therefore, improved methods are needed in neural networks begins with the idea of layers consist-
this field. The proposed deep CNNs are an often-used ing of neurons with a local receptive field, i.e., neurons
architecture for deep learning and have been widely which connect to a limited region of the input data and
used in computer vision and audio recognition. In the not the whole [4].
literature, different values of factors used for the CNNs In this examination, the image classification process
are considered. From the results of the experiments on is performed by using TensorFlow, which is an open
the CIFAR dataset, we argue that the network depth is of source programming library in Python to manufacture
the first priority for improving the accuracy. It can not our DCNN. We have considered the CIFAR-10 dataset,
only improve the accuracy, but also achieve the same which contains 60,000 pictures [30]. In the examina-
high accuracy with less complexity compared to increas- tion, 6000 and 3000 bits of data were taken from the
ing the network width. related images for planning and testing exclusively the
In order to classify a set of data into different classes cat and pooch pictures taken from the CIFAR-10 dataset,
or categories, the relationship between the data and which were resized, and histogram equalization opera-
the classes into which they are classified must be well tions were performed.
understood. Generally, classification is done by a com- The experimental strategy was executed in Python
puter, so, to achieve classification by a computer, the using Ubuntu. We aimed for the best result in the im-
computer must be trained. Sometimes it never gets suffi- age handling field. For implementing the operational
cient accuracy with the results obtained, so training is a parallelism, we had some limitations related to the
key to the success of classification. To improve the clas- CPU, which is useful only for sequential operations. But
sification accuracy, inspired by the ImageNet challenge, when it comes to parallel programming, GPUs are more
the proposed work considers classification of multiple prominent with CUDA [11]. In the past, GPUs have
images into the different categories (classes) with more been working as graphic accelerators. However, parallel
accuracy in classification, reduction in cost and in a programming has been developed as a powerful, gen-
shorter time by applying parallelism using a deep neu- eral purpose and fully programmable parallel data pro-
ral network model. cessing approach for operations that require it.
The image classification problem requires determin- CUDA is NVIDIA’s [26] parallel computing architec-
ing the category (class) that an image belongs to. The ture that enables dramatic increases in computing per-
problem is considerably complicated by the growth of formance by harnessing the power of the GPU (graph-
categories’ count, if several objects of different classes ics processing unit). We perform the proposed method
are present in the image and if the semantic class hier- on Ubuntu 16.04 operating system using an NVIDIA
archy is of interest, because an image can belong to sev- Geforce GTX 680 with 2 GB of memory. In the devel-

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00013-0 99
Copyright © 2019 Elsevier Inc. All rights reserved.
100 Deep Learning and Parallel Computing Environment for Bioengineering Systems

oped model, GPU technology has been utilized while the years that followed, the vast majority of those who
training. participated in the competition used CNNs. All of the
entries have been developed using CNNs. This contest
proved the success of the CNN in image classification,
6.2 IMAGE CLASSIFICATION which spoke much of its name. CNNs have become
Image processing is an improved example for digitizing powerful models in feature learning and image classi-
a scene and performing some operations, or a method fication. Researchers who have always aimed to achieve
for extracting useful information from it. Image classi- better results have developed many methods using the
fication is a very wide area of image processing. Clas- CNN approach.
sification is the process of ensuring that unclassified Michael et al. [6] call attention to that how to code
images are included in their class within certain cate- data and invariance properties in profound CNN struc-
gories [1]. Image classification is a problem of computer tures is as yet a reasonable issue. In the investigation,
vision that deals with a lot of basic information from it is proposed that CNN change the standard convo-
fields such as healthcare, agriculture, meteorology and lution square and afterward transmit some more data
safety. The human brain can easily classify images. But layer after the cover, yet it is prescribed to remain
for a computer this is not easy if the image contains in the system with some solidness. The fundamental
noise. Different methods have been developed to per- thought is to exploit both positive and negative highs
form the classification operation. General classification in convolution maps. This is accomplished by adjust-
procedures can be divided into two broad categories of ing the conventional enactment work venture before it
supervised classification based on the method used and is dumped. Extensive investigations in two established
unsupervised classification [2]. datasets (MNIST and CIFAR-10) demonstrate that the
In a supervised class, the investigator defines homo- proposed approach performs better to standard CNN.
geneous representations of information classes in the H. Dou et al. [7] proposed a multi-scale CNN with a
image. These examples are called training areas. The profundity diminishing multi-section structure to take
choice of appropriate training areas is based on the care of the issue of scale change in an image.
knowledge of the analyst’s classification. Thus, the an- Image processing is now routinely used by a wide
alyst is tempted to control the classification of certain range of individuals who have access to digital cam-
classes [3]. The unsupervised classification reverses the eras and computers. With a minimum investment, one
supervised classification process. Programs using clus- can readily enhance contrast, detect edges, quantify in-
tering algorithms are employed to determine statistical tensity, and apply a variety of mathematical operations
groupings or constructs in the data. Generally, the ana- to images. Although these techniques can be extremely
lyst specifies how many groups or clusters in the data powerful, the average user often digitally manipulates
can be searched for. In addition to specifying the re- images with abandon, seldom understanding the most
quired number of classes, the analyst can also deter- basic principles behind the simplest image-processing
mine the separation distance between the clusters and routines [8]. Although this may be acceptable to some
the parameters for the variation within each cluster. The individuals, it often leads to an image that is signifi-
unchecked classification does not start with a prede- cantly degraded and does not achieve the results that
fined class set. Supervised learning has been extremely would be possible with some knowledge of the basic
successful in learning good visual presentations that not operations of an image-processing system.
only give good results on the task they are trained on, One of the main problems in computer vision is
but also transfer to other tasks and data sets [3]. Sci- the image classification problem, which is concerned
entists have developed many methods for solving the with determining the presence of visual structures in
image classification problem. These methods compete an input image. As we know, image classification is a
to achieve perfection in image classification. ImageNet complex process that may be affected by many factors.
[4] is an image classification competition. The data to Because classification results are the basis for many en-
be processed and the number of categories to be clas- vironmental and socioeconomic applications, scientists
sified are increased every year. The competition, which and practitioners have made great efforts in developing
was organized in 2012, has been a milestone in image advanced classification approaches and techniques for
classification. improving classification accuracy [9].
A. Krizevsky et al. [5] have been successful in achiev- Classification between the objects is an easy task
ing the best result of recent times with the approach for humans but it has proved to be a complex prob-
they have developed using convolution networks. In lem for machines. The rise of high-capacity computers,
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 101

the availability of high quality and low-priced video

cameras, and the increasing need for automatic video
analysis has generated an interest in object classifica-
tion algorithms. A simple classification system consists
of a camera fixed high above the zone of interest, where
images are captured and consequently processed. Clas-
sification includes image sensors, image preprocessing,
object detection, object segmentation, feature extraction
and object classification. A classification system consists
of a database that contains predefined patterns which
are compared with a detected object to classify it to
a proper category. Image classification is an important
and challenging task in various application domains,
including biomedical imaging, biometry, video surveil-
lance, vehicle navigation, industrial visual inspection,
robot navigation, and remote sensing.
The classification process consists of the following
steps:
A. Pre-processing – atmospheric correction, noise re-
moval, image transformation, main component
analysis, etc.;
B. Detection and extraction of an object, including de-
tection of position and other characteristics of a
moving object image obtained by a camera; while in
extraction, estimating the trajectory of the detected
object in the image plane; FIG. 6.1 Steps for image classification.
C. Training – selection of the particular attribute which
best describes the pattern; V. Decision and classification. Categorizes recog-
D. Classification of the object – this step categorizes nized items into predefined classes by utilizing a
detected objects into predefined classes by using a reasonable strategy that contrasts the image designs
suitable method that compares the image patterns and the objective examples.
with the target patterns. VI. Accuracy evaluation. Precision appraisal is ac-
knowledged to distinguish conceivable wellsprings
6.2.1 Image Classification Approach of mistakes and as a pointer utilized as a part of
I. Digital data. An image is captured by using a digi- correlations. See Fig. 6.1.
tal camera or any mobile phone camera. Categories of image classification. The most com-
II. Pre-processing. Improvement of the image data, mon methodologies for image ordering can be classi-
which includes obtaining a normalized image, en- fications, both supervised and unsupervised, paramet-
hancing contrast, obtaining a gray-scale image, bi- ric and nonparametric, object-oriented, sub-pixel, per-
nary image, resizing image, complementing binary pixel and per-field or spectral classifiers, contextual and
image, removing noise, getting the image bound- spectral–contextual classifiers, or hard and soft classifi-
ary. cation [13]:
III. Feature extraction. The process of measuring, cal- • Supervised classification – The process of using sam-
culating or detecting the features from the image ples of known informational classes (training sets)
samples. The two most common types of feature to classify pixels of an unknown identity.
extraction are (i) geometric feature extraction and • Unsupervised classification – This is a method,
(ii) color feature extraction. which examines many unknown pixels and divides
IV. Selection of preparing information. Selection of them into a number of classes based on natural
the specific property which best portrays the given groupings present in the image values. A computer
example, e.g., info image, output image, after pre- determines spectrally separable classes and then de-
processing train dataset name. fines their information value.
102 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 6.1
Different image classification techniques.
Classiﬁcation techniques Beneﬁts Assumptions and/or limitations
Neural network • Can be used for classification or • Difficult to understand the structure of
regression an algorithm
• Able to represent Boolean functions • Too many attributes can result in
(AND, OR, NOT) overfitting
• Tolerant of noisy inputs • Optimal network structure can only be
• Instances can be classified by more determined by experimentation
than one output
Support vector machine • Models nonlinear class boundaries • Training is slow compared to Bayes
• Overfitting is unlikely to occur and decision trees
• Computational complexity reduced to • Difficult to determine optimal
a quadratic optimization problem parameters when training data is not
• Easy to control the complexity of linearly separable
decision rule and frequency of error • Difficult to understand the structure of
an algorithm
Fuzzy logic • Different stochastic relationships can • Prior knowledge is very important to
be identified to describe properties get good results
• Precise solutions are not obtained if
the direction of decision is not clear
Genetic algorithm • Can be used in feature classification • Computation or development of the
and feature selection scoring function is nontrivial
• Primarily used in optimization • Not the most efficient method to find
• Always finds a “good” solution (not some optima, rather than global
always the best solution) • Complications involved in the
• Can handle large, complex, representation of training/output data
nondifferentiable and multimodal
spaces
• Efficient search method for a complex
problem space
• Good at refining irrelevant and noisy
features selected for classification

• Parametric classifier – The performance of a para- 6.2.2 Image Classification Techniques

metric classifier depends largely on how well the Image classification is a complex procedure which relies
data match the pre-defined models and on the accu- on different components. Here, some of the presented
racy of the estimation of the model parameters. The strategies, issues and additional prospects of image or-
parameters like mean vector and covariance matrix ders are addressed. The primary spotlight will be on cut-
are used; for example, maximum likelihood classi- ting edge classification methods which are utilized for
fier. enhancing characterization precision. Moreover, some
essential issues, identifying with grouping execution are
• Nonparametric classifier – There is no assumption
additionally talked about [2]. See Tables 6.1 and 6.2.
about the data. Nonparametric classifiers do not
Fig. 6.2 shows the performance comparison with re-
make use of statistical parameters to calculate class
cent studies on image classification considering the ac-
separation; for example, ANN, SVM, decision tree curacy of the fuzzy measure, decision tree, as well as
classifier. support vector machine and artificial neural network
• Hard classification – Each pixel is required to show methods based on the results which are obtained from
membership to a single class. the literature survey. It is observed that the accuracy rate
• Soft classification – Each pixel may exhibit numerous of the fuzzy measure is less and that of an artificial
and partial class membership. neural network is more, but it does not come close to
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 103

TABLE 6.2
Comparative analysis of different image classification techniques.
Parameter Artiﬁcial neural Support vector Fuzzy logic Genetic algorithm
networks machines
Type of approach Nonparametric Nonparametric with Stochastic Large time series
binary classifier data
Nonlinear decision Efficient when the Efficient when the Depends on prior Depends on the
boundaries data have only few data have more input knowledge for direction of decision
input variables variables decision boundaries
Training speed Network structure, Training data size, Iterative application of Refining irrelevant
momentum rate, kernel parameter, the fuzzy integral and noise genes
learning rate, class separability
convergence criteria
Accuracy Depends on the Depends on selection Selection of cutting Selection of genes
number of input of optimal hyper plane threshold
classes
General Network structure Kernel parameter Fused fuzzy integral Feature selection
performance

FIG. 6.2 Accuracy comparison of different image classification techniques.

the ImageNet challenge. To improve the classification 2. Graph based learning shows good accuracy, but high
accuracy and achieve competitive ImageNet challenge computational complexity.
accuracy, the proposed work considers classification of 3. The accuracy of an algorithm depends on the sam-
multiple images into different categories (classes) with ple selection (selection of the most informative
more accuracy in classification, reduction in cost and unlabeled samples) and consideration of spectral
in shorter time by applying parallelism using a deep data.
neural network model. (http://www.jatit.org/volumes/
research-papers/Vol4No11/5Vol4No11.pdf) 6.2.4 Research Challenge
1. Getting high classification accuracy by considering
6.2.3 Research Gaps the computational complexity of designing a semi-
After surveying the literature of the different classifica- supervised classification algorithm is a challenging
tion techniques, we can observe that task.
1. The accuracy of a current algorithm for the most part 2. Designing an effective and robust algorithm by ex-
relies upon adequate labeled data. ploiting spectral, as well as spatial information.
104 Deep Learning and Parallel Computing Environment for Bioengineering Systems

3. The sample selection method needs to be carefully such as medicine, safety and education. Deep learn-
chosen (use of active learning) while designing al- ing is a subfield of machine learning which endeav-
gorithms. ors to learn high-level abstractions in data by utilizing
hierarchical architectures. It is a developing methodol-
6.2.5 Problem Definition ogy and has been broadly connected in traditional ar-
We strive to classify multiple images into the correct cat- tificial intelligence domains, such as semantic parsing
egories (classes) with more accuracy in classification, [1], transfer learning [2,3], natural language process-
reduction in cost and in shorter time by applying par- ing [4], computer vision [5,6], and many more. There
allelism. are mainly three important reasons for the booming of
deep learning today: dramatically increased chip pro-
6.2.6 Objective cessing abilities (e.g., GPU units), significantly lowered
1. Training of the system followed by testing. The train- cost of computing hardware, and considerable advances
ing process implies taking the characteristic proper- in the machine learning algorithms [9]. Various deep
ties of the images (from a class) and framing a one learning approaches have been broadly reviewed and
of a kind depiction for a specific class. discussed in recent years [8–12]. Among those Schmid-
2. The testing step means categorizing the test im- huber et al. emphasized the important inspirations and
ages into various classes for which the system was technical contributions in a historical timeline format,
trained. This assigning of the class is done based on while Bengio examined the challenges of deep learning
the partitioning between classes based on the train- research and proposed a few forward-looking research
ing features. directions. Deep networks have shown to be effective for
computer vision tasks because they can extract appro-
priate features while jointly performing discrimination
6.3 DEEP CONVOLUTIONAL NEURON [9,13]. In recent ImageNet Large Scale Visual Recogni-
NETWORK tion Challenge (ILSVRC) competitions [11], deep learn-
An artificial neural network was derived mimicking the ing methods have been widely adopted by different re-
working mechanism of the human brain. Its layered net- searchers and achieved top accuracy scores [7]. Deep
work structure has been made by simulating the way learning allows computational models that are com-
nerve cells work. This structure has been used for a posed of multiple processing layers to learn representa-
long time in many artificial intelligence problems and tions of data with multiple levels of abstraction. These
has achieved great results [1]. The foundations of deep methods have dramatically improved the state-of-the-
learning include artificial neural networks. An artificial art in speech recognition, visual object recognition, ob-
neural network has been produced for providing archi- ject detection, and many other domains such as drug
tects with a better model of the brain. In the 1980s, discovery and genomics. Deep learning discovers intri-
deep constraints did not permit intensive matrix manip- cate structures in large data sets by using the backprop-
ulation, so deep learning couldn’t be transformed into agation algorithm to indicate how a machine should
training. In the late 1980s, Hinton and Lecun’s back- change its internal parameters that are used to com-
propagation algorithm [2] attracted the attention of the pute the representation in each layer from the represen-
Advanced Research Institute of Canada, despite the fact tation in the previous layer. Deep convolutional nets
that it did not resonate in the scientific areas, and re- have brought about breakthroughs in processing im-
search groups of institute’s affiliated universities were ages, video, speech and audio, whereas recurrent nets
perhaps the reference groups for deep learning on this have shed light on sequential data such as text and
issue [3]. Deep learning has been used in many ma- speech. The rise of deep learning [10] from its roots
chine learning problems and has achieved successful to becoming the state-of-the-art of AI has been fueled
results. Recent developments in deeper learning enable by three recent trends: the explosion in the amount of
efficient processing of large numbers of images. Many training data, the use of accelerators such as graphics
famous architectures were developed such as Boltzman processing units (GPUs), and advancements in the de-
machines, restricted Boltzman machines, autoencoders sign of models used for training. These three trends have
and convolutional neural networks. Convolutional net- made the task of training deep layer neural networks
works, in particular, achieve great results in image classi- with large amounts of data both tractable and useful.
fication. Along with the progress of technology, almost Using any of the deep learning frameworks (e.g., Caffe
every field has inevitably benefited from the technol- [11], TensorFlow [12], MXNet [7]), users can develop
ogy. Image classification is valuable for many key areas and train their models. Neural network models range in
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 105

size from small (5 MB) to very large (500 MB). Training device, for instance, a GPU. The stages that show no
neural networks can take a significant amount of time, data parallelism are executed in host code. The stages
and the goal is to find suitable weights for the different that show rich proportion of data parallelism are ex-
variables in the neural network. Once the model train- ecuted in the device code [4]. A CUDA program is a
ing is complete, it can be used for inference, serving and bound together source code fusing the two, host and
applying the trained model on new data in domains device code. The NVIDIA C compiler (nvcc) separates
such a natural language processing, speech recognition, the two in the midst of the accumulation methodology.
or image classification. The host code is a straight ANSI C code; it is moreover
accumulated with the host’s standard C compilers and
continues running as a typical CPU process. The device
6.4 COMPUTE UNIFIED DEVICE code is made using ANSI C extended with watchwords
ARCHITECTURE (CUDA) for naming data parallel limits, called parcels, and their
The graphics processing unit (GPU) has become an in- related data structures. The device code is consistently
tegral part of today’s mainstream computing systems. furthermore requested by the nvcc and executed on a
During the past six years, there has been a marked in- GPU device. Under conditions where no device is open
crease in the performance and capabilities of GPUs. or the part is even more fittingly executed on a CPU,
A modern GPU is not only a powerful graphics en- one can similarly execute partitions on a CPU using
gine, but also a highly parallel programmable proces- the mirroring features in CUDA programming progres-
sor featuring peak arithmetic and memory bandwidth sion pack (SDK) or the MCUDA mechanical assembly
that substantially outpaces its CPU counterpart. The [Stratton 2008]. In the matrix multiplication example,
GPU’s rapid increase in both programmability and ca- the entire matrix multiplication computation can be
pability has spawned a research community that has implemented as a kernel where each thread is used to
successfully mapped a broad range of computationally compute one element of output matrix P . The kernel
demanding, complex problems to the GPU [15]. This functions (or, simply, kernels) typically generate a large
effort in general-purpose computing on the GPU, also number of threads to exploit data parallelism. In this
known as GPU computing, has positioned the GPU as a example, the number of threads used by the kernel is
compelling alternative to traditional microprocessors in a function of the matrix dimension. For a 1000 × 1000
high-performance computer systems of the future. Due matrix multiplication, the kernel that uses one thread
to the rapid growth of graphics processing unit (GPU) to compute one P element would generate 1,000,000
processing capability, using GPU as a coprocessor to threads when it is invoked. It is worth noting that CUDA
assist the central processing unit (CPU) in computing threads are of much lighter weight than the CPU threads
massive data becomes essential. Computational scien- [5]. CUDA programmers can assume that these threads
tists have long been interested in graphics processing take very few cycles to generate and schedule due to effi-
units (GPUs) due to their relatively low cost per unit of cient hardware support. This is in contrast with the CPU
floating-point (FP) performance. Unlike conventional threads that typically require thousands of clock cycles
multiprocessors, GPU’s processor cores are specialized to generate and schedule. CUDA is a development to
for program behaviors common to graphics shaders— the C language that licenses GPU code to be written
thousands of independent threads, each comprising in ordinary C. The code is either engaged for the host
only dozens or hundreds of instructions, performing processor (the CPU) or centered at the device proces-
few memory accesses and producing a small number sor (the GPU). The host processor delivers multithread
of output values. Recent advances in hardware and pro- endeavors (or bits as they are known in CUDA) onto
grammability have opened GPUs to a broader commu- the GPU device. The GPU has its own special internal
nity of developers. GPUs’ throughput-optimized archi- scheduler that will by then assign the pieces to whatever
tectural features can outstrip CPU performance on nu- GPU gear is accessible. We’ll give relevant details later.
merical computational workloads, depending on how Given there is adequate parallelism in the task, as the
well the workload matches the computational behavior amount of SMs in the GPU grows, so should the speed
for which the GPU is designed. An important question of the program [8]. In any case, this covers a notewor-
for many developers is whether they can map particu- thy issue. You have to request at what rate the code can
lar applications to these new GPUs to achieve signif- be continued running in parallel. The best speedup pos-
icant performance increases over contemporary multi- sible is obliged by the proportion of consecutive code.
core processors. A CUDA program includes at least one If it contains an endless proportion of planning power
stage that is executed on either the host (CPU) or a and can do parallel endeavors in zero time, we will at
106 Deep Learning and Parallel Computing Environment for Bioengineering Systems

present be left to deal with the consecutive code part. increase the CGMA ratio for a kernel. Accessing DRAM
Thus, we have to consider in the beginning if we can is slow and expensive. To overcome this problem, sev-
avoid ifs, ands or buts and parallelize a great deal of the eral low-capacity, high-bandwidth memories, both on-
exceptional weight. and off-chip, are present on a CUDA GPU. If some data
NVIDIA is centered around offering assistance to is used frequently, then CUDA caches it in one of the
CUDA. Broad information, models, and contraptions low-level memories. Thus, the processor does not need
to help with progression are available from its site at to access the DRAM every time. The following figure il-
http://www.nvidia.com under CudaZone. CUDA, rather lustrates the memory architecture supported by CUDA
than its forerunners, has directly truly started to gain and typically found on NVIDIA cards. The following are
vitality and from the blue, no doubt, there will be a pro- the different types of memory used in CUDA [24]:
gramming lingo that will ascend as the one of choice for A. Local Memory. Each SP uses local memory. All vari-
GPU programming. Given that the amount of CUDA- ables declared in a kernel (a function to be executed
enabled GPUs now number in the millions, there is a gion GPU) are saved in local memory.
gantic market out there sitting tight for CUDA-engaged B. Registers. A kernel may consist of several expres-
applications [21]. Figuring is progressing from “central sions. During execution of an expression, values are
planning” on the CPU to “co-taking care of” on the saved into the registers of SP.
CPU and GPU. To enable this new enrolling perspec- C. Global Memory. It is the main memory of GPU.
tive, NVIDIA made the CUDA parallel figuring building Whenever a memory from GPU is allocated for vari-
that is directly transporting in GeForce, ION Quadro, ables by using cudaMalloc () function, by default it
and Tesla GPUs, addressing a gigantic presented base uses global memory.
for application development [28]. See Fig. 6.3. D. Shared Memory. Shared memory is shared by every
Apart from the device DRAM, CUDA supports sev- thread in a block. Shared memory is used to reduce
eral additional types of memory that can be used to the latency (memory access delay). When the shared

FIG. 6.3 CUDA architecture.

CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 107

memory is used for a variable, it should be prefixed calculations. A calculation communicated utilizing Ten-
with keyword _shared_ during its declaration, e.g., sorFlow can be executed with almost no change on a
_shared_int x. wide assortment of heterogeneous systems, going from
E. Constant Memory. Constant memory is also used cell phones, for example, telephones and tablets, up to
to reduce latency. But constant memory is used in huge scale disseminated systems of several machines
only those situations, when multiple threads have and a large number of computational gadgets, for ex-
to access the same value. This is how constant mem- ample, GPU cards [14]. The framework is adaptable
ory reduces the latency. and can be utilized to express a wide assortment of
F. Texture Memory. Texture memory is again used to calculations, including preparing and derivation calcu-
reduce the latency. Texture memory is used in a spe- lations for profound neural system models, and it has
cial case. Consider an image. When it accesses a par- been utilized for leading examination and for convey-
ticular pixel, there are more chances that will access ing a machine learning system into creation crosswise
surrounding pixels. Such a group of values which over more than 12 territories of software engineering
are accessed together are saved in texture memory. and different fields, including discourse acknowledg-
ment, PC vision, application autonomy, data recovery,
regular dialect handling, geographic data extraction,
6.5 TENSORFLOW and computational medication revelation. This chap-
ter depicts the TensorFlow interface and an execution
TensorFlow is a machine learning system that operates
of that interface that has been developed at Google
at large scale and in heterogeneous environments. Ten-
[29].
sorFlow uses dataflow graphs to represent computation,
shared state, and the operations that mutate that state.
It maps the nodes of a dataflow graph across many 6.6 IMPLEMENTATION
machines in a cluster, and within a machine across
multiple computational devices, including multicore 6.6.1 Deep Convolutional Neural Networks
CPUs, general-purpose GPUs, and custom-designed Deep learning. Deep learning is a new area of machine
ASICs known as tensor processing units (TPUs). This learning research, which has been presented with the
architecture gives flexibility to the application devel- goal of getting machine learning nearer to one of its
oper: whereas in previous “parameter server” designs unique objectives.
the management of a shared state is built into the sys- Deep learning is an artificial intelligence function
tem, TensorFlow enables developers to experiment with that copies the task of the human brain and is used
novel optimizations and training algorithms [25]. Ten- in processing data and creating patterns for decision
sorFlow supports a variety of applications, with a focus making. Deep learning for images is simply using more
attributes extracted from the image rather than only its
on training and inference on deep neural networks. Sev-
signature. However, it is done automatically in the hid-
eral Google services use TensorFlow in production, so it
den layers and not as input (as it is the case in NN) [7],
was released as an open-source project and has become
as can see in Fig. 6.4. The neurons in the first layer pass
widely used for machine learning research. TensorFlow
input data to the network.
uses a single dataflow graph to represent all computa-
tion and states in a machine learning algorithm, includ-
ing the individual mathematical operations, parameters
and their update rules, and input preprocessing [26].
The dataflow graph expresses the communication be-
tween subcomputations explicitly, thus making it easy
to execute independent computations in parallel and
to partition computations across multiple devices. Ten-
sorFlow differs from batch dataflow systems in two
respects. The model supports multiple concurrent ex-
ecutions on overlapping subgraphs of the overall graph.
And individual vertices may have mutual states that can FIG. 6.4 Neural network representation.
be shared between different executions of the graph.
TensorFlow [28] is an interface for communicating ma- Similarly, the last layer is called the output layer. The
chine learning calculations, and used for executing such layers in-between the input and output layers are called
108 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 6.5 Sample image dataset.

hidden layers. In this example, the network has only one Suppose we try to teach a computer to recognize im-
hidden layer shown in blue [13]. The networks which ages and classify them into one of these 10 categories
have many hidden layers tend to be more accurate and [22]; see Fig. 6.5.
are called deep networks; hence machine learning algo- To do so, we first need to educate the how a cat,
rithms which use these deep networks are called deep a dog, a bird, etc., looks like before the computer has
learning. the capacity to perceive another picture. The more fe-
A typical convolutional network [9] is a sequence of lines the computer sees, the better it gets at perceiv-
ing felines [30]. This is known as regulated learning.
convolution and pooling pairs, followed by a few fully
It can convey this errand by marking the images, the
connected layers. A convolution is like a small neural
PC will begin perceiving designs exhibited in feline im-
network that is applied repeatedly, once at each loca- ages that are different from others and will begin as-
tion on its input. As a result, the network layers become sembling its own particular discernment. We will utilize
much smaller but increase in depth. Pooling is the oper- Python and TensorFlow [28] to compose the program.
ation that usually decreases the size of the input image. TensorFlow is an open source, profound learning system
Max pooling is the most common pooling algorithm, made by Google that gives engineers granular control
and has proven to be effective in many computer vision over every neuron (known as a “hub” in TensorFlow)
tasks. so it can alter the weights and accomplish ideal execu-
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 109

tion. TensorFlow has numerous developed libraries (a ers were told not to dismiss images in which numerous
few of which shall be utilized for image ordering) and occurrences of the question show up. See Table 6.3.
has an astonishing network, so only needs to discover
open source usage for basically any profound learning 6.6.3 Implementing an Image Classifier
point. Since the project specification did not present a concrete
problem to solve; a problem first needs to be found and
6.6.2 Dataset decided on. The problem needs to be not too complex,
In this work we have chosen to use the CIFAR-10 dataset while still being not too trivial. It requires implement-
that comprises 60,000 images, having size of 32 × 32 ing some kind of image classifier; what kind of images,
pixels. The dataset contains 10 classes that are funda- however, and how specific the image classification, e.g.,
mentally unrelated (don’t overlap), with each class con- classifying different objects or more specifically classify-
taining 6000 images [22]. The images are little, obvi- ing different types of the same object, remains part of
ously marked, and have no commotion which makes the implementation [15]. A sufficient amount of data
the dataset perfect for this assignment with impressively of sufficient quality is also needed to be able to train
substantially less pre-preparing. Few images taken from and test the image classifier; what “sufficient” means
the dataset are presented in Fig. 6.6. in numerical terms and in practice may fall on the im-
I. CIFAR-100 dataset. This dataset is much the same plementation chapter as well. The first task in this part
as the CIFAR-10 [30], with the exception of it has 100 of the project is therefore to find and look at different
classes containing 600 images each. There are 500 pre- sources of data sets; then decide on a problem of reason-
pared images and 100 testing images for each class. able complexity, where a sufficient amount of data can
The 100 classes in the CIFAR-100 are gathered into 20 be acquired for training and testing the image classifier.
superclasses. Each image accompanies a “fine” name After that, the data, of course, needs to be downloaded
(the class to which it has a place) and a “coarse” mark [16].
(the superclass to which it has a place). In the follow- The second task is to begin implementing the im-
ing we describe the class and superclass structure of age classifier. The image classifier will be implemented
the CIFAR-100 dataset. Every superclass contains five in both Microsoft CNTK and Google TensorFlow, using
classes. Where the name of the class is plural, the label- TensorFlow as back end, with Keras, a third party API for

FIG. 6.6 Sample images from CIFAR dataset.

110 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 6.3 point in the future as well [24]. All the different mod-
List of classes in CIFAR-100. els of the image classifier in the different frameworks
will be implemented and developed in the same pro-
Superclass Classes gramming language and development environment, to
aquatic mammals beaver, dolphin, otter, seal, make the models more comparable. The programming
whale language that will be used for the implementation is
fish aquarium fish, flatfish, ray, Python 3 [32] and the development environment that
shark, trout will be used is Microsoft Visual Studio 2015 Enterprise
flowers orchids, poppies, roses, with the Python Tools for Visual Studio plug-in [31]
sunflowers, tulips installed, in addition to using the same programming
food containers bottles, bowls, cans, cups, language and IDE [27].
plates
fruit and vegetables apples, mushrooms, 6.6.4 Installation and System Requirements
oranges, pears, sweet In order to gather the necessary information to eval-
peppers uate the system requirements, software and hardware
household electrical clock, computer keyboard, support, and programming language support, Tensor-
devices lamp, telephone, television Flow, Keras, and CNTK’s documentation was studied.
household furniture bed, chair, couch, table, In order to evaluate the ease and speed of installation,
wardrobe the frameworks were downloaded and installed. This is
insects bee, beetle, butterfly, the more subjective part of the evaluation. The aspects
caterpillar, cockroach that the conclusions are based on are: the number of
steps required to be able to use the framework, and the
large carnivores bear, leopard, lion, tiger,
wolf perceived ease to follow the aforementioned steps. Writ-
ten below are the steps the authors used to install each
large man-made outdoor bridge, castle, house, road,
respective framework. First, the development environ-
things skyscraper
ment needed to be set up. Since the development is to
large natural outdoor cloud, forest, mountain,
be done in Python, Python 3.5.3 was downloaded and
scenes plain, sea
installed from Python’s homepage https://www.python.
large omnivores and camel, cattle, chimpanzee, org/. The IDE used was Microsoft Visual Studio 2015
herbivores elephant, kangaroo
Enterprise [34], which was already installed on the com-
medium-sized mammals fox, porcupine, possum, puters used in this study. To be able to use Python in
raccoon, skunk Visual Studio, the Python Tools for Visual Studio (PTVS)
non-insect invertebrates crab, lobster, snail, spider, extension [33] needed to be installed. To install PTVS,
worm the Visual Studio installation was modified through the
people baby, boy, girl, man, woman Windows Control Panel with the added PTVS extension.
reptiles crocodile, dinosaur, lizard, Google TensorFlow [29] was downloaded and installed
snake, turtle through Visual Studio, with PTVS, using the built-in
small mammals hamster, mouse, rabbit, tool pip. To be able to use the GPU, TensorFlow’s
shrew, squirrel GPU version 0.12.1 was installed, pip handled and
trees maple, oak, palm, pine, installed all Python related dependencies [21]. When
willow using TensorFlow’s GPU version, two additional down-
loads were required: NVIDIA CUDA Toolkit 8.0 and
vehicles 1 bicycle, bus, motorcycle,
pickup truck, train NVIDIA cuDNN v5.1 (CUDA Deep Neural Network) li-
brary, which were downloaded from https://developer.
vehicles 2 lawn-mower, rocket,
nvidia.com/cuda-downloads, and https://developer.nvi
streetcar, tank, tractor
dia.com/cudnn, respectively. The cuDNN’s dll-file was
placed in the CUDA-folder created after installing;
deep learning, as front end. Keras is usable as front end Keras was downloaded and installed through Visual
Studio, with PTVS, using the built-in tool pip. The
to TensorFlow [29] today, the process to add Keras to the
version installed was 1.2.2. Pip handled and installed
TensorFlow core is ongoing as of January 2019 [25], and all Python related dependencies, however, the scipy
will probably be able to use CNTK as back end at some and numpy versions installed through pip were wrong,
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 111

and needed to be downloaded and installed manually. the cycle is made by the CPU at the finish of each cy-
The correct versions of scipy and numpy needed by cle.
Keras were downloaded from http://www.lfd.uci.edu/ Convolutional layer. Instead of directly using the
~gohlke/pythonlibs/. The downloaded whl-files of the 2D convolution routine projected by Caffe, here we de-
correct versions of scipy and numpy were installed using compose the data convolutional operations into fine-
pip through the Windows Command Prompt [26]. grained tasks and then map them to threads on the
Now we discuss the mapping mechanism between GPU.
CUDA programming model and Deep CNN-to-GPU in Pooling layer. In this layer, according to a predefined
detail. group size, components in a convolutional result array
are divided into different groups. The main objective
A. Mapping implementation is finding the max value or calculating the mean value
The preparing method and mapping association be- of each group, which depends on the chosen sampling
tween Deep CNN layers and CUDA kernel are showed method.
up in Fig. 6.7. Two CUDA kernel functions are proposed Full connection layer. The forward propagation and
to be designed for the forward and backward propaga- backward propagation on a single neuron in a full con-
tion of each of the layers, respectively. At the start of nection layer are shown in Fig. 6.7. In a forward pass,
each cycle, forward convolution kernel loads a sample a dot product z between an input vector x and weight
from global memory, according to its index which is vector ω needs to be done, and the final output of the
the record provided by the CPU; this way training data neuron a is calculated by a = f (z). And in the backward
are fed in continuously. On account of CPU, network pass, the gradient δ1z is calculated by δ1a Xda/dz, where
−−−−→
outputs need to be copied back to calculate the current δ1a is the dot product between − →
ω and δa 1 + 1.
value of the loss function after a forward pass, while Output layer. The mapping mechanism of output
the learning rate needs to be balanced after a backward layers to CUDA kernels is similar to a full connection
pass. By comparing the present loss and the predefined layer, in general. It likewise gives mapping operations
minimum loss value, a decision whether to jump out on the same neuron to one thread when applying the

FIG. 6.7 DCNN training on GPU. Green parts are CUDA kernel functions executed on GPU, blue parts are
operations executed on CPU.
112 Deep Learning and Parallel Computing Environment for Bioengineering Systems

softmax function as the activation function, consider- sisting of only 1 layer, or a more complicated neural
ing the communications among threads become more network consisting of 5, 9, 16, etc., layers.
frequent under this condition. 4. The weight matrices and bias vectors of proper size
and initialized to their initial values. (One weight
6.6.5 Algorithms matrix and bias vector per layer.)
Gradient descent is the dominant method used to train 5. The loss value – the model has as output the logit
deep learning models. The proposed work should con- vector (estimated training labels) and by compar-
tain the following two algorithms, which are used for ing the logit with the actual labels, we can calculate
the training purpose: the loss value (with the softmax with cross-entropy
1. Stochastic gradient descent function). The loss value is an indication of how
2. Mini-batch gradient descent close the estimated training labels are to the actual
All classifiers are trained using stochastic gradient training labels, and they will be used to update the
descent with one of three loss functions: perceptron, weight values.
hinge, or logistic. For each label, a binary classifier is 6. An optimizer, which will use the calculated loss
trained, and an image is classified with the label corre- value to update the weights and biases with back-
sponding to the largest “score”. The parameters of gra- propagation.
dient descent include the number of training iterations
and the learning step. Finally, the mini-batch gradient 6.7.2 Understanding the Original Image
descent (MBGD) algorithm is improved with comput- Dataset
ing united device architecture (CUDA) multi-streaming The original one batch data is a 10000 × 3072 matrix
technique, which further speeds up network training expressed in numpy array. The number of columns,
in GCN framework, CIFAR-10 dataset compared to the 10000, indicates the number of sample data. As stated
state-of-the-art framework of TensorFlow [17]. in the CIFAR-10/CIFAR-100 dataset, the row vector (of
size 3072) represents a color image of 32 × 32 pixels.
Since this project is going to use CNN for the classifica-
6.7 RESULT ANALYSIS tion tasks, the original row vector is not appropriate. In
6.7.1 Neural Networks in TensorFlow order to feed an image data into a CNN model, the di-
The graph containing a neural network (see Fig. 6.8.) mension of the input tensor should be either (width ×
should contain the following steps: height × num_channel) or (num_channel × width ×
1. The input datasets – the training dataset and la- height). It depends on the choice. We select the first
bels, the test dataset and labels (and the validation choice because the default choice in TensorFlow’s CNN
dataset and labels). operation is such [20]. How to reshape into such a
2. The test and validation datasets can be placed inside form? The row vector for an image has the exact same
a tf.constant(). And the training dataset is placed in number of elements if we calculate 32 · 32 · 3 = 3072. In
a tf.placeholder() so that it can be fed in batches order to reshape the row vector into (width × height ×
during training (stochastic gradient descent). num_channel) form, two steps are required. The first
3. The neural network model with all of its layers. This step is to use the reshape function, and the second step
can be a simple fully connected neural network con- is to use the transpose function in numpy. By definition

FIG. 6.8 Working of a neural network in TensorFlow.

CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 113

FIG. 6.9 Reshape and transpose.

from the numpy official website, reshape transforms an can take a list of axes, and each value specifies an in-
array to a new shape without changing its data. Here, dex of dimension we want to move. For example, calling
the phrase “without changing its data” is an important transpose with argument (1, 2, 0) in an numpy array of
part since don’t want to damage the data [21]. Reshape (num_channel, width, height) will return a new numpy
operations should be delivered in three more detailed array of (width, height, num_channel). See Fig. 6.10.
steps. The following describes the logics behind:
1. Divide the row vector into 3 pieces, where each piece 6.7.3 Understanding the Original Labels
means a color channel. The resulting array is a 3 × The label data is just a list of 10,000 numbers ranging
1024 matrix, which makes 10,000 × 3 × 1024 tensors from 0 to 9, which correspond to each of the 10 classes
in total. in CIFAR-10.
2. Divide each 3 pieces further by 32 which is the airplane : 0
width and height of an image. This results in 3 × automobile : 1
32 × 32, making 10,000 × 3 × 32 × 32 tensors in to- bird : 2
tal. In order to realize the logical concept in numpy, cat : 3
reshape should be called with the following argu- deer : 4
ments: (10,000, 3, 32, 32). As we have noticed, the dog : 5
reshape function doesn’t automatically divide fur- frog : 6
ther when the third value (32, width) is provided. horse : 7
We need to explicitly specify the value for the last ship : 8
value (32, height). See Fig. 6.9. truck : 9
This is not the end of story, yet. Now, one image Code 1 defines a function to return a handy list of
data is represented in (num_channel, width, height) image categories. This function will be used in the pro-
form. However, this is not the shape TensorFlow and duction phase. Because the predicted output is a num-
matplotlib are expecting. They are expecting a different ber, it should be converted as a string, so humans can
shape, namely (width, height, num_channel), instead. read. See Fig. 6.11.
So it is required to swap the order of the axes, and this The display_stats defined below answers some of the
is where transpose comes in. The transpose function questions like in a given batch of data [23].
114 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 6.10 Code 1: reshape and transpose.

FIG. 6.11 Code 2: label names.

FIG. 6.12 Code 3: showing a sample image in a batch.

“What are all possible labels?” “What is the range of machine learning algorithms the best chance on the
values in the image data?” “Are the labels in order or problem.
random?” See Fig. 6.12. Normalize. The normalize function takes data, x,
We have tried the third batch and its 7000 images. and returns it as a normalized numpy array. Data x
As the result, Fig. 6.13 shows that the numbers of image can be anything, and it can be an N -dimensional array.
data for each class are about the same. In this chapter, it will be 3D array for an image. The
min–max normalization (y = (x − min)/(max − min))
6.7.4 Implementing Pre-process Functions
technique is used, but there are other options, too. By
One can probably notice that some frameworks/li- applying min–max normalization, the original image
braries like TensorFlow, numpy, or Scikit-learn provide
data is going to be transformed in the range from 0
similar functions, which we are going to build. For get-
to 1 (inclusive). A simple answer to why normalization
ting the best accuracy of machine learning algorithms
on the datasets, some machine learning algorithms re- should be performed is somewhat related to activation
quire the information to be in an explicit form, whereas functions. See Fig. 6.14.
other algorithms can perform better if the information For example, a sigmoid activation function takes an
is set up with a certain goal, but not always. At last, the input value and outputs a new value ranging from 0 to
raw data may not be in the required format to best ex- 1. When the input value is somewhat large, the output
pose the underlying structure and relationships to the value easily reaches the max value of 1. Similarly, when
predicted variables. It is important to prepare the avail- the input value is somewhat small, the output value eas-
able data in such a way that it gives various different ily reaches the max value of 0. See Fig. 6.15.
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 115

For another example, ReLU activation function takes 0 to infinity. When the input value is somewhat large,
an input value and outputs a new value ranging from the output value increases linearly. However, when the
input value is somewhat small, the output value easily
reaches the max value of 0. See Fig. 6.16.

FIG. 6.15 Sigmoid function.

FIG. 6.13 Showing sample image in batch. FIG. 6.16 ReLU function.

FIG. 6.14 Code 4: min–max normalization function.

116 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 6.17 One hot encoding process.

Now, when thinking about the image data, all val- put, x, which is a list of labels (ground truth). The total
ues originally range from 0 to 255. It appears that when number of elements in the list is the total number of
data is passed to the sigmoid function, the output is samples in a batch. One_hot_encode function returns
almost always 1, and when it is passed into ReLU func- a two-dimensional tensor, where the number of rows is
tion, the output could be very large. When backpropa- the size of the batch, and the number of columns is the
gation process is performed to optimize the networks, number of image classes. See Fig. 6.18.
this could lead to exploding/vanishing gradient prob- Process all the data and save it.
lems. In order to avoid the issue, it is better to let all the Code 6 below uses the previously implemented
values be around 0 and 1. functions, normalize and one-hot-encode, to pre-
process the given dataset. As depicted in Fig. 6.19, 10%
6.7.5 Output of the Model of data from every batch will be combined to form the
For now, what we need to know is the output of the validation dataset. The remaining 90% of data is used
model. It is a set of probabilities of each class of im- as a training dataset. Lastly, there is a testing dataset that
age based on the model’s prediction result. In order to is already provided. The code cell below will preprocess
express those probabilities in code, a vector having the all the CIFAR-10 data and save it to an external file [22].
same number of elements as the number of classes of def _preprocess_and_save(normalize, one_hot_en-
the image is needed. For instance, CIFAR-10 provides 10 code, features, labels, filename): features = normal-
different classes of the image, so we need a vector of size ize(features) labels = one_hot_encode(labels)
10 as well. See Fig. 6.17. pickle.dump((features, labels), open(filename, ’wb’))
Also, our model should be able to compare the pre- def preprocess _and_save _data(cifar10_ dataset_folder_
diction with the ground truth label. This means the path, normalize, one_hot_encode): n_batches = 5
shape of the label data should also be transformed into valid_features = [] valid_labels = []
a vector of size 10, too. Instead, because the label is the for batch_i in range(1, n_batches + 1): features,
ground truth, we set the value 1 for the corresponding labels = load_cfar10 _ batch( cifar10 _dataset_folder
element [21]. One_hot_encode function takes the in- _path, batch_i)
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 117

FIG. 6.18 Code 5: one hot encoding process.

FIG. 6.19 Train/validate/test data.

# find index to be the point as validation data in the tures.extend(features[-index_of_validation:]) valid_la-

whole dataset of the batch (10%) index_of_validation bels.extend(labels[-index_of_validation:])
= int (len (features) * 0.1) # preprocess the all stacked validation dataset _pre-
# preprocess the 90% of the whole dataset of the process _and_save (normalize, one_hot_encode, np.ar-
batch # - normalize the features # - one_hot_encode the ray(valid_features), np.array(valid_labels), ’preprocess_
lables # - save in a new file named, "preprocess_batch_" validation.p’)
+ batch_number # - each file for each batch _prepro- # load the test dataset with open(cifar10 _dataset_
cess_and_save (normalize, one_hot_encode, features folder_path + ’/test_batch’, mode=’rb’) as file: batch =
[:-index_of_validation], labels[:-index_of _ validation], pickle .load(file, encoding=’latin1’)
’preprocess _batch_’ + str(batch_i) + ’.p’) # preprocess the testing data test_features =
# unlike the training dataset, validation dataset will batch[’data’].reshape((len(batch [’data’]), 3, 32, 32)).
be added through all batch dataset # - take 10% of transpose(0, 2, 3, 1) test_labels = batch[’labels’]
the whole dataset of the batch # - add them into # Preprocess and Save all testing data
a list of # - valid_features # - valid_labels valid_fea- _preprocess_and_save(normalize, one_hot _encode,
118 Deep Learning and Parallel Computing Environment for Bioengineering Systems

np.array(test_features), np.array(test_labels), ’prepro- bigger group of information over the GPUs. This setup
cess _training.p’) requires that all GPUs share the model parameters. A
Code 6: Overall process code verifiable truth is that exchanging information to and
from GPUs is very moderate [19]. Therefore, choose to
6.7.6 Training a Model Using Multiple GPU store and refresh every model parameter on the CPU
Cards/CUDA (see the green box). A new classification of model pa-
Present day workstations may contain numerous GPUs rameters is exchanged to the GPU when another bunch
for logical calculation. TensorFlow can use this condi- of information is handled by all GPUs. The GPUs are
tion to run the preparation task simultaneously over dif- synchronized in the task. All inclinations are gathered
ferent cards. Preparing a model in a parallel, dispersed from the GPUs and arrived at the midpoint of (see
mold requires organizing preparing forms. For what the green box). The model parameters are refreshed
tails term demonstrate is reproduction to be one dupli- with the angles found the middle value of over every
cate of model preparing on a subset of information. In- single model copy. The variation in training time is
nocently utilizing nonconcurring updates of model pa- relatively low in both frameworks, although the varia-
rameters prompts problematic preparing execution on tion is slightly higher using Keras with TensorFlow as
the grounds that an individual model imitation may be back end, the last run on CIFAR-10 using Keras with
prepared on a stale duplicate of the model parameters. TensorFlow as back end especially stands out, having
On the other hand, utilizing completely synchronous 30 seconds to its nearest neighbor, see Table 6.4. In-
updates will be as moderate as the slowest show repro- terestingly, the first epoch was consistently the epoch
duction [18]. that took the most time to finish, see the Maximum
On a workstation with numerous GPU cards, each Epoch Time column in the tables. After some testing
GPU will have comparable speed and contain enough and after the results presented in tables were com-
memory to run an entire CIFAR-10 show. In this way, piled, we came to the conclusion that the first epochs
we outline our preparation framework in the following took more time because we ran the scripts with de-
way: bugging on in Visual Studio. When we ran the scripts
1. Place an individual model imitated on each GPU. without debugging, the first epochs took approximately
2. Update model parameters synchronously by sitting the same time as the rest of the epochs. Train_neu-
tight for all GPUs to wrap up a clump of informa- ral_network function runs optimization task on a given
tion. batch. Because CIFAR-10 dataset comes with 5 separate
A diagram of this model is presented in Fig. 6.20. batches, and each batch contains different image data,
train_neural_network should be run over every batch.
This can be done with simple codes just like shown in
Code 7. It runs the training over 10 epochs for every
batch, see Fig. 6.21, and Fig. 6.22 shows the training re-
sults.

TABLE 6.4
Results of Keras/TensorFlow CIFAR-10.
Run Total Mean epoch Maximum
(no) time (s) time (s) epoch time (s)
1 3733 93.4 98
2 3746 93.6 97
3 3757 94.1 99
4 3759 94.0 101
5 3701 92.4 97

FIG. 6.20 CUDA based working model. 1–5 18692 93.5 100

Note that each GPU figures derivation and in addi-

tion the inclinations for a one of a kind group of in- The lower the misfortune, the better a model (except
formation. This setup successfully allows separating a if the model has over-fitted the preparation informa-
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 119

FIG. 6.21 Training over 10 epochs.

tion). The misfortune is computed on preparing and set k − 1 times. The variance of the resulting estimate is
approval, and its interpretation is how well the model is minimized as k becomes larger. See Fig. 6.22.
getting along for these two sets. As opposed to exactness, The precision of a model is generally decided after
misfortune isn’t a rate. It is a sum of the mistakes made the model parameters are found and settled, and no
in every case of preparing or approval sets. On account more learning occurs. At that point the tests are encour-
of neural systems, the misfortune is typically negative aged to the model and the quantity of mix-ups (zero–
log-probability and remaining aggregate of squares for one misfortunes) the model makes is recorded, after
characterization and relapse individually. At that point correlation with the genuine targets. At that
normally, the fundamental target in a learning model point, the level of misclassification is computed. See
is to decrease (limit) the misfortune capacity’s, an in- Fig. 6.23.
centive regarding the model’s parameters, by changing After getting the trained model, it is very easy to
the weight vector esteems through various advancement predict whether images from the test dataset are cats
techniques, for example, back engendering in neural or dogs, etc. (with a predicted probability for each).
systems. Cross-validation is a method used to assess ma- The intuition behind this is that even if the test image
chine learning models on a limited data sample. The is not too easy to make a prediction, the transforma-
procedure has a single parameter called k that refers to tions change it so that the model has a higher chance
the number of groups that a given data sample is to be of capturing the dog/cat shape and predicting accord-
split into. As such, the procedure is often called k-fold ingly. A few of the images have been misclassified due
cross-validation. At last when a specific value for k is to poor contrast, rectangular rather than square images,
picked, it might be utilized instead of k in the reference or because the dog/cat is in a very small portion of the
to the model, for example, k = 10 implies 10-fold cross- image. Take a look, for example, at the rectangular im-
validation. age of a dog. When the model tries to predict for this
In the k-fold cross-validation method, the dataset is image, it sees just the center of the image (cropping
distributed into k subsets, and the method is repeated by default is center). Thus, it cannot predict if the im-
k times. Each time, one of the k subsets is utilized as age is of a dog or a cat. From above it observed that,
the test set and the remaining k − 1 subsets are put to- at our machine has predicted the image of the puppy
gether to form a training set. The average error across all and classified it as a dog and similarly for car and clas-
k trials is then calculated. The advantage of this method sified it as an automobile. A learning curve is a plot of
is that every data point has one chance to be in a test the training and test losses as a function of the num-
set exactly once, and has the chance to be in a training ber of iterations. These plots are very useful to visualize
120 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 6.22 Training result with loss and accuracy.

the train/validation losses and validation accuracy. The For instance, if the number of tests is 1000 and
learning curve of the model is as shown in Fig. 6.24. As model orders 952 of those accurately, at that point the
found in the figure, the accuracy rate remains constant models correctness is 95.2%. See Fig. 6.25.
after 3000 iterations. Following the training, the tech-
nique has finished with 90% precision in the test phase
utilizing untagged information in the database. It is un- 6.8 CONCLUSIONS
avoidable that the classification accuracy of our modest The present study investigated whether deep learning
amount of information increases when the quantity of could be applied to the classification of images on the
information is increased. CIFAR-10 database. Deep learning technologies are be-
CHAPTER 6 Deep Convolutional Neural Network for Image Classification on CUDA Platform 121

coming more accessible for corporations and individu-

als and demonstrate better results than the convolution
neuron network. In this examination, the image clas-
sification process was performed by using the Tensor-
Flow with Keras, which is an open source programming
library in Python for constructing our DCNN. For ex-
ecuting the operational parallelism there were a few re-
strictions regarding the CPU. The CPU is helpful for just
consecutive activities. However, with regards to parallel
programming, GPUs are more useful with CUDA. Pre-
viously, GPUs have been filling in as realistic quicken-
ing agents. Experimental results have demonstrated that
the proposed technique fundamentally beats the other
state-of-the-art supervised and semi-supervised meth-
ods on CIFAR datasets. By making utilization of the
abundant unlabeled pixels of the hyperspectral images,
FIG. 6.23 Prediction result. the future semi-supervised learning technique will be
able to train deep neural networks adequately.

REFERENCES
1. S.J. Lee, T. Chen, L. Yu, C.H. Lai, Image classification based
on the boost convolutional neural network, IEEE Access 6
(2018) 12755–12768.
2. G. Wang, W. Li, M.A. Zuluaga, R. Pratt, P.A. Patel, M. Aert-
sen, T. Doel, A.L. David, J. Deprest, S. Ourselin, T. Ver-
cauteren, Interactive medical image segmentation using
deep learning with image-specific fine-tuning, IEEE Trans-
actions on Medical Imaging (2018).
3. J. Ker, L. Wang, J. Rao, T. Lim, Deep learning applications
in medical image analysis, IEEE Access 6 (2018) 937589.
4. H. Wu, S. Prasad, Semi-supervised deep learning us-
ing pseudo labels for hyperspectral image classification,
FIG. 6.24 Learning curve of the model. IEEE Transactions on Image Processing 27 (3) (2018)
1259–1270.

FIG. 6.25 Model accuracy and loss.

122 Deep Learning and Parallel Computing Environment for Bioengineering Systems

5. Emine Engil, Ahmet Çinar, Zafer Güler, A GPU-based con- 17. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep
volutional neural network approach for image classifica- network training by reducing internal covariate shift, arXiv
tion, in: Artificial Intelligence and Data Processing Sympo- preprint, arXiv:1502.03167, 2015.
sium (IDAP), 2017 International, IEEE, 2017, pp. 1–6. 18. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
6. Waseem Rawat, Zenghui Wang, Deep convolutional neural B. Xu, C. Zhang, Z. Zhang, MXNet: a flexible and efficient
networks for image classification: a comprehensive review, machine learning library for heterogeneous distributed sys-
Neural Computation 29 (9) (2017) 2352–2449. tems, arXiv preprint, arXiv:1512.01274, 2015.
7. X. Jia, Image recognition method based on deep learning,
19. T. Li, Y. Dou, J. Jiang, Y. Wang, Q. Lv, Optimized deep belief
in: Control and Decision Conference (CCDC), 2017 29th
networks on CUDA GPUs, in: Neural Networks (IJCNN),
Chinese, IEEE, May 2017, pp. 4730–4735.
2015 International Joint Conference on, IEEE, 2015 July,
8. A. Isin, S. Ozdalili, Cardiac arrhythmia detection using
pp. 1–8.
deep learning, Procedia Computer Science 120 (2017)
268–275. 20. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
9. H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, B. Catanzaro, E. Shelhamer, cuDNN: efficient primitives
J. Wei, P. Xie, E.P. Xing, Poseidon: an efficient communi- for deep learning, 2014.
cation architecture for distributed deep learning on GPU 21. H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional
clusters, arXiv preprint, 2017. deep belief networks for scalable unsupervised learning
10. A. Awan, K. Hamidouche, J. Hashmi, D.K. Panda, S-Caffe: of hierarchical representations, in: Proceedings of the
co-designing MPI runtimes and caffe for scalable deep International Conference on Machine Learning, 2009,
learning on modern GPU clusters, in: 22nd ACM SIGPLAN pp. 609–616.
Symposium on Principles and Practice of Parallel Program- 22. A. Awan, K. Hamidouche, J. Hashmi, D.K. Panda, S-Caffe:
ming, February 2017; co-designing MPI runtimes and caffe for scalable deep
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep learning on modern GPU clusters, in: 22nd ACM SIGPLAN
network training by reducing internal covariate shift, arXiv: Symposium on Principles and Practice of Parallel Program-
1502.03167, 2015. ming, February 2017.
11. F. Tschopp, J.N. Martel, S.C. Turaga, M. Cook, J. Funke, Ef- 23. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
ficient convolutional neural networks for pixelwise classifi- B. Catanzaro, E. Shelhamer, cuDNN: efficient primitives
cation on heterogeneous hardware systems, in: Biomedical
for deep learning, 2014.
Imaging (ISBI), 2016 IEEE 13th International Symposium
24. NVIDIA corp., Nvidia dgx-1, [Online], available: http://
on, IEEE, April 2016, pp. 1225–1228.
www.nvidia.com/object/deep-learning-system.html, 2016.
12. Y. Demir, A. Uçar, C. Güzeliş, Moving towards in object
recognition with deep learning for autonomous driving 25. TensorFlow, http://tensorflow.org/.
applications, 2016. 26. TensorFrames, https://github.com/databricks/tensor
13. H.R. Roth, L. Lu, J. Liu, J. Yao, A. Seff, K. Cherry, L. Kim, frames.
R.M. Summers, Improving computer-aided detection using 27. TensorFlowOnSpark, https://github.com/yahoo/Tensor
convolutional neural networks and random view aggrega- FlowOnSpark.
tion, IEEE Transactions on Medical Imaging 35 (5) (2016) 28. The CIFAR-10 Dataset, https://www.cs.toronto.edu/~kriz/
1170–1181. cifar.html.
14. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, 29. THE MNIST DATABASE, http://yann.lecun.com/exdb/
M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, mnist/.
TensorFlow: a system for large-scale machine learning, in: 30. Yahoo Flickr Creative Commons 100M.
OSDI, vol. 16, 2016, pp. 265–283. 31. CNTK, [Online], available: http://www.cntk.ai/, Feb. 2017.
15. B. Alipanahi, A. Delong, M.T. Weirauch, B.J. Frey, Predict-
32. http://junyelee.blogspot.com/2018/01/deep-learning-
ing the sequence specificities of DNA- and RNA-binding
with-python.html.
proteins by deep learning, Nature Biotechnology 33 (8)
(2015) 831. 33. http://www.drdobbs.com/parallel/cuda-supercomputing-
16. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: for-the-masses-part/215900921.
surpassing human-level performance on ImageNet classi- 34. https://www.tutorialspoint.com/cuda/cuda_memories.
fication, in: Proceedings of the IEEE International Confer- htm.
ence on Computer Vision, 2015, pp. 1026–1034.
CHAPTER 7

Efficient Deep Learning Approaches

for Health Informatics
T.M. NAVAMANI, ME, PHD

7.1 INTRODUCTION tures such as recurrent neural networks (RNNs), deep

Recently, deep learning took a vital role in the domain belief networks (DBNs), restricted Boltzmann machines
of health informatics by presenting more benefits with (RBMs) and autoencoders are also used in the health
respect to feature extraction and data classification. In care domain. Likewise, the advancement of graphical
the health care domain, rich sources of biomedical data processing units (GPUs) has a good effect on the us-
are available which are challengeable inputs for the re- age of deep learning. Parallelization in CNNs can be
searchers and provide more opportunities for them to achieved by incorporating the algebraic operations such
design new data driven models in health informatics as matrix operations and convolution operations to the
using deep learning techniques. Exploring the associa- GPU.
tions among the collected data sets is the primary prob-
lem to develop health care tools based on data driven Machine Learning Vs. Deep Learning
models and machine learning. The underlying root of Machine learning is a popular learning methodology
deep learning models is based on neural network ar- based upon artificial intelligence which can learn from
chitecture. Compared to neural network structure, more data without the need of human intervention in com-
hidden neurons and layers are used in the design of puters. Most of the time, the demand on applying ma-
deep learning models. More neurons used in the design chine learning techniques is to derive predicting mod-
cover a large amount of raw data at hand during the els without the need of considering underlying mecha-
training phase. Deep learning approaches are based on nisms that are unknown or sometimes not completely
representation learning with more levels of representa- specified. A machine learning process can be carried out
tion attained by designing nonlinear modules layer by by four steps such as data organization, representation
layer so that each layer can transform the representa- learning, model fitting and data evaluation [2]. Initially,
tion from one form to the next higher form, and finally developing a machine learning model requires feature
to a more abstract form, which leads to the generation engineering and domain expertise to convert the raw
of an automatic feature set [1], [2]. Automatic gener- data into an appropriate internal representation form,
ation of a feature set without human intervention has from which learning subsystems like classifiers are de-
more benefits in the health informatics domain. In the rived, which could detect patterns in the data set. In
applications such as in medical imaging, automatically recent years, the techniques comprise linear transforma-
generated features are well refined and also very much tion of the input data space but also have restrictions in
useful for better diagnosis than the features found by processing raw data as such.
human. These generated features may also be useful for Deep learning is an emerging machine learning tech-
finding nucleotide structures which could tie up with nique which shows the difference in terms of represen-
a DNA or RNA element of a protein. There are several tation learning from raw data. It plays an important
variations of deep learning models which are popular role in processing big data since it can extract valu-
in recent days. Among them convolutional neural net- able knowledge from complex systems. Traditionally,
works (CNNs) have more benefits and good impact on the training process of multi-layer neural networks al-
health informatics. The design of CNNs consists of a ways leads to a local optimal problem or cannot guaran-
sequence of feedforward layers, which implements con- tee convergence. Deep learning models are developed to
volutional filter, reduction and pooling. Each layer rep- overcome these problems with two-level strategy such
resents a high level abstract feature. The entire architec- as pre-training and fine tuning for training the network
tural design of a CNN is similar to the method of how effectively. Fig. 7.1 shows a comparison of a multi-layer
visual cortex collects and associates the visual informa- perceptron and deep learning model. Recently, the in-
tion in terms of receptive fields. Further deep architec- crease in computational power of computers and data

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00014-2 123
Copyright © 2019 Elsevier Inc. All rights reserved.
124 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 7.1 Multi-layer perceptron vs, deep learning.

size also made deep learning techniques more popu- used training method to provide minimal error sur-
lar. At the time when big data emerged, deep learning face.
models also became popular in providing solutions by Adding multiple hidden layers to the MLP network
processing and analyzing big data. To train a large scale will form the deep architecture which can solve more
deep learning model, high performance computing sys- complex problems since the hidden layers capture non-
tems are required. By utilizing a GPU based deep learn- linearity features as shown in Fig. 7.2. Such deep ar-
ing framework, the training time is reduced from several chitectures are called deep neural networks (DNNs).
weeks to one day, or to several hours. Hence, a typical To train these DNNs, more sophisticated approaches
deep learning model initially undergoes unsupervised were proposed. In short, DNNs can be trained by super-
training, and then a supervised training methodology is vised and unsupervised learning techniques. In super-
applied for better fine-tuning and to learn features and vised learning, target outputs are specified and weights
representations of big data for classification and pattern are fine-tuned to reduce the error, which in turn predicts
recognition tasks [2]. Other than health care domain, a desired value for the process of classification or regres-
sion. In the case of unsupervised learning, the training
deep learning approaches have proven better perfor-
process is carried out by not having any target data.
mance in natural language processing, speech recogni-
Hence, unsupervised learning is most widely used in
tion, computer vision, etc.
feature extraction, dimensionality reduction and clus-
The multi-layer perceptron (MLP) network was de-
tering. Also, for the applications attributed to health
signed to simulate the process of biological neural net-
informatics, it is better to combine unsupervised learn-
works, which can alter themselves, generate new neural ing method with the initial training process of DNN
connections and also involve a learning process car- to extract the more abstract features and then use these
ried on according to the stimulations raised by neurons. features for classification by incorporating a supervised
MLP network comprises an input layer, one or more learning algorithm.
hidden layers, and an output layer [1]. Training the MLP In the past, DNNs were not more focused on due
network is carried out for many epochs, and new in- to the requirement of high computational capacity
put sample is presented for each epoch and weights both for training and processing, specifically for a few
between neurons are adjusted on the basis of the learn- real time applications. Recently, advancements made in
ing algorithm. During the initial training cycle, random hardware technology and also the possibility of paral-
weights are assigned between neurons and for the sub- lelization, made through GPU acceleration, multicore
sequent training cycles weights are fine-tuned to reduce processing and cloud computing, the limitations were
the difference among target outputs and network out- overcome, which enabled DNNs to become a popular
puts. The gradient descent method is the most widely learning methodology based on artificial intelligence.
CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 125

FIG. 7.2 From ANN to a deep learning model.

Although different deep learning approaches are dis- architectures such as deep autoencoder, recurrent neural
cussed in the literature, a detailed analysis and compar- network, deep belief network (DBN), deep Boltzmann
ison of the deep learning models is missing and their machine (DBM), restricted Boltzmann machine (RBM),
constructional, training requirements, specifically in the convolutional neural network, etc., are known, which
health care domain, are not discussed efficiently. Thus are introduced and discussed below.
this chapter deals with in-depth exploration of vari-
ous deep learning approaches and their applications Deep Autoencoders
in health informatics. The subsequent sections of this An autoencoder is a kind of neural network to extract
chapter discuss structural design, description and appli- the features using data driven learning. It has equally
cations of the deep learning approaches, specifically in many nodes both in the input and output layers, and
the health informatics domain. Also the required input training is carried out to recreate the input vector in-
features, data format and outcomes of each model in stead of assigning a target label to it. In general, the
health care applications are discussed. A comparative number of neurons in a hidden layer is less compared
analysis of various deep learning models is also pre- to that of the input layer neurons so that encoding of
sented with respect to the health informatics domain. data is done in a lower-dimensional space and also ab-
stract features are extracted. For some applications, high
dimensional input data has been fed to the input layer,
7.2 DEEP LEARNING APPROACHES and in that case maintaining one hidden layer is not suf-
Deep learning architectures are capable of extracting ficient to encode all the data. Hence, deep autoencoder
hierarchical representation features automatically and architectures are designed by keeping multiple autoen-
also use the remaining layers for learning intricate fea- coders which can be fixed on top of each other. Fig. 7.3
tures from simpler ones. Hence, these approaches can shows a simple model of a deep autoencoder with in-
be incorporated to design an end-to-end network model put, output, encoder and decoder layers [3]. These net-
to learn features automatically from raw inputs and to works also experience the problem of vanishing gra-
process them accordingly. In the literature, several DNN dients during the training process. To solve this prob-
126 Deep Learning and Parallel Computing Environment for Bioengineering Systems

lem, commonly the network is initially assigned with parative analysis of manually found and deep learned
the weight values randomly in the pre-training phase, features.
after which standard backpropagation algorithm is in-
corporated to fine-tune the parameters. In the literature, Recurrent Neural Networks (RNNs)
various autoencoders are designed to make the repre- RNN is a neural network designed for analyzing streams
sentation learning stronger and more efficient against of data by means of hidden units. In some of the ap-
the deviations occurring in the input space. plications like text processing, speech recognition and
DNA sequences, the output depends on the previous
computations. Since RNNs deal with sequential data,
they are well suited for the health informatics domain
where enormous amounts of sequential data are avail-
able to process [3]. Fig. 7.4 shows a model of RNN.

FIG. 7.3 Deep autoencoder.

Vincent et al. [4] have designed a denoising autoen-

coder, in which input values are recreated by introduc-
ing a kind of noise to the patterns to cover the structure
of the input. Rifai et al. [5] proposed a contractive au-
toencoder, where analytic contractive penalty is added
to the error function instead of adding noise to the in-
FIG. 7.4 Recurrent neural network.
put pattern. After a while a convolutional autoencoder
model was designed, in which weights are shared be- In general, RNNs are provided with the input sam-
tween the nodes to preserve spatial locality and also for ples which contain more interdependencies. Also they
the effective processing of two-dimensional (2-D) pat- have a significant representation for keeping the in-
terns. Ahmed et al. [6] have proposed a feature learning formation about the past time steps. The output pro-
algorithm on the basis of unsupervised training method duced at time t1 affects the parameter available at time
using stacked autoencoders for the purpose of learning t1 + 1. In this manner, RNNs keep two kinds of in-
feature representations by means of compressed mea- put such as the present one and the past recent one to
surements. A detailed study analysis was done in Lu et produce the output for the new data. As deep autoen-
al. [7] on stacked denoising encoders with three hid- coders, RNNs also face the vanishing gradient problem.
den layers for fault diagnosis. Various design models of To solve this problem, several variations of RNNs were
a two-layer network built by changing the size of the proposed. Long short term memory units (LSTMs) and
hidden layer are studied and analyzed by Tao et al. [8] gated recurrent units (GRUs), which are variations of
for fault diagnosis. Learning of time series data repre- RNN, solve the vanishing gradient problem [1]. These
sentations from various sensors using autoencoders is networks have more benefits than traditional RNNs be-
discussed and analyzed in [9] and [10]. Lu et al. [7] and cause they are capable of maintaining long term inter-
Thirukovalluru et al. [11] have applied autoencoders for relations and also nonlinear dynamics in the case of a
the application of fault diagnosis. Also they have sug- time series input data set. Specifically, in RNN or LSTM,
gested that representation learning can be an appro- the same weight is maintained across all layers which
priate tool for health informatics applications by com- control the number of parameters the network requires
CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 127

to learn. Zheng et al. [12] discussed about LSTMs for Deep Belief Network
RUL estimation. Comparisons of three kinds of autoen- A deep belief network is a kind of deep learning net-
coder such as the simple RNN, LSTM and GRU are done work formed by stacking several RBMs. Fig. 7.6 shows
in [13] for the purpose of prognostics of auto engines a model of a deep belief network (DBN) [1]. The train-
and fault diagnosis. A health monitoring system based ing process is carried out in a greedy layer-wise manner
on empirical evaluation using LSTMs was carried out by with weight fine-tuning to abstract hierarchical features
Zhao et al. [14] to predict tool wear. An integrated ar- derived from the raw input data. DBN was designed to
chitecture, which is a combination of a CNN and an model a perceived distribution among the input and
LSTM, was proposed in [15] and outperformed several hidden layers’ space such that direct connections exist
baseline approaches. among the lower layer nodes and indirect connections
exist at the upper layer nodes [4]. The training process is
Restricted Boltzmann Machine (RBM) carried out layer-wise at the same time by adjusting the
Hinton et al. [16] have designed a restricted Boltzmann weight parameters using contrastive convergence (CD)
machine model which is a variation of Boltzmann ma- to establish a balanced estimate of the learning proba-
chine and a kind of neural network. It is a stochas- bility. Besides, the conditional probability distribution
tic model with normal input, output and hidden units of the input samples is determined to learn the abstract
and also restricted to construct a bipartite graph [1] as features which are robust and also invariable to trans-
shown in Fig. 7.5. A pair of nodes from each of these formation, noise, etc. [19].
units can form a symmetric connection between them.
However, the nodes within the unit have no direct con-
nection. To extract more abstract features, multiple RBM
models are fixed and the upper layers are completely
connected as conventional learning models to differen-
tiate feature vectors [4]. The network learns a probabilis-
tic distribution over its input space. It has the ability to
detect patterns even if some data is missing [2]. How-
ever, due to the challenges like inactive hidden nodes,
class variation, more sensitivity to large scale data set,
etc., make the training process more difficult, and also
tracking of cost or loss function is difficult [3]. These
issues are overcome by the methods based on RBM pro-
posed by Nair et al. [17] and Li et al. [18]. RBMs pro-
vide better solutions for feature extraction, dimension-
ality reduction and collaborative filtering. Two popular FIG. 7.6 Deep belief network (DBN).
RBMs are deep belief network and deep Boltzmann ma-
chine.
Deep Boltzmann Machine (DBM)
A deep Boltzmann machine is a model with more hid-
den layers with directionless connections between the
nodes as shown in Fig. 7.7. DBM learns the features
hierarchically from the raw data and the features ex-
tracted in one layer are applied as hidden variables as
input to the subsequent layer. As in DBN, DBM incorpo-
rates a Markov random field for layer-wise pre-training
for the large unlabeled data and then provides feedback
from the upper layer to the backward layers. By applying
the backpropagation method, the training algorithm is
fine-tuned [20]. The training process in DBM needs to
be adapted to define the training information, weight
initialization and adjustment parameters. It is observed
FIG. 7.5 Restricted Boltzmann machine. from the DBM that time complexity constraints will oc-
128 Deep Learning and Parallel Computing Environment for Bioengineering Systems

cur when setting the parameters as optimal [4]. A center- collection of digital filters to perform the convolution
ing optimization method was proposed by Montavon operation on the input data. The pooling layer is used as
et al. [21] to make the learning mechanism more stable a dimensionality reduction layer and decides the thresh-
and also for midsized DBM for the purpose of designing old. During backpropagation, a number of parameters
a generative, faster and discriminative model. are required to be adjusted, which in turn minimizes
the connections within the neural network architecture.
Convolutional Neural Network The design of CNNs is stimulated by means of a bio-
A neural network which was designed to process multi- logical model of the visual cortex. According to the func-
dimensional data like image and time series data is tional process of a visual cortex, a typical CNN design
called a convolutional neural network (CNN). It in- comprises the order of convolution and subsample lay-
cludes feature extraction and weight computation dur- ers. The purpose of keeping fully connected layers after
ing the training process. The name of such networks is the last subsampling layer is to perform dimensional-
obtained by applying a convolution operator which is ity reduction. These fully connected layers are consid-
useful for solving complex operations. The true fact is ered like in traditional neural networks. In most of the
that CNNs provide automatic feature extraction, which cases, CNNs are used to analyze image data and hence
is the primary advantage [2]. The specified input data is the operations performed at these layers are within a
initially forwarded to a feature extraction network, and two-dimensional plane. CNNs are most widely used in
then the resultant extracted features are forwarded to a health care applications since they have the capabil-
classifier network as shown in Fig. 7.8. The feature ex- ity of automatically generating features from the time
traction network comprises loads of convolutional and series data and the frequency representation images.
pooling layer pairs. Convolutional layer consists of a These features are then forwarded to a classifier network
for classification and regression. CNNs are also used in
other applications like speech recognition, time series
prediction, etc. [3]. Janssens et al. [22] have incorpo-
rated CNNs to monitor rotating machinery conditions.
The discrete Fourier transform of two accelerometers is
applied as a network input. In [23], the authors have de-
signed a CNN for prediction purposes. Here, data from
sensors at periodic intervals are considered as input.
A regression layer is added since the input data is fed as
a continuous value. The authors clearly demonstrated
about how well the designed CNN based regression
model outperforms the traditional regression methods
such as multilayer perceptron algorithm, support vec-
tor regression, and also the relevance vector regression.
A deep CNN architecture was designed by Ding and He
FIG. 7.7 Deep Boltzmann machine. [24] and by Guo et al. [26] for giving solutions for fault

FIG. 7.8 Architecture of CNN.

CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 129

diagnosis applications. In Abdeljaber et al. [25], the au- auditory signals. Generally, the training process of a
thors have designed CNNs for vibration analysis. The deep learning network takes more time. However, dur-
benefit of this approach is the automatic detection of ing testing, a deep learning network can be fast when
robust features from the input without the need for ad- run on GPUs.
ditional processing. In Lee et al. [27], the authors have Table 7.1 shows the various deep learning methods
examined the application of CNNs in analyzing noisy which were developed over the decade and their com-

TABLE 7.1
Comparison of deep learning methods.
Deep learning Description Strengths Weaknesses
algorithms
Denoising Designed to correct Better for feature extraction More computational time;
autoencoders corrupted input data values and compression Addition of random noise;
Less scalability to high
dimensional data
Sparse autoencoder Sparsity parameter can be Linearly separable features More computational time
applied to loss function to can be produced since more forward passes
construct robust features are required for every input
independent of applications sample
Restricted Boltzmann Designed as a generative Ability to create patterns Training process will be
machine (RBM) model with layer by layer even when some data are difficult;
feature learning missing Tracking of loss or cost
function takes more time
Deep Boltzmann Architecture is designed Robust feature extraction More training time is
machine (DBM) with undirected connection through unsupervised required; Joint optimization
between layers of the training is possible by of parameters will be
network allowing a feedback difficult when large data set
mechanism is applied to the network
Deep belief network Designed with undirected Able to extract global Slowest training process
(DBN) connection between features from data; Shows
topmost two layers and better performance for
directed connection one-dimensional data; Good
between lower layers for data dimensionality
reduction problems
Convolutional neural Deep neural network Most widely used in deep Large volume of data is
network (CNN) structure with learning applications with required with more
interconnections reflects the different variations of hyperparameter tuning to
biological visual cortex training strategies; Provides extract optimal features
good performance for
multi-dimensional data;
Representational abstract
features can be extracted
from raw data
Recurrent neural Neural network structure to Most widely used to model Training process is difficult
network model sequential time series time series data and sometimes affected
data; Temporal layer is from vanishing gradients.
added to learn about More parameters have to be
complex variations in data updated, which in turn
makes the real time
prediction process more
difficult
130 Deep Learning and Parallel Computing Environment for Bioengineering Systems

parative analysis. Some techniques are more popular DNA methylation states derived from DNA sequences.
and most widely used for health care applications and Compared to deep learning approaches, conventional
for big data analysis. machine learning approaches provide more benefits,
specifically for processing small data sets.

7.3 APPLICATIONS Clinical Imaging

Translational Bioinformatics The foremost application of deep learning with respect
The role of deep learning models in bioinformatics to clinical dataset was on medical image processing. Of-
can be concentrated in three parts: prediction of ge- ten, the brain magnetic resonance imaging (MRI) scans’
netic processes, prevention of diseases, and providing analysis was done using deep learning network to pre-
complete treatment. These are widely established in ge- dict Alzheimer disease and its variations [2], [36], [37].
nomics, pharmacogenomics and epigenomics. The aim Among various deep learning networks, convolutional
of genomics is identification of gene and environmen- neural network (CNN) is most widely used for health
tal factors which contribute specifically to cancer type informatics and medical image processing because it
diseases. Pharmacogenomics analyzes the variations in performs well in the field of computer vision and is
drug response of individuals to the treatments given by also capable to correlate with GPUs. CNNs were also
using differences in genes. Epigenomics deals with in- used to produce representations in hierarchical manner
vestigation of protein collaboration and understanding in the case of low field MRI scans for the automatic
processes like transcriptome, proteome, etc. [1]. In gen- segmentation of cartilage and also for predicting the
eral, health informatics datasets are high-dimensional, risk of osteoarthritis [38]. Deep learning architectures
diverse and also unbalanced sometimes. Traditional are also used for segmentation of multi-channel three-
machine learning approaches process the data with the dimensional MRI scans and also for the diagnosis of
normal flow of data preprocessing, feature extraction, gentle and harmful nodules from ultrasound images
etc. However, these approaches cannot process the se- [39]. In Gulshan et al. [40], the authors incorporated
quence data in a direct manner since they need domain CNNs to recognize diabetic retinopathy in retinal fun-
knowledge to process. To extract biomarkers of genes dus images and to obtain better performance results
specific to a particular disorder is a challenging issue over more than 10000 test images. CNN also shows bet-
since it requires large amounts of data for processing. ter performance towards classifying biopsy-proven clin-
The markers exist at different concentration levels dur- ical images with respect to different types of skin cancer
ing the period of disorder and also during the treatment [41].
period. Deep learning approaches are useful in solving The conventional machine learning approaches can-
these issues with better outcomes. not handle the challenges existing in medical imaging
The deep learning approaches are capable in process- like non-isotropic resolution, Rician noise, and bias
ing large, complex and unstructured data by analyzing field effects in MRI. Hence, to solve these challenges,
heterogeneous data like protein occurrences, gene fac- switching to deep learning approaches will give better
tors and various environmental issues [1]. Incorpora- results. Deep learning techniques provide the option of
tion of these approaches in bioinformatics with differ- automating and merging the relevant extracted features
ent areas is studied in [28], [29], [30] and [31]. Deep with the classification procedure [39], [42]. Moreover,
learning approaches have shown better outcomes in the the benefits of CNNs specifically in clinical imaging
case of cancer cells recognition and classification pro- comprise the classification of interior lung diseases on
cesses. They outperform traditional machine learning the basis of computed tomography (CT) images [43],
approaches and also open the door for designing ef- successful detection of haemorrhages in color fundus
fective methods. Fakoor et al. [32] have designed an images [44], and also the classification of tuberculosis
autoencoder model on the basis of gene expression manifestation which are based on X-ray images [35].
dataset from various categories of cancer with a sim-
ilar microarray dataset used for classification and de- Electronic Health Records
tection. A DBN was designed by Ibrahim et al. [33] In recent years, deep learning networks have been
to determine features in genes and microRNA which used to process structured (diagnosis, medications data,
shows better performance in classification of different laboratory test) and unstructured data (free text of
types of cancer. Kearnes et al. [34] discussed about deep medical prescriptions and notes, etc.) of electronic
learning approach related to graph convolutions. In health records (EHRs). Compared to traditional ma-
[35], the authors have incorporated DNN for predicting chine learning models, deep learning produces better
CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 131

performance results in processing the health records pared to a fully connected network, CNNs use less pa-
[45]. In the literature, some works designed supervised rameters by applying a convolution operation on the
deep learning networks and also presented unsuper- input data space and also parameters are shared be-
vised deep learning models to process electronic health tween the regions. Hence, large DNA sequence data can
records in terms of learning representations of patient be trained using these models and also improved pat-
data and are then evaluated using shadow classifiers. Liu tern detection accuracy can be obtained. Deepbind, a
et al. [46] have proposed a multiple layer CNN for pre- deep architecture based on CNNs, was proposed by Ali-
dicting various kinds of heart diseases and also specified panathi et al. [57], which predicts specificities of DNA
the significant benefits. In [47], the authors designed and RNA binding proteins. CNNs were also used for
RNNs with hidden units named Deepcare to infer cur- predicting chromatin marks from a DNA sequence [44].
rent illness states and predict future medical outcomes. Angermueller et al. [35] have incorporated CNNs for
They also modified the network with a decay effect to predicting DNA methylation states. Like CNNs, other
process irregularly timed events. Deepcare model is also deep architectures were also applied for extracting fea-
used to evaluate in various applications like disease pro- tures from raw DNA sequence data and for processing
gression modeling, future risk prediction of diabetes, the data.
and mental health patient cohorts. Miotto et al. [48]
have developed a three layer stacked denoising autoen- Mobile Devices
coder (SDA) to learn deep patient representations for Smartphones and wearable devices which are embed-
the EHRs. They also applied this representation on dis- ded with sensors play a significant role in health moni-
ease risk prediction using random forests as classifiers. toring. By using these devices, direct access to personal
This deep representation provides better prediction re- analytics of the patients is possible, which can con-
sults than conventional machine learning algorithms tribute to monitoring their health, facilitating preven-
(like PCA, k-means). Liang et al. [49] have applied tive care and also helping in managing ongoing illness
RBMs to learn representations from EHRs and proven [2]. Deep learning plays a crucial role in analyzing these
better prediction accuracy for various diseases. new kinds of data. There exist some challengeable is-
Deep learning models are also used to model time sues like efficient implementation of deep neural archi-
series data such as laboratory test results with respect tecture design on a mobile device for processing data
to the identification of specific phenotypes. Lipton et from sensors, etc. To overcome these challenges, sev-
al. [54] have designed RNNs with LSTM to identify pat- eral suggestions were proposed in the literature. Lane
terns from clinical measurements which are a kind of and Georgiev [61] have proposed a low power deep
time series data sources. This model was trained to clas- neural network, which exploits both CPU and digital
sify various diagnoses from irregularly sampled clini- signal processor of mobile devices without burdening
cal measurements. The model provided better results the hardware. They also proposed another deep archi-
compared to traditional machine learning algorithms tecture DeepX, a software accelerator that can minimize
in terms of processing time series data. In [59], a deep the resource usage, which is the major need of mo-
architecture based RNNs were used by the authors for bile adoption. It also enabled large scale deep learning
processing free-text patient summary information and to execute on mobile devices and outperformed cloud
also for obtaining improved results in removing pro- based off-loading solutions [62]. In [63], the authors
tected health information from clinical notes. Table 7.2 have incorporated CNNs and RNNs with LSTM to pre-
describes about the deep learning models and some ap- dict frozen gait problems in Parkinson disease patients.
plications with respect to health informatics. These patients will struggle to initiate movements such
as walking. A deep learning technique was also used to
Genomics predict poor or good sleep of persons using actigraphy
Deep learning models are widely used in extracting measurements of the physical activity during their awak-
high-level abstract features, providing improved perfor- ening time.
mance over the traditional models, increasing inter-
pretability and also for understanding and processing
biological data. To predict splicing action of exons, a 7.4 CHALLENGES AND LIMITATIONS
fully connected feedforward neural network was de- From the review on deep learning approaches and ap-
signed by Xiong et al. [60]. In recent years, CNNs were plications in health informatics, it can be inferred that
applied on the DNA dataset directly without the re- DBN and autoencoders are better suited for fault diag-
quirement of defining features a priori [2], [44]. Com- nosis purposes. CNN and RNN architectures are most
132 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 7.2
Deep learning models and applications with respect to specific area and input data set.
Model Data Area Applications References
Deep neural Genomics Bioinformatics Representing gene variants from microarray [50]
network data
Deep neural Genomics Bioinformatics Drug design from the given molecule [1]
network compounds
Deep neural Genomics Bioinformatics Interaction of compound-protein, RNA binding [35]
network protein from protein structures,
genes/RNA/DNA sequences and molecule
compounds
Stacked AE Electronic Medical Detection of distinguished patterns of [51]
health records informatics physiology in clinical time series data
Stacked AE Electronic Medical Designing a structure for measuring sequences [2]
health records informatics of serum uric acid to represent the signatures
of gout and acute leukemia
Stacked sparse Clinical imaging Medical Early diagnosis of Alzheimer disease from brain [36]
AE Imaging MRIs
Stacked sparse Genomics Bioinformatics Diagnosis of cancer from gene expression [32]
AE profiles
Stacked sparse Genomics Bioinformatics To predict protein backbones from protein [2]
AE sequences
Stacked Electronic Medical To produce unsupervised representation of [48]
denoising AE health records informatics patients which can be used for the prediction of
future clinical events
Stacked Electronic Medical To predict future diseases from the available [52]
denoising AE health records informatics patient clinical data status
Stacked Clinical imaging Medical To diagnose breast nodules and lesions from [39]
denoising AE imaging ultrasound images
Stacked Clinical imaging Medical To detect different modes of variations in [37]
denoising AE imaging Alzheimer disease from brain MRIs
RBM Electronic Medical Segmentation of more number of sclerosis [2]
health records informatics lesions in multi-channel 3D MRIs
RBM Electronic Medical Automatic diagnosis and disease classification [49]
health records informatics from the patient clinical data status
RBM Electronic Medical Predicting suicide risk analysis of mental health [2]
health records informatics patients through representations of the medical
concepts which are embedded in the EHRs
RBM Mobile data Pervasive To identify photoplethysmography signals for [53]
sensing effective health monitoring
LSTM RNN Electronic Medical To effectively diagnose and classify from the [54]
health records informatics clinical measurements of patients in pediatric
intensive care unit
LSTM RNN Electronic Medical To design a memory model with dynamic [47]
health records informatics nature for predictive treatment based on patient
history
LSTM RNN Electronic Medical To predict disease levels from longitudinal lab [2]
health records informatics tests
LSTM RNN Electronic Medical Effective collection of details from patient [2]
health records informatics clinical notes
CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 133

TABLE 7.2 (continued)

Model Data Area Applications References
CNN Clinical imaging Medical Model is used to predict the risk of [55]
imaging osteoarthritis by segmenting knee cartilage
MRIs automatically
CNN Clinical imaging Medical To detect diabetic retinopathy from retinal [2]
imaging fundus images
CNN Electronic Medical To classify skin cancer in various levels [56]
health records informatics
CNN Electronic Medical Prediction of different kinds of heart diseases [46]
health records informatics like congestive heart failure, pulmonary disease
from longitudinal EHRs
CNN Electronic Medical Designing an end-to-end system for the [2]
health records informatics prediction of unplanned readmission of patients
after discharge
CNN Genomics Bioinformatics To design a model for predicting chromatin [2]
marks from the data source like DNA
sequences
CNN Genomics Bioinformatics Prediction of DNA and RNA binding proteins [57]
explicit manner
CNN Genomics Bioinformatics Prediction of methylation states in particular [35]
cell bisulfite sequencing
CNN Mobile Pervasive To estimate various kinds of chromatin marks [2]
sensing
CNN Mobile Pervasive To analyze electro encephalogram and local [2]
sensing field potential signals
CNN Mobile Pervasive To predict the sleep quality from the physical [58]
sensing activity of patients through wearable sensor
data during awake time

widely designed to learn representations from health diagnostic problems in health informatics, the authors
data. Even though all these models produce better out- have not given enough proof and clarifications for why
comes in health informatics, expert’s domain knowl- they have chosen these architectures. From Table 7.2, we
edge is an essential requirement for successful realiza- can see that deep autoencoders, DBN, RNN and CNN
tion. Deep learning architectures depend heavily on rep- models played a crucial role in producing solutions for
resentation data of features. However, getting them for fault diagnosis applications, because the researchers had
any application is an expensive task and also a time con- chosen these models to solve specific fault diagnostic
suming process. However, deep learning approaches are problems. Hence, analyzing the models with various
desirable for health informatics applications since they perceptions also gives rise to many open challenges for
can establish better outcomes compared to traditional researchers. Thus in existing systems, deep learning ap-
machine learning approaches. The key challenges ob- proaches are applied as normal machine learning tech-
served from the literature are as follows: niques without the proper evidence of why the network
There exist a few questions like why the authors produces good results, etc.
have chosen CNN or RNN for some applications, how In the previous sections, it was discussed that deep
deep these architectures have to be designed, and so learning approaches require large amounts of training
on. Hence, it will be difficult to realize the reasons for data for better predictive results. Even though many
certain results produced by the network. Thus, it can medical organizations have taken steps to convert med-
be concluded that developments in deep learning ap- ical documents from paper to electronic records, the
proaches are oriented towards using modern GPUs for dataset related to a particular disease is always de-
solving computational problems in parallel. manded [1]. Hence, deep learning approaches cannot
Even though some of the deep learning models are be suited for all types of application particularly in the
being incorporated for giving solutions for particular case of diagnosis of diseases which occur rarely. An-
134 Deep Learning and Parallel Computing Environment for Bioengineering Systems

other concern is that, while training the deep learning 7.5 CONCLUSIONS
networks, the overfitting problem will exist when the In recent years, deep learning approaches dominated in
network parameters are equivalent to the size of in- pattern recognition and classification through learning.
put data set. Thus the DNN model can remember the In this work, we have discussed the basic deep learn-
training samples, but cannot be generalized for new in- ing network models and outlined some of the appli-
put samples which have not already been seen. Hence, cations in health informatics. Biomedical data can be
the solutions have to be explored to prevent overfitting efficiently processed by deep learning networks, which
problems and to improve in deriving solutions for all in turn increase the predictive power for many specific
input data sets. applications in the health informatics domain. More-
When using DNN models in health care applica- over, several applications of health informatics involve
tions, the collected data set cannot be used as such as processing the medical data as an unstructured source.
input to the network. It has to be pre-processed, normal- The sources of unstructured data arise from clinical
ized before feeding for the processing layers. Besides, imaging, medical informatics, bioinformatics, etc. How-
the initial assumption of parameter values which affect ever, electronic health records represent the data like
the design of a DNN, like size of input data set and the patient’s information, pathology, treatment given, diag-
minimum number of filters to be used for a CNN or its nosis details, etc., in structured format. Deep learning
depth, still has to be explored well and to be validated approaches handle both representations efficiently to
for standard input. Hence, the process of implementing produce better outcomes. Deep learning provides more
successful pre-processing and determining the optimal opportunities toward designing predictive data models,
set of hyperparameters can be a challenging problem to especially in health informatics. However, there exist
be solved. These issues are directly affecting the training some technical challenges to be resolved. For exam-
time of the network and also play a crucial role towards ple, it is too expensive to get patient and clinical data
designing an effective classification model for biomedi- and also a large fraction of health data set represent
cal applications. healthy control individuals. Deep learning algorithms
mostly depend on large amounts of training data. And
Another important aspect is that most of the DNNs
also the algorithms have been incorporated in applica-
can be easily misled. For example, it was observed in
tions where the input data set is balanced. Sometimes
[64] that by applying small changes to the input sam-
the network is given fabricated biological data samples.
ples will cause misclassification of the samples. How-
All these challenges act as a barrier for the basic require-
ever, it is noted that most of the machine learning algo-
ments such as data availability and privacy to be satis-
rithms are prone to these issues. The feature values can fied. Advancements made in the development of health
be intentionally set too high or too low to bring mis- care monitoring equipment and diagnosis instruments
classification. Similarly the decision tree classification will have a vital role for the future deep learning re-
process can also be misled by adding changes in the fea- search. With respect to computational power, in the
ture set. Thus all kinds of machine learning model are future more ad hoc hardware platforms with excellent
prone to such manipulations. In [65], the authors have computation power and storage for network models
proven that there is a possibility of obtaining a mean- will be available. Thus we conclude that deep learning
ingless artificial dataset which is divided into a finite set algorithms have established better outcomes and pre-
of classes even though they are not classified by the al- diction in health informatics with the integration of
gorithm. It is one of the drawbacks of DNN algorithms advanced parallel processors.
and also a limitation for all other traditional learning
algorithms as well.
Thus, when large amounts of biomedical data are REFERENCES
available, deep learning techniques can better develop 1. D. Ravì, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-
and produce good results, particularly in the applica- Perez, B. Lo, G.Z. Yang, Deep learning for health infor-
tions where human interpretation is difficult. Hence, matics, IEEE Journal of Biomedical and Health Informatics
these techniques lead to faster and smarter diagnosis 21 (1) (2017) 4–21.
2. R. Miotto, F. Wang, S. Wang, X. Jiang, J.T. Dudley, Deep
of diseases and also improve the decision making pro-
learning for healthcare: review, opportunities and chal-
cess. Thus, designing efficient deep learning models to lenges, Briefings in Bioinformatics (2017).
produce good predictive results in the heath informat- 3. S. Khan, T. Yairi, A review on the application of deep learn-
ics domain is always a challenging problem for the re- ing in system health management, Mechanical Systems
searchers. and Signal Processing 107 (2018) 241–265.
CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 135

4. P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Ex- 17. V. Nair, G.E. Hinton, Rectified linear units improve re-
tracting and composing robust features with denois- stricted Boltzmann machines, in: Proceedings of the 27th
ing autoencoders, in: Proceedings of the 25th Interna- International Conference on Machine Learning (ICML-10),
tional Conference on Machine Learning, ACM, 2008, July, 2010, pp. 807–814.
pp. 1096–1103. 18. G. Li, L. Deng, Y. Xu, C. Wen, W. Wang, J. Pei, L. Shi, Tem-
5. S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Con- perature based restricted Boltzmann machines, Scientific
tractive auto-encoders: explicit invariance during feature Reports 6 (2016) 19133.
extraction, in: Proceedings of the 28th International Con- 19. G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algo-
ference on International Conference on Machine Learning, rithm for deep belief nets, Neural Computation 18 (7)
Omnipress, 2011, June, pp. 833–840. (2006) 1527–1554.
6. H.O.A. Ahmed, M.L.D. Wong, A.K. Nandi, Intelligent con- 20. R. Salakhutdinov, H. Larochelle, Efficient learning of deep
dition monitoring method for bearing faults from highly Boltzmann machines, in: Proceedings of the Thirteenth In-
compressed measurements using sparse over-complete fea- ternational Conference on Artificial Intelligence and Statis-
tures, Mechanical Systems and Signal Processing 99 (2018) tics, 2010, March, pp. 693–700.
459–477. 21. G. Montavon, K.R. Müller, Deep Boltzmann machines and
7. C. Lu, Z.Y. Wang, W.L. Qin, J. Ma, Fault diagnosis of the centering trick, in: Neural Networks: Tricks of the
rotary machinery components using a stacked denoising Trade, Springer, Berlin, Heidelberg, 2012, pp. 621–637.
autoencoder-based health state identification, Signal Pro- 22. O. Janssens, V. Slavkovikj, B. Vervisch, K. Stockman, M.
cessing 130 (2017) 377–388. Loccufier, S. Verstockt, et al., S. Van Hoecke, Convolutional
8. S. Tao, T. Zhang, J. Yang, X. Wang, W. Lu, Bearing fault diag- neural network based fault detection for rotating machin-
nosis method based on stacked autoencoder and softmax ery, Journal of Sound and Vibration 377 (2016) 331–345.
regression, in: Control Conference (CCC), 2015 34th Chi- 23. G.S. Babu, P. Zhao, X.L. Li, Deep convolutional neural net-
work based regression approach for estimation of remain-
nese, IEEE, 2015, July, pp. 6331–6335.
ing useful life, in: International Conference on Database
9. R. Kishore, K. Reddy, S. Sarkar, M. Giering, Anomaly detec-
Systems for Advanced Applications, Springer, Cham, 2016,
tion and fault disambiguation in large flight data: a multi-
April, pp. 214–228.
modal deep auto-encoder approach, in: Annual Confer-
24. X. Ding, Q. He, Energy-fluctuated multiscale feature learn-
ence of the Prognostics and Health Management Society,
ing with deep Convnet for intelligent spindle bearing fault
Denver, Colorado, 2016.
diagnosis, IEEE Transactions on Instrumentation and Mea-
10. Z. Chen, W. Li, Multisensor feature fusion for bearing fault
surement 66 (8) (2017) 1926–1935.
diagnosis using sparse autoencoder and deep belief net-
25. O. Abdeljaber, O. Avci, S. Kiranyaz, M. Gabbouj, D.J.
work, IEEE Transactions on Instrumentation and Measure-
Inman, Real-time vibration-based structural damage de-
ment 66 (7) (2017) 1693–1702.
tection using one-dimensional convolutional neural net-
11. R. Thirukovalluru, S. Dixit, R.K. Sevakula, N.K. Verma, A.
works, Journal of Sound and Vibration 388 (2017)
Salour, Generating feature sets for fault diagnosis using de-
154–170.
noising stacked auto-encoder, in: Prognostics and Health 26. X. Guo, L. Chen, C. Shen, Hierarchical adaptive deep con-
Management (ICPHM), 2016 IEEE International Confer- volution neural network and its application to bearing
ence on, IEEE, 2016, June, pp. 1–7. fault diagnosis, Measurement 93 (2016) 490–502.
12. S. Zheng, K. Ristovski, A. Farahat, C. Gupta, Long short- 27. D. Lee, V. Siu, R. Cruz, C. Yetman, Convolutional neu-
term memory network for remaining useful life estimation, ral net and bearing fault analysis, in: Proceedings of the
in: Prognostics and Health Management (ICPHM), 2017 International Conference on Data Mining (DMIN), The
IEEE International Conference on, 2017, June, pp. 88–95. Steering Committee of The World Congress in Computer
13. M. Yuan, Y. Wu, L. Lin, Fault diagnosis and remaining use- Science, Computer Engineering and Applied Computing
ful life estimation of aero engine using LSTM neural net- (WorldComp), 2016, January, p. 194.
work, in: Aircraft Utility Systems (AUS), IEEE International 28. L.A. Pastur-Romay, F. Cedrón, A. Pazos, A.B. Porto-Pazos,
Conference on, IEEE, 2016, October, pp. 135–140. Deep artificial neural networks and neuromorphic chips
14. R. Zhao, J. Wang, R. Yan, K. Mao, Machine health for big data analysis: pharmaceutical and bioinformatics
monitoring with LSTM networks, in: Sensing Technol- applications, International Journal of Molecular Sciences
ogy (ICST), 2016 10th International Conference on, IEEE, 17 (8) (2016) 1313.
2016, November, pp. 1–6. 29. M.K. Leung, A. Delong, B. Alipanahi, B.J. Frey, Machine
15. P. Malhotra, V. TV, A. Ramakrishnan, G. Anand, L. Vig, P. learning in genomic medicine: a review of computational
Agarwal, G. Shroff, Multi-sensor prognostics using an un- problems and data sets, Proceedings of the IEEE 104 (1)
supervised health index based on LSTM encoder-decoder, (2016) 176–197.
arXiv preprint, arXiv:1608.06154, 2016. 30. C. Angermueller, T. Pärnamaa, L. Parts, O. Stegle, Deep
16. G.E. Hinton, T.J. Sejnowski, Learning and relearning in learning for computational biology, Molecular Systems Bi-
Boltzmann machines, in: Parallel Distributed Processing: ology 12 (7) (2016) 878.
Explorations in the Microstructure of Cognition, vol. 1, 31. E. Gawehn, J.A. Hiss, G. Schneider, Deep learning in drug
1986, pp. 282–317, 2. discovery, Molecular Informatics 35 (1) (2016) 3–14.
136 Deep Learning and Parallel Computing Environment for Bioengineering Systems

32. R. Fakoor, F. Ladhak, A. Nazi, M. Huber, Using deep learn- 45. H. Schütze, C.D. Manning, P. Raghavan, Introduction to
ing to enhance cancer diagnosis and classification, in: Information Retrieval, vol. 39, Cambridge University Press,
Proceedings of the International Conference on Machine 2008.
Learning, vol. 28, 2013, June. 46. C. Liu, F. Wang, J. Hu, H. Xiong, Risk prediction with elec-
33. R. Ibrahim, N.A. Yousri, M.A. Ismail, N.M. El-Makky, tronic health records: a deep learning approach, in: ACM
Multi-level gene/MiRNA feature selection using deep be- International Conference on Knowledge Discovery and
lief nets and active learning, in: Engineering in Medicine Data Mining, Sydney, NSW, Australia, 2015, pp. 705–714.
and Biology Society (EMBC), 2014 36th Annual Inter- 47. T. Pham, T. Tran, D. Phung, S. Venkatesh, Deepcare: a
national Conference of the IEEE, IEEE, 2014, August, deep dynamic memory model for predictive medicine, in:
pp. 3957–3960. Pacific-Asia Conference on Knowledge Discovery and Data
34. S. Kearnes, K. McCloskey, M. Berndl, V. Pande, P. Ri- Mining, Springer, Cham, 2016, April, pp. 30–41.
ley, Molecular graph convolutions: moving beyond fin- 48. R. Miotto, L. Li, B.A. Kidd, J.T. Dudley, Deep patient: an
gerprints, Journal of Computer-Aided Molecular Design unsupervised representation to predict the future of pa-
30 (8) (2016) 595–608. tients from the electronic health records, Scientific Reports
35. C. Angermueller, H. Lee, W. Reik, O. Stegle, Accurate pre- 6 (2016) 26094.
diction of single-cell DNA methylation states using deep 49. Z. Liang, G. Zhang, J.X. Huang, Q.V. Hu, Deep learning for
learning, BioRxiv, 055715, 2017. healthcare decision making with EMRs, in: Bioinformatics
36. S. Liu, S. Liu, W. Cai, S. Pujol, R. Kikinis, D. Feng, Early and Biomedicine (BIBM), 2014 IEEE International Confer-
diagnosis of Alzheimer’s disease with deep learning, in: ence on, IEEE, 2014, November.
Biomedical Imaging (ISBI), 2014 IEEE 11th International 50. D. Quang, Y. Chen, X. Xie, DANN: a deep learning ap-
Symposium on, IEEE, 2014, April, pp. 1015–1018. proach for annotating the pathogenicity of genetic vari-
37. T. Brosch, R. Tam, Alzheimer’s Disease Neuroimaging Ini- ants, Bioinformatics 31 (5) (2014) 761–763.
tiative, Manifold learning of brain MRIs by deep learning,
51. Z. Che, D. Kale, W. Li, M.T. Bahadori, Y. Liu, Deep com-
in: International Conference on Medical Image Comput-
putational phenotyping, in: Proceedings of the 21st ACM
ing and Computer-Assisted Intervention, Springer, Berlin,
SIGKDD International Conference on Knowledge Discov-
Heidelberg, 2013, September, pp. 633–640.
ery and Data Mining, ACM, 2015, August, pp. 507–516.
38. A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, M.
52. R. Miotto, L. Li, J.T. Dudley, Deep learning to predict pa-
Nielsen, Deep feature learning for knee cartilage segmen-
tient future diseases from the electronic health records, in:
tation using a triplanar convolutional neural network, in:
European Conference on Information Retrieval, Springer,
International Conference on Medical Image Computing
Cham, 2016, March, pp. 768–774.
and Computer-Assisted Intervention, Springer, Berlin, Hei-
53. V. Jindal, J. Birjandtalab, M.B. Pouyan, M. Nourani, An
delberg, 2013, September, pp. 246–253.
adaptive deep learning approach for PPG-based identifi-
39. J.Z. Cheng, D. Ni, Y.H. Chou, J. Qin, C.M. Tiu, Y.C. Chang,
C.M. Chen, Computer-aided diagnosis with deep learning cation, in: Engineering in Medicine and Biology Society
architecture: applications to breast lesions in US images (EMBC), 2016 IEEE 38th Annual International Conference
and pulmonary nodules in CT scans, Scientific Reports 6 of the, IEEE, 2016, August, pp. 6401–6404.
(2016) 24454. 54. Z.C. Lipton, D.C. Kale, C. Elkan, R. Wetzel, Learning
40. V. Gulshan, L. Peng, M. Coram, M.C. Stumpe, D. Wu, A. to diagnose with LSTM recurrent neural networks, arXiv
Narayanaswamy, et al., R. Kim, Development and valida- preprint, arXiv:1511.03677, 2015.
tion of a deep learning algorithm for detection of diabetic 55. A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, M.
retinopathy in retinal fundus photographs, JAMA 316 (22) Nielsen, Deep feature learning for knee cartilage segmen-
(2016) 2402–2410. tation using a triplanar convolutional neural network, in:
41. A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. International Conference on Medical Image Computing
Blau, S. Thrun, Dermatologist-level classification of skin and Computer-Assisted Intervention, Springer, Berlin, Hei-
cancer with deep neural networks, Nature 542 (7639) delberg, 2013, September, pp. 246–253.
(2017) 115. 56. A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M.
42. E. Choi, M.T. Bahadori, A. Schuetz, W.F. Stewart, J. Sun, Blau, S. Thrun, Dermatologist-level classification of skin
Doctor AI: predicting clinical events via recurrent neural cancer with deep neural networks, Nature 542 (7639)
networks, in: Machine Learning for Healthcare Conference, (2017) 115.
2016, December, pp. 301–318. 57. B. Alipanahi, A. Delong, M.T. Weirauch, B.J. Frey, Predict-
43. D.R. Kelley, J. Snoek, J.L. Rinn, Basset: learning the regu- ing the sequence specificities of DNA- and RNA-binding
latory code of the accessible genome with deep convolu- proteins by deep learning, Nature Biotechnology 33 (8)
tional neural networks, Genome Research 26 (7) (2016) (2015) 831.
990–999. 58. A. Sathyanarayana, S. Joty, L. Fernandez-Luque, F. Ofli, J.
44. J. Zhou, O.G. Troyanskaya, Predicting effects of noncoding Srivastava, A. Elmagarmid, S. Taheri, Correction of: sleep
variants with deep learning-based sequence model, Nature quality prediction from wearable data using deep learning,
Methods 12 (10) (2015) 931. JMIR mHealth and uHealth 4 (4) (2016).
CHAPTER 7 Efficient Deep Learning Approaches for Health Informatics 137

59. F. Dernoncourt, J.Y. Lee, O. Uzuner, P. Szolovits, De- 15th ACM/IEEE International Conference on, IEEE, 2016,
identification of patient notes with recurrent neural net- April, pp. 1–12.
works, Journal of the American Medical Informatics Asso- 63. N.Y. Hammerla, S. Halloran, T. Ploetz, Deep, convolu-
ciation 24 (3) (2017) 596–606. tional, and recurrent models for human activity recog-
60. H.Y. Xiong, B. Alipanahi, L.J. Lee, H. Bretschneider, D. nition using wearables, arXiv preprint, arXiv:1604.08880,
Merico, R.K. Yuen, et al., Q. Morris, The human splicing 2016.
code reveals new insights into the genetic determinants of 64. A. Nguyen, J. Yosinski, J. Clune, Deep neural networks
disease, Science 347 (6218) (2015) 1254806. are easily fooled: high confidence predictions for unrec-
61. N.D. Lane, P. Georgiev, Can deep learning revolutionize ognizable images, in: Proceedings of the IEEE Confer-
mobile sensing?, in: Proceedings of the 16th International ence on Computer Vision and Pattern Recognition, 2015,
Workshop on Mobile Computing Systems and Applica- pp. 427–436.
tions, ACM, 2015, February, pp. 117–122. 65. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I.
62. N.D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, Goodfellow, R. Fergus, Intriguing properties of neural net-
L. Qendro, F. Kawsar, DeepX: a software accelerator for works, arXiv preprint, arXiv:1312.6199, 2013.
low-power deep learning inference on mobile devices, in:
Information Processing in Sensor Networks (IPSN), 2016
CHAPTER 8

Deep Learning and Semi-Supervised

and Transfer Learning Algorithms for
Medical Imaging
R. MOHANASUNDARAM, PHD • ANKIT SANDEEP MALHOTRA, BE •
R. ARUN, ME • P.S. PERIASAMY, PHD

8.1 INTRODUCTION to be the region where patients interact with function-

All the outcomes of data acquired by medical experi- ing along with practical artificial intelligence systems.
ments are taken on by specialists in the industry. Speak- With developments in computers in general, it has be-
ing with reference to image analysis by professionals of come fairly probable to train more and more com-
the industry, the interpretations are fairly inadequate pound models on even more data, and especially, if
because of its subjectivity, intricacy, and weariness of we review the last few years, the application of super-
the image. Deep learning and machine learning have vised learning with respect to image segmentation has
had a deep impact on the medical field. They made enhanced. Keeping this in mind, the models are then
the job even faster, and far more accurate than any hu- trained on data from longitudinal studies in which the
man brain. People are willing to spend as much money disease status, years after the acquisition of the baseline
as needed for healthcare, but at the same time expect image, is known. This advancement can prove to be one
the best and most comfortable treatment. Deep learn- of the most advantageous things in the history of hu-
ing and ML play a major role here. mankind. Because healthcare is something which every
Deep learning is far beyond what we think. It just human will require.
is not about selecting and extracting structures, but it The inspiration of this chapter is to provide a very
can also build new structures and features specifically. widespread evaluation of deep learning as well as semi-
Coming to the medical field, it just doesn’t detect any supervised and transfer learning algorithms for medical
disease, but also gives plausible prediction models to imaging, relating to current scenarios and future scopes
assist the doctor. This is the future. Not just in medical [16]. In this chapter, there is complete provision of fun-
imaging, but in healthcare overall. AI is not meant to damental knowledge and “state-of-the-art” approaches
make the job easier, but to make it accurate, safe and, with respect to deep learning in the field of medical
most importantly, human free. imaging.
Even though automated analysis of diseases and
healthcare issues occurring from the orthodox method-
ologies have revealed extensive precision for a long 8.2 IMAGE ACQUISITION IN THE MEDICAL
time, recent developments in ML and AI techniques FIELD
have lit fire in deep learning. Algorithms involving When it comes to medical applications of imaging, ac-
deep learning have revealed extremely favorable per- curacy is a prerequisite. Any little discrepancy can cause
formances which are beyond excellence. Not only this, major issues, and troubleshooting those is not that easy
but they also shown favorably well performing speeds. [15]. Image acquisition devices have improved drasti-
The domains wherein these were tested mainly involved cally over the decades. We are now going into radio-
drug discovery, speech recognition, text recognition, lips logical imaging like, CT, MRI, X-Ray, etc., which are,
reading, computer aided diagnosis and face recognition frankly, very accurate and also with much higher resolu-
[8]. tion. Positron emission topography (PET) scans, retinal
Medical imaging is regarded to be one of the most photography, dermoscopy images are the recent modal-
vigorous fields for research in machine learning, mainly ities of medical imaging.
because the figures are moderately labeled and struc- In the early times (1970s), one of the first implemen-
tured. It is highly possible that this might come up tations in the field of medicine was the MYCIN system.

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00015-4 139
Copyright © 2019 Elsevier Inc. All rights reserved.
140 Deep Learning and Parallel Computing Environment for Bioengineering Systems

MYCIN, in simple words, advised diverse systems of encouraged by examining the human brain, which col-
antibiotic treatments for patients. At this time, AI al- lects multiple signals as input, creating an agglomera-
gorithms were more manual kind of feature extraction tion using weights, and subsequently transferring them
techniques and then became supervised learning tech- through nonlinear methods to create output signals.
niques. Later on in 2015, symbolic research was done in Amazing fact – It is said that AI will capture around 40
the field of unsupervised learning and its applications to 50% of jobs in the next coming years.
in the medical industry.

8.4 NEURAL NETWORK ARCHITECTURE

8.3 DEEP LEARNING OVER MACHINE Artificial neural networks came into consideration in
LEARNING the mid-20th century. And surprisingly enough, they
Before going into the types of neural net, its usage and were inspired by the human nervous system.
applications, let us see why deep learning should be pre- An experiment was conducted in the late 20th cen-
ferred over machine learning. tury. It is said that pigeons are art experts. A pigeon was
To be noted, we aren’t trying to say that machine kept in a skinner box. Then, recent paintings of two dif-
learning is bad. We are only implying that deep learning ferent artists were kept in front of the pigeon (e.g., Van
gives a broader view. Gogh and Chagall). If the pigeon pecked a particular
We all know that the precise diagnosis of any dis- painting (Van Gogh), a reward was given.
ease primarily depends on image acquisition as well as The astonishing part was that the pigeons differenti-
interpretation [15]. Devices performing image acquisi- ated between the two paintings of Van Gogh and Cha-
tion have extensively evolved in the last couple of years. gall with 95% accuracy, with pictures presented to them,
Machine learning had major roles to play in this field which they were already trained on. For the unseen pic-
when it recently came into the industry. Although ma- tures, i.e., the pictures they hadn’t been trained on, the
chine learning purely relies on crafted features when accuracy of the differentiating was 85%.
it comes to image interpretation, it is well known that What can we grab out of this experiment? It’s sim-
computer vision is what makes machine learning what ple. Pigeons don’t memorize pictures. What they do is
it is. Let us take an example in which brain tumor is
they extract and recognize the pattern or the style of the
the main target to be achieved. For a brain tumor, let’s
painting. They generalize from the previously perceived
see what the main features to be extracted are. First
pictures to make predictions.
will be the structural diagnosis, then the location, the
This is what neural networks in general do. Unlike
boundaries covering it, etc. Due to massive dissimilarity
orthodox computers, they work on pattern recognition
between patients, these orthodox learning methods are
and interpretation. Conventional computers, as we all
not rugged enough and consistent enough to come out
know, worked mainly on storage and then its restora-
with precise, as well as suitable, results.
tion.
Deep learning is the new technology. Let’s see the
So, what are neural networks? Neural networks are
last sentence again. It is not the future technology. This
models of the brain and the nervous system. They work
technology is already there and many people are already
contributing a vast amount of effort to this. in a similar manner like our brain and process informa-
Deep learning has established immense interest in tion in the same way. Their principles are very simple
all spheres, including medical image analysis, with an but have complex behaviors.
expectation of acquiring a $300 million medical imag- Let’s go to the fundamentals of neural networks. See
ing market by the end of 2021. This implies that med- Figs. 8.1 and 8.2. All the types of neural networks have
ical imaging arena is bound to receive more funds by 4 basic attributes. They are:
2021 than what all other analysis industries were paid • A set of processing units;
in 2016. It is a highly selective, as well as scrutinized, • A set of connections;
ML approach. The method utilizes DNNs, which is a • A computing procedure;
variant of neural networks, differing in the extensive • A training procedure.
specification of the human brain utilizing unconven-
tional methodologies when compared to a simple neu-
ral network. Deep learning signifies the utilization of 8.5 DEFINING DEEP LEARNING
a deep neural network model. The foundational com- Perceptron was one of the first neural networks that
puting unit in a neural network is a neuron, an idea was developed taking into consideration the natural hu-
CHAPTER 8 Deep Learning and Semi-Supervised and Transfer Learning Algorithms 141

8.6 DEEP LEARNING ARCHITECTURE

When it comes to research, various types of deep learn-
ing algorithm have been developed such as convolu-
tional neural network (CNN), deep neural network
(DNN), deep belief network (DBN), among many
more, which are explained in brief below.

8.6.1 Convolution Neural Network (CNN)

CNN is grabbing a lot of attention in the digital im-
age processing and vision arena. Because of its various
different architectures, it provides a lot of benefits and
applications in various different processes [3]. CNNs are
perfectly apt for 2D data. They consist of convolutional
filters whose main job is to convert the 2D data into
3D. CNNs are mainly fit for tasks which perform im-
age recognition, like image classification, localization,
detection, segmentation, etc.
Convolution makes use of three ideas to perform
computationally efficient machine learning:
1. Weight sharing;
2. Sparse connections;
3. Invariant representation.
FIG. 8.1 Neuron vs. node. In simple words, CNNs are the most popular algorithms
when it comes to image recognition and visual learn-
ing tasks [4]. This is mainly because CNNs are very
good at preserving local image relations, while per-
forming dimensionality reduction. The only drawback
is that they need a lot of labeled data for classifica-
tion.

8.6.2 Recurrent Neural Network

RNNs are well known for applications which involve
analyzing sequential data, for example, analyzing words
in a sentence and numbers in a series, etc. [10]. They
FIG. 8.2 Basic depiction of an artificial neural network.
have the ability of learning any sort of sequences,
wherein the weights are shared across all steps and neu-
man brain. This system had an input layer which was rons. One of the major advantages is that they can gen-
directly connected to the output layer and was good erate. Because of this, they have been employed mostly
to classify linearly separable patterns. Later on, to deal in the analysis part.
with more compound patterns, different neural net- Let us consider a simple RNN. In here, the output of
works came into consideration with layered architecture a layer is considered as the input of the next part of the
[10]. net; then again, this is put back into the layer, which
So then, what are deep neural networks? Fundamen- results in a capacity for related memory. RNNs have
tally, every neuron in the network sums up the input multiple variations like MDL, STM, etc., which provide
data and applies the activation function, finally provid- state-of-the-art accuracies. Hence, they can be used in
ing the output that has the possibility to be propagated character recognition, NLP related tasks, etc.
to the upcoming layer. Therefore, what adding more RNNs have primarily been utilized in segmentation,
hidden layers does is that it permits dealing with com- in the vast sphere of medical image analysis. Chen et
plex hidden layers, capturing nonlinear relationships. al. collectively confined CNNs and RNNs to differenti-
These neural networks are primarily known as deep ate three-dimensional electron microscope images from
neural networks. neuronal and fungal structures.
142 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 8.3 Sample DNN.

8.6.3 Deep Neural Network you to prove that you’re not a bot by choosing a typical
Deep is a word we have all been reading for a long time. type of photos such as street signs, etc.? Yes, that’s what
Let us understand why it is called deep neural network. the DNN can do on its own. And much more with ex-
The past rules developed for networks such as percep- tremely complicated data, analyzing which the human
tron, HNN, etc., worked on the concept of one input brain can take as long as a decade.
layer and one output layer.
But, if it’s more than 3 layers, including the output
and input layers, then it is called a deep neural net- 8.7 DEEP LEARNING IN MEDICAL IMAGING
work. Hence, deep in its rawest form is more than just [5]
one hidden layer. In DNNs, every layer of nodes [9] 8.7.1 Diabetic Retinopathy
trains on a different set of features based on the previ- It is a disease caused in the eye as a result of diabetes,
ous layer’s output. So, the more you go into the net, the causing blindness as time passes by. Among people suf-
more you can advance into it, and the more complex fering from diabetes at least 15% (according to sources)
features the nodes can recognize [12]. This is because it have risk of vision impairment. Manual process of DR
recombines and learns from the features of the previous is very tedious and complicated at the same time. In the
layer. This is what is called future hierarchy. These nets early stages, this disease hardly shows any symptoms. So
allow complex nonlinear relationships. The major ad- due to lack of expertise and equipment, even clinicians
vantage which DNNs hold is the ability of dealing with sometimes make mistakes [1]. But thanks to deep learn-
unstructured and unlabeled data which is actually most ing, automated detection of DR is possible, and with
of the world’s data. Especially in the medical field, most high accuracy, too. Below shown is a research work of
of the data is unstructured and unlabeled. So DNNs are the same [2].
apt for this large amount of data and its processing. See The classification and detection of moderate and
Fig. 8.3. worse referable of Messidor-2 dataset and implementa-
So, basically, deep learning networks can take all the tion of (DCNN) on EyePACS-1 dataset was conducted
raw, unlabeled, unstructured data, and cluster it, as well by Gulshan et al., where 874 patients provided 1700
as process, to similar forms. A simple example is that images to the Messidor-2 dataset and approximately
deep learning can take a billion images, and cluster 10,000 retinal images to the EyePACS-1. A 96.1% sen-
them according to their similarities: dogs in one cor- sitivity and 93.9% specificity on Messidor-1, and 97.5%
ner, chips in another, and in the ultimate one, all the sensitivity and 93.4% specificity on EyePACS-1, respec-
pictures of a car. This is the basis of the so-called smart tively, was claimed by the authors. The testing on pub-
photo albums. Have you ever encountered a situation lically available datasets for classification of fundus
while browsing the internet wherein the website asks was conducted by Kathirvel who trained a DCNN with
CHAPTER 8 Deep Learning and Semi-Supervised and Transfer Learning Algorithms 143

TABLE 8.1
Details of diabetic retinopathy.
Authors Model Dataset accuracy:acc or
sensitivity:sensi or
speciﬁcity:spec (%)
Gulshan et al. Deep Convolutional Neural Net- EyePACS-1 Messidor-2 97.5% sensi & 93.4% spec
work 96.1% sensi & 93.9% spec
Kathirvel CNN with dropout layer Kaggle-fundus, DRIVE and STARE 94–96%
Pratt et al. Cu-DCNN library Kaggle 75% acc
Haloi etal. Five layers CNN Massidor ROC 98% AUC 97% AUC
Alban et al. DCNN EyePACS 45% acc
Lim et al. DCNN DIARETDB1 SiDRP –

dropout layer techniques as well. The accuracy reported The research here involves a finite difference time
was up to 94–96%. domain (FTDT) data, set side by side numerical simula-
Haloi claimed sensitivity, specificity, accuracy and tion of tumor models [6] in the adipose tissue. In order
area under the curve (AUC) up to 97%, 96%, 96% and to reduce the dimensionality of the FTDT, the backscat-
0.988, respectively, on Maddissor dataset and AUC up ter signals are preprocessed and the features are mined.
to 0.98 on ROC dataset, by implementing five layer Now, the features which were mined are considered as
CNN [3] with a dropout mechanism for discovery of the input for the trained deep learning based classifier.
early stage DR on Retinopathy Online Challenge (ROC) Feature mining over here was completed [6] with the
[1] and Massidor datasets. Alban used CNNs for detec- help of principal component analysis (PCA). What it
tion of DR and also de-noised the angiograph images does is that it evades the necessity of handcrafting fea-
of EyePACS. A diagnosis of five classes’ severities was
tures for the input signals and hence can generally be
conducted, which provided 79% AUC and 45% accu-
applied to any sort of signal. Short-time Fourier trans-
racy. Lim et al. drew features from identified regions
form was also used. Two different deep learning archi-
utilizing the method proposed; then the feature vector
tectures were used here:
was passed to a DCNN for classification. The model on
DIARETDB1 and SiDRP datasets was realized. All the 1. Regular, or vanilla, feedforward deep neural net-
above works are summarized in Table 8.1. works (DNNs),
2. Convolutional neural networks (CNNs).
8.7.2 Cardiac Imaging Below is an extract from the research work by
In the arena of cardiac imaging, deep learning has un- Branislav Gerazov and Raquel C. Conceicao:
equivocally showed very satisfying results with great ac- “The experiments for optimizing the classifier
curacy. Especially considering calcium score quantifica- hyper parameters and assessing their
tion, MRI and CT scans were the largely utilized imag- performance were accomplished using
ing modalities. Physical detection of CAC in CT scans
10-fold cross-validation on randomized
needs extensive expert interaction, which makes it time-
consuming and not feasible when it comes to large-scale subsets of the whole dataset, without keeping
or epidemiological studies. together data from the same model. To take
into account the unbalanced count of the two
8.7.3 Tumor Classification in Homogeneous classes in the dataset, we used stratified folds
Breast Tissue which preserve the percentage of both
Due to the dielectric constant between malignant tu- classes. The hyper parameters were changed
mors and adipose breast tissue, breast cancer has seen
in the following ranges. The length of the
remarkable differences and improvements, and this is
purely due to deep learning. In this category of tumor feature vector as extracted using PCA was
classification, the research work done is presented be- varied in the range: 10–100. For the CNN
low. architecture, we generated spectrograms
144 Deep Learning and Parallel Computing Environment for Bioengineering Systems

using a 213 ps frame length, Hamming jor issue, which deep learning has to come through,
window, and 256 Fast Fourier Transform and it’s called the black box problem. Even though
bins.” the math which was used to make a neural network
Table 8.2 shows the original results obtained. They is fairly direct, comprehending as to how the output
were obtained by first down-sampling and trimming the was finally obtained is not as easy. What it means is
signals. In the first approach, around 30 PCA compo- that the model of machine learning gets its input, pro-
nents were extracted and used as an input of an RBF cesses and identifies patterns, but how the model works,
kernel SVM classifier. how the processing takes place is utterly complicated.
All the researchers using it do not know the way the
model works and processes, or why it provides better
TABLE 8.2 results.
Classification accuracy obtained for the binary
task of detecting tumor malignancy. 8.8.2 Semi-Supervised and Transfer
Approach Accuracy Learning Algorithms
PCA30 + SVM (CT) 89.20% We have had a look at deep learning models, their im-
DWT + SVM (CSCT) 91.19% plementation, and their contribution to the medical im-
PCA50 + DNN (2 × 300 + 1) 92.81%
age field, etc. Now let us see the developments made by
semi-supervised and transfer learning techniques in the
Spectrograms + CNN (4 × 20 + 9 × 300) 89.58%
field of medical imaging [13].
DNN (2 × 300) + SVM 93.44%
8.8.2.1 Semi-Supervised Learning
Let’s go through the concepts of supervised and unsu-
In the final step, they used around 300 outputs of pervised learning again briefly.
the penultimate layer of the finest performing DNN and
8.8.2.2 Supervised Learning
used them as an input to an SVM classifier. The SVM
This kind of learning involves the algorithm learning to
classifier showed 93.44% of accuracy.
assign labels to types of data inputs based on the labels
that were inputted by a human during the process of the
8.8 DEVELOPMENTS IN DEEP LEARNING training.
METHODS
8.8.2.3 Unsupervised Learning
Figuring out in medical or specific image data is not al-
This algorithm doesn’t involve any guidance from the
ways viable. What this means here is that there might
user. It analyzes the data and then sorts inherent simi-
be a rare disease or lack of an expert, although most
larities between the input samples.
deep learning methods stress more on supervised learn-
So, it can be quite obviously presumed that semi-
ing. To overcome the issue of lack of big data, there
supervised learning is a hybrid version. Whatever chal-
is a change required from supervised to unsupervised
lenges are somewhat faced in each type, semi-supervised
or semi-supervised learning. In spite of significant ef-
provides a win–win situation.
forts, deep learning theories have been unable to pro-
vide exhaustible solutions, rendering many questions Let’s go ahead and define semi-supervised clustering
unanswered, which extend the unlimited opportunity then.
for improvisation and growth. Semi-supervised learning makes use of both labeled
and unlabeled data. So with the use of some labeled and
8.8.1 Black Box and Deep Learning unlabeled data, the accuracy of the decision boundary
Around 10 years ago, when medical imaging came into becomes much higher. See Fig. 8.4. The advantages of
existence, it broke out to the world and gave rugged using supervised learning are:
solutions to a lot of unsolved mysteries in the medi- 1. Labeled data is often expensive and difficult to
cal industry. We can’t deny that medical imaging has find;
genuinely solved multiple problems which were pre- 2. The model becomes more robust by using a more
sumed to be impossible to solve. There is still a ma- precise decision boundary.
CHAPTER 8 Deep Learning and Semi-Supervised and Transfer Learning Algorithms 145

FIG. 8.4 Classification of supervised, semi-supervised and unsupervised learning algorithms.

8.8.3 Applications of Semi-Supervised work in an end-to-end fashion and learning features

Learning in Medical Imaging from multi-resolution feature maps through convolu-
Pelvic MR Image Segmentation Based on tions and pooling operations. However, deep neural
Multi-Task Residual Fully Convolutional networks contain millions of parameters, and thus re-
Networks quire a large amount of labeled data (which is difficult
Below are extracts from the research work performed in practice) to learn these parameters.
with respect to pelvic MR image segmentation [14]. These are the challenges which are faced. To solve
“In this section, we first train an initial multi-task this and come up with a fixed or robust solution, semi-
residual fully convolutional network (FCN) with a supervised learning method, based on multi-task resid-
small number of labeled MRI data. Now, with the ini- ual FCN is used.
tially trained FCN, those unlabeled new data can be First, convolution layers in FCN are substituted by
automatically segmented and some reasonable segmen- residual learning blocks, to learn the residual functions.
tation can be included into the training data to fine-tune Secondly, in order to better use low-level feature maps
the network. This step can be repeated to progressively
in the FCN, we use concatenation based skip connec-
improve the training of our network, until no reason-
tion to combine low-level feature maps with high-level
able segmentations of new data can be included [11].
feature maps. Third, considering multi-task learning be-
What this does is that it increases the effectiveness of
cause of its added advantages, three regression tasks are
this method and has massive advantages in terms of
accuracy.” employed to provide more information for helping op-
“Exact segmentation of prostate and neighboring or- timize the segmentation task. Finally, a semi-supervised
gans from MRI scan is a critical step for image-guided learning framework is designed by progressively includ-
radiotherapy. As of now, manual segmentation of these ing reasonable segmentation results of unlabeled data
organs is adopted in clinics, which, in all honesty, is bi- to refine the trained networks.”
ased, time-consuming and tedious. Thus, it is desired
for developing an accurate and automatic segmentation 8.8.4 Method
method.” Architecture Below shown is the residual learning
“Fully convolutional network (FCN) achieves in- block.
ordinate success when it comes to semantic segmen- Here, the intensity to be measured is of the bladder,
tation for natural images, in training a neural net- rectum and prostate.
146 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 8.3
The parameters for regular learning techniques.
Learning techniques Bladder Prostate Rectum
Supervised 0.952 (0.007) 0.891 (0.019) 0.884 (0.027)
Conventional 0.960 (0.006) 0.895 (0.024) 0.884 (0.031)
Semi-supervised 0.963 (0.007) 0.903 (0.022) 0.890 (0.029)

The function can be shown in the formulated form regulated learning calculation was performed for 6 cy-
as cles. The exhibitions of three calculations are presented
in the table shown below. The outcomes demonstrate
y = F (x, W ) + x that, by utilizing semi-supervised learning, to figure out
where x and y denote the input and output, respectively, how to expand the training data could better enhance
and W signifies the parameters present in the block. the segmentation performance. Likewise, continuously
Once feature learning is over, the feature maps are including the unlabeled information into the prepara-
directed to go into 4 tasks: 3 regression tasks and one tion set for refreshing the system parameters can per-
segmentation task. form superior to the regular learning techniques. See
Table 8.3.
Semi-Supervised Stage Now in the semi-supervised
learning stage, training of the network θ on set L = {I ,
S}, which comprises N MR images I = {I1 , I2 , ..., IN } 8.9 TRANSFER LEARNING
and their corresponding segmentation maps S = {S1 , Another very important machine learning method is
S2 , ..., SN }. Coming to the semi-supervised stage, what called the transfer learning method. In this learning
appears to be is an unlabeled data set U = {I , U }. With approach, a model established for a particular task is
this, a semi-supervised algorithm has the possibility to reused as the initial point for a model on a second task.
be made, comprising the inputs θ, L, U , along with out- The basic idea behind it is to use the information ob-
put of updated θ, in order to gradually train the network tained from tasks for which labeled data is accessible in
with unlabeled data. settings where only little labeled data is accessible. Cre-
The algorithm for the same is shown below. ation of labeled data can be expensive, hectic and a very
tedious task. So through this algorithm, creation of la-
Input: θ , L, U beled data is not required. See Fig. 8.5.
Output: updated θ Multi-task learning and featuriser are the two main
1: while len(U ) > 0 do requirements in transfer learning and its architecture.
2: Estimate SU by θ
3: Move k best pairs (IU , SU ) from U to L 8.9.1 Transfer Learning in Image Data
4: Optimize θ by training on L One of the most common applications of transfer learn-
5: return θ ing is with the usage of image data. Whether it is pre-
dictive modeling or some imaging application, transfer
The training method can be used as per choice. In the learning gives rugged solutions with high precision.
work performed especially in this experiment, the net- This can include any prediction task which includes
work parameters were initialized by Xavier Algorithm, photographs or videos taken as an input. These prob-
updated by backpropagation algorithm using Adam op- lem statements use a deep learning model, pre-trained
timizer. for a large and challenging image classification task.
The system parameters were instated by a similar This approach is quite useful as the images were trained
kind of training set, including 30 labeled MR pictures, on a large dataset of photographs or videos and re-
and after that, refurbished with another 30 unlabeled quire the model to make predictions on a large num-
MR pictures with various semi-supervised learning tech- ber of classes, demanding that the model learn to ex-
niques. The parameter k in the algorithm given was set tract features from photographs for solving the prob-
to 5, and the substitute updating in the traditional semi- lem.
CHAPTER 8 Deep Learning and Semi-Supervised and Transfer Learning Algorithms 147

FIG. 8.5 Traditional learning vs. transfer learning.

8.9.1.1 Mechanisms of Deep Transfer Learning These ROIs were down-sampled to a communal size
for Medical Imaging and were bonded into 2 classes based on their intersec-
Transfer learning has been successfully used and appre- tion with ground truth annotations. DSC (dice similar-
ciated in situations where data is not in abundance. ity coefficient) is used as the metric.
Below shown are the results of a research done in the
field of medical imaging using transfer learning. 8.9.1.3 Transferred Learned Features
In this research, the authors considered a set of im- 1. Full network adaptation (CaffeNet_FA).
ages to be trained, and built classifiers to segregate 2. Partial network adaptation (CaffeNet_PA).
between the kidney and non-kidney regions [17]. For 3. Zero network adaptation (CaffeNet_NA).
a single test image, to detect or to find the best kid-
ney region of interest (ROI) S* from a set of candi- 8.9.1.4 Traditional Texture Feature Results
date ROIs {S} is done in two steps. The whole set Haar features are reported to have the best performance
{S} is distributed over the classifier models and the for kidney detection. Hence, this study in particular
candidates with positive class labels Y are retained used Haar features.
(Eq. (8.1)). The ROI with largest possibility L from the To quantitatively evaluate the performance on 45
set {S + } is chosen as the identified kidney region validation images, two metrics were used: (i) the num-
(Eq. (8.2)): ber of localization failures, i.e., the number of images
for which the dice similarity coefficient between de-
{Y, L} = MLClassifier(S) & {S+ ∈ S|Y = 1}, (8.1) tected kidney ROI and ground truth annotation was <
S∗ = argmax(L+), where L+ = L(S+). (8.2) 0.80, and (ii) detection accuracy, i.e., the average dice
overlap across 45 images between detection results and
CNNs are employed as feature extractors to enable rela- ground truth.
tionships between traditional texture features. Table 8.4 shows the results in its most accurate
form.
8.9.1.2 Dataset and Training Fig. 8.7 shows the results. In (A) and (B), the base-
In total, 90 long-axis kidney images were acquired on line method was affected by the presence of diaphragm,
GE Healthcare LOGIQ E9 scanner, split into two equal kidney and liver boundaries, creating a texture similar
and distinct sets, for training and validation. The images to renal-sinus portion, while CaffeNet had excellent lo-
were attained at variable depths of ultrasound acquisi- calization. In (C) and (D), CaffeNet resulted in over-
tion fluctuating between 9 and 16 cm. Precise rectangu- segmentation. Illustrating that limited data problems
lar ground truth kidney ROIs were manually marked by require careful feature engineering, incorporating do-
a clinical expert. See Fig. 8.6. main knowledge still carries a lot of relevance.
148 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 8.6 Ground truth kidney ROI.

TABLE 8.4
Results of traditional texture features.
Method Haar features CaffeNet_NA CaffeNet_PA CaffeNet_FA Haar + CaffeNet_FA
Average dice overlap 0.793 0.825 0.831 0.842 0.857
No. of failures 12/45 12/45 11/45 10/45 3/45

FIG. 8.7 Caffenet with localization and oversegmentation.

CHAPTER 8 Deep Learning and Semi-Supervised and Transfer Learning Algorithms 149

A rugged performance was achieved of 86% average 8.9.2.1 Dataset

detection accuracy using the hybrid methodology. The The dataset used here is BreaKHis database, which had
number of failures, which were 3/45, were much better almost 8000 microscopic pictures of benign & malig-
than for each of the methods individually. nant breast tumors attained from 82 patients. See Ta-
ble 8.5.
8.9.2 Transfer Learning Technique for the
Detection of Breast Cancer
TABLE 8.5
The model given below shows a transfer learning model The distribution in the dataset.
for detection of breast cancer [7] with AUC of 0.93. This
Magniﬁcation Benign Malignant Total
methodology uses the technique of histopathology for
40x 625 1370 1995
the detection of breast cancer.
100x 644 1437 2081
The model used here is the Google’s Inception v3
model. 200x 623 1390 2013
The research depicted below mainly wanted to deci- 400x 588 1232 1820
pher the problem of inadequate amount of data by the Total 2480 5429 7909
use of transfer learning and data augmentation. # Patients 24 58 82

FIG. 8.8 Architecture of traditional CNNs.

FIG. 8.9 The architecture of Google’s Inception v3.

150 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 8.10 Graphical result of cross-entropy.

FIG. 8.11 Graphical result of training model accuracy.

Images of lower magnifying factors were trained in set comprises 187 images of benign and 410 images of
this research in order to identify the region of interest malignant tumors.
(ROI) in the whole image: 625 imageries of benign and
1397 imageries of malignant tumor were taken in con- 8.9.2.2 Preprocessing
sideration here. The training set comprises 438 images To complement for the lack of data in the training im-
of benign, 960 images of malignant, and the validation ages, data augmentation techniques were used. This was
CHAPTER 8 Deep Learning and Semi-Supervised and Transfer Learning Algorithms 151

done by rotating the images by certain angles, mirroring retinopathy screening, in: AAAI Workshop: Modern Artifi-
them, and even randomly distorted images were added cial Intelligence for Health Analytics, 2014, June.
to the original data set. Now, the total was 11,184 im- 3. H. Pratt, F. Coenen, D.M. Broadbent, S.P. Harding,
ages. Y. Zheng, Convolutional neural networks for dia-
betic retinopathy, Procedia Computer Science 90 (2016)
8.9.2.3 Transfer Learning Part 200–205.
4. R. Zhu, R. Zhang, D. Xue, Lesion detection of endoscopy
Deep CNN and ConvNet model were built to clas-
images based on convolutional neural network features,
sify breast cancer histopathological images to malignant in: Image and Signal Processing (CISP), 2015 8th Interna-
and benign classes. Transfer learning was applied here tional Congress on, IEEE, 2015, October, pp. 372–376.
to overcome the issue of insufficient data and training 5. M.I. Razzak, S. Naz, A. Zaib, Deep learning for medi-
time. See Fig. 8.8. cal image processing: overview, challenges and the fu-
A pre-trained Google’s Inception v3 was utilized us- ture, in: Classification in BioApps, Springer, Cham, 2018,
ing Python API. See Fig. 8.9. pp. 323–350.
6. C. DeSantis, J. Ma, L. Bryan, A. Jemal, Breast cancer statis-
8.9.2.4 Results tics, 2013, CA: A Cancer Journal for Clinicians 64 (1)
• Training Accuracy and cross entropy (2014) 52–62.
• Cross-entropy. See Fig. 8.10. 7. Risk of breast cancer, http://www.breastcancer.org/
• Accuracy of training model. See Fig. 8.11. symptoms/understand_bc/risk/understanding, 2016.
8. Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature
From the results shown above, it’s clear that accuracy
521 (7553) (2015) 436.
increases as the training proceeds. See Table 8.6. Cross- 9. L.V. Fausett, Fundamentals of Neural Networks: Architec-
entropy was used as a cost function, which is calculated tures, Algorithms and Applications, vol. 3, Prentice-Hall,
as Englewood Cliffs, 1994.
10. N. Tajbakhsh, J.Y. Shin, S.R. Gurudu, R.T. Hurst, C.B.
H (x) = H (p) = − p(xi) log (p(xi))
Kendall, M.B. Gotway, J. Liang, Convolutional neural net-
i
works for medical image analysis: full training or fine tun-
ing? IEEE Transactions on Medical Imaging 35 (5) (2016)
TABLE 8.6
1299–1312.
The classification of accuracy for different 11. J. Ker, L. Wang, J. Rao, T. Lim, Deep learning applications in
cut-off values. medical image analysis, IEEE Access 6 (2018) 9375–9389.
Classiﬁcation accuracy 12. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning
Cut-off Benign Malignant for image recognition, in: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2016,
0.3 0.74 0.93 pp. 770–778.
0.4 0.83 0.89 13. M. Borga, T. Andersson, O.D. Leinhard, Semi-supervised
0.5 0.89 0.82 learning of anatomical manifolds for atlas-based segmen-
tation of medical images, in: Pattern Recognition (ICPR),
0.6 0.91 0.76
2016 23rd International Conference on, IEEE, 2016, De-
cember, pp. 3146–3149.
14. Z. Feng, D. Nie, L. Wang, D. Shen, Semi-supervised learn-
Finally, it can be concluded that Google’s Inception ing for pelvic MR image segmentation based on multi-
v3 model with breast cancer microscopic biopsy images task residual fully convolutional networks, in: Biomedical
and the trained model performed classification with ac- Imaging (ISBI 2018), 2018 IEEE 15th International Sym-
curacy of 0.83 for the benign class and 0.89 for the posium on, IEEE, 2018, April, pp. 885–888.
malignant class. 15. https://machinelearningmastery.com/transfer-learning-
for-deep-learning.
16. https://www.datacamp.com/community/tutorials/
REFERENCES transfer-learning.
1. T. Chandrakumar, R. Kathirvel, Classifying diabetic 17. C.K. Shie, C.H. Chuang, C.N. Chou, M.H. Wu, E.Y. Chang,
retinopathy using deep learning architecture, International Transfer representation learning for medical image anal-
Journal of Research in Engineering and Technology 5 (6) ysis, in: Engineering in Medicine and Biology Society
(2016) 19–24. (EMBC), 2015 37th Annual International Conference of
2. G. Lim, M.L. Lee, W. Hsu, T.Y. Wong, Transformed repre- the IEEE, IEEE, 2015, August, pp. 711–714.
sentations for convolutional neural networks in diabetic
CHAPTER 9

Survey on Evaluating the Performance

of Machine Learning Algorithms: Past
Contributions and Future Roadmap
SYED MUZAMIL BASHA, MTECH • DHARMENDRA SINGH RAJPUT, PHD

9.1 INTRODUCTION If the output variable we want to predict is not a cate-

Machine learning algorithms are now involved in more gory, but a real-valued number, like the amount of time
and more aspects of everyday life from what one can in seconds it would likely take a car to accelerate from 0
read and watch, to how one can shop, to who one to 100 kilometers per hour, then it is regression prob-
can meet and how one can travel. For example, con- lem, where we use a regression function. Supervised
sider fraud detection. Every time someone buys some- learning needs to have a training set with labeled ob-
thing using a credit card, machine learning algorithms jects to make its predictions. Obtaining labels for some
immediately check your purchase to verify whether or problems can be easy or difficult, depending on how
not this might be a fraudulent transaction. They pre- much labeled data is needed and on the level of hu-
dict whether it’s fraudulent or not based on whether man expertise or expert knowledge required to provide
that purchase is consistent with the features of your an accurate label, and the complexity of the labeling
previous purchases. Search and recommendation sys- task among other factors. Mostly used platforms include
tems are also a vast area of application for machine crowdsourcing, Amazon’s Mechanical Turk, or Crowd
learning. These machine learning algorithms are at the Flower. Any regression problem can be mapped to a
heart of how commercial search engines work, start- continuous member function. Similarly, any classifica-
ing with the moment you begin typing in a query. Be- tion problem can be mapped to distinct categories. The
sides, search engines typically use data about how you second major class of machine learning algorithms is
interact with the search site, such as which pages you called unsupervised learning, in which input data don’t
click, how long does someone read the pages, etc., to have any labels to go with the data. Such problems are
improve their future effectiveness. Similarly, movie rec- solved by finding some useful structure in input data,
ommendation sites use machine learning algorithms to in a procedure called clustering. So once we can dis-
model what you liked in your recent reviews based on cover this structure in the form of clusters, groups or
your past reviews. There is a fascinating trend happen- other exciting subsets, this composition will be used for
ing where ready to use machine learning algorithms for tasks like producing a useful summary of the input data,
speech recognition, language translation, text classifica- maybe visualizing the structure. Unsupervised learning
tions, and many other tasks are now being offered as allows us to approach problems with little or no idea
web-based services on cloud computing platforms, sig- about the final result.
nificantly increasing the audience of developers that can The first step in solving a problem with machine
use them and making it easier than ever to put together learning is to find how to represent the learning prob-
solutions that apply machine learning at a high level. lem into an algorithm for the computer to understand.
In general, machine learning algorithms are categorized The second step is to decide on an evaluation method
into two main types. The first type is known as super- that provides some quality or accuracy score for the pre-
vised learning, in which our goal is to predict some dictions of a machine learning algorithm, typically a
output variable that’s associated with each input item. classifier. A good classifier will have the high accuracy
So if the output is a predicted category from a finite that will make a prediction that matches the correct,
number of possibilities, such as fraudulent or not, for true label a high percentage of the times. The third step
a credit card transaction. This is a classification prob- is in applying machine learning to solve a problem. So,
lem within supervised learning, and the function used let us discuss these three steps in detail. Converting the
to perform the classification task is called the classifier. problem into a representation that a computer can deal

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00016-6 153
Copyright © 2019 Elsevier Inc. All rights reserved.
154 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 9.1 Evaluating the performance of different MLA.

with involves two things. First, we need to turn each in- evaluation metric like mean squared error, confusion
put object, which is often called the sample, into a set matrix, ROC curve, etc. The experiment carried out in
of features that describe the purpose. Second, we need the present research work helps the learners in identify-
to pick a learning model, typically the type of classifier ing the past research contributions and current research
that learns the system. Often the process of address- gap as well as in conducting an experiment when ana-
ing machine learning involves an iterative process. The lyzing the performance of an algorithm in R language.
present work is focused on finding out the performance This paper is organized as follows. In the introduction,
of machine learning algorithms and rating them using we highlight the importance of applying MLA in real
evaluation parameters, as well as making the reader un- time. In the methodology section, we aim to give an in-
derstand how a model can be fitted on real-time data troduction about the MLA with past and future trends,
and how to perform analysis similarly as plotted in and evaluate the performance of each MLA using evalu-
Fig. 9.1. In the past, the authors were able to demon- ation metrics. In the results section, a predictive model
strate the implementation of two or three algorithms is built, and we measure the performance of that model
each when addressing the problems in classification and on Iris dataset in detail.
regression [1], whereas in the present research work we
aim to demonstrate linear and nonlinear versions of
regression and classification algorithms for addressing 9.2 METHODOLOGY
each type of problem using the dataset, and the details The experiments carried out throughout this research
are explained in next section. This research work aims to use two datasets: Longley’s Economic Regression Data,
identify the problems faced by beginners who want to consisting of seven variables observed from 1947 to
learn the concepts in machine learning algorithms (re- 1962 and used in predicting the number of people em-
gression and classification), apply the same on real-time ployed yearly [2], and the Iris Dataset available in R
dataset (IRIS), fit a model, evaluate the performance of Dataset Package. Longley’s dataset is used in evaluat-
a model and rate the performance of a algorithm using ing linear and nonlinear regression algorithms, whereas
CHAPTER 9 Survey on Eval. the Performance of MLAs: Past Contributions and Future Roadmap 155

Iris dataset is used in evacuating linear and nonlinear training [3,4], instead of assigning weights manually, to-
classification algorithms (data(iris), data(longley)). The wards extracting sentiments from text, by constructing
R packages and libraries used in the experiment con- 250 rules using “AND”, “OR”, “NOT” connectors for
ducted are: each emotion in FuzzySet application available in MAT-
1. library(pls), library(ipred) LAB. The drawback in this research work is that only text
2. library(earth), library(randomForest) data is considered, whereas in [5] a detailed comparison
3. library(kernlab), library(gbm) is made among predictive models (moving average with
4. library(caret), library(Cubist)
period 3) on time series dataset. The drawback in this re-
5. library(nnet), library(MASS)
search work is that only a single instance of time series
6. library(rpart), library(GGally)
data is considered. In [6], an analysis on PIMA diabetes
7. library(party), library(mda)
dataset is conducted and prediction is made based on
8. library(RWeka), library(klaR)
The evaluation parameter is the mean squared error for insulin. The drawback in this work is that a generalized
comparing the performance of linear and nonlinear re- linear model is used in deciding the important features,
gression algorithms. Similarly, in evaluating the perfor- whereas in [7] a gradient ascent algorithm (incremen-
mance of linear and nonlinear classification algorithms, tal model) is used to find the exact weights of the terms
we computed a confusion matrix and user precision, used in determining the sentiment of the tweet and uti-
recall, and F-score as the evaluation metrics. Weighted lize a boosting approach to improve the accuracy of a
fuzzy logic is used in assigning weights to the data while linear classifier. See Table 9.1.

TABLE 9.1
Comparative study of recent trends in machine learning technology.
Reference Focused on Approach Future scope
[8] Random forest Combination with the To relieve pedestrian safety
cross-validation issues
[9] Predictive approach to Reasonable suspicion and To update the existing data
knowledge and investigation collective interests protection legal framework
[10] Newton–Raphson Proposed functional networks Mixture models simulation
maximum-likelihood based on propensity score and studies
optimizations as a new Newton–Raphson
large-scale machine learning maximum-likelihood
classifier
[11] ICT-producing firms in the UK Developed a novel Finds ICT employment shares
sector-product approach and over double the conventional
used text mining to provide estimates
further detail on key
sector-product cells
[12] To understand the relationships The knowledge fusion taxonomy Improving NPS is not automatic
among traditional marketing and requires strategic choices to
analytics (TMA) obtain its benefits
[13] A priori-based hierarchical Presenting SUBSCALE, a novel This algorithm scales very well
clustering clustering algorithm to find with the dimensionality of the
nontrivial subspace clusters with dataset and is highly
minimal cost parallelizable
[14] Data preprocessing algorithms Illustrative study on two different To connect and accommodate
datasets, banana and sonar several data pre-processing
algorithms
[15] Machine learning techniques Literature survey To analyze machine learning
with modern signal processing
technologies
156 Deep Learning and Parallel Computing Environment for Bioengineering Systems

9.3 LINEAR REGRESSION equation:

Linear regression can be applied in two ways: (i) to
∂y
check whether there exists a statistical relationship be- y = β0 + β12 x as = 2β1 . (9.4)
tween two variables and (ii) to focus on the behavior of ∂β1
unobserved values. We can also represent linear regres- We can determine that the above regression model is
sion as “Line of Best Fit” by nonlinear. From this, it is clear that the model is non-
linear in the parameter and not in the independent vari-
y = b0 + b1 x + (9.1) able. A logistic regression model is represented by
where β0 is the intercept, β1 is the slope, and denotes p
log ) = β0 + β1 x, (9.5)
the residual. These values can be calculated as 1−p
2 2
y x − x y) in which dp/dβ1 depends on β1 . And hence, the logistic
β0 = ,
n x2) − x)2 regression is also called as nonlinear regression. Model
estimation in nonlinear regression is complex and re-
n xy) − x) y
β1 = 2 (9.2) quires a lot of computation, as it is iterative, in which
n x )− x)2 the starting values are provided by the user. Thereby
2
n n xy) − x − y) the software makes stepwise increments or decrements
= 2 2 . successfully to improve the model performance. The
n x − x x]) n y − y y)
iterations stop once the model reaches maximum im-
In [16], the author attempted to apply the fuzzy linear provement, where further development is not possible
function to linear regression in representing the vague and the model fits. In most applications, when the user
system. In [17], the authors studied if a correlation exists is not sure about the functional form, the nonlinear
among latent curve analysis, multiple linear regression regression model is applied to meet the exact prob-
and hierarchical linear modeling. In [18], hierarchical lem. In [19], a nonlinear regression model was used
linear regression modeling was used in finding the cor- in representing the fuzzy-in and fuzzy-out system, in
relation between mental and physical health in diag- which both the input and outputs were uncertain, and
nosing cancer among family members. An experiment achieved better prediction accuracy. In [20], the same
was carried out by using two datasets. The first of them model was used in removing the heaviness from the
was Longley’s Economic Regression Data, consisting of metal in the field of chemistry. In the experiment, the
seven variables observed from 1947 to 1962 and used authors fitted multivariate adaptive regression splines
in predicting the number of people employed yearly [2]. and achieved 0.04574327 of MSE, which is better than
A comparison was made between the principal compo- for the linear regression models.
nent regression (PCR) and partial least squares regres-
9.4.1 Support Vector Machine
sion (PLSR), and the measure of performance for each
approach was the mean squared error (MSE). We have Support vector machines (SVMs) are rooted in statis-
recorded that PCR was 0.0522765 of MSE and PLSR was tical learning theory, and were developed by Vladimir
0.05366753 of MSE as in Eq. (9.3). The same is plotted Vapnik in the 1990s. An SVM looks at the extreme
in Fig. 9.2. boundaries and draws the edges, often termed as hyper-
planes, which segregate two classes. Suboptimal deci-
n
sion boundaries can result in a misclassification of the
Ai − pi )
i=1 new data point. The extreme data points help in de-
MSE = (9.3) ciding the limits called support vectors, and they have
n
a higher preference to ignore the training data points.
where Ai represents the actual position of the point, pi
Then the sum of the shortest positive distances and
the predicted location of the point, and n the number
shortest negative points are termed the hyperplane. For
of instances considered in our experiment. a nonlinear SVM, the cost function can be computed as
m

9.4 NONLINEAR REGRESSION min C yi c1 (θ t xi ) + (1 − yi )c0 (θ T xi ) ,
θ (9.6)
In a nonlinear regression model, the derivatives are de- i=1
pendent on one or more parameters as in the following θ x ≥ φ, θ T x ≥ φ − 1.
T
CHAPTER 9 Survey on Eval. the Performance of MLAs: Past Contributions and Future Roadmap 157

FIG. 9.2 Linear regression models and their performance.

In [21], the authors formulated a new version of SVM called instance- or memory-based supervised learning.
for a classification problem with two classes, whereas in The k in k-NN refers to the number of nearest neighbors
[9] a new version of SVM was given that can effectively the classifier will retrieve and use to make its prediction.
perform active learning when determining the topic of a In particular, the k-NN algorithm has three steps that
particular document. In [10], an SVM was used in devel- can be specified. First, when given a new previously un-
oping a computer-aided diagnosis (CAD) system which seen instance of something to classify, a k-NN classifier
can perform analysis on MR images, helping in medi- will look into its set of memorized training examples
cal interpretation. In [22], a combined SVM and fuzzy to find the k examples that have closest features. Sec-
inference system was suggested when making a decision ond, the classifier will look up the class lab els for those
of placing vehicles in a lane. In our experiment, SVM has k-nearest neighbor examples. Generally, to use the near-
achieved 0.1448464 of MSE, which is far better than the est neighbor algorithm, one should specify four things.
other two linear regression models and one nonlinear First, one needs to define what distance means in the fu-
regression model. ture space, to properly select the nearby neighbors, e.g.,
one can use the simple straight line, or Euclidean dis-
9.4.2 K-Nearest Neighbors tance, to measure the distance between points. Second,
The K-nearest neighbors algorithm can be used for clas- one needs to tell the algorithm how many of these near-
sification and regression. The k-NN classifiers are often est neighbors to use in making a prediction. Third, one
158 Deep Learning and Parallel Computing Environment for Bioengineering Systems

has to determine which neighbors have more influence programmer while constructing the neural network. In
on the outcome. Finally, giving labels to the k nearby [27], an artificial intelligence system is proposed, which
points, one has to specify how to combine them to can measure heart rate variability with good accuracy.
produce a final prediction. The most common distance In [28], the authors aimed to study the understanding
metric is the Euclidean straight line distance. Euclidean the emotions of others, which helps in critical decision
metric is a particular case of a more general Minkowski making. An NN was used in training the proposed sys-
metric. A straightforward way to assess if the classifier is tem. In [18], the NN approach was used in forecasting
likely to be good at predicting the label of future, previ- photovoltaic energy in terms of sensitivity, and achieved
ously unseen data instances is to compute the classifier’s good accuracy of 99.1% compared to other MLAs used
accuracy on the test set data items. The accuracy is de- in the experiment. In [29], a fuzzy inference system with
fined as the fraction of test set items, whose true label a conventional neural network was combined in con-
was correctly predicted by the classifier, to get a more re- trolling the state of a thermostat. In the experiment con-
liable estimate of likely future accuracy for a particular ducted, NN had achieved 0.0002765618 of MSE, which
value of k. The best choice of the value of k, that is, the is better than for all nonlinear regression algorithms dis-
one that leads to the highest accuracy, can vary greatly cussed so far.
depending on the data set. In general, with k-nearest
neighbors, using a larger k suppresses the effects of
noisy individual labels. Consider that there are (xn , yn ) 9.5 NONLINEAR DECISION TREE
points in the given dataset and that x1 be the new point REGRESSION
that is to be classified based on the k value as in
9.5.1 Regression With Decision Trees
d x, x ) = xi − xj Decision trees are used to solve both classification

d and regression problems in the form of trees that can
(9.7)
= xi k − xj k)2 , where x, x ∈ Rd be incrementally updated by splitting the dataset into
i=1 smaller datasets (numerical and categorical), where the
results are represented in the leaf nodes. The compara-
In [23], Mahanalobis distance was used as a met- tive results are presented in Table 9.2, and a complete
ric in addressing the multi-classification problem and summary is provided in Fig. 9.3.
achieved 1.3% of test error. In [24], neighbor-weighted
K-nearest neighbor algorithm was proposed, in which
Algorithm 1 Greedy decision tree learning algorithm.
big weights were assigned to small class neighbors, and
low weights were assigned to large class neighbors. 1: procedure R EGRESSION WITH D ECISION
In [25], graphical processing unit (GPU) was used in T REES(empty tree, data)
evaluating the distance between points with an aim to 2: Start
optimize the time taken to perform the necessary oper- 3: Consider an empty tree
ations. In [26], KNN was used on Hadoop environment 4: Select a feature to split data. This selection is
in addressing classification problem. In the experiment based on the importance of an attribute, which is
conducted, KNN has achieved 0.9259962 of MSE. From evaluated using GLM model.
the results obtained, it is clear that in evaluating the per- 5: F or each split of a tree check whether is there
formance of KNN, MSE is not the only parameter and something to make the prediction.
that there is need a to learn other evaluation metrics. 6: else Go to step 3 and continue the split.
7: endf or
9.4.3 Neural Network 8: Stop
The behavior of the neuron with only two states is as in
the following equation:

n 9.5.2 Random Forest
y= xi wi Random forest is a versatile machine learning method
i=1 (9.8) capable of performing both regression and classifica-
{if y ≥ T output is 1, otherwise output is 0}, tion tasks. A summary of random forest performance
is presented in Fig. 9.4, and the MSE obtained using the
where xi denotes inputs, wi weights associated, Y out- same algorithm is 0.2940453, which is better than for
come, and T is the threshold value provided by the all other decision tree algorithms.
CHAPTER 9 Survey on Eval. the Performance of MLAs: Past Contributions and Future Roadmap 159

FIG. 9.3 Linear regression models and their performance.

TABLE 9.2
Performance of regression algorithms.
Name of the algorithm MSE
Decision Tree 0.4776895
Conditional decision tree 0.4776895
Model trees 0.663211
Rule system 0.663211
Bagging CART 0.663211

FIG. 9.5 Summary of random forest performance.

able y that can be either zero or one, either spam or

not spam, fraudulent or not fraudulent, malignant or
benign. On the other hand, in multi-class problems y
FIG. 9.4 Summary of decision tree performance.
may, e.g., take on four values: zero, one, two, and three.
This is called a multi-class classification problem. To de-
9.6 LINEAR CLASSIFICATION IN R
velop a classification algorithm, consider an example
To address the classification problem and understand of a training set for classifying a tumor as malignant
the performance of MLA, we experimented with the IR- or benign. And notice that malignancy takes on only
RIS dataset collected from UCI Machine Learning repos- two values, zero (no) and one (yes). Linear regression
itories. A visual representation of the complete dataset to this data set means fitting a straight line to the data.
considered is plotted in Fig. 9.5. The performance of lo- In the experiment conducted, the confusion matrices
gistic regression is plotted in Fig. 9.6. In all classification generated using linear discriminant analysis (LDA) and
problems, the variable which is to be predicted is a vari- partial least squares discriminant analysis (PLSDA) are
160 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 9.6 Visualization of Irris dataset.

FIG. 9.8 Interpreting the confusion matrix.

right corners, respectively. The accuracy and other eval-

uation metrics of the model can be calculated as shown
in Fig. 9.9. In order to evaluate the efficiency of any
predictive model, one can make use of any of the fol-
lowing methods. Based on the area, where the analysis
is done, the necessity of applying any machine learning
algorithm determines the choice of an evaluation tech-
nique. In general, a confusion metric is used in finding
out the accuracy of the classification algorithm. A good
FIG. 9.7 Prediction using logistic regression. model will always have high precision and recall

plotted in Fig. 9.7, whereas for nonlinear classification Kappa = (ObservedAccuracy − ExpectedAccuracy)
algorithms their respective confusion matrices are plot- /(1 − ExpectedAccuracy). (9.9)
ted in Fig. 9.8. The way to interpret the confusion matrix
is shown in Table 9.2. The total number of instances in The best model is chosen based on accurately pre-
the dataset representing positive (1) and negative (0) dicting ones (sensitivity (TPR)) and zeros (specificity
is 2014. This is represented in the top-left and bottom- (FPR)). The next step is to combine both measures
CHAPTER 9 Survey on Eval. the Performance of MLAs: Past Contributions and Future Roadmap 161

in evaluating the performance of a model using a re- plying classification can be done by constructing the
ceiver operating characteristics (ROC) curve. The area confusion matrix. In Fig. 9.8, the performance of the
under the ROC curve can be used as an evaluation met- nonlinear classification algorithms can be seen. Simi-
ric to measure the efficiency of the predictive model. larly, for a decision tree and nonlinear classification,
On the other hand, the question arises on how to the generated confusion matrix can be analyzed in
handle data which is nonlinear in nature; here ap-
Fig. 9.10, illustrating the efficiency of the predictive
model.

9.7 RESULTS AND DISCUSSION

The MSE is a measure to rate the performance of a re-
gression algorithm as in Table 9.2, and classification
accuracy is a measure to rate the performance of a clas-
sification algorithm as in Fig. 9.12. The dataset (iris
flowers) used in our experiment with an objective of
making the readers understand the concepts like provid-
ing the summary of the dataset and its feature plot as in
Fig. 9.10, evaluating machine algorithms using accuracy
and kappa metric with 0.95 confidence level plotted
in Fig. 9.11, making some predictions using a machine
FIG. 9.9 Evaluation metric used in rating the performance learning algorithm, discussed in the methodology sec-
of classification problem. tion. We observe that

FIG. 9.10 Analysis of linear classifiers.

162 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 9.11 Feature plot of Iris dataset.

• The performance of KNN algorithm in terms of accu-

racy is next to LDA, as it requires an additional step
in finding the centroid value of each cluster.
• The performance of the random forest algorithm in
terms of accuracy is next to LDA, as it has a problem
with boundary selection while constructing a tree.
• The performance of SVM is 0.9 and it is stable, as
the data points are separated based on vector bound-
aries.

9.8 CONCLUSIONS
In the present research, we aim to make the reader un-
derstand the procedure when applying MLA on both
linear and nonlinear data in addressing regression and
classification problems. The impact of the present re-
FIG. 9.12 Performance of MLA. search is to find the different ways to evaluate a ma-
chine learning algorithm. The results obtained in this
• The performance of LDA algorithm in terms of accu- research are applicable to address the real time prob-
racy is near to one. lems like classification and regression. The findings in
CHAPTER 9 Survey on Eval. the Performance of MLAs: Past Contributions and Future Roadmap 163

our research are that a support vector machine (SVM) 12. Z. Xu, G.L. Frankwick, E. Ramirez, Effects of big data ana-
performs better than all other classification algorithms, lytics and traditional marketing analytics on new product
and the neural network (NN) approach gives the lowest success: a knowledge fusion perspective, Journal of Busi-
mean squared error (MSE) for the regression problem, ness Research 69 (5) (2016) 1562–1566.
13. A. Kaur, A. Datta, A novel algorithm for fast and scalable
which is a metric used in evaluating the performance of subspace clustering of high-dimensional data, Journal of
both classification and regression algorithms. We found Big Data 2 (1) (2015) 17.
that ROC is best for regression, and constructing a con- 14. S. García, J. Luengo, F. Herrera, Tutorial on practical tips of
fusion matrix works best for classification. In the future, the most influential data preprocessing algorithms in data
we would like to work with data having N dimensions mining, Knowledge-Based Systems 98 (2016) 1–29.
on the distributed platform. The performance of each 15. J. Qiu, Q. Wu, G. Ding, Y. Xu, S. Feng, A survey of machine
learning for big data processing, EURASIP Journal on Ad-
of the MLAs is evaluated using a metric like throughput,
vances in Signal Processing 2016 (1) (2016) 67.
response time, overload on each machine in a cluster of 16. K.J. Preacher, P.J. Curran, D.J. Bauer, Computational tools
machines. for probing interactions in multiple linear regression, mul-
tilevel modeling, and latent curve analysis, Journal of Edu-
cational and Behavioral Statistics 31 (4) (2006) 437–448.
REFERENCES 17. H. Tanaka, K. Asai, S. Uejima, Linear regression analysis
with fuzzy model, IEEE Transactions on Systems, Man and
1. L. Bottou, F.E. Curtis, J. Nocedal, Optimization meth-
Cybernetics 12 (6) (1982) 903–907.
ods for large-scale machine learning, SIAM Review 60 (2) 18. K.M. Shaffer, J.M. Jacobs, R.D. Nipp, A. Carr, V.A. Jackson,
(2018) 223–311. E.R. Park, W.F. Pirl, A. El-Jawahri, E.R. Gallagher, J.A. Greer,
2. R. Durrett, Lecture Notes on Particle Systems and Percola- et al., Mental and physical health correlates among fam-
tion, Brooks/Cole Pub Co, 1988. ily caregivers of patients with newly-diagnosed incurable
3. S.M. Basha, Y. Zhenning, D.S. Rajput, N. Iyengar, D. cancer: a hierarchical linear regression analysis, Supportive
Caytiles, Weighted fuzzy rule based sentiment prediction Care in Cancer 25 (3) (2017) 965–971.
analysis on tweets, International Journal of Grid and Dis- 19. Y.-L. He, X.-Z. Wang, J.Z. Huang, Fuzzy nonlinear regres-
tributed Computing 10 (6) (2017) 41–54. sion analysis using a random weight network, Information
4. S.M. Basha, D.S. Rajput, Sentiment Analysis: Using Artifi- Sciences 364 (2016) 222–240.
cial Neural Fuzzy Inference System, IGI Global, 2018. 20. B. Nagy, C. Mânzatu, A. Măicăneanu, C. Indolean, L.
5. S.M. Basha, Y. Zhenning, D.S. Rajput, R.D. Caytiles, N.C.S. Barbu-Tudoran, C. Majdik, Linear and nonlinear regres-
Iyengar, Comparative study on performance analysis of sion analysis for heavy metals removal using agaricus
time series predictive models, International Journal of Grid bisporus macrofungus, Arabian Journal of Chemistry 10
(2017) S3569–S3579.
and Distributed Computing 10 (8) (2017) 37–48.
21. J.A. Suykens, J. Vandewalle, Least squares support vector
6. S.M. Basha, H. Balaji, N.C.S. Iyengar, R.D. Caytiles, A soft machine classifiers, Neural Processing Letters 9 (3) (1999)
computing approach to provide recommendation on pima 293–300.
diabetes, International Journal of Advanced Science and 22. E. Balal, R.L. Cheu, Modeling of lane Changing Decisions:
Technology 106 (2017) 19–32. Comparative Evaluation of fuzzy Inference System, Sup-
7. S.M. Basha, D.S. Rajput, K. Vishu Vandana, Impact of gra- port Vector Machine and Multilayer Feed-Forward Neural
dient ascent and boosting algorithm in classification, In- Network, Tech. rep., 2017.
ternational Journal of Intelligent Engineering and Systems 23. K.Q. Weinberger, L.K. Saul, Distance metric learning for
(IJIES) 11 (1) (2018) 41–49. large margin nearest neighbor classification, Journal of Ma-
8. X. Jiang, M. Abdel-Aty, J. Hu, J. Lee, Investigating macro- chine Learning Research 10 (Feb) (2009) 207–244.
level hotzone identification and variable importance using 24. S. Tan, Neighbor-weighted k-nearest neighbor for unbal-
big data: a random forest models approach, Neurocomput- anced text corpus, Expert Systems with Applications 28 (4)
ing 181 (2016) 53–63. (2005) 667–671.
25. V. Garcia, E. Debreuve, M. Barlaud, Fast k nearest neigh-
9. X. Wu, X. Zhu, G.-Q. Wu, W. Ding, Data mining with big
bor search using gpu, in: Computer Vision and Pattern
data, IEEE Transactions on Knowledge and Data Engineer-
Recognition Workshops, 2008. CVPRW’08. IEEE Computer
ing 26 (1) (2014) 97–107.
Society Conference on, IEEE, 2008, pp. 1–6.
10. E. Elsebakhi, F. Lee, E. Schendel, A. Haque, N. Kathireason, 26. J. Maillo, S. Ramírez, I. Triguero, F. Herrera, kNN-is: an
T. Pathare, N. Syed, R. Al-Ali, Large-scale machine learning iterative spark-based design of the k-nearest neighbors clas-
based on functional networks for biomedical big data with sifier for big data, Knowledge-Based Systems 117 (2017)
high performance computing platforms, Journal of Com- 3–15.
putational Science 11 (2015) 69–81. 27. M. Patel, S.K. Lal, D. Kavanagh, P. Rossiter, Applying neu-
11. M. Nathan, A. Rosso, Mapping digital businesses with big ral network analysis on heart rate variability data to as-
data: some early findings from the UK, Research Policy sess driver fatigue, Expert Systems with Applications 38 (6)
44 (9) (2015) 1714–1733. (2011) 7235–7242.
164 Deep Learning and Parallel Computing Environment for Bioengineering Systems

28. Y. Fan, N.W. Duncan, M. de Greck, G. Northoff, Is there 29. S. Leva, A. Dolara, F. Grimaccia, M. Mussetta, E. Ogliari,
a core neural network in empathy? An FMRI based quan- Analysis and validation of 24 hours ahead neural net-
titative meta-analysis, Neuroscience and Biobehavioral Re- work forecasting of photovoltaic output power, Mathemat-
views 35 (3) (2011) 903–911. ics and Computers in Simulation 131 (2017) 88–100.
CHAPTER 10

Miracle of Deep Learning Using IoT

RAMGOPAL KASHYAP, PHD

KEY TERMS & DEFINITIONS

DM. Data mining is the path toward discovering outlines in broad educational lists including procedures at the intersection
purpose of machine learning, bits of knowledge, and database frameworks. It is a fundamental method where watchful
systems are associated with removing data patterns. It is an interdisciplinary subfield of PC science. The general goal
of the data mining process is to expel information from an instructive gathering and change it into a legitimate structure
for help utilization. Beside the rough examination step, it incorporates database and data organization perspectives, data
pre-getting ready, model and determination thoughts, interesting quality estimations, versatile quality considerations,
post-treatment of discovered structures, portrayal, and web based refreshing. Data mining is the examination adventure
of the “learning exposure in databases” process, or KDD.
Deep Learning. Deep learning is one of machine learning techniques which are themselves a subset of the more extensive
field of artificial intelligence. Profound learning is a class of machine learning calculations that utilizes a few layers of
nonlinear handling units for highlight extraction and change. Each progressive layer utilizes the yield from the past layer as
information. Profound neural systems, profound conviction systems and repetitive neural systems have been connected to
fields, for example, PC vision, discourse acknowledgment, regular dialect handling, sound acknowledgment, interpersonal
organization separating, machine interpretation, and bioinformatics, where they created results practically identical to and
sometimes superior to anything human specialists have achieved.
Internet-of-Things. The Internet-of-Things (IoT) eludes to a system containing physical part equipped for social occa-
sion and sharing electronic data. The Internet-of-Things incorporates a wide assortment of ‘smart” gadgets, from modern
machines that transmit information about the creation procedure to sensors that track data about the human body. Fre-
quently, these gadgets utilize web protocol (IP), a similar protocol that recognizes PCs over the internet and enables them
to speak with each other. The objective behind the web-of-things is to have gadgets that self-report progressively, en-
hancing proficiency and conveying vital data to the surface more rapidly than a framework relying upon human mediation.

10.1 INTRODUCTION get-togethers to fulfill a few needs, for example, control-

IoT is inevitable and used in all parts of our life. IoT ling and checking present day working environments
is used by governments around the globe to assem- or enhancing business association or points of confine-
ble data from different parts and to improve prosper- ment. There are 3 billion gadgets related to the Internet
ity, transposition, security, and headway. IoT is used by right now. This number will expand to 20 billion by
associations with the true objective to provide better 2020. While these gadgets make our lives more secure
service to the customers or to enhance prosperity and and significant, they are increasing the number of pos-
security in the workplace. IoT is also used by people sible strike targets available to programming engineers.
to acknowledge and manage their life, for example, as Hence it becomes essential to ensure that these contrap-
is done by Amazon, which uses smart IoT contraptions tions are protected from enemies and unapproved ac-
that have the ability of working together with people us- cess, and this need is fundamental. This chapter intends
ing voice. Resonate can be asked to give urgent messages to develop an intrusion detection system for perceiving
regarding the atmosphere, plan cautions, play music, or application layer ambushes against IoT device using a
get new feeds from various resources from the web. It mix of grouping and decision tree (DT) computations.
is an advancement by which billions of smart things Ordinarily, an IoT contraption accumulates recognition
or gadgets known as “things” can utilize two or three data and sends it for further processing through portals
sorts of sensor to aggregate particular sorts of informa- to a central system. Idiosyncrasies in data sent by IoT
tion collected by themselves and, what’s more, include contraptions might be distinguished at an application
the information from other smart “things”. They would layer in central structures. The objectives of this exami-
then have the ability to give this information to avowed nation are:

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00017-8 165
Copyright © 2019 Elsevier Inc. All rights reserved.
166 Deep Learning and Parallel Computing Environment for Bioengineering Systems

1. To significantly grasp IoT, and how it works; have the capacity to perceive important data from non-
2. To significantly grasp irregularities of procedures; accommodating data.
3. To develop a model for intrusion detection relying Examination is a part of every process, and this
on bunching and DT and apply it to IoT; chapter shows also how the computations may func-
4. To survey the model; tion. Inside an IoT based organic network, examination
5. To offer proposals on improving the model. overviews and surveys spout data persistently, including
IoT is a promising advancement, yet it is still in its in- chronicled data from databases to achieve the best re-
fancy and faces various hurdles [4]. Immediately, there sults [8]. Since different sorts of data stream need to
is no standard framework for IoT systems. Since IoT is be taken care of in an informative system, in-streams
still being developed, vendors rush to make protests are inspected in packs or gatherings. The result is an
against consent to other dealers’ things to achieve fi- out-stream with significant information. Despite exami-
nancial benefit and push customers toward merchants’ nations having all the earmarks of being a useful device,
safety. Moreover, IoT is heterogeneous which makes there might be regions where it can’t be implemented.
partners working together, directing, and, what’s more, The clearest example is advancement in light of the fact
tying out IoT concepts a hard task [1]. IoT objects that examination is simply not prepared to get to data
use different correspondence traditions to communi- from past events, so data must rely upon near endeavors
cate over different frameworks (e.g., GSM, WAN, WSN, executed already, for instance. Also, certain quantita-
Bluetooth). Given the number and diverse sorts of IoT tive models can’t cover specific innovative fields. This
dissents and also their compelled gear limits, it is mod- places the fundamental initiative in a vague framework.
erately hard to pass on and have needed safety among With the ultimate objective of handling this issue and
the contraptions. This pushes towards united and non- making an aggregate picture of the fundamental initia-
joined sort out based confirmations and variation from tive reliant on examination, intuition and individual
the norm disclosure and protection instruments. experience are required, and thus emotional data [5].
Besides, elucidation limits give, by the supporting se-
mantic matchmaking, a license to legitimize works as
10.2 INSPIRATION expected, thereby extending trust in a system response.
Decision Support Systems is the term used for data- In case inescapable little scale devices are fit for onboard
based help to programmed proposing to drive deci- overseeing on the covertly recouped data, they can por-
sions and bolster a manager. On an extremely funda- tray themselves and the setting where they are sorted
mental level, the data stream from a couple of sources out toward outside contraptions and, what’s more, ap-
is assessed by models inside a computer program. In- plications. This would refresh interability and, what’s
spouting data stream generally begins from databases more, versatility, empowering unavoidable data based
and results, e.g., in the casing of a report which is imag- structures with high degree of automation not yet al-
ined in a straightforward way. The examination can lowed by customary IoT establishments and systems.
be portrayed as a mix of figure-based models, prac- Machine learning [22], web-of-things and tremendous
tices learned, model checking and quantitative exami- information are reliant on each other, while complet-
nation that basically uses estimations to reinforce deci- ing an amazing activity as found in Fig. 10.1.
sions with respect to deliberate writing of computer pro- Two basic results are provoked in any case, addi-
grams, which is called continuous examination. This ex- tionally the human–computer association could be up-
amination typically encounters two standard occasions. graded, by diminishing the customer effort required
The key time frame is generally fixed on combining in- to benefit by managing systems. In conventional IoT
ternal data and transforming it into a sorted out shape. checks, a customer unequivocally interfaces with one
The second time frame is initiated by the frequent exam- contraption when given a chance to play out an er-
ple of gigantic data. The second time frame is depicted rand [10]. In spite of what may be common, customer
by the headway of new systems, types and the joining of specialists running on versatile figuring contraptions
data spilling from outside sources. In the reference area, should have the capacity to organize in the meantime
a model visual layout of a descriptive programming can the various embedded little scale segments, outfitting
be found. The best approach to update business meth- customers with setting vigilantly changed errand and
ods and, along these lines, decisions of any kind is to decision help. Moreover, paying little heed to whether
process data to be examined. Likewise, this can save machine learning frameworks, estimations and instru-
essential features, valuable time and money. Figuring ments have attracted novel classes of examination, strik-
the focal point of an illustrative programming ought to ingly basic for big data Internet-of-Things perspective,
CHAPTER 10 Miracle of Deep Learning Using IoT 167

FIG. 10.1 Machine learning interoperability.

the abuse of commence based and repulsive disclosure learning, add up to learning, and improve basic learn-
structures using non-change masterminding happens as ing.
expected, reimbursing possible faults in getting works
out, device flightiness and the nonappearance of na-
ture of remote correspondences taking off adaptable IoT 10.3 DECISIONS IN AN AREA OF DEEP
establishments greatly flexible for paying little heed to LEARNING AND IOT
what you look like at its application [11]. In this section, the subject of data-driven essential ad-
Immense data examination is a rapidly developing ministration is reviewed more in detail. The central fig-
investigation area, comprising the fields of program ure in the process still remains a human, yet the de-
building, information affiliation, and it has changed cision or noteworthiness in detail and the assurance
into an unavoidable term in understanding and over- among decisions relies upon insignificant data and con-
seeing complex issues in different disciplinary fields, victions and not on adapting, long understanding or
for instance, spreading out, related number juggling, intuition. Media transmission associations and the cash
calm, computational science, helpful associations, so- related industry viably associated systems in the midst
cial affiliations, back, business, government, control, of the 1990s to evaluate the enormous proportion of
transportation and media correspondences. The util- data they aggregated. These systems maintained trading,
ity of gigantic data is found, everything considered, in organized exhibiting, coercion area and credit scoring.
the locale of the IoT. Huge data is used to amass IoT Like the pace of relationship headway among associa-
structures which consolidate things-driven, data-driven, tions, data-driven fundamental authority progressed as
advantage-driven sketching out, cloud-based IoT. De- a result of the persistent headway in information devel-
grees of advanced interfacing with IoT support sensors opment. The past couple of years have seen a brisk en-
and radio, keep seeing attestation, low power, and cen- hancement of artificial intelligence (AI) developments
trality gathering, sensor frameworks’ and IoT benefits joined with machine discovering that uses estimations
basically join semantic affiliation, security and assur- with the ultimate objective to satisfy needs in a more
ance, ensuring traditions, plan precedents of sharp af- correct and, all things considered, mechanized way. Rec-
filiations [9]. To sufficiently mix huge data and pass ollecting this, the capacity of pushing ahead essential
on among contraptions using IoT, machine learning initiative is apparently endless. Colossal associations
frameworks are used. Machine involvement isolates vi- passing on information-driven fundamental authority
tality from massive data using obvious structures, which are Amazon and Google. While Amazon gathers profits
when consolidated fall far from the confidence exami- from data with respect to setting strong goods propo-
nation, gathering, Bayesian frameworks, decision trees, sition, Google goes for settling on decisions altogether
and sporadic forests, reinforced vector machines, bolster dependent on amassed data. Data is amassed, secured,
168 Deep Learning and Parallel Computing Environment for Bioengineering Systems

separated and changed; the last information, which is

the aftereffect of the information gathering strategy, can
be depicted as the introduction of convincing decision
making. To the extent of a proof based organization ap-
proach, this information is a significant part of decision
making.

10.3.1 IoT Reference Model

Presently, there is no standard IoT example; Cisco’s cre-
ation is a standout amongst the most sensible models,
which has laid out IoT levels in an apparently captivat-
ing way. Cisco’s IoT reference show relies upon infor-
mation stream and might be seen as an establishment
for a predominant appreciation of IoT association lev-
els and potential. Level 1 includes devices which are FIG. 10.2 Machine learning method case-wise result
equipped with particular sensors and are neither area comparison.
bound nor sizably constrained. They deliver data and application uses convolutional neural frameworks to
can be controlled. Level 2 involves networks which en- see trash in images. Yet it uses 83% of CPU and an-
able the correspondence between devices, distinctive ticipates that control will respond in five seconds. For-
frameworks and, additionally, Level 3. This is done by tunately, late advances in mind weight, unforgiving
encoding, trading and coordinating. Level 3 is where choosing, and reviving executives are enabling tremen-
data is changed over into storable information by iso- dous learning on resource obliged IoT contraptions.
lating examination. The examination here fuses, for in- These traits ask for liberal selecting power, which can be
stance, appraisal, sorting out and diminishing. Level 4 a restricted resource on various IoT contraptions. Spe-
is where data is secured, stays lethargic and is arranged cialists have seen three distinctive approaches to oversee
for further use. In Level 5, data is moved and joined into resource-constrained contraptions, while diminishing
various setups so applications can make use of and read their appetite on CPU, memory, and power [2].
them. This suggests they are by-and-by information. In Structure weight is the route toward changing over a
Level 6, information is deciphered by a particular appli- heavily interconnected neural framework into an insuf-
cation. Structures change from control applications to ficiently related composer. This structure does not work
BI and analytics. Level 7 involves processes and people, with all of many learning frameworks but instead, when
initiated to make a move and execute as demonstrated properly arranged, it can lessen both most extreme and
by necessities dependent on the passed on information. estimation stacks. By decently pruning abundance affil-
With this model, it might be less important to appre- iations, computational load can be lessened by a factor
ciate the degree and adventures inside an IoT organic of 10 without absurdly impacting precision. One ex-
network, and thus this movement might be fundamen- amination found another basic extraordinary position
tal since going with segments of this section depends on of little framework models: these compacted models
these considerations. can continue running on generally used IoT stages, in-
Fig. 10.2 demonstrates how remarkable machine cluding those from Qualcomm, Intel, and NVidia. An-
learning counts give a differing result, for which cor- other approach to managing direct reduction of com-
rect theory is required. Quantifiable ways to deal with putational load is too questionable, rather than a de-
oversee anomaly affirmation join Gaussian spreads and cisive process, the estimates of spotlight centers with
examining how far away the information point is to superfluous effect around show exactness. For example,
a goal from the irrefutable information mean. Z-score in one examination on low power use design, experts
is a routinely utilized metric for parametric scatterings made great results by diminishing the proportion of bits
along these lines, which relies on the mean and stan- per center by 33% to 75% in low-effect neurons. This
dard deviation of the information. change did not diminish precision. While sensible for
When running a basic learning examination on two or three uses, assessed management isn’t useful for
PDAs, must coordinators change in accordance with a applications requiring high precision.
more resource-obliged arrangement? Picture examina- A third procedure to direct basic learning on IoT con-
tion on resource obliged stages can drain basic process traptions is to pass on particular reestablishing head-
and memory resources. For example, the Spot Garbage gear. Both convolution and basic neural frameworks,
CHAPTER 10 Miracle of Deep Learning Using IoT 169

FIG. 10.3 Machine learning applications.

as well as updates for memory utilization, are essen- test to one of a set of possible depictions, e.g., an email
tially used. The downside of this approach is that it re- message is spam or not as illustrated in Fig. 10.3. The
quires a specific device. Apparatus based reviving bosses request has thus a discrete n-ary yield. This is the main
are most sensible for high-regard applications that re- problem considered in this part division, delineated as
quire low power use and shrewd estimation. Basic learn- the estimation of the relationship of a dependent vari-
ing isn’t out of the level of IoT contraptions. Structure able and more than one free factor, e.g., anticipating the
weight, questionable dealing with, and gear reestab- purchase cost of a house thinking about its size, age,
lishing masters all engage in a tremendous neural net, zone and different features [12]. Precisely when all is
showing up on devices with limited CPU, memory, and said and done, both the abundance of information and
power. In a similar route with a specific system, there temptation into wrongdoing can move in steady ranges.
are blocks and checks. Structure weight and collected Get together, i.e., limiting a system of discernment to
figuring are suitable to applications with flexibility for social, augments the similarity of tests inside every get-
a nonappearance of exactness. The two methodologies together and the versatile quality between gatherings.
can be joined with reviving executives to similarly up-
Diverse case accreditation issues depend on the wake of
grade execution at any rate to the detriment of using
squeezing.
specific gear. Structure weight and equivocal dealing
The execution of an ML structure routinely consoli-
with both diminished data used as a touch of esti-
dates two essential stages: getting ready and testing. In
mations and should be used together cautiously. Our
the getting ready stage, the understood ML count de-
knowledge of tremendous learning on IoT contraptions
is constrained, and one should proceed with caution in velops a model of the particular issue inductively from
a couple of zones. masterminded data [13].
Each open dataset to be portrayed is separated into
10.3.2 Rudiments of Machine Learning an organizing set for show building and a test set for
Machine learning is a piece of artificial intelligence guarantee. There exist a few procedures for earnestly
which intends to make structures arranged for getting picking getting ready and test pieces. Among others, in
information from past connections. Particularly, instead k-overlay cross-guaranteeing, the dataset is split into k
of master structures, ML counts and methodologies are subsets of comparable size; one of them is used for
routinely data-driven, inductive and general in nature; testing and the rest of the k − 1 for getting ready. The
they are delineated and associated with a focus on fig- method is underscored k times, each time using a sub-
ures or decisions in some broad class of endeavors, e.g., stitute subset for testing. The less flighty holdout struc-
spam withdrawing, handwriting accreditation or activ- ture, rather, segregates the dataset aimlessly, if all else
ity zone. Three basic groupings of ML issues exist: classi- fails to entrust a more vital level of tests to the strategy
fication, incorporating the relationship of an affirmative set and a one more moment to the test set [17].
170 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 10.4 Big data and Internet-of-Things processing model.

Fig. 10.4 demonstrates how information is master- 10.3.4 Secure Deep Learning
minded especially for the attributive language with unfit Various hardware supported strategies have been used
number of restrictions DL. It gives tasteful expressive- to stimulate getting ready, for instance, scattered han-
ness to help the showing continue, while giving poly- dling and GPU animated enrolling. Large neural frame-
nomial multifaceted nature to both standard and non- works and other machine learning classifiers provided
standard affirmation affiliations [19]. an impression of being weak against little disturbances
to inputs. Those wellsprings of data can be indistinct to
10.3.3 Algorithms for Efficient Training the human eye; henceforth, it is possible that the sys-
More significant models with greater parameters seem tem misses the mark without making any noticeable
to achieve better accuracy. In any case, the drawback changes [20]. Picture portrayal using significance learn-
of using further model is a longer getting ready time ing is all about comprehensively grasping and isolat-
with bunch institutionalization, the mean and change ing things. For power-hungry edge contraptions, man-
of the internal centers can be acclimated to be extraor- aging the trade-off between power use and the idea
dinary for getting ready [14]. This empowers us to have of the received images is of uncommon concern. Phe-
a higher learning rate which enables speedier planning. nomenal controllability of the camera parameters en-
Cluster institutionalization, moreover, regularizes the gages shrewd control of sensor data age to enhance the
model, thus reducing the necessity for using other regu- information given a power spending plan. For exam-
larization procedures like dropout [16]. A technique for ple, region-of-interest (RoI) based coding at the sensor
getting ready frameworks, insinuated as dense–sparse– can be engaged by controlling spatial objectives of the
dense (DSD), proceeds in three phases. It trains to orga- camera achieving loss of information with constrained
nize firstly with many parameters, secondly with a few imperativeness use. Furthermore, basic picture sensor
parameters, which amounts to pruning, and then again noise must be considered for productive picture por-
with many parameters. The authors showed better pre- trayal for low-end contraptions [21]. Various prior work
cision for the framework arranged with DSD than with has examined the impact of low-quality images on the
standard planning. The nature of this work is to learn image portrayal. Customary denoising frameworks can
basic parameters first, and take in whatever is left of be used to enhance the idea of the photos as a pre-
the parameters with formally learned basic parameters. getting ready. Super-objectives can be associated with
Another system capable of getting ready uses low pre- low-objective images. Unnecessary process for the un-
cision math, which is less computationally expensive blemished images due to the usage of preprocessing for
than drifting point accomplice dynamic-settled point every data results in adulterated precision for the ideal
calculating, and can help cut down exactness in the images. Data development approaches can, moreover,
midst of planning. be used for both noisy and low-objective images. The
CHAPTER 10 Miracle of Deep Learning Using IoT 171

extent of immaculate and irritated images and hardship sion of the readied frameworks for clean and LR images.
definition in the midst of getting ready chooses execu- For this circumstance as well, clearly, planning with LR
tion for each datum allocation. The trade-off between images upgrades precision for LR images stood out from
the accuracy of the flawless and noisy images must be getting ready just with clean images on both MNIST
considered. Also, thinking about the confusion resolv- and CIFAR10. When planning just with LR images, for
ing question disclosure mastermind is, moreover, cru- MNIST, getting ready just with LR images results in in-
cial as there are almost no works related to dissent area credible exactness for clean images and moreover LR
for noisy and low-objective images. images. This is due to LR images having adequate fea-
tures that can be especially summed up for clean HR
10.3.5 Robust and Resolution-Invariant images. For complex images like CIFAR10 getting ready
Image Classification just with LR images ruins precision for clean HR images
Picture arrangement using significant learning is all since LR images lose features. In all cases, precision for
about comprehensively distinguishing things. For LR images isn’t close to that for clean images in view
power-hungry edge devices, it is essential to manage the of the loss of features in LR images. When getting ready
trade-of between the essentialness and nature of the re- with clean and LR images, inquisitively, frameworks ar-
ceived picture. RoI based coding is transforming into a ranged with both clean and LR images demonstrate
standard for controlling the essentialness quality trade- better precision improvements for both clean and LR
off in resource poor edge devices. Moreover, unavoid- images appearing differently in relation to the frame-
able picture sensor commotion must be considered for work arranged with clean images. This prescribes data
productive picture portrayal for low-end contraptions. development with LR images fills as ok to regularize for
The objective of this examination is to improve the the ideal images too. In any case LR demonstrates re-
healthiness of a classifier against such disturbances [18]. duced precision for LR images stood out from. Because
Various prior work has inspected the impact of low- the framework arranged just with LR images has seen
quality images on the image portrayal. There are two more LR images in the midst of planning, it performs
diverse approaches to improve the gathering exactness. better for LR images. Yet again, the pivot adversity just
One is to clear such disturbance itself before performing forms accuracy for clean and LR images diverged from
gathering. A two-phase getting ready of to some extent the framework arranged without turn difficulty.
coupled framework for LR picture course of action ana- A conventional neural network (CNN) contains
lyzed the effect of racket and picture quality degradation three sorts of layers: an info layer, concealed layer, and
on the precision of a framework, and to re-plan/change a yield layer. Info and yield layers are single-layers while
the framework with data extension. the hidden layer can have a multilayer structure. The
quantity of concealed layers relies upon the span of the
10.3.6 Planning With Flawless and Noisy information set. Commonly, a CNN contains 2–4 hid-
Images den layers, while an intricate system may reach more
Clean, Noisy/Pivot mishap demonstrates an incredible than 10 hidden layers. Each layer incorporates a lot of
trade-off between getting ready just with clean images hubs, which are viewed as the central units of the neu-
and with disorderly and noisy images. The turn setback ral system, called neurons. Their info originates from
just augmentation precision for both flawless and noisy the output of the previous layer with the exception of
images stood out from the unadulterated data exten- the information layer, where the input is crude infor-
sion approach. This exhibits additional difficulty with mation. Similarly, the output of one neuron will be
united embedding results in best regularization over bolstered to the next layer as input. As a principal unit, a
getting ready just with extended data while overseeing neuron executes an arrangement of straightforward and
pixel level disturbance. Low-resolution (LR) images are nonlinear calculations as pursues. In the wake of mak-
delivered with sub-analyzing components of either 2 ing a model structure, the subsequent stage is to prepare
or 4. After down-examining, those photos are up-tried NN’s parameter [6]. Crude information will experience
with “nearest” or “bicubic” technique. Erratically lifted the whole framework and make output units deliver last
sub-sampling factor and up-testing strategy are used per outcomes, which will contrast with the coveted output
each image while getting ready. Occurring LR images, results by assuming a misfortune work. Here we pick
together with the ideal high resolution (HR) images, are Softmax capacity to decide the names of the informa-
mixed in the midst of getting ready with/without pivot tion test. The idea of “Softmax loss” is direct. Given an
setback. Yet again, frameworks are set up with clean im- information test x(i) with the genuine name k, misfor-
ages and with LR images for examination, the test preci- tune work just spotlights on the anticipated likelihood
172 Deep Learning and Parallel Computing Environment for Bioengineering Systems

of the kth class. For the anticipated probabilities of dif- of a PCA are frequently related to segment scores, and it
ferent classes, since the consequences of capacity I(•) was contended that each institutionalized unique vari-
are 0, their misfortune is equivalent to 0. Basically, the able could increase the segment score. A PCA-based
NN’s preparation stage is to finely tune model’s param- edge-safeguarding highlights’ technique was connected
eters, to such an extent that its misfortune would reduce to hyper spectral picture grouping. In-arched detailing
to a global minimum. for kernel PCA was examined and was connected to
semi-supervised learning.
10.3.7 Smart and Fast Data Processing
In this section, to are occupied with joining the benefits
of the above two classifications, in particular, to keep the 10.4 SIMULATION RESULTS AND
physical highlights of the first information, yet utilize PERFORMANCE ANALYSIS OF
a straight change in the information subset determina- HANDWRITTEN DIGITS RECOGNITION
tion process. We attempt to target two indices: substan- IN IOT
tial scale informational index and mass informational Handwritten digits fill in as an interface among human
index. For expansive scale informational index, we pro- and IoT gadgets, which can be represented in the ac-
pose to utilize SVD-QR for the information subset de- companying models. (1) Smart telephones are most
termination, where SVD is utilized to sort the solitary prominent IoT gadgets, and many advanced mobile
qualities and comparing particular vectors, and the span phones utilize written by hand digits/letters in the order
of the information subset could be resolved depending in their touch screens. It’s alluring that written by hand
on solitary qualities; and QR is utilized to choose which digits/letters in order could be 100% perceived contin-
information tests ought to be chosen as a contribution uously. (2) In numerous independent frameworks, for
to profound learning [3]. The SVD is a straightforward example, self-ruling mail scanner and wholesalers, man-
change; anyway, the QR decides on the list of informa- ually written digits’ acknowledgment is basic to perceive
tion subsets to be chosen, which makes the information postal division and beneficiary telephone number so
subset chose the same highlights as the first informa- the postal courses could be resolved and beneficiaries
tional collection. For profound learning with huge in- can get instant messages on the conveyance status of
formation input, say a framework with the size of thou- sends. Self-governing frameworks are vital parts of IoT.
sands times thousand, how to stretch out the SVD-QR (3) In an observation framework, transcribed/printed
strategy to huge information frameworks is not clear. digits’ acknowledgment is essential to recognize cer-
A noteworthy test in monstrous information preparing tain elements, for example, vehicles, and reconnaissance
is to broaden the current practice of taking a shot at a frameworks are mission-basic IoT. (4) In instructive
single machine and medium or vast size information IoT, for example, machine-based evaluating framework,
preprocessing, particularly thinking about real-world written by hand digits/letter sets’ acknowledgment will
frameworks and compositional requirements. Numer- help to rapidly review undergraduates exams and dis-
ous methodologies on gigantic information preparing charge instructors’ outstanding burden. This is excep-
center around dimensionality reduction that process the tionally constrained to numerous decisions issues and
dimensionality-reducing mapping. A prime case of this it’s tedious for undergraduates. (5) In money related
methodology is irregular projection strategies, which IoT, for example, self-ruling investor, written by hand
select the mapping indiscriminately. Several different digits in bank checks must be related to 100% exact-
methodologies regularly tailor the mapping to a given ness to guarantee trustable exchanges, which makes 24
smaller size informational collection, for example, in- hour management of an account conceivable. In this
formation mindful dimensionality decrease strategies. chapter, manually written digits’ acknowledgment is
The mapping isn’t pre-decided, yet information sub- utilized in our reproduction [15]. We apply SVD-QR
ordinate. The PCA calculation utilizes the information pre-handling and LMSVDQR pre-preparing for pro-
to register the mapping, and the mapping is genuinely found learning neural systems manually written digits’
time-shifting since the information is changing, so PCA (from 0 to 9) acknowledgment. Transcribed letters in
can distinguish the hidden structure of the information. order acknowledgment will be considered in our fu-
A PCA calculation is dependent on a streamlined neu- ture works. Reproduction results for SVD-QR approach
ron model, which was a solitary neuron with Hebbian- contain 5000 preparing precedents of transcribed dig-
type learning for the association weights. Regularly, PCA its. Each preparation/testing model contains a 20 × 20
should be possible by SVD or eigenvalue decomposi- pixels grayscale picture of the digit from 0 to 9, and ev-
tion of an information covariance grid. The aftereffects ery pixel is addressed by a floating point number (from
CHAPTER 10 Miracle of Deep Learning Using IoT 173

−0.1320 to 1.1277 in the informational index we uti- with the unified SDN. The commitments of the pro-
lized) demonstrating the grayscale force at that area. found learning based proposition can be clarified in
The 20 × 20 matrix of pixels can be vectorized into a two viewpoints. First, with the focal control worldview
400-dimensional vector. So a lattice can be developed of SDN, switches don’t have to trade their channel states
where every one of these preparation precedents turns any longer. All channel task procedures can be done in
into a solitary column. This gives us a 5000 × 400 grid the local controller. In this way, the flagging overhead
where each column is a preparation case for a manu- of the system is essentially lessened. Second, since the
ally written digit (0 to 9) picture. The second piece of profound taking in methodology can gain from past
the preparation set is a 5000-dimensional vector that channel task forms through preparing with the informa-
contains names (a genuine digit from 0 to 9) for the tion gathered from the current channel task calculations
preparation set. In total we have 5000 models of manu- (e.g., ACPOCA), the channel task can be done in a sim-
ally written digits in the database, and every digit (from ple single cycle.
0 to 9) has 500 precedents. In the first place, we utilize the ground-breaking
profound learning to deal with foreseeing the mind-
boggling activity, which can accomplish over 90% exact-
10.5 AN INTELLIGENT TRAFFIC LOAD ness and have a brisk reaction time (5 ms in those three
PREDICTION diverse forecast techniques), to additionally think about
True Internet-of-Things organization is in a general the expectation precision in those three distinctive con-
sense heterogeneous software defined networking trol frameworks. The outcome demonstrates that the ex-
(SDN); it is a celebrated procedure utilized in the IoT pectation exactness of brought together SDN based fore-
to manage heterogeneous assets and structures. In such cast is in every case superior to those in the two different
SDNIoT, as delineated heterogeneous gadgets sense and frameworks [23]. Finally, with the concentrated SDN
gather information in the detecting plane, and after- control, we join the profound learning based activity
ward send the information to the portal after com- expectation and channel task that uses the anticipated
bination through switches in the information plane movement stack as the rule to play out the insightful
with the expanding number of gadgets, the heap of channel task. Such insightful channel task, which we al-
incorporated movement in switches may turn out to lude to as TP-DLCA, can effectively build the channel
be fundamentally substantial, and different channels task precision and preparing velocity of channel task.
should have been uniformly allocated to each connect The reproduction results exhibit that both the through-
to adjust the heap. Since high obstruction exists be- put and deferral in the SDN-IoT with our proposition
tween non-symmetrical channels and the number of are superior to those of the ordinary calculations.
symmetrical channels is restricted, the partially overlap-
ping channel (POC) can be a decent answer for decline
impedance and enhance arrange throughput [7]. The 10.6 PERFORMANCE OF DEEP LEARNING
current POC calculations for the most part center on the BASED CHANNEL ASSIGNMENT
enhancement of system execution after channel task, This section contrasts the learning execution of POC
however, do not have the thought of waste through- and distinctive learning structures and diverse learning
put because of the suspended transmission amid the parameters. Then we think about the POC precision of
channel task process with the high elements of the cur- our proposition. At last, we think about the through-
rent IoT, the allocated channels should be every now put between deep learning based POCA (DLPOCA)
and again changed to adaptively acclimate to the pow- and conventional channel task calculations (i.e., the
erfully changed system activity. This dynamic change symmetrical channel task, POC, AC-POCA) to contrast
tosses out a basic prerequisite for the snappy handling the preparation exactness and distinctive learning struc-
of the channel task [25]. To take care of this issue, an tures, i.e., deep belief network (DBN) with 2 and 3
anti-coordination based POC assignment (ACPOCA) concealed layers. The quantity of hubs in each layer is
can productively lessen the emphasis times of channel set to 20 and 100. At that point, we change the DBN
task process, and enhance the system throughput. In structure into profound CNN with 1 and 2 convolu-
any case, without a focal controller, both the flagging tion layers and 2 full association layers, respectively. In
and suspension time of the system are restricted by the the CNN, we set the size of convolution layer to 3 × 3,
disseminated setting. Accordingly, we address such dif- the quantity of hubs in full association layer is 100, the
ficulties, in the initial segment of this chapter, using a quantity of diverts in convolution layer is 20, and the
profound learning based clever POC task calculation cushioning and walk are set to 1. At that point, we think
174 Deep Learning and Parallel Computing Environment for Bioengineering Systems

about the distinctive preparing structures in various sys- contrasting and the objective individual. The persons
tem structures. Subsequently we run each one of those on foot proposition net and the recognition net ad-
preparation forms with smaller than usual clump size of just with one another amid the joint advancement. For
20 and 500 pouches, for the precision result. From the instance, the proposition net can concentrate more on
outcome, we can see that the exactness is profoundly the review instead of the accuracy, as false alerts could
identified with the preparation structure and the pro- be wiped out through the last highlights coordinating
found CNN is greatly superior to DBN in our situation. procedure. In the interim, misalignments of the propo-
The amount of emphasis time is dependable, which al- sition are likewise worthy, as they can be additionally
together beats ordinary calculations. In traditional cal- balanced by the recognition net. To enhance the versatil-
culations, the switch picking the channel of its con- ity of the entire framework, motivated by late advances
nections relies upon the choices of different switches. in protest location, we urge the two sections to share
This implies that the switches must hold up until earlier hidden convolution include maps, which fundamen-
switches completed their channel task. The larger the cy- tally quickens the surmising methodology. Customary
cle times, the more drawn out time each change needs re-id include adapting chiefly utilizes pair-wise or triplet
to spend on channel task. This causes repetitive assem- separate misfortune capacities. Notwithstanding, they
bly time. Due to the repetitive assembly time, excess are not as productive as just having a few information
flagging increases correspondingly. Amid the intermin- tests at each time, and there are O(N 2 ) potential in-
gling time, all connections are down as a direct result of formation blends, where N is the number of images.
the channel reassignment, and the throughput dimin- Distinctive testing procedures could fundamentally af-
ishes with such repetitive combination. fect the assembly rate and quality, however, finding pro-
ductive examining techniques turns out to be much
10.6.1 A Deep Learning System for the more troublesome as N increases. Another methodol-
Individual Pursuit ogy is figuring out how to arrange characters with the
Not quite the same as ordinary methodologies that sep- Softmax misfortune work, which successfully looks at
arate the issue into two separate undertakings passerby all the examples in the meantime. As the number of
location and individual recognition, we together handle classes increases, preparing the huge Softmax classifier
the two perspectives in a solitary CNN. The CNN com- framework turns out to be much slower or even can’t
prises two sections, given an entire info display picture, be done. In this section, we suggest a novel Online In-
a walker proposition net is utilized to deliver bouncing stance Matching (OIM) misfortune capacity to adapt to
boxes of competitor individuals, which are encouraged the issues as shown in Fig. 10.5. We keep a query table
into a recognizable proof net to separate highlights for of highlights from all the marked characters, and look

FIG. 10.5 Deep learning system.

CHAPTER 10 Miracle of Deep Learning Using IoT 175

at separations between scaled down cluster tests and all stem highlight maps for every proposition. At that point
the enlisted sections. they are gone through the rest conv4_4 to conv5_3 of
Then again, numerous unlabeled characters could the ResNet-50, trailed by a worldwide normal pool-
show up in scene images, which can be filled in as neg- ing layer to condense into a 2048 dimensional element
atives for named personalities to consequently misuse vector. On the one hand, as the walker recommenda-
around the line to store their highlights additionally for tions would unavoidably contain some false cautions
correlation. This is another favorable position brought and misalignments, we utilize again a Softmax classi-
by the individual hunt issue setting. The without pa- fier and a straight relapse to dismiss non-people and
rameter OIM misfortune appears considerably quicker refine the areas. Then again, we anticipate the highlights
and superior to the Softmax misfortune in our exami- into an L2 -standardized 256-dimensional subspace (id-
nations. accomplishment), and utilize them to process cosine
similitude’s with the objective individual while doing
10.6.2 Network Architecture surmising. Amid the preparation to organize, we direct
A profound learning structure that mutually handles the id-accomplishment with the OIM misfortune work.
the passerby recognition and individual re-recognizable Together with different misfortune capacities for discov-
proof in a solitary convolution neural system (CNN) ery, the entire net is mutually prepared to perform var-
is shown in Fig. 10.5. Given as information an entire ious tasks by learning, instead of utilizing the elective
scene picture, we first utilize a stem CNN to change from improvements.
crude pixels to convolution highlight maps. A person
on foot proposition net is based upon these element 10.6.3 Dataset
maps to foresee jumping boxes of hopeful individuals, We gather and comment on a vast scale CUHK-SYSU
which are then nourished into a distinguishing proof Person Search Dataset to assess our technique to mis-
net with RoI-pooling to remove L2 -standardized 256-D use two information sources to differentiate the scenes.
highlights for every one of them. At deduction organi- On the one hand, to utilize hand-held cameras to shoot
zation, we rank the display individuals as indicated by road snaps around urban communities. Then again, we
their component separations to the objective individ- gather motion picture depictions that contain people
ual. At the preparing stage, we propose an OIM misfor- on foot, as they could improve the variety of perspec-
tune work [24] over the element vectors to manage the tives, lighting, and foundations. In this area the essential
ID net, together with a few different misfortune capac- measurements of our dataset additionally characterize
ities for preparing the proposition net in a way to per- the assessment conventions and measurements.
form various tasks. Beneath we will initially detail the
CNN to demonstrate structure, and afterward expound 10.6.4 Statistics
on the OIM misfortune work. We embrace the ResNet- After gathering all 18,184 images, we first thickly com-
50 as the CNN demonstrates. It has a 7 × 7 convolution ment on all 96,143 walkers bouncing boxes in these
layer in front (named conv1), trailed by four squares scenes, and afterward relate the individual that shows
(named conv2_x to conv5_x) each containing 3, 4, 6, up crosswise over various images, bringing about 8432
and 3 lingering units, respectively. Given an informa- marked personalities. The measurements of two infor-
tion picture, the stem will create 1024 channels of high- mation sources are recorded to do clarify those individ-
lights’ maps, which have 1/16 goals of the first picture. uals who show up with half-bodies or strange postures,
Convolution layers of size 512 × 3 × 3 are first added for example, sitting or hunching down. Besides, individ-
to change the highlights particularly for people on foot. uals who change garments and beautifications in vari-
At that point we pursue to relate 9 neighbors that each ous video outlines are not related in our dataset, since
element delineates, and utilize a Softmax classifier to individual inquiry issue requires perceiving characters
foresee whether each instance is a person on foot or for the most part as indicated by their garments and
not, and in addition a straight relapse to change their ar- body shapes as opposed to faces.
eas. We will keep the main 128 balanced jumping boxes To guarantee that the foundation people on foot
after non-most extreme concealment as our last propo- don’t contain marked personalities, and in this way,
sition. To discover the objective individual among ev- they can be securely filled in as antagonistic examples
ery one of this proposition, we construct a recognizable for ID. Note that we additionally overlook the founda-
proof net to remove the highlights of every proposition, tion walkers whose statures are smaller than 50 pixels,
and look at the objective ones. We first adventure a RoI- as they would be difficult to perceive notwithstanding
pooling layer to pool a 1024 × 14 × 14 district from the for human labelers. The tallness conveyances of marked
176 Deep Learning and Parallel Computing Environment for Bioengineering Systems

nacular speech, yet it customarily does everything con-

sidered by utilizing giant CPU and GPU assets. Stan-
dard critical learning techniques aren’t appropriate to
watch out for the difficulties of the IoT applications,
in any case, since they can’t have any sort of impact
on a relative level of computational assets. While uti-
lizing translated figuring and system weight, there will
be a state of solid disasters. Decreasing push in system
or focus point size may yield inadmissible drops in ac-
curacy and precision. Noteworthy learning is the best
reaction for various examination and recognizing ev-
idence issues. Precisely when utilized with an eye on
the saving method, memory, and power assets, huge
learning can pass on information to IoT contraptions.
These frameworks incorporate an impressive number of
passed on contraptions dependably making a high vol-
ume of information. Contrasting sorts of contraptions
FIG. 10.6 Labeled and unlabeled inter-ties representation. make specific sorts of information, inciting heteroge-
neous instructive collections, yet the information nor-
and unlabeled personalities are exhibited in Fig. 10.6. It mally joins timestamps and zone data. At long last, IoT
may very well be seen that our dataset has a rich variety applications must be present to drive forward distur-
of passerby scales. bance in the information because of transmission and
tying down blunders. Counterfeit mindfulness is expect-
10.6.5 Effectiveness of Online Instance ing a developing part in IoT applications and associa-
Matching tions. Financing interests in IoT new associations that
The adequacy of the OIM can be checked by contrasting are utilizing AI are up relentlessly. Affiliations have pro-
it against Softmax baselines without retraining the clas- cured various associations working at the association of
sifier network. In the first place, we can see that utilizing AI and IoT over the last two years. Furthermore, essential
Softmax misfortune without retraining classifier keeps shippers of IoT compose PC programs, which are start-
low precision amid the entire procedure. This marvel ing now to offer united AI limits, for example, machine
checks our examination that taking in a vast classifier learning-based examination. Man-made intelligence is
framework is troublesome. Indeed, even with appropri- expecting a featuring part in IoT in light of its capacity to
ate retraining, the preparation precision still enhances rapidly wring bits of knowledge from information. Ma-
gradually, and the test map keeps at around 60%. De- chine taking in an AI headway brings the capacity to or-
spite what might be expected, the OIM misfortune be- dinarily perceive designs and recognize inconsistencies
gins with a low preparing exactness, however, meets in the information from astonishing sensors and gad-
substantially quicker and furthermore reliably enhances gets, for example, reading temperature, weight, spongi-
the test execution. The parameter-free OIM misfortune ness, air quality, vibration, and sound. Affiliations are
learns specifically without expecting to take in a major finding that machine learning can have basic focal con-
classifier lattice. Also, the discrepancy among preparing centrations over customary business data gadgets for
and test standard never again exists, as both are depen- isolating IoT information, including being able to af-
dent on the inward result of L2 -standardized compo- fect operational wants of to 20 times prior and with
nent vectors, which speaks for the cosine comparability. more indisputable precision than edge based checking
We additionally assess the effect of OIM misfortune on structures, for example, talk attestation, and computer
the standard individual re-ID undertaking, by preparing vision can help expel understanding from information
two distinctive base CNNs, Inception and ResNet5, with that used to require human audit.
either Softmax or OIM misfortune, on three vast scales. For the mass information input, it is advantageous
to apply limited memory subspace optimization SVD
(LMSVD)-QR calculation to increase the information
10.7 DISCUSSION preparing speed. Recreation results in robotized tran-
Deep learning is surprising for managing clearly un- scribed digit acknowledgment demonstrate that SVD-
yielding issues in computer vision and trademark ver- QR and LMSVD-QR can colossally decrease the quan-
CHAPTER 10 Miracle of Deep Learning Using IoT 177

tity of contribution to profound learning neural sys- the task it to handle the issue of utilizing numerous cur-
tem without losing its execution, and also can hugely rent little datasets that everyone has in his/her very own
build the information preparing speed for profound information predisposition. A joint single errand learn-
learning in IoT. Consequently, we propose a novel pro- ing calculation and space guided dropout method are
found learning based activity stack expectation calcula- produced to deal with the area assignment unequivo-
tion right off the bat to gauge future movement load cally in a solitary model. From the application point of
and login system. DLPOCA is used to insightfully dis- view, to facilitate the more sensitive issue setting, a re-
pense diverts to each connection in the SDN-IoT orga- markable profound learning structure for concurrent in-
nization to consider a profound learning based forecast, dividual location and recognizable proof is needed. An
and POCA can brilliantly maintain a strategic distance epic Online Instance Matching challenge work makes
from potential clog and rapidly distribute appropriate learning recognition highlights all the more viable. The
direct connections in SDN-IoT. The reenactment result hazardous development of detecting information and
shows that our proposition altogether outperforms tra- brisk reaction prerequisites of the IoT have, as of late,
ditional channel task calculations. prompted the rapid transmissions in the remote IoT
to rise as a basic issue. Distributing appropriately and
directly in remote IoT is a fundamental guarantee of
10.8 CONCLUSION rapid transmission. In any case, the customary settled
There are two fundamental cases in information pre- channel task calculations are not appropriate in the IoT
preparing, to keep a specific component or to include because of heavy activity loads. As of late, the SDN-IoT
lost. Uniform or non-uniform testing could keep the is trying to enhance the transmission quality. Addition-
physical highlights of the information, yet it neglects ally, the profound learning strategy has been generally
the investigation of the more profound relations among examined in high computational SDN. In this way, a
all information other than mean and standard devi- novel profound learning based movement stack expec-
ation; though change-based information pre-handling tation technique emerged to foresee the future activity
further investigates the information inter-correlations load and system blockage. DLPOCA was to insight-
from past insights, even if its physical highlights have fully distribute diverts to each connection in SDN-IoT
been lost. To consolidate the benefits of the classifi- which can astutely stay away from movement block-
cations, SVD-QR and LMSVD-QR calculations are sug- age and rapidly appoint appropriate channels to the
gested for preprocessing of profound learning in multi- remote connections of SDN-IoT. Broad reproduction re-
layer neural systems. The SVDQR is fit to be connected sults show that our proposition fundamentally beats the
to a huge-scale information pool, and LMSVD-QR is for customary channel task calculations.
mass information input. The SVD and LMSVD perform
direct changes to the first information, and the QR task
decides the information test record of an information REFERENCES
subset to be kept, so that the first information physi- 1. Special issue of Big Data research journal on “Big Data
cal highlights have been kept. Connecting them mech- and neural networks”, Big Data Research 11 (2018) iii–iv,
anizes transcribed digits acknowledgment. Recreation https://doi.org/10.1016/s2214-5796(18)30058-3.
results demonstrate that the two methodologies can 2. R. Kashyap, P. Gautam, Fast medical image segmenta-
immensely lessen the quantity of contribution for pro- tion using energy-based method, Biometrics: Concepts,
found learning without losing its execution. In particu- Methodologies, Tools, and Applications 3 (1) (2017)
lar, the SVD-QR could accomplish a similar execution 1017–1042.
(99.7% exactness) with just 103 information sources, 3. S. Liang, Smart and fast data processing for deep learn-
contrasting with the first informational index with 400 ing in the Internet-of-Things: less is more, IEEE Internet
data sources. of Things Journal 20 (10) (2018) 1–9, https://doi.org/10.
1109/jiot.2018.2864579.
Building up a progression of profound learning
4. M. Rathore, A. Paul, A. Ahmad, G. Jeon, IoT-based Big
based ways is needed to deal with making human-
Data, International Journal on Semantic Web and Infor-
recognition adaptable towards genuine information mation Systems 13 (1) (2017) 28–47, https://doi.org/10.
and applications. From the information point of view, 4018/ijswis.2017010103.
on the one hand, the goal is to facilitate the test of 5. P. Yildirim, D. Birant, T. Alpyildiz, Data mining and ma-
sufficiently lacking regulated information. A profound chine learning in the textile industry, Wiley Interdisci-
learning system should use semi-administered noisy plinary Reviews: Data Mining and Knowledge Discovery
naming information, or effective to gather. Then again, 8 (1) (2017) e1228, https://doi.org/10.1002/widm.1228.
178 Deep Learning and Parallel Computing Environment for Bioengineering Systems

6. E. Konovalov, Equivalence of conventional and modified 16. C. Tofan, Optimization techniques of decision making –
network of generalized neural elements, Modeling and decision tree, Advances in Social Sciences Research Journal
Analysis of Information Systems 23 (5) (2016) 657–666, 1 (5) (2014) 142–148, https://doi.org/10.14738/assrj.15.
https://doi.org/10.18255/1818-1015-2016-5-657-666. 437.
7. F. Tang, B. Mao, Z. Fadlullah, N. Kato, On a novel deep- 17. R. Kashyap, A. Piersson, Impact of Big Data on security, in:
learning-based intelligent partially overlapping channel as- Handbook of Research on Network Forensics and Analysis
signment in SDN-IoT, IEEE Communications Magazine Techniques, IGI Global, 2018, pp. 283–299.
56 (9) (2018) 80–86, https://doi.org/10.1109/mcom.2018. 18. R. Kashyap, V. Tiwari, Energy-based active contour method
1701227. for image segmentation, International Journal of Elec-
8. K. Noel, Application of machine learning to system-
tronic Healthcare 9 (2–3) (2017) 210–225.
atic allocation strategies, SSRN Electronic Journal (2016),
19. M. Dzbor, A. Stutt, E. Motta, T. Collins, Representations
https://doi.org/10.2139/ssrn.2837664.
for semantic learning webs: Semantic Web technology in
9. A. Jara, A. Olivieri, Y. Bocchi, M. Jung, W. Kastner, A.
learning support, Journal of Computer Assisted Learning
Skarmeta, Semantic Web of Things: an analysis of the ap-
plication semantics for the IoT moving towards the IoT 23 (1) (2007) 69–82, https://doi.org/10.1111/j.1365-2729.
convergence, International Journal of Web and Grid Ser- 2007.00202.x.
20. L. Kim, DeepX: deep learning accelerator for restricted
vices 10 (2–3) (2014) 244, https://doi.org/10.1504/ijwgs.
Boltzmann machine artificial neural networks, IEEE
2014.060260.
10. L. Urquhart, T. Rodden, A legal turn in human–computer Transactions on Neural Networks and Learning Systems
interaction? Towards regulation by design for the Internet 29 (5) (2018) 1441–1453, https://doi.org/10.1109/tnnls.
of things, SSRN Electronic Journal (2016), https://doi.org/ 2017.2665555.
10.2139/ssrn.2746467. 21. A. Neath, J. Cavanaugh, The Bayesian information cri-
11. Internet of things & creation of the fifth V of Big terion: background, derivation, and applications, Wiley
Data, International Journal of Science and Research Interdisciplinary Reviews: Computational Statistics 4 (2)
(IJSR) 6 (1) (2017) 1363–1366, https://doi.org/10.21275/ (2011) 199–203, https://doi.org/10.1002/wics.199.
art20164394. 22. D. Schwab, S. Ray, Offline reinforcement learning with
12. Z. Chen, B. Liu, Lifelong machine learning, Synthe- task hierarchies, Machine Learning 106 (9–10) (2017)
sis Lectures on Artificial Intelligence and Machine 1569–1598, https://doi.org/10.1007/s10994-017-5650-8.
Learning 10 (3) (2016) 1–145, https://doi.org/10.2200/ 23. F. Tang, Z. Fadlullah, B. Mao, N. Kato, An intelligent traffic
s00737ed1v01y201610aim033. load prediction based adaptive channel assignment algo-
13. D. Wang, C. McMahan, C. Gallagher, A general regres- rithm in SDN-IoT: a deep learning approach, IEEE Internet
sion framework for group testing data, which incorporates of Things Journal (2018) 1–14, https://doi.org/10.1109/
pool dilution effects, Statistics in Medicine 34 (27) (2015) jiot.2018.2838574.
3606–3621, https://doi.org/10.1002/sim.6578. 24. M. Abdechiri, K. Faez, H. Amindavar, Visual object track-
14. A. Hussain, E. Cambria, Semi-supervised learning for ing with online weighted chaotic multiple instance learn-
big social data analysis, Neurocomputing 275 (2018)
ing, Neurocomputing 247 (2017) 16–30, https://doi.org/
1662–1673, https://doi.org/10.1016/j.neucom.2017.10.
10.1016/j.neucom.2017.03.032.
010.
25. J. Liu, B. Krishnamachari, S. Zhou, Z. Niu, DeepNap: data-
15. S. Ouchtati, M. Redjimi, M. Bedda, A set of features extrac-
driven base station sleeping operations through deep rein-
tion methods for the recognition of the isolated handwrit-
ten digits, International Journal of Computer and Com- forcement learning, IEEE Internet of Things Journal (2018)
munication Engineering 3 (5) (2014) 349–355, https:// 1, https://doi.org/10.1109/jiot.2018.2846694.
doi.org/10.7763/ijcce.2014.v3.348.
CHAPTER 11

Challenges in Storing and Processing

Big Data Using Hadoop and Spark
SHAIK ABDUL KHALANDAR BASHA, MTECH •
SYED MUZAMIL BASHA, MTECH • DURAI RAJ VINCENT, PHD •
DHARMENDRA SINGH RAJPUT, PHD

11.1 INTRODUCTION learning is computationally very heavy, becoming pro-

The main challenge in performing analysis on big data hibitively heavy when we use it together with big data
is the unavailability of labeled data as this requires sub- technologies. We can distribute the process to overcome
ject experts to train the data for classification. In con- the problem in computation, and all the studies which
trast to that, getting unlabeled data is easy, but it re- have examined the above-mentioned big data technolo-
quires deep training to prepare the data for analysis gies are going to suggest a better solution. Currently
in the applications like sentiment analysis and speech this combination has made lots of changes to how we
recognition. In this chapter we discuss on challenges in work. The company called Databricks has developed DL
storing, processing and applying classification on un- and big data technologies framework with Python APIs
labeled data, which is very easy to get through many which has played a vital role in developing advantages
reviews websites like Amazon and Twitter. There are of big data technologies and DL. In big data technolo-
two traditional models to train the unlabeled big data: gies, there is a model called big data technologies MLlib
convolutional neural network (CNN) and deep belief and API can easily enable deep learning module with
network (DBN). The drawback of CNN is in finding less LOCE (Lines of Code). The main focus of both is
the local minima and over-fitting when the data is too the integration without giving away performance. Some
large [1]. To overcome this limitation, DBN is used, in big data technologies can be integrated with other plat-
which there are two types of layer, unsupervised and forms with the help of APIs Python which has been
supervised, as shown in Fig. 11.1. The unsupervised used for development, it will integrate almost all used li-
layers are responsible for learning the weights of unla- braries as of now, including the most powerful libraries
beled data whereas the supervised layer is responsible like TensorFlow and Keras, both working on DL.
for training the data based on the weights assigned in Some of the limitations and research gaps of Apache
the first two unsupervised layers. In order to improve big data technologies with respect to deep learning will
the performance of training the data, we can use graphic be given in later sections.
processing units (GPSs). In a GPS, global memory is
available with higher bandwidth, which saves a lot of
time in moving the data between the machines. Ten- 11.2 BACKGROUND AND MAIN FOCUS
sor T-DBN is a platform in which the data training can The focus is on making a new analytical architecture
be parallelized among multiple neurons, where each that meets all the requirements of today’s industries.
neuron is responsible for processing (1/n)th of a task. An analytic architecture consists of three layers [2]: first,
The basic entity of Spark is RDD (resilient distributed analytic platform capabilities, that is, a set of building
dataset). RDD is suitable for fault tolerant acquisition blocks needed to create enterprise analytics solutions
of elements which can be applied in parallel, and we that meet the functional requirements of an organiza-
can share or connect Hbase, Hadoop, HDFS any input tion’s collection of business use cases; second, analytic
format. As time runs fast, day by day the technology platform software, that is, a set of software component
and usage increases the combination of deep learning preferences for implementing analytic platform capa-
and big data technologies, as a way to have an effec- bilities; and third, analytic platform hardware, that is,
tive framework for distributed computing at clusters. network, server, storage, and edge device specifications
Industry requirements have changed, and they are mov- needed for the analytic platform software to meet the
ing towards advanced approaches. As we all know, deep availability, scalability, performance, and other non-

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00018-X 179
Copyright © 2019 Elsevier Inc. All rights reserved.
180 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 11.1 Architecture of deep belief networks (DBNs).

functional requirements of business use cases. The func- compared the performance of three different feature se-
tional domains of an analytic platform architecture are lection algorithms and proved that a genetic algorithm
categorized into [3]: provision data domain responsible is giving the best result in selecting the best among all
for ingestion of streaming data, moving data between available features. SVM algorithm gives the best result
RDBMS, HDFS and connectors to unstructured sources; in predicting the level of certainty in breast cancer.
data storage, distribution, and processing domain re-
sponsible for stream processing, analytics database, in 11.2.1 Challenges of Big Data Technologies
memory distributed compute; and data analysis do- At the development stage, the application is developed.
main responsible for social media and text analyt- At the time of deployment, one can come across vari-
ics, statistical methods, machine learning, media and ous challenges. Different methods are used in deploying
speech analytics; embedded analytics domain responsi- the application in standalone phases. Big data technolo-
ble for distributed flow control, event driven business gies support Mesos and Yarn. Constructing dependen-
rules, complex event processing; information delivery cies configuration also helps in deploying application
domain responsible for visual analytics, real time re- without any exception at cluster node.
porting; enterprise information management domain
responsible for distributed compute data management; 11.2.1.1 Memory Issues
development, deployment, and solution maintaining
Big data technologies are ready to assist in transfer-
domain responsible for design tools, analytic model
ring huge amounts of data. Addressing issues related to
management; infrastructure delivery domain respon-
memory is critical for monitoring and measuring sim-
sible for distributed compute security, administration
and monitoring; and, finally, service delivery domain ple usage.
responsible for analytics platform/solutions. In [4,5], • Proper documentation cannot lead to failure – each
the author used weighted fuzzy logic to assign weights and every step should be recorded for future pro-
in training the data to extract sentiments from the la- poses when recovering the system from a single point
beled tweets and achieved good F-score, whereas in [6] of failure.
the author made a detailed comparison of predictive • Frequent changes are more dangerous – change of
models and performed analysis on a time series dataset. version after the setting can be overruled.
In [7], the author performed analysis on PIMA diabetes
dataset and predicted the levels of diabetes based on 11.2.1.2 Limitations of Apache Big Data
insulin feature, whereas in [8] the author used gradi- Technologies
ent ascent algorithm when finding the exact weights of While processing huge amounts of data, big data tech-
the terms used in determining the sentiment of tweet nologies and deep learning are much more suitable
and used a boosting approach to improve the accuracy with small files and duplicate management. Now, the
of a linear classifier. In [9], the author provided a novel industry is trying to shift towards Apache Flink with
way of performing prediction on a breast cancer dataset, respect to big data. A combination of deep learning
CHAPTER 11 Challenges in Storing and Processing Big Data Using Hadoop and Spark 181

and its usability will be more frequent, since stream- 4. Use cases: NaviSite, Yahoo, Wego and Twitter.
ing of live data is divided in to many batches in regular • Apache Spark
intervals, and this batching makes big data technolo- 1. Used for event detection like earthquake by ana-
gies resilient (creates RDDs). Each RDD has operations lyzing Twitter stemming data;
like join, map and reduce. They help building the data 2. Used in real time by the gaming industry for dis-
into batches. But in a real process a micro batch pat- covering patterns and processing;
tern will appear in streaming, and one of the drawbacks 3. Used as E-commerce source in live stemming clus-
of big data technologies is that they are not precisely tering algorithm;
compatible with small files. When big data technolo- 4. Supports parallel processing.
gies and Hadoop are combined, HDFS has limitations • GraphX
on small files rather than large files, and small files 1. It is a part of Spark used in distributed system in-
are stored gzipped in S3. Apache big data technologies memory (RAM) GraphX processing;
are independent as they do not have file management, 2. Used to divide large processes on one machine
and because of that they become dependent on other framework; supports neo.uj and Apache GraphX.
platforms, like Hadoop or Cloud Base, and their big
data technologies, creating one of the issues to con-
sider. To support fast processing, big data technologies 11.3 HADOOP ARCHITECTURE
support memory based transactions for processing huge Apache Hadoop offers a scalable, flexible and reliable
amounts of data. distributed computing big data framework for a cluster
The main research gap of big data technologies and of systems with storage capacity and local computing
deep learning is that big data technologies support only
power by leveraging commodity hardware. Hadoop fol-
a few MLlib algorithms while deep learning is quite
lows a master–slave architecture as shown in Fig. 11.2, in
advanced in algorithms, so big data technologies have
which the Name node has meta data about all the data
a very high latency and they only support time-based
blocks in HDFS for the transformation and analysis of
rather than record-based processing. When data gener-
large datasets using Hadoop MapReduce paradigm [10].
ating is very fast, deep learning is able to handle huge
The three important Hadoop components that play a
amounts of data, yet big data technologies are not ca-
vital role in the Hadoop architecture are Hadoop Dis-
pable of back pressure handling. A buffer should be
tributed File System (HDFS) [11], Hadoop MapReduce
automatically released and used to store new data in
and Yet Another Resource Negotiator (YARN).
the stream to address the above gaps and limitations;
one can instead use Apache Flink with deep learn-
ing.
11.4 MAPREDUCE ARCHITECTURE
11.2.2 Real Time Applications of Big Data MapReduce can handle large-scale data and works well
Frameworks on Write Once and Read Many (WORM) [12] without
• Hadoop mutexes. MapReduce operations are performed by the
1. Analyze patient’s records, e.g., if he/she is sus- same physical processor. All carried out operations need
pected to have a heart attack; examine a series local data (data locality). The runtime takes care of split-
of observations or test recorded results, which are ting and moving data. MapReduce consists of the fol-
analyzed by big data methods; lowing phases: the Map phase reads assigned input split
2. Prevent hardware failure; from HDFS and parses it into Key/Value pairs; in Parti-
3. Understand the trends in selling a product. tion phase, every mapper should determine the reducer
• ZooKeeper to receive each of its outputs; the Shuffle phase can pre-
1. Uses CXF implementation of DOSGI; pare the Reduce task bucket, the Merge phase sorts all
2. Used in real time for serving a registration reposi- map outputs into a single run. Then the Reduce phase
tory; edits the corresponding list of values and writes it to a
3. Used to create resource locks in a real time dis- file in HDFS as shown in Fig. 11.3. A sample applica-
tributed system. tion developed using MapReduce is shown in Fig. 11.4,
• Apache Storm where the frequency of words occurred in any given text
1. Supports real time live stemming and processing; is illustrated.
2. Supports multiple programming languages; The following list of datasets was used for Hadoop
3. Is fault tolerant and has low latency; practice:
182 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 11.2 Master–slave architecture of Hadoop.

FIG. 11.3 Phases of MapReduce architecture.

FIG. 11.4 MapReduce model.

• Amazon. It’s no secret that Amazon is among market • LinkedData. You may find almost all categories of
leaders when it comes to cloud. AWS is being used on datasets here.
a large scale with Hadoop. Also, Amazon provides a
lot of datasets for Hadoop practice. You can down-
load these. 11.5 JOINS IN MAPREDUCE
• Wikipedia. Yes! Wikipedia also provides datasets for
A mapper’s job during the Map stage is to “read” the
Hadoop practice. You will have refreshed and real
data from join tables and to “return” the “join key” and
data for use.
• Air traffic. Airport, airline and route data (6977 air- “join value” pair into an intermediate file [13]. Further,
ports, 5888 airlines and 59,036 routes spanning the in the Shuffle stage, this intermediate file is then sorted
globe). and merged. The reducer’s job during the Reduce stage
• UCI Machine Learning Repository. A collection of is to take this sorted result as input and complete the
databases, domain theories, and data generators. task of join as shown in Fig. 11.5.
CHAPTER 11 Challenges in Storing and Processing Big Data Using Hadoop and Spark 183

FIG. 11.5 Joins in MapReduce model.

FIG. 11.6 Schema of dataset.

As the name suggests, in the Reduce side join, the re- availability file system: no files and directories, but a
ducer is responsible for performing the join operation. unified concept of a node, called a Znode (ephemeral
It is comparatively simple and easier to implement than or persistent), which is a container of data (like a file)
the Map side join as the Sorting and Shuffling phase and a container of other znodes (like a directory) [15].
sends the values having identical keys to the same re- The operations performed on Znodes are: connecting a
ducer and therefore, by default, the data is organized
Znode to a host by specifying hostname; creation of a
for us. Now, let us understand the Reduce side join
Znode in the Persistence mode by specifying a group
in detail when the dataset in Fig. 11.6 is given as in-
put. name as part of a path name; joining a new member
The architecture of ZooKeeper is based on a sim- by specifying the path and the group name; listing the
ple client–server model [14]. The clients are the nodes children of a Znode using getChildren() by specifying
which request the server for service, and the server is the path as argument; deleting a member from the ex-
the node which serves the requests. It provides a high- isting path.
184 Deep Learning and Parallel Computing Environment for Bioengineering Systems

Algorithm 1 Algorithm to perform Reduce side join. ure” (SPOF) distributed application [17]. The impor-
1. Start tant high level components that we have in each Su-
2. Define the static class CustsMapper extending Map- pervisor node include: topology, which runs distribu-
per with generic <Object, Text, Text, Text> tively on multiple workers processes on multiple worker
3. Override the method map with arguments (Object nodes; spout, which reads tuples off a messaging frame-
key, Text value, Context context) work and emits them as a stream of messages or it may
4. Get the value to a string named record by using connect to Twitter API and emit a stream of tweets; bolt,
toString() which is the smallest processing logic within a topology.
5. Create an array of strings named parts using split( ) Output of a bolt can be fed into another bolt as input
by specifying (,) as delimiter in a topology.
6. Define the static class TxnsMapper extending Map-
per with generic <Object, Text, Text, Text>
7. Override the method map with arguments (Object 11.7 APACHE SPARK ENVIRONMENT
key, Text value, Context context) Apache Spark is very suitable for data engineering, able
8. Get the value to a string named record by using to handle case datasets without thinking much about
toString() the infrastructure. It helps in data ingestion, processing,
9. Create an array of strings named parts using split( ) machine learning and accuracy, and it provides a frame-
by specifying (,) as delimiter work for construction of a distributed system. The best
10. Update the context with instances Text(parts[2]), part of big data technologies is speed when accessing
Text “tnxn” + parts 3 using write operation the data and transferring it, which leads to implementa-
11. Define the static class ReduceJoinReducer extend- tion of MapReduce for keeping data in memory rather
ing Reducer with generics <Text, Text, Text, Text> than on a disk. Also big data technologies provide many
12. Override the method reduce with arguments (Text libraries for programming languages like Java, Scala and
key, Iterable<Text> values, Context context) Python.
13. For each value update the count and total based on
if (parts[0].equals(“tnxn”)) 11.7.1 Use Cases of Big Data Technologies
14. Else if (parts[0].equals(“cust”)) update the name as • Analytics. It can be used for building analytics in
parts[1] real time for data which is streaming. It has an abil-
15. Create a string str having both count and total us- ity to transfer massive amounts of data from differ-
ing format() ent sources. Big data technologies support Kafka, Ze-
16. Update the context with instances Text(name), romQ, HDFS, Plume and Twitter.
Text(str) using write operation • Trending data. For an online stream, big data tech-
17. Configure the job using job and Configuration nologies will be more suitable for processing the
classes data, in which trending data can be easily stored and
18. Stop analytics can be done at runtime.
• IoT. Internet-of-Things generates huge amounts of
data from sensors installed at various places. Gener-
ated data are pushed for storage and processing. Big
data technologies have been applied for data process-
11.6 APACHE STORM ing and transferring in regular period (every second,
It is a distributed real-time big data processing system minute, hour).
designed to process vast amounts of data in a fault- • Machine learning. Spark can be used in offline pro-
tolerant and horizontally scalable method with highest cessing and may use machine learning algorithms.
ingestion rates [16]. It manages distributed environ- ML can be deployed easily on data sets as ML con-
ment and cluster state via Apache ZooKeeper. It reads tains different algorithms. We can apply it on data
raw stream of real-time data from one end, passes it sets to achieve real time ML system. We can also com-
through a sequence of small processing units and out- bine MLlib with Spark.
puts useful information at the other end. The compo- In the programming context, Spark context is cre-
nents in Fig. 11.7 represent the core concept of Apache ated, which provides the means for loading and saving
Storm. data files of different type [18], thereby SQL context can
One of the main highlights of Apache Storm is that also be created from the Spark context to implicitly con-
it is a fault-tolerant, fast with no “Single Point of Fail- vert RDDs into DataFrames. Using the Spark context, it
CHAPTER 11 Challenges in Storing and Processing Big Data Using Hadoop and Spark 185

FIG. 11.7 Application process of Apache Storm.

is possible to load a text file into an RDD using the text

File() method, resulting in a creation of a data frame.
Next, a Configuration object is defined, which is used
to create a Spark context. A schema string is then cre-
ated, which contains the data column names and the
schema created from it by splitting the string by its spac-
ing, and using the StructType() and StructField() meth-
ods to define each schema column as a string value.
Later, each data row is then created from the raw data by
splitting it with the help of a comma as a line divider,
and then the elements are added to a Row() structure.
A data frame is created from the schema, and saved
to HDFS using the saveAsTextFile(). Apache Parquet is
another column-based data format used to increase per-
formance by efficient compression and encoding rou-
tines.

11.8 GRAPH ANALYSIS WITH GRAPHX

Graph processing is an important aspect of the analy-
sis that applies to a lot of use cases [19]. Fundamen-
tally, graph theory and processing are about defining
relationships between different nodes and edges. Nodes FIG. 11.8 Schema view of a dataset.
or vertices are the units while edges define the rela-
tionships between nodes for the scheme plotted in usually user activity on websites. Streaming data helps
Fig. 11.8. in creating live dashboards. When building applica-
tions to collect, process and analyze streaming data in
real time, we need to have different design consider-
11.9 STREAMING DATA ANALYTICS ations compared to processing static batch data. Dif-
Streaming data is basically a continuous group of data ferent streaming data processing frameworks [20] are
records generated from sources like sensors, server traf- available on the market, e.g., Apache Samza, Storm and
fic and online searches. Sources of streaming data are Spark Streaming as shown in Fig. 11.9. Spark Streaming
186 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 11.9 Architecture of Spark Streaming.

FIG. 11.10 Architecture of Spark Streaming.

Algorithm 2 Algorithm to build a graph. Algorithm 3 TCP based stream processing using spark.
1. Start. 1. Start
2. Import the data. 2. Import Spark Streaming context from Apache.org
3. Build a graph using the structure of the vertices (or 3. Create a configuration instance using sparkconf()
nodes) and structure of the edges. specifying the local host as master node
4. Perform some joins to ensure that the data items 4. Create an instance of streaming context with batch
from the datasets are associated with each other. size of 5 seconds
5. Create a set of vertices and attach the metadata to 5. Create an instance of DStream that can connect to
each of them. hostname and port sockettextstream()
6. Create the edges from all of our individual rows by 6. Perform transformation on DStream
adding a dummy value of 1 to it. 7. Use the MapReduce logic to organize the context
7. Consider a default station for edges that don’t 8. Start capturing the streaming context using start()
point to any of the vertices. 9. Wait for the computation to terminate
8. Now the instance of graphs is created with vertices, 10. Run nc -lk port number that activates netcat as a
edges and default details data server
9. The instance can be used to access the properties 11. Stop
like numVertices, numEdges.
10. Stop. cessed stream of batches that can be stored to a file
system with 0.5 s, as minimum batch size results in one
converts a data stream into batches of X seconds called second end to end latency.
DStreams, which internally is a sequence of RDDs.
Later, Spark is used to processes the RDDs, and the re-
sults of the RDD operations are returned in batches 11.10 FUTER RESEARCH DIRECTIONS
as shown in Fig. 11.10. Spark Streaming batches are In the future, we would like to address the industry
processed using the Spark engine, which returns a pro- needs by proposing a new analytic platform that can
CHAPTER 11 Challenges in Storing and Processing Big Data Using Hadoop and Spark 187

fulfill all the real time requirements with less manpower PIMA diabetes, International Journal of Advanced Science
and infrastructure. and Technology 106 (2017) 19–32.
8. S.M. Basha, D.S. Rajput, K. Vishu Vandana, Impact of gra-
dient ascent and boosting algorithm in classification, In-
11.11 CONCLUSION ternational Journal of Intelligent Engineering and Systems
(IJIES) 11 (1) (2018) 41–49.
From this chapter, we would like to conclude that there 9. S.M. Basha, D.S. Rajput, N.C.S. Iyengar, A novel approach
are many and various challenges in the area of big data to perform analysis and prediction on breast cancer dataset
analytics. The researchers can spend their valuable time using R, International Journal of Grid and Distributed
when deciding on a research topic from this area by un- Computing 11 (2) (2018) 41–54.
derstanding all the concepts discussed in this chapter. 10. A.F. Gates, O. Natkovich, S. Chopra, P. Kamath, S.M.
Big data frameworks are suitable for data engineer- Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, U. Sri-
ing, able to handle case datasets without thinking much vastava, Building a high-level dataflow system on top of
about the infrastructure. They help in data ingestion, map–reduce: the pig experience, Proceedings of the VLDB
processing, machine learning and accuracy, providing Endowment 2 (2) (2009) 1414–1425.
the framework for construction of a distributed system. 11. K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Hadoop
distributed file system, in: Mass Storage Systems and Tech-
The best part of big data technologies is speed when
nologies (MSST), 2010 IEEE 26th Symposium on, IEEE,
accessing data and transferring it, which leads to an im- 2010, pp. 1–10.
plementation of MapReduce for keeping data in mem- 12. J. Shafer, S. Rixner, A.L. Cox, The Hadoop distributed
ory rather than on a disk. Also big data technologies filesystem: balancing portability and performance, in:
provide many libraries for programming languages like Performance Analysis of Systems & Software (ISPASS),
Java, Scala and Python. 2010 IEEE International Symposium on, IEEE, 2010,
The limitation of the present research work is about pp. 122–133.
performing experiments on relatively small datasets. In 13. H.-c. Yang, A. Dasdan, R.-L. Hsiao, D.S. Parker, Map–
the future, our aim is to consider the real time big data reduce–merge: simplified relational data processing on
datasets with respect to volume. large clusters, in: Proceedings of the 2007 ACM SIGMOD
International Conference on Management of Data, ACM,
2007, pp. 1029–1040.
14. P. Hunt, M. Konar, F.P. Junqueira, B. Reed, ZooKeeper: wait-
REFERENCES
free coordination for internet-scale systems, in: USENIX
1. X.-W. Chen, X. Lin, Big data deep learning: challenges and Annual Technical Conference, vol. 8, Boston, MA, USA,
perspectives, IEEE Access 2 (2014) 514–525. 2010.
2. P.K. Davis, Analytic Architecture for Capabilities-Based 15. F. Junqueira, B. Reed, ZooKeeper: Distributed Process Co-
Planning, Mission-System Analysis, and Transformation, ordination, O’Reilly Media, Inc., 2013.
Tech. rep., Rand National Defense Research Inst., Santa 16. R. Ranjan, Streaming big data processing in datacenter
Monica, CA, 2002.
clouds, IEEE Cloud Computing 1 (1) (2014) 78–83.
3. H.D. Hunt, J.R. West, M.A. Gibbs Jr, B.M. Griglione, G.D.N.
17. A. Bahga, V. Madisetti, Internet of Things: A Hands-On Ap-
Hudson, A. Basilico, A.C. Johnson, C.G. Bergeon, C.J.
proach, Vpt, 2014.
Chapa, A. Agostinelli, et al., Similarity matching of prod-
18. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman,
ucts based on multiple classification schemes, US Patent
D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al., ML-
9,262,503 (Feb. 16, 2016).
lib: machine learning in Apache Spark, Journal of Machine
4. S.M. Basha, Y. Zhenning, D.S. Rajput, N. Iyengar, D.
Learning Research 17 (1) (2016) 1235–1241.
Caytiles, Weighted fuzzy rule based sentiment prediction
19. J.E. Gonzalez, R.S. Xin, A. Dave, D. Crankshaw, M.J.
analysis on tweets, International Journal of Grid and Dis-
Franklin, I. Stoica Graphx, Graph processing in a dis-
tributed Computing 10 (6) (2017) 41–54.
5. S.M. Basha, D.S. Rajput, Sentiment Analysis: Using Artifi- tributed dataflow framework, in: OSDI, vol. 14, 2014,
cial Neural Fuzzy Inference System, IGI Global, 2018. pp. 599–613.
6. S.M. Basha, Y. Zhenning, D.S. Rajput, R.D. Caytiles, N.C.S. 20. S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S.
Iyengar, Comparative study on performance analysis of Mittal, J.M. Patel, K. Ramasamy, S. Taneja, Twitter Heron:
time series predictive models, International Journal of Grid stream processing at scale, in: Proceedings of the 2015
and Distributed Computing 10 (8) (2017) 37–48. ACM SIGMOD International Conference on Management
7. S.M. Basha, H. Balaji, N.C.S. Iyengar, R.D. Caytiles, A of Data, ACM, 2015, pp. 239–250.
soft computing approach to provide recommendation on
CHAPTER 12

An Efficient Biography-Based
Optimization Algorithm to Solve the
Location Routing Problem With
Intermediate Depots for Multiple
Perishable Products
ERFAN BABAEE TIRKOLAEE, ME • ALIREZA GOLI, ME • MANI BAKHSHI, ME •
ARUN KUMAR SANGAIAH, PHD

12.1 INTRODUCTION VRP. However, the concurrent execution of the steps

In the current competitive conditions, companies need leads to the solutions with higher quality; it is impossi-
to focus on the optimal strategic and operational de- ble to solve the large size problems [41]. After the emer-
cisions to manage and maintain their logistic systems gence of multi-depot VRP, researchers examined various
efficiently. As far as the optimal routes can cause cost solution approaches. For instance, Chao et al. [10] of-
reduction and improvement of service quality, a vital fered a heuristic approach to solve multi-depot VRP and
operational decision would be the constructions of the validated the method with several random problems.
optimal routes of vehicles. A classical vehicle routing By defining time windows in multi-depot VRP, cus-
problem (VRP) considers the provision of services to tomers are served during a unique specified time. It
a set of given customers with certain demands using a is evident that different types of this problem have
fleet of vehicles. The aim is to determine a set of routes received scant attention. This is particularly appar-
(starting from a depot and finishing at the same de- ent in multiple depot VRPs. In contrast, capacitated
pot) for the vehicles with finite capacity such that the VRP (CVRP) and its extended version, i.e., vehicle
total traveling cost (total traveled distance and the total routing problem with time windows (VRPTW), have
number of used vehicles) is minimized and customers’ been investigated extensively, whereas other types of
demands are fulfilled at the same time. Moreover, all the problem have been considered sporadically [48].
the customers are serviced only once. In fact, the goal Dondo and Cerdá [12] presented a novel model-based
is to specify optimal routes of vehicles to satisfy the de- improvement methodology to solve the multi-depot
mand of a certain number of customers and to reduce VRPTW problem by improving initial solutions based
the costs associated with the total traveled distance to on solving a series of mixed-integer linear programming
its minimum value [14]. Since the emergence of this (MILP) models to be solved that allow exchanges of
problem by Dantzig and Rams [11], a lot of attention nodes among tours and node reordering on every route.
has been paid to more complicated problems of VRP Varieties of methods have been developed to solve
and similar routing problems, such as considering the VRP and VRPTW. In these problems, the routes should
problem with time windows, heterogeneous vehicles, be constructed in a way that the total distribution cost
multi-product deliveries, multi-depot, and so on [2,28, is minimized. A comprehensive explanation of these
5,24,43–45]. As a result of its widespread applications, problems and other relevant problems can be seen in
VRP has been reviewed extensively [20,7]. Toth and Vigo [47].
In VRP with multiple depots, customers are serviced Indeed, assigning delivery time windows is one of
using multiple depots. Like VRP, vehicles return to the the basic features of perishable products distribution.
depot where they begin their trip. The basic solution Accordingly, goods perishability and loss of quality ap-
process of multi-depot VRP includes two steps. Firstly, pear widely in food products such as dairy, meat and
the customers are clustered and allocated to a depot. vegetables [37]. Furthermore, it applies to organisms,
Secondly, planning is done for the clusters, as well as ornamental flowers, etc. The time interval between pro-

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00019-1 189
Copyright © 2019 Elsevier Inc. All rights reserved.
190 Deep Learning and Parallel Computing Environment for Bioengineering Systems

duction and sales of such goods plays an important role minimization. Hwang [22] studied a VRP to obtain an
for manufacturers and vendors. The reason is that, if the optimal scheme of food supplies and inventory allo-
distribution of perishable products is not so effective, cation to save the starvation zones. The goal of the
then: problem is to reduce damages and famine in place of
(1) Products would not be delivered to their desti- total traveled distance or total time. Prindezis et al. [34]
nations timely (grocery stores, restaurants, etc.) suggested a service provider application in food mar-
and quality or value (or both) of products would ket centers considering professions, and information to
plunge, solve VRP. In doing so, a tabu search (TS) algorithm was
(2) Inventory costs (e.g., holding costs) of products employed. They designed specialized software for the
would increase significantly, and roads in Athens and implemented it in a unified deliv-
(3) Market demand would decrease and customers ery logistics problem for 690 companies in the central
may prefer other suppliers. food market. They proposed a two-phase algorithm to
Another real nature of the problem is considering mul- solve the problem. In the first phase, a route generation
tiple trips for vehicles. As one of the first research, Fleis- algorithm was used, and in the second phase, a TS al-
chmann [13] studied the implication of VRP with mul- gorithm was utilized to obtain a solution with higher
tiple trips. Olivera and Viera [31] noticed that single-trip quality. Hsu et al. [21] proposed a stochastic process
VRP was not suitable for the vehicles with small capacity of delivering perishable foods and studied a stochas-
or long services. Therefore, studies on VRP with multi- tic VRPTW for perishable foods distribution model to
ple trips attracted some attention [8,4,23]. The location determine the optimal delivery routes, optimal number
routing problem (LRP) is an important logistics prob- of vehicles, optimal departure time of vehicle from the
lem with many implications for full supply chains [27].
distribution center, etc. Zeng et al. [52] proposed two
LRP is a relatively novel problem considering two main
techniques to solve a real-life VRP of soft drink distri-
tasks of a logistic system simultaneously, the facility lo-
bution problem. The main aim was to minimize the
cation problem (FLP) and the VRP [50].
total number of required vehicles. They developed hy-
LRP is a kind of VRP, in which the number and lo-
brid methodologies for real world problems. They could
cations of depots must be determined simultaneously
improve some best-found solutions in the literature.
with routes of vehicles from depots to customers. The
Osvald and Stirn [33] suggested a heuristic algorithm
goal is to minimize the cost of depots location and
to solve a time-dependent VRPTW for fresh vegetables
the cost of distribution (satisfying customers’ demands)
distribution considering perishability as a critical factor
[51].
in the total distribution cost. The problem was formu-
LRP has various applications such as the distribution
of foods [49], locating blood banks [32], distribution of lated as a VRPTW with time-dependent travel times. The
newspapers [25], garbage collection [9], etc. model considered the impact of perishability as part of
Various research has applied different metaheuristic the overall distribution cost.
algorithms in this field, including simulated annealing Ramos et al. [36] developed a nonlinear mathemat-
(SA), biography based optimization (BBO), accelerated ical model for VRPTW considering the effect of road
cuckoo optimization algorithm (ACOA), and so on [15, irregularities on perishing fresh fruits and vegetables.
16]. The most important reason for using these meth- The main objective of their work was to obtain the op-
ods is the solution complexity of the related problems timal distribution routes for fresh fruits and vegetables
to vehicle routing. On the other hand, most of the real considering different road classes with the least amount
cases have a large number of customers and a large of logistics costs. They employed a genetic algorithm
number of depots, and finding the optimal solutions (GA) to solve the problem. Their results showed that
is very difficult and sometimes impossible. To this end, the vehicle routing problem with time windows, con-
metaheuristic algorithms are always proposed as effec- sidering road irregularities and different classes of toll
tive tools. roads, could significantly influence total delivery costs
A brief literature review of the research done to study as compared to the traditional VRP models.
the VRP and LRP for perishable products distribution Govindan et al. [18] developed a multi-objective LRP
is provided as follows. Adenso-Diaz et al. [1] studied with time windows for perishable products distribution.
the distribution of dairy products firstly. The authors The objective is to specify the number and location
considered demand allocation, visit frequency of each of facilities, the amount of products in the echelons
customer per week, and vehicles routes, and suggested and the reduction of incurred costs of greenhouse gas
a version of VRPTW with the aim of distribution costs (GHG) emissions within the network.
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 191

Mirzaei and Seifi [29] proposed inventory routing • Integrating locational and routing decisions in per-
problems (IRPs) for perishable products considering ishable products distribution by proposing a novel
lost sales. They could present a linear model and solve mathematical model;
it using CPLEX. Moreover, they implemented a hybrid • Developing an efficient BBO algorithm to solve the
algorithm to solve large-sized instances. The efficiency large-sized problems;
of their algorithm was verified compared to CPLEX. • Finding the optimal values of BBO parameters using
Recently, Azadeh et al. [3] developed a model of IRP Taguchi design method.
with transshipment for a single perishable product dis- The remaining sections of our work are organized as
tribution. In their problem, vehicle routing and inven- follows. The assumptions and mathematical model of
tory decisions were made concurrently over a planning the problem are described in Sect. 12.2. The proposed
solution method and numerical results with sensitiv-
horizon to cover the customers under a maximum level
ity analysis are respectively presented in Sects. 12.3 and
policy. They used a GA based technique to solve the
12.4. Finally, the conclusions and outlook of the re-
problem. They generated a numerical example to illus-
search are provided in Sect. 12.5. Moreover, Fig. 12.1
trate the validity of the model and the efficiency of the
depicts the proposed methodology of the research.
proposed algorithm. Hiassat et al. [19] applied a GA to
solve their proposed location inventory routing prob-
lem (LIRP) for the distribution of perishable products. 12.2 MODEL DEVELOPMENT
Their proposed GA could achieve appropriate solutions
This section discusses the assumptions and the pro-
within a short run time. posed mathematical formulation of the problem as a
Tirkolaee et al. [42] offered a novel model for the developed model of research done by Tirkolaee et al.
robust multi-trip VRPTW with intermediate depots to [42].
determine the optimal routes in a single-echelon sup- The problem is defined on a complete undirected
ply chain of perishable products. They solved their pro- graph G = (N T , A), with N T = {1, 2, . . . , nt} = N C ∪
posed MILP model using CPLEX solver and demon- N D, and A = {(i, j ) | i, j ∈ N T , i = j }, where N T ,
strated the validity of their proposed model by gen- N C, and N D respectively denote the set of all nodes,
erating and illustrating different problems. Qiu et al. the set of customer nodes, and the set of intermediate
[35] presented a production routing problem (PRP) for depot nodes. Here, N V denotes the set of all vehicles.
products with perishable inventory. They analyzed the Each vehicle has its set of trips, i.e., N P . The set of per-
optimal integrated decisions on the time and amount ishable product that must be delivered to customers is
of delivering and selling products in different time pe- defined by N R.
riods. They could apply an exact branch-and-cut algo- Consider the situation in which there is a supply
rithm to solve the proposed problem. In another recent chain network including a set of customers. We aim to
work, Soysal et al. [39] proposed a green IRP for perish- find the optimal number and locations of intermediate
able products with horizontal collaboration. They con- depots, the optimal number of vehicles, and the opti-
sidered carbon emissions, total routing and fuel con- mal constructed routes within a supply chain. The ob-
sumption costs, inventory and waste cost and driving jective function is to minimize the total cost including
vehicle usage costs, total traveled distance as converted
time and uncertain demands.
to total transportation cost, earliness and tardiness costs
In this research, a multi-trip LRP with time windows
of services, and establishment costs of intermediate de-
considering intermediate depots is addressed, in which
pots. Since the supply chain network is specific to per-
vehicles may end their trip in a different intermediate
ishable products and time windows play an important
depot. Therefore, we are to make the following contri- role. The main assumptions are listed below:
butions: • Multiple perishable products are considered.
• Considering intermediate depots as key centers to • The demand of each customer for different products
provide fresh perishable products within a given is fulfilled only by an intermediate depot and a vehi-
time window and concerning to their proximity to cle.
customers; • A fleet of heterogeneous vehicles is available at inter-
• Considering multiple trips and maximum allowable mediate depots which has a given capacity for each
traveling time and distance for vehicles, and provid- type of product.
ing the possibility of having different departure and • The supply chain network is asymmetric and com-
destination for vehicles in each trip; plete graph (there is a route between any two nodes).
192 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 12.1 The proposed methodology of the research.

• Vehicles have a maximum allowable capacity for Tables 12.1–12.4 define the indices, parameters,
their traveling distance. non-decision variables and decision variables of the
• Vehicles have a maximum allowable usage time in- proposed model.
cluding traveling times, loading times in intermedi- In Tables 12.1–12.4, sets, indices, parameters and
ate depots, and unloading times at customers’ nodes. various variables are defined.
• Vehicles may have multiple trips.
• Traveling time, tij , and cost, dij , of all vehicles are
the same, but vehicles have different capacities. TABLE 12.1
• No split delivery is allowed, and capacity constraint Indices and sets of the model.
of each vehicle must be satisfied during route con-
N C = {1, 2, . . . , nc} Set of customers
struction.
N D = {1, 2, . . . , nd} Set of intermediate depots
• Each vehicle starts its trip from a given established
intermediate depot and does not necessarily return N T = {1, 2, . . . , nt} Set of customers and interme-
to that intermediate depot, but it must end the trip diate depots (total nodes of the
graph)
at one of the established intermediate depots.
• For each customer, a soft time window is defined N V = {1, 2, . . . , nv} Set of vehicles
which includes two initial and secondary intervals. N P = {1, 2, . . . , np} Set of trips
Delivery cannot be done outside of the initial inter- N R = {1, 2, . . . , nr} Set of perishable products
val [ei , li ] (hard time window), but it can be done i, j , k, l Indices of all nodes
outside of the secondary interval [eei , lli ] (soft time
v Index of vehicles
window) by incurring penalty costs.
p Index of possible trips
• Intermediate depots have different capacities for dif-
ferent perishable products. r Index for perishable products
• According to the parking capacity of vehicles, inter- S An arbitrary subset of customers’
mediate depots can dispatch a given number of vehi- nodes
cles. M An arbitrary large number
• Demand parameter is deterministic.
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 193

TABLE 12.2 TABLE 12.3

Parameters of the model. Non-decision variables of the model.
p
tij Traveling time from node i to node j tti Presence time of vehicles at intermediate
existence(j, r) A 0–1 matrix representing whether or not depot/customer i in trip p
customer j has demand for product r tti Presence time of vehicles at intermediate
QGk r Capacity of intermediate depot k to depot/customer
p
supply product r ddi Distance traveled by vehicle to arrive at
V Cv r Capacity of vehicle k to deliver product r customer i in trip p
p
DEj r Demand of customer j for product r sdi Distance traveled by vehicle to arrive at
customer i in trip p or initial distance traveled by
dij Distance between node i and node j
vehicle of intermediate depot i in trip p
DIv Maximum allowable capacity of traveling dddjv Distance traveled by vehicle v when arrived at
distance for vehicle v
intermediate depot i
Pe Earliness penalty cost to serve LTv Total amount of loading time for vehicle v
customers
U Tv Total amount of unloading time for vehicle v
Pl Tardiness penalty cost to serve
customers
F Fk Establishment cost of intermediate
depot k TABLE 12.4
CVv Usage cost of vehicle v for one planning Decision variables of the model.
period vp
xij A binary variable which is equal to 1 if vehicle v
ei Lower bound of the initial interval to traverses node i to node j in trip p; 0,
serve customer i otherwise.
li Upper bound of the initial interval to yj r v k A binary variable which is equal to 1 if the
serve customer i demand of product r in node j is fulfilled by
eei Lower bound of the secondary interval intermediate depot k and vehicle v; 0,
to serve customer i otherwise.
lli Upper bound of the secondary interval Y Yk A binary variable which is equal to 1 if
to serve customer i intermediate depot k is established to fulfill
customers’ demands; 0, otherwise.
Tmax Maximum allowable usage time of
vehicles Fvk A binary variable which is equal to 1 if vehicle v
is used in intermediate depot k; 0, otherwise.
ul Unit loading time of vehicles at
customers’ nodes Y Ei Amount of earliness in serving customer i
uu Unit unloading time of vehicles at Y Li Amount of tardiness in serving customer i
intermediate depots’ nodes
α Conversion factor of distance to cost
subject to

12.2.1 Mathematical Model yj rvk
The mathematical model for the problem is: r∈N R k∈N D v∈N V
⎛ ⎞
0, if r∈N R existence(j, r) = 0,
= ∀j ∈ N C,
Minimize Z = α ⎝ dij xij ⎠
vp 1, if r∈N R existence(j, r) ≥ 1,
i∈N T j ∈N T v∈N V p∈N P (12.2)
vp
+ CVv Fvk yj rvk ≤ xij ,
v∈N V k∈N D i∈N C p∈N P (12.3)

+ (P eY Ei + P lY Li ) ∀j ∈ N C, ∀v ∈ N V , ∀r ∈ N R, ∀k ∈ N D,
vp vp
i∈N C
xij = xj i , ∀j ∈ N C, ∀v ∈ N V , ∀p ∈ N P ,
+ F Fk Y Yk i∈N T i∈N T
k∈N D (12.4)
(12.1)
194 Deep Learning and Parallel Computing Environment for Bioengineering Systems

vp vp p p vp
xik + xli ≤ 2, ∀k, l ∈ N D, ∀v ∈ N V , ∀p ∈ N P , ddj = (sdi + dij )xij , ∀j ∈ N C, ∀p ∈ N P ,
i∈NC i∈NC i∈N T v∈N V
(12.5) (12.22)

yj rvk p0, if i ∈ N D and p = 1,
sdi = p (12.23)
j ∈N C r∈N R v∈N V ddi , if i ∈ N C and p ∈ N P ,
(12.6) p vp
vp
≤M xkj , ∀k ∈ N D, dddjv = (ddi + dij )xij , ∀j ∈ N D, ∀v ∈ N V ,
j ∈N C v∈N V p∈N P i∈N T p∈N P
vp (12.24)
xki ≤ Y Yk , ∀k ∈ N D, ∀v ∈ N V , ∀p ∈ N P , (12.7)
i∈N C dddjv ≤ DIv , ∀j ∈ N D, ∀v ∈ N V , (12.25)
vp vp
xki ≥ 1 − M(1 − Y Yk ), ∀k ∈ N D, xkj ≤ MFvk , ∀v ∈ N V , ∀k ∈ N D, ∀p ∈ N P ,
i∈N C v∈N V p∈N P j ∈NC
(12.8) (12.26)
vp
xil ≤ Y Yl , ∀l ∈ N D, ∀v ∈ N V , ∀p ∈ N P , (12.9) LTv = ul DEj r yj rvk ,
i∈N C j ∈N C r∈N R k∈N D (12.27)
vp
xil ≥ 1 − M(1 − Y Yl ), ∀l ∈ N D, ∀v ∈ N V , ∀p ∈ N P ,
i∈N C v∈N V p∈N P
U Tv = uu DEj r yj rvk , ∀v ∈ N V ,
(12.10)
j ∈N C r∈N R k∈N D
vp
xij ≤ 1, ∀j ∈ N C, ∀v ∈ N V , ∀r ∈ N R, ∀p ∈ N P , (12.28)
i∈N T vp
(12.11) LTv + U Tv + tij xij ≤ T max, ∀v ∈ N V ,
vp i∈N T j ∈N T p∈N P
oi − oj + Mxij ≤ M − 1, ∀i, j ∈ N C, ∀v ∈ N V , ∀p ∈ N P , (12.29)
(12.12) vp vp+1
vp xkj ≥ xkj ,
xij ≤ |S| − 1, j ∈N C k∈N D j ∈N C k∈N D (12.30)
i∈S j ∈S
i=j (12.13) ∀p = 1, 2, . . . , P − 1, ∀v ∈ N V ,
vp
∀S ⊆ N C; |S| ≥ 2, v ∈ N V , ∀p ∈ N P , xij , Y Yk , Fvk , yj rvk ∈ {0, 1} , Y ei ≥ 0, Y li ≥ 0,

DEj r yj rvk ≤ QGkr , ∀k ∈ N D, ∀r ∈ N R, oi ∈ Z + , ∀k ∈ N D, ∀i, j ∈ N C, (12.31)
j ∈N C v∈N V ∀v ∈ V , ∀r ∈ N R, ∀p ∈ N P .
(12.14)
vp The objective function (12.1) minimizes the total
DEj r xij ≤ V Cvr ,
cost, including the sum of transportation cost, and the
i∈N T j ∈N C (12.15) other costs including vehicle usage costs in intermedi-
∀v ∈ N V , ∀r ∈ N R, ∀p ∈ N P , ate depots, earliness, and tardiness penalty costs, and
p p vp establishment costs of intermediate depots. Eq. (12.2)
ttj = (tti + tij )xij , ∀j ∈ N C, ∀p ∈ N P ,
indicates that each customer is covered by only one ve-
i∈N T v∈N V
hicle and one intermediate depot. Eq. (12.3) ensures
(12.16)
that each customer receives the service only when a ve-
p
ttj = 0, ∀j ∈ N D, ∀p ∈ N P , (12.17) hicle arrives at its corresponding node. Eq. (12.4) en-
p sures the continuity of the vehicles’ routes. Eq. (12.5)
ttj = ttj , ∀j ∈ N C, (12.18)
indicates the assignment of intermediate depots to end
p∈N P
the trips (considering the least distance). Eq. (12.6) de-
ei ≤ tti ≤ li , ∀i ∈ N C, (12.19) fines the relation of variable y to variable x with re-
spect to the intermediate depot giving service to the
customer. Eqs. (12.7) and (12.8) state that when in-
Y Ei ≥ eei − tti , ∀i ∈ N C, (12.20)
termediate depot k is established, a vehicle can start
the travel from it. Eqs. (12.9) and (12.10) indicate that
Y Li ≥ tti − lli , ∀i ∈ N C, (12.21) when an intermediate depot is established, a vehicle
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 195

p vp vp
can finish its trip at it. Eq. (12.11) ensures that each ddj = qij + dij xij , ∀j ∈ N C, ∀p ∈ N P ,
customer should be visited at most once. Eqs. (12.12) i∈N T v∈N V
and (12.13) eliminate the potential sub-tours (based (12.40)
on Miller–Tucker–Zemlin equation [26]). Eqs. (12.14) vp
qij ≥ 0, ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P . (12.41)
and (12.15) express that customers’ demand (covered
by intermediate depot i and vehicle k) must be less than
the capacity of intermediate depot i and vehicle k for Linearization of Eq. (12.24):
product r in the intermediate depot and vehicle, respec- p vp
tively. Eqs. (12.16)–(12.19) guarantee providing service dddjv = (ddi + dij )xij , ∀j ∈ N D, ∀v ∈ N V ,
to customers within the initial interval. Eqs. (12.20) and i∈N T p∈N P
(12.21) calculate the amount of earliness and tardiness (12.24)
for each customer. Eqs. (12.22)–(12.25) ensure that the vp p
maximum allowable distance capacity of vehicles is not wij ≤ ddi , ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P , (12.42)
violated. Eq. (12.26) indicates that a vehicle can serve vp vp
wij ≤ Mxij , ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P , (12.43)
the customers when its usage cost is paid. Eqs. (12.27) vp vp
and (12.28) compute the total loading and unloading wij ≥ di − M(1 − xij ), ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P ,
times of vehicles, respectively. Eq. (12.29) reflects the (12.44)
usage time limitation of the vehicles. Eq. (12.30) guar- vp vp
antees the sequencing number of the vehicle trips from dddjv = wij + dij xij , ∀j ∈ N D, ∀v ∈ N V ,
p to p + 1 by each vehicle. Eq. (12.31) specifies the types i∈N T p∈N P
of the decision variables. (12.45)
vp
wij ≥ 0, ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P . (12.46)
12.2.1.1 Linearization of the Nonlinear
Equations
12.2.2 An Illustration
Linearization of Eq. (12.16):
Applying the above linearizations, the proposed model

p
ttj =
p vp
(tti + tij ) xij , ∀j ∈ N C, ∀p ∈ N P , turns into an MILP. CPLEX solver in the GAMS opti-
i∈N T v∈N V
mization package is used to validate the model.
(12.16) Fig. 12.2 depicts a schematic description of the prob-
vp p lem. We see four potential locations for the establish-
hij ≤ tti , ∀i, j ∈ N T , v ∈ N V , ∀p ∈ N P , (12.32) ment of intermediate depots and ten customers in the
vp vp
hij ≤ Mxij , ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P , network. Two perishable products and one type of ve-
(12.33) hicle are considered in intermediate depots with each
vp vp vehicle being allowed to make three travels at most.
hij ≥tti − M(1 − xij ), ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P ,
(12.34)
p vp vp
TABLE 12.5
ttj = hij + tij xij , ∀j ∈ N C, ∀p ∈ N P , Solution to the problem specified in Fig. 12.1.
i∈N T v∈N V
(12.35) Vehicle 2 in interme- Intermediate depot 2 – 1 – 2 –
diate depot 2 Intermediate depot 3 – 3 – 4 –
vp
hij ≥ 0, ∀i, j ∈ N T , v ∈ N V , ∀p ∈ N P . (12.36) Intermediate depot 2 – 10 – 9 –
8 – Intermediate depot 2
Linearization of Eq. (12.22): Vehicle 1 in interme- Intermediate depot 3 – 5 – 6 – 7
p p vp
diate depot 3 – Intermediate depot 3
ddj = (sdi + dij )xij , ∀j ∈ N C, ∀p ∈ N P ,
i∈N T v∈N V
(12.22)
After solving the problem, two intermediate depots
vp p
qij ≤ sdi , ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P , (12.37) are established and each vehicle makes two travels as
vp vp specified in Table 12.5. As it is obvious, vehicle 2 is uti-
qij ≤ Mxij , ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P , (12.38)
lized in intermediate depot 2 and has three trips, and
vp p vp
qij ≥ sdi − M(1 − xij ), ∀i, j ∈ N T , ∀v ∈ N V , ∀p ∈ N P , vehicle 1 is utilized in intermediate depot 3 and has just
(12.39) one trip.
196 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 12.2 A scheme for the supply chain of the problem.

12.3 SOLUTION METHOD novel intelligent approach called biography based op-
In recent years, many researchers have designed various timization (BBO). This approach has common char-
metaheuristic algorithms to solve practical optimiza- acteristics with other biologically based optimization
tion problems [14,3,6,17,46]. Two optimization meth- techniques such as GA and particle swarm optimiza-
ods, namely, a BBO algorithm and CPLEX solver, are tion (PSO). BBO is a population-based evolutionary
employed to solve the MILP problem. As no benchmark algorithm that is inspired by the migration of animals
is available in the literature according to the proposed and birds between islands. The islands with appropriate
model, two optimization methods have been adopted conditions for species have high habitat suitability in-
in this paper for validation purposes. The flow graph of dices (HSIs). The features determining an HSI include
the proposed approach is depicted in Fig. 12.3. factors such as rainfall, herbal diversity and temperature
that are called suitability index variables (SIVs).
12.3.1 Introduction to Biography Based The island with a high HSI value has a low immigra-
Optimization Algorithm tion rate because it is already filled with other species
Simon [38] analyzed biography science, the science of and cannot accommodate new ones. As islands with
studying geographical distribution of species, and its low HSI values contain small populations, their immi-
corresponding mathematical development to solve op- gration rate is high. Because suitability of a place is pro-
timization problems, which led to the introduction of a portional to its biographical diversity, immigration of
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 197

FIG. 12.3 The flow graph of the proposed solution method.

new species to a place with a high HSI value increases (1 and n stand for the best and worst solution, respec-
the HSI value. Similar to other evolutionary algorithms tively). A solution can be modified into other solutions
such as GA with mutation and cut operators, in BBO, with a given probability. When a solution is selected
migration and mutation operators introduce appropri- to be modified, then immigration rate λ is applied to
ate changes in the population of the generations. BBO stochastically determine whether each of the SIV values
algorithm has been implemented to solve various opti- should be modified or not. If a given SIV value in a given
mization problems with high efficiency [53]. solution Si is selected to be modified, then we use the
emigration rate μ of the other solutions to probabilis-
12.3.1.1 Migration Operator tically decide which of the solutions should migrate a
A set of candidate solutions are introduced by an array randomly selected SIV to solution Si .
of integers. We can consider each integer in the solu- As with other population-based optimization algo-
tion array as an SIV (similar to a gene in GA). Moreover, rithms, some sort of elitism is typically incorporated in
assume that there are methods to determine the ideal order to retain the best solutions in the population. This
solutions. Ideal solutions have high HSI values (corre- prevents the best solutions from being corrupted by im-
sponding to habitats with high species) and weak so- migration.
lutions have low HIS values (corresponding to habitats
with low species). The HSI value in BBO is similar to the 12.3.1.2 Mutation Operator
fitness value in other population-based optimization al- Cataclysmic events can highly change the HSI of natural
gorithms. habitat. Hence, a habitat’s HSI can change suddenly due
Each habitat (solution) has an immigration rate λ to random events. This can be modeled as mutation of
and an emigration rate μ, which are applied to appor- SIV in BBO.
tion the information between solutions stochastically. This mutation pattern tends to increase diversity
These rates are calculated as follows: among the population. The highly probable solutions
tend to be more dominant among the population. This
ki mutation method makes low HSI solutions likely to
λi = I 1 − ∀i, (12.47) mutate, giving them a chance to be improved. It also
n
makes high HSI solutions likely to mutate, which pro-
ki
μi = E ∀i, (12.48) vides a chance to be improved even more than they
n already have.
The probability of the number of species in the habi-
where I and E are respectively the maximum immigra-
tat is used to specify the mutation rate as follows:
tion rate and the maximum emigration rate, and ki is
the number of species in habitat i. Here, ki takes value 1 − ps
between 1 and n, where n is the size of the population m (s) = mmax ( ) (12.49)
pmax
198 Deep Learning and Parallel Computing Environment for Bioengineering Systems

where mmax is the maximum mutation rate which is set 12.3.3 Initial Solution Generation
by the user, pmax is the maximum species count, and In the initial solution generation phase, the first type
ps is the probability of including s species within the of the string, which is binary and represents the es-
habitat (for more details, see [38]). Note that an elitism tablishment of intermediate depots, is first generated
approach is used to keep the characteristics of the habi- randomly. Each facility is established or not with equal
tat having the best solution in BBO process so that even probabilities.
if mutation takes down its HSI, it can be saved and re- After specification of the established intermediate
covered if required. depots, in order to generate the solution in the second
type of the string, the vehicles being used are selected
12.3.2 Solution Representation first. For this, the objective is to select vehicles with most
To represent a solution, the following coding system is capacity and least usage cost. Hence, a CQR index is de-
used. The coding system includes two types of string in fined as follows:
each solution. The first string is a 1-dimensional string CVv
with length N D having zero–one values. The value 1 in CQRv = (12.50)
cell i means the establishment of an intermediate depot r V Cvr
i. For example, we have 4 potential places for the estab- where CVv is the usage cost of vehicle v in one plan-
lishment of intermediate depots in the following string, ning period and V Cvr is the capacity of vehicle v for
where intermediate depots are established in places 1 product r.
and 4 from left to right: The CQR index is calculated for the vehicles. Then,
vehicles are sorted in an ascending order based on the
1 0 0 1 values of the indices. In other words, a vehicle with the
least CQR value has a higher priority for selection. In
The second string has the length of Lmax + 1 and order to determine the route, a vehicle is selected, and
width of nd (number of intermediate depots.) In this then demand nodes are assigned to the selected vehi-
string, visiting nodes, the order of visits, and the types cle randomly. In each assignment, the capacity, distance
of the used vehicle are determined. In each row of this and usage time constraints of the vehicle are tested. If
string, the first box represents the intermediate depot, the remaining capacity of the vehicle is enough to assign
and other boxes represent demand boxes in order of the another demand node, the remaining distance capacity
visit. In the last box of each row, the type of the used and the remaining usage time of the vehicle are enough
vehicle is encoded. For example, if we consider four to assign at least one more node, and so the assignment
potential nodes for the establishment of intermediate process would continue considering the possibility of
depots, where we have intermediate depots in nodes 1 multiple trips. Otherwise, one of the two following pro-
and 4, and mark demand nodes with 5 to 10 and vehicle cedures is applied randomly:
with 1 and 2, the following string represents a sample • Returning the vehicle to an intermediate depot, re-
solution of the problem (Lmax = 7): plenishing, and continuing to assign demand nodes
to it, or
1 5 8 4 9 4 0 2 • Returning the vehicle to an intermediate depot, se-
4 6 7 10 1 0 0 1 lecting another vehicle according to the CQR index,
and assigning remaining nodes to it according to the
We note that in the above solution, vehicle 2 starts procedure applied to the previous vehicle that has
from intermediate depot 1, visits nodes 5 and 8, and completed its tour.
then goes to intermediate depot 4. In the intermediate Using the above process, we can expect all the gen-
depot 4, it is loaded again and visits node 9, and finally, erated solutions to be feasible. The only constraint that
returns to the intermediate depot 4. Vehicle 1 starts from can be violated is the hard time window constraint. To
intermediate depot 4, visits demand nodes 6, 7, and 10, prevent such a violation, an arbitrary penalty cost of 104
and finally goes to intermediate depot 1. is embedded in the objective function. For example, if a
It should be noted that this solution representation trip finishes at tt = 90 and li = 85, then (90–85) × 104 is
does not allow a vehicle to visit more than Lmax nodes. embedded into the objective function.
Moreover, capacity and distance constraints of vehicles
and vehicles’ usage times are checked to be feasible, and 12.3.4 Immigration Phase
thus the search for the optimal solution is done in the First, variables λ and μ are calculated for different solu-
feasible region. tions and then based on the main mechanism of BBO
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 199

algorithm, the solutions of each area are combined us- As it is clear, the full factorial design needs 36 = 729
ing the one-point crossover and immigration rate λ. experiments for BBO, which is not economical in terms
Similarly, for inter-area emigrations, solutions are com- of cost and time. By searching among different Taguchi
bined using the two-point crossover and emigration rate tables using the Minitab statistical software, the table
μ. As it is obvious, it can be concluded that elitism is related to L27 (36 ) presentation is chosen based on our
implemented by setting λ = 0 for the P n best habitats, goal. After testing the data given in Table 12.6, the aver-
where P n is a user-selected elitism parameter specifying age rate of S/N for 27 states of Taguchi, as done for BBO
how many of the best solutions should be kept from algorithm, and optimal values of parameters of BBO al-
one generation to the next. gorithm are given in Table 12.7.

12.3.5 Mutation Phase TABLE 12.7

The mutation operation is applied to every solution Parameters of BBO algorithm.
based on the mutation rate. To apply the mutation oper-
Symbol Description Value
ator, a trip of a vehicle is swapped with a possible trip of
another vehicle. As indicated before, operators in BBO G Population size 1000
algorithm resemble the ones in a GA. Emax Maximum immigration rate 0.9
Imax Maximum emigration rate 0.8
12.3.6 Optimal Parameter Design mmax Maximum mutation rate 0.5
We observe that a BBO algorithm has some elements
Pn Elitism parameter 3
and parameters which can contribute to the final re-
CT Stopping criterion (the generation 200
sult and the performance of the algorithm. Accordingly,
count limit)
proper specification of the parameters can enhance the
performance of the algorithm significantly. Next, we ex-
plain our procedure for setting the parameters.
For a comprehensive overview, the pseudo-code of
There are two approaches to set the value of param-
the proposed algorithm is depicted in Fig. 12.4.
eters [40]: (i) standard analysis of variance (ANOVA),
and (ii) signal-to-noise ratio (S/N). The value of S/N
denotes the dispersion around a certain value. In other 12.4 COMPUTATIONAL RESULTS
words, it implies how our solutions have changed
The steps of the proposed methodology implementa-
through several experiments.
tion are described as follows:
In order to minimize the dispersion of the objec-
• Generating random instance problems in different
tive functions, S/N of Taguchi method is used. The S/N
sizes;
ratios characterize noise factors accompanying control-
• Validating the proposed model by solving instance
lable parameters [40].
problems in small sizes using CPLEX solver [42];
The parameters defined for BBO algorithm have re- • Evaluating the performance of BBO compared to
markable impact on the robustness of the algorithm, CPLEX solver [44];
which are given in Table 12.6. We have determined three • Performing a sensitivity analysis on the demand pa-
levels for each factor (parameter). rameter to study the behavior of the objective func-
tion based on the real-world conditions.
TABLE 12.6 In order to implement BBO algorithm, the solution
Input values of BBO algorithm. representation is assembled as the structure of the solu-
tions. At first, several random solutions are generated,
Symbol 1 2 3 then, the objective function of each one is calculated.
G 200 500 1000 Then the structure of BBO is implemented to optimize
Emax 0.8 0.9 0.95 the random-generated solutions, which is presented in
Imax 0.7 0.75 0.8 Sect. 12.3.1.
To assess the effectiveness of the suggested algo-
mmax 0.5 0.6 0.7
rithm, 12 instances with small to large sizes are gener-
Pn 1 2 3 ated randomly. Moreover, n nodes are first considered
CT 200 250 300 in a two-dimensional space to generate instance prob-
lems, and then an n × n matrix is defined to generate the
200 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 12.8
Information corresponding to randomly generated problems.
Problem n Depot Customers Products Types of vehicles Tmax
P1 5 3 2 2 2 480
P2 7 3 4 2 2 480
P3 9 4 5 3 3 480
P4 10 4 6 3 3 480
P5 13 5 8 3 4 480
P6 14 5 9 3 4 480
P7 15 5 10 3 5 480
P8 18 6 12 3 5 480
P9 21 6 15 4 6 480
P10 32 8 25 4 7 480
P11 38 8 30 5 7 480
P12 50 10 40 5 8 480
P13 65 15 50 8 10 480
P14 80 20 60 9 12 480
P15 100 30 70 10 15 480

TABLE 12.9
Computational results.
Problem CPLEX BBO
TC RT TC RT GAP (%)
P1 6032.2 0.2 6035 61.2 0.21
P2 12046.2 6.79 12050 97.4 0.99
P3 24074.9 198.73 24094.1 117.9 1.72
P4 12055.3 902.3 12106.4 142.7 1.21
P5 12071.4 1449.01 12371.3 168.5 2.50
P6 12079.8 3600 12990.5 193.1 0.39
P7 12761.1 3600 12897.2 207.6 1.15
P8 14672.8 3600 14995.1 267.16 0.98
P9 15128.5 3600 15297.2 304.7 2.51
P10 19796.3 3600 20679.8 325.2 0.93
P11 23498.1 3600 23134.2 346.1 0∗∗
P12 –∗ 3600 28476.6 349.8 0∗∗
P13 –∗ 3600 39421.1 366.1 0∗∗
P14 –∗ 3600 57108.5 382.8 0∗∗
P15 –∗ 3600 94252.3 415.9 0∗∗
AVE – 2570.468 – 249.744 0.839
*No solution found.
**BBO performs better in large sized problems.
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 201

It is a well-known metric to determine the algorithm’s

efficiency. In other words, it shows that how far the pro-
posed metaheuristic algorithm has been able to achieve
the optimal solution. Since there is no possibility of
achieving the optimal solution in large-sized instances,
the output of the metaheuristic algorithm would be ac-
cepted if it has low GAP in small and median-scale in-
stances. However, the average GAP of BBO algorithm is
about 0.8% and it has at most 2.51% error. So it can be
claimed that BBO algorithm can find the best possible
solutions for large-sized problems.
As it is obvious in Table 12.9, BBO performs bet-
ter than CPLEX in the two last problems. Also, CPLEX
is unable to solve the problem with 50 nodes in 3600
seconds. This reveals the weakness of the exact method,
and the necessity of heuristic/metaheuristic algorithms
is demonstrated.

12.4.1 Optimal Solutions for Instance

Problems
In order to verify the solution of the problem, the de-
tails of the optimal solution of P3 is described which
is obtained by using CPLEX. The number of established
depots is equal to three, numbered as 1, 3, and 4, the
vehicles used in depots 1 and 3 are type 1 vehicles, and
the vehicles used in depots 3 and 4 are type 3 vehi-
cles. Hence, three vehicles are used to service the cus-
tomers. The optimally constructed tours for the vehicles
FIG. 12.4 Pseudo-code of the proposed BBO [30]. and schematic solutions are respectively shown in Ta-
ble 12.10 and Fig. 12.5. On the other hand, the obtained
corresponding graph network, and finally, the network
solution by BBO is illustrated in Fig. 12.6, and the con-
arcs. The maximum number of trips of each vehicle and
structed tours of used vehicles are shown in Table 12.11.
the number of vehicle types are regarded to be 3 and 7,
respectively.
It should be noted that vehicle types have differ- TABLE 12.10
ent capacities, maximum distance capacities, and us- Representation of obtained solution by CPLEX
age costs. The input data of to these randomly gener- for P3 .
ated problems are shown in Table 12.8. In this table, (Origin depot, Vehicle type) Constructed tour
columns 1–7 respectively represent the problem num- (1, 1) 1–5–3–7–3
ber, number of nodes in the network, number of poten-
(3, 3) 3–6–4–8–3
tial depots, the number of customers in the network,
the number of products, number of potential vehicle (4, 3) 4–9–4
types, and maximum available time to the vehicles. A
PC with 2.6 GHz of CPU and 8 GB of RAM is used to
code the model and BBO in GAMS 24.1 and MATLAB TABLE 12.11
2009b, respectively. A summary of the results obtained Representation of obtained result by BBO for P3 .
by our algorithm and CPLEX is shown in Table 12.9.
Furthermore, a run time limitation of 3600 seconds is (Origin depot, Vehicle type) Constructed tour
applied as stopping criterion, and the best-found solu- (1, 1) 1–5–3–7–3
tion is reported. (3, 2) 3–6–4
In Table 12.9, GAP represents the relative error of the (4, 2) 4–9–8–4
proposed metaheuristic algorithm compared to CPLEX.
202 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 12.5 Obtained solution by CPLEX for P3 .

FIG. 12.6 Obtained solution by BBO for P3 .

The obtained results demonstrate that the output of TABLE 12.12

the proposed metaheuristic algorithm has the least error Sensitivity analysis of the demand parameter.
against the optimal solution. The proposed algorithm
provides the possibility to search the solution space bet- Objective value for different demands
ter. Therefore, it can examine different scenarios and 0.9DEj r DEj r 1.1DEj r 1.2DEj r
various routes and reports the best possible solutions. 19074.9 24094.1 27160.7 33192.8

12.4.2 Sensitivity Analysis

In order to study the effect of the demand parameter As it is obvious in Fig. 12.7, the objective function
on the objective function, a sensitivity analysis is per- constructs a direct relation. The most change has been
formed for four different change intervals. The third in- occurred for a 20% increase in DEj r . This analysis may
stance problem (P3 ) is selected as an example. The ob- be used as an effective managerial tool to help them in
tained result is represented in Table 12.12 and Fig. 12.7. the decision-making process in real-world conditions.
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 203

FIG. 12.7 Objective function values for various demands parameters.

They can determine the optimal policies by studying parison of the proposed algorithm with CPLEX shows
the sensitivity analysis of the key parameters such as de- a high efficiency for the proposed algorithm. Finally,
mand, and provide the required resources to cover all a sensitivity analysis is performed on the demand pa-
the customers timely. rameter to show the behavior of the objective function
against parameter changes. The main limitation of this
research is to find a real case study to indicate the appli-
12.5 DISCUSSION, CONCLUDING REMARKS cation of this research in real-life issues.
AND FUTURE RESEARCH DIRECTIONS For future studies, other metaheuristic algorithms
This research proposes a novel MILP model for LRP in and new strategies to generate better initial solutions
the distribution of multiple perishable products with may be explored and integrated with the proposed algo-
multi-trip heterogeneous vehicles in order to establish rithm. Moreover, studying inventory management poli-
intermediate depots in the best locations for covering cies in intermediate depots can be considered in the
the customer’s demands with respect to their requested model. In addition, the uncertain nature of the demand
time windows. As a main difference, it is assumed that parameter may be investigated using well-known tools
vehicles may not return to the same intermediate depot such as grey system, fuzzy theory and robust optimiza-
which they started their trips from there. This possibility tion.
would help to keep the freshness of products. The ob-
jective is to minimize total costs due to the traveled dis-
tance, vehicle usage cost, intermediate depots’ establish- REFERENCES
ment costs, and earliness and tardiness penalty costs. 1. B. Adenso-Diaz, M. Gonzalez, E. Garcia, A hierarchical ap-
After reviewing the recent studies in terms of mathe- proach to managing dairy routing, Interfaces 28 (2) (1998)
matical modeling and solution method, the advantages 21–31.
of this research have been cleared. We implemented a 2. M. Alinaghian, H. Amanipour, E.B. Tirkolaee, Enhance-
novel metaheuristic algorithm, namely, BBO algorithm, ment of inventory management approaches in vehicle
to solve the large-sized instance problems effectively. routing-cross docking problems, Journal of Supply Chain
Many related works applied famous metaheuristics like Management Systems 3 (3) (2014).
3. A. Azadeh, S. Elahi, M.H. Farahani, B. Nasirian, A ge-
GA, PSO and SA. Furthermore, Taguchi design method
netic algorithm-Taguchi based approach to inventory rout-
is implemented to adjust the algorithm parameters op-
ing problem of a single perishable product with trans-
timally. Several numerical examples are generated ran- shipment, Computers & Industrial Engineering 104 (2017)
domly in small, medium, and large sizes in order to val- 124–133.
idate the proposed mathematical model and evaluate 4. N. Azi, M. Gendreau, J.Y. Potvin, An exact algorithm for a
the effectiveness of BBO. The validation of the mathe- vehicle routing problem with time windows and multiple
matical model and algorithm is done using the CPLEX use of vehicles, European Journal of Operational Research
solver as an exact method. On the other hand, the com- 202 (3) (2010) 756–763.
204 Deep Learning and Parallel Computing Environment for Bioengineering Systems

5. E. Babaee Tirkolaee, M. Alinaghian, M. Bakhshi Sasi, M.M. 19. A. Hiassat, A. Diabat, I. Rahwan, A genetic algorithm ap-
Seyyed Esfahani, Solving a robust capacitated arc routing proach for location–inventory routing problem with per-
problem using a hybrid simulated annealing algorithm: ishable products, Journal of Manufacturing Systems 42
a waste collection application, Journal of Industrial Engi- (2017) 93–103.
neering and Management Studies 3 (1) (2016) 61–76. 20. W. Ho, G.T. Ho, P. Ji, H.C. Lau, A hybrid genetic algo-
6. E. Babaee Tirkolaee, P. Abbasian, M. Soltani, S.A. Ghaffar- rithm for the multi-depot vehicle routing problem, Engi-
ian, Developing an applied algorithm for multi-trip vehi- neering Applications of Artificial Intelligence 21 (4) (2008)
cle routing problem with time windows in urban waste 548–557.
collection: a case study, Waste Management & Research 21. C.I. Hsu, S.F. Hung, H.C. Li, Vehicle routing problem
37 (1_suppl) (2019) 4–13. with time-windows for perishable food delivery, Journal of
7. K. Braekers, K. Ramaekers, I. Van Nieuwenhuyse, The ve- Food Engineering 80 (2) (2007) 465–475.
hicle routing problem: state of the art classification and 22. H.S. Hwang, A food distribution model for famine re-
review, Computers & Industrial Engineering 99 (2016) lief, Computers & Industrial Engineering 37 (1–2) (1999)
300–313. 335–338.
8. J. Brandao, A. Mercer, A tabu search algorithm for 23. C.K.Y. Lin, R.C.W. Kwok, Multi-objective metaheuristics for
the multi-trip vehicle routing and scheduling problem, a location-routing problem with multiple use of vehicles
European Journal of Operational Research 100 (1997) on real data and simulated data, Journal of the Operational
180–191. Research Society 175 (2006) 1833–1849.
9. R. Caballero, M. Gonzalez, F.M. Guerrero, J. Molina, C. Par- 24. O. Jabali, T. Woensel, A.G. De Kok, Analysis of travel times
alera, Solving a multi-objective location routing problem and CO2 emissions in time-dependent vehicle routing,
with a metaheuristic based on tabu search: application to Production and Operations Management 21 (6) (2012)
a real case in Andalusia, European Journal of Operational 1060–1074.
Research 177 (3) (2007) 1751–1763. 25. S.K. Jacobsen, O.B.G. Madsen, A comparative study of
10. I.M. Chao, B.L. Golden, E. Wasil, A new heuristic for the heuristics for a two level routing-location problem, Eu-
ropean Journal of Operational Research 5 (6) (1980)
multi-depot vehicle routing problem that improves upon
378–387.
best-known solutions, American Journal of Mathematical
26. I. Kara, G. Laporte, T. Bektas, A note on the lifted Miller–
and Management Sciences 13 (3–4) (1993) 371–406.
Tucker–Zemlin subtour elimination constraints for the ca-
11. G.B. Dantzig, J.H. Ramser, The truck dispatching problem,
pacitated vehicle routing problem, European Journal of
Management Science 6 (1) (1959) 80–91.
Operational Research 158 (3) (2004) 793–795.
12. R. Dondo, J. Cerdá, A reactive MILP approach to the mul-
27. R.B. Lopes, S. Barreto, C. Ferreira, B.S. Santos, A decision-
tidepot heterogeneous fleet vehicle routing problem with
support tool for a capacitated location-routing problem,
time windows, International Transactions in Operational
Decision Support Systems 46 (1) (2008) 366–375.
Research 13 (5) (2006) 441–459.
28. S.H. Mirmohammadi, E. Babaee Tirkolaee, A. Goli, S.
13. B. Fleischmann, The Vehicle Routing Problem with Dehnavi-Arani, The periodic green vehicle routing prob-
Multiple Use of Vehicles, Working paper, Fachbereich lem with considering of time-dependent urban traffic and
Wirtschaftswissenschaften, Universität Hamburg, 1990. time windows, Iran University of Science and Technology
14. F.P. Goksal, I. Karaoglan, F. Altiparmak, A hybrid discrete 7 (1) (2017) 143–156.
particle swarm optimization for vehicle routing problem 29. S. Mirzaei, A. Seifi, Considering lost sale in inventory rout-
with simultaneous pick-up and delivery, Computers & In- ing problems for perishable goods, Computers & Indus-
dustrial Engineering 65 (1) (2013) 39–53. trial Engineering 87 (2015) 213–227.
15. A. Goli, A. Aazami, A. Jabbarzadeh, Accelerated cuckoo op- 30. S.N. Nemade, M.T. Kolte, S. Nemade, Multi-user detection
timization algorithm for capacitated vehicle routing prob- in DS-CDMA system using biogeography based optimiza-
lem in competitive conditions, International Journal of Ar- tion, Procedia Computer Science 49 (2015) 289–297.
tificial Intelligence 16 (1) (2018) 88–112. 31. A. Olivera, O. Viera, Adaptive memory programming for
16. A. Goli, S.M.R. Davoodi, Coordination policy for produc- the vehicle routing problem with multiple trips, Comput-
tion and delivery scheduling in the closed loop supply ers & Operations Research 34 (2007) 28–47.
chain, Production Engineering (2018) 1–11. 32. I. Or, W.P. Pierskalla, A transportation location-allocation
17. A. Goli, E.B. Tirkolaee, B. Malmir, G.B. Bian, A.K. Sangaiah, model for regional blood banking, AIIE Transactions
A multi-objective invasive weed optimization algorithm 11 (2) (1979) 86–95.
for robust aggregate production planning under uncertain 33. A. Osvald, L.Z. Stirn, A vehicle routing algorithm for the
seasonal demand, Computing (2019) 1–31. distribution of fresh vegetables and similar perishable
18. K. Govindan, A. Jafarian, R. Khodaverdi, K. Devika, Two- food, Journal of Food Engineering 85 (2) (2008) 285–295.
echelon multiple-vehicle location-routing problem with 34. N. Prindezis, C.T. Kiranoudis, D. Marinos-Kouris, A
time windows for optimization of sustainable supply business-to-business fleet management service provider for
chain network of perishable food, International Journal of central food market enterprises, Journal of Food Engineer-
Production Economics 152 (2014) 9–28. ing 60 (2) (2003) 203–210.
CHAPTER 12 An Efficient Biography-Based Optimization Algorithm 205

35. Y. Qiu, J. Qiao, P.M. Pardalos, Optimal production, re- lection considering drivers and crew’s working time, Waste
plenishment, delivery, routing and inventory management Management 76 (2018) 138–146.
policies for products with perishable inventory, Omega 82 45. E.B. Tirkolaee, A.A.R. Hosseinabadi, M. Soltani, A.K. San-
(2019) 193–204. gaiah, J. Wang, A hybrid genetic algorithm for multi-trip
36. T.R.P. Ramos, M.I. Gomes-Salema, A.P. Barbosa-Povoa, A green capacitated arc routing problem in the scope of ur-
multi-product, multi-depot vehicle routing problem in a ban services, Sustainability 10 (5) (2018) 1366.
reverse logistics system: comparative study of an exact for- 46. E.B. Tirkolaee, A. Goli, M. Hematian, A.K. Sangaiah,
mulation and a heuristic algorithm, in: Livro de actas da T. Han, Multi-objective multi-mode resource constrained
14° Congresso da APDIO, IO 2009, 2009, pp. 195–202. project scheduling problem using Pareto-based algo-
37. K. Sawaki, Optimal policies in continuous time inventory rithms, Computing (2019) 1–24.
control models with limited supply, Computers & Mathe- 47. P. Toth, D. Vigo, The Vehicle Routing Problem, SIAM
matics with Applications 46 (7) (2003) 1139–1145. Monographs on Discrete Mathematics and Applications,
38. D. Simon, Biogeography-based optimization, IEEE Trans- SIAM, Philadelphia, PA, 2002.
actions on Evolutionary Computation 12 (6) (2008) 48. T. Vidal, T.G. Crainic, M. Gendreau, N. Lahrichi, W. Rei, A
702–713. hybrid genetic algorithm for multidepot and periodic vehi-
39. M. Soysal, J.M. Bloemhof-Ruwaard, R. Haijema, J.G. van cle routing problems, Operations Research 60 (3) (2012)
der Vorst, Modeling a green inventory routing problem for 611–624.
perishable products with horizontal collaboration, Com- 49. C. Watson-Gandy, P. Dohrn, Depot location with van sales-
puters & Operations Research 89 (2018) 168–182. men: a practical approach, Omega 1 (3) (1973) 321–329.
40. G. Taguchi, S. Chowdhury, Y. Wu, Taguchi’s Quality Engi- 50. V.F. Yu, S.W. Lin, W. Lee, C.J. Ting, A simulated anneal-
neering Handbook, Wiley and Sons, 2005. ing heuristic for the capacitated location routing prob-
41. L. Tansini, M.E. Urquhart, O. Viera, Comparing Assign- lem, Computers & Industrial Engineering 58 (2) (2010)
ment Algorithms for the Multi-depot VRP, Reportes Técni- 288–299.
cos 01-08, UR, FI – INCO 2001. 51. M.H.F. Zarandi, A. Hemmati, S. Davari, The multi-depot
42. E.B. Tirkolaee, A. Goli, M. Bakhsi, I. Mahdavi, A robust capacitated location-routing problem with fuzzy travel
multi-trip vehicle routing problem of perishable products times, Expert Systems with Applications 38 (8) (2011)
with intermediate depots and time windows, Numerical 10075–10084.
Algebra, Control and Optimization 7 (4) (2017) 417–433. 52. L. Zeng, H.L. Ong, K.M. Ng, S.B. Liu, Two composite meth-
43. E.B. Tirkolaee, M. Alinaghian, A.A.R. Hosseinabadi, M.B. ods for soft drink distribution problem, Advances in Engi-
Sasi, A.K. Sangaiah, An improved ant colony optimization neering Software 39 (5) (2008) 438–443.
for the multi-trip Capacitated Arc Routing Problem, Com- 53. Y.J. Zheng, H.F. Ling, X.L. Xu, S.Y. Chen, Emergency
puters & Electrical Engineering (2018), https://doi.org/10. scheduling of engineering rescue tasks in disaster relief op-
1016/j.compeleceng.2018.01.040, in press. erations and its application in China, International Trans-
44. E.B. Tirkolaee, I. Mahdavi, M.M.S. Esfahani, A robust peri- actions in Operational Research 22 (3) (2015) 503–518.
odic capacitated arc routing problem for urban waste col-
CHAPTER 13

Evolutionary Mapping Techniques for

Systolic Computing System
C. BAGAVATHI, MTECH • O. SARANIYA, ME, PHD

13.1 INTRODUCTION dom methods are time consuming and produce subop-
Computational requirements from the current technol- timal results. This calls for automated approaches which
ogy are demanding advancements in consumer prod- have the ability to generate several possible solutions
ucts for accurate and fast decisions from a hardware– efficiently and swiftly. This target can be achieved by us-
software platform [2]. Requirements can be satisfied by ing the computer aided techniques for the exploration
building faster devices such as VLSI/ASIC chips for given of the solution space. Systolic array mapping can be
specifications. The upper limit for the maximum speed formulated mathematically as a problem of constraint
at which a chip can be operated makes building an inte- optimization. The best solution of design vectors is se-
grated circuit (IC) from scratch for a specific application lected through rigorous scrutiny of solution candidates
unfeasible. Expensive cooling systems and limitation on by evaluating an individual fitness function. The main
integration density with silicon are some more reasons intention is to minimize the total delay which is asso-
for not opting the choice of developing an IC. An al- ciated with each edge of the processing element and
ternative method is to adopt pipelining and parallel also to maximize the hardware utilization efficiency.
processing techniques with multiple processors to per- Automated learning–searching algorithms such as EA
form the task. Based on the available choices, the second are chosen for designing systolic array, which tries to
method is more conservative and can achieve maximum learn the best way to reach the optimal solution through
performance through transformation of data processing bio-inspired mechanisms. In designing an evolvable
architecture. hardware that adapts to different run time configura-
According to classification of computers by Flynn tions, evolutionary algorithms are preferred for provid-
[3], any architecture can be classified as one of the four ing minimum evolve time of configuration [5]. Bio-
data processing architectures based on the number of inspired computing (BIC) is considered as the major
instruction and data streams. Signal transmission strate- domain in computational intelligence, where the evo-
gies, array construction, methods of programming com- lution strategy of species has been mimicked to derive a
putational units and algorithmic adaptation are to be mathematical model to reach optimum solution. Learn-
decided for parallel architecture. When multiple instruc- ing from nature perspective started in the early 1950s
tions are being carried on a single data available in a and had picked up pace from 1990s. The concept of
processor, it is an MISD (multiple instruction stream learning a coping mechanism for difficult situations by
single data stream) processor. In systolic arrays, instead a species has been efficiently designed for computa-
of following the process where a sequence of locally tionally intensive tasks. Of all the algorithms based on
stored instruction are fetched and executed, raw and species evolution and group dynamics, a recent devel-
processed data moves around the architecture, and the opment of bio-inspired evolutionary computation had
instructions of data processing are stagnant in the com- mathematically formulated behavior of adjustment of
putational units. Systolic arrays are an example of MISD internal organ configuration in a human body, when
processors and were designed in 1978 [4]. an imbalance condition is forced [6]. This algorithm,
Researchers currently work on developing an archi- termed allostatic optimization (AO), deals with the in-
tecture that will be economically, logically and environ- herent feedback mechanism when an instability is de-
mentally stable, so that the design can be designated tected.
as optimum architecture. In fulfilling such a demand, Evolutionary programming has been employed in
designs are identified and tested using a trial-and-error this chapter to identify several optimal design vectors
method for satisfying parameters such as throughput, for systolic architecture. Swarm intelligence methods
power, area, speed, cost, accuracy, and design time. Ran- like particle swarm and ant colony optimization has

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00020-8 207
Copyright © 2019 Elsevier Inc. All rights reserved.
208 Deep Learning and Parallel Computing Environment for Bioengineering Systems

been applied in discrete form in order to identify the • Computationally, if a processor accepts a credit on
comparative performance analysis and to identify the an aid link at time t, then it accepts a credit at time
optimal solution. Genetic and memetic algorithms im- t + on the interrelated yield link, where is the
prove the genetic character of a population in the di- time length that is independent of the communi-
rection which improves the fitness. The characteristics cation size, adaptation of the link and place of the
and superiority of obtained systolic architecture are de- processor [7].
pendent on the fundamental design vectors for a sys- Pipelining works on the performance of a machine
tolic array, such as projection, processor and schedul- by dividing a process into a number of stages with
ing vectors, selected through iterative algorithms. Since added storage units in the process line [8]. A pipeline
parallelism and pipelining are the superior features of uses the available computation time justifiably. Parallel
systolic arrays, the designed architecture is expected to processing is utilizing many similar processors to deal
offer maximum hardware utilization efficiency. with different data simultaneously to improve through-
put. Implementation of arithmetic, logical or other type
of operation through pipelined architecture improves
13.2 SYSTOLIC ARRAYS the performance of computing machines and gives an
A systolic system is an interconnection of processing ele- efficient output with lesser time of execution [9]. For
ments that specifically compute and distribute data over applications with comparable computational duration
the system [4]. The term “systolic” has been selected and sampling rate, parallel architectures are chosen for
exclusively to represent the data traversal through the tasks with operations proportional to integral power
architecture resembling the function of the heart. In a of points per data unit [10]. Novel architectures with
systolic computing method, each processor frequently high efficiency can be designed when both paralleliza-
pumps the data inside and outside a cell for realizing tion and pipelining are implemented as a hybrid design.
some brief computing, and the data is made available Out of the commonly available parallel MISD designs,
at the local connection. systolic and wavefront arrays [11] are popular for var-
The basic structure of systolic architecture is demon- ious advantages. Wave front processors consist of an
strated in the figure given below. Fig. 13.1 explains the array of similar or dissimilar processors, each with its
own local memory and connected in a nearest-neighbor
topology [12]. Each processor usually performs a ded-
icated computation. Hybrids containing two or more
different cells are possible. The cells fire asynchronously
when all required inputs are available [13]. It is suitable
of computationally exploiting algorithms [14] that use
asynchronous data flow mechanism. Systolic arrays are
most appropriate for computationally intensive appli-
cations with the inherent massive parallelism because
they have capability of dealing with modular, concur-
rent, regular, synchronous, rhythmic processes that re-
quire repetitive and intensive computation. Compared
to wavefront arrays, the systolic model suits very well
for special purpose and high performance computer de-
vices [13]. It works under a global clock.
FIG. 13.1 Systolic architecture. The main principle of operation is to fetch the data
from memory and use it efficiently for computation.
ability to increase the computation power of processors Functional modules are designed for each systolic cell
when arranged in systolic fashion. and data are allotted for unit computation [15]. Sys-
A systolic array is an interaction of processors, in tolic arrays are suitable for designing high-level com-
which the processors can be located at a network edge putation in hardware structures [16]. Simple, regular
of a bound so that interconnections lead to better implementations and
• Topologically, if there is focused link from the pro- higher densities similar to the design and implemen-
cessor at place I to the processor at place I + d for tation of gate arrays. A highly compact design requires
some d, then there is such a link for each I inner the both good performance and low demand for support-
grid. ing entities [17]. By replacing a single processing ele-
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 209

FIG. 13.2 Comparison of architectures for a TPU [25].

ment with an array of processing elements or cells, a ral networks [33] and image processing [34–39]. Sys-
higher computation throughput can be accomplished tolic arrays have been used in the most advanced neuro-
without demanding more memory bandwidth. Systolic computer for machine learning [40]. Systolic arrays are
arrays have adaptability, modular structure, local inter- designed based on the dependence graph (DG) which
connections, lesser bandwidth demands and compute- will be transformed to space–time representation. The
accelerated operations. Systolic arrays generally have transformation techniques rely on three vectors, namely
high rate of input/output, and they are well-matched projection, scheduling, and processor vectors. From the
for the demanding parallel operations. three vectors, edge mapping is performed to construct
H.T. Kung et al. invented a systolic array, and many the systolic architecture [41]. From a vast space of pos-
more applications were designed such as QR decom- sible solutions, there is a strong need for developing an
position [18], matrix triangulation, convolution [19], automated and efficient approach for designing the ex-
multipliers [20], programmable chip [21], fault toler- act systolic array for a given application.
ance applications [22], linear time algorithms, polyno-
mial computation, warp processors, etc. The best known
example of a systolic array is iWarp processor of Intel 13.3 EVOLUTIONARY ALGORITHMS
which uses a linear array processor connected by data Design environments have expanded with need for
buses [23]. The most recent systolic array design was more automated processes in real-world optimization
Google’s Tensor Processing Unit (TPU) known for its su- problems. Applications such as computer vision [42],
perior configuration and performance. TPUs have been robotics, big data analytics, and bioinformatics [43]
deployed in Google’s data centers for more than a year require algorithms to be designed for high efficiency
and they have demonstrated the superior performance and robustness. For designing such optimization pro-
of machine learning through systolic arrays. Each com- cesses, the current trend is machine learning and search
putational cell is a matrix multiplication [24] unit based methodologies [44].
on systolic arrays [25]. Each element is an MAC unit, An evolutionary algorithm (EA) is an optimization
and a systolic array produces the highest density of MAC algorithm that has mimicked the biological mechanism
units compared to a full core design. Fig. 13.2 describes such as mutation, recombination, and natural selection
various implementations available for TPU with a men- to find an optimal design within specific constraints
tion of design with systolic arrays. [45].
Systolic arrays are used in various applications, in- Evolutionary algorithms are rapidly developing asso-
cluding language recognition [26], signal processing ciative analysis, in which a collection of techniques and
[27], relational database operations [28], matrix arith- systems are learned for managing a large complicated
metic [29–31], character string manipulation [32], neu- problem. Many techniques are available under the class
210 Deep Learning and Parallel Computing Environment for Bioengineering Systems

of evolutionary algorithms that differ in the representa- problems in run time. Unlike classical system design,
tion of solution, implementation details and how the where the designer decides or calculates the structure
particular problem is applied. The general algorithm of and configuration of the system based on the problem
an evolutionary procedure is given below: specifications, EH uses an evolutionary algorithm (EA)
• Select an initial population x0 = {x10 , x20 , . . . , xN0 }, to tune its parameters or structure in order to find the
xi ∈ S, where S is the search space; optimal configuration for a certain problem according
• Determine the value of objective function f (x0 ) for to a set of training samples. These training samples are
each member of population; representative examples of the problem that needs to be
• Repeat for every iteration j until termination condi- solved.
tion is met; Evolutionary computing (EC) can be basically classi-
Perform selection; fied into four classes: evolutionary strategy (ES), evolu-
Perform crossover with a probability; tionary programming (EP), genetic algorithm (GA) and
Perform mutation with a probability; genetic programming (GP). Recent trends have included
Determine the new population xi and fitness swarm intelligence (SI) with evolutionary computation
function fi ; due to similar methods of evaluation in the two classes
Replace if new members are better in fitness [46]. Evolutionary strategies are specific techniques de-
Else signed for solving optimization problems. Evolutionary
Retain the same members and proceed with programming attempts to develop artificial intelligence
iterations (AI) by predicting possible conditions of a defined sit-
A general design of an evolutionary algorithm is ex- uation from the experience learned from previous in-
plained in Fig. 13.3. The initial operand selection fol- stances though machine learning (ML). Genetic algo-
lowed by fitness evaluation and population reproduc- rithm is a well defined, evolving optimization method.
tion forms the basic process of EA. The iteration con- Evolutionary computing utilizes a community oriented
tinues until termination. Optionally, EA can perform search with disruptive mechanism such as crossover and
adaptation of algorithm or local search. By choosing denotation procedure such as reproduction. Evolution-
various options, new evolutionary methods can be de- ary procedures are well known for their ability to inte-
rived. grate theoretical and computational model, to apply a
wide range domain, to provide parallel convergence, to
involve in self-development, and to provide true global
optimum solutions.
Hybrid evolutionary algorithms (HEA) are successful
methodologies due to their robustness in noisy envi-
ronments, ability to handle huge data, and capability
to produce reliable results [47].
Exhaustive search algorithm or heuristic [48] is an
algorithm, which is non-evolutionary and gives an
approximate solution in less than polynomial time.
Heuristics move from one point to another in the in-
dex space using some transition rules. The value of the
objective function is calculated for each point, and the
transition takes place to optimize the function. The
heuristic used in this chapter is a bounded search op-
timization heuristic. The vectors are chosen by using
constraints on the search (solution) space. The vectors
which give the minimum cost function are optimal.
This method of heuristics gives the vectors in a single
iteration if the search space is of low order.
FIG. 13.3 General framework of evolutionary computation Genetic algorithm (GA) [49] is the most widely used
[46]. evolutionary procedure which stands on the concept
of natural selection since its development in 1975 by
Evolvable hardware (EH) systems are configurable John Holland [50,51]. Probable solution of a genetically
hardware systems which are able to adapt to different designed optimization problem is coded as a genetic
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 211

strand. There exists a one-to-one mapping between the where each frog from the ordered list is allotted a group.
result points and genetic representations. The possible The entire population is divided into m memeplexes,
solutions are available as a set of populations that are each containing q frogs [32]. Within each memeplex,
allowed to randomly combine and modify until some the frogs with the best and the worst fitnesses are iden-
termination condition like maximum number of itera- tified as Xb and Xw , respectively. Also, the frog with the
tions or a satisfactory fitness function value is reached. global best fitness is identified as Xg . Then, a process
The three main operators are reproduction selection, similar to PSO is applied to improve only the frog with
crossover, and mutation. A wide variation of genetic the worst fitness (not all frogs) in each cycle [45]. Ac-
algorithm exists, tailoring the needs of various appli- cordingly, the position of the frog with the worst fitness
cations [52,53]. Steady state GA is the commonly used is adjusted as follows:
method, where the offspring from crossover replaces the
worst fit candidate only if it is better than the candidates Di = Rand()(Xb − Xw ), (13.1)
already in the population. The main parameters used in =X +D ,
Xw w i (13.2)
the GA procedure are population size, number of gen-
erations, crossover and mutation rates.
where Rand() is a random number between 0 and 1
Memetic algorithm (MA) [54] is designed based on
and Dmax is the maximum allowed change in a frog’s
the inspiration from Dawkins’ notion of a meme. It
position and Di varies within twice of Dmax . When the
belongs to evolutionary computation class with an op-
fitness is improved, the worst frog is replaced. Other-
tional local search process [46]. MAs are similar to GA,
wise, the calculations are repeated with respect to the
which performs on a cluster of elements (memes). The
global frog. The main parameters of SFL are the number
first step is to select a group of memes (candidate so-
of frogs P , number of memeplexes, number of gener-
lutions) and allow them to evolve towards the optimal
ation for each memeplex before shuffling, number of
solution by crossover and mutation along with personal
shuffling iterations, and the maximum step size.
experience of the memes. By adding the local improve-
ment factor along with information variation, MA con-
verges faster compared to GA. Memetic algorithm im-
13.4 SWARM INTELLIGENCE (SI)
proves the population-based global search method by
adding a local search technique. Any advantageous in- Swarm intelligence (SI) is a bio-inspired collective be-
formation available in a local area can be used to guide havior of organisms interacting among themselves and
the search to a better solution. The stopping condition with their environment to achieve a target. SI proves
can be the total number of iterations before reaching how a group of similar objects can work together and
a target, the number of iterations for which the target produce amazing results in terms of creativity and ef-
value has been stable, or a satisfactory target value [45]. ficiency. Stigmergy is the change of behavioral pat-
Shuffled frog leaping algorithm (SFL) combines the tern of a group member due to the influence of other
essence of the group based MAs and the social behavior- group members. Stigmergy forms the basic principle be-
based PSO algorithms [61]. In the SFL, the popula- hind developing swarm intelligence and computational
tion consists of a set of frogs (solutions) that is parti- methods. The common algorithms that come under
tioned into subsets referred to as memeplexes similar swarm intelligence are particle swarm optimization, ant
to memes in MA. Local search is performed by different colony optimization, bacterial foraging optimization,
societies of frogs that are considered as different meme- artificial bee colony optimization, pigeon inspired op-
plexes. Each memeplex includes individual frogs with timization, and many more.
unique ideas to reach the target (food). The ideas of Particle swarm optimization (PSO) [55] is a
frogs in a memeplex can be influenced by other frogs trajectory-evolving biomimic algorithm which imitates
in the same group. The ideas can be evolved and passed a flock of birds (solutions) trying to reach the des-
through other memeplexes through a shuffling process. tination. The birds move by comparing their current
The local search and shuffling processes continue un- position with that of the bird which is leading in the
til defined convergence criteria are satisfied [62,63]. An direction of destination. When the position of a bird is
initial population of P frogs is created randomly. For trailing compared to the best positioned bird, it acceler-
N-dimensional problems (N variables), frog i is repre- ates towards that bird with a velocity, hence updating its
sented as Xi = (xi1 , xi2 , . . ., xiN ). The frogs are arranged best position. The algorithm considers the personal and
in an order based on their fitness. Similar to a roulette social experience of the birds for reaching the target. As
wheel selection of GA, the frogs are sorted into m groups the experience of individual birds and birds as a flock
212 Deep Learning and Parallel Computing Environment for Bioengineering Systems

is utilized for optimizing the direction angle of reach- Biological behavior of ants [57] searching for food
ing the target, the result is obtained swiftly. The particle through the shortest path and ants selecting the path
(bird) is denoted by i. Three parameters of each particle initially followed by a predecessor form the inspira-
i are monitored: tion of generating ant colony optimization (ACO) [58].
• Current position Xi = (Xi1 , Xi2 , . . . , XiS ); Ant colony optimization [59] is a population based op-
• The best of the previous positions assumed by the timization method. The ants, when following a path
particle pi = (pi1 , pi2 , . . . , piS ); to find food, leave a chemical named pheromone to
• Velocity of flight Vi = (Vi1 , Vi2 , . . . , ViS ). be detected by the following ants. If more ants follow
As the particles move, new position Xi and new ve- the same path, then as an effect of positive feedback,
locity Vi are acquired by each article with the goal of the concentration of pheromone increases, thereby in-
reaching the destination: dicating the most traveled path. In implementing ACO,
each iteration consists of the same number of ants
but the representation or the path chosen by the ants
Vi = ω × Vi + c1 × Rand()(Pi − Xi )
(13.3) vary among cycles. Each ant has S representations, and
+ c2 × Rand()(Pg − Xi ), each representation includes a set of path options and
Xi = Xi + Vi , (13.4) pheromone concentrations. To ensure good solutions,
the pheromone concentration associated with each path
where is altered by every iteration. The stagnation of results
at local optima is avoided by using pheromone evap-
Vmax ≥ Vi ≥ −Vmax . oration rate constant that reduces the concentration
with time as there is a half-life of a few minutes for
The improvement in velocity as given by Eq. (13.3) is the chemical [60]. The pheromone concentration in-
formulated using three terms: the first corresponds to creases with fitness. For minimization problems, the
the current velocity scaled by an inertia weight, which fitness is inversely proportional to pheromone concen-
signifies the tendency of the particle to cling to the ac- tration. After updating the pheromone concentration,
tual velocity, the second corresponds to the cognition the next iteration starts by selecting the path for ants.
or personal experience of the particle of attaining a po- The main parameters involved in ACO are the number
sition Xi compared to its own best position Pi scaled of ants m, number of iterations N, exponents α and
by a constant c1 , and the third corresponds to the social β, which control the importance of pheromone con-
behavior of the particle comparing its current position centration by a factor which indicates the goodness of
Xi to the global best position Pg scaled by a constant c2 . a path for the ant to select, pheromone evaporation
rate r; and pheromone reward factor R, indicating the
The constants c1 and c2 are the learning factors, usually
tendency to retain the pheromone concentration. The
set to 2 [45]. The velocity is allowed a maximum value
pheromone concentration associated with each possi-
Vmax in order to have a hold on the range of results.
ble route is given by τi,j :
The main parameters used in the PSO technique are
the population size, number of iterations (generations),
τi,j (t) = ρτi,j (t − 1) + δτi,j (t), t = 1, 2, 3, . . . , T ,
maximum change of a particle velocity Vmax and iner-
(13.5)
tia weight ω. Generally, PSO is used for unconstrained
problems where the variables have no limits, similar to where T is the number of iterations. The change in
the sky limit reached by the birds. PSO is known for its pheromone concentration is determined by the path
simple implementation, derivative-free design, parallel chosen by ant k. An ant chooses a path with a spe-
processing, efficient global search and few parameters cific probability decided by parameters like α, β and τ .
to be monitored. Ant colony optimization can be applied to discrete op-
The hybrid PSO–GA algorithm has been a popular timization and graph problems.
enhancement to EA. Crossover on global best value, The comparative design of evolutionary algorithms
mutation on stagnant best position (personal value of has been pictorially depicted in Fig. 13.4.
position for a particle) and initial population of PSO Each algorithm is unique in its representation,
derived from GA are some hybridized solutions for HEA method of selection and evolutionary concept to reach
[56]. Hybrid algorithms are out of scope for this chap- the next generation. The following sections list the ben-
ter and have been mentioned for the sake of complete- efits of using evolutionary algorithms for systolic array
ness. design.
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 213

FIG. 13.4 Comparison of EA.

13.5 MAPPING TECHNIQUES optimization, and intermediate representation (IR) in

Systolic programming can be applied to any iterative HDL. A dependence graph (DG) is a graph of directed
algorithm through a systematic mapping procedure. It- nature that portrays the computations’ dependence in
erative algorithms are those which perform the same an algorithm. The nodes in a DG depict computations
computation in every iteration on different sets of data. (intended as nodes) and the edges depict precedence re-
These iterative algorithms need parallel architecture to straints between nodes. In a DG, a fresh node is formed
reduce the latency and improve the computational effi- whenever a fresh computation is called for in the al-
ciency at reduced cost. Mapping is a procedure to con- gorithm (program), and no node in DG is ever reuti-
vert the nested loops of the algorithm to the sched- lized on an individual basis of computation. To attain
uled arrangement of processors. Data dependency and the parallelism maximally in an algorithm, an analy-
the significance of projection, processor and schedul- sis of data dependencies in computation is essential
ing vectors are used in the designing of systolic archi- [41]. A dependence graph can be simply considered as
tecture [41]. In mapping a design, the following are an all-space no-time representation of data dependency
the most essential steps: dependence graph (DG) cap- in an algorithm. In contrast, systolic architectures have
turing, space–time mapping, signal flow graph (SFG) a space–time representation. Systolic design method-
214 Deep Learning and Parallel Computing Environment for Bioengineering Systems

ology maps an N-dimensional dependence graph to a s1
s= .
lower-dimensional systolic architecture using a trans- s2
formation through space time mapping. Systolic archi-
tecture is designed by using linear mapping techniques These matrices have certain constraints on the space
on a regular dependency graph. An edge in dependency based on the projection vector or iteration vector d. Two
graphs represents precedence constraints in signal flow nodes that are displaced by d are processed by the same
graph direction and any node in the DG represents the processor:
presence of an edge in the same direction at all nodes in
d1
the DG [64]. See Fig. 13.5. d= .
d2
Index vector I represents any point in the search
space S:

i
I= .
j
The hardware utilization efficiency (HUE) is calcu-
lated from the scheduling matrix s and the iteration
vector d. The most efficient architecture would result in
utilization efficiency near to 1:

1
HUE = . (13.6)
|s T d|

Processor space and projection vectors must be or-

thogonal. No two nodes should be mapped onto same
processor at the same time. These conditions ensure
proper mapping of data to architecture:

p T d = 0, (13.7)
s T d = 0. (13.8)

This search technique is applied to find the opti-

FIG. 13.5 Mapping process of a regular algorithm onto
systolic arrays.
mal matrices that minimize a cost function. The cost
function includes the number of processors and cycles
required to complete the algorithm per iteration [41]:
Systolic mapping is a process of assigning each point
in the iteration space a scheduled processing element
NSP E = P Emax − P Emin + 1, (13.9)
for the operation at a discrete time to obtain an interme-
diate representation (IR) of processors. The search for
where
optimal matrix is generally through heuristics [65,66],
where heuristics is a method of general search which
P Emax = max(pT q, q ∈ S), (13.10)
forces the direction of search to be oriented towards the
optimum point which is best suited for immediate goals P Emin = min(p T q, q ∈ S). (13.11)
[67]. The overall operation of the array is determined by
the processor allocation matrix p and scheduling ma- Eq. (13.9) defines the number of scheduled process-
trix s. The processor allocation matrix allocates different ing elements (NSPE) required for completing the task
data to the processors available. The scheduling matrix accounting for efficient hardware utilization; P Emax
determines the time at which the next data should reach and P Emin denote the maximum and minimum of pro-
a particular processor: cessing elements used for the algorithm using the pro-
cessor allocation matrix p and indexing matrix q that

p1 belongs to a search space S. In the search (index) space,
p= , the processors are arranged using Cartesian coordinates.
p2
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 215

The duration for completion of algorithm (Ncycle ) de- • Coarseness;

pends on scheduling matrix s and also the allocation of • Contrast;
processors in space: • Busyness;
• Complexity;
Ncycle = max(s T (p − q) + 1), q ∈ S. (13.12) • Texture strength.
Coarseness defines the basic pattern of a texture. For
The cost function is chosen by giving a higher weight to
the number of processors to emphasize hardware effi- a coarse texture, the basic patterns are large and the tex-
ciency: ture possess a uniformity in intensity locally. Coarseness
measures the spatial rate of change of intensity and pro-
Cost function = 0.75 × (NSP E ) + 0.25 × (Ncycle ). duces a slight change of intensity between current pixel
and mean of neighbors:
(13.13)
1
pcos = G (13.17)
ω+ i=0 pi s(i)
13.6 SYSTOLIC IMPLEMENTATION OF
TEXTURE ANALYSIS The coarseness is derived using an inertia parameter
Texture is the visual information about shape and phys- ω designed to produce a bounded value; pi is the prob-
ical structure of objects perceived by humans. Texture ability function derived from Ni and histogram of the
can be described as dimensional pattern from structures image. Large values of coarseness reveal a coarse texture.
understood through various characteristic parameters Contrast in an image defines the capacity to clearly
such as brightness, color, shape, size, etc. Image texture differentiate neighboring pixels. Contrast depends on
is defined as a function of the spatial variation in pixel the spatial changes in intensity and probability of
intensities (grey values) and has been used as one of the neighboring pixels:
significant parameters of image processing since 1970s
[68]. Texture analysis is defined as the arrangement of
G
1
G G
1
basic constituents of a material. Classification or seg- pcon = [ pi pj (i − j )2 ][ 2 s(i)],
mentation of textural features with respect to the shape Ng (Ng − 1) n
i=0 j =0 i=0
of a small element, density and direction of regularity (13.18)
can be considered as a characteristic of an image. Distri-
bution of tone or histogram defines the intensity quality where Ng represents the total number of grey levels in
of image. Grey tone distribution matrix (GTDM) gives the image:
the information about spatial intensity changes through
difference between current pixel and neighbors in terms
G
of grey values. GTDM is a column vector with each entry Ng = Qi , (13.19)
specifying the sum of differences between grey values of i=0
all pixels of a specific intensity and the mean value of where Qi = 1, if pi = 0, otherwise Qi = 0. The contrast
neighboring grey values. Considering the grey tone of measures the sharpness of the image.
any pixel at (k, l) as f (k, l), the average grey tone is given Busy texture indicates rapid intensity changes from
by the following equation:
one pixel to another. The high spatial frequency with
small changes in intensity causes a mild noticeable
1
k
k
f¯ = f (k + m, l + n), (13.14) change in texture. The magnitude of changes depends
W −1 on the contrast of the image:
m=−k n=−k

where W = (2k + 1)2 and k denotes the window size.

G
G
G
The ith entry of a GTDM is pbus = [ pi s(i)]/[ (ipi − ipj )], pi = 0, pj = 0.
i=0 i=0 j =0
s(i) = |i − f¯| for i ∈ Ni , if Ni = 0, (13.15) (13.20)
Ni
pi = 2 . (13.16) Busyness is a ratio with its numerator indicating a mea-
n
sure of spatial rate of intensity change and denominator
The term Ni denotes the number of pixels with intensity denoting sum of differences in magnitude between dif-
i. The parameters from GTDM are mentioned below: ferent grey tones.
216 Deep Learning and Parallel Computing Environment for Bioengineering Systems

Complexity relates to the amount of information 13.7.1 Performance of EA for F8

available in an image. When the image has more basic Optimization
patterns of different average intensities concentrated in The performance of the five EAs is compared using con-
a region, the image is defined as complex. Patterns with tinuous optimization benchmark problems [45]. The
sharp edges are complex: objective function to be optimized is a scalable, non-
linear, and nonseparable function that may take any

G
G number of variables (xi s), i.e.,
pcom = (|i − j |)/(n2 (pi + pj ))(pi s(i) + pj s(j )),
i=0 j =0
N
xi2
N √
f (xi |i = 1, N) = 1 + − cos(xi / i ).
pi = 0, pj = 0. 4000
i=1 i=1
(13.21) (13.23)

Texture strength can be derived from the easiness of The summation term of the F8 function (Eq. (13.23))
defining a texture from its region. Textures which pro- has a parabolic shape while the cosine function in the
vide a high degree of visual satisfaction are considered product term creates waves over the parabolic surface.
strong. Classification of texture is possible when the ba- These waves create local optima over the solution space
sic patterns are of considerable size and there is suffi- [74]. The F8 function can be scaled to any number of
cient difference in average intensities: variables N. The values of each variable are constrained
to a range from −512 to 511. The global optimum (min-

G
G
G imum) solution for this function is known to be zero
pstr = [ (pi + pj )(i − j )2 ])]/[ω + s(i)], when all N variables equal zero. See Table 13.1.
i=0 j =0 i=0 Twenty trial runs were performed for each problem.
pi = 0, pj = 0. The performance of the different algorithms was com-
(13.22) pared using three criteria: (i) the percentage of suc-
cess, as represented by the number of trials required for
the objective function to reach its known target value;
Texture strength can be related with contrast and
(ii) the average value of the solution obtained in all tri-
coarseness, it gives the boldness of texture patterns.
als; (iii) the processing time to reach the optimum target
The above mentioned five parameters of texture anal-
value. The processing time, and not the number of gen-
ysis are applied at different levels of image processing
eration cycles, was used to measure the speed of each
[69,70]. In many machine learning and image process-
algorithm, because the number of generations in each
ing algorithms, assumptions are made so that the local
evolutionary cycle is different from one algorithm to an-
regions have uniform intensities. The main purpose of
other. See Table 13.2.
such parameters is to sort image data into more readily
The GA had performed poorly compared to other
interpretable information, which is used in a wide range
algorithms and reached a target in 50% of trials and
of applications such as industrial inspection, image re-
also the percentage of success decreased with increase
trieval, medical imaging and remote sensing. in the number of variables. With increase in the num-
Texture analysis methods have been utilized in a vari- ber of variables, the processing time also increased. GA
ety of application domains such as materials inspection was able to get solution accuracy closer to the opti-
[71], biomedical signal analysis [72], tissue characteri- mum for F8 problem. It is also evident that the mean
zation [73], and texture segmentation. solution is high for GA with the number of variables
being more than 20, indicating the wandering nature of
the algorithm. Memetic algorithm had performed better
13.7 RESULTS AND DISCUSSION than GA in terms of success rate and processing time.
This section is organized as follows. The performance of Variation from the optimum value was minimal among
EA is evaluated for an optimization problem and com- the trails. More variation in processing time has been
pared with its performance based on mean solution and noticed for PSO as the social behavior influences the
percentage of success. As a proposed work, systolic array results, and time for reaching the target is inappropri-
mapping of texture analysis is performed through evo- ately high. The success rate of SFL was similar to GA and
lutionary algorithms, and the detailed discussion of the PSO; PSO and SFL have been found to outperform other
results are presented in the later half of the section. algorithms in terms of solution quality mainly due to
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 217

TABLE 13.1
Parameters for evolutionary algorithms.
Algorithm Parameters
Genetic algorithm Crossover probability = 0.8
Genetic algorithm Mutation probability = 0.08
Genetic algorithm Population size – 200 to 500
Genetic algorithm Stopping condition – no improvement in objective function for 10 generations or target
value reaching
Memetic algorithm Population size of 100
Particle swarm optimization Maximum velocity was set as 20
Particle swarm optimization Numbers of particles is 40 and that of generations is 10,000
Particle swarm optimization Inertia factor is a linear function, decreasing with the increasing number of generations
Ant colony optimization Suited to discrete problems
Ant colony optimization 30 ants and 100 iterations
Ant colony optimization α = 0.5; β = 2.5
Ant colony optimization ρ = 0.4; R = 10
Shuffled frog leaping algorithm population of 200 frogs
Shuffled frog leaping algorithm 20 memeplexes
Shuffled frog leaping algorithm 10 iterations per memeplex

TABLE 13.2
Results of evolutionary algorithms applied to F8 optimization problem [45].
Comparison criteria Algorithm N = 10 N = 20 N = 50 N = 100
Percentage of success GA 50 30 10 0
Percentage of success MA 90 100 100 100
Percentage of success PSO 30 80 100 100
Percentage of success SFL 50 7 90 100
Mean solution GA 0.06 0.097 0.161 0.432
Mean solution MA 0.014 0.013 0.011 0.011
Mean solution PSO 0.093 0.081 0.011 0.011
Mean solution SFL 0.08 0.063 0.049 0.019

the inertia factor. For MA, PSO and SFL, the difference responding to the number of grey levels. GTDM for tex-
between mean solution values reduces with increasing ture analysis is an iterative process of dimension 4 and
number of variables. A random variation in trend is requires extensive procedures for obtaining the proces-
noticed with SFL, where N = 20 results in a very low sor and scheduling matrices.
success rate. See Fig. 13.6. The GTDM matrix defines the grey tone differences
of all pixels of a particular intensity value. Five pa-
13.7.2 Texture Analysis rameters such as coarseness, contrast, complexity, busy-
Texture analysis through grey tone difference matrix has ness and texture strength can be derived from the
been sampled out for implementation in systolic arrays. matrix. Comparing the results in Table 13.3, coarse-
For a 300 × 300 image, the GTDM matrix is a inten- ness is high in image (A) and lowest in image (C).
sity based column vector (GTDM) of length 255 cor- Image (D) is high in contrast and can be clearly
218 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 13.6 Comparison of evolutionary algorithms with different variables for F8 optimization.

TABLE 13.3
Parameters from GTDM of sample image.
Parameters Image (A) Image (B) Image (C) Image (D)
Coarseness 3.35 × 10−6 3.5 × 10−6 8.4 × 10−8 1.42 × 10−7
Contrast 8.95 × 10−3 6.52 × 10−3 1.3 × 10−2 1.48 × 10−2
Busyness 21.07 24.84 240 93.52
Complexity 1.80 × 107 4.9 × 109 4.9 × 107 4.7 × 109
Texture strength 7.68 × 10−2 7.83 × 10−2 6.3 × 10−2 5.53 × 10−2
Computation time (in s) 0.65 0.53 0.49 0.56

⎛ ⎞
viewed from the sample image in Fig. 13.7. Busyness 0 0 0 −1 0 0
⎜ 0 −1 −1 0 0 −1 ⎟
of texture is more in image (C) with very fine de- ⎜
D=⎝ ⎟.
tails in pixels between the structures of the building. 1 0 0 1 1 0 ⎠
The computational time for all the images was sim- 0 0 1 0 0 0
ilar and close to half a minute run time. The com-
plexity was more in images (B) and (D), resulting in The exhaustive search approach when applied to tex-
a high value of the parameter. The texture strength ture analysis for implementing systolic arrays uses the
had almost of same dynamic range for all the sam- predefined projection vector d, dependence matrix D,
ples. constraints as given in Eqs. (13.7)–(13.12), and cost
Mapping techniques such as exhaustive search tech- function as mentioned in Eq. (13.13). The cost func-
tion is altered as it requires the number of processors in
niques, particle swarm optimization and ant colony
the x- and y-direction.
optimization are employed for systolic array map-
Particle swarm optimization implemented for map-
ping [75]. The mapping procedure is done in MAT-
ping systolic arrays uses 5 particles (solutions) as the
LAB tool. The processor used for all the simulation
total set of solutions. The inertia weight ω is continu-
had a clock speed of 1.9 × 109 Hz. The systolic ar-
ously updated using Eq. (13.24) as a means of including
ray used for mapping in this work was arranged in
the reluctance of the particles to change the direction of
two dimensions to increase the pace of mapping de-
their current movement:
cision. The projection vector d and dependence matrix
D designed for GTDM systolic mapping is given below, 0.8 × (Number of iterations − Current iteration)
ω = 0.4 + .
and they were maintained throughout the proposed de- Number of iterations − 1
sign:
⎛ ⎞ (13.24)
1 0
⎜ 0 1 ⎟ The processing time can be defined as the time taken
⎜
d =⎝ ⎟,
0 0 ⎠ for entire procedure to run and end with the resulting
0 0 numbers of processors and cycles. The average value is
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 219

FIG. 13.7 Sample images for grey tone difference matrix generation.

a measure of how much the output gets varied between fies the diverse exploring nature of the algorithm. See
the extremes around the optimal value. Since the itera- Table 13.4.
tion count is fixed, 50 trials are run for PSO. The initial The results can be interpreted for the minimum
population, position and velocity are random, and sub- number of processors and cycles. It can be derived that
sequent iterations get varied to improve the solution. ACO produces the minimum number of processors and
The results reveal that the processing time is approxi- cycles. The average value of the solution is 3.98 and the
mately constant, as there is no information exchange final values of processor space and scheduling matrices
operation as opposed to GA and MA. The parameters are given below:
for assessing systolic array implementation are ⎛ ⎞ ⎛ ⎞
• Processors in the x- and y-direction; 0 3 0 0 0 3 0 0
⎜ 0 4 0 0 ⎟ ⎜ 0 2 0 0 ⎟
• Number of cycles for one iteration in the given ar- px = ⎜
⎝ 3
⎟, p = ⎜ ⎟,
rangement of processors; 0 0 0 ⎠ y ⎝ 3 0 0 0 ⎠
• Cost function as mentioned in Eq. (13.13); 1 2 0 0 3 2 0 0
⎛ ⎞
• Number of iterations for termination of program; 4 1 4 3
• Total number of cycles required for the array to com- ⎜ 3 2 4 2 ⎟
plete processing of a sample image; S=⎜
⎝ 1 1 1 1 ⎠.
⎟
• Time taken for simulation of the algorithm; 4 4 4 4
• Average value of the solution denoting the traversal
of the algorithm in reaching the final solution. If the The edge mapping is formed from the derived matrices,
average value varies from the cost function, it signi- resulting in a dependence matrix D. From the obtained
220 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 13.4 not guarantee an optimal result. In this chapter, a brief

Comparative analysis for GTDM implementation overview of evolutionary procedures, systolic arrays and
in systolic array. methods to transform an iterative algorithm into archi-
tecture have been discussed. The significance of parame-
Parameters Exhaustive PSO ACO ters derived from GTDM is mentioned, and parameters
search
involved in selecting the best of addressed algorithms
Processors in 5 10 2 are clearly justified. Ant colony optimization proved to
x-direction be good among the selected EA in addressing systolic
Processors in 5 7 4 array mapping of grey tone distribution matrix compu-
y-direction tation.
Number of 9 17 9
cycles List of Acronyms and Abbreviations
Cost function 21 56.75 8.25 VLSI very large scale integration
Number of 2016 50 300 ASIC application specific integrated circuit
iteration IC integrated circuit
Total number 79,9236 2,930,532 1,509,668 PE processing element
of cycles MISD multiple instruction stream single data stream
Time taken for 2.518 120.7 88.13 BIC bio-inspired computing
simulation AO allostatic optimization
(in s) MOPS million operations per second
Average value 4.09 84.25 3.95 TPU tensor processing unit
of solution MAC multiply and accumulate unit
GPU graphics processing unit
DG dependence graph
processor space and scheduling matrices, the edge map- EA evolutionary algorithm
ping is given as EH evolvable hardware
⎛ ⎞ EC evolutionary computing
0 0 0 0 4 3 ES evolutionary strategy
⎜ 3 4 3 2 1 2 ⎟
D=⎜ ⎝ 0 0 0 0 4 4 ⎠.
⎟ EP evolutionary programming
GA genetic algorithm
0 0 0 0 3 2 GP genetic programming
SI swarm intelligence
The simulation has been performed on texture anal- AI artificial intelligence
ysis, and the ant colony optimization has proved to ML machine learning
achieve better mapping with minimal number of pro- HEA hybrid evolutionary algorithm
cessors and computation time. The above analysis leads MA Memetic Algorithm
to a conclusion that systolic arrays are better adaptable PSO particle swarm optimization
for regular image processing computations [76,77]. ACO ant colony optimization
SFL shuffled frog leaping algorithm
SFG signal flow graph
13.8 CONCLUSIONS IR intermediate representation
Bio-inspired computing [78,79] has grown its branches HUE hardware utilization efficiency
through many domains of science and technology. It HDL hardware description language
has been found apt, in various instances, to solve many NSPE number of scheduled processing elements
complex problems easily by modeling the characteris-
tics of biological species. The implementation of evolu-
tionary algorithms for mapping iterative algorithms to a REFERENCES
systolic array has been tried out, and a successful imple- 1. L. Liu, J. Chen, P. Fieguth, G. Zhao, R. Chellappa, M.
mentation is possible by selecting efficient algorithms Pietikainen, A survey of recent advances in texture repre-
which create the mapping vectors. A large number of sentation, arXiv preprint, arXiv:1801.10324.
vectors is available in the solution space, and selecting 2. B. Biswas, R. Mukherjee, I. Chakrabarti, P.K. Dutta, A.K.
an efficient vector set through simple heuristics does Ray, A high-speed VLSI architecture for motion estimation
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 221

using modified adaptive rood pattern search algorithm, 21. A.L. Fisher, H. Kung, L.M. Monier, Y. Dohi, Architecture
Circuits, Systems, and Signal Processing (2018) 1–20. of the PSC-A programmable systolic chip, ACM SIGARCH
3. M.J. Flynn, Some computer organizations and their effec- Computer Architecture News 11 (3) (1983) 48–53.
tiveness, IEEE Transactions on Computers 100 (9) (1972) 22. H. Kung, M.S. Lam, Fault-Tolerance and Two-Level Pipelin-
948–960. ing in VLSI Systolic Arrays, Tech. rep., Carnegie-Mellong
4. H. Kung, C.E. Leiserson, Systolic Arrays (for VLSI), Sparse Univ., Pittsburgh, PA, Dept. of Computer Science, 1983.
Matrix Proceedings 1978, vol. 1, Society for Industrial and 23. T. Gross, D.R. O’Hallaron, iWarp: Anatomy of a Parallel
Applied Mathematics, 1979, pp. 256–282. Computing System, MIT Press, 1998.
5. J. Mora, E. de la Torre, Accelerating the evolution of a sys- 24. H.H.S. Sidhu, Design and implementation modified Booth
tolic array-based evolvable hardware system, Microproces- algorithm and systolic multiplier using FPGA, Interna-
sors and Microsystems 56 (2018) 144–156. tional Journal of Engineering Research & Technology
6. V. Osuna-Enciso, E. Cuevas, D. Oliva, H. Sossa, M. Pérez- (IJERT) 2.
Cisneros, A bio-inspired evolutionary algorithm: allostatic
25. C.-P. Lu, Ai, native supercomputing and the revival of
optimisation, International Journal of Bio-Inspired Com-
Moore’s law, APSIPA Transactions on Signal and Informa-
putation 8 (3) (2016) 154–169.
tion Processing 6.
7. R. Davis, D. Thomas, Systolic array chip matches the
26. K.T. Johnson, A.R. Hurson, B. Shirazi, General-purpose sys-
pace of high-speed processing, Electronic Design 32 (22)
tolic arrays, Computer 26 (11) (1993) 20–31.
(1984) 207.
8. A. Faraz, F.U.H. Zeya, M. Kaleem, A survey of paradigms 27. R. Urquhart, D. Wood, Systolic matrix and vector multipli-
for building and designing parallel computing machines, cation methods for signal processing, in: IEE Proceedings F
Computer Science & Engineering 5 (1) (2015) 1. (Communications, Radar and Signal Processing), vol. 131,
9. P. Kacsuk, M. Tudruj, Extending grade towards explicit pro- IET, 1984, pp. 623–631.
cess synchronization in parallel programs, Computers and 28. H. Kung, P.L. Lehman, Systolic (VLSI) arrays for relational
Artificial Intelligence 17 (5) (1998) 507–516. database operations, in: Proceedings of the 1980 ACM SIG-
10. J. Speiser, H. Whitehouse, A review of signal processing MOD International Conference on Management of Data,
with systolic arrays, in: Real-Time Signal Processing VI, ACM, 1980, pp. 105–116.
vol. 431, International Society for Optics and Photonics, 29. W.M. Gentleman, H. Kung, Matrix triangularization by sys-
1983, pp. 2–7. tolic arrays, in: Real-Time Signal Processing IV, vol. 298,
11. S.-Y. Kung, B. Rao, et al., Wavefront array processor: lan- International Society for Optics and Photonics, 1982,
guage, architecture, and applications, IEEE Transactions on pp. 19–27.
Computers 100 (11) (1982) 1054–1066. 30. S. Subathradevi, C. Vennila, Systolic array multiplier for
12. P.A. Laplante, S.J. Ovaska, Real-Time Systems Design and augmenting data center networks communication link,
Analysis: Tools for the Practitioner, John Wiley and Sons, Cluster Computing (2018) 1–11.
2011. 31. D.I. Moldovan, On the design of algorithms for VLSI
13. S.-Y. Kung, P.S. Lewis, S.-C. Lo, Performance analysis and systolic arrays, Proceedings of the IEEE 71 (1) (1983)
optimization of vlsi dataflow arrays, Journal of Parallel and 113–120.
Distributed Computing 4 (6) (1987) 592–618. 32. R.J. Lipton, D. Lopresti, A systolic array for rapid string
14. P.S. Kumar, Z. David, Neural Networks and Systolic Array comparison, in: Proceedings of the Chapel Hill Conference
Design, World Scientific, 2002. on VLSI, 1985, pp. 363–376.
15. J. Fortes, K. Fu, B. Wah, Systematic approaches to the design
33. H. Yang, Y. Zhu, J. Liu, End-to-end learning of energy-
of algorithmically specified systolic arrays, in: Acoustics,
constrained deep neural networks, arXiv preprint, arXiv:
Speech, and Signal Processing, IEEE International Confer-
1806.04321.
ence on ICASSP’85, vol. 10, IEEE, 1985, pp. 300–303.
34. H.-T. Kung, Special-purpose devices for signal and im-
16. R.P. Brent, H. Kung, F.T. Luk, Some Linear-Time Algorithms
age processing: an opportunity in very large scale integra-
for Systolic Arrays, Tech. rep., Cornell University, 1983.
17. I.E. Sutherland, C.A. Mead, Microelectronics and computer tion (VLSI), in: Real-Time Signal Processing III, vol. 241,
science, Scientific American 237 (3) (1977) 210–229. International Society for Optics and Photonics, 1980,
18. C. Thiripurasundari, V. Sumathy, C. Thiruvengadam, An pp. 76–85.
FPGA implementation of novel smart antenna algorithm 35. A.L. Fisher, Systolic algorithms for running order statistics
in tracking systems for smart cities, Computers & Electrical in signal and image processing, in: VLSI Systems and Com-
Engineering 65 (2018) 59–66. putations, Springer, 1981, pp. 265–272.
19. H. Kung, S. Song, A Systolic 2-d Convolution Chip, Tech. 36. H. Kung, J.A. Webb, Mapping image processing operations
rep., Carnegie-Mellon Univ., Pittsburgh, PA, Dept. of Com- onto a linear systolic machine, Distributed Computing
puter Science, 1981. 1 (4) (1986) 246–257.
20. B.K. Meher, P.K. Meher, Analysis of systolic penalties and 37. R. Mukherjee, P. Saha, I. Chakrabarti, P.K. Dutta, A.K. Ray,
design of efficient digit-level systolic-like multiplier for bi- Fast adaptive motion estimation algorithm and its efficient
nary extension fields, Circuits, Systems, and Signal Process- vlsi system for high definition videos, Expert Systems with
ing (2018) 1–17. Applications 101 (2018) 159–175.
222 Deep Learning and Parallel Computing Environment for Bioengineering Systems

38. T. Komarek, P. Pirsch, Array architectures for block match- 55. E. García-Gonzalo, J. Fernández-Martínez, A brief histor-
ing algorithms, IEEE Transactions on Circuits and Systems ical review of particle swarm optimization (PSO), Jour-
36 (10) (1989) 1301–1308. nal of Bioinformatics and Intelligent Control 1 (1) (2012)
39. S. Divakara, S. Patilkulkarni, C.P. Raj, High speed mod- 3–16.
ular systolic array-based DTCWT with parallel processing 56. H. Garg, A hybrid pso-ga algorithm for constrained opti-
architecture for 2d image transformation on FPGA, Inter- mization problems, Applied Mathematics and Computa-
national Journal of Wavelets, Multiresolution and Infor- tion 274 (2016) 292–305.
mation Processing 15 (05) (2017) 1750047. 57. R. Alvarez, C. Rahmann, R. Palma-Behnke, P. Estévez, F.
40. P. Jawandhiya, Hardware design for machine learning, In- Valencia, Ant colony optimization algorithm for the mul-
ternational Journal of Artificial Intelligence and Applica- tiyear transmission network expansion planning, in: 2018
tions (IJAIA) 9 (1) (2018). IEEE Congress on Evolutionary Computation (CEC), IEEE,
41. K.K. Parhi, VLSI Digital Signal Processing Systems: Design 2018, pp. 1–8.
and Implementation, John Wiley & Sons, 2007. 58. J.M. Pasteels, J.-L. Deneubourg, S. Goss, Self-organization
42. S. Kumar, E.A. Chauhan, A survey on image feature se- mechanisms in ant societies. I. Trail recruitment to newly
lection techniques, International Journal of Computer Sci- discovered food sources, in: Jacques M. Pasteels, Jean-Louis
ence and Information Technologies (IJCSIT) 5 (5) (2014) Deneubourg (Eds.), From Individual to Collective Behav-
6449–6452. ior in Social Insects: les Treilles Workshop, Birkhauser,
43. M. Yoshida, T. Hinkley, S. Tsuda, Y.M. Abul-Haija, R.T. 1987.
McBurney, V. Kulikov, J.S. Mathieson, S.G. Reyes, M.D. Cas- 59. M. Dorigo, M. Birattari, T. Stutzle, Artificial ants as a com-
tro, L. Cronin, Using evolutionary algorithms and machine putational intelligence technique, IEEE Computational In-
learning to explore sequence space for the discovery of an- telligence Magazine 1 (2006) 28–39.
timicrobial peptides, Chem 4 (3) (2018) 533–543. 60. M. Fera, F. Fruggiero, A. Lambiase, G. Martino, M.E. Nenni,
44. N. Pillay, R. Qu, D. Srinivasan, B. Hammer, K. Sorensen, Production scheduling approaches for operations manage-
Automated design of machine learning and search algo- ment, in: Operations Management, InTech, 2013.
rithms [guest editorial], IEEE Computational Intelligence 61. E. Afzalan, M. Taghikhani, M. Sedighizadeh, Optimal
Magazine 13 (2) (2018) 16–17. placement and sizing of DG in radial distribution networks
45. E. Elbeltagi, T. Hegazy, D. Grierson, Comparison among using SFLA, International Journal of Energy Engineering
five evolutionary-based optimization algorithms, Ad- 2 (3) (2012) 73–77.
vanced Engineering Informatics 19 (1) (2005) 43–53. 62. M.M. Eusuff, K.E. Lansey, Optimization of water distribu-
46. J. Zhang, Z.-h. Zhan, Y. Lin, N. Chen, Y.-j. Gong, J.-h. tion network design using the shuffled frog leaping algo-
Zhong, H.S. Chung, Y. Li, Y.-h. Shi, Evolutionary compu- rithm, Journal of Water Resources Planning and Manage-
tation meets machine learning: a survey, IEEE Computa- ment 129 (3) (2003) 210–225.
tional Intelligence Magazine 6 (4) (2011) 68–75. 63. M. Eusuff, K. Lansey, F. Pasha, Shuffled frog-leaping algo-
47. M.M. Drugan, Reinforcement learning versus evolutionary rithm: a memetic meta-heuristic for discrete optimization,
computation: a survey on hybrid algorithms, Swarm and Engineering Optimization 38 (2) (2006) 129–154.
Evolutionary Computation 44 (2019) 228–246. 64. B. Sundari, Design space exploration of deeply nested loop
48. F. Glover, Heuristics for integer programming using surro- 2d filtering and 6 level FSBM algorithm mapped onto sys-
gate constraints, Decision Sciences 8 (1) (1977) 156–166. tolic array, VLSI Design 2012 (2012) 15.
49. M. Pedemonte, F. Luna, E. Alba, Systolic genetic search, a 65. L. Whitley, A. Howe, S. Rana, J. Watson, L. Barbulescu,
systolic computing-based metaheuristic, Soft Computing Comparing heuristic search methods and genetic algo-
19 (7) (2015) 1779–1801. rithms for warehouse scheduling, in: Systems, Man, and
50. L. Fogel, A. Owens, M. Walsh, Adaptation in Natural and Cybernetics, 1998. 1998 IEEE International Conference
Artificial Systems, 1975. on, vol. 3, IEEE, 1998, pp. 2430–2435.
51. J.H. Holland, Genetic algorithms and adaptation, in: 66. N. Ling, M.A. Bayoumi, Systematic algorithm mapping for
Adaptive Control of Ill-Defined Systems, Springer, 1984, multidimensional systolic arrays, Journal of Parallel and
pp. 317–333. Distributed Computing 7 (2) (1989) 368–382.
52. Q. Wang, Using genetic algorithms to optimise model 67. C.M. Fiduccia, R.M. Mattheyses, A linear-time heuristic
parameters, Environmental Modelling & Software 12 (1) for improving network partitions, in: Papers on Twenty-
(1997) 27–34. Five Years of Electronic Design Automation, ACM, 1988,
53. R. Tyagi, S.K. Gupta, A survey on scheduling algorithms for pp. 241–247.
parallel and distributed systems, in: Silicon Photonics & 68. A. Rosenfeld, E.B. Troy, Visual Texture Analysis, Tech. rep.,
High Performance Computing, Springer, 2018, pp. 51–64. Maryland Univ., College Park (USA), Computer Science
54. P. Garg, A comparison between memetic algorithm and Center, 1970.
genetic algorithm for the cryptanalysis of simplified data 69. M. Tuceryan, A.K. Jain, Texture analysis, in: Handbook of
encryption standard algorithm, arXiv preprint, arXiv:1004. Pattern Recognition and Computer Vision, World Scien-
0574. tific, 1993, pp. 235–276.
CHAPTER 13 Evolutionary Mapping Techniques for Systolic Computing System 223

70. M. Amadasun, R. King, Textural features corresponding to 75. J.W. Haefner, Parallel computers and individual-based
textural properties, IEEE Transactions on Systems, Man and models: an overview, in: Individual-Based Models and
Cybernetics 19 (5) (1989) 1264–1274. Approaches in Ecology, Chapman and Hall/CRC, 2018,
71. J.S. Weszka, A. Rosenfeld, An application of texture analysis pp. 126–164.
to materials inspection, Pattern Recognition 8 (4) (1976) 76. M. Patel, P. McCabe, N. Ranganathan, SIBA: a VLSI systolic
195–200. array chip for image processing, in: Pattern Recognition,
1992. Vol. IV. Conference D: Architectures for Vision and
72. P.M. Szczypiński, A. Klepaczko, Mazda—a framework for
Pattern Recognition, Proceedings, 11th IAPR International
biomedical image texture analysis and data exploration, in:
Conference on, IEEE, 1992, pp. 15–18.
Biomedical Texture Analysis, Elsevier, 2018, pp. 315–347.
77. R.W. Means, H.J. Sklar, Systolic array image processing sys-
73. R. Lerski, K. Straughan, L. Schad, D. Boyce, S. Blüml, I. tem, US Patent 5,138,695, Aug. 11, 1992.
Zuna VIII, MR image texture analysis—an approach to tis- 78. F. Dressler, O.B. Akan, Bio-inspired networking: from the-
sue characterization, Magnetic Resonance Imaging 11 (6) ory to practice, IEEE Communications Magazine 48 (11)
(1993) 873–887. (2010) 176–183.
74. D. Whitley, R. Beveridge, C. Graves, K. Mathias, Test driv- 79. S. Thakoor, Bio-inspired engineering of exploration sys-
ing three 1995 genetic algorithms: new test functions and tems, Journal of Space Mission Architecture 2 (1) (2000)
geometric matching, Journal of Heuristics 1 (1) (1995) 49–79.
77–104.
CHAPTER 14

Varied Expression Analysis of Children

With ASD Using Multimodal Deep
Learning Technique
S.P. ABIRAMI, ME • G. KOUSALYA, ME, PHD • BALAKRISHNAN, ME, PHD •
R. KARTHICK, BOT

14.1 INTRODUCTION pend upon the movement of eyes, facial muscle move-
Autism spectral disorder (ASD) stands to be a promis- ments or depend upon the various means to create rela-
ing field for research today as there is no single standard tionships among the different shapes of the face or on
diagnostic measure for ASD. The clinical trials to iden- the variety of emotional characteristics. However, this
tify the autistic nature and the cognitive development information can be gained from the varied sequence
are quite time consuming. This is because the analysis is of images which depicts the drive of emotions. Hence
based on social interaction, verbal and non-verbal com- the system, which classifies the emotions from the im-
munication and imitation of sameness, which involves ages, includes a series of algorithms that combine the
series of screenings over time. Neurodevelopment stud- techniques of feature extraction and classification. One
ies suggest that facial expressions and emotions stand such algorithm of identifying a human face in the image
as a key indicator in analyzing the state of a human’s re- space is a Viola–Jones algorithm which possesses im-
sponse. Thus facial expression is concentrated on differ- proved facial detection accuracy. The detected face can
entiating the neurodevelopmental disorders among the be fed as input to any N classifiers that classify the fa-
ASD positive and normally developing children. Such cial features and support the identification of a facial
ASD children will find it hard to identify the object expression.
influencing them, through their poor gaze factor, the In recent decades, many artificial intelligence tech-
emotion perceived and the emotion to respond. niques such as deep neural networks (DNNs) could be
Facial expression has a significant impact in the com- involved to improve the learning factor of the machine
munication factor and is identified as a basic interest in a very granular way. Convolutional neural network
identification and involvement parameter that is to be (CNN) and recurring neural network (RNN) are the
notified. This form of communication through facial
most common DNN techniques that are applied to ana-
expression and emotional sequence has a faster con-
lyze the feature in a very efficient way among image and
vergence than any other non-visual communications.
video input formats.
These emotions could be perceived in many different
In particular, CNNs are a type of artificial neural net-
methods, and studies converge on the major emotions
faced during communication, namely anger, disgust, work technique with the feed-forward network that ex-
sleepiness, happyness, neutrality, sadness and fear. tracts more features than many ad hoc extractors in their
Psychological analysis and researches suggest that order of evolution. This is because the architecture of
typically developing (TD) individuals more rapidly de- CNNs works on the basis of the activity of the neurons
tect emotional expressions than neutral expressions. in the brain and requires a wide range of factors that are
However, children with high functioning autism are learned from a previously fine labeled image data. The
screened from such emotion detection which leaves a CNN technique, when combined with GPU processing,
clue as to the importance of facial expression and emo- gives an eminent and quick way of analyzing the fea-
tional identification in children with ASD. tures. This CNN can be applied for training the images,
With the advent of these facial expressions in real and thus the trained network could be later deployed
time applications, there arises a need for automatic in real-time video analysis for emotion detection. How-
recognition mechanism. These mechanisms for auto- ever, a major drawback with this CNN way of learning
mated recognition of facial expressions normally de- is that a large amount of data input needs to be fed into

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00021-X 225
Copyright © 2019 Elsevier Inc. All rights reserved.
226 Deep Learning and Parallel Computing Environment for Bioengineering Systems

the system so as to classify and label in an accurate clas- compensatory abilities are caused through the neurode-
sification strategy. velopment of the child varied in the age scale. The sen-
Although CNN shows improved facial expression sory nerves react to the state of children and reflect emo-
identification, the algorithm works well on image tion expressed by them. Facial expressions are one of
frames separated from a video input. Recurring neural the primary signals used to detect the impaired com-
network (RNN) is one such algorithms, which over- munication abilities in social environment among the
comes the challenge faced in CNN. The recurrent neural children with high functioning of autism spectral disor-
network is a classical sequential learning model that der [2].
learns the features of an object in a live stream. The The behavioral characteristics and the diagnostic
features are learned from a series of data by integrating probability show differences among various gender
previous inputs that are influenced from the internal children with autism. But the sensory symptoms and
state of the neural network. The chapter focuses on the the basic characteristics of an autistic child remain the
analysis of CNN for image data that leads to more so- same at the early identification level [3]. Hence the
phisticated results when compared to simple machine chapter deals with expression identification without
learning mechanisms. considering the gender details and age. Also, it is impor-
The paper is organized as follows: Sect. 14.2 de- tant to analyze the expression and emotional behaviors
scribes the state-of-the-art carried out related to the in children post the clinical analysis period, as the sub-
work specified in this paper. Section 14.3 elaborates on clinical levels of autism identification in the absence of
the methodology adopted for the implementation, and compensatory ability identification over time may result
Sect. 14.4 provides the acquired meaningful insights in autism in the future [3,13,14,19].
that were made on facial expressions on-line with ASD Such early identification could be intervened through
children. The section also justifies the implementation basic facial emotions that the child processes. On a very
results. Sections 14.5 and 14.6 present the conclusion granular observation, these emotions could be iden-
and future work identified in this research work. tified by a mother when breast-feeding the baby. The
children who are prone to autism do not show proper
interest in viewing a human and that results in the non-
14.2 STATE-OF-THE-ART engaged facial expression [21,25,29]. Such facial expres-
sion gaps could be identified and must be analyzed in
The state-of-the-art in this paper deals with four major an efficient manner so as to boost up the clinical obser-
discussions, namely (i) the motivation and driving force vations and reviews.
for the research, (ii) characteristics of autism, (iii) cur- In an experiment [4], it was observed that the target
rent screening methods, and (iv) existing computer in- object to the stimuli might influence the emotional be-
terventions that could be incorporated in the screening havior of the children, either ASD or TD. Such emotions
mechanism. The study and survey do not give a compar- are to be carefully notified to improve the efficiency of
ison of existing deep learning techniques to identify the the screening results and impacts [5,15,17]. Thus the
autism based insights. This is because, with the exten- expression faced by the children should be analyzed
sive possible study, the expression of a human being is with and without the object intervention. The chapter
identified through deep learning and the ability of iden- initially focuses on exploring such expressions of the
tification of expression by an autistic children were dealt children in a contactless environment, later pertaining
with. But, it does not include the expression shown by to the assumption of facing a camera and human un-
an autistic child [16,20]. This seems to be a major re- der courtesy. Such in-depth analysis through human–
search gap and the chapter focuses on bridging the gap computer interaction could be established on applica-
by identifying the facial expression of an autistic child tion of deep learning techniques to result in more so-
through computer intervention. phisticated results that support the screening technique.
Even though ASD is famed by major disturbances The early screening method initiates the process by
in social communication, there also exit other psychi- face detection and through feature extraction. Viola et
atric conditions that are associated with impairments al. proposed the Viola–Jones algorithm for face detec-
in social skills, communication, high up restriction and tion that belongs to the class of Haar classifiers of fa-
even repetitive behaviors [30,28]. These impairments cial detection and feature identification [10]. The algo-
lead to intellectual disability, specific language impair- rithm undergoes Haar cascade classification, whereas
ment, attention-deficit/hyperactivity disorder, anxiety Jing-Wein Wang et al. suggested an algorithm for fa-
disorders and disruptive behavior disorders [1]. Such cial feature specification that categorized the face into
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 227

T shaped structure by extracting eyes, nose and posi- 14.3 METHODOLOGY

tion as three feature dimensions [11]. The current re- 14.3.1 Detection of Human Faces
search chapter employs the Viola–Jones algorithm for A human can identify and detect between different faces
facial detection that proves to be a better ethnic tech- easily, whereas a computer does this job through a series
nique. of instructions and training. The computer, when used
Lydia R. Whitaker et al. classified the facial expres- for face detection, has to be trained for diverse elements
sion of the children when the target object shows anger of facial features such as shape, size, texture and vary-
and happyness. The difference in the variance of emo- ing intensities of colors on the face. This motivated the
tion boundary suggests that the target object might be face detection become the most prominent study area
an influencing factor for an ASD positive child [12]. for most of the real world applications, which lead to
Exploration of such emotions identified through facial the evolution of better techniques in a gradient manner.
detections using machine learning algorithms will sup- The development of a face detection algorithm in-
port the identification of autism much earlier when the curs several challenges such as poses of the face, their ex-
clinical analysis process is initiated. To make some fur- pression, obstructive elements, illumination, etc. How-
ther improvements in the classification accuracy and to ever, the most typical, precise and proficient face de-
better rely on the screening mechanisms for early iden- tection algorithm in recent decades is identified as the
tification, the feature identification and analysis is made Viola–Jones algorithm of face detection. This was the
in a deeper sense using deep learning algorithms in- first framework of object detection proposed by Paul
volved in the process. Viola and Michael Jones in 2001. The Viola–Jones al-
A facial feature tracker can gather a combination gorithm was primarily motivated towards the feature
of displacement factors from feature motion obtained detection of the frontal faces in full view.
from images or a live motion video and is subsequently The chapter work aims to analyze facial expression
used to train an SVM classifier [18]. Such classifications in faces that are detected using the Viola–Jones algo-
sort the expressions that are unseen by the humans. rithm, which uses cascade classifiers that support face
Such SVM based expression classifications are employed detection in a better aspect, in a way even if the person
together in an ad hoc based, incrementally trained ar- face is tilted down, twisted towards right or left. This is
chitecture for person-independent expression identifi- done through the inclusion of four different Haar cas-
cation and analysis [10,18]. cade classifiers.
This aspect of the algorithm proves to be a successful
While the real impact of deep learning became ap-
modular technique where researchers append their in-
parent in recent decades [6,7], it has been applied in
depth analysis. The Viola–Jones algorithm is performed
a wide range of application domains, including nat-
in four stages which are listed below:
ural language processing, automatic speech recogni-
• Selection of Haar features;
tion, image recognition, natural language processing,
• Construction of an integral image;
bioinformatics, and with a major focus on the med-
• Training of image using AdaBoost technique;
ical diagnosis field [8,9,23,25]. Among the typically
• Classification of images using cascading classifiers.
available deep learning models, stacked auto-encoder
(SAE), deep belief network (DBN), convolution neural 14.3.1.1 Selection of Haar Features in Selected
network (CNN), and recurrent neural network (RNN) Faces
stand to be widely used deep learning techniques which The Viola–Jones algorithm employs the use of Haar-
could converge at a faster rate. like features, which is a scalar product of image and
It is observed that it is possible to apply more ad- Haar-like templates. A Haar-like feature distinguishes
vanced features in a practical face detection solution the object based on the values obtained from the pixel
as long as the false positive detections can be rejected calculation of two adjacent rectangular windows in an
quickly in the early stages upon the features classified image. The entire image is divided into several rectangu-
using simple linear SVM. In this regard, CNNs can auto- lar windows, which in turn get divided into subsections.
matically learn features to capture complex visual varia- The Haar-like features are calculated for each subsec-
tions by leveraging a large amount of training data. The tion of the image. A Haar-like feature takes into account
chapter focuses on major directions to implement the the adjacent rectangular regions at a particular position
CNN architecture that would progressively result in a in a window, sums up the intensities of pixels in each
better solution, giving maximum accuracy. region and computes the dissimilarity between the ob-
228 Deep Learning and Parallel Computing Environment for Bioengineering Systems

tained sums. This dissimilarity value is then used to The Haar-like features are easily computed using the
categorize subsections of an image. calculation of integral image values, which is described
For example, consider a database of images repre- in the next section.
senting the human faces. Suppose the major difference
among all the images is found to be color variation in 14.3.1.2 Construction of an Integral Image
the region of eyes and cheeks. Then the adjacent rectan- The input image from the dataset is transformed into an
gular regions considered for Haar-feature selection are integral image, implying the summation of pixel values
the regions that lie corresponding to the eye and cheek in a recognized rectangular piece of image. The summa-
region. tion of the pixel at a location (x, y) is computed as
The Viola–Jones algorithm uses three classes of fea-
tures as shown in Fig. 14.1. The two-rectangle feature ii (x, y) = i(x , y ) (14.1)
is the variation amid the sum of the pixels contained x ≤x,y ≤y
by two rectangular regions. The three-rectangle feature
calculates the sum contained by two outside rectangles where (x, y) is the location and ii(x’, y’) is the integral
subtracted from the sum in a middle rectangle. Lastly, a transformation for original pixel i(x, y).
four-rectangle feature computes the difference between The integral image (xi, yi) corresponds to one sin-
diagonal pairs of rectangles [22,24]. gle location (say, (x1, y1)). The integral image of (x2,
y2) corresponds to the summation of pixels from both
(x1, y1) and (x2, y2). This implies that the summation
of pixels at location (x, y) is the sum of pixels above
and left of (x, y). The conversion of the integral image
continues until all individual rectangular block of input
image is processed. Thus for a particular input image,
the total transformation of integral image is computed
as

i (x, y) = ii (Z) + ii (W ) − ii (X) − ii(Y )
(x,y)W XY Z
(14.2)

14.3.1.3 AdaBoost Technique to Build Strong

Classifiers
Once the integral image value is obtained, the images
need to be classified based on the required criteria. In
FIG. 14.1 Three classes of feature.
the process of face detection, the object has to be clas-
sified as a face object or a no-face object. A single al-
The algorithm for Haar-like feature selection [27] is gorithm when used for this classification may not clas-
as follows in Table 14.1. sify the objects precisely. The classification will be much
more proficient if multiple classifiers are combined and
TABLE 14.1 incorporated. The AdaBoost classifier is one such classi-
Haar cascade feature extraction. fier, in which multiple weak classifiers are combined to
form a strong classifier.
Step 1: Let the feature be denoted as f which ranges
between the index values from i to m
The AdaBoost technique identifies the weak classi-
fiers while analyzing the facial features to eliminate
Step 2: For each Haar-feature f = i to m
the negative inputs. The term “boosted” implies that
Step 3: Compute the sum of the pixels in the adjacent the classifiers at all stages of the cascade are intricate
rectangular windows
themselves and they are made of essential classifiers by
Step 4: Record the parameters found in the correspond- means of different boosting techniques. The AdaBoost
ing Haar-like feature algorithm allocates weight to each training sample, and
Step 5: End for decides the probability with which a sample should be
projected in the training set. The Viola–Jones algorithm
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 229

FIG. 14.2 Classification flow using a cascade classifier.

computes the weak classifier as In order to train the cascade classifier, we need a
set of positive and negative samples. In our work, we
1 pf (x) < pθ have incorporated the utility called opencv_createsam-
q (x, f, p, θ) = (14.3)
0 otherwise ples to create the positive samples for opencv_traincas-
cade. The output file of this function serves as an input
where f represents the value of the feature, θ is the value to opencv_traincascade to train the detected face. The
of the threshold and p is the polarity, which indicates negative samples are collected from arbitrary images,
the inequality direction. which do not include the objects to be detected.
The weak classifiers are then further processed to Fig. 14.2 and Table 14.2 show the flow of the cas-
achieve a strong classifier with minimization of low cade classifier. Initially, the classifier was trained with a
false positive rate attainment. The strong classifier is few positive and negative samples, which are arbitrary
computed as images of the same size, of which both samples were
equally scaled in their size. The classifier generates “1” if
⎧ T
⎨ 1 t=1 αt qt (x) ≥ γ t,
the region possibly identifies the face and generates “0”
H (x) = (14.4) otherwise. The major goal of the cascade classifier is to
⎩ find the face objects of interest at diverse sizes, making
0 otherwise
the classifier more proficient without altering the size of
where αt = log βt1 and γ t is taken to ensure that all the the input images.
positive training samples are classified correctly.
TABLE 14.2
14.3.1.4 Cascade Classifier Cascade classification technique.
Haar feature-based cascade classifiers is an effectual ma-
P – Set of positive samples
chine learning based approach, in which a cascade func-
N – Set of negative samples
tion is trained using a sample that contains a lot of
positive and negative images. The outcome of AdaBoost For each feature f
classifier is that the strong classifiers are divided into In each stage, use P & N to train the classifier with
stages to form cascade classifiers. The term “cascade” the selected features
means that the classifier thus produced consists of a Step 1: Assign the weights for the features
set of simpler classifiers which are applied to the re- Step 2: Normalize the weights
gion of interest until the selected object is discarded or Step 3: Based on the output of Step 2, select the
passed. next best (weak) classifier
The cascade classifier splits the classification work Step 4: Update weights and evaluate the features
into two stages: training and detection. The training for the selected criteria
stage does the work of gathering the samples which can
Step 5: If it passes, apply the second stage of
be classified as positive and negative. The cascade clas- features and continue the process. Else normalize
sifier employs some supporting functions to generate a the weights and repeat the steps.
training dataset and to evaluate the prominence of clas- End for
sifiers.
230 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 14.3 68 facial landmarks coordinates.

14.3.2 Extraction of Features TABLE 14.3

The feature extraction of the facial images is performed Mapping of landmark points to facial features.
through the identification of facial landmarks. Facial
Facial components Landmark points in front
landmarks shown in Fig. 14.3 suggest 68 points in the
facial
frontal face sketching the facial alignment that could be
Face jawline Points 1, 2, 3, . . . , 17
used for any application analysis. The meticulous dy-
namic facial landmark points depicting the facial align- Left eyebrow Points 18, 19, . . . , 22
ment are shown in Fig. 14.3. Right eyebrow Points 23, 24, . . . , 27
In this work, we detect the face landmark using Angular nose Points 28, 29, . . . , 36
OpenCV Dlib. The 68 feature points of the Dlib model Left eye Points 37, 38, . . . , 42
detection include the face jawline, the right and left Right eye Points 43, 44, . . . , 48
eyes, right and left eyebrows, the mouth, and the nose. Outline of the mouth Points 49, 50, . . . , 60
The details of the facial landmark points and the fea-
Inner line of the mouth Points 61, 62, . . . , 68
tures represented by the landmark points are listed in
Table 14.3.
The Dlib automatically detects the face and identifies
• The distance from the center of the nose to the point
the facial landmarks. These points are then identified as
(x, y) is termed as d
the input data to feed the classifier. Fig. 14.4 shows the
block diagram of facial landmark identification that is • The angle at which the point is present from the cen-
incorporated in the current research. ter is termed as
The landmark classification method plots 68 points The implemented steps for analysis of distance calcu-
on the input image representing the face. Each point lation in landmark points to identify facial expression is
identified on the face has basically three characteristics mentioned in Table 14.4.
to be examined. They are listed below: Fig. 14.5 shows the sample images of children faces
• The central point positions of the angular nose, with landmark identification among the dataset that are
which determines the angle of view corresponding used in project implementation. These 68 points have a
to the face unique frontal arrangement of the detected face.
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 231

FIG. 14.4 Block diagram of facial landmark identification.

TABLE 14.4
In live analysis there is a chance of not detecting the
Landmark analysis technique. face within some frame. Nevertheless, when this live de-
tection is combined with appropriate GPU processing
For all detected face instances individually support, it avoids the process of former facial detection
Sketch the facial landmarks with the predictor class and the capture rate and analysis will be relatively im-
The landmarks are positioned using x and y axis in x[] proved on the provision of high computing capability.
and y[] This leads to low failure rate in the process of object
Compute xmean and ymean detection with live frames. Therefore, to enhance the
analysis factor in the proposed work, we process the in-
For every value in x[] and y[]
put image using the Viola–Jones Haar-cascade classifier
Compute the mean of both axes to determine center of
to detect a face and then with the landmark using Open
gravity using
CV Dlib Python libraries to examine the detected face.
xcentral = [(x-xmean) for x in xlist]
ycentral = [(y-ymean) for y in ylist] 14.3.3 Expression Classifier
Compute the angle for each face detected as The proposed work classifies the detected facial expres-
if xlist[26] == xlist[29] sion by means of an SVM-linear kernel classifier. SVM is
anglenose = 0
one of the most popular machine learning algorithms
which falls into the category of supervised learning.
else
SVM can be employed for solving the problems related
anglenose = (y[26]-y[29])/(x[26]-x[29]))*180/π ) to classification and regression.
if anglenose < 0 The working principle of SVM is the concept of deci-
anglenose += 90 sion planes which separate the different sets of objects
else based on the specific criteria. The SVM model is widely
used in the classification of images which motivated the
anglenose −= 90
use of SVM in the proposed work for classification of ex-
Compute the angel relative as
pressions and grouping them into their categories.
(z-ymean)/(w-xmean))*180/π ) – anglenose The facial landmark points are analyzed to classify
the expressions such as anger, disgust, fear, happyness,
232 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 14.5 Resulting images of landmark identification with 68 points.

sadness, surprise and neutrality. These expressions are in details about the frames identified for analysis. The
evaluated in children both with ASD and TD. two major challenges faced during face detection using
The linear classifier classifies the facial expression the Haar-cascade classifier are larger search space and
which is of the form larger visual variations. To overcome this challenge, we
employed CNN to detect features in a frame in a fine
f (x) = wxt + b (14.5) granular manner.
In order to achieve better performance in screening
where x corresponds to the feature, w corresponds methodology, the CNN based facial expression identifi-
to the weight, and b is the bias. Each image from cation is made to undergo facial detection using Haar-
ASD_POSITIVE and ASD_NEGATIVE is divided using cascade classification to detect a face from any frame of
the ratio of 80:20 for training and testing. input as discussed in Sect. 14.3.1.
The CNN involves three basic layers such as (a) con-
14.3.4 Expression Identification Through a volution layer, (b) max pooling layer and (c) fully con-
Convolution Neural Network (CNN) nected network layer as shown in Fig. 14.6. The CNN
The face detected using a Haar-cascade classifier and the technique starts reading the input frame after frame, and
emotion detected using Dlib library functions result in a each frame undergoes a series of convolution layers and
better performance of expression recognition. To further a subsampling layer. These two layers of operation are
improve this performance and to improve the accuracy termed the hidden layer that could be maximized to at-
ratio, a deep learning algorithm is employed to learn tain maximum accuracy in feature extraction.
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 233

FIG. 14.6 CNN technique for expression identification in children with ASD.

In the proposed technique, we resized the input im-

ages in 64 × 64 dimensions and first grouped them
based on the various classes of expressions in their
labeled order. These input images were then trained
for each expression category. For the CNN deep learn-
ing technique to work perfectly, we included the entire
dataset of images from both ASD positive and ASD neg-
ative classes thus maximizing the input size which led
to better learning of features in the face.
The CNN in our proposed work includes, first, the
convolution layer that has 32 3 × 3 filters, with the stride
of size 1, including batch normalization and no max
pooling. The second convolution layer has 64 3 × 3 fil-
ters, with the stride of size 1, including batch normaliza-
tion and max-pooling with a filter size 2 × 2. The fully
connected layer has a hidden layer with 512 neurons
and soft-max as the activation function.
The flow of expression identification in children with
autism is as described in Fig. 14.7.

14.3.4.1 Convolution Layer

Having a segment of neuron layer with N × N square
neurons with m × m filter, the final convolution layer
will be of size (N − m + 1) × (N − m + 1). In our
CNN based expression identification in ASD positive
and ASD negative children, we involve a basic convo-
lution algorithm with two layers of hidden operations.
Table 14.5 shows the convolution technique for emo-
tion identification.
The project also aims at influencing the default val-
ues to F , P , and S, so as to ease the computation in the
FIG. 14.7 Flow diagram for CNN based expression
first run. The filter and bias weights could be altered to
identification.
improve the accuracy in expression identification.
234 Deep Learning and Parallel Computing Environment for Bioengineering Systems

The convolution neural network based landmark emotion varying in the order of anger, disgust, fear,
identification computes the points with x and y coordi- happyness, sadness, neutrality, sleepiness in our imple-
nates indicated as (lx , ly ). In general, with 68 landmark mentation. Each point in the landmark takes four ma-
points the convolution layer will have neurons defined jor vectors with entries such as x-axis, y-axis, angular
from (l 1x , l 1y ) to (l 68x , l 68y ) as shown in Fig. 14.8 distance from head central point, and angle of land-
where l is the landmark and 1 to 68 are the facial mark.
landmark points pointed on the detected face. These These detections are made from the combinations of
landmarks are annotated and are labeled from each neurons formed out of the convolution layer as shown
in Fig. 14.9.
TABLE 14.5
Basic convolution technique.
Convolutional layer accepts a volume of size
W1 × H1× D1W1 × H1 × D1
Computes through 4 hyperparameters
Number of filters = K,
Spatial extent = F ,
Stride S = 1,
Amount of zero padding = P
Produces a volume of size W2 × H2 × D2 where
W2 = (W1−F + 2P )/S + 1
H2 = (H1−F + 2P )/S + 1 (i.e., width and height are com-
puted equally by symmetry)
D2 = K
Computes weights per filter as (F × F × D)
Computes weight for n bias terms as (F × F × D) × K
Output will be of size W2 and H2 with depth d through
the stride S. FIG. 14.9 Landmark vectorization through combinational
neurons.

FIG. 14.8 Mapping of landmark points with neurons.

CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 235

14.3.4.2 Max Pool Layer 14.3.4.3 Fully Connected Layer

Max pooling layer is to downsample the special volume The fully connected layer is built by connecting every
considering the depth of the input image [26]. In our neuron to every other neuron residing in the network.
proposed work, we employed the max pooling algo- The project results in this fully connected layer through
rithm with 2 × 2 filters with the stride value of 1. The two levels of convolution layer and max pooling layer
basic max priority algorithm adopted for the project is included in it. The layer result is 256 combinations of
discussed in Table 14.6. neurons that are vectorized. These fully connected neu-
rons are then examined to identify accurate expression
classification through minute feature detection in chil-
TABLE 14.6 dren faces. The fully connected layers of neurons un-
Max pooling technique. dergo soft-max activation function that approximates
Max pooling accepts a volume of size the functional value.
W1*H1*D1W1*H1*D1
Computation involves two hyperparameters such as:
14.4 RESULTS AND ANALYSIS
Stride S = 1
The data set employed in this work contains two cat-
Spatial extent F
egories of images corresponding to the children who
Computes the volume as W2*H2*dD2
were identified as ASD_POSITIVE and ASD_NEGATIVE.
W2 = (W1-F )/S + 1 The dataset contains 304 and 390 images of ASD_POS-
H2 = (H1-F )/S + 1 ITIVE children and ASD_NEGATIVE children, respec-
D2 = D1 tively. These images were preprocessed to eliminate
replica images. In our work, we considered 193 images
in ASD_POSITIVE and 359 images in ASD_NEGATIVE.
The operation performed in the max pooling layer is In the entire dataset, 80% of images were categorized as
based on downsampling with the presence of bearable a training dataset and 20% of the images were catego-
loss in the network. The reason for this bearable factor rized as a test dataset.
is that decreasing the size will lead to the minimiza- The probabilities of the various expressions in both
tion of computational complexity and support overfit- the TD and ASD children were calculated and analyzed.
ting. Also, no new parameters are added to this layer Table 14.7 depicts the overall probabilistic deviations
in our computation by taking a fixed size at a particu- in all the seven recognized facial expressions. All the
lar instant of time. A simple downsampling based max images in the dataset were evaluated for various expres-
pooling technique adopted in our work is described in sions using the recognized landmark points and the
Fig. 14.10. variations among the probabilities of expressions rec-

FIG. 14.10 Downsampling of image through max pooling.

236 Deep Learning and Parallel Computing Environment for Bioengineering Systems

TABLE 14.7
Comparison of difference in probabilities across various facial expressions in ASD and TD.
Category Facial Anger Disgust Fear Happyness Sadness Neutrality Sleepiness
expression
ASD Anger 0 0.165923 0.105921 0.050293 0.117268 0.168714 0.068018
ASD Disgust 0.16592 0 0.06 0.11563 0.04866 0.002791 0.09791
ASD Fear 0.10592 0.060002 0 0.05563 0.011347 0.062793 0.0379
ASD Happy- 0.05029 0.11563 0.055628 0 0.066975 0.118421 0.017725
ness
ASD Sadness 0.11727 0.048655 0.01135 0.06697 0 0.051446 0.04925
ASD Neutrality 0.16871 0.002791 0.06279 0.11842 0.05145 0 0.1007
ASD Sleepi- 0.06802 0.097905 0.037903 0.01772 0.04925 0.100696 0
ness
TD Anger 0 0.199906 0.061775 0.166527 0.092745 0.146316 0.011966
TD Disgust 0.19991 0 0.13813 0.03338 0.10716 0.05359 0.18794
TD Fear 0.06177 0.138131 0 0.104753 0.030971 0.084541 0.04981
TD Happy- 0.16653 0.033378 0.10475 0 0.07378 0.02021 0.15456
ness
TD Sadness 0.09275 0.10716 0.03097 0.073782 0 0.053571 0.08078
TD Neutrality 0.14632 0.05359 0.08454 0.020211 0.05357 0 0.13435
TD Sleepi- 0.01197 0.18794 0.049809 0.154561 0.080779 0.13435 0
ness

FIG. 14.11 Distraction probability of Sadness in ASD and TD.

ommends the neighboring reaction that the children that could be modestly differentiated. This inference
might express. aids in examining the performance of facial expression
Table 14.7 also depicts that for each facial expression analysis and thus endows suitable progression in detec-
there could be some facial expression from the children tion. The insights made from these distraction probabil-
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 237

FIG. 14.12 Distraction probability of neutrality in ASD and TD.

FIG. 14.13 Distraction probability of happyness in ASD and TD.

ities are that (i) any child with a maximum probability veloping children increased the probabilities in most
emotion could toggle to the second maximum prob- possessed expressions when compared to SVM classifier
ability; (ii) the probabilities of various emotions dif- implemented through Dlib. The various probabilities
fered for ASD positive and ASD negative children, and among the two techniques are annotated in Table 14.8,
which shows the probability variations among ASD pos-
(iii) imperfect prediction or gap of emotional analysis
itive and ASD negative children through simple SVM
could also be inferred through the expression analysis
linear classifier and then the application of CNN deep
as pictorially represented in Fig. 14.11 to Fig. 14.17 for
learning algorithm.
Sadness, Neutral, Happy, Fear, Disgust, Anger and Sleep The probabilities show that the maximum probabil-
respectively. ity obtained in the SVM classification of expression is
The facial expressions identified among the children influenced at a higher rate when CNN is applied. This
who are prone to autistic behavior and normally de- classification based CNN also identifies and quantifies
238 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 14.14 Distraction probability of fear in ASD and TD.

FIG. 14.15 Distraction probability of disgust in ASD and TD.

the most probable expression that ASD positive and In Fig. 14.20, the left bar indicates the probabilistic
ASD negative children express. value obtained in ASD positive children using the SVM
In Figs. 14.18 and 14.19, the bars indicate that the classifier over the expressions of anger, disgust, fear,
maximum probability of expression faced by an ASD happyness, sadness, neutrality and sleepiness, while
child is predominant on neutral and disgust expres- the right bar indicates the probabilistic value obtained
sions, whereas happy and neutral expressions are pre- when applying CNN. Fig. 14.21 indicates the analo-
dominant in typically developing children. This indi- gous probabilistic values computed through SVM and
cates that the parents and caretakers have to be keen CNN for ASD negative children. Figs. 14.20 and 14.21
observers when a child is prone to disgust expression also indicate that the distraction probabilities of ASD
with no reason and under faster sequence of repetition. +ve and ASD −ve are minimized upon application of

These insights provide an early biomarker for screening CNN. This is because of the deeper analysis made in
of autism rather than admitting the child to be autistic the facial landmark points through CNN computa-
in nature. tion.
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 239

FIG. 14.16 Distraction probability of anger in ASD and TD.

FIG. 14.17 Distraction probability of sleepiness in ASD and TD.

TABLE 14.8
Probability comparison of facial expression using SVM and CNN.
Facial expression SVM Linear CNN
ASD positive ASD negative ASD positive ASD negative
Anger 0.046345 0.046512 0.03821 0.05545
Disgust 0.209656 0.245092 0.23595 0.22614
Fear 0.152989 0.105871 0.16954 0.12547
Happyness 0.095634 0.212718 0.12832 0.24249
Sadness 0.16421 0.139531 0.01432 0.11568
Neutrality 0.214629 0.191544 0.28689 0.20115
Sleepiness 0.116538 0.058732 0.12677 0.03362
240 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 14.18 Comparison of expression identified through SVM.

FIG. 14.19 Comparison of expression identified through CNN.

14.4.1 Accuracy of the Face Expression 14.5 CONCLUSIONS

Analysis Autism spectrum disorder is a neurodevelopmental dis-
During the first run of a linear classifier (i), the ac- order that is to be notified at an early stage so as to im-
curacy factor is determined as 0.775362318841, and prove the training mechanism. To support such screen-
later in a linear classifier (ii), the accuracy factor is ing mechanisms, facial expression analysis and emotion
determined as 0.847826086957. So, on average, the identification play a vital role in identifying commu-
mean accuracy of the analysis for the SVM linear classi- nication deficits among children with high functioning
fier is 0.811594202899, i.e., approximately equals 81% autism. This project aims to identify such facial expres-
of accuracy, whereas the accuracy for emotion predic- sions in ASD positive and ASD negative children. The
tion through CNN is approximated to 0.89754. This facial features are identified using facial landmark vec-
accuracy could still be improved by increasing the torization, and expressions are classified using an SVM
number of hidden layers in the neural networks, thus linear classifier. The resultant expression analysis sug-
converging into a finer classification of emotion cate- gests that ASD positive children show disgust as a major
gory. expression and then neutrality as the second most prob-
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 241

FIG. 14.20 Probability of classification of emotion in ASD positive children.

FIG. 14.21 Probability of classification of emotion in ASD negative children.

242 Deep Learning and Parallel Computing Environment for Bioengineering Systems

able expression, whereas ASD negative children show 6. G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algo-
happyness as the major expression and then disgust and rithm for deep belief nets, Neural Computation 18 (2006)
neutrality in order. 1527–1554.
The facial expression identification is then per- 7. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimension-
ality of data with neural networks, Science 313 (2006)
formed using a convolutional neural network that max-
504–507.
imizes the accuracy in classification. The CNN method 8. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classifi-
aims to identify the expression in a still image, sepa- cation with deep convolutional neural networks, Advances
rating into frames to learn in-depth features. The dis- in Neural Information Processing Systems 60 (2012)
traction probabilities of varied expressions are mini- 1097–1105.
mized by improving the appropriate facial expression 9. Qingchen Zhanga, Laurence T. Yang, Zhikui Chenc, Peng
results. The accuracy in screening methodology through Lic, A survey on deep learning for big data, Information
expression could be further improved by applying a Fusion 42 (2018) 146–157.
10. N.K. Bansode, P.K. Sinha, Facial feature extraction and tex-
recurring neural network that identifies the expres-
tual description classification using SVM, in: International
sion in live video input. The deep learning techniques, Conference on Computer Communication and Informat-
RNN and CNN, together give better screening accu- ics, 2014, pp. 1–5.
racy. 11. Kaiqi Cen, Study of Viola–Jones Real Time Face Detector,
2016.
12. Lydia R. Whitaker, Andrew Simpson, Debi Roberson, Brief
14.6 FUTURE WORK report: is impaired classification of subtle facial expres-
sions in children with autism spectrum disorders related to
The major challenge faced in the facial expression de- atypical emotion category boundaries? Journal of Autism
tection is that when using live video analysis the com- and Developmental Disorders (June 2017).
puting capacity needs to be increased to a GPU based 13. Lien Van Eylen, Bart Boets, Jean Steyaert, John Wagemans,
analysis. Furthermore, the screening in contactless envi- Hse Noens, Local and global visual processing in autism
ronment seems to be a basic level screening as the analy- spectrum disorders: influence of task and sample character-
sis should also be made on an object interference to the istics and relation to symptom severity, Journal of Autism
stimuli. Such differentiated screening technique might and Developmental Disorders (August 2015).
enhance the identification factor of neurodevelopmen- 14. Jorieke Duvekot, Jan van der Ende, Frank C. Verhulst,
Geerte Slappendel, Emma van Daalen, Athanasios Maras,
tal disorder in autistic children. The expressions could
Kirstin Greaves-Lord, Factors influencing the probability
be expanded to find emotions when applying time se- of a diagnosis of autism spectrum disorder in girls versus
quences. The entire process could be extended to work boys, SAGE Journals 21 (6) (December 2016) 646–658.
for real time live video analysis of the children using 15. Neri L. Romero, A pilot study examining a computer-based
a recurring neural network combined with the existing intervention to improve recognition and understanding
technique. of emotions in young children with communication and
social deficits, Research in Developmental Disabilities 65
(June 2017) 35–45.
REFERENCES 16. Daniel Bone, Somer L. Bishop, Matthew P. Black, Matthew
S. Goodwin, Catherine Lord, Shrikanth S. Narayanan, Use
1. John N. Constantino, M.D. Natasha Marrus, The Early Ori- of machine learning to improve autism screening and di-
gins of Autism, Elsevier Inc., 2017. agnostic instruments: effectiveness, efficiency, and multi-
2. Tanaya Guha, Zhaojuns Yang, Ruth B. Grossman, instrument fusion, Journal of Child Psychology and Psy-
Shrikanth S. Narayanan, A computational study of expres- chiatry 57 (August 2016) 927–937.
sive facial dynamics in children with autism, IEEE Transac- 17. Yongning Song, Yuji Hakoda, Selective impairment of ba-
tions on Affective Computing (March 2016). sic emotion recognition in people with autism: discrimi-
3. Jorieke Duvekot, Jan van der Ende, Frank C. Verhulst, nation thresholds for recognition of facial expressions of
Geerte Slappendel, Emma van Daalen, Athanasios Maras, varying intensities, Journal of Autism and Developmental
Kirstin Greaves-Lord, Factors influencing the probability Disorders (December 2017) 1–9.
of a diagnosis of autism spectrum disorder in girls versus 18. Philipp Michel, Rana El Kaliouby, Real time facial ex-
boys, SAGE Journals 21 (6) (December 2016) 646–658. pression recognition in video using support vector ma-
4. M.J. Hollocks, A. Ozsivadjian, C.E. Matthews, P. Howlin, chines, in: ICMI’03, Vancouver, British Columbia, Canada,
E. Simonoff, The relationship between attentional bias and November 5–7, 2003.
anxiety in children and adolescents with autism spectrum 19. N. Yirmiya, C. Kasari, M. Sigman, P. Mundy, Facial ex-
disorders, Autism Research 6 (2013) 237–247. pressions of affect in autistic, mentally retarded and nor-
5. Julian Arellano, Noah Makow, Pablo Hernandez, Facial Ex- mal children, Journal of Child Psychology and Psychiatry
pression Recognition, Cs 221 Final Project, Fall 2016. 30 (5) (1989) 725–735.
CHAPTER 14 Multimodal Deep Learning Based Expression Analysis of Children With ASD 243

20. A. Metallinou, R.B. Grossman, S. Narayanan, Quantify- 26. http://cs231n.github.io/convolutional-networks/.

ing a typicality in affective facial expressions of children 27. Bhumika G. Bhatt, Zankhana H. Shah, Face feature ex-
with autism spectrum disorders, in: Multimedia and Expo traction techniques: a survey, in: National Conference on
(ICME), 2013 IEEE International Conference on, IEEE, Recent Trends in Engineering & Technology, 2011.
2013, pp. 1–6. 28. K. Ramesha, K.B. Raja, K.R. Venugopal, L.M. Patnaik, Fea-
21. R. Grossman, T. Guha, Z. Yang, D. Hedley, S.S. Narayanan, ture extraction based face recognition gender and age clas-
Missing the mark: dynamic differences in facial expressions sification, International Journal on Computer Science and
of children with HFA, in: Int. Meeting for Autism Research Engineering 02 (01S) (2010) 14–23.
(IMFAR), 2015. 29. R.M. Joseph, H. Tager-Flusberg, C. Lord, Cognitive profiles
22. O.H. Jensen, Implementing the Viola–Jones Face Detec- and social-communicative functioning in children with
tion Algorithm, PhD thesis, Technical University of Den- autism spectrum disorder, Journal of Child Psychology and
mark, DTU, DK-2800 Kgs. Lyngby, Denmark, 2008. Psychiatry 43 (6) (2002) 807–821.
23. C. Zhang, Z. Zhang, A Survey of Recent Advances in Face 30. D.P. Kennedy, R. Adolphs, Perception of emotions from
Detection, 2010. facial expressions in high-functioning adults with autism,
24. Selin Baskana, M. Mete Buluta, Volkan Atalay, Projection Neuropsychologia 50 (14) (2012) 3313–3319.
based method for segmentation of human face and its eval-
uation, Pattern Recognition Letters 23 (2002) 1623–1629.
25. N.H. Frijda, The Laws of Emotion, Scientific Research, An
academic publisher (SCIRP), 2006.
CHAPTER 15

Parallel Machine Learning and Deep

Learning Approaches for
Bioinformatics
M. MADIAJAGAN, MS, PHD • S. SRIDHAR RAJ, BTECH, MTECH

15.1 INTRODUCTION 15.1.2 Role of Parallelization in Deep

Machine learning is widely used to perform a single or Learning
series of operations without being explicitly interpreted In order to train huge deep neural networks, high com-
by the programmers which showcases the potential of putational power, time and iterations are required.
artificial intelligence [1]. Instead of following the pro- Therefore, the need for parallel processing arises here.
grammer’s instructions, the technology performs tasks Deep learning has the capacity to identify complex
by statistical analysis and analytics. Thus, it is a poten- multidimensional structures and large scale problems,
tial technique to learn the experience itself and react to which arise in image processing, speech recognition and
it [2]. video processing [12,13].
It provides various applications to social networks, High level representations can be derived by train-
online customer relationship development, prediction ing very large datasets and huge deep learning networks
related applications, etc. [3]. By saying prediction re- having numerous layers and parameters. Training large
lated applications, we mean that bioinformatics area models consumes a lot of time to acquire a well trained
has a lot of applications for predictive analysis and it model. To resolve this issue, these large scale deep learn-
is the reason why machine learning gives outstanding ing models are trained in parallel using distributed sys-
results to the problems in bioinformatics [4]. tems which have billions of cores and GPUs with huge
computing threads [7,14].
15.1.1 Machine Learning and Deep Learning Training large sets of data takes significantly many
The basic idea of machine learning algorithms is to hours. The recent big data boom has made everyone
understand the reality problems that humans are tack- look into machine learning and apply these concepts in
ling. The technical interpretations such as unsupervised, almost all the fields today. Machine learning algorithms
semisupervised and supervised learning denotes the la- like logistic regression, support vector machines, kernel
beling factor of the bioinformatics datasets [5,6]. The methods, principal component analysis, linear regres-
objective is to gain knowledge using the previous his- sion, naive Bayes, random forests, and neural networks
tory of datasets and act accordingly. To perform this, are used depending on the type of problem and data
potential methods and algorithms are required, which available [12,15]. There are various algorithms we can
is a complicated process [7]. use to improve the accuracy of the predicted results but
To overcome this complexity, neural networks come this might be at the cost of extra computations, thus
into the picture. By using deep neural networks (DNNs) longer training time. Parallel computing can be used as
the complexity of analyzing large datasets can be re- a powerful tool for processing large data sets [16].
solved. The basic potential of the DNN can be explored Most deep learning algorithms work in parallel by
only when we have large datasets. The larger the dataset themselves, while some do not work in parallel. The
used for training, the better the accuracy of testing [8,9]. mathematics behind a machine learning algorithm is
Natural language processing, video processing, rec- mainly linear algebra to compute kernel matrix or gra-
ommendation systems, disease prediction, drug discov- dient by summing over all the training points. Large
ery, speech recognition, web content filtering, etc. [10], matrix multiplication is the major reason for the bottle-
are some of the applications of deep learning. As the neck of computation [17]. To optimize training, parallel
scope for the learning algorithms evolves, the applica- computing can be deployed. MapReduce can only par-
tions for deep learning grow drastically [11]. allelize the statistical query model used by most of the

Deep Learning and Parallel Computing Environment for Bioengineering Systems. https://doi.org/10.1016/B978-0-12-816718-2.00022-1 245
Copyright © 2019 Elsevier Inc. All rights reserved.
246 Deep Learning and Parallel Computing Environment for Bioengineering Systems

machine learning algorithms, where expectation of the etc., are some more applications [25]. Therefore, the re-
distribution is determined by summing over functions lationship between deep learning and bioinformatics is
of the data points. GraphLab is another broader frame- in a path of greater growth and potential.
work used for asynchronous iterative algorithms with For example, microarray data is used to predict the
sparse dependencies [12]. patient’s outcome. On the basis of patients’ genotypic
The development of parallel machine learning algo- microarray data, their survival time and risk of tumor
rithms to process large data sets in real world appli- metastasis or recurrence can be estimated. An efficient
cations is inevitable. Recently, the IBM Haifa Lab and algorithm which considers the correlative information
Watson Labs, which are machine learning based labs, in a comprehensive manner is highly desirable.
produced tools for parallel machine learning [18]. The The rest of this chapter is organized as followed. In
IBM Haifa Lab and Watson Labs work with IBM de- Sect. 15.2, the relation between deep learning and par-
velopment and services arms, partners and clients to allelization is discussed. In Sect. 15.3, the role of deep
answer their needs, and collaborate with universities to learning in bioinformatics is discussed. Section 15.4
promote industrial research. Continuously reinventing presents the application of parallel deep learning al-
and refocusing themselves to stay at the forefront of the gorithms on bio informatics applications. Section 15.5
technology, most projects today fall under artificial in- presents the implementation screenshots of the training
telligence, cloud data services, healthcare informatics, process. Section 15.6 summarizes all the sections and
and image and video analytics, alongside mobile appli- gives the future directions.
cations, security and quality [19]. The labs also focus
on the healthcare domain. These tools are helpful in ex-
ecuting machine learning algorithms on multithreaded 15.2 DEEP LEARNING AND PARALLEL
and multiprocessors machines [20]. PROCESSING
15.1.3 Deep Learning Applications in In this section, basic concepts on parallel processing al-
Bioinformatics gorithms and deep learning based parallel algorithms
are discussed. The discussion will give an insight of in-
Bioinformatics, or computational biology, is the science
tegrating the deep models and parallel algorithms.
of interpreting biological data through computer sci-
ence. Due to the vast development of protein sequence,
15.2.1 Parallel Processing
genomics, three-dimensional modeling of biomolecules
and biological systems, etc. [21], large amount of bi- Parallel processing is basically used to minimize the
ological data is being generated. Complex analysis is computation time of a monotonous process, by split-
needed to derive conclusions from this huge amount of ting the huge datasets into small meaningful parts to
biological data [22]. acquire proper outcomes from it. Web services, social
Good knowledge in molecular biology and com- media, speech processing, medical imaging, bioinfor-
puter science is required to approach the analysis of matics and many similar fields are facing the difficulty
bioinformatics data. As the generation of data from of analyzing terabytes of data they collect daily. There
genomic, proteomic and other bioinformatics applica- are some problems in which the run-time complexity
tions increased on a large scale, analyzing this data gains cannot be improved even with many processors [26].
more attention [23]. Data mining techniques act as a In some cases such as user experience with the inter-
base to analyze these datasets. The outcome of the anal- face, parallel thread handling creates frustration, when
yses of large data should be sensible in terms of the he/she is trying to access something and the parallel
structure inferred by the data [24]. thread hovers to some other location. Parallel algo-
Some of the applications under classification tasks rithms are called efficient when their run-time com-
are cancer cell classification, gene classification and clas- plexity divided by the number of processors is equal to
sification of microarray data. Protein structure predic- the best run-time complexity in sequential processing.
tion, statistical modeling of protein–protein interac- Therefore, some activities must be in a sequential man-
tion, clustering of gene expression data, gene finding, ner only, and judging the tendency of making it parallel
protein function domain detection, function motif de- or serial is a separate domain of problems [25].
tection, protein function inference, disease diagnosis, To rectify this issue, we move to parallel processing
disease prognosis, disease treatment optimization, pro- where more than one processor is used to process the
tein and gene interaction network reconstruction, data instructions. This way the workload is divided between
cleansing, and protein subcellular location prediction, multiple processors. See Fig. 15.1.
CHAPTER 15 Parallel Machine Learning and Deep Learning Approaches for Bioinformatics 247

FIG. 15.1 Parallel processing execution.

The disadvantage of parallel computing is the depen- 15.2.3 Deep Learning Using Parallel
dency between processors, i.e., one processor waits for Algorithms
the result of another processor. Dual core, multicore, i3, The tendency of machine learning algorithms to solve
i5, i7, etc., all denote the number of processors in mod- complex problems with simple solutions has to be
ern computing environment. scaled to large-scale problems, too. By exploring ma-
chine learning with large-scale problems, the outcomes
15.2.2 Scalability of Parallelization Methods and patterns which we infer will be definitely out of log-
The different ways in which parallel processing can be ical thinking. But the space and time constraint of the
performed over different machines or on a single ma- sequential mechanism ruined the development. Plat-
chine which has multiple cores is known as a local forms like Spark and Hadoop can be used for parallel
training process. Once the data is localized in the ma- tasks in hyperparameter and ensemble learning where
chine, it takes care of the parallel processing within it multiple models have to be analyzed [29].
[27]. The usage of multiple cores can be of two types.
They are: 15.2.3.1 Advantages of Using Parallelism in
Deep Learning
• The lengthy way of loading multiple data in a single
layer and applying a multicore processor. • When it comes to deep learning, artificial neural net-
works (ANNs) are involved in it. The ANN takes a
• An alternative is to separate the data into batches and
large set of data as input, it learns the parameters
send each core a batch for processing.
and gets trained as a model. The training time taken
Storing of very large datasets in a local machine is
is very high. This learning of many parameters takes
not possible unless the memory increases proportion-
very long computation time. The computation time
ally with usage. To resolve this, the data is stored across
is considered on the order of days, where “q” denotes
many machines in a distributed manner. Here, either the number of cores in the processor. The VGGNet
the model or data can be distributed, which is dis- application takes about 10 hours for training even in
cussed below. In data parallelism, data is distributed am 8q machine. This is a computationally intensive
across multiple machines. When data is large or to at- process which takes a lot of time [30].
tain faster processing, data parallelism can be used. If, • The basic structure of deep neural networks is its dis-
on the other hand, the model is too big to fit in a single tributed nature across the layers. Deep learning paral-
system, model parallelism can be used. When a model lelization leads to the improvements in training time
is placed into a single machine, one model demands from months to weeks, or even days.
the output of another model. This forward and back- • The importance here is acceleration, nothing else.
ward propagation establishes communication between You can run deep learning solutions on a single pro-
the models from different machines in a serial fashion cessor or machine provided you can tolerate the slug-
[28]. gishness [29]. Hence, the sure way of speeding things
248 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 15.2 Bioinformatics applications.

up is to use hardware acceleration just like in com- matics problems are still in its beginning stage of evo-
puter graphics since both graphics and deep learning lution after applying the convolutional, deep belief and
are inherently parallel in nature [13]. recurrent neural networks [24]. Fig. 15.2 illustrates var-
ious applications of bioinformatics. Molecular biology,
15.2.3.2 Parallelism Challenges in Deep genomics, etc., are no surprise as bioinformatics appli-
Learning cations. But the evolution of bioinformatics using infor-
Directly applying the parallel algorithms over the prob- mation technology, databases and design of drugs using
lems is difficult. The scalable sequential algorithms are computer aided software are to be noted.
not able to rectify these issues. To overcome this chal-
lenge, a framework which implements parallel machine 15.3.1 Bioinformatics Applications
learning algorithms on large distributed data such as The performance of all the machine learning and deep
social networks based on functional programming ab- learning algorithms in bioinformatics applications is
stractions has been discussed [31]. The algorithms can noticeable. This infers that deep learning is the most
be implemented very easily by using the functional effective among the technologies applied in this field.
combinators, which yields best composition of aggrega- Still, it requires the proper selection of the model for the
tion, distributed and sequential processes. This system problem, as well as parameters. See Fig. 15.3. Some of
also avoids inversion of control in a synchronous paral- the areas in bioinformatics research are discussed next.
lel model. The cost of the parallel processing units (i.e.,
GPU (graphical processing unit)) is yet another chal- 15.3.1.1 Sequence Analysis
lenge to overcome. Analysis of a sequence is a very basic operation in the
field of computational biology. It is used to find simi-
lar biological sequences and varying parts during med-
15.3 DEEP LEARNING AND ical analysis and genome mapping. By analyzing a se-
BIOINFORMATICS quence, it can be aligned in a proper manner. Sequences
Deep learning and bioinformatics go hand in hand which are searched often are stored in the database for
with the base applications like image processing, com- regular access from the computer.
puter vision, medical images, DNA sequencing, RNA
detection, gene structure prediction, drug discovery, re- 15.3.1.2 Genome Annotation
sistance to antibiotics, agriculture, weather forecasting, Dr. Owen White designed the first genome annotation
forensics, bio-weapons, nutrition science, etc. Bioinfor- software model in 1995. Genomics is nothing but the
CHAPTER 15 Parallel Machine Learning and Deep Learning Approaches for Bioinformatics 249

FIG. 15.3 Classification of bioinformatics applications based neural network types.

marking of genes and related biological features in a population increases, hence high level systems are re-
DNA sequence. quired to identify the sequence properly.

15.3.1.3 Analysis of Gene Expression 15.3.1.6 Protein Structure Prediction

By using the techniques such as microarrays, sequencing The main sequence required to predict the protein struc-
DNA, gene serial analysis, parallel signature sequencing, ture is the amino acid sequence. It can be derived from
etc., the expression of most of the genes can be deter- the gene sequence itself, which helps to identify the
mined. All the mentioned techniques are completely unique structure of the gene. This unique structure’s
noiseless and bias subjectively to the environment. In knowledge plays a very important role in knowing the
gene expression studies, the evolution is happening in protein functions. To design drugs and novel enzymes,
the protein structure prediction is used.
developing tools which distinguish and separate signal
and noise. 15.3.1.7 Modeling Biological Systems
15.3.1.4 Analysis of Protein Expression Modeling of biological systems has a higher require-
ment in the field of computational biological sciences.
There are numerous ways to measure the gene expres-
Computer simulations are used for the cellular systems,
sion, but protein expression is the best among them be-
and gene regulatory networks are employed to detect
cause of its catalyst property. A snapshot of the protein the complex structures in the cellular networks. The
for analysis can be acquired by the protein microarrays, connections between cells can be well defined and eas-
spectrometry and very high utilization capacity. ily overlooked by computer modeling. The domain of
artificial intelligence itself is to understand the real life
15.3.1.5 Analysis of Mutations in Cancer problems by developing a similar kind of system.
The genomes in a cancer affected body are randomly
rearranged in a very confusing manner. Very high-level 15.3.1.8 High-Throughput Image Analysis
sequencing methodologies are required to identify the Biological images have the potential to deliver high-
mutation point, which is unknown. The number of level informative content. By analyzing these medical
genome sequences is growing drastically as the human images, a researcher can make an active decision over
250 Deep Learning and Parallel Computing Environment for Bioengineering Systems

the pros and cons. The observing job of the researcher • To interpret and analyze data quicker by using
can be completely replaced by the computer modeled machine learning, deep learning and mining tech-
system. Some applications are in the analysis of clinical niques;
images, DNA clones overlapping detection, etc. • To improve the result accuracy;
• To detect mutations with next-generation sequenc-
15.3.1.9 Microarrays ing;
In order to automatically collect large amounts of data, • To facilitate cancer sequencing.
microarrays are very useful. Machine learning can assist
the microarrays in the analysis process, which in turn
15.3.3 Challenges in Using Parallel Deep
Learning for Bioinformatics
helps to identify the patterns and networks between the
Applications
genes.
• Reducing the additional time taken for dividing the
The expression of a gene in a genome is observed by
processes and combining the results;
the microarrays, which results in diagnosing cancer. Ra-
• Making sure the performance of the parallelization
dial basis functions, deep learning, Bayesian networks, process is making the additional time taken negligi-
decision trees and random forests are the most often ble;
used methods for analyzing the observations. • Identifying if the parallelization process takes more
time than a sequential process for a particular task,
15.3.1.10 Systems Biology how to overcome this, and the mechanism needed
By observing the components in a biological system, to reduce the time overhead;
the behaviors of the system can be inferred. DNA, RNA • Developing efficient bioinformatics algorithms and
and metabolites are some of those components which approaches for target identification and validation,
need to be observed. Probabilistic models are devel- lead identification and optimization to improve drug
oped for them and they are used in genetic algorithms, discovery [32].
which comprise Markov models. Some applications are
enzyme function prediction, high throughput microar-
ray data analysis, analysis of genome-wide association 15.4 PARALLEL DEEP LEARNING IN
studies to better understand markers of multiple scle- BIOINFORMATICS APPLICATIONS
rosis, protein function prediction, and identification of WITH IMPLEMENTATION AND REAL
NCR-sensitivity of genes in yeast. TIME NUMERICAL EXAMPLE
Applying the parallel deep learning methodology over
15.3.1.11 Text Mining the bioinformatics applications brings a greater chal-
The need of text mining is very high in biological lenge in implementation. Their overheads, applicability
databases, articles, etc. Just observing the protein se- problems and related issues are addressed in this sec-
quence does not infer all the details about the structure, tion.
there have to be some other additional techniques to A general model has been designed in order to ex-
ecute the bioinformatics application’s data in parallel
extract the potential contents from the metadata, to de-
with deep learning algorithms. Fig. 15.4 and Fig. 15.5
termine the subcellular localization of a protein, to an-
depict the data flow of the parallel execution using deep
alyze DNA-expression arrays, as well as large-scale pro-
learning. The data flow consists of the following steps:
tein and molecule interaction. Another application of
1. The required bioinformatics data is collected by
text mining is the detection and visualization of distinct means of sensor or similar devices from the subject.
DNA regions given sufficient reference data [24]. 2. The collected data is preprocessed based on the type
of application and purpose of the user.
15.3.2 Advantages of Using Parallel Deep 3. Once the preprocessed data is ready, the entire
Learning in Bioinformatics dataset is split into training and test data.
Applications 4. The traditional training and testing split of 70 : 30 is
Bioinformatics is combining biology with computer sci- followed.
ence for the following reasons: 5. Now, the training data is fed to train the CNN model.
• To identify the molecular reason for a disease; 6. In order to perform parallel processing, the datasets
• To explain the influencing factor of a disease at the have to be separated into equal halves for paral-
gene level; lel processing. Certain scheduling algorithms can be
CHAPTER 15 Parallel Machine Learning and Deep Learning Approaches for Bioinformatics 251

7. The deep learning algorithms which segment the

CNN features are performed to divide the dataset.
8. Then, the model gets trained in the separate cores.
9. After the training is over, the results of the trained
data have to be combined into a single module with-
out loss in data.
10. By having the trained results in a single module
makes the testing process smoother.
11. The models are saved as checkpoints at each iteration
of the training for further processing during testing.
Let us consider some real time numerical values re-
garding the time overhead led by imparting parallel pro-
cessing in deep learning bioinformatics applications.
Let us assume that
• The total number of records in the datasets is 1000;
• A quadcore processor is given for parallel execution.
Training part parameters
• The total number of processors given is 4;
• The number of training records in the dataset is 700;
• The number of records given for each processor for
parallel execution is 700/4;
• The time taken for one data allocation to the proces-
sor is 1 s (4 × 1 = 4 s);
FIG. 15.4 Dataflow of parallel dataset training.
• The training time of one record is 1 s;
• Table merging time is 2 s per merging;
• The overall computation of the quadcore processor
takes 8 s.
Testing part parameters:
• The total number of processors given is 4;
• The number of testing records in the dataset is 300;
• The number of records given for each processor for
parallel execution is 75;
• The time taken for one data allocation (separation)
to the processor is 1 s (4 × 1 = 4 s);
• Training time of one record is 1 s;
• Table merging time is 2 s per merging;
• The overall computation of the quadcore processor
takes 10 s;
• Prediction time is 2 s;
• Writing the predicted result to the dataset takes 1 s
per record.
The dataset split for the training is based on the fol-
lowing capacity of the processors:
• Number of processors;
• Load balancing capacity of each processor;
• Fault tolerance;
FIG. 15.5 Dataflow of parallel testing process. • Scalability;
• Interoperability;
used to schedule the processes based on the number • Recording of the data sets as per the requirement of
of cores available in the process. the application;
252 Deep Learning and Parallel Computing Environment for Bioengineering Systems

FIG. 15.6 Model initialization and GPU leveraging.

• The fact that, once the data set is ready, the data needed to address this dynamic alteration issues among
schema has to be framed in such a way that it is loss- parallel processors.
less and dependency preserving;
• The idea that the first level of dataset separation is the
separation of data as training and test data. This level 15.5 SAMPLE IMPLEMENTATION
of data separation need not be checked for lossless SCREENSHOTS TO VISUALIZE THE
and dependency preserving properties, since we are TRAINING PROCESS
not going to combine it again; In this section, screenshots of the implementation are
• Using the second level of dividing or scheduling the shown step by step. In Fig. 15.6, the backend Ten-
dataset as per the availability of the number of pro- sorFlow initialization and GPU device release for the
cessors given for execution. process computations is shown. Fig. 15.7 illustrates
The deep learning algorithm splitting to the processors the convolutional neural network (CNN) model ini-
can happen in two ways: tialization. The model is stored as a checkpoint file and
• The deep learning algorithm can be fixed as baseline regained by the same net model for the next iterations.
and same pseudocode can be passed to all the pro- The CNN ultimately performs pooling operation in the
cessors; pooling layer after each convolution layer operation.
The frequency of pooling and output format of the av-
• Each processor can have a different algorithm execut-
erage pooling layer are presented.
ing for the same purpose provided there is no data
The ultimate aim of the parallel algorithms is to
dependency between the datasets.
reduce the execution time required for a particular
Either the processor or the process has to be kept con-
task.
stant for the other to perform parallel processing. The
Fig. 15.8 shows the similarity matrix calculated
former will act as the point of reference to the latter. for the classification task. The higher the similarity
By following this principle, any number of parameters value, the closer the samples are to the correspond-
can be manipulated and arranged to suit the paral- ing class; and they are mapped to their respective
lel processing. But, the main requirement is that the classes.
datasets should not have any data dependency among Fig. 15.9 shows the execution time for the training
them. and testing of the samples. The execution time shows us
Tackling the data dependency issues are very tedious how effectively the parallel algorithms played their roles
in parallel processing. Even though structured query in deep learning techniques.
languages can be used to manage the issue effectively, Fig. 15.10 shows the final classified samples for each
dynamic alterations to parallel environment are dif- class. With the help of these, we can easily categorize the
ficult. Thus, separate efficient parallel algorithms are samples and their classes to which they belong.
CHAPTER 15 Parallel Machine Learning and Deep Learning Approaches for Bioinformatics 253

FIG. 15.7 Visualizing GPU device enabling.

FIG. 15.8 Similarity matrix computation for classification.

254 Deep Learning and Parallel Computing Environment for Bioengineering Systems

In this chapter, the parallel processing basic con-

cepts were explained with examples in order to make
a clear way to parallel deep learning. The types of paral-
lelization techniques were discussed with diagrammatic
explanation, and ways in which they can be internally
classified were focused on. The relation between bioin-
formatics and deep learning, the challenges in combin-
ing them together and benefits were discussed. The ap-
plicability of the parallel deep learning algorithms to
the real time datasets was explained with simple numer-
ical examples.
All the above figures illustrate the step by step proce-
dure form preprocessing till classification process. Fur-
ther, the misclassification errors depends on the num-
ber of samples which are classified under each class.
The uneven number of samples leads to heavy misclas-
sification errors. The generative methods can be used
to stabilize these errors by normalizing the labels with
pseudo-samples.
The credibility of generative pseudo-samples in
bioinformatics is a major issue. The evaluation of the
FIG. 15.9 Snippet while the dataset is getting trained in
generative pseudo-samples is tedious due to complexity
Python shell.
in the structure of the data. Purpose-specific algorithms
are required to complement these pseudo-samples with
the original samples of the class.
The main task in machine learning research is pre-
diction. Fields such as biomedicine and bioinformat-
ics have a greater application range for this prediction
task. Deep learning and other deep-based represen-
tative learning algorithms have been applied success-
fully in image understanding, speech recognition, and
text classification, etc. Besides, semisupervised learning,
learning from positive and unlabeled example, multi-
view learning, transfer learning, probabilistic graphical
model, etc., are also rapidly developed. The method-
ology, implementation and data from real-world ap-
plications have been covered in this chapter, which
will hopefully provide novel guidance for the machine
learning researchers and broaden the perspectives of
medicine and bioinformatics researchers.
FIG. 15.10 Class and its corresponding labels.

15.6 SUMMARY REFERENCES

In this interdisciplinary domain of deep learning and 1. Y. Yuan, Z. Bar-Joseph, Deep learning for inferring gene
bioengineering, bioinformatics applications have a very relationships from single-cell expression data, bioRxiv
wide scope for learning techniques. Bioinformatics field (2019) 365007.
is very suitable for developing huge amounts of data, 2. L. Pastur-Romay, F. Cedron, A. Pazos, A. Porto-Pazos, Deep
artificial neural networks and neuromorphic chips for big
which is best suited for deep learning process, but it
data analysis: pharmaceutical and bioinformatics applica-
lacks the organization at the molecular level. Since deep tions, International Journal of Molecular Sciences 17 (8)
learning and bioinformatics are rapidly growing areas (2016) 1313.
in today’s world, it is very important to address the re- 3. M.S. Klausen, M.C. Jespersen, H. Nielsen, K.K. Jensen, V.I.
search issues in them. Jurtz, C.K. Soenderby, M.O.A. Sommer, O. Winther, M.
CHAPTER 15 Parallel Machine Learning and Deep Learning Approaches for Bioinformatics 255

Nielsen, B. Petersen, et al., Netsurfp-2.0: improved pre- 19. C. Zhang, X. Sun, K. Dang, K. Li, X.-w. Guo, J. Chang,
diction of protein structural features by integrated deep Z.-q. Yu, F.-y. Huang, Y.-s. Wu, Z. Liang, et al., Toward an
learning, Proteins: Structure, Function, and Bioinformatics expert level of lung cancer detection and classification us-
(2019). ing a deep convolutional neural network, The Oncologist
4. H. Kashyap, H.A. Ahmed, N. Hoque, S. Roy, D.K. Bhat- (2019).
tacharyya, Big data analytics in bioinformatics: a machine 20. L. Wei, Y. Ding, R. Su, J. Tang, Q. Zou, Prediction of hu-
learning perspective, arXiv preprint, arXiv:1506.05101. man protein subcellular localization using deep learning,
5. D. Ravì, C. Wong, F. Deligianni, M. Berthelot, J. Andreu- Journal of Parallel and Distributed Computing 117 (2018)
Perez, B. Lo, G.-Z. Yang, Deep learning for health infor- 212–217.
matics, IEEE Journal of Biomedical and Health Informatics 21. H. Fu, Y. Yang, X. Wang, H. Wang, Y. Xu, Deepubi: a deep
21 (1) (2017) 4–21. learning framework for prediction of ubiquitination sites
6. A. Akay, H. Hess, Deep learning: current and emerging ap- in proteins, BMC Bioinformatics 20 (1) (2019) 86.
plications in medicine and technology, IEEE Journal of
22. K. Raza, Application of data mining in bioinformatics,
Biomedical and Health Informatics (2019).
arXiv preprint, arXiv:1205.1125.
7. Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature
23. V.I. Jurtz, A.R. Johansen, M. Nielsen, J.J. Almagro Ar-
521 (7553) (2015) 436.
menteros, H. Nielsen, C.K. Sønderby, O. Winther, S.K.
8. J. Schmidhuber, Deep learning in neural networks: an
Sønderby, An introduction to deep learning on biologi-
overview, Neural Networks 61 (2015) 85–117.
9. L. Wei, R. Su, B. Wang, X. Li, Q. Zou, X. Gao, Integration cal sequence data: examples and solutions, Bioinformatics
of deep feature representations and handcrafted features to 33 (22) (2017) 3685–3690.
improve the prediction of n6-methyladenosine sites, Neu- 24. S.Y. Rhee, J. Dickerson, D. Xu, Bioinformatics and its appli-
rocomputing 324 (2019) 3–9. cations in plant biology, Annual Review of Plant Biology
10. F. Luo, M. Wang, Y. Liu, X.-M. Zhao, A. Li, DeepPhos: pre- 57 (2006) 335–360.
diction of protein phosphorylation sites with deep learn- 25. S. Min, B. Lee, S. Yoon, Deep learning in bioinformatics,
ing, Bioinformatics (2019). Briefings in Bioinformatics 18 (5) (2017) 851–869.
11. G.B. Goh, N.O. Hodas, A. Vishnu, Deep learning for com- 26. B. Alipanahi, A. Delong, M.T. Weirauch, B.J. Frey, Predict-
putational chemistry, Journal of Computational Chemistry ing the sequence specificities of DNA- and RNA-binding
38 (16) (2017) 1291–1307. proteins by deep learning, Nature Biotechnology 33 (8)
12. A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments (2015) 831.
for generating image descriptions, in: Proceedings of the 27. H. Gelbart, A. Yarden, Learning genetics through an au-
IEEE Conference on Computer Vision and Pattern Recog- thentic research simulation in bioinformatics, Journal of
nition, 2015, pp. 3128–3137. Biological Education 40 (3) (2006) 107–112.
13. V. Hegde, S. Usmani, Parallel and Distributed Deep Learn- 28. R. Khattree, D. Naik, Machine learning techniques for
ing, Tech. rep., Stanford University, June 2016, https:// bioinformatics, in: Computational Methods in Biomedical
stanford.edu/~rezab/dao/projects_reports/hedge_usmani. Research, Chapman and Hall/CRC, 2007, pp. 57–88.
pdf, 2016. 29. S. Salza, M. Renzetti, Performance modeling of paral-
14. M. Staples, L. Chan, D. Si, K. Johnson, C. Whyte, R. Cao, Ar- lel database systems, Informatica-Ljubljana 22 (1998)
tificial intelligence for bioinformatics: applications in pro- 127–140.
tein folding prediction, bioRxiv (2019) 561027. 30. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N.
15. A.A. Abdullah, S. Kanaya, Machine learning using H2O Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, et
R package: an application in bioinformatics, in: Proceed- al., Deep neural networks for acoustic modeling in speech
ings of the Third International Conference on Computing, recognition: the shared views of four research groups, IEEE
Mathematics and Statistics (iCMS2017), Springer, 2019,
Signal Processing Magazine 29 (6) (2012) 82–97.
pp. 375–381.
31. K.R. Foster, R. Koprowski, J.D. Skufca, Machine learning,
16. Q. Zou, Q. Liu, Advanced machine learning techniques for
medical diagnosis, and biomedical engineering research-
bioinformatics, 2018.
commentary, Biomedical Engineering Online 13 (1)
17. Y. Yuan, G. Xun, Q. Suo, K. Jia, A. Zhang, Wave2vec: deep
(2014) 94.
representation learning for clinical temporal data, Neuro-
computing 324 (2019) 31–42. 32. Y. Chen, Y. Li, R. Narayan, A. Subramanian, X. Xie, Gene
18. D. Quang, Y. Chen, X. Xie, Dann: a deep learning approach expression inference with deep learning, Bioinformatics
for annotating the pathogenicity of genetic variants, Bioin- 32 (12) (2016) 1832–1839.
formatics 31 (5) (2014) 761–763.
Index

A Bioinformatics, 17–22, 25, 28, 29, 31, CNNs, 30, 31, 42, 43, 50, 75, 92, 93,
Accelerated cuckoo optimization 32, 246, 248, 250, 254 100, 123, 128, 130, 131,
algorithm (ACOA), 190 applications, 23, 26, 31, 246, 248, 141, 143, 147, 225, 227
Accelerated parallel processing (APP), 9 250, 254 architecture, 45, 64, 66, 78–83
Accuracy, 82, 84, 85, 91–93, 99, 102, computations, 23, 31 deep learning, 233, 237
103, 139, 143, 151, 158, data, 25, 31, 246, 250 Complete
161, 162, 180, 184, 187, MapReduce tools, 23 dataset, 159
227, 232, 233, 240, 242, problems, 25, 26, 248 network, 44
245, 250 research, 17, 248 Component substitution (CS), 42
classification, 63, 65, 70, 92, 99, 100, tools, 19, 20, 23 Compound annual growth rate (CAGR),
103, 120, 161, 227 workloads, 23 19
evaluation, 101 Compute unified device architecture
Biospark framework, 25
rate, 120 (CUDA), 9, 105, 112
Brain tumor, 53–56, 58–61, 64, 66, 67,
scores, 104, 153 Computed tomography (CT), 37, 59,
140
AdaBoost classifier, 228, 229 130
diagnosis, 58, 59, 67
Allostatic optimization (AO), 207 Concealed layers, 171, 173
Ant colony optimization (ACO), 212 segmentation, 63, 65, 66
Brain tumors classification, 53 Conditional random field (CRF), 62
Apache Convolution, 82, 86, 90, 99, 108, 141
Flink, 24, 25, 180, 181 algorithm, 233
Hadoop, 20, 22, 23, 181 C
filters, 81, 86
Hadoop ecosystem, 19 Caffe, 65
kernel, 111
Spark, 19, 24, 31, 181, 184 CaffeOnSpark, 31 layer, 75, 79, 86, 145, 173, 175,
Arithmetic logic unit (ALU), 7 Cascade classifier, 227, 229 232–235, 252
Artificial intelligence (AI), 1, 25, 167, Central nervous system (CNS), 54 network, 43, 86
210 Central processing unit (CPU), 105 neural network, 141, 227, 232, 234
Artificial neural networks (ANN), 1, 2, Cerebral blood volume (CBV), 61 operation, 86, 90, 93
4, 26, 29, 102, 104, 140, CIFAR datasets, 99, 121
225, 247 Convolutional
Classification feature map, 82
ASD negative children, 233, 237, 238, accuracy, 63, 65, 70, 92, 99, 100, 103,
240, 242 features, 78, 83
120, 161, 227 layer, 45, 76–79, 81, 82, 84–86, 90,
ASD positive children, 225, 227, 233, algorithms, 101, 154, 155, 159–161,
237, 238, 240 93, 111, 128
163 networks, 86, 104
Autism spectral disorder (ASD), 225
brain tumor, 67 Convolutional neural networks (CNN),
Automatic segmentation, 60, 64, 145
data, 63, 67, 123 26, 30, 31, 37, 42, 43, 64,
Average gradient (AG), 47
layer, 83, 92 67, 75, 123, 128, 130, 141,
neural network, 30 143, 179, 225, 227, 232,
B
object, 84, 92, 101 252
Backbone network, 85
problem, 26, 27, 63, 153, 157, 159, CUDA, 9, 99, 105, 106, 112, 118, 121
Basic local alignment search tool
162 deep neural network, 110
(BLAST), 20
procedures, 100, 130 GPU, 9, 106
Batch dataset, 117
BBO algorithm, 191, 196, 197, 199, 201, process, 101 kernel, 111
203 results, 99, 100 program, 105, 111
Binary classification, 27 subnet, 85
Binary classifier, 112 supervised, 63, 100, 101 D
Binary robust independent elementary SVM, 237 Data
features (BRIEF), 39 tumor, 75, 143 bioinformatics, 25, 31, 246, 250
Binary robust invariant scalable key unsupervised, 63, 100, 101 classification, 63, 67, 123
(BRISK), 39, 40 Classifier, 101, 102, 109, 112, 153, 155, Hadoop, 19
Biography based optimization (BBO), 157, 158, 228, 229, 231, 240 parallelism, 3–5, 9, 11, 14, 31, 105,
190, 196 SVM, 227, 237, 238 247

257
258 INDEX

Dataset, 3, 9, 10, 17, 20, 23, 28, 30, 109, Epigenomics, 130 framework, 18, 30–32
112, 119, 147, 149, 158, 161, Evolutionary algorithm (EA), 209, 210 MapReduce, 23, 181
169, 175, 181, 235, 250, 251 Evolutionary computing (EC), 210 platform, 19
for Hadoop practice, 182 Evolutionary programming (EP), 210 Hadoop distributed file system (HDFS),
DCNN, 50, 99, 121, 142 Evolutionary strategy (ES), 210 19, 181
architecture, 50 Evolvable hardware (EH), 210 Hardware utilization efficiency (HUE),
Deconvolution, 86, 93 Expectation maximization (EM), 27 214
Deconvolution performs filtering, 86 Exponential linear unit (ELU), 91 HDFS, 19–24, 180, 181, 184, 185
Deep Extra fraction (EF), 65 Health informatics, 123–126, 130, 131,
architecture based RNNs, 131 133, 134
belief network, 123, 127 F Hidden layers, 1, 4, 26, 29, 81, 108, 124,
CNN layers, 111 Facial features, 225, 227, 228, 240 171, 232, 233, 240
CNNs, 99 Facility location problem (FLP), 190 Hidden Markov model (HMM), 62
convolutional nets, 104 Fault diagnosis, 126, 127, 129, 131, 133 High resolution (HR), 171
convolutional neural network Feature accelerated segment test (FAST), Hybrid evolutionary algorithms (HEA),
architecture, 83 40 210
convolutional neural networks, 14, Feature pyramid network (FPN), 85 Hyperparameter learning, 6
82, 99, 107 Features
layer neural networks, 104 convolutional, 83 I
learning, 1–3, 7–9, 14, 29, 31, 37, 42, extraction, 101, 123, 124, 127, 128, Image quality index (IQI), 46
64–67, 75, 93, 104, 107, 130, 225, 226, 230 ImageNet large scale visual recognition
108, 123, 130, 131, 134, extraction network, 128 challenge (ILSVRC), 104
139–141, 143, 144, 167, Haar, 147, 227 Independent component analysis (ICA),
170, 173, 174, 176, 179, layers, 93 42
181, 245–248, 250 Info layer, 171
map, 78–80, 83, 86
algorithms, 1, 2, 9, 14, 18, 134, Innermost layer, 85
File transfer protocol (FTP), 20
141, 227, 232, 245, 246, Integrated circuit (IC), 207
248, 250–252, 254 Intermediate representation (IR), 213,
approaches, 37, 125, 130, 131, 133, G
Gated recurrent units (GRU), 126 214
134 Inventory routing problems (IRP), 191
architectures, 133 Gene expression Omnibus (GEO), 20
Genetic algorithm (GA), 190, 210 IoT contraptions, 165, 168, 169, 176
frameworks, 104, 124 Iris dataset, 154, 155
methods, 66, 93, 144 Genetic programming (GP), 210
Genomics, 17, 18, 130, 131, 246, 248 iWarp processor, 209
networks, 50, 75, 127, 129, 130,
134, 142 Genomics datasets, 20
techniques, 26, 123, 124, 130, 134, Google TensorFlow, 109, 110 J
226, 227, 242, 252 GPUs, 1, 2, 5, 8, 9, 14, 99, 104, 105, 118, Jaccard coefficient (JC), 65
network, 75, 82, 91 121, 123 Joint single errand learning calculation,
neural networks, 4, 8, 14, 31, 75, 79, multiple, 1 177
83, 84, 86, 92–94, 99, 103, processors, 8
107, 124, 131, 140–142, Graphics processing unit (GPU), 1, 43, K
145, 225, 245, 247 56, 81, 104, 105, 123, 145, Keras, 31, 65
residual network, 84 158
Deep belief network (DBN), 64, 123, Grey tone distribution matrix (GTDM), L
125, 127, 141, 173, 179, 227 215 Landmark classification method, 230
Deep Boltzmann machine (DBM), 125, GTDM, 215, 217, 218, 220 Layer
127 GTDM matrix, 217 classification, 83, 92
Deep learning (DL), 37, 42, 64 convolution, 79, 86, 145, 173, 175,
Deep neural networks (DNN), 124, 141, H 232–235, 252
143, 225, 245 Haar convolutional, 45, 76–79, 81, 82,
Deeplearning4j, 65 cascade classifiers, 226, 227 84–86, 90, 93, 111, 128
Dependence graph (DG), 209, 213 classifiers, 226 network, 82, 108, 232
Diagnosis brain tumor, 58, 59, 67 features, 147, 227 Learning, 1, 3, 14, 29, 126, 134, 247,
Dice similarity coefficient (DSC), 65 Habitat suitability indices (HSI), 196 254
Dilated convolution, 86 Hadoop, 3, 5, 6, 18, 19, 21, 22, 25, 30, algorithms, 1, 5, 6, 124, 126, 245, 254
Directed acyclic graph (DAG), 24 31, 181 network, 83, 85
Discrete Fourier transform (DFT), 90 Apache, 20, 22, 23, 181 process, 29, 124
DNNs, 124, 134, 142, 143, 245 architecture, 20, 181 supervised, 26, 28, 64, 81, 124, 139,
big data, 18, 19, 22, 31, 32 140, 144, 153, 157, 245
E cluster, 19, 20, 22, 23, 31 unsupervised, 28, 67, 79, 81, 86, 124,
Electronic health records (EHR), 130 common, 19 140, 144, 153
Electronic medical record (EMR), 20 data, 19 Lesion segmentation, 93
Ensemble learning, 3, 6, 247 ecosystem, 18, 21, 30, 31 Linear discriminant analysis (LDA), 159
INDEX 259

Location inventory routing problem structure, 82 Q

(LIRP), 191 Neural network architecture, 30, 123, Quadcore processor, 10, 251
Location routing problem (LRP), 190 128, 140
LR images, 171 deep, 80 R
Neural networks Random forests (RF), 26
M deep, 4, 8, 14, 31, 75, 79, 83, 84, 86, Receiver operating characteristics
Machine learning 92–94, 99, 103, 107, 124, (ROC), 65, 161
algorithms, 1, 3, 5, 26, 32, 62, 108, 131, 140–142, 145, 225, Recurrent neural network (RNN), 30,
114, 131, 134, 153, 154, 161, 245, 247 123, 125, 126, 141,
162, 184, 231, 245–248 learning, 75 225–227, 248
supervised, 26–28 Neurons, 29, 30, 75–78, 99, 107, 125, Region of interest (ROI), 147, 150
techniques, 26, 32, 42, 70, 123, 133 225, 235 Region proposal network (RPN), 83
unsupervised, 26, 27 Nonparametric classifiers, 102 Reinforcement learning, 26, 29
Magnetic resonance imaging (MRI), 37, Normalized absolute error (NAE), 66 Representation learning, 123, 126
130 Noteworthy learning, 176 Resilient distributed datasets (RDD), 24,
Magnetic resonance spectroscopy 30, 179
(MRS), 62 ResNeXt network, 85
O
Malignant brain tumors, 54, 55 Restricted Boltzmann machine (RBM),
Object
Manual segmentation, 65, 145 42, 123, 125, 127
MapReduce, 19, 20, 23, 25, 31, 181, classification, 84, 92, 101
detection, 80, 82–85, 93, 101, 104, Retinopathy online challenge (ROC),
182, 184, 187 143
architecture, 181 227, 231
OIM misfortune, 175, 176 RNNs, 123, 126, 131, 141
framework, 23 RoIAlign layer, 85
Hadoop, 23, 181 Online Instance Matching (OIM), 174
jobs, 22 Overlap fraction (OF), 65
S
programming, 22, 23
Massidor datasets, 143 Scale invariant feature transform (SIFT),
P
39
Massively parallel processing (MPP), 18 Parallel processing, 2, 3, 5, 9, 14, 208,
MatConvNet, 65 Segmentation, 63, 65–67, 85, 92, 93,
245–247, 250, 252, 254
141
Max pooling, 90, 233, 235 Parametric classifier, 102
Max pooling layer, 232, 235 brain tumor, 63, 65, 66
Partial least squares discriminant mask, 85
Mean squared error (MSE), 65, 156, 163 analysis (PLSDA), 159
Medical semantic, 81, 85, 86, 145
Partial least squares regression (PLSR), task, 145, 146
image analysis, 63, 64, 67, 70, 92, 93, 156
140, 141 tumor, 61, 75, 94
Partial order alignment (POA), 23
images, 37, 38, 50, 53, 63, 75, 92, Siamese network, 45
Partially overlapping channel (POC), Signal flow graph (SFG), 213
248, 249 173
imaging, 38, 39, 53, 63, 65, 70, 123, Simulated annealing (SA), 190
Particle swarm optimization (PSO), Single processor, 2, 3, 7, 247
139, 144, 145, 147 196, 211
Memetic algorithm (MA), 211 Spark, 3, 5, 6, 24, 25, 30, 31, 179, 181,
Perishable products, 189–192, 195, 203 184, 186
MetaSpark, 24
Pharmacogenomics, 130 Apache, 19, 24, 31, 181, 184
Microbatch datasets, 24
Pooling layer, 78–82, 86, 90, 111, 128 streaming, 6, 185, 186
Multifactor dimensionality reduction
(MDR), 26 Positron emission topography (PET), Spatial frequency (SF), 46
139 Spatial pyramid pooling (SPP), 90
Multilayer
perceptron, 80, 86, 90, 128 Prediction, 6, 11, 75, 81, 85, 94, 131, Speeded up robust features (SURF), 39
Multiple 153, 155, 157, 245, 254 Stacked auto encoder (SAE), 64
layers, 64, 76 in health informatics, 134 Stacked denoising autoencoder (SDA),
processors, 2, 207, 246 neural network, 30 131
Spark partitions, 6 time, 10, 251 Stochastic gradient descent (SGD), 45
Multiprocessor, 9 Principal component analysis (PCA), Stream processors, 7, 8
Mutual information (MI), 46, 65 28, 31, 42, 143 Strided convolution, 86
Principal component regression (PCR), Structural similarity index metric
N 156 (SSIM), 47
Network Production routing problem (PRP), 191 Structural similarity (SS), 65
classifier, 128 Profound learning, 172, 177 Structure prediction protein, 246, 249
connections, 79, 85, 90 neural systems, 172, 177 Subsample layers, 128
convolution, 43, 86 structure, 175, 177 Subsampling layer, 78, 128, 232
deep, 75, 82, 91 Protein Substructure segmentation, 93
depth, 81 structure prediction, 246, 249 Suitability index variables (SIV), 196
design, 85 Python, 63, 65, 99, 110, 121, 184, 187 Supervised
layer, 82, 108, 232 Pytorch, 65 classification, 63, 100, 101
260 INDEX

learning, 26, 28, 64, 81, 124, 139, Theano, 31, 65 machine learning, 26, 27
140, 144, 153, 157, 245 Tiled convolution, 86 training, 124, 126
machine learning, 26–28 Time constrained (TC) scaling, 6 User defined functions (UDF), 22
Support vector machines (SVM), 25, 27, Torch, 65
31, 42, 146, 156, 163 Translational bioinformatics, 130 V
Support vector regression (SVR), 27 Transposed convolution, 86 Validation dataset, 112, 116, 117
SVM Tumor Vehicle routing problem (VRP), 189
classifier, 227, 237, 238 classification, 70, 75, 143 VGG network, 81, 85
linear classifier, 237, 240 segmentation, 61, 67, 75, 94 Visual graphics group (VGG), 81
Swarm intelligence (SI), 210, 211
Systolic arrays, 207–209, 218, 220 U W
UCI machine learning repository, 159, Weak classifiers, 228, 229
T 182 World health organization (WHO), 54
Tabu search (TS), 190 Unsupervised
Tensor flow, 65 classification, 63, 100, 101 Y
Tensor processing units (TPU), 107, 209 learning, 28, 67, 79, 81, 86, 124, 140, Yet another resource negotiator (YARN),
TensorFlow, 31, 99, 107–110, 118, 121 144, 153 19, 181