Fundamentals of Machine Learning and Deep Learning in Medicine
Fundamentals of Machine Learning and Deep Learning in Medicine
of Machine Learning
and Deep Learning in
Medicine
Reza Borhani
Soheila Borhani
Aggelos K. Katsaggelos
Fundamentals of Machine Learning and Deep
Learning in Medicine
Reza Borhani • Soheila Borhani •
Aggelos K. Katsaggelos
Fundamentals of Machine
Learning and Deep Learning
in Medicine
Reza Borhani Soheila Borhani
Electrical and Computer Engineering Biomedical Informatics
Northwestern University University of Texas Health Science Center
Evanston, IL, USA Houston, TX, USA
Aggelos K. Katsaggelos
Electrical and Computer Engineering
Northwestern University
Evanston, IL, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To our families:
Maryam and Ali
Eιρηνη, Zωη, oϕια and Adam
Preface
Not long ago, machine learning and deep learning were esoteric subjects known
only to a select few at computer science and statistics departments. Today, however,
these technologies have made their way into every corner of the academic universe,
including medicine. From automatic segmentation of medical imaging data, to
diagnosing medical conditions and disorders, to predicting clinical outcomes, to
recruiting patients for clinical trials, machine learning and deep learning models
have produced results that rival, and in some cases exceed, human performance.
These groundbreaking successes have garnered the attention of healthcare stake-
holders in academia and industry, with many anticipating and advocating for an
overhaul of current educational curricula in order to prepare students for the
transition of medicine from the “information age” to the “age of AI.”
As medical and health-related programs begin to incorporate machine learning
and deep learning into their curricula, a salient question arises about the extent to
which these subjects should be taught, given that researchers and practitioners in
these fields can, and often do, use various forms of technology without full knowl-
edge of their inner-workings. For instance, a diagnostician need not necessarily be
familiar with how magnetic fields are generated inside a scanner machine in order
to interpret an MRI accurately. Similarly, surgeons can learn to operate robotic
surgical systems effectively without ever knowing how to build, fix, or maintain
one. We believe the same cannot be said about the use of artificial intelligence in
medicine. For example, oncologists cannot be the mere end-users of a machine
learning model which recommends the best course of treatment for a given cancer
patient. They need to understand how these models work and, ideally, play an
active role in developing them. Otherwise, one of two scenarios is bound to occur:
either physicians will uncritically accept the model recommendations (which is a
dangerous form of automation bias), or they will learn to distrust and ignore such
recommendations to the detriment of their patients who could benefit from the
“wisdom” of data-driven models trained on millions upon millions of examples.
Thanks to the immense popularity of machine learning and deep learning, the
market abounds with textbooks written on these subjects. However, having generally
been written by mathematicians and engineers for mathematicians and engineers,
vii
viii Preface
these texts are not geared toward the specific educational needs of medical students,
researchers, and practitioners. Put differently, they are written in a “language” which
is not accessible to the average scholar in medicine who typically lacks a graduate-
level background in mathematics and computer science. Nearly six decades ago, the
pioneering British medical scientist Sir Harold Percival Himsworth addressed this
very challenge in his opening statement to the 1964 Conference on Mathematics and
Computer Science in Biology and Medicine: “Medical biologists, mathematicians,
physicists and computologists may have more of their outlook in common than we
suspect. But they do speak different dialects and they do have different points of
view. This is no new problem for a multidisciplinary subject like medical research.
If it is to be solved and the evident necessity for co-operation realized, one thing is
essential: we must learn each other’s language.”
The book before you is an attempt to realize this vision by providing an
accessible introduction to the fundamentals of machine learning and deep learning
in medicine. To serve an audience of medical researchers and professionals, we have
presented throughout the book a curated selection of machine learning applications
from medicine and adjacent fields. Additionally, we have prioritized intuitive
descriptions over abstract mathematical formalisms in order to remove the veil of
unnecessary complexity that often surrounds machine learning and deep learning
concepts. A reader who has taken at least one introductory mathematics course
at the undergraduate level (e.g., biostatistics or calculus) will be well-equipped
to use this book without needing any additional prerequisites. This makes our
introductory text appropriate for use by readers from a wide array of medical
backgrounds who are not necessarily initiated in advanced mathematics but yearn
for a better understanding of how these disruptive technologies can shape the future
of medicine.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
A Deeper Dive into the Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 8
Revisiting Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Revisiting Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Revisiting Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Revisiting Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The Machine Learning Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Mathematical Encoding of Medical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Imaging Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Time-Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Genomics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Elementary Functions and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Different Representations of Mathematical Functions . . . . . . . . . . . . . . . . . . . . . . 47
Elementary Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Polynomial Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Reciprocal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Trigonometric and Hyperbolic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Exponential Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
x Contents
Logarithmic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Step Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Elementary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Basic Function Adjustments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Addition and Multiplication of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Composition of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Min–Max Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Constructing Complex Functions Using Elementary Functions
and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Linear Regression with One-Dimensional Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
The Least Squares Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Linear Regression with Multi-Dimensional Input . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Input Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 Linear Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Linear Classification with One-Dimensional Input . . . . . . . . . . . . . . . . . . . . . . . . . 89
The Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
The Cross-Entropy Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
The Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Linear Classification with Multi-Dimensional Input . . . . . . . . . . . . . . . . . . . . . . . . 101
Linear Classification with Multiple Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 From Feature Engineering to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Feature Engineering for Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Feature Engineering for Nonlinear Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Multi-Layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Optimization of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Design of Neural Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Convolutional and Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 131
The Convolution Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Contents xi
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Chapter 1
Introduction
Throughout history, humans have always sought to better understand the natural
phenomena that directly impacted their lives. Precipitation is one example. For
eons, the ability to predict rainfall was the holy grail for our ancestors whose
livelihoods were continuously under threat by prolonged droughts and major floods.
Oblivious to the principles of hydrology and out of desperation, some resorted
to human sacrifice1 in the hope of pleasing the gods and saving their crops.
The enlightenment brought about a drastic change in the way we think about
the phenomena of interest to us, replacing religious and philosophical dogmas
with the tools of scientific reasoning and experimentation. For instance, it was
through careful and repeated experimentation that Galileo discovered the parabolic
nature of projectile motion, as described in his book: “Dialogues concerning two
new sciences” [1]. Galileo’s discovery refuted the long-lasting Aristotelian theory
of linear motion and paved the way for precise calculation of the trajectory of
cannonballs as the most advanced weaponry of his time (see Fig. 1.1). Decades
later, Isaac Newton formalized Galileo’s observations through a set of differential
equations that fully describe the behavior of virtually all moving objects around us,
cannonballs included.
At the dawn of the third decade of the third millennium, we no longer pray
to gods for rain, nor do we keep our fingers crossed during wartime for artillery
shells to hit their intended targets. As a civilization, we are now capable of creating
artificial rain (via cloud seeding) and launching intercontinental ballistic missiles
with pinpoint accuracy. The unsolved problems of today are much more complex
by comparison, an example of which are human maladies such as cancers and
auto-immune disorders that, despite our best efforts, continue to claim the lives of
millions every year. To compare the complexity of these modern problems with
1 The Aztecs would sacrifice their young children before Tlaloc, the god of water and earthly
fertility. Mayans believed that the rain god Chaac would strike the clouds with his lightning axe to
cause thunder and rain.
those of the past, consider, as an example, the second law of motion in Newtonian
mechanics. This well-known law, expressed commonly as F = m a, states that
the acceleration a of any moving object is influenced by two factors only: the
object’s mass m and the net force F exerted on it. Additionally, the relationship
between acceleration and force happens to be linear, which is the easiest to model
mathematically. Furthermore, this simple linear relationship is universal meaning
that it applies similarly to all moving objects and at all times, regardless of their
location, speed, and other physical attributes.
In contrast, diseases are not single-factor or bi-factor phenomena. A mathemat-
ical model of cancer (if one is ever to be discovered) would likely include tens of
thousands of genetic and environmental variables. In addition, these variables may
not necessarily interact in a conveniently linear fashion, making their mathematical
modeling immensely more difficult. Finally, there is no universality with human
diseases as they do not always manifest the same across all afflicted individuals. As a
result of this inherent variance and complexity, traditional mathematical machinery
and deductive reasoning tools employed to solve many classical chemistry and
physics problems in the centuries past cannot adequately address the complex
biology problems of the twenty-first century.2
Luckily for us, we have at our disposal today an extremely valuable commodity
that we can utilize when modeling complex phenomena: data. Unlike our forefa-
thers who did not possess the technology to generate and store large quantities
of data, we live in a world awash in it. Currently, more than two zettabytes (i.e.,
2 × 10 21 bytes) of medical data are generated annually across the globe in the form
of electronic health records, high-resolution medical images, bio-signals, genome
sequencing data, and more. These massive amounts of data are easily accessible
through large distributed networks of interconnected servers dubbed “the cloud.”
2 The development of the atomic model and the periodic table of elements revolutionized chemistry
in the nineteenth century. In the twentieth century, physics underwent a paradigm shift with the
advent of quantum mechanics. Many believe that the complete mapping of the human genome
coupled with the ongoing information technology revolution promises similar leaps of progress for
biology in the twenty-first century.
The Machine Learning Pipeline 3
Data Collection
Since machine learning is built on the principle of learning from data, it makes
intuitive sense that collecting data constitutes the first step in the machine learning
pipeline. In our dermatology example, the data to be collected is in the form of
images of skin lesions, each labeled as either benign or malignant by a human
4 1 Introduction
Fig. 1.2 A small classification dataset consisting of four benign lesions (top row) and four
malignant lesions (bottom row). The images shown in this figure were taken from the international
skin imaging collaboration (ISIC) dataset [6]
Feature Design
3 Other factors such as the patient’s life style, over-exposure to UV light, and familial history of
the disease also play a role in guiding the physician toward a cancer diagnosis. Nonetheless, we
build our machine learning system using the lesion’s appearance alone.
The Machine Learning Pipeline 5
Fig. 1.3 Feature space representation of the dataset shown previously in Fig. 1.2. Here the
horizontal and vertical axes represent the symmetry and border shape features, respectively. The
fact that the benign and malignant lesions lie in distinct regions of the feature space reflects a good
choice of features
more irregular borders compared to benign nevi, as reflected in the small dataset
shown in Fig. 1.2. While quantifying these qualitative features is not a trivial task,
for the sake of simplicity, suppose we can easily extract the following two features
from each image in our dataset: first, symmetry ranging from perfectly symmetric
to highly asymmetric, and second, border shape ranging from perfectly regular
to highly irregular. With this choice of features, each image can be represented
in a two-dimensional feature space, depicted in Fig. 1.3, by just two numbers: a
number measuring the lesion’s symmetry that determines the horizontal position of
the image and another number capturing the lesion’s border shape that determines
its vertical position in the feature space.
Designing proper features is crucial to the overall success of a classification
system. Quality features allow for the two classes of data to be well-separated in
the feature space, as is the case with our choice of features in Fig. 1.3.
6 1 Introduction
Model Training
Fig. 1.4 Model training involves finding an appropriate line that separates the two classes of
data in the feature space. The linear classifier shown in black provides a computational rule for
distinguishing between benign and malignant lesions. A lesion is classified as benign if its feature
representation lies below the line (in the blue region) and malignant if the feature representation
lies above it (in the yellow region)
The Machine Learning Pipeline 7
Model Testing
The classification model shown in Fig. 1.4 does an excellent job at separating the
feature representations of the benign and malignant lesions, with no data point being
classified incorrectly. This, however, should not give us too much confidence about
the classifier’s efficacy. The real test of a classifier is when it can generalize what
it has learned to new or previously unseen instances of the data. This evaluation is
done via the process of model testing. To test a classification model, we must collect
a new batch of data, called testing data, as shown in Fig. 1.5. Clearly, there should
be no overlap between the testing dataset and the set of data used previously during
training, which from now on we refer to as the training dataset.
The model testing process begins with obtaining the feature representation of
each image in the testing dataset using the previously designed set of features (i.e.,
symmetry and border shape). With both features extracted, we then find the position
of each testing image in the feature space relative to the trained linear classifier. As
illustrated in Fig. 1.6, all four benign lesions fall below the line in the blue region
and are thus classified correctly as benign by the classifier. Similarly, two of the
four malignant lesions fall above the line in the yellow region and, as a result, are
classified correctly as malignant. However, two malignant lesions (namely, the data
points M and N) end up on the wrong side of the line in the blue region and are
therefore misclassified as benign by the classifier.
Fig. 1.5 A testing dataset of benign (top row) and malignant (bottom row) skin lesions. The
training dataset illustrated in Fig. 1.2 and the testing dataset illustrated here must not have any
data point in common. The images shown in this figure were taken from the international skin
imaging collaboration (ISIC) dataset [6]
8 1 Introduction
Fig. 1.6 The feature representations of two of the eight testing data points end up on the wrong
side of the linear classifier. As a result, these two malignant lesions (M and N) will be classified
incorrectly as benign by the model
Data is the fuel that powers machine learning, and as such “the more is always the
merrier” when it comes to data. However, in practice, there are often cost, security,
and patient privacy concerns that can each severely limit data availability. In such
circumstances, proper rationing of the available data between the training and testing
sets becomes important.
A Deeper Dive into the Machine Learning Pipeline 9
Fig. 1.7 The schematic summary of the classification pipeline discussed in Sect. “The Machine
Learning Pipeline”
There is no precise rule for what portion of a given dataset should be set aside
for testing. On one hand, we want the training set to be as large as possible so that
the classifier can learn from a wide array of data samples. On the other hand, a large
and diverse testing set ensures that the trained model can reliably classify previously
unseen instances. As a rule of thumb, between 10% and 30% of the whole data is
typically assigned at random to the testing set. Generally speaking, the percentage
of the original data that may be assigned to the testing set increases as the size of the
data increases. The intuition for this is that when the data is plentiful, the training set
still accurately represents the underlying phenomenon of interest, even after removal
of a relatively large set of testing data. Conversely, with smaller datasets, we usually
take a smaller percentage for testing since the relatively larger training set needs to
retain what little information of the underlying phenomenon was captured by the
original data, and as a result, smaller amounts of data can be spared for testing.
Fig. 1.8 A color image is made up of three color bands or channels: red (R), green (G), and blue
(B). Every pixel in a color image can therefore be represented as a list of three integers (one for
each channel) with an intensity value ranging from 0 to 255
channels as illustrated in Fig. 1.8. Multiplying the total number of pixels (i.e., 512 2 )
by the number of color channels per pixel (i.e., 3), we arrive at a number close to
800,000. This would be the dimension of the feature space if we were to use raw
pixel values as features!
Ultra-high-dimensional spaces like this cause an undesired phenomenon called
the curse of dimensionality. To provide an intuitive description of this phenomenon,
we begin with a simple one-dimensional space and work our way up from there
to higher dimensions. Suppose we aim to understand what goes on inside a one-
dimensional space (i.e., a line). We place a series of sensors on this line, each at a
distance of d from its neighboring sensors. Clearly, the smaller the value of d the
larger the number of required sensors and the more fine-grained our understanding
of the space under study. Setting d to a fixed pre-determined value, as shown in the
left panel of Fig. 1.9, we need 3 sensors to cover a line segment of length 2d. Now let
us move up one dimension. As illustrated in the middle panel of Fig. 1.9, in a two-
dimensional space (i.e., a plane), we will need 9 sensors to cover a 2d × 2d area
with the same level of resolution (or granularity). Similarly, in a three-dimensional
space, we will need a total of 27 sensors, as illustrated in the right panel of Fig. 1.9.
Extrapolating this pattern into higher dimensions, 3N sensors are needed in a general
N-dimensional space. In other words, the number of sensors grows exponentially
with the dimension of the space.
Data points are essentially like sensors since they relay to us useful information
about the space they lie in. The larger the number of data points (sensors) the
fuller our understanding of the feature space. The problem is as the dimension N
of the feature space increases we need exponentially more data points to perform
classification effectively—something that is not feasible when N is extremely large.
A Deeper Dive into the Machine Learning Pipeline 11
Fig. 1.9 The number of sensors (shown as red dots) that we must place so that each is at a distance
of d from its neighboring sensors grows exponentially with the dimension of the space. This
exponential growth behavior is commonly referred to as the curse of dimensionality
We just saw how “the curse of dimensionality” practically prohibits the use of
raw pixel values as features.4 The good news is that, as we will see later in the book,
deep learning allows for the automatic learning of the features from the data. In
fact, in a typical deep learning classification system, the feature design and model
training steps are combined into one step so that both the features and the classifier
are learned jointly from the data.
w0 + w1 x1 + w2 x2 = 0. (1.1)
4 Even if the extremely large dimension of the feature space were not an issue, by using raw
pixel values as features, we would disregard the valuable information that can be inferred from
the location of each pixel in the image. In Chap. 7, we will study a family of deep learning
models called convolutional neural networks that are specifically designed to leverage the spatial
correlations present in imaging data.
12 1 Introduction
Fig. 1.10 (First panel) The feature space representation of a toy classification dataset consisting
of two classes of data: blue squares and yellow circles. (Second panel) The line defined by
the parameters (w0 , w1 , w2 ) = (16, 1, −8) classifies three yellow circles incorrectly, hence
g(16, 1, −8) = 3. (Third panel) The line defined by the parameters (w0 , w1 , w2 ) = (4, 5, −8)
misclassifies one yellow circle and one blue square, hence g(4, 5, −8) = 2. (Fourth panel) The
line defined by the parameters (w0 , w1 , w2 ) = (−8, 1, 4) classifies only a single blue square
incorrectly, hence g(−8, 1, 4) = 1
we look for those resulting in a line that separates the two classes of data as best as
possible. More precisely, we want to set the line parameters so as to minimize the
number of errors or misclassifications made by the classifier. We can express this
idea mathematically by denoting by g(w0 , w1 , w2 ) a function that takes a particular
set of line parameters as input and returns as output the number of classification
errors made by the classifier w0 + w1 x1 + w2 x2 = 0. In Fig. 1.10, we show three
different settings of (w0 , w1 , w2 ) for a toy classification dataset, resulting in three
distinct classifiers and three different values of g.
The function g is commonly referred to as a cost function or error function in the
machine learning terminology. We aim to minimize this function by finding optimal
values for w0 , w1 , and w2 , denoted, respectively, by w0 , w1 , and w2 , such that
for all values of w0 , w1 , and w2 . For example, for the toy classification dataset
shown in Fig. 1.10, we have that w0 = −8, w1 = 1, and w2 = 4. This corresponds
to a minimum cost value of g(w0 , w1 , w2 ) = 1, which is the smallest number of
errors attainable by any linear classifier on this particular set of data. The process of
determining the optimal parameter values for a given cost function is referred to as
mathematical optimization.
Note that unlike the skin cancer classification dataset shown in Fig. 1.4, the toy
dataset in Fig. 1.10 is not linearly separable, meaning that no linear model can be
found to classify it without error. This type of data is commonplace in practice
and requires more sophisticated nonlinear classification models such as the ones
shown in Fig. 1.11. In the second, third, and fourth panels of this figure, we show
an instance of a polynomial, decision tree, and artificial neural network classifier,
respectively. These are the three most popular families of nonlinear classification
models, with the latter family being the focus of our study in Chaps. 6 and 7.
A Deeper Dive into the Machine Learning Pipeline 13
Fig. 1.11 (First panel) The toy classification dataset shown originally in Fig. 1.10. (Second panel)
A polynomial classifier. (Third panel) A decision tree classifier. (Fourth panel) A neural network
classifier. Each of the nonlinear classifiers shown here is capable of separating the two classes of
data perfectly
By evaluating the performance of the linear classifier shown in Fig. 1.6, we can see
that it correctly classifies six of the eight samples in the testing dataset. Dividing
the first number by the second gives a widely used quality metric for classification
called accuracy, defined as
Based on the definition given above, this metric always ranges between 0 and 1,
with larger values of it being more desirable. In our example, accuracy = 68 = 0.75.
While accuracy does provide a useful metric for the overall performance of a
classifier, it does not distinguish between the misclassification of a benign lesion as
malignant (type I error) and the misclassification of a malignant lesion as benign
(type II error). Since this distinction is particularly important in the context of
medicine, two additional metrics are often used to report classification results.
Denoting the malignant class as positive (for cancer) and the benign class as
negative, the two metrics of sensitivity and specificity are defined, respectively, as
Fig. 1.12 A confusion matrix illustrated. Here a is the number of positive data points classified
correctly as positive, b is the number of positive data points classified incorrectly as negative, c
is the number of negative data points classified incorrectly as positive, and d is the number of
negative data points classified correctly as negative
a+d a d
accuracy = , sensitivity = , specificity = .
a+b+c+d a+b c+d
(1.5)
In addition to the metrics in Eq. (1.5), a number of other classification metrics can be
calculated using the confusion matrix, among which balanced accuracy, precision,
and F-score (as defined below) are more frequently used in the literature.
a
a+b + d
c+d a 2a
balanced accuracy = , precision = , F-score = .
2 a+c 2a + b + c
(1.6)
In Sects. “The Machine Learning Pipeline” and “A Deeper Dive into the Machine
Learning Pipeline”, we motivated the introduction of the machine learning pipeline
using image-based diagnosis of skin cancer (see, e.g., [5]). This is just one of
many diagnostic tasks where machine learning has achieved human-level accuracy.
Another examples include diagnosis of diabetic retinopathy using retinal fundus
photographs (see e.g., [7]), diagnosis of breast cancer using mammograms (see
e.g., [8]), diagnosis of lung cancer using chest computed tomography (CT) images
(see e.g., [9]), diagnosis of bladder cancer using cystoscopy images (see e.g., [10]),
and many more (Fig. 1.13).
The Machine Learning Taxonomy 15
Fig. 1.14 (Left panel) A sample training dataset for the task of brain tumor localization. (Right
panel) To determine if any tumors are present in a given brain MRI, a small window is scanned
across it from top to bottom. If the image content inside the window is deemed malignant by a
trained classifier, its location will be marked by a bounding box. The images used to create this
figure were taken from the brain tumor image segmentation (BRATS) dataset [11]
Fig. 1.15 (Middle panel) A whole-slide image of breast tissue. In addition to normal regions that
make up the vast majority of the slides, there are multiple benign, in situ carcinoma, and invasive
carcinoma regions in this image that are highlighted in green, yellow, and blue, respectively. (Side
panels) Four 800 × 800 patches, each representing one of the four classes in the data, are blown
up in the side panels for better visualization. The data used to create this image was taken from the
breast cancer histology (BACH) dataset [13]
variables are intrinsically different from their discrete counterparts such as blood
type or tumor stage that always take on a few pre-determined values: {O, A, B,
AB} for blood type, and {I, II, III, IV} for tumor stage. In the nomenclature
of machine learning, the task of predicting a continuous output from input data
is called regression. It is interesting, and somewhat surprising, to note that all the
continuous variables mentioned here (i.e., blood pressure, BMI, and age) can be
predicted to varying degrees of accuracy [12] using retinal scans (like the one shown
in Fig. 1.13).
In all problem instances we have seen so far, there is always a discrete-valued (in
the case of classification) or continuous-valued (in the case of regression) output that
we wish to predict using input data. A machine learning classifier or regressor tries
to learn this input/output relationship using data labeled by a human supervisor.
Because of this reliance on labeled data, both classification and regression are
considered supervised learning schemes. Another category of machine learning
problems called unsupervised learning deals with learning from the input data
alone. In what follows, we briefly introduce two fundamental problems in this
category: clustering and dimension reduction.
The objective of clustering is to identify groups or clusters of input data points
that are similar to each other. For example, in the left panel of Fig. 1.16, we show a
gene expression microarray that is a two-dimensional matrix with 33 rows (patient
samples) and 20 columns (genes). Each square on this 33 × 20 grid represents
the color-coded expression level of a particular gene in a tissue sample collected
from a patient with leukemia. In the right panel of Fig. 1.16, we show the results of
clustering the rows and columns of this microarray data. As can been seen, the genes
across the columns of the microarray form two equally sized clusters. Similarly, the
patients across the rows of the microarray are clustered into two groups of sizes 22
and 11, respectively. Automatic identification of such gene and patient clusters can
lead to the discovery of new gene targets for drug therapy (see e.g., [14, 15]).
Modern-day medical datasets can be extremely high-dimensional. This is both
a blessing and a curse. High-resolution pathology and radiology scans allow
physicians to see minute details that could lead to early diagnosis of disease.
Similarly, large-scale RNA sequencing datasets give researchers the ability to detect
genetic variations at the level of the nucleotide. However, this level of resolution
comes at a price. Recall from our discussion of the curse of dimensionality in
Sect. “A Deeper Dive into the Machine Learning Pipeline” that as the dimension
of the data grows, we need exponentially larger datasets (in terms of the number of
data points). When acquiring such large amounts of data is not feasible, reducing
the dimension of data—if possible—will be crucial for training effective models.
Geometrically speaking, in order to reduce the dimension of a dataset, we must
find a lower-dimensional representation (sometimes called a manifold) for the data
points in their original high-dimensional space. This general idea is illustrated in
Fig. 1.17 using an overly simplistic two-dimensional dataset consisting of eight
data points. As depicted in the left panel of the figure, each data point in this two-
dimensional space is generally represented by two numbers—a and b—indicating
its horizontal and vertical coordinates. However, if the data points happen to lie on
18 1 Introduction
Fig. 1.16 (Left panel) A gene expression microarray of 20 genes (across columns) and 33 patients
with leukemia (across rows). (Right panel) Clustering of this data reveals two groups of similar
patients and two groups of similar genes. The data used to create this figure was taken from [16]
a circular manifold and we are able to uncover it (as shown in the right panel of
Fig. 1.17), each data point can then be represented using one number only: the angle
θ between the horizontal axis and the line segment connecting the data point to the
origin. This brings the dimension of the original dataset down from two to one.
Dimension reduction techniques operate under the assumption that the original
high-dimensional data lies (approximately) in a lower-dimensional subspace and
that we can discover this subspace relatively easily. Sometimes this assumption does
not hold. In such cases, we can take an alternative approach to combat the curse of
dimensionality. Rather than reducing the dimension of data, we increase the size
of data by creating synthetic data points. Figure 1.18 shows a set of real images of
skin lesions along with a batch of synthetic but realistic-looking lesions generated
to augment the size of the original data. It is worth noting that the number of
The Machine Learning Taxonomy 19
Fig. 1.17 (Left panel) A toy two-dimensional dataset with eight data points shown in red. Each
data point is represented by a pair of numbers a and b indicating its location relative to the
horizontal and vertical axes. (Right panel) Because this particular set of data happens to lie on
a circular manifold, the location of each data point can be encoded using the angle θ alone. Note
that the discovery of this circular manifold was key in reducing the dimension of this dataset from
two to one
Fig. 1.18 (Left panel) A collection of real skin lesions taken from the international skin imaging
collaboration (ISIC) dataset [6]. (Right panel) A collection of fake lesions generated using machine
learning [18]
diagnosing COVID-19 using transfer learning (see, e.g., [19]). We discuss transfer
learning in more detail in Chap. 7.
In every machine learning problem, we have seen so far a model is trained to
make a single decision for which it receives an immediate reward. For example,
presented with an image of a skin lesion, a skin cancer classifier has only one
decision to make: is the lesion benign or malignant? If the classifier answers
this question correctly, its accuracy score will improve as a result. Reinforcement
learning extends this general framework to more complex scenarios where a
computer agent is trained to make a sequence of decisions in pursuit of a long-term
goal. To better understand this distinction, consider the game of chess: a computer
trained to play chess must make a series of decisions—in the form of chess-piece
moves—with the long-term goal of check-mating its opponent. Each decision is
called an action in the context of reinforcement learning. “Moving the queen up two
squares” is an example of an action. Note, however, that depending on the state of
the chessboard, this action may or may not be allowed. For example, if an enemy
piece is on the square right above the queen, she must eliminate it first before being
able to move to her desired location. In the parlance of reinforcement learning, a
state is a variable that communicates characteristic information about the problem
environment (e.g., the location of each piece on the board) to the computer agent.
Reinforcement learning problems are inherently dynamic because every action taken
by the agent changes the state of the environment.
In medicine, reinforcement learning has been applied to devising treatment
policies in diabetes [20], cancer [21], and sepsis [22]. For example, to achieve
the long-term goal of full recovery from sepsis in an intensive care unit, a computer
agent can learn to take appropriate actions depending on the patient’s state. The
action space of this problem includes administering antibiotics, administering
intravenous fluids, placing the patient on mechanical ventilation, etc. Taking each of
these actions leads to a change (for better or worse) in the patient’s state captured via
their vital measurements and lab tests (see, e.g., [23]). We will study reinforcement
learning in Chap. 8.
Problems
Fig. 1.19 Figure associated with Exercises 1.2 and 1.3. See text for details
(b) Show that the F-score always lies in between sensitivity and precision, but never
exceeds their average.
1.7 Further Machine Learning and Deep Learning Applications
Based on the description provided below, determine what type of machine
learning problem is solved in each case.
(a) Spampinato et al. [24] developed a machine learning system to predict
skeletal bone age using hand X-ray images. Skeletal age assessment is a
radiological procedure for determining bone age in children with growth and
endocrine disorders. Ideally, the patient’s skeletal age should be identical to their
chronological age. Is this application an instance of binary classification, multi-
class classification, regression, clustering, or dimension reduction? Explain.
(b) Miotto et al. [25] developed a deep learning system to derive a compact rep-
resentation of patients’ electronic health records (EHRs). The results obtained
using this representation—dubbed “deep patient” by the authors—in a number
of disease prediction tasks were better than those obtained by using the raw
EHR data. Is this application an instance of binary classification, multi-class
classification, regression, clustering, or dimension reduction? Explain.
(c) Hannun et al. [26] developed a deep learning system to detect multiple types
of cardiac rhythms including the sinus rhythm and 10 different patterns of
arrhythmia in single-lead electrocardiograms (ECGs), with the average F-score
for their model exceeding that of the average cardiologist. Is this application an
instance of binary classification, multi-class classification, regression, cluster-
ing, or dimension reduction? Explain.
(d) Razavian et al. [27] developed a deep learning system for patient risk stratifica-
tion. Using lab results as input, their model was effective in predicting whether
a patient would be diagnosed with a specific condition 3–15 months into the
future from the time of prediction. Is this application an instance of binary
classification, multi-class classification, regression, clustering, or dimension
reduction? Explain.
(e) Tian et al. [28] developed a deep learning system to group individual cells
together on the basis of transcriptome similarity using single-cell RNA sequenc-
ing (scRNA-seq) data. This type of data allows for fine-grained comparison of
the transcriptomes at the level of the cell. Is this application an instance of binary
classification, multi-class classification, regression, clustering, or dimension
reduction? Explain.
References
1. Galilei G, Crew H, Salvio AD. Dialogues concerning two new sciences. New York: McGraw-
Hill; 1963
2. Weizenbaum J. ELIZA: a computer program for the study of natural language communication
between man and machine. Commun ACM. 1966;9(1):36–45
References 23
27. Razavian N, Marcus J, Sontag D. Multi-task prediction of disease onsets from longitudinal lab
tests; 2016. arXiv preprint 160800647v3
28. Tian T, Wan J, Song Q, et al. Clustering single-cell RNA-seq data with a model-based deep
learning approach. Nat Mach Intell. 2019;1:191–8
Chapter 2
Mathematical Encoding of Medical Data
Numerical Data
Consider the following toy dataset consisting of 5 patients’ systolic blood pressure
values measured (in millimeter of mercury or mmHG) at the time of admission to
the hospital
patient 1: 124,
patient 2: 227,
patient 3: 105, (2.1)
patient 4: 160,
patient 5: 202.
In mathematics, data like this is typically stored in, and represented by, an object
called a vector that is simply an ordered listing of numbers
Throughout the book, we represent vectors by a bold lowercase (often Roman) letter
such as x in order to distinguish them from scalar values that are typically denoted
by non-bold Roman or Greek letters such as x or α.
When the elements or entries inside a vector are listed out horizontally (or in a
row) as in (2.2), we call the resulting vector a row vector. Alternatively, the vector’s
entries can be listed vertically (or in a column), in which case we refer to the
resulting vector as a column vector. We can always swap back and forth between the
row and column versions of the vector via a vector operation called transposition.
Notationally, transposition is denoted by the letter T placed just to the right and
above a vector that turns a row vector into a column vector and vice versa, e.g.,
⎡ ⎤
124
⎢ 227⎥
⎢ ⎥
⎢ ⎥
x T = [124 227 105 160 202] T = ⎢ 105⎥ . (2.3)
⎢ ⎥
⎣ 160⎦
202
⎡ ⎤
x1 − y1
⎢ x2 − y2 ⎥
⎢ ⎥
x−y=⎢ .. ⎥. (2.6)
⎣ . ⎦
xN − yN
Aside from the rudimentary operations of addition and subtraction, the two vectors
x and y in (2.4) can also be multiplied together in a number of ways, one of which
called the inner-product is of particular interest to us in this book. Also referred to
as the dot-product, the inner-product of x and y produces a scalar output that is the
sum of the pair-wise multiplication of the corresponding entries in x and y. Denoted
by xT y, the inner-product of x and y can be written as
⎡ ⎤
y1
⎢ y2 ⎥
N
⎢ ⎥
xT y = [x1 x2 · · · xN ] ⎢ . ⎥ = x1 y1 + x2 y2 + · · · + xN yN = xn yn . (2.7)
⎣ .. ⎦
n=1
yN
When a vector x is multiplied by a scalar α, the resulting vector will have all its
entries scaled by α
⎡ ⎤
α x1
⎢ α x2 ⎥
⎢ ⎥
α x = ⎢ . ⎥. (2.8)
⎣ .. ⎦
α xN
Because our senses have evolved in a world with three physical dimensions, we
can understand one-, two-, and three-dimensional vectors intuitively. For instance,
as illustrated in the left panel of Fig. 2.1, we can visualize two-dimensional vectors
as arrows stemming from the origin in a two-dimensional plane. Addition of two
vectors as well as vector–scalar multiplication is also easy to interpret geometrically
as shown via examples in the middle and right panels of Fig. 2.1, respectively.
Thinking of vectors as arrows helps us define the norm (or magnitude) of a vector
as the length of the arrow representing it. For a general two-dimensional vector
x1
x= , (2.9)
x2
Fig. 2.1 (Left panel) Vectors x = [3 3] and y = [−2 1] drawn as arrows starting from the origin
and ending at points whose horizontal and vertical coordinates are stored in x and y, respectively.
(middle panel) The addition of x and y is equal to the vector connecting the origin to the opposite
corner of the parallelogram that has x and y as its sides. (right panel) When multiplied by a scalar α,
the resulting vector will remain in parallel to the original vector, but its length will alter depending
on the magnitude of α. Note that when α is negative, the resulting vector will point in the opposite
direction of the original vector
So far we have pictured vectors as arrows stemming from the origin. This is the
conventional way vectors are usually depicted in any standard mathematics or linear
algebra text. However in the context of deep learning and as illustrated in Fig. 2.2,
it is often more visually helpful to draw vectors not as a set of arrows but as a
scattering of dots encoding the location of each arrow’s endpoint or spike.
Categorical Data
Mathematical functions used in the context of deep learning require inputs that
are strictly numerical or quantitative, as was the case with systolic blood pressure
mentioned in the previous section. However, medical data does not always come
prepackaged in this manner. Sometime medical variables of interest are categorical
in nature. For instance, the type of COVID-19 vaccine an individual receives in the
United States does not take on a numerical value, but instead belongs to one of the
following categories: Moderna, Pfizer-BioNTech (or Pfizer for short), and Johnson
1 Itshould be noted that (2.11) represents the most common but only one of many ways in which
the norm of a vector can be defined.
Categorical Data 29
& Johnson’s Janssen (or J&J for short). Such categories need to be translated into
numerical values before they can be used by deep learning algorithms.
It is certainly possible to represent each category with a distinct number, for
example, by assigning 1 to Moderna, 2 to Pfizer, and 3 to J&J. However as illustrated
in the left panel of Fig. 2.3, by doing so, we have made the implicit assumption that
the Pfizer vaccine (encoded with a value of 2) is closer or more “similar” to the
J&J vaccine (encoded with a value of 3) than the Moderna vaccine (encoded with a
value of 1). This assumption may or may not be true in reality. In general, it is best to
avoid making such assumptions that could alter the problem’s geometry, especially
when we lack the intuition or knowledge necessary for ascertaining similarity or
dissimilarity between different categories in the data.
30 2 Mathematical Encoding of Medical Data
Fig. 2.3 Encoding of a categorical variable (i.e., the type of the COVID-19 vaccine administered
in the United States) via a single number (left panel) and via one-hot encoding (right panel). See
text for further details
Imaging Data
Digital images are ubiquitous today and come in a wide variety of sizes, colors,
and formats. We start with the most basic digital image, called a black and white or
binary image, which can be represented as a two-dimensional array of bits (i.e., 0s
and 1s) as illustrated in Fig. 2.4. Each cell in the array is called a pixel and is either
black (if the pixel value is 0) or white (if the pixel value is 1).
In general, binary images are extremely limited in the amount of information
they convey. This is because each pixel in a binary image can only hold one bit
of information. To remedy, this limitation grayscale images allow every pixel to
hold up to 8 bits (or 1 byte) of information. As a result, grayscale images can be
composed of 28 = 256 different shades of gray as illustrated in Fig. 2.5.
In addition to the number of bits used per pixel, the image resolution (i.e.,
the total number of pixels in the image) is also determinative of the amount of
information held in an image. Intuitively, the higher the image resolution the more
visual information can be stored in the image (see Fig. 2.6).
Given their two-dimensional nature, it is clear that we need mathematical objects
other than vectors to store grayscale images. Matrices happen to be the ideal data
Imaging Data 31
Fig. 2.4 A black and white image of the letter “T.” Conventionally, black and white colors are
assigned the values of 0 and 1, respectively
Fig. 2.5 (Left panel) Grayscale images are composed of 256 shades of gray, wherein each pixel
takes an integer value in the range of 0 to 255, with 0 representing the smallest light intensity
(black) and 255 representing the largest intensity (white). (Right panel) (I) CT-scans, (II) MRIs,
(III) X-rays, (IV) echocardiograms, and many other imaging modalities used in modern medicine
are grayscale images
Fig. 2.6 Grayscale image of four alphabet letters visualized at different resolutions
flips the whole matrix around so that the ith row in X becomes the ith column in
XT , and the j th column in X becomes the j th row in XT .
As with vectors, addition and subtraction of matrices that have the same
dimensions can be done element-wise, mirroring Equations (2.5) and (2.6). While
the inner-product (or dot-product) is not defined for matrices, a common form of
matrix multiplication is built upon the inner-product concept. To be able to multiply
matrices X and Y, the number of columns in the first matrix must match the number
of rows in the second matrix. Assuming X and Y are M ×N and N ×P , respectively,
the product of X and Y (denoted as XY) will be an M × P matrix whose (i, j )th
entry is the inner-product of the ith row of X and the j th column of Y. A simple
example of matrix multiplication is shown in (2.14).
⎡ ⎤ ⎡ ⎤
a b au + bx av + by aw + bz
⎣ c d ⎦ u v w = ⎣ cu + dx cv + dy cw + dz ⎦ . (2.14)
xy z
ef eu + f x ev + fy ew + f z
Just as with vectors, we can also define the norm of a matrix as a number
representing its overall size. Recall from (2.11) that the norm of a vector is defined
as the square root of the sum of the squares of its entries. The matrix norm is defined
Imaging Data 33
similarly as the square root of the sum of the squares of all the matrix entries, which
can be written for the matrix X in (2.12) as2
N M
X = 2 .
xnm (2.15)
n=1 m=1
As illustrated in Fig. 2.7, several medical imaging modalities produce color images
that are different from grayscale images in terms of appearance and structure.
Examples of color images used in the clinic include dermoscopy, ophthalmoscopy,
cystoscopy, and colonoscopy images, to just name a few. A common way to create
the perception of color is through superimposing the primary colors of red, green,
and blue. In this color system, often referred to as the RGB system, the level or
intensity of each of the three primary colors determines the final color resulted from
their combination. Using the RGB system and assuming 256 levels of intensity for
each of the primary colors, we can define 256 × 256 × 256 = 16, 777, 216 unique
colors, each represented by a triple in the form of [r b g], where 0 ≤ r, b, g ≤ 255
are integers (see the left panel of Fig. 2.7).
Since pixel values in an RGB image are no longer scalars, the matrix data
structure shown in (2.12) is inadequate to support color images. To store such
images, we need a new data structure called a tensor that is the natural extension
of the two-dimensional matrix to three dimensions, just as the matrix itself was the
Fig. 2.7 (Left panel) The RGB color space. Each color in this space is represented as a triple of
integers [r b g] where 0 ≤ r, b, g ≤ 255. (Right panel) (I) Colonoscopy images, (II) histology
slides, (III) dermoscopy images, and (IV) ophthalmoscopy images are all examples of color
imaging modalities
2 The matrix norm defined in (2.15), often referred to as the Frobenius norm, is the most common
but only one of many ways in which the norm of a matrix may be defined.
34 2 Mathematical Encoding of Medical Data
Time-Series Data
x1 , x2 , . . . , xN . (2.17)
Time-Series Data 35
Fig. 2.8 (Top-left panel) A toy time-series dataset with N = 5 time–value pairs. (Top-right panel)
Linear interpolation of the same dataset, done by connecting consecutive points with a line, makes
it easier for the human eye to follow the ebb and flow in the data. (Bottom panel) Daily stock prices
are a common instance of time-series data. Here, the daily stock prices of two movie distribution
companies are plotted over a seven-year period from 2003 through 2010
As long as the initial time-stamp (t1 ) and the temporal resolution (t2 −t1 ) are known,
we can always use the compact representation above to revert back to the fuller
representation in (2.16) if needed.
The most common time-series data encountered in medicine are electrical bio-
signals such as electrocardiograms and electroencephalograms, as well as data
generated from wearable technologies that can track different physical aspects of
an individual’s daily routine and movement over time.
An electrocardiogram (ECG) is inherently a time-series where the quantity
measured over time is the electrical activity of the heart that conveys valuable
information about the heart’s structure and function. In the most conventional form
of ECG, 10 electrodes are placed over different parts of the body as depicted in
the top panel of Fig. 2.9. These electrodes, six placed over the chest and four on
the limbs, measure small voltage changes induced throughout the body during each
heartbeat. In the bottom panel of Fig. 2.9, we show the prototypical output of one of
these electrodes in a healthy individual. Each unit in the horizontal direction (i.e.,
36 2 Mathematical Encoding of Medical Data
Fig. 2.9 (Top panel) The approximate position of the ECG electrodes on the body. (Bottom panel)
The graph of voltage versus time for one cardiac cycle (heartbeat)
time) represents 40 milliseconds (ms), whereas each unit in the vertical direction
(i.e., voltage) represents 0.1 millivolts (mV). As can be seen in the figure, the voltage
pattern associated with a normal heartbeat consists of three distinct components:
the P wave, the QRS complex, and the T wave, each representing the changes in
electrical activity of the heart muscle that happen as a result of contraction and
relaxation of its chambers. Deviations from the normal P-QRS-T patterns can be
used as a basis to diagnose a host of medical conditions including myocardial
ischemia/infarction, arrhythmias, myocardial hypertrophy, myocarditis, pericarditis,
pericardial effusion, valvular diseases, and certain electrolyte imbalances, to name
a few.
The readings from three of the limb electrodes shown in Fig. 2.9 (namely, RA,
LA, and LL) are linearly combined to produce six lead signals (namely, I, II, III,
aVR, aVL, and aVF) defined, respectively, as
Text Data 37
I = LA − RA,
II = LL − RA,
III = LL − LA,
LA + LL
aVR = RA − , (2.18)
2
RA + LL
aVL = LA − ,
2
RA + LA
aVF = LL − .
2
Using vector and matrix notation, we can write the system of equations in (2.18)
more compactly as
⎡ ⎤ ⎡ ⎤
I −1 1 0
⎢ II ⎥ ⎢ −1 0 1 ⎥
⎢ ⎥ ⎢ ⎥ ⎡ RA ⎤
⎢ ⎥ ⎢ ⎥
⎢ III ⎥ ⎢ 0 −1 1 ⎥⎣
⎢ ⎥=⎢ ⎥ LA ⎦ . (2.19)
⎢ aVR ⎥ ⎢ 1 −0.5 −0.5⎥
⎢ ⎥ ⎢ ⎥ LL
⎣ aVL ⎦ ⎣−0.5 1 −0.5⎦
aVF −0.5 −0.5 1
The six lead signals defined in (2.19) are usually stitched together along with an
additional set of lead signals read directly from the six chest electrodes to produce
a so-called 12 lead ECG, an instance of which is shown in Fig. 2.10.
Text Data
Fig. 2.10 A 12 lead ECG is a collection of six lead signals from the chest and six lead signals
from the limbs as defined in (2.19)
Text Data 39
Fig. 2.11 Generally speaking, human-generated data (here, written descriptions of a mole by two
dermatologists) have more intrinsic variability than machine-generated data (here, dermoscopy
images of the same mole). See text for further details
and under different lighting conditions), they are much more alike compared to
their corresponding written descriptions. Because humans can express the same
idea or sentiment in a multitude of ways, the processing of natural languages (e.g.,
written text) can be considerably more challenging than that of natural signals
(e.g., images). Hence, when it comes to text documents and files, the raw input
data usually requires a significant amount of preprocessing, normalization, and
transformation.
A bag of words (or BoW for short) is a simple, commonly used, vector-based
normalization and transformation scheme for text documents. In its most basic form,
a BoW vector representation consists of the normalized count of different words
used in a text document with respect to a single corpus or a collection of documents,
excluding those non-distinctive words that do not characterize the document in the
context of the application at hand.
To illustrate this idea, in what follows, we build a BoW representation for a
toy text dataset in (2.22) comprising two progress notes that describe the patients’
clinical status during their stay in the hospital.
Fig. 2.12 Bag of words (BoW) representation of the two text documents shown in (2.22). See text
for details
such as and, have, of, the, to, and was. These words, typically referred to as
stop words, are so commonly used in the English language that they carry very little
useful information and hence can be removed without severely compromising the
document’s syntax. Additionally, we can reduce each remaining word to its stem
or root form. For example, since the words improve, improved, improving,
improvement, and improvements all have the same common linguistic root,
we can represent them all using the word improve without too much information
loss. These preprocessing steps transform the original dataset displayed in (2.22)
into the one shown in (2.21).
would be considered identical even though they imply completely opposite mean-
ings. To remedy this problem, another popular text encoding scheme treats docu-
ments like a time-series. Recall from Sect. “Time-Series Data” that time-series data
(when the time increments are all equal) can be represented as an ordered listing
or a sequence of numbers. Text data can similarly be thought of as a sequence of
characters made up of letters, numbers, spaces, and other special characters (e.g.,
“%,” “!,” “@,” etc.). In Table 2.1, we show a subset of alphanumeric characters
along with their ASCII codes. An abbreviation for the American Standard Code
for Information Interchange, ASCII, is a universal character-encoding standard for
electronic communications in which every character is assigned a numerical code
Genomics Data 41
that can be stored in computer memory. For instance, a medication order that reads
which instructs the patient be given one milligram of a certain drug four times a day,
can be stored in the computer using the following ASCII representation:
[71, 73, 86, 69, 32, 49, 32, 77, 71, 32, 81, 73, 68]. (2.24)
Note, however, that this representation still suffers from the same sort of problem
described in Sect. “Categorical Data” and visualized in Fig. 2.3. That is, since
alphanumeric characters are categorical in nature, representing them using numbers
(e.g., ASCII codes) is sub-optimal for deep learning purposes. Instead, it is best
to employ an encoding scheme such as one-hot encoding, replacing each ASCII
entry in (2.24) with its corresponding one-hot encoded vector from Table 2.1.
Finally, it should be noted that while the representation shown in (2.24) was based
on a character-level parsing of the text in (2.23), similar representations can be
constructed at the word level as well.
Genomics Data
The genetic information required for the biological functioning and reproduc-
tion of all living organisms is contained within an organic compound called
deoxyribonucleic acid or DNA for short. The DNA molecule is a long chain of
repeating chemical units or bases strung together in the shape of two twisting
strands, as illustrated in Fig. 2.13. Each strand is made of four bases: adenine,
cytosine, guanine, and thymine, which are commonly abbreviated as A, C, G, and
T, respectively.
Structurally, adenine bases in one strand always face thymine bases in the
opposite strand and vice versa. Similarly, cytosine bases in one strand always pair
with thymine bases in the other and vice versa. Because of this redundancy, the
entire DNA structure can be fully characterized using only one of its strands.
From a data structure perspective, the DNA molecule is a very long3 piece of text
written in a language whose alphabet consists of four letters only. We can therefore
treat DNA sequences as text data and apply the transformations discussed previously
in Sect. “Text Data”. For example, the length-9 sequence
AACTGTCAG (2.25)
3 The human DNA is estimated to be composed of more than three billion bases!
42 2 Mathematical Encoding of Medical Data
Table 2.1 ASCII and one-hot encoded representations of the space character, single-digit
numbers, and uppercase letters of the alphabet
Character ASCII Code One-hot encoded vector
32 [1000000000000000000000000000000000000]
0 48 [0100000000000000000000000000000000000]
1 49 [0010000000000000000000000000000000000]
2 50 [0001000000000000000000000000000000000]
3 51 [0000100000000000000000000000000000000]
4 52 [0000010000000000000000000000000000000]
5 53 [0000001000000000000000000000000000000]
6 54 [0000000100000000000000000000000000000]
7 55 [0000000010000000000000000000000000000]
8 56 [0000000001000000000000000000000000000]
9 57 [0000000000100000000000000000000000000]
A 65 [0000000000010000000000000000000000000]
B 66 [0000000000001000000000000000000000000]
C 67 [0000000000000100000000000000000000000]
D 68 [0000000000000010000000000000000000000]
E 69 [0000000000000001000000000000000000000]
F 70 [0000000000000000100000000000000000000]
G 71 [0000000000000000010000000000000000000]
H 72 [0000000000000000001000000000000000000]
I 73 [0000000000000000000100000000000000000]
J 74 [0000000000000000000010000000000000000]
K 75 [0000000000000000000001000000000000000]
L 76 [0000000000000000000000100000000000000]
M 77 [0000000000000000000000010000000000000]
N 78 [0000000000000000000000001000000000000]
O 79 [0000000000000000000000000100000000000]
P 80 [0000000000000000000000000010000000000]
Q 81 [0000000000000000000000000001000000000]
R 82 [0000000000000000000000000000100000000]
S 83 [0000000000000000000000000000010000000]
T 84 [0000000000000000000000000000001000000]
U 86 [0000000000000000000000000000000100000]
V 87 [0000000000000000000000000000000010000]
W 88 [0000000000000000000000000000000001000]
X 89 [0000000000000000000000000000000000100]
Y 90 [0000000000000000000000000000000000010]
Z 91 [0000000000000000000000000000000000001]
Genomics Data 43
⎡ ⎤
1 1 0 0 0 0 0 1 0
⎢ ⎥
⎢0 0 1 0 0 0 1 0 0⎥
⎢ ⎥, (2.26)
⎣0 0 0 0 1 0 0 0 1⎦
0 0 0 1 0 1 0 0 0
where each column is a one-hot encoded vector representing one of the four bases
in (2.25).
Another commonly used modality of genomics data are gene expression microar-
rays discussed briefly in Sect. “The Machine Learning Taxonomy”. Unlike tradi-
tional low-throughput lab techniques such as real-time polymerase chain reaction
(qPCR) and northern blot that can only handle a small number of genes per study,
the microarray technology allows for simultaneous measurement of the expression
levels of thousands of genes across an arbitrarily large population of patients. From
a data structure point of view, this type of data can be stored in a matrix whose rows
and columns represent patients and genes, respectively. The (i, j )th entry of the said
matrix is a real number representing the expression level of gene j in patient i . This
technology has been successfully applied to discovering novel disease subtypes (see
Fig. 1.16) and identifying the underlying mechanisms of response to drugs.
Problems
(a) Demographic information including age, sex, and race for a cohort of patients
(b) A black and white picture of an ECG printout
(c) A high-resolution pathology slide of breast tissue suspected of malignancy
(d) A volumetric CT of the lung
(e) A functional magnetic resonance image (fMRI) that measures the activity in every
volumetric pixel (or voxel) of the brain over a relatively short time window
2.2 Vector Calculations
Supposing
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 2 0
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x = ⎣ 1 ⎦, y = ⎣ 5 ⎦, and z = ⎣ 0 ⎦, (2.27)
0 −1 −3
Fig. 2.15 Figure associated with Exercise 2.7. See text for details
(b) What is the patient’s average QRS amplitude? The QRS amplitude is measured as
the voltage differential between the peak of the R wave and the peak of the S wave.
Express your answer in millivolts (mV). A QRS amplitude of greater than 4.5 mV
could indicate cardiac hypertrophy.
(c) What is the patient’s average QRS duration? The QRS duration is measured as the
amount of time elapsed between the beginning of the Q wave and the end of the S
wave. Express your answer in milliseconds (ms). A QRS duration of greater than
120 ms could indicate a problem in the electrical conduction system of the heart.
(d) What is the patient’s average T/QRS ratio? The T/QRS ratio is defined as the T
wave amplitude (measured from the baseline) divided by the QRS amplitude. The
T/QRS ratio is useful in differentiating between left ventricular aneurysm and acute
myocardial infarction.
Chapter 3
Elementary Functions and Operations
To get an intuitive sense for what a mathematical function is, it is best to introduce
the concept through a series of simple examples.
Example 3.1 (Historical Revenue of McDonald’s) The table in the left panel
of Fig. 3.1 shows a listing of the annual total revenue of the fast-food
restaurant chain McDonald’s over a period of 12 years from 2005 through
2016. This data consists of two columns: the year column and the revenue
column. When presented with a table like this, we naturally scan across each
(continued)
Fig. 3.1 Figure associated with Example 3.1. The annual total revenue of the fast-food restaurant
chain McDonald’s from 2005 to 2016. See text for further details
Fig. 3.2 Figure associated with Example 3.1. See text for details
Example 3.2 (The McDonald’s Menu) A restaurant menu—like the one from
McDonald’s printed out in Table 3.1—provides a cornucopia of mathematical
functions. Here, we have a dataset of food items, along with a large number
of characteristics for each. Unlike the previous example, we no longer have
a unique and easily identifiable input–output pair. For example, we could
decide to look at the relationship between the food item and its allotment
of calories, or the relationship between the food item and its total fat content.
Notice, in either case (and unlike the previous example), the input is no longer
numerical.
Example 3.3 (Digital Images) It is not always the case that a mathematical
function comes in the form of a labeled table where input and output are
placed neatly in separate columns. Take, for example, a standard grayscale
image of the handwritten digit 0 shown in Fig. 3.2. The left panel displays the
raw image itself, which is small enough that we can actually see all of the
individual pixels that make it up.
Although we may not always think of them as one, a grayscale image is in
fact a mathematical function. Recall from our discussion in Sect. “Imaging
Data” that grayscale images are two-dimensional arrays of numbers (or
matrices). This view of our digit image is plotted in the middle panel of
Fig. 3.2 where we can see each pixel value printed in red on top of its
respective pixel.
As a mathematical function, the grayscale image relates a pair of indices
indicating a specific row and column of the array (the inputs) to a given pixel
intensity value (the output). Therefore, we can write out this function as a
table, as we have done in Table 3.2. Regardless of how we record them, each
input–output pair in a grayscale image is a three-dimensional point. As such,
we can plot any grayscale image as a surface in three-dimensional space. We
do this for the digit image in the right panel of Fig. 3.2 where can visually
examine how each input relates to its output.
Table 3.1 A subset of food items on the McDonald’s menu along with each item’s dietary
information
Item Calories Fat Sodium Carbohydrates Fiber Sugars Protein
McChicken 360 16 800 40 2 5 14
McRib 500 26 980 44 3 11 22
Big Mac 530 27 960 47 3 9 24
Filet-O-Fish 390 19 590 39 2 5 15
Cinnamon Melts 460 19 370 66 3 32 6
McDouble 380 17 840 34 2 7 22
Hamburger 240 8 480 32 1 6 12
Chicken McNuggets [4] 190 12 360 12 1 0 9
Chicken McNuggets [6] 280 18 540 18 1 0 13
Chicken McNuggets [40] 1880 118 3600 118 6 1 87
Hash Brown 150 9 310 15 2 0 1
Side Salad 20 0 10 4 1 2 1
Bacon Clubhouse Burger 720 40 1470 51 4 14 39
Buffalo Ranch McChicken 360 16 990 40 2 5 14
Daily Double 430 22 760 34 2 7 22
Cheeseburger 290 11 680 33 2 7 15
Fruit & Maple Oatmeal 290 4 160 58 5 32 5
Oatmeal Raisin Cookie 150 6 135 22 1 13 2
Egg McMuffin 300 13 750 31 4 3 17
Apple Slices 15 0 0 4 0 3 0
Small Mocha 340 11 150 49 2 42 10
Large Mocha 500 17 240 72 2 63 16
Bacon McDouble 440 22 1110 35 2 7 27
Jalapeño Double 430 23 1030 35 2 6 22
Baked Apple Pie 250 13 170 32 4 13 2
Crispy Ranch Snack Wrap 360 20 810 32 1 3 15
Grilled Ranch Snack Wrap 280 13 720 25 1 2 16
Sausage Burrito 300 16 790 26 1 2 12
Sausage McMuffin 370 23 780 29 4 2 14
Hot Fudge Sundae 330 9 170 53 1 48 8
Strawberry Sundae 280 6 85 49 0 45 6
Large French Fries 510 24 290 67 5 0 6
Hotcakes 350 9 590 60 3 14 8
Hot Caramel Sundae 340 8 150 60 0 43 7
Sausage McGriddles 420 22 1030 44 2 15 11
Small French Fries 230 11 130 30 2 0 2
Small Latte 170 9 115 15 1 12 9
Large Latte 280 14 180 24 1 20 15
Double Cheeseburger 430 21 1040 35 2 7 24
Steak & Egg McMuffin 430 23 960 31 4 3 26
Hotcakes & Sausage 520 24 930 61 3 14 15
Egg White Delight 250 8 770 30 4 3 18
Chocolate Chip Cookie 160 8 90 21 1 15 2
Quarter Pounder Deluxe 540 27 960 45 3 9 29
Sausage McMuffin + Egg 450 28 860 30 4 2 21
Sausage Biscuit 480 31 1190 39 3 3 11
Big Breakfast 740 48 1560 51 3 3 28
Different Representations of Mathematical Functions 51
Table 3.2 The grayscale image in Fig. 3.2 represented as a table of input–output pairs
Input Output Input Output Input Output Input Output
(0, 0) 255 (2, 0) 255 (4, 0) 255 (6, 0) 255
(0, 1) 255 (2, 1) 204 (4, 1) 170 (6, 1) 221
(0, 2) 170 (2, 2) 0 (4, 2) 119 (6, 2) 17
(0, 3) 34 (2, 3) 221 (4, 3) 255 (6, 3) 170
(0, 4) 102 (2, 4) 255 (4, 4) 255 (6, 4) 85
(0, 5) 238 (2, 5) 68 (4, 5) 102 (6, 5) 51
(0, 6) 255 (2, 6) 119 (4, 6) 119 (6, 6) 255
(0, 7) 255 (2, 7) 255 (4, 7) 255 (6, 7) 255
(1, 0) 255 (3, 0) 255 (5, 0) 255 (7, 0) 255
(1, 1) 255 (3, 1) 187 (5, 1) 187 (7, 1) 255
(1, 2) 34 (3, 2) 51 (5, 2) 68 (7, 2) 153
(1, 3) 0 (3, 3) 255 (5, 3) 255 (7, 3) 34
(1, 4) 85 (3, 4) 255 (5, 4) 238 (7, 4) 85
(1, 5) 0 (3, 5) 119 (5, 5) 51 (7, 5) 255
(1, 6) 170 (3, 6) 119 (5, 6) 136 (7, 6) 255
(1, 7) 255 (3, 7) 255 (5, 7) 255 (7, 7) 255
From what we have seen so far in the chapter, we can summarize a mathematical
function as a rule that relates inputs to outputs. In a dataset, like the ones we
encounter in the previous examples, this rule is explicit: it literally *is* the data
itself. Sometimes (as with Example 3.2) we have to pluck out a mathematical
function from a sea of choices, and sometimes (as with Example 3.3) the input–
output relationship may not be clear at first sight. Nonetheless, mathematical
functions abound. This omnipresence motivates the use of mathematical notation
that allows us to more freely discuss functions at a higher level, categorize them by
certain shared attributes, and build multi-use tools based on very general principles.
To denote the mathematical function relating an input x to an output y, we use
the notation y(x). For instance, in Example 3.2, we saw that the total revenue of
the McDonald’s corporation in year 2005 was 19.12 billion dollars, and we can
therefore write y(2010) = 19.12. Similarly, in Example 3.3, we saw that the pixel
intensity value at the top-left corner of the image was 255, and thus we can write
y(0, 0) = 255.
In the remainder of this section, we review the classical way in which mathemat-
ical functions are typically described: using an algebraic equation or formula. In the
process, we discuss how these functions implicitly produce datasets like the ones
we have seen previously.
Take the familiar equation of a line
1
y(x) = −1 + x (3.1)
2
52 3 Elementary Functions and Operations
Fig. 3.3 (Left panel) Tabular view of the mathematical function y(x) = −1 + 12 x. It is impossible
to list out every possible input–output pair in a table like this as there are infinitely many of them.
(Right panel) The same function plotted over a small input range from x = −6 to x = 6
for instance. This is an explicitly written rule for taking an input x and transforming
it into an associated output y. Writing down the equation of the line or any other
formula gives us its rule explicitly: its recipe for transforming inputs into outputs.
Note that with the algebraic formula for a mathematical function on hand, we can
easily create its tabular view, as shown in the left panel of Fig. 3.3 for the function
defined in (3.1). Here, the vertical dots in each column of the table indicate that we
could keep on listing off input and output pairs (in no particular order). If we list
out every possible input–output pair, this table would be equivalent to the equation
defining it in (3.1)—albeit the list would be infinitely long!
Sometimes it is possible to visualize a mathematical function (when the input is
only one- or two-dimensional). For example, in the right panel of Fig. 3.3, we show
the plot of the mathematical function defined in (3.1). Although this plot appears to
be continuous, it is not so in reality. If you could look closely enough, you would be
able to see finely sampled but disjointed input–output pairs that make up the line.
If we use a high enough sampling resolution, the plot will look continuous to the
human eye, the way we might draw it using pencil and paper.
Which of the two modes of expressing a mathematical function better describes
it: its algebraic equation or its equivalent tabular view consisting of all of its input–
output pairs written out explicitly? To answer this question, note that if we have
access to the former, we can always generate the latter—at least in theory by
listing out every input–output pair using the equation to generate the pairs. But the
reverse is not always true. If we only have access to a dataset/table describing a
Elementary Functions 53
Fig. 3.4 Plots of two mathematical functions. Can you guess the algebraic equation generating
each?
mathematical function (and not its algebraic expression), it is often not obvious how
to draw conclusions vis-a-vis the associated algebraic form of the original function.
We could attempt to plot the table of values, or some portion of it, as we did in
Fig. 3.3, and intuit the equation y = −1 + 12 x simply by looking at its plot. To see if
this strategy works in general, we have plotted two more examples in Fig. 3.4. Take
a moment to see if you can determine the algebraic equation of these plots using
your visual intuition.
If you are familiar with elementary functions you may have been able to spot the
equation for the example on the left, as
How about the second example on the right? Not many people—even if they are
mathematicians—can correctly identify the function’s underlying equation as
The point here is that even when the input is only one-dimensional, identifying a
function’s equation by plotting some portion of its table of values is very difficult to
do “by eye” alone. And, it is worth emphasizing that we could only even attempt this
for functions of one or two inputs, since we cannot meaningfully visualize functions
that take in three or more inputs.
Elementary Functions
In this section, we review elementary functions that are used extensively throughout
not only the study of machine learning but many areas of science in general.
54 3 Elementary Functions and Operations
Polynomial Functions
Polynomial functions are perhaps the first set of elementary functions one learns
about as they arise in virtually all areas of science and technology. When we are
dealing with only one input, x, each polynomial function simply raises the input to
a given power. The first few polynomial functions are written as
The first element in (3.4)—often written just as f1 (x) = x, ignoring the superscript
1)—is a simple line with a slope of 1 and a vertical intercept of 0, and the second,
f2 (x) = x 2 , a simple parabola. We can continue listing more polynomials, one
for each positive integer k, with the kth polynomial taking the form fk (x) = x k .
Because of this special indexing of powers, the polynomials naturally form a catalog
or a family of functions. Of special interest to us in this book is the first member of
this family that is the building block of linear machine learning models we study in
great detail in Chaps. 4 and 5.
It is customary to define a degree-d polynomial as a linear combination of the first
d polynomial functions (plus a constant term). For instance, f (x) = 1 + 2x − 3x 2 +
x 3 is a degree-3 polynomial. In general, when we have N inputs x1 , x2 , . . . , xN , a
polynomial function involves raising each input xi to a nonzero integer power ki
and multiplying the result to form
kN
f (x1 , x2 , . . . , xN ) = x1k1 x2k2 . . . xN . (3.5)
Several polynomial functions with one and two input(s) are plotted in the top and
bottom panels of Fig. 3.5, respectively.
Reciprocal Functions
Reciprocal functions are created similarly to polynomial functions with one differ-
ence: instead of raising an input to a positive integer, we raise them to a negative
one. The first few reciprocal functions are therefore written as
1 1 1
f1 (x) = x −1 = , f2 (x) = x −2 = , f3 (x) = x −3 = , (3.6)
x x2 x3
and so on. Several examples of reciprocal functions are plotted in Fig. 3.6.
Elementary Functions 55
Fig. 3.5 Several polynomial functions. (Top panel) From left to right, the plot of f (x) = x,
f (x) = x 2 , f (x) = x 3 , and f (x) = x 4 . (Bottom panel) From left to right, the plot of f (x1 , x2 ) =
x2 , f (x1 , x2 ) = x1 x22 , f (x1 , x2 ) = x1 x2 , and f (x1 , x2 ) = x12 x23
Fig. 3.6 Several reciprocal functions. From left to right, the plot of f (x) = x −1 , f (x) = x −2 ,
f (x) = x −3 , and f (x) = x −4
The basic trigonometric functions are derived from the simple relations of a right
triangle and take on a repeating wave-like shape. The first of these are the sine
and cosine functions written, respectively, for a scalar input x as sin(x) and cos(x).
These two elementary functions originate in tracking the vertical and horizontal
coordinates of a single point on the unit circle
x2 + y2 = 1 (3.7)
56 3 Elementary Functions and Operations
Fig. 3.7 The sine (red) and cosine (blue) functions can be plotted by tracking the vertical and
horizontal position of the endpoint of an arrow stemming from the origin and ending on the unit
circle as the endpoint moves counterclockwise. Every time the endpoint completes one loop around
the circle each function naturally repeats itself, making sine and cosine periodic functions
x 2 − y 2 = 1. (3.8)
Other common hyperbolic functions are based on various ratios of these two
fundamental functions. For example, the hyperbolic tangent function is defined as
sinh(x)
the ratio of hyperbolic sine to hyperbolic cosine, that is, tanh(x) = cosh(x) .
Exponential Functions
A well-known Indian folktale tells the story of a king who enjoyed inviting people
to play Chess against him. One day the king offered to grant a traveling savant
whatever reward he wanted if he beat the king in the game. The savant agreed, but
demanded the king pay him in a rather strange way: if the savant won, the king
Elementary Functions 57
would put a single grain of rice on the first square of the Chessboard and double it
on every consequent one. The two played and the savant won.
The king ordered a large bag of rice to be brought in and started placing the grains
according to the mutually agreed upon arrangement: one grain on the first square,
two on the second, four on the third, and so on and so forth. By the time he reached
the 21st square, he had already emptied the entire bag. Soon the king realized all the
rice in his entire kingdom would not be enough to fulfill his pledge to the savant.
The king in this fable failed to appreciate the incredibly rapid growth of the
exponential function f (x) = 2x , plotted in Fig. 3.8. In general, an exponential
function can be defined for any base value. For example, f (x) = 10x defines an
exponential with base 10.
Another widely used choice for the base value is the Euler’s number denoted
e = 2.71828 . . ., where the decimal expansion is unending and non-repeating.
This number—credited to seventeenth century mathematician Jacob Bernoulli—
originally arose out of a thought experiment posed about compound interest
payments. Suppose we have a principal of $1.00 in the bank and receive 100%
interest from the bank per year, credited once at the end of the year. This means we
would double our amount of money after one year, i.e., we multiply our principal
by 2. Notice what changes if instead of receiving one interest payment of 100%
on our principal we received 2 payments of 50% interest during the year. At the
first crediting, we multiply the principal by 1.5 (to get 50% interest). However, at
the second crediting, we multiply this updated value by 1.5, or in other words, we
multiply our principal by (1.5)2 = (1+ 12 )2 . If we keep going in this way, supposing
we credit 33.33 . . . % interest 3 times per year, we end up multiplying the principal
by (1 + 13 )3 , cutting the interest in quarters we end up multiplying the principal by
(1 + 14 )4 , etc. In general, if we cut the interest payments into n equal pieces, we
58 3 Elementary Functions and Operations
ex − e−x
tanh(x) = . (3.9)
ex + e−x
Logarithmic Functions
What is 810, 456, 018 × 6, 388, 100, 279? Nowadays, it only takes a few seconds to
type these numbers into a trusty calculator and find the answer. But before the advent
of calculators, one had no choice but to multiplying two large numbers like these by
hand: an obviously tedious and time-consuming task, requiring careful book keeping
to avoid clerical errors. Necessity being the mother of all invention however, people
invented all sorts of tricks to make this sort of computations easier.
The logarithm—first invented to cut big multiplication problems down to size by
turning multiplication into addition—is an elementary function with a wide range
of modern applications. Based on the exponential function with generic base b, the
logarithm of base b is defined as
Using this definition, one can quickly verify that indeed this function (regardless of
the choice of base b) turns multiplication into addition. The logarithm and exponen-
tial functions are inverses of one another. This allowed one to take the multiplication
of two large numbers p and q, and instead of doing this multiplication, evaluating p
and q by the logarithm (usually looking up the values logb (p) and logb (q) values in
a table), adding the result, and then exponentiating to get the resulting product p · q.
The logarithm function with base e (the Euler’s number) is commonly referred to as
the natural logarithm and is plotted in Fig. 3.9. When dealing with natural logarithm,
it is commonplace (for the sake of brevity) to drop the base e from the notation and
write the natural logarithm function simply as f (x) = log(x).
Step Functions
Compared with the previous functions that were defined by a single equation over
their entire input domain, step functions are defined in cases over subregions of
their input. Over each subregion, the step functions are constant but can take on
different values on each subregion. For example, a step function with two steps has
the algebraic form
Elementary Functions 59
v1 if x < s
f (x) = , (3.11)
v2 if x > s
where s is referred to as a split point, and v1 and v2 are two constant values.
Typically, the value of the function at the split point x = s is not of great significance
and is often set as the average of v1 and v2 , i.e., f (s) = v1 +v 2
2 . The sign function
is a prime and often used example of a step function where s = 0, v1 = −1,
and v2 = +1. In general, a step function with N steps breaks the input into N
subregions, taking on a different value over each subregion
⎧
⎪
⎪ v1 if x < s1
⎪
⎪
⎪
⎪
⎪ v
⎨ 2
if s1 < x < s2
. ..
f (x) = .. . (3.12)
⎪
⎪
⎪
⎪
⎪
⎪ vN −1 if sN −2 < x < sN −1
⎪
⎩
vN if sN −1 < x
and hence has N − 1 split points s1 through sN −1 , with N constant levels denoted
v1 through vN .
Many analog (continuous) signals such as radio and television signals often
look like some sort of sine function when broadcast. At the receiver, however, an
electronic device will digitize (or quantize) such signals, which entails transforming
the original wavy analog signal into a step function that closely resembles it. By
doing so, far fewer values are required to store and process the signal (just the
60 3 Elementary Functions and Operations
Fig. 3.10 (Left panel) The original sine function in black along with its digitized version (a step
function with 9 levels/steps) in red. (Middle panel) A more common way of plotting the step
function on the left where all the discontinuities have been filled in so that the step function can
be visualized more easily. (Right panel) As the number of levels/steps increases, the step function
resembles the underlying continuous sine function more closely
values of the steps and splitting points, as opposed to the entire original function).
Figure 3.10 illustrates a facsimile of this idea for a simple sine function.
Elementary Operations
Fig. 3.11 Illustration of various function adjustments including amplification, attenuation, squeez-
ing, stretching, as well as horizontal and vertical shifts. The original function f (x) = sin(x) is
drawn in dashed black for comparison. (Top-left panel) Plots of the functions 2f (x) and 12 f (x)
shown in red and blue, respectively, to exemplify function amplification and attenuation. (Top-right
panel) Plots of the functions f (2x) and f ( 12 x) shown in red and blue, respectively, to exemplify
squeezing and stretching. (Bottom-left panel) Plot of the function f (x) + 1 (in red) exemplifies a
vertical shift. (Bottom-right panel) Plot of the function f (x + 1) (in blue) exemplifies a horizontal
shift
Just like numbers, basic arithmetic operations including addition (or subtraction)
and multiplication (or division) can be used to combine two (or more) functions as
well. For instance, f1 (x) + f2 (x) is a new function formed by adding the values of
f1 (x) and f2 (x) at each point over their entire common domain.
Akin to addition, we can define multiplication of two functions f1 (x) and f2 (x)
denoted by f1 (x) × f2 (x), or just simply f1 (x) f2 (x). Interestingly, amplitude
modulation (AM) radio broadcasting was invented in the early 1900 based on the
simple idea of multiplying a message signal by a sinusoidal function (called the
carrier signal) at the transmitter side. Amplitude modulation makes it possible to
broadcast multiple messages simultaneously over a shared medium (or channel).
Function addition and multiplication are illustrated in Fig. 3.12.
Composition of Functions
Fig. 3.12 Illustration of function addition and multiplication using the functions f1 (x) =
2 sin(x) + 3 cos( 10
1
x − 1) and f2 (x) = sin(10x) plotted in the top-left panel and the top-right
panel, respectively. The addition of the two functions, f1 (x) + f2 (x), is plotted in the bottom-left
panel, and the multiplication of the two functions, f1 (x) × f2 (x), is plotted in the bottom-right
panel
get sin x 3 , or alternatively, we can plug the sine function into the cubic one to get
(sin (x))3 .
Importantly, the order in which we compose the two functions is important. This
is different from what we saw with addition or multiplication, where we always have
x 3 + sin(x) = sin(x) + x 3 , and similarly, x 3 × sin(x) = sin(x) × x 3 . This gives
composition, as a way of combining functions, much more flexibility compared to
addition or multiplication, especially when dealing with more than two functions.
Let us verify this observation by adding a third function to the mix: the exponential
ex . While again there is only one way to combine x 3 , sin(x), and ex via addition
(i.e., x 3 + sin(x) + ex ) or multiplication (i.e., x 3 × sin(x) × ex ), we have now many
different ways to compose these three functions: we can select any of the three, plug
it into one of the two remaining functions, take the result, and plug it into the last
one. Figure 3.13 shows the functions resulted from all 3! = 3 × 2 × 1 ways in which
we can compose these three functions.
Notationally, the composition of f1 (x) with f2 (x) is written as f1 (f2 (x)), and
in general, we have that
Fig. 3.13 Six different ways of composing three elementary functions: f1 (x) = x 3 , f2 (x) =
3
sin(x), and f3 (x) = ex . (Top-left panel) Plot of the function f3 (f2 (f1 (x))) = esin(x ) . (Top-
3
middle panel) Plot ofthe function
f3 (f1 (f2 (x))) = esin (x) . (Top-right panel) Plot of the function
3
f2 (f3 (f1 (x))) = sin e . (Bottom-left panel) Plot of the function f2 (f1 (f3 (x))) = sin (ex )3 .
x
3
(Bottom-middle panel) Plot of the function f1 (f3 (f2 (x))) = esin(x) . (Bottom-right panel) Plot
of the function f1 (f2 (f3 (x))) = (sin(ex ))3
Min–Max Operations
The maximum of two functions f1 (x) and f2 (x), denoted by max (f1 (x) , f2 (x)),
is formed by setting the output to the larger value of f1 (x) and f2 (x) for every x in
the common domain of f1 (x) and f2 (x). The minimum of two functions is defined
in a similar manner, only this time by setting the output to the smaller value of the
two. The following is a practical use case of this function operation from the field
of electrical engineering.
Electricity is delivered to the point of consumption in AC (Alternating Current)
mode, meaning that the voltage you get at your outlet is a sinusoidal waveform with
both positive and negative cycles. This is while virtually every electronic device
(mobile phone, laptop, etc.) operates on DC (direct current) power and thus requires
a constant steady supply of voltage. A conversion from AC to DC therefore has to
take place inside the power adapter: this is the function of a rectifier. In its simplest
form, the rectifier comprises a single diode that blocks negative cycles of the AC
waveform and only allows positive cycles to pass. The diode’s output voltage fout
can then be expressed in terms of the input voltage fin as fout (x) = max (0, fin (x)).
In the left panel of Fig. 3.14, we show the shape of fout (x) when the input is
a simple sine function fin (x) = sin(x). When fin (x) = x, the output fout (x) =
64 3 Elementary Functions and Operations
Fig. 3.14 (Left panel) The input (in dashed black) and output (in solid red) of a rectifier. (Right
panel) The rectified linear unit
max (0, x) is the so-called rectified linear unit—also known due to its shape as the
ramp function—plotted in the right panel of Fig. 3.14.
cos(16x)
f (x) = (3.14)
1 + x2
for instance, whose plot is shown in the left panel of Fig. 3.15. It is easy to verify that
this seemingly complex function was created using only the elementary functions
and operations in Tables 3.3 and 3.4, according to the graphical recipe shown in the
right panel of Fig. 3.15.
Problems
Fig. 3.15 (Left panel) The plot of the function f (x) = cos(16x)
1+x 2
. (Right panel) A graphical
representation of how the elementary functions and operations in Tables 3.3 and 3.4 can be
combined to form f (x). Here, Fi denotes the elementary function in the ith row of Table 3.3,
and Oj denotes the elementary operation in the j th row of Table 3.4
one output. With this definition in mind, determine whether each of the following
input–output relationships defines a valid function:
(a) The relationship between a food item (input) and its sodium content (output),
as provided in Table 3.1
(b) The relationship between the amount of protein in a food item (input) and its
total calories (output), as provided in Table 3.1
(c) The relationship between x (input) and y (output), defined through the equation
2x + 3y + xy = 1
(d) The relationship between x (input) and y (output), defined through the equation
x 2 + y 2 = xy
66 3 Elementary Functions and Operations
(e) The relationship between x (input) and y (output), captured in the s-shaped plot
shown in Fig. 3.16
by producing a similar graphical representation to the one shown in the right panel
of Fig. 3.15.
(a) f (x) = 1 + x 2 + x 4 .
−x
(b) f (x) = tanh(x) = eex −e
x
.
+e−x
(c) f (x) = log 1+e−x .
1
Chapter 4
Linear Regression
A large number of our day-to-day experiences and activities are governed by linear
phenomena. For instance, the distance traveled by a car at a certain speed is linearly
related to the duration of the trip. When an object is thrown, its acceleration is a
linear function of the amount of force exerted to throw the object. The sales tax
owed on a purchased item changes linearly with the item’s original price. The
recommended dosages for many medications are linear functions of the patient’s
weight. And, the list goes on.
Linear regression is the machine learning task of uncovering the hidden linear
relationship between the input and output data. In this chapter, we study linear
regression from the ground up, laying the foundation for discussion of more
complex nonlinear models in the chapters to come.
where the first element of each pair is a sample input and the second element is
the corresponding output. Before proceeding any further, take a moment and try
to solve this regression problem yourself by finding a mathematical relationship
(whether linear or nonlinear) that could explain this data. In other words, use your
mathematical intuition to find a function f (·) such that
f (0) = −1,
f (2) = 0,
(4.2)
f (4) = 1,
f (6) = 2.
If you have not managed to find a solution already, you may find it helpful to plot
this data (as we have done in Fig. 4.1) and inspect it visually. Do you see a trend or
pattern emerge?
Most (if not all) people would immediately recognize a linear relationship
between the input and output in this case, even though countless other (nonlinear)
functions satisfying the equations in (4.2) also exist. The trigonometric function
f (x) = sin2 ( π4 x) − sin( π4 x) − cos( π4 x) is one example.
The reason why we are quicker to pick a linear solution over a nonlinear one can
be explained by the ubiquity and simplicity of linear functions. As discussed earlier
in the introduction to the chapter, we are surrounded by linear phenomena, and
as a result, our brains have evolved to recognize linearity with ease. Moreover, the
Occam’s razor principle states that when presented with competing hypotheses with
identical prediction power, the simplest solution is always preferable. In the words
of the second century AD Greek mathematician Claudius Ptolemy “in general, we
consider it a good principle to explain a phenomenon by the simplest hypothesis
possible.”
Linear functions are the simplest of all mathematical functions, both alge-
braically and geometrically, making them easy for humans to understand, intuit,
interpret, and wield. A linear function with one-dimensional input takes the general
form of
f (x) = w0 + w1 x, (4.3)
where the parameter w0 represents the function’s bias (or vertical intercept) and
the parameter w1 represents its slope (or rise over run). Solving a linear regression
problem then becomes synonymous with finding the correct values for w0 and w1
in a way that all equations in (4.2) hold.
Substituting the parametric expression of f (·) in (4.3) into (4.2), we have
w0 = −1,
w0 + 2w1 = 0,
(4.4)
w0 + 4w1 = 1,
w0 + 6w1 = 2.
This linear system of equations has a unique solution given by (w0 , w1 ) = (−1, 12 ),
which yields f (x) = −1 + 12 x as the linear model underlying the regression dataset
in (4.1).
The regression datasets encountered in practice differ from the previously studied
toy dataset in (4.1) in two important ways. First, real-world datasets are typically
much larger in size, some having in excess of millions of data points. Therefore,
from this point on, we assume regression datasets consist, in general, of p input–
output pairs where p can be arbitrarily large. The general setup for linear regression
problems can then be cast as a set of p linear equations, written compactly as
w0 + w1 xi = yi , i = 1, 2, . . . , p. (4.5)
Second, and perhaps more importantly, there is no guarantee for all p data points in
a given regression dataset to be collinear, meaning that they all fall precisely on a
single straight line. In fact, it is highly unlikely to encounter fully collinear datasets
in real-life settings even if the underlying relationship between the input and output
is truly linear. This is because a certain amount of noise is always present in the data
as a result of various types of observational and measurement errors that cannot be
eliminated entirely. Mathematically speaking, this means that the linear system of
72 4 Linear Regression
equations in (4.5) will almost never have any solutions if the presence of noise is
not taken into account.
We can model the existence of noise by adding a new variable i to the output yi
in (4.5) to form the following “noisy” system of equations:
w0 + w1 xi = yi + i , i = 1, 2, . . . , p. (4.6)
Note, however, that with this adjustment the linear system above has now more
unknown variables (p + 2) than equations (p), causing it to have infinitely many
solutions. Thus, a new strategy is needed to solve (4.6) to retrieve the optimal values
for w0 and w1 .
Rearranging (4.6) by bringing the term yi to the left hand side and squaring both
sides yield an equivalent system of equations
in which the noise/error terms are now isolated from the rest of the variables. In
addition, by squaring both sides, we can make sure that both positive and negative
error values of the same magnitude contribute equally to the mean squared error (or
MSE) defined, over the whole dataset, as
1 2
p
MSE = i . (4.8)
p
i=1
Ideally, we would want the MSE to be 0. However, this implies that all i ’s should
be zero, which, as stated previously, is not a practical possibility. If we cannot vanish
the mean squared error entirely, the next best thing we could do is to make it as small
as possible. This desire forms the basis of the least squares framework, depicted
visually in Fig. 4.2, in which we determine the optimal values for w0 and w1 by
minimizing the least squares cost function
1 2 1
p p
g(w0 , w1 ) = i = (w0 + w1 xi − yi )2 . (4.9)
p p
i=1 i=1
Minimizing the least squares cost function in (4.9) is no herculean task and can
be done by setting the partial derivative of g(·) to 0 with respect to its inputs. The
partial derivative of g(·) with respect to w0 can be written, and simplified, as
1 2
p p
∂g ∂(w0 + w1 xi − yi )
= 2 (w0 + w1 xi − yi ) = (w0 + w1 xi − yi ).
∂w0 p ∂w0 p
i=1 i=1
(4.10)
Similarly, the partial derivative of g(·) with respect to w1 can be written, and
simplified, as
The Least Squares Cost Function 73
Fig. 4.2 (Left panel) A noisy version of the toy regression dataset shown originally in Fig. 4.1.
Note that with the addition of noise the data points no longer lie on a straight line. (Middle panel)
A “bad” linear model for fitting this data produces relatively large squared error values. Here, the
ith square has an area equal to i2 . (Right panel) A “good” linear model should produce relatively
small squared error values overall. Visually speaking, the least squares framework seeks out the
linear model producing the least average amount of the gray color
1 2
p p
∂g ∂(w0 + w1 xi − yi )
= 2 (w0 + w1 xi − yi ) = (w0 + w1 xi − yi )xi .
∂w1 p ∂w1 p
i=1 i=1
(4.11)
Setting (4.10) and (4.11) to zero and applying simple algebraic rearrangements lead
to the following linear system of equations
p
p
w 0 p + w1 xi = yi ,
i=1 i=1
(4.12)
p
p
p
w0 x i + w1 xi2 = yi xi ,
i=1 i=1 i=1
Finally, multiplying both sides by the inverse of the square matrix in (4.13) gives
the optimal values for w0 and w1 as
⎡ ⎤ ⎡ p ⎤−1 ⎡ p ⎤
w0 p i=1 xi i=1 yi
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦=⎣ ⎦ ⎣ ⎦. (4.14)
p p 2
p
w1 i=1 xi i=1 xi i=1 yi xi
74 4 Linear Regression
Example 4.1 (Training a Linear Regressor) Here, we use the least squares
solution derived in (4.14) to find the best fitting line for the noisy dataset shown
in Fig. 4.2, where
Substituting
4
4
4
4
xi = 12, xi2 = 56, yi = 3.5, yi xi = 28,
i=1 i=1 i=1 i=1
(4.16)
into (4.14) yields
⎡ ⎤ ⎡
⎤−1 ⎡ ⎤ ⎡ ⎤
w0 4 12 3.5 −1.75
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦=⎣ ⎦ ⎣ ⎦=⎣ ⎦. (4.17)
w1 12 56 28 0.875
Therefore, f (x) = −1.75 + 0.875 x is the best linear function to represent the
dataset in (4.15).
Up to this point in the chapter, the inputs we dealt with were all scalars or one-
dimensional, which allowed us to visualize them as we did in Figs. 4.1 and 4.2.
In general, however, the input to regression problems can be, and often, multi-
dimensional. While we cannot visualize datasets wherein the input has more than
two components or dimensions, nonetheless the least squares framework can be
carried over with minimal adjustments to deal with multi-dimensional input.
First, we must expand the linear function definition to accommodate general
n-dimensional inputs. Analogously to (4.3), a linear function in n dimensions is
defined as
f (x1 , x2 , . . . , xn ) = w0 + w1 x1 + w2 x2 + · · · + wn xn , (4.18)
Linear Regression with Multi-Dimensional Input 75
1
p
g(w0 , w1 , . . . , wn ) = (w0 + w1 xi + · · · + wn xn − yi )2 . (4.19)
p
i=1
At this point, it is notationally convenient to throw all the inputs x1 through xn into
a single input vector denoted as
⎡ ⎤
x1
⎢ ⎥
⎢ x2 ⎥
x=⎢ ⎥
⎢ .. ⎥ , (4.20)
⎣.⎦
xn
and all the slope parameters w1 through wn into a single weight vector denoted as
⎤ ⎡
w1
⎢ ⎥
⎢ w2 ⎥
w=⎢ ⎥
⎢ .. ⎥ , (4.21)
⎣ . ⎦
wn
so that the least squares cost function in (4.19) can be written more compactly as
1
p
g(w0 , w) = (w0 + wT xi − yi )2 . (4.22)
p
i=1
1 2
p p
∂g ∂(w0 + wT xi − yi )
= 2 (w0 + wT xi − yi ) = (w0 + wT xi − yi )
∂w0 p ∂w0 p
i=1 i=1
(4.23)
and set it to 0, which yields (after simple rearrangements)
p
p
p w0 + xTi w= yi . (4.24)
i=1 i=1
1 2
p p
∇w g = 2 (w0 + wT xi − yi )∇w (w0 + wT xi − yi ) = (w0 + wT xi − yi )xi
p p
i=1 i=1
(4.25)
and set it to 0n×1 . Again, after a few simple rearrangements, we have
p
p
p
x i w0 + xi xTi w= yi xi . (4.26)
i=1 i=1 i=1
we can combine (4.24) and (4.26) into the following linear system of equations:
⎡ ⎤
w0
⎢ ⎥
A⎣ ⎦ = b, (4.29)
w
whose solution reveals the optimal values for the parameters of the linear function
in (4.18), as
⎡ ⎤
w0
⎢ ⎥ −1
⎣ ⎦ = A b. (4.30)
w
(continued)
Linear Regression with Multi-Dimensional Input 77
(continued)
Table 4.1 The dataset associated with Example 4.2. See text for details
Inputs Output
Country Mortality rate Polio immunization (%) GDP Life expectancy
Afghanistan 269.06 48.38 340.02 58.19
Albania 45.06 98.13 2119.73 75.16
Algeria 102.82 93.18 3261.29 74.21
Angola 362.75 70.88 2935.76 50.68
Argentina 100.38 94.46 6932.55 75.24
Armenia 117.33 88.67 2108.68 73.31
Australia 62.43 91.86 35, 391.20 81.91
Austria 65.80 85.53 33, 171.58 81.48
Azerbaijan 119.85 74.08 4004.78 71.15
Bangladesh 135.67 87.67 573.58 69.97
Belarus 220.27 89.27 3669.02 69.75
Belgium 69.93 97.67 17, 752.53 80.65
Belize 154.20 95.60 3871.88 69.15
Benin 269.31 65.38 572.45 57.71
Bhutan 231.53 88.80 1270.01 65.92
Bosnia 63.55 75.18 2216.64 76.18
Brazil 151.27 98.33 5968.89 73.27
Bulgaria 124.73 94.47 4802.02 72.74
78 4 Linear Regression
With these retrieved parameters, the final linear model of life expectancy can be
written as
We can now use this model to predict life expectancy for countries in the
validation set starting with Bangladesh, which according to Table 4.1 has a
mortality rate of 135.67 per 1000 population, an infant Polio immunization rate
of 87.67%, and a per capita GDP of 573.58 dollars. Plugging these values into the
linear regression model in (4.34), we can find the predicted life expectancy as
Input Normalization
Fig. 4.3 The visual comparison of predicted versus actual life expectancy values for countries
whose names start with the letter “B.” The actual life expectancy values (in blue) were taken from
the dataset in Table 4.1, whereas the predicted values (in yellow) were obtained using the linear
regression model in (4.34). It is important to emphasize that the data for the countries shown in
this figure were not used during training
mortality rate, Polio immunization, and GDP. The weight associated with mortality
rate in this model is negative. This means that decreasing mortality rate would
increase life expectancy. On the other hand, the weights associated with the other
two inputs are positive, meaning that increasing Polio immunization and GDP
would increase life expectancy. In other words, the mathematical sign of a particular
parameter in a linear regression model determines whether the input attached to that
parameter contributes negatively or positively to the output.
The magnitude of a parameter also informs us about how strongly the output is
correlated with the input associated with that parameter. Intuitively, the larger the
input’s weight the greater its influence on the output. This insight can be used to
create a ranking of the inputs’ importance based on their contribution to the output.
However, there is one caveat: when |wi | > |wj |, we may conclude that the i th input
has a greater effect than the j th input, only if the two inputs are on a comparable
scale.
For example, from (4.34), it is not immediately clear that Polio immunization has
a larger influence on determining life expectancy than GDP just because the weight
associated with Polio immunization (i.e., 0.02152) is greater in magnitude than the
weight associated with GDP (i.e., 0.00018). This is because Polio immunization (as
a percentage) always ranges between 0 and 100, whereas the per capita GDP can
80 4 Linear Regression
xi − ai
xi = , i = 1, 2, . . . , p. (4.37)
bi − ai
This simple input normalization scheme is precisely what we want as it ensures that
xi always lies between 0 and 1 (see Exercise 4.3). Moreover, it does not invalidate
any of our previous modeling assumptions, since, from a mathematical standpoint,
a linear function with tunable parameters involving xi ’s as input is equivalent to one
involving xi ’s as input. Recall from (4.18) that a linear model having x1 , x2 , . . . , xn
as input can be written as
n
f (x1 , x2 , . . . , xn ) = w0 + wi x i
i=1
n
x i − ai
= w0 + wi
bi − ai
i=1
n
wi a i n
wi (4.38)
= w0 − + xi
bi − ai bi − ai
i=1
!" #
i=1 !" #
v i
v0
n
= v0 + vi xi ,
i=1
Example 4.3 (Revisiting Prediction of Life Expectancy) Here, we repeat the steps
taken in Example 4.2 to create a linear regression model of life expectancy. This
time, we normalize the input data first so that we can use the resulting least squares
solution to compare the relative importance of each input in estimating the output.
For each input column in Table 4.1, we find the smallest and largest values
across all countries, denoting them as a and b, respectively. We then use the linear
transformation x → x−a b−a to form the normalized dataset shown in Table 4.2. Note
that the normalization procedure must be performed over the whole dataset that
includes both the training and validation subsets of the data.
(continued)
Input Normalization 81
Table 4.2 The dataset associated with Example 4.3. See text for details
Normalized inputs Output
Country Mortality rate Polio immunization GDP Life expectancy
Afghanistan 0.70510 0.0000 0.0000 58.19
Albania 0.0000 0.9958 0.05078 75.16
Algeria 0.1818 0.8969 0.0833 74.21
Angola 1.0000 0.4504 0.0741 50.68
Argentina 0.1741 0.9225 0.1881 75.24
Armenia 0.2275 0.8065 0.0505 73.31
Australia 0.0547 0.8704 1.0000 81.91
Austria 0.0653 0.7438 0.9367 81.48
Azerbaijan 0.2354 0.5145 0.1046 71.15
Bangladesh 0.2852 0.7865 0.0067 69.97
Belarus 0.5515 0.8185 0.0950 69.75
Belgium 0.0783 0.9867 0.4968 80.65
Belize 0.3435 0.9453 0.1008 69.15
Benin 0.7059 0.3405 0.0066 57.71
Bhutan 0.5870 0.8092 0.0265 65.92
Bosnia 0.0582 0.5366 0.0535 76.18
Brazil 0.3343 1.0000 0.1606 73.27
Bulgaria 0.2508 0.9226 0.1273 72.74
where the magnitude of the weight of each input can now be used as a proxy
for determining the extent of its contribution toward estimating the output. In this
example, the weight attached to mortality rate has the largest magnitude followed
by GDP and Polio immunization.
82 4 Linear Regression
Regularization
In all regression problems, we have discussed so far the number of data points p
always exceeded the number of inputs n. In this section, we discuss what happens
when the reverse is true, i.e., n ≥ p. As done previously, we start with a simple toy
dataset with n = p = 2
w0 = 1,
(4.42)
w0 + w1 + w2 = 2,
to model the data in (4.41). It quickly becomes clear that this system has many
solutions. More precisely, any set of parameters of the form (w0 , w1 , w2 ) = (1, t, 1−
t) is a solution to (4.42), where t can be any real number. For example, t = 1, t = 10,
and t = 100 yield the linear functions 1 + x1 , 1 + 10x1 − 9x2 , and 1 + 100x1 − 90x2 ,
respectively, all of which can explain the data in (4.41) perfectly and without error.
When n ≥ p, the resulting linear system of equations, like the one shown in
(4.42), has fewer equations than unknowns, which leads to the possibility of it
having infinitely many solutions. This results in the least squares cost function
1
2
g(w0 , w1 , w2 ) = (w0 + w1 xi,1 + w2 xi,2 − yi )2 (4.43)
2
i=1
to have infinitely many minima, which, practically speaking, is not desirable. One
way to address this issue is to adjust the least squares cost function by adding a
non-negative function r(w1 , w2 ) to the original cost
so that the new cost function has a unique minimum. The function r(·) is called a
regularizer, and the adjustment process described above is referred to as regulariza-
tion. The most commonly used regularizer in deep learning is the quadratic or ridge
regularizer defined as
r(w1 , w2 ) = λ w12 + w22 , (4.45)
Regularization 83
1
p
(w0 + wT xi − yi )2 + λ wT w, (4.46)
p
i=1
we can follow the steps laid down in (4.23) through (4.30) to find the least squares
solution as
⎡ ⎤
w0
⎢ ⎥ −1
⎣ ⎦ = A b, (4.47)
w
where
⎡ p ⎤ ⎡ p ⎤
T
p i=1 xi i=1 yi
⎢ ⎥ ⎢ ⎥
A=⎣ ⎦, b=⎣ ⎦, (4.48)
p p p
i=1 xi i=1 xi xi
T + λ In×n i=1 yi xi
(continued)
84 4 Linear Regression
More generally, adding the term λ In×n inside the matrix A in (4.48) guarantees
A to be invertible (see Exercise 4.2).
Problems
where the vector y is formed by throwing all outputs into a single vector
Regularization 85
⎡
⎤
y1
⎢ ⎥
⎢ y2 ⎥
yp×1 =⎢ ⎥
⎢ .. ⎥ (4.53)
⎣.⎦
yp
and where the matrix X is formed by stacking all input vectors x1 through xp side-
by-side as columns of a new matrix, and then extending its row space by adding a
row vector consisting only of 1’s, as in
1 1 ··· 1
X(n+1)×p = . (4.54)
x1 x2 · · · x p
(c) Follow a similar set of steps as described in (4.23) through (4.30) to derive (4.47) as
the optimal set of parameters that minimize the regularized least squares function in
(4.46).
(d) Show that when the least squares cost function is regularized via r(w) = λwT w, the
least squares solution in (4.52) can be adjusted and written as
⎡ ⎤
w0 −1
⎢ ⎥
⎦ = XX + λ I̊
T
⎣ Xy, (4.55)
w
(e) Show that, when λ > 0, the matrix XXT +λ I̊ in (4.55) is always invertible regardless
of the dimensions or entries of X.
4.3 Input Normalization
(a) Given two real numbers c and d (where c < d) and a set of p measurements
{x1 , x2 , . . . , xp }, find a linear function f (·) such that c ≤ f (xi ) ≤ d for all
i = 1, 2, . . . , p.
(b) Recall from our discussion of input normalization in Sect. “Input Normalization”
that the linear model of life expectancy that was originally derived in (4.34) could
not be used to infer relative input importance. To remedy this issue, in Example 4.3,
we linearly transformed all inputs as shown in Table 4.2 and applied the least squares
solution to this normalized data to derive a new linear model in (4.40), in which
86 4 Linear Regression
the input weights can be used to infer relative input importance. In this part of the
exercise, you will re-derive the linear model in (4.40) without using the normalized
data in Table 4.2, but instead by only leveraging (4.38) along with the original
(unnormalized) data in Table 4.1.
4.4 Prediction of Life Expectancy: Part I
In Example 4.2, we used a relatively small dataset consisting of p = 18 coun-
tries to predict life expectancy based on n = 3 input factors including mortality
rate, Polio immunization rate, and GDP. Here, we use an expanded version of this
dataset that contains p = 133 countries. This version of the data is stored in
life-expectancy-133-countries.csv that is included in the chapter’s sup-
plements. The goal of this exercise is to evaluate how increasing the size of data impacts
the prediction power of linear regression, as measured over a validation dataset using the
mean squared error (MSE) metric defined in (4.8):
(a) Normalize the input data as described in Sect. “Input Normalization”.
(b) Use the normalized data associated with all countries whose names start with the
letters “A-M” to train a linear regression model for life expectancy.
(c) Based on your model, which input happens to be the most important factor in
predicting the output? Which input happens to be the least important?
(d) Use your trained model to calculate the mean squared error (MSE) for the validation
portion of the data that includes all countries whose names start with the letters “N-
Z.” How does this MSE value compare to the MSE value calculated for the smaller
version of the dataset in Table 4.2?
4.5 Prediction of Life Expectancy: Part II
In this exercise, we use an expanded version of the dataset referenced
in Exercise 4.4 (with n = 18 input factors) to train a linear regression
model for predicting life expectancy. This version of the data is stored in
life-expectancy-18-factors.csv that is included in the chapter’s
supplements. The goal of this exercise is to evaluate how increasing the input dimension
impacts the prediction power of linear regression, as measured over a validation dataset
using the mean squared error (MSE) metric defined in (4.8):
input factors include age, body mass index (BMI), the number of children, and smoking
status (1 for smokers and 0 for non-smokers). The dataset used here is an abridged
version of the “insurance” dataset taken from [1], which is included in the chapter’s
supplements (under the name medical-insurance.csv):
(a) Normalize the input data as described in Sect. “Input Normalization”.
(b) Split the data randomly into two equal-sized training and validation datasets, and
use the former to train a linear regression model.
(c) Based on your model, which input happens to be the most important factor in
predicting the output? Which input happens to be the least important?
(d) Use your trained model to calculate the mean squared error (MSE) for both the
training and validation datasets. Which MSE value is larger in this case? Is that
what you expected? Explain.
Reference
In the previous chapter, we studied linear regression as the most fundamental model
for capturing the relationship between input and output data in situations where
output takes on values from a continuous range. Analogously, linear classification is
considered to be the foundational classification model for separating two (or more)
classes of data using linear boundaries.
Since both paradigms use linear models at their core, our overall treatment of
linear classification in this chapter will closely mirror our discussion of linear
regression in Chap. 4. However, as we will see shortly, the seemingly subtle
distinction between regression and classification (in terms of the nature of the
output) leads to significant differences in the cost functions used in each case, as
well as the optimization strategies employed to minimize those costs to retrieve
optimal model parameters.
We begin the chapter, like we did in Sect. “Linear Regression with One-Dimensional
Input”, by considering a simulated classification dataset consisting only of p = 6
input–output pairs of the form (xi , yi )
(−2.0, 0),
(−1.5, 0),
(−1.0, 0),
(5.1)
(−0.5, 1),
(0.5, 1),
(2.5, 1),
where xi and yi represent the ith input and output, respectively. As with regression,
the goal with classification is to find a function f (·) such that f (xi ) = yi holds
true for i = 1, 2, . . . , 6. Note that because the output yi is always limited to take
on binary values (i.e., 0 or 1), classification can be thought of as a special case
of regression (where a special type of constraint is imposed on the values that the
output can attain). It is, therefore, not illogical to wonder whether we can reuse the
same mathematical framework we developed in the previous chapter to find f (·) in
this case as well. Let us give it a try!
Following the same set of steps as outlined in Example 4.1, one can derive the
function f (x) = 0.5875+0.2625 x as the best linear regressor to fit the classification
data in (5.1), both of which (the function and the data) are plotted in Fig. 5.1. A
quick glance at this figure shows that f (·), having an unconstrained and unbounded
output, represents the underlying dataset rather poorly.
This issue can be fixed by employing the so-called Heaviside step function h(·),
which is defined as
0, x < 0
h(x) = (5.2)
1, x ≥ 0
and plotted in the left panel of Fig. 5.2. Since the output of h(·) is binary at all
times, it possesses the properties we expect to see in a proper classification function.
Therefore, we can pass the linear function f (x) = w0 + w1 x through h(·) and use
the compositional function h(f (x)) as our new classifier.
A corresponding least squares cost function can be formed, following the steps
described in Sect. “The Least Squares Cost Function”, as
1 1
p p
g(w0 , w1 ) = (h(f (xi )) − yi ) =
2
(h(w0 + w1 xi ) − yi )2 , (5.3)
p p
i=1 i=1
which closely resembles the least squares cost function in (4.9). This time, however,
we cannot simply set the partial derivatives of g(·) to zero and solve for w0 and
The Logistic Function 91
Fig. 5.2 (Left panel) The Heaviside step function defined in (5.2). (Right panel) The logistic
function defined in (5.4). In practice, the logistic function can be used as a smooth and
differentiable approximation to the Heaviside step function
w1 , since the function h(·) contained within g(·) is discontinuous and hence non-
differentiable.1
A clever way to get around this issue is to replace h(·) with another function
approximating it that is smooth and differentiable. The logistic function defined as
1
σ (x) = (5.4)
1 + e−x
and plotted in the right panel of Fig. 5.2 is one such function. In the next section, we
will take a closer look at this function, its historical origins, and its mathematical
properties.
The first recorded use of the logistic function dates back to the mid-nineteenth
century and the work of the Belgian mathematician Pierre Francois Verhulst
who used this function in his study of population growth. Prior to Verhulst, the
Malthusian model was the only game in town when it came to modeling how
biological populations grew over time. The Malthusian model assumes that the rate
of growth in a population at each point in time is proportional to the size of the
1 Technically, it is feasible to use subderivatives and subgradients [1] to bypass the issue of non-
differentiability of the Heaviside function. However, as we will see later in the chapter, there
are additional reasons making the least squares cost function in (5.3) inappropriate for use in
classification problems.
92 5 Linear Classification
d
N(t) ∝ N(t), (5.5)
dt
where N (t) denotes the size of the population at time t and the ∝ symbol denotes
proportionality.2 Based on the Malthusian model, as the population gets larger
in size so does the rate of the growth of the population, causing N(t) to grow
exponentially in time. Indeed, one can easily verify that the exponential function
N (t) = et satisfies (5.5).
The Malthusian model is quite effective in explaining bacterial growth, among
many other biological processes. Starting with a single bacterium at time t = 0,
and assuming that binary fission (i.e., the division of one bacterium into two) takes
exactly one second to complete, at t = 1 there will be N = 2 bacteria, at t = 2 there
will be N = 4 bacteria, at t = 3 there will be N = 8 bacteria, etc. The question is:
can this exponential pattern continue forever?
When the resources needed for the growth of a population (e.g., food, space,
etc.) are limited, there comes a point at which the growth begins to slow down. To
incorporate this reality into his growth model, Verhulst used an adjusted growth rate
of N (t)(K − N (t)), wherein the constant K represents the capacity of the system
that hosts the population. This way, the growth rate is influenced by not only the
population at time t but also by the remaining capacity in the system at time t, via
the term K − N (t).
With this adjustment, Verhulst re-wrote the differential equation in (5.5) as
d
N(t) ∝ N (t)(K − N(t)) (5.6)
dt
and derived the logistic function in (5.4) as a solution (see Fig. 5.3 and Exercise 5.3).
The differential equation in (5.6), commonly referred to as the logistic equation,
has found many applications outside its originating field of ecology. In medicine, the
logistic equation has been used to model tumor growth in mice and humans [3, 4],
where in this context N(t) represents the volume of tumor at time t. In another set of
medical applications, the logistic equation has been employed to model the spread
of infectious diseases, where in this context N(t) represents the number of cases
of the disease at time t. In certain circumstances, N(t) closely follows a logistic
pattern, e.g., the SARS outbreak in the beginning of the twenty first century [5],
and more recently the Covid-19 pandemic [6]. A clear logistic trend is discernible
in Fig. 5.4 that shows the number of Covid-19 cases in China over a 3-month period
starting from January 3, 2020 and ending on April 3, 2020.
2 f (t) ∝ g(t) is another way of stating that there always exists a constant α such that f (t) =
α g(t).
The Logistic Function 93
Fig. 5.3 A hand-drawn depiction of the logistic (“Logistique”) function in an 1845 paper by
Verhulst [2], in which he compares his model of population growth with the exponential
(“Logarithmique”) model. In Verhulst’s sketch of the logistic function, the rate of growth peaks
around the point labeled as Oi , and the population curve starts to level off around the point labeled
as O
Fig. 5.4 The cumulative number of Covid-19 cases in China during the first quarter of 2020, as
reported by the World Health Organization [7]
d
σ (t) = σ (t) σ (−t) = σ (t) (1 − σ (t)). (5.8)
dt
Finally, the logistic function is closely related to the hyperbolic tangent function
94 5 Linear Classification
t
sinh(t) 1
e − e−t e2t − 1
tanh(t) = = 2
= 2t (5.9)
cosh(t) 1
2 et + e−t e +1
Replacing the Heaviside step function h(·) with the logistic function σ (·) in (5.3)
gives a new least squares cost of the form
1
p
g(w0 , w1 ) = (σ (w0 + w1 xi ) − yi )2 , (5.11)
p
i=1
which can now be differentiated with respect to its input variables. Specifically, we
can use the chain rule of calculus along with the formula in (5.8) to derive the partial
derivative of g(·) with respect to w0 , as
1
p
∂g ∂ (σi − yi )
= 2 (σi − yi )
∂w0 p ∂w0
i=1
2
p
∂ (w0 + w1 xi )
= (σi − yi ) σi (1 − σi ) (5.12)
p ∂w0
i=1
2
p
= (σi − yi ) σi (1 − σi ) ,
p
i=1
2
p
∂g
= (σi − yi ) σi (1 − σi ) xi . (5.13)
∂w1 p
i=1
Setting both partial derivatives to zero gives the following system of equations:
The Cross-Entropy Cost Function 95
p
1 e−(w0 +w1 xi )
− yi 2 = 0,
1 + e−(w0 +w1 xi ) 1 + e−(w0 +w1 xi )
i=1
(5.14)
p
1 xi e−(w0 +w1 xi )
− yi 2 = 0,
1 + e−(w0 +w1 xi ) 1 + e−(w0 +w1 xi )
i=1
1
p
g(w0 , w1 ) = (σi − yi )2 , (5.15)
p
i=1
yi 1−yi
1 1
gi = . (5.16)
σi 1 − σi
which, by design, converts very large numbers into considerably smaller ones.3
Finally, taking the average of all the terms in (5.17) across the entire dataset forms
the cross-entropy cost function
1
p
g(w0 , w1 ) = − yi log(σi ) + (1 − yi ) log(1 − σi ). (5.18)
p
i=1
p
1
p
= yi ,
1 + e−(w0 +w1 xi )
i=1 i=1
(5.19)
p
xi
p
= yi xi ,
1 + e−(w0 +w1 xi )
i=1 i=1
and solve it for optimal w0 and w1 . Although this new system of equations is less
complex-looking than the system in (5.14), it still possesses no known algebraic
solution that can be written in closed form. This is where optimization algorithms
(e.g., gradient descent) must be employed, as we discuss next.
3 Replacing g with its logarithm is permitted because log(·) is a monotonically increasing function
i
over its domain. If gi > gj for some i and j , we will still have log(gi ) > log(gj ) after passing
each term through the log(·) function.
The Gradient Descent Algorithm 97
Up to this point in the book, our method of minimizing a given cost function has
involved setting the partial derivatives of the function (with respect to its inputs) to
zero and solving the resulting system of equations for optimal input values. This
strategy was effective in minimizing the least squares cost functions associated with
linear regression in (4.9), (4.22), and (4.46). In each case, the resulting system was
linear in its unknown variables, making it easy to solve using basic linear algebra
manipulations. Despite the fact that these systems all had a unique solution, this
general strategy works even when the derivative system has multiple solutions.
Consider the single input function
1 6 3 5 1 4
g(w) = w − w + w + w3 − w2 + 2 (5.20)
6 5 4
for instance. The derivative of this polynomial function can be computed as
dg
= w 5 − 3w 4 + w 3 + 3w 2 − 2w = (w − 2) (w − 1)2 w (w + 1), (5.21)
dw
which has multiple zeros at w = 2, w = 1, w = 0, and w = −1. The points at which
the derivative of a function becomes zero are often referred to as the function’s
stationary points. These may include local minima, local maxima, and saddle (or
inflection) points. The plot of g(·) in Fig. 5.5 shows two local minima at w = 2 and
w = −1, one local maximum at w = 0, and one saddle point at w = 1.
To identify which, if any, of these stationary points is the function’s global
minimum, we can evaluate g(·) at all of them and choose the one that returns the
smallest output value. Here, we have that g(2) = 1.47, g(1) = 1.82, g(0) = 2.00,
Fig. 5.5 (Left panel) The plot of the polynomial function g(·) defined in (5.20). (Right panel) The
plot of the derivative of g(·) computed in (5.21). The points at which the derivative function crosses
zero are its stationary points. See text for additional details
98 5 Linear Classification
and g(−1) = 1.02. In this case, w = −1 returns the smallest output and is therefore
the function’s global minimum.4
The fact that we were able to factorize the derivative of g(·) in (5.21) allowed
us to determine its stationary points quickly and painlessly. This, however, is an
exception rather than the rule. In general, finding a function’s stationary points is
not a trivial task, as we saw with the least squares and cross-entropy cost functions
in (5.14) and (5.19), respectively. In such circumstances, a set of powerful numerical
optimization tools can come in handy to approximate the stationary points. In
what follows, we describe, via a simple example, one of the most commonly
used numerical optimization techniques in machine learning and deep learning: the
gradient descent algorithm.
Here, we introduce the gradient descent algorithm in a slow, step-by-step fashion
in pursuit of minimizing the function
dg
= 4w 3 − 2w − 1 − cos(w) (5.23)
dw
has no easy-to-identify zeros. To estimate the stationary points of g(·) (or equiv-
alently the zeros of its derivative), the gradient descent algorithm is initialized
at a random point w [0] , which is then refined repeatedly through a series of
mathematically defined steps until a reasonable approximate solution is reached.
4 Note that this argument only works when the function g(·) is bounded from below. All of the cost
functions introduced in this book, including but not limited to the least squares and cross-entropy
cost functions we have seen so far, are specifically designed to be non-negative over their input
domain and are thus bounded from below.
The Gradient Descent Algorithm 99
Here, we start the algorithm at w [0] = 0 and use g (·) to denote the derivative of
g(·) for notational convenience. Since g (0) = −2 does not happen to be zero, w [0]
is not a minimum of g(·) and the algorithm will continue.
Next, we search for a new point denoted by w [1] that has to be a better
approximation of the function’s minimum than w[0] . In other words, we aim to
refine and replace w[0] with w [1] such that g(w [1] ) < g(w [0] ). The question then
becomes whether we should move to the left or right of w[0] to search for w [1] .
Luckily, the answer to this question is hidden in the mathematical definition of the
derivative. Recall from basic calculus that the derivative of the function g (·) at w [0]
can be approximated as
where ε is a small positive number.5 When g (w [0] ) is negative (as is the case here),
we have that g(w [0] + ε) < g(w [0] ). Hence, stepping ε units to the right of w [0]
would decrease the evaluation of g(·). On the other hand, when g (w [0] ) is positive,
we should move in the opposite direction (i.e., to the left) in order to reduce the
value of g(·). Using the mathematical sign function, we can combine these two
cases together and conclude that the point
w [1] = w [0] − ε sign g (w [0] ) (5.25)
approximates the minimum of g(·) more closely than w [0] . Noting that sign (t) =
t
|t| , we can rewrite (5.25) as
w [1] = w [0] − α [0] g w [0] , (5.26)
where we have denoted the term |g (wε [0] )| by α [0] that is typically referred to as the
learning rate in the parlance of machine learning. The elegance of the formula for
updating w [0] in (5.26) is in the fact that it can be reused in a recursive manner to
update w [1] itself. At w [1] , if the derivative of g(·) remains negative, we continue
moving to the right in pursuit of an even better approximation to the function’s
minimum. Otherwise, if the derivative of g(·) suddenly becomes positive at w [1] , it
means that we have skipped the minimum that now lies to the left of w[1] . Again,
and in either case,
w [2] = w [1] − α [1] g w [1] (5.27)
5 Ingeneral, the smaller the value of ε the better the approximation. In the limit, and as ε → 0, we
have strict equality.
100 5 Linear Classification
Table 5.1 The sequence of points created by the gradient descent algorithm to find the minimum
of the function g(w) = w 4 − w 2 − w − sin(w). The learning rate α [k] is set to 0.1 for all iterations
of the algorithm
k w [k] α [k] g (w [k] ) w [k+1] = w [k] − α [k] g (w [k] )
0 0.0000 0.1 −2.0000 0.2000
1 0.2000 0.1 −2.3481 0.4348
2 0.4348 0.1 −2.4478 0.6796
3 0.6796 0.1 −1.8816 0.8677
4 0.8677 0.1 −0.7685 0.9446
5 0.9446 0.1 −0.1040 0.9550
6 0.9550 0.1 −0.0038 0.9554
7 0.9554 0.1 −8.8 × 10−5 0.9554
8 0.9554 0.1 −2.0 × 10−6 0.9554
9 0.9554 0.1 −4.7 × 10−8 0.9554
would get us closer to the true minimum of g(·). This process can be repeated to
produce a sequence of points of the form
w [k+1] = w [k] − α [k] g w [k] , (5.28)
Table 5.2 The sequence of points created by the gradient descent algorithm to find the minimum
of the function g(w) = w 4 − w 2 − w − sin(w). The learning rate α [k] is set to 1.0 for all iterations
of the algorithm. Here, ∞ represent numbers larger than the computational capacity of an average
computer
k w [k] α [k] g (w [k] ) w [k+1]
0 0.0 1.0 −2.0 2.0
1 2.0 1.0 27.4 −25.4
2 −25.4 1.0 −65624.5 65599.1
3 65599.1 1.0 −1.1 × 1015 1.1 × 1015
4 1.1 × 1015 1.0 −5.8 × 1045 5.8 × 1045
5 5.8 × 1045 1.0 7.6 × 10137 −7.6 × 10137
6 −7.6 × 10137 1.0 −∞ +∞
stationary point. As a general rule of thumb, the smaller the learning rate is set the
lower the speed of convergence to the minimum. In Table 5.3, we summarize the
results of applying gradient descent to minimizing the function g(·) in (5.23) using
the relatively small learning rate of α [k] = 0.01. Because the learning rate is set too
small in this case, we will need over 100 iterations of the algorithm to get within the
same vicinity of the minimum as in Table 5.1.
Comparing the results reported in Tables 5.1, 5.2, and 5.3 indicates that choosing
learning rate for gradient descent must be handled with care. Otherwise, the
algorithm could either fail to converge to a minimum, or do so at a very slow pace. It
must be noted that a number of advanced variants of the gradient descent algorithm
exist wherein the learning rate is set automatically by the algorithm and adjusted
adaptively at each iteration by leveraging the local geometry of the cost function.
The inner-workings of these advanced algorithms are, for the most part, outside the
scope of this book. The interested reader is encouraged to consult [8] and references
therein.
Table 5.3 The sequence of points created by the gradient descent algorithm to find the minimum
of the function g(w) = w 4 − w 2 − w − sin(w). The learning rate α [k] is set to 0.01 for all iterations
of the algorithm
k w [k] α [k] g (w [k] ) w [k+1] k w [k] α [k] g (w [k] ) w [k+1]
0 0 0.01 −2.0000 0.0200 50 0.9063 0.01 −0.4515 0.9108
1 0.0200 0.01 −2.0398 0.0404 51 0.9108 0.01 −0.4123 0.9149
2 0.0404 0.01 −2.0797 0.0612 52 0.9149 0.01 −0.3760 0.9187
3 0.0612 0.01 −2.1196 0.0824 53 0.9187 0.01 −0.3426 0.9221
4 0.0824 0.01 −2.1592 0.1040 54 0.9221 0.01 −0.3119 0.9253
5 0.1040 0.01 −2.1981 0.1260 55 0.9253 0.01 −0.2837 0.9281
6 0.1260 0.01 −2.2360 0.1483 56 0.9281 0.01 −0.2579 0.9307
7 0.1483 0.01 −2.2726 0.1710 57 0.9307 0.01 −0.2343 0.9330
8 0.1710 0.01 −2.3075 0.1941 58 0.9330 0.01 −0.2127 0.9351
9 0.1941 0.01 −2.3402 0.2175 59 0.9351 0.01 −0.1929 0.9371
10 0.2175 0.01 −2.3703 0.2412 60 0.9371 0.01 −0.1750 0.9388
11 0.2412 0.01 −2.3974 0.2652 61 0.9388 0.01 −0.1586 0.9404
12 0.2652 0.01 −2.4208 0.2894 62 0.9404 0.01 −0.1437 0.9418
13 0.2894 0.01 −2.4403 0.3138 63 0.9418 0.01 −0.1301 0.9431
14 0.3138 0.01 −2.4552 0.3384 64 0.9431 0.01 −0.1178 0.9443
15 0.3384 0.01 −2.4651 0.3630 65 0.9443 0.01 −0.1066 0.9454
16 0.3630 0.01 −2.4695 0.3877 66 0.9454 0.01 −0.0964 0.9463
17 0.3877 0.01 −2.4681 0.4124 67 0.9463 0.01 −0.0872 0.9472
18 0.4124 0.01 −2.4604 0.4370 68 0.9472 0.01 −0.0789 0.9480
19 0.4370 0.01 −2.4462 0.4615 69 0.9480 0.01 −0.0713 0.9487
20 0.4615 0.01 −2.4253 0.4857 70 0.9487 0.01 −0.0645 0.9494
21 0.4857 0.01 −2.3974 0.5097 71 0.9494 0.01 −0.0583 0.9500
22 0.5097 0.01 −2.3626 0.5333 72 0.9500 0.01 −0.0527 0.9505
23 0.5333 0.01 −2.3210 0.5565 73 0.9505 0.01 −0.0476 0.9510
24 0.5565 0.01 −2.2727 0.5792 74 0.9510 0.01 −0.0430 0.9514
25 0.5792 0.01 −2.2180 0.6014 75 0.9514 0.01 −0.0388 0.9518
26 0.6014 0.01 −2.1572 0.6230 76 0.9518 0.01 −0.0351 0.9521
27 0.6230 0.01 −2.0909 0.6439 77 0.9521 0.01 −0.0317 0.9524
28 0.6439 0.01 −2.0197 0.6641 78 0.9524 0.01 −0.0286 0.9527
29 0.6641 0.01 −1.9441 0.6835 79 0.9527 0.01 −0.0258 0.9530
30 0.6835 0.01 −1.8649 0.7022 80 0.9530 0.01 −0.0233 0.9532
31 0.7022 0.01 −1.7829 0.7200 81 0.9532 0.01 −0.0211 0.9534
32 0.7200 0.01 −1.6987 0.7370 82 0.9534 0.01 −0.0190 0.9536
33 0.7370 0.01 −1.6132 0.7531 83 0.9536 0.01 −0.0172 0.9538
34 0.7531 0.01 −1.5270 0.7684 84 0.9538 0.01 −0.0155 0.9539
35 0.7684 0.01 −1.4410 0.7828 85 0.9539 0.01 −0.0140 0.9541
36 0.7828 0.01 −1.3557 0.7964 86 0.9541 0.01 −0.0126 0.9542
37 0.7964 0.01 −1.2717 0.8091 87 0.9542 0.01 −0.0114 0.9543
38 0.8091 0.01 −1.1897 0.8210 88 0.9543 0.01 −0.0103 0.9544
39 0.8210 0.01 −1.1100 0.8321 89 0.9544 0.01 −0.0093 0.9545
40 0.8321 0.01 −1.0330 0.8424 90 0.9545 0.01 −0.0084 0.9546
(continued)
Linear Classification with Multi-Dimensional Input 103
f (x) = w0 + wT x (5.29)
with the scalar w0 and the n×1 vector w as parameters. It is notationally convenient
to temporarily redefine the vectors x and w to include 1 and w0 as their first entry,
respectively, and write (5.29) even more compactly as
f (x) = wT x. (5.30)
1
p
g(w) = − yi log σ (wT xi ) + (1 − yi ) log 1 − σ (wT xi ) . (5.31)
p
i=1
where the scalar w [k] is replaced by its vector analog w[k] , and the derivative
function g (·) is replaced by the gradient function ∇ g(·). As discussed previously,
the gradient descent algorithm can be initialized at any random point w[0] , and
refined sequentially using (5.32) until a “good enough” approximation of the
function’s minimum is reached. In practice, we halt the algorithm after a maximum
number of iterations are taken or when the norm of the gradient has fallen below
some small user-defined value, whichever comes first (Fig. 5.7).
104 5 Linear Classification
1
p
∇g(w) = σ (wT xi ) − yi xi . (5.34)
p
i=1
To choose a proper learning rate for gradient descent, it is advisable to run the
algorithm for a limited number of iterations using a range of different values
for α [k] and plot the resulting cost function evaluations at each step. We have
done so in the left panel of Fig. 5.8 for three learning rate values: 10, 1, and
0.1. As can be seen in the figure, α [k] = 10 is evidently too large, causing the
algorithm to diverge. The learning rate of 0.1 is too small on the other hand,
(continued)
Linear Classification with Multi-Dimensional Input 105
Fig. 5.8 (Left panel) The cost function evaluations resulted from three runs of gradient descent for
minimizing the cross-entropy cost associated with the dataset in Fig. 5.9. The runs were initiated
at the same starting point, but with different learning rates. The vertical axis is logarithmic in
scale. (Right panel) The linear boundary separating the two classes of data is characterized by the
equation in (5.36) and drawn as a dashed black line
is in close proximity of the true minimum of the cost function. The classifica-
tion boundary can then be written as
So far in the chapter, we have focused our attention on binary classification where
the output takes on only one of two possible values or outcomes. In practice,
however, classification problems with more than two classes are just as common
as their binary counterparts. For instance, in many oncology applications, we are
interested in classifying certain tissue images into one of three categories: “normal,”
“benign,” or “malignant.” Once a cancer diagnosis is made, we may be interested
in evaluating its aggressiveness by assigning it one of four pathological grades:
“low-grade” (G1), “intermediate-grade” (G2), “high-grade” (G3), or “anaplastic”
(G4). The higher the tumor grade the more quickly it grows and spreads throughout
the body. Hence, accurate tumor grading is key to devising the optimal treatment
approach.
In this section, we discuss how the binary classification framework we developed
previously can be extended to handle multi-class problems such as the examples
mentioned above. In general, a multi-class classification problem involves m > 2
classes. Focusing on the j th class for the moment, we already know how to
differentiate it from the rest of the data using a binary classifier. This can be done by
lumping together every other class of data (except the j th one) into a new category
called the “not j ” class. Next, we temporarily assign the label “1” to the j th class
and the label “0” to the “not j ” class and train a linear classifier to separate the two
as discussed in Example 5.1. Denoting the bias and slope parameters of this linear
classifier by w0,j and wj , the equation of the separating boundary can be written as
Linear Classification with Multiple Classes 107
It can be shown, using elementary linear algebra calculations, that the expression
computes the distance from the input point x to the linear boundary in (5.37). The
distance metric in (5.38) is positive if x lies on the positive side of the boundary
(where class “1” resides) and negative when x falls on the negative side of the
boundary (where class “0” resides).
Repeating the process outlined above m times, once for each class of data, we
end up with the m linear functions f1 (·) through fm (·). We can then use these
functions to compute the m corresponding distance values D1 through Dm . The
index associated with the largest distance
(continued)
108 5 Linear Classification
4.46
w0,2 = −3.87, w2 = ,
−5.45
(5.42)
1.10
w0,3 = −8.19, w3 = .
6.29
Finally, the distance function associated with each binary classifier can be
computed via (5.38) as
Problems
g(w) = w 6 − w 5 + w 4 − w (5.44)
References
1. Shor NZ. Minimization methods for non-differentiable functions. Berlin: Springer; 1985
2. Verhulst PF. Mathematical researches into the law of population growth increase. Nouveaux
Mires de l’Acade Royale des Sciences et Belles-Lettres de Bruxelles. 1845;18:8
3. Benzekry S, Lamont C, Beheshti A, et al. Classical mathematical models for description and
prediction of experimental tumor growth. PLoS Comput Biol. 2014;10(8):e100380
4. Vaidya V, Alexandro F. Evaluation of some mathematical models for tumor growth. Int J Biomed
Comput. 1982;13(1):19–36
5. Hsieh Y, Lee J, Chang H. SARS epidemiology modeling. Emerg Infect Dis.
2004;10(6):11651168
6. Wang P, Zheng X, Li J, et al. Prediction of epidemic trends in COVID-19 with logistic model
and machine learning techniques. Chaos, Solitons Fractals. 2020;139:110058
7. The World Health Organization (WHO) COVID-19 global dataset. Accessed Apr 2022. https://
covid19.who.int/data
8. Watt J, Borhani R, Katsaggelos AK. Machine learning refined: foundations, algorithms, and
applications. Cambridge: Cambridge University Press; 2020
Chapter 6
From Feature Engineering to Deep
Learning
The models we have studied thus far in the book have all been linear. In this
chapter, we begin our foray into nonlinear models by formally introducing features
as mathematical functions that transform the input data. We discuss two main
approaches to defining features: feature engineering that is driven by the domain
knowledge of human experts and feature learning that is fully driven by the data
itself. A discussion of the latter approach naturally leads to the introduction of deep
neural networks as the main driver of recent advances in the field.
f (x1 , x2 , . . . , xn ) = w0 + w1 x1 + w2 x2 + · · · + wn xn , (6.1)
f (x) = w0 + wT x (6.2)
if we arrange all the inputs x1 through xn into a single input vector denoted as
⎡ ⎤
x1
⎢x2 ⎥
⎢ ⎥
x = ⎢ . ⎥, (6.3)
⎣ .. ⎦
xn
and all the parameters w1 through wn into a single parameter vector denoted as
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 111
R. Borhani et al., Fundamentals of Machine Learning and Deep Learning
in Medicine, https://doi.org/10.1007/978-3-031-19502-0_6
112 6 From Feature Engineering to Deep Learning
⎡ ⎤
w1
⎢ w2 ⎥
⎢ ⎥
w = ⎢ . ⎥. (6.4)
⎣ .. ⎦
wn
For many real-world regression datasets, however, the linear model in (6.2) is not
capable of capturing the complex nonlinear relationship that may exist between the
input and the output. One way to solve this issue is by injecting nonlinearity into this
model via what are called features in the parlance of machine learning. A feature
h(x) is a nonlinear mathematical function of the input x. For instance,
is a trigonometric one. Notice from (6.6) that a feature does not necessarily have to
involve all the inputs x1 through xn , but in general it can. A nonlinear regression
model, in general, can employ m such features h1 through hm
denoting
⎡ ⎤
v1
⎢ v2 ⎥
⎢ ⎥
v=⎢ . ⎥ (6.9)
⎣ .. ⎦
vm
and
⎡ ⎤
h1 (x)
⎢ h2 (x) ⎥
⎢ ⎥
h(x) = ⎢ . ⎥ . (6.10)
⎣ .. ⎦
hm (x)
Regardless of how the features in (6.10) are chosen, the steps we take to formally
resolve the nonlinear regression model in (6.7) are entirely similar to what we saw
Feature Engineering for Nonlinear Regression 113
1 2
p
g (v0 , v) = v0 + vT h(xi ) − yi (6.11)
p
i=1
$ %
over a regression dataset consisting of p input–output pairs (x1 , y1 ) , . . . , xp , yp .
We then minimize this cost function by setting the derivative of g with respect to
v and the gradient of g with respect to v equal to zero simultaneously. Solving the
resulting linear system will reveal the optimal values for v and v. The only issue that
remains is determining appropriate nonlinear functions to form the feature vector in
(6.10). Let us explore this issue further through an example.
1
σ (x) = , (6.12)
1 + e−x
(continued)
Fig. 6.1 Figure associated with Example 6.1. See text for details
Our proposed feature did its job: the input–output relationship that was
nonlinear in the original space has become linear in the feature space.
In Example 6.1, we relied on what we knew about the nature of bacterial growth
to determine an appropriate nonlinear feature transformation (the sigmoid function)
that could linearize the relationship between the transformed input and the output.
This is an instance of what is more broadly referred to as feature engineering,
wherein the functional form of nonlinearities is determined (or engineered) by
humans through their expertise, domain knowledge, intuition about the problem at
hand, etc. A properly engineered feature (or a set of features) is the one that provides
a good linear fit in the feature space, wherein the input has undergone nonlinear
feature transformation.
Feature Engineering for Nonlinear Classification 115
f (x) = w0 + wT x = 0, (6.14)
where we have used the notation in (6.3) and (6.4) to write the equation of the
boundary more compactly. When the two classes of data are not linearly separable,
we can adjust this equation by injecting nonlinearity into it in an entirely similar
fashion as we did in the previous section. Specifically, we replace the linear model
in (6.14) with a nonlinear model of the form
where the weight vector v and the feature vector h(x) are defined in (6.9) and (6.10),
respectively. Next, we need to define a proper cost function to minimize in order
to resolve this nonlinear model. As discussed in Sect. “Linear Classification with
Multiple Classes”, one appropriate cost function to use for classification is the cross-
entropy cost function that can be written in this case as
1
p
g(v0 , v) = − yi log σ (v0 + vT h (xi )) + (1 − yi ) log 1 − σ (wT h (xi ))
p
i=1
(6.16)
and minimized using gradient descent.
(continued)
1 In this section, we only consider the case of two-class or binary classification. Multi-class
h1 (x1 , x2 ) = x12 ,
(6.17)
h2 (x1 , x2 ) = x22 .
As shown in Fig. 6.3, once we transform the inputs using the features defined
in (6.17), the two classes of data that were not linearly separable in the original
input space become linearly separable in the feature space.
Feature Learning
Fig. 6.3 Figure associated with Example 6.2. A well-engineered set of features as defined in
(6.17) provide good nonlinear separation in the problem’s original input space (left panel) and,
simultaneously, good linear separation in the feature space (right panel). See text for further details
in medicine are too high-dimensional to visualize. Besides, more often than not we
have too little or no knowledge of the phenomenon that governs the problem of
interest. Even with prior knowledge of the phenomenon under study, the process of
engineering features is non-trivial and time-consuming as it will typically involve
multiple rounds of discussion and refinement between medical experts and machine
learning developers [2]. Motivated by these challenges, in this section, we introduce
an alternative approach to feature engineering, in which features are learned directly
from the data itself without the need for human involvement. This new approach,
commonly referred to as feature learning, allows us to automate the manual (and
somewhat tedious) task of feature engineering.
The key idea behind feature learning is to use parameterized features in (6.7)
whose parameters are tuned alongside other model parameters during training. In
other words, in a feature learning setup, the nonlinear regression model in (6.7) can
be adjusted and written as
before all of these weighted inputs are aggregated inside a summation unit shown as
a small yellow circle in Fig. 6.4. Finally, an artificial neuron is a nonlinear function
because it consists of an “activation” unit (shown as a blue circle in the figure)
whose output is a nonlinear transformation of the linearly weighted combination
w1 x1 + · · · + wn xn .2 Stitching all the pieces together, an artificial neuron can be
modeled as
was used as activation since biological neurons were thought to act like a digital
switch. As long as the input α falls below some activation threshold, the switch will
remain off and the output is 0. Once the input goes above the threshold, the switch
gets turned on, producing an output equal to 1. This modeling was compatible with
the belief that a biological neuron would not communicate with other downstream
neurons unless it got excited or activated by a large enough input coming through
its dendrites.
As we discussed in Chap. 5, the flat and discontinuous shape of the Heaviside step
function creates fatal problems when we try to optimize machine learning models
involving this function using gradient descent. Luckily, replacing the Heaviside step
function with its smooth approximation, i.e., the logistic sigmoid function
1
φ(α) = , (6.21)
1 + e−α
ameliorates these optimization problems. For this reason, until the beginning of the
twenty first century, most neural network models used the logistic sigmoid function
or its close relative, the hyperbolic tangent function
e2α − 1
φ(α) = (6.22)
e2α + 1
as activation. Still as can be seen in Fig. 6.5, both the logistic and hyperbolic tangent
functions are almost flat when the input is far away from the origin. This means that
the derivative is almost zero when the input happens to be somewhat large (in either
positive or negative direction). This issue, sometimes referred to as the vanishing
gradient problem, hinders proper parameter tuning and limits practical use of the
activation functions in (6.21) and (6.22).
More recently, a new breed of nonlinear activation functions based on the
rectified linear unit (ReLU) function
has shown better optimization performance compared to the previous and more
biologically plausible options. Notice that with the ReLU function in (6.23) the
derivative never vanishes as long as the input remains positive (see the bottom-left
panel of Fig. 6.5). However, negative inputs can still create problems. To remedy
this issue, a variant of the original ReLU called the leaky ReLU
1, α≥0
φ(α) = (6.24)
τ α, α<0
was introduced, wherein the left “hinge” is no longer flat, but at a small incline (see
the bottom-middle panel of Fig. 6.5). Another popular variant of the ReLU is the
so-called maxout activation function defined as
which takes the maximum of two linear combinations of the input. Empirically,
artificial neural networks employing the maxout activation function have fewer tech-
nical issues during optimization and often converge faster to a solution. However, it
has more internal parameters to tune. An instance of the maxout activation function
is plotted in the last panel of Fig. 6.5.
Arranging several artificial neurons (like the one shown in Fig. 6.4) in a single
column and connecting their respective outputs into another summation unit create
a single-layer neural network, as illustrated in Fig. 6.6. Note that this is precisely
the graphical representation of the model in (6.18), barring the bias parameter v0 .
To avoid clutter in the figure, the parameters associated with the line segments
connecting the inputs to the artificial neurons are stored in θ1 through θm . Different
settings of these parameters define distinct features. Hence, we can tune them
together with the external parameters v0 through vm during model training and by
minimizing an appropriate cost function depending on the problem at hand.
120 6 From Feature Engineering to Deep Learning
Fig. 6.5 An illustration of several historical and modern activation functions used in artificial
neural networks. (Top-left panel) The Heaviside step function defined in (6.20). (Top-middle panel)
The logistic sigmoid function defined in (6.21). (Top-right panel) The hyperbolic tangent function
defined in (6.22). (Bottom-left panel) The rectified linear unit (ReLU) function defined in (6.23).
(Bottom-middle panel) The leaky ReLU function defined in (6.24) with τ set to 0.1. (Bottom-right
panel) The maxout activation function defined in (6.25) with (τ1 , τ2 , τ3 , τ4 ) set to (3, 1, −2, −1)
Fig. 6.7 A descriptive recipe for creating a single-layer neural network (top row) and a two-layer
neural network (bottom row)
Fig. 6.8 A three-layer neural network illustrated. Note that the number of artificial neurons in
each layer (denoted by m1 , m2 , and m3 ) need not be the same
This is also illustrated in the top row of Fig. 6.7. Notice that nothing stops us from
continuing this process further as depicted in the bottom row of Fig. 6.7, where the
output of the single-layer neural network is passed through an extra pair of nonlinear
activation and linear combination modules, creating a two-layer neural network.
We can continue this process as many times as we wish to create a general multi-
layer neural network, also known as a multi-layer perceptron. A neural network with
several (typically more than three) layers is considered a deep network in the jargon
of machine learning.
In Fig. 6.8, we show the graphical representation of a three-layer neural network
that is analogous to the single-layer version shown in Fig. 6.6. Here, each of the three
layers of artificial neurons separating the input layer on the left from the output on
the right is called a hidden layer. The rationale behind this naming convention is that
an outside observer can only “see” what goes inside the network (input) and what
comes out of it (output), but not any intermediate layer in between.
Figure 6.8 also illustrates why multi-layer perceptrons are considered fully
connected network architectures: because every unit in one layer is connected to
every unit in the following layer. An important question to ask at this point is: how
does using “deeper” neural networks benefit us? Let us explore this through a simple
example.
122 6 From Feature Engineering to Deep Learning
Fig. 6.9 Figure associated with Example 6.3. See text for details
Optimization of Neural Networks 123
Example 6.3 illustrates intuitively why deeper neural networks are superior to
shallower ones: because they can represent much more complex nonlinear functions.
This comes at a cost, however. Deep networks have more internal parameters and,
generally speaking, are more difficult to optimize notwithstanding the fact that
computation has become drastically faster and cheaper over the last decade. Next,
we delve deeper into the optimization of deep neural networks.
While algorithms for minimizing neural network cost functions exist in a litany of
forms, the vast majority of them are built upon a few principal foundations. First
and foremost, these algorithms use the cost function’s gradient just like the vanilla
gradient descent algorithm introduced in Sect. “The Gradient Descent Algorithm” to
minimize the cross-entropy cost for linear classification. In (5.34), we computed the
gradient manually and in closed algebraic form. With neural networks, however, this
becomes an extremely tedious task due to a large number of parameters involved as
well as the compositional structure of these networks that requires the repeated use
of the chain rule.3
Fortunately, by using a so-called automatic differentiator, it is possible to
calculate gradients automatically and with ease, just as it is possible to multiply two
large numbers using a conventional calculator. The inner-workings of an automatic
differentiator are, for the most part, outside the scope of this book.4 However, it is
worthwhile to mention that a specific mode of automatic differentiation (the reverse
mode, to be precise) is typically referred to as backpropagation in the machine
learning literature. In other words, the backpropagation algorithm is the name given
to the automatic computation of gradients for cost functions involving artificial
neural networks.
Another common theme shared by most neural network optimizers is the
use of gradient descent (or other optimizers) in “stochastic mode.” In stochastic
optimization, we do not use the entire training data in order to compute the gradient
of the cost function. Notice, from (4.22) and (5.18), where the least squares cost for
regression and the cross-entropy cost for classification are defined, that in both cases
the cost function can be decomposed over individual data points. In other words, we
can write a generic regression or classification cost function g(w) as
3 According to the chain rule, if we have y = f1 (u) and u = f2 (x), then the derivative of the
composition of f1 and f2 , i.e., f1 (f2 (x)), can be found as
dy dy du
= × .
dx du dx
1
p
g(w) = gi (w), (6.27)
p
i=1
where g1 through gp are individual cost functions associated with each of the p data
points. To compute the gradient of g, we can write
1 1
p p
∇g(w) = ∇ gi (w) = ∇gi (w). (6.28)
p p
i=1 i=1
This means that the full (or batch) gradient is the summation of the gradients
associated with each individual data point. Based on this observation, it is fair to
ask what would happen if instead of taking one descent step in g using the full
gradient, we took a sequence of p descent steps in g1 , g2 , . . . , gp in a sequential
manner. That is, we first descend in g1 using ∇g1 (w), then in g2 using ∇g2 (w), and
so forth. It turns out that this approach provides faster convergence to the minima of
g in practice.
In general, we can group multiple data points into a mini-batch and take a descent
step in the cost function
$ associated with the% entire mini-batch. Suppose we partition
the full training set (x1 , y1 ), . . . , (xp , yp ) into T non-overlapping subsets or mini-
batches of roughly the same size, represented by 1 through T . In this approach,
we decompose the full gradient over each mini-batch as
⎛ ⎞ ⎛ ⎞
1
T
1
T
∇g(w) = ∇ ⎝ gj (w)⎠ = ∇⎝ gj (w)⎠ (6.29)
p p
i=1 j ∈i i=1 j ∈i
and take descent steps sequentially, first in j ∈1 gj (w), then in j ∈2 gj (w),
and so on. The optimal mini-batch size varies from problem to problem but in most
cases is set relatively small compared to the full size of the training dataset. Note
that with stochastic gradient descent the mini-batch size equals 1.
Fig. 6.11 Figure associated with Example 6.4. See text for further details
validation errors in the end. This validation scheme, commonly known as k-fold
cross-validation, allows the model to use a larger share of the data in exchange for
increased computation. In general, the smaller the data the larger k should be set. In
most extreme cases when the data is severely limited, it is recommended to set k to
its maximum possible value (i.e., the number of points in the training set). By doing
so, we will leave only one data point out at a time for validation. This particular
setting is called leave-one-out cross-validation.
Cross-validation provides a way to answer the question we posed in the
beginning of this section: how should we choose the right number of hidden layers
to include in a neural network model? In short, to design a well-performing neural
network architecture for use with a given dataset, we cross-validate an array of
choices, selecting the model that results in the lowest cross-validation error. It is
important to remember that in building a neural network architecture, our design
choices, in addition to the network’s depth, include the number of artificial neurons
per layer as well as the nonlinear form of the activation function.
Problems
activation functions based on the rectified linear unit (ReLU) have largely replaced
the older ones?
6.5 Counting the Parameters of a Multi-layer Neural Network
(a) Find the total number of adjustable parameters in the single-layer neural
network displayed in Fig. 6.6. Express your answer in terms of the dimension of
the input n and the number of artificial neurons m in the network’s only hidden
layer.
(b) Find the total number of adjustable parameters in the three-layer neural network
displayed in Fig. 6.8. Express your answer in terms of the dimension of the input
n and the number of artificial neurons in each hidden layer, i.e., m1 , m2 , and m3 .
(c) Using your answers to part (a) and part (b), find a general formula for computing
the total number of adjustable parameters in an -layer neural network. Once
References 129
again, express your answer in terms of the dimension of the input n as well as
the number of artificial neurons in each hidden layer.
6.6 Backpropagation for a Single-Layer Neural Network
Compute the gradient of a cross-entropy cost function associated with a single-
layer neural network. The cross-entropy cost function is defined in (5.31). For
simplicity, assume the hidden layer of the network consists only of two artificial
neurons.
6.7 Backpropagation for a Two-Layer Neural Network
Compute the gradient of a least squares cost function associated with a two-layer
neural network. The least squares cost function is defined in (4.22). For simplicity,
assume both hidden layers of the network consist only of two artificial neurons each.
References
1. Lin J, Lee SM, Lee HJ, Koo YM. Modeling of typical microbial cell growth in batch culture.
Biotechnol Bioprocess Eng. 2000;5(5):382–85
2. Borhani S, Borhani R, Kajdacsy-Balla A. Artificial intelligence: a promising frontier in bladder
cancer diagnosis and outcome prediction. Crit Rev Oncol Hematol. 2022;171:103601. https://
doi.org/10.1016/j.critrevonc.2022.103601
3. Watt J, Borhani R, Katsaggelos AK. Machine learning refined: foundations, algorithms, and
applications. Cambridge: Cambridge University Press; 2020
Chapter 7
Convolutional and Recurrent Neural
Networks
As we saw in the previous chapter, artificial neural networks (and more precisely
single- and multi-layer perceptrons) are powerful tools for modeling nonlinear
input–output relationships. For instance, we may seek to uncover the relationship
between a patient’s lab test results (input) and the likelihood of readmission to
hospital in near future (output)—if such relationship exists—using a single-layer
perceptron model like the one shown in Fig. 7.1. It is important to note that such a
model is not sensitive to the order in which the input is fed to it. Here, the first four
inputs are liver function test results, and the next four are urine test results. If we
switched this around and fed the urine test results first (as inputs 1 through 4) and the
liver function test results last (as inputs 5 through 8), nothing would fundamentally
change with respect to the underlying model that would be trained using this data.
In fact, with fully connected perceptrons, there is no such thing as the first input
versus the last input since the order here is completely arbitrary.
This, however, is not always the case. Sometimes, the input data will have
some sort of structure that can—and should—be leveraged when solving machine
learning and deep learning problems involving that data type. In other words, we
cannot simply switch the input around and expect the model to perform equally
well. Images and texts are prime examples of such type of data.
As discussed in Sect. “Imaging Data”, the information in an image is stored
over a rectangular grid (matrix) at small square units (entries) called pixels. The
pixel values alone are, for the most part, of little to no use without knowledge
of their exact location on the grid. To see why this is the case, compare the two
images shown in Fig. 7.2, wherein the image on the right has the exact same pixel
values as the one on the left. By shuffling the pixels around, however, all the spatial
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 131
R. Borhani et al., Fundamentals of Machine Learning and Deep Learning
in Medicine, https://doi.org/10.1007/978-3-031-19502-0_7
132 7 Convolutional and Recurrent Neural Networks
Fig. 7.1 A single-layer perceptron that takes as input four liver function test results (albumin, ALP,
ALT, and AST levels) and four urine test results (glucose, nitrite, pH, and urobilinogen levels), and
outputs the likelihood of the patient’s readmission to the hospital in the near future
Fig. 7.2 The image in the right panel was created by a random shuffling of the rows in the original
X-ray on the left. While the two images share the exact same pixel values, virtually all the useful
medical information in the original image is lost after altering spatial placement of the pixels
information encoded in the original image is lost, and the resulting scrambled image
looks nothing like the original, despite having the exact same pixel values.
Similarly, text data (as well as a host of other data types such as genome-
sequencing data, time-series data, etc.) possess a special sequential structure that
must not be ignored during modeling. For example, a physician instruction note
that reads “take 40 mg omeprazole before breakfast and 10 mg atorvastatin at night”
would completely lose its meaning if we were to shuffle the words in it at random
(see the top panel of Fig. 7.3), or worse, would take on an intelligible but different
meaning that could be detrimental to the patient’s health (see the bottom panel of
Fig. 7.3).
The Convolution Operation 133
Fig. 7.3 Different word permutations of a physician note instructing the patient to take two medi-
cations: one over-the-counter drug for acid reflux (omeprazole), and one prescribed medication for
high cholesterol (atrovastatin). While the scrambled version in the middle row is unintelligible, the
permutation in the bottom row changes the meaning of the original order that can be harmful to the
patient
For the reasons detailed above, generic multi-layer perceptrons are not suitable
for modeling imaging and text data. In this chapter, we discuss two popular neural
network architectures that are designed respectively to handle imaging and text data
as input, namely, convolutional neural networks and recurrent neural networks.
Fig. 7.4 The daily number of new cases of Covid-19 in Cook County, Illinois, as reported by the
New York Times [1]
x1 = s1 + ε1 ,
x2 = s2 + ε2 ,
.. (7.1)
.
xN = sN + εN .
To recover the signal s from the observation x, we make two reasonably realistic
assumptions: one that the noise ε has no bias (and thus has zero mean) and two that
the underlying signal s is relatively smooth. The latter is sometimes called the local
smoothness assumption in the jargon of machine learning. This assumption is valid
in the case of the Covid time-series data in Fig. 7.4 because in the absence of noise
we should expect that the number of new Covid cases in one region (Cook County)
to not change drastically from one day to the next.
According to (7.1), the average of all observed data in an L-vicinity of xn (i.e.,
the 2L+1 consecutive data points xn−L , xn−L+1 , . . . , xn , . . . , xn+L−1 , xn+L ) can
be written as
The Convolution Operation 135
1
L
1
L
1
L
xn+ = sn+ + εn+ . (7.2)
2L + 1 2L + 1 2L + 1
=−L =−L =−L
Now, leveraging our first assumption that ε has zero mean, we can write
1
L
εn+ ≈ 0. (7.3)
2L + 1
=−L
1
L
sn+ ≈ sn . (7.4)
2L + 1
=−L
Finally, substituting (7.3) and (7.4) into (7.2), we arrive at the following estimation
for the value of sn
1
L
sn ≈ xn+ . (7.5)
2L + 1
=−L
Statistically speaking, the larger the value of L the more reliable the approximation
in (7.3) becomes, as more samples generally drive the “sample average” closer to
the “population average” (which is assumed to be zero). On the other hand, as L
gets larger, the approximation in (7.4) gets worse, as sn gets drowned out by its
neighboring values. Fortunately, we can ameliorate this issue by adjusting the way
we compute the average in (7.5). Specifically, rather than using a uniform average,
we take a weighted average of elements in the L-vicinity of xn such that larger
weights are assigned to the elements closer to xn and smaller weights to those farther
away from it.
Denoting by w the weight given to xn+ in (7.5), we can write it more generally
as
L
sn = w xn+ , (7.6)
=−L
where we have also replaced the “approximately equal” sign with its strict version.
Notice that (7.6) reduces to (7.5) when the weights are chosen uniformly, as
1
w = , = −L, . . . , L. (7.7)
2L + 1
136 7 Convolutional and Recurrent Neural Networks
Figure 7.5 shows a graphical illustration of the uniform weight sequence in (7.7)
as well as the non-uniform weight sequence defined entry-wise as
L + 1 − ||
w = , = −L, . . . , L. (7.8)
(L + 1)2
Note that the non-uniform weight sequence in (7.8) attains its maximum value when
= 0 (this is the weight assigned to xn ). The weights then taper off gradually as we
get farther away from the center point at xn . Note, also, that the weights in both (7.7)
and (7.8) always add up to 1.
The term convolution refers to the weighted sum in (7.6). More precisely, the
convolution between w (a sequence of length 2L + 1 defined over the range
−L, . . . , L) and x (a sequence of length N defined over the range 1, . . . , N ) is a
new sequence s denoted by s = w ∗ x, and defined entry-wise as2
L
sn = w xn+ n = 1, 2, . . . , N. (7.10)
=−L
Fig. 7.5 An illustration of two weighting schemes: uniform (in yellow) as defined in (7.7) and
non-uniform (in blue) as defined in (7.8). In both cases, L = 3
2 The operation defined in (7.10) is more accurately known as cross-correlation, which is closely
related to the convolution operation defined as
L
sn = w− xn+ , (7.9)
=−L
where the weight sequence w is first flipped around its center before getting multiplied by x.
Flipping the weight sequence guarantees the convolution operation to have the commutative
property, which is not a matter of concern to us in this book. Therefore, with slight abuse of
terminology, we continue to refer to the operation defined in (7.10) as convolution throughout the
chapter.
The Convolution Operation 137
Fig. 7.7 Figure associated with Example 7.1. See text for details
The Convolution Operation 139
Fig. 7.8 Figure associated with Example 7.2. (Left panel) The convolution of the underlying time-
series data (in gray) with the uniform kernel in (7.7). (Right panel) The convolution of the time-
series data with the non-uniform kernel in (7.8). In both cases L = 10
L1
L2
n1 = 1, 2, . . . , N1
sn1 ,n2 = w1 , 2 xn1 +1 , n2 +2 ,
n2 = 1, 2, . . . , N2
1 =−L1 2 =−L2
(7.11)
where w is an (2L1 + 1) × (2L2 + 1) kernel matrix and x—which was an
N1 × N2 matrix originally—has been padded (e.g., with zeros) to become
(continued)
140 7 Convolutional and Recurrent Neural Networks
1 x (n1 + , n2 ) − x (n1 − , n2 )
(7.12)
2 x (n1 , n2 + ) − x (n1 , n2 − )
1 x (n1 + 1, n2 ) − x (n1 − 1, n2 )
(7.13)
2 x (n1 , n2 + 1) − x (n1 , n2 − 1)
where
* +
wh = −0.5 0 +0.5 (7.16)
and
⎡ ⎤
−0.5
wv = ⎣ 0 ⎦ . (7.17)
+0.5
(continued)
142 7 Convolutional and Recurrent Neural Networks
Fig. 7.10 Figure associated with Example 7.3. See text for details
Using individual raw pixel values as features has been shown experimentally to
produce low-quality results in virtually all machine learning tasks involving images.
Moreover, if were to use pixel values directly as features, high-resolution medical
images of today would create ultra-high-dimensional feature spaces that are prone
to a negative phenomenon in machine learning called the curse of dimensionality
(see Sect. “Revisiting Feature Design” for a refresher).
An alternative, more efficient approach is to represent an image using its edge
content alone. This idea is illustrated in Fig. 7.11 that shows an input image in the
left panel along with a corresponding image in the right panel, comprised only of
the most prominent edges in the original image.
The edge-detected image in the right panel of Fig. 7.11 is an efficient representa-
tion of the original image in the left panel in the sense that we can still—for the most
part—tell what goes on inside the image while discarding a large amount of less-
useful information from the vast majority of pixels that do not belong to any edges.
Convolutional Neural Networks 143
Fig. 7.11 (Left panel) An X-ray, taken from [2], showing a right femur chalk stick fracture. (Right
panel) The edge-detected version of this image where the bright yellow pixels indicate large edge
content
This is true in general: the most relevant visual information in an image is largely
contained in the relatively small number of edges within the image [3]. Interestingly,
several studies performed on mammals have also determined that individual neurons
involved in early stages of visual processing operate as edge detectors [4, 5].
In computer vision, edge-based feature design has been the cornerstone of many
popular feature engineering schemes including the histogram of oriented gradients
(or HoG) [6] and the scale-invariant feature transform (or SIFT) [7].
The edges within an image can be extracted using the convolution operation.
As illustrated in Fig. 7.10, convolving an image with certain horizontal and vertical
kernels gives image gradients in those directions where large pixel values indicate
strong edge content. Additional convolutional kernels may be added to the mix to
detect edges that are not strictly horizontal or vertical but are at an incline. For
example, each of the eight convolutional kernels shown in Fig. 7.12 corresponds
to one of eight equally (angularly) spaced edge orientations starting from 0◦ , with
seven additional orientations at 45◦ (or π4 -radian) increments.
To capture the total edge content of an input image in any of the eight
directions shown in Fig. 7.12, we convolve the input image with the correspond-
ing convolutional kernel, pass the results through a rectified linear unit (ReLU)
144 7 Convolutional and Recurrent Neural Networks
Fig. 7.12 Eight 3 × 3 convolutional kernels designed to detect horizontal, vertical, and diagonal
edges within an image
function3 to remove any negative entries, and finally add up the remaining pixel
values into a single scalar. Denoting the input image by x, and the convolutional
kernels by w1 , w2 , . . . , w8 , this edge extraction process returns eight feature maps
f1 , f2 , . . . , f8 to represent x, which can be expressed algebraically as
fi = max (0, wi ∗ x) i = 1, 2, . . . , 8. (7.18)
all pixels
We use the ReLU function in (7.18) so that negative values in the matrix wi ∗ x do
not cancel out positive values in it when performing the final summation.4
Stacking all fi ’s into a single vector f, we now have a (primitive) feature
representation for x in the form of a histogram which can be normalized to have
unit length.5
f
f← .
f
Convolutional Neural Networks 145
Fig. 7.13 An illustration of a simple edge-based feature representation based on (7.18). See text
for further details
This feature extraction process is illustrated in Fig. 7.13 for three simple images:
a rectangle (top panel), a circle (middle panel), and an eight-angled star or octagram
(bottom panel). For each basic shape, we plot the convolutional maps of the input
image with each of the eight kernels in Fig. 7.12 (after passing each map through
ReLU), as well as the final histogram representation of the image in the last column.
The edge-based feature extractor we have designed so far works well in overly
simplistic cases. For example, we can distinguish an image of a circle from that
of a square by simply comparing their feature representations. As can be seen in
Fig. 7.13, the feature representation of a circle is much more uniform than (and
thus distinct from) that of a square. This strategy, however, fails when applied to
146 7 Convolutional and Recurrent Neural Networks
Fig. 7.14 An illustration of the summation pooling operation. The 6 × 6 matrix on the left is
pooled over four non-overlapping 3 × 3 patches, producing the smaller 2 × 2 matrix on the right
distinguishing between a circle and a star, since their feature representations end up
to be identical due to the symmetrical nature of both shapes.
In practice, real-world images are much more complicated than these simplistic
geometrical shapes, and summarizing them using just eight features would be
extremely ineffective. To fix this issue, instead of computing each feature over the
entire image as was done in (7.18), we break the image down into relatively small
patches (that may be overlapping) and compute the features over each patch as
fi,j = max (0, wi ∗ x) i = 1, 2, . . . , 8. (7.19)
j th patch
This process, that is, breaking the image into small (possibly overlapping)
patches and representing each patch via the sum (or average) of its pixels, is referred
to as pooling in the parlance of machine learning and is depicted in Fig. 7.14 for a
sample 6 × 6 matrix.
Procedurally, the pooling operation is very similar to convolution but with two
differences: first, with pooling the sliding window can jump multiple pixels at a time
depending on how much overlap is required between adjacent windows or patches.
The number of pixels the sliding window is shifted each time is usually referred to
as the stride. With convolution, the stride is typically set to 1. The second difference
between convolution and pooling is how the content of the sliding window is
processed and then summarized as a single value. Recall that with convolution,
we must first compute the entry-wise product between the kernel matrix and the
matrix captured inside the sliding window. With pooling, however, there is no kernel
involved, and we simply add up all the pixels inside the sliding window.
Convolutional Neural Networks 147
In Fig. 7.15, we show the end-to-end edge-based feature extraction pipeline after
the introduction of the pooling layer.
The feature extraction scheme shown in Fig. 7.15 has several adjustable hyper-
parameters including:
• The number of convolution kernels, k
• The dimension of the convolutional kernels, q × q
• The dimension of the pooling windows, r × r
• The pooling stride, s
148 7 Convolutional and Recurrent Neural Networks
These hyperparameters are all discrete in nature and are usually tuned by trial and
error. The choice of these hyperparameters directly impacts the total number of
features in the final feature representation vector, which can be computed as
, - , -
N1 − r N2 − r
k +1 +1 , (7.20)
s s
6 In the original LeNet, convolutions were performed without padding the input. As a result, the
Fig. 7.16 Two high-level architectures for machine learning tasks involving image data. (Top
row) A fixed feature extractor layer (consisting of fixed convolutional kernels, ReLU, and
pooling modules) is inserted between the input image and the final multi-layer perceptron
regressor/classifier. (Bottom row) In a convolutional neural network, the convolutional kernels in
the feature extractor layer are tuned jointly with the multi-layer perceptron weights. The modules
involving fixed and adjustable weights are colored gray and green, respectively
three layers. The first, second, and third layers consist of 120, 84, and 10 units,
respectively.
In the top panel of Fig. 7.18, we show a more compact visual representation of the
LeNet architecture, focusing on the number and size of convolutional kernels as well
as the pooling window size and stride in each convolutional layer. The convolutional
kernels and pooling windows are drawn exactly to size, and colored yellow and blue,
respectively, while the fully connected layers are drawn as red circles. The number
of convolutional kernels in each layer and the number of neuronal units in each
fully connected layer are printed underneath them in brackets. The compact visual
representation introduced here allows us to easily compare the classical LeNet with
more modern convolutional neural network architectures such as the AlexNet [9]
(middle panel of Fig. 7.18) and VGGNet [10] (bottom panel of Fig. 7.18).
Modern convolutional neural networks have tens of millions of tunable param-
eters distributed throughout both the convolutional and fully connected layers of
150 7 Convolutional and Recurrent Neural Networks
Fig. 7.18 The compact visual representations of three popular convolutional neural network
architectures. (Top panel) The LeNet architecture has 2 convolutional layers and 3 fully connected
layers. This model was originally trained on the MNIST dataset [11] consisting of 60,000 images
of handwritten digits. (Middle panel) The AlexNet has 5 convolutional layers and 3 fully connected
layers. This model was first trained on a subset of the ImageNet dataset [12] consisting of
1,200,000 natural images. (Bottom panel) The original VGGNet architecture consisted of 14
convolutional layers and 3 fully connected layers
the network. For example, the AlexNet (shown in the middle panel of Fig. 7.18)
has roughly 60 million tunable weights, while this number is around 140 million in
the case of the VGGNet (shown in the bottom panel of Fig. 7.18). In the absence
of very large datasets (with hundreds of thousands or millions of data points),
these architectures are extremely prone to overfitting. Additionally, training these
deep architectures requires extensive computational resources and training time. For
instance, the original AlexNet was trained on 2 GPU for 6 days, while the VGGNet
was trained on 4 GPUs over a period of 2 weeks.
When the size of data is smaller than ideal and/or we have limited computational
resources at our disposal to train a modern convolutional neural network from
scratch, we can still leverage pre-trained models such as AlexNet or VGGNet
by “transferring some of the knowledge” gained from these models to ours. This
strategy is typically called transfer learning.
For instance, we can choose to re-use these pre-trained models by keeping all
their weights untouched—except for the weights of the final fully connected layer
that you can tune using our own (smaller) dataset. Depending on the size of our
training data, we can take this idea one step further and also learn some of the
weights in the convolutional layers, typically those belonging to the layers that are
closer to the output. During the training phase of transfer learning, it is usually
beneficial not to randomly re-initialize the weights we look to re-tune, but instead
initialize them at their optimal values according to the pre-trained model (e.g.,
AlexNet or VGGNet).
Recurrence Relations 151
Recurrence Relations
where the first L − 1 values of the moving average h are set to the values of the
input time-series x itself. After these initial values, we create those that follow by
averaging the preceding L elements of the input series. This simple moving average
process is a popular example of dynamic systems with fixed order. The dynamic
systems part of this phrase refers to the fact that the system h is defined in terms of
recent values of the input sequence x. The fixed order part refers to just how many
preceding elements of input x are used to calculate the values in h. In (7.21), this
value was set to L for each value of ht created (after the initial values).
The generic form of a dynamic system with fixed order is very similar to the
moving average process expressed in (7.21); only it employs a general (and possibly
152 7 Convolutional and Recurrent Neural Networks
Here, L is the fixed order of the dynamic system, and the first L − 1 values
γ1 , γ2 , . . . , γL−1 are called the initial conditions of the system. These initial
conditions are often dependent on the input sequence but, in general, can be set
to any values. Fixed order dynamic systems are used in a variety of scientific and
engineering disciplines. Convolutional operations, for instance, are prime examples
of a dynamic system with fixed order and are frequently used to filter and adjust
digital signals. A special case of a dynamic system with fixed order is when L is
set to 1, implying that each element of the output sequence ht is dependent only on
the current input point xt , that is, ht = f (xt ). These kinds of systems are called
memoryless since the dynamic system is constructed without any knowledge of the
past input values.
Another special class of fixed order dynamic systems are recurrence relations,
where instead of constructing an output sequence based on a given input sequence,
these systems define an input sequence in terms of itself, as
In this case, we do not begin with an input sequence x and filter it to create an output
sequence h. Instead, we generate the input itself by recursing on a set of formula of
the form shown in (7.23). As such, these recurrence relations are sometimes referred
to as generative models. Notice that with recurrence relations the initial conditions
will still have to be set, which are simply the first L−1 entries of the input sequence.
x1 = γ
(7.24)
xt = w0 + w1 xt−1
generates a sequence that exhibits exponential growth using the linear func-
tion f (x) = w0 + w1 x. Here, γ , w0 , and w1 are all adjustable scalars.
In Fig. 7.19, we show two example sequences of length N = 10 generated
using (7.24). In the first instance shown in the left panel of Fig. 7.19, we set the
(continued)
Recurrence Relations 153
Fig. 7.19 Figure associated with Example 7.4. See text for details
If we repeat this process, substituting in the recursive formula for xt−2 , then
xt−3 , and so on, we can connect xt all the way back to the initial condition,
and write
which shows how the sequence behaves exponentially depending on the value
of w1 . If w0 = 0, a similar exponential relationship can be derived by
rolling back to the initial condition (see Exercise 7.6). As we saw previously
in Sect. “The Logistic Function”, this sort of dynamic system arises in
Malthusian modeling of population growth.
(continued)
154 7 Convolutional and Recurrent Neural Networks
Fig. 7.20 Figure associated with Example 7.5. See text for details
x1 = γ1 ,
x2 = γ2 ,
..
.
(7.27)
xL = γL ,
L
xt = w0 + wi xt−i + t , if t > L,
i=1
where t denotes the small amount of added noise introduced at each step. In
Fig. 7.20, we show two sequences generated via the auto-regressive system
in (7.27). In both cases, L = 4, and we have used the same initial conditions
and linear function weights. The only difference between the two sequences is
the value of t in each case. In the left panel, no noise was added, i.e., t = 0,
while a small random (Gaussian) noise was added to the sequence shown in
the right panel.
Another classic example of an auto-regressive model is the Fibonacci
sequence, defined recursively as
x1 = 0
x2 = 1 (7.28)
xt = xt−1 + xt−2 if t > 2,
x1 = γ
(7.29)
xt = wxt−1 (1 − xt−1 ) ,
where the recursive update function is no longer linear, but quadratic. Take
a moment to revisit Sect. “The Logistic Function”, and notice the similarity
between the dynamic system in (7.29) and the differential equation in (5.6).
In fact, using the right settings of γ and w, we can generate the familiar s-
shaped logistic curve, as illustrated in Fig. 7.21,
This dynamic system is often chaotic, meaning that slight adjustments
to the initial condition γ and weight w can produce drastically different
results. For instance, in Fig. 7.22, we show two sequences with the same initial
condition γ = 10−4 but different weight values: in the left panel w = 3, while
in the right panel w = 4. As can be seen from comparing the two panels, a
relatively small change in w can turn a nicely converging sequence (left) into
a chaotic pseudo-random one (right).
Fig. 7.22 Figure associated with Example 7.6. See text for details
where we have composed f with itself t times. This confirms that every point
in a sequence generated by a recurrence relation of order L = 1 is completely
determined by its initial condition. The same is true in general for recurrence
relations with order L > 1.
It is important to note that from the very definition of a recurrence relation with
order L
we can see that each xt is dependent on only the value of xt−1 through xt−L ,
and no point coming before xt−L . Therefore, the range of values used to build
each subsequent point is—by definition—limited by the order of the system. Such
systems with “limited memory” have two major disadvantages. First, it is often not
easy to select a proper value for L. If set too small, the system may lack enough
memory to model a recursive phenomenon. On the other hand, large values of L
can result in needlessly complex models that are difficult to optimize and wield.
Second—and more importantly—many modalities of dynamic data (e.g., text) can
have variable length. Take patient notes for example. If you were to use patient notes
to predict the health status of admitted patients in a hospital (i.e., a binary label that
can be positive or negative), how would you choose a fixed value for L knowing that
some patient notes can be only a few words long, while others can be quite lengthy,
sometimes exceeding a few pages? Because a fixed order dynamic system is limited
by its order and cannot use any information from earlier in a sequence, this problem
can arise regardless of the order L that we choose. In the next section, we introduce
variable order dynamic systems to remedy this problem.
scheme, instead of taking a sliding window and averaging the input series inside of
it, we compute the average of the entire input sequence in an online fashion, adding
the contribution of each input one element at a time. Before discussing how the
exponential average is computed, it is helpful to first define a running average for
the input sequence x1 , x2 , . . . , xN , as follows:
h1 = x1
x1 + x2
h2 =
2
x1 + x2 + x3
h3 = (7.32)
3
..
.
x1 + x2 + · · · + xN
hN = .
N
Here, each point ht in the running average sequence is the arithmetic average of
all points in the input sequence indexed from 1 to t. In other words, the running
average sequence ht summarizes the input sequence up to (and including) xt via a
simple summary statistic: their sample mean.
Notice that the running average in (7.32) is a dynamic system that can be written
recursively as
t −1 1
ht = ht−1 + xt (7.33)
t t
for all t = 1, 2, . . . , N . Once you have taken a moment to verify that (7.32)
and (7.33) are indeed equivalent, also notice that ht does not have a fixed order
as h1 depends only on one input point, h2 depends on two input points, h3 depends
on three input points, and so forth. If ht was a fixed order system, then its value
would depend on the same number of input points at all steps.
While (7.32) and (7.33) are two equivalent representations of the same dynamic
system, the latter is far more efficient from a computational perspective. To
see why this is the case, let us compute the entire running average sequence
h1 , h2 , . . . , hN using both representations, counting the number of mathematical
operations (addition, multiplication, division, etc.) that must be performed along the
way. Using (7.32), we need no additions or divisions to compute h1 , 1 addition and 1
division to compute h2 , 2 additions and 1 division to compute h3 , and so on, totaling
1
0 + 1 + 2 + · · · + (N − 1) = (N − 1)N (7.34)
2
additions and N −1 divisions overall. Note that the number of additions is quadratic
in N , making it prohibitively large as N grows larger. On the other hand, using
158 7 Convolutional and Recurrent Neural Networks
ht = α ht−1 + (1 − α) xt . (7.35)
This slightly adjusted version of the running average is called an exponential average
because if we roll (7.35) back to its initial condition—as we did in (7.26)—the
following exponentially weighted average emerges (see Exercise 7.7)
The generic form of a dynamic system with variable order is very similar to the
exponential average shown in (7.36) and can be written as
h1 = γ (x1 )
(7.37)
ht = f (ht−1 , xt ) t > 1,
where γ (·) and f (·) can be any mathematical function. While there are many
variations on this generic theme, all dynamic systems with variable order share
the two universal properties: first, ht is defined recursively in itself, and second,
it provides a summary of all preceding input values x1 through xt and as such is
sometimes referred to as the state variable in the context of variable order dynamic
systems. We can see why this is the case if we roll back ht in (7.37) all the way to
h1 , via
(t − 1) ht−1 + xt
ht = .
t
Recurrent Neural Networks 159
Fig. 7.23 Graphical model representations of dynamic systems with fixed and variable order. (Top
panel) The memory of a fixed order dynamic system is limited to the order of the system L,
meaning that the system is only aware of the most recent L elements of the input sequence. Here
L = 2, and the input points that play a role in the value of ht are colored in red. (Bottom panel)
The memory of a variable order dynamic system is complete in the sense that every preceding input
plays a role in determining the value of the output at time or step t
which exposes the fact that ht is dependent on all prior values of the input sequence
x1 through xt . In other words, at each step, ht provides a summary of the input
up to that point in the sequence, and therefore, it has a “full memory” of all input
preceding it. This is in direct contrast to the fixed order dynamic system described
in the previous section where every value was dependent on only a fixed and limited
number of inputs preceding it. This comparison is illustrated in Fig. 7.23.
The variable order dynamic system shown in the bottom panel of Fig. 7.23
provides a blueprint for building a recurrent neural network, a prototypical example
of which is illustrated in Fig. 7.24. Here, in addition to the input sequence (in red)
and the state sequence (in yellow), we have an output sequence (in blue) sitting atop
the state layer. Unlike the representation in the bottom panel of Fig. 7.23 wherein the
function f remains the same throughout the system, in recurrent neural networks,
f can change from state to state. In practice, however, fi ’s tend to have the same
functional form but use different parameters. For example,
is a common choice for the functions f1 through ft−1 with vi ’s and wi ’s being
tunable parameters. The same is true for the functions g1 through gt that predict the
160 7 Convolutional and Recurrent Neural Networks
outputs y1 through yt from the states h1 through ht . As with any other machine
learning and deep learning model we have encountered so far, all the function
parameters must be learned during training.
Finally, it should be noted that sometimes—and depending on the application—
the output of a recurrent neural network is not a sequence but a single variable. For
instance, in classification tasks involving dynamic or sequential data, we can remove
all the output points y1 through yt−1 from the architecture in Fig. 7.24, keeping only
yt that will contain the predicted classification label.
Problems
Fig. 7.25 Figure associated with Exercise 7.1. The three images shown here can be downloaded
from the chapter’s supplements
⎡ ⎤
1 ···
2 +1 ··· 2 1
⎢ 2 ··· + 1 + 2 + 1 ··· 3
3 2 ⎥
⎢ ⎥
⎢ .. .... .. .. .. .. .. .. ⎥
⎢ . . . ⎥
⎢ . . . . . . ⎥
⎢ + 1 · · · 2 − 1 2 2 − 1 · · · + 1 ⎥
⎢
1⎢ ⎥
⎥
w = ⎢ + 1 + 2 · · · 2 2 + 1 2 · · · + 2 + 1 ⎥ ,
L⎢ ⎥
⎢ + 1 · · · 2 − 1 2 2 − 1 · · · + 1 ⎥
⎢ ⎥
⎢ .. .. . . .. .. .. .. . .. ⎥
⎢ . . . . . . . .. . ⎥
⎢ ⎥
⎣ 2 3 ··· + 1 + 2 + 1 ··· 3 2 ⎦
1 2 ··· +1 ··· 2 1
x1 = 0
x2 = 1
xt = xt−1 + xt−2 if t > 2.
For each of the following sequences, can you define a similar recursive formula that
generates the entire sequence starting from some initial condition? If not, why?
(a) 1, 1, 2, 4, 7, 13, 24, 44, 81, . . .
(b) 1, 2, 4, 8, 16, 32, 64, 128, . . .
(c) 1, 2, 6, 24, 120, 720, 5040, . . .
(d) 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, . . .
References 163
References
1. The New York Times. Coronavirus (Covid-19) Data in the United States. Accessed July 2022.
https://github.com/nytimes/covid-19-data
2. Keshavamurthy J. Case study: bisphosphonate induced femur fractures. Accessed Aug 2022.
https://doi.org/10.53347/rID-45453
3. Barlow H. Redundancy reduction revisited. Netw Comput Neural Syst. 2001;12(3):241–53
4. Marčelja S. Mathematical description of the responses of simple cortical cells. JOSA.
1980;70(11):1297–300
5. Jones JP, Palmer LA. An evaluation of the two-dimensional Gabor filter model of simple
receptive fields in cat striate cortex. J Neurophysiol. 1987;58(6):1233–58.
6. Dalal N, Triggs B. Histograms of oriented gradients for human detection. Proc IEEE Comput
Soc Conf Comput Vis Pattern Recognit. 2005;1:886–93
7. Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis.
2004;60(2):91–110
8. LeCun Y, Boser B, Denker JS, et al. Backpropagation applied to handwritten zip code
recognition. Neural Comput. 1989;4(1):541–51
9. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional
neural networks. In: Proceedings of the 25th international conference on neural information
processing systems. Vol. 1. NIPS’12. Red Hook: Curran Associates Inc.; 2012. p. 10971105
10. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recog-
nition. In: 3rd international conference on learning representations, ICLR 2015, San Diego,
May 7–9, 2015. Conference Track Proceedings; 2015. Available from: http://arxiv.org/abs/
1409.1556
11. Deng L. The MNIST database of handwritten digit images for machine learning research. IEEE
Signal Proces Magaz. 2012;29(6):141–2
12. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image
database. In: 2009 IEEE conference on computer vision and pattern recognition. Piscataway:
IEEE; 2009. p. 248–55
Chapter 8
Reinforcement Learning
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 165
R. Borhani et al., Fundamentals of Machine Learning and Deep Learning
in Medicine, https://doi.org/10.1007/978-3-031-19502-0_8
166 8 Reinforcement Learning
Path-Finding AI
Fig. 8.1 (Left panel) An example of Gridworld shaped like a maze. The agent (in black) must
learn to navigate the Gridworld to reach the green target while avoiding hazardous squares (in
red). (Right panel) The agent cannot “see” the entire Gridworld at once. At each turn, it can only
see neighboring squares to its current location
Reinforcement Learning Applications 167
Fig. 8.2 (Left panel) A small 5 × 5 Gridworld where the hazard squares are organized in a
particular way to divide the world into two halves, leaving only a narrow passage for the agent
on the left to reach the target on the right. (Right panel) A larger 20 × 20 Gridworld with randomly
placed hazards
panel of Fig. 8.2 where the world is considerably larger compared to the previous
examples. Moreover, the hazards in this case are placed randomly and—as a result—
do not seem to follow any specific pattern. Here, too, the robot must learn to navigate
a hazard-free path to reach the green target efficiently.
Automatic Control
Fig. 8.3 An illustration of the cart–pole (left) and the lunar lander problem (right)
Fig. 8.4 Reinforcement learning can be used to train AI agents to play board games such as Chess
(left panel) and Go (right panel)
Game-Playing AI
Fig. 8.5 The process of pattern-cutting performed by a pair of robotic arms trained using
reinforcement learning. This figure was reproduced from [1]
(the arm coming into the frame from the left). The function of the gripper arm is to
facilitate the procedure by grasping the soft tissue and applying forces of varying
magnitude and direction to it as the other arm cuts through the tissue.
Fig. 8.6 The angle and dosage of radiation are important parameters in radiation therapy as they
determine the impact zone of the ionizing beams as well as the level of energy delivered to the cells
within that zone
Fundamental Concepts
For any given Gridworld as shown in Figs. 8.1 and 8.2, knowledge of the agent’s
current location is enough to fully describe the problem environment. Hence, a state
in this case consists of the horizontal and vertical coordinates of the black circle on
the map. Recall that the robot in Gridworld is only allowed to move one unit up,
down, left or right. These define the set of actions that the Gridworld agent can take.
Note that depending on the agent’s location (state), only a subset of actions may be
available to the agent. For instance, if the agent is at the top-left corner of the map,
it will only be allowed to go one unit right or down.
We can design a variety of reward structures to communicate our goal to the
agent, that is, to reach the target (green square) in an efficient manner while
avoiding the hazards (red squares). For example, we can assign a relatively-small-in-
magnitude negative value (e.g., −1) to all actions (one unit movement) that lead to
a non-goal and non-hazard state, a larger-in-magnitude negative value (e.g., −100)
for those actions leading to a hazard state, and a non-negative number (e.g., 0) to
actions leading to the goal state itself. This way, the agent is incentivized not to step
on hazard squares (as its reward will be reduced by 100 each time it does so), and
to reach the goal state in as few steps as possible (since walking over each white
square still reduces its reward by 1).
To summarize, beginning at a state (location) in Gridworld, an action is taken to
move the agent to a new state. For taking this action and moving to the new state,
the agent receives a reward.
or moved), and a new state of the system arises. For taking this action and moving
to the new state, the agent receives a reward.
In playing the game of Chess, a state is any (legal) configuration of all (remaining)
white and black pieces on the board, and an action is a legal move of any of current
pieces on the board according to the rules of Chess. In this case, one reward structure
to induce our agent to learn how to win could be as follows: any move made that
does not immediately lead to the goal state (checkmating the opponent) receives a
reward of −1, while a move that successfully checkmates the opponent receives a
large positive reward (e.g., 10,000).
Mathematical Notation
S = {σ1 , σ2 , . . . , σn } (8.1)
A = {α1 , α2 , . . . , αm }. (8.2)
174 8 Reinforcement Learning
At the kth step in solving a reinforcement learning problem, the agent begins at a
state sk ∈ S and takes an action ak ∈ A that moves the system to a state sk+1 ∈ S.
It is important not to confuse the s notation with the σ notation in (8.1), and the a
notation with the α notation in (8.2). The notation sk is a variable denoting the state
at which the kth step of the procedure begins and thus can be any of the possible
realized states in S = {σ1 , σ2 , . . . , σn }. Similarly, the notation ak is a variable
denoting the action taken at the kth step, which is one of the permissible actions
from the set A = {α1 , α2 , . . . , αm }.
Recall that the mechanism by which an agent learns the best action to take in a
given state is the reward structure. We use the notation rk to denote the reward an
agent receives at the kth step. In general, rk is a function of the initial state at the kth
step of the process as well as the action taken by the reinforcement learning agent
at this step
rk = f (sk , ak ) . (8.3)
Fig. 8.7 An illustrative summary of the reinforcement learning nomenclature and notation
introduced thus far
Bellman’s Equation 175
sensor issues, friction, etc. Almost the same modeling discussed in the section
captures this variability for such stochastic problems, with the main difference being
that the reward function must also necessarily be a function of the state sk+1 in
addition to sk and ak , i.e.,
Bellman’s Equation
With notation out of the way, we are now ready to address perhaps the most
important question in reinforcement learning: how do we actually train the agent?
The answer is—like any other machine learning problem—through optimizing
an appropriate cost function. However, unlike other machine learning problems
such as linear or logistic regression, here we cannot directly work out an exact
parameterized form of the cost function. Instead, we formalize a certain attribute
that we want this function to ideally have and, working backward, we can arrive at
a method for computing it.
Let us define Q(s1 , a1 ) as the maximum total reward possible if we begin at the
state s1 and take the action a1 . Recall that taking the action a1 brings us to some state
s2 , and the agent receives some reward r1 . Therefore, Q(s1 , a1 ) can be calculated as
the sum of the realized reward r1 plus the largest possible total reward from all the
proceeding steps starting from the state s2 . Invoking the definition of the Q function,
this latter quantity can be written as
where (s2 ) denotes the index set for all valid actions that can be taken when the
agent is at the state s2 . Writing out the equality above algebraically, we then have
Note that the expression in (8.6) holds generally regardless of what state and action
we begin with. In other words, at the kth step of the process, we can write
When dealing with reinforcement learning problems with a finite number of states
and actions, the Q function can be represented as a two-dimensional matrix. Recall
from (8.1) and (8.2) that we denote the set of all states as S = {σ1 , σ2 , . . . , σn },
and the set of all possible actions as A = {α1 , α2 , . . . , αm }. Therefore, Q can be
represented as the n × m matrix
⎡ ⎤
Q (σ1 , α1 ) Q (σ1 , α2 ) · · · Q (σ1 , αm )
⎢ Q (σ2 , α1 ) Q (σ2 , α2 ) · · · Q (σ2 , αm ) ⎥
⎢ ⎥
⎢ .. .. .. .. ⎥, (8.8)
⎣ . . . . ⎦
Q (σn , α1 ) Q (σn , α2 ) · · · Q (σn , αm )
which is indexed by all possible actions along its columns, and all possible states
along its rows. In the beginning, this matrix can be initialized at random (or at zero).
Next, by running through an episode of simulation
we generate data that can be used to resolve the optimal Q function step-by-step via
the recursive definition in (8.7). With the matrix Q initialized, the agent takes its
first action at random for which it receives the reward r1 . Based on this reward, we
can update Q(s1 , a1 ) via (8.6), as
The agent then takes its second action (once again at random) for which it receives
the reward r2 , and we update Q(s2 , a2 ) via
This sequential update process continues until a goal state is reached or a maximum
number of steps are taken. When the current episode ends, we begin a new episode
and continue updating Q.
After performing enough training episodes, our Q matrix/function eventually
becomes optimal, since (by construction) it will satisfy the desired recursive
The Basic Q-Learning Algorithm 177
definition for all state–action pairs. Notice, in order for Q to be optimal for all
state–action pairs, every such pair must be visited at least once. In practice, one
must typically cycle through each pair multiple times in order for Q to be trained
appropriately or employ function approximators to generalize from a small subset
of state–action pairs to the entire space.
In summary, by running through a large number of episodes (and so through as
many state–action pairs as many times as possible), and updating Q at each step
using the recursive Bellman’s equation, we learn Q by trial-and-error interactions
with the environment. How well our computations converge to the true Q function
relies heavily on how well we sample the state–action spaces through our trial-and-
error interactions. The pseudocode for the basic version of the Q-learning algorithm
is given below.
(continued)
178 8 Reinforcement Learning
Fig. 8.8 All the possible states (left panel) and actions (right panel) for the Gridworld shown
originally in the left panel of Fig. 8.2
and run the Q-learning algorithm. The initial and final Q matrices (at the
beginning of episode 1 and at the end of episode 100) are displayed in
Table 8.1.
ak = αi , (8.13)
The Basic Q-Learning Algorithm 179
Table 8.1 The initial (left) and final (right) Q matrices associated with Example 8.1. Here, each
row is a state and each column an action. The Q matrix on the left was initialized at zero. The Q
matrix on the right was resolved after running 100 episodes of the Q-learning algorithm
Down Up Left Right Down Up Left Right
↓ ↑ ← → ↓ ↑ ← →
(1, 1) 0 0 0 0 −0.008 −0.007 −0.008 −0.007
(1, 2) 0 0 0 0 −0.007 −0.006 −0.008 −1.001
(1, 3) 0 0 0 0 −1.001 −1.002 −0.007 −0.001
(1, 4) 0 0 0 0 −0.001 −0.002 −1.001 0
(1, 5) 0 0 0 0 0 0 0 0
(2, 1) 0 0 0 0 −0.008 −0.006 −0.007 −0.006
(2, 2) 0 0 0 0 −0.007 −0.005 −0.007 −1.002
(2, 3) 0 0 0 0 −1.001 −0.004 −0.006 −0.002
(2, 4) 0 0 0 0 −0.001 −0.003 −1.002 −0.001
(2, 5) 0 0 0 0 0 −0.002 −0.002 −0.001
(3, 1) 0 0 0 0 −0.007 −0.007 −0.006 −0.005
(3, 2) 0 0 0 0 −0.006 −0.006 −0.006 −0.004
(3, 3) 0 0 0 0 −1.002 −1.004 −0.005 −0.003
(3, 4) 0 0 0 0 −0.002 −0.004 −0.004 −0.002
(3, 5) 0 0 0 0 −0.001 −0.003 −0.003 −0.002
(4, 1) 0 0 0 0 −0.006 −0.008 −0.007 −0.006
(4, 2) 0 0 0 0 −0.005 −0.007 −0.007 −1.004
(4, 3) 0 0 0 0 −0.004 −1.005 −0.006 −0.004
(4, 4) 0 0 0 0 −0.003 −0.005 −1.004 −0.003
(4, 5) 0 0 0 0 −0.002 −0.004 −0.004 −0.003
(5, 1) 0 0 0 0 −0.007 −0.008 −0.008 −0.007
(5, 2) 0 0 0 0 −0.006 −0.007 −0.008 −1.005
(5, 3) 0 0 0 0 −1.004 −1.005 −0.007 −0.005
(5, 4) 0 0 0 0 −0.004 −0.005 −1.005 −0.004
(5, 5) 0 0 0 0 −0.003 −0.004 −0.005 −0.004
where
Equations (8.13) and (8.14) define a policy for the reinforcement learning agent
to utilize when it finds itself at any state sk . Once Q is resolved properly and
sufficiently, the agent can use this policy to take actions that allow it to travel in
a reward-maximizing path of states until it reaches the goal (or a maximum number
of steps are taken).
180 8 Reinforcement Learning
Fig. 8.9 Starting at state s1 = (2, 2), the agent looks up Q in search of the largest value along
the row associated with s1 . In this case, −0.005 is the largest value that happens to fall under the “
Up/↑ ” column. Taking this recommended action, will take the agent to s2 = (3, 2)
The Basic Q-Learning Algorithm 181
Fig. 8.10 The largest value along a given row in Q is not always unique. In such cases, the agent
can choose any of the available optimal actions at random. Here, starting at (3, 4), the agent can
move either down to (2, 4) or right to (3, 5)
Fig. 8.11 Starting at s1 = (2, 2) and following the policy defined in (8.13) and (8.14), the agent
has three paths to the target state shown in green
The basic Q-learning algorithm has a number of parameters to set. These include the
number of maximum steps per episode of training T , as well as the total number of
training episodes E. Each of these parameters can heavily influence the performance
of the trained agent. On one end of the spectrum, if T is not set high enough, the
agent may never reach the goal state. With a problem like Gridworld—where there
is only one such state—this would be disastrous as the system (and Q) would never
learn how to reach the goal. On the other hand, the training can take an extremely
long time if the number of steps is set too large.
A similar story can be said for the number of episodes E: too small, and Q will
not be learned properly, and too large results in much wasted time and computation.
182 8 Reinforcement Learning
As we will see later in the chapter, other variants of the basic Q-learning algorithm
have additional parameters that need to be set as well.
To tune the Q-learning parameters, we need a validation strategy to evaluate the
performance of our trained agent with different parameter settings. This validation
strategy includes running a set of validation episodes, where each episode begins
at a different starting position, and the agent transitions using the optimal policy.
Calculating the average reward on a set of validation episodes at the completion of
each training episode can then help us evaluate how a particular parameter setting
affects the efficiency and speed of training.
Because a problem like the Gridworld discussed in Examples 8.1 and 8.2 has a
small number of states, the number of steps T and episodes E can be kept relatively
low. Ideally, however, we set both to a large number—as large as possible—given
time and computational constraints.
Q-Learning Enhancements
where the term on the left hand side, i.e., Q(sk , ak ), stands for the maximum
possible reward the agent receives if it starts at state sk and takes action ak . This
is equal to the sum of the two terms on the right hand side of the equation: the first
(rk ) stands for the immediate short-term reward the agent receives for taking action
ak at state sk , and the second term stands for the maximum long-term reward the
agent can potentially receive starting from state sk+1 .
Note that the recursive equation in (8.15) was originally derived assuming Q
was optimal. This is clearly not true at first when we begin training1 since we
do not have knowledge of the optimal Q (that is why we have to train in the first
place). Therefore, neither term on the left and right hand sides of (8.15) involving
Q gives us a maximal value initially in the process. However, we can make several
adjustments to the basic Q-learning algorithm to compensate for the fact that the
optimal Q—and hence the validity of the recursive update equation—takes several
episodes of simulation to resolve properly.
One glaring inefficiency in the basic Q-learning algorithm is the fact that the agent
takes random actions during training (see line 8 of the basic Q-learning algorithm in
Sect. “The Basic Q-Learning Algorithm”). This inefficiency becomes more palpable
if we simply look at the total rewards per episode of training. In Fig. 8.12, we plot
the total reward gained per episode of training for the small Gridworld in the left
panel of Fig. 8.2. The rapid fluctuation in total reward per episode seen in this plot is
a direct consequence of using random action selection for training the reinforcement
learning agent. The average reward over time does not improve even though Q is
getting more accurate as training proceeds. Adjacently, this means that the average
amount of computation time stays roughly the same no matter how well we have
resolved Q.
While training with random action selection does force the agent to explore
the problem environment well during training, we never exploit the resolving Q
matrix/function during training in order to take actions. It seems intuitive that
after a while the agent does not need to rely completely on random action-taking.
Instead, it can use the (partially) resolved Q to take proper actions while training.
As Q becomes more and more close to optimal, this would clearly lower training
time in later episodes, since the agent is now taking actions informed by the Q
function/matrix instead of merely random ones.
The important question is: when should the agent start exploiting Q during
training? We already have a sense of what happens if the agent never does this:
training will be highly inefficient. On the other hand, if the agent starts exploiting Q
too soon and too much, it might not explore enough of the state space of the problem
to create a robust learner as the learning of Q would be heavily biased in favor of
early successful episodes.
In practice, there are various ways of applying this exploration–exploitation
trade-off for choosing actions during training. Most of these schemes use a simple
Fig. 8.12 The total reward per episode recorded during simulation after running the basic Q-
learning algorithm for 400 episodes on the Gridworld shown originally in the left panel of Fig. 8.2
184 8 Reinforcement Learning
stochastic switch: at each step of an episode of simulation, choose the next action
randomly with a certain probability p, or via the optimal policy with probability
1 − p. In the most naive approach, the probability p can be kept fixed at some
value between 0 and 1 and used for all steps/episodes of training. More thoughtful
implementations push p gradually toward zero as training proceeds, since the
approximation of Q gets more reliable over time.
To see how much exploitation of Q helps make training more efficient, we
repeat the process used to create Fig. 8.12, this time setting the exploration–
exploitation probability p to 0.5 for all steps/episodes. As can be seen in Fig. 8.13,
the exploration–exploitation method produces episodes with much greater stability
(i.e., less fluctuation) and with far greater total reward.
We constrain γ to lie between 0 and 1, so that by scaling it up and down, we can tune
the influence that short-term and long-term rewards have on how Q is learned. In
particular, by setting γ to a smaller value, we assign more weight to the contribution
of the short-term reward rk . In this case, the agent learns to take a more greedy
approach to accomplishing the goal, at each state taking the next step that essentially
maximizes the short-term reward only.
On the other hand, by setting γ close to 1, we essentially have our original
update formula back, where we take into account equal contributions of both short-
term and long-term rewards. As with the exploration–exploitation trade-off, one
can either set γ to a fixed value for all steps/episodes during training or change
its value from episode to episode according to a predefined schedule. Sometimes
setting γ to some value smaller than 1 helps proving mathematical convergence of
Q-learning. In practice, however, γ is usually set close to 1 (if not 1), and we just
tinker with the exploration–exploitation probability because in the end both trade-
offs (exploration–exploitation and short-term long-term reward) address the same
issue: our initial distrust of Q. Integrating both trade-off modifications into the basic
Q-learning algorithm, we have the following enhanced version of Q-learning given
in the pseudocode below.
2 The pioneering electrical engineer Claude Shannon, who is regarded as the father of information
theory, published a seminal paper in 1950 entitled Programming a Computer for Playing Chess [2]
in which he points out the intractability of the approach of defining a “dictionary” for all possible
positions in Chess.
Tackling Problems with Large State Spaces 187
where w0,j and w1,j are tunable weights or parameters. At each step of Q-
learning, rather than updating some Q(σk , αj ), we update the parameters of the
corresponding function qj (s)—typically via online learning—such that
Note that this is very similar to the linear regression setup described in Chap. 4
where the input–output pairs3 associated with the linear function qj arise occasion-
ally in a sequential manner, as the agent navigates the problem environment.
Sometimes a linear function is not flexible enough to model qj (s) accurately. In
such cases, we may choose nonlinear function approximators and rewrite (8.18) as
Problems
(a) Initialize the Q matrix at random and run the Q-learning algorithm to resolve
Q
(b) Use the resolved Q to test the reinforcement learning agent placed initially at
each of the four corners of the Gridworld. Does the agent navigate the map as
expected? If not, why?
Table 8.2 The number (per player) of pieces in Chess along with a description of how each piece
is allowed to move on the board. This table is associated with Exercise 8.4
Name Symbol No. of pieces Legal moves
Pawn p 8 1 square up; 2 squares up (first move only);
1 square forward diagonally when capturing
an enemy piece (en passant capturing)
Rook R 2 Any number of squares horizontally or vertically;
Castling with the King or the Queen
Knight N 2 2 squares vertically and 1 square horizontally;
2 squares horizontally and 1 square vertically
Bishop B 2 Any number of squares diagonally
Queen Q 1 Any number of squares vertically, horizontally,
or diagonally; Castling with a rook
King K 1 1 square vertically, horizontally, or diagonally;
Castling with a rook
References
1. Murali A, Sen S, Kehoe B, et al. Learning by observation for surgical subtasks: multilateral
cutting of 3D viscoelastic and 2D orthotropic tissue phantoms. In: Proceedings of the 2015
IEEE international conference on robotics and automation; 2015.p. 1202–9
2. Shannon CE. Programming a computer playing chess. Philos Mag. 1959;41(312)
Index
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 191
R. Borhani et al., Fundamentals of Machine Learning and Deep Learning
in Medicine, https://doi.org/10.1007/978-3-031-19502-0
192 Index
F
Feature engineering K
nonlinear classification, 115–116 K-fold cross-validation, 127
non-linear regression
Lactobacillus delbrueckii growth, 113
linearize data, 114 L
machine learning, 112 Leaky ReLU, 119, 120
multi-dimensional input, 111 Least square cost function
problems collinear, 71
feature engineering vs. feature learning, framework, 72
127 input normalization
multi-layer neural network, 128 input factors, 78
nonlinear regression, 127 parameter magnitude, 79
ReLUs, 127 Polio immunization and GDP, 79
single-layer neural network, 129 predicted vs. actual life expectancy, 79
two-layer neural network, 129 prediction of life expectancy, 80–81
Feature learning tunable parameters, 80
biological neurons, 118 MSE, 72
historical and modern activation functions, multi-dimensional input
120 dataset example, 77
Index 193
equations, 76 M
n-dimensional inputs, 74, 75 Machine learning models, 3, 20, 25, 54, 118,
prediction of life expectancy, 76–78 126
single input vector, 75 Machine learning pipeline
noisy dataset, 74 classification, 9
problems cost function/error function, 12
input normalization, 85–86 data collection
least square solution, 84, 85 data, 8
linearity/additivity/homogeneity, 84 datasets, 4, 9
medical insurance cost, prediction, dermatologist, 4
86–87 pathologists, 4
prediction of life expectancy, 86 rule of thumb, 9
regression datasets, 71, 73 skin lesions, 3–4
regularization, 82–84 feature design, 4–5
Leave-one-outcross-validation, 127 ABCDE rule, 9
Linear classification benign and malignant lesions, 3
cost functions, 89 border shape, 4
cross-entropy cost function data points, 10
convex function, 96 dimensional spaces, 10
convoluted system of equation, 95 features, 3
Heaviside step function, 94 feature space, 4
optimization algorithm, 96 sensors, 10
logistic (see Logistic function) skin cancer classification task, 9, 10
multi-dimensional input symmetry, 4
classification dataset, 104, 106 mathematical optimization, 12
gradient descent algorithm, 102, 103 model testing
simulated dataset, 104 benign and malignant lesion, 7
tunable parameters, 101 classification metrics, 13, 14, 21–22
multiple classes confusion matrix, 14
binary classification, 106 data points, 8
classification dataset, 108, 109 testing data, 7
oncology applications, 105 training dataset, 7
one-vs-rest classifier, 107 model training
problems, 109–110 benign and malignant lesion, 6
simulated dataset, 107–108 linear classifier, 6, 11
one-dimensional input, 89–91 toy classification dataset, 12, 13
Verhulst, 92 two-dimensional space, 11
Linear regression nonlinear classification models, 12
definition, 69 Machine learning taxonomy
least square cost function (see Least square brain tumor localization, 15
cost function) breast tissue, 16
one-dimensional output, 69–71 cancers, 14
Local smoothness, 134 classification, 16
Logarithmic functions, 58, 59 classification framework, 15
Logarithmique model, 93 classifiers, 21
Logistic function, 91 clustering, 17, 18
Covid-19 cases, 93 COVID-19 pandemic, 19
hyperbolic tangent function, 93, 94 decisions, 20
Logarithmique model, 93 future component, 15
Malthusian model, 91, 92 manifold, 17, 18
model tumor growth, 92 medical datasets, 17
Logistic sigmoid function, 109, 118, 120 microarray, 18
194 Index
T W
Transfer learning, 19, 20, 150 World Health Organization (WHO), 76,
Trigonometric functions, 55–56, 64, 66, 70 93