Machine Learning Simplified
Machine Learning Simplified
LICENSE
1.0.1
First release, January 2022
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Machine Learning 6
1.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Machine Learning Pipeline 9
1.2.1 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 ML Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Artificial Intelligence 11
1.3.1 Information Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Types of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Overview of this Book 13
3 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Linear Regression 29
3.1.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.3 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.4 Gradient Descent with More Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Gradient Descent in Other ML Models 43
3.2.1 Getting Stuck in a Local Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Overshooting Global Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.3 Non-differentiable Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Bias-Variance Decomposition 59
5.1.1 Mathematical Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.2 Diagnosing Bias and Variance Error Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Validation Methods 64
5.2.1 Hold-out Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Unrepresentative Data 68
6 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Introduction 69
6.2 Filter Methods 71
6.2.1 Univariate Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.2 Multivariate Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Search Methods 74
6.4 Embedded Methods 74
6.5 Comparison 75
7 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1 Data Cleaning 78
7.1.1 Dirty Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Feature Transformation 81
7.2.1 Feature Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2.2 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Feature Engineering 85
7.3.1 Feature Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3.2 Ratio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Handling Class Label Imbalance 88
7.4.1 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4.2 Synthetic Minority Oversampling Technique (SMOTE) . . . . . . . . . . . . . . . . . . . . . 90
PREFACE
here). You will find answers to these questions and many more as
you read this book.
How to Use This Book
This book is divided into two parts. Part I discusses the fundamen-
tals of (supervised) machine learning, and Part II discusses more
advanced machine learning algorithms. I divided the book in this
way for a very important reason. One mistake many students make
is to jump right into the algorithms (often after hearing one of their
names, like Support Vector Machines) without a proper foundation.
In doing so, they often fail to understand, or misunderstand, the
algorithms. Some of these students get frustrated and quit after
this experience. In writing this book, I assumed the chapters would
be read sequentially. The book has a specific story line and most
explanations appear in the text only once to avoid redundancy.
I have also supplemented this book with a GitHub repository that
contains python implementations of concepts explained in the book.
For more information, scan the QR code located in the ‘Try It Now’
box at the end of each chapter, or just go directly to
github.com/5x12/themlsbook.
Final Words
Hopefully this book persuades you that machine learning is not the
intimidating technology that it initially appears to be. Whatever
your background and aspirations, you will find this book a useful
introduction to this fascinating field.
Should you have any questions or suggestions, feel free to reach out
to me at awolf.io. I appreciate your feedback, and I hope that it will
make the future editions of this book even more valuable.
Good luck in your machine learning journey,
Your author
Part I
FUNDAMENTALS OF SUPERVISED
LEARNING
1. Introduction
In this book you will learn about supervised machine learning, or supervised ML. Supervised ML is
one of the most powerful and exciting fields in science right now and has many practical business
applications. Broadly speaking, the goal of supervised ML is to make predictions about unknown
quantities given known quantities, such as predicting a house’s sale price based on its location and
square footage, or predicting a fruit category given the fruit’s width and height. Supervised ML does
this by learning from, or discovering patterns in, past data, such as past house sale prices.
This chapter has two goals. The first is to give you a taste for the type of problems supervised ML
can be used to solve. The second is to help you form a high-level mental map of the ML World. It
will help you understand the difference between supervised ML and many other fields and terms
that are sometimes (incorrectly!) used synonymously with it or are (correctly) used in similar or
subtly different contexts. Some of these terms include: deep learning, unsupervised learning, data
science, ML operations (or MLOps), and artificial intelligence. At the end of this chapter you should
understand what supervised ML deals with and what these related fields deal with. This will help
focus your attention on the right things as we explore supervised ML over the course of the book.
6 Chapter 1. Introduction
other variables.
3 A classifier is another word for a classification model
1.1 Machine Learning 7
Figure 1.1: A high-level overview of supervised ML for housing price predictions example. (a)
Each row in the table on the left represents measurements of a single house. The first four columns
are observed features, such as crime rate and ocean proximity, while the last column is the target
variable, the house price. (b) The ML model learns to predict house prices from the training dataset
which contains both the features and the price of houses (it does this by finding patterns in the data,
e.g., that high crime rate lowers price). The trained model is then used to predict the unknown price
of a new house given its features (crime rate, ocean proximity, etc.), as shown in the block diagram
on the right.
2. How many customers will come to our restaurant next Saturday? This can be useful for
making sure we have the right amount of food and workers on staff. We want to make this
prediction based on factors like the number of guests that came last weekend, the number of
guests each day this week, the expected weather this weekend, and if there is a holiday or
special event.
Both classification and regression models help us solve numerous business problems in everyday life.
However, how can we make a model that automatically makes the desired predictions (as accurately
as possible)? At first it may seem that there are some rules that we can encode into the model -
for example that houses by the ocean have higher value. But usually we will not know any simple
rules in advance nor know exactly how the rules interact with each other, either reinforcing one
another or cancelling each other out – exactly how much does proximity to the ocean increase the
house’s value? If the house is in a dangerous neighborhood does that decrease its value? What if it
is in a dangerous neighborhood but also close to the ocean? As we can start to see, the number of
interactions among variables grows very fast. In real problems where there are thousands, millions,
or even more variables, we can’t possibly understand how they all interact.
The goal of supervised machine learning, or supervised ML, is to automatically learn the correct
rules and how they interact with each other. How is this possible? It does so by learning from
the provided dataset: In supervised ML we assume that we are given a dataset (e.g., a table in
Figure 1.1a) consisting of both the measurements of an object (e.g., house features like floor location,
floor size, criminal rate, etc.) and the answer to the desired label (house price). (This final piece
of information, the answer, is precisely what is meant by “supervised” learning.) Supervised ML
learns the relationships between different measurements and the label. After the model is learned
(or trained), we can then use it to predict the label of a new (unseen) data point from its known
measurements (but totally unknown label). Coming back to the housing example, the model would
first learn from the dataset how different measurements affect the house price. Then, it will be able
to predict the unknown price of a new house based entirely on its measurements (shown in Figure
8 Chapter 1. Introduction
1.1b). This prediction may be very accurate or very inaccurate. The accuracy depends on many
factors, such as the difficulty of the problem, quality of the dataset, and quality of the ML algorithm.
We will discuss these factors in depth throughout this book.
Definition 1.1.1 — Supervised Machine Learning Algorithm. A supervised learning algo-
rithm is an algorithm that learns rules and how they interact with each other from the labelled
dataset in order to perform regression or classification tasks.
Supervised ML may still seem a bit abstract to you, but that’s okay. In the next chapter we will
dive into a more detailed introduction of supervised ML: we will precisely define terms used by
supervised ML, show you how to represent the learning problem as a math problem, and illustrate
these concepts on a simple dataset and prediction task. The rest of this book will expand upon this
basic framework to explore many different supervised ML algorithms.
Deep Learning
Deep learning is a term that you have probably heard of before. Deep learning models are very
complex supervised ML models4 that perform very complicated tasks in areas where more advanced
or faster analysis is required and traditional ML fails. They have produced impressive results on a
number of problems recently, such as image recognition tasks. We will not discuss deep learning in
this book, due to its complexity. But, at a high level, a deep neural network (what is usually meant
by the term ’deep learning’) are so powerful because it is related to neural network algorithms -
specific algorithms designed to model and imitate how the human brain thinks and senses information.
After mastering the basics of supervised learning algorithms presented in this book, you will have a
good foundation for your study of deep neural networks later. I recommend deeplearning.ai as a
good source for beginners on deep learning.
It should be noted, before we continue, that although deep learning models are very complex and
have produced great results on some problems, they are not universally better than other ML methods,
despite how they are sometimes portrayed. Deep learning models often require huge amounts of
training data, huge computational cost, a lot of tinkering, and have several other downsides that
make them inappropriate for many problems. In many cases, much simpler and more traditional ML
models, such as the ones you will learn about in this book, actually perform better.
Figure 1.2: Typical pipeline for a supervised machine learning problem. Steps I and II are responsible
for extracting and preparing the data, Step III is responsible for building the machine learning
model (most of the rest of this book will detail the steps in this process), and Step IV is responsible
for deploying the model. In a small project, one person can probably perform all tasks. However,
in a large project, Steps I and II may be performed by a data science specialist, Steps III may be
performed by either a data scientist or machine learning operations, or MLOps, engineer, and Step
IV may be performed by an MLOps engineer.
market segments, and they are considering designing and positioning the shopping mall services to
target a few profitable market segments or to differentiate their services (e.g., invitations to events
or discounts) across market segments. This scenario uses unsupervised ML algorithms to cluster a
dataset of surveys that describes attitudes of people about shopping in the mall.1-1 In this example,
supervised learning would be useless since it requires a target variable that we want to predict.
Because we are trying to understand what people want instead of predicting their actions, we use
unsupervised learning to categorize their opinions.
After completing this book on supervised ML, you are encouraged to learn about unsupervised
ML. As you will find in your studies, many of the key concepts, ideas, and building blocks of
algorithms that we will learn about in this book in the supervised setting have similar analogues in
the unsupervised setting. In other words, mastering supervised ML will make it much easier for you
to master unsupervised ML as well. One last note: I’ve left some more information on unsupervised
learning in Appendix A - feel free to read it through.
Imagine an online store “E Corp” that sells hardware for computers. E Corp wants to understand
their customers’ shopping habits to figure out which customer is likely to buy which product. They
then want to build a system that predicts how best to market to each user. They have hired you to
build such a system. How should you proceed? You brainstorm the following plan, depicted in the
block-diagram in Figure 1.2:
I. First, you have to obtain or extract data that will be used for algorithm training. The initial step
is to collect historical data of E Corp on their past customers over a certain period of time.
10 Chapter 1. Introduction
II. Second, you have to prepare the dataset for your ML model. One of the preparation procedures
can be to check the quality of collected data. In most real-world problems, the data collected is
extremely “dirty”. For example, the data might contain values that are out of range (e.g., salary
data listing negative income) or have impossible data combinations (e.g., medical data listing
a male as pregnant). Sometimes this is a result of human error in entering data. Training the
model with data that hasn’t been carefully screened for these or many other problems will
produce an inaccurate and misleading model. Another preparation procedure might be to use
statistical and ML techniques to artificially increase the size of the dataset so that the model
would have more data to learn from. (We will cover all preparation procedures in Chapter 7.)
III. Next, you build your ML model - you feed the prepared dataset into your ML algorithm to train
it. At the end of training, you have a model that can predict the probability of a specific person
purchasing an E Corp product based on their individual parameters. (Don’t worry if you don’t
fully understand what the term “training” means just yet – this will be a huge subject of the
rest of the book!)
IV. Finally, you deploy your trained and (presumably) working model on a client’s computer or
cloud computing service as an easy-to-use application. The application may be a nice user
interface that interacts with the ML model.
If E-Corp is a small company with a small dataset, you could perform all of these steps yourself.
However, for much bigger ML problems you will probably need to separate these steps into parts
which can be performed by different people with different expertise. Usually, a data scientist will
be responsible for steps I and II, and an ML operations engineer, or MLOps engineer, will be
responsible for step IV. Step III is an overlap, and can be performed by both a Data scientist and
an MLOps engineer. (In my personal experience quite often it is the former one.) In the next two
sections we discuss data science and MLOps in more detail. Our goal is to give you an overview of
these fields, and make sure you understand how they differ. Many times all of these terms are used
synonymously, even in job descriptions, and this tends to confuse students of machine learning.
• How can we ensure data quality? What do we need to perform to ensure that we have a clean
and ready-to-use dataset?
• Are different datasets or features scaled differently? Do we need to normalize them?
• How could we generate more insightful features from the data we have?
• What are the top three features that influence the target variable?
• How should we visualize the data? How could we tell the story to our client in the best way?
• What is the purpose of a model? What business problem can it solve?
• How do we plan to evaluate a trained model? What would be an acceptable model accuracy?
• Based on the data we have and the business problem we deal with, what algorithm should be
selected? What cost function should be used for that algorithm?
1.2.2 ML Operations
The main goal of ML Operations, or MLOps, is getting models out of the lab and into production
(step IV of Figure 1.2). Although MLOps engineers also build models, their main focus lies in the
integration and deployment of those models on a client’s IT infrastructure. An MLOps engineer
typically deals with the following questions:
• How to move from Jupyter notebook (in Python programming environment) to production,
i.e., to scale the prediction system to millions of users? Do we deploy on cloud (e.g., AWS,
GCP, or Azure), prem, or hybrid infrastructure?
• What front-end, back-end, or data services should the product integrate with? How can we
integrate our model with a web-based interface?
• What are the data drift and vicious feedback loops?5 Will data in the real world behave just
like data we used in the lab to build the ML model? How can we detect when the new data is
behaving in an ’unexpected’ way? How frequently should the data be refreshed?
• How to automate model retraining? What tools should we use for continuous integration (CI)
or continuous deployment (CD)?
• How to version the code, model, data, and experiments?
• How do we monitor the application and infrastructure health? What unit tests do we have to
perform before running the model or application? Do we have logging and other means of
incidents investigation?
• How do we serve security and privacy requirements? Are there any security controls to take
into account?
year.
12 Chapter 1. Introduction
the up and down buttons shows that it is an example of AI, even if it is an incredibly basic
example.
2. The spam detector in your email needs to intelligently decide if a new email is spam email or
real email. It makes this decision using information about the message, such as its sender, the
presence of certain words or phrases, e.g., a message with the words ’subscribe’ or ’buy’ may
be more likely to be spam. (Remember that this is an example of supervised ML that we saw
earlier in this chapter.)
3. A much more complex intelligent system is a self-driving car. This requires many difficult AI
tasks, such as detecting red lights, identifying pedestrians, and deciding how to interact with
other cars and predict what other cars will do.
All these AI machines perform their tasks roughly based on a set of simple "if-then" rules:
• an elevator B will get you to the top floor if it is closer to you than elevators A and C.
• a spam detector sends an email to the spam folder if that email contains certain words such as
"sale", "buy," or "subscribe”.
• a self-driving car stops at an intersection if the stoplight is red.
However, these machines are different in complexity. So, what defines their complexity?
What makes a smart elevator a less complex form of AI than a self-driving car? The simple answer
is the level of information that the machine processes to make a decision. You can actually organize
the complexity of the three aforementioned examples of AI: A spam detector is more complex than a
smart elevator, and a self-driving car is more complicated than a spam detector.
• A smart elevator processes almost no information to understand it is the closest one to you and
needs to go your way - it takes into account (calculates) just two static variables - the direction
and the distance between itself and a passenger.
• An e-mail spam detector processes some information every time a new e-mail is received
– specifically, it parses/analyses the text of each email received – to detect trigger words to
decide if an email should be sent to the spam folder.
• A self-driving car processes a lot of information every second to understand the direction and
speed it needs to go, and if it needs to stop for a light, pedestrian, or other obstacle.
The level of information processing required for machine decision-making defines the complexity of
AI. With more information processing comes more patterns that the machine needs to extract and
understand. AI is categorized based on how complex it is. These categories are outlined in the next
section.
1.3.2 Types of AI
Each AI task requires its own set of techniques to solve. Some researchers have categorized AI into
four categories, roughly based on complexity:
Basic AI: Basic AI is the simplest form of AI. It does not form memories of past experience to
influence present or future decisions. In other words, it does not learn, it only reacts to
currently existing conditions. Basic AI often utilizes a predefined set of simple rules encoded
by a human. An example of Basic AI is the smart elevator discussed above.
1.4 Overview of this Book 13
Figure 1.3: A "mind-map" diagram illustrating the relationship between different areas of artificial
intelligence.
Limited AI: Limited AI systems require a substantial amount of information, or data, to make
decisions. The advantage is they can make decisions without being explicitly programmed
to do so. The email spam detector discussed above is an example of limited AI – the spam
detector learns rules for predicting if an email is spam based on a dataset of emails that are
known to be spam or not spam. (Usually a human being must provide these labels that the
machine learns from.) Machine Learning is the main and most well-known example of Limited
AI, and the email spam detector is an example of supervised ML, which we discuss in the next
section.
Advanced AI: Advanced AI systems will possess the intelligence of human beings. They will be
able to, for example, drive your car, recognize your face, have conversations with other human
beings, understand their emotions, and perform any task that an intelligent human could do.
These systems are often described as a computational “brain”.
Superintelligence: Superintelligent AI systems will possess intelligence that far surpasses the
abilities of any human being.
So far we haven’t explained just how AI systems perform intelligent tasks. You may be wondering:
Are these machines “actually” intelligent? As it turns out, this is a very deep philosophical question
and one that we will not go into here. But we will note that the term “artificial intelligence” as it
is used in society often gives an unrealistic, sci-fi, depiction of AI systems. However, many AI
systems actually come down to solving carefully crafted math problems – and that’s a large part of
their beauty. You will see this process – the reduction of a problem requiring intelligence to a math
problem – in the context of supervised ML throughout this book.
Supervised ML can actually be viewed as a subfield of AI. Figure 1.3 shows a nested set of circles
representing the subfields of AI and how supervised ML fits into it.
This book thoroughly covers data preparation and model building blocks - steps II and III of Figure
1.2. The book is also divided into two parts. In Part I you will learn about the fundamentals of
14 Chapter 1. Introduction
machine learning, and in Part II you will learn the details about more complex machine learning
algorithms.
I chose to split this book into these two parts for a very important reason. Many times people hear
the names of certain powerful machine learning algorithms, like support vector machines, neural
networks, or boosted decision trees, and become entranced by them. They try to jump into learning
complex algorithms too quickly. As you can imagine, it is very difficult for them to learn complex
models without learning simpler ones first. But also, as you are probably not aware of yet, most
ML models encounter common problems and difficulties (such as overfitting) which you will learn
about in the next chapter. Therefore, while it may be tempting to delve into specific algorithms right
away, you should take time to get the high-level picture of the ML world as well as all the details of
the fundamental principles it was built on (e.g. how an algorithm actually learns from data from a
mathematical point of view). It’s kind of like working on a puzzle with the picture on the cover; sure,
you can finish the puzzle without the image, but it will take a lot longer, and you could make a lot of
mistakes. Once you master the fundamentals of ML you will have a clear idea of these common
problems and you will be able to understand how to avoid common pitfalls, regardless of which
learning algorithm you use for a specific problem.
The next chapter gives an overview of the fundamentals of ML in the context of an example. At a
high level, we will see a detailed sequence of steps that represents the model building stage (Step III)
of the ML pipeline shown in Figure 1.2. After giving you a high level overview of the steps in ML
model building, that chapter will also provide more details about how the rest of the book is laid out,
chapter-by-chapter.
Key concepts
• Artificial Intelligence
• Supervised Learning, Unsupervised Learning, Deep Learning
• MLOps, Data Science
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Understand what supervised ML deals with and how it fits into the wider world of AI.
• Form a typical pipeline for a supervised machine learning problem.
2. Overview of Supervised Learning
In this chapter we will present an overview of supervised machine learning (ML). Supervised ML
systems are fairly complicated but, for the most part, they all share the same basic building blocks.
We begin this chapter by presenting an overview of supervised ML in the context of a simple example
out of which these building blocks will emerge. We will then move from a specific to a general form,
clearly articulating the building blocks. This general structure will serve as a roadmap for the rest
of the book which will delve into the mathematical details and the various trade-offs and choices
associated with each building block.
goal will be to build an ML model that predicts a target variable – the type of a new fruit – based
only on its features – its height and width measurements. But first, we present some mathematical
notation and explore this dataset further.
Data Visualization
Since it is difficult for us humans to make sense of a raw data table, it is often useful to visualize the
data. For a classification problem with two input features the data is often visualized with a scatter
plot. A scatter plot for our example dataset is shown in Figure 2.1. In the scatter plot, the horizontal
axis corresponds to the first feature and the vertical axis corresponds to the second feature. The ith
(1) (2)
data point corresponds to a dot with coordinates equal to its feature values, (xi , xi ), and colored
according to its class label, yi , where the colors associated with each class are designated by the
legend in the top left corner of the figure. In our fruit classification example, the horizontal axis
represents the width feature, the vertical axis represents the height feature, each fruit is a dot whose
coordinates are its width and height, and the dot is colored red if the fruit is an apple, green if it is a
mandarin, and blue if it is a lemon. For example, the third fruit, i = 3, corresponds to a point with
coordinates (10.48, 7.32) and colored blue (class Lemon) in the figure.
Figure 2.1: Scatter plot visualization of our example fruit dataset. The axes represent the two
measured properties (height on the horizontal x(1) -axis and width on the vertical x(2) -axis). Each
dot represents a fruit whose coordinates correspond to the two measurements and whose color
corresponds to the fruit’s type (color code specified by the legend at the top left of the figure).
18 Chapter 2. Overview of Supervised Learning
models, K-Nearest Neighbor models, Bayesian models, maximum margin models, and many others,
correspond to choosing a different type of prediction function f . In this section we focus on the
K-Nearest Neighbor (KNN) classifier due to its simplicity.
KNN classifies a new point x by first finding the k points in the training set that are nearest to the
point x. It then predicts that the (unknown) label of point x is equal to the most popular class among
these nearest neighbors (remember that we know the class label of the neighbors since they are in
our training set). k is a parameter that we have freedom to choose. If we choose k = 1, our predictor
is especially simple: we find the closest point in the training set and predict that the new data point
has the same class label as that training point.
How would the 1-Nearest neighbor classifier predict the type of a new fruit that is 4 cm tall and 5
cm wide? First, we find the piece of fruit in our labeled training dataset whose height and width are
closest to the coordinates (4cm, 5cm), and then guess that the unknown fruit is the same type as this
nearest neighbor. From our dataset in Table 2.1, the nearest fruit to the (4cm, 5cm) measurement is
in the row i = 1 with width and height of (3.91, 5.76), and has type mandarin. (You can check that
this is the nearest neighbor by brute force: measure the distance to each data point and then find the
smallest distance.) Thus, we would guess that the unknown fruit is a mandarin.
(a) Learned decision regions (and training data) (b) Classify new point
Figure 2.2: Visualization of learned decision regions for a 1-Nearest neighbor classifier. Figure
(a) shows the training data points along with the learned decision regions (colored areas). Figure
(b) shows how this classifier predicts the (unknown) type of a new fruit with height and width
measurements (4cm, 5cm). This point lands in the green region so the classifier predicts mandarin.
that the model performs exceptionally well on training data, and thus it’s an unrealistic estimate of
real-world error.
Loss Function
First, we need to summarize how well our predictions match the true label with a single number.
Let’s compute the misclassification rate by counting the total number of misclassifications and
dividing by the total number of data points:
1 n
L( f , X,Y ) = · ∑ [ f (xi ) 6= yi ] (2.1)
n i=1
Table 2.2: Table (a) shows the test set consisting of measurements and known fruit label. Table (b)
illustrates a data set where we withhold the fruit type (from our learned model).
Table 2.3: Table showing fruit’s actual Figure 2.3: Predictions on the test set of the
type yi (first column), its predicted type model f learned with k = 1 shown as decision
f (xi ) (second column), and its misclas- regions in Figure 2.2b. Each test point is colored
sification indicator [yi 6= f (xi )] (third with its true type; its predicted type is the colored
column). region it falls in.
together).
correct, and three were incorrect. Your misclassification rate on the the new batch of fruit, called the
test set, was L( f , Xtest ,Ytest ) = 35 .
Overfitting
What happened? Although you got a very low error on the training set (0% wrong) you got a higher
error on the test set (60% wrong). This problem, where the error is low on the training set, but much
higher on an unseen test set, is a central problem in ML known as overfitting. Essentially, what
happened is your classifier (nearly) memorized the training data. One potential indication of this is a
rough and jagged decision boundary, as is the case in Figure 2.2a, which is strongly influenced by
every single data point.
Overfitting is not just a problem in ML, but is a problem us humans suffer from while learning as
well. Consider the following example which is probably similar to an experience you’ve had before.
Two students, Ben and Grace, have a difficult multiple-choice math exam tomorrow. To prepare his
students for the exam, the professor gave each of them a past exam and told them that tomorrow’s
exam would be similar. Ben spent most of his time memorizing the correct answers to the practice
exam, while Grace spent her time working through the problems to understand how to calculate the
correct answers. Before they went to bed that night, both Ben and Grace took the practice exam.
Ben got a 98% (or 2% error) while Grace only got a 83% (or 17% error). Ben was confident that he
would ace the exam tomorrow and that his friend Grace would not do so well. But what happened
the next day when the professor gave them a new 100-question exam? Ben got incredibly lucky
because 20 of the questions were nearly identical to the ones on the practice exam. He gave the
answer based on his memory and got them all right. But the other 80 questions did not resemble the
practice questions. On these questions, he was left pretty much making a wild guess and only got
20 of them correct. All-in-all his score was 40% (or 60% error), much worse than his score on the
practice exam. Since Grace, on the other hand, learned the concepts from the practice exams, she
was able to adapt them to the problems on the new exam and scored 80% (or 20% error), almost
the same as she scored on the practice exam the night before. Grace learned better than Ben and,
despite doing worse than him on the practice exam, did much better than him on the new exam.
Ben’s learning was overfit to the practice exam, just as our learned fruit classifier was overfit to the
training data.
Figure 2.4: Decision regions for decreasing values of k (the number of neighbors in the KNN
classifier). In figure (a) we underfit the data, learning a function that predicts Apple regardless of
height and width; figure (b) obtains a more precise fit; figure (c) an even more precise fit (obtaining
minimal test set error - see Figure 2.5); figure (d) is severely overfit.
rougher decision boundaries). When k is decreased all the way until k = 1 in the bottom right
subfigure, Figure 2.4d, we learn the most complex model (classifier pays attention to every single
data point).
The goal in Machine Learning is to find a generic model – the model that optimally balances the
overfitting and underfitting of data. Model complexity is defined with the hyperparameters of the
model itself. Each model has its own set of hyperparameters. The complexity of a KNN classifier is
defined by the hyper-parameter k. What is the right value of k to produce a good model?
Figure 2.5 plots the train and test error for each value of k ranging from k = n on the left down to
k = 1 on the right. On the left side of the plot (near k = n) we underfit the data: the train and test
error are very similar (almost touching) but they are both very high. On the other hand, on the right
side of the plot (near k = 1) we overfit the data: the training error is very low (0%, as we calculated),
but the test error is much higher (60%). The model with k = 5 achieves the optimal test error, and
hence the optimal trade off between overfitting and underfitting.
2.2 ML Pipeline: General Form 23
Figure 2.5: Misclassification error rate for training and test sets as we vary k, the number of
neighbors in our KNN model. Models on the left side of the plot (near k = n = 20) use too large a
neighborhood and produce an underfit model. Models on the right side of the plot (near k = 1) use
too small a neighborhood and produce an overfit model. Models between k = 3 and k = 7 perform
optimally on the test set – in other words, they optimally balance overfitting and underfitting.
Figure 2.6: Detailed pipeline for a full ML system. The blue boxes represent the high-level ML
pipeline we saw in Figure 1.2 of Chapter 1. The white boxes represent more detailed steps (under-
neath each high-level step) of the pipeline which we discuss in this chapter. This diagram can serve
as your “mind map” which guides you through the rest of this book in which we will present details
for (most of) the white boxes.
The plot above is typical of most ML algorithms: low complexity models (left part of the figure)
underfit the data; as model complexity increases (moving right on the figure), the training error
decreases, but the test error begins to decrease, hits a minimum error, then begins to increase as
model complexity is further increased, signifying overfitting. (Note that this explanation holds
approximately, as there are several ‘bumps’ in the figure.)
We actually saw a very high-level view of this pipeline in Figure 1.2 of Chapter 1. We are now ready
to see a more detailed form of the pipeline, shown in Figure 2.6. The pipeline consists of four main
stages – Data Extraction, Data Preparation, Model Building, and Model Deployment – shown in the
blue boxes at the top. Each main stage consists of several intermediate stages which are shown in
the white boxes below the main stage’s box. Of these high level stages, the Model Building stage
contains most of the complex and difficult ML components. The majority of this book will focus on
this stage, hence it is highlighted in the figure. The following sections give a brief overview of each
stage and also tells you where in the book that stage will be discussed in detail. Thus, this section
will also serve as a roadmap for your study of the rest of the book.
Although the rest of the ML pipeline is clearly important, and is the part that you will spend nearly
all of your time learning to master, we cannot underestimate the importance of high quality data.
In fact, in recent years, many ML systems obtained huge increases in performance not by creating
better ML algorithms, but by a dedicated effort to collect and provide high quality labels for large
datasets.
Data Cleaning
Many ML algorithms cannot be applied directly to ’raw’ datasets that we obtain in practice. Practical
datasets often have missing values, improperly scaled measurements, erroneous or outlier data points,
or non-numeric structured data (like strings) that cannot be directly fed into the ML algorithm. The
goal of data cleaning is to pre-process the raw data into a structured numeric format (basically, a
data table like we have in our example in Table 2.1) that the ML algorithm can act upon.
Other techniques similar to data cleaning (which get the data into an acceptable format to perform
ML), are: to convert numerical values into categories (called feature binning); to convert from
categorical values into numerical (called feature encoding); to scale feature measurements to a
2.2 ML Pipeline: General Form 25
similar range. We discuss data cleaning, feature encoding, and related techniques and give more
detailed examples of why and when we need them in Chapter 7.
Feature Design
Data cleaning is necessary to get the data into an acceptable format for ML. The goal of feature
design techniques is to improve the performance of the ML model by combining raw features into
new features or removing irrelevant features. The two main feature design paradigms are:
Feature Transformation: Feature transformation is the process of transforming the human-readable
data into a machine-interpretable data. For instance, many algorithm cannot treat categorical
values, such as "yes" and "no", and we have to transform those into numerical form, such as 0
and 1, respectively.
Feature Engineering: Feature engineering is the process of combining (raw) features into new
features that do a better job of capturing the problem structure. For example, in our fruit
classification problem, it may be useful to build a model not just on the height and width of a
fruit, but also on the ratio of height to width of the fruit. This ratio feature is a simple example
of feature engineering. Feature engineering is usually more of an art than science, where
finding good features requires (perhaps a lot of) trial and error and can require consulting an
expert with domain knowledge specific to your problem (for example, consulting a doctor if
you are building a classification model for a medical problem).
Feature Selection: Feature selection is another idea whose goal is to increase the quality of the final
ML model. Instead of creating new features, the goal of feature selection is to identify and
remove features that are useless to the prediction problem (or that have low predictive power).
Removing useless features ensures the model does not learn to make predictions based on this
erroneous information. This can significantly improve our performance on unseen data (test
set).
Feature transformation, feature engineering and feature selection are crucial steps that can make or
break an ML project. We will discuss them in Chapters 6 and 7.
Algorithm Selection
The next step is to select the form of prediction function fˆ. The form of fˆ should be chosen such that
it accurately captures the “true” function f . Of course, the true f is unknown (that is why we are
trying to learn fˆ from data!). The vast array of ML algorithms that have been developed correspond
to choosing a different form of the function fˆ. We discuss most of these algorithms in Part II of this
book. These algorithms include:
• Linear and Polynomial models
• Logit models
• Bayesian models
• Maximum margin models
• Tree-based models
26 Chapter 2. Overview of Supervised Learning
• Ensemble models
Many of these algorithms can be applied to both classification and regression problems (with slight
changes). As we learn about these algorithms in the second part of Machine Learning Simplified
book, it will be helpful for you to keep in mind that all of these algorithms simply amount to changing
this one part of the ML pipeline (while keeping the rest of it fixed).
Model Learning
Once we have selected the learning algorithm and its loss function, we need to train the model. The
beauty of machine learning is that the actual learning is not something mystical or conceptually
difficult – it is simply a mathematical optimization problem. In Chapter 3 I will clearly show how
maths work behind the scene during model learning, and what it really is from the mathematical
point of view. But at a high level, learning amounts to finding the set of parameters that minimizes
the loss function on the training data.
Model Evaluation
The next key component of the ML algorithm is assessing how well the trained model will perform
on unseen test data. As we saw on our fruit classification example above, we cannot simply evaluate
the error on the training set (otherwise we could simply memorize the answers and get a perfect
score). In an ideal world, we would be supplied with a set of test data, separate from the training
data, on which we could evaluate the model. However, in practice, we are often presented with one
large dataset, and then it’s our job to split it into training and test datasets. How exactly we split
the dataset into two sets is a separate topic that deserves attention. In Chapter 5.2 we will discuss
more advanced methods, like cross validation, which creates many different splits of the data and
evaluates the model’s performance on each different split.
Hyper-parameter Tuning
As we saw in our fruit classification example, it is often very easy to build an ML model that performs
perfectly (or very well) on training data, but very poorly on unseen test data. We encounter the same
difficulty in almost all ML problems, and a central challenge is to train a model that balances the
competing problems of overfitting and underfitting. This balance comes with the right values of
hyperparameters.
How can we select the best hyperparameters? Usually there is no way to know which hyper-
parameters will work best beforehand. Often many different combinations of hyperparameters are
tried and the ones that produces the best results is chosen. For instance, in fruit example with a KNN
model, we tried many number of neighbors of the model’s hyperparameter k. For a large number of
neighbors, the model tend to underfit, while for a small number of neighbors it tend to overfit. This
2.2 ML Pipeline: General Form 27
process is known as hyperparameter tuning, and will be discussed in the second part of the Machine
Learning Simplified book. Keep in mind that each algorithm has its own set of hyperparameters.
Model Validation
After all of the preceding Data Preparation and Model Building steps, we must validate that our
model performs as we expect it will. The problem is that, usually, you will do a lot of experimentation
and tinkering with parts of the pipeline – with different feature sets, different learning algorithms,
different hyper-parameters, and many other choices. The pitfall is that this meta-level tuning is
susceptible to a type of over-fitting itself. In order to validate the quality of the model, we need do a
"final test", and see how it performs on another, totally separate portion of data called validation set
(we usually set it aside at the very beginning of the project) to ensure it is performing as expected. If
it is not, we must go back to our pipeline and diagnose why it is not performing well. We will talk
about model validation together with the hyperparameter tuning chapter in the second part of the
book.
Key concepts
• Supervised ML Classifier
• Overfitting and underfitting
• Supervised ML Pipeline
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Explain how a simple supervised ML classifier performs classification tasks.
• Explain how the level of model complexity affects overfitting and underfitting.
• Explain the basic building blocks of the ML pipeline.
3. Model Learning
An overview of the (supervised) ML pipeline was presented in the last chapter. The next few chapters
describe the parts of the ML pipeline in detail. We begin in this chapter by describing the part of
the ML pipeline that intrigues most people: how a supervised ML algorithm learns from the data
(Step 5 from Figure 2.6). First, we will see that the learning problem is in fact nothing more than
a mathematical optimization problem. After converting the learning problem to an optimization
problem, we will see how the actual learning can be performed by solving the optimization with a
simple algorithm known as the gradient descent algorithm. I know this conversion takes away a bit
of the mystery and excitement surrounding machine learning, but this is where tech really excels:
the boring stuff turns what seems like science-fiction into a reality.
Amsterdam Apartments
n. Area (m2 ) Price (e10, 000)
A 30 31
B 46 30
C 60 80
D 65 49
E 77 70
F 95 118
Table 3.1: Hypothetical dataset of apart- Figure 3.1: A Scatter Plot of the Amsterdam hous-
ment prices in Amsterdam. ing prices dataset.
fˆ(x) = a · x + b (3.1)
where fˆ is a function representing the prediction, x is the response variable (m2 ), and a and b are the
parameters of fˆ, where a is the coefficient (slope), and b is the point of intercept. The “hat” over the
f indicates that the function is estimated from data. (The parameters a and b are also learned from
data, but we do not put a “hat” over them to avoid cluttering notation.)
Finding the line that best fits the data is known as a linear regression and is one of the most popular
tools in statistics, econometrics, and many other fields. For our housing dataset, the line of best fit,
shown in Figure 3.2, is fˆ(x) = 1.3 · x − 18, which has slope coefficient a = 1.3 and point of intercept
(on the y-axis) b = −18 (we’ll discuss how to calculate the line of best fit a little later). This formula
allows us to predict the price of an Amsterdam apartment by substituting its floor area for the variable
x. For instance, let’s say you want to predict the price of an apartment with a floor area of 70 square
meters. Simply substitute 70 for x:
fˆ(x) = 1.3 · x − 18
= 1.3 · 70 − 18
= 73
and we see that its predicted price is e730, 000. We illustrate this prediction graphically in Figure
3.2.
3.1 Linear Regression 31
Figure 3.2: Linear regression fit to the Amsterdam housing prices dataset. The learned linear model
can be used to predict the price of a new house given its area in square meters. Shown in blue is the
prediction that an apartment with floor area of 70 (in m2 ) has a price of 73 (in e10,000).
Now we know how to make a prediction. But let’s take a step back. How can we build an algorithm
– a specific sequence of steps – to determine the best parameters for the function fˆ(x)? For this
example, how can we discover that a = 1.3 and b = −18 is the best out of all possible choices of
parameters a and b (without being told, obviously)? Conceptually, this requires two parts:
• First, need a way to numerically measure a goodness-of-fit, or how well a model fits the data,
for one particular setting of parameters. For example, figure 3.3 plots regression models for
different settings of the parameters a and b. How do we know which one is better than the
others? The label under each subfigure lists a goodness-of-fit metric known as the SSR score,
which will be presented in the next subsection.
• After we know how to measure how well some specific settings of parameters a and b is, we
need to figure out how to search over all possible values of a and b to find the best one. A
naive approach would be to try all possibilities but this would take forever. In Section 3.1.3
we discuss a much better algorithm known as gradient descent.
3.1.2 Goodness-of-Fit
To evaluate the goodness of fit we need to first learn about residuals.
Definition 3.1.1: Residual
A residual is the difference between the observed value and the value that the model predicts
for that observation. The residual is defined as:
ri = yi − fˆ(xi )
where yi is the actual target value of the ith data point, fˆ(xi ) is a predicted value for data point
xi , and ri is the ith residual.
32 Chapter 3. Model Learning
(a) a = 1.3 and b = −18; (b) a = 4 and b = −190; (c) a = 10 and b = 780;
SSR = 1, 248.15. SSR = 20, 326. SSR = 388, 806.
Figure 3.3: Linear models with different settings of parameters a and b plotted along with the
Amsterdam housing prices dataset. Each label contains the model’s SSR which measure its goodness-
of-fit (lower is better).
Figure 3.4: Residuals associated with the linear model in Figure 3.3a are depicted with red dashed
lines. To keep the figure uncluttered, only the residual rC for point C is explicitly labeled.
To get a better understanding of residuals, let’s illustrate them graphically on our example. Figure 3.4
uses blue data points ( ) to represent observed prices of apartments taken from Table 3.1 (denoted as
yi ). The figure uses black data points ( ) to represent prices of apartments of the same corresponding
squared area, but is predicted by our function fˆ.
In Figure 3.4, data point C shows that the actual price yC of a 60m2 apartment in the market is
e800,000. However, the model’s predicted price fˆ(xC ) for an apartment with the same area is
e600,000, because fˆ(xC ) = 1.3 · 60 − 18 = 60 (in e10,000). The difference of the actual to the
3.1 Linear Regression 33
predicted price, is known as the residual. For this data point the residual is
rC = yC − fˆ(xC )
= 80 − (1.3 · xC − 18)
= 80 − (1.3 · 60 − 18)
= 20,
or e200, 000, since the units are e10,000. The residuals for the remaining data points A, B, D, E,
and F in our example are
rA = yA − fˆ(xA ) = 31 − (1.3 · 30 − 18) = 10
rB = yB − fˆ(xB ) = 30 − (1.3 · 46 − 18) = −11.8
rD = yD − fˆ(xD ) = 49 − (1.3 · 65 − 18) = −17.5
rE = yE − fˆ(xE ) = 70 − (1.3 · 77 − 18) = −12.1
rF = yF − fˆ(xF ) = 118 − (1.3 · 95 − 18) = 12.5.
These residuals are represented with vertical red dashed lines in Figure 3.4. Now we know how
to compute the residual of each data point, but how do we calculate the error of the model on the
entire dataset? The most popular method is to compute the squared error or sum of squared residuals
defined as follows.
Definition 3.1.2: Sum of Squared Residuals (SSR)
The sum of squared residuals (SSR) is the sum of the squared differences between the
observed value and the value that the model predicts for that observation. For a prediction
function fˆ, we calculate the SSR as
n n
SSR( fˆ) = ∑ (yi − fˆ(xi ))2 = ∑ ri2 (3.2)
i=1 i=1
SSR( fˆ) measures how well the specific model fˆ fits the data: the lower the SSR, the better
the fit. SSR is the most widely used loss function, but others are also possible, such as Sum
of Absolute Residuals (SAR).
In our running example, we compute the sum of the squared residuals as:
SSR( fˆ) = rA2 + rB2 + rC2 + rD
2
+ rE2 + rF2
= (10)2 + (−11.8)2 + (20)2 + (−17.5)2 + (−12.1)2 + (12.5)2
= 100 + 139.24 + 400 + 306.25 + 146.41 + 156.25
= 1248.15.
The SSR measures how well the specific model fˆ(x) = 1.3x − 18 fits the data – the lower the SSR,
the better the fit. For instance, Figure 3.3 shows SSR values for the model we just calculated in
Figure 3.3a, and two other values shown in Figure 3.3b and Figure 3.3c. Based on these values, we
know that the model in Figure 3.3b is better than model in Figure 3.3c, but worse than the model in
Figure 3.3a.
34 Chapter 3. Model Learning
(a) a = 0.7; SSR = 10, 624.95. (b) a = 1.3; SSR = 1, 248.15. (c) a = 1.9; SSR = 10, 443.75.
Figure 3.5: Regression models with a different parameter a along with sum of squared residual (SSR)
error listed in label.
Parameterized SSR
For our learning (optimization) procedure in the next section, we will be interested in the SSR at
many potential settings of parameters, not just at one specific choice. In other words, we want
to write the SSR as a function of the parameters in our model. Let’s start with an example. For
simplicity, let’s pretend that the parameter b = −18 is known. This leaves us with fˆ(x) = a · x − 18,
a function of a single parameter a. The SSR as a function of parameter of a is
n
SSR(a) = ∑ (yi − fˆ(xi ))2
i=1
n
= ∑ (yi − (a · xi − 18))2
i=1
Let’s evaluate the SSR for some values of the parameter a. When a = 0.7, we have
n
SSR(a) = ∑ (yi − (0.7 · xi − 18))2
i=1
=(31 − (0.7 · 30 − 18))2 + (30 − (0.7 · 46 − 18))2 + (80 − (0.7 · 60 − 18))2 +
(49 − (0.7 · 65 − 18))2 + (70 − (0.7 · 77 − 18))2 + (118 − (0.7 · 95 − 18))2
=10624.95
where we substituted the values of xi and yi from Table 3.1. Following the same procedure, we find
out that when a = 1.3, we get SSR(a) = 1248.15, and when a = 1.9, we get SSR(a) = 10443.75.
Figure 4.2 shows a plot for the SSR evaluated at each value of a.
Now, let’s plot the calculated SSR values over a changing parameter a on the graph, where the x-axis
is an a value and the y-axis is the value of the SSR, as shown in Figure 3.6a. In the language of
mathematical optimization which we use in the next section, we represent the errors with a cost
function denoted as J. In other words, J(a) = SSR(a) and we will be using only the notation J from
now on.
3.1 Linear Regression 35
(a) Cost function plotted against parameter a with a few (b) Minimum point of cost function marked.
values highlighted.
Figure 3.6: The parameterized SSR, the SSR as a function of parameter a, for our example dataset
and linear model.
where the value of a that gives the minimum cost is denoted with a “star” superscript as a∗ . This
values is also known as a minimum point of the cost function J(a). In our example with the value of
b fixed at b = −18, the optimal value of a is a∗ = 1.3 (which we stated earlier).
But how do we find the optimal value? A slow and painful way would be to “plug and chug”
numerous values of a until we find one that gives the lowest cost function J(a). But that’s going to
take an infuriating, mind-numbing amount of time.1 There are many algorithms in mathematical
optimization that have been developed to find the minimum faster (and there is still much research
and many improvements made in this area).
In this section, we will learn how to perform the minimization using the gradient descent algorithm.
At a high level, the gradient descent algorithm is fairly straightforward. It starts with a random
value, or a good guess, of the parameter a. It then finds the direction in which the function decreases
fastest and takes a “step” in that direction.2 It repeats this process of finding the direction of steepest
1 This statement should be taken qualitatively – technically, for a continuous value of a (with arbitrarily many decimal
places) we would not find the exact minimum this way.
2 Remember from your calculus class that the direction of steepest descent of a function is calculated as the gradient of
the function. This is where you should stop and refresh your knowledge about the basics of derivatives and gradients if
you are uncertain how to use them; from this point, our discussion will assume you understand the basics.
36 Chapter 3. Model Learning
descent and taking a step in that direction until it converges to the minimum value of the function.
The details of the algorithm are given in Definition 3.1.3. We now give a detailed walk-through of
the algorithm and explanation of the math.
Definition 3.1.3: Gradient Descent (with a Single Parameter)
1. Choose a fixed learning rate l.
2. Initialize parameter a to an arbitrary value.
3. While (termination condition not met)
3.a. Compute ∂∂ aJ , the derivative of the cost function J at value a.
3.b. Take a step in the gradient direction, scaled by learning rate l:
∂J
ai = ai−1 − l ·
∂a
Gradient Derivation 2
We first compute the gradient of the cost function J(a) = ∑ni=1 yi − (axi − 18) algebraically as
∂J ∂ n 2
= ∑ i y − (ax i − 18)
∂a ∂ a i=1
n 2
∂
= ∑ yi − (axi − 18)
i=1 ∂ a
n
= ∑ −2 · xi · yi − (axi − 18)
i=1
where the first equality follows since the gradient of a sum of terms is equal to the sum of the
gradients of each individual term (here, each individual data point). We now illustrate a step-by-step
derivation of the gradient of the cost function at the current value of parameter a. First, we expand
the cost function for the data points in our example:
2 2
J(a) = 31 − (a · 30 − 18) + 30 − (a · 46 − 18)
2 2
+ 80 − (a · 60 − 18) + 49 − (a · 65 − 18)
2 2
+ 70 − (a · 77 − 18) + 118 − (a · 95 − 18)
We then take the derivative of this function with respect to parameter a yielding
∂J ∂ 2 ∂ 2
= 31 − (a · 30 − 18) + 30 − (a · 46 − 18)
∂a ∂a ∂a
∂ 2 ∂ 2
+ 80 − (a · 60 − 18) + 49 − (a · 65 − 18)
∂a ∂a
∂ 2 ∂ 2
+ 70 − (a · 77 − 18) + 118 − (a · 95 − 18)
∂a ∂a
where we used a basic property of derivatives – that the derivative of a sum of terms is equal to the
sum of derivatives of each term. To calculate the derivative in each term, we apply the chain rule
3.1 Linear Regression 37
∂J
=(−2 · 30) · 31 − (a · 30 − 18) + (−2 · 46) · 30 − (a · 46 − 18)
∂a
+(−2 · 60) · 80 − (a · 60 − 18) + (−2 · 65) · 49 − (a · 65 − 18)
+(−2 · 77) · 70 − (a · 77 − 18) + (−2 · 95) · 118 − (a · 95 − 18)
∂J
= (−2 · 30) · (31 − (0 · 30 − 18)) + (−2 · 46) · (30 − (0 · 46 − 18))
∂a
+ (−2 · 60) · (80 − (0 · 60 − 18)) + (−2 · 65) · (49 − (0 · 65 − 18)) (3.4)
+ (−2 · 77) · (70 − (0 · 77 − 18)) + (−2 · 95) · (118 − (0 · 95 − 18))
= − 67218
In other words, when a = 0, the slope of the curve is equal to −67218, as illustrated in Figure 3.8a.
We then use this gradient to update the parameter a:
∂J
anew = a − ·l
a }
|∂ {z (3.5)
Step Size
where the step size is obtained by multiplying the derivative with a predefined learning rate l. The
learning rate controls the size of the step we take and will be discussed more later, but for now just
keep in mind that the learning rate is usually set to a small value, such as l = 0.00001. Using the
value of the derivative calculated in Equation 3.4 along with a learning rate of l = 0.00001, the
updated parameter value is:
38 Chapter 3. Model Learning
(c) Model after first iteration (f) Model after second iteration
Figure 3.8: Illustrations of first two iterations of gradient descent. Plots for the first iteration are in
the left column, and plots for the second iteration are in the right column. The figures in column
illustrate the operations at each iteration: first computing the gradient, second taking a gradient
step to update parameters a, and third displaying the fit and calculating the SSR with the new value
of parameter a.
= 0 − (−0.67218)
| {z }
Step Size
= 0.67218
3.1 Linear Regression 39
Figure 3.8b shows that with the new value for a, we move much closer to the optimal value. We can
also see in Figure 3.8c how much the residuals shrink when a = 0.67218, compared to the previous
function with a = 0.
Second Iteration
At the second iteration, we take another step toward the optimal value using the same routine. We
substitute the current value of a into the cost function gradient and use the result to take a step to
update a via Eq. 3.5. To take this step, we go back to the derivative and plug in the new value
a = 0.67218.
∂J
= (−2 · 30) · (31 − (0.67218 · 30 − 18)) + (−2 · 46) · (30 − (0.67218 · 46 − 18))
∂a
+ (−2 · 60) · (80 − (0.67218 · 60 − 18)) + (−2 · 65) · (49 − (0.67218 · 65 − 18)) (3.6)
+ (−2 · 77) · (70 − (0.67218 · 77 − 18)) + (−2 · 95) · (118 − (0.67218 · 95 − 18))
= − 32540.2338
And this tells us that when a = 0.67218, the slope of the curve = −32540.2338 as shown in
Figure 3.8d. Let’s update a by plugging in the current a minus the step size, which is the gradient
multiplied by the fixed learning rate l = 0.00001:
= 0.99758
We’ve completed another step towards obtaining a minimum optimal value, shown in Figure 3.8e.
We can also compare the residuals when a = 0.99758 to when a = 0.67218, shown in Figure 3.8f.
We see that the cost function J(a) is getting smaller.
Third Iteration
At the third iteration, we calculate the derivative of the loss function again, at the point a = 0.99758:
∂J
= (−2 · 30) · (31 − (0.99758 · 30 − 18)) + (−2 · 46) · (30 − (0.99758 · 46 − 18))
∂a
+ (−2 · 60) · (80 − (0.99758 · 60 − 18)) + (−2 · 65) · (49 − (0.99758 · 65 − 18)) (3.8)
+ (−2 · 77) · (70 − (0.99758 · 77 − 18)) + (−2 · 95) · (118 − (0.99758 · 95 − 18))
= − 15752.7272
This tells us that when a = 0.99758, the slope of the curve is equal to −15752.7272. It’s time to
calculate the new value for a.
anew = 0.99758 − (−15752.7272) · 0.00001
| {z }
Step Size
= 1.15511
40 Chapter 3. Model Learning
Table 3.2: Iterations of gradient descent on our Figure 3.9: Red arrows indicate change in param-
example. Each row of the table represents an eter value a at each iteration of gradient descent
iteration with model parameter a and the gradi- (starting at a = 0)
ent of the loss function J at the value a.
Different Initialization
If you remember, we started by initiating our gradient descent with a = 0. The truth is, no matter
what initialization we use, we will still (eventually) find the optimum value. Let’s see what happens
3.1 Linear Regression 41
Table 3.3: Each row contains the quantities for Figure 3.10: All steps of gradient descent when
an iteration of the gradient descent algorithm the initial parameter value is a = 2.23
on our example problem. The first column is
the gradient, the second column is the step
size l · ∂∂ aJ , and the third column is the updated
parameter value.
if instead of initializing a = 0, we initialize a = 2.23 (assume we use the same learning rate of
l = 0.00001). Once again I leave you to groan as you start the actual math work, and just show
my end-result calculations in Table 3.3. By looking at Figure 3.10, you can see that even when we
start with a different a value, we still find the minimum point of the cost function. You could also
observe that the derivative (and hence, the step size) values are all positive, as f (x) is increasing on
the interval [1.3029, +∞].
The Learning Rate
It’s important to choose an appropriate learning rate (denoted as l in Eq. 3.5) in the gradient descent
algorithm: a learning rate that is excessively small or excessively large will cause the algorithm to
converge slowly or fail to converge at all. If the learning rate is too small, we will take unnecessarily
small steps causing us to take an excessively long time to reach the optimum value. On the other
hand, if the learning rate is too large, we will take large steps that jump back and forth over the
optimal value many times, unnecessarily; this will be very slow and may cause us to not reach the
optimum at all. An illustration of the problem of choosing a learning rate that is too small or too
large is shown on our running example in Figure 3.11.
Stochastic Gradient Descent
The gradient descent algorithm we just learnt is called batch gradient descent, since computation of
the gradient requires touching each data point, or the entire batch. Our example dataset contains
only six data points, so the cost of touching each data point is not very large. However, real life data
42 Chapter 3. Model Learning
Figure 3.11: Progression of the gradient descent algorithm with excessively small or large learning
rate plotted against the cost function. (a) With small learning rate the algorithm takes many steps to
slowly reach the optimum. (b) With large learning rate the algorithm jumps over the optimum many
times before reaching it.
sets often contain millions or billions of data points, making the cost of touching each data point
very large. The Stochastic gradient descent (SGD) algorithm utilizes a low-cost approximation to the
gradient, in fact, an approximation that is correct on average, that touches only a single data point,
rather than the entire data set. That is, the SGD update is:
θ := θ − ∇θ J(θ ; xi ; yi ) · l (3.10)
at a data point index i, where i is chosen at random at each iteration of the SGD algorithm. Since the
data points are chosen at random, each data point will eventually participate in the learning process,
albeit not at every iteration.
Although SGD is much faster than the full-batch gradient descent, the gradient at a single point might
be an inaccurate representation of the full data-set, causing us, in fact, to step in a bad direction. In
practice, a good trade-off between SGD and full-batch gradient descent that maintains both speed
and accuracy is to compute the gradient on a mini-batch consisting of k > 1 samples, where k is a
hyper-parameter, typically between k = 10 and k = 1000. The mini-batch update using k data points
is:
where xi:i+k−1 are the k consecutive data points starting at a random index i.
Now that we have seen how the gradient descent algorithm works on a simple model with one
parameter. But real problems have many more parameters that we need to estimate. Can gradient
descent be applied to these problems? The answer is yes. It is a very simple modification which we
explain in this section. We also present the results using vector algebra which you will see often in
ML books.
3.2 Gradient Descent in Other ML Models 43
Figure 3.12: Cost function with two unknown weights. Gradient descent finds the minimum of the
bowl.
Let’s start by looking at how gradient descent estimates not just one, but two parameters. In our
housing example we assumed that we knew the true value of the intercept parameter b, so the cost
function J(a) had a single parameter a. When both a and b are unknown, the cost function J(a, b)
has two parameters, a and b, to estimate. Gradient descent uses exactly the same logic to estimate
the parameters. First we initialize a and b at any value to calculate the derivative J(a, b), but this
time with respect to each parameter a and b where the other parameter is held fixed.
First, what is the gradient with respect to the parameter a when b is held fixed? This is exactly what
we saw in the last section:
n
∂J
= ∑ −2 · xi · (yi − (axi − b))
∂ a i=1
Second, what is the gradient with respect to the parameter b when a is held fixed?
n
∂J
= ∑ −2 · (yi − (axi − b))
∂ b i=1
But what happens if we need to find both a and b? If the parameter b is also unknown, the cost
function J(a, b) is a function of two parameters and can be visualized with a 3-D graph. Figure 3.12
shows the cost function for different values for the coefficient a and the intercept b.
In higher dimensional cases, the cost-function cannot be visualized easily. However, the gradient
descent algorithm can be applied in just the same way.
(c) Non-convex, one parameter; note that there (d) Non-convex, two parameters
is no saddle point in function of 1 parameter
(saddle point occurs when gradient is zero, but
point is minimum in one direction and maxi-
mum in another).
Figure 3.13: Example comparing convex cost functions (top row) and non-convex cost functions
(bottom row). The subfigures on the left show a cost-function with one parameter (hence a 2-D plot)
while the subfigures on the right show a cost-function with two parameters (hence a 3-D plot).
real-life) models, gradient descent may not find the global minimum of the function due to a more
complex cost functions.
In particular, cost-functions that are not well-behaved may either cause the gradient descent to either
(i) stuck in local minima, or (ii) overshoot global minima. Additionally, cost functions can be (iii)
non-differentiable, which make it difficult to apply gradient descent. We discuss each of these
problems in this section. However, before we do that, I’d like to take a moment and explain two
different types of cost functions - convex and non-convex cost functions. This will significantly help
us in understanding aforementioned problems that the gradient descent tackles.
(a) Gradient Descent on a Convex Cost Function. (b) Gradient Descent on a Non-convex Cost Function.
Figure 3.14: Comparison of the gradient descent algorithm applied to a convex cost function (left)
and a non-convex cost function (right). For a convex cost function we find the global optimum, while
for a non-convex cost function we can get stuck in a local optimum.
is shown in Figure 3.13. This figure shows example cost functions with a single parameter (giving
rise to 2-D plots), as well as cost functions with two parameters (giving rise to 3-D plots).
For non-convex cost-functions with two parameters, there are three types of critical points that have
gradient zero, and are thus relevant to the optimization:
• Saddle points are the plateau-like regions.
• Local minima is the smallest values of the function within a specific range.
• Global minimum is the smallest value of the function on the entire domain.
A non-convex cost functions can have multiple local minima, that are, the smallest points within
a specific range of the cost function. It can also have multiple global minimums with equal value,
although it is rarely occurred. The objective of the gradient descent is to find an optimal value, which
is, any global minimum point.
(a) Gradient descent missing global minimum on a con- (b) Gradient Descent missing global minimum on a
vex cost function due to a very large learning rate. non-convex cost function due to a very large learning
rate.
Figure 3.15: Comparison of the gradient descent algorithm missing a global minimum on a convex
cost function (left) and a non-convex cost function (right). If the learning rate l (and the step size
with it) is too large, for a convex cost function we can (plausibly) "jump over" the minima we are
trying to reach, while for a non-convex cost function we can skip a global optimum.
For instance, Figure 3.15 shows how we can overshoot the global minimum point with a big learning
rate on (a) a convex cost function, and (b) a non-convext cost function.
Possible Solutions
In general, there is no silver bullet solution to extremum problems, but several techniques have been
shown to work well in practice to avoid them and move closer to the global minimum (or a better
local minimum). These techniques include:
1. Use different variations of gradient descent (such as Stochastic Gradient Descent, Mini
Batch Gradient Descent, Momentum-based Gradient Descent, Nesterov Accelerated Gradient
Descent)
2. Use different step sizes by adjusting the learning rate.
These techniques are not covered by this book, but are something you can discover in your free time.
Key concepts
• Goodness-of-Fit and SSR
• Cost Function
• Gradient Descent
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Explain how to quantitatively evaluate linear algorithms.
• Explain what is gradient descent and how to visualize it on a linear regression model.
4. Basis Expansion and Regularization
An overview of linear regression and gradient descent was presented in the last chapter. This
chapter focuses on how we can modify linear regression and its cost function as a way to change its
complexity.
You may be thinking that linear regression is too weak of a model for any practical purposes.
Sometimes, this is true: the data features have strong non-linear relationships that are not captured
by the linear model. But does this mean that everything we just learned about linear regression in
the previous chapter is useless in this case? Not at all! This section discusses a powerful technique
known as basis expansion that effectively adds non-linear features into the model. Then, linear
regression can be applied directly on the dataset to learn the coefficients of the non-linear terms.
This section will discuss basis expansion into polynomial features, which is a very general technique.
There are many other possibilities, some of which are more appropriate for different problems; we
discuss the choice in more detail in the chapter on Feature Engineering.
50 Chapter 4. Basis Expansion and Regularization
Table 4.1: Amsterdam housing dataset. (a) Training set identical to the one we used in Section 3.1,
(b) Test dataset used to evaluate our model.
Figure 4.1: Plot of the training and test data points in our Amsterdam housing example.
Recall from your introductory math classes that the degree of a polynomial f (x) is equal to the
largest power of x in the definition of f (x). For example:
• A first-degree polynomial function is a simple linear function and can be represented as:
f (x) = w0 + w1 x
4.1 Basis Expansion 51
f (x) = w0 + w1 x + w2 x2
f (x) = w0 + w1 x + w2 x2 + w3 x3
f (x) = w0 + w1 x + w2 x2 + w3 x3 + . . . + wn xn
Second-degree polynomial
Let’s start with fitting a second-degree polynomial function fˆ(xi ) = w0 + w1 xi + w2 xi2 to the training
set. As in the last section we want to find the values of parameters w0 , w1 , and w2 that produce the
minimum SSR:
n
min
w0 ,w1 ,w2
∑ (yi − (w0 + w1 xi + w2 xi2 ))2
i=1
We can estimate the weights that give the minimum SSR using gradient descent as in the last section.
Because we covered that part, I will not show the exact calculations, but only the final weights
w0 = 31.9, w1 = −0.5, and w2 = 0.014.
Again, this is our model that predicts the price of an apartment located in Amsterdam, given its area.
For instance, we know that the real price of an apartment A from Table 4.2b is e310,000. However,
if we did not know that price, we could use our model to predict it:
fˆ(xA ) = w0 + w1 xA + w2 xA2
= 31.9 − 0.5 · 30 + 0.014 · 302
= 29.5
Hence, e295,000 is the predicted price for a 30-squared meters apartment. This is a pretty good
prediction, because the actual price of that apartment is 310,000. So the model was wrong by a small
52 Chapter 4. Basis Expansion and Regularization
Training Set
n. area (m2 ) actual price (e10, 000) predicted price (e10, 000)
A 30 31 29.5
B 46 30 38.5
C 60 80 52.3
D 65 49 58.6
E 77 70 76.4
F 95 118 110.8
(a) Training set predictions
Test Set
n. area (m2 ) actual price (e10, 000) predicted price (e10, 000)
G 17 19 27.4
H 40 50 34.3
I 55 60 46.8
J 57 32 48.9
K 70 90 65.5
L 85 110 90.6
(b) Test set predictions
Table 4.2: Predictions on the Amsterdam housing prices dataset for a second-degree polynomial.
amount of e15,000.1 Let’s predict the prices of the remaining apartments in the training set (I will
leave actual calculations as a homework to you).
Now that we have both predicted and actual apartment prices, we can measure how good this model
performs overall on a training set by calculating the Sum of Squared Residuals (SSR), that you
learned in the previous chapter.
n 2
ˆ
SSRtraining = ∑ yi − f (xi )
i∈S
= (yA − fˆ(xA ))2 + (yC − fˆ(xC ))2 + . . . + (yF − fˆ(xF ))2
= (31 − 29.5)2 + (30 − 38.5)2 + . . . + (118 − 110.5)2
= 1027.0004
We’ve measured SSRtraining - the total error of a model. It shows how good the model performs on a
training set. Remember that SSRtraining alone does not tell us anything by now. It becomes relevant
only when we compare it with other models’ SSRtraining . Let’s now evaluate how good the model
performs on the dataset it has not seen before, i.e. on the test set. For that, we similarly need to
predict the prices of the apartments in the test set (again, I will leave actual math as homework to
you).
1 The calculated weights that got us the answer were actually rounded to avoid writing very long numbers, but if we
Figure 4.2: Fit of polynomials of different degrees to the Amsterdam housing prices dataset.
Now that we have both predicted and actual apartment prices, we can measure how good this model
performs on a test set by calculating Sum of Squared Residuals (SSR):
n
2
SSRtest ˆ
= ∑ yi − f (xi )
i=1
= (yG − fˆ(xG ))2 + (yH − fˆ(xH ))2 + . . . + (yL − fˆ(xL ))2
= (19 − 27.4)2 + (50 − 34.3)2 + . . . + (110 − 90.6)2
= 1757.08
By comparing SSRtraining and SSRtest , we can say that the model performs approximately five times
worse for the test set than for the training set. Let’s now leave everything as it is and move to a
fifth-degree polynomial model.
Fourth-degree and Fifth-degree Polynomial
The procedure for fitting the fourth and fifth degree polynomial are similar. I will not show the
detailed calculations for the sake of clarity. The learned functions of the fourth-degree and fifth-
degree polynomial are shown in Figure 4.2c and Figure 4.2d respectively (and learned weights are
54 Chapter 4. Basis Expansion and Regularization
Model Comparisons
polynomial degree SSRtraining SSRtest ∑ |wi |
2 1027.00 1757.08 32.41
4 688.66 29379.05 945.20
5 0.6 6718669.7 21849.22
Table 4.3: Training and test error on Amsterdam housing prices dataset.
detailed below). The training and test error rates for all polynomial functions are collected in Table
4.3. We see that even though a fifth-degree polynomial model has the lowest SSRtraining compared
to the previous two models, it also has the largest SSRtest . The huge gap between SSRtraining and
SSRtest is a sign of overfitting.
The learned weights for the fourth-degree and fifth-degree polynomial are as follows. For the fourth
degree polynomial we learn the model
n=6
∑ |wi | = |w0 | + |w1 | + |w2 | + |w3 | + |w4 | + |w5 |
i=0
= 19915.1 + 1866.21 + 66.7535 + . . .
= 21849.22
4.2 Regularization
The basis expansion strategy presented in the last section produces a more complex model. Also, we
proved that a more complex model has the higher sum of its weights, as shown in section 4.1.2. As
discussed in that section, this may overfit the data. One technique to decrease the complexity of a
model is known as regularization. At a high level, regularization puts a constraint on the sum of
weights in order to keep the weights small. In other words, regularization constructs a penalized loss
function of the form
λ ·
Lλ (w; X,Y ) = LD (w; X,Y ) + |{z} R(w)
| {z } | {z }
fit data well strength penalize complex models
where LD is the data loss function that measures the goodness-of fit, R(w) is the penalty term that
penalizes complex models, and λ ≥ 0 is a parameter that controls the strength of the penalty (relative
to the goodness-of-fit). For regression problems, the data loss can be the SSR
n
LD (w; X,Y ) = ∑ (yi − fˆw (xi ))2
i=1
as we used above. (Note that when λ = 0, the penalty is zero, so we recover the ordinary least square
solution.) We discuss two popular choices of the penalty term R, known as ridge regression (or L2
regularized regression) and lasso regression (or L1 regularized regression), in the following sections
d
R(w) = ∑ w2j
j=1
that computes the sum of squared values of the weights. This penalizes weight vectors whose
components are large.
56 Chapter 4. Basis Expansion and Regularization
Let’s see how ridge regression works in the context of an example. Let’s assume we are learning a
third degree polynomial: fˆw (xi ) = w0 + w1 xi + w2 xi2 + w3 xi3 . The penalized objective is:
k
SSRL2 = arg min
λ ≥0
∑(yi − fˆ(xi ))2 + λ · ∑ w2j
i j=1
| {z }
L2 penalty term
2
= arg min ∑ yi − (w0 + w1 xi + w2 xi2 + w3 xi3 ) + λ · (w20 + w21 + w22 + w23 )
λ ≥0 i | {z }
| {z } keep weights small
fit the training data well
We can see how λ affects the learned model by looking at the model’s behavior at its extremes:
• When λ = 0, the penalty has no effect and we obtain the ordinary least squares (OLS) solution.
When λ is close to 0, we obtain a model that is close to the OLS solution. In other words, as
λ → 0, wregularized
i → wi .
• As λ → ∞, the penalty term dominates the data term. This forces the weights to be exactly
equal to zero, leaving us with just the intercept fˆ(x) = w0 . When λ is large, the weights are
encouraged to be close to 0. In other words, as λ → ∞, wregularized
i → 0, and fˆ(x) → w0
where λ stands for the set of hyper-parameters and Sλ is the set of allowable models restricted to
these hyper-parameters. (In this chapter, we use this notation at a very high level; don’t worry about
the details.) How do we select the best hyper-parameters to use? Ideally, we’d evaluate the learned
model on an independent validation set, for each different setting of the hyper-parameter(s) λ . We’d
then select the λ giving the best validation performance. Mathematically, we would solve
n0
λ ∗ = arg min ∑ L ( fθ ∗
λ
(xi0 ), y0i )
λ i=1
In practice, we usually don’t have an independent test set and must resort to cross-validation or
similar methods to evaluate the quality of the candidate model (as discussed in the previous section.)
(a) No regularization. (b) with L1 regularization term, λ = (c) with L2 regularization term, λ =
1. 1.
Figure 4.3: Degree 4 polynomial fit for different regularization methods: (a) unregularized fit, (b)
L1-regularization with α = 1, (c) L2 Regularization with α = 1.
d
R(w) = ∑ |w j |
j=1
k
SSRL1 = arg min (yi − fˆ(xi ))2 + λ ∑ | wj |
λ ≥0 j=0
| {z }
L1 penalty term
2
= arg min yi − (w0 + w1 xi + w2 xi2 + w3 xi3 ) + λ · | w0 | + | w1 | + | w2 | + | w3 |
λ ≥0 | {z } | {z }
fit the training data well keep weights small
where λ ≥ 0 controls the strength of the penalty. The L1-regularized linear regression is also known
Lasso Regression. The L1-regularized model behaves the same as the L2-regularized model at the
extremes of λ , that is:
• As λ → 0, wregularized → wi
• As λ → ∞, wregularized → 0, fˆ(x) → w0
Figure 4.4: Degree 4 polynomial with different lambda values for L1 (left) and L2 (right) regulariza-
tion.
Key concepts
• Basis expansion
• Regularization, L1 and L2 penalty terms
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Understand how to increase model’s complexity through basis expansion
• Understand how to use regularization penalty terms to decrease overfitting.
• Understand the difference between L1 and L2 penalty terms.
5. Model Selection
In the previous chapter we showed how a ML algorithm learns a model from training data. In this
chapter we learn about another important part of the ML pipeline: selecting a model that will perform
well on unseen (test) data. The challenge is to select a model that is complex enough to capture the
details of the training data but not too complex that it (nearly) memorizes the data – in other words,
we want to select a model that neither underfits nor overfits the training data. In this chapter, we
discuss how to select a model that balances these two competing desires from both a theoretical and
a practical point of view:
• From a theoretical perspective, we discuss the bias-variance decomposition which provides
insight into the problems of underfitting and overfitting.
• From a practical perspective, we discuss methods like cross-validation which provide estimates
of a model’s generalization performance which can be used to select a model that will perform
well on unseen data.
Figure 5.1: Model error as a function of model complexity. The solid black and green curves show
training and test error, respectively; as model complexity increases, training error decreases, while
the test error hits a minimal value in the middle then begins to increase again. The test error can be
decomposed into three theoretical error sources: bias error, which decreases as model complexity
increases; variance error, which increases as model complexity increases; and irreducible error
which is constant for all models. These error sources are unobserved and hence illustrated with
dashed lines.5-1
(a) Underfitting: high bias, low vari- (b) Good balance: low variance, low (c) Overfitting: high variance, low
ance. bias. bias.
curve) monotonically decreases, meaning that we continually fit the data better. On the other hand,
the test error (green curve) is initially large (resulting from a low complexity model that underfits the
data), decreases until it hits a good model (which adequately balances overfitting and underfitting),
then begins to increase monotonically (resulting from a high complexity model that overfits the data).
In this chapter, we analyze the overfitting and underfitting problems in more detail using a mathe-
matical decomposition of the error known as the bias-variance decomposition. At a high level, the
bias-variance decomposition decomposes the error as follows:
That is, the error is first decomposed into an irreducible component, which represents error inherent
in the problem, like noise in the labels, that no model can reduce, and a reducible component, which
represents error resulting from our choice of model (including its hyperparameters). The reducible
5.1 Bias-Variance Decomposition 61
error is further decomposed into a bias error that measures the average error over different training
sets, and a variance error that measures how sensitive the model is to randomness in the training set.
In summary:
Irreducible Error: The irreducible error (red curve) is flat, indicating that it is a constant source of
error inherent in the problem. In other words, it is an error source that no model can reduce.
Bias Error: The bias error (yellow curve) is large when the model has low complexity then mono-
tonically decreases as the model complexity increases. The bias error indicates the extent to
which the model underfits the data.
Variance Error: The variance error (blue curve) is small when the model has low complexity then
monotonically increases as the model complexity increases. The variance error indicates how
sensitive the learned model is to perturbations in the training data. The variance error indicates
the extent to which the model overfits the data.
These three error components are illustrated by the dashed lines in Figure 5.1. Note that the sum
of the three components of error (dashed curves) equals the test error (green curve) for each model
complexity (indexed on the x-axis).
Figure 5.2 contains example models that illustrate over-fitting and underfitting. In each subfigure,
the training data is shown as black dots and the learned model is shown as a blue line. The model in
the first figure underfits the data: it predicts the same value (the average value on the training set) for
each data point. On the other hand, the model in the last figure overfits the data: it interpolates the
function at the training data points, producing a perfect prediction at each training point (zero error);
however, we expect this rough curve to perform poorly on new data. The model in the middle figure
balances the two and seems to be a preferable model. The next section will dissect over-fitting and
underfitting in mathematical detail and will use these models as examples.
where the expectation operator E averages over everything that is random in the fit: all possible
training sets of size N and the noise in the response variable.1 We then write
h 2 i h 2 i h 2 i
E y − fˆ(x) = E ( f (x) + ε) − fˆ(x) = E ( f (x) − fˆ(x)) + ε (5.2)
where we substituted the definition for y using the true function f and noise variable ε, and then
regrouped the terms. We then isolate reducible and irreducible components of the error by rewriting
1 Recall from basic probability theory that the expectation operator E[·] averages the value of the operand with respect
to an implicitly defined probability distribution.
R
For example, the expectation of a function g with respect to a probability
distribution p is defined as E[g(x)] = p(x) · g(x) dx.
62 Chapter 5. Model Selection
it as:
h 2 i h 2 i h i
E ( f (x) − fˆ(x)) + ε =E f (x) − fˆ(x) + 2E f (x) − fˆ(x) ε + E[ε 2 ]
h 2 i h i
= E f (x) − fˆ(x) + 2E f (x) − fˆ(x) E[ε] + E[ε 2 ]
(5.3)
|{z} | {z }
=0 = σ2
h 2 i
=E f (x) − fˆ(x) + σ2
|{z}
| {z } irreducible error
reducible error
The reducible error can be further decomposed into bias and variance components as follows. First,
we write
h 2 i h 2 i
E f (x) − fˆ(x) f (x) − E fˆ(x) − fˆ(x) − E fˆ(x)
=E (5.4)
where we subtracted E fˆ(x) from one term and added it to the other inside the parentheses. We
We see that the bias-variance decomposition of the error has the same form that we specified in the
previous section, consisting of bias error, variance error, and an irreducible error.
Illustrations
We now illustrate the bias error and variance error using graphs. A model with high variance means
that the fit of the function fˆ varies a lot given the data. That is, there is randomness in the training
set (x values) that we observe; This is illustrated in Figure 5.3.
5.1 Bias-Variance Decomposition 63
(a) d = 0 (b) d = 1
(c) d = 3 (d) d = 9
Figure 5.3: Relation of variance error to model complexity. Each subfigure illustrates a model
with fixed complexity (controlled by polynomial degree d) that is fit to two different training data
sets, one to the orange data points, and the other to the blue data points. The subfigures represent
models of increasing complexity. Note how the fits on each training set, fˆorange and fˆblue , are similar
for low-complexity models such as (a), (b) and (c) (indicating low variance), while they are very
different for high-complexity models, such as (d) (indicating high variance).
(a) d = 0 (b) d = 1
(c) d = 3 (d) d = 9
Figure 5.4: Relation of bias error, f (x) − E[ fˆ(x)] to model complexity. Each subfigure shows the
true function f (x) in orange (which is known because this is a synthetic example), and the learned
model averaged over random samples of the dataset ED [ fˆ(x)]. (The red curve is approximated by
building random samples of the dataset D, fitting fˆ to the dataset, and averaging the curves.) The
true function f is shown in black in each subfigure ( f is known because this is a synthetic dataset;
in real datasets f is not known). Note how the fit model matches the true function better as model
complexity increases - that is, the bias.
In Section 2.1.3, we estimated the performance of a model on unseen data using a method known as
hold-out validation. In hold-out validation, we first split our data into a training set and a test, or
hold-out, set (usually, 80% of the data is used for a train set and 20% for a test set). The model is
then trained on the training set, and its performance is evaluated on the hold-out set. Figure 5.5a
5.2 Validation Methods 65
Figure 5.5: Different ways to split a dataset of n = 20 data points into training and validation sets
(each data point corresponds to a rectangle in the figures). (a) hold-out validation with 80% train set
size and 20% test set size, (b) 5-fold cross-validation, which trains five separate models on different
splits of the data, (c) leave one out cross validation (LOOCV) which trains n models each on n − 1
data points and tests the model on the remaining data point. (d) leave-p-Out Cross Validation
(LpOCV) which trains np models each on n − p data points and tests the model on the remaining p
data points, for all subsets of data of size p. The cross-validation methods generally produce better
estimates of generalization error at the expense of increased computational cost.
66 Chapter 5. Model Selection
illustrates how a hypothetical dataset of twenty data points is split using hold-out validation.
The hold-out validation estimator of test performance has two important shortcomings, which are
especially prominent when working with small datasets:
• Losing data for model training: Setting aside a chunk of data as the hold-out set means you
won’t be able to use it to train your algorithm. For instance, splitting 20 observations into 16
for a training set and 4 for a test set means losing 4 observations for model training. This loss
of data often corresponds to a loss in model accuracy.
• Skewed training and test sets: Since the training and test sets are sampled at random, they
might be skewed, meaning that they don’t accurately represent the whole dataset. Suppose,
for example, you use a hold-out set of four out of the twenty data points chosen at random.
What if those four points happened to be the four highest values in your dataset? Your testing
set would not properly represent the range of values. Such an extreme split may be unlikely in
practice, but it is not too unlikely for the split to have an abnormally large concentration of
high-valued (or low-valued) data points.
Cross-validation, or CV, remedies the shortcomings associated with standard hold-out validation for
estimating the accuracy of a learned model. Unlike the hold-out method that puts a chunk of the data
aside for testing, CV allows the entire dataset to participate in both the training and testing process. I
know, you are thinking – wait a minute, we’ve just learnt that training and testing the model using
the same dataset is not methodologically correct! So, why are we doing just that now? Well, CV
uses a trick such that no data point is used to train and test the same model. The most popular CV
technique is known as K-Fold Cross Validation (KfCV) which we discuss in the next section.
Figure 5.5b illustrates how the k-fold cross validation splits a dataset with twenty observations into
five equal folds, each fold contains four observations. It then uses four folds (sixteen observations)
5.2 Validation Methods 67
to train the model, and the remaining fold (four observations) to evaluate it. This process is iterated
five times, where a different fold serves as the test set at each iteration.
In the extreme case, we can perform leave one out cross validation (LOOCV) which is equivalent to
n-fold cross validation. That is, each data point is its own fold; we train n models, each on n − 1 data
points and test on the remaining data point. The data splitting for LOOCV is illustrated in Figure
5.5c.
Leave-p-Out Cross Validation Method (LpOCV) is similar to LOOCV but has a critical difference:
at each iteration of LpOCV, p observations, rather than 1 as for LOOCV, are used for validation. In
other words, for a dataset with n observations, for every iteration n − p observations will be used as a
training set, while p will be used as a test set. LpOCV is exhaustive, meaning that it trains and tests
on all possible test sets of size p; clearly, this is computationally expensive for larger values of p.
For instance, let’s say we have a dataset with 20 observations (i.e., n = 20), and we want to perform
LpOCV. If we set p = 3, we get n − p = 17 observations for training, and p = 3 observations for
validation on all possible combinations. Figure 5.5d illustrates all train-validation set splits for a
dataset of this size.
Cross-validation methods generally provide a more accurate estimate of performance than hold-out
validation. However, the obvious downside of cross-validation methods is the increased computa-
tional cost that arises from training multiple models on different folds. In the examples used here,
hold-out validation is the cheapest but least accurate estimate of generalization performance, 5-fold
validation balances cost and accuracy, and leave one out cross validation is the most expensive and
most accurate. As a general rule of thumb, 5-fold or 10-fold cross-validation is used. However, for
extremely large datasets, hold-out validation may be most appropriate due to the small computational
cost, while for extremely small datasets, LOOCV may be appropriate since each data point is crucial
for training an accurate model; LpOCV may also be used for small datasets, but we note that it is
rarely used in practice due to the high computational cost, requiring np models to be trained. Figure
5.6 summarizes the guidelines for choosing a cross-validation method based on the data set size and
the computational resources available.
Figure 5.6: Matrix summarizing guidelines for choosing a cross-validation method based on the size
of the data set and the computational resources available.
68 Chapter 5. Model Selection
Key concepts
• Bias-variance decomposition
• Hold-out and cross validation methods
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Mathematically and conceptually understand bias-variance decomposition
• Understand different validation methods and when they become relevant/applicable.
6. Feature Selection
In many ML problems, the dataset contains a large number of features but some of the features
are irrelevant to, or are only weakly relevant to, the prediction task. This chapter discusses feature
selection techniques that seek to identify and remove such features. Feature selection is useful for
two reasons: first, it can prevent overfitting and hence increase the performance of the model – i.e., it
can prevent the ML algorithm from learning correlations with irrelevant features that are present
in training data but not in the test data; second, it leads to models that are easier for a human to
interpret – it is difficult to understand a model that uses, say, one hundred features, but if the dataset
contains only, say, five relevant features, we can build a model on those five features that is easy
to understand. As is the case with data preprocessing, there is no “right” way to perform feature
selection – it often requires a lot of trial and error for each problem. This chapter presents some
feature selection techniques that are the most popular and useful in practice.
6.1 Introduction
Consider the dataset in Table 6.1 and the task of predicting if a car is a luxury car or not – based
on its other features. This dataset contains several features that are either irrelevant to, or are only
weakly relevant to, the prediction task. An example of an irrelevant feature is the type feature. This
70 Chapter 6. Feature Selection
Table 6.1: Example dataset of car features and car price (target variable).
feature has the same value, sedan, for each data point and clearly cannot help distinguish between a
luxury and non-luxury car. An example of a feature with weak predictive power is the wheel feature
(which represents the side of the car that the driving wheel is on). We see that many cars of both
classes, luxury and not luxury, have driver’s wheel on both the left and right side of the car. Hence,
such a feature likely has weak predictive power. Although we discovered this statistically (i.e., by
looking at the data), we might have suspected that this was the case a priori since we know each
manufacturer makes cars with the driver wheel on each side.1
The goal of feature selection is to systematically identify the features that are the most important, or
have the highest predictive power, and then train the model only on those features. Feature selection
is useful for two reasons:
Prevent Overfitting: Removing irrelevant or weakly relevant features can prevent our ML model
from overfitting to the training data set. In the above example, there may be some correlation
between driver wheel side and car class in the training data set, but if this correlation is not
present in the test set, then whatever we learned about this feature will cause errors at test
time.
Interpretability: Removing irrelevant or weakly relevant features can help us interpret, or under-
stand, our ML model better. If a model is built on all of the features, it would be difficult
to understand exactly how the predictions behave based on the interactions among features.
However, if we remove many features, such as the driver wheel side feature, and leave, let’s say,
just the manufacturer and horsepower, it is much easier for humans to understand. Removing
features improves interpretability even more in large datasets if we can, say, reduce a dataset
with millions of features to a dataset with hundreds of features.
So how do we perform feature selection? Sometimes we can identify irrelevant features a priori,
based on our knowledge of the problem. But many other times we will identify irrelevant features
using the properties of the dataset or prediction task. Roughly speaking, there are three different
groups of feature selection techniques: filter methods, search methods, and embedded methods. We
discuss each of these methods in the following sections, then discuss the similarities, differences,
strengths, and weaknesses of each approach in the final section.
1 It
is important to note that although it might not “make sense” a priori for a feature to possess predictive power, the
feature may in fact have high predictive power. For example, if the relative wealth of people in countries with right-handed
cars is higher that that of people in countries with left-handed cars, more luxury cars with right-handed wheels will
probably be made.
6.2 Filter Methods 71
Variance score
The simplest method for univariate feature selection uses a simple variance score of the feature.
Simply put, you calculate the variance of all the values for a particular feature. Recall that variance,
or mean square, can be computed in general as the average of the squares of the deviations from the
mean:
∑ y2
variance = s2 =
n
where each y is the difference between a value and the mean of all the values. The variance for
feature j would thus be:
∑ni=1 x j
Var(x j ) =
n
72 Chapter 6. Feature Selection
You can see that if all the values for a particular feature were identical, the variance in them would
be zero, and they would provide no information at all in the ability to discern a feature that is useful
for model creation from one that it not useful. For example, consider the features in Table 6.1. We
have already reasoned that the type feature has no value, because all its values are the same in the
dataset. Now let’s look at the horsepower feature. If we compute the variance for this feature, we
first compute the mean, which in this case is 299.2. Then we compute the average squared deviation
from the mean, which is the variance, and find that it is 25,899. This number can be used as the
variance score and compared to other scores for data that span similar scales.
Of course we can’t legitimately compare this variance with that obtained from the interior feature,
because that latter variance is about 0.27, several orders of magnitude smaller. But what if we had
another feature, say torque. Here we might find the values in our dataset range from 200 to 500, and
give a variance of 20,000. Then our top two features in terms of variance would be interior and
torque, and we might find that these are the most powerful in our model to predict luxury.
Chi-squared score
A chi-square test is used in statistics to test the independence of two events. For our feature selection
task, we are interested in testing the independence between a specific feature variable and the target
variable. A feature that is completely independent of the target variable is irrelevant to the prediction
task and can be dropped. In practice, the Chi-square test measures the degree of dependence between
the feature and the target variable, and we drop features with the worst score.
The Chi-square score assumes that both the input variables and the target variable are categorical.
For example, you might want to know if the location of the wheel affects the status of a car. If
Chi-squared score shows that it doesn’t affect the car status in any way, we can drop this feature
from the dataset. Note that the Chi-square score assumes that both the input variables and the target
variable are categorical.
Fisher Score
The key idea of Fisher score is to find a subset of features, such that in the data space spanned by the
selected features, the distances between data points in different classes are as large as possible, while
the distances between data points in the same class are as small as possible. In other words, the
6.2 Filter Methods 73
between-class variance of the feature should be large, while the within-class variance of the feature
should be small.
In particular, given dataset {xi , yi }Ni=1 , where xi ∈ RM and yi ∈ {1, 2, .., c} represents the class which
xi belongs to. N and M are the number of samples and features, respectively. f1 , f2 , .., fM denote the
M features. We are interested in selecting a good subset from the M features in order to make the
dataset more productive. Then the Fisher score F of the i-th feature ( fi ) is computed below:
where nk is the number of samples in class k, µik and σik is the mean and standard deviation of a
feature fi in a certain k-th class, µi and σi denote the mean and standard deviation of a feature fi ,
and fi, j is the value of feature fi in sample (or observation) x j .
3. Return the subset s with the best score from our array.
For most practical problems, performing the exact search is infeasible, and we must resort to
approximate search methods. The two most popular approximate methods are the forward-selection
and backward-selection algorithms. Each of these algorithms performs the approximate search in a
greedy fashion as follows:
• Step forward feature selection starts with an empty set of features. At each iteration, it
identifies the most relevant feature and adds it to the set. It identifies this feature by brute
force: for each feature it fits a model with that feature added to the current features and records
the model’s score; the most relevant feature is the feature whose addition causes the largest
improvement to the score. Step forward feature selection is presented in Algorithm 6.2.1.
• Step backwards feature selection starts with the set of all features. At each iteration, it identifies
the least relevant feature and remove it from the set. It identifies this feature by brute force:
for each feature it fits a model with that feature removed from the current features and records
the model’s score; the least relevant feature is the feature whose removal had the smallest
decline in the score. Similar to the algorithm for forward feature selection in Alg. 6.2.1, we
can write down an algorithm for backward feature selection, but this is omitted for space.
Many learning algorithms have their own built-in feature selection methods, which is why we call
them embedded methods. Embedded methods are used by the algorithms and performed during
model training.
6.5 Comparison
In this chapter we saw three different types of feature selection methods. We now provide a
brief comparison between them. Broadly speaking, the methods differ in their accuracy, their
computational cost, and their generality
They are differentiated in three ways:
Accuracy: How does each feature selection technique affect the final prediction accuracy of the
ML classifier? In general, filter methods are at a disadvantage because they perform feature
selection without looking at the model class.
Computational Cost: Filter methods are often very fast. Univariate filter methods operate in time
O(nclasses · ndata ). Correlation based methods must examine all pairs of features and run in
time quadratic in the number of features O(n2classes · ndata ). Search methods, on the other hand
are much slower.
Generality: What learning algorithms is each method compatible with? Both the filter and search
methods can be combined with any learning algorithm. However, the embedded methods only
apply to learning algorithms that can be modified with a penalized loss function.
These methods are differentiated by the type of information they use, their computational cost, and
their accuracy. Filter methods are very fast since they only require computing simple statistics rather
than running a full ML training algorithm. However, since they perform feature selection without
running a ML algorithm, they can often be inaccurate.
Filter methods: Filter methods identify irrelevant features based solely on the features, without
examining the classification task. (Some use class label, but they don’t build or assume any
particular model.) As an extreme example, a feature that has the same value for each data
point clearly has no predictive value. More realistically, a feature with many similar values
(e.g., with low variance) will probably have low predictive power. Filter methods are often
very fast – they allow you to eliminate a large number of features with a single computation
of feature score – however, since they don’t take into account the model that will use these
features, they are often inaccurate.
Search methods: Search method identify features that are directly relevant to the prediction prob-
lem. The basic idea is to define a search for the K best features outside of the specific learning
algorithm. For example, “out of all possible sets of K features, find the set of K features that
gives the best performance using a decision tree classifier.” Since there are an exponential
number of sets of K features, these methods are often computationally expensive.
Embedded Methods: Embedded methods also utilize prediction accuracy to select features, how-
ever they do so within the learning algorithm itself. Embedded methods define the learning
problem with a penalized learning objective that automatically does feature selection – in
other words, although the penalized objective does not explicitly enforce sparsity, it turns out
that the optimized solution is in fact sparse. The most popular example of this procedure is
L1-penalized linear regression.
76 Chapter 6. Feature Selection
Key concepts
• Filter methods, search methods, embedded methods
• Variance score, Chi-squared score, Correlation-based feature selection, Fisher score
• Step Forward Feature Selection, Step Backward Feature Selection, Recursive Feature
Elimination
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Understand different groups of feature selection methods, how they differ from each
other and where each of them fit best
• Understand few methods from each group
7. Data Preparation
The previous chapters discussed the core elements of the ML pipeline, which assumed the data was
in an “ideal” form. Unfortunately, in practice, we are often confronted with datasets with incorrect
data, or with data that is correct but is not in a format that can be processed by ML algorithms.
Before applying ML algorithms we often need to preprocess the data. While there is no “correct”
way to preprocess the data, a number of methods are widely used in practice. This chapter discusses
the following methods:
Data Cleaning: Data cleaning seeks to correct that appears to be incorrect. Incorrect data may arise
due to human input error, such as misspellings or improper formatting, or data may be missing,
duplicated, or irrelevant to the prediction task.
Encoding: ML algorithms require numeric data. However, many datasets contain unstructured data
(like strings) or categorical variables (like color) which must be encoded numerically.
Feature Engineering: The goal of feature engineering is to create new features by combining
several features that we expect to be important based on our human knowledge of the problem.
For example, if a dataset contains features for the total sales per day and number of customers
per day, we expect a feature that measures the average sale per customer per day (by dividing
78 Chapter 7. Data Preparation
Table 7.1: Hypothetical dataset of product orders showing (a) initial ‘dirty’ dataset, and (b) a
(potential) cleaned dataset. The initial dataset in Table (a) cannot be consumed by ML algorithms,
while the cleaned dataset in Table (b) can be.
The first, and perhaps most important, step in any ML project is to carefully examine and understand
your data. In practice, you will often find that your dataset is “dirty,” meaning that it contains
incorrect, missing, duplicated, irrelevant, or improperly formatted data. In addition to dirty data,
many datasets contain data points that are legitimate measurements but are outliers, meaning that
they differ substantially from the other data points (this will be defined more precisely later). The
data quality has an enormous influence on the quality of any ML model.1 Consequently, it is often
1 You may have heard the phrase “garbage in, garbage out” that is sometimes used to describe this phenomenon.
7.1 Data Cleaning 79
necessary to preprocess the data to correct or delete the dirty or outlier data. We can then run our
ML algorithms on the corrected data set. The following subsection discusses ways to deal with dirty
data and the subsequent subsection discusses ways to deal with outlier data. (We emphasise again
that data cleaning is often a subjective procedure – there are no hard set rules. It requires a lot of
empirical experience to understand datasets and how ML algorithms will be affected by various
cleaning procedures.)
Incorrect Data
Datasets may contain data that is clearly incorrect, such as spelling or syntax errors. In some cases,
however, it may be difficult to tell if the data is incorrect or if it is correct but simply unexpected (to
us, as humans). The data point in the second row of Table 7.1 has value “Californai” for its state
feature, which is clearly a misspelling of the state “California”. If this mistake were left uncorrected,
any ML algorithm built on this dataset would treat the two strings “Californai” and “California”
differently.
How can we identify incorrect data? Perhaps the most exhaustive way is to hire a human being to
go through the data manually to correct it, such as identifying and correcting spelling errors. One
way to check whether a particular column has misspelled values is to look at its set of unique values,
which is often much smaller than the set of all values itself. You can see how this is done in Python
by following "Try It Now" box at the end of this chapter.
Duplicated Data
Duplicated data is another common problem that arises in practice. For example, Table 7.1a has
duplicate observations in rows two and three, and in rows four and five. Duplicate data effectively
doubles the weight that an ML algorithm gives to the data point and has the effect of incorrectly
prioritizing some data points over others which can lead to a poor model. In some cases, however,
the duplicate data are in fact genuine. For example, if two purchases for the exact same amount
were made on the same day from the exact same location. In most scenarios, genuine duplicates are
very unlikely, but there is no way to know for certain simply by looking at the data. To resolve the
issue for certain you would need to use external sources of information (for example, verifying with
another department that two identical purchases were in fact made).
There are different methods in python to spot duplicated data. You can learn about them in "Try It
Now" box.
80 Chapter 7. Data Preparation
Missing Data
Missing data arises for a variety of reasons. For example, if the data is entered by a human being,
he may have forgotten to input one or more values. Alternatively, data may be missing because
it is genuinely unknown or unmeasured, such as, for example, a set of survey questions that were
answered by some, but not all, customers. A missing value occurs in our running example for the
purchase column in row three of Table 7.1.
Some ML algorithms have built-in ways to handle missing data, but most do not. So how should
we deal with missing data? One approach is to simply delete all data points that have any missing
features. If there are not many such data points, this may be an acceptable solution. But if there are
many such data points, then a large part of the dataset will be removed and the ML algorithm will
suffer significantly. Instead, it is desirable to maintain all of the data points, but fill in the missing
values with a ‘good’ value. There are two different types of data filling procedures, discussed below.
Exact: Sometimes the missing value can be determined exactly. For example, if the US State of an
order is missing, but we have its zip code, we can determine its state exactly (assuming we
have another table which maps zip codes to states) and fill it into the table.
Imputed: Many times, the missing data cannot be determined exactly and we need to make an
educated guess at its value. For numeric data, one popular guess is to choose the mean or
median value of the non-missing values of the features. For example, to impute a missing
product order, we take the median order total. For categorical data, we can impute the value
as the mode, or most likely, value. In cases where an imputed value of a feature is used, it
is sometimes useful to create a binary feature that indicates if the input data contained the
missing feature or not (i.e., a new feature ‘order total was missing’) – this provides more
information to the learning algorithm which may help it learn a good predictive model.
Last but not least, another way to correct or fill in missing information is to go to the department or
people that created the data and understand their processes and standards for getting that data. The
people aspect is crucial and often missed. For example, go to logistics to understand ship dates, go
to sales and marketing for customer information and purchase prices. You may find that the logistics
department and sales marketing group have completely different ship dates for the same order and
as the data scientist, you need to reconcile that information.
7.2 Feature Transformation 81
7.1.2 Outliers
An outlier is an observation that differs significantly from other observations. Outliers may be
problematic for one of two reasons: first, an outlier may simply not be representative of data that
we will see at test time (in a new dataset); second, many ML algorithms are sensitive to severe
outliers and often learn models which focuses too heavily on outliers and consequently make poor
predictions on the rest of the data points. On the other hand, outliers sometimes reveal insights into
important, though unexpected, properties of our dataset that we might not otherwise notice. There
are no hard and fast rules about how to classify a point as an outlier and whether or not to remove it
from the dataset. Usually, you will build ML models several times, both with and without outliers,
and with different methods of outlier categorization. This subsection discusses two ways that outlier
detection is commonly performed in practice: the first is to use common sense or domain knowledge;
the second is to use statistical tests that measure how far a point is from a ‘typical’ data point.
How can common sense or domain knowledge be used to identify outliers? Consider the purchase
value of $1 in the final row of Table 7.1a. If you know, for example, that the cheapest product in
your shop is a $24 cowboy hat, then clearly the data point is erroneous and does not represent a valid
purchase value. In Table 7.1b we fix this erroneous value by filling it with the mean purchase value
(i.e., it is treated as if it were a missing purchase value).
How can we use statistical metrics to determine if a data point is an outlier? The simplest way is
to identify if a datapoint is too far away from the average value. For example, if µ ( j) is the mean
and σ ( j) is the standard deviation of the jth feature in the dataset, we may want to classify values
that are further than k standard deviations from the mean as outliers. That is, a feature value xij with
value xij < µ ( j) − k · σ ( j) or xij > µ ( j) + k · σ ( j) is considered an outlier. Typically, a value of k = 3
standard deviations is chosen.
Let’s show how to use statistical metrics to identify outliers in Table ??, column Purchase. The
mean of the column’s observations is:
∑ni=1 xi 190 + 243 + 193 + 193 + 298 + 1
µ= = = 186.3
n 6
and its standard deviation is
r
∑(xi − x̄)2
σ=
r n
(190 − 186.3)2 + (243 − 186.3)2 + (193 − 186.3)2 + (193 − 186.3)2 + (298 − 186.3)2 + (1 − 186.3)2
=
6
= 91.41
Suppose we set a range of acceptable values of k = 3 standard deviations. Then, any data points
below µ − 3 · σ = 186.3 − 3 · 91.41 = −87.93 or above µ + 3 · σ = 186.3 + 3 · 91.41 = 460.53 is
considered an outlier. But since we cannot have a purchase with a negative sum, outlier would be
above 460.53. In this data set, there are no outliers present.
Amsterdam Demographics
Age Income (e) Vehicle Kids Residence
32 95,000 none no downtown
46 210,000 car yes downtown
25 75,000 truck yes suburbs
36 30,000 car yes suburbs
29 55,000 none no suburbs
54 430,000 car yes downtown
Amsterdam Demographics
Age Income (e) Vehicle Kids Residence
32 95,000 0 0 1
46 210,000 1 1 1
25 75,000 2 1 0
36 30,000 1 1 0
29 55,000 0 0 0
54 430,000 1 1 1
(a) Substitute categorical with numeric variables
Amsterdam Demographics
Age Income (e) Vehicle_none Vehicle_car Vehicle_truck Kids Residence
32 95,000 1 0 0 0 1
46 210,000 0 1 0 1 1
25 75,000 0 0 1 0 0
36 30,000 0 1 0 1 0
29 55,000 1 0 0 0 0
54 430,000 0 1 0 1 1
(b) Vehicle categorical variable with one-hot encoding
Table 7.3: Numerical encoding of categorical features in the Amsterdam housing dataset. (a) uses a
direct encoding, while (b) uses a one-hot encoding of the ternary Vehicle feature.
How can we convert categorical variables to numeric variables? For binary categorical variables we
can simply substitute the values 0 and 1 for each category. For example, for the Kids feature we can
map value No to 0 and Yes to 1, and for the Residence feature we can map value Suburbs to 0 and
7.2 Feature Transformation 83
Downtown to 1. For categorical variables with more than two categories, we can perform a similar
numeric substitution. For example, for the Vehicle feature we can map value none to 0, car to 1,
and truck to 2. The substitutions of each categorical variables for numeric variables transform the
original dataset in Table 7.2 to the dataset in Table 7.3a.
The direct substitution for multiple categories is valid, though potentially problematic: it implies
that there is an order among the values, and that this order is important for classification. In the
encoding above, it implies that a vehicle feature of none (with encoding 0) is somehow more similar
to a vehicle feature of car (with encoding 1) than it is to a vehicle feature of truck (with encoding
2). If the order of a feature’s values is not important, it is better to encode the categorical variable
using one-hot encoding.2 One-hot encoding transforms a categorical feature with K categories into
K features, one for each category, taking value 1 for the data’s category and 0 in all other feature
categories. For example, one-hot encoding would split the categorical column Vehicle into three
columns: Vehicle_none, Vehicle_car, and Vehicle_truck, as shown in Table 7.3.
It is important to keep in mind that the more categories that a categorical variable has, the more
columns one-hot encoding will add to the dataset. This can cause your dataset to blow up in size
unexpectedly and cause severe computational problems if you are not careful.
Many datasets contain numeric features with significantly different numeric scales. For example,
the Age feature ranges from 27 to 54 (years), while the Income feature ranges from e30, 000 to
e430, 000, while the features Vehicle_none, Vehicle_car, Vehicle_truck, and Kids all have the range
from 0 to 1. Unscaled data will, technically, not prohibit the ML algorithm from running, but can
often lead to problems in the learning algorithm. For example, since the Income feature has much
larger value than the other features, it will influence the target variable much more.3 But we don’t
necessarily want this to be the case. To ensure that the measurement scale doesn’t adversely affect
our learning algorithm, we scale, or normalize, each feature to a common range of values. The two
most popular approaches to data scaling are feature standardization (or z-score normalization) and
feature normalization, described below.
Feature Standardization: In feature standardization, the feature values are rescaled to have a mean
of µ = 0 and a standard deviation of σ = 1. That is, the standardized features are calculated
as:
( j)
( j) xi − µ ( j)
xi = (7.1)
σ ( j)
( j) ( j)
where xi is an observation i of the feature j, xi is a standardized value for an observation i
of the feature j, µ ( j) is a mean value of the feature j, and σ ( j) is a standard deviation of the
feature j.
2 When the ordering of a specific feature is important, we can substitute that order with numbers. For example, if we
had a feature that showed a credit score of inhabitants and the values were {bad, satisfactory, good, excellent}, we could
replace those categories with numbers {1, 2, 3, 4}. Such features are called ordinal features.
3 This is the case for many ML models, such as linear models we discussed in the last section. However, some ML
Table 7.4: Amsterdam demographics dataset from Table 7.2 where the features have been transformed
via (a) standardization and (b) normalization.
Feature Normalization: In feature normalization, the feature values are converted into a specific
range, typically in the interval [0, 1]. That is, the normalized features are calculated as:
( j)
( j) xi − min( j)
x̂i = (7.2)
max( j) − min( j)
( j) ( j)
where xi is an observation i of the feature j, x̂i is a normalized value for an observation i of
the feature j, min( j) is a minimum value of the feature j, and max( j) is a maximum value of
the feature j.
When should you use standardization and when is it better to use normalization? In general, there’s
no definitive answer. Usually the best thing to do is to try both and see which one performs better for
your task. A good default (and first attempt in many projects) is to use standardization, especially
when a feature has extremely high or low values (outliers) since this will cause normalization to
“squeeze” the typical data values into a very small range.
Example
We now illustrate the computation of standardized features and normalized features on the Amsterdam
demographics dataset in Table 7.3a Let’s start by standardizing the Age feature. We first compute its
mean:
age
∑n x
µ age
= i=1 i
n
32 + 46 + 25 + 36 + 29 + 54
=
6
= 37
7.3 Feature Engineering 85
Each Age measurement is then standardized by subtracting the feature’s mean and dividing by its
standard deviation, as in Equation 7.1:
32 − 37 46 − 37
x1age = = −0.50 x2age = = 0.90
10.03 10.03
25 − 37 36 − 37
x3age = = −1.20 x4age = = −0.10
10.03 10.03
29 − 37 54 − 37
x5age = = −0.80 x6age = = 1.70
10.03 10.03
The remaining features can be standardized using the same logic. Table 7.4a shows the result. (I’ll
leave the actual computation to you as homework).
The normalized features for the same data set are computed as follows. First, let’s go back to Table
7.2 where the natural range of the Age feature is 25 to 54. By subtracting 25 from every value, then
dividing the result by 54, you can normalize those values into the range [0, 1]. Let’s scale Age using
the normalization method:
32 − 25 46 − 25
x1age = = 0.24 x2age = = 0.72
54 − 25 54 − 25
25 − 25 36 − 25
x3age = =0 x4age = = 0.38
54 − 25 54 − 25
29 − 25 54 − 25
x5age = = 0.14 x6age = =1
54 − 25 54 − 25
Following the same logic, we should do the same for all the remaining features (I’ll let you do the
rest of the math to get a bit more experience). After normalization, we get the dataset on Table 7.4b.
Note that the feature columns with binary values [0, 1] do not change – that’s because according to
the Normalization formula in Equation 7.2, the scaled value of 0 is 0, and 1 is 1.
Amsterdam Demographics
Age Income (e) Vehicle_none Vehicle_car Vehicle_truck Kids Residence
young 95,000 1 0 0 0 1
older 210,000 0 1 0 1 1
young 75,000 0 0 1 0 0
middle 30,000 0 1 0 1 0
young 55,000 1 0 0 0 0
older 430,000 0 1 0 1 1
This section discusses some of the most popular ‘generic’ feature engineering approaches – by
‘generic’ we mean that they are applicable in many domains. This is by no means an exhaustive list
(indeed it cannot be, as new applications and new features are constantly being designed). Instead, it
is meant to show you some popular techniques, illustrate why feature engineering is so powerful,
and give you insight into creating custom features for your problem.
where max( j) and min( j) are the jth feature’s maximum and minimum values, respectively. The
ranges of the K bins are then
As an example of equal width binning, consider splitting the Age feature in the Amsterdam demo-
graphics dataset into K = 3 bins. The bin’s width is:
h max − min i h 54 − 25 i
w= = = 9.7 ≈ 10
x 3
which we rounded to the nearest integer because Age values are always integers (in this dataset). To
calculate each bin’s range, we plug the bin width into equation (7.4) and obtain
If we divide the Customers feature by the Visitors feature, we can obtain a new feature that represents
the conversion ratio of a specific ad (Table 7.7).
This new feature can improve the performance of ML models. Since Conversion Ratio is based on
both Visitors and Customers, you might think that we can technically exclude these two. But actually
keeping both of these variables can achieve higher accuracy.
For the sake of simplicity, I used the dataset with just a few features to show what Feature Engineering
is. In reality, expect the datasets to have 10+, 20+, 100+ columns, or even more.
Bank Transactions
# date time location Status
1 21/08/2020 02:00 Amsterdam Legit
2 24/12/2020 05:19 Dusseldorf Fraud
3 10/04/2020 18:06 Berlin Legit
... ... ... ... ...
53 13/03/2020 19:01 Belgium Legit
54 08/10/2020 15:34 Paris Legit
55 02/04/2020 23:58 Amsterdam Fraud
The target column Status has two classes: Fraud for fraudulent transactions and Legit for legal
transactions. Imagine that out of 55 observations in the dataset, there are 50 legal transactions (class
Legit) and only 5 fraudulent transactions (class Fraud). These two classes are imbalanced.
When we have a disproportionate ratio of observations in each class, the class with the smaller
number of observations is called the minority class, while the class with the larger number of
observations is called the majority class. In our current example, the class Fraud is a minority class
and the class Legit is a majority class.
Imbalanced classes can create problems in ML classification if the difference between the minority
and majority classes are significant. When we have a very few observations in one class and a
lot of observations in another, we try to minimize the gap. One of the ways to do so is by using
oversampling techniques.
7.4.1 Oversampling
Many ML classifiers produce predictions that are biased towards the class with the largest number of
samples in the training set. When the classes have wildly different numbers of observations, this
can cause the ML algorithm to learn a poor model. For instance, imagine we happen to collect a
dataset with 1,000 credit card transactions, where there is only one fraudulent transaction and 999
90 Chapter 7. Data Preparation
(a) Dataset with class imbalance (b) Lines between each pair of points in minority class
(c) Interpolation of data points (d) Synthetic dataset with balanced classes
Figure 7.2: Illustration of SMOTE algorithm to create a synthetic dataset with balanced classes.
non-fraudulent transactions. We use that dataset to train the algorithm. It’s likely that the algorithm
will almost always predict a transaction to be non-fraudulent.
Oversampling techniques try to balance a dataset by artificially increasing the number of observations
in the minority class. For our dataset in Table 7.8, 5 out of 55 (or 9%) of the transactions are found
to be fraudulent. We might want to increase those 5 fraudulent transactions by 25 (=0.45) or even
50 to avoid discarding the rare class, as shown in Figure 7.1. Another method for augmenting the
dataset is with the Synthetic Minority Oversampling Technique (SMOTE), which doesn’t simply
replicate data points but produces new synthetic data points. The SMOTE algorithm is discussed in
the next section.
SMOTE synthesises new minority observations between existing minority observations. SMOTE
draws lines between existing minority observations similar to what is shown in Figure 7.2b. SMOTE
then randomly generates new, synthetic minority observations, as shown in Figure 7.2c. Figure 7.2d
shows the results. We generated 10 new observations, increasing the number of observations in the
minority class from 4 to 14. Now the difference between classes is not so significant.
That is a high level overview of SMOTE. I am going to skip the mathematical definition of SMOTE
because it is somewhat complicated. If you are interested, I urge you to test SMOTE in Python. The
"How To Code" box below will help you find some very helpful resources.
Key concepts
• Data Cleaning
• Feature Transformation
• Feature Engineering
• Data Augmentation
A reminder of your learning outcomes
Having completed this chapter, you should be able to:
• Understand what kind of procedures we should/can perform in the data preparation
phase.
• Understand how to transform categorical data into numerical, and why.
• Identify outliers in the dataset.
ACKNOWLEDGEMENTS
Without the support and help from a few key contributors, this book would not be possible. I am
deeply thankful to the following people in particular:
Alikber Alikberov, Elena Siamashvili, George Ionitsa, Victor Zhou, Anastasiia Tupitsyna
Much appreciation to many other contribute directly and indirectly:
1. Joshua Starmer, Assistant Professor at UNC-Chapel Hill
2. Josh Tenenbaum, Professor at MIT, Department of Brain and Cognitive Sciences
3. Guy Bresler, Associate Professor at MIT, Department of Electrical Engineering and
Computer Science
END NOTES
For space considerations, I’m presenting copious (but not comprehensive) citations. I intend these
notes as both a trail of the sources used for this book and a detailed entry point into primary sources
for anyone interested in some Friday night (or Saturday morning) exploration.
Chapter 1
1-1. Theodoros Evgeniou; inseaddataanalytics.github.io/INSEADAnalytics
Chapter 5
5-1. Giorgos Papachristoudis; towardsdatascience.com/the-bias-variance-tradeoff-8818f41e39e9
A. Unsupervised Learning
Let’s go back to the example where someone removed your table’s last column so that you no longer
had the different types of fruit labelled (Table ??). Let’s say I’ve found your graph (Figure ??)
representing the unlabelled data of fruits you measured, and want to use it to "restore" the fruit class
f Chapter A. Unsupervised Learning
label from the supervised learning section. In other words, I want to partition this dataset into a
chosen number of clusters.
My decision causes several interesting challenges. First, because I do not know how many types of
fruits were purchased, I have to consider the number of clusters needed. Second, since I do not know
what types of fruits were measured (even if I do manage to partitioning the dataset into the correct
number of clusters), I won’t be able to identify which cluster represents which fruit. We’ll need to
tackle these two problems separately.
Because I do not know how many fruit types were measured, I’ll start by splitting my algorithm into
clusters of 2, 3, 4, 5, and 6 (Figure A.2).
After careful observations of my output graphs, I noticed that partitioning the dataset into three
clusters appears to be the best option. As a result, I concluded that only three fruit types were
measured. My decision was based purely on the graphs’ appearances, which means I also accepted
the risk of being wrong - I can assume there were only three fruits, but I will never know for certain.
Despite having decided how many fruits I think were measured, I can’t say what types of fruits were
measured. They could have been watermelon, kiwi, apple, banana, orange, lemon, or something else
entirely. However, since the height (x-axis) and the width (y-axis) are known, if I go to the nearest
fruit market and show this graph to a farmer, I’d probably get my answers. Again, I have to accept
the risk that some of those answers might be wrong.
Unsupervised learning has two incredibly important aspects to consider when clustering an unlabeled
dataset.
1. We need to be careful about determining the number of clusters for the algorithm to partition
the dataset.
2. We have to know the market/business to successfully identify each cluster.
B. Non-differentiable Cost Functions
As we learned in Section 3.1.3, gradient descent requires taking the derivative of a cost function. In
other words, gradient descent is only applicable if the cost function is differentiable. Unfortunately,
not all the functions are differentiable. Generally the most common forms of non-differentiable
behavior involve:
1. Discontinuous functions
2. Continuous but non-differentiable functions
f (x) = x3 − 2x + 1 (B.1)
Is this function differentiable? Yes: we can easily find its derivative f 0 (x):
In mathematics, a continuous function is a function that exists for every value of its domain. In
other words, for any value of x, there will be the only one corresponding value of y. Based on this
information, we know that the function is continuous.
h Chapter B. Non-differentiable Cost Functions
(
3x2 − 2 , x < 1
f (x) =
2x − 1 , x > 1
This function is not differentiable at x = 1 because there is a "jump" in the value of the function: the
function is not defined at x=1, so it is not continuous.
Discontinuous functions can also be without any "jumps" in its domain. Let’s have a look at the
following functions.
i
The function on Figure B.3 is not defined at x = 0, so it makes no sense to ask if they are differentiable
at that point.
Many students assume that a continuous function is also a differentiable function, but that is not
always the case: while a differentiable function is always continuous, a continuous function is not
always differentiable. In other words, there are continuous but non-differentiable functions. Let’s go
ahead and check out some examples of these surprising functions.
1
f (x) = x 3 (B.3)
1 2
f 0 (x) = x− 3 (B.4)
3
Now that we have the function’s derivative, let’s determine the derivative when x = 0
1 −2
f 0 (0) = ·0 3
3
1 1
= · 2 (B.5)
3 03
1
=
0
Since vertical gradients are undefined, any curve with a vertical lift/drop would be non-differentiable.
So you can check the function on the presence of these vertical lifts to understand if it is differentiable.
This was a very brief introduction to non-differentiable functions. Because this topic is slightly
more advanced, this book does not cover more than basic fundamentals. I suggest you discover
non-differentiable functions in your free time. For now it’s enough to know that those functions
exist, and you should keep that in mind when working with gradient.