AISPUBLISHING - Data Science From Scratch With Python - PV0 PDF
AISPUBLISHING - Data Science From Scratch With Python - PV0 PDF
AI Publishing
How to contact us
2
Table of Contents
3
2.2.2. Bias-Variance Trade-off ........................................................................ 34
2.2.3. Feature Extraction and Selection......................................................... 37
3. Overview of Python and Data Processing ................................................. 38
3.1. Python Programming Language .............................................................. 38
3.1.1. What is Python? ..................................................................................... 38
3.1.2. Installing Python .................................................................................... 39
3.1.3. Python Syntax ......................................................................................... 40
3.1.4. Python Data Structures ......................................................................... 41
3.1.5. Why not R? ............................................................................................. 49
3.2. Python Data Science Tools ...................................................................... 50
3.2.1. Jupyter Notebook .................................................................................. 50
3.2.2. NumPy..................................................................................................... 51
3.2.3. Pandas ...................................................................................................... 53
3.2.4. Scientific Python (SciPy) ....................................................................... 58
3.2.5. Matplotlib ................................................................................................ 60
3.2.6. Scikit-Learn ............................................................................................. 73
3.3. Dealing with Real-World Data ................................................................. 77
3.3.1. Importing the Libraries ......................................................................... 77
3.3.2. Get the Dataset ...................................................................................... 77
3.3.3. Detecting Outliers and Missing Data .................................................. 78
3.3.4. Dummy Variables .................................................................................. 82
3.3.5. Normalize Numerical Variables........................................................... 83
4. Statistics and Probability ............................................................................... 87
4.1. Why Probability and Statistics? ................................................................ 87
4.2. Data Categories .......................................................................................... 87
4.3. Summary Statistics ..................................................................................... 88
4.3.1. Measures of Central Tendency ............................................................ 88
4
4.3.2. Measures of Asymmetry ....................................................................... 89
4.3.3. Measures of Spread ................................................................................ 90
4.3.4. Measures of Relationship ...................................................................... 91
4.4. Bayes Rule ................................................................................................... 92
4.4.1. Marginal Probability .............................................................................. 92
4.4.2. Joint Probability ..................................................................................... 92
4.4.3. Conditional Probability ......................................................................... 93
4.4.4. Bayes Rule ............................................................................................... 94
5. Supervised Learning Techniques ................................................................. 96
5.1. Linear Regression ....................................................................................... 96
5.1.1. Simple and Multiple Linear Regression Introduction....................... 96
5.1.2. Simple Linear Regression in Python ................................................. 101
5.1.3. Multiple Linear Regression in Python .............................................. 103
5.1.4. Linear Regression Coefficients .......................................................... 104
5.2. Logistic Regression .................................................................................. 109
5.2.1. Logistic Regression Intuition ............................................................. 109
5.2.2. Logistic Regression Regularization.................................................... 112
5.2.3. Logistic Regression Pros and Cons ................................................... 113
5.2.4. Logistic Regression in Python............................................................ 113
5.3. Support Vector Machines ....................................................................... 119
5.3.1. SVM Intuition ...................................................................................... 119
5.3.2. SVM Pros and Cons ............................................................................ 124
5.3.3. SVM in Python ..................................................................................... 124
5.4. Decision Trees and Random Forests .................................................... 127
5.4.1. Decision Trees Intuition ..................................................................... 127
5.4.2. Decision Trees Example ..................................................................... 132
5.4.3. Decision Trees Pros and Cons........................................................... 136
5
5.4.4. Decision Trees in Python ................................................................... 136
5.4.5. Random Forests Intuition .................................................................. 144
5.4.6. Random Forests Pros and Cons ........................................................ 144
5.4.7. Random Forests in Python................................................................. 145
5.5. K-Nearest Neighbor ................................................................................ 149
5.5.1. K-Nearest Neighbor Intuition ........................................................... 149
5.5.2. K-Nearest Neighbor Hyperparameters ............................................ 149
5.5.3. Dimensionality Problem ..................................................................... 151
5.5.4. Feature Normalization ........................................................................ 151
5.5.5. K-Nearest Neighbor Pros and Cons................................................. 152
5.5.6. K-Nearest Neighbor in Python ......................................................... 152
5.6. Naïve Bayes ............................................................................................... 161
5.6.1. Bayes Theory Revision ........................................................................ 161
5.6.2. Naïve Bayes Intuition .......................................................................... 162
5.6.3. Naïve Bayes Pros and Cons ............................................................... 167
5.6.4. Naïve Bayes in Python ........................................................................ 167
5.7. Model Evaluation and Selection ............................................................ 170
5.7.1. Splitting the Dataset ............................................................................ 170
5.7.2. Cross-Validation................................................................................... 170
5.7.3. Evaluation Metrics ............................................................................... 171
5.7.4. Hyperparameters Tuning .................................................................... 174
5.7.5. Grid Search in Python ......................................................................... 175
6. Unsupervised Learning Techniques .......................................................... 179
6.1. K-Means Clustering ................................................................................. 179
6.1.1. K-Means Intuition ............................................................................... 179
6.1.2. K-Means Initialization Trap ............................................................... 182
6.1.3. Selecting the Number of Centroids................................................... 182
6
6.1.4. K-Means Failure Cases........................................................................ 183
6.1.5. K-Means Pros and Cons ..................................................................... 184
6.1.6. K-Means in Python.............................................................................. 184
6.2. Hierarchical Clustering ............................................................................ 200
6.2.1. Hierarchical Clustering Intuition ....................................................... 200
6.2.2. Hierarchical Clustering Pros and Cons ............................................. 201
6.2.3. Hierarchical Clustering in Python ..................................................... 202
6.3. Principal Component Analysis ............................................................... 205
6.3.1. PCA Intuition ....................................................................................... 205
6.3.2. PCA Pros and Cons............................................................................. 206
6.3.3. PCA in Python ..................................................................................... 207
7. Neural Networks and Deep Learning ....................................................... 211
7.1. Neural Networks Introduction .............................................................. 212
7.1.1. Reasons for Neural Networks Success ............................................. 212
7.1.2. What is Deep Learning? ...................................................................... 212
7.2. Artificial Neural Networks ..................................................................... 214
7.2.1. How do Neural Networks Work? ..................................................... 214
7.2.2. The Activation Functions ................................................................... 216
7.2.3. Numerical Example ............................................................................. 219
7.2.4. ANN in Python .................................................................................... 222
7.3. Convolution Neural Networks .............................................................. 227
7.3.1. What is Convolution Neural Networks? .......................................... 227
7.3.2. What is the Convolution Operation? ................................................ 227
7.3.3. Padding Layer ....................................................................................... 231
7.3.4. Pooling Layer........................................................................................ 231
7.3.5. CNN Traditional Structure................................................................. 232
7.3.6. CNN in Python .................................................................................... 233
7
8. Reinforcement Learning Techniques ........................................................ 236
8.1. Reinforcement Learning Introduction .................................................. 236
8.1.1. Reinforcement Learning Definition .................................................. 236
8.1.2. Reinforcement Learning Elements ................................................... 236
8.1.3. Reinforcement Learning Example .................................................... 238
8.2. Upper Confidence Bound....................................................................... 239
8.2.1. The Multi-armed Bandit Problem ..................................................... 239
8.2.2. Upper Confidence Bound Intuition .................................................. 239
8.2.3. Upper Confidence Bound in Python ................................................ 241
8.3. Thompson Sampling ............................................................................... 244
8.3.1. Thompson Sampling Intuition........................................................... 244
8.3.2. Thompson Sampling in Python ......................................................... 245
Bonus: Free eBook in Neural Networks and Deep Learning with Python . 248
8
© Copyright 2019 by AI Publishing
All rights reserved.
First Printing, 2019
Edited by AI Publishing
Ebook Converted and Cover by Gazler Studio
Publised by AI Publsihing LLC
ISBN-13: 978-1-7330426-3-5
ISBN-10: 1-7330426-3-6
The contents of this book may not be reproduced, duplicated, or transmitted without
the direct written permission of the author.
Under no circumstances will any legal responsibility or blame be held against the
publisher for any reparation, damages, or monetary loss due to the information herein,
either directly or indirectly.
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any part of the content
within this book without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for educational and
entertainment purposes only. No warranties of any kind are expressed or implied.
Readers acknowledge that the author is not engaging in the rendering of legal,
financial, medical, or professional advice. Please consult a licensed professional before
attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author
responsible for any losses, direct or indirect, which are incurred as a result of the use
of information contained within this document, including, but not limited to, errors,
omissions, or inaccuracies.
9
10
About the Publisher
At AI Publishing Company, we have established an international learning
platform specifically for Young Students, Beginners, Small Enterprises,
Startups and managers, who are new to Data Sciences and Artificial
Intelligence.
Through our interactive, coherent and practical books and courses, we help
beginners learn skills that are crucial to developing AI and Data science
projects.
Our courses and books range from basic intro courses to language
programming and data sciences to advanced courses for machine learning,
deep learning, computer vision, big data and much more. Using programming
languages like Python, R and some data science and AI softwares.
AI Publishing’s core focus is to enable our learners to create and try proactive
solutions for digital problems by leveraging the power of AI and Data Sciences
to the maximum extent possible.
Moreover, we offer specialized assistance in the form of our free online
content and ebooks, providing up to date and useful insight into AI practices
and Data sciences subjects, along with eliminating the doubts and
misconceptions about AI and programming.
Our experts have cautiously developed our online courses, kept them concise,
to the point, short, and comprehensive, so that you can understand everything
clearly and effectively and start practicing the applications right away.
We also offer consultancy and corporate training in AI and Data Sciences for
enterprises so that their staff can navigate through the workflow efficiently,
with absolutely no trouble at all,
With AI Publishing, you can always stay closer to the innovative world of AI
and Data Sciences.
If you are also eager to learn the AtoZ of AI and Data Sciences and have got
no clue where to start, AI Publishing is the place to go.
Please contact us by email at: [email protected].
11
12
Book Approach
This book assumes that you know nothing about Data Science. Its goal is to
give you the concepts, the intuitions, and the tools you need to actually
implement data science programs capable of learning from data.
We will cover many techniques from the simplest and most commonly used
to more advance. We will be using actual some python libraries and packages
like NumPy, Pandas, Scikit-Learn and Keras.
While you can read this book without picking up your laptop, we highly
recommend you experiment with the practical part available online as
Jupyter notebooks at:
https://github.com/aispublishing/dsfs-python
13
Preface
This book is written for beginners and novices who want to develop
fundamental data science skills and learn how to build models that learn useful
information from data. This book will prepare the learner for a career or
further learning that involves more advanced topics. It contains introduction
and very basic concepts used in data science. The learner is not required to
have any prior knowledge but some basic knowledge of mathematics is
required.
Data science has been applied to a vast range of domains like finance,
education, business and healthcare. Data Science is a powerful tool in fighting
cancer, diabetes, and various heart diseases. Machine learning algorithms are
being employed to recognize specific patterns for symptoms of these
conditions. Some machine learning models can even predict the chance of
having a heart attack in a specific time frame. Cancer researchers are using
deep learning models to detect cancer cells. Research is being conducted at
UCLA to identify cancer cells using deep learning.
Deep learning models have been built which accurately detect and recognize
faces in real time. Through such models, social media applications like
facebook and twitter can quickly recognize the faces in the uploaded images
and can automatically tag them. Such applications are also being used for
security purposes.
15
Speech recognition is another success and an active area of research. The
machine learns to recognize the voice of a person, can also convert the spoken
words to text and can understand the meaning of those words to get the
command.
One of the hottest research areas is self-driving cars. Using data from camera
and various sensors, it learns to drive as it interacts with the environment.
Using deep learning, those cars learn to recognize and understand a stop sign,
differentiate between a pedestrian and a lampost and learn how to avoid
collision with other vehicles.
16
1. Introduction
This eBook will give you a fundamental understanding of all data science,
machine learning, and deep learning concepts and algorithms. To achieve this,
the book has detailed theoretical and analytical explanations of all concepts
and also includes dozens of hands-on, real-life projects to help you understand
the concepts better.
In the first chapter, you will learn what is meant by data science, why it is
currently used everywhere, its areas of applications, and its history and future.
Finally, the concluding chapter discusses some notes, tips and tricks to get the
utmost benefit from this eBook.
Domain
Expertise
Computer
Mathematics
Science
17
So, you might ask, what is the difference between data science, data analytics
and big data?
First, big data means the huge volumes of various types of data: structured
data, unstructured data and semi-structured data. We won’t get into the details
of what is meant by unstructured or semi-structured data because this isn’t the
scope of this eBook.
However, we can say that the data is semi-structured if it lacks a fixed, rigid
schema. So, it has a structure, but this structure is not fixed or rigid.
Spreadsheets are good examples of semi-structured data.
On the other hand, unstructured data doesn’t have any structure. Text
documents and images are good examples of unstructured data.
Data analytics, on the other hand, is more about extracting information from
the data by calculating statistical measures and visualizing the relationship
between the different variables and how they are used to solve a problem. This,
of course, requires data preprocessing to remove any outliers or unwanted
features and also requires data post processing to visualize the data and draw
conclusions from these visualizations.
Finally, data science came to take the best of the two worlds because it is, as
we said, an interdisciplinary field which aims to mine a large amount of all
types of data to identify patterns. To identify these patterns, data scientists
explore the data, visualize it and calculate important statistics from it. Then
depending on these steps and the nature of the problem itself, they develop a
machine learning model to identify the patterns.
So, in one sentence, if you want to know why data science is surging now, it is
because of the availability of more data, better algorithms, and better hardware.
However, we will talk about a few famous use cases of machine learning and
data science in our daily lives as a lead-in to the next chapters.
2. Transport: Tesla cars have auto-pilot which can take control of driving,
and thus decrease the number of car crashes dramatically. Machine learning is
also used for air traffic control as the whole process is now automated.
19
4. Social media: Nearly all social media platforms use machine learning for
both filtering spamming and sentiment analysis.
Finally, as we discussed, these are just a few broad and general applications of
data science and machine learning. You can develop your own application in
any field that you find interesting and have some experience in. You’ll easily
be able to achieve this by the end of this eBook.
However, the godfather of data science is considered to be C.F. Jeff Wu, who
gave a fundamental talk called “Statistics = Data Science” back in November
1997. He formalized the data science field as a trilogy of data analysis, data
collection, and decision making.
After this talk, data science has been used with an exponential increase in the
number of people interested in it.
20
Further Readings
https://www.dataversity.net/brief-history-data-science/
So, it will not be surprising to see many tasks, which are currently considered
science fiction such as assistant robots and self-driving cars, used in our daily
lives.
Further Readings
https://www.dataversity.net/data-scientist-future-will/
Also, to get the utmost benefit from this eBook, finish every single project
provided on your own first, and then check the sample solution. Don’t read
the solution first and convince yourself that you understand everything. You
have to write code, develop your logical thinking skills and deal with
21
programming errors and problems. If you start by reading the solution, then
you won’t acquire any of these three very important skills.
Finally, we encourage you to go through any further reading material that you
will frequently find in the upcoming chapters. Although they may contain
advanced topics, it will give you an overview of what you can learn next after
finishing this eBook.
If you have any questions regarding the eBook or just want to connect, feel
free to reach me on GitHub or LinkedIn
22
2. Preliminary to Understand Data Science
In this chapter, we’ll explore in detail the different data science elements in the
first section, including statistics and probability, data mining and machine
learning (ML), the different types of learning, what is meant by neural networks
and deep learning (DL) and finally, what is the link between AI, ML, and DL.
But before we dive into probability and statistics theories in chapter 4, let’s
first define some important terms.
23
Data are stored in columns and rows. The convention is that each row
represents one observation, case or example. Also, each column represents
one feature or variable.
Because our ultimate goal is to find a function that predicts y values based on
x values Y = f(x), it is important to know that x variables need to be
independent of each other and called the predictors. On the other hand, y
variable is the dependent variable, and it’s called the response.
Again, our ultimate goal is to find a global function to map x into y. Therefore,
Sampling
Population Sample
Inference
the whole population as our target for our mapping function will be no
different than traditional programming algorithms, which are designed to work
on the specified dataset only and are not guaranteed to be generalized to the
whole population. The problem is that we cannot have the whole population
in our dataset, so we work with a representative sample of the data population.
Machine learning algorithms are different from traditional programming
algorithms in that their goal is to find parameters that can do the mapping on
the entire population based on the given sample.
Outliers are also considered a critical issue that can alter the performance of
many machine learning algorithms as we will see in the upcoming chapters.
24
Outliers can be detected by visualizing the data or by calculating special
statistical measures that we’ll discuss in detail in chapter 4.
Outliers can be dealt with in four major ways: drop them completely, cap them
with a threshold, assign new values based on the mean of the dataset for
example, or to transform the dataset itself.
Transform
Drop Cap New value
the dataset
Note that the topic of outliers will be revisited multiple times as we go through
the datasets, and then we will discuss what is the best way to handle them
based on the nature of the dataset itself.
The same issues and solutions will be covered about missing data, which are
also frequently found in datasets.
25
Data mining, on the other hand, is carried out by a person on a particular
dataset, in a specific situation, with a goal in mind. This person can use
machine learning algorithms to find patterns for the sake of finding them or
generating some preliminary insights from the dataset.
Also, we can say that machine learning uses data mining techniques, among
other techniques to build models that can be used to achieve machine learning
tasks.
Before we explain the difference in words, take a look at this image, which
visualizes the difference between them.
Deep
Learning
Machine
Learning
Artificial
Intelligence
So, we can say that there is an AI involved in our system if the computer is
able to mimic human behavior. AI involves many techniques such as rule-
26
based systems or expert systems. One category of techniques that was showing
promising results back in the 80s was machine learning.
Machine learning was promising because it did not use any heuristics or hard-
coded algorithms but instead was oriented to mimicking how humans learn
instead of mimicking human behavior. So, simply put, machine learning
algorithms were developed to find the function that maps the input to the
output by feeding the algorithm lots of data and let it decide the best function.
However, machine learning faced the same issues as AI in some tasks, because
of the same reason, which is that these algorithms cannot find the complex
function that maps the input to the output. An example of this is image
classification.
The neural network consists of a collection of neurons (which are the major
elements in the brain) connected in a specific way. By using this algorithm,
many complex functions were feasible.
However, the use of the neural network was still limited because of three
reasons that we talked about in the first chapter. These reasons are the lack of
computational power, the lack of data, and the lack of optimum optimization
algorithms for neural networks.
This is because to mimic the brain, we need around 86 billion neurons, and
that was not possible by any means.
1. Supervised Learning:
In this paradigm, we have our dataset containing the input features and the
output features. We try to predict the output from the input by training our
machine learning model on the input and by trying to get as many correct
predictions as possible.
2. Unsupervised Learning:
3. Reinforcement Learning:
28
Reinforcement learning is mainly used in skill acquisition tasks such as robot
navigation.
Supervised Learning
Classification
Regression
Machine
Learning
Unsupervised
Learning Reinforcement
Clustering Learning
Dimensionality Skill Acquisition
Reduction
If you remember, the main objective is to recognize the pattern of the data,
which can be measured by how well the algorithm performs on unseen data,
not just the ones that the model was trained on.
29
This is called generalization, which means performing well on previously
unseen input.
The problem in our discussion so far is that when we train our model, we
calculate the training error. However, we care more about the testing error
(generalization error).
Therefore, we need to split our dataset into two sub-datasets, one for training
and one for testing. For traditional machine learning algorithms with small
datasets (less than 50,000 instances), we usually split the dataset into 70% for
training and 30% for testing. If the dataset is large (more than 50,000
instances), we train on more than 70% and test on less than 30%. For deep
learning applications, the datasets are usually too large to the extent that the
testing can be done on less than 10%.
Note that your model should not be exposed to the testing set throughout the
training process.
You might now ask, are there any guarantees that this splitting operation will
give the two datasets the same distribution?
This is hard to answer, but data science pioneers made all their algorithms
based on the assumption that the data generation process is I.I.D., which
means that the data are independent of each other and identically distributed.
So, what are the factors that determine how well the machine learning
algorithm is performing?
We can think of two main factors which result in a small training error and
cause a small gap between the training error and testing error.
We say that the model is underfitting when the training error is large, as the
model cannot capture the true complexity of the data.
30
We say that the model is overfitting when the gap between the training and
testing errors is large, as the model is capturing even the noise among the data.
31
So, you might wonder, can we control this? The answer is yes. It can be
controlled by changing the model capacity. Capacity is a term that is used in
many fields, but in the context of machine learning, it is a measure of how
complex a relationship the model can describe. We say that a model that
represents quadratic function has more capacity than the model that can
represent a linear function.
Therefore, we can say that the model is performing well, if the capacity is
appropriate for the amount of training data it is provided with, and the true
complexity of the task it needs to perform. Given that knowledge, we can say
with confidence that the model on the left is underfitting because it has low
capacity. The model on the right is overfitting because it has a high capacity,
and the model in the middle is just right because it has the appropriate capacity.
32
The solution to underfitting is fairly straight-forward, which is either increasing
the size of the dataset, increasing the complexity of the model, or training the
model for more time until it fits.
The overfitting solution is a bit trickier because it needs more carefulness. The
first solution is to gather more data, of course, but this is not always feasible.
The second solution is to use cross-validation. So, let’s stop here and learn
what cross-validation means.
So far, we’ve split our dataset into training and testing, and we said we train
our model on the training set for the specified number of iterations, and after
the training is finished, we test the model performance by using the test set.
But what if we need to test our model after each iteration to discover if it is
converging or diverging? This is where a validation set comes to the rescue.
The validation dataset is simply another part of the dataset that is used for
validating the performance of the model while it is still being trained. So, we
split our dataset now into three datasets instead of two.
But the problem is, if the validation set is the same each time, we are back to
square one, which prevented us from using the testing set while training our
model.
33
After understanding what is meant by cross-validation, we can now understand
why it is used for preventing overfitting. Now we can monitor our model and
stop the training whenever the gap between the training error and validation
error is increasing. In fact, this is called early stopping, and we will talk about
it in detail in chapter 7.
Other solutions for overfitting exist, but are designed to work on specific
algorithms. These solutions will be mentioned and explained when we get to
their respective algorithms.
Overfitting
solutions
Cross
Getting more data Regularization
Validation
34
Before talking about bias and variance, we will classify the various kinds of
errors.
First, we have the irreducible error, which comes from the nature of the data
itself. For example, when you talk through your mobile phone, your voice
signal will always be corrupted by the irreducible error that we cannot fix.
While we cannot do anything about this kind of error, it is important to know
that it exists so we understand what the maximum limit of accuracy for
example, that we can reach when we train our model is.
The second kind of error is, of course, the reducible error. This error can be
categorized even more to bias error and variance error.
Bias error is the difference between the average prediction of our model and
the correct value which we are trying to predict. We say that the bias error is
high if the model is oversimplified. In this case, we have a huge error in both
training and testing sets. This is similar to underfitting.
Variance error is the variability of model prediction for the given data or a
value that tells us the spread of our data. We say the variance error is high if
the model is not generalizing well on the test set. This is similar to overfitting.
The blue points represent how far we are from the minimum error which is
represented by the small red circle. In case of low bias, the blue points—the
error—are not very far from the minimum error. In the case of low variance,
the blue points are near each other without taking into consideration the
minimum error location.
35
However, there is a tradeoff between bias and variance because as we decrease
the model bias, we make it more complex, and thus, we increase its variance.
Similarly, when we limit the spread of our data to decrease its variance, there
is a higher chance to increase the bias.
By linking that to the model capacity, both increasing the model variance, and
decreasing the model bias, will increase the model capacity.
Looking back at the three curves of overfitting, underfitting and fitting we can
say that when the model is underfitting, it has low variance and high bias. We
can also say that when the model is overfitting, it has high variance and low
bias.
To solve the variance error, we try to get more training examples and a smaller
set of features.
By just looking at the solution, we can see again that solving one of the two
problems will negatively affect the other one. Therefore, you should first know
which problem, if any, your model is suffering from more, and focus on
solving it.
36
2.2.3. Feature Extraction and Selection
Moving to the final topic of this chapter, feature extraction and selection is an
extremely important step in any data science project. Why?
The problem with real-world datasets is that many of the recorded features are
dependent on each other’s, and thus redundant. Even if there are no
completely-dependent variables, some features are more important and
effective than others, depending on the task at hand. Another issue is that
many datasets consist of hundreds or even thousands of features, making the
training process impractical and sometimes impossible.
To do so, we will perform this step as a preprocessing step for all the projects
that we will work on together throughout this eBook.
37
3. Overview of Python and Data Processing
This chapter is divided into three main sections: Python programming
language, python data science tools, and real-world data.
In the first section, we will learn the basics of Python programming, its syntax,
its data structures, and why not to use R.
In the second section, we will focus more on the tools and libraries that every
data scientist should be familiar with including Jupyter notebook, NumPy,
Pandas, SciPy, Matplotlib and Scikit-Learn.
In the third and final section, we will begin our journey on how to deal with
real-world data using the tools that we mentioned. This will include how to get
the dataset, how to import the needed libraries, what the different types of
variables are, how to split our dataset, how to preprocess our data, and finally,
how to perform k-fold cross-validation.
There are many different versions of Python, with two versions, 2.7 and 3.6
being the most commonly used. For beginner or intermediate level
programmers, the main difference is in some simple syntax. We will be using
3.6 in this eBook as it has wider libraries support than 2.7.
1. Official Python Website: This is very easy to follow, but it will install
Python only, with no external libraries. Thus, this method is not
recommended.
2. Miniconda: This will install the conda package manager along with
Python. This method has the same disadvantage as the first method as
all the external libraries have to be installed manually.
3. Anaconda Distribution: This will install all the packages that you will
need in many chapters of this eBook. Also, the installation of any
additional packages is very easy and straightforward, and we will
mention it when we need it. This is the recommended method.
39
Further Readings
If you want to know more about how to use Anaconda, check its
documentation here
https://docs.conda.io/projects/conda/en/latest/index.html
Every programming language has its own syntax. So, what do we mean by
syntax?
The syntax is the rules or the grammar of the programming language, like that
of any spoken language such as English or French.
The first thing you will need to know about any programming language is the
syntax because this differs very much from one language to another.
The first rule of Python code is the line structure. Any Python program is
divided into logical lines, and every one of these lines is ended by a token which
is NEWLINE. You do not write this word; it is embedded and hidden in the
language. A single logical line can consist of one or more physical lines. If a
line contains only comments or is left blank, it is called a blank line, which is
ignored by the interpreter.
The second rule is the comments. Comments in Python start with a hash
character (#). These comments are also ignored by the interpreter.
The third rule is joining two lines. This is needed when you are writing a long
code and need to go to the following line. To do so, we use the backslash
character (\).
The fourth rule is writing multiple statements on a single line. This can be done
by using a semicolon ( ; ) between the two statements. Then, they will be
executed as if they were on two different lines.
40
The final and most important rule is indentation. While many languages such
as Java or C++ use braces ({}) when indicating blocks of code Python uses
whitespaces to do this. All the statements within the same block should have
the same indentation level.
To start writing code using any programming language, you need to know that
all the data “variables” that you use in your code has to be saved in the
memory. You can do an operation like the following for example, and it will
be saved in the memory, but where? Can you locate the memory address that
contains three? The answer is, of course, no.
Thus, we need to assign three to a variable that we can refer to after that.
But as we can tell, the data has to be stored in the memory in a structure so we
can differentiate between different kinds of variables.
Before we start talking about the different data structures, note that in Python,
you don’t have to write the type of the variable before it as in other languages.
Python is smart enough to interpret the type of the variable from the
assignment. To understand more about this, let’s discuss the different data
structures.
41
We’ll talk first about the basic data types. The most basic data type category is
numbers. We can show our numbers in three different formats: integer, float
and complex. We won’t work with complex numbers as they’re not really
useful in data science. You only need to know that Python, as opposed to other
languages, has a dedicated data type for complex numbers.
Let us write some basic code and see how Python executes it.
As you can see, by using type built-in function, we can see which data type
Python used for every variable.
Also, if you do any basic operations between an integer and a float, Python
will store the result automatically in a float.
So, let’s now talk about strings, which are the second category of the basic
data types in Python. Strings are sequences of character data. We can use either
42
single or double quotes to indicate that this variable is a string.
As we said, strings are just a bunch of characters. Thus, we can access some of
these characters like this.
We can also multiply a number by a string. This will have the effect of repeating
this string a number of times equal to this number.
43
The error is pretty clear! So, what we do to add a number to a string is this.
The same operation can be done with float to number or the other way around.
We’ll now talk about the Boolean data type. This is a data type that was created
to be used in conditions and comparisons. This is because the only values that
can be stored in Boolean data type are True and False - 0 and 1-.
44
Given that we now understand the basic data types, let us move to more
complex data types.
First, we have lists , which are basically a container of variables of any type,
stored together. We can write a simple list like this.
So, to write a list, we use square brackets []. Also, for all indexing, we start
from 0 and not from 1. Thus, if we want to do any operation on the second
number of this list, then we will write list [1]. So, what if we need more than
one index? Then we can do the following:
If we write the index negative, then it will start from the end of the list.
We can add two lists together, and we can append or remove one value it/from
the list.
45
So, let’s now talk about another data structure, which is Tuple. The Tuple is a
special type of list whose elements cannot be changed.
We say that lists are mutable as we can change their contents at any time, while
we cannot do the same with tuples. Therefore, we say that tuples are
immutable.
By looking at this simple example, we can see that the only difference in syntax
is that we use circle brackets instead of square brackets. We can also see that
for indexing, tuples and lists are the same.
It is immutable.
46
Moving to the next data structure, we now introduce dictionary. The
Dictionary is an address-book, where you can find the address of a person by
using his name. If we assume that you have his full name, then this name is
unique. So, we say that every object in the dictionary has two attributes, which
are the value and the key. While the key is unique as we said, the value is not.
For example, John and Mary -keys- can have the same height -values-, but we
cannot do the opposite. This means that we cannot say that John is 170 cm
for example, and then say that he is 180 cm. Also, if the height is the key, then
we cannot assign the same height to two different persons.
We use curly brackets for indexing, and to connect a key to a value we use a
colon (:). Note that while tuples and lists are ordered, dictionaries are not
ordered, and can be indexed using the keys.
Notice that we get the value by using the key instead of the index, as there is
order here.
47
The final data structure is called set, which can only have unique values. To
create and assign a set, we also use curly brackets but without the colons. It
also resembles the dictionary in that it has no order.
As we can see, the main idea behind sets is we don't have any repeated values.
Also, sets do not support indexing.
So, after talking about the syntax of all the data structures, let’s discuss the use
cases of each one of them.
First, we use lists when we don’t have any special cases that we want to take
care of, and we want our list to be ordered for indexing.
48
We use tuples only when we are sure that the values inside them should not
be changed no matter what, so this is the best way to assure that.
Dictionaries are used when we want to have some sort of relation between
some unique variables and other non-unique variables. Also, they are very
useful in the sense that we do not need to know the index of the variable to
get it, as we are only concerned with the key.
Sets are rarely used in data science, but we only use it when we know that any
repeated data will be redundant. So, sets can be very efficient in ignoring
redundant data to increase the performance of any algorithm.
Now, it’s your turn to run the code and experiment with it.
Further Readings
If you want to know more about Python data structures, go to this tutorial
here
https://www.tutorialspoint.com/python/python_variable_types.htm
However, it’s not as widely used as Python because Python has much more
support from external libraries than R. Python, and it can be used for other
applications, so its use can result in more thorough projects.
49
3.2. Python Data Science Tools
Note that all the codes that we will develop throughout this book are
embedded in notebooks. Thus, you need to be familiar with the interface.
When you open the application, you will see something like this:
You can create a notebook by clicking New on the right corner. After that,
you can create a notebook which looks like this.
50
By moving your mouse cursor to any button, you will understand exactly what
it does. It is very intuitive.
The main thing you need to know is that what you write is one of two things,
either a code cell or a markdown cell. The markdown cell is just for the
organization because you write things that will not be executed by Python,
such as comments or headlines.
At the end of this section, you will find a hands-on box containing a notebook
with even more details.
3.2.2. NumPy
NumPy is short for Numerical Python, which is a library consisting of
multidimensional array objects and collection of routines for processing those
arrays. Its main use is for mathematical and logical operations on arrays.
To understand and practice the capabilities of NumPy, let’s start writing some
code using it.
We can import NumPy using "import", and we usually use a short name for
our libraries as we will be mentioning them many times.
51
Create an array using NumPy by doing the following.
Now, let us see how to get the shape of any array. This is crucially important
in data science, as we are always working with arrays and matrices.
Finally, we’ll see how to perform the basic operations using NumPy.
52
Further Readings
If you want to know more about NumPy, go to this tutorial here
https://www.numpy.org/devdocs/user/quickstart.html
3.2.3. Pandas
Pandas is another very critical library in data science. It provides high-
performance data manipulation and analysis tools with its powerful data
structures.
The main unit of Pandas is the DataFrame, which is like an excel sheet with
dozens of built-in functions for any data preprocessing or manipulation
needed. There is also a data type called Series and another one called Panel.
These will be explained when needed.
With Pandas, dealing with missing data or outliers can be very easy. Not only
that but also manipulating complete columns or rows of data.
53
Let us look at the fundamentals of Pandas. Again, it is really important that
you execute the following code snippets yourself in order to understand better.
54
Pandas Panels are not used widely. Thus, we will focus only on Series and Data
Frames.
However, you can use Panels when your data are 3D.
● read_csv()
● read_excel()
● read_json()
● read_html()
● read_sql()
The first step is to change the directory to the one containing the dataset. This
can be done using os library.
We will now use the reading function that we have just mentioned.
Pandas has a function called “head” that enables us to view the first few
elements of a specific DataFrame.
55
56
Now, we’ll work with the cars dataset and see how to select a column from it.
57
Finally, we can create a new column in the DataFrame that the data is saved
in.
Further Readings
If you want to know more about Pandas, go to this tutorial here
https://pandas.pydata.org/pandas-docs/stable/
Note that SciPy library depends on NumPy for all its operations.
58
We will see how to compute 10x using SciPy.
Finally, for our discussion, let us calculate the inverse of any matrix using
SciPy.
SciPy will not be used that much in our discussions, as we will use more high-
level libraries to compute the determinant and other operations. However, it
is good to know.
59
Further Readings
If you want to know more about SciPy, go to this tutorial here
https://docs.scipy.org/doc/
3.2.5. Matplotlib
Matplotlib is the fundamental library in Python for plotting 2D and even some
3D data. You can use it for many different plots such as histograms, bar plots,
heatmaps, line plots, scatter plots and many others.
Let’s see how to work with it. We’ll start by importing it.
60
We can make the plot more beautiful.
61
To understand the anatomy of the figure, see the following figure.
62
We can also have many sub-plots as follows:
63
Now, let’s use the visualization on a real dataset to enhance our understanding.
We will be using the cars dataset once again.
We start by importing the libraries, fixing the path and loading the dataset.
Then, we simply call the scatter method and pass our dataset variables.
64
Now, we will experiment and see different kinds of plots: histograms,
boxplots, bar plots and line plots. We will start with the histogram.
65
Then, we try to make it look better.
66
After that, we repeat the same code but on our cars’ dataset.
67
We then use the same data using a boxplot.
68
From there, we can experiment with the bar plots and see how they look and
used. Here, we combine them with error bars that are frequently used when
we have uncertainty about our data.
69
The last type of plot that we’ll mention is the line plot. We will artificially
generate the data with the following distribution so they can be interpreted
easily in the plots.
Now, we can create two plots in one using the sub-plots function.
70
Finally, we can combine the four different types of plots that we discussed in
a single plot.
71
One final thing before we move on. It’s worth mentioning that there is another
less used library called seaborn, which can help us produce some good-looking
graphs.
72
Further Readings
If you want to know more about Matplotlib, go to this tutorial here
https://matplotlib.org/contents.html
3.2.6. Scikit-Learn
Let us now introduce one of the most important libraries for anyone starting
machine learning—Sklearn, or Scikit-Learn.
The library also provides many utilities for data-preprocessing and data
visualization and evaluation.
74
Following that, we choose x to be all the dataset variables without the origin,
the model and the MPG columns. Also, we choose y to be the output variable
which is MPG. Moreover, we drop any missing values.
We then fit the model and predict the output. We will understand all the details
in chapter 5.
75
Further Readings
If you want to know more about Sklearn, go to this tutorial here
https://scikit-learn.org/stable/documentation.html
76
3.3. Dealing with Real-World Data
We will practice how to import the libraries, and most importantly, how to
know what libraries you need in your projects throughout the upcoming
chapters.
We can also work with an SQL database format or even specific APIs that
some websites or servers provide. Moreover, we can work with files coming
from other software such as MATLAB. We will see this in practice in the
notebooks of this section.
However, we haven’t yet mentioned the source of these datasets. Basically, you
can collect your own dataset and store them into an excel file, for example.
However, this may be an overhead for you when starting your machine
learning journey. There are plenty of websites where dedicated data scientists
upload their datasets. Some of the most popular websites for this purpose are:
● Kaggle
● WorldBank
● UCI Machine Learning Repository
77
● Quandl
● Amazon Web Services (AWS) datasets
● Data.Gov
Another very cool service that Google has just launched that it is still in a Beta
version is Google Dataset Search engine. This is just the usual Google search
engine but dedicated to the search of datasets. You can access it here.
A more advanced approach to create your dataset is via API and web scraping.
This will be explored in detail in chapter 9.
The first and the most important step in preprocessing is detecting outliers.
We’ve talked about outliers before. Now it’s time to learn how to deal with
them.
To detect outliers, you should first look at the general structure of your data
and print some statistics of them. Also, you should visualize your data if
possible. This is an easy task now that we know how to use Pandas and
Matplotlib, specifically.
78
As we can see, the data has 635 examples with seven features. We can also see
that there are missing data in some features such as Mileage and Price. Let's
visualize the data to see if there are any outliers.
79
The outliers are clear! They exist at nearly 2090. So, let's filter them out.
80
We can delete them using a smarter way as follows:
81
3.3.4. Dummy Variables
The second preprocessing step in all machine learning projects is to know if
we need dummy variables or not.
Dummy variables are variables that are used when we have a categorical
variable that we cannot do mathematical operations on.
For example, if one feature of a house is the presence of a garden, we can see
that the possible values for this variable are either YES or NO. So, we create
a dummy variable for this variable where YES is replaced by 1 and NO is
replaced by 0.
This can be further extended for other variables that we cannot operate on,
such as the blood type. In this case, what we do is convert this variable with
one-hot encoding.
If we have three blood types only, then the first blood type will be replaced
with 001, the second one will be replaced by 010 and the final one will be
replaced by 100. This is one-hot encoding, and we can extend it further based
on the number of possible values that this categorical variable can have.
82
So, we have to convert our categorical variables into dummy variables to make
all of our variables contain only numbers that the machine learning algorithms
can understand and work with.
Let’s continue our discussion on the house prices dataset by examining the
number of rooms and the area of this house. We can say, for example, that any
practical house can have from one to ten or more rooms, while it could have
an area of 100 square feet to thousands of square feet.
The problem exists here because the different variables normally have different
scales. This will affect our algorithm as it would think that the area of the house
matters more than the number of rooms, which we do not want to happen.
So, in order to make all our variables have the same scale, we normalize all of
our numerical variables.
● Standard score: This is done by getting the mean of every variable and
subtracting the examples from it and dividing by the standard deviation
X −µ
X=
of this variable. σ
This works well when the data are normally distributed.
● Min-Max Feature scaling: This is basically subtracting the minimum
value and dividing by the maximum value minus the minimum value.
X − X min
X=
X max − X min
83
There are different normalization methods, but these two are the most
commonly used in machine learning.
We start by importing the needed libraries, fixing the path, and loading the
cars’ dataset.
In case you want to remember why we split our dataset; this image can help.
84
Now, let us normalize our dataset.
Hint: You can use MinMaxScaler and see which works better for you on the
following algorithm.
After finishing this part, let us see how we can utilize cross-validation.
85
These five numbers represent the cross-validation accuracy on each fold; we
used five folds in this example.
Further Readings
If you are curious about other normalization techniques you can check
here
https://www.studytonight.com/dbms/database-normalization.php
86
4. Statistics and Probability
In this chapter, we will talk in more depth about statistics and probability,
which we introduced in the previous chapter.
As you know, there are very few things in the world that we can be sure about
100%. Most things we are sure about only to some extent. Thus, we need
probability and statistics to provide a rational and scientific way to deal with
this uncertainty.
Also, as we will see in the next chapters, all the machine learning algorithms
are heavily based on probability and statistics theorems. So, in order to
understand them correctly and know how to use them, we have to know the
basis for these algorithms.
87
house. Continuous data can have any value from negative infinity to infinity.
An example of that is the speed of a car. But of course, depending on the
nature of the variable in the data, even the continuous variables should be
restricted by a range.
88
40+50
After that, we take the middle value as our median, which will be 2
= 45
as the number of examples is even, so we take the average of the two middle
values.
Finally, we calculate the mode by observing the most occurring value, which
is 50 in our case.
89
Figure 2- Left-Skewed (Negative Skewed)
91
correlation is from -1 to 1 as -1 indicates a pure negative
correlation. This means that as one variable increases, the other
variable decreases in the same way. If the correlation value is 1,
then there is a pure positive correlation. Finally, if the correlation
is zero, then the two variables are independent of each other. The
equation to calculate the correlation is the following:
𝐶𝐶𝐶𝐶𝐶𝐶(𝑋𝑋, 𝑌𝑌)
𝜌𝜌 =
𝜎𝜎𝑥𝑥 𝜎𝜎𝑦𝑦
In this chapter, we will discuss the most famous and used supervised learning
algorithms. This is a significant chapter that you’ll have to go through in detail
to get the utmost benefit. We will first explore the most basic supervised
learning algorithm.
A B
As we can see, the intersection between the two events is the joint probability.
For example, if we have a traditional card deck with fifty-two cards, then the
2
probability of choosing a black eight card is 𝑃𝑃(𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 ∩ 8) = 52because there
are only two cards that satisfy these two conditions.
93
4.4.4. Bayes Rule
Given that we now have some familiarity with different probability concepts,
we can introduce Bayes rule. Bayes rule has the advantage of providing us with
a method to update our beliefs based on new evidence.
Suppose for example that we want to estimate the probability that a given
person will be accepted for graduate studies or not. If you only have his or her
exam grades, you will provide a different probability than if you have additional
evidence such as the number of published papers.
Bayes rule can be written as follows:
𝑃𝑃(𝐴𝐴) ∗ 𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
This rule combines both conditional probability and marginal probability.
Also, it is derived from joint probability.
Here, we want to get the probability of event A given that B is the new
evidence that we have. We call this the posterior, which would be “the
probability of getting accepted given that this person has published papers”.
We call 𝑃𝑃(𝐴𝐴) the likelihood, as it the probability of observing the new
evidence, given our initial hypothesis. This can be translated for our example
as follows “probability of having published papers given that the person gets
accepted”.
The marginal probability 𝑃𝑃(𝐴𝐴) is also called the prior, as it is the probability
of our hypothesis without any additional prior information. Referring to our
example, we can say that this maps to “the probability of getting accepted”.
Finally, 𝑃𝑃(𝐵𝐵) is the marginal likelihood which could be translated to “the
probability of having published papers”.
In order to understand Bayes rule, let us look at a numerical example. Assume
that the probability of getting accepted at this university is 10%. Assume also
that the probability of publishing papers is 30%, which means that out of every
ten people applying to this university, there are three people who have
94
published papers. Also, assume that 20% of people that got accepted have
published papers, so (𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) = 0.2 .
Without having the new evidence, which is the published papers, we would
have said that the probability for being accepted is the prior probability which
is (𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) = 0.1 . But now, using Bayes rule, we can have more precise
calculation as follows:
𝑃𝑃(𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) ∗ 𝑃𝑃(𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎) 0.2 ∗ 0.1
𝑃𝑃(𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝ℎ𝑒𝑒𝑒𝑒) = = = 0.066
𝑃𝑃(𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝ℎ𝑒𝑒𝑒𝑒) 0.3
We will see in the following chapter that there is a whole algorithm called Naïve
Bayes that is based entirely on Bayes rule.
95
5. Supervised Learning Techniques
In this chapter, we will discuss the most famous and used supervised learning
algorithms. This is a crucial chapter to go through in detail in order to get the
utmost benefit. We will first explore the most basic supervised learning
algorithm called linear regression. Then, we will go through more advanced
and complex algorithms which are logistic regression, support vector
machines, decision trees, K-nearest neighbors and naïve Bayes. Finally, we will
define the metrics that will help us evaluate any machine learning model.
Our discussion will be divided mainly into two main parts: how the algorithm
works intuitively and mathematically, how to implement it in Python.
If you remember from school, we can do this by using the following equation.
𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏
In this equation, we can find the output y by multiplying the input x by the
slope m and by adding this to the y-intercept b. We have the output and the
input, but what about the slope and the intercept?
96
In fact, this is what we are trying to learn, because if we already have the slope
and the intercept, then there is no problem to solve.
So, our goal is to find m and b which we will call the weights and the bias from
now on.
If the input variables are more than one, then we call this a multiple linear
regression problem, and if there is only one input variable, like our example,
then we call it a simple linear regression problem.
Now we’ll plot the data using some arbitrary numbers that we can assume for
now are true.
It’s clear that we can fit our model using different lines by tweaking m and b.
97
To stick to the machine learning notation, let’s rename b to w0 and m to w1. So
now, we can rewrite the equation this way
𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1 ∗ 𝑥𝑥
We can generalize this equation even further to be true for multiple regression.
𝑛𝑛
𝑦𝑦 = � 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝑤𝑤 𝑇𝑇 𝑥𝑥
𝑖𝑖= 0
The T superscript that we use for w is called the transpose, and this equation
is the same as the sum equation, but it is mainly used when we convert our
variables into vectors and matrices. By converting them, we can avoid using
loops which takes too much time to finish if we have a large number of inputs.
Using vectors is always preferable as computers are optimized to perform
matrix multiplication more than loops. We call this paradigm vectorization.
As we can see, there are infinite values for the weights, and we cannot really
tell, until now, which set of weights gives the best performance.
There are two main methods to determine these weights. Both of them are
based on minimizing the error. However, they differ in their approaches to do
so, as the first method does this by getting a closed-form mathematical
solution, while the second one is an iterative solution that tries to converge to
the correct answer.
The first method is quite simple, as we say that the error is 𝜖𝜖𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 where
𝑦𝑦𝑖𝑖 is the true output for example i, and 𝑦𝑦�𝑖𝑖 is the estimated output for example
i. So, the error, which is also called the residual, is the difference between them.
Our objective is to minimize the sum of the squared prediction errors. We use
the square because we want all our errors to be positive values and eliminate
any negative values. We could also minimize the sum of the absolute prediction
errors as this will also do the trick; however, using the squaring technique has
some mathematical advantages over the absolute technique. Therefore, we will
stick with the sum of the squared errors technique.
98
So, we can write this mathematically as follows.
𝑛𝑛 𝑛𝑛
We can find w using some mathematical manipulation that we will not really
be concerned about right now, but it has a closed-form solution that is applied.
The second method is an iterative method called the gradient descent. In this
method, we have our cost function which is the same as the sum of the squared
errors. Our objective again, is to find the weights that minimize the cost
function as follows.
If you studied pre-calculus in high school, you will know that by saying minimize
or maximize for a function, we mean getting the first derivative of this function
and making this derivative equal to zero. The symbol that we will use for the
derivative of the cost function is 𝛻𝛻𝐽𝐽. The most common optimization algorithm
used in machine learning for minimization is called gradient descent.
The intuition behind the gradient descent is very simple. You start by choosing
random weights. Then you calculate the first derivative of the cost function.
After that, you move in the opposite direction of this value, multiplying this
99
number by a factor called the learning rate. Finally, we update the weights and
repeat until convergence.
𝑤𝑤 = 𝑤𝑤 − 𝛼𝛼𝛼𝛼𝛼𝛼(𝑤𝑤)
So, you might have two questions. The first one asks what the value of the
learning rate should be. The answer is that it depends on the convergence rate.
So, if we have an error that is far from the right answer, then we will want a
bigger learning rate. However, once we start converging, this big learning rate
will make it difficult for us to reach the minimum value as it may overshoot.
Also, choosing a very small learning rate will make the model take too much
time to converge and it may also get stuck in a local minimum and not reach
the global minima. Nonetheless, people tend to use learning rate in the range
of 10-2 – 10-5. So, a good method to choose your learning rate is to start from
10-5 and increase it sharply as long as it gives you good results, then increase it
carefully once you reach a critical value.
Note that the learning rate is not included in the trainable parameters of the
model; thus, we call it a hyperparameters. As we will see in the next algorithms,
there will be many hyperparameters which we will have full control of.
100
The second question is why we take the negative of the gradient. The answer
is that the derivative is the slope at this point, and the direction of that slope
is in the opposite direction of the correct answer. Therefore, we use the
negative sign in our calculation of the new weights.
The algorithm that we have just discussed is called gradient descent, and it is
used in many other machine learning algorithms as it is a very solid
optimization algorithm. Note also that there are two variations of this
algorithm which are stochastic gradient descent and mini-batch gradient
descent. We will discuss them in detail in chapter 7. However, it is enough to
know for now that stochastic gradient descent updates the weights based on a
single example while mini-batch updates them based on several examples equal
to the batch. There are pros and cons for the use of each of the three versions
of the algorithm. Using vanilla gradient descent is adequate for now.
The first step is to import all the libraries that we will need.
101
We print some information about the dataset, and given that we preprocessed
it in chapter 3, we won’t need to do any preprocessing again.
We then split our dataset into a training dataset and testing dataset.
102
Then, we fit our dataset using sklearn linear regression function.
After that, we predict the output and measure the performance using the root
mean square error metric (RMSE).
Given that we have more than one input variable, we need to normalize our
input.
103
We finally fit and predict.
𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟
𝑅𝑅 2 = 1 −
𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡
Where,
𝑛𝑛
𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟 = � 𝑟𝑟𝑖𝑖2
𝑖𝑖=1
𝑛𝑛
𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡 = � 𝑦𝑦𝑖𝑖2
𝑖𝑖=1
We know that 𝑟𝑟𝑖𝑖 is the difference between the predicted value and the true
value, also known as the residual. Therefore, we can say that R2 is a measure
of the reduction in the sum of squared values between the raw label values and
the residuals. If R2 = 0, then our model is useless and does not reduce the error.
On the other hand, if ri = 0, then R2 = 1 which is our ultimate target.
104
2
Another variation of R2 is 𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎 which is the same except for the SS terms as
the variance of the residual and the true labels.
We will now see how we can use SciPy to do the same tasks that we did on
simple linear regression and multiple linear regression with the addition of
2
calculating R2 and 𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎 .
2
We can see that R2 and 𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎 are 62.5% and 62.6% for simple linear regression.
105
2
We can see that R2 and 𝑅𝑅𝑎𝑎𝑎𝑎𝑎𝑎 are 82.6% and 82.1% for multiple linear
regression, which is much better than simple linear regression.
Now, let us plot the residual while keeping in mind that the error should have
a normal distribution.
106
Let us now plot the residual for the training data.
107
As we can see, it nearly fits on the red line corresponding to R2
108
It didn’t do well on the test set as linear regression has many limitations; one
of them is that it cannot model non-linear functions.
Therefore, we will explore more complex algorithms that can handle nonlinear
functions.
However, the equation that we used for linear regression is not limited by this
constraint. So, we use a logistic function to transform our output to be in the
range [0,1] so we can treat it as a probability. The most famous and currently
used logistic function is the sigmoid which has the following equation.
1
𝑦𝑦(𝑧𝑧) =
1 + 𝑒𝑒 −𝑧𝑧
Where z is the linear equation that we used in linear regression
𝑛𝑛
𝑧𝑧 = � 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝑤𝑤 𝑇𝑇 𝑥𝑥
𝑖𝑖= 0
To understand how the sigmoid function squashes our input into [0,1], we can
plot it using Python, and we would get the following curve.
110
We can have this plot by writing the sigmoid as a Python function and then
call this function with different values of input.
As you can see, the output, the Y-axis, can only take values in the range [0,1],
and it reaches zero at negative infinity and reaches one and positive infinity.
We can also see that the output is 0.5 when the input is zero. We can alter that
by scaling the sigmoid function or changing the bias.
Moving to the loss function, we cannot use the same mean square error loss
that we used for linear regression, as the numbers are all between 0 and 1 so
the results will be significant. Thus, we need a loss function that is sensitive to
small changes. To do so, we use the negative log-likelihood loss function which is
defined as follows:
𝐽𝐽(𝑤𝑤) = − �(𝑦𝑦 𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙 �ℎ𝑤𝑤 �𝑥𝑥 𝑖𝑖 �� + �1 − 𝑦𝑦 𝑖𝑖 � 𝑙𝑙𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙𝑙𝑙 �1 − ℎ𝑤𝑤 �𝑥𝑥 𝑖𝑖 �� )
𝑖𝑖
The mathematics behind the final output is complex, so the only thing that
you need to know is that gradient descent and other iterative optimization
111
algorithms are the only way to update the weights in logistics regression and
hence classify the output correctly.
𝜕𝜕𝜕𝜕(𝑤𝑤)
𝑤𝑤 = 𝑤𝑤 − 𝛼𝛼 − 𝜆𝜆𝜆𝜆𝜆𝜆
𝜕𝜕𝜕𝜕
We say that 𝛼𝛼 is the learning rate and 𝜆𝜆 is the penalization term. So, we see
that the regularization term is added as a second term in the loss function. The
purpose of this regularization term is to push the parameters towards smaller
numbers, and thus, the model does not become more complex, and hence it
doesn’t overfit.
The main difference between the two methods is that the Lasso method tries
to push all the parameters towards zero, while the Ridge method tries to push
all the parameters towards very small numbers but not equal to zero. Both
methods are used, and you have to experiment with both of them to know
which one works best for each specific case.
112
5.2.3. Logistic Regression Pros and Cons
We can see that the main advantages of using logistic regression are that it is
very easy to understand and interpret, very fast to train and predict and works
well with sparse data if regularization is used.
The main disadvantages of using logistic regression are that it requires the data
to be preprocessed and scaled, and it doesn’t work very well compared to more
complex algorithms if the data is complex by nature.
After loading the dataset, we print some information about the dataset.
113
As we see in the following figure, there are twenty-two columns, twenty-one
of them are features and the last one is our target. Also, only nine of them are
numerical, so we need to convert the other thirteen from categorical to
numerical using dummy variables.
114
We choose the column with the name bad credit to be our output.
As we can see, the data is unbalanced as 70% of the output is good and 30%
is bad. Right now, we cannot really do anything about it, but in the deep
learning chapter of this eBook, we will see how we can do data augmentation
to solve this crucial problem.
We convert our categorical features into numerical features using the get
dummies function in Pandas.
We copy and paste the names of these columns, so we can assign them to our
input matrix.
115
After that, we perform normalization and scaling on all features.
We test our model and get 77% which is not that good. However, you can do
the same steps with the normalization and see how this score drops
dramatically.
116
Following that, we plot our features against their model weights using a bar
plot to see if the model is getting more complex than needed, and thus, can be
prone to overfitting.
117
Now let us use L1 regularization. As we see, the accuracy decreases by nearly
1.5% but the weights of the model are now much smaller, so we do not have
to worry about overfitting.
118
We do the same using L2 regularization, and we can see that the results in this
specific case are much worse. Note that this is not the general case, and you
must experiment with both techniques to decide which is better in that case.
Suppose for example, that we want to separate the following points for a
classification purpose.
120
As we can see, the three lines separate the dataset correctly. However, when
we test our models, each one of them will classify the test data differently.
So, our target is not only to find a line that separates the dataset correctly, but
also to maximize the margin between the different classes. By doing so, there
is a higher chance that the test dataset will be classified correctly.
We define the margin to be twice the distance between the hyperplane, which
is just a line on our case, and the nearest points to the hyperplane. These points
are called the support vector
121
Here, we have the four support vectors as two of them correspond to the red
class and two to the yellow class.
We can use the same linear regression model 𝑤𝑤 𝑇𝑇 𝑥𝑥 for the support vector, and
by doing so we can write 𝑤𝑤 𝑇𝑇 (𝑥𝑥+ − 𝑥𝑥− ) = 2, where x+ corresponds to the red
support vectors and x- corresponds to the yellow support vectors. Also, we got
2 because of subtracting the two equations 𝑤𝑤 𝑇𝑇 𝑥𝑥+ = 1 and 𝑤𝑤 𝑇𝑇 𝑥𝑥− = −1 from
each other.
2
After normalization, we have ||𝑤𝑤||
as our margin, which we try to maximize.
||𝑤𝑤||
You can find some people minimizing the reciprocal 2
instead to have one
minimization problem. Therefore, our cost function now has two terms, one
for minimizing the classification error, and one for minimizing the reciprocal
of the margin.
To solve this problem, SVM uses the kernel trick, which is nothing but a set of
functions that takes low-dimensional input space and transforms it into a
higher dimensional input space where the data can be separated. Some of the
most commonly used kernels are Radial Basis Functions, Sigmoid Kernel and
Polynomial Kernel. Explaining the math behind each of these kernels is
beyond the scope of this book and can be found in many academic statistics
references. However, you can experiment with all kernels using sklearn very
easily and compare the results to know which is better for this specific dataset.
123
So, overall, there are three main hyperparameters which are the kernel, the C
penalty, and Gamma hyperparameter. We discussed the first two, but we did
not discuss the third one. Gamma is a hyperparameter for deciding to what
extent do the far away points affect the overall decision of the separation line.
So, we say that a large gamma value means that close points have the most
influence on the decision, while a small gamma value means that the far points
have the most influence on the decision.
However, it suffers from not scaling very well as the number of samples
increases exponentially. In addition to that, it needs extensive preprocessing
before we can leverage its true power. Finally, it requires exhaustive
hyperparameter tuning.
124
Then, we split our dataset.
Let’s see one more example of using a cancer dataset and observe how
changing C hyperparameter changes the accuracy.
This low accuracy on the test set is a result of not doing normalization before
training, so let us fix that.
125
Let us try to fit the model again now.
Using the SVC as it is with its default hyperparameters, which you can find on
sklearn official documentation, gives us great results.
We will be using the same car dataset, so the first few steps are exactly the
same.
These are the default values for all the hyperparameters available for SVM.
126
The results are comparable to linear regression.
We can use grid search to find the best combination of C, Kernel and
gamma—learning rates that will result in the best accuracy score.
We got 2.6 which is much better than the best result—3.21—that we got from
linear regression.
As the name infers, the random forests algorithm is based on being random
and having forests, which means that we use decision trees to build forests
randomly. Therefore, if you understand the decision trees thoroughly, it will
be very easy to understand random forests also.
To make any decision, we ask ourselves some questions and based on their
answers, we choose what we want to do. For example, suppose that you want
127
to go to the cinema and watch a movie. Your decision to do this or not can be
visualized as follows:
So, if there is a seat, then you will ask about the position of this seat. Let us
assume that you prefer to set in the middle, so if a seat in the middle is available,
then you will book it. Otherwise, you can settle for another place in case it is
cheaper. But, if you did not find a seat, you will ask if there is a seat on the
same day. Based on that, you will wait only in case it will take less than two
hours.
We can see that this is how we think and rationalize about many of our
decisions. In our example, we can treat the booking problem as a classification
problem. We can also use decision trees for regression by asking some
questions and having paths that lead to different outputs.
We call the questions that we ask the nodes of the tree with each node
corresponding to a specific question which we call an attribute 𝐴𝐴𝑖𝑖 . The answers
to these questions, or attributes, are called the branches of the tree 𝑣𝑣𝑖𝑖𝑖𝑖 . Also,
we call the last nodes with the final answers the leaf nodes or simply the leaves.
These are the classes in case of classification and the predicted values in case
of regression.
Therefore, our objective is to find the best path to get the output.
128
Let us assume that we have a Boolean, only 0 or 1 set of functions of n
attributes. Then, the maximum number of possible classes is 2𝑛𝑛 and the
maximum number of functions or truth tables that we can make out of this is
𝑛𝑛
22 . Of course, if we have more attributes then our problem will be even more
complex to solve. So, you can see that the problem, although solvable
theoretically, requires a lot of computational power that is sometimes not even
possible practically.
Thus, our objective is now to find the best path to get the output in an effective
and practical manner. So, we should construct a decision tree that is as small
as possible yet contains the maximum useful information.
To do so, we use a greedy algorithm called divide and conquer. Using this
algorithm has proven to get us a small enough tree but not guaranteed to get
us the smallest tree. The algorithm has three main steps which we will mention
here and study them in detail, shortly.
First, we start with an empty tree. After that, we divide the problem into many
sub-problems to test the most important and useful attributes in our decision.
Finally, we use recursion which is applying the second step again iteratively
from the root of the tree to the final leaves.
You might be thinking about how we decide the most important attributes.
The answer is that we can think of them as being the ones that make the most
difference in our decision while training. In other words, they are the ones that
reduce the uncertainty about the decision better than the other attributes.
However, you might also have another question in mind; how to measure this
uncertainty. This is done by calculating a famous quantity called Entropy. This
quantity was coined by Shannon, one of the most influential scientists in the
field of information theory, in the last century.
129
represent the entropy because it was originally developed to work with bits.
The entropy of an unfair coin that either always comes up as heads or tails is
zero. So, we can say that entropy represents how much uncertainty we have
about our problem. Therefore, it equals zero if the coin comes up always as
heads or tails as there no surprise in that. On the other hand, if we have a fair
coin, then the entropy will be the maximum because we have an equal
probability for each event to happen. The formula that Shannon developed
the entropy is the following
𝑛𝑛
Let us calculate the entropy of an unfair coin that comes up tails 99% of the
time.
The following plot visuals how the entropy changes with the probability. It is
maximized when we have equal probabilities.
130
Another quantity that can be used to calculate the uncertainty is the Gini
Index which is very similar to the entropy. The formula of it is as follows:
𝑛𝑛
𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖(𝑥𝑥) = 1 − � 𝑃𝑃𝑖𝑖2
𝑖𝑖=1
Both of them are used extensively in practice. While using Gini impurity index
is computationally better because it does not require computing any
logarithmic functions, entropy is more commonly used so we will stick with it
for the rest of this section.
Our problem right now is to reduce the entropy due to a specific attribute and
do this recursively. This is the definition of Information Gain:
131
So, let us now write the algorithm in detail:
132
We apply the algorithm by first calculating the entropy.
9 9 5 5
𝐻𝐻(𝑥𝑥) = − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � = 0.94
14 14 14 14
9
We got 14 because our friend played golf nine days out of the fourteen days,
while he did not play only five days out of the fourteen days; hence the second
term.
Then, we calculate the attribute that gives us the highest information gain. Let
us start with the wind attribute. We have eight days with strong wind and six
days with weak wind. There were only two out of the eight days with a weak
wind that our friend decided not to play, and six days that he decided to play.
On the other hand, he played on three days when the wind was strong and did
not play on three days also.
6 6 2 2
𝐻𝐻(𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤) = − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � = 0.81
8 8 8 8
133
3 3 3 3
𝐻𝐻(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) = − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � = 1
6 6 6 6
𝐼𝐼𝐼𝐼(𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤) = 𝐻𝐻(𝑥𝑥) − 𝑃𝑃(𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤) ∗ 𝐻𝐻(𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤) − 𝑃𝑃(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) ∗ 𝐻𝐻(𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠)
8 6
𝐼𝐼𝐼𝐼(𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤) = 0.94 − (0.81) − (1) = 0.048
14 14
You can do the same with other attributes as practice, and you will get the
following results.
𝐼𝐼𝐼𝐼(𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜) = 0.247
𝐼𝐼𝐼𝐼(𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡) = 0.029
𝐼𝐼𝐼𝐼(𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻) = 0.151
Based on that, we choose outlook to be the root of our tree as it gives us the
maximum information gain. Our tree now looks as follows:
So, if the outlook is overcast, then we know for sure that our friend is going
to play golf, and if not, then we repeat the algorithm: with the remaining
attributes. Therefore, our table now looks like this:
134
If we go through the algorithm steps again, we will get the following entropy:
3 3 2 2
𝐻𝐻(𝑥𝑥) = − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � − 𝑙𝑙𝑙𝑙𝑙𝑙2 � � = 0.96
5 5 5 5
Then, the information gains obtained with the remaining attributes are as
follows:
𝐻𝐻(𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻) = 0.96
𝐻𝐻(𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡) = 0.57
𝐻𝐻(𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤) = 0.019
Therefore, our final tree will be like this.
135
5.4.3. Decision Trees Pros and Cons
Decision trees are extremely easy to interpret and understand from a human
perspective. Also, they have the advantage of being easy to visualize.
Moreover, it does not require any preprocessing on the dataset like linear
regression or many other algorithms. Finally, decision trees can be used for
both regression and classification problems.
However, decision trees have a very high probability of overfitting very easily.
This is because they can grow exponentially to get the best results on the
training set, so the output is crafted for the training set only but fails
dramatically on any test set. One workaround is to use a technique called
pruning in which we stop the decision tree from growing further once the
validation error increases. After performing the pruning operation, we would
have different possible trees, so we compute the cost and the complexity of
each one to choose the best one.
As usual, we start by importing all the libraries we will use in the exercise.
Then we fix the path as usual to the one containing our datasets.
136
We import our dataset which is like the one we used before in logistic
regression, with the purpose of classification if we are going to give a loan to
someone based on his credit score, or not.
We see that the dataset also suffers from the imbalance problem that we will
ignore for now and focus on in chapter 7.
137
Then, we split our dataset.
After that, we convert all the categorical features into dummy variables.
We create our model and specify four hyperparameters. The first one is the
criterion which we will use as the entropy but feel free to choose Gini and
observe the difference. The second one, the random seed, is needed to
replicate the results afterward because it will choose the same initial random
weights if the same seed is used every time the code is executed. The third one
is the maximum depth of the tree, which is a very crucial hyperparameter, that
is needed to prevent the tree from being so deep and thus being more prone
to overfitting. The final one is the minimum samples in each leaf which pushes
the tree to be balanced. Another hyperparameter worth mentioning that we
did not specify and left with its default value is the maximum leaf nodes which
if defined limits the number of leaf nodes that the tree can have. This can help
in reducing the computational complexity but at the expense of decreasing the
accuracy.
138
By testing our model, we see it achieved an accuracy score of 75%.
139
Moreover, we can plot the features against their importance as a bar plot as we
did in logistic regression.
As we can see, there are four features that did not contribute at all to the final
output. This can be very insightful when performing feature selection.
140
As we see, the data can be modeled using a simple line nearly perfect.
Now, let us create our decision tree regressor, while also creating a linear
regression model, so we can compare their performance on this linearly
separable dataset.
141
We then test our two models and retransform them using an exponential,
which is the inverse of the logarithm.
Finally, we plot our training data, our test data, our linear prediction and our
tree prediction using the same graph.
142
The linear model approximates the data with a line, as we knew it would. This
line provides quite a good forecast for the test data (the years after 2000) while
glossing over some of the finer variations in both the training and the test data.
The tree model, on the other hand, makes perfect predictions on the training
data; we did not restrict the complexity of the tree, so it learned the whole
dataset by heart. However, once we leave the data range for which the model
has data, the model keeps predicting the last known point. The tree has no
ability to generate “new” responses, outside of what was seen in the training
data. This shortcoming applies to all models based on trees.
143
5.4.5. Random Forests Intuition
Given that we now have a solid understanding of decision trees, understanding
random forests is quite easy. Random forests are one of the ensemble methods
which operate by constructing many different decision trees while training.
They were proposed to tackle the problem of overfitting that decision trees
suffer from. Therefore, we can think of the random forest as a majority voting
algorithm where it creates different decision trees with a different set of
features in each tree, and then it takes the average of their output.
Creating different decision trees with a different set of features in each tree is
referred to as Bagging, which is a category of ensemble methods. Another
source of randomness in the random forest is feature selection at each node,
as now we have different features in each tree, so the splitting based on the
information gain calculation will differ in each tree.
There are many different hyperparameters for random forests which are the
following
144
However, random forests suffer from being slower while training than
decision trees. Also, random forests have many hyperparameters, and grid
search or random search is a must. Finally, because random trees are random,
we cannot really be absolutely confident about their results as the results may
change from time to time.
We will do the usual few first steps of importing the libraries, fixing the path,
and importing the dataset.
Then, we will split the credit dataset, which we also used in decision trees.
145
Following that, we create our random forest classifier with 500 decision trees
and a maximum depth of 4. The number of jobs is specifying how many CPU
cores that we want to train on. So, by choosing -1, then we use the maximum
number of CPU cores available.
As we see, it got us better results on the same dataset than decision trees.
Now, we can visualize the importance of each feature in making our decisions.
146
Let us now see how we can utilize random forests for a regression task.
Then, we create our random forest regression, train it, and test it.
147
Finally, we can also visualize feature importance.
148
5.5. K-Nearest Neighbor
So, you are now wondering, how does KNN work? Basically, in the case of
classification, we classify the current example based on its proximity, or
distance, to other examples. By looking at the name of the model, we observe
that it has “K” in it. This “K” can be any number as we will see, and depending
on this number, we make our decision. So, suppose that K=3, and we want to
classify the current example where there are only two possible classes, then we
compute the distance from the current example to all examples and get the
nearest three examples. After that, we look at the class of these three examples,
and we classify our current example into the same class as the dominant class.
So, if two examples belong to the first class, and one example belongs to the
second class, then our example will belong to the first class. We call this simple
algorithm Majority Voting.
149
For the “K” value, if we set it very low, the model will be more sensitive to
noise. Also, it may lead to overfitting and non-smooth decision boundaries, as
we will see. On the other hand, if we set it very large, then we might include
examples from other classes which will also yield incorrect results. Thus, we
can choose grid search or cross-validation with different values of “K” starting
from 3 to 13 for examples and find the one that gives us the best test accuracy.
Another good starting value for “K” is the square root of the number of
examples in the dataset. This was found by experimentation, so it is not
guaranteed to work every time.
Regarding the distance function, there is a general formula for numerical data
to find the distance from a query point 𝑥𝑥𝑞𝑞 to an example point 𝑥𝑥𝑗𝑗
1
𝐿𝐿𝑝𝑝 �𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 � = (� |𝑥𝑥𝑗𝑗,𝑖𝑖 − 𝑥𝑥𝑞𝑞 |𝑝𝑝 )𝑝𝑝
𝑖𝑖
If we set p=2, then we have our familiar Euclidean distance. We use it when
the features of the data measure similar properties.
If we set p=1, then we have the Manhattan distance. We use this mainly when
the features are not similar.
Further Readings
If you want to know more about the different distance measures, you can
take a look here
https://towardsdatascience.com/importance-of-distance-
metrics-in-machine-learning-modelling-e51395ffe60d
150
5.5.3. Dimensionality Problem
So far in all the algorithms that we discussed before KNN, we did not really
care about the number of features, because they were all parametric models
which have a specific number of weights which are not extremely big in most
cases. But right now, we are dealing with an algorithm which depends directly
on the number of features, or the dimensions in the dataset. Thus, we need to
restrict our dataset from containing too many features or the algorithm itself
will perform very badly. We can do so manually to get insights about the most
influential features as we did before in decision trees and random forests and
work only on these features. Also, we can do this automatically using one of
the unsupervised learning algorithms that are used for dimensionality
reduction, such as PCA or GMM, which we will discuss in the next chapter.
X − X min
X=
X max − X min
151
X −µ
X=
σ
The first step is, as usual, importing the libraries that we will use.
We will use a helper library called mglearn, which can help us with visualizing
KNN in more depth.
Then, we will use mglearn library to plot some arbitrary data and perform
KNN classification with K=1.
152
Let’s do the same but using K=3.
153
Our decision varies depending on the K value, as in the first case we
classified two of the test points as class zero, while in the second case we
classified only one of them as class zero.
Now, let us work with a real dataset called diabetes. In this dataset, we want
to predict if the person has diabetes or not based on different features.
154
There is a huge problem of missing data in our dataset. We talked about this
problem before in chapter 3, so let us now see how we can solve it
practically.
As we see, we looped through the different features that have this issue and
replaced each missing instance with a Nan with is short for Not a Number.
Then, we got the mean of this current feature while not taking the missing
instances into consideration. Finally, we replaced the missing instances with
the mean of the feature.
155
We then split our dataset into training and testing sets.
We then use K as the square root of the number of examples in our dataset,
as we discussed earlier.
Then we use the Euclidean distance as the distance function for our model,
and we train it.
156
We then test our model and report the accuracy.
157
Then, we will use an artificially-made dataset from mglearn to train a KNN
regressor.
We then use the following loop to visualize the effect of changing K on both
the train and test scores.
158
159
160
As we see, using K=1 resulted in an overfitted model with a perfect score
while training but with a very bad score while testing. Also, using K=9 did
not result in a good test score because the decision is based on distant points.
On the other hand, using K=3 resulted in a good test score.
So, before we start seeing the algorithm in action and how it can be used in
Python, let us revise the theory quickly with an example.
𝑃𝑃(𝐴𝐴) ∗ 𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
If you remember, we said that the conditional probability 𝑃𝑃(𝐴𝐴) is called the
likelihood which is the probability of observing the new evidence, given our
initial hypothesis. We also said that the marginal probability 𝑃𝑃(𝐴𝐴), which is
called the prior, is the probability of our hypothesis without any additional
prior information. Finally, we said that 𝑃𝑃(𝐵𝐵) is the marginal probability.
So, using Bayes Rule, we can update our beliefs when new information or
evidence is found. You can revisit the cancer example that we tackled in the
previous chapter.
161
5.6.2. Naïve Bayes Intuition
You might be wondering how we can use Bayes Rule in machine learning,
and why the algorithm is called “Naïve” Bayes. The answer to these
questions can be obtained by looking at a classification problem and
following the steps of the algorithm accordingly.
But before that, we should know that it is called “Naïve” mainly because it
assumes that the features are independent, which means that the presence of
one feature does not affect the others.
Knowing that, let us revisit the golf example that we discussed in the
decision tree section.
We will assume that all the features are independent, which means that if the
wind is weak, for example, then this does not imply anything about the
162
outlook of this day. Another assumption is that all the features contribute
equally to the prediction.
These assumptions are, of course, invalid in most cases. This is because the
features, by nature, have some dependency on each other, while some of the
features are more important in predicting the output than the others.
However, these two assumptions are crucial to derive the naïve Bayes
classifier as we will see.
𝑃𝑃(𝑌𝑌) ∗ 𝑃𝑃(𝑌𝑌)
𝑃𝑃(𝑋𝑋) =
𝑃𝑃(𝑋𝑋)
where 𝑋𝑋 = (𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥3 , … . , 𝑥𝑥𝑛𝑛 ) which represents the different features. If the
features are independent, we can then write the Bayes Rule again as follows:
163
We can then get the frequency table as follows:
Now, let us assume we want to know the probability that our friend will play
if the weather is sunny. We can convert this to:
164
𝑃𝑃(𝑌𝑌𝑌𝑌𝑌𝑌) ∗ 𝑃𝑃(𝑌𝑌𝑌𝑌𝑌𝑌)
𝑃𝑃(𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) =
𝑃𝑃(𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆)
From our likelihood table, we got 𝑃𝑃(𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) = 0.36, and from the
9 3
frequency table we got that (𝑌𝑌𝑌𝑌𝑌𝑌) = 14 . Also, we can get that 𝑃𝑃(𝑌𝑌𝑌𝑌𝑌𝑌) = 9
because we have 9 Yeses and only 3 of them were Sunny. Therefore, we can
get (𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) = 0.6 .
We observe that the denominator does not change because all the features
are independent. As a result, we can remove it and add a proportionality
instead
𝑛𝑛
We can manipulate this even further by saying that we want to find the class
y which gives us the maximum probability. This was fairly easy in case of a
binary classification problem like the golf problem because if we got
𝑃𝑃(𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) = 0.6
then 𝑃𝑃(𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) will be 0.4 and we really do not need to calculate it.
However, if the classification problem is multivariate, then we need a
formula for that.
𝑛𝑛
By getting this formula we can classify the output, which is our goal.
As you can see, Naïve Bayes is also a data-driven algorithm like KNN and
does not require the calculation of any weights or defining any loss functions.
Finally, Naïve Bayes has only one hyperparameter, which is called alpha.
Increasing the value of this hyperparameter smoothes the naïve Bayes model,
165
which makes it even more naïve. Decreasing it will make the model result in
fewer assumptions resulting in more accuracy. However, changing the value
of this hyperparameter has little influence on the overall performance of the
algorithm.
There are three main variations of naïve Bayes that are used in practice:
Multinomial Naïve Bayes, Complement Naïve Bayes and Bernoulli Naïve
Bayes.
166
Further Readings
If you want to know more about the different variations of Naïve Bayes,
you can take a look here
https://scikit-learn.org/stable/modules/naive_bayes.html
On the other hand, the assumptions that this algorithm is making are not
realistic in many cases, which harms the performance dramatically. Also, it
requires data preprocessing in contrast with decision trees and random
forests.
First, we import the needed libraries. We will use a real-world dataset from
sklearn called 20newsgroups which contains 18846 examples in text form
belonging to twenty different classes. You can check more about this
interesting dataset here. We will also use a function called TfidfVectorizer
which is used to get something like the frequency table that we used in the
golf example, but for text. We will also evaluate our model using something
called the confusion matrix, which we will see shortly.
167
Then, we will load the dataset and split it into a training set and a test set.
We then create our multinomial Naïve Bayes classifier and train it. Then, we
test it and store the predicted outputs.
168
Finally, we can create a simple function that gives us the predicted class, and
we use it to predict a given text.
169
5.7. Model Evaluation and Selection
If you have followed all the sections in all the chapters, then this section will
be mostly revision for you, with some additional insights and tips.
5.7.2. Cross-Validation
However, there was a huge drawback to using a static validation set; we can
only experiment with only one3 combination of hyperparameters. Also, if we
split our dataset even further, then our training set might get too small and
then our model will not be representative. Thus, we introduced the k-fold
cross-validation technique which is based on using the same validation set but
with the effect of having different validation sets.
170
We split our dataset into k separate parts, and the training process is repeated
k times. Each time, the training set equals “100-K percent” of the dataset, and
the validation set is “K percent” of the dataset. To have the effect of different
validation sets, we choose these datasets, the training, and the validation
randomly each time.
Finally, we calculate the overall accuracy of our model by taking the average
accuracies of the “K” different iterations.
For regression problems, we have four main different metrics to evaluate our
model, which we covered in the linear regression section of this chapter.
The first and the most intuitive one is the accuracy, where we report the
number of predicted outputs that match the true outputs.
The second metric, with which we can deduce many different metrics, is called
the confusion matrix, which we saw in action while working with Naïve Bayes.
We can see the confusion matrix in the following figure.
171
From the confusion matrix, we get the accuracy which is:
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑃𝑃 + 𝑁𝑁
Also, we can get another two metrics called the precision and the recall.
𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉 =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
And we can also get a combination of the precision and the recall called the
F-score as follows:
2
𝐹𝐹 =
1 1
+
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
So, why don’t we just get the accuracy? Why do we need precision and recall?
Suppose that you have a spam classification problem where you classify
emails to be either spam or ham. So, we have two different kinds of errors:
172
false positives and false negatives. The false positives occur when we
misclassify a ham email as a spam email, and the false negatives occur when
we misclassify a spam email as a ham email. Which of these two errors are
more critical? I think you’ll agree with me that putting an important email
into the spam folder is more crucial than getting annoyed with a spam email
into your main email folder. Of course, both are considered types of errors,
but in our problem, we care more about having the minimum number of
false positives. Thus, we use precision as our metric when evaluating the
model.
If we want a harmonic mean between accuracy and recall, then we use the F-
score.
We can also deduce another metric that is used when evaluating machine
learning models. This is the ROC curve, where ROC is short for Receiver
Operating Characteristic. The ROC curve is a plot of the False Positive Rate
against the True Positive Rate. It is used mainly to select the optimum model
which should have an area under the curve—AUC—equal to or near 1. This
is because the True Positive Rate should be equal to or near 1, while the
False Positive Rate should be equal to or near 0. Moreover, a random
classifier is found to have an AUC of 0.5.
173
5.7.4. Hyperparameters Tuning
To perform hyperparameters tuning, there are two main techniques that are
used in practice—the grid search and the random search.
For the grid search, we choose candidate values for each one of the
hyperparameters in our classifier or regressor. Then we train on every possible
combination of these hyperparameters, and then we use the combination that
gave us the best performance on the test set.
For the random search, we specify a range for each one of the different
hyperparameters along with the number of iterations. Then, our model is
174
trained for that specific number of iterations, while using a different random
combination of the hyperparameters for each iteration. Then, we also use the
combination that got us the best performance on the test set.
So, we say that grid-search is a discrete conclusive search over all the points,
but it is computationally expensive. On the other hand, random search is a
continuous non-conclusive search that is computationally efficient.
We then fix the path that contains the dataset and load the dataset.
175
We then split our dataset as usual.
Then, we train a support vector classifier with a radial basis function kernel.
We can now use the classifier on the test set to make predictions.
176
As we see, we got 60 TP, 45 TN, 2 FN and 5 FP. Also, we got an average of
89% accuracy using 10-fold cross-validation.
177
As we can see, we got the best gamma equals 0.5, so we can experiment with
different values of gamma near this value with different values of C.
Finally, we get 90% accuracy. We can repeat this process as many times as we
want until the grid search does not give us different values for the
hyperparameters.
178
6. Unsupervised Learning Techniques
In the previous chapter, we discussed the most used supervised learning
algorithms. In this chapter, we will discuss some of the most used
unsupervised ones. By the end of this chapter, if you followed it thoroughly,
you can confidently say that you understand how both the supervised learning
algorithms and the unsupervised learning algorithms work.
As in the previous chapter, our discussion will be divided into two main parts
of how the algorithm works intuitively and mathematically, and how to
implement it in Python.
So, how did you cluster the clouds together to form some shapes?
You did that by noticing the similarity within each group of clusters while also
noticing the dissimilarity between each group and the other ones. This is
179
equivalent to finding high intra-class similarity and low inter-class similarity.
To find the similarity either within each group or between the different groups,
you estimated the distance between the different clusters.
Along with these steps, the algorithm calculates two distances: the inter-class
distance and the intra-class distance. The first distance is also called Within
Group Sum of Squares, or SSW, while the second distance is called Between
Groups Sum of Squares or SSB. The Total Sum of Squares, or SST, is the
result of adding the two distances together.
𝑚𝑚 𝑛𝑛
𝑘𝑘 𝑚𝑚
Where i represents data points, j represents the features, and k represents the
number of clusters.
Then, we assign the different data points to the three clusters according to
the SST and the SSB distances.
181
Finally, we repeat the second and the third steps until convergence.
The first issue is tricky because the initialization can alter the overall output
of the algorithm dramatically. Tackling this issue is not easy and is still an
active area of research. One practical method that is implemented in many
frameworks such as sklearn is to set the centroids as far as possible to each
while they are still within the distribution of the dataset.
182
Another possible workaround is to use grid search as we did in KNN. This
can be slow sometimes and isn’t considered the best solution.
A third solution is to use a technique called the elbow method. In this method,
we plot the Sum of Squared Errors against the number of clusters and take the
elbow of this plot as follows.
Finally, we can use the Silhouette Method, in which we plot the Silhouette
coefficient against the number of clusters. This coefficient is calculated using
the mean intra-class distance and the mean nearest-cluster distance for each
example. The formula to calculate it is as follows:
𝑏𝑏 − 𝑎𝑎
𝑆𝑆 =
𝑚𝑚𝑚𝑚𝑚𝑚(𝑎𝑎, 𝑏𝑏)
183
To solve this problem, we cannot use any techniques like we did with the
initialization and the choice of the number of clusters. This is because it is a
problem in the core of the algorithm itself. So, the only solution is to use
another more complex algorithm that does not have the K-means
assumptions. This is exactly what we will do in the next section.
On the other hand, we noticed three issues with K-means: the sensitivity to
the choice of the number of clusters, the initialization problem, and the poor
results on complex data.
Further Readings
If you want to play with and visualize K-means in dozens of scenarios,
you can check here
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
184
First, we will work with a real dataset and see how we can use the different
techniques that we discussed to choose the value of K.
Then, we will fix the path and import the dataset which we will use in this part.
This will be the daily weather dataset that contains different features regarding
the weather for more than 1000 consecutive days.
After that, we can calculate some summary statistics which can provide us with
insights if we want to do any further analysis.
185
Also, as usual, we normalize our features using the standard scaler.
Now, to select the best number of clusters, we use the elbow method which
we’ve discussed. We can do so by calculating the distances for a range of
numbers and take the elbow of the curve.
186
The second method is to use the silhouette score from the sklearn library, and
we take the value with the highest score.
187
As we see, both methods provided us with 4 as the best number of clusters.
Now, let us work with a synthetic dataset which can help us to see the
algorithm step-by-step and the problems of K-means.
Then, we will define some helper functions for plotting and generating the
data itself.
188
Now, let us generate 300 examples which are clustered into three clusters.
Let us look at the algorithm step-by-step. The first step is to choose the
number of clusters and initialize them randomly.
189
The second step is to assign each example into one of the centroids.
190
The third step is to update the centroids’ positions to be the mean of the
assigned values.
The final step is to repeat the process until convergence. Let us assume that
this will happen after 100 iterations.
191
What we have seen is an ideal case, so let us see what will happen if any of the
problems that we discussed occur.
The first problem is the initialization problem. So, we will use the same code
for generating the data with the same number of clusters, but we will change
the seed in order to initialize the weights differently.
192
As we can see, two different centroids were initialized near each other which
will lead to a failure case.
193
The second problem is the choice of the number of clusters. In order to
simulate this, let’s create the same clusters but assign only two centroids.
194
As we see, this also resulted in a failure in the algorithm.
The third problem is the distribution of the data themselves. Let us assume
that the clusters are not isotropic, which means that we cannot represent them
as circles.
195
Again, the algorithm cannot cluster the data successfully.
Another problem with the data occurs when we do not have equal variances.
We can simulate this by the following code.
196
197
The algorithm did its best to cluster the data, but there is no metric that we
can use to evaluate if this is the optimum clustering or not.
Finally, if the data do not have convex clusters, then the algorithm will not
work as well.
198
199
6.2. Hierarchical Clustering
To address these problems and solve them, we need another algorithm called
hierarchical clustering.
The idea behind this algorithm is very intuitive. It assumes that every
example in our dataset is a cluster by itself, and then combines different
clusters based on the distances into one cluster. This is called the
agglomerative method.
While this is the most popular method, some people use it in reverse, as they
treat the whole dataset as one cluster, and split it into smaller clusters also
based on the distances. This is, on the other hand, called the divisive method.
Computing the second distance, on the other hand, is trickier. While there
are dozens of distance metrics used for this task, only five of them are
currently used in real-world situations.
The first metric is called the single link, which is the smallest distance
between one example in one cluster and another example in the other cluster.
This can be written as follows:
200
The second metric is the complete link metric, which is the largest distance
between one example in one cluster and another example in the other cluster.
This can be written as follows:
The third metric is the average link metric, which is the average distance
between one example in one cluster and another example in the other cluster.
This can be written as follows:
The fourth metric is the centroid metric, which is the distance between the
centroids of two clusters. This can be written as follows:
The final metric is the medoid metric, which is the distance between the
medoids, which are chosen examples in the middle of the clusters, of two
clusters. This can be written as follows:
201
or treats each example as a different cluster and merges the similar examples
or clusters them together.
Then, we will fix the path and load our dataset for this exercise. We will work
with a dataset called stock movements which cannot be clustered with k-means
algorithm because the dataset is not equally distributed.
After that, we will normalize our features to use the clustering algorithm.
Suppose that we want to have three clusters, then our first cluster will include
all the features from Apple to Exxon, our second cluster will contain all the
features from Home Depot to Procter Gamble, while the third one will contain
all the features from Walgreen to McDonalds. We did this clustering based on
the cluster distance, as the first cluster is the one with the least cluster distance,
and so on.
203
204
6.3. Principal Component Analysis
If you remember, throughout our journey so far, we have come across many
datasets which contained dependent features. Sometimes, it was easy to
perform feature selection by hand after calculating some summary statistics.
However, on many occasions, this was a really hard task. We said back then
that we would see an unsupervised machine learning algorithm that was
developed just for this specific-use case.
The time has come to discuss this algorithm, which is called principal
component analysis. The goal of this algorithm is to find the features which
have the highest variance, and thus, we can perform feature extraction instead
of the manual feature selection.
So, suppose that we have ten features in our dataset, and they are highly
correlated. PCA transforms these ten features into two features, for example,
depending on your choice, where these two features construct a linear
combination of the original ten features. Thus, our new feature space contains
features which are not in the dataset itself, but rather a combination of the
dataset’s features.
Now, our feature space is only 2D instead of 10D. The first dimension, also
called the first principal component or the first basis vector, points in the
direction of the data with the maximum variance. The second dimension,
which is also called the second principal component or the second basis vector,
points in the direction of the data with the second maximum variance, and so
on.
205
𝑛𝑛
Where the basis vectors are 𝑍𝑍, we can calculate 𝑍𝑍1 as follows:
The first step is to subtract the mean of the data, and preferably standardize
the data. Then, we calculate the covariance matrix, which is the variance
between the different variables structured into a matrix. After that, we calculate
the eigenvalues and the eigenvectors of the covariance matrix, which is pure
linear algebra. Following that, we construct the transformation matrix 𝑤𝑤𝑖𝑖𝑖𝑖 with
the rows being the eigenvectors that correspond to the k largest eigenvalues,
where these eigenvectors represent our new basis vectors.
You do not really need to worry about all of this. Instead, you should only now
that PCA is used for dimensionality reduction and can be combined with any
unsupervised or supervised machine learning algorithm to speed up the
training without sacrificing the accuracy. The other thing you have to worry
about is how to implement PCA using Python, which we will tackle in the final
section of this chapter.
We then import and plot our dataset, which contains only the width and the
length of the grains.
207
Following that, we calculate the correlation between these two features. By
doing so, we find that they are highly correlated.
Thus, we create our PCA model and fit it into our dataset.
We can now plot our transformed features and observe that they are not
correlated.
We can make sure of that by calculating the correlation again, but now using
the transformed features. We see that they are not correlated at all.
208
Then, we can do the same again but with the addition of plotting the basis
vectors that we explained earlier on in the original dataset plot.
Now, let’s do one more exercise using the fish dataset, which contains five
features and one output column corresponding to the fish species.
We make a pipeline which performs the normalization and the fitting in one
step.
209
Finally, we can plot the variances explained by each feature.
As we can see, we can use only the first four PCA features, and we will not
lose any information at all or use only the first two PCA features, and we will
lose a little bit of the variance.
210
7. Neural Networks and Deep Learning
By now, you should have a solid understanding of all supervised and
unsupervised learning algorithms. There is only one branch of machine
learning, reinforcement learning left, which we will explore in the following
chapter.
In this chapter, we will focus on neural networks and go into deep learning
from there. Neural Network is considered a supervised machine learning
algorithm like linear regression and SVM. However, we are dedicating a whole
chapter to it.
So, you might be asking why we did not treat neural network as all the other
supervised learning algorithms and cover it in chapter 5. Simply, because neural
networks became powerful in the last few years, and there are dozens and even
hundreds of use cases for this specific algorithm, and you will have the chance
to write code for a few of them by the end of this chapter.
After that, we will have a whole section on Artificial Neural Networks -ANN-
where we will dive into the details of this brilliant algorithm and see how we
can implement it using Python and different frameworks such as Keras.
Finally, we will discuss one of the most successful variations of ANN which is
Convolutional Neural Networks -CNN- that is currently deployed and used all
over the world in different fields but especially in face detection and
recognition. As always, we will dive into the details by working on hands-on
projects.
211
7.1. Neural Networks Introduction
The first reason is the availability of the data nowadays, which the neural
networks’ algorithms depend on heavily. By having this massive amount of
data, the capacity of the model can increase safely without worrying that much
about overfitting as before. Of course, it is still a burden, but not as before.
The second reason for this massive success is the availability of better
hardware, especially the graphics processing unit -GPU- which is used for
performing all the training. This was really important for neural networks to
succeed because until a few years ago, the training process of neural networks
would take days and even months to finish, which made people shift to other
faster algorithms. But currently, using the light-speed GPUs, these days and
months can be shortened to minutes and even seconds.
Finally, the third reason is the introduction of better and improved algorithms
for training and preprocessing the data. We will explore most of these
algorithms throughout the chapter, and you will know by then how valuable
these modified algorithms are and to which extent they helped in making the
neural networks and the deep learning success massively.
212
chapter, and you are interested in being a machine learning expert that you’ve
heard about deep learning before.
The following figure visualizes both the simple neural network and the deep
neural network.
Don’t be confused by the connections, the arrows, and the words under the
figures. Everything will be clear in the following sections.
However, the takeaway from these figures is that deep neural networks are
very complex. In fact, they were developed as a way to mimic the human brain
and thus reach the real intelligence.
213
7.2. Artificial Neural Networks
There are two main steps that are performed in the neural networks’ vanilla
algorithm, and they are the same for all the variations of ANN like CNN or
Recurrent Neural Networks -RNN-. So, by understanding these two steps,
you can confidently say that you understand how all neural networks work,
no matter how complex they seem at first glance.
The first step is called forward propagation, and the second one is called
backward propagation.
But before we explain what these words mean, let us see why the algorithm
was called “neural” networks. Like we mentioned earlier, the people who
came up with this algorithm were inspired by how the human brain works
and thought about developing an algorithm that mimics the brain-behavior.
To do so, they studied the brain structure and designed the algorithm based
on that.
In the following figure, we see the structure of the neuron, which is the
building block for brain functionality.
214
By Egm4313.s12 (Prof. Loc Vu-Quoc) - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=72816083
The input of the neurons is called the Dendrite that is then activated and
transformed using the Myelin sheath and then produced as outputs on the
Axon terminal.
The same with backpropagation, which means calculating the errors based on
the predicted outputs compared to the true outputs. Again, you can think
215
about having multiple linear regression gradient descent algorithms running
concurrently and parallelly. Concurrently, because you might have more than
one neuron stacked together over each other’s, so you can do the calculations
of different neurons at the same time without any one affecting the others.
We call these stacked neurons a layer of neurons or a layer for short.
Parallelly, because you might have more than one layer, so the output of layer
calculations will affect the output of the previous layer’s calculations.
Finally, we call anything between the input layer and the output layer, the
hidden layers. By increasing the number of hidden layers, we would have a
deep neural network as we will see shortly.
We talked back then about only one activation function, the sigmoid
function, as it was used in logistic regression. The mapping of the sigmoid
function is shown in the following figure.
216
Now, let us introduce other activation functions which are frequently used in
neural networks. We have the ReLU activation function which is short for
Rectified Linear Unit. Its mapping is shown in the following figure.
I know that you are now wondering why we might need another activation
function if the sigmoid works just fine. The answer is that it does not in
many scenarios.
217
Given that we use gradient descent for backpropagation, we need to calculate
the derivative of the output after the activation function. Mathematically, the
sigmoid function performs terribly when the input is more than positive one
or less than negative one and is nearly zero. We did not suffer from that or
notice it while working with logistic regression because we scaled our input
to be in the range of -1 and 1 before feeding it to the sigmoid. But now, we
cannot do this, because even if we manage to do so for the first layer, then
the outputs from it will need scaling again and with the following layers.
However, it has two problems. The first one is that it cannot be interpreted
at the final output. The solution is to use ReLU for all the hidden layers and
then use a sigmoid function for the final output layer.
The second problem is that the inputs must be all positive, or they will be
mapped to 0. There are two solutions for this problem. The first one is to
calculate the absolute of the numbers that are fed to the ReLU before each
layer. This solution, of course, is not the best one as it requires more
preprocessing.
The second solution is to use a modified version of the ReLU which is the
leaky ReLU. The graph for this function is shown below.
218
We know that the sigmoid function is used for binary classification problems,
but if we have a multiple classification problem, then we need another
activation function, which is the SoftMax function. This function is simply a
normalized average of the sigmoid function.
In summary, we usually use ReLU or Leaky ReLU for the hidden layers, and
a sigmoid/SoftMax/tanh for the output layer depending on whether we have
a binary classification problem or not. Of course, there are many other
different functions, but they are not used as frequently as the ones that we’ve
discussed. However, you should always experiment with different functions
while working on a project because it is an iterative and explorative process.
We have our simple neural network below which contains only two inputs
which are i1 and i2, one hidden layer consisting of two neurons h1 and h2,
and an output layer containing one neuron “out”. Also, we divided the
neurons of both the hidden layer and the output layer into two parts which
are the input to the neuron “i” and the output of the same neuron after
activation “o”.
So, the first step is to initialize the weights “w1, w2, w3, w4, w5, w6” with
random numbers. Let us assume that we did so, and we have the following
weights.
Also, assume that we have the input and the output as follows:
219
I1 = 0.05, i2 = 0.1, outo = 0.7
ℎ1𝑜𝑜 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(ℎ1𝑖𝑖 )
ℎ2𝑜𝑜 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(ℎ2𝑖𝑖 )
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖 )
Now, for backpropagation, we will use the squared error function along with
gradient descent for optimization.
1
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = � (𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 − 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜)2
2
1
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = (0.7 − 0.61)2 = 0.0081
2
220
Now, to perform the gradient descent step mathematically, you need to have
a background in multivariable calculus and partial derivatives. We are finding
the derivative each time with respect to only one variable as we now have
more than one, unlike in logistic regression. The key concept that you need
to look up is the chain-rule of calculus. By using it, we can write the
derivative of the error as following.
𝜕𝜕𝜕𝜕 1
= 2 ∗ ∗ (𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 − 𝑜𝑜𝑜𝑜𝑡𝑡𝑜𝑜 )2−1 = 0.7 − 0.61 = 0.09
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑜𝑜 2
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑜𝑜
Now, for , it is the partial derivative of the sigmoid with respect to the
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑖𝑖
input of the output neuron, which has the following formula:
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑜𝑜
= 𝑜𝑜𝑜𝑜𝑡𝑡𝑜𝑜 (1 − 𝑜𝑜𝑜𝑜𝑡𝑡𝑜𝑜 )
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑖𝑖
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑜𝑜
= 0.61(1 − 0.61) = 0.24
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑖𝑖
𝜕𝜕𝜕𝜕𝜕𝜕𝑡𝑡𝑖𝑖
= ℎ1𝑜𝑜 = 0.5
𝜕𝜕𝜕𝜕5
𝜕𝜕𝜕𝜕
= 0.09 ∗ 0.24 ∗ 0.5 = 0.01
𝜕𝜕𝜕𝜕5
Finally, we update the weights using the following equation, where 𝛼𝛼 = 0.001
which is the learning rate.
𝜕𝜕𝜕𝜕
𝑤𝑤5 = 𝑤𝑤5 − 𝛼𝛼 = 0.4 − 0.001 ∗ 0.01 = 0.399
𝜕𝜕𝜕𝜕5
The same calculations can be done with all the other weights.
221
And that is! You now know how neural networks work both in concept and
in theory. The final thing that you need to know is how to use neural
networks in a hands-on project, which is the topic of the following section.
The second way is by using another powerful but more low-level framework
called TensorFlow, which will require us to define the forward propagation
algorithm explicitly but takes care of the backward propagation.
So, let us start by importing all the libraries that we will use in this section.
Now, let us load the dataset that we will use for this project, which is a famous
dataset consisting of 70,000 images of numbers from 0 to 9. So, our task is a
multi-classification one.
Then, we print the shape of the train and test images to understand the
structure of the dataset better.
222
After that, we perform basic preprocessing as we reshape the images, so each
image is represented by a (28*28) vector and normalized.
Now, we define our neural network model, by first using the sequential
method as our neural network is sequential by nature. Then, we add one
hidden layer consisting of 512 neurons and has a ReLU activation function.
Finally, we add an output layer consisting of ten neurons corresponding to the
ten classes that we have and has a SoftMax activation function, as we discussed
earlier.
Then, we train our neural network in a single line of code bypassing the input,
the labels, the number of iterations which is also called the epochs, and the
batch size. We need the batch size as we can not fit the whole dataset in the
memory at once as we did with other algorithms because the dataset is much
bigger. Thus, we use a modified version of the gradient descent by training on
only one batch, getting the output, and then on the following batch until we
reach the end of the dataset.
223
After that, we evaluate our neural network performance.
We got 97.8% accuracy, which we could not reach using any other algorithm.
The first step is to load the dataset using the following method.
Then, we define all the hyperparameters that we will use in the project. The
dropout is a trick used extensively in deep neural networks to avoid overfitting.
The trick is to drop a random number of neurons or shut them down at each
iteration, so the model is forced to learn using all neurons. It then has a higher
chance of learning the correct function instead of relying only on a subset of
neurons.
We then define the layers’ dimensions as we will use three hidden layers instead
of one in Keras.
224
To perform any calculations using TensorFlow, all the variables need to be
stored in something called a placeholder.
Then, we train the neural network by initializing the TensorFlow session and
running it.
225
Finally, we train our model and print the accuracy after every 100 iterations.
As we see, the maximum accuracy that we reached is 93.75% with three hidden
layers compared to 98.7% with only one hidden layer that we got earlier.
You are highly advised to experiment and hyper-tune this neural network
yourself to obtain even better results.
226
7.3. Convolution Neural Networks
Then, after the promising results that the deep learning algorithms showed,
they integrated the convolution operation with the deep neural networks to
get the best of both worlds.
227
Now, given that we know scientifically what is meant by an image let’s explore
the convolution operation.
Suppose that we want to detect the edges in any picture, the first thing that
comes to your mind is to multiply the image by some other image that would
help us extract the edges from the first images.
That is correct, but the problem is that it is infeasible, because the images
consist of millions of pixels, and, we would need a different detector image
each time.
229
(1 ∗ 3) + (0 ∗ 0) ∗ (1 ∗ −1) + (1 ∗ 1) + (0 ∗ 5) + (8 ∗ −1) + (1 ∗ 2)
+ (0 ∗ 7) + (2 ∗ −1) = −5
Thus, the output image will have a value of -5 at its first entry.
Then, we repeat the same process after sliding the window that we convolve
the filter with.
By doing so, we have the output image which contains the detected features.
However, the filter that we convolved our input image with contained
specific numbers, which was fine for detecting simple edges. But, if we want
to extract more complex features like the eyes or the faces in the images,
then we need much more complex filters, and we need many and not just
one.
230
Therefore, the idea was to treat the numbers in the filters as weights which
are found using a neural network and can be stacked using multiple neurons
and multiple layers. By that, we have the convolution neural network.
The padding layer is always used before the filter layer, and it does not have
any parameters or trainable weights. Basically, it is used to preserve the
dimensions of the input image.
In the last example, we saw that the output image dimensions were four and
four, while the input image dimensions were six and six. This can be a huge
problem if we are doing the convolution operation multiple times, which we
do with deep convolutional neural networks. Thus, the use of the padding
layer is crucial.
Max pooling and average pooling are two widely used ways to represent the
pooling layer.
231
On the other hand, the average pooling is done as follows:
Finally, we should have a fully connected layer at the end of the neural
network in order to combine all the different functions that were formulated
from the neural network so far. It isn’t always necessary or preferred because
it may make the model more vulnerable to overfitting. Following that, we
have a flattening layer which converts the output into a vector column as we
did in the preprocessing step in the neural network project.
232
which consists of a small number of neurons such as sixteen or thirty-two.
These two layers are followed usually by a pooling layer, either maximum
pooling or average pooling.
People frequently consider these three layers as one layer, and, this is true for
Keras, which we will use in the following section.
However, note that these three layers, which can be considered as a single
layer, are repeated many times while the number of neurons increase each
time. This is done until the model is not underfitting and before it starts
overfitting.
Finally, the CNN would have a fully connected layer and/or a flattening
layer.
As we discussed earlier, all the activation functions that use the middle are
ReLU, and the final activation function for the outputs is tanh, sigmoid or
SoftMax.
Then, we define our CNN model. Here, we use the Conv2D layer from keras
with thirty-two neurons in the first layer and with a filter size of three by
233
three. Also, we notice that the first Conv2D layer has a padding = same
while the second one has a padding = valid. The valid padding is equivalent
to no padding at all, while the same padding means that the output size of
this layer should equal the input size.
We can see all the layers names, parameters and shapes using the summary
function.
Finally, we add a flatten layer along with two fully connected layers. The final
fully connected layer should have several neurons equal to the number of
classes that we want to classify along with a SoftMax activation function.
234
Then, as before, we compile, fit, and evaluate the model.
Then, we will explore upper confidence bound and Thompson sampling, two
widely-used reinforcement learning algorithms. Of course, we will, as always,
explain the algorithms theoretically. Then, we will have a hands-on exercise on
them.
While this might seem vague right now because of the new terminology, it will
be clear after the following section.
We’ll imagine that you have a dog that you want to train. When you first get
the dog or if it is a newborn, it doesn’t know anything. Thus, it explores the
236
environment around it by interacting with it and doing actions. If it obeys you
and listens to what you are saying, then you reward it with a small snack. If it
does something wrong, then you will punish it by delaying its meal by an hour
or so. By doing so, you train your dog by giving it positive or negative rewards
as a response to its actions with the environment.
237
8.1.3. Reinforcement Learning Example
To make sure that we understood everything clearly, let us mention an example
of RL used in real-life. Imagine that you want to control a robot’s movement
so that it can walk without falling or getting stuck. If you treat this as a
supervised learning problem, it will be so difficult it cannot be solved
practically. So, let us treat it as a reinforcement learning problem.
Here, the agent is the robot’s brain, while the environment can be the robot’s
body, the obstacles around it and the physics constraints. Also, the states
would be the joints’ angles of the robot, the distance from the next obstacle,
the type of the obstacle and so on. Moreover, the actions that the robot’s brain
can take are the controls and the commands for its joints and limbs. Finally,
you can design the reward to be proportional to how much the robot walked
without any falling or getting stuck.
238
8.2. Upper Confidence Bound
There are extreme approaches that you might take. The first one is to choose
one random machine and play there until the time is over. This is not good
because if you’re not lucky, the average payout of this machine might not be
good, and you will not gain that much money. This is a pure exploitation
approach, as you decided to take the safe route.
1
The other approach is to play on each slot machine for 20 the time. By doing
so, you will not gain a very bad profit, but you’re also not maximizing it. This
is a pure exploration approach.
The trick here is to use a combination of exploration and exploitation, and that
is what the following algorithms are addressing.
239
more we are not sure about or know about a specific state or “arm”, the
more it becomes important to explore and tackle.
Suppose that we want to have personalized ads for our website users. In this
case, the ads that we display for the users each time they open our website
will be the arms d. We will represent each time the user opens the website as
round n.
Now, every time the user opens a new website, we display only one ad for
him. Thus, we can represent the reward for each ad i as follows:
𝑟𝑟𝑖𝑖 (𝑛𝑛) ∈ {0,1}: 𝑟𝑟𝑖𝑖 (𝑛𝑛) = 1 if the user clicked the add and 0 if he did not.
The first one is calculating the sum of rewards R, or the value function, and
calculating the number of times the ad i was selected N. This should be done
in each round, of course.
The second step is to calculate the average reward of ads from the beginning
to the current round.
𝑅𝑅𝑖𝑖 (𝑛𝑛)
𝑟𝑟𝑖𝑖 (𝑛𝑛) =
𝑁𝑁𝑖𝑖(𝑛𝑛)
3𝑙𝑙𝑙𝑙𝑙𝑙(𝑛𝑛)
Where ∆𝑖𝑖 (𝑛𝑛) = � 2 𝑁𝑁 (𝑛𝑛)
𝑖𝑖
The third and the final step is to select the ad i that has the maximum upper
confidence bound 𝑟𝑟𝑖𝑖 (𝑛𝑛) + ∆𝑖𝑖 (𝑛𝑛)
240
By performing these three steps, you can combine both exploration and
exploitation to find the best action for each state.
The first step is to import the needed libraries and fix the path as usual.
We then read the dataset that we will use in this project, which is an ads
dataset corresponding to the response of thousands of people to ten
different ads.
241
Now, we start with the pure exploration algorithm. We have 10,000 records
which are represented by the variable N. We implement the algorithm by
choosing a random ad each time without any considerations.
If we plot the histogram, which represents how many times each ad was
selected, we notice that nearly all of them were selected equally.
Finally, we record the total reward that we get from this pure exploration
policy, which is 1255.
242
Now, using the same dataset, let us implement the UCB steps and equations
which are straightforward as follows.
The total reward is much bigger, and the histogram shows clearly that the
fourth ad is the one that gives the highest reward.
243
8.3. Thompson Sampling
Like UCB, the algorithm consists of three main steps. The first step is to
calculate for each round n the number of times the ad i got reward 1 up to
the current round 𝑁𝑁𝑖𝑖1 (𝑛𝑛) and the other way around 𝑁𝑁𝑖𝑖0 (𝑛𝑛).
The second step is that for each ad, we take a random draw from the
following distribution.
244
You do not really need to understand the previous equation as it needs a very
solid mathematical background.
The third step is to select the ad with the highest reward 𝜃𝜃𝑖𝑖 (𝑛𝑛)
These are the three main steps that this algorithm needs to run. However,
there is one issue with this algorithm. As it is probabilistic, we cannot trust its
results without monitoring. On the other hand, this algorithm can handle any
delayed feedback, unlike UCB.
We will import the libraries, fix the path, and load the same dataset.
245
As we see below, the algorithm chooses also the fourth ad like UCB did, but
with much higher confidence.
246
Also, we see that the reward is higher than the UCB.
247
Bonus: Free eBook in Neural Networks and Deep
Learning with Python
248
If you want to help us produce more material like this, then please
leave an honest review. It really does make a difference.
249
250