SupportcoursesM DLearning
SupportcoursesM DLearning
net/publication/383220216
CITATIONS READS
17 2,158
1 author:
Ammouri Bilel
SEE PROFILE
All content following this page was uploaded by Ammouri Bilel on 19 August 2024.
REPUBLIQUE TUNISIENNE
L ECTURE N OTES
V ERSION .2024. BETA
Bilel A MMOURI
: ESAM
ð : ammouri-bilel
§ : bilelammouri
D : Ammouri-Bilel
: 0000-0002-5491-5172
2
Preface
These are the class notes I took for ESA Mograne : Introduction to Machine Learning and
Deep Learning. These notes are part of a course designed for students in agronomy (rural
economics and vegetable production). The course, focused on deep learning, assumes that
students have little to no background in machine learning or advanced theoretical ma-
thematics. To make these concepts more accessible, I decided to introduce foundational
machine learning concepts and then delve into deep learning, describing some algorithms
without getting into detailed mathematical demonstrations. Additionally, I have provided
a GitHub repository 1 containing code, notebooks, and examples relevant to agronomy,
which will help students better understand and apply these concepts.
1. https://github.com/bilelammouri/Machine-Learning-and-Deep-Learning-in-Agronomy
3
4
Table of Contents
5
6 TABLE OF CONTENTS
2 Supervised Learning 29
2.1 Understanding Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Regression Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 k-Nearest Neighbors (k-NN) . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Regression-classification algorithms . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Decision Trees (DT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.3 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . 48
2.4.4 Ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.5 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.6 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.7 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 57
3 Unsupervised Learning 59
3.1 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Independent Component Analysis (ICA) . . . . . . . . . . . . . . . . . . . . 68
3.6 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . 71
1.1 Introduction
9
10 CHAPTER 1. INTRODUCTION TO MACHINE LEARNING
A computer program that learns from experience is called a machine learning program or
simply a learning program. It is also sometimes referred to as a learner.
The learning process, for both humans and machines, can be divided into four main com-
ponents : data storage, abstraction, generalization, and evaluation. Figure 1.1 illustrates
these components and the steps involved in the learning process.
The learning process, whether for humans or machines, can be divided into four main
components : data storage, abstraction, generalization, and evaluation. Storing and re-
trieving large amounts of data is crucial for the learning process, as both humans and
computers depend on data storage for advanced reasoning. Humans store data in their
brains and retrieve it through electrochemical signals, whereas computers use hard drives,
flash memory, and RAM, accessing data via cables and other technologies.
1.2. LEARNING MODELS 11
Data
Concepts
Inferences
Abstraction involves extracting knowledge from stored data, creating general concepts
from the overall data. This process includes applying known models and developing new
ones. Training a model to fit a dataset transforms the data into an abstract form that
summarizes the original information.
Generalization involves applying the knowledge obtained from stored data to new, similar
tasks, aiming to discover the most relevant properties of the data for future applications.
Evaluation provides feedback to measure the usefulness of the learned knowledge. This
feedback is used to improve the entire learning process.
Machine learning involves selecting suitable features to develop models that effectively
address specific tasks. These learning models can be categorized into three primary types :
Logical models, which utilize logical expressions ; Geometric models, which apply geome-
tric properties of the instance space ; and Probabilistic models, which use probability for
classifying instances within the space. Additionally, models may focus on grouping and
grading outcomes to enhance predictive accuracy (see Figure 1.2 and Table 1.1).
Logical
models
Geometric
Learning models
Models
Probabilistic
models
Logical models partition the instance space into segments using logical expressions to
create grouping models. These expressions yield Boolean values (True or False), facilitating
the categorization of data into homogeneous groups based on the specific problem. In clas-
sification tasks, all instances within a group belong to the same class. Logical models are
primarily divided into two categories : tree models and rule models. Rule models consist of
a set of IF-THEN rules, where the ‘if-part’ defines a segment, and the ‘then-part’ determines
the model’s behavior for that segment, following a similar principle in tree-based models.
A deeper understanding of logical models requires an exploration of Concept Learning,
which involves deriving logical expressions from examples, aligning with the goal of ge-
neralizing a function from specific training examples. Concept Learning can be formally
defined as the inference of a Boolean-valued function from training examples of its input
and output, typically describing only the positive class and labeling everything else as ne-
gative. For instance, in a Concept Learning task called "Enjoy Sport," data from various
example days is described by six attributes, and the objective is to predict whether a day is
enjoyable for sports. This involves formulating hypotheses as conjunctions of constraints
on the attributes, with training data containing positive and negative examples of the tar-
get function. Each hypothesis might represent a vector of six constraints : Sky, AirTemp,
Humidity, Wind, Water, and Forecast, while the training phase focuses on learning the
conjunction of attributes for which Enjoy Sport equals "yes." The problem can be articula-
1.2. LEARNING MODELS 13
ted as identifying a function that predicts the target variable Enjoy Sport as either yes (1)
or no (0) given instances representing all possible days.
Logical models, such as decision trees, utilize logical expressions to segment the instance
space, identifying similarity through logical segments. In contrast, geometric models de-
fine similarity based on the geometry of the instance space. Features can be represented
as points in two dimensions (x- and y-axis) or three dimensions (x, y, and z). Even if fea-
tures are not inherently geometric, they can be modeled geometrically ; for example, one
could model soil moisture and temperature over time on two axes. There are two primary
methods for establishing similarity in geometric models :
— Using geometric concepts like lines or planes to classify the instance space, known as
Linear models.
— Using the geometric notion of distance to represent similarity, where proximity im-
plies similar feature values, referred to as Distance-based models.
Linear Models
Linear models are relatively straightforward. They represent the function as a linear com-
bination of inputs. For instance, if x1 and x2 are scalars or vectors of the same dimension,
and α and β are arbitrary scalars, then αx1 + βx2 represents a linear combination of x1 and
x2 . In its simplest form, f (x) is a straight line, given by the equation f (x) = α + βx, where
α is the intercept and β is the slope (see Figure 1.3).
Distance-based Models
Distance-based models represent the second class of geometric models. Like linear models,
distance-based models rely on the geometry of data. As the name suggests, these models
operate on the concept of distance. In the context of machine learning, this distance is not
solely the physical separation between two points. For example, in agriculture, one might
consider the distance between two fields based on the mode of irrigation used—drip irri-
gation may cover less ground compared to traditional sprinklers. Additionally, the concept
of distance can vary depending on the context ; for instance, the distance between crop va-
rieties can influence yield outcomes differently. Commonly used distance metrics include
14 CHAPTER 1. INTRODUCTION TO MACHINE LEARNING
Euclidean, Minkowski, Manhattan, and Mahalanobis (see Figure 1.4). These metrics apply
through the concepts of neighbors and exemplars. Neighbors are points in proximity based
on the distance measure, expressed through exemplars. Exemplars are either centroids,
finding a center of mass according to a chosen distance metric, or medoids, identifying
the most centrally located data point. The arithmetic mean is a commonly used centroid,
minimizing squared Euclidean distance to all other points.
— A centroid is the geometric center of a plane figure, i.e., the mean position of all
points in the figure from the centroid point. This extends to any n-dimensional space
object : its centroid is the mean position of all points.
— Medoids are similar to means or centroids but are used when a mean or centroid
cannot be defined. They are preferred in contexts where the centroid is not represen-
tative of the dataset, such as in image data.
The third category of machine learning algorithms is probabilistic models. Unlike k-nearest
neighbor algorithms, which use distance, or logical models, which use logical expressions,
1.3. LEARNING SYSTEM 15
2. Generative models estimate the joint distribution P (Y, X). Once the joint distribu-
tion is known, any conditional or marginal distribution involving the same variables
can be derived. Generative models can create new data points and their labels based
on the joint probability distribution. The joint distribution seeks a relationship bet-
ween two variables, allowing the inference of new data points once this relationship
is known.
Naïve Bayes is an example of a probabilistic classifier, utilizing Bayes’ rule, which relies
on conditional probability. Conditional probability determines the likelihood of an event
given that another event has occurred. The Naïve Bayes algorithm evaluates the evidence
to determine the likelihood of a specific class and assigns a label to each entity accordingly.
P (B|A)P (A)
P (A|B) = (1.1)
P (B)
— Grouping models divide the instance space into segments or groups and apply a
simple method within each segment (e.g., majority class). Examples include decision
trees and KNN.
— Grading models form a single global model over the instance space. Examples include
linear classifiers and neural networks.
training experience E, with the ultimate goal of discovering an unknown target function.
This target function embodies the knowledge to be acquired from the training experience
but remains unknown initially.
For instance, in the context of predicting crop yield, the learning system utilizes histo-
rical crop data as its training experience, while the task involves classifying whether a
given crop will have a high yield. Here, the training examples can be represented as
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), where X denotes various factors such as soil quality, weather
conditions, and irrigation practices, and y signifies the yield status.
The precise knowledge to be gleaned from the training experience pertains to learning the
target function, which can be expressed as a mapping function f : X → y. This function
encapsulates the relationship between the input variable X and the output variable y.
Through our exploration of the learning process, we understand that several design choices
are fundamental for creating an effective learning system. The critical considerations are :
To illustrate these design choices, we will consider the checkers learning problem. The
three elements for this scenario are :
The type of training experience plays a crucial role in determining the success of a learning
system. Training experiences can be categorized as follows :
— Direct training provides specific instances along with the correct actions for
each instance.
— Indirect training offers sequences of actions and their final outcomes (e.g.,
yield or no yield) without specifying the correct action for each instance. This
introduces the credit assignment problem.
2. Teacher Involvement :
When predicting crop yield, a farmer decides on the best agricultural practices among
available options, applying learned experience to enhance their chances of maximizing
18 CHAPTER 1. INTRODUCTION TO MACHINE LEARNING
yield. This learning process can be formalized through the target function.
Considerations include :
1. Direct Experience : The learning system must determine the optimal agricultural
practice from a vast search space. This is denoted as the function ChoosePractice :
C → P , where C represents different crop conditions and P denotes possible agri-
cultural practices.
4. If c is not a terminal state, then V (c) = V (c′ ), where c′ is the best achievable terminal
condition from c.
This recursive definition implies that determining V (c) necessitates searching for the opti-
mal agricultural strategy, making it computationally intensive.
The objective of learning is to derive an operational version of V , which the crop yield
prediction program can utilize to evaluate crop conditions and select practices efficiently.
Perfectly learning this operational form may be challenging ; thus, learning algorithms
typically approximate the target function as V̂ .
With the target function V defined, the subsequent step is to select a suitable representa-
tion for the function V̂ , which the learning algorithm will use. Potential representations
include :
— A comprehensive table that stores values for each distinct board state
This selection presents a trade-off : while a more expressive representation can approxi-
mate V more accurately, it also requires more training data to differentiate among the
various hypotheses.
To simplify, we can represent the function V̂ as a linear combination of specific board
features :
Here, w0 , . . . , w6 are numerical coefficients determined by the learning algorithm, and the
weights w1 to w6 indicate the significance of various board features. These features might
include :
— x3 (c) the quality of the soil, measured by nutrient content (e.g., pH level)
— x5 (c) the presence of pests or diseases affecting the crops (binary indicator)
To train the learning program, a set of training data is necessary, comprising specific board
states c and their corresponding training values Vtrain (c). Each training example is repre-
sented as an ordered pair ⟨c, Vtrain (c)⟩.
For instance, a training example might be ⟨(x1 = 500, x2 = 22, x3 = 7.5, x4 = 100, x5 =
0, x6 = 1), 400⟩, indicating a win for black (i.e., x2 = 22 degrees Celsius, etc).
While assigning Vtrain (c) for clear and well-understood conditions is straightforward, it
becomes complex for intermediate conditions. In such cases, we utilize temporal difference
(TD) learning—a key reinforcement learning concept where iterative corrections improve
estimated returns towards accurate targets.
Let Successor(c) denote the subsequent board state following c. The learner’s approxima-
tion V̂ is employed to assign training values for intermediate states as follows :
1X
E= (Vtrain (ci ) − V̂ (ci ))2 (1.6)
2 i
The final design of the checkers learning system can be encapsulated in four distinct pro-
gram modules that represent fundamental components of many learning systems :
1. The Performance System : This module takes a new board state as input and outputs
a trace of the game played against itself. It simulates the checkers game using the
current hypothesis.
1.4. TYPES OF LEARNING 21
2. The Critic : This component processes the trace of the game generated by the per-
formance system. It evaluates the game outcome and produces a set of training
examples for the target function. The critic is essential for assessing the effective-
ness of moves made during gameplay.
3. The Generalizer : This module receives the training examples from the critic and
outputs a hypothesis that estimates the target function. Effective generalization is
crucial for adapting learned strategies to new, unseen board configurations.
4. The Experiment Generator : This component takes the current hypothesis (the func-
tion that has been learned so far) as input and generates new problems (initial board
states) for the performance system to explore. This iterative process allows the sys-
tem to continually refine its understanding of the game dynamics.
Together, these modules create a comprehensive framework for the checkers learning sys-
tem, allowing it to learn from self-play, improve through evaluation, and generalize its
knowledge to play effectively against various opponents.
Machine learning algorithms can generally be classified into four main types (see Figure
1.6 and 1.7).
In supervised learning, a training set containing examples with corresponding correct res-
ponses (targets) is provided. The algorithm learns from this training data to generalize
and respond accurately to new inputs. This approach is also referred to as learning from
exemplars. Specifically, supervised learning involves learning a function that maps inputs
to outputs based on example input-output pairs.
Each example in the training dataset consists of an input object (often represented as a
vector) and an output value. The supervised learning algorithm analyzes this training data
to produce a function capable of mapping new examples. Ideally, this function will accu-
rately determine the class labels for unseen instances. Supervised learning encompasses
both classification and regression tasks, and numerous algorithms are available, each with
its advantages and disadvantages. There is no single algorithm that excels in all supervised
learning scenarios.
22 CHAPTER 1. INTRODUCTION TO MACHINE LEARNING
Consider a dataset from an agronomic study that includes various factors affec-
ting crop yield. Each data point in the dataset represents specific agricultural
conditions, such as rainfall, temperature, soil quality, fertilizer usage, presence
of pests or diseases, and type of crop. Each condition is labeled with the cor-
responding crop yield.
Here is a detailed example using the previously mentioned variables :
x1 x2 x3 x4 x5 x6 Yield (units)
500 22 7.5 100 0 Wheat 4000
450 20 6.8 80 1 Corn 3500
... ... ... ... ... ... ...
In this example, the system learns from historical data of crop yields under
various conditions, developing a model that can predict future yields based on
This type of machine learning algorithm draws inferences from datasets consisting solely of
input data without labeled responses. In unsupervised learning, there are no classifications
1.4. TYPES OF LEARNING 23
— x1 — soil pH level
x1 x2 x3 x4 x5 x6
6.5 20 15 50 22 500
5.8 18 10 45 20 450
... ... ... ... ... ...
yield the highest rewards through experimentation. In many complex scenarios, the effects
of actions may not only influence immediate rewards but also impact future situations and
subsequent rewards.
Consider an agricultural robot that is tasked with optimizing the watering sche-
dule for a field. While we cannot explicitly instruct the robot on the optimal
watering times and amounts, we can provide feedback based on the crop yield
and health outcomes. The robot must learn which watering schedules lead to
better yields or healthier crops. A similar methodology can be applied to train
machines for various tasks, such as pest control, soil treatment, or autonomous
harvesting. Reinforcement learning differs from supervised learning, which re-
lies on examples provided by knowledgeable experts.
Here is a detailed description of the reinforcement learning process for the
agricultural robot :
— State (s) : The current condition of the field, which includes factors
such as soil moisture level, weather conditions, and crop growth stage.
— Action (a) : The watering decision made by the robot, such as the
amount of water to apply and the timing of the application.
— Reward (r) : The feedback provided to the robot based on the crop yield
and health outcomes. For example, a higher yield or healthier crops
result in a higher reward, while poor outcomes result in a lower reward
or penalty.
— Policy (π) : The strategy that the robot uses to determine its actions
based on the current state. The policy is continuously improved as the
robot learns from its experiences.
The reinforcement learning process involves the robot interacting with the en-
vironment (the field) and receiving feedback to improve its watering policy.
The goal is to maximize the cumulative reward over time, leading to optimal
Target Function
f : x 7→ y
Training Data
(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )
Learning Hypothesis
Algorithms
Model estimate
F IGURE 1.7 – Three broad categories of machine learning : unsupervised learning, super-
vised learning and reinforcement learning
Source : ww2.mathworks.cn
CHAPTER 2
Supervised Learning
Introduction
29
30 CHAPTER 2. SUPERVISED LEARNING
Supervised learning is a machine learning approach where the algorithm iteratively learns
the dependencies between data points. The desired output is specified in advance, and the
learning process is supervised by comparing the algorithm’s predictions to actual results.
The goal is to optimize the algorithm so it can apply learned patterns to make predictions
on new, unseen data. Supervised learning methods can be used for both regression and
classification problems (see Figure 2.1).
In supervised classification, abstract classes are created to categorize and organize data
meaningfully. Objects are grouped based on similar characteristics and structured accor-
dingly. This approach helps in organizing data for better interpretation and analysis.
In contrast, supervised regression algorithms are used to make predictions and infer causal
relationships between independent and dependent variables. These algorithms are essen-
tial for predictive modeling and understanding the influence of different factors on the
outcomes.
2.2. REGRESSION ALGORITHMS 31
The following figure 2.2 lists some of the most important supervised classification algo-
rithms .
Graphically, regression can be represented as a line or curve that best fits the observed data
points related to agricultural outcomes. The objective is to minimize the vertical distances
between the actual data points and the regression line, which indicates the strength of
32 CHAPTER 2. SUPERVISED LEARNING
— Determining the relative importance of different factors and how they interact.
There are several types of regression methods employed in data science and machine lear-
ning, each suitable for different contexts. Common types include :
2.2. REGRESSION ALGORITHMS 33
— Linear Regression
— Logistic Regression
— Polynomial Regression
— Ridge Regression
— Lasso Regression
— Neural Networks
When there is a single independent variable, it is termed simple linear regression (2.1) ;
when there are multiple independent variables, it is referred to as multiple linear regres-
sion (2.2).
yi = ω0 + ω1 × xi + εi (2.1)
Where Y represents the dependent variable (target), X denotes the independent variable
(predictor), and ωi are the linear coefficients and εi are the random error term.
34 CHAPTER 2. SUPERVISED LEARNING
The primary objective in linear regression is to identify the best fit line, which minimizes
the error between predicted and actual values. The accuracy of this line depends on opti-
mizing the coefficients (ω0 and ω1 ), which is achieved through a cost function.
The cost function measures how well the linear regression model performs and helps in
estimating the coefficients for the best fit line. For linear regression, the Mean Squared
Error (MSE) is commonly used, defined as :
N
1 X
MSE = (yi − (ω1 xi + ω0 ))2 (2.3)
N i=1
Where :
— yi : Actual value
Residuals represent the discrepancies between actual and predicted values ; larger resi-
duals indicate a poor fit, while smaller residuals suggest a better model.
Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is a linear
model that includes a regularization term to prevent overfitting. This technique not only
helps in reducing model complexity but also performs feature selection by shrinking some
coefficients to zero. The addition of the L1 penalty term differentiates it from standard
linear regression (see Figure 2.3).
Lasso regression modifies the standard linear regression formula by adding a regulariza-
Pp
tion term λ j=1 |ωj |, where λ controls the strength of the penalty (2.4). This helps in
addressing issues of multicollinearity and overfitting.
Xn p p
ωj xij )2 + λ
X X
ω̂ = arg min (yi − ω0 − |ωj | (2.4)
ω
i=1 j=1 j=1
Where y represents the dependent variable (target), x denotes the independent variables
(predictors), ω are the coefficients, and λ is the regularization parameter. A larger λ leads
to more shrinkage of the coefficients.
36 CHAPTER 2. SUPERVISED LEARNING
Ridge regression, also known as Tikhonov regularization, is a linear model that includes
a regularization term to prevent overfitting. This technique helps in reducing model com-
plexity and multicollinearity by adding a penalty term that shrinks the coefficients. The
addition of the L2 penalty term differentiates it from standard linear regression (see Fi-
gure 2.4).
Ridge regression modifies the standard linear regression formula by adding a regulari-
Pp
zation term λ j=1 ωj2 , where λ controls the strength of the penalty (2.5). This helps in
addressing issues of multicollinearity and overfitting.
Xn p p
ωj xij )2 + λ ωj2
X X
ω̂ = arg min (yi − ω0 − (2.5)
ω
i=1 j=1 j=1
Where y represents the dependent variable (target), x denotes the independent variables
(predictors), ω are the coefficients, and λ is the regularization parameter. A larger λ leads
to more shrinkage of the coefficients.
Polynomial regression expands the linear regression model by adding polynomial terms of
the independent variable x (2.6). This allows the model to fit curves instead of straight
lines, accommodating more complex data patterns.
Where y represents the dependent variable (target), x denotes the independent variable
(predictor), ωi are the coefficients, and εi are the error terms. By increasing the degree n,
the model becomes more flexible but may also risk overfitting.
There are several types of classification methods employed in data science and machine
learning, each suitable for different contexts. Common types include :
— Logistic Regression
— Naive Bayes
2.3. CLASSIFICATION ALGORITHMS 39
Logistic regression is a statistical method used for binary classification tasks, where the
goal is to predict one of two possible outcomes. It models the probability that a given
input belongs to a particular class, making it an essential tool for various classification
problems.
Unlike linear regression, which predicts continuous values, logistic regression predicts the
probability of an outcome that can only take on two discrete values (e.g., yes/no, true/-
false, success/failure). The model uses a logistic function (sigmoid function) to map pre-
dicted values to probabilities (see Figure 2.6).
In logistic regression, the relationship between the independent variables (predictors) and
the dependent variable (binary target) is modeled using the logistic function :
1
P (Y = 1|X) = (2.7)
1+ e−(β0 +β1 X1 +···+βn Xn )
p
Logit(P ) = ln = β0 + β1 X1 + · · · + βn Xn (2.8)
(1 − p)
Where :
40 CHAPTER 2. SUPERVISED LEARNING
— P (Y = 1|X) is the probability that the target variable Y equals 1 given the predictors
X.
— β0 is the intercept.
The logistic function outputs a value between 0 and 1, which can be interpreted as the
probability of the target variable being 1.
Logistic regression is widely used in various fields, including healthcare for disease pre-
diction, finance for credit scoring, marketing for customer segmentation, and many more,
due to its simplicity, interpretability, and effectiveness in binary classification tasks.
The k-NN algorithm operates on the principle of similarity, often measured using distance
metrics such as Euclidean distance, Manhattan distance, or Minkowski distance. The Eu-
clidean distance between two points x = (x1 , x2 , . . . , xn ) and y = (y1 , y2 , . . . , yn ) is given
by :
v
u n
uX
d(x, y) = t (x
i − yi )2 (2.9)
i=1
1. Compute the distance between the query instance and all the training samples.
3. Assign the most common class among the k-nearest neighbors to the query instance.
Regression-classification algorithms are versatile models that can be used for both regres-
sion (predicting continuous values) and classification (predicting discrete classes) tasks.
These algorithms are particularly powerful as they can handle a variety of data types and
structures. In this section, we will explore some of the most commonly used regression-
classification algorithms, including Decision Trees (DT), Random Forest (RF), Support Vec-
tor Machines (SVM), Boosting, and Bagging.
Decision Trees are versatile and interpretable models that can be used for both regression
and classification tasks. They split the data into subsets based on the value of input fea-
tures, creating a tree-like structure where each internal node represents a feature, each
branch represents a decision rule, and each leaf node represents an outcome.
The diagram below 2.8 illustrates the key terminologies associated with decision trees.
A decision tree begins with a root node representing the entire population or sample, which
is then split into two or more homogeneous groups through a process called splitting. Sub-
nodes that further divide are known as decision nodes, while those that do not are called
44 CHAPTER 2. SUPERVISED LEARNING
— CART (Classification and Regression Trees) : CART is a versatile algorithm that can
construct both classification and regression trees. It uses binary splits and measures
like Gini index or Mean Squared Error to create robust decision trees.
We have established that decision trees can be used for both regression and classification
tasks. Let’s delve into the algorithms behind the various types of decision trees.
In classification tasks, Decision Trees aim to predict the class label of an instance based on
its features (see Figure 2.9). The algorithm recursively splits the dataset into subsets that
are as homogeneous as possible concerning the target class. The most common measures
for selecting the best split are Gini impurity and information gain.
The classification process can be summarized by the following equation for a leaf node :
where ŷ is the predicted class, C is the set of all classes, and P (y = c|X) is the probability
of class c given the features X.
In regression tasks, Decision Trees predict continuous values. The tree structure is built
in a similar way, but instead of predicting class labels, the algorithm predicts a numerical
value at each leaf node (see Figure 2.10). The predicted value for a new instance is the
average of the target values in the corresponding leaf node.
2.4. REGRESSION-CLASSIFICATION ALGORITHMS 45
1 X
ŷ = yi (2.11)
Nt i∈T
where ŷ is the predicted value, Nt is the number of training instances in the leaf node T ,
and yi are the target values of the instances in T .
ID3 Algorithm
5. For each attribute x, calculate the entropy with respect to the attribute.
7. Remove the attribute that offers the highest information gain from the set of
attributes.
8. Repeat until all attributes are exhausted or the decision tree consists entirely
of leaf nodes.
Random Forest is an ensemble learning method that combines the predictions of multiple
decision trees to improve robustness and accuracy (see Figure 2.11). It can be used for both
2.4. REGRESSION-CLASSIFICATION ALGORITHMS 47
classification and regression tasks. The algorithm creates a "forest" of decision trees, each
trained on a different random subset of the data, and then aggregates their predictions.
In classification tasks, Random Forest builds multiple decision trees and merges their re-
sults to obtain a more accurate and stable prediction. Each tree in the forest outputs a class
prediction, and the class with the most votes becomes the final prediction.
The classification prediction for Random Forest can be described by the following equa-
tion :
B
X
ŷ = arg max I(hb (X) = c) (2.12)
c∈C
b=1
where ŷ is the predicted class, C is the set of all classes, B is the number of trees in the
forest, hb (X) is the prediction of the b-th tree, and I is an indicator function that equals 1
if hb (X) = c and 0 otherwise.
48 CHAPTER 2. SUPERVISED LEARNING
In regression tasks, Random Forest predicts continuous values by averaging the predictions
of individual trees. Each tree provides a numerical prediction, and the final output is the
mean of all the tree predictions.
The regression prediction for Random Forest is given by :
B
1 X
ŷ = hb (X) (2.13)
B b=1
where ŷ is the predicted value, B is the number of trees, and hb (X) is the prediction of the
b-th tree.
Support Vector Machines (SVM) are powerful supervised learning models used for both
classification and regression tasks (see Figure 2.12). SVMs work by finding the hyperplane
that best separates the data into classes or fits the data in the case of regression. They are
particularly effective in high-dimensional spaces and are known for their robustness and
accuracy.
2.4. REGRESSION-CLASSIFICATION ALGORITHMS 49
F IGURE 2.12 – Overview of SVM algorithm : (a) SVM for classification ; (b) SVM for re-
gression
Source : liang2020machine
In classification tasks, SVM aims to find the optimal hyperplane that maximizes the margin
between the different classes. The support vectors are the data points that are closest to the
hyperplane and influence its position and orientation. The most common kernel functions
used in SVM include linear, polynomial, and radial basis function (RBF).
The classification decision function for SVM can be described by the following equation :
where w is the weight vector, x is the input feature vector, and b is the bias term.
In regression tasks, SVM, also known as Support Vector Regression (SVR), tries to fit the
best line within a threshold value, known as the epsilon (ε) margin. The objective is to
ensure that most of the data points lie within this margin.
The regression function for SVR is given by :
f (X) = w · x + b (2.15)
50 CHAPTER 2. SUPERVISED LEARNING
where w is the weight vector, x is the input feature vector, and b is the bias term. The goal
is to minimize the following loss function :
Ensemble learning enhances machine learning results by combining multiple models, lea-
ding to better predictive performance than a single model. The basic concept involves
training a set of classifiers (experts) and allowing them to vote. Two key types of ensemble
learning are Bagging and Boosting. Both techniques reduce the variance of individual es-
timates by combining multiple estimates from different models, resulting in a model with
greater stability (see Figure 2.13) :
— Bagging : This method involves training multiple homogeneous weak learners in-
dependently and in parallel, then averaging their predictions to determine the final
model output.
Next, we will examine Bagging and Boosting in more detail and highlight their differences.
Boosting
Boosting is an ensemble modeling technique that aims to create a strong classifier from
several weak classifiers by building models sequentially. Initially, a model is constructed
2.4. REGRESSION-CLASSIFICATION ALGORITHMS 51
using the training data. The second model is then created to correct the errors present
in the first model. This process continues, with each subsequent model focusing on the
residuals of the previous one, until the entire training dataset is accurately predicted or
the maximum number of models is reached.
There are several boosting algorithms, among which the most notable include AdaBoost,
Gradient Boosting, and XGBoost. The original boosting algorithms, proposed by Robert
Schapire and Yoav Freund, were not adaptive and could not fully exploit the weak lear-
ners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that
won the prestigious Gödel Prize. AdaBoost, short for Adaptive Boosting, was the first suc-
cessful boosting algorithm developed for binary classification, combining multiple “weak
classifiers” into a single “strong classifier”.
52 CHAPTER 2. SUPERVISED LEARNING
Boosting Algorithm
F IGURE 2.14 – An illustration presenting the intuition behind the boosting algorithm,
consisting of the parallel learners and weighted dataset
Source : www.geeksforgeeks.org
Bagging
Imagine we have a dataset D consisting of d tuples. For each iteration i, we create a trai-
ning set Di by randomly sampling d tuples from D with replacement (this process is known
as bootstrapping, and it allows for duplicate entries in Di ). We then train a classifier model
Mi on this bootstrapped training set Di . Each classifier Mi provides a prediction. To deter-
mine the final prediction, the bagged model M ∗ aggregates these predictions, typically by
54 CHAPTER 2. SUPERVISED LEARNING
voting in the case of classification tasks, and assigns the most common prediction to the
unknown sample X.
Bagging Algorithm
The Random Forest model uses Bagging, where decision tree mo-
dels with higher variance are present. It makes random feature
selection to grow trees. Several random trees make a Random Fo-
rest.
Naive Bayes is a family of simple and effective probabilistic classification algorithms based
on Bayes’ Theorem. It assumes that the features are conditionally independent given the
class label, which simplifies the computation and makes it a fast and scalable solution for
various classification problems.
Naive Bayes calculates the probability of each class given a set of features using Bayes’
Theorem :
P (X|C) · P (C)
P (C|X) = (2.17)
P (X)
Where :
Given the independence assumption, the likelihood P (X|C) is decomposed into the pro-
duct of the probabilities of individual features :
n
Y
P (X|C) = P (xi |C) (2.18)
i=1
This simplifies the computation, as the algorithm only needs to estimate the probabilities
of individual features given the class.
thods, GPR does not assume a fixed form for the underlying function and instead defines
a distribution over possible functions, making it highly flexible (see Figure 2.17).
Gaussian Process Regression models the relationship between the input X and the output
y by defining a Gaussian process prior over functions (2.19). This prior is characterized
by a mean function m(X) and a covariance function k(X, X ′ ), which encodes assumptions
about the function’s smoothness and other properties.
Where GP denotes the Gaussian process, m(X) is the mean function, and k(X, X ′ ) is
the covariance function. The choice of the covariance function k significantly affects the
model’s predictions and flexibility.
Bayesian Linear Regression modifies the standard linear regression by incorporating prior
distributions over the model parameters (2.20). The posterior distributions are obtained
by combining these priors with the likelihood of the observed data, resulting in a more
robust model, especially with limited data.
Where p(ω|x, y) is the posterior distribution of the parameters given the data, p(y|x, ω)
is the likelihood of the data given the parameters, and p(ω) is the prior distribution of
2.4. REGRESSION-CLASSIFICATION ALGORITHMS 59
Unsupervised Learning
Unsupervised learning is a machine learning approach where models are trained without
the need for labeled datasets. These models autonomously uncover hidden patterns and
insights within the provided data, much like the human brain processes new information.
It can be described as follows :
61
62 CHAPTER 3. UNSUPERVISED LEARNING
— It excels at deriving valuable insights from data, revealing hidden patterns and struc-
tures.
— The process mirrors human experiential learning, thereby aligning closely with the
core principles of artificial intelligence.
— It effectively handles unlabeled and uncategorized data, which enhances its practical
applicability.
— Clustering : This approach involves grouping objects into clusters where items wi-
thin each cluster exhibit high similarity, while significantly differing from items in
other clusters. Cluster analysis uncovers commonalities among data objects and or-
ganizes them based on these shared characteristics.
— K-means clustering
— Hierarchical clustering
— Anomaly detection
— Neural networks
3.1. K-MEANS CLUSTERING 63
— Apriori algorithm
1. Let X be the m × n data matrix, where m is the number of samples and n is the
number of features.
3. Assign Points to Clusters : For each data point xi , compute the Euclidean distance
to each centroid cj :
v
u n
uX
d(xi , cj ) = t (xi,f − cj,f )2 (3.1)
f =1
4. Update Centroids : Recalculate the centroid of each cluster as the mean of all points
assigned to that cluster :
1 X
cj = xi (3.2)
|Cj | xi ∈Cj
where Cj is the set of points in the j-th cluster and |Cj | is the number of points in
cluster j.
5. Repeat : Repeat steps 3 and 4 until the centroids do not change significantly :
(t+1) (t)
cj = cj ∀j (3.3)
64 CHAPTER 3. UNSUPERVISED LEARNING
K-Means Algorithm
3. Assign Points to Clusters : Assign each data point to the nearest centroid,
forming k clusters.
4. Update Centroids : Calculate the new centroids as the mean of all points in
each cluster.
5. Repeat : Repeat the assignment and update steps until the centroids stabilize
or a maximum number of iterations is reached.
1. Let X be the m × n data matrix, where m is the number of samples and n is the
number of features. Let y be the m-dimensional vector of target values.
2. Distance Calculation : For a new data point xnew , compute the Euclidean distance
3.2. K-NEAREST NEIGHBORS (KNN) 65
where xnew,j is the j-th feature of the new data point and xi,j is the j-th feature of
the i-th data point.
3. Find the Nearest Neighbors : Identify the k data points with the smallest distances
to xnew .
4. Make a Prediction :
For classification : Assign the class label ŷnew based on majority voting among the k
nearest neighbors :
where yi1 , yi2 , . . . , yik are the target values of the k nearest neighbors.
For regression : Compute the average target value ŷnew of the k nearest neighbors :
k
1X
ŷnew = yi (3.6)
k j=1 j
KNN Algorithm
2. Compute the Distance : Calculate the distance between the new data point
and all existing data points using a suitable distance metric (e.g., Euclidean
distance).
3. Find the Nearest Neighbors : Identify the k nearest neighbors to the new data
point based on the computed distances.
4. Make a Prediction :
— For classification : Assign the class label that is most common among the
k nearest neighbors (majority voting).
— For regression : Compute the average (or weighted average) of the target
values of the k nearest neighbors.
66 CHAPTER 3. UNSUPERVISED LEARNING
1. Let X be the m × n data matrix, where m is the number of samples and n is the
number of features.
3. Start with each data point as its own cluster. Let C = {C1 , C2 , . . . , Cm } be the initial
set of clusters.
4. Find the closest pair of clusters (Ci , Cj ) and merge them into a new cluster Cij .
Update the set of clusters :
5. Update the distance matrix to reflect the new distances between the merged cluster
and the remaining clusters. Common linkage methods include :
3.3. HIERARCHICAL CLUSTERING 67
Average Linkage :
where |Ci | and |Cj | are the sizes of clusters Ci and Cj respectively.
6. Repeat steps 4 and 5 until all data points are in a single cluster.
1. Calculate the Distance Matrix : Compute the pairwise distance between all
data points using a suitable distance metric (e.g., Euclidean distance).
2. Merge Clusters : Starting with each data point as its own cluster, iteratively
merge the two closest clusters until all points are in a single cluster.
3. Update the Distance Matrix : After each merge, update the distance matrix to
reflect the new distances between clusters.
4. Repeat : Continue merging clusters and updating the distance matrix until
only one cluster remains, creating a dendrogram in the process.
68 CHAPTER 3. UNSUPERVISED LEARNING
We apply hierarchical clustering to this dataset to group the soil samples based
on their properties.
Principal Component Analysis (PCA) is a statistical technique used for dimensionality re-
duction, data visualization, and feature extraction. It transforms the original features into
a new set of uncorrelated features called principal components, which capture the maxi-
mum variance in the data.
The mathematical formulation of PCA is as follows :
1. Let X be the m × n data matrix, where m is the number of samples and n is the
number of features.
Xc = X − µ (3.12)
1
C= XT Xc (3.13)
m−1 c
3.4. PRINCIPAL COMPONENT ANALYSIS (PCA) 69
4. Perform eigen decomposition on the covariance matrix to find the eigenvalues and
eigenvectors :
Cv = λv (3.14)
where v are the eigenvectors (principal components) and λ are the eigenvalues (va-
riance explained by each principal component).
5. Sort the eigenvalues in descending order and select the top k eigenvectors corres-
ponding to the largest eigenvalues to form the projection matrix W.
Z = Xc W (3.15)
PCA Algorithm
1. Standardize the Data : Scale the data so that each feature has a mean of 0
and a standard deviation of 1.
4. Sort and Select Principal Components : Sort the principal components based
on their eigenvalues in descending order and select the top k components that
capture the most variance.
5. Transform the Data : Project the original data onto the selected principal
components to obtain the reduced-dimension dataset.
70 CHAPTER 3. UNSUPERVISED LEARNING
1. Center the Data : Subtract the mean of each variable to ensure the data has zero
mean.
Xc = X − E[X] (3.16)
2. Whiten the Data : Transform the centered data so that its covariance matrix is the
identity matrix, making the data uncorrelated and with unit variance.
1
Xwhitened = VΛ− 2 VT Xc (3.17)
where V and Λ are the eigenvector and eigenvalue matrices of the covariance matrix
of Xc .
3.6. APRIORI ALGORITHM 71
3. Apply ICA : Decompose the whitened data Xwhitened into a product of mixing matrix
A and independent components S :
Xwhitened = AS (3.18)
The goal is to estimate A and S such that the components in S are as statistically
independent as possible.
ICA Algorithm
1. Center and Whiten the Data : Preprocess the data by centering (subtracting
the mean) and whitening (decorrelating and scaling) to make the variance
equal across dimensions.
2. Apply ICA : Use an ICA algorithm to separate the mixed signals into inde-
pendent components.
The Apriori algorithm is a popular method used in association rule mining to identify
frequent itemsets and generate association rules from a dataset. It is widely utilized in
market basket analysis and other fields to discover relationships among variables.
72 CHAPTER 3. UNSUPERVISED LEARNING
The mathematical formulation of the Apriori algorithm involves the following steps :
1. Construct the Data Matrix : Let X be the m × n data matrix, where m is the number
of samples and n is the number of features.
X = UΣVT (3.22)
where :
U is an m × m orthogonal matrix whose columns are the left singular vectors of X.
Σ is an m × n diagonal matrix whose diagonal elements are the singular values of X.
V is an n × n orthogonal matrix whose columns are the right singular vectors of X.
3. Truncate the Matrices : Retain only the top k singular values in Σ and the cor-
responding columns in U and V, resulting in the truncated matrices Uk , Σk , and
Vk :
X ≈ Uk Σk VkT (3.23)
Xk = Uk Σk VkT (3.24)
74 CHAPTER 3. UNSUPERVISED LEARNING
SVD Algorithm
1. Construct the Data Matrix : Form the m × n data matrix X, where m is the
number of samples and n is the number of features.
3. Truncate the Matrices : Retain only the top k singular values and correspon-
ding vectors to reduce dimensionality.
— Works well with spherical clusters — Not suitable for non-spherical clusters
Hierarchical Clustering
— No need to specify number of clusters in advance — Computationally intensive for large datasets
Independent Component
— Finds statistically independent components — Sensitive to noise
Analysis (ICA)
— Useful for blind source separation — Computationally expensive
75
76
K-nearest Neighbors
(KNN) — Simple and intuitive — Computationally expensive for large datasets
Apriori Algorithm
— Easy to implement — Computationally intensive
— Provides valuable insights into data patterns — Requires large support and confidence thresholds
— Useful for fraud detection and fault diagnosis — Requires well-defined normal behavior
Neural Networks
— Can model complex patterns — Requires large datasets
— Prone to overfitting
Introduction
In recent years, neural networks and deep learning have revolutionized the field of ar-
tificial intelligence, enabling significant advancements in various domains such as image
recognition, natural language processing, and autonomous systems. This chapter provides
a comprehensive overview of these powerful computational frameworks, exploring both
foundational concepts and advanced architectures.
The concept of "Artificial Neural Network" (ANN) is inspired by biological neural networks
that constitute the human brain’s architecture. Just as neurons in the brain are intercon-
nected, artificial neural networks feature nodes (neurons) linked across various layers.
As illustrated in the accompanying figure, biological neural networks have dendrites that
represent inputs in ANNs, cell nuclei that correspond to nodes, synapses that signify weights,
and axons that represent outputs (see Figure 4.1 and table 4.1).
77
78 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
— Input Layer : This layer accepts diverse input formats as specified by the program-
mer.
— Hidden Layer : Situated between the input and output layers, the hidden layer per-
forms calculations to uncover latent features and patterns.
— Output Layer : After passing through the hidden layer, the transformed inputs yield
the final output.
The ANN computes the weighted sum of inputs plus a bias, represented as a transfer func-
tion. The weighted total is fed into an activation function to determine node activation,
allowing only activated nodes to contribute to the output layer. Various activation func-
tions are available depending on the task (see Figure 4.2).
Artificial Neural Networks (ANNs) can be conceptualized (see Figure 4.3) as weighted
directed graphs where neurons serve as nodes, and the connections (edges) between them
possess weights. Inputs, represented as patterns or vectors, are received from external
sources and mathematically denoted as x(n).
Each input is multiplied by its respective weight, signifying the strength of interconnec-
tions. The weighted inputs are then aggregated in a computational unit. If this sum is
80 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
zero, a bias is added to prevent a zero output. This bias, with a weight of one, helps keep
responses within desired limits.
The weighted inputs are processed through an activation function, which can be linear
or non-linear. Common activation functions include binary, linear, and hyperbolic tangent
sigmoidal functions (see Figure 4.4).
A specific type of ANN is built around a unit known as a perceptron, as depicted in the
accompanying figure. A perceptron processes a vector of real-valued inputs, computes a
linear combination of these inputs, and produces an output of 1 if the result exceeds a
certain threshold, and -1 otherwise. Formally, given inputs x1 to xn , the output o(x1 , . . . , xn )
from the perceptron can be expressed as :
Pn
1 if wi xi > θ
i=1
o(x1 , . . . , xn ) = (4.1)
−1
otherwise
Here, each wi represents a real-valued weight that defines how much influence input xi
has on the output. The term θ serves as the threshold that the weighted sum must exceed
4.1. NEURAL NETWORKS 81
n
X
w i xi > 0 (4.2)
i=0
In this context, x1 and x2 represent the perceptron inputs, with positive examples mar-
ked by "+" and negative by "-". The outputs from multiple units can be fed into a sub-
sequent layer, and Boolean functions can be expressed in disjunctive normal form as ORs
of conjunctions of inputs and their negations. Negating an input to an AND perceptron can
be accomplished by adjusting the sign of the corresponding weight.
To learn how to adjust weights for a perceptron, we start by selecting random initial
weights and iteratively applying the perceptron to the training examples, modifying weights
when misclassifications occur. This iterative process continues until all training examples
are classified correctly. Weights are updated according to the perceptron training rule :
Here, t is the target output for the current example, o is the perceptron’s output, and η is
the learning rate, a small positive constant (often around 0.1), which may decrease over
time.
The intuition behind this update mechanism is straightforward. If the perceptron correctly
classifies a training example, no weight adjustments are needed. Conversely, if the percep-
tron outputs -1 when the target is +1, the weights must be adjusted to increase the output
towards the correct classification.
The learning process can be shown to converge within a finite number of iterations of the
training rule to a weight vector that accurately classifies all training examples, given that
the examples are linearly separable and a sufficiently small η is used. However, if the data
is not linearly separable, convergence is not guaranteed.
4.1. NEURAL NETWORKS 83
1X
⃗ =
E(w) (td − od )2 (4.5)
2 d∈D
An Artificial Neural Network (ANN) can be designed and implemented in various ways.
The following characteristics define different variants of an ANN :
x = w 0 + w 1 x1 + · · · + w n xn (4.6)
The activation function, applied to this weighted sum, determines the neuron’s out-
put. Some commonly used activation functions include sigmoid, tanh, and ReLU.
84 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
2. Network Topology : Network topology refers to the patterns and structures within
a collection of interconnected nodes. It dictates the complexity of tasks that the net-
work can learn, with larger and more complex networks generally capable of identi-
fying more subtle patterns and intricate decision boundaries. However, a network’s
effectiveness depends not only on its size but also on how the nodes are arranged.
Key aspects of network architecture include :
— Input Layer : Nodes in this layer receive unprocessed signals from the input
data.
— Hidden Layers : These layers process signals received from previous layers
before passing them to the next layer. The network can have multiple hid-
den layers.
— Hidden Nodes : The number of hidden nodes is chosen by the user and de-
pends on factors such as the number of input nodes, the size of the training
data, the amount of noise in the data, and the complexity of the learning
task.
3. The Training Algorithm : Training an ANN involves adjusting the connection weights
to improve the network’s performance. The two primary algorithms for learning a
single perceptron are the perceptron rule and the delta rule, used depending on whe-
ther the training dataset is linearly separable. The most commonly used algorithm
4.1. NEURAL NETWORKS 85
for training ANNs today is backpropagation, which efficiently updates the weights to
minimize the error between the predicted and actual outputs.
4. The Cost Function : The cost function, also known as the loss function or error
function, quantifies the difference between the network’s predictions and the actual
target values. It measures the performance of the ANN during training.
The cost function is a critical component in the optimization process, guiding the
adjustment of weights to improve the accuracy of the network’s predictions. It may
also be referred to as the objective function or scoring function, depending on the
context.
1
L(ŷ (i) , y (i) ) = (ŷ (i) − y (i) )2 (4.11)
2
L(ŷ (i) , y (i) ) = −(y (i) × ln (ŷ (i) ) + (1 − y (i) ) × ln (1 − ŷ (i) )) (4.12)
Cost function
computes the average error over all training samples.
1 Pm
J(ω, b) = m i=1 L(ŷ (i) , y (i) )
Pm
= − m1 i=1 [y
(i)
× ln (ŷ (i) ) + (1 − y (i) ) × ln (1 − ŷ (i) )]
(4.14)
86 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
Artificial Neural Networks (ANNs) come in various types, each designed to address specific
types of problems and inspired by different aspects of biological neural systems. The most
common types include :
— Feedback ANN : Outputs are cycled back into the network, optimizing internal re-
sults. This type of ANN is particularly effective for solving optimization problems and
is often used in recurrent neural networks (RNNs).
— Feed-Forward ANN : This type consists of an input layer, an output layer, and at
least one hidden layer. It evaluates input patterns to determine outputs and is the
simplest form of ANN, often used for tasks like image recognition and classification.
Forward propagation, also known as the forward pass, is the process of computing and
storing intermediate variables, including outputs, within a neural network. This calculation
unfolds sequentially from the input layer to the output layer, establishing the foundation
for subsequent stages in the neural network’s operation.
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
xm (4.15)
1
X= x x2 ;Z = z z z [1](m) ;A = a a a[1](m)
. .. . .
.. ... . .
.. . .. .. .. ... .
..
Z 1 = W 1 X + b1 hidden layer
φ1 = (4.16)
A1 = ϕ(Z 1 ) hidden activation vector
ω = W 2 A1 (4.17)
Σ = l(ω, y) (4.18)
J =Σ+θ (4.19)
1. Initialization :
— Set ai = xi .
2. Forward Pass :
(a) For ν = 2 to L do :
3. Output :
4.1.6 Backpropagation
∂J ∂J ∂Σ ∂Σ
= = (4.23)
∂ω ∂Σ ∂ω ∂ω
∂J ∂J ∂ω ∂J ∂s
∂W 2 = ∂ω ∂W 2 + ∂s ∂W 2
(4.25)
∂J 1 T 2
= ∂ω A + λW
∂J ∂J ∂ω
∂A1 = ∂ω ∂A1
(4.26)
T ∂J
= W 2 ∂ω
6.
∂J ∂J ∂A1
∂Z = ∂A1 ∂Z
(4.27)
∂J ′
= ∂A1 ϕ (Z)
4.2. DEEP LEARNING 89
1. Initialization :
(a) Provide activations ai from the forward pass and the label y.
(b) For i = N − O + 1 to N do :
∂L(y,x,w) ′
— Calculate δi = ∂ai
f (neti ).
2. Backward Pass :
(a) For ν = L − 1 to 2 do :
Deep Architectures (see Figure 4.7) refer to computational models that are composed of
multiple layers of interconnected processing units, commonly known as neurons or nodes.
These architectures are characterized by their depth, meaning they have a significant num-
ber of hidden layers between the input and output layers. Each layer in a deep architecture
performs a specific transformation on the data, extracting increasingly abstract and com-
plex features as the data propagates through the network (heaton2015artificial).
Deep architectures are a hallmark of deep learning, a subfield of machine learning. They
are particularly well-suited for handling high-dimensional data and complex patterns, ma-
king them ideal for tasks such as image and speech recognition, natural language proces-
sing, and game playing. The depth of these networks allows them to learn hierarchical
90 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
representations, where each successive layer captures more sophisticated features or re-
presentations of the input data.
The most common types of deep architectures include Convolutional Neural Networks
(CNNs), which are widely used in image and video processing ; Recurrent Neural Net-
works (RNNs) and their variants like Long Short-Term Memory (LSTM) networks, which
are effective in modeling sequential data ; and Deep Belief Networks (DBNs) and Autoen-
coders, which are used for unsupervised learning and feature extraction.
The training of deep architectures typically involves sophisticated optimization techniques
and regularization methods to address challenges such as overfitting, vanishing gradients,
and computational efficiency. Despite these challenges, the ability of deep architectures to
automatically discover relevant features from raw data has led to significant breakthroughs
in various AI applications.
1. Convolutional Layers : The convolutional layer is the core building block of a CNN.
It consists of a set of learnable filters (or kernels) that slide over the input data. The
operation can be mathematically described by :
XX
(f ∗ x)(i, j) = x(m, n)w(i − m, j − n) (4.29)
m n
where x is the input, w is the filter (kernel), and (i, j) are the coordinates of the out-
put feature map. Each filter learns to detect specific features such as edges, textures,
or colors.
2. Activation Functions : After convolution, the feature map is passed through an acti-
vation function to introduce non-linearity into the model. The most commonly used
92 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
3. Pooling Layers : Pooling layers reduce the spatial dimensions (width and height) of
the feature maps while retaining the most important information. A common opera-
tion is max pooling, which is defined as :
where y(i, j) is the pooled output and x(i, j) is the input feature map. This operation
helps to achieve spatial invariance.
4. Fully Connected Layers : In the final stages of a CNN, fully connected layers are
used to combine the features learned by the convolutional layers across the entire
image. The output from the previous layers is flattened into a vector and fed into the
fully connected layers, leading to the final classification output. The output layer uses
a softmax activation function for multi-class classification, which can be represented
as :
ezi
σ(zi ) = P zj (4.32)
je
Operation of CNNs
The operation of a CNN involves a forward pass where an input image is passed through
the network layers, undergoing convolution, activation, and pooling operations, followed
by fully connected layers. The final layer produces class scores or probabilities, from which
the network’s prediction is derived.
During training, a loss function, such as cross-entropy loss, measures the discrepancy bet-
ween the predicted labels and the true labels. The weights of the network are then adjus-
ted using backpropagation and optimization algorithms like Stochastic Gradient Descent
(SGD) to minimize this loss function.
Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to
recognize patterns in sequences of data, such as time series, natural language, and more.
Unlike feedforward neural networks, RNNs have connections that form directed cycles,
allowing them to maintain a ’memory’ of previous inputs (see Figure 4.9). This makes
RNNs particularly well-suited for tasks where context and sequence order are important.
1. Hidden States and Recurrence : The fundamental feature of an RNN is its hidden
state, which captures information from the sequence of inputs. The hidden state at
time step t, denoted as ht , is a function of the input at the current time step xt and
the hidden state from the previous time step ht−1 . This relationship can be expressed
as :
ht = f (Wxh xt + Whh ht−1 + bh ) (4.33)
where :
94 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
— Whh is the weight matrix for the hidden state to hidden state,
2. Output Layer : The output at each time step t, denoted as yt , is typically computed
using the hidden state ht . The output can be expressed as :
yt = g(Why ht + by ) (4.34)
where :
— Why is the weight matrix from the hidden state to the output,
Training RNNs
RNNs are trained using the backpropagation through time (BPTT) algorithm, a variant of
the backpropagation algorithm adapted for handling sequential data. The BPTT algorithm
involves unfolding the network through time and applying backpropagation to calculate
gradients. The gradients are then used to update the network’s weights to minimize the
loss function.
4.2. DEEP LEARNING 95
Applications of RNNs
RNNs have a wide range of applications, particularly in areas involving sequential data.
They are used in language modeling, speech recognition, machine translation, and time
series prediction, among others. However, standard RNNs can suffer from issues such as
vanishing and exploding gradients, which limit their ability to learn long-term dependen-
cies in sequences.
Long Short-Term Memory (LSTM) networks (see Figure 4.10) are a type of Recurrent
Neural Network (RNN) specifically designed to overcome the limitations of traditional
RNNs, such as the vanishing and exploding gradient problems. LSTMs achieve this by
introducing a more sophisticated memory cell structure, allowing them to maintain and
manipulate information over longer periods. This makes LSTMs particularly effective for
tasks that require learning long-term dependencies, such as language modeling and time
series forecasting.
LSTM networks consist of a series of cells, each containing three main components : a cell
state, an input gate, a forget gate, and an output gate. These components work together
to control the flow of information within the network.
1. Cell State : The cell state, denoted as Ct , serves as a memory that carries informa-
tion across different time steps. It can be modified by the gates to retain or forget
information as needed. The cell state can be updated using the following equation :
where :
96 CHAPTER 4. NEURAL NETWORKS AND DEEP LEARNING
2. Gates in LSTM : The three gates in an LSTM—input gate, forget gate, and output
gate—are crucial for controlling the flow of information.
(a) Forget Gate : The forget gate decides which information from the previous cell
state should be discarded. It is defined as :
where σ is the sigmoid function, Wf is the weight matrix, ht−1 is the previous
hidden state, xt is the input at the current time step, and bf is the bias term.
(b) Input Gate : The input gate controls how much of the new information should
4.2. DEEP LEARNING 97
(c) Output Gate : The output gate determines the output of the LSTM cell based
on the cell state. It is defined as :
ht = ot ⊙ tanh(Ct ) (4.40)
LSTM networks are trained using backpropagation through time (BPTT), similar to stan-
dard RNNs. However, due to the gating mechanisms, LSTMs can learn to retain relevant
information and forget irrelevant information over longer sequences, making them more
robust in handling long-term dependencies.
LSTMs have been widely used in various applications, particularly where sequence pre-
diction and long-term context are crucial. Notable applications include natural language
processing (NLP) tasks like language translation and sentiment analysis, speech recogni-
tion, time series prediction, and anomaly detection in sequential data.
In machine learning, there are numerous algorithms available for both regression and
classification tasks. Given a specific problem, multiple algorithms may be applicable, ma-
king it essential to evaluate their effectiveness. This chapter focuses on the evaluation of
machine learning algorithms, exploring methods to assess the performance of both regres-
sion and classification models. We will also discuss how to compare the performance of
different algorithms to select the most suitable one for practical applications. These eva-
luation techniques are crucial for ensuring that we choose the right model that meets the
desired criteria and performs well on the given data.
99
100 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
regressor, it produces a specific outcome based on the validation set. To account for varia-
tions due to randomness in training data, initialization, and other factors, multiple models
can be generated using the same algorithm. These models are then tested on various vali-
dation sets, producing a range of error measurements. The statistical distribution of these
errors provides valuable insights into the expected performance of the algorithm for the
given problem and allows for a more comprehensive comparison with other algorithms.
— Risks associated with errors : Generalized using loss functions, which may vary
significantly depending on the application.
5.2. CROSS-VALIDATION 101
— Training time and space complexity : The resources required during the training
phase, which can affect scalability.
— Testing time and space complexity : The efficiency of the model during deployment
and prediction.
— Interpretability : The ability of the model to provide insights that can be understood
and verified by experts.
5.2 Cross-Validation
V1 = X1 , T1 = X2 ∪ X3 ∪ . . . ∪ XK
V2 = X2 , T2 = X1 ∪ X3 ∪ . . . ∪ XK
.. ..
. .
VK = XK , TK = X1 ∪ X2 ∪ . . . ∪ XK−1
102 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
forming the training set. This results in N separate evaluations, each with a unique va-
lidation instance. LOOCV is particularly useful in applications such as medical diagnosis,
where labeled data is scarce.
5.2.3 5 × 2 Cross-Validation
The 5 × 2 cross-validation method involves dividing the dataset X into two equal parts
(1) (2) (1) (2)
X1 and X1 . The model is first trained on X1 and validated on X1 , and then the roles
(2) (1)
are reversed, with X1 as the training set and X1 as the validation set. This procedure
is repeated five times with different random splits, resulting in ten pairs of training and
validation sets. The pairs are as follows :
(1) (2)
T1 = X1 , V 1 = X1
(2) (1)
T2 = X1 , V 2 = X1
.. ..
. .
(1) (2)
T9 = X5 , V 9 = X5
(2) (1)
T10 = X5 , V10 = X5
5.2.4 Bootstrapping
Bootstrapping is a statistical resampling technique where datasets are sampled with repla-
cement. In the context of machine learning, bootstrapping involves creating multiple new
training datasets by randomly sampling from the original dataset, with some data points
potentially being sampled multiple times (see Figure 5.4). The corresponding test datasets
are formed from the instances not included in the training sets.
104 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
2. Draw two balls (e.g., A and E), record the labels, and return
1.
them to the urn.
2. Draw two more balls (e.g., C and E), record the labels, and
return them to the urn.
Accurately measuring the error of a machine learning model is crucial for evaluating its
performance. Different metrics are used depending on whether the model is a regression or
5.3. MEASURING ERROR 105
a classification model. This section covers various error measures for both types of models.
In regression, the goal is to predict a continuous output. The following metrics are com-
monly used to measure the accuracy of regression models :
The Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of
predictions, without considering their direction. It is calculated as :
n
1X
MAE = |yi − ŷi | (5.1)
n i=1
where yi are the actual values, ŷi are the predicted values, and n is the number of obser-
vations.
The Mean Squared Error (MSE) measures the average of the squares of the errors. It gives
more weight to larger errors, which can be useful for identifying large outliers. The formula
is :
n
1X
MSE = (yi − ŷi )2 (5.2)
n i=1
The Root Mean Squared Error (RMSE) is the square root of the MSE. It is in the same units
as the target variable, which can make interpretation easier. It is calculated as :
v
√ u1 X
u n
RMSE = MSE = t (yi − ŷi )2 (5.3)
n i=1
The Mean Absolute Percentage Error (MAPE) expresses the accuracy as a percentage of
the error. It is calculated as :
106 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
n
100% X yi − ŷi
MAPE = (5.4)
n i=1 yi
However, MAPE can be sensitive to very small actual values, potentially leading to large
percentage errors.
Pn
(yi − ŷi )2
R = 1 − Pi=1
2
n (5.5)
i=1 (yi − ȳ)
2
where ȳ is the mean of the actual values yi . An R2 value of 1 indicates perfect prediction,
while an R2 value of 0 indicates that the model does not explain any of the variability in
the target variable.
Theil’s U Statistic
Theil’s U Statistic is a relative measure of accuracy that compares the predictive accuracy
of a forecasting model to that of a naïve model, which simply uses the last observed value
as the forecast for the next period. It is defined as :
q P
1 n
n i=1 (yi − ŷi )2
U=q P q P (5.6)
1 n 1 n
n i=1 yi2 + n i=1 ŷi2
A Theil’s U value less than 1 indicates that the model has better predictive accuracy than
the naïve model, whereas a value greater than 1 indicates worse predictive performance.
The table below 5.1 compares several commonly used regression error metrics in terms of
their description, interpretability, handling of errors, and unit of measure.
Classification models predict categorical outcomes. The following metrics are commonly
used to assess the performance of these models :
5.3. MEASURING ERROR 107
Confusion Matrix
A Confusion Matrix provides a detailed breakdown of the classification results (see Figure
5.5). It consists of the following components :
Precision and Recall are key metrics derived from the confusion matrix :
— Precision : The ratio of correctly predicted positive observations to the total predic-
108 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
TP
Precision = (5.7)
TP + FP
— Recall : (Sensitivity or True Positive Rate) : The ratio of correctly predicted positive
observations to all observations in the actual class. It is calculated as :
TP
Recall = (5.8)
TP + FN
F1 Score
The F1 Score is the harmonic mean of Precision and Recall, providing a balance between
the two. It is calculated as :
Precision × Recall
F1 Score = 2 × (5.9)
Precision + Recall
5.3. MEASURING ERROR 109
The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate (FPR)
(see Figure 5.6), which is defined as :
FP
FPR = (5.10)
FP + TN
The area under the ROC curve (AUC) provides a single scalar
value to compare models. An AUC of 1 represents perfect classifi-
cation, while an AUC of 0.5 indicates no discriminative ability.
— Accuracy : The ratio of correctly predicted instances (both positive and negative) to
the total instances.
TP + TN
Accuracy = (5.11)
TP + TN + FP + FN
— Specificity (True Negative Rate) : The ratio of correctly predicted negative obser-
vations to all actual negatives.
TN
Specificity = (5.12)
TN + FP
110 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
Appendix
111
112
No. Organization/Dataset Description of dataset Source
1 Image-Net Dataset Images of various plants (trees, vegetables, flo- http://image-net.org/explore?wnid=
wers) n07707451
2 ImageNet Large Scale Vi- Images that allow object localization and detec- http://image-net.org/challenges/
sual Recognition Challenge tion LSVRC/2017/#det
113
114 CHAPTER 5. EVALUATION OF MACHINE LEARNING ALGORITHMS
Bibliographie
115
116 BIBLIOGRAPHIE
G ARCIA -P EDRERO, Angel et al. (2019). « Deep learning for automatic outlining agricultural
parcels : Exploiting the land parcel identification system ». In : IEEE access 7, p. 158223-
158236.
G ARCÍA -VÁZQUEZ, Fabián et al. (2023). « Prediction of internal temperature in greenhouses
using the supervised learning techniques : Linear and support vector regressions ». In :
Applied Sciences 13.14, p. 8531.
G HOSH, Dibyendu et al. (2022). « Application of machine learning in understanding plant
virus pathogenesis : trends and perspectives on emergence, diagnosis, host-virus inter-
play and management ». In : Virology Journal 19.1, p. 42.
G OMES, Jacó C et Díbio L B ORGES (2022). « Insect pest image recognition : A few-shot
machine learning approach including maturity stages classification ». In : Agronomy
12.8, p. 1733.
H AN, Jiawei, Jian P EI et Hanghang T ONG (2022). Data mining : concepts and techniques.
Morgan kaufmann.
H EATON, Jeff (2015). « Artificial Intelligence for Humans, Volume 3 : Neural Networks and
Deep Learning ». In : Heaton Research Inc, Chesterfield, ABD 30, p. 55.
H ERTZMANN, Aaron, David F LEET et Marcus B RUBAKER (2014). « Machine learning and
data mining lecture notes ». In : Department of Computer and Mathematical Sciences,
University of Toronto Scarborough.
H OCHREITER, Sepp (2013). « Basic methods of data analysis ». In : Institute of Bioinfor-
matics, Johannes Kepler University Linz, statistics. Austria : Johannes Kepler University
Linz.
— (2014). « Theoretical concepts of machine learning ». In : Lecture Notes] Linz, AUT :
Institute of Bioinformatics, Johannes Kepler University Linz. Available at :< http ://www.
bioinf. jku. at/teaching/current/ss vl tcml/ML theoretical. pdf>[Accessed 26/07/2016].
— (s. d.). « Bioinformatics III ». In : ().
KAMILARIS, Andreas et Francesc X P RENAFETA -B OLDÚ (2018). « Deep learning in agricul-
ture : A survey ». In : Computers and electronics in agriculture 147, p. 70-90.
K RISHNACHANDRAN, VN (s. d.). « Lecture Notes in ». In : ().
L IANG, Yun-Chia et al. (2020). « Machine learning-based prediction of air quality ». In :
applied sciences 10.24, p. 9151.
M URPHY, Kevin P (2012). Machine learning : a probabilistic perspective. MIT press.
BIBLIOGRAPHIE 117
S HAFAGH -KOLVANAGH, Jalil et al. (2022). « Machine learning-assisted analysis for agrono-
mic dataset of 49 Balangu (Lallemantia iberica L.) ecotypes from different regions of
Iran ». In : Scientific Reports 12.1, p. 19237.
W IDYAWATI, Dewi, Amaliah FARADIBAH et Poetri Lestari Lokapitasari B ELLUANO (2023).
« Comparison Analysis of Classification Model Performance in Lung Cancer Prediction
Using Decision Tree, Naive Bayes, and Support Vector Machine ». In : Indonesian Jour-
nal of Data and Science 4.2, p. 78-86.