Machine Learning: April 2022
Machine Learning: April 2022
net/publication/360034430
Machine Learning
CITATIONS READS
0 545
4 authors:
Some of the authors of this publication are also working on these related projects:
Investment of Machine Learning Models for Classifying Remote Sensing Images View project
All content following this page was uploaded by Khalid Ahmed Alafandy on 26 July 2022.
Chapter 5
Machine Learning
Khalid Ahmed AlAfandy
https://orcid.org/0000-0003-1465-4446
ENSA, Abdelmalek Essaadi University, Morocco
Hicham Omara
Abdelmalek Essaadi University, Morocco
Mohamed Lazaar
ENSIAS, Mohammed V University in Rabat, Morocco
Mohammed Al Achhab
NTT, ENSATE, Abdelmalek Essaadi University, Tetouan, Morocco
ABSTRACT
This chapter provides a comprehensive explanation of machine learning including
an introduction, history, theory and types, problems, and how these problems can
be solved. Then it shows some of the most used machine learning algorithms that
are used in image classification, ending with the evaluation matrices calculations
that are used to assess the performance of the learning models. The open source
libraries also mentioned in this chapter facilitate the used codes for building any
learning model with the use of machine learning.
INTRODUCTION
DOI: 10.4018/978-1-7998-9831-3.ch005
Copyright © 2022, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Machine Learning
advances in machine learning and deep learning are causing a paradigm shift in almost
every field (Shinde & Shah, 2018; Bowling et al., 2006). The relationship between
artificial intelligence, machine learning, and deep learning is depicted in figure 1.
Machine learning is an artificial intelligence portion that relies on the utilization of
real data to train computers, which qualifies computers to reach strong predictions
for a particular data type as a human expert and without modular programming. That
means it is a process of importing dataset features and exporting output classes or
results depending on the used machine learning algorithm type. These data features
can be linear data, images, videos, audios, or any other used data type in our human
life. So, machine learning is an attempt to make computers learn as humans using
commonly used data in our human life (Alpaydin, 2020). Machine learning can
be divided into two types; supervised and unsupervised machine learning. The
supervised machine learning is based on training computers using given data that
have known correct outputs where the unsupervised machine learning is based
on given data without outputs or cleared results to build the learning models. The
supervised machine learning can be categorized into two branches; classification and
regression. The regression problem is based on the predictions within continuous
outputs where there is a relation between inputs and outputs within continuous
function. The classification problem is based on the predictions in discrete outputs
where the outputs are limited to two or more known categories or classes (Alloghani
et al., 2020). The learning model must count on a mathematical model where it can
be different according to the learning type which is called the hypothesis function. In
supervised machine learning, the cost function or loss function must be used through
a training process which is built according to the used mathematical model; it must
be minimized to achieve a high accuracy prediction model. The lowest cost value
can be achieved by updating the model parameters. Thus, the main goal to build a
high performance learning model is selecting model parameters values that result
in the lowest cost function. There are two main problems that can occur in learning
models; the over-fitting problem and the under-fitting problem, and then there are
more ways to solve these problems (Alpaydin, 2020; Alloghani et al., 2020).
This chapter outlines the machine learning history, types, problems, and how
learning problems can be solved, then shows some of the most used machine learning
algorithms, the open sources library which is ease the used codes for building any
learning model with the use of machine learning, ending with the evaluation matrices
calculations that are used in the learning models performance assessment.
84
Machine Learning
Figure 1. The relation between the artificial intelligence, machine learning, and
deep learning
(Shinde & Shah, 2018; Bowling et al., 2006)
Machine learning started in 1943 with building the first mathematical model of the
neural networks which was presented in (Mcculloch & Pitts, 1990). In 1950, Arthur
Samual developed a computer program for playing checkers, he initiated the alph-
beta pruning that measures the chance of winning to overcome the low available
memory in this time, and then he designed the scoring function called minimax
algorithm which is used in game programming till now (ElNaqa & Murphy, 2015).
Arthur Samuel defined machine learning in 1959. His definition is “The machine
learning is the field of study that gives computers the ability to learn without being
explicitly programmed” (ElNaqa & Murphy, 2015). In 1965, Aleksei Ivakhnenko
and Valentin Lapa created a hierarchical representation of polynomial activation
function neural networks that were trained with the GMDH. It is widely regarded
as the first multi-layer perceptron, and Ivakhnenko is frequently referred to as the
“Father of Deep Learning.” (Ivakhnenko & Lapa, 1966). In 1979, Kunihiko Fukushima
created a hierarchical multilayered network for pattern recognition and inspiration
for convolutional neural networks (Fukushima et al., 1983). In 1998, the problem of
learning was also defined by Tom Mitchell as “a computer program is said to learn
from experience E with respect to some task T and some performance measure P”
(Mitchell, 2006). ImageNet, a vast visual database of labeled images founded by
Fei-Fei Li in 2009, is a massive visual database of tagged images. She believed that
85
Machine Learning
The machine learning process is the operation of importing dataset features and
exporting output classes or results depending on the used machine learning algorithm
type. Machine learning algorithms are divided into two categories: supervised and
unsupervised machine learning (Khanum et al., 2015). Figure 2 shows types of
machine learning algorithms.
The supervised learning relies on a given data set with the known correct outputs,
where there is relevance among the input and the output. Supervised learning models
are assorted to regression and classification models. In a regression model, the results
are tried to predict within a continuous output, which means that the input variables
are tried to map to some continuous function. In a classification model, the results
are tried to predict in a discrete output, which means that the input variables are
tried to map into discrete categories or classes (Sen et al., 2020).
86
Machine Learning
The unsupervised machine learning counts on given data that doesn’t have any
labels. So, the learning algorithm attempts to find some structure in the given data.
After the learning algorithm succeeds in structuring this given unlabeled data, the
algorithm grouped these data into separate clusters. Then the main unsupervised
machine learning algorithm is called the clustering algorithm (Smola & Vishwanathan,
2008). The k-means algorithm is the commonly used clustering algorithm. It based
on the division of m objects into k clusters where each object is affiliated to the
nearest mean cluster. This approach results perfectly k various clusters with greatest
possible variation. Till now, it is not known the best number of clusters k that leads
to the greatest variation as a priority, thus it must be reckoned from the data. The
input features of dataset are given X={x1,…,xm}. The goal of k-means algorithm is
to structure the given dataset into k clusters where every point in a cluster looks like
the points from its own cluster than with the points from other clusters. To achieve
this goal, realize the prototype vectors 𝜇1,…,𝜇k and an indicator vector rij which is
1, if and only if, is assigned to cluster j. To cluster our dataset we will minimize
the following objective function J(r,𝜇), which minimizes the distance of each point
from the prototype vector (Smola & Vishwanathan, 2008).
m k
1
J (r , µ) =
2
∑∑
2 i =1 j =1
rij x i − µj (1)
2
where r={rij}, 𝜇={𝜇j}, and . denotes the usual Euclidean square norm.
To achieve the high performance learning model, it must be minimize the value
of objective function J(r,𝜇) in (1), on the other hand to build this model it must to
find r and 𝜇 values. So, practically it is very difficult to minimize objective function
J(r,𝜇) with respect to both r and 𝜇 values, then two stages strategy must be adapted;
the first stage is to determine r with fixing 𝜇. The xi can be found by setting rj=1 if:
87
Machine Learning
J (r , µ) = argmin x i − µj
2
(2)
´
j
∑r (x ij i
−µj ) = 0 (3)
i =1
µj =
∑r x i ij i
(4)
∑r i ij
Where ∑ rij counts the numbers of points that assigned to cluster j and 𝜇j is basically
i
set to be the sample mean of points that assigned to cluster j. Figure 3 shows the
unsupervised learning (Smola & Vishwanathan, 2008).
88
Machine Learning
The supervised machine learning is based on given data with known correct results
or outputs. The supervised machine learning can be categorized into two types;
regression and classification. In the regression learning models, the given dataset
inputs and its outputs must have a mathematical relation where the outputs are
predicted within continuous function. In the classification learning models, the given
dataset inputs and outputs also must have a mathematical relation but the outputs
are predicted within discrete function. In these two types, the mathematical model
is necessary but it can be different according to the learning model type (regression
or classification), the used dataset, and the learning algorithm, this mathematical
model is called hypothesis function h𝜃(x). The model parameters {𝜃0,…,𝜃n} must
be well selected and updated according to the dataset input features {x1,…,xn}
t
o achieve the lowest cost or loss value that calculated using the cost function J
(Verdhan, 2020). The main difference between the regression and classification
algorithm is the hypothesis function where the cost function and updating model
parameters are almost the same. Figure 4 shows the supervised learning (regression
and classification).
In regression, the hypothesis function can be calculated for each input in the
dataset inputs by using the model parameters and the input features. This hypothesis
function can be a linear function or any other mathematical function according to the
89
Machine Learning
nature of the used dataset (Gutenbrunner et al., 1993; Shanthamallu et al., 2017). It
can be calculated by (Gutenbrunner et al., 1993; Shanthamallu et al., 2017):
where h𝜃(x) is the hypothesis function, {𝜃0,…,𝜃n} is the model parameters, and
{x1,…,xn} is the nth input features for the dataset input .
It can be reformulated as matrix multiplication form by (Gutenbrunner et al.,
1993; Shanthamallu et al., 2017):
1
x1
hθ (x )=θ0 θ1 θ2
… θn x 2 =θT x (6)
xn
In classification, the hypothesis function can be calculated for each input in the
dataset inputs by using the model parameters and the input features. The most used
hypothesis function in classification is the sigmoid function (Shanthamallu et al.,
2017). The hypothesis function can be calculated by (Shanthamallu et al., 2017):
1
he (x ) = g(z ) = (7)
1 + e −z
z = 𝜃Tx (8)
The sigmoid function is g(z) whose output is any real number in [0, 1] interval
as shown in figure 5. If the output is less than 0.5 then the classification output
is 0, and if the output is greater than or equal 0.5 then the classification output is
1. In case of more than two classes which y={0,1,2,…,n}, the problem is divided
into n+1 binary classification problems; in classes predictions outputs, the highest
probability for a class means that y belongs to this class (Shanthamallu et al., 2017;
Zhao et al., 2010).
90
Machine Learning
Through the training stage in supervised machine learning, two outputs can be
utilized to build the learning model; the correct and predefined dataset outputs and
the hypothesis outputs where the cost function is based on these two outputs. Thus,
the cost function for regression learning can be built using the hypothesis function
and the dataset known correct outputs by (F. Lubis et al., 2014):
( ( ) )
m 2
1 (i ) (i )
J regression = ∑ hθ x −y (9)
2m i =1
( ( ) )
m 2 n
1 (i ) (i ) λ
J regression = ∑ hθ x −y + ∑θ
2
j
(10)
2m i =1 2m j =1
91
Machine Learning
where 𝜆 or lambda is the regularization parameter. It determines how much the costs
of theta parameters 𝜃 are elevated. It must be notice that the selection of 𝜆 value is
based on the self-intuition.
The cost function for classification learning uses the log function because of the
sigmoid function. It can be built by (Zhao et al., 2010):
( ( )) (
m
J classification =−
1 (i )
∑ y log hθ x
m i =1
(i ) (i )
) (
−1 − y log 1 − hθ (x )
) (11)
The regularization term can be added as done in regression in (10) (Zhao et al.,
2010).
( ( )) (
m n
J classification =−
1 (i )
∑ y log hθ x
m i =1
(i ) (i )
) (
−1 − y log 1 − hθ (x ) +
2m
λ
) ∑θ
j =1
2
j
(12)
To achieve the lowest cost value, it must estimate the model parameters that
realize the minimum loss for this model, beginning with initializing these model
parameters by random values which can’t obtain the minimum loss for the learning
model. So, it must update these parameters to achieve the targeted minimum loss
value. This updating is done using the gradient descent which iterates this process
until we reach the minimum loss (Ruder, 2016). The parameters can be updated by
(Ruder, 2016):
∂J
θj = θj − α (13)
∂θj
where 𝛼 is the learning rate, J is the cost function value, and 𝜃j∈{𝜃0,…,𝜃n}. It must
be notice that the selection of 𝛼 value is based on the self-intuition where it is
preferred to select a small value that must be less than 1 (Ruder, 2016). By using (9)
the updated parameters for regression learning can be calculated by (Ruder, 2016):
1
( ( ) )
2
∂
m (i ) (i )
∑ hθ x −y
2m i =1
θj = θj −α (14)
∂θj
92
Machine Learning
1 m
m
( ( ) (i )
∂ ∑ i =1 hθ x −y
(i )
)
2 i =1 ( ( )
θj =θj −α 2 ∑ hθ x −y
(i ) (i )
)
∂θj
(15)
( ( ) )
m
(i ) (i ) (i )
θj = θj −α ∑ hθ x −y x j (16)
i =1
m (i )
∂ ∑ y log hθ x
α i =1
(i )
( ( )) (
(i )
−1 − y log 1 − hθ (x )
) ( )
θj =θj − (17)
m ∂θj
1. Divide the dataset into training dataset which is 80% from dataset records and
the test dataset which is 20% from dataset records.
2. Initialize the model parameters to any random values.
3. Calculate the hypothesis function using the model parameters and the training
dataset input features.
4. Calculate the cost function even if using regularization or not.
5. Update the model parameters values using the gradient descent.
6. Repeat steps from 3 to 5 to achieve the possible lowest cost value.
7. Calculate the hypothesis function using the final model parameters values and
the test dataset input features.
8. Compare the hypothesis outputs with the test dataset known correct results to
assess the learning model performance.
93
Machine Learning
The over-fitting problem is that the learning model can fit 100% of training data well
through prediction after the training process but can’t predict the test data well. In
this problem, the training set cost function value will be low and the validation set
cost function value will be much greater than the training set cost function value.
There are two main solutions for these problems; the first solution is reducing the
training features, thus it must fine select the features to be removed that can’t affect
the required data to train. This selection can be done manually or by model selection
algorithm. The second solution is increasing the training set data. It can be done
by getting more training data or using data augmentation. The data augmentation
is selecting some data and adding these selected data as new data after doing
mathematical or spatial modification on it, it will be explained in section 2. The
third solution is the regularization where it must add the regularization term in the
cost function calculations through the training process (Hawkins, 2004).
The under-fitting problem is that the learning model fails to predict the training data
well after the training process. In this problem, the training set and the validation set
cost function values will be high. There are two main solutions for these problems;
the first solution is increasing the training features. It can be done by selecting some
features and adding these features as new features after mathematical modification
such as square or any other mathematical function. The second solution is increasing
the training iterations to reduce under-fitting where stopping training too soon can
94
Machine Learning
also result in an under-fit learning model. The third solution is the decrease of the
regularization parameter 𝜆 value if this value is high (Jabbar & Khan, 2015).
DATA AUGMENTATION
One way of preventing the over-fitting problems is increasing the training data,
but sometimes there is no data available for use in training or providing additional
data has high cost. Thus, the data augmentation is the solution in this situation. So,
data augmentation, which is commonly used in computer vision, is a technique for
increasing the amount of data by adding significantly changed copies of either existing
data or new synthetic data derived from existing data (Shorten & Khoshgoftaar, 2019).
The most well-known sort of data augmentation is image data augmentation, which
entails transforming images in the training dataset into altered copies that belong
to the same class as the original image. Shifts, flips, zooms, color modification,
random cropping, rotation, noise injection, and many other operations from the field
of image editing are all included in transforms. Typically, image data augmentation
is only applied to the training dataset, not the validation or test datasets. This differs
from data preparation tasks like image resizing and pixel scaling, which must be
carried out uniformly across all datasets that interact with the model (Shorten &
Khoshgoftaar, 2019).
The Naïve Bayes algorithm is a machine learning algorithm which acts as a classifier.
This classifier is based on the concept of the Bayes theorem where the Bayes theorem
95
Machine Learning
is one of the fundamental probability theorems. The Bayes theorem can be represented
with a simple mathematical formula as (19) (Taheri & Mammadov, 2013).
P (B|A)P (A)
P (A|B ) = (19)
P (B )
where P(A|B) is the probability of event A occurring given that B is true, P(B|A)
is the probability of event B occurring given that A is true, and P(A) and P(B) are
the probabilities of observing A and B respectively without any given condition.
So, if we have a given dataset to build a learning model using Naïve Bayes
algorithm with input X with input features (x1, x2, x3, …, xn) where X=(x1,x2,x3,…,xn)
and output y, the mathematical model of the Naïve Bayes learning algorithm can
be represented as (Taheri & Mammadov, 2013):
P (X |y )P (y )
P (y|X ) = (20)
P (X )
Using the input features the Naïve Bayes learning algorithm can be represented
mathematically, for n features, as (Taheri & Mammadov, 2013):
P (x 1|y )P (x 2 |y )P (x 3 |y )…P (x n |y )P (y )
P (y|x 1, x 2 , x 3 , , x n ) = (21)
P (x 1 )P (x 2 )P (x 3 )…
P (x n )
By looking at the dataset and substituting the values into the equation, you can
get the values for each. The denominator does not change for any of the entries in
the dataset; it remains constant. As a result, the denominator can be eliminated and
proportionality can be injected as (Taheri & Mammadov, 2013):
P (y|x 1, x 2 , x 3 , , x n )∝ P (y )∏ P (x i |y )
n
(22)
i =1
The output class (the value of y variable) can be given with maximum probability
as (Taheri & Mammadov, 2013):
P (y )∏ P (x i |y )
n
y =argmax
y
(23)
i =1
96
Machine Learning
There are three main types of Naïve Bayes classifier; the Multinomial Naïve
Bayes, the Bernoulli Naïve Bayes, and the Gaussian Naïve Bayes (Singh et al.,
2019; T. Wang & W. Li, 2010).
The Multinomial Naïve Bayes classifier is usually used for document classifications.
It is used for discrete counts (Singh et al., 2019).
The Bernoulli Naïve Bayes classifier is useful in binary classification. One of
most used applications is text classifications using a ‘bag of words’ paradigm, in
which the 1s and 0s represent “word appears in the document” and “word does not
appear in the document,” respectively (Singh et al., 2019).
The Gaussian Naïve Bayes classifier works by using a Gaussian distribution
to distribute the continuous values associated with each feature. So, the features
likelihood is considered to be Gaussian, and then the conditional probability can
be calculated by (T. Wang & W. Li, 2010):
x 2
1
PX e 2 2
(24)
2 2
where 𝜇 is the mean value and 𝜎 is the standard deviation value of X features; it can
be given by (T. Wang & W. Li, 2010):
1 n
xi
n i 1
(25)
0.5
1 n 2
xi (26)
n 1 i 1
97
Machine Learning
Zhang et al., 2017). The KNN algorithm is based on the calculation of the distances
among a query and all the data examples, the specified number examples K selection
which is closest to the query, then polls for the averages the labels (in the case of
regression) or the most frequent label (in the case of classification) (S. Zhang et al.,
2017; A. Lubis & M. Lubis, 2020). The K value is determined by iterations and test
but it can be depended on the neighbors where the K can has high value in case of
more neighbors and can has small value in case of fewer neighbors. Be careful that
in the case of K = N, where N is the number of classes, over-fitting problem can be
occurred (S. Zhang et al., 2017).
98
Machine Learning
The Euclidean distance De(x,y) is the calculation of the square root of the sum
of the square differences between the coordinates (x,y) of n points as (A. Lubis &
M. Lubis, 2020; Wu et al., 2002):
De (x , y )= ∑ (x − yi )
n 2
i
(27)
i =1
The Manhattan distance Dm(x,y) is the calculation of the sum of the absolute
values of the differences between the coordinates (x,y) of n points as (A. Lubis &
M. Lubis, 2020):
Dm (x , y )= ∑
n
x i − yi (28)
i =1
The Hamming distance Dh(x,y) is the calculation of the distance between n given
points is the maximum difference between their coordinates (x,y) on a dimension
as (A. Lubis & M. Lubis, 2020; Wu et al., 2002):
Dh (x , y )= ∑
n
x i − yi (29)
i =1
0, x = y
With Dh (x , y ) = (30)
1, x ≠ y
The DT Algorithm
99
Machine Learning
calculations. Accuracy depends on the tree design and the features selection (Farid
et al., 2014).
100
Machine Learning
( ( )) (
m n
(i )
∑ y log hθ x
J SVM =−
i =1
(i ) (i )
) ( λ
)
−1 − y log 1 − hθ (x ) + ∑θj2
2 j =1
(31)
1
By multiplying the two terms (cost and regularization) of (22) by C where C =
λ
and called the penalty parameter of the SVM classifier model, it gives (Chauhan et
al., 2019; Vapnik, 2000; Koda et al., 2018):
( ( )) (
m n
(i )
C ∑ y log hθ x
J SVM =−
i =1
(i ) (i )
) ( 1
)
−1 − y log 1 − hθ (x ) + ∑θj2
2 j =1
(32)
So the hypothesis function for the SVM classifier can be represented as (Chauhan
et al., 2019; Vapnik, 2000; Koda et al., 2018):
0,θT x ≥ 0
hθ (x ) = (33)
1,θT x < 0
101
Machine Learning
The ANNs
The ANNs algorithm is a machine learning approach which can act as a supervised
or unsupervised machine learning algorithm; it can be utilized as a regression or
classifier too. The use of the ANNs as a supervised classifier relies on the biological
neural networks form. The definition of the ANNs approach is algorithms that attempt
to imitate the human brain (Shanmuganathan, 2016; AlAfandy et al., 2019). The
structure of the ANNs depends on the information that flows through this network.
The ANNs are deemed as nonlinear applied mathematical information modeling
tools whenever the complicated relationships between inputs and outputs are
forged. The ANNs are formed of a sequence of layers; every layer contains a set of
neurons. The input layer is the first layer wherever the output layer is the last layer;
the internal layers are treated as the hidden layers. Neurons within the preceding
and the succeeding layers are connected by weighted connections known as the
weights (Srivastava et al., 2012). The accuracy and the performance of the ANNs
are extremely looking at the network structure and the hyper-parameters values. The
ANNs process rate is high however the network takes an enormous time for training
and also needs a huge memory with advanced hardware; on the other hand there
is some stiffness to set the network structure. Figure 9 shows the ANNs approach
(Shanmuganathan, 2016; Srivastava et al., 2012).
102
Machine Learning
The ANNs structure with multilayers which each layer contains multiple nodes
L−1 L
has input layer X=A[0], hidden layers from A[1] to A
to yˆ =A , and output layer
L
yˆ =A (Shanmuganathan, 2016; Srivastava et al., 2012; Bengio et al., 2017)
a l
x 1
1 l
x a2
2
X = A = x 3 A = a 3 and
l ={1,2,…, L }
0 l l
(34)
l
x n a l
k
where L is the ANNs layers, n is the input features, and k[l] is the lth layer nodes.
Then, the lth layer output can be calculated as (Bengio et al., 2017):
l l
A =∅ Z ( )
l
(35)
l l l −1 l
A
Z =W
+ b (36)
l
(
So, the right vectors dimensions are W = k , k
l l −1
) , b =(k ,1), (k ,1), and
[l] [l] [l]
l
A[l]=(k[l],1). The ∅ is the activation function of the lth layer (Bengio et al., 2017).
Then, the output of the ANN structure is calculated as (Bengio et al., 2017):
L
ŷ=A (38)
103
Machine Learning
Open source is an expression referred to open source software. Open source software
is a code that is designed to be publicly accessible for free. So, anyone can see,
modify, and distribute this code. In machine learning, a lot of researchers routinely
open source their work on the Internet, such as on GitHub. On the other hand there
are open source libraries for machine learning such as TF and Scikit-learn. These
libraries are available and easy to deal with the widely used programming languages
such as MATLAB and Python.
TF is an end-to-end open-source platform for creating machine learning and deep
learning applications which is created by the Google Brain team. It’s a symbolic
math package that performs numerous tasks involving DNNs training and inference
using dataflow and differentiable programming (Gad, 2018).
Scikit-learn was created as a Google summer of code project in 2007 by David
Cournapeau. It’s a Python-based machine learning package that includes supervised
and unsupervised machine learning approaches. It is distributed under several Linux
distributions and is licensed under a liberal simplified BSD license, allowing for
academic and commercial use (Pedregosa et al., 2011).
There are several evaluation metrics for assessing the performance of the classification
algorithms; some of them assess the performance of each class prediction and the
others assess the predictive performance for the whole classifier.
This section illustrates the confusion matrix, precision, recall, and F1-score
which are used to assess the performance of each class prediction, the OA and the
kappa coefficient which are used to assess the predictive performance for the whole
classifier (X. Deng et al., 2016; AlBeladi & Muqaibel, 2018; Banko, 1998; W. Li
et al., 2017; C. Liu et al., 2007; Cohen, 1960).
104
Machine Learning
105
Machine Learning
As a result of this confusion matrix, precision, recall, F1-score and the OA can be
calculated (X. Deng et al., 2016; AlBeladi & Muqaibel, 2018).
Precision depicts the proportion of expected samples in a class that actually belong
to that class compared to all predicted samples in that class; it can be expressed as
(X. Deng et al., 2016; AlBeladi & Muqaibel, 2018):
TP
Precision = (39)
TP + FP
Recall depicts the proportion of anticipated samples in a class that actually belong
to that class to all actual samples in that class; it can be expressed as (X. Deng et
al., 2016; AlBeladi & Muqaibel, 2018):
TP
Recall = (40)
TP + FN
106
Machine Learning
The OA
The OA basically informs us what percentage of the reference sites were correctly
mapped out of all of them. The OA is usually reported as a percentage, with 100%
accuracy indicating that all reference sites were properly categorized. OA is the
simplest to compute and comprehend, but it only provides basic accuracy information
to map users and producers. The OA is the major classification accuracy appreciation.
The OA is calculated as (Banko, 1998; W. Li et al., 2017; AlAfandy et al., 2020b):
A statistical test is used to calculate the kappa coefficient, which is used to assess
the correctness of a categorization. Kappa is a metric that measures how well a
categorization worked as compared to assigning values at random. The Kappa
Coefficient might be anything between -1 and 1. A classification with a value of 0
is no better than a random categorization. The categorization is much poorer than
random if the number is negative. A value near to 1 suggests that the classification
is superior to chance. The kappa coefficient (𝜅) is calculated as (C. Liu et al., 2007;
Cohen, 1960):
p − pe
κ = o (43)
1 − pe
107
Machine Learning
TP + TN
po = (44)
TP + FP + FN + TN
TP + FP TP + FN
pyes = × (46)
TP + FP + FN + TN TP + FP + FN + TN
FN + TN FP + TN
pno = × (47)
TP + FP + FN + TN TP + FP + FN + TN
REFERENCES
AlAfandy, K. A., Omara, H., Lazaar, M., & Al Achhab, M. (Eds.). (2019). Artificial
Neural Networks Optimization and Convolution Neural Networks to Classifying
Images in Remote Sensing: A Review. Proceeding of The 4th International Conference
on Big Data and Internet of Things (BDIoT’19). 10.1145/3372938.3372945
AlAfandy, K. A., Omara, H., Lazaar, M., & Al Achhab, M. (2020a). Investment of
Classic Deep CNNs and SVM for Classifying Remote Sensing Images. Advances in
Science Technology and Engineering Systems Journal, 5(5), 652–659. doi:10.25046/
aj050580
AlAfandy, K. A., Omara, H., Lazaar, M., & Al Achhab, M. (2020b). Using Classic
Networks for Classifying Remote Sensing Images: Comparative Study. Advances in
Science Technology and Engineering Systems Journal, 5(5), 770–780. doi:10.25046/
aj050594
AlBeladi, A. A., & Muqaibel, A. H. (2018). Evaluating Compressive Sensing
Algorithms in Through-the-wall Radar via F1-score. International Journal of Signal
and Imaging Systems Engineering, 11(3), 164–171. doi:10.1504/IJSISE.2018.093268
108
Machine Learning
Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., & Aljaaf, A. J. (2020). A
Systematic Review on Supervised and Unsupervised Machine Learning Algorithms
for Data Science. In M. Berry, A. Mohamed, & B. Yap (Eds.), Supervised and
Unsupervised Learning for Data Science. Unsupervised and Semi-Supervised
Learning. Springer. doi:10.1007/978-3-030-22475-2_1
Alpaydin, E. (2020). Introduction to Machine Learning. MIT Press.
Banko, G. (1998). A Review of Assessing the Accuracy of Classifications of Remotely
Sensed Data and of Methods Including Remote Sensing Data in Forest Inventory.
International Institution for Applied Systems Analysis (IIASA).
Bengio, Y., Goodfellow, I., & Courville, A. (2017). Deep Learning. MIT press.
Bowling, M., Furnkranz, J., Graepel, T., & Musick, R. (2006). Machine Learning and
Games. Machine Learning, Springer, 63(3), 211–215. doi:10.100710994-006-8919-x
Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem Formulations and
Solvers in Linear SVM: A Review. Artificial Intelligence Review, Springer, 52(2),
803–855. doi:10.100710462-018-9614-6
Cohen, J. (1960). A Coefficient of Agreement for Normal Scales. Educational and
Psychological Measurement, 20(1), 37–46. doi:10.1177/001316446002000104
J. Deng, W. Dong, R. Socher, L. Li, K. Li, & L. Fei-Fei (Eds.). (2009). ImageNet: A
Large-Scale Hierarchical Image Database. In Proceeding of the 2009 IEEE Conference
on Computer Vision and Pattern Recognition. IEEE. 10.1109/CVPR.2009.5206848
Deng, X., Liu, Q., Deng, Y., & Mahadevan, S. (2016). An Improved Method
to Construct Basic Probability Assignment Based on the Confusion Matrix for
Classification Problem. Information Sciences, Elsevier, 340-341, 250–261.
doi:10.1016/j.ins.2016.01.033
ElNaqa, I., & Murphy, M. J. (2015). What is Machine Learning? In I. Issam ElNaqa
& M. J. Murphy (Eds.), Machine Learning in Radiation Oncology (pp. 3–11).
Springer. doi:10.1007/978-3-319-18305-3_1
Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., & Strachan, R. (2014).
Hybrid Decision Tree and Naive Bayes Classifiers for Multi-class Classification
Tasks. Expert Systems with Applications, Elsevier, 41(4), 1937–1946. doi:10.1016/j.
eswa.2013.08.089
Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A Neural Network Model
for a Mechanism of Visual Pattern Recognition. IEEE Transactions on Systems,
Man, and Cybernetics, SMC-13(5), 826–834. doi:10.1109/TSMC.1983.6313076
109
Machine Learning
110
Machine Learning
111
Machine Learning
112
Machine Learning
APPENDIX
Table 1.
113