Iai&ml Unit-5
Iai&ml Unit-5
In machine learning, each task or problem is divided into classification and Regression. Not all metrics
can be used for all types of problems; hence, it is important to know and understand which metrics should
be used
Performance Metrics for Classification
In a classification problem, the category or classes of data is identified based on training data. The model
learns from the given dataset and then classifies the new data into classes or groups based on the training.
It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To evaluate the
performance of a classification model, different metrics are used, and some of them are as follows:
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be determined
as the number of correct predictions to the total number of predictions.
To implement an accuracy metric, we can compare ground truth and predicted values in a loop, or we can
also use the scikit-learn module for this.
Although it is simple to use and implement, it is suitable only for cases where an equal number of samples
belong to each class.
It is good to use the Accuracy metric when the target variable classes in data are approximately balanced.
For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In this case, if the
model is asked to predict whether the image is of Apple or Mango, it will give a prediction with 97% of
accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly belongs to one class.
For example, Suppose there is a model for a disease prediction in which, out of 100 people, only five
people have a disease, and 95 people don't have one. In this case, if our model predicts every person with
no disease (which means a bad prediction), the Accuracy measure will be 95%, which is not correct.
The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing
for beginners.
A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended
to use for classifiers with more than two classes).
o In the matrix, columns are for the prediction values, and rows specify the Actual values. Here Actual and prediction
give two possible classes, Yes or No. So, if we are predicting the presence of a disease in a patient, the Prediction
column with Yes means, Patient has the disease, and for NO, the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110 time predicted yes, whereas 55 times
predicted No.
o However, in reality, 60 cases in which patients don't have the disease, whereas 105 cases in which patients have the
disease.
In general, the table is divided into four terminologies, which are as follows:
1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality, also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality, also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.
III. Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines the
proportion of positive prediction that was actually correct. It can be calculated as the True Positive or
predictions that are actually true to the total positive predictions (True Positive and False Positive).
From the above definitions of Precision and Recall, we can say that recall determines the performance of a
classifier with respect to a false negative, whereas precision gives information about the performance of a
classifier with respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we want to
minimize the false positive, then precision should be close to 100% as possible.
In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize recall, it
will minimize the FN error.
V. F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions that are
made for the positive class. It is calculated with the help of Precision and Recall. It is a type of single
score that represents both Precision and Recall. So, the F1 Score can be calculated as the harmonic
mean of both precision and Recall, assigning equal weight to each of them.
As F-score make use of both precision and recall, so it should be used if both of them are important for
evaluation, but one (precision or recall) is slightly more important to consider than the other. For example,
when False negatives are comparatively more important than false positives, or vice versa.
VI. AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then, we can use
the AUC-ROC curve. It is one of the popular and important metrics for evaluating the performance of the
classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a graph to
show the performance of a classification model at different threshold levels. The curve is plotted
between two parameters, which are:
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
AUC calculates the performance across all the thresholds and provides an aggregate measure. The value
of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC of 0.0,
whereas models with 100% correct predictions will have an AUC of 1.0.
AUC should be used to measure how well the predictions are ranked rather than their absolute values.
Moreover, it measures the quality of predictions of the model without considering the classification
threshold.
As AUC is scale-invariant, which is not always desirable, and we need calibrating probability outputs,
then AUC is not preferable.
Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives vs. false
positives, and it is difficult to minimize one type of classification error.
2. Performance Metrics for Regression
Regression is a supervised learning technique that aims to find the relationships between the dependent
and independent variables. A predictive regression model predicts a numeric or discrete value. The
metrics used for regression are different from the classification metrics. It means we cannot use the
Accuracy metric (explained above) to evaluate a regression model; instead, the performance of a
Regression model is reported as errors in the prediction. Following are the popular metrics that are used to
evaluate the performance of Regression models.
To understand MAE, let's take an example of Linear Regression, where the model draws a best fit line
between dependent and independent variables. To measure the MAE or error in prediction, we need to
calculate the difference between actual values and predicted values. But in order to find the absolute error
for the complete dataset, we need to find the mean absolute of the complete dataset.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
MAE is much more robust for the outliers. One of the limitations of MAE is that it is not differentiable, so
for this, we need to apply different optimizers such as Gradient Descent. However, to overcome this
limitation, another metric can be used, which is Mean Squared Error or MSE.
Moreover, due to squared differences, it penalizes small errors also, and hence it leads to over-estimation
of how bad the model is.
MSE is a much-preferred metric compared to other regression metrics as it is differentiable and hence
optimized better.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
The R squared score will always be less than or equal to 1 without concerning if the values are too large or
small.
To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than
R². It is because it adjusts the values of increasing predictors and only shows improvement if there is a real
improvement.
1. Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel
and combines them for determining the model average.
2. Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this
model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.
Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed
to improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression. It decreases the variance and helps to avoid overfitting. It is usually applied to decision tree
methods. Bagging is a special case of the model averaging approach.
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of
weak classifiers. It is done by building a model by using weak models in series. Firstly, a model is built
from the training data. Then the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the complete training data set is
predicted correctly or the maximum number of models is added.
Boosting Algorithms
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of correctly classified
data points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Similarities Between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal similarity of being
classified as ensemble methods. Here we will explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority
Voting).
4. Both are good at reducing variance and provide higher stability.
Differences Between Bagging and Boosting
1. The simplest way of combining predictions A way of combining predictions that belong
that belong to the same type. to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
Each model receives equal weight. Models are weighted according to their
3. performance.
Different training data subsets are selected Every new subset contains the elements that
using row sampling with replacement and were misclassified by previous models.
5. random sampling methods from the entire
training dataset.
6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.
If the classifier is unstable (high variance), If the classifier is stable and simple (high
7.
then apply bagging. bias) the apply boosting.
In this base classifiers are trained parallelly. In this base classifiers are trained
8.
sequentially.
Example: The Random forest model uses Example: The AdaBoost uses Boosting
9
Bagging. techniques
1. Autoregressive models
1. Autoregressive models: Autoregressive models predict the probability distribution of the next
element in a sequence based on the previous elements. Examples of autoregressive models
include language models and PixelCNN.
2. Latent variable models: Latent variable models, on the other hand, model the underlying
structure of the data by learning a set of hidden variables that capture the important features of
the data. Examples of latent variable models include Variational Autoencoders (VAEs) and
Generative Adversarial Networks (GANs).
VAEs are a type of latent variable model that learns to encode input data into a lower-
dimensional representation called the "latent space." This latent space can then be used to
generate new samples that are similar to the training data.
GANs, on the other hand, consist of two neural networks: a generator network that generates
new data samples, and a discriminator network that distinguishes between real and fake samples.
The two networks are trained together in a "minimax" game, where the generator tries to create
samples that fool the discriminator, and the discriminator tries to correctly identify real and fake
samples. Over time, the generator learns to create increasingly realistic samples.
Deep generative models have many applications in various fields such as art, fashion, and music.
They can also be used to generate synthetic data that can be used to augment small datasets, to
improve the performance of machine learning models, or to preserve privacy by generating data
that does not contain personal information.
Unlike traditional Boltzmann Machines, DBMs have more than one hidden layer, making them
more capable of representing complex distributions. The structure of a DBM allows for
interactions between the visible units and the hidden units in each layer, as well as between the
hidden units in different layers.
DBMs are trained using a technique called contrastive divergence, which involves iteratively
updating the model parameters to maximize the likelihood of the training data. The training
process involves minimizing a cost function that is based on the difference between the model's
output and the true data distribution.
Once trained, a DBM can be used to generate new data samples by iteratively sampling from the
hidden units in each layer and propagating the samples through the network until the visible units
are reached. DBMs have been successfully applied to a range of tasks, including image and
speech recognition, natural language processing, and drug discovery. However, they are
computationally expensive and can be difficult to train, especially for large datasets.
DBMs have been successfully applied to a range of tasks, including image and speech
recognition, natural language processing, and drug discovery. However, they are computationally
expensive and can be difficult to train, especially for large datasets. Variations of DBMs, such as
Convolutional Deep Boltzmann Machines (CDBMs) and Recurrent Deep Boltzmann Machines
(RDBMs), have been developed to address these issues.
Deep auto-encoders:
Deep autoencoders are a class of neural networks used for unsupervised learning, feature
learning, and data compression. They are based on the concept of autoencoders, which are neural
networks that are trained to reconstruct the input data from a compressed representation of the
data.
A deep autoencoder consists of an encoder network that maps the input data to a compressed
representation, and a decoder network that maps the compressed representation back to the
original input data. The encoder and decoder networks are typically symmetric, and the number
of hidden layers in the encoder network is the same as the number of hidden layers in the
decoder network.
The compressed representation of the input data is often referred to as the code, latent variable,
or bottleneck layer. This layer has a lower dimensionality than the input data, and it captures the
most important features of the input data. The process of compressing the input data into a
lower-dimensional code is called encoding, and the process of reconstructing the input data from
the code is called decoding.
Deep autoencoders are trained using a process called backpropagation, which involves
minimizing the difference between the input data and the reconstructed output data. This is
typically done by minimizing a loss function, such as the mean squared error between the input
and output data. The weights of the network are adjusted iteratively using an optimization
algorithm, such as stochastic gradient descent or Adam.
Deep autoencoders can be used for a variety of applications, including data compression,
dimensionality reduction, anomaly detection, and image denoising. They can also be used for
transfer learning, where the pre-trained encoder network is used as a feature extractor for a
supervised learning task.
One limitation of deep autoencoders is that they are prone to overfitting, especially when the
network has a large number of parameters. Various regularization techniques, such as dropout,
early stopping, and weight decay, can be used to prevent overfitting. Variations of deep
autoencoders, such as denoising autoencoders and variational autoencoders, have been developed
to address some of these limitations.
Overall, DNNs have shown great promise in a wide range of applications and are likely to play
an increasingly important role in many industries in the years to come.