0% found this document useful (0 votes)
20 views15 pages

Iai&ml Unit-5

Ai and ml unit 5 notes

Uploaded by

g.monikadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

Iai&ml Unit-5

Ai and ml unit 5 notes

Uploaded by

g.monikadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT– V

Machine Learning Algorithm Analytics: Evaluating Machine Learning algorithms, Model,


Selection, Ensemble Methods (Boosting, Bagging, and Random Forests). Modeling
Sequence/Time-Series Data and Deep Learning: Deep generative models, Deep Boltzmann
Machines, Deep auto-encoders, Applications of Deep Networks.

Evaluating Machine Learning algorithms:


Evaluating the performance of a Machine learning model is one of the important steps while building an
effective ML model. To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics. These performance metrics
help us understand how well our model has performed for the given data. In this way, we can improve the
model's performance by tuning the hyper-parameters. Each ML model aims to generalize well on
unseen/new data, and performance metrics help determine how well the model generalizes on the new
dataset.

In machine learning, each task or problem is divided into classification and Regression. Not all metrics
can be used for all types of problems; hence, it is important to know and understand which metrics should
be used
Performance Metrics for Classification
In a classification problem, the category or classes of data is identified based on training data. The model
learns from the given dataset and then classifies the new data into classes or groups based on the training.
It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To evaluate the
performance of a classification model, different metrics are used, and some of them are as follows:

o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC

I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be determined
as the number of correct predictions to the total number of predictions.

It can be formulated as:

To implement an accuracy metric, we can compare ground truth and predicted values in a loop, or we can
also use the scikit-learn module for this.

Although it is simple to use and implement, it is suitable only for cases where an equal number of samples
belong to each class.

When to Use Accuracy?

It is good to use the Accuracy metric when the target variable classes in data are approximately balanced.
For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In this case, if the
model is asked to predict whether the image is of Apple or Mango, it will give a prediction with 97% of
accuracy.

When not to use Accuracy?

It is recommended not to use the Accuracy measure when the target variable majorly belongs to one class.
For example, Suppose there is a model for a disease prediction in which, out of 100 people, only five
people have a disease, and 95 people don't have one. In this case, if our model predicts every person with
no disease (which means a bad prediction), the Accuracy measure will be 95%, which is not correct.

II. Confusion Matrix


A confusion matrix is a tabular representation of prediction outcomes of any binary classifier, which is
used to describe the performance of the classification model on a set of test data when true values are
known.

The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing
for beginners.

A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended
to use for classifiers with more than two classes).

We can determine the following from the above matrix:

o In the matrix, columns are for the prediction values, and rows specify the Actual values. Here Actual and prediction
give two possible classes, Yes or No. So, if we are predicting the presence of a disease in a patient, the Prediction
column with Yes means, Patient has the disease, and for NO, the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110 time predicted yes, whereas 55 times
predicted No.
o However, in reality, 60 cases in which patients don't have the disease, whereas 105 cases in which patients have the
disease.

In general, the table is divided into four terminologies, which are as follows:
1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality, also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality, also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.

III. Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines the
proportion of positive prediction that was actually correct. It can be calculated as the True Positive or
predictions that are actually true to the total positive predictions (True Positive and False Positive).

IV. Recall or Sensitivity


It is also similar to the Precision metric; however, it aims to calculate the proportion of actual positive that
was identified incorrectly. It can be calculated as True Positive or predictions that are actually true to the
total number of positives, either correctly predicted as positive or incorrectly predicted as negative (true
Positive and false negative).

The formula for calculating Recall is given below:

When to use Precision and Recall?

From the above definitions of Precision and Recall, we can say that recall determines the performance of a
classifier with respect to a false negative, whereas precision gives information about the performance of a
classifier with respect to a false positive.

So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we want to
minimize the false positive, then precision should be close to 100% as possible.

In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize recall, it
will minimize the FN error.
V. F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions that are
made for the positive class. It is calculated with the help of Precision and Recall. It is a type of single
score that represents both Precision and Recall. So, the F1 Score can be calculated as the harmonic
mean of both precision and Recall, assigning equal weight to each of them.

The formula for calculating the F1 score is given below:

When to use F-Score?

As F-score make use of both precision and recall, so it should be used if both of them are important for
evaluation, but one (precision or recall) is slightly more important to consider than the other. For example,
when False negatives are comparatively more important than false positives, or vice versa.

VI. AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then, we can use
the AUC-ROC curve. It is one of the popular and important metrics for evaluating the performance of the
classification model.

Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a graph to
show the performance of a classification model at different threshold levels. The curve is plotted
between two parameters, which are:

o True Positive Rate


o False Positive Rate

TPR or true Positive rate is a synonym for Recall, hence can be calculated as:

FPR or False Positive Rate can be calculated as:


To calculate value at any point in a ROC curve, we can evaluate a logistic regression model multiple times
with different classification thresholds, but this would not be much efficient. So, for this, one efficient
method is used, which is known as AUC.

AUC: Area Under the ROC curve


AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-
dimensional area under the entire ROC curve, as shown below image:

AUC calculates the performance across all the thresholds and provides an aggregate measure. The value
of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC of 0.0,
whereas models with 100% correct predictions will have an AUC of 1.0.

When to Use AUC

AUC should be used to measure how well the predictions are ranked rather than their absolute values.
Moreover, it measures the quality of predictions of the model without considering the classification
threshold.

When not to use AUC

As AUC is scale-invariant, which is not always desirable, and we need calibrating probability outputs,
then AUC is not preferable.

Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives vs. false
positives, and it is difficult to minimize one type of classification error.
2. Performance Metrics for Regression
Regression is a supervised learning technique that aims to find the relationships between the dependent
and independent variables. A predictive regression model predicts a numeric or discrete value. The
metrics used for regression are different from the classification metrics. It means we cannot use the
Accuracy metric (explained above) to evaluate a regression model; instead, the performance of a
Regression model is reported as errors in the prediction. Following are the popular metrics that are used to
evaluate the performance of Regression models.

o Mean Absolute Error


o Mean Squared Error
o R2 Score
o Adjusted R2

I. Mean Absolute Error (MAE)


Mean Absolute Error or MAE is one of the simplest metrics, which measures the absolute difference
between actual and predicted values, where absolute means taking a number as Positive.

To understand MAE, let's take an example of Linear Regression, where the model draws a best fit line
between dependent and independent variables. To measure the MAE or error in prediction, we need to
calculate the difference between actual values and predicted values. But in order to find the absolute error
for the complete dataset, we need to find the mean absolute of the complete dataset.

The below formula is used to calculate MAE:

Here,

Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.

MAE is much more robust for the outliers. One of the limitations of MAE is that it is not differentiable, so
for this, we need to apply different optimizers such as Gradient Descent. However, to overcome this
limitation, another metric can be used, which is Mean Squared Error or MSE.

II. Mean Squared Error


Mean Squared error or MSE is one of the most suitable metrics for Regression evaluation. It measures the
average of the Squared difference between predicted values and the actual value given by the model.
Since in MSE, errors are squared, therefore it only assumes non-negative values, and it is usually positive
and non-zero.

Moreover, due to squared differences, it penalizes small errors also, and hence it leads to over-estimation
of how bad the model is.

MSE is a much-preferred metric compared to other regression metrics as it is differentiable and hence
optimized better.

The formula for calculating MSE is given below:

Here,

Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.

III. R Squared Score


R squared error is also known as Coefficient of Determination, which is another popular metric used for
Regression model evaluation. The R-squared metric enables us to compare our model with a constant
baseline to determine the performance of the model. To select the constant baseline, we need to take the
mean of the data and draw the line at the mean.

The R squared score will always be less than or equal to 1 without concerning if the values are too large or
small.

IV. Adjusted R Squared


Adjusted R squared, as the name suggests, is the improved version of R squared error. R square has a
limitation of improvement of a score on increasing the terms, even though the model is not improving, and
it may mislead the data scientists.

To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than
R². It is because it adjusts the values of increasing predictors and only shows improvement if there is a real
improvement.

We can calculate the adjusted R squared as follows:


Here,

n is the number of observations

k denotes the number of independent variables

and Ra2 denotes the adjusted R2

Ensemble Methods (Boosting, Bagging, and Random Forests)


Ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to learn
a set of classifiers (experts) and to allow them to vote. Bagging and Boosting are two types of Ensemble
Learning. These two decrease the variance of a single estimate as they combine several estimates from
different models. So the result may be a model with higher stability.

1. Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel
and combines them for determining the model average.
2. Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this
model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.

Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed
to improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression. It decreases the variance and helps to avoid overfitting. It is usually applied to decision tree
methods. Bagging is a special case of the model averaging approach.

Description of the Technique


Suppose a set D of d tuples, at each iteration i, a training set D i of d tuples is selected via row sampling
with a replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e.,
bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier M i returns its
class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).

Implementation Steps of Bagging


 Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations
with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel with each training set and independent of each other.
 Step 4: The final predictions are determined by combining the predictions from all the models.
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance are present. It
makes random feature selection to grow trees. Several random trees make a Random Forest.

Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of
weak classifiers. It is done by building a model by using weak models in series. Firstly, a model is built
from the training data. Then the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the complete training data set is
predicted correctly or the maximum number of models is added.
Boosting Algorithms
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of correctly classified
data points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Similarities Between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal similarity of being
classified as ensemble methods. Here we will explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority
Voting).
4. Both are good at reducing variance and provide higher stability.
Differences Between Bagging and Boosting

S.NO Bagging Boosting

1. The simplest way of combining predictions A way of combining predictions that belong
that belong to the same type. to the different types.

2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.

Each model receives equal weight. Models are weighted according to their
3. performance.

Each model is built independently. New models are influenced


4. by the performance of previously built
models.
S.NO Bagging Boosting

Different training data subsets are selected Every new subset contains the elements that
using row sampling with replacement and were misclassified by previous models.
5. random sampling methods from the entire
training dataset.
6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.

If the classifier is unstable (high variance), If the classifier is stable and simple (high
7.
then apply bagging. bias) the apply boosting.

In this base classifiers are trained parallelly. In this base classifiers are trained
8.
sequentially.

Example: The Random forest model uses Example: The AdaBoost uses Boosting
9
Bagging. techniques

Deep generative models


Deep generative models are a type of machine learning model that can generate new data
samples that are similar to the training data they were trained on. These models can be used for a
variety of tasks, including image and speech synthesis, text generation, and even music
composition.

There are two main types of deep generative models:

1. Autoregressive models

2. Latent variable models.

1. Autoregressive models: Autoregressive models predict the probability distribution of the next
element in a sequence based on the previous elements. Examples of autoregressive models
include language models and PixelCNN.

2. Latent variable models: Latent variable models, on the other hand, model the underlying
structure of the data by learning a set of hidden variables that capture the important features of
the data. Examples of latent variable models include Variational Autoencoders (VAEs) and
Generative Adversarial Networks (GANs).
VAEs are a type of latent variable model that learns to encode input data into a lower-
dimensional representation called the "latent space." This latent space can then be used to
generate new samples that are similar to the training data.

GANs, on the other hand, consist of two neural networks: a generator network that generates
new data samples, and a discriminator network that distinguishes between real and fake samples.
The two networks are trained together in a "minimax" game, where the generator tries to create
samples that fool the discriminator, and the discriminator tries to correctly identify real and fake
samples. Over time, the generator learns to create increasingly realistic samples.

Deep generative models have many applications in various fields such as art, fashion, and music.
They can also be used to generate synthetic data that can be used to augment small datasets, to
improve the performance of machine learning models, or to preserve privacy by generating data
that does not contain personal information.

Deep Boltzmann Machines


Deep Boltzmann Machines (DBMs) are a class of generative neural networks that are used for
unsupervised learning tasks such as feature learning and density estimation. They are composed
of multiple layers of stochastic binary units that have undirected connections between them.

Unlike traditional Boltzmann Machines, DBMs have more than one hidden layer, making them
more capable of representing complex distributions. The structure of a DBM allows for
interactions between the visible units and the hidden units in each layer, as well as between the
hidden units in different layers.

DBMs are trained using a technique called contrastive divergence, which involves iteratively
updating the model parameters to maximize the likelihood of the training data. The training
process involves minimizing a cost function that is based on the difference between the model's
output and the true data distribution.

Once trained, a DBM can be used to generate new data samples by iteratively sampling from the
hidden units in each layer and propagating the samples through the network until the visible units
are reached. DBMs have been successfully applied to a range of tasks, including image and
speech recognition, natural language processing, and drug discovery. However, they are
computationally expensive and can be difficult to train, especially for large datasets.

DBMs have been successfully applied to a range of tasks, including image and speech
recognition, natural language processing, and drug discovery. However, they are computationally
expensive and can be difficult to train, especially for large datasets. Variations of DBMs, such as
Convolutional Deep Boltzmann Machines (CDBMs) and Recurrent Deep Boltzmann Machines
(RDBMs), have been developed to address these issues.
Deep auto-encoders:
Deep autoencoders are a class of neural networks used for unsupervised learning, feature
learning, and data compression. They are based on the concept of autoencoders, which are neural
networks that are trained to reconstruct the input data from a compressed representation of the
data.

A deep autoencoder consists of an encoder network that maps the input data to a compressed
representation, and a decoder network that maps the compressed representation back to the
original input data. The encoder and decoder networks are typically symmetric, and the number
of hidden layers in the encoder network is the same as the number of hidden layers in the
decoder network.

The compressed representation of the input data is often referred to as the code, latent variable,
or bottleneck layer. This layer has a lower dimensionality than the input data, and it captures the
most important features of the input data. The process of compressing the input data into a
lower-dimensional code is called encoding, and the process of reconstructing the input data from
the code is called decoding.

Deep autoencoders are trained using a process called backpropagation, which involves
minimizing the difference between the input data and the reconstructed output data. This is
typically done by minimizing a loss function, such as the mean squared error between the input
and output data. The weights of the network are adjusted iteratively using an optimization
algorithm, such as stochastic gradient descent or Adam.

Deep autoencoders can be used for a variety of applications, including data compression,
dimensionality reduction, anomaly detection, and image denoising. They can also be used for
transfer learning, where the pre-trained encoder network is used as a feature extractor for a
supervised learning task.

One limitation of deep autoencoders is that they are prone to overfitting, especially when the
network has a large number of parameters. Various regularization techniques, such as dropout,
early stopping, and weight decay, can be used to prevent overfitting. Variations of deep
autoencoders, such as denoising autoencoders and variational autoencoders, have been developed
to address some of these limitations.

Applications of Deep Networks


Deep neural networks (DNNs) have become increasingly popular in recent years due to their
ability to learn complex patterns from large amounts of data. Some of the applications of deep
networks include:
1. Image and speech recognition: Deep learning has been successfully used in image and
speech recognition tasks. DNNs are capable of detecting patterns in images and speech,
which can be used to identify objects or words.
2. Natural language processing: DNNs can be used to analyze natural language and
understand its meaning. This has led to applications such as chatbots and virtual
assistants.
3. Recommender systems: DNNs can be used to develop recommender systems that provide
personalized recommendations to users based on their past behavior.
4. Autonomous vehicles: DNNs are being used to develop autonomous vehicles that can
navigate roads and make decisions based on real-time data.
5. Healthcare: DNNs are being used in healthcare applications such as disease diagnosis and
drug discovery.
6. Gaming: DNNs are being used in the gaming industry to develop intelligent agents that
can learn to play games.
7. Finance: DNNs are being used in finance applications such as fraud detection and stock
market analysis.

Overall, DNNs have shown great promise in a wide range of applications and are likely to play
an increasingly important role in many industries in the years to come.

You might also like