0% found this document useful (0 votes)
15 views29 pages

ML-Unit 4

The document provides an overview of Logistic Regression, a supervised learning algorithm used for binary classification problems, explaining its mathematical foundation, types, and implementation steps. It also discusses the Cost Function in machine learning, detailing its significance in evaluating model performance, types of cost functions, and their applications in regression and classification tasks. Finally, it introduces Gradient Descent as an optimization technique for minimizing errors in machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views29 pages

ML-Unit 4

The document provides an overview of Logistic Regression, a supervised learning algorithm used for binary classification problems, explaining its mathematical foundation, types, and implementation steps. It also discusses the Cost Function in machine learning, detailing its significance in evaluating model performance, types of cost functions, and their applications in regression and classification tasks. Finally, it introduces Gradient Descent as an optimization technique for minimizing errors in machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1 Unit IV Notes || Machine Learning || MC4301

Unit IV – Parametric Machine Learning.

Logistic Regression: Classification and representation

Introduction to Logistic Regression:

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target
variable. The nature of target or dependent variable is dichotomous, which means there would be only two
possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML
algorithms that can be used for various classification problems such as spam detection, Diabetes prediction,
cancer detection etc.

Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression
is used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.

Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the logistic
function:
2 Unit IV Notes || Machine Learning || MC4301

Logistic Function (Sigmoid Function):


1. The sigmoid function is a mathematical function used to map the predicted values to probabilities.
2. It maps any real value into another value within a range of 0 and 1.
3. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the sigmoid function or the logistic
function.
4. In logistic regression, we use the concept of the threshold value, which defines the probability of either
0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values
tends to 0.

Assumptions for Logistic Regression:


1. The dependent variable must be categorical in nature.
2. The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
1. We know the equation of the straight line can be written as:

2. In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

3. But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
3 Unit IV Notes || Machine Learning || MC4301

 Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".

Steps in Logistic Regression:

To implement the Logistic Regression using Python, we will use the same steps as we have done in previous
topics of Regression. Below are the steps:

1. Data Pre-processing step


2. Fitting Logistic Regression to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in our
code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is given
below:

#Data Pre-procesing Step


# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

2. Fitting Logistic Regression to the Training set:


4 Unit IV Notes || Machine Learning || MC4301

We have well prepared our dataset, and now we will train the dataset using the training set. For providing
training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic regression.
Below is the code for it:

#Fitting Logistic Regression to the training set


from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalt
y='l2', random_state=0, solver='warn', tol=0.0001, verbose=0, warm_start=False)

3. Predicting the Test Result


Our model is well trained on the training set, so we will now predict the result by using test set data. Below is
the code for it:
# Predicting the test set result
y_pred= classifier.predict(x_test)

4. Test Accuracy of the result:

Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we need
to import the confusion_matrix function of the sklearn library. After importing the function, we will call it
using a new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:

#Creating the Confusion matrix


from sklearn.metrics import confusion_matrix
cm= confusion_matrix()

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output, we
can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result:


Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of
matplotlib library. Below is the code for it:

#Visualizing the training set result

from matplotlib.colors import ListedColormap


x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
5 Unit IV Notes || Machine Learning || MC4301

mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to create the colormap
for visualizing the result. We have created two new variables x_set and y_set to replace x_train and y_train.
After that, we have used the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data points
predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:

 In the above graph, we can see that there are some Green points within the green region and Purple
points within the purple region.
 All these data points are the observation points from the training set, which shows the result for
purchased variables.
 This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary
on the y-axis.
 The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users
who did not purchase the SUV car.
 The green point observations are for which purchased (dependent variable) is probably 1 means user
who purchased the SUV car.
 We can also estimate from the graph that the users who are younger with low salary, did not purchase
the car, whereas older users with high estimated salary purchased the car.
6 Unit IV Notes || Machine Learning || MC4301

 But there are some purple points in the green region (Buying the car) and some green points in the
purple region (Not buying the car). So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated salary did not purchase the car.

Cost Function in Machine Learning

A Machine Learning model should have a very high level of accuracy in order to perform well with real-world
applications. But how to calculate the accuracy of the model, i.e., how good or poor our model will perform
in the real world? In such a case, the Cost function comes into existence. It is an important machine learning
parameter to correctly estimate the model.

Cost function also plays a crucial role in understanding that how well your model estimates the relationship
between the input and output parameters.

In this topic, we will explain the cost function in Machine Learning, Gradient descent, and types of cost
functions.

What is Cost Function?

A cost function is an important parameter that determines how well a machine learning model performs for a
given dataset. It calculates the difference between the expected value and predicted value and represents it as
a single real number.

In machine learning, once we train our model, then we want to see how well our model is performing.
Although there are various accuracy functions that tell you how your model is performing, but will not give
insights to improve them. So, we need a function that can find when the model is most accurate by finding the
spot between the undertrained and overtrained model.

In simple, "Cost function is a measure of how wrong the model is in estimating the relationship between
X(input) and Y(output) Parameter." A cost function is sometimes also referred to as Loss function, and it can
be estimated by iteratively running the model to compare estimated predictions against the known values of
Y.

The main aim of each ML model is to determine parameters or weights that can minimize the cost function.
7 Unit IV Notes || Machine Learning || MC4301

Types of Cost Function

Cost functions can be of various types depending on the problem. However, mainly it is of three types, which
are as follows:
1. Regression Cost Function
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.

1. Regression Cost Function


Regression models are used to make a prediction for the continuous variables such as the price of houses,
weather prediction, loan predictions, etc. When a cost function is used with Regression, it is known as the
"Regression Cost Function." In this, the cost function is calculated as the error based on the distance, such as:

Error= Actual Output-Predicted output

There are three commonly used Regression cost functions, which are as follows:

a. Means Error
In this type of cost function, the error is calculated for each training data, and then the mean of all error values
is taken.
It is one of the simplest ways possible.

The errors that occurred from the training data can be either negative or positive. While finding mean, they
can cancel out each other and result in the zero-mean error for the model, so it is not recommended cost
function for a model.

However, it provides a base for other cost functions of regression models.

b. Mean Squared Error (MSE)


Means Square error is one of the most commonly used Cost function methods. It improves the drawbacks of
the Mean error cost function, as it calculates the square of the difference between the actual value and predicted
value. Because of the square of the difference, it avoids any possibility of negative error.
The formula for calculating MSE is given below:

Mean squared error is also known as L2 Loss.


In MSE, each error is squared, and it helps in reducing a small deviation in prediction as compared to MAE.
But if the dataset has outliers that generate more prediction errors, then squaring of this error will further
increase the error multiple times. Hence, we can say MSE is less robust to outliers.
8 Unit IV Notes || Machine Learning || MC4301

c. Mean Absolute Error (MAE)


Mean Absolute error also overcome the issue of the Mean error cost function by taking the absolute difference
between the actual value and predicted value.
The formula for calculating Mean Absolute Error is given below:

This means the Absolute error cost function is also known as L1 Loss. It is not affected by noise or outliers,
hence giving better results if the dataset has noise or outlier.

2. Binary Classification Cost Functions


Classification models are used to make predictions of categorical variables, such as predictions for 0 or 1, Cat
or dog, etc. The cost function used in the classification problem is known as the Classification cost function.
However, the classification cost function is different from the Regression cost function.
One of the commonly used loss functions for classification is cross-entropy loss.
The binary Cost function is a special case of Categorical cross-entropy, where there is only one output class.
For example, classification between red and blue.
To better understand it, let's suppose there is only a single output variable Y

Cross-entropy(D) = - y*log(p) when y = 1

Cross-entropy(D) = - (1-y)*log(1-p) when y = 0

The error in binary classification is calculated as the mean of cross-entropy for all N training data. Which
means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N

3. Multi-class Classification Cost Function


A multi-class classification cost function is used in the classification problems for which instances are
allocated to one of more than two classes. Here also, similar to binary class classification cost function, cross-
entropy or categorical cross-entropy is commonly used cost function.
It is designed in a way that it can be used with multi-class classification with the target values ranging from 0
to 1, 3, ….,n classes.

In a multi-class classification problem, cross-entropy will generate a score that summarizes the mean
difference between actual and anticipated probability distribution.
For a perfect cross-entropy, the value should be zero when the score is minimized.

Gradient Descent in Machine Learning

Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient descent
is also used to train Neural Networks.
9 Unit IV Notes || Machine Learning || MC4301

In mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing an


objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the task of
minimizing the cost function parameterized by the model's parameters. The main objective of gradient descent
is to minimize the convex function using iteration of parameter updates. Once these machine learning models
are optimized, these models can be used as powerful tools for Artificial Intelligence and various computer
science applications.

In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient descent, the
role of cost functions specifically as a barometer within Machine Learning, types of gradient descents, learning
rates, etc.

What is Gradient Descent or Steepest Descent?

Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient
Descent is defined as one of the most commonly used iterative optimization algorithms of machine learning
to train the machine learning and deep learning models. It helps in finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using gradient descent is as follows:

If we move towards a negative gradient or away from the gradient of the function at the current point, it will
give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current point,
we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve
this goal, it performs two steps iteratively:

Calculates the first-order derivative of the function to compute the gradient or slope of that function.
Move away from the direction of the gradient, which means slope increased from the current point by alpha
times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.
10 Unit IV Notes || Machine Learning || MC4301

How does Gradient Descent work?

Before starting the working principle of gradient descent, we should know some basic concepts to find out the
slope of a line from linear regression. The equation for simple linear regression is given as:

Equation : Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The starting point (shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to
calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and
bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are generated,
then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is called a
point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required:

1. Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future iteration and allow it to the
point of convergence or local minimum or global minimum. Let's discuss learning rate factors in brief;

Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small value that is
evaluated and updated based on the behavior of the cost function. If the learning rate is high, it results in larger
steps but also leads to risks of overshooting the minimum. At the same time, a low learning rate shows the
small step sizes, which compromises overall efficiency but gives the advantage of more precision.
11 Unit IV Notes || Machine Learning || MC4301

Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:

1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and update the model
after evaluating all training examples. This procedure is known as the training epoch. In simple words, it is a
greedy approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:


o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent:

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration.
Or in other words, it processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time. As it requires only one training example at a time, hence it is easier to
store in allocated memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed. Further, due to frequent
updates, it is also treated as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.

Advantages of Stochastic gradient descent:


In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few advantages
over other gradient descent.
o It is easier to allocate in desired memory.
o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.
12 Unit IV Notes || Machine Learning || MC4301

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient descent.
It divides the training datasets into small batch sizes then performs the updates on those batches separately.
Splitting training datasets into smaller batches make a balance to maintain the computational efficiency of
batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a special type of
gradient descent with higher computational efficiency and less noisy gradient descent.

Advantages of Mini Batch gradient descent:


o It is easier to fit in allocated memory.
o It is computationally efficient.
o It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for optimization problems, it still
also has some challenges. There are a few challenges as follows:

1. Local Minima and Saddle Point:


For convex problems, gradient descent can find the global minimum easily, while for non-convex problems,
it is sometimes difficult to find the global minimum, where the machine learning models achieve the best
results.

Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further. Apart
from the global minimum, there occur some scenarios that can show this slop, which is saddle point and local
minimum. Local minima generate the shape similar to the global minimum, where the slope of the cost
function increases on both sides of the current points.
13 Unit IV Notes || Machine Learning || MC4301

In contrast, with saddle points, the negative gradient only occurs on one side of the point, which reaches a
local maximum on one side and a local minimum on the other side. The name of a saddle point is taken by
that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in a local region.
In contrast, the name of the global minima is given so because the value of the loss function is minimum there,
globally across the entire domain the loss function.

2. Vanishing and Exploding Gradient


In a deep neural network, if the model is trained with gradient descent and backpropagation, there can occur
two more issues other than local minima and saddle point.

Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this gradient
becomes smaller that causing the decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they become insignificant.

Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large and
creates a stable model. Further, in this scenario, model weight increases, and they will be represented as NaN.
This problem can be solved using the dimensionality reduction technique, which helps to minimize complexity
within the model.

Example program:
Import numpy as np
def gradient_descent(x,y):
m_curr = b_curr = 0
iterations = 10000
n = len(x)
learning_rate = 0.08

for i in range(iterations):
y_predicted = m_curr * x + b_curr
cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum(y-y_predicted)
m_curr = m_curr - learning_rate * md
b_curr = b_curr - learning_rate * bd
print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
14 Unit IV Notes || Machine Learning || MC4301

gradient_descent(x,y)

Out[]:
m 4.96, b 1.44, cost 89.0 iteration 0
m 0.4991999999999983, b 0.26879999999999993, cost 71.10560000000002 iteration 1
m 4.451584000000002, b 1.426176000000001, cost 56.8297702400001 iteration 2
.
.
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9997
m 2.000000000000001, b 2.9999999999999947, cost 1.0255191767873153e-29 iteration 9998
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9999

Optimization in a Machine Learning


Machine learning optimization is the process of adjusting hyperparameters in order to minimize the cost
function by using one of the optimization techniques. It is important to minimize the cost function because it
describes the discrepancy between the true value of the estimated parameter and what the model has predicted.
Optimization plays an important part in a machine learning project in addition to fitting the learning algorithm
on the training dataset.
The step of preparing the data prior to fitting the model and the step of tuning a chosen model also can be
framed as an optimization problem. In fact, an entire predictive modeling project can be thought of as one
large optimization problem.

Parameters and hyperparameters of the model


Before we go any further, we need to understand the difference between parameters and
hyperparameters of a model. These two notions are easy to confuse but we ought not to.
You need to set hyperparameters before starting to train the model. They include a number of clusters,
learning rate, etc. Hyperparameters describe the structure of the model.
On the other hand, the parameters of the model are obtained during the training. There is no way to
get them in advance. Examples are weights and biases for neural networks. This data is internal to the
model and changes based on the inputs.

To tune the model, we need hyperparameter optimization. By finding the optimal combination of their
values, we can decrease the error and build the most accurate model.
15 Unit IV Notes || Machine Learning || MC4301

How hyperparameter tuning works:


As we said, the hyperparameters are set before training. But you can’t know in advance, for instance,
which learning rate (large or small) is best in this or that case. Therefore, to improve the model’s
performance, hyperparameters have to be optimized.

After each iteration, you compare the output with expected results, assess the accuracy, and adjust the
hyperparameters if necessary. This is a repeated process. You can do that manually or use one of the
many optimization techniques, which come in handy when you work with large amounts of data

Top optimization techniques in machine learning


Now let us talk about the techniques that you can use to optimize the hyperparameters of your model.

Exhaustive search
Exhaustive search, or brute-force search, is the process of looking for the most optimal
hyperparameters by checking whether each candidate is a good match. You perform the same thing
when you forget the code for your bike’s lock and try out all the possible options. In machine learning,
we do the same thing but the number of options is quite large, usually.

The exhaustive search method is simple. For example, if you are working with a k-means algorithm,
you will manually search for the right number of clusters. However, if there are hundreds and
thousands of options that you have to consider, it becomes unbearably heavy and slow. This makes
brute-force search inefficient in the majority of real-life cases.

Gradient descent
Gradient descent is the most common algorithm for model optimization for minimizing the error. In
order to perform gradient descent, you have to iterate over the training dataset while re-adjusting the
model.

Your goal is to minimize the cost function because it means you get the smallest possible error and
improve the accuracy of the model.

On the graph, you can see a graphical representation of how the gradient descent algorithm travels in
the variable space. To get started, you need to take a random point on the graph and arbitrarily choose
a direction. If you see that the error is getting larger, that means you chose the wrong direction.

When you are not able to improve (decrease the error) anymore, the optimization is over and you have
found a local minimum. In the following video, you will find a step-by-step explanation of how
gradient descent works.

Looks fine so far. However, classical gradient descent will not work well when there are a couple of
local minimums. Finding your first minimum, you will simply stop searching because the algorithm
only finds a local one, it is not made to find the global one.
16 Unit IV Notes || Machine Learning || MC4301

Note: In gradient descent, you proceed forward with steps of the same size. If you choose a learning rate that
is too large, the algorithm will be jumping around without getting closer to the right answer. If it’s too small,
the computation will start mimicking exhaustive search take, which is, of course, inefficient.

So you have to choose the learning rate very carefully. If done right, gradient descent becomes a computation-
efficient and rather quick method to optimize models.

Genetic algorithms
Genetic algorithms represent another approach to ML optimization. The principle that lays behind the
logic of these algorithms is an attempt to apply the theory of evolution to machine learning.

In the evolution theory, only those specimens get to survive and reproduce that have the best adaptation
mechanisms. How do you know what specimens are and aren’t the best in the case of machine learning
models?

Imagine you have a bunch of random algorithms at hand. This will be your population. Among multiple
models with some predefined hyperparameters, some are better adjusted than the others. Let’s find
them! First, you calculate the accuracy of each model. Then, you keep only those that worked out best.
Now you can generate some descendants with similar hyperparameters to the best models to get a
second generation of models.

We repeat this process many times and only the best models will survive at the end of the process.
Genetic algorithms help to avoid being stuck at local minima/maxima. They are common in optimizing
neural network models.
17 Unit IV Notes || Machine Learning || MC4301

Regularization in Machine Learning


What is Regularization?
Regularization is one of the most important concepts of machine learning. It is a technique to prevent the
model from overfitting by adding extra information to it.
Sometimes the machine learning model performs well with the training data but does not perform well with
the test data. It means the model is not able to predict the output when deals with unseen data by introducing
noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a
regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features in the model
by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the
model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization
technique, we reduce the magnitude of the features by keeping the same number of features."
How does Regularization Work?
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple
linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
X1, X2, …Xn are the features for Y.
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of
the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost
function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
18 Unit IV Notes || Machine Learning || MC4301

o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Key Difference between Ridge Regression and Lasso Regression
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.
19 Unit IV Notes || Machine Learning || MC4301

What is Overfitting?

o Overfitting & underfitting are the two main errors/problems in the machine learning model, which
cause poor performance in Machine Learning.
o Overfitting occurs when the model fits more data than required, and it tries to capture each and every
datapoint fed to it. Hence it starts capturing noise and inaccurate data from the dataset, which degrades
the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Example to Understand Overfitting
We can understand overfitting with a general example. Suppose there are three students, X, Y, and Z, and all
three are preparing for an exam. X has studied only three sections of the book and left all other sections. Y
has a good memory, hence memorized the whole book. And the third student, Z, has studied and practiced all
the questions. So, in the exam, X will only be able to solve the questions if the exam has questions related to
section 3. Student Y will only be able to solve questions if they appear exactly the same as given in the book.
Student Z will be able to solve all the exam questions in a proper way.
The same happens with machine learning; if the algorithm learns from a small part of the data, it is unable to
capture the required data points and hence under fitted.
Suppose the model learns the training dataset, like the Y student. They perform very well on the seen dataset
but perform badly on unseen data or unknown instances. In such cases, the model is said to be Overfitting.
And if the model performs well with the training dataset and also with the test/unseen dataset, similar to
student Z, it is said to be a good fit.
20 Unit IV Notes || Machine Learning || MC4301

How to detect Overfitting?


Overfitting in the model can only be detected once you test the data. To detect the issue, we can
perform Train/test split.
In the train-test split of the dataset, we can divide our dataset into random test and training datasets. We train
the model with a training dataset which is about 80% of the total dataset. After training the model, we test it
with the test dataset, which is 20 % of the total dataset.

Now, if the model performs well with the training dataset but not with the test dataset, then it is likely to have
an overfitting issue.
For example, if the model shows 85% accuracy with training data and 50% accuracy with the test dataset, it
means the model is not performing well.

Ways to prevent the Overfitting


Although overfitting is an error in Machine learning which reduces the performance of the model, however,
we can prevent it in several ways. With the use of the linear model, we can avoid overfitting; however, many
real-world problems are non-linear ones. It is important to prevent overfitting from the models. Below are
several ways that can be used to prevent overfitting:
21 Unit IV Notes || Machine Learning || MC4301

1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization
1.Early Stopping
In this technique, the training is paused before the model starts learning the noise within the model. In this
process, while training the model iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration improves the performance of the model.
After that point, the model begins to overfit the training data; hence we need to stop the process before the
learner passes that point.
Stopping the training process before the model starts capturing noise from the data is known as early stopping.

However, this technique may lead to the underfitting problem if training is paused too early. So, it is very
important to find that "sweet spot" between underfitting and overfitting.
2.Train with More data
Increasing the training set by including more data can enhance the accuracy of the model, as it provides more
chances to discover the relationship between input and output variables.
It may not always work to prevent overfitting, but this way helps the algorithm to detect the signal better to
minimize the errors.
When a model is fed with more training data, it will be unable to overfit all the samples of data and forced to
generalize well.
But in some cases, the additional data may add more noise to the model; hence we need to be sure that data is
clean and free from in-consistencies before feeding it to the model.
22 Unit IV Notes || Machine Learning || MC4301

3.Feature Selection
While building the ML model, we have a number of parameters or features that are used to predict the outcome.
However, sometimes some of these features are redundant or less important for the prediction, and for this
feature selection process is applied. In the feature selection process, we identify the most important features
within training data, and other features are removed. Further, this process helps to simplify the model and
reduces noise from the data. Some algorithms have the auto-feature selection, and if not, then we can manually
perform this process.
4.Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.
In the general k-fold cross-validation technique, we divided the dataset into k-equal-sized subsets of data;
these subsets are known as folds.
5.Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to adding more data to prevent
overfitting. In this technique, instead of adding more training data, slightly modified copies of already existing
data are added to the dataset.
The data augmentation technique makes it possible to appear data sample slightly different every time it is
processed by the model. Hence each data set appears unique to the model and prevents overfitting.
6.Regularization
If overfitting occurs when a model is complex, we can reduce the number of features. However, overfitting
may also occur with a simpler model, more specifically the Linear model, and for such cases, regularization
techniques are much helpful.
Regularization is the most popular technique to prevent overfitting. It is a group of methods that forces the
learning algorithms to make a model simpler. Applying the regularization technique may slightly increase the
bias but slightly reduces the variance. In this technique, we modify the objective function by adding the
penalizing term, which has a higher value with a more complex model.
The two commonly used regularization techniques are L1 Regularization and L2 Regularization.
Ensemble Methods
In ensemble methods, prediction from different machine learning models is combined to identify the most
popular result.
The most commonly used ensemble methods are Bagging and Boosting.
In bagging, individual data points can be selected more than once. After the collection of several sample
datasets, these models are trained independently, and depending on the type of task-i.e., regression or
classification-the average of those predictions is used to predict a more accurate result. Moreover, bagging
reduces the chances of overfitting in complex models.
23 Unit IV Notes || Machine Learning || MC4301

Perceptron in Machine Learning

In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It
is the primary step to learn Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building block of an Artificial Neural
Network. Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing
certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine
Learning algorithm used for supervised learning for various binary classifiers. This algorithm enables neurons
to learn elements and processes them one by one during preparation. In this tutorial, "Perceptron in Machine
Learning," we will discuss in-depth knowledge of Perceptron and its basic functions in brief. Let's start with
the basic introduction of Perceptron.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit that helps to detect
certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks. However,
it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and Bias, net sum, and an activation
function.

What is Binary classifier in Machine Learning?


In Machine Learning, binary classifiers are defined as the function that helps in deciding whether input data
can be represented as vectors of numbers and belongs to some specific class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and feature vectors.
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three main
components. These are as follows:

o Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the system for further
processing. Each input node contains a real numerical value.
o Wight and Bias:
24 Unit IV Notes || Machine Learning || MC4301

Weight parameter represents the strength of the connection between units. This is another most important
parameter of Perceptron components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.

Types of Activation functions:


o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision based on various problem
statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in
perceptron models by checking whether the learning process is slow or has vanishing or exploding gradients.

How does Perceptron work?


In Machine Learning, Perceptron is considered as a single-layer neural network that consists of four main
parameters named input values (Input nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their weights, then adds these values
together to create the weighted sum. Then this weighted sum is applied to the activation function 'f' to obtain
the desired output. This activation function is also known as the step function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is mapped between required
values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
25 Unit IV Notes || Machine Learning || MC4301

Perceptron model works in two important steps as follows:


Step-1
In the first step first, multiply all input values with corresponding weight values and then add them to
determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b

Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum, which gives us
output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model consists
feed-forward network and also includes a threshold transfer function inside the model. The main objective of
the single-layer perceptron model is to analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with inconstantly
allocated input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs, if the
total sum of all inputs is more than a pre-determined value, the model gets activated and shows the output
value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this model is stated as
satisfied, and weight demand does not change. However, this model consists of a few discrepancies triggered
when multiple weight inputs values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:


Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but
has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on
the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the
output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having various
layers in which activation function does not remain linear, similar to a single layer perceptron model. Instead
of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.
26 Unit IV Notes || Machine Learning || MC4301

A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:


o A multi-layered perceptron model can be used to solve complex non-linear problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:


o In Multi-layer perceptron, computations are difficult and time-consuming.
o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.
o The model functioning depends on the quality of the training.

What Is a Neural Network?


A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data
through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems
of neurons, either organic or artificial in nature.

Neural networks can adapt to changing input; so the network generates the best possible result without needing
to redesign the output criteria. The concept of neural networks, which has its roots in artificial intelligence, is
swiftly gaining popularity in the development of trading systems.

Pros
Can often work more efficiently and for longer than humans
Can be programmed to learn from prior outcomes to strive to make smarter future calculations
Often leverage online services that reduce (but do not eliminate) systematic risk
Are continually being expanded in new fields with more difficult problems
Cons
Still rely on hardware that may require labor and expertise to maintain
May take long periods of time to develop the code and algorithms
May be difficult to assess errors or adaptions to the assumptions if the system is self-learning but lacks
transparency
Usually report an estimated range or estimated amount that may not actualize
27 Unit IV Notes || Machine Learning || MC4301

Multi-Class Classification
Multi-class classification is perhaps the most popular machine learning job, aside from regression.

The science behind it is the same whether it’s spelled multiclass or multi-class. An ML classification problem
with more than two outputs or classes is known as multi feature classification. Because each image may be
classed as many distinct animal categories, using a machine learning model to identify animal species in
photographs from an encyclopedia is an example of multi-class classification. Multi-class classification also
necessitates the use of only one class in a sample (ie. an elephant is only an elephant; it is not also a lemur).

We are given a set of training samples separated into K distinct classes, and we create an ML model to forecast
which of those classes some previously unknown data belongs to. The model learns patterns specific to each
class from the training dataset and utilizes those patterns to forecast the classification of future data.
Approach –
1. Load dataset from the source.
2. Split the dataset into “training” and “test” data.
3. Train Decision tree, SVM, and KNN classifiers on the training data.
4. Use the above classifiers to predict labels for the test data.
5. Measure accuracy and visualize classification.
Decision tree classifier – A decision tree classifier is a systematic approach for multiclass classification. It
poses a set of questions to the dataset (related to its attributes/features). The decision tree classification
algorithm can be visualized on a binary tree. On the root and each of the internal nodes, a question is posed
and the data on that node is further split into separate records that have different characteristics. The leaves of
the tree refer to the classes in which the dataset is split. In the following code snippet, we train a decision tree
classifier in scikit-learn.
Example:
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# loading the iris dataset
iris = datasets.load_iris()
# X -> features, y -> label
X = iris.data
y = iris.target
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions)
28 Unit IV Notes || Machine Learning || MC4301

What is a backpropagation algorithm?


Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for errors working
back from output nodes to input nodes. It is an important mathematical tool for improving the accuracy of
predictions in data mining and machine learning. Essentially, backpropagation is an algorithm used to
calculate derivatives quickly.

There are two leading types of backpropagation networks:

1. Static backpropagation. Static backpropagation is a network developed to map static inputs for static
outputs. Static backpropagation networks can solve static classification problems, such as optical
character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point learning.
Recurrent backpropagation activation feeds forward until it reaches a fixed value.

What is a backpropagation algorithm in a neural network?


Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent with
respect to weight values for the various inputs. By comparing desired outputs to achieved system outputs, the
systems are tuned by adjusting connection weights to narrow the difference between the two as much as
possible.

The algorithm gets its name because the weights are updated backward, from output to input.

The advantages of using a backpropagation algorithm are as follows:

 It does not have any parameters to tune except for the number of inputs.
 It is highly adaptable and efficient and does not require any prior knowledge about the network.
 It is a standard process that usually works well.
 It is user-friendly, fast and easy to program.
 Users do not need to learn any special functions.

The disadvantages of using a backpropagation algorithm are as follows:

 It prefers a matrix-based approach over a mini-batch approach.


 Data mining is sensitive to noise and irregularities.
 Performance is highly dependent on input data.
 Training is time- and resource-intensive.

What is a backpropagation algorithm in machine learning?


Backpropagation requires a known, desired output for each input value in order to calculate the loss
function gradient -- how a prediction differs from actual results -- as a type of supervised machine
learning. Along with classifiers such as Naïve Bayesian filters and decision trees, the backpropagation
training algorithm has emerged as an important part of machine learning applications that involve
predictive analytics.

What is the time complexity of a backpropagation algorithm?


The time complexity of each iteration -- how long it takes to execute each statement in an algorithm -
- depends on the network's structure. For multilayer perceptron, matrix multiplications dominate time.
29 Unit IV Notes || Machine Learning || MC4301

Non-Linear Activation Functions


Examples of non-linear activation functions include:

1. Sigmoid function: The Sigmoid function exists between 0 and 1 or -1 and 1. The use of a sigmoid function
is to convert a real value to a probability. In machine learning, the sigmoid function is generally used to refer
to the logistic function, also called the logistic sigmoid function; it is also the most widely used sigmoid
function (others are the hyperbolic tangent and the arctangent).

A sigmoid function is placed as the last layer of the model to convert the model’s output into a probability
score, which is easier to work with and interpret.

Another reason to use it mostly in the output layer is that it can otherwise cause a neural network to get stuck
in training time.

2. TanH function: It is the hyperbolic tangent function whose range lies between -1 and 1, hence also called
the zero-centred function. Because it is zero centred, it is much easier to model inputs with strongly negative,
positive or neutral values. TanH function is used instead of sigmoid function if the output is other than 0and1.
TanH functions usually find applications in RNN for natural language processing and speech recognition
tasks.

On the downside, in the case of both Sigmoid and TanH, if the weighted sum input is very large or very small,
the function’s gradient becomes very small and closer to zero.

3. ReLU function: Rectified Linear Unit, also called ReLU, is a widely favoured activation function for deep
learning applications. Compared to Sigmoid and TanH activation functions, ReLU offers an upper hand in
terms of performance and generalisation. In terms of computation too, ReLU is faster as it does not compute
exponentials and divisions. The disadvantage is that ReLU overfits more, as compared with Sigmoid.

4. Parametric ReLU (PReLU): ReLU has been one of the keys to the recent successes in deep learning. Its
use has lead to better solutions than that of sigmoid. This is partially due to the vanishing gradient problem in
case of sigmoid activations. But, we can still improve upon ReLU. LeakyReLU was introduced, which doesn’t
zero out the negative inputs as ReLU does. Instead, it multiplies the negative input by a small value (like 0.02)
and keeps the positive input as is. But this has shown a negligible increase in the accuracy of our models.

Dropout is a regularization

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-
adaptations on training data. It is a very efficient way of performing model averaging with neural networks.
The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.

A simple and powerful regularization technique for neural networks and deep learning models is dropout. This
notebook will uncover the dropout regularization technique and how to apply it to deep learning models in
Python with Keras.

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out”
randomly. This means that their contribution to the activation of downstream neurons is temporally removed
on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons
are tuned for specific features providing some specialization. Neighboring neurons become to rely on this
specialization, which if taken too far can result in a fragile model too specialized to the training data. This
reliant on context for a neuron during training is referred to complex co-adaptations.

You might also like