ML-Unit 4
ML-Unit 4
Logistic regression is a supervised learning classification algorithm used to predict the probability of a target
variable. The nature of target or dependent variable is dichotomous, which means there would be only two
possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML
algorithms that can be used for various classification problems such as spam detection, Diabetes prediction,
cancer detection etc.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression
is used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the logistic
function:
2 Unit IV Notes || Machine Learning || MC4301
2. In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
3. But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
To implement the Logistic Regression using Python, we will use the same steps as we have done in previous
topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in our
code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is given
below:
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For providing
training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic regression.
Below is the code for it:
Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we need
to import the confusion_matrix function of the sklearn library. After importing the function, we will call it
using a new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output, we
can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to create the colormap
for visualizing the result. We have created two new variables x_set and y_set to replace x_train and y_train.
After that, we have used the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data points
predicted by the classifier.
Output: By executing the above code, we will get the below output:
In the above graph, we can see that there are some Green points within the green region and Purple
points within the purple region.
All these data points are the observation points from the training set, which shows the result for
purchased variables.
This graph is made by using two independent variables i.e., Age on the x-axis and Estimated salary
on the y-axis.
The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users
who did not purchase the SUV car.
The green point observations are for which purchased (dependent variable) is probably 1 means user
who purchased the SUV car.
We can also estimate from the graph that the users who are younger with low salary, did not purchase
the car, whereas older users with high estimated salary purchased the car.
6 Unit IV Notes || Machine Learning || MC4301
But there are some purple points in the green region (Buying the car) and some green points in the
purple region (Not buying the car). So we can say that younger users with a high estimated salary
purchased the car, whereas an older user with a low estimated salary did not purchase the car.
A Machine Learning model should have a very high level of accuracy in order to perform well with real-world
applications. But how to calculate the accuracy of the model, i.e., how good or poor our model will perform
in the real world? In such a case, the Cost function comes into existence. It is an important machine learning
parameter to correctly estimate the model.
Cost function also plays a crucial role in understanding that how well your model estimates the relationship
between the input and output parameters.
In this topic, we will explain the cost function in Machine Learning, Gradient descent, and types of cost
functions.
A cost function is an important parameter that determines how well a machine learning model performs for a
given dataset. It calculates the difference between the expected value and predicted value and represents it as
a single real number.
In machine learning, once we train our model, then we want to see how well our model is performing.
Although there are various accuracy functions that tell you how your model is performing, but will not give
insights to improve them. So, we need a function that can find when the model is most accurate by finding the
spot between the undertrained and overtrained model.
In simple, "Cost function is a measure of how wrong the model is in estimating the relationship between
X(input) and Y(output) Parameter." A cost function is sometimes also referred to as Loss function, and it can
be estimated by iteratively running the model to compare estimated predictions against the known values of
Y.
The main aim of each ML model is to determine parameters or weights that can minimize the cost function.
7 Unit IV Notes || Machine Learning || MC4301
Cost functions can be of various types depending on the problem. However, mainly it is of three types, which
are as follows:
1. Regression Cost Function
2. Binary Classification cost Functions
3. Multi-class Classification Cost Function.
There are three commonly used Regression cost functions, which are as follows:
a. Means Error
In this type of cost function, the error is calculated for each training data, and then the mean of all error values
is taken.
It is one of the simplest ways possible.
The errors that occurred from the training data can be either negative or positive. While finding mean, they
can cancel out each other and result in the zero-mean error for the model, so it is not recommended cost
function for a model.
This means the Absolute error cost function is also known as L1 Loss. It is not affected by noise or outliers,
hence giving better results if the dataset has noise or outlier.
The error in binary classification is calculated as the mean of cross-entropy for all N training data. Which
means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N
In a multi-class classification problem, cross-entropy will generate a score that summarizes the mean
difference between actual and anticipated probability distribution.
For a perfect cross-entropy, the value should be zero when the score is minimized.
Gradient Descent is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further, gradient descent
is also used to train Neural Networks.
9 Unit IV Notes || Machine Learning || MC4301
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient descent, the
role of cost functions specifically as a barometer within Machine Learning, types of gradient descents, learning
rates, etc.
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century. Gradient
Descent is defined as one of the most commonly used iterative optimization algorithms of machine learning
to train the machine learning and deep learning models. It helps in finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the current point, it will
give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the current point,
we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main
objective of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve
this goal, it performs two steps iteratively:
Calculates the first-order derivative of the function to compute the gradient or slope of that function.
Move away from the direction of the gradient, which means slope increased from the current point by alpha
times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization process which
helps to decide the length of the steps.
10 Unit IV Notes || Machine Learning || MC4301
Before starting the working principle of gradient descent, we should know some basic concepts to find out the
slope of a line from linear regression. The equation for simple linear regression is given as:
Equation : Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point (shown in above fig.) is used to evaluate the performance as it is considered just as an
arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to
calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and
bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are generated,
then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is called a
point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between expected and
actual. To minimize the cost function, two data points are required:
These two factors are used to determine the partial derivative calculation of future iteration and allow it to the
point of convergence or local minimum or global minimum. Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small value that is
evaluated and updated based on the behavior of the cost function. If the learning rate is high, it results in larger
steps but also leads to risks of overshooting the minimum. At the same time, a low learning rate shows the
small step sizes, which compromises overall efficiency but gives the advantage of more precision.
11 Unit IV Notes || Machine Learning || MC4301
Based on the error in various training models, the Gradient Descent learning algorithm can be divided
into Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Let's
understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and update the model
after evaluating all training examples. This procedure is known as the training epoch. In simple words, it is a
greedy approach where we have to sum over all examples for each update.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example per iteration.
Or in other words, it processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time. As it requires only one training example at a time, hence it is easier to
store in allocated memory. However, it shows some computational efficiency losses in comparison to batch
gradient systems as it shows frequent updates that require more detail and speed. Further, due to frequent
updates, it is also treated as a noisy gradient. However, sometimes it can be helpful in finding the global
minimum and also escaping the local minimum.
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic gradient descent.
It divides the training datasets into small batch sizes then performs the updates on those batches separately.
Splitting training datasets into smaller batches make a balance to maintain the computational efficiency of
batch gradient descent and speed of stochastic gradient descent. Hence, we can achieve a special type of
gradient descent with higher computational efficiency and less noisy gradient descent.
Although we know Gradient Descent is one of the most popular methods for optimization problems, it still
also has some challenges. There are a few challenges as follows:
Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further. Apart
from the global minimum, there occur some scenarios that can show this slop, which is saddle point and local
minimum. Local minima generate the shape similar to the global minimum, where the slope of the cost
function increases on both sides of the current points.
13 Unit IV Notes || Machine Learning || MC4301
In contrast, with saddle points, the negative gradient only occurs on one side of the point, which reaches a
local maximum on one side and a local minimum on the other side. The name of a saddle point is taken by
that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in a local region.
In contrast, the name of the global minima is given so because the value of the loss function is minimum there,
globally across the entire domain the loss function.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this gradient
becomes smaller that causing the decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large and
creates a stable model. Further, in this scenario, model weight increases, and they will be represented as NaN.
This problem can be solved using the dimensionality reduction technique, which helps to minimize complexity
within the model.
Example program:
Import numpy as np
def gradient_descent(x,y):
m_curr = b_curr = 0
iterations = 10000
n = len(x)
learning_rate = 0.08
for i in range(iterations):
y_predicted = m_curr * x + b_curr
cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
md = -(2/n)*sum(x*(y-y_predicted))
bd = -(2/n)*sum(y-y_predicted)
m_curr = m_curr - learning_rate * md
b_curr = b_curr - learning_rate * bd
print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
14 Unit IV Notes || Machine Learning || MC4301
gradient_descent(x,y)
Out[]:
m 4.96, b 1.44, cost 89.0 iteration 0
m 0.4991999999999983, b 0.26879999999999993, cost 71.10560000000002 iteration 1
m 4.451584000000002, b 1.426176000000001, cost 56.8297702400001 iteration 2
.
.
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9997
m 2.000000000000001, b 2.9999999999999947, cost 1.0255191767873153e-29 iteration 9998
m 2.000000000000002, b 2.999999999999995, cost 1.0255191767873153e-29 iteration 9999
To tune the model, we need hyperparameter optimization. By finding the optimal combination of their
values, we can decrease the error and build the most accurate model.
15 Unit IV Notes || Machine Learning || MC4301
After each iteration, you compare the output with expected results, assess the accuracy, and adjust the
hyperparameters if necessary. This is a repeated process. You can do that manually or use one of the
many optimization techniques, which come in handy when you work with large amounts of data
Exhaustive search
Exhaustive search, or brute-force search, is the process of looking for the most optimal
hyperparameters by checking whether each candidate is a good match. You perform the same thing
when you forget the code for your bike’s lock and try out all the possible options. In machine learning,
we do the same thing but the number of options is quite large, usually.
The exhaustive search method is simple. For example, if you are working with a k-means algorithm,
you will manually search for the right number of clusters. However, if there are hundreds and
thousands of options that you have to consider, it becomes unbearably heavy and slow. This makes
brute-force search inefficient in the majority of real-life cases.
Gradient descent
Gradient descent is the most common algorithm for model optimization for minimizing the error. In
order to perform gradient descent, you have to iterate over the training dataset while re-adjusting the
model.
Your goal is to minimize the cost function because it means you get the smallest possible error and
improve the accuracy of the model.
On the graph, you can see a graphical representation of how the gradient descent algorithm travels in
the variable space. To get started, you need to take a random point on the graph and arbitrarily choose
a direction. If you see that the error is getting larger, that means you chose the wrong direction.
When you are not able to improve (decrease the error) anymore, the optimization is over and you have
found a local minimum. In the following video, you will find a step-by-step explanation of how
gradient descent works.
Looks fine so far. However, classical gradient descent will not work well when there are a couple of
local minimums. Finding your first minimum, you will simply stop searching because the algorithm
only finds a local one, it is not made to find the global one.
16 Unit IV Notes || Machine Learning || MC4301
Note: In gradient descent, you proceed forward with steps of the same size. If you choose a learning rate that
is too large, the algorithm will be jumping around without getting closer to the right answer. If it’s too small,
the computation will start mimicking exhaustive search take, which is, of course, inefficient.
So you have to choose the learning rate very carefully. If done right, gradient descent becomes a computation-
efficient and rather quick method to optimize models.
Genetic algorithms
Genetic algorithms represent another approach to ML optimization. The principle that lays behind the
logic of these algorithms is an attempt to apply the theory of evolution to machine learning.
In the evolution theory, only those specimens get to survive and reproduce that have the best adaptation
mechanisms. How do you know what specimens are and aren’t the best in the case of machine learning
models?
Imagine you have a bunch of random algorithms at hand. This will be your population. Among multiple
models with some predefined hyperparameters, some are better adjusted than the others. Let’s find
them! First, you calculate the accuracy of each model. Then, you keep only those that worked out best.
Now you can generate some descendants with similar hyperparameters to the best models to get a
second generation of models.
We repeat this process many times and only the best models will survive at the end of the process.
Genetic algorithms help to avoid being stuck at local minima/maxima. They are common in optimizing
neural network models.
17 Unit IV Notes || Machine Learning || MC4301
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
18 Unit IV Notes || Machine Learning || MC4301
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Key Difference between Ridge Regression and Lasso Regression
o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.
19 Unit IV Notes || Machine Learning || MC4301
What is Overfitting?
o Overfitting & underfitting are the two main errors/problems in the machine learning model, which
cause poor performance in Machine Learning.
o Overfitting occurs when the model fits more data than required, and it tries to capture each and every
datapoint fed to it. Hence it starts capturing noise and inaccurate data from the dataset, which degrades
the performance of the model.
o An overfitted model doesn't perform accurately with the test/unseen dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Example to Understand Overfitting
We can understand overfitting with a general example. Suppose there are three students, X, Y, and Z, and all
three are preparing for an exam. X has studied only three sections of the book and left all other sections. Y
has a good memory, hence memorized the whole book. And the third student, Z, has studied and practiced all
the questions. So, in the exam, X will only be able to solve the questions if the exam has questions related to
section 3. Student Y will only be able to solve questions if they appear exactly the same as given in the book.
Student Z will be able to solve all the exam questions in a proper way.
The same happens with machine learning; if the algorithm learns from a small part of the data, it is unable to
capture the required data points and hence under fitted.
Suppose the model learns the training dataset, like the Y student. They perform very well on the seen dataset
but perform badly on unseen data or unknown instances. In such cases, the model is said to be Overfitting.
And if the model performs well with the training dataset and also with the test/unseen dataset, similar to
student Z, it is said to be a good fit.
20 Unit IV Notes || Machine Learning || MC4301
Now, if the model performs well with the training dataset but not with the test dataset, then it is likely to have
an overfitting issue.
For example, if the model shows 85% accuracy with training data and 50% accuracy with the test dataset, it
means the model is not performing well.
1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization
1.Early Stopping
In this technique, the training is paused before the model starts learning the noise within the model. In this
process, while training the model iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration improves the performance of the model.
After that point, the model begins to overfit the training data; hence we need to stop the process before the
learner passes that point.
Stopping the training process before the model starts capturing noise from the data is known as early stopping.
However, this technique may lead to the underfitting problem if training is paused too early. So, it is very
important to find that "sweet spot" between underfitting and overfitting.
2.Train with More data
Increasing the training set by including more data can enhance the accuracy of the model, as it provides more
chances to discover the relationship between input and output variables.
It may not always work to prevent overfitting, but this way helps the algorithm to detect the signal better to
minimize the errors.
When a model is fed with more training data, it will be unable to overfit all the samples of data and forced to
generalize well.
But in some cases, the additional data may add more noise to the model; hence we need to be sure that data is
clean and free from in-consistencies before feeding it to the model.
22 Unit IV Notes || Machine Learning || MC4301
3.Feature Selection
While building the ML model, we have a number of parameters or features that are used to predict the outcome.
However, sometimes some of these features are redundant or less important for the prediction, and for this
feature selection process is applied. In the feature selection process, we identify the most important features
within training data, and other features are removed. Further, this process helps to simplify the model and
reduces noise from the data. Some algorithms have the auto-feature selection, and if not, then we can manually
perform this process.
4.Cross-Validation
Cross-validation is one of the powerful techniques to prevent overfitting.
In the general k-fold cross-validation technique, we divided the dataset into k-equal-sized subsets of data;
these subsets are known as folds.
5.Data Augmentation
Data Augmentation is a data analysis technique, which is an alternative to adding more data to prevent
overfitting. In this technique, instead of adding more training data, slightly modified copies of already existing
data are added to the dataset.
The data augmentation technique makes it possible to appear data sample slightly different every time it is
processed by the model. Hence each data set appears unique to the model and prevents overfitting.
6.Regularization
If overfitting occurs when a model is complex, we can reduce the number of features. However, overfitting
may also occur with a simpler model, more specifically the Linear model, and for such cases, regularization
techniques are much helpful.
Regularization is the most popular technique to prevent overfitting. It is a group of methods that forces the
learning algorithms to make a model simpler. Applying the regularization technique may slightly increase the
bias but slightly reduces the variance. In this technique, we modify the objective function by adding the
penalizing term, which has a higher value with a more complex model.
The two commonly used regularization techniques are L1 Regularization and L2 Regularization.
Ensemble Methods
In ensemble methods, prediction from different machine learning models is combined to identify the most
popular result.
The most commonly used ensemble methods are Bagging and Boosting.
In bagging, individual data points can be selected more than once. After the collection of several sample
datasets, these models are trained independently, and depending on the type of task-i.e., regression or
classification-the average of those predictions is used to predict a more accurate result. Moreover, bagging
reduces the chances of overfitting in complex models.
23 Unit IV Notes || Machine Learning || MC4301
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It
is the primary step to learn Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building block of an Artificial Neural
Network. Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing
certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine
Learning algorithm used for supervised learning for various binary classifiers. This algorithm enables neurons
to learn elements and processes them one by one during preparation. In this tutorial, "Perceptron in Machine
Learning," we will discuss in-depth knowledge of Perceptron and its basic functions in brief. Let's start with
the basic introduction of Perceptron.
Weight parameter represents the strength of the connection between units. This is another most important
parameter of Perceptron components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.
The data scientist uses the activation function to take a subjective decision based on various problem
statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in
perceptron models by checking whether the learning process is slow or has vanishing or exploding gradients.
This step function or Activation function plays a vital role in ensuring that output is mapped between required
values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation function curve up or down.
25 Unit IV Notes || Machine Learning || MC4301
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum, which gives us
output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
Neural networks can adapt to changing input; so the network generates the best possible result without needing
to redesign the output criteria. The concept of neural networks, which has its roots in artificial intelligence, is
swiftly gaining popularity in the development of trading systems.
Pros
Can often work more efficiently and for longer than humans
Can be programmed to learn from prior outcomes to strive to make smarter future calculations
Often leverage online services that reduce (but do not eliminate) systematic risk
Are continually being expanded in new fields with more difficult problems
Cons
Still rely on hardware that may require labor and expertise to maintain
May take long periods of time to develop the code and algorithms
May be difficult to assess errors or adaptions to the assumptions if the system is self-learning but lacks
transparency
Usually report an estimated range or estimated amount that may not actualize
27 Unit IV Notes || Machine Learning || MC4301
Multi-Class Classification
Multi-class classification is perhaps the most popular machine learning job, aside from regression.
The science behind it is the same whether it’s spelled multiclass or multi-class. An ML classification problem
with more than two outputs or classes is known as multi feature classification. Because each image may be
classed as many distinct animal categories, using a machine learning model to identify animal species in
photographs from an encyclopedia is an example of multi-class classification. Multi-class classification also
necessitates the use of only one class in a sample (ie. an elephant is only an elephant; it is not also a lemur).
We are given a set of training samples separated into K distinct classes, and we create an ML model to forecast
which of those classes some previously unknown data belongs to. The model learns patterns specific to each
class from the training dataset and utilizes those patterns to forecast the classification of future data.
Approach –
1. Load dataset from the source.
2. Split the dataset into “training” and “test” data.
3. Train Decision tree, SVM, and KNN classifiers on the training data.
4. Use the above classifiers to predict labels for the test data.
5. Measure accuracy and visualize classification.
Decision tree classifier – A decision tree classifier is a systematic approach for multiclass classification. It
poses a set of questions to the dataset (related to its attributes/features). The decision tree classification
algorithm can be visualized on a binary tree. On the root and each of the internal nodes, a question is posed
and the data on that node is further split into separate records that have different characteristics. The leaves of
the tree refer to the classes in which the dataset is split. In the following code snippet, we train a decision tree
classifier in scikit-learn.
Example:
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# loading the iris dataset
iris = datasets.load_iris()
# X -> features, y -> label
X = iris.data
y = iris.target
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, dtree_predictions)
28 Unit IV Notes || Machine Learning || MC4301
1. Static backpropagation. Static backpropagation is a network developed to map static inputs for static
outputs. Static backpropagation networks can solve static classification problems, such as optical
character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point learning.
Recurrent backpropagation activation feeds forward until it reaches a fixed value.
The algorithm gets its name because the weights are updated backward, from output to input.
It does not have any parameters to tune except for the number of inputs.
It is highly adaptable and efficient and does not require any prior knowledge about the network.
It is a standard process that usually works well.
It is user-friendly, fast and easy to program.
Users do not need to learn any special functions.
1. Sigmoid function: The Sigmoid function exists between 0 and 1 or -1 and 1. The use of a sigmoid function
is to convert a real value to a probability. In machine learning, the sigmoid function is generally used to refer
to the logistic function, also called the logistic sigmoid function; it is also the most widely used sigmoid
function (others are the hyperbolic tangent and the arctangent).
A sigmoid function is placed as the last layer of the model to convert the model’s output into a probability
score, which is easier to work with and interpret.
Another reason to use it mostly in the output layer is that it can otherwise cause a neural network to get stuck
in training time.
2. TanH function: It is the hyperbolic tangent function whose range lies between -1 and 1, hence also called
the zero-centred function. Because it is zero centred, it is much easier to model inputs with strongly negative,
positive or neutral values. TanH function is used instead of sigmoid function if the output is other than 0and1.
TanH functions usually find applications in RNN for natural language processing and speech recognition
tasks.
On the downside, in the case of both Sigmoid and TanH, if the weighted sum input is very large or very small,
the function’s gradient becomes very small and closer to zero.
3. ReLU function: Rectified Linear Unit, also called ReLU, is a widely favoured activation function for deep
learning applications. Compared to Sigmoid and TanH activation functions, ReLU offers an upper hand in
terms of performance and generalisation. In terms of computation too, ReLU is faster as it does not compute
exponentials and divisions. The disadvantage is that ReLU overfits more, as compared with Sigmoid.
4. Parametric ReLU (PReLU): ReLU has been one of the keys to the recent successes in deep learning. Its
use has lead to better solutions than that of sigmoid. This is partially due to the vanishing gradient problem in
case of sigmoid activations. But, we can still improve upon ReLU. LeakyReLU was introduced, which doesn’t
zero out the negative inputs as ReLU does. Instead, it multiplies the negative input by a small value (like 0.02)
and keeps the positive input as is. But this has shown a negligible increase in the accuracy of our models.
Dropout is a regularization
Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-
adaptations on training data. It is a very efficient way of performing model averaging with neural networks.
The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.
A simple and powerful regularization technique for neural networks and deep learning models is dropout. This
notebook will uncover the dropout regularization technique and how to apply it to deep learning models in
Python with Keras.
Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out”
randomly. This means that their contribution to the activation of downstream neurons is temporally removed
on the forward pass and any weight updates are not applied to the neuron on the backward pass.
As a neural network learns, neuron weights settle into their context within the network. Weights of neurons
are tuned for specific features providing some specialization. Neighboring neurons become to rely on this
specialization, which if taken too far can result in a fragile model too specialized to the training data. This
reliant on context for a neuron during training is referred to complex co-adaptations.