0% found this document useful (0 votes)
14 views

Machine Learning

Machine learning notes revised for Advance AI

Uploaded by

varunpuslekar31
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Machine Learning

Machine learning notes revised for Advance AI

Uploaded by

varunpuslekar31
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Module-1

Dr.Bahubali Shiragapur, Director @SCSEA


D Y Patil International University Akrudi Pune
Can anyone identify this animal ?
Exactly …. It’s dog..

Why not CAT ? Or Lion ?

BraIn Is traInEd…

What kind of system is present inside the Brain?

yEs… nErvous systEm


How This Human Nervous System Works?

Input Processing Output

1. Dendrites = Input
2. Nucleus / Soma / Neuron = processes
the information
3. Axon = Output
4. Axon terminals / Synapses = Point of
connection to other neurons
McCULLOCH PITTS NEURON MODEL
• Mankind’s First Mathematical
Model of A Biological Neuron
• Its binary activated
• It allows only binary 0 & 1
• Neurons are connected by
Directed weighted paths
• It has 2 types of paths
• Excitatory (Positive weights)
• Inhibitory (negative weights)
• Neuron is associated with a
threshold value
• Neuron fires only if
net input > threshold 𝑦 = 𝑥1 ∗ 𝑤 + 𝑥2 ∗ 𝑤 + 𝑥𝑛 ∗ 𝑤 + 𝑥𝑛+1 ∗ −𝑝 + 𝑥𝑛+2 ∗ −𝑝 + 𝑥𝑛+𝑚 ∗ −𝑝
• Threshold is set so that
inhibitory is absolute.
• Non zero inhibitory input prevents 𝑦 = σ𝑛𝑖=1 𝑥𝑖 ∗ 𝑤 + σ𝑚
𝑖=1 𝑥𝑛+𝑖 ∗ −𝑝
the neuron from firing

1 𝑖𝑓 𝑦 ≥ 𝜃
Y=f(y) =ቊ
0 𝑖𝑓 𝑦 < 𝜃
AND Function using McCulloch Pitts Neuron Model
Y=f(Z)
• Net input is given by X1 1
Y
Z = σ𝑖 𝑤𝑖 ∗ 𝑥𝑖 Z
Z = 𝑤1 ∗ 𝑥1 + 𝑤2 ∗ 𝑥2
Z = 1 * 𝑥1 + 1 * 𝑥2 X2 1

Z = 𝑥1 + 𝑥2
1 𝑖𝑓 𝑍 ≥ 2
Y=f(Z) =ቊ X1 X2 Y
0 𝑖𝑓 𝑍 < 2 0 0 0
0 1 0
1 0 0
1 1 1
AND Function using Bipolar inputs

X1 1
Y
x2 Z

x1 X2 1

x2 = 1 – x1
X1 X2 Y
1 1 1
x1 + x2 = z 1 -1 -1
As z = 1 -1 1 -1
x1 + x2 = 1 -1 -1 -1
x2 = 1 – x1
Linearly Separable !!!
X-OR Function using Bipolar inputs

x2 3
X1
Y
x1 Z

X2 3

X1 X2 Y
1 1 -1
Z = 3x1 + 3x2 1 -1 1
−1 𝑖𝑓 𝑍 ≥ 6
Y=f(Z) =ቊ -1 1 1
1 𝑖𝑓 𝑍 < 6
Non Linearly Separable !!! -1 -1 -1
What is a Perceptron?

A perceptron is a binary classification algorithm modeled after the functioning of the human brain
What is Multilayer Perceptron?
• A multilayer perceptron (MLP) is a group of perceptrons, organized in multiple layers, that can
accurately answer complex questions.
• Each perceptron in the first layer (on the left) sends signals to all the perceptrons in the second
layer, and so on.
• An MLP contains an input layer, at least one hidden layer, and an output layer.
The Perceptron Learning Process

• Takes the inputs which are fed into the perceptrons


in the input layer, multiplies them by their weights,
and computes the sum.
Z = σ𝑖 𝑤𝑖 ∗ 𝑥𝑖
• Adds the number one, multiplied by a “bias weight”.
Z = σ𝑖 𝑤𝑖 ∗ 𝑥𝑖 + b*1
• This is a technical step that helps the output of each
perceptron to move up, down, left and right based
on this value the activation function makes a
decision.
b
• Feeds the sum through the activation function in a
simple perceptron system, the activation function is
a step function.
• The result of the step function is the output.
1 𝑖𝑓 𝑦 ≥ 𝜃 1
Y=f(Z) =ቊ
0 𝑖𝑓 𝑦 < 𝜃
Perceptron Learning Algorithm
• S1 : Initialize weights & Bias (Initially set to 0)
• S2 : Compute the response of the net
X1 w1
• S3 : Apply Activation function
Y
• S4: check for stopping condition : Z
• if false update weight & Bias
• S5 : Repeat S2-S4
X2 w2
Z= σ𝑖 𝑤𝑖 ∗ 𝑥𝑖 b
Y = 𝑏𝑗 + σ𝑖 𝑤𝑖 ∗ 𝑥𝑖
Y=t 1 Activation of this unit
Where t = Target is always 1

• 𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 + ∆𝑤
where ∆𝑤 = 𝑥𝑖 *t*𝛼 Perceptron is the basic unit of the neural network.
Accepts an input and generates a prediction.
• 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 + ∆𝑏
where ∆𝑏 = t*𝛼
• Assume 𝛼 = 1 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑅𝑎𝑡𝑒
What is a (Neural Network) NN?

 Single neuron == linear regression without applying activation(perceptron)

 Basically a single neuron will calculate weighted sum of input(W.T*X) and then we can set a threshold to predict
output in a perceptron. If weighted sum of input cross the threshold, perceptron fires and if not then perceptron
doesn't predict.
 Perceptron can take real values input or Boolean values.

 Actually, when w•x+b=0 the perceptron outputs 0.

 Disadvantage of perceptron is that it only output binary values and if we try to give small change in weight and bais
then perceptron can flip the output. We need some system which can modify the output slightly according to small
change in weight and bias. Here comes sigmoid function in picture.
 If we change perceptron with a sigmoid function, then we can make slight change in output.

 e.g. output in perceptron = 0, you slightly changed weight and bias, output becomes = 1 but actual output is 0.7. In
case of sigmoid, output = 0, slight change in weight and bias, output = 0.7.

 If we apply sigmoid activation function then Single neuron will act as Logistic Regression.
 we can understand difference between perceptron and sigmoid function by looking at sigmoid function graph.
Artificial Neural Network (ANN)
▪ ANN is an information processing system.
▪ Elements called Neurons process the information : Heart of ANN
▪ Neural Net can be single layer or multi-layer (MNN) net
▪ ANN thinks and behaves exactly like human.
▪ It has 2 passes
1. Forward pass
2. Backward pass
▪ Backward pass updates weight
Artificial Neural Networks (ANN)
&
Deep Neural Network (DNN)

• Artificial Neural Networks (ANN) is a supervised learning system built of a large number of simple
elements, called neurons or perceptron's.
• Each neuron makes a simple decisions, and feeds those decisions to other neurons, organized in
interconnected layers.
• Together, the neural network can implement almost any function, and practically answer any questions,
given enough training samples and computing power.
• A “shallow” neural network has only three layers of neurons:
1. An input layer that accepts the independent variables or inputs of the model
2. One hidden layer that connects the input & output layers
3. An output layer that generates predictions
• A Deep Neural Network (DNN) has a similar structure, but it has two or more “hidden layers” of neurons
that process inputs.
• As more number of neuron layers are added, deep learning networks are more accurate, and also
improve in accuracy.
Comparison
Difference between MLP and ANN

• In MLP the decision/ activation function is a step function and the output is
binary.
• Artificial Neural Networks are evolved from MLPs.
• Other activation functions can be used in ANN which result in outputs of real
values, usually between 0 and 1 or between -1 and 1.
• ANN allows for probability-based predictions (softmax) or classification of
items into multiple labels (tanH).
Basic Terminologies
1. Forward Pass
• The forward pass takes the inputs, passes them through the network and allows each neuron to react to
a fraction of the input.
• Neurons generate their outputs and pass them on to the next layer, until eventually the network
generates an output.
2. Error Function
• Defines how far the actual output of the current model is from the correct output.
• When training the model, the objective is to minimize the error function and bring output as close as
possible to the correct value.
• Error/Loss is evaluated as (𝑦ො − 𝑦)2 instead of (𝑦 − 𝑦)
ො 2 because the derivative yields positive value.
3. Backpropagation
• In order to discover the optimal weights for the neurons, we perform a backward pass, moving back
from the network’s prediction to the neurons that generated the particular prediction. This is called
backpropagation.
• Backpropagation tracks the derivatives of the activation functions in each successive neuron, to find
weights that brings the loss function to a minimum, which will generate the best prediction.
• This is a mathematical process called gradient descent.
Ctnd..
4. Bias and Variance
• When training neural networks, like in other machine learning techniques, we try to balance
between bias and variance.
• Bias measures how well the model fits for the known outputs of the training examples & able to
predict correctly.
• Variance measures how well the model works with unknown inputs that were not available during
training.
• The bias neuron holds the number 1, and makes it possible to move the activation function up,
down, left and right on the number graph.
5. Hyper-parameters
• Tuning hyper-parameters is the primary way to build a network that provides accurate predictions
for a certain problem.
• Common hyper-parameters include
1. The number of hidden layers,
2. The activation function, and
3. How many times (epochs) training should be repeated & many more.
Ctnd..
6. Validation Set
• A validation set is another group of sample inputs which were not included during training and
preferably are different from the samples in the training set.
• This lets us know: Can the model generate correct predictions for an unknown set of inputs?
7. Validation Error
• The nice thing is that for the validation set, the correct outputs are already known.
• Validation error is the difference between correct prediction for the validation set & the actual
model output which is already known to us.
• Practically, when training a neural network model, we will attempt to gather a training set that is as
large as possible.
• We will then break up the training set into at least two groups:
• Training set and
• Validation set.
Backpropagation

Backpropagation helps to adjust the weights of the neurons so that the result comes closer and closer to
the known output value.
For large neural networks, backpropagation is the algorithm which can discover the optimal weights
relatively quickly, even for a network with millions of weights.
Backpropagation and it’s Important
1. Forward pass
2. Error function
3. Backpropagation with gradient descent:
• It calculates partial derivatives, going back from the error function to
a specific neuron and its weight.
• This provides complete traceability from total errors, back to a
specific weight which contributed to that error.
• The result of backpropagation is a set of weights that minimize the
error function.
4. Weight update
• weights can be updated after every sample in the training set, but
this is usually not practical.
• Typically, a batch of samples is run in one big forward pass, and
then backpropagation performed on the aggregate result.
• Running the entire training set through the backpropagation process
is called an epoch
How BP works
1. Let i1 = 0.1, w1 = 0.27, i2 = 0.2, w3 = 0.57, and b1 = 0.4.
h1 = (i1 * w1) + (i2 * w3) + b1
= (0.1 * 0.27) + (0.2 * 0.57) + (0.4 * 1) = 0.541
f(h1) = f(0.541) = 1 / (1 + exp(-0.541)) = 0.632
2. Similarly h2,o1 & o2 are evaluated
3. Let the final calculated outputs are
o1 = 0.735 and o2 = 0.455
4. Assume that the correct output values are
o1 = 0.5 and o2 = 0.5
5. For simplicity, we’ll use the Mean Squared Error MSE(o1) = ½ (-0.235)2 = 0.0276
function. MSE(o2) = ½ (0.045)2 = 0.001
6. Thus error & MSE in the output are:
Total Error = 0.0276 + 0.001 = 0.0286
o1 = 0.5—0.735 = -0.235 This is the number we need to minimize with backpropagation.
o2 = 0.5—0.455 = 0.045 For complete working of BP click here
Ctnd..
• The backpropagation algorithm calculates how much the final output values, o1 and o2, are affected by each
of the weights.
• To do this, it calculates partial derivatives, going back from the error function to the neuron that carried a
specific weight.
• For example, weight w5, going from hidden neuron h1 to output neuron o1, affected our model as follows:
Neuron h1 with weight w5 → affects total input of neuron o1 → affects output o1 → affects total error
• Backpropagation goes in the opposite direction
Total errors → affected by output o1 → affected by total inputs of neuron o1 → affected by neuron h1
with weight w5 .
• The algorithm calculates 3 derivatives
• The derivative of total errors with respect to output o1
• The derivative of output o1 with respect to total input of neuron o1
• Total input of neuron o1 with respect to neuron h1 with weight w5
• This gives us complete traceability from the total errors, all the way back to the weight w5.
• Using the Leibniz Chain Rule, it is possible to calculate, based on the above three derivatives, what is the
optimal value of w6 that minimizes the error function. In other words, what is the “best” weight w6 that will
make the neural network most accurate?
Back-Propagation
Back Propagation for AND Function using Bipolar inputs
x1 x2 b Z Y t=o/ ∆𝑤𝟏 ∆𝑤2 ∆𝒃 W1=0 W2=0 b=0
p
1 1 1 0 0 1 1 1 1 1 1 1
-1 1 1 1 1 -1 1 -1 -1 2 0 0
1 -1 1 2 1 -1 -1 1 -1 1 1 -1
-1 -1 1 -3 -1 -1 0 0 0 1 1 -1
• For (x1,x2)=(1,1) • 𝑤1 = 𝑤𝑜𝑙𝑑1 + ∆𝑤1 = 0 + (1*1*1) = 1
• S1: b=0, w1 = w2 = 0, 𝛼 = 1 • ∆𝑤1 = 𝑥1 *t*𝛼 = 1*1*1 = 1

• S2 : Z = b + σ𝑖 𝑤𝑖 ∗ 𝑥𝑖 = 0+(0*1)+(0*1) = 0 • 𝑤2 = 𝑤𝑜𝑙𝑑2 + ∆𝑤2 = 0 + (1*1*1) = 1


1 𝑖𝑓 𝑍 > 0 • ∆𝑤2 = 𝑥2 *t*𝛼 = 1*1*1 = 1
• S3 : 𝑌 = 𝑓 𝑍 = ቐ 0 𝑖𝑓 𝑍 = 0 thus Y = 0 • 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 + ∆𝑏 = 0 + (1*1) = 1
−1 𝑖𝑓 𝑍 < 0
• ∆𝑏 = t*𝛼 = 1*1 = 1
• S4 : Y≠ t ; (0 ≠ 1)
• Do same for (x1,x2)=(-1,1) considering
• S5: update weight & bias b=1, w1=1,w2=1, 𝛼 = 1.
Ctnd..

Final weights & bias after 1st Epoch are w1=1, w2=1, b=-1
wkt Z = b+ x1*w1 + x2*w2
𝑤1 𝑏
x2= - x1* + x2
𝑤2 𝑤2
By substituting
x1
x2 = -x1 + 1

x2 = 1 – x1

Single layer Perceptron is used for linear data & multi-layer perceptron is used for non-linear
data
Machine Learning Algorithms

Supervised

Unsupervised

Reinforcement

Semi-supervised
Supervised Learning

• Can you all recall the days of your


kinder Garden?
Or
• When you first started to learn to
drive a car?
• No doubt, In both cases you have
been accompanied by a supervisor
(a teacher or a driving instructor).

• Same thing is done here.


• Complete prior knowledge of the
data
• Class labels (target) of the Data are
known in advance

Features : Shape, Texture, weight, size, number of seeds, colour


Unsupervised Learning

• Very useful in exploratory


analysis
• As it finds hidden structure
in unlabelled data set.
• Class labels(target) of the
data are unknown
• No prior knowledge of the
data is available
Reinforcement Learning

• Involves rewards and punishments


or we can say trial and error
method.
• While playing a game for the first
time, it’s just a learning
experience since the rules are
unfamiliar.
• Once you become more familiar
you begin to understand
strategies on how to win or gain
the greatest rewards.
• Thus mistakes will be very less
Semi-supervised Learning

• It’s a trade off between supervised


& unsupervised learning
• Uses small amount of labeled
data & large amount of unlabeled
data
• Thus avoids the problem of
finding large amount of labelled
data
• Used when it is either too
expensive or not feasible to
collect large labeled datasets but
a large volume of unlabeled data
is available.
Overview
Some Applications

Supervised
Unsupervised
• Risk Evaluation
• Item categorization
• Fraud Detection
• Clustering customers
• Recognition of image objects
• Similar item recommendation
many more…

Semi-supervised Reinforcement
• Spam detection • Robotics
• Speech analysis • Playing games : Chess
• Web-content classification • Self-driving cars
Algorithms

Supervised

Classification Regression

1. Classification
Text : Sentiment classification
Image : Handwritten digit classification
2. Regression: Continuous variable
House Price Prediction
Algorithms

Un-Supervised

Clustering Dimensional Reduction

Clustering : Dividing the dataset into a number of clusters


Fruits, Vegetables, Objects
Dimensional Reduction : Reducing high dimensionality data into low dimensional
data their by increasing the computational complexity & accuracy of the model.
3-Dimension to 2-Dimension
Algorithms
Decision Tree
• Decision tree algorithm falls under the category of
supervised learning.
Class
• They can be used to solve both regression and
classification problems.
1st 3rd
• A decision tree is a tree-like graph with nodes
representing the place where we pick an attribute Sex Sex
and ask a question; edges represent the answers
the to the question; and the leaves represent the F M
actual output or class label.
• Let’s consider a very basic example Survived Age
• Our model uses say 3 features from the data set,
namely sex, age and class. A>10 A<10

Not Survived
Survived
Linear Regression
• Linear Regression is a supervised machine learning algorithm
where the predicted output is continuous and has a constant
slope.
• Regression is a parametric technique in nature because it makes
certain assumptions.
• It’s used to predict values within a continuous range, rather than
trying to classify them into categories
• There are two main types: simple & multivariable
• Simple linear regression uses traditional slope-intercept form.
𝒚 = 𝒎𝒙 + 𝒃
• m & b are variables which our model will try to learn
• x is an input data & y is an output (prediction)
• Our algorithm will try to learn the correct values for m & b. By
the end of our training, our equation will approximate the line of
best fit
• Multivariable regression uses more than one input data.
𝒇 𝒙, 𝒚, 𝒛 = 𝒘 𝒙 + 𝒘 𝒚 + 𝒘 𝒛
Logistic Regression
• Logistic Regression is a supervised ML algorithm which is
used for the classification problems
• It’s a predictive analysis algorithm and is based on the
concept of probability.
• Linear : Continuous (marks 0-100)
• Logistic : Discrete (Pass/Fail)
• Types
• Binary (0/1)
• Multi-class (cat/dog/hen)
• Ordinal (low/medium/high)
• Can be applied for checking the survival of a person in
Titanic dataset
• The logistic function also called the sigmoid function
1
𝑓 𝑥 =
1 + 𝑒 −𝑥
• The hypothesis of logistic regression tends it to limit the
cost function between 0 and 1.
0 ≤ ℎ𝜃 ≤ 1
Types of Regression Analysis
In a regression model, the inputs are called independent values. The output is called the dependent value. There
are weights called coefficients, which determine how much each input value contributes to the result.

1. Linear regression : suitable for dependent values which can be fitted with a straight line (linear function).
2. Polynomial regression : suitable for dependent variables which can be fitted by a curve or series of curves.
3. Logistic regression : suitable for dependent variables which are binary, and therefore not normally distributed.
4. Stepwise regression : an automated technique that can deal with high dimensionality of independent
variables.
5. Ridge regression : A regression technique that helps with multicollinearity, independent variables that are
highly correlated. It adds a bias to the regression estimates, penalizing coefficients using a shrinkage
parameter.
6. Lasso regression : Like Ridge regression, shrinks coefficients to solve multicollinearity, however, it also shrinks
the absolute values, meaning some of the coefficients can become zero. This performs “feature selection”,
removing some variables from the equation.
7. ElasticNet regression : Combines Ridge and Lasso regression, and is trained with L1 and L2 regularization,
trade off between the two techniques.
Navie Bayes
• Navie Bayes Classifier is one of the simple and most effective supervised Classification
algorithms, it helps in building the fast machine learning models that can make quick
predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
• Navie Bayes algorithm is comprised of two words Navie and Bayes
• Navie: Called Navie because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
• If the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple.
• Hence each feature individually contributes to identify that it is an apple without depending on each other.

• Bayes: It is called Bayes because it depends on the principle of Bayes Theorem.


• Bayes theorem is used to determine the probability of a hypothesis with prior knowledge.

𝑃 𝐵 𝐴 ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
Consider the Iris data set.
Support Vector Machine (SVM)

• SVM is a supervised ML algorithm which


can be used for both classification or
regression challenges. However, it is
mostly used in classification problems.
• The objective of SVM is to find a
hyperplane in an N-dimensional space(N
is the number of features) that distinctly
classifies the data points.
• Input features is 2 =hyperplane (a line).
• Input features is 3 = hyperplane (2D
plane).
• It becomes difficult to imagine when the
number of features exceeds 3.
• Maximizing the margin distance provides
more confidence while classifying the new
data points
K Nearest Neighbor (KNN)
• KNN is one of the simplest Supervised Machine
Learning algorithms
• Used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• KNN checks the similarity between the new data
and available data and put the new data into the
category that is most similar to the available
categories.
• S1: Select the number of K neighbors
• S2: Find the Euclidean distance with all K neighbors
• S3: Take the K nearest neighbors as per the
calculated Euclidean distance.
• S4: Among these k neighbors, count the number of
the data points in each category.
• S5: Assign the new data points to that category for
which the number of the neighbor is maximum.
For k =5
Principal Component Analysis (PCA)

PCA is an unsupervised, non-parametric statistical technique primarily used for dimensionality


reduction in machine learning.
More number of features means high dimensionality data, this makes the model to overfit.
Hence decreasing the dimensions, decreases the complexity of the model and increases
computational efficiency.
Feature Selection & Linear Discriminant Analysis
(LDA)

• Feature selection and Data cleaning should be the first and most important step of your
model designing.
• Feature selection in machine learning refers to the process of choosing the most relevant
features from our data to give to our model.
• By limiting the number of features we can often speed up training and improve accuracy, or
both.
• Irrelevant or partially relevant features can negatively impact model performance.
• LDA is a dimensionality reduction technique. It is used as a pre-processing step in ML
• The goal of LDA is to project the features in higher dimensional space onto a lower-
dimensional space in order to avoid the curse of dimensionality and also reduce resources
and dimensional costs
• Logistic regression is a classification algorithm traditionally limited to only two-class
classification problems.
• If you have more than two classes then Linear Discriminant Analysis is the preferred.
K Means Clustering
• K-means clustering is one of the simplest
and popular unsupervised machine
learning algorithms.
• Clustering is the process of dividing the
entire data into groups (also known as
clusters) based on the patterns in the
data.
• K is the number of categories, or clusters
• S1: Choose the number of clusters K
• S2 : Select k random points from the data
as centroids
• S3: Assign all the points to the closest
cluster centroid
• S4 : Recompute the centroids of newly
formed clusters
• S5 : Repeat steps 3 and 4
• S6 : Stop the algorithm when Centroids of
Example K = 2

S1 S2 S3

No change in centroid,
Stop the algorithm

S4 S5
K Mean Shift
• It is another powerful unsupervised clustering algorithm.
• Unlike K-means clustering, it does not make any assumptions, hence it is a non-
parametric algorithm.
• K mean shift determines the number of clusters based on the data, one need not to
specify as in case of K mean clustering
Activation Function
• Activation function is one of the important building block of Neural Network
• When our brain is fed with a lot of information simultaneously, it tries hard to understand and classify the information
into “useful” and “less-useful” information.
• We need a similar mechanism for classifying incoming information as “useful” or “less-useful” in case of Neural
Networks.
• Activation functions are mathematical equations that determine the output of a neural network.
• The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not,
based on whether each neuron’s input is relevant for the model’s prediction.
• Activation functions help generate output values within an acceptable range 1 and 0 or between -1 and 1.
Need of Activation Function
• Without activation function every neuron will only be performing a linear transformation on the inputs using the
weights and biases.
• In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron
has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred
to the next layer.
• The activation function is a mathematical “gate” in between the input feeding the current neuron and its output
going to the next layer.
• It can be as simple as a step function that turns the neuron output on and off, depending on a threshold.
• Neural networks use non-linear activation functions, which can help the network learn complex data, compute and
learn almost any function representing a question, and provide accurate predictions.
1. Binary Step Function
• A binary step function is a threshold-based activation
function. If the input value is above or below a certain
threshold, the neuron is activated and sends exactly the
same signal to the next layer.
1 ;𝑥 ≥ 0
Y=𝑓 𝑥 = ቊ
0 ;𝑥 < 0
• The problem with a step function is that it does not allow
multi-value outputs
• The gradient of the step function is zero which causes a
hindrance in the back propagation process.
𝑌′ = 0
• Gradients are calculated to update the weights and biases
during the backpropagation process. Since the gradient of
the function is zero, the weights and biases don’t update.
2 . Linear Activation Function
• Linear Function overcomes the gradient of the function becoming zero as in
case of step function.
• A linear activation function takes the form
Y = mx
• Linear function is better than a step function because it allows multiple
outputs, not just yes and no.
• A linear activation function has two major problems
1. Not possible to use Back-Propagation (gradient descent)
The derivative of the function is a constant. So it’s not possible to go back and
understand which weights in the input neurons can provide a better prediction.
Thus not able to train the model.
𝑌 ′ = m (for m=4)
2. All layers of the neural network collapse into one
With linear activation functions, no matter how many layers are their in the neural
network, the last layer will be a linear function of the first layer (because a linear
combination of linear functions is still a linear function).
So a linear activation function turns the neural network into just one layer.
3. Non-Linear Activation Functions
• Modern neural network models use non-linear activation functions. They allow the model to
create complex mappings between the network’s inputs and outputs, which are essential for
learning and modeling complex data, such as images, video, audio, and data sets which are
non-linear.
• Non-linear functions address the problems of a linear activation function:
• They allow backpropagation because they have a derivative function which is related to
the inputs.
• They allow “stacking” of multiple layers of neurons to create a deep neural network.
Multiple hidden layers of neurons are needed to learn complex data sets with high levels
of accuracy.

• There are 7 Common Nonlinear Activation Functions and which one to


choose we will look in detail
x 𝞂 𝑥 𝞂′ 𝑥

1. Sigmoid / Logistic -4 0 0
-3 0 0
• It is one of the most widely used non-linear
-2 activation
0.1 0.1
function.
-1 0.3 0.2
• Sigmoid transforms the values between the range 0 and 1
1 0 0.5 0.3
𝞂 𝑥 = 1 0.7 0.2
1 + 𝑒 −𝑥
• For X above 2 or below -2, tends to bring 2the Y0.9 value0.1
(the
prediction) to the edge of the curve, very close
3 to 11 or 0.0.05
This
enables clear predictions.
4 1 0.02
• For very high or very low values of X, there is almost no
change to the prediction, causing a vanishing gradient
problem.
• This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
• Outputs are not zero centered.
• Computationally expensive

For more detail regarding derivative click here


2. Tanh / Hyperbolic Tangent

• The tanh function is very similar to the sigmoid function.


• The only difference is that it is symmetric around the origin.
• The range of values in this case is from -1 to 1 (Negative,
neutral & strong)
• Thus the inputs to the next layers will not always be of the
same sign.
2
tanh 𝑥 = -1
1+ 𝑒 −2𝑥

tanh’(x) = 1 – tanh^2(x)
• Output is zero centered because its range in between -1 to 1
• Hence optimization is easier in this method so it’s always
preferred over Sigmoid function .
• But still it suffers from Vanishing gradient problem.
Comparison of sigmoid & tanh

Zero Centered
3. Relu (Rectified Linear Unit)
• The main advantage of using ReLU over other activation functions is
it does not activate all the neurons at the same time.
• Neurons will only be deactivated if the output of the linear
transformation is less than 0.
• Since only a certain number of neurons are activated, the ReLU
function is far more computationally efficient when compared to the
sigmoid and tanh function thus allows the network to converge very
quickly
f(x) = max(0,x)

1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ
0;𝑥 < 0
• Although it looks like a linear function, ReLU has a derivative
function and allows for backpropagation hence its non-linear
• The Dying ReLU problem when inputs approach zero, or are
negative, the gradient of the function becomes zero, the network
cannot perform backpropagation and cannot learn.
4. Leaky Relu
• Leaky ReLU function is nothing but an improved
version of the ReLU function
• In Relu the gradient is 0 for x<0, this would
deactivate the neurons in that region.
• Leaky ReLU is defined to address this problem.
• Instead of defining the Relu function as 0 for
negative values, here we define it as an extremely
small linear component of x.

𝑥 ;𝑥 ≥ 0
𝑓 𝑥 = ቊ
0.01𝑥 ; 𝑥 < 0
• As the gradient of the left side of the graph comes
out to be a non zero value. We would no longer
encounter dead neurons in that region.

1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ
0.01 ; 𝑥 < 0
5. Parameterised ReLU

• According to the name it introduces a new


parameter as a slope of the negative part of
the function.
𝑥 ;𝑥 ≥ 0
𝑓 𝑥 = ቊ
𝑎𝑥 ; 𝑥 < 0
• When the value of ‘a’ is fixed to 0.01, the
function acts as a Leaky ReLU function.

1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ
𝑎 ;𝑥 < 0
Considering a = 0.25
• The parameterized ReLU function is used
when the leaky ReLU function still fails to
solve the problem of dead neurons and the
relevant information is not successfully
passed to the next layer.
Comparison between Relu, L-Relu & P-Relu
6. Exponential Linear Unit (ELU)
• Unlike the leaky relu and parametric ReLU functions,
instead of a straight line, ELU uses a log curve for
defining the negative values.
𝑥 ;𝑥 ≥ 0
𝑓 𝑥 = ቊ 𝑥−1
𝑎𝑒 ;𝑥 < 0
• ELU has a extra ‘a’ constant which should be positive
number.
• ELU is a function that tend to converge cost function to
zero faster and produce more accurate results.
• ELU is a strong alternative to ReLU, as it produce
negative outputs.
• But for x > 0, it can blow up the activation with the
output range of [0, ∞].
1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ 𝑥
𝑎𝑒 ; 𝑥 < 0
7. Softmax
1. Max Function : Returns the largest numeric value from a list of numeric values.
2. Argmax Function : Returns the index of the list that contains the largest value.
3. Softmax Function : Returns the probabilistic value.
• Softmax is used as the activation function for multi-class classification problems
• The softmax activation will output one value for each node in the output layer. The output values will represent the
probabilities
• Each input will be in the interval [0,1]
• Exponential converts Negative inputs into nonnegative values
• As the denominator for each Softmax computation remains same, the values become proportional to each other, which
makes sure that together they sum to 1.
𝑒 𝑥𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑥𝑖 =
σ𝑗 𝑒 𝑥𝑗
Ctnd..
The final layer values of the neural network, without the activation function, is what we call the “logits layer”
8. Swish

• Swish is a lesser known activation function


which was discovered by researchers at
Google.
• Swish is as computationally efficient as
ReLU and shows better performance than
ReLU on deeper models.
• The values for swish ranges from negative
infinity to infinity.

𝑥
f(x) =
1− 𝑒 −𝑥
Overview of 7 Common Activation Functions
1. The sigmoid function : Outputs values between zero and one. For very high or low values of the input
parameters, the network can be very slow to reach a prediction, called the vanishing gradient problem.
2. The TanH function : Zero-centered making it easier to model inputs that are strongly negative strongly
positive or neutral.
3. The ReLu function : Highly computationally efficient but is not able to process inputs that approach zero or
negative.
4. The Leaky ReLu : Has a small positive slope in its negative area, enabling it to process zero or negative
values.
5. The Parametric ReLu : Allows the negative slope to be learned.
6. Softmax : Special activation function used for output neurons. It normalizes outputs of each class
between 0 and 1, and returns the probability that the input belongs to a specific class.
7. Swish : New activation function discovered by Google researchers. It performs better than ReLu with a
similar level of computational efficiency.
Bias & Variance
• Bias is a special neuron added to each layer in the neural
network, which simply stores the value of 1.
• This makes it possible to move or “translate” the activation
function left or right on the graph.
• Although neural networks can work without bias neurons, in
reality, they are always added, and their weights are estimated
as part of the overall model.
• In our example, the bias neurons are b1 and b2 at the bottom.
They also have weights attached to them which are learned
during backpropagation.
• High bias means the model is not “fitting” well on the training
set. This means the training error will be large.
• Low bias means the model is fitting well, and training error will
be low.
• High variance means that the model is not able to make
accurate predictions on the validation set. The validation error
will be large.
• Low variance means the model is successful in breaking out of
its training data.
Ctnd..

• Bias reflects how well the model fits the training


set.
• Variance reflects how well the model fits unseen
examples in the validation set.
• Overfitting happens when the neural network is
good at learning its training set, but is not able
to generalize its predictions to additional,
unseen examples. This is characterized by low
bias and high variance.
• Underfitting happens when the neural network
is not able to accurately predict for the training
set, not to mention for the validation set. This is
characterized by high bias and high variance.
Symptoms

1. The symptoms of overfitting are:


• Low bias
o Accurate predictions for the training set
• High variance
o Poor ability to generate predictions for the validation set
2. The symptoms of underfitting are:
• High bias
o Poor predictions for the training set
• High variance
o Poor predictions for the validation set
Methods to Avoid Overfitting

1. Retraining neural networks:


• Running the same model on the same training set but with different initial weights, and
selecting the network with the best performance.
2. Multiple neural networks
• Training several neural network models in parallel, with the same structure but different
weights, and averaging their outputs.
3. Early stopping
• Training the network, monitoring the error on the validation set after each iteration, and
stopping training when the network starts to overfit the data.
4. Regularization
• Adding a term to the error function equation, intended to decrease the weights and biases,
smooth outputs and make the network less likely to overfit.
5. Tuning performance ratio
• Similar to regularization, but using a parameter that defines by how much the network
should be regularized.
Methods to Avoid Underfitting

1. Adding neuron layers or inputs


• Adding neuron layers, or increasing the number of inputs and neurons in each layer, can
generate more complex predictions and improve the fit of the model.
2. Adding more training samples or improving quality
• The more training samples you feed into the network, and the better they represent the
variance in the real population, the better the network will perform.
3. Dropout
• Randomly “kill” a certain percentage of neurons in every training iteration. This ensures
some information learned is randomly removed, reducing the risk of overfitting.
4. Decreasing regularization parameter
• Regularization can be overdone. By using a regularization performance parameter, you
can learn the optimal degree of regularization, which can help the model better fit the
data.
Difference Between Model Parameter & Hyperparameter

1. A model parameter
• It’s an internal to learn your network and is used to make predictions in a production
deep learning model.
• The objective of training is to learn the values of the model parameters.
2. A hyperparameter
• It’s an external parameter set by the operator of the neural network.
• For example,
1. the number of iterations of training,
2. the number of hidden layers,
3. or the activation function.
3. Different values of hyperparameters can have a major impact on the performance of the
network.
Hyper-parameters

Hyperparameters related to neural Hyperparameters related to the training


network structure algorithm

Number of hidden layers Learning rate

Dropout Epoch, iterations and batch size

Activation function Optimizer algorithm

Weights initialization Momentum


Methods of Hyperparameter Tuning

1. Manual hyperparameter tuning


• An experienced operator can guess parameter values that will achieve very high
accuracy. This requires trial and error.
2. Grid search
• This involves systematically testing multiple values of each hyperparameter and
retraining the model for each combination.
3. Random search
• A research study by Bergstra and Bengio showed that using random hyperparameter
values is actually more effective than manual search or grid search.
4. Bayesian optimization
• A method proposed by Shahriari, et al, which trains the model with different
hyperparameter values over and over again, and tries to observe the shape of the
function generated by different parameter values. It then extends this function to predict
the best possible values. This method provides higher accuracy than random search.
References

• Complete Guide to Artificial Neural Network Concepts & Models (missinglink.ai)


• 7 Types of Activation Functions in Neural Networks: How to Choose? (missinglink.ai)
• Activation Functions — ML Glossary documentation (ml-cheatsheet.readthedocs.io)
• A Step by Step Backpropagation Example – Matt Mazur

You might also like