Machine Learning
Machine Learning
BraIn Is traInEd…
1. Dendrites = Input
2. Nucleus / Soma / Neuron = processes
the information
3. Axon = Output
4. Axon terminals / Synapses = Point of
connection to other neurons
McCULLOCH PITTS NEURON MODEL
• Mankind’s First Mathematical
Model of A Biological Neuron
• Its binary activated
• It allows only binary 0 & 1
• Neurons are connected by
Directed weighted paths
• It has 2 types of paths
• Excitatory (Positive weights)
• Inhibitory (negative weights)
• Neuron is associated with a
threshold value
• Neuron fires only if
net input > threshold 𝑦 = 𝑥1 ∗ 𝑤 + 𝑥2 ∗ 𝑤 + 𝑥𝑛 ∗ 𝑤 + 𝑥𝑛+1 ∗ −𝑝 + 𝑥𝑛+2 ∗ −𝑝 + 𝑥𝑛+𝑚 ∗ −𝑝
• Threshold is set so that
inhibitory is absolute.
• Non zero inhibitory input prevents 𝑦 = σ𝑛𝑖=1 𝑥𝑖 ∗ 𝑤 + σ𝑚
𝑖=1 𝑥𝑛+𝑖 ∗ −𝑝
the neuron from firing
1 𝑖𝑓 𝑦 ≥ 𝜃
Y=f(y) =ቊ
0 𝑖𝑓 𝑦 < 𝜃
AND Function using McCulloch Pitts Neuron Model
Y=f(Z)
• Net input is given by X1 1
Y
Z = σ𝑖 𝑤𝑖 ∗ 𝑥𝑖 Z
Z = 𝑤1 ∗ 𝑥1 + 𝑤2 ∗ 𝑥2
Z = 1 * 𝑥1 + 1 * 𝑥2 X2 1
Z = 𝑥1 + 𝑥2
1 𝑖𝑓 𝑍 ≥ 2
Y=f(Z) =ቊ X1 X2 Y
0 𝑖𝑓 𝑍 < 2 0 0 0
0 1 0
1 0 0
1 1 1
AND Function using Bipolar inputs
X1 1
Y
x2 Z
x1 X2 1
x2 = 1 – x1
X1 X2 Y
1 1 1
x1 + x2 = z 1 -1 -1
As z = 1 -1 1 -1
x1 + x2 = 1 -1 -1 -1
x2 = 1 – x1
Linearly Separable !!!
X-OR Function using Bipolar inputs
x2 3
X1
Y
x1 Z
X2 3
X1 X2 Y
1 1 -1
Z = 3x1 + 3x2 1 -1 1
−1 𝑖𝑓 𝑍 ≥ 6
Y=f(Z) =ቊ -1 1 1
1 𝑖𝑓 𝑍 < 6
Non Linearly Separable !!! -1 -1 -1
What is a Perceptron?
A perceptron is a binary classification algorithm modeled after the functioning of the human brain
What is Multilayer Perceptron?
• A multilayer perceptron (MLP) is a group of perceptrons, organized in multiple layers, that can
accurately answer complex questions.
• Each perceptron in the first layer (on the left) sends signals to all the perceptrons in the second
layer, and so on.
• An MLP contains an input layer, at least one hidden layer, and an output layer.
The Perceptron Learning Process
• 𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 + ∆𝑤
where ∆𝑤 = 𝑥𝑖 *t*𝛼 Perceptron is the basic unit of the neural network.
Accepts an input and generates a prediction.
• 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 + ∆𝑏
where ∆𝑏 = t*𝛼
• Assume 𝛼 = 1 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑅𝑎𝑡𝑒
What is a (Neural Network) NN?
Basically a single neuron will calculate weighted sum of input(W.T*X) and then we can set a threshold to predict
output in a perceptron. If weighted sum of input cross the threshold, perceptron fires and if not then perceptron
doesn't predict.
Perceptron can take real values input or Boolean values.
Disadvantage of perceptron is that it only output binary values and if we try to give small change in weight and bais
then perceptron can flip the output. We need some system which can modify the output slightly according to small
change in weight and bias. Here comes sigmoid function in picture.
If we change perceptron with a sigmoid function, then we can make slight change in output.
e.g. output in perceptron = 0, you slightly changed weight and bias, output becomes = 1 but actual output is 0.7. In
case of sigmoid, output = 0, slight change in weight and bias, output = 0.7.
If we apply sigmoid activation function then Single neuron will act as Logistic Regression.
we can understand difference between perceptron and sigmoid function by looking at sigmoid function graph.
Artificial Neural Network (ANN)
▪ ANN is an information processing system.
▪ Elements called Neurons process the information : Heart of ANN
▪ Neural Net can be single layer or multi-layer (MNN) net
▪ ANN thinks and behaves exactly like human.
▪ It has 2 passes
1. Forward pass
2. Backward pass
▪ Backward pass updates weight
Artificial Neural Networks (ANN)
&
Deep Neural Network (DNN)
• Artificial Neural Networks (ANN) is a supervised learning system built of a large number of simple
elements, called neurons or perceptron's.
• Each neuron makes a simple decisions, and feeds those decisions to other neurons, organized in
interconnected layers.
• Together, the neural network can implement almost any function, and practically answer any questions,
given enough training samples and computing power.
• A “shallow” neural network has only three layers of neurons:
1. An input layer that accepts the independent variables or inputs of the model
2. One hidden layer that connects the input & output layers
3. An output layer that generates predictions
• A Deep Neural Network (DNN) has a similar structure, but it has two or more “hidden layers” of neurons
that process inputs.
• As more number of neuron layers are added, deep learning networks are more accurate, and also
improve in accuracy.
Comparison
Difference between MLP and ANN
• In MLP the decision/ activation function is a step function and the output is
binary.
• Artificial Neural Networks are evolved from MLPs.
• Other activation functions can be used in ANN which result in outputs of real
values, usually between 0 and 1 or between -1 and 1.
• ANN allows for probability-based predictions (softmax) or classification of
items into multiple labels (tanH).
Basic Terminologies
1. Forward Pass
• The forward pass takes the inputs, passes them through the network and allows each neuron to react to
a fraction of the input.
• Neurons generate their outputs and pass them on to the next layer, until eventually the network
generates an output.
2. Error Function
• Defines how far the actual output of the current model is from the correct output.
• When training the model, the objective is to minimize the error function and bring output as close as
possible to the correct value.
• Error/Loss is evaluated as (𝑦ො − 𝑦)2 instead of (𝑦 − 𝑦)
ො 2 because the derivative yields positive value.
3. Backpropagation
• In order to discover the optimal weights for the neurons, we perform a backward pass, moving back
from the network’s prediction to the neurons that generated the particular prediction. This is called
backpropagation.
• Backpropagation tracks the derivatives of the activation functions in each successive neuron, to find
weights that brings the loss function to a minimum, which will generate the best prediction.
• This is a mathematical process called gradient descent.
Ctnd..
4. Bias and Variance
• When training neural networks, like in other machine learning techniques, we try to balance
between bias and variance.
• Bias measures how well the model fits for the known outputs of the training examples & able to
predict correctly.
• Variance measures how well the model works with unknown inputs that were not available during
training.
• The bias neuron holds the number 1, and makes it possible to move the activation function up,
down, left and right on the number graph.
5. Hyper-parameters
• Tuning hyper-parameters is the primary way to build a network that provides accurate predictions
for a certain problem.
• Common hyper-parameters include
1. The number of hidden layers,
2. The activation function, and
3. How many times (epochs) training should be repeated & many more.
Ctnd..
6. Validation Set
• A validation set is another group of sample inputs which were not included during training and
preferably are different from the samples in the training set.
• This lets us know: Can the model generate correct predictions for an unknown set of inputs?
7. Validation Error
• The nice thing is that for the validation set, the correct outputs are already known.
• Validation error is the difference between correct prediction for the validation set & the actual
model output which is already known to us.
• Practically, when training a neural network model, we will attempt to gather a training set that is as
large as possible.
• We will then break up the training set into at least two groups:
• Training set and
• Validation set.
Backpropagation
Backpropagation helps to adjust the weights of the neurons so that the result comes closer and closer to
the known output value.
For large neural networks, backpropagation is the algorithm which can discover the optimal weights
relatively quickly, even for a network with millions of weights.
Backpropagation and it’s Important
1. Forward pass
2. Error function
3. Backpropagation with gradient descent:
• It calculates partial derivatives, going back from the error function to
a specific neuron and its weight.
• This provides complete traceability from total errors, back to a
specific weight which contributed to that error.
• The result of backpropagation is a set of weights that minimize the
error function.
4. Weight update
• weights can be updated after every sample in the training set, but
this is usually not practical.
• Typically, a batch of samples is run in one big forward pass, and
then backpropagation performed on the aggregate result.
• Running the entire training set through the backpropagation process
is called an epoch
How BP works
1. Let i1 = 0.1, w1 = 0.27, i2 = 0.2, w3 = 0.57, and b1 = 0.4.
h1 = (i1 * w1) + (i2 * w3) + b1
= (0.1 * 0.27) + (0.2 * 0.57) + (0.4 * 1) = 0.541
f(h1) = f(0.541) = 1 / (1 + exp(-0.541)) = 0.632
2. Similarly h2,o1 & o2 are evaluated
3. Let the final calculated outputs are
o1 = 0.735 and o2 = 0.455
4. Assume that the correct output values are
o1 = 0.5 and o2 = 0.5
5. For simplicity, we’ll use the Mean Squared Error MSE(o1) = ½ (-0.235)2 = 0.0276
function. MSE(o2) = ½ (0.045)2 = 0.001
6. Thus error & MSE in the output are:
Total Error = 0.0276 + 0.001 = 0.0286
o1 = 0.5—0.735 = -0.235 This is the number we need to minimize with backpropagation.
o2 = 0.5—0.455 = 0.045 For complete working of BP click here
Ctnd..
• The backpropagation algorithm calculates how much the final output values, o1 and o2, are affected by each
of the weights.
• To do this, it calculates partial derivatives, going back from the error function to the neuron that carried a
specific weight.
• For example, weight w5, going from hidden neuron h1 to output neuron o1, affected our model as follows:
Neuron h1 with weight w5 → affects total input of neuron o1 → affects output o1 → affects total error
• Backpropagation goes in the opposite direction
Total errors → affected by output o1 → affected by total inputs of neuron o1 → affected by neuron h1
with weight w5 .
• The algorithm calculates 3 derivatives
• The derivative of total errors with respect to output o1
• The derivative of output o1 with respect to total input of neuron o1
• Total input of neuron o1 with respect to neuron h1 with weight w5
• This gives us complete traceability from the total errors, all the way back to the weight w5.
• Using the Leibniz Chain Rule, it is possible to calculate, based on the above three derivatives, what is the
optimal value of w6 that minimizes the error function. In other words, what is the “best” weight w6 that will
make the neural network most accurate?
Back-Propagation
Back Propagation for AND Function using Bipolar inputs
x1 x2 b Z Y t=o/ ∆𝑤𝟏 ∆𝑤2 ∆𝒃 W1=0 W2=0 b=0
p
1 1 1 0 0 1 1 1 1 1 1 1
-1 1 1 1 1 -1 1 -1 -1 2 0 0
1 -1 1 2 1 -1 -1 1 -1 1 1 -1
-1 -1 1 -3 -1 -1 0 0 0 1 1 -1
• For (x1,x2)=(1,1) • 𝑤1 = 𝑤𝑜𝑙𝑑1 + ∆𝑤1 = 0 + (1*1*1) = 1
• S1: b=0, w1 = w2 = 0, 𝛼 = 1 • ∆𝑤1 = 𝑥1 *t*𝛼 = 1*1*1 = 1
Final weights & bias after 1st Epoch are w1=1, w2=1, b=-1
wkt Z = b+ x1*w1 + x2*w2
𝑤1 𝑏
x2= - x1* + x2
𝑤2 𝑤2
By substituting
x1
x2 = -x1 + 1
x2 = 1 – x1
Single layer Perceptron is used for linear data & multi-layer perceptron is used for non-linear
data
Machine Learning Algorithms
Supervised
Unsupervised
Reinforcement
Semi-supervised
Supervised Learning
Supervised
Unsupervised
• Risk Evaluation
• Item categorization
• Fraud Detection
• Clustering customers
• Recognition of image objects
• Similar item recommendation
many more…
Semi-supervised Reinforcement
• Spam detection • Robotics
• Speech analysis • Playing games : Chess
• Web-content classification • Self-driving cars
Algorithms
Supervised
Classification Regression
1. Classification
Text : Sentiment classification
Image : Handwritten digit classification
2. Regression: Continuous variable
House Price Prediction
Algorithms
Un-Supervised
Not Survived
Survived
Linear Regression
• Linear Regression is a supervised machine learning algorithm
where the predicted output is continuous and has a constant
slope.
• Regression is a parametric technique in nature because it makes
certain assumptions.
• It’s used to predict values within a continuous range, rather than
trying to classify them into categories
• There are two main types: simple & multivariable
• Simple linear regression uses traditional slope-intercept form.
𝒚 = 𝒎𝒙 + 𝒃
• m & b are variables which our model will try to learn
• x is an input data & y is an output (prediction)
• Our algorithm will try to learn the correct values for m & b. By
the end of our training, our equation will approximate the line of
best fit
• Multivariable regression uses more than one input data.
𝒇 𝒙, 𝒚, 𝒛 = 𝒘 𝒙 + 𝒘 𝒚 + 𝒘 𝒛
Logistic Regression
• Logistic Regression is a supervised ML algorithm which is
used for the classification problems
• It’s a predictive analysis algorithm and is based on the
concept of probability.
• Linear : Continuous (marks 0-100)
• Logistic : Discrete (Pass/Fail)
• Types
• Binary (0/1)
• Multi-class (cat/dog/hen)
• Ordinal (low/medium/high)
• Can be applied for checking the survival of a person in
Titanic dataset
• The logistic function also called the sigmoid function
1
𝑓 𝑥 =
1 + 𝑒 −𝑥
• The hypothesis of logistic regression tends it to limit the
cost function between 0 and 1.
0 ≤ ℎ𝜃 ≤ 1
Types of Regression Analysis
In a regression model, the inputs are called independent values. The output is called the dependent value. There
are weights called coefficients, which determine how much each input value contributes to the result.
1. Linear regression : suitable for dependent values which can be fitted with a straight line (linear function).
2. Polynomial regression : suitable for dependent variables which can be fitted by a curve or series of curves.
3. Logistic regression : suitable for dependent variables which are binary, and therefore not normally distributed.
4. Stepwise regression : an automated technique that can deal with high dimensionality of independent
variables.
5. Ridge regression : A regression technique that helps with multicollinearity, independent variables that are
highly correlated. It adds a bias to the regression estimates, penalizing coefficients using a shrinkage
parameter.
6. Lasso regression : Like Ridge regression, shrinks coefficients to solve multicollinearity, however, it also shrinks
the absolute values, meaning some of the coefficients can become zero. This performs “feature selection”,
removing some variables from the equation.
7. ElasticNet regression : Combines Ridge and Lasso regression, and is trained with L1 and L2 regularization,
trade off between the two techniques.
Navie Bayes
• Navie Bayes Classifier is one of the simple and most effective supervised Classification
algorithms, it helps in building the fast machine learning models that can make quick
predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
• Navie Bayes algorithm is comprised of two words Navie and Bayes
• Navie: Called Navie because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features.
• If the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple.
• Hence each feature individually contributes to identify that it is an apple without depending on each other.
𝑃 𝐵 𝐴 ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
Consider the Iris data set.
Support Vector Machine (SVM)
• Feature selection and Data cleaning should be the first and most important step of your
model designing.
• Feature selection in machine learning refers to the process of choosing the most relevant
features from our data to give to our model.
• By limiting the number of features we can often speed up training and improve accuracy, or
both.
• Irrelevant or partially relevant features can negatively impact model performance.
• LDA is a dimensionality reduction technique. It is used as a pre-processing step in ML
• The goal of LDA is to project the features in higher dimensional space onto a lower-
dimensional space in order to avoid the curse of dimensionality and also reduce resources
and dimensional costs
• Logistic regression is a classification algorithm traditionally limited to only two-class
classification problems.
• If you have more than two classes then Linear Discriminant Analysis is the preferred.
K Means Clustering
• K-means clustering is one of the simplest
and popular unsupervised machine
learning algorithms.
• Clustering is the process of dividing the
entire data into groups (also known as
clusters) based on the patterns in the
data.
• K is the number of categories, or clusters
• S1: Choose the number of clusters K
• S2 : Select k random points from the data
as centroids
• S3: Assign all the points to the closest
cluster centroid
• S4 : Recompute the centroids of newly
formed clusters
• S5 : Repeat steps 3 and 4
• S6 : Stop the algorithm when Centroids of
Example K = 2
S1 S2 S3
No change in centroid,
Stop the algorithm
S4 S5
K Mean Shift
• It is another powerful unsupervised clustering algorithm.
• Unlike K-means clustering, it does not make any assumptions, hence it is a non-
parametric algorithm.
• K mean shift determines the number of clusters based on the data, one need not to
specify as in case of K mean clustering
Activation Function
• Activation function is one of the important building block of Neural Network
• When our brain is fed with a lot of information simultaneously, it tries hard to understand and classify the information
into “useful” and “less-useful” information.
• We need a similar mechanism for classifying incoming information as “useful” or “less-useful” in case of Neural
Networks.
• Activation functions are mathematical equations that determine the output of a neural network.
• The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not,
based on whether each neuron’s input is relevant for the model’s prediction.
• Activation functions help generate output values within an acceptable range 1 and 0 or between -1 and 1.
Need of Activation Function
• Without activation function every neuron will only be performing a linear transformation on the inputs using the
weights and biases.
• In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron
has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred
to the next layer.
• The activation function is a mathematical “gate” in between the input feeding the current neuron and its output
going to the next layer.
• It can be as simple as a step function that turns the neuron output on and off, depending on a threshold.
• Neural networks use non-linear activation functions, which can help the network learn complex data, compute and
learn almost any function representing a question, and provide accurate predictions.
1. Binary Step Function
• A binary step function is a threshold-based activation
function. If the input value is above or below a certain
threshold, the neuron is activated and sends exactly the
same signal to the next layer.
1 ;𝑥 ≥ 0
Y=𝑓 𝑥 = ቊ
0 ;𝑥 < 0
• The problem with a step function is that it does not allow
multi-value outputs
• The gradient of the step function is zero which causes a
hindrance in the back propagation process.
𝑌′ = 0
• Gradients are calculated to update the weights and biases
during the backpropagation process. Since the gradient of
the function is zero, the weights and biases don’t update.
2 . Linear Activation Function
• Linear Function overcomes the gradient of the function becoming zero as in
case of step function.
• A linear activation function takes the form
Y = mx
• Linear function is better than a step function because it allows multiple
outputs, not just yes and no.
• A linear activation function has two major problems
1. Not possible to use Back-Propagation (gradient descent)
The derivative of the function is a constant. So it’s not possible to go back and
understand which weights in the input neurons can provide a better prediction.
Thus not able to train the model.
𝑌 ′ = m (for m=4)
2. All layers of the neural network collapse into one
With linear activation functions, no matter how many layers are their in the neural
network, the last layer will be a linear function of the first layer (because a linear
combination of linear functions is still a linear function).
So a linear activation function turns the neural network into just one layer.
3. Non-Linear Activation Functions
• Modern neural network models use non-linear activation functions. They allow the model to
create complex mappings between the network’s inputs and outputs, which are essential for
learning and modeling complex data, such as images, video, audio, and data sets which are
non-linear.
• Non-linear functions address the problems of a linear activation function:
• They allow backpropagation because they have a derivative function which is related to
the inputs.
• They allow “stacking” of multiple layers of neurons to create a deep neural network.
Multiple hidden layers of neurons are needed to learn complex data sets with high levels
of accuracy.
1. Sigmoid / Logistic -4 0 0
-3 0 0
• It is one of the most widely used non-linear
-2 activation
0.1 0.1
function.
-1 0.3 0.2
• Sigmoid transforms the values between the range 0 and 1
1 0 0.5 0.3
𝞂 𝑥 = 1 0.7 0.2
1 + 𝑒 −𝑥
• For X above 2 or below -2, tends to bring 2the Y0.9 value0.1
(the
prediction) to the edge of the curve, very close
3 to 11 or 0.0.05
This
enables clear predictions.
4 1 0.02
• For very high or very low values of X, there is almost no
change to the prediction, causing a vanishing gradient
problem.
• This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
• Outputs are not zero centered.
• Computationally expensive
tanh’(x) = 1 – tanh^2(x)
• Output is zero centered because its range in between -1 to 1
• Hence optimization is easier in this method so it’s always
preferred over Sigmoid function .
• But still it suffers from Vanishing gradient problem.
Comparison of sigmoid & tanh
Zero Centered
3. Relu (Rectified Linear Unit)
• The main advantage of using ReLU over other activation functions is
it does not activate all the neurons at the same time.
• Neurons will only be deactivated if the output of the linear
transformation is less than 0.
• Since only a certain number of neurons are activated, the ReLU
function is far more computationally efficient when compared to the
sigmoid and tanh function thus allows the network to converge very
quickly
f(x) = max(0,x)
1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ
0;𝑥 < 0
• Although it looks like a linear function, ReLU has a derivative
function and allows for backpropagation hence its non-linear
• The Dying ReLU problem when inputs approach zero, or are
negative, the gradient of the function becomes zero, the network
cannot perform backpropagation and cannot learn.
4. Leaky Relu
• Leaky ReLU function is nothing but an improved
version of the ReLU function
• In Relu the gradient is 0 for x<0, this would
deactivate the neurons in that region.
• Leaky ReLU is defined to address this problem.
• Instead of defining the Relu function as 0 for
negative values, here we define it as an extremely
small linear component of x.
𝑥 ;𝑥 ≥ 0
𝑓 𝑥 = ቊ
0.01𝑥 ; 𝑥 < 0
• As the gradient of the left side of the graph comes
out to be a non zero value. We would no longer
encounter dead neurons in that region.
1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ
0.01 ; 𝑥 < 0
5. Parameterised ReLU
1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ
𝑎 ;𝑥 < 0
Considering a = 0.25
• The parameterized ReLU function is used
when the leaky ReLU function still fails to
solve the problem of dead neurons and the
relevant information is not successfully
passed to the next layer.
Comparison between Relu, L-Relu & P-Relu
6. Exponential Linear Unit (ELU)
• Unlike the leaky relu and parametric ReLU functions,
instead of a straight line, ELU uses a log curve for
defining the negative values.
𝑥 ;𝑥 ≥ 0
𝑓 𝑥 = ቊ 𝑥−1
𝑎𝑒 ;𝑥 < 0
• ELU has a extra ‘a’ constant which should be positive
number.
• ELU is a function that tend to converge cost function to
zero faster and produce more accurate results.
• ELU is a strong alternative to ReLU, as it produce
negative outputs.
• But for x > 0, it can blow up the activation with the
output range of [0, ∞].
1 ;𝑥 ≥ 0
𝑓′ 𝑥 = ቊ 𝑥
𝑎𝑒 ; 𝑥 < 0
7. Softmax
1. Max Function : Returns the largest numeric value from a list of numeric values.
2. Argmax Function : Returns the index of the list that contains the largest value.
3. Softmax Function : Returns the probabilistic value.
• Softmax is used as the activation function for multi-class classification problems
• The softmax activation will output one value for each node in the output layer. The output values will represent the
probabilities
• Each input will be in the interval [0,1]
• Exponential converts Negative inputs into nonnegative values
• As the denominator for each Softmax computation remains same, the values become proportional to each other, which
makes sure that together they sum to 1.
𝑒 𝑥𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑥𝑖 =
σ𝑗 𝑒 𝑥𝑗
Ctnd..
The final layer values of the neural network, without the activation function, is what we call the “logits layer”
8. Swish
𝑥
f(x) =
1− 𝑒 −𝑥
Overview of 7 Common Activation Functions
1. The sigmoid function : Outputs values between zero and one. For very high or low values of the input
parameters, the network can be very slow to reach a prediction, called the vanishing gradient problem.
2. The TanH function : Zero-centered making it easier to model inputs that are strongly negative strongly
positive or neutral.
3. The ReLu function : Highly computationally efficient but is not able to process inputs that approach zero or
negative.
4. The Leaky ReLu : Has a small positive slope in its negative area, enabling it to process zero or negative
values.
5. The Parametric ReLu : Allows the negative slope to be learned.
6. Softmax : Special activation function used for output neurons. It normalizes outputs of each class
between 0 and 1, and returns the probability that the input belongs to a specific class.
7. Swish : New activation function discovered by Google researchers. It performs better than ReLu with a
similar level of computational efficiency.
Bias & Variance
• Bias is a special neuron added to each layer in the neural
network, which simply stores the value of 1.
• This makes it possible to move or “translate” the activation
function left or right on the graph.
• Although neural networks can work without bias neurons, in
reality, they are always added, and their weights are estimated
as part of the overall model.
• In our example, the bias neurons are b1 and b2 at the bottom.
They also have weights attached to them which are learned
during backpropagation.
• High bias means the model is not “fitting” well on the training
set. This means the training error will be large.
• Low bias means the model is fitting well, and training error will
be low.
• High variance means that the model is not able to make
accurate predictions on the validation set. The validation error
will be large.
• Low variance means the model is successful in breaking out of
its training data.
Ctnd..
1. A model parameter
• It’s an internal to learn your network and is used to make predictions in a production
deep learning model.
• The objective of training is to learn the values of the model parameters.
2. A hyperparameter
• It’s an external parameter set by the operator of the neural network.
• For example,
1. the number of iterations of training,
2. the number of hidden layers,
3. or the activation function.
3. Different values of hyperparameters can have a major impact on the performance of the
network.
Hyper-parameters