DL Unit-2
DL Unit-2
The behavior of neural networks is shaped by its network architecture. A network’s architecture
can be defined (in part) by the following: • Number of neurons • Number of layers • Types of
connections between layers
The most well-known and simplest-to-understand neural network is the feedforward multilayer
neural network. It has an input layer, one or many hidden layers, and a single output layer. Each
layer can have a different number of neurons and each layer is fully connected to the adjacent
layer. The connections between the neurons in the layers form an acyclic graph, as illustrated in
Figure 2-1.
The biological neuron (see Figure 2-2) is a nerve cell that provides the fundamental functional unit
for the nervous systems of all animals. Neurons exist to communicate with one another, and pass
electro-chemical impulses across synapses, from one cell to the next, as long as the impulse is
strong enough to activate the release of chemi‐ cals across a synaptic cleft. The strength of the
impulse must surpass a minimum threshold or chemicals will not be released.
Soma
Dendrites
Axons
Synapses
The neuron is made up of a nerve cell consisting of a soma (cell body) that has many dendrites but
only one axon. The single axon can branch hundreds of times, however. Dendrites are thin
structures that arise from the main cell body. Axons are nerve fibers with a special cellular
extension that comes from the cell body
Synapses are the connecting junction between axon and dendrites. The majority of synapses send
signals from the axon of a neuron to the dendrite of another neuron. The exceptions for this case
are when a neuron might lack dendrites, or a neuron lacks an axon, or a synapse, which connects
an axon to another axon.
Dendrites
Dendrites have fibers branching out from the soma in a bushy network around the nerve cell.
Dendrites allow the cell to receive signals from connected neighboring neurons and each dendrite
is able to perform multiplication by that dendrite’s weight value. Here multiplication means an
increase or decrease in the ratio of synaptic neu‐ rotransmitters to signal chemicals introduced into
the dendrite.
Axons
Axons are the single, long fibers extending from the main soma. They stretch out longer distances
than dendrites and measure generally 1 centimeter in length (100 times the diameter of the soma).
Eventually, the axon will branch and connect to other dendrites. Neurons are able to send
electrochemical pulses through cross membrane voltage changes generating action potential. This
signal travels along the cell’s axon and activates synaptic connections with other neurons.
The Perceptron
Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule based
on the original MCP neuron. A Perceptron is an algorithm for supervised learning of binary
classifiers. This algorithm enables neurons to learn and processes elements in the training set one at
a time.
Basic Components of Perceptron
Perceptron is a type of artificial neural network, which is a fundamental concept in machine learning.
The basic components of a perceptron are:
1. Input Layer: The input layer consists of one or more input neurons, which receive input signals from the
external world or from other layers of the neural network.
2. Weights: Each input neuron is associated with a weight, which represents the strength of the connection
between the input neuron and the output neuron.
3. Bias: A bias term is added to the input layer to provide the perceptron with additional flexibility in modeling
complex patterns in the input data.
4. Activation Function: The activation function determines the output of the perceptron based on the weighted
sum of the inputs and the bias term. Common activation functions used in perceptrons include the step function,
sigmoid function, and ReLU function.
5. Output: The output of the perceptron is a single binary value, either 0 or 1, which indicates the class or category
to which the input data belongs.
6. Training Algorithm: The perceptron is typically trained using a supervised learning algorithm such as the
perceptron learning algorithm or backpropagation. During training, the weights and biases of the perceptron
are adjusted to minimize the error between the predicted output and the true output for a given set of training
examples.
7. Overall, the perceptron is a simple yet powerful algorithm that can be used to perform binary classification
tasks and has paved the way for more complex neural networks used in deep learning today.
Types of Perceptron:
1. Single layer: Single layer perceptron can learn only linearly separable patterns.
2. Multilayer: Multilayer perceptrons can learn about two or more layers having a greater processing power.
The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision
boundary.
History of Perceptron
The perceptron was introduced by Frank Rosenblatt in 1958, as a type of artificial neural network capable
of learning and performing binary classification tasks. Rosenblatt was a psychologist and computer
scientist who was interested in developing a machine that could learn and recognize patterns in data,
inspired by the workings of the human brain.
The perceptron was based on the concept of a simple computational unit, which takes one or more inputs
and produces a single output, modeled after the structure and function of a neuron in the brain. The
perceptron was designed to be able to learn from examples and adjust its parameters to improve its
accuracy in classifying new examples.
The perceptron algorithm was initially used to solve simple problems, such as recognizing handwritten
characters, but it soon faced criticism due to its limited capacity to learn complex patterns and its inability
to handle non-linearly separable data. These limitations led to the decline of research on perceptrons in
the 1960s and 1970s.
However, in the 1980s, the development of backpropagation, a powerful algorithm for training multi-layer
neural networks, renewed interest in artificial neural networks and sparked a new era of research and
innovation in machine learning. Today, perceptrons are regarded as the simplest form of artificial neural
networks and are still widely used in applications such as image recognition, natural language processing,
and speech recognition.
How Does Perceptron Work?
AS discussed earlier, Perceptron is considered a single-layer neural link with four main parameters. The
perceptron model begins with multiplying all input values and their weights, then adds these values to
create the weighted sum. Further, this weighted sum is applied to the activation function ‘f’ to obtain the
desired output. This activation function is also known as the step function and is represented by ‘f.’
This step function or Activation function is vital in ensuring that output is mapped between (0,1) or (-1,1).
Take note that the weight of input indicates a node’s strength. Similarly, an input value gives the ability
the shift the activation function curve up or down.
Step 1: Multiply all input values with corresponding weight values and then add to calculate the weighted
sum. The following is the mathematical expression of it:
Add a term called bias ‘b’ to this weighted sum to improve the model’s performance.
Step 2: An activation function is applied with the above-mentioned weighted sum giving us an output
either in binary form or a continuous value as follows:
Y=f(∑wi*xi + b)
Types of Perceptron models
We have already discussed the types of Perceptron models in the Introduction. Here, we shall give a
more profound look at this:
1. Single Layer Perceptron model: One of the easiest ANN(Artificial Neural Networks) types consists of a feed-
forward network and includes a threshold transfer inside the model. The main objective of the single-layer
perceptron model is to analyze the linearly separable objects with binary outcomes. A Single-layer perceptron
can learn only linearly separable patterns.
2. Multi-Layered Perceptron model: It is mainly similar to a single-layer perceptron model but has more hidden
layers.
Forward Stage: From the input layer in the on stage, activation functions begin and terminate on the
output layer.
Backward Stage: In the backward stage, weight and bias values are modified per the model’s
requirement. The backstage removed the error between the actual output and demands originating
backward on the output layer. A multilayer perceptron model has a greater processing power and can
process linear and non-linear patterns. Further, it also implements logic gates such as AND, OR, XOR,
XNOR, and NOR.
Advantages:
Helps us obtain the same accuracy ratio with big and small data.
Disadvantages:
Characteristics of the Perceptron Model: The following are the characteristics of a Perceptron Model:
3. Initially, weights are multiplied with input features, and then the decision is made whether the neuron is fired or
not.
4. The activation function applies a step rule to check whether the function is more significant than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly separable
classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an output signal; otherwise,
no output will be shown.
Limitation of Perceptron Model :The following are the limitation of a Perceptron model:
1. The output of a perceptron can only be a binary number (0 or 1) due to the hard-edge transfer function.
2. It can only be used to classify the linearly separable sets of input vectors. If the input vectors are non-linear, it
is not easy to classify them correctly.
Perceptron Learning Rule states that the algorithm would automatically learn the optimal weight
coefficients. The input features are then multiplied with these weights to determine if a neuron fires or
not.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a certain
threshold, it either outputs a signal or does not return an output. In the context of supervised learning
and classification, this can then be used to predict the class of a sample.
Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight coefficient;
an output value ”f(x)”is generated.
“b” = bias (an element that adjusts the boundary away from origin without any dependence on the input value)
Inputs of a Perceptron
A Perceptron accepts inputs, moderates them with certain weight values, then applies the transformation
function to output the final result. The image below shows a Perceptron with a Boolean output.
A Boolean output is based on inputs such as salaried, married, age, past credit profile, etc. It has only two
values: Yes and No or True and False. The summation function “∑” multiplies all inputs of “x” by weights
“w” and then adds them up as follows:
The multilayer feed-forward network is a neural network with an input layer, one or more hidden
layers, and an output layer. Each layer has one or more artificial neu‐ rons. These artificial neurons
are similar to their perceptron precursor yet have a dif‐ ferent activation function depending on the
layer’s specific purpose in the network. The artificial neuron of the multilayer perceptron is similar
to its predecessor, the perceptron, but it adds flexibility in the type of activation layer we can use.
Figure 2-5. Artificial neuron for a multilayer perceptron
The net input to the activation function is still the dot product of the weights and input features yet
the flexible activation function allows us to create different types out of output values.
We express the net input to a neuron as the weights on connections multiplied by activation
incoming on connection, as shown in Figure 2-6. For the input layer, we’re just taking the feature
at that specific index, and the activation function is linear (it passes on the feature value). For
hidden layers, the input is the activation from other neurons. Mathematically, we can express the
net input (total weighted input) of the artificial neuron as
input_sumi = Wi · Ai
where Wi is the vector of all weights leading into neuron i and Ai is the vector of acti‐ vation
values for the inputs to neuron i. Let’s build on this equation by accounting for the bias term that
is added per layer (explained further below):
input_sumi = Wi · Ai + b
To produce output from the neuron, we’d then wrap this net input with an activation function g, as
demonstrated in the following equation:
ai = g (input_sumi)
We can then expand this function with the definition of input i :
ai = g (Wi · Ai + b)
This activation value for neuron i is the output value passed on to the next layer through
connections to other artificial neurons (multiplied by weights on connec‐ tions) as an input value.
As Figure 2-7 depicts, the neurons in each layer (represented by the circles) are all fully connected
to all neurons in all adjacent layers.
The neurons in each layer all use the same type of activation function (most of the time). For the
input layer, the input is the raw vector input. The input to neurons of the other layers is the output
(activation) of the previous layer’s neurons. As data moves through the network in a feed-forward
fashion, it is influenced by the connec‐ tion weights and the activation function type. Let’s now
take a look at the specifics of each layer type
Figure 2-7. Fully connected multilayer feed-forward neural network topology
Input layer. This layer is how we get input data (vectors) fed into our network. The number of
neurons in an input layer is typically the same number as the input feature to the network. Input
layers are followed by one or more hidden layers (explained in the next section). Input layers in
classical feed-forward neural networks are fully con‐ nected to the next hidden layer, yet in other
network architectures, the input layer might not be fully connected.
Hidden layer. There are one or more hidden layers in a feed-forward neural network. The weight
values on the connections between the layers are how neural networks encode the learned
information extracted from the raw training data. Hidden layers are the key to allowing neural
networks to model nonlinear functions, as we saw from the limitations of the single-layer
perceptron networks.
Output layer. We get the answer or prediction from our model from the output layer. Given that
we are mapping an input space to an output space with the neural network model, the output layer
gives us an output based on the input from the input layer. Depending on the setup of the neural
network, the final output may be a realvalued output (regression) or a set of probabilities
(classification). This is controlled by the type of activation function we use on the neurons in the
output layer. The out‐ put layer typically uses either a soˆ max or sigmoid activation function for
classification.
Connections between layers. In a fully connected feed-foward network, the connec‐ tions
between layers are the outgoing connections from all neurons in the previous layer to all of the
neurons in the next layer. We change these weights progressively as our algorithm finds the best
solution it can with the backpropagation learning algo‐ rithm. We can understand the weights
mathematically by thinking of them as the parameter vector in the earlier linear algebra section
describing the machine learning process as optimizing the parameter vector (e.g., “weights” here)
to minimize error.
The process of learning for any learning algorithm using weights is the process of readjusting the
weights and biases, making some smaller and others larger, thereby allocating significance to
certain bits of information and minimizing other bits. This helps our model learn which predictors
(or features) are tied to which outcomes, and adjusts the weights and biases accordingly.
In most datasets, certain features are strongly correlated with certain labels (e.g., square footage
relates to the sale price of a house). Neural networks learn these rela‐ tionships blindly by making
a guess based on the inputs and weights and then meas‐ uring how accurate the results are. The
loss functions in optimization algorithms, such as stochastic gradient descent (SGD), reward the
network for good guesses and penalize it for bad ones. SGD moves the parameters of the network
toward making good predictions and away from bad ones.
Another way to look at the learning process is to view labels as theories and the fea‐ ture set as
evidence. Then, we can make the analogy that the network seeks to estab‐ lish the correlation
between the theory and the evidence. The model attempts to answer the question “which theory
does the evidence support?” With these ideas in mind, let’s take a look at the learning algorithm
most commonly associated with neural networks: backpropagation learning.
Backpropagation Learning
Backpropagation is an important part of reducing error in a neural network model.
Backpropagation learning is similar to the perceptron learning algorithm. We want to compute the
input example’s output with a forward pass through the network. If the output matches the label,
we don’t do anything. If the output does not match the label, we need to adjust the weights on the
connections in the neural network.
To further illustrate general neural network learning, let’s take a look at the pseudo‐ code for the
algorithm, as shown
The key is to distribute the blame for the error and divide it between the contributing weights.
With the perceptron learning algorithm, it’s easy because there is only one weight per input to
influence the output value. With feed-forward multilayer net‐ works learning algorithms have a
bigger challenge. There are many weights connect‐ ing each input to the output, so it becomes
More difficult. Each weight contributes to more than one output, so our learning algorithm must
be more clever
Activation Functions
We use activation functions to propagate the output of one layer’s nodes forward to the next layer
(up to and including the output layer). Activation functions are a scalar-to-scalar function, yielding
the neuron’s activation. We use activation functions for hidden neurons in a neural network to
introduce nonlinearity into the network’s modeling capabilities. Many activation functions belong
to a logistic class of trans‐ forms that (when graphed) resemble an S. This class of function is
called sigmoidal. The sigmoid family of functions contains several variations, one of which is
known as the Sigmoid function. Let’s now take a look at some useful activation functions in neural
networks.
Linear
A linear transform (see Figure 2-11) is basically the identity function, and f(x) = Wx, where the
dependent variable has a direct, proportional relationship with the inde‐ pendent variable. In
practical terms, it means the function passes the signal through unchanged.
Figure 2-11. Linear activation function
We see this activation function used in the input layer of neural networks.
Sigmoid
Like all logistic transforms, sigmoids can reduce extreme values or outliers in data without
removing them. The vertical line in Figure 2-12 is the decision boundary.
A sigmoid function is a machine that converts independent variables of near infinite range into
simple probabilities between 0 and 1, and most of its output will be very close to 0 or 1.
Tanh
Pronounced “tanch,” tanh is a hyperbolic trigonometric function (see Figure 2-13). Just as the
tangent represents a ratio between the opposite and adjacent sides of a right triangle, tanh
represents the ratio of the hyperbolic sine to the hyperbolic cosine: tanh(x) = sinh(x) / cosh(x).
Unlike the Sigmoid function, the normalized range of tanh is –1 to 1. The advantage of tanh is that
it can deal more easily with negative numbers.
Hard Tanh
Similar to tanh, hard tanh simply applies hard caps to the normalized range. Any‐ thing more than
1 is made into 1, and anything less than –1 is made into –1. This allows for a more robust activation
function that allows for a limited decision boundary.
Softmax
Softmax is a generalization of logistic regression inasmuch as it can be applied to con‐ tinuous
data (rather than classifying binary) and can contain multiple decision boundaries. It handles
multinomial labeling systems. Softmax is the function you will often find at the output layer of a
classifier.
To further illustrate the idea of the softmax output layer and how to use it, let’s con‐ sider two use
cases. If we have a multiclass modeling problem yet we care only about the best score across these
classes, we’d use a softmax output layer with an argmax() function to get the highest score of all
the classes.
For the case in which we have a large set of labels (e.g., thousands of labels), we’d use the variant
of the softmax activation function called the hierarchical somax activa‐ tion function. This
variant decomposes the labels into a tree structure, and the soft‐ max classifier is trained at each
node of the tree to direct the branching for classification.
Rectifed Linear
Rectified linear is a more interesting transform that activates a node only if the input is above a
certain quantity. While the input is below zero, the output is zero, but when the input rises above
a certain threshold, it has a linear relationship with the depen‐ dent variable f(x) = max(0, x), as
demonstrated in Figure 2-14.
Rectified linear units (ReLU) are the current state of the art because they have proven to work in
many different situations. Because the gradient of a ReLU is either zero or a constant, it is possible
to reign in the vanishing exploding gradient issue. ReLU acti‐ vation functions have shown to train
better in practice than sigmoid activation functions.
Leaky ReLU
Leaky ReLUs are a strategy to mitigate the “dying ReLU” issue.2 As opposed to having the
function being zero when x < 0, the leaky ReLU will instead have a small negative slope (e.g.,
“around 0.01”). Some success has been seen in practice with this ReLU variation but results are
not always consistent. The equation is given here:
Softplus
This activation function is considered to be the “smooth version of the ReLU,” as is illustrated in
Figure 2-15. Compare this plot to the ReLU in Figure 2-14.
Figure 2-15 shows that the softplus activation function (f(x) = ln[ 1 + exp(x) ]) has a similar shape
to the ReLU. We also notice the differentiability and nonzero derivative of the softplus everywhere
on the graph, in contrast to the ReLU.
Loss Functions:
The loss function is very important in machine learning or deep learning. let’s say you are working
on any problem and you have trained a machine learning model on the dataset and are ready to put
it in front of your client. But how can you be sure that this model will give the optimum result? Is
there a metric or a technique that will help you quickly evaluate your model on the dataset? Yes,
here loss functions come into play in machine learning or deep learning.
In mathematical optimization and decision theory, a loss or cost function (sometimes also
called an error function) is a function that maps an event or values of one or more variables
onto a real number intuitively representing some “cost” associated with the event.
In simple terms, the Loss function is a method of evaluating how well your algorithm is
modeling your dataset. It is a mathematical function of the parameters of the machine learning
algorithm.
In simple linear regression, prediction is calculated using slope(m) and intercept(b). the loss
function for this is the (Yi – Yihat)^2 i.e loss function is the function of slope and intercept.
Cost Function vs Loss Function in Deep Learning
Most people confuse loss function and cost function. let’s understand what is loss function and
cost function. Cost function and Loss function are synonymous and used interchangeably but
Measures the error between predicted and Quantifies the overall cost or error of the model
actual values in a machine learning model. on the entire training set.
Used to optimize the model during training. Used to guide the optimization process by
minimizing the cost or error.
Can be specific to individual samples. Aggregates the loss values over the entire training
set.
Examples include mean squared error Often the average or sum of individual loss values
(MSE), mean absolute error (MAE), and in the training set.
binary cross-entropy.
Used to evaluate model performance. Used to determine the direction and magnitude of
parameter updates during optimization.
Different loss functions can be used for Typically derived from the loss function, but can
different tasks or problem domains. include additional regularization terms or other
considerations.
1. Regression
o Hubber loss
2. Classification
o Binary cross-entropy
o Categorical cross-entropy
A. Regression Loss
The Mean Squared Error (MSE) is the simplest and most common loss function. To calculate
the MSE, you take the difference between the actual value and model prediction, square it, and
Advantage
1. Easy to interpret.
Disadvantage
1. Error unit in the square. because the unit in the square is not understood properly.
The Mean Absolute Error (MAE) is also the simplest loss function. To calculate the MAE, you
take the difference between the actual value and model prediction and average it across the
whole dataset.
Advantage
3. Robust to outlier
Disadvantage
1. Graph, not differential. we can not use gradient descent directly, then we can subgradient
calculation.
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to
y – the actual value of the data point. Also known as true value.
ŷ – the predicted value of the data point. This value is returned by the model.
δ – defines the point where the Huber loss function transitions from a quadratic to linear.
Advantage
Robust to outlier
Its main disadvantage is the associated complexity. In order to maximize model accuracy, the
hyperparameter δ will also need to be optimized which increases the training requirements.
B. Classification Loss
It is used in binary classification problems like two classes. example a person has covid or not
Binary cross entropy compares each of the predicted probabilities to the actual class output
which can be either 0 or 1. It then calculates the score that penalizes the probabilities based on
the distance from the expected value. That means how close or far from the actual value.
yi – actual values
Advantage –
Disadvantage –
Not intuitive
Categorical Cross entropy is used for Multiclass classification and softmax regression.
loss function = -sum up to k(yjlagyjhat) where k is classes
where
k is classes,
y = actual value
Note – In multi-class classification at the last neuron use the softmax activation function.
If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use categorical cross-
entropy. and if the target column has Numerical encoding to classes like 1,2,3,4….n then use
Hyperparameter selection focuses on ensuring that the model neither underfits nor overfits the
training dataset, while learning the structure of the data as quickly as pos‐ sible.
Learning Rate
The learning rate affects the amount by which you adjust parameters during optimi‐ zation in order
to minimize the error of neural network’s guesses. It is a coefficient that scales the size of the steps
(updates) a neural network takes to its parameter vec‐ tor x as it crosses the loss function space.
During backpropagation we multiply the error gradient by the learning rate, and then update a
connection weight’s last iteration with the product to reach a new weight. The learning rate
determines how much of the gradient we want to use for the algo‐ rithm’s next step. A large error
and steep gradient combine with the learning rate to produce a large step. As we approach minimal
error and the gradient flattens out, the step size tends to shorten.
A large learning rate coefficient (e.g., 1) will make your parameters take leaps, and small ones
(e.g., 0.00001) will make it inch along slowly. Large leaps will save time initially, but they can be
disastrous if they lead us to overshoot our minimum. A learning rate too large oversteps the nadir,
making the algorithm bounce back and forth on either side of the minimum without ever coming
to rest.
In contrast, small learning rates should lead you eventually to an error minimum (it might be a
local minimum rather than a global one), but they can take a very long time and add to the burden
of an already computationally intensive process.
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to
minimize parameter size over time.
Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller.
Smaller-valued weights lead to simpler hypotheses, and simpler hypotheses are the most
generalizable. Unregularized weights with several higher-order polyno‐ mials in the feature set
tend to overfit the training set.
As the input training set size grows, the effect of regularization decreases and the parameters tend
to increase in magnitude. This is appropriate, because an excess of features relative to training set
examples leads to overfitting in the first place. Bigger data is the ultimate regularizer.
Momentum
Momentum helps the learning algorithm get out of spots in the search space where it would otherwise
become stuck. In the errorscape, it helps the updater find the gulleys that lead toward the minima.
Momentum is to the learning rate what the learning rate is to weights, and it helps us produce better quality
models.
Sparsity
The sparsity hyperparameter recognizes that for some inputs only a few features are relevant. For
example, let’s assume that a network can classify a million images. Any one of those images will
be indicated by a limited number of features. But to effec‐ tively classify millions of images a
network must be able to recognize considerably more features, many of which don’t appear most
of the time. An example of this would be how photos of sea urchins don’t contain noses and
hooves. This contrasts to how in submarine images the nose and hoof features will be 0.
The features that indicate sea urchins will be few and far between, in the vastness of the neural
network’s layers. That’s a problem, because sparse features can limit the number of nodes that
activate and impede a network’s ability to learn. In response to sparsity, biases force neurons to
activate and the activations st ay around a mean that keeps the network from becoming stuck.