0% found this document useful (0 votes)
57 views

DL Unit-2

The document provides information about foundations of neural networks. It discusses the biological neuron, perceptron, multilayer feedforward network, activation functions, loss functions, and hyperparameters. Specifically, it describes the biological neuron and its key components like the soma, dendrites, axons, and synapses. It also explains the basic components and working of the perceptron, including input layer, weights, bias, activation functions, output, and training algorithm. Additionally, it discusses the history and types of perceptrons, as well as how perceptrons work.

Uploaded by

mdarbazkhan4215
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

DL Unit-2

The document provides information about foundations of neural networks. It discusses the biological neuron, perceptron, multilayer feedforward network, activation functions, loss functions, and hyperparameters. Specifically, it describes the biological neuron and its key components like the soma, dendrites, axons, and synapses. It also explains the basic components and working of the perceptron, including input layer, weights, bias, activation functions, output, and training algorithm. Additionally, it discusses the history and types of perceptrons, as well as how perceptrons work.

Uploaded by

mdarbazkhan4215
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT-II

Foundations of Neural Networks The biological Neuron, Perceptron, Multilayer Feed


forward Network, Training the neural network, Activation Functions: Linear,
Sigmoid, Tanh, Hard Tanh, Softmax, Rectified linear, Loss function: Notation,
regression, classification, reconstruction, Hyper-parameters: Learning rate,
Regularization, Momentum, Sparsity.

Foundations of Neural Networks:


Neural networks are a computational model that shares some properties with the ani‐ mal brain in
which many simple units are working in parallel with no centralized control unit. The weights
between the units are the primary means of long-term information storage in neural networks.
Updating the weights is the primary way the neural network learns new information

The behavior of neural networks is shaped by its network architecture. A network’s architecture
can be defined (in part) by the following: • Number of neurons • Number of layers • Types of
connections between layers

The most well-known and simplest-to-understand neural network is the feedforward multilayer
neural network. It has an input layer, one or many hidden layers, and a single output layer. Each
layer can have a different number of neurons and each layer is fully connected to the adjacent
layer. The connections between the neurons in the layers form an acyclic graph, as illustrated in
Figure 2-1.

Figure 2-1. Multilayer neural network topology


A feed-forward multilayer neural network can represent any function, given enough artificial
neuron units. It is generally trained by a learning algorithm called backpro‐ pagation learning.
Backpropagation uses gradient descent (see Chapter 1) on the weights of the connections in a
neural network to minimize the error on the output of the network.

The Biological Neuron

The biological neuron (see Figure 2-2) is a nerve cell that provides the fundamental functional unit
for the nervous systems of all animals. Neurons exist to communicate with one another, and pass
electro-chemical impulses across synapses, from one cell to the next, as long as the impulse is
strong enough to activate the release of chemi‐ cals across a synaptic cleft. The strength of the
impulse must surpass a minimum threshold or chemicals will not be released.

Figure 2-2 presents the major parts of the nerve cell:

 Soma
 Dendrites
 Axons
 Synapses

The neuron is made up of a nerve cell consisting of a soma (cell body) that has many dendrites but
only one axon. The single axon can branch hundreds of times, however. Dendrites are thin
structures that arise from the main cell body. Axons are nerve fibers with a special cellular
extension that comes from the cell body

Figure 2-2. The biological neuron


Synapses

Synapses are the connecting junction between axon and dendrites. The majority of synapses send
signals from the axon of a neuron to the dendrite of another neuron. The exceptions for this case
are when a neuron might lack dendrites, or a neuron lacks an axon, or a synapse, which connects
an axon to another axon.

Dendrites

Dendrites have fibers branching out from the soma in a bushy network around the nerve cell.
Dendrites allow the cell to receive signals from connected neighboring neurons and each dendrite
is able to perform multiplication by that dendrite’s weight value. Here multiplication means an
increase or decrease in the ratio of synaptic neu‐ rotransmitters to signal chemicals introduced into
the dendrite.

Axons

Axons are the single, long fibers extending from the main soma. They stretch out longer distances
than dendrites and measure generally 1 centimeter in length (100 times the diameter of the soma).
Eventually, the axon will branch and connect to other dendrites. Neurons are able to send
electrochemical pulses through cross membrane voltage changes generating action potential. This
signal travels along the cell’s axon and activates synaptic connections with other neurons.

The Perceptron

Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule based
on the original MCP neuron. A Perceptron is an algorithm for supervised learning of binary
classifiers. This algorithm enables neurons to learn and processes elements in the training set one at
a time.
Basic Components of Perceptron

Perceptron is a type of artificial neural network, which is a fundamental concept in machine learning.
The basic components of a perceptron are:

1. Input Layer: The input layer consists of one or more input neurons, which receive input signals from the
external world or from other layers of the neural network.

2. Weights: Each input neuron is associated with a weight, which represents the strength of the connection
between the input neuron and the output neuron.

3. Bias: A bias term is added to the input layer to provide the perceptron with additional flexibility in modeling
complex patterns in the input data.

4. Activation Function: The activation function determines the output of the perceptron based on the weighted
sum of the inputs and the bias term. Common activation functions used in perceptrons include the step function,
sigmoid function, and ReLU function.

5. Output: The output of the perceptron is a single binary value, either 0 or 1, which indicates the class or category
to which the input data belongs.

6. Training Algorithm: The perceptron is typically trained using a supervised learning algorithm such as the
perceptron learning algorithm or backpropagation. During training, the weights and biases of the perceptron
are adjusted to minimize the error between the predicted output and the true output for a given set of training
examples.

7. Overall, the perceptron is a simple yet powerful algorithm that can be used to perform binary classification
tasks and has paved the way for more complex neural networks used in deep learning today.
Types of Perceptron:

1. Single layer: Single layer perceptron can learn only linearly separable patterns.

2. Multilayer: Multilayer perceptrons can learn about two or more layers having a greater processing power.

The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision
boundary.

History of Perceptron

The perceptron was introduced by Frank Rosenblatt in 1958, as a type of artificial neural network capable
of learning and performing binary classification tasks. Rosenblatt was a psychologist and computer
scientist who was interested in developing a machine that could learn and recognize patterns in data,
inspired by the workings of the human brain.

The perceptron was based on the concept of a simple computational unit, which takes one or more inputs
and produces a single output, modeled after the structure and function of a neuron in the brain. The
perceptron was designed to be able to learn from examples and adjust its parameters to improve its
accuracy in classifying new examples.

The perceptron algorithm was initially used to solve simple problems, such as recognizing handwritten
characters, but it soon faced criticism due to its limited capacity to learn complex patterns and its inability
to handle non-linearly separable data. These limitations led to the decline of research on perceptrons in
the 1960s and 1970s.

However, in the 1980s, the development of backpropagation, a powerful algorithm for training multi-layer
neural networks, renewed interest in artificial neural networks and sparked a new era of research and
innovation in machine learning. Today, perceptrons are regarded as the simplest form of artificial neural
networks and are still widely used in applications such as image recognition, natural language processing,
and speech recognition.
How Does Perceptron Work?

AS discussed earlier, Perceptron is considered a single-layer neural link with four main parameters. The
perceptron model begins with multiplying all input values and their weights, then adds these values to
create the weighted sum. Further, this weighted sum is applied to the activation function ‘f’ to obtain the
desired output. This activation function is also known as the step function and is represented by ‘f.’

This step function or Activation function is vital in ensuring that output is mapped between (0,1) or (-1,1).
Take note that the weight of input indicates a node’s strength. Similarly, an input value gives the ability
the shift the activation function curve up or down.

Step 1: Multiply all input values with corresponding weight values and then add to calculate the weighted
sum. The following is the mathematical expression of it:

∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4

Add a term called bias ‘b’ to this weighted sum to improve the model’s performance.

Step 2: An activation function is applied with the above-mentioned weighted sum giving us an output
either in binary form or a continuous value as follows:

Y=f(∑wi*xi + b)
Types of Perceptron models

We have already discussed the types of Perceptron models in the Introduction. Here, we shall give a
more profound look at this:

1. Single Layer Perceptron model: One of the easiest ANN(Artificial Neural Networks) types consists of a feed-
forward network and includes a threshold transfer inside the model. The main objective of the single-layer
perceptron model is to analyze the linearly separable objects with binary outcomes. A Single-layer perceptron
can learn only linearly separable patterns.

2. Multi-Layered Perceptron model: It is mainly similar to a single-layer perceptron model but has more hidden
layers.

Forward Stage: From the input layer in the on stage, activation functions begin and terminate on the
output layer.

Backward Stage: In the backward stage, weight and bias values are modified per the model’s
requirement. The backstage removed the error between the actual output and demands originating
backward on the output layer. A multilayer perceptron model has a greater processing power and can
process linear and non-linear patterns. Further, it also implements logic gates such as AND, OR, XOR,
XNOR, and NOR.

Advantages:

 A multi-layered perceptron model can solve complex non-linear problems.

 It works well with both small and large input data.

 Helps us to obtain quick predictions after the training.

 Helps us obtain the same accuracy ratio with big and small data.

Disadvantages:

 In multi-layered perceptron model, computations are time-consuming and complex.


 It is tough to predict how much the dependent variable affects each independent variable.

 The model functioning depends on the quality of training.

Characteristics of the Perceptron Model: The following are the characteristics of a Perceptron Model:

1. It is a machine learning algorithm that uses supervised learning of binary classifiers.

2. In Perceptron, the weight coefficient is automatically learned.

3. Initially, weights are multiplied with input features, and then the decision is made whether the neuron is fired or
not.

4. The activation function applies a step rule to check whether the function is more significant than zero.

5. The linear decision boundary is drawn, enabling the distinction between the two linearly separable
classes +1 and -1.

6. If the added sum of all input values is more than the threshold value, it must have an output signal; otherwise,
no output will be shown.

Limitation of Perceptron Model :The following are the limitation of a Perceptron model:

1. The output of a perceptron can only be a binary number (0 or 1) due to the hard-edge transfer function.

2. It can only be used to classify the linearly separable sets of input vectors. If the input vectors are non-linear, it
is not easy to classify them correctly.

Perceptron Learning Rule

Perceptron Learning Rule states that the algorithm would automatically learn the optimal weight
coefficients. The input features are then multiplied with these weights to determine if a neuron fires or
not.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a certain
threshold, it either outputs a signal or does not return an output. In the context of supervised learning
and classification, this can then be used to predict the class of a sample.

Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight coefficient;
an output value ”f(x)”is generated.

In the equation given above:

 “w” = vector of real-valued weights

 “b” = bias (an element that adjusts the boundary away from origin without any dependence on the input value)

 “x” = vector of input x values

 “m” = number of inputs to the Perceptron


The output can be represented as “1” or “0.” It can also be represented as “1” or “-1” depending on which
activation function is used.

Let us learn the inputs of a perceptron in the next section.

Inputs of a Perceptron

A Perceptron accepts inputs, moderates them with certain weight values, then applies the transformation
function to output the final result. The image below shows a Perceptron with a Boolean output.

A Boolean output is based on inputs such as salaried, married, age, past credit profile, etc. It has only two
values: Yes and No or True and False. The summation function “∑” multiplies all inputs of “x” by weights
“w” and then adds them up as follows:

Multilayer Feed-Forward Networks:

The multilayer feed-forward network is a neural network with an input layer, one or more hidden
layers, and an output layer. Each layer has one or more artificial neu‐ rons. These artificial neurons
are similar to their perceptron precursor yet have a dif‐ ferent activation function depending on the
layer’s specific purpose in the network. The artificial neuron of the multilayer perceptron is similar
to its predecessor, the perceptron, but it adds flexibility in the type of activation layer we can use.
Figure 2-5. Artificial neuron for a multilayer perceptron
The net input to the activation function is still the dot product of the weights and input features yet
the flexible activation function allows us to create different types out of output values.

Artifcial neuron input.


The artificial neuron (see Figure 2-6) takes input that, based on the weights on the connections,
can be ignored (by a 0.0 weight on an input connec‐ tion) or passed on to the activation function.
The activation function also has the ability to filter out data if it does not provide a non-zero
activation value as output.

Details of an artificial neuron in a multilayer perceptron neural network

We express the net input to a neuron as the weights on connections multiplied by activation
incoming on connection, as shown in Figure 2-6. For the input layer, we’re just taking the feature
at that specific index, and the activation function is linear (it passes on the feature value). For
hidden layers, the input is the activation from other neurons. Mathematically, we can express the
net input (total weighted input) of the artificial neuron as
input_sumi = Wi · Ai
where Wi is the vector of all weights leading into neuron i and Ai is the vector of acti‐ vation
values for the inputs to neuron i. Let’s build on this equation by accounting for the bias term that
is added per layer (explained further below):
input_sumi = Wi · Ai + b
To produce output from the neuron, we’d then wrap this net input with an activation function g, as
demonstrated in the following equation:
ai = g (input_sumi)
We can then expand this function with the definition of input i :
ai = g (Wi · Ai + b)
This activation value for neuron i is the output value passed on to the next layer through
connections to other artificial neurons (multiplied by weights on connec‐ tions) as an input value.

Feed-forward neural network architecture


With multilayer feed-forward neural networks, we have artificial neurons arranged into groups
called layers. Building on the layer concept, we see that the multilayer neural network has the
following:
A single input layer
• One or many hidden layers, fully connected
• A single output layer

As Figure 2-7 depicts, the neurons in each layer (represented by the circles) are all fully connected
to all neurons in all adjacent layers.

The neurons in each layer all use the same type of activation function (most of the time). For the
input layer, the input is the raw vector input. The input to neurons of the other layers is the output
(activation) of the previous layer’s neurons. As data moves through the network in a feed-forward
fashion, it is influenced by the connec‐ tion weights and the activation function type. Let’s now
take a look at the specifics of each layer type
Figure 2-7. Fully connected multilayer feed-forward neural network topology

Input layer. This layer is how we get input data (vectors) fed into our network. The number of
neurons in an input layer is typically the same number as the input feature to the network. Input
layers are followed by one or more hidden layers (explained in the next section). Input layers in
classical feed-forward neural networks are fully con‐ nected to the next hidden layer, yet in other
network architectures, the input layer might not be fully connected.

Hidden layer. There are one or more hidden layers in a feed-forward neural network. The weight
values on the connections between the layers are how neural networks encode the learned
information extracted from the raw training data. Hidden layers are the key to allowing neural
networks to model nonlinear functions, as we saw from the limitations of the single-layer
perceptron networks.

Output layer. We get the answer or prediction from our model from the output layer. Given that
we are mapping an input space to an output space with the neural network model, the output layer
gives us an output based on the input from the input layer. Depending on the setup of the neural
network, the final output may be a realvalued output (regression) or a set of probabilities
(classification). This is controlled by the type of activation function we use on the neurons in the
output layer. The out‐ put layer typically uses either a soˆ max or sigmoid activation function for
classification.

Connections between layers. In a fully connected feed-foward network, the connec‐ tions
between layers are the outgoing connections from all neurons in the previous layer to all of the
neurons in the next layer. We change these weights progressively as our algorithm finds the best
solution it can with the backpropagation learning algo‐ rithm. We can understand the weights
mathematically by thinking of them as the parameter vector in the earlier linear algebra section
describing the machine learning process as optimizing the parameter vector (e.g., “weights” here)
to minimize error.

Training Neural Networks


A well-trained artificial neural network has weights that amplify the signal and dampen the noise.
A bigger weight signifies a tighter correlation between a signal and the network’s outcome. Inputs
paired with large weights will affect the network’s interpretation of the data more than inputs
paired with smaller weights.

The process of learning for any learning algorithm using weights is the process of readjusting the
weights and biases, making some smaller and others larger, thereby allocating significance to
certain bits of information and minimizing other bits. This helps our model learn which predictors
(or features) are tied to which outcomes, and adjusts the weights and biases accordingly.

In most datasets, certain features are strongly correlated with certain labels (e.g., square footage
relates to the sale price of a house). Neural networks learn these rela‐ tionships blindly by making
a guess based on the inputs and weights and then meas‐ uring how accurate the results are. The
loss functions in optimization algorithms, such as stochastic gradient descent (SGD), reward the
network for good guesses and penalize it for bad ones. SGD moves the parameters of the network
toward making good predictions and away from bad ones.

Another way to look at the learning process is to view labels as theories and the fea‐ ture set as
evidence. Then, we can make the analogy that the network seeks to estab‐ lish the correlation
between the theory and the evidence. The model attempts to answer the question “which theory
does the evidence support?” With these ideas in mind, let’s take a look at the learning algorithm
most commonly associated with neural networks: backpropagation learning.

Backpropagation Learning
Backpropagation is an important part of reducing error in a neural network model.
Backpropagation learning is similar to the perceptron learning algorithm. We want to compute the
input example’s output with a forward pass through the network. If the output matches the label,
we don’t do anything. If the output does not match the label, we need to adjust the weights on the
connections in the neural network.

To further illustrate general neural network learning, let’s take a look at the pseudo‐ code for the
algorithm, as shown

The key is to distribute the blame for the error and divide it between the contributing weights.
With the perceptron learning algorithm, it’s easy because there is only one weight per input to
influence the output value. With feed-forward multilayer net‐ works learning algorithms have a
bigger challenge. There are many weights connect‐ ing each input to the output, so it becomes
More difficult. Each weight contributes to more than one output, so our learning algorithm must
be more clever
Activation Functions
We use activation functions to propagate the output of one layer’s nodes forward to the next layer
(up to and including the output layer). Activation functions are a scalar-to-scalar function, yielding
the neuron’s activation. We use activation functions for hidden neurons in a neural network to
introduce nonlinearity into the network’s modeling capabilities. Many activation functions belong
to a logistic class of trans‐ forms that (when graphed) resemble an S. This class of function is
called sigmoidal. The sigmoid family of functions contains several variations, one of which is
known as the Sigmoid function. Let’s now take a look at some useful activation functions in neural
networks.

Linear
A linear transform (see Figure 2-11) is basically the identity function, and f(x) = Wx, where the
dependent variable has a direct, proportional relationship with the inde‐ pendent variable. In
practical terms, it means the function passes the signal through unchanged.
Figure 2-11. Linear activation function

We see this activation function used in the input layer of neural networks.

Sigmoid
Like all logistic transforms, sigmoids can reduce extreme values or outliers in data without
removing them. The vertical line in Figure 2-12 is the decision boundary.

A sigmoid function is a machine that converts independent variables of near infinite range into
simple probabilities between 0 and 1, and most of its output will be very close to 0 or 1.
Tanh
Pronounced “tanch,” tanh is a hyperbolic trigonometric function (see Figure 2-13). Just as the
tangent represents a ratio between the opposite and adjacent sides of a right triangle, tanh
represents the ratio of the hyperbolic sine to the hyperbolic cosine: tanh(x) = sinh(x) / cosh(x).
Unlike the Sigmoid function, the normalized range of tanh is –1 to 1. The advantage of tanh is that
it can deal more easily with negative numbers.
Hard Tanh
Similar to tanh, hard tanh simply applies hard caps to the normalized range. Any‐ thing more than
1 is made into 1, and anything less than –1 is made into –1. This allows for a more robust activation
function that allows for a limited decision boundary.

Softmax
Softmax is a generalization of logistic regression inasmuch as it can be applied to con‐ tinuous
data (rather than classifying binary) and can contain multiple decision boundaries. It handles
multinomial labeling systems. Softmax is the function you will often find at the output layer of a
classifier.

To further illustrate the idea of the softmax output layer and how to use it, let’s con‐ sider two use
cases. If we have a multiclass modeling problem yet we care only about the best score across these
classes, we’d use a softmax output layer with an argmax() function to get the highest score of all
the classes.
For the case in which we have a large set of labels (e.g., thousands of labels), we’d use the variant
of the softmax activation function called the hierarchical soˆmax activa‐ tion function. This
variant decomposes the labels into a tree structure, and the soft‐ max classifier is trained at each
node of the tree to direct the branching for classification.

Rectifed Linear

Rectified linear is a more interesting transform that activates a node only if the input is above a
certain quantity. While the input is below zero, the output is zero, but when the input rises above
a certain threshold, it has a linear relationship with the depen‐ dent variable f(x) = max(0, x), as
demonstrated in Figure 2-14.

Rectified linear units (ReLU) are the current state of the art because they have proven to work in
many different situations. Because the gradient of a ReLU is either zero or a constant, it is possible
to reign in the vanishing exploding gradient issue. ReLU acti‐ vation functions have shown to train
better in practice than sigmoid activation functions.
Leaky ReLU
Leaky ReLUs are a strategy to mitigate the “dying ReLU” issue.2 As opposed to having the
function being zero when x < 0, the leaky ReLU will instead have a small negative slope (e.g.,
“around 0.01”). Some success has been seen in practice with this ReLU variation but results are
not always consistent. The equation is given here:

Softplus
This activation function is considered to be the “smooth version of the ReLU,” as is illustrated in
Figure 2-15. Compare this plot to the ReLU in Figure 2-14.

Figure 2-15 shows that the softplus activation function (f(x) = ln[ 1 + exp(x) ]) has a similar shape
to the ReLU. We also notice the differentiability and nonzero derivative of the softplus everywhere
on the graph, in contrast to the ReLU.
Loss Functions:
The loss function is very important in machine learning or deep learning. let’s say you are working
on any problem and you have trained a machine learning model on the dataset and are ready to put
it in front of your client. But how can you be sure that this model will give the optimum result? Is
there a metric or a technique that will help you quickly evaluate your model on the dataset? Yes,
here loss functions come into play in machine learning or deep learning.

In mathematical optimization and decision theory, a loss or cost function (sometimes also

called an error function) is a function that maps an event or values of one or more variables

onto a real number intuitively representing some “cost” associated with the event.

In simple terms, the Loss function is a method of evaluating how well your algorithm is

modeling your dataset. It is a mathematical function of the parameters of the machine learning

algorithm.

In simple linear regression, prediction is calculated using slope(m) and intercept(b). the loss

function for this is the (Yi – Yihat)^2 i.e loss function is the function of slope and intercept.
Cost Function vs Loss Function in Deep Learning

Most people confuse loss function and cost function. let’s understand what is loss function and

cost function. Cost function and Loss function are synonymous and used interchangeably but

they are different.

Loss Function Cost Function

Measures the error between predicted and Quantifies the overall cost or error of the model
actual values in a machine learning model. on the entire training set.

Used to optimize the model during training. Used to guide the optimization process by
minimizing the cost or error.

Can be specific to individual samples. Aggregates the loss values over the entire training
set.

Examples include mean squared error Often the average or sum of individual loss values
(MSE), mean absolute error (MAE), and in the training set.
binary cross-entropy.

Used to evaluate model performance. Used to determine the direction and magnitude of
parameter updates during optimization.

Different loss functions can be used for Typically derived from the loss function, but can
different tasks or problem domains. include additional regularization terms or other
considerations.

Loss Function in Deep Learning

1. Regression

o MSE(Mean Squared Error)

o MAE(Mean Absolute Error)

o Hubber loss

2. Classification

o Binary cross-entropy

o Categorical cross-entropy
A. Regression Loss

1. Mean Squared Error/Squared loss/ L2 loss

The Mean Squared Error (MSE) is the simplest and most common loss function. To calculate

the MSE, you take the difference between the actual value and model prediction, square it, and

average it across the whole dataset.

Advantage

 1. Easy to interpret.

 2. Always differential because of the square.

 3. Only one local minima.

Disadvantage

 1. Error unit in the square. because the unit in the square is not understood properly.

 2. Not robust to outlier

Note – In regression at the last neuron use linear activation function

2. Mean Absolute Error/ L1 loss

The Mean Absolute Error (MAE) is also the simplest loss function. To calculate the MAE, you

take the difference between the actual value and model prediction and average it across the

whole dataset.
Advantage

 1. Intuitive and easy

 2. Error Unit Same as the output column.

 3. Robust to outlier

Disadvantage

 1. Graph, not differential. we can not use gradient descent directly, then we can subgradient

calculation.

Note – In regression at the last neuron use linear activation function.

3. Huber Loss

In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to

outliers in data than the squared error loss.

 n – the number of data points.

 y – the actual value of the data point. Also known as true value.

 ŷ – the predicted value of the data point. This value is returned by the model.

 δ – defines the point where the Huber loss function transitions from a quadratic to linear.

Advantage

 Robust to outlier

 It lies between MAE and MSE.


Disadvantage

 Its main disadvantage is the associated complexity. In order to maximize model accuracy, the

hyperparameter δ will also need to be optimized which increases the training requirements.

B. Classification Loss

1. Binary Cross Entropy/log loss

It is used in binary classification problems like two classes. example a person has covid or not

or my article gets popular or not.

Binary cross entropy compares each of the predicted probabilities to the actual class output

which can be either 0 or 1. It then calculates the score that penalizes the probabilities based on

the distance from the expected value. That means how close or far from the actual value.

 yi – actual values

 yihat – Neural Network prediction

Advantage –

 A cost function is a differential.

Disadvantage –

 Multiple local minima

 Not intuitive

Note – In classification at last neuron use sigmoid activation function.

2. Categorical Cross Entropy

Categorical Cross entropy is used for Multiclass classification and softmax regression.
loss function = -sum up to k(yjlagyjhat) where k is classes

cost function = -1/n(sum upto n(sum j to k (yijloghijhat))

where

 k is classes,

 y = actual value

 yhat – Neural Network prediction

Note – In multi-class classification at the last neuron use the softmax activation function.

if problem statement have 3 classes

softmax activation – f(z) = ez1/(ez1+ez2+ez3)

When to use categorical cross-entropy and sparse categorical cross-entropy?

If target column has One hot encode to classes like 0 0 1, 0 1 0, 1 0 0 then use categorical cross-

entropy. and if the target column has Numerical encoding to classes like 1,2,3,4….n then use

sparse categorical cross-entropy.


Hyperparameters:
In machine learning, we have both model parameters and parameters we tune to make networks
train better and faster. These tuning parameters are called hyperpara‐ meters, and they deal with
controlling optimization functions and model selection during training with our learning algorithm.
In DL4J, we also refer to the optimiza‐ tion algorithms as updaters, because updates are
synonymous with the steps the algo‐ rithm takes across the weight space to minimize error.

Hyperparameter selection focuses on ensuring that the model neither underfits nor overfits the
training dataset, while learning the structure of the data as quickly as pos‐ sible.

Learning Rate
The learning rate affects the amount by which you adjust parameters during optimi‐ zation in order
to minimize the error of neural network’s guesses. It is a coefficient that scales the size of the steps
(updates) a neural network takes to its parameter vec‐ tor x as it crosses the loss function space.

During backpropagation we multiply the error gradient by the learning rate, and then update a
connection weight’s last iteration with the product to reach a new weight. The learning rate
determines how much of the gradient we want to use for the algo‐ rithm’s next step. A large error
and steep gradient combine with the learning rate to produce a large step. As we approach minimal
error and the gradient flattens out, the step size tends to shorten.

A large learning rate coefficient (e.g., 1) will make your parameters take leaps, and small ones
(e.g., 0.00001) will make it inch along slowly. Large leaps will save time initially, but they can be
disastrous if they lead us to overshoot our minimum. A learning rate too large oversteps the nadir,
making the algorithm bounce back and forth on either side of the minimum without ever coming
to rest.

In contrast, small learning rates should lead you eventually to an error minimum (it might be a
local minimum rather than a global one), but they can take a very long time and add to the burden
of an already computationally intensive process.
Regularization
Regularization helps with the effects of out-of-control parameters by using different methods to
minimize parameter size over time.

In mathematical notation, we see regularization represented by the coefficient lambda, controlling


the trade-off between finding a good fit and keeping the value of certain feature weights low as
the exponents on features increase.

Regularization coefficients L1 and L2 help fight overfitting by making certain weights smaller.
Smaller-valued weights lead to simpler hypotheses, and simpler hypotheses are the most
generalizable. Unregularized weights with several higher-order polyno‐ mials in the feature set
tend to overfit the training set.

As the input training set size grows, the effect of regularization decreases and the parameters tend
to increase in magnitude. This is appropriate, because an excess of features relative to training set
examples leads to overfitting in the first place. Bigger data is the ultimate regularizer.

Momentum
Momentum helps the learning algorithm get out of spots in the search space where it would otherwise
become stuck. In the errorscape, it helps the updater find the gulleys that lead toward the minima.
Momentum is to the learning rate what the learning rate is to weights, and it helps us produce better quality
models.

Sparsity

The sparsity hyperparameter recognizes that for some inputs only a few features are relevant. For
example, let’s assume that a network can classify a million images. Any one of those images will
be indicated by a limited number of features. But to effec‐ tively classify millions of images a
network must be able to recognize considerably more features, many of which don’t appear most
of the time. An example of this would be how photos of sea urchins don’t contain noses and
hooves. This contrasts to how in submarine images the nose and hoof features will be 0.

The features that indicate sea urchins will be few and far between, in the vastness of the neural
network’s layers. That’s a problem, because sparse features can limit the number of nodes that
activate and impede a network’s ability to learn. In response to sparsity, biases force neurons to
activate and the activations st ay around a mean that keeps the network from becoming stuck.

You might also like