Neural networks
Neural networks are machine learning models that mimic the complex functions of the human
brain. These models consist of interconnected nodes or neurons that process data, learn
patterns, and enable tasks such as pattern recognition and decision-making. These networks are
built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold
and an activation function.
2. Connections: Links between neurons that carry information, regulated by weights and
biases.
3. Weights and Biases: These parameters determine the strength and influence of
connections.
4. Propagation Functions: Mechanisms that help process and transfer data across layers
of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve
accuracy.
Learning in neural networks follows a structured, three-stage process:
Input Computation: Data is fed into the network.
Output Generation: Based on the current parameters, the network generates an output.
Iterative Refinement: The network refines its output by adjusting weights and biases,
gradually improving its performance on diverse tasks.
In an adaptive learning environment:
The neural network is exposed to a simulated scenario or dataset.
Parameters such as weights and biases are updated in response to new data or
conditions.
With each adjustment, the network’s response evolves, allowing it to adapt effectively
to different tasks or environments.
Layers in Neural Network Architecture
1. Input Layer: This is where the network receives its input data. Each input neuron in
the layer corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy lifting. A
neural network can have one or multiple hidden layers. Each layer consists of units
(neurons) that transform the inputs into something that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The format of these
outputs varies depending on the specific task (e.g., classification, regression).
Working of Neural Networks
Forward Propagation
When data is input into the network, it passes through the network in the forward direction,
from the input layer through the hidden layers to the output layer. This process is known as
forward propagation.
1. Linear Transformation: Each neuron in a layer receives inputs, which are multiplied
by the weights associated with the connections. These products are summed together,
and a bias is added to the sum. This can be represented mathematically as: z = w1x1+w2
x2+…+wnxn+b where w represents the weights, x represents the inputs, and b is the
bias.
2. Activation: The result of the linear transformation (denoted as z) is then passed through
an activation function. The activation function is crucial because it introduces non-
linearity into the system, enabling the network to learn more complex patterns. Popular
activation functions include ReLU, sigmoid, and tanh.
Backpropagation
After forward propagation, the network evaluates its performance using a loss function, which
measures the difference between the actual output and the predicted output. The goal of training
is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a measure of error
in the predictions. The loss function could vary; common choices are mean squared
error for regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are updated
using an optimization algorithm like stochastic gradient descent (SGD). The weights
are adjusted in the opposite direction of the gradient to minimize the loss. The size of
the step taken in each update is determined by the learning rate.
Iteration
This process of forward propagation, loss calculation, backpropagation, and weight updates is
repeated for many iterations over the dataset. Over time, this iterative process reduces the loss,
and the network’s predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as classification,
regression, or any other predictive modeling.
3.1.6 Multilayer Perceptron
A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform
input data from one dimension to another. It is called “multi-layer” because it contains an input
layer, one or more hidden layers, and an output layer. The purpose of an MLP is to model
complex relationships between inputs and outputs, making it a powerful tool for various
machine learning tasks.
The key components of Multi-Layer Perceptron include:
Input Layer: Each neuron (or node) in this layer corresponds to an input feature. For
instance, if you have three input features, the input layer will have three neurons.
Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from
the input layer.
Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Working of Multi-Layer Perceptron
Let’s delve in to the working of the multi-layer perceptron. The key mechanisms such as
forward propagation, loss function, backpropagation, and optimization.
Step 1: Forward Propagation
In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers. Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
z=∑wixi+b
where, xi is the input feature, wi is the corresponding weight, and b is the bias term.
2. Activation Function: The weighted sum z is passed through an activation function to
introduce non-linearity. Common activation functions include:
Sigmoid
ReLU (Rectified Linear Unit)
Tanh (Hyperbolic Tangent)
Step 2: Loss Function
Once the network generates an output, the next step is to calculate the loss using a loss function.
In supervised learning, this compares the predicted output to the actual label.
For a classification problem, the commonly used binary cross-entropy loss function is:
Where;
yi is the actual label.
is the predicted label.
N is the number of samples.
For regression problems, the mean squared error (MSE) is often used:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s weights
and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight and
bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss.
where,
w is the weight, η is the learning rate, and ∂L/∂w is the gradient of the loss function with respect
to the weight.
Gradient Descent (GD)
Concept:
In Gradient Descent, we calculate the gradient of the loss function with respect to the model
parameters (weights and biases) for the entire dataset and update the parameters in the direction
that reduces the loss. It does this by computing the mean gradient over all training examples.
Steps in Gradient Descent:
1. Initialization: Randomly initialize the model parameters (weights and biases).
2. Compute Gradient: Compute the gradient (partial derivatives) of the loss function
with respect to all parameters using the entire dataset.
3. Update Parameters: Update the parameters by subtracting the product of the learning
rate and the computed gradients.
4. Repeat: Repeat the process for a set number of iterations or until convergence.
The update rule is given by:
Where:
θ is the vector of parameters (weights and biases).
η is the learning rate (step size).
∇J(θ) is the gradient of the loss function with respect to the parameters.
Pros:
Accurate Update: Since we use the gradient from the whole dataset, the update is more
accurate.
Convergence: Gradient Descent tends to converge steadily toward the minimum of the
loss function (provided the learning rate is set appropriately).
Cons:
Computationally Expensive: For large datasets, computing the gradient over the entire
dataset at each step can be very slow and requires a lot of memory.
Slow Convergence: Gradient Descent may take many iterations to converge, especially
if the dataset is large.
Stochastic Gradient Descent (SGD)
Concept:
In Stochastic Gradient Descent (SGD), instead of computing the gradient over the
entire dataset, we update the parameters after computing the gradient for each
individual training example. This makes the updates faster because we are not waiting
for the entire dataset to be processed.
SGD updates the parameters in a much more noisy way, as each training example
provides a noisy approximation of the true gradient.
Pros:
Faster Updates: Since it processes one training example at a time, SGD can make
updates much faster, especially for large datasets.
Can Escape Local Minima: The noisy updates in SGD (due to the random selection
of individual samples) can help the algorithm escape local minima or saddle points,
potentially leading to better solutions.
Less Memory Usage: As we only need to compute the gradient for one sample at a
time, SGD requires much less memory than standard Gradient Descent.
Cons:
Noisy Updates: Because we update based on a single sample, the gradient can be
quite noisy. This can cause the algorithm to oscillate or diverge if not managed
properly (e.g., with learning rate decay).
Convergence Issues: Although it can be faster in terms of computation, the noisy
updates can cause slow convergence, and it may not converge to the exact minimum
of the loss function. It may keep oscillating around the optimal point.
Sensitive to Learning Rate: The learning rate needs to be chosen carefully. A large
learning rate can lead to divergence, and a small one can slow down convergence.
Mini-Batch Gradient Descent
To address the weaknesses of both Gradient Descent and Stochastic Gradient Descent,
there's a middle-ground approach known as Mini-Batch Gradient Descent.
Mini-Batch Gradient Descent splits the dataset into small batches and performs
updates based on each batch. This combines the advantages of both methods:
o It provides faster convergence compared to full-batch GD.
o It smoothens the noisy updates from SGD, providing more stable convergence.
Mini-Batch GD Update Rule:
Where:
mmm is the batch size.
The summation is over the batch of mmm samples.
Mini-Batch Gradient Descent is widely used in practice for training large models, especially
in deep learning, because it balances the speed of SGD and the stable convergence of GD.
Step 4: Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases during
training. Popular optimization methods include:
Stochastic Gradient Descent (SGD): Updates the weights based on a single sample or a
small batch of data:
Adam Optimizer: An extension of SGD that incorporates momentum and adaptive
learning rates for more efficient training:
Here, gt represents the gradient at time t, and β1, β2 are decay rates.
Advantages of Multi-Layer Perceptron
Versatility: MLPs can be applied to a variety of problems, both classification and
regression.
Non-linearity: MLPs can model complex, non-linear relationships in data.
Parallel Computation: With the help of GPUs, MLPs can be trained quickly by taking
advantage of parallel computing.
Disadvantages of Multi-Layer Perceptron
Computationally Expensive: MLPs can be slow to train, especially on large datasets
with many layers.
Prone to Overfitting: Without proper regularization techniques, MLPs can overfit the
training data, leading to poor generalization.
Sensitivity to Data Scaling: MLPs require properly normalized or scaled data for
optimal performance.