Department of Artificial Intelligence & Data Science
EXPERIMENT NO: 2
Aim: Apply the following learning algorithms to learn the parameters of the supervised single
layer feed forward neural network: Batch, Stochastic, and Mini-Batch Gradient Descent.
Software Required: Jupyter Notebook, Python, TensorFlow, Keras.
Theory:
Single Layer Feed Forward Neural Network
A Single Layer Feedforward Neural Network is one of the simplest types of artificial neural
networks. It is a type of neural network that has only one input layer and a single layer of output.
Figure 1: Single Layer Feedforward Neural Network
Working Principle:
Each neuron performs: 𝑦 =𝑓(∑𝑛𝑖=1𝑤𝑖𝑥𝑖 +𝑏) where, 𝑥𝑖 = inputs, 𝑤𝑖 =
weights, b = bias, and f = activation function.
The network learns by adjusting weights using a learning algorithm like gradient descent and a
loss function.
Feed Forward Neural Network
Feedforward neural networks are also known as Multi-layered Network of Neurons (MLN). In
these networks the information only travels forward in the neural network, through the input nodes
then through the hidden layers (single or many layers) and finally through the output nodes. In
MLN there are no feedback connections such that the output of the network is fed back into itself.
Gradient
A gradient is the direction and magnitude calculated during the training of a neural network it is
used to teach the network weights in the right direction by the right amount. The higher the
gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model
stops learning. Mathematically, a gradient is a partial derivative with respect to its inputs.
Gradient Descent:
Department of Artificial Intelligence & Data Science
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. This direction, defined by the slope
of the loss function, leads the model toward the minimum point, where prediction error is at its
lowest. A crucial factor in this process is the learning rate, which defines the step size toward the
minimum. A lower learning rate results in smaller steps, increasing the time to reach the minimum
but often yielding a more precise result. If learning rate is large, then larger will be the steps and
model may not reach the local minimum because it just bounces back and forth between the convex
function of gradient descent.
Figure 2: Impact of Learning Rate on Convergence Speed and Stability
Types of Gradient Descent
1) Batch Gradient Descent - Updates weights after computing the gradient over the entire dataset.
2) Stochastic Gradient Descent - Updates weights for each training samples individually.
3) Mini-Batch Gradient Descent - Updates weights after computing the gradient over a small batch
of training samples.
1. Batch Gradient Descent
Batch gradient descent, also called vanilla gradient descent, uses the entire dataset to calculate the
gradient and update the parameters once per epoch. Advantage of Batch Gradient Descent is that
it produces a stable error gradient and a stable convergence however it requires that the entire
training set resides in memory and is available to the algorithm. However, the problem with Batch
Gradient Descent is that it can be computationally expensive, especially for large datasets, as it
computes gradients using the entire dataset at each step. As we need to calculate the gradient on
the whole dataset to perform just one update, batch gradient descent can be very slow and is
intractable for datasets that don’t fit in memory. It may also converge slowly, get stuck in local
minima or saddle points, and suffer from poor generalization if not properly tuned.
Department of Artificial Intelligence & Data Science
Figure 3: Effect of Learning Rate on Gradient Descent
2. Stochastic Gradient Descent
Stochastic Gradient Descent updates the model parameters using only one training sample at a
time. It is computationally efficient and allows for faster iterations, making it suitable for
largescale datasets. One advantage is that the frequent updates allow us to have a detailed rate of
improvement. However, because of its frequent updates, it introduces noise in the optimization
path, leading to a more fluctuating convergence. This randomness can help escape local minima
but may also prevent reaching the exact minimum unless properly tuned.
3. Mini-Batch Gradient Descent
Mini-Batch Gradient Descent combines the advantages of both Batch and Stochastic Gradient
Descent. It splits the training dataset into smaller batches, and then it computes the gradient and
updates the parameters using these smaller batches, offering a good trade-off between convergence
speed and stability. It reduces the variance of updates compared to SGD while being more efficient
than Batch Gradient Descent. Therefore, it creates a balance between the robustness of stochastic
gradient descent and the efficiency of batch gradient descent. Mini-batching also allows for
parallelization and efficient use of hardware (like GPUs), making it the most commonly used
method in practice.
Batch vs. Stochastic vs. Mini-Batch Gradient Descent
1) Which is faster (for the same number of epochs)?
• Order: Batch > Mini-Batch > Stochastic
• Reason: Batch is fastest per epoch as it uses full data at once; SGD is slowest due to noisy
updates
2) Which converges faster (for the same number of epochs)?
• Order: Stochastic > Mini-Batch > Batch
• Reason: SGD updates more frequently, reaching minima faster, though noisily
3) Which converges more smoothly and stably?
Department of Artificial Intelligence & Data Science
Figure 4: Comparison of Cost Function Behaviour
• Order: Batch > Mini-Batch > Stochastic
• Reason: Batch has the smoothest gradient, while SGD fluctuates a lot.
Conclusion:
Thus, we successfully implemented the learning algorithms to learn the parameters of the
supervised single layer feed forward neural network.
Code:
import pandas as pd
import [Link] as plt
from [Link] import StandardScaler
df = pd.read_csv('Social_Network_Ads.csv')
[Link]()
df = df[['Age', 'EstimatedSalary', 'Purchased']]
[Link]()
[Link]
Department of Artificial Intelligence & Data Science
X = [Link][:, 0:2] y
= [Link][:, -1]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled.shape
import tensorflow as tf from
tensorflow import keras from
keras import Sequential from
[Link] import Dense
model = Sequential() [Link](Dense(10,
activation='relu', input_dim=2)) [Link](Dense(10,
activation='relu')) [Link](Dense(1,
activation='sigmoid')) [Link]()
# Batch Gradient Descent
[Link](loss='binary_crossentropy', metrics=['accuracy']) history = [Link](X_scaled,
y, epochs=500, batch_size=400, validation_split=0.2, verbose=0)
[Link]([Link]['loss'])
# Stochastic Gradient Descent
[Link](loss='binary_crossentropy', metrics=['accuracy'])
history = [Link](X_scaled, y, epochs=500, batch_size=1, validation_split=0.2)
[Link]([Link]['loss'])
Department of Artificial Intelligence & Data Science
# Mini-Batch Gradient Descent
[Link](loss='binary_crossentropy', metrics=['accuracy'])
history = [Link](X_scaled, y, epochs=500, batch_size=250, validation_split=0.2)
[Link]([Link]['loss'])