DL (2)
DL (2)
layer.
Linear Activation Function or Identity Function returns the input as the output
1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is
mathematically defined as
A=11+e−x
A=
1+e
−x
1
. This formula ensures a smooth and continuous output that is essential for
gradient-based optimization methods.
● It allows neural networks to handle and model complex patterns
classification.
f(x)=tanh(x)=
1+e
−2x
2
–1.
tanh(x)=2×sigmoid(2x)–1
layers.
A(x)=max(0,x)
A(x)=max(0,x), this means that if the input x is positive, ReLU returns x, if the
input is negative, it returns 0.
● Value Range:
● [0,∞)
● [0,∞), meaning the function only outputs non-negative values.
more efficient.
for computation.
ReLU Activation Function
1. Softmax Function
Softmax function is designed to handle multi-class classification
problems. It transforms raw output scores from a neural network into
probabilities. It works by squashing the output values of each class into
the range of 0 to 1, while ensuring that the sum of all probabilities equals
1.
2. SoftPlus Function
Softplus function is defined mathematically as:
A(x)=log(1+ex)
A(x)=log(1+e
x
ReLU has.
● Smoothness: Softplus is a smooth, continuous function, meaning
8. Explain Momentum/Nestorev/Adagrad/RMSProp/Adam
based Gradient descent algorithm.(ALL GPT)
1. What is Momentum in Learning?
Momentum helps a model retain past updates to smoothen the optimization process.
Instead of only considering the current gradient, momentum incorporates previous updates,
making the optimization faster and more stable.
Instead of studying all subjects equally, you realize that Math needs more focus while
English needs less.
📌 Advantages of Adagrad
✅ No need to manually tune the learning rate → Adagrad adjusts it automatically.
✅ Great for sparse data problems → Common in NLP (e.g., word embeddings) and
✅ Works well for convex optimization problems → Suitable for simpler ML tasks.
Recommendation Systems.
❌ Disadvantages of Adagrad
1. 📉 Learning Rate Becomes Too Small (Vanishing Learning Rate)
○ Since Adagrad keeps accumulating squared gradients, the denominator
grows over time.
○ This reduces the learning rate too much, causing optimization to slow
🛠️
down or even stop.
2. Not Suitable for Deep Learning
○ Deep networks require long-term learning stability.
○ Adagrad slows down too quickly, making it ineffective for complex models.
○ Algorithms like RMSProp & Adam fix this issue.
✅
RMSProp balances this by:
✅
Keeping track of recent effort rather than all past efforts.
Smoothing out sudden changes in speed.
✅ Advantages of RMSProp
● Solves Adagrad’s problem of vanishing learning rate.
● Works well for non-stationary (changing) problems.
● Great for deep learning and RNNs (Recurrent Neural Networks).
❌ Disadvantages of RMSProp
2️⃣ Adam (Adaptive Moment Estimation)
🧠 Key Idea: Combine Momentum & RMSProp
Adam is like a mix of Momentum and RMSProp, making it one of the best optimizers for
deep learning.
● Momentum (like β1\beta_1β1) helps keep the movement smooth so you don’t
oversteer.
● RMSProp (like β2\beta_2β2) adjusts how much you slow down or speed up
based on past driving experience.
● Together, Adam ensures efficient, smooth, and adaptive learning – like an expert
driver adjusting speed & direction simultaneously.
✅ Advantages of Adam
● Fast convergence → Works well even with noisy gradients.
● Adaptive learning rate → Adjusts per parameter.
● Great for deep learning → Default optimizer in many frameworks (e.g., TensorFlow,
PyTorch).
❌ Disadvantages of Adam
● Computationally expensive due to tracking multiple moving averages.
● May not always generalize well for some problems (SGD can generalize better).
● Requires tuning of β1\beta_1β1and β2\beta_2β2, though defaults work well.
Learning Rate Decreases over time due to Keeps the learning rate more stable
Adjustment continuous accumulation of by preventing extreme reduction,
squared gradients, making it slow making it better for non-convex
for long-term training. problems.
Handling of Works well with sparse data Also works well with sparse data but
Sparse Data since it aggressively reduces avoids drastic reduction of learning
learning rates for frequently rates.
occurring features.
Best Use Case Suitable for convex problems and Works well for non-stationary and
sparse data scenarios. non-convex optimization problems.
Handling of Not well-suited for sparse data, as Performs well on sparse data
Sparse Data learning rate remains constant. due to per-parameter adaptive
learning rates.
Convergence Helps escape local minima but More stable and adaptive,
Stability may overshoot due to fixed making it a robust choice for
learning rate. deep learning.
Best Use Case Useful when dealing with high Preferred for deep learning
variance or noisy gradients. models due to its adaptive and
robust nature.
Summary
● AdaGrad is good for sparse data but suffers from diminishing learning rates.
● RMSProp improves upon AdaGrad by using an exponentially decaying average of
squared gradients, making it more suitable for non-convex problems.
● Momentum-based Gradient Descent accelerates convergence by adding
momentum but uses a fixed learning rate.
● Adam combines the benefits of momentum and adaptive learning rates, making it
one of the most popular optimizers for deep learning.