0% found this document useful (0 votes)
6 views

DL (2)

The document covers various concepts in neural networks, including the McCulloch-Pitts model, perceptron learning algorithms, and different activation functions such as linear, sigmoid, tanh, ReLU, softmax, and softplus. It also discusses optimization techniques like Momentum, Adagrad, RMSProp, and Adam, comparing their features and use cases. Additionally, it touches on autoencoders and their architectures, including undercomplete, overcomplete, and sparse autoencoders.

Uploaded by

itsmissrj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DL (2)

The document covers various concepts in neural networks, including the McCulloch-Pitts model, perceptron learning algorithms, and different activation functions such as linear, sigmoid, tanh, ReLU, softmax, and softplus. It also discusses optimization techniques like Momentum, Adagrad, RMSProp, and Adam, comparing their features and use cases. Additionally, it touches on autoencoders and their architectures, including undercomplete, overcomplete, and sparse autoencoders.

Uploaded by

itsmissrj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

PT1

1. Explain Mc-Culloch Pitts Model. MCCULLOH

2. Explain discrete perceptron learning algorithm with block


diagram.

3. Explain continuous perceptron learning algorithm with block


diagram.

4. Problems based on discrete/continuous perceptron learning


algorithm.

5. Explain with example any four activation functions.


Types of Activation Functions in Deep Learning

1. Linear Activation Function

Linear Activation Function resembles straight line define by y=x. No


matter how many layers the neural network contains, if they all use linear
activation functions, the output is a linear combination of the input.

●​ The range of the output spans from


●​ (−∞ to +∞)
●​ (−∞ to +∞).

●​ Linear activation function is used at just one place i.e. output

layer.

●​ Using linear activation across all layers makes the network’s

ability to learn complex patterns limited.


Linear activation functions are useful for specific tasks but must be
combined with non-linear functions to enhance the neural network’s
learning and predictive capabilities.

Linear Activation Function or Identity Function returns the input as the output

2. Non-Linear Activation Functions

1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is
mathematically defined as

A=11+e−x

A=
1+e

−x

1

​. This formula ensures a smooth and continuous output that is essential for
gradient-based optimization methods.
●​ It allows neural networks to handle and model complex patterns

that linear equations cannot.

●​ The output ranges between 0 and 1, hence useful for binary

classification.

●​ The function exhibits a steep gradient when x values are

between -2 and 2. This sensitivity means that small changes in

input x can cause significant changes in output y, which is critical

during the training process.

Sigmoid or Logistic Activation Function Graph

2. Tanh Activation Function


Tanh function or hyperbolic tangent function, is a shifted version of the
sigmoid, allowing it to stretch across the y-axis. It is defined as:
f(x)=tanh⁡(x)=21+e−2x–1.

f(x)=tanh(x)=
1+e

−2x

2

–1.

Alternatively, it can be expressed using the sigmoid function:


tanh⁡(x)=2×sigmoid(2x)–1

tanh(x)=2×sigmoid(2x)–1

●​ Value Range: Outputs values from -1 to +1.

●​ Non-linear: Enables modeling of complex data patterns.

●​ Use in Hidden Layers: Commonly used in hidden layers due to its

zero-centered output, facilitating easier learning for subsequent

layers.

Tanh Activation Function

3. ReLU (Rectified Linear Unit) Function


ReLU activation is defined by

A(x)=max⁡(0,x)

A(x)=max(0,x), this means that if the input x is positive, ReLU returns x, if the
input is negative, it returns 0.

●​ Value Range:
●​ [0,∞)
●​ [0,∞), meaning the function only outputs non-negative values.

●​ Nature: It is a non-linear activation function, allowing neural

networks to learn complex patterns and making backpropagation

more efficient.

●​ Advantage over other Activation: ReLU is less computationally

expensive than tanh and sigmoid because it involves simpler

mathematical operations. At a time only a few neurons are

activated making the network sparse making it efficient and easy

for computation.
ReLU Activation Function

3. Exponential Linear Units

1. Softmax Function
Softmax function is designed to handle multi-class classification
problems. It transforms raw output scores from a neural network into
probabilities. It works by squashing the output values of each class into
the range of 0 to 1, while ensuring that the sum of all probabilities equals
1.

●​ Softmax is a non-linear activation function.

●​ The Softmax function ensures that each class is assigned a

probability, helping to identify which class the input belongs to.


Softmax Activation Function

2. SoftPlus Function
Softplus function is defined mathematically as:

A(x)=log⁡(1+ex)

A(x)=log(1+e
x

). This equation ensures that the output is always positive and


differentiable at all points, which is an advantage over the traditional ReLU
function.

●​ Nature: The Softplus function is non-linear.

●​ Range: The function outputs values in the range


●​ (0,∞)
●​ (0,∞), similar to ReLU, but without the hard zero threshold that

ReLU has.
●​ Smoothness: Softplus is a smooth, continuous function, meaning

it avoids the sharp discontinuities of ReLU, which can sometimes

lead to problems during optimization.

Softplus Activation Function

6. Explain different loss functions.


https://www.geeksforgeeks.org/loss-functions-in-deep-learning/
7. Explain different activation functions and loss functions. How
to choose the activation functions and loss functions for deep
learning applications.

8. Explain Momentum/Nestorev/Adagrad/RMSProp/Adam
based Gradient descent algorithm.(ALL GPT)
1. What is Momentum in Learning?

Momentum helps a model retain past updates to smoothen the optimization process.
Instead of only considering the current gradient, momentum incorporates previous updates,
making the optimization faster and more stable.

●​ Uses past update history to smoothen the update.


●​ Moves faster in relevant directions while dampening oscillations.
●​ Helps escape local minima and speeds up convergence.
📌 Key Terms:
●​ η\etaη (eta) = Learning rate (controls step size).
●​ γ\gammaγ (gamma) = Momentum coefficient (controls how much past update
contributes).
●​ ∇wt\nabla w_t∇wt​= Gradient at time step ttt.
●​ updatetupdate_tupdatet​= Velocity term, which accumulates past gradients.

📌 Adagrad (Adaptive Gradient Algorithm) – Theory & Explanation


Adagrad is an adaptive learning rate optimization algorithm that adjusts the learning
rate for each parameter individually. It is particularly useful for problems with sparse data
and features with different frequencies.

🖼️ Real-World Example: Studying for Exams 📚


Imagine you are preparing for multiple subjects:

●​ Mathematics (Harder Subject)


●​ English (Easier Subject)

Instead of studying all subjects equally, you realize that Math needs more focus while
English needs less.

Adagrad applies the same logic! 🚀


●​ It reduces the learning rate for frequently updated parameters (like easy
subjects).
●​ It increases focus on rarely updated parameters (like hard subjects).
●​ Over time, parameters that receive large gradients slow down in learning, while
others continue to improve.

📌 Advantages of Adagrad
✅ No need to manually tune the learning rate → Adagrad adjusts it automatically.​
✅ Great for sparse data problems → Common in NLP (e.g., word embeddings) and
✅ Works well for convex optimization problems → Suitable for simpler ML tasks.
Recommendation Systems.​

❌ Disadvantages of Adagrad
1.​ 📉 Learning Rate Becomes Too Small (Vanishing Learning Rate)
○​ Since Adagrad keeps accumulating squared gradients, the denominator
grows over time.
○​ This reduces the learning rate too much, causing optimization to slow

🛠️
down or even stop.
2.​ Not Suitable for Deep Learning
○​ Deep networks require long-term learning stability.
○​ Adagrad slows down too quickly, making it ineffective for complex models.
○​ Algorithms like RMSProp & Adam fix this issue.

📌 When to Use Adagrad?


🔹 NLP & Sparse Data Problems (e.g., text-based models, recommendation systems).​
🔹 Feature Imbalance Scenarios where some features are much more frequent than
🔹 Convex optimization problems, but not deep learning.
others.​
‘1️⃣ RMSProp (Root Mean Square Propagation)
🧠 Key Idea: Solve Adagrad’s "Vanishing Learning Rate" Issue
Adagrad accumulates squared gradients, which causes the learning rate to decrease too
much over time. RMSProp fixes this by using a moving average of squared gradients
instead of a full sum.

🖼️ Real-World Analogy: Adjusting Running Pace 🏃‍♂️


Imagine you're running a marathon. If you always keep slowing down based on past
effort (like Adagrad), you might stop completely before finishing the race.


RMSProp balances this by:​


Keeping track of recent effort rather than all past efforts.​
Smoothing out sudden changes in speed.

✅ Advantages of RMSProp
●​ Solves Adagrad’s problem of vanishing learning rate.
●​ Works well for non-stationary (changing) problems.
●​ Great for deep learning and RNNs (Recurrent Neural Networks).

❌ Disadvantages of RMSProp
2️⃣ Adam (Adaptive Moment Estimation)
🧠 Key Idea: Combine Momentum & RMSProp
Adam is like a mix of Momentum and RMSProp, making it one of the best optimizers for
deep learning.

🖼️ Real-World Analogy: Driving a Car 🚗


Think about driving on a curvy road.

●​ Momentum (like β1\beta_1β1​) helps keep the movement smooth so you don’t
oversteer.
●​ RMSProp (like β2\beta_2β2​) adjusts how much you slow down or speed up
based on past driving experience.
●​ Together, Adam ensures efficient, smooth, and adaptive learning – like an expert
driver adjusting speed & direction simultaneously.

✅ Advantages of Adam
●​ Fast convergence → Works well even with noisy gradients.
●​ Adaptive learning rate → Adjusts per parameter.
●​ Great for deep learning → Default optimizer in many frameworks (e.g., TensorFlow,
PyTorch).

❌ Disadvantages of Adam
●​ Computationally expensive due to tracking multiple moving averages.
●​ May not always generalize well for some problems (SGD can generalize better).
●​ Requires tuning of β1\beta_1β1​and β2\beta_2β2​, though defaults work well.

9. Compare and Contrast AdaGrad with RMSProp gradient


descent.

10. Differentiate between Momentum based and Adam


Gradient Descent.
Comparison of AdaGrad and RMSProp Gradient Descent
Feature AdaGrad RMSProp

Full Form Adaptive Gradient Algorithm Root Mean Square Propagation


Adaptability Adapts learning rate per Modifies AdaGrad by using an
parameter by accumulating exponentially decaying moving
squared gradients over time. average of past squared gradients.

Learning Rate Decreases over time due to Keeps the learning rate more stable
Adjustment continuous accumulation of by preventing extreme reduction,
squared gradients, making it slow making it better for non-convex
for long-term training. problems.

Handling of Works well with sparse data Also works well with sparse data but
Sparse Data since it aggressively reduces avoids drastic reduction of learning
learning rates for frequently rates.
occurring features.

Downside Learning rate can become too Mitigates AdaGrad’s aggressive


small over time, leading to slow learning rate decay but still requires
convergence or stopping training careful tuning of hyperparameters.
prematurely.

Best Use Case Suitable for convex problems and Works well for non-stationary and
sparse data scenarios. non-convex optimization problems.

Difference Between Momentum-Based and Adam Gradient Descent


Feature Momentum-Based Gradient Adam Gradient Descent
Descent

Full Form Momentum (uses past gradients Adaptive Moment Estimation


to smooth updates)

Update Adds a fraction of the past update Combines momentum and


Mechanism to the current update, helping to adaptive learning rates using
overcome local minima and first-moment and
smooth updates. second-moment estimates.

Speed Accelerates training by reducing Faster convergence due to


oscillations in steep regions. adaptive learning rate and
momentum.

Hyperparameters Learning rate (α), momentum Learning rate (α), β1


coefficient (β). (momentum term), β2 (RMS
term), and epsilon (ϵ).

Handling of Not well-suited for sparse data, as Performs well on sparse data
Sparse Data learning rate remains constant. due to per-parameter adaptive
learning rates.
Convergence Helps escape local minima but More stable and adaptive,
Stability may overshoot due to fixed making it a robust choice for
learning rate. deep learning.

Best Use Case Useful when dealing with high Preferred for deep learning
variance or noisy gradients. models due to its adaptive and
robust nature.

Summary

●​ AdaGrad is good for sparse data but suffers from diminishing learning rates.
●​ RMSProp improves upon AdaGrad by using an exponentially decaying average of
squared gradients, making it more suitable for non-convex problems.
●​ Momentum-based Gradient Descent accelerates convergence by adding
momentum but uses a fixed learning rate.
●​ Adam combines the benefits of momentum and adaptive learning rates, making it
one of the most popular optimizers for deep learning.

11. Discuss Autoencoders with appropriate architecture


diagram. (Lect 7)

12. Discuss architectures of undercomplete and overcomplete


autoencoders.
11,12 https://youtu.be/wPz3MPl5jvY?si=2EdkFk5HBZx0fwFD

13. Explain sparse auto encoders in short.


https://youtu.be/REzrCEDMQws?si=aA5hZVxKraPkudeg

You might also like