Below are complete, clean, study-oriented class notes compiled exactly as per your updated
instruction:
📘 DEEP LEARNING – CLASS NOTES
Sources used
✅ DOC file (Session 7) → FULL
✅ CS05_Optimization → Slides 29 to 51 ONLY
✅ CS06_Activation → COMPLETE slides
Language: simple, academic, exam-ready
Structured for understanding + revision
PART 1: SESSION 7 – DEEP LEARNING (DOC FILE –
FULL CONTENT)
1. Overview of Deep Learning Training
Deep learning training involves:
1. Forward propagation
2. Loss computation
3. Backpropagation
4. Weight update using optimization algorithms
The goal is to:
Minimize training error (optimization)
Minimize generalization error (avoid overfitting)
2. Objective Function (Loss Function)
In optimization terminology:
Loss function = Objective function
Optimization algorithms aim to minimize this function
3. Convex vs Non-Convex Optimization
Convex Function
Single global minimum
Easy optimization
Non-Convex Function (DNN Loss)
Multiple local minima
Saddle points
Flat regions
➡ Deep neural networks always involve non-convex optimization
4. Gradient Descent Recap
Weight update rule:
w=w−η∇L(w)w = w - \eta \nabla L(w)w=w−η∇L(w)
Where:
η\etaη = learning rate
Gradient sign decides direction
Learning rate decides step size
5. Types of Gradient Descent
Batch Gradient Descent
Uses all training samples
One update per epoch
Computationally expensive
Stochastic Gradient Descent (SGD)
Uses one sample at a time
Fast but noisy
Mini-Batch Gradient Descent
Uses subset of data
Most commonly used in practice
6. Learning Rate Scheduling
Learning rate should not be constant.
Types:
Piecewise constant
Exponential decay
Polynomial decay
👉 Piecewise constant LR is most commonly used.
7. Need for Advanced Optimization
Problems with basic GD:
Oscillations near minima
Slow convergence
Sensitive to learning rate
Saddle points more common than local minima in high dimensions
PART 2: CS05 – OPTIMIZATION (Slides 29–51)
8. Exponentially Weighted Averages (EWA)
Used to smooth values over time.
Formula:
Vt=βVt−1+(1−β)×NewSampleV_t = \beta V_{t-1} + (1 - \beta) \times \text{NewSample}Vt
=βVt−1+(1−β)×NewSample
Where:
β\betaβ ∈ [0,1]
Higher β\betaβ → more weight to past values
Used in:
Momentum
RMSProp
Adam
9. Gradient Descent with Momentum
Momentum accumulates past gradients.
Equations:
vt=βvt−1+(1−β)∇Lv_t = \beta v_{t-1} + (1-\beta)\nabla Lvt=βvt−1+(1−β)∇L w=w−ηvtw = w
- \eta v_tw=w−ηvt
Advantages:
Faster convergence
Reduced oscillations
Prevents stalling in SGD
Interpretation:
Acts like a ball rolling downhill
Keeps moving in consistent directions
10. Sparse Features Problem
In standard GD:
Same learning rate for all parameters
Rare features get fewer updates
➡ Leads to slow learning for sparse features
11. AdaGrad Optimizer
AdaGrad adapts learning rate per parameter.
Formula:
st=st−1+gt2s_t = s_{t-1} + g_t^2st=st−1+gt2 w=w−ηst+ϵgtw = w - \frac{\eta}{\sqrt{s_t} + \
epsilon} g_tw=w−st+ϵηgt
Advantages:
Larger updates for infrequent features
No need to tune learning rate manually
Drawback:
Learning rate decays too aggressively
May stop learning early
12. RMSProp Optimizer
Fixes AdaGrad’s rapid decay problem.
Formula:
st=βst−1+(1−β)gt2s_t = \beta s_{t-1} + (1-\beta)g_t^2st=βst−1+(1−β)gt2
Benefits:
Prevents denominator from growing indefinitely
Stable convergence
Widely used in practice
13. Adam Optimizer
Adam = Momentum + RMSProp
Uses:
First moment (mean of gradients)
Second moment (variance of gradients)
Advantages:
Fast convergence
Handles sparse gradients well
Default optimizer in many frameworks
Practical Note:
Adam may converge faster but sometimes generalizes worse than SGD
14. Saddle Points in Deep Learning
Saddle points are more common than local minima
Gradient ≈ 0 but not a minimum
Optimization may stall
Momentum-based methods help escape saddle points.
PART 3: CS06 – ACTIVATION FUNCTIONS
(COMPLETE)
15. Need for Activation Functions
Without activation:
Network becomes linear
Cannot learn complex patterns
Activation functions:
Introduce non-linearity
Enable universal function approximation
16. Step Function
Binary output (0 or 1)
Non-differentiable
❌ Not used in deep learning
17. Linear Activation
f(x)=xf(x) = xf(x)=x
No non-linearity
Used only in regression output layer
18. Sigmoid Activation
σ(x)=11+e−x\sigma(x) = \frac{1}{1+e^{-x}}σ(x)=1+e−x1
Range: (0,1)
Problems:
Vanishing gradient
Saturation
Non zero-centered
Computationally expensive
Used for:
Binary classification output
19. Tanh Activation
Range: (-1,1)
✔ Zero-centered
❌ Vanishing gradient still exists
Used in:
Recurrent Neural Networks (RNNs)
20. ReLU Activation
f(x)=max(0,x)f(x) = \max(0,x)f(x)=max(0,x)
Advantages:
Fast computation
No vanishing gradient (positive side)
Sparse activations
Disadvantage:
Dying ReLU problem
21. Dying ReLU Problem
Neuron outputs zero for all inputs
Gradient becomes zero
Neuron stops learning permanently
22. Leaky ReLU
f(x)=max(αx,x)f(x) = \max(\alpha x, x)f(x)=max(αx,x)
✔ Fixes dying ReLU
✔ Small gradient for negative values
23. Softmax Activation
Used in multi-class classification
Softmax(zi)=ezi∑ezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum e^{z_j}}Softmax(zi)=∑ezjezi
Properties:
Outputs probability distribution
Sum of probabilities = 1
24. Output Representation
Binary Classification
Output: Sigmoid
Loss: Binary Cross Entropy
Multi-Class Classification
One-hot encoding
Output: Softmax
Loss: Categorical Cross Entropy
25. Weight Initialization
Poor initialization leads to:
Vanishing gradients
Exploding gradients
26. Zero Initialization Problem
All neurons learn same features
Symmetry problem
❌ Not allowed
27. Xavier Initialization
Gaussian distribution
Mean = 0
Controlled variance
Used with:
Sigmoid
Tanh
Prevents:
Vanishing gradients
Exploding gradients
🔑 IMPORTANT EXAM POINTS
Momentum vs RMSProp vs Adam
Exponentially weighted averages
Saddle points in DNN
Dying ReLU problem
Softmax vs Sigmoid
Xavier initialization importance
📝 PROBABLE EXAM QUESTIONS
Long Answer
1. Explain Adam optimizer with equations.
2. Discuss RMSProp and AdaGrad.
3. Explain activation functions and their importance.
4. What are saddle points? Why are they problematic?
Short Answer
1. What is dying ReLU?
2. Why zero initialization fails?
3. Define EWA.
4. Why softmax is used for multi-class classification?
If you want, I can next:
✔ Convert this into PDF / handwritten-style notes
✔ Add numerical examples
✔ Generate MCQs + answers
✔ Create one-day revision sheet
Just tell me 👍