0% found this document useful (0 votes)
14 views11 pages

Session 7

The document contains comprehensive class notes on deep learning, covering topics such as training processes, optimization algorithms, activation functions, and weight initialization. Key concepts include forward propagation, loss functions, gradient descent methods, and various optimizers like Adam and RMSProp. Additionally, it highlights important exam points and probable questions related to these topics.

Uploaded by

2024da04053
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Session 7

The document contains comprehensive class notes on deep learning, covering topics such as training processes, optimization algorithms, activation functions, and weight initialization. Key concepts include forward propagation, loss functions, gradient descent methods, and various optimizers like Adam and RMSProp. Additionally, it highlights important exam points and probable questions related to these topics.

Uploaded by

2024da04053
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Below are complete, clean, study-oriented class notes compiled exactly as per your updated

instruction:

📘 DEEP LEARNING – CLASS NOTES


Sources used

 ✅ DOC file (Session 7) → FULL


 ✅ CS05_Optimization → Slides 29 to 51 ONLY
 ✅ CS06_Activation → COMPLETE slides
 Language: simple, academic, exam-ready
 Structured for understanding + revision

PART 1: SESSION 7 – DEEP LEARNING (DOC FILE –


FULL CONTENT)

1. Overview of Deep Learning Training


Deep learning training involves:

1. Forward propagation
2. Loss computation
3. Backpropagation
4. Weight update using optimization algorithms

The goal is to:

 Minimize training error (optimization)


 Minimize generalization error (avoid overfitting)

2. Objective Function (Loss Function)


In optimization terminology:

 Loss function = Objective function


 Optimization algorithms aim to minimize this function

3. Convex vs Non-Convex Optimization


Convex Function

 Single global minimum


 Easy optimization

Non-Convex Function (DNN Loss)

 Multiple local minima


 Saddle points
 Flat regions

➡ Deep neural networks always involve non-convex optimization

4. Gradient Descent Recap


Weight update rule:

w=w−η∇L(w)w = w - \eta \nabla L(w)w=w−η∇L(w)

Where:

 η\etaη = learning rate


 Gradient sign decides direction
 Learning rate decides step size

5. Types of Gradient Descent


Batch Gradient Descent

 Uses all training samples


 One update per epoch
 Computationally expensive

Stochastic Gradient Descent (SGD)


 Uses one sample at a time
 Fast but noisy

Mini-Batch Gradient Descent

 Uses subset of data


 Most commonly used in practice

6. Learning Rate Scheduling


Learning rate should not be constant.

Types:

 Piecewise constant
 Exponential decay
 Polynomial decay

👉 Piecewise constant LR is most commonly used.

7. Need for Advanced Optimization


Problems with basic GD:

 Oscillations near minima


 Slow convergence
 Sensitive to learning rate
 Saddle points more common than local minima in high dimensions

PART 2: CS05 – OPTIMIZATION (Slides 29–51)

8. Exponentially Weighted Averages (EWA)


Used to smooth values over time.

Formula:
Vt=βVt−1+(1−β)×NewSampleV_t = \beta V_{t-1} + (1 - \beta) \times \text{NewSample}Vt
=βVt−1+(1−β)×NewSample

Where:

 β\betaβ ∈ [0,1]
 Higher β\betaβ → more weight to past values

Used in:

 Momentum
 RMSProp
 Adam

9. Gradient Descent with Momentum


Momentum accumulates past gradients.

Equations:

vt=βvt−1+(1−β)∇Lv_t = \beta v_{t-1} + (1-\beta)\nabla Lvt=βvt−1+(1−β)∇L w=w−ηvtw = w


- \eta v_tw=w−ηvt

Advantages:

 Faster convergence
 Reduced oscillations
 Prevents stalling in SGD

Interpretation:

 Acts like a ball rolling downhill


 Keeps moving in consistent directions

10. Sparse Features Problem


In standard GD:

 Same learning rate for all parameters


 Rare features get fewer updates
➡ Leads to slow learning for sparse features

11. AdaGrad Optimizer


AdaGrad adapts learning rate per parameter.

Formula:

st=st−1+gt2s_t = s_{t-1} + g_t^2st=st−1+gt2 w=w−ηst+ϵgtw = w - \frac{\eta}{\sqrt{s_t} + \


epsilon} g_tw=w−st+ϵηgt

Advantages:

 Larger updates for infrequent features


 No need to tune learning rate manually

Drawback:

 Learning rate decays too aggressively


 May stop learning early

12. RMSProp Optimizer


Fixes AdaGrad’s rapid decay problem.

Formula:

st=βst−1+(1−β)gt2s_t = \beta s_{t-1} + (1-\beta)g_t^2st=βst−1+(1−β)gt2

Benefits:

 Prevents denominator from growing indefinitely


 Stable convergence
 Widely used in practice

13. Adam Optimizer


Adam = Momentum + RMSProp
Uses:

 First moment (mean of gradients)


 Second moment (variance of gradients)

Advantages:

 Fast convergence
 Handles sparse gradients well
 Default optimizer in many frameworks

Practical Note:

 Adam may converge faster but sometimes generalizes worse than SGD

14. Saddle Points in Deep Learning


 Saddle points are more common than local minima
 Gradient ≈ 0 but not a minimum
 Optimization may stall

Momentum-based methods help escape saddle points.

PART 3: CS06 – ACTIVATION FUNCTIONS


(COMPLETE)

15. Need for Activation Functions


Without activation:

 Network becomes linear


 Cannot learn complex patterns

Activation functions:

 Introduce non-linearity
 Enable universal function approximation
16. Step Function
 Binary output (0 or 1)
 Non-differentiable
❌ Not used in deep learning

17. Linear Activation


f(x)=xf(x) = xf(x)=x

 No non-linearity
 Used only in regression output layer

18. Sigmoid Activation


σ(x)=11+e−x\sigma(x) = \frac{1}{1+e^{-x}}σ(x)=1+e−x1

Range: (0,1)

Problems:

 Vanishing gradient
 Saturation
 Non zero-centered
 Computationally expensive

Used for:

 Binary classification output

19. Tanh Activation


Range: (-1,1)

✔ Zero-centered
❌ Vanishing gradient still exists
Used in:

 Recurrent Neural Networks (RNNs)

20. ReLU Activation


f(x)=max⁡(0,x)f(x) = \max(0,x)f(x)=max(0,x)

Advantages:

 Fast computation
 No vanishing gradient (positive side)
 Sparse activations

Disadvantage:

 Dying ReLU problem

21. Dying ReLU Problem


 Neuron outputs zero for all inputs
 Gradient becomes zero
 Neuron stops learning permanently

22. Leaky ReLU


f(x)=max⁡(αx,x)f(x) = \max(\alpha x, x)f(x)=max(αx,x)

✔ Fixes dying ReLU


✔ Small gradient for negative values

23. Softmax Activation


Used in multi-class classification

Softmax(zi)=ezi∑ezj\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum e^{z_j}}Softmax(zi)=∑ezjezi


Properties:

 Outputs probability distribution


 Sum of probabilities = 1

24. Output Representation


Binary Classification

 Output: Sigmoid
 Loss: Binary Cross Entropy

Multi-Class Classification

 One-hot encoding
 Output: Softmax
 Loss: Categorical Cross Entropy

25. Weight Initialization


Poor initialization leads to:

 Vanishing gradients
 Exploding gradients

26. Zero Initialization Problem


 All neurons learn same features
 Symmetry problem
❌ Not allowed

27. Xavier Initialization


 Gaussian distribution
 Mean = 0
 Controlled variance
Used with:

 Sigmoid
 Tanh

Prevents:

 Vanishing gradients
 Exploding gradients

🔑 IMPORTANT EXAM POINTS


 Momentum vs RMSProp vs Adam
 Exponentially weighted averages
 Saddle points in DNN
 Dying ReLU problem
 Softmax vs Sigmoid
 Xavier initialization importance

📝 PROBABLE EXAM QUESTIONS


Long Answer

1. Explain Adam optimizer with equations.


2. Discuss RMSProp and AdaGrad.
3. Explain activation functions and their importance.
4. What are saddle points? Why are they problematic?

Short Answer

1. What is dying ReLU?


2. Why zero initialization fails?
3. Define EWA.
4. Why softmax is used for multi-class classification?

If you want, I can next:

 ✔ Convert this into PDF / handwritten-style notes


 ✔ Add numerical examples
 ✔ Generate MCQs + answers
 ✔ Create one-day revision sheet

Just tell me 👍

You might also like