Chap 2 Training Feed Forward Neural Networks

Uploaded by

HRITWIK GHOSH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views22 pages

Chap 2 Training Feed Forward Neural Networks

Uploaded by

HRITWIK GHOSH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

TRAINING FEED FORWARD NEURAL NETWORKS

DR. SANJAY CHATTERJI

The Fast-Food Problem
 During training, we show the neural net
a large number of training examples
and iteratively modify the weights to
minimize the errors we make on the
training examples.
 One idea is to be intelligent about
picking our training cases.
 Instead, we try to motivate a solution
that works well in general.
Gradient Descent
 Minimizing the squared error over all of the training examples by
simplifying the problem.
 Linear neuron with two inputs has three-dimensional space where the
horizontal dimensions correspond to the weights w1 and w2, and the
vertical dimension corresponds to the value of the error function E.
 We can also conveniently visualize this surface as a set of elliptical
contours.
 Contours correspond to settings of w1 and w2 that evaluate to the
same value of E.
 We can develop a high-level strategy for how to find the values of the
weights that minimizes the error function.
The Delta Rule and Learning Rates
 Hyperparameters: In addition to the weight parameters
defined in our neural network, learning algorithms also
require a couple of additional parameters to carry out the
training process.
 learning rate: at each step of moving perpendicular to the contour,
we need to determine how far we want to walk before
recalculating our new direction. It depends on the steepness of
the surface.
 If we pick a learning rate that’s too small, we risk taking too long
during the training process. But if we pick a learning rate that’s
too big, we’ll mostly likely start diverging away from the minimum.
Continued..
 Now, we are finally ready to derive the delta rule for
training our linear neuron.
 To calculate how to change each weight, we evaluate
the gradient, which is essentially the partial derivative of
the error function with respect to each of the weights.
Gradient Descent with Sigmoidal Neurons
 Now, we will deal with training neurons and neural
networks that utilize nonlinearities.
 We use the sigmoidal neuron as a model.
 For simplicity, we assume that the neurons do not use a
bias term.
 The neuron computes the weighted sum of its inputs, the
logit z.
 It then feeds its logit into the input function to

compute y, its final output.

Gradient Descent with Sigmoidal Neurons
 For learning, we want to compute the gradient of the error
function with respect to the weights.
 To do so, we start by taking the derivative of the logit with
respect to the inputs and the weights:
 The update rule is just like the delta rule, except with extra
multiplicative terms.
The Backpropagation Algorithm
 How to tackle the problem of training multilayer neural
networks (instead of just single neurons)?
 We don’t know what the hidden units ought to be doing.
 We can compute how fast the error changes as we
change a hidden activity.
 From there, we can figure out how fast the error changes
when we change the weight of an individual connection.
 Essentially, we’ll be trying to find the path of steepest
descent in high dimensional space!
The Backpropagation Algorithm Cont.
 Our strategy will be one of dynamic programming.
 From ED(one layer of hidden units) we compute
ED(activities of the below layer) and then we compute
ED(weights leading into the unit).
 ED: Error Derivative
 We can express ED(layeri) in terms of ED(layerj).
 Once we fill up the table with all partial derivatives, we
can determine how the error changes w.r.t. weights.
 This gives us how to modify the weights after each
training example.
Stochastic and Minibatch Gradient Descent
 Here we’ve been using a version of gradient descent known as batch
gradient descent.
 Another potential approach is stochastic gradient descent, where at each
iteration error surface is estimated only with respect to a single example.
 Instead of a single static error surface, here our error surface is dynamic.
 In mini-batch gradient descent, at every iteration, we compute the error
surface with respect to some subset of the total dataset.
Test Sets, Validation Sets, and Overfitting
 The complex model does not generalize well: Overfitting.
 It is very misleading to evaluate a model using the data
we used to train it.
 At the end of each epoch, we want to measure how well
our model is generalizing.
As no. of connections and depth increases,
propensity to overfit also increases
 ANN with two inputs, a
softmax output of size two,
and a hidden layer with 3,
6, or 20

 ANN that have one, two,

or four hidden layers of
three neurons each.
workflow we use when building and training deep learning
models
Preventing Overfitting in Deep Neural Networks
 One method of combatting overfitting is called regularization.
 Regularization modifies the objective function that we minimize by
adding additional terms that penalize large weights.
 We change the objective function so that it becomes Error + λ f(θ): λ is
the regularization strength
 L2 regularization: we add ½ λw2 to the error function
 L1 regularization: we add the term λ|w|
 The value of λ determines how much we want to protect overfitting.
λ = 0  we do not take any measures against the possibility of overfitting.
 λ is too large  our model will prioritize keeping θ as small as possible over
trying to find the parameter values that perform well on our training set.
ANN using mini-batch gradient descent
(batch size 10) and L1 regularization
strengths of 0.01, 0.1, and 1
L1 Vs L2 Regularization
 The L1 regularization leads the weight vectors to become sparse
during optimization (close to zero). So, neurons with L1
regularization end up using only a small subset of their most
important inputs and become quite resistant to noise.
 In comparison, weight vectors from L2 regularization
are usually diffuse, small numbers.
 L1 regularization is very useful when you want to understand
exactly which features are contributing to a decision.
 If this level of feature analysis isn’t necessary, we prefer to use L2
regularization because it empirically performs better.
Max norm constraints
 They have the goal of attempting to restrict θ from becoming too
large.
 They do this more directly.
 They enforce an absolute upper bound c on the magnitude of the
incoming weight vector for every neuron and use projected gradient
descent to enforce the constraint.
 Typical values of c are 3 and 4.
 One of the nice properties is that the parameter vector cannot grow
out of control (even if learning rates are too high) because the
updates to the weights are bounded.
Dropout
 Dropout is implemented by only keeping a neuron active with some
probability p.
 It prevents the network from becoming too dependent on any one (or any
small combination) of neurons.
 Mathematically, it prevents overfitting by providing a way of approximately
combining exponentially many different neural network architectures
efficiently.
 Inverted dropout: neuron whose activation hasn’t been silenced has its
output divided by p before the value is propagated to the next layer.
Intricacies
 Dropout is pretty intuitive to understand, but there are some important
intricacies to consider.
 First, we’d like the outputs of neurons during test time to be equivalent
to their expected outputs at training time.
 We could fix this naively by scaling the output at test time.
 For example, if p = 0.5, neurons must halve their outputs at test time
in order to have the same (expected) output they would have during
training.
 This means that if a neuron’s output prior to dropout was x, then after
dropout, the expected output would be E output = px + (1−p) ・ 0 =
px.
Intricacies
 This naive implementation of dropout is undesirable, however,
because it requires scaling of neuron outputs at test time.
 Test-time performance is extremely critical to model evaluation, so
it’s always preferable to use inverted dropout, where the scaling
occurs at training time instead of at test time.
 In inverted dropout, any neuron whose activation hasn’t been
silenced has its output divided by p before the value is propagated
to the next layer.
 With this fix, E [output] = p ・ x/p + (1 − p) ・ 0 = x, and we can
avoid arbitrarily scaling neuronal output at test time.
Thank You

Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
100% (1)
Max A. Little Machine Learning For Signal Processing Data Science Algorithms and Computational Statistics Oxford University Press USA 2019
378 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Causal Forest Presentation - High Dim Causal Inference
No ratings yet
Causal Forest Presentation - High Dim Causal Inference
113 pages
Implicit Documentation: Release 0.4.4
No ratings yet
Implicit Documentation: Release 0.4.4
30 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Cours 4
No ratings yet
Cours 4
30 pages
DNN Tip
No ratings yet
DNN Tip
49 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Unit 3
No ratings yet
Unit 3
110 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Lec 8
No ratings yet
Lec 8
43 pages
DL Class3
No ratings yet
DL Class3
28 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Neural Network
No ratings yet
Neural Network
44 pages
Unit 4
No ratings yet
Unit 4
13 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
DL Regularization
No ratings yet
DL Regularization
28 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Slides 11
No ratings yet
Slides 11
48 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Lect 6
No ratings yet
Lect 6
60 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
LecML - 3 NN
No ratings yet
LecML - 3 NN
33 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
4.2 Ann
No ratings yet
4.2 Ann
26 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Chap 6 Embedding
No ratings yet
Chap 6 Embedding
44 pages
Chap 3.1 Embedding in Tensorflow
No ratings yet
Chap 3.1 Embedding in Tensorflow
23 pages
NLP Soft-Eng Algo Autumn End Semester
No ratings yet
NLP Soft-Eng Algo Autumn End Semester
8 pages
ALGO - 7th Sem 2018
No ratings yet
ALGO - 7th Sem 2018
2 pages
DFD 2
No ratings yet
DFD 2
39 pages
DFD 1
No ratings yet
DFD 1
75 pages
Application of Wave-Number Approach For Room Behaviour Analysis and Absorption Coefficient Measurement
No ratings yet
Application of Wave-Number Approach For Room Behaviour Analysis and Absorption Coefficient Measurement
9 pages
PESTPP Workshop Based On GV Workshop
No ratings yet
PESTPP Workshop Based On GV Workshop
27 pages
Ai Notes
No ratings yet
Ai Notes
19 pages
CS-601 Machine Learning Study Guide
No ratings yet
CS-601 Machine Learning Study Guide
2 pages
Accuracy and Resolution in 2D Resistivity Inversion: Frank Dale Morgan
No ratings yet
Accuracy and Resolution in 2D Resistivity Inversion: Frank Dale Morgan
127 pages
Fast Doc
No ratings yet
Fast Doc
22 pages
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
No ratings yet
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
94 pages
Summer Training
No ratings yet
Summer Training
16 pages
39 Inverse Problems in Engineering
No ratings yet
39 Inverse Problems in Engineering
1 page
Lecture Slides For Chapter 7 of Deep Learning Ian Goodfellow 2016-09-27
No ratings yet
Lecture Slides For Chapter 7 of Deep Learning Ian Goodfellow 2016-09-27
13 pages
Unit 1 Lecture 2
No ratings yet
Unit 1 Lecture 2
4 pages
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
No ratings yet
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
27 pages
DL Unit2 HD
No ratings yet
DL Unit2 HD
141 pages
Learning Nonlinear Functions Using Regularized Greedy Forest
No ratings yet
Learning Nonlinear Functions Using Regularized Greedy Forest
24 pages
DEEP LEARNING NOTES - Btech
No ratings yet
DEEP LEARNING NOTES - Btech
26 pages
Lundberg, Lee - 2017 - A Unified Approach To Interpreting Model Predictions (2) - Annotated
No ratings yet
Lundberg, Lee - 2017 - A Unified Approach To Interpreting Model Predictions (2) - Annotated
11 pages
Survey of Multitask Learning
No ratings yet
Survey of Multitask Learning
20 pages
Crime Detecction DL Model ConvLSTM2D Analysis and Results
No ratings yet
Crime Detecction DL Model ConvLSTM2D Analysis and Results
4 pages
UIUC ECON 490: Applied Machine Learning in Economics
No ratings yet
UIUC ECON 490: Applied Machine Learning in Economics
28 pages
Deep Learning
100% (4)
Deep Learning
100 pages
ANN-Regression-Python Examples
No ratings yet
ANN-Regression-Python Examples
35 pages
Btech III Year I Semester (Ar20)
No ratings yet
Btech III Year I Semester (Ar20)
7 pages
Unit 2 DL
No ratings yet
Unit 2 DL
44 pages
Understanding Imbalanced Semantic Segmentation Through Neural Collapse
No ratings yet
Understanding Imbalanced Semantic Segmentation Through Neural Collapse
19 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
6 036 Final Fall 2021 Exam Solutions
No ratings yet
6 036 Final Fall 2021 Exam Solutions
33 pages
An Overview of Systems-Theoretic Guarantees in Data-Driven Model Predictive Control
No ratings yet
An Overview of Systems-Theoretic Guarantees in Data-Driven Model Predictive Control
25 pages

Chap 2 Training Feed Forward Neural Networks

Uploaded by

Chap 2 Training Feed Forward Neural Networks

Uploaded by

TRAINING FEED FORWARD NEURAL NETWORKS

DR. SANJAY CHATTERJI

compute y, its final output.

 ANN that have one, two,

You might also like